linux-spdx.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v8 00/12] syfs: generic deadlock fix with module removal
@ 2021-09-27 16:37 Luis Chamberlain
  2021-09-27 16:37 ` [PATCH v8 01/12] LICENSES: Add the copyleft-next-0.3.1 license Luis Chamberlain
                   ` (11 more replies)
  0 siblings, 12 replies; 94+ messages in thread
From: Luis Chamberlain @ 2021-09-27 16:37 UTC (permalink / raw)
  To: tj, gregkh, akpm, minchan, jeyu, shuah
  Cc: bvanassche, dan.j.williams, joe, tglx, mcgrof, keescook, rostedt,
	linux-spdx, linux-doc, linux-block, linux-fsdevel,
	linux-kselftest, linux-kernel

This is a follow up to my v7 series of fixes for the zram driver [0]
which ended up uncovering a generic deadlock issue with sysfs and module
removal. I've reported this issue and proposed a few patches first since
March 2021 [1]. At the end of this email you will find an itemized list
of changes since that v1 series, you can also find these changes on my
branch 20210927-sysfs-generic-deadlock-fix [4] which is based on
linux-next tag next-20210927.

Just a heads up, I'm goin on vacation in two days, won't be back until
Monday October 11th.

On this v8 I incorporate feedback from the v7 series, namely:

 - Tejun requested I move the struct module to the last attribute when
   extending functions
 - As per discussion with Tejun, trimmed and clarified the commit log
   and documentation on the generic fix on patch 7
 - As requested by Bart Van Assche, I simplied the setting of the
   struct test_config *config into one line instead of two on many
   places on patch 3 which adds the new sysfs selftest
 - Dan Williams had some questions about patch 7, and so clarified these
   questions using a more elaborate example on the commit log to show
   where the lock call was happening.
 - Trimmed the Cc list considerably as it was way too long before
 - Rebased onto linux-next tag next-20210927

Below a list of changes of this patch set since its inception:

On v1:
  - Open coded the sysfs deadlock race to only be localized by the zram
    driver
Changes on v2:
  - used bdgrab() as well for another race which was speculated by
    Minchan
  - improved documentation of fixes
Changes on v3:
  - used a localized zram macros for the sysfs attributes instead of
    open coding on each routine
  - replaced bdget() stuff for a generic get_device() and bus_get() on
    dev_attr_show() / dev_attr_store() for the issue speculated by
    Michan
Changes on v4:
  - Cosmetic fixes on the zram fixes as requested by Greg
  - Split out the driver core fix as requested by Greg for the
    issue speculated by Michan. This fix ended up getting up to its 4th
    patch iteration [2] and eventually hit linux-next. We got a 0day
    0day suspend stres fail for this patch [3]
Changes on v5:
  - I ended up writing a test_sysfs driver and with it I ended up
    proving that the issue speculated by Michen was not possible and
    so I asked Greg to drop the patch from his queue titled
    "sysfs: fix kobject refcount to address races with kobject removal"
  - checkpatch fixes for the zram changes
Changes on v6:
  - I submitted my test_sysfs driver for inclusion upstream which easily
    abstracted the deadlock issue in a driver generically [4]
  - I rebased the zram fixes and added also a new patch for zram to use
    ATTRIBUTE_GROUPS As per Minchen I sent the patches to be merged
    through Andrew Morton.
  - Greg ended up NACK'ing the patchset because he was not sure the fix
    was correct still
Changes on v7:
  - Formalizes the original proposed generic sysfs fix intead of using
    macro helpers to work around the issue
  - I decided it is best to merge all the effort together into
    one patch set because communication was being lost when I split the
    patches up. This was not helping in any way to either fix the zram
    issues or come to consensus on a generic solution. The patches are
    also merged now because they are all related now.
  - Running checkpatch exposed that S_IRWXUGO and S_IRWXU|S_IRUGO|S_IXUGO
    should be replaced, so I did that in this series in two new patches
  - Adds a try_module_get() documentation extension with tribal
    knowledge and new information I don't think some folks still believe
    in. The new test_sysfs selftest however proves this information to
    be correct, the same selftest can be used to try to prove that
    documentation incorrect
  - Because the fix is now generic zram's deadlock can easily be fixed
    now by just making it use ATTRIBUTE_GROUPS().

[0] https://lkml.kernel.org/r/YUjLAbnEB5qPfnL8@slm.duckdns.org
[1] https://lkml.kernel.org/r/20210306022035.11266-1-mcgrof@kernel.org
[2] https://lkml.kernel.org/r/20210623215007.862787-1-mcgrof@kernel.org                                                                                                      
[3] https://lkml.kernel.org/r/20210701022737.GC21279@xsang-OptiPlex-9020                                                                                                     
[4] https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux-next.git/log/?h=20210927-sysfs-generic-deadlock-fix

Luis Chamberlain (12):
  LICENSES: Add the copyleft-next-0.3.1 license
  testing: use the copyleft-next-0.3.1 SPDX tag
  selftests: add tests_sysfs module
  kernfs: add initial failure injection support
  test_sysfs: add support to use kernfs failure injection
  kernel/module: add documentation for try_module_get()
  fs/kernfs/symlink.c: replace S_IRWXUGO with 0777 on
    kernfs_create_link()
  fs/sysfs/dir.c: replace S_IRWXU|S_IRUGO|S_IXUGO with 0755
    sysfs_create_dir_ns()
  sysfs: fix deadlock race with module removal
  test_sysfs: enable deadlock tests by default
  zram: fix crashes with cpu hotplug multistate
  zram: use ATTRIBUTE_GROUPS to fix sysfs deadlock module removal

 .../fault-injection/fault-injection.rst       |   22 +
 LICENSES/dual/copyleft-next-0.3.1             |  237 +++
 MAINTAINERS                                   |    9 +-
 arch/x86/kernel/cpu/resctrl/rdtgroup.c        |    4 +-
 drivers/block/zram/zram_drv.c                 |   74 +-
 fs/kernfs/Makefile                            |    1 +
 fs/kernfs/dir.c                               |   44 +-
 fs/kernfs/failure-injection.c                 |   91 ++
 fs/kernfs/file.c                              |   19 +-
 fs/kernfs/kernfs-internal.h                   |   75 +-
 fs/kernfs/symlink.c                           |    4 +-
 fs/sysfs/dir.c                                |    5 +-
 fs/sysfs/file.c                               |    6 +-
 fs/sysfs/group.c                              |    3 +-
 include/linux/kernfs.h                        |   19 +-
 include/linux/module.h                        |   34 +-
 include/linux/sysfs.h                         |   52 +-
 kernel/cgroup/cgroup.c                        |    2 +-
 lib/Kconfig.debug                             |   25 +
 lib/Makefile                                  |    1 +
 lib/test_kmod.c                               |   12 +-
 lib/test_sysctl.c                             |   12 +-
 lib/test_sysfs.c                              |  952 ++++++++++++
 tools/testing/selftests/kmod/kmod.sh          |   13 +-
 tools/testing/selftests/sysctl/sysctl.sh      |   12 +-
 tools/testing/selftests/sysfs/Makefile        |   12 +
 tools/testing/selftests/sysfs/config          |    5 +
 tools/testing/selftests/sysfs/sysfs.sh        | 1383 +++++++++++++++++
 28 files changed, 3026 insertions(+), 102 deletions(-)
 create mode 100644 LICENSES/dual/copyleft-next-0.3.1
 create mode 100644 fs/kernfs/failure-injection.c
 create mode 100644 lib/test_sysfs.c
 create mode 100644 tools/testing/selftests/sysfs/Makefile
 create mode 100644 tools/testing/selftests/sysfs/config
 create mode 100755 tools/testing/selftests/sysfs/sysfs.sh

-- 
2.30.2


^ permalink raw reply	[flat|nested] 94+ messages in thread

* [PATCH v8 01/12] LICENSES: Add the copyleft-next-0.3.1 license
  2021-09-27 16:37 [PATCH v8 00/12] syfs: generic deadlock fix with module removal Luis Chamberlain
@ 2021-09-27 16:37 ` Luis Chamberlain
       [not found]   ` <202110050907.35FBD2A1@keescook>
  2021-09-27 16:37 ` [PATCH v8 02/12] testing: use the copyleft-next-0.3.1 SPDX tag Luis Chamberlain
                   ` (10 subsequent siblings)
  11 siblings, 1 reply; 94+ messages in thread
From: Luis Chamberlain @ 2021-09-27 16:37 UTC (permalink / raw)
  To: tj, gregkh, akpm, minchan, jeyu, shuah
  Cc: bvanassche, dan.j.williams, joe, tglx, mcgrof, keescook, rostedt,
	linux-spdx, linux-doc, linux-block, linux-fsdevel,
	linux-kselftest, linux-kernel, Goldwyn Rodrigues, Kuno Woudt,
	Richard Fontana, copyleft-next, Ciaran Farrell,
	Christopher De Nicolo, Christoph Hellwig, Jonathan Corbet,
	Thorsten Leemhuis

Add the full text of the copyleft-next-0.3.1 license to the kernel
tree as well as the required tags for reference and tooling.
The license text was copied directly from the copyleft-next project's
git tree [0].

Discussion of using copyleft-next-0.3.1 on Linux started since June,
2016 [1]. In the end Linus' preference was to have drivers use
MODULE_LICENSE("GPL") to make it clear that the GPL applies when it
comes to Linux [2]. Additionally, even though copyleft-next-0.3.1 has
been found to be to be GPLv2 compatible by three attorneys at SUSE and
Redhat [3], to err on the side of caution we simply recommend to
always use the "OR" language for this license [4].

Even though it has been a goal of the project to be GPL-v2 compatible
to be certain in 2016 I asked for a clarification about what makes
copyleft-next GPLv2 compatible and also asked for a summary of
benefits. This prompted some small minor changes to make compatibility
even further clear and as of copyleft 0.3.1 compatibility should
be crystal clear [5].

The summary of why copyleft-next 0.3.1 is compatible with GPLv2
is explained as follows:

  Like GPLv2, copyleft-next requires distribution of derivative works
  ("Derived Works" in copyleft-next 0.3.x) to be under the same license.
  Ordinarily this would make the two licenses incompatible. However,
  copyleft-next 0.3.1 says: "If the Derived Work includes material
  licensed under the GPL, You may instead license the Derived Work under
  the GPL." "GPL" is defined to include GPLv2.

In practice this means copyleft-next code in Linux may be licensed
under the GPL2, however there are additional obvious gains for
bringing contributions from Linux outbound where copyleft-next is
preferred. A summary of benefits why projects outside of Linux might
prefer to use copyleft-next >= 0.3.1 over GPLv2:

o It is much shorter and simpler
o It has an explicit patent license grant, unlike GPLv2
o Its notice preservation conditions are clearer
o More free software/open source licenses are compatible
  with it (via section 4)
o The source code requirement triggered by binary distribution
  is much simpler in a procedural sense
o Recipients potentially have a contract claim against distributors
  who are noncompliant with the source code requirement
o There is a built-in inbound=outbound policy for upstream
  contributions (cf. Apache License 2.0 section 5)
o There are disincentives to engage in the controversial practice
  of copyleft/ proprietary dual-licensing
o In 15 years copyleft expires, which can be advantageous
  for legacy code
o There are explicit disincentives to bringing patent infringement
  claims accusing the licensed work of infringement (see 10b)
o There is a cure period for licensees who are not compliant
  with the license (there is no cure opportunity in GPLv2)
o copyleft-next has a 'built-in or-later' provision

The first driver submission to Linux under this dual strategy was
lib/test_sysctl.c through commit 9308f2f9e7f05 ("test_sysctl: add
dedicated proc sysctl test driver") merged in July 2017. Shortly after
that I also added test_kmod through commit d9c6a72d6fa29 ("kmod: add
test driver to stress test the module loader") in the same month. These
two drivers went in just a few months before the SPDX license practice
kicked in. In 2018 Kuno Woudt went through the process to get SPDX
identifiers for copyleft-next [6] [7]. Although there are SPDX tags
for copyleft-next-0.3.0, we only document use in Linux starting from
copyleft-next-0.3.1 which makes GPLv2 compatibility crystal clear.

This patch will let us update the two Linux selftest drivers in
subsequent patches with their respective SPDX license identifiers and
let us remove repetitive license boiler plate.

[0] https://github.com/copyleft-next/copyleft-next/blob/master/Releases/copyleft-next-0.3.1
[1] https://lore.kernel.org/lkml/1465929311-13509-1-git-send-email-mcgrof@kernel.org/
[2] https://lore.kernel.org/lkml/CA+55aFyhxcvD+q7tp+-yrSFDKfR0mOHgyEAe=f_94aKLsOu0Og@mail.gmail.com/
[3] https://lore.kernel.org/lkml/20170516232702.GL17314@wotan.suse.de/
[4] https://lkml.kernel.org/r/1495234558.7848.122.camel@linux.intel.com
[5] https://lists.fedorahosted.org/archives/list/copyleft-next@lists.fedorahosted.org/thread/JTGV56DDADWGKU7ZKTZA4DLXTGTLNJ57/#SQMDIKBRAVDOCT4UVNOOCRGBN2UJIKHZ
[6] https://spdx.org/licenses/copyleft-next-0.3.0.html
[7] https://spdx.org/licenses/copyleft-next-0.3.1.html

Cc: Goldwyn Rodrigues <rgoldwyn@suse.com>
Cc: Kuno Woudt <kuno@frob.nl>
Cc: Richard Fontana <fontana@sharpeleven.org>
Cc: copyleft-next@lists.fedorahosted.org
Cc: Ciaran Farrell <Ciaran.Farrell@suse.com>
Cc: Christopher De Nicolo <Christopher.DeNicolo@suse.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Thorsten Leemhuis <linux@leemhuis.info>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 LICENSES/dual/copyleft-next-0.3.1 | 237 ++++++++++++++++++++++++++++++
 1 file changed, 237 insertions(+)
 create mode 100644 LICENSES/dual/copyleft-next-0.3.1

diff --git a/LICENSES/dual/copyleft-next-0.3.1 b/LICENSES/dual/copyleft-next-0.3.1
new file mode 100644
index 000000000000..086bcb74b478
--- /dev/null
+++ b/LICENSES/dual/copyleft-next-0.3.1
@@ -0,0 +1,237 @@
+Valid-License-Identifier: copyleft-next-0.3.1
+SPDX-URL: https://spdx.org/licenses/copyleft-next-0.3.1
+Usage-Guide:
+  This license can be used in code, it has been found to be GPLv2 compatible
+  by attorneys at Redhat and SUSE, however to air on the side of caution,
+  it's best to only use it together with a GPL2 compatible license using "OR".
+  To use the copyleft-next-0.3.1 license put the following SPDX tag/value
+  pair into a comment according to the placement guidelines in the
+  licensing rules documentation:
+    SPDX-License-Identifier: GPL-2.0 OR copyleft-next-0.3.1
+    SPDX-License-Identifier: GPL-2.0-only OR copyleft-next 0.3.1
+    SPDX-License-Identifier: GPL-2.0+ OR copyleft-next-0.3.1
+    SPDX-License-Identifier: GPL-2.0-or-later OR copyleft-next-0.3.1
+License-Text:
+
+=======================================================================
+
+                      copyleft-next 0.3.1 ("this License")
+                            Release date: 2016-04-29
+
+1. License Grants; No Trademark License
+
+   Subject to the terms of this License, I grant You:
+
+   a) A non-exclusive, worldwide, perpetual, royalty-free, irrevocable
+      copyright license, to reproduce, Distribute, prepare derivative works
+      of, publicly perform and publicly display My Work.
+
+   b) A non-exclusive, worldwide, perpetual, royalty-free, irrevocable
+      patent license under Licensed Patents to make, have made, use, sell,
+      offer for sale, and import Covered Works.
+
+   This License does not grant any rights in My name, trademarks, service
+   marks, or logos.
+
+2. Distribution: General Conditions
+
+   You may Distribute Covered Works, provided that You (i) inform
+   recipients how they can obtain a copy of this License; (ii) satisfy the
+   applicable conditions of sections 3 through 6; and (iii) preserve all
+   Legal Notices contained in My Work (to the extent they remain
+   pertinent). "Legal Notices" means copyright notices, license notices,
+   license texts, and author attributions, but does not include logos,
+   other graphical images, trademarks or trademark legends.
+
+3. Conditions for Distributing Derived Works; Outbound GPL Compatibility
+
+   If You Distribute a Derived Work, You must license the entire Derived
+   Work as a whole under this License, with prominent notice of such
+   licensing. This condition may not be avoided through such means as
+   separate Distribution of portions of the Derived Work.
+
+   If the Derived Work includes material licensed under the GPL, You may
+   instead license the Derived Work under the GPL.
+   
+4. Condition Against Further Restrictions; Inbound License Compatibility
+
+   When Distributing a Covered Work, You may not impose further
+   restrictions on the exercise of rights in the Covered Work granted under
+   this License. This condition is not excused merely because such
+   restrictions result from Your compliance with conditions or obligations
+   extrinsic to this License (such as a court order or an agreement with a
+   third party).
+
+   However, You may Distribute a Covered Work incorporating material
+   governed by a license that is both OSI-Approved and FSF-Free as of the
+   release date of this License, provided that compliance with such
+   other license would not conflict with any conditions stated in other
+   sections of this License.
+
+5. Conditions for Distributing Object Code
+
+   You may Distribute an Object Code form of a Covered Work, provided that
+   you accompany the Object Code with a URL through which the Corresponding
+   Source is made available, at no charge, by some standard or customary
+   means of providing network access to source code.
+
+   If you Distribute the Object Code in a physical product or tangible
+   storage medium ("Product"), the Corresponding Source must be available
+   through such URL for two years from the date of Your most recent
+   Distribution of the Object Code in the Product. However, if the Product
+   itself contains or is accompanied by the Corresponding Source (made
+   available in a customarily accessible manner), You need not also comply
+   with the first paragraph of this section.
+
+   Each direct and indirect recipient of the Covered Work from You is an
+   intended third-party beneficiary of this License solely as to this
+   section 5, with the right to enforce its terms.
+
+6. Symmetrical Licensing Condition for Upstream Contributions
+
+   If You Distribute a work to Me specifically for inclusion in or
+   modification of a Covered Work (a "Patch"), and no explicit licensing
+   terms apply to the Patch, You license the Patch under this License, to
+   the extent of Your copyright in the Patch. This condition does not
+   negate the other conditions of this License, if applicable to the Patch.
+
+7. Nullification of Copyleft/Proprietary Dual Licensing
+
+   If I offer to license, for a fee, a Covered Work under terms other than
+   a license that is OSI-Approved or FSF-Free as of the release date of this
+   License or a numbered version of copyleft-next released by the
+   Copyleft-Next Project, then the license I grant You under section 1 is no
+   longer subject to the conditions in sections 3 through 5.
+
+8. Copyleft Sunset
+
+   The conditions in sections 3 through 5 no longer apply once fifteen
+   years have elapsed from the date of My first Distribution of My Work
+   under this License.
+
+9. Pass-Through
+
+   When You Distribute a Covered Work, the recipient automatically receives
+   a license to My Work from Me, subject to the terms of this License.
+
+10. Termination
+
+    Your license grants under section 1 are automatically terminated if You
+
+    a) fail to comply with the conditions of this License, unless You cure
+       such noncompliance within thirty days after becoming aware of it, or
+
+    b) initiate a patent infringement litigation claim (excluding
+       declaratory judgment actions, counterclaims, and cross-claims)
+       alleging that any part of My Work directly or indirectly infringes
+       any patent.
+
+    Termination of Your license grants extends to all copies of Covered
+    Works You subsequently obtain. Termination does not terminate the
+    rights of those who have received copies or rights from You subject to
+    this License.
+
+    To the extent permission to make copies of a Covered Work is necessary
+    merely for running it, such permission is not terminable.
+
+11. Later License Versions
+
+    The Copyleft-Next Project may release new versions of copyleft-next,
+    designated by a distinguishing version number ("Later Versions").
+    Unless I explicitly remove the option of Distributing Covered Works
+    under Later Versions, You may Distribute Covered Works under any Later
+    Version.
+
+** 12. No Warranty                                                       **
+**                                                                       **
+**     My Work is provided "as-is", without warranty. You bear the risk  **
+**     of using it. To the extent permitted by applicable law, each      **
+**     Distributor of My Work excludes the implied warranties of title,  **
+**     merchantability, fitness for a particular purpose and             **
+**     non-infringement.                                                 **
+
+** 13. Limitation of Liability                                           **
+**                                                                       **
+**     To the extent permitted by applicable law, in no event will any   **
+**     Distributor of My Work be liable to You for any damages           **
+**     whatsoever, whether direct, indirect, special, incidental, or     **
+**     consequential damages, whether arising under contract, tort       **
+**     (including negligence), or otherwise, even where the Distributor  **
+**     knew or should have known about the possibility of such damages.  **
+
+14. Severability
+
+    The invalidity or unenforceability of any provision of this License
+    does not affect the validity or enforceability of the remainder of
+    this License. Such provision is to be reformed to the minimum extent
+    necessary to make it valid and enforceable.
+
+15. Definitions
+
+    "Copyleft-Next Project" means the project that maintains the source
+    code repository at <https://github.com/copyleft-next/copyleft-next.git/>
+    as of the release date of this License.
+
+    "Corresponding Source" of a Covered Work in Object Code form means (i)
+    the Source Code form of the Covered Work; (ii) all scripts,
+    instructions and similar information that are reasonably necessary for
+    a skilled developer to generate such Object Code from the Source Code
+    provided under (i); and (iii) a list clearly identifying all Separate
+    Works (other than those provided in compliance with (ii)) that were
+    specifically used in building and (if applicable) installing the
+    Covered Work (for example, a specified proprietary compiler including
+    its version number). Corresponding Source must be machine-readable.
+
+    "Covered Work" means My Work or a Derived Work.
+
+    "Derived Work" means a work of authorship that copies from, modifies,
+    adapts, is based on, is a derivative work of, transforms, translates or
+    contains all or part of My Work, such that copyright permission is
+    required. The following are not Derived Works: (i) Mere Aggregation;
+    (ii) a mere reproduction of My Work; and (iii) if My Work fails to
+    explicitly state an expectation otherwise, a work that merely makes
+    reference to My Work.
+
+    "Distribute" means to distribute, transfer or make a copy available to
+    someone else, such that copyright permission is required.
+
+    "Distributor" means Me and anyone else who Distributes a Covered Work.
+
+    "FSF-Free" means classified as 'free' by the Free Software Foundation.
+
+    "GPL" means a version of the GNU General Public License or the GNU
+    Affero General Public License.
+
+    "I"/"Me"/"My" refers to the individual or legal entity that places My
+    Work under this License. "You"/"Your" refers to the individual or legal
+    entity exercising rights in My Work under this License. A legal entity
+    includes each entity that controls, is controlled by, or is under
+    common control with such legal entity. "Control" means (a) the power to
+    direct the actions of such legal entity, whether by contract or
+    otherwise, or (b) ownership of more than fifty percent of the
+    outstanding shares or beneficial ownership of such legal entity.
+
+    "Licensed Patents" means all patent claims licensable royalty-free by
+    Me, now or in the future, that are necessarily infringed by making,
+    using, or selling My Work, and excludes claims that would be infringed
+    only as a consequence of further modification of My Work.
+
+    "Mere Aggregation" means an aggregation of a Covered Work with a
+    Separate Work.
+
+    "My Work" means the particular work of authorship I license to You
+    under this License.
+
+    "Object Code" means any form of a work that is not Source Code.
+
+    "OSI-Approved" means approved as 'Open Source' by the Open Source
+    Initiative.
+
+    "Separate Work" means a work that is separate from and independent of a
+    particular Covered Work and is not by its nature an extension or
+    enhancement of the Covered Work, and/or a runtime library, standard
+    library or similar component that is used to generate an Object Code
+    form of a Covered Work.
+
+    "Source Code" means the preferred form of a work for making
+    modifications to it.
-- 
2.30.2


^ permalink raw reply	[flat|nested] 94+ messages in thread

* [PATCH v8 02/12] testing: use the copyleft-next-0.3.1 SPDX tag
  2021-09-27 16:37 [PATCH v8 00/12] syfs: generic deadlock fix with module removal Luis Chamberlain
  2021-09-27 16:37 ` [PATCH v8 01/12] LICENSES: Add the copyleft-next-0.3.1 license Luis Chamberlain
@ 2021-09-27 16:37 ` Luis Chamberlain
  2021-10-05 16:11   ` Kees Cook
  2021-09-27 16:37 ` [PATCH v8 03/12] selftests: add tests_sysfs module Luis Chamberlain
                   ` (9 subsequent siblings)
  11 siblings, 1 reply; 94+ messages in thread
From: Luis Chamberlain @ 2021-09-27 16:37 UTC (permalink / raw)
  To: tj, gregkh, akpm, minchan, jeyu, shuah
  Cc: bvanassche, dan.j.williams, joe, tglx, mcgrof, keescook, rostedt,
	linux-spdx, linux-doc, linux-block, linux-fsdevel,
	linux-kselftest, linux-kernel, Goldwyn Rodrigues, Kuno Woudt,
	Richard Fontana, copyleft-next, Ciaran Farrell,
	Christopher De Nicolo, Christoph Hellwig, Jonathan Corbet,
	Thorsten Leemhuis

Two selftests drivers exist under the copyleft-next license.
These drivers were added prior to SPDX practice taking full swing
in the kernel. Now that we have an SPDX tag for copylef-next-0.3.1
documented, embrace it and remove the boiler plate.

Cc: Goldwyn Rodrigues <rgoldwyn@suse.com>
Cc: Kuno Woudt <kuno@frob.nl>
Cc: Richard Fontana <fontana@sharpeleven.org>
Cc: copyleft-next@lists.fedorahosted.org
Cc: Ciaran Farrell <Ciaran.Farrell@suse.com>
Cc: Christopher De Nicolo <Christopher.DeNicolo@suse.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Thorsten Leemhuis <linux@leemhuis.info>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 lib/test_kmod.c                          | 12 +-----------
 lib/test_sysctl.c                        | 12 +-----------
 tools/testing/selftests/kmod/kmod.sh     | 13 +------------
 tools/testing/selftests/sysctl/sysctl.sh | 12 +-----------
 4 files changed, 4 insertions(+), 45 deletions(-)

diff --git a/lib/test_kmod.c b/lib/test_kmod.c
index ce1589391413..d62afd89dc63 100644
--- a/lib/test_kmod.c
+++ b/lib/test_kmod.c
@@ -1,18 +1,8 @@
+// SPDX-License-Identifier: GPL-2.0-or-later OR copyleft-next-0.3.1
 /*
  * kmod stress test driver
  *
  * Copyright (C) 2017 Luis R. Rodriguez <mcgrof@kernel.org>
- *
- * This program is free software; you can redistribute it and/or modify it
- * under the terms of the GNU General Public License as published by the Free
- * Software Foundation; either version 2 of the License, or at your option any
- * later version; or, when distributed separately from the Linux kernel or
- * when incorporated into other software packages, subject to the following
- * license:
- *
- * This program is free software; you can redistribute it and/or modify it
- * under the terms of copyleft-next (version 0.3.1 or later) as published
- * at http://copyleft-next.org/.
  */
 #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
 
diff --git a/lib/test_sysctl.c b/lib/test_sysctl.c
index 3750323973f4..9e5bd10a930a 100644
--- a/lib/test_sysctl.c
+++ b/lib/test_sysctl.c
@@ -1,18 +1,8 @@
+// SPDX-License-Identifier: GPL-2.0-or-later OR copyleft-next-0.3.1
 /*
  * proc sysctl test driver
  *
  * Copyright (C) 2017 Luis R. Rodriguez <mcgrof@kernel.org>
- *
- * This program is free software; you can redistribute it and/or modify it
- * under the terms of the GNU General Public License as published by the Free
- * Software Foundation; either version 2 of the License, or at your option any
- * later version; or, when distributed separately from the Linux kernel or
- * when incorporated into other software packages, subject to the following
- * license:
- *
- * This program is free software; you can redistribute it and/or modify it
- * under the terms of copyleft-next (version 0.3.1 or later) as published
- * at http://copyleft-next.org/.
  */
 
 /*
diff --git a/tools/testing/selftests/kmod/kmod.sh b/tools/testing/selftests/kmod/kmod.sh
index afd42387e8b2..7189715d7960 100755
--- a/tools/testing/selftests/kmod/kmod.sh
+++ b/tools/testing/selftests/kmod/kmod.sh
@@ -1,18 +1,7 @@
 #!/bin/bash
-#
+# SPDX-License-Identifier: GPL-2.0-or-later OR copyleft-next-0.3.1
 # Copyright (C) 2017 Luis R. Rodriguez <mcgrof@kernel.org>
 #
-# This program is free software; you can redistribute it and/or modify it
-# under the terms of the GNU General Public License as published by the Free
-# Software Foundation; either version 2 of the License, or at your option any
-# later version; or, when distributed separately from the Linux kernel or
-# when incorporated into other software packages, subject to the following
-# license:
-#
-# This program is free software; you can redistribute it and/or modify it
-# under the terms of copyleft-next (version 0.3.1 or later) as published
-# at http://copyleft-next.org/.
-
 # This is a stress test script for kmod, the kernel module loader. It uses
 # test_kmod which exposes a series of knobs for the API for us so we can
 # tweak each test in userspace rather than in kernelspace.
diff --git a/tools/testing/selftests/sysctl/sysctl.sh b/tools/testing/selftests/sysctl/sysctl.sh
index 19515dcb7d04..2046c603a4d4 100755
--- a/tools/testing/selftests/sysctl/sysctl.sh
+++ b/tools/testing/selftests/sysctl/sysctl.sh
@@ -1,16 +1,6 @@
 #!/bin/bash
+# SPDX-License-Identifier: GPL-2.0-or-later OR copyleft-next-0.3.1
 # Copyright (C) 2017 Luis R. Rodriguez <mcgrof@kernel.org>
-#
-# This program is free software; you can redistribute it and/or modify it
-# under the terms of the GNU General Public License as published by the Free
-# Software Foundation; either version 2 of the License, or at your option any
-# later version; or, when distributed separately from the Linux kernel or
-# when incorporated into other software packages, subject to the following
-# license:
-#
-# This program is free software; you can redistribute it and/or modify it
-# under the terms of copyleft-next (version 0.3.1 or later) as published
-# at http://copyleft-next.org/.
 
 # This performs a series tests against the proc sysctl interface.
 
-- 
2.30.2


^ permalink raw reply	[flat|nested] 94+ messages in thread

* [PATCH v8 03/12] selftests: add tests_sysfs module
  2021-09-27 16:37 [PATCH v8 00/12] syfs: generic deadlock fix with module removal Luis Chamberlain
  2021-09-27 16:37 ` [PATCH v8 01/12] LICENSES: Add the copyleft-next-0.3.1 license Luis Chamberlain
  2021-09-27 16:37 ` [PATCH v8 02/12] testing: use the copyleft-next-0.3.1 SPDX tag Luis Chamberlain
@ 2021-09-27 16:37 ` Luis Chamberlain
  2021-10-05 14:16   ` Greg KH
                     ` (2 more replies)
  2021-09-27 16:37 ` [PATCH v8 04/12] kernfs: add initial failure injection support Luis Chamberlain
                   ` (8 subsequent siblings)
  11 siblings, 3 replies; 94+ messages in thread
From: Luis Chamberlain @ 2021-09-27 16:37 UTC (permalink / raw)
  To: tj, gregkh, akpm, minchan, jeyu, shuah
  Cc: bvanassche, dan.j.williams, joe, tglx, mcgrof, keescook, rostedt,
	linux-spdx, linux-doc, linux-block, linux-fsdevel,
	linux-kselftest, linux-kernel

This adds a new selftest module which can be used to test sysfs, which
would otherwise require using an existing driver. This lets us muck
with a template driver to test breaking things without affecting
system behaviour or requiring the dependencies of a real device
driver.

A series of 28 tests are added. Support for using two device types are
supported:

  * misc
  * block

Contrary to sysctls, sysfs requires a full write to happen at once, and
so we reduce the digit tests to single writes. Two main sysfs knobs are
provided for testing reading/storing, one which doesn't inclur any
delays and another which can incur programmed delays. What locks are
held, if any, are configurable, at module load time, or through dynamic
configuration at run time.

Since sysfs is a technically filesystem, but a pseudo one, which
requires a kernel user, our test_sysfs module and respective test script
embraces fstests format for tests in the kernel ring bufffer. Likewise,
a scraper for kernel crashes is provided which matches what fstests does
as well.

Two tests are kept disabled as they currently cause a deadlock, and so
this provides a mechanism to easily show proof and demo how the deadlock
can happen:

Demos the deadlock with a device specific lock
./tools/testing/selftests/sysfs/sysfs.sh -t 0027

Demos the deadlock with rtnl_lock()
./tools/testing/selftests/sysfs/sysfs.sh -t 0028

Two separate solutions to the deadlock issue have been proposed,
and so now its a matter of either documenting this limitation or
eventually adopting a generic fix.

This selftests will shortly be expanded upon with more tests which
require further kernel changes in order to provide better test
coverage.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 MAINTAINERS                            |    7 +
 lib/Kconfig.debug                      |   12 +
 lib/Makefile                           |    1 +
 lib/test_sysfs.c                       |  921 ++++++++++++++++++
 tools/testing/selftests/sysfs/Makefile |   12 +
 tools/testing/selftests/sysfs/config   |    2 +
 tools/testing/selftests/sysfs/sysfs.sh | 1208 ++++++++++++++++++++++++
 7 files changed, 2163 insertions(+)
 create mode 100644 lib/test_sysfs.c
 create mode 100644 tools/testing/selftests/sysfs/Makefile
 create mode 100644 tools/testing/selftests/sysfs/config
 create mode 100755 tools/testing/selftests/sysfs/sysfs.sh

diff --git a/MAINTAINERS b/MAINTAINERS
index 0f28fb4b4e5c..1b4cefcb064c 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -18216,6 +18216,13 @@ L:	linux-mmc@vger.kernel.org
 S:	Maintained
 F:	drivers/mmc/host/sdhci-pci-dwc-mshc.c
 
+SYSFS TEST DRIVER
+M:	Luis Chamberlain <mcgrof@kernel.org>
+L:	linux-kernel@vger.kernel.org
+S:	Maintained
+F:	lib/test_sysfs.c
+F:	tools/testing/selftests/sysfs/
+
 SYSTEM CONFIGURATION (SYSCON)
 M:	Lee Jones <lee.jones@linaro.org>
 M:	Arnd Bergmann <arnd@arndb.de>
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index dbff322207ad..ae19bf1a21b8 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -2343,6 +2343,18 @@ config TEST_SYSCTL
 
 	  If unsure, say N.
 
+config TEST_SYSFS
+	tristate "sysfs test driver"
+	depends on SYSFS
+	depends on NET
+	depends on BLOCK
+	help
+	  This builds the "test_sysfs" module. This driver enables to test the
+	  sysfs file system safely without affecting production knobs which
+	  might alter system functionality.
+
+	  If unsure, say N.
+
 config BITFIELD_KUNIT
 	tristate "KUnit test bitfield functions at runtime"
 	depends on KUNIT
diff --git a/lib/Makefile b/lib/Makefile
index 2cfd33917ad5..5143d65f90d6 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -61,6 +61,7 @@ obj-$(CONFIG_TEST_FIRMWARE) += test_firmware.o
 obj-$(CONFIG_TEST_BITOPS) += test_bitops.o
 CFLAGS_test_bitops.o += -Werror
 obj-$(CONFIG_TEST_SYSCTL) += test_sysctl.o
+obj-$(CONFIG_TEST_SYSFS) += test_sysfs.o
 obj-$(CONFIG_TEST_HASH) += test_hash.o test_siphash.o
 obj-$(CONFIG_TEST_IDA) += test_ida.o
 obj-$(CONFIG_KASAN_KUNIT_TEST) += test_kasan.o
diff --git a/lib/test_sysfs.c b/lib/test_sysfs.c
new file mode 100644
index 000000000000..2043ca494af8
--- /dev/null
+++ b/lib/test_sysfs.c
@@ -0,0 +1,921 @@
+// SPDX-License-Identifier: GPL-2.0-or-later OR copyleft-next-0.3.1
+/*
+ * sysfs test driver
+ *
+ * Copyright (C) 2021 Luis Chamberlain <mcgrof@kernel.org>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License as published by the Free
+ * Software Foundation; either version 2 of the License, or at your option any
+ * later version; or, when distributed separately from the Linux kernel or
+ * when incorporated into other software packages, subject to the following
+ * license:
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of copyleft-next (version 0.3.1 or later) as published
+ * at http://copyleft-next.org/.
+ */
+
+/*
+ * This module allows us to add race conditions which we can test for
+ * against the sysfs filesystem.
+ */
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include <linux/init.h>
+#include <linux/list.h>
+#include <linux/module.h>
+#include <linux/printk.h>
+#include <linux/fs.h>
+#include <linux/miscdevice.h>
+#include <linux/slab.h>
+#include <linux/uaccess.h>
+#include <linux/async.h>
+#include <linux/delay.h>
+#include <linux/vmalloc.h>
+#include <linux/debugfs.h>
+#include <linux/rtnetlink.h>
+#include <linux/genhd.h>
+#include <linux/blkdev.h>
+
+static bool enable_lock;
+module_param(enable_lock, bool_enable_only, 0644);
+MODULE_PARM_DESC(enable_lock,
+		 "enable locking on reads / stores from the start");
+
+static bool enable_lock_on_rmmod;
+module_param(enable_lock_on_rmmod, bool_enable_only, 0644);
+MODULE_PARM_DESC(enable_lock_on_rmmod,
+		 "enable locking on rmmod");
+
+static bool use_rtnl_lock;
+module_param(use_rtnl_lock, bool_enable_only, 0644);
+MODULE_PARM_DESC(use_rtnl_lock,
+		 "use an rtnl_lock instead of the device mutex_lock");
+
+static unsigned int write_delay_msec_y = 500;
+module_param_named(write_delay_msec_y, write_delay_msec_y, uint, 0644);
+MODULE_PARM_DESC(write_delay_msec_y, "msec write delay for writes to y");
+
+static unsigned int test_devtype;
+module_param_named(devtype, test_devtype, uint, 0644);
+MODULE_PARM_DESC(devtype, "device type to register");
+
+static bool enable_busy_alloc;
+module_param(enable_busy_alloc, bool_enable_only, 0644);
+MODULE_PARM_DESC(enable_busy_alloc, "do a fake allocation during writes");
+
+static bool enable_debugfs;
+module_param(enable_debugfs, bool_enable_only, 0644);
+MODULE_PARM_DESC(enable_debugfs, "enable a few debugfs files");
+
+static bool enable_verbose_writes;
+module_param(enable_verbose_writes, bool_enable_only, 0644);
+MODULE_PARM_DESC(enable_debugfs, "enable stores to print verbose information");
+
+static unsigned int delay_rmmod_ms;
+module_param_named(delay_rmmod_ms, delay_rmmod_ms, uint, 0644);
+MODULE_PARM_DESC(delay_rmmod_ms, "if set how many ms to delay rmmod before device deletion");
+
+static bool enable_verbose_rmmod;
+module_param(enable_verbose_rmmod, bool_enable_only, 0644);
+MODULE_PARM_DESC(enable_verbose_rmmod, "enable verbose print messages on rmmod");
+
+static int sysfs_test_major;
+
+/**
+ * test_config - used for configuring how the sysfs test device will behave
+ *
+ * @enable_lock: if enabled a lock will be used when reading/storing variables
+ * @enable_lock_on_rmmod: if enabled a lock will be used when reading/storing
+ *	sysfs attributes, but it will also be used to lock on rmmod. This is
+ *	useful to test for a deadlock.
+ * @use_rtnl_lock: if enabled instead of configuration specific mutex, we'll
+ *	use the rtnl_lock. If your test case is modifying this on the fly
+ *	while doing other stores / reads, things will break as a lock can be
+ *	left contending. Best is that tests use this knob serially, without
+ *	allowing userspace to modify other knobs while this one changes.
+ * @write_delay_msec_y: the amount of delay to use when writing to y
+ * @enable_busy_alloc: if enabled we'll do a large allocation between
+ *	writes. We immediately free right away. We also schedule to give the
+ *	kernel some time to re-use any memory we don't need. This is intened
+ *	to mimic typical driver behaviour.
+ */
+struct test_config {
+	bool enable_lock;
+	bool enable_lock_on_rmmod;
+	bool use_rtnl_lock;
+	unsigned int write_delay_msec_y;
+	bool enable_busy_alloc;
+};
+
+/**
+ * enum sysfs_test_devtype - sysfs device type
+ * @TESTDEV_TYPE_MISC: misc device type
+ * @TESTDEV_TYPE_BLOCK: use a block device for the sysfs test device.
+ */
+enum sysfs_test_devtype {
+	TESTDEV_TYPE_MISC = 0,
+	TESTDEV_TYPE_BLOCK,
+};
+
+/**
+ * sysfs_test_device - test device to help test sysfs
+ *
+ * @devtype: the type of device to use
+ * @config: configuration for the test
+ * @config_mutex: protects configuration of test
+ * @misc_dev: we use a misc device under the hood
+ * @disk: represents a disk when used as a block device
+ * @dev: pointer to misc_dev's own struct device
+ * @dev_idx: unique ID for test device
+ * @x: variable we can use to test read / store
+ * @y: slow variable we can use to test read / store
+ */
+struct sysfs_test_device {
+	enum sysfs_test_devtype devtype;
+	struct test_config config;
+	struct mutex config_mutex;
+	struct miscdevice misc_dev;
+	struct gendisk *disk;
+	struct device *dev;
+	int dev_idx;
+	int x;
+	int y;
+};
+
+static struct sysfs_test_device *first_test_dev;
+
+static struct miscdevice *dev_to_misc_dev(struct device *dev)
+{
+	return dev_get_drvdata(dev);
+}
+
+static struct sysfs_test_device *misc_dev_to_test_dev(struct miscdevice *misc_dev)
+{
+	return container_of(misc_dev, struct sysfs_test_device, misc_dev);
+}
+
+static struct sysfs_test_device *devblock_to_test_dev(struct device *dev)
+{
+	return (struct sysfs_test_device *)dev_to_disk(dev)->private_data;
+}
+
+static struct sysfs_test_device *devmisc_to_testdev(struct device *dev)
+{
+	struct miscdevice *misc_dev;
+
+	misc_dev = dev_to_misc_dev(dev);
+	return misc_dev_to_test_dev(misc_dev);
+}
+
+static struct sysfs_test_device *dev_to_test_dev(struct device *dev)
+{
+	if (test_devtype == TESTDEV_TYPE_MISC)
+		return devmisc_to_testdev(dev);
+	else if (test_devtype == TESTDEV_TYPE_BLOCK)
+		return devblock_to_test_dev(dev);
+	return NULL;
+}
+
+static void test_dev_config_lock(struct sysfs_test_device *test_dev)
+{
+	struct test_config *config = &test_dev->config;
+
+	if (config->enable_lock) {
+		if (config->use_rtnl_lock)
+			rtnl_lock();
+		else
+			mutex_lock(&test_dev->config_mutex);
+	}
+}
+
+static void test_dev_config_unlock(struct sysfs_test_device *test_dev)
+{
+	struct test_config *config = &test_dev->config;
+
+	if (config->enable_lock) {
+		if (config->use_rtnl_lock)
+			rtnl_unlock();
+		else
+			mutex_unlock(&test_dev->config_mutex);
+	}
+}
+
+static void test_dev_config_lock_rmmod(struct sysfs_test_device *test_dev)
+{
+	struct test_config *config = &test_dev->config;
+
+	if (config->enable_lock_on_rmmod)
+		test_dev_config_lock(test_dev);
+}
+
+static void test_dev_config_unlock_rmmod(struct sysfs_test_device *test_dev)
+{
+	struct test_config *config = &test_dev->config;
+
+	if (config->enable_lock_on_rmmod)
+		test_dev_config_unlock(test_dev);
+}
+
+static void free_test_dev_sysfs(struct sysfs_test_device *test_dev)
+{
+	if (test_dev) {
+		kfree_const(test_dev->misc_dev.name);
+		test_dev->misc_dev.name = NULL;
+		kfree(test_dev);
+		test_dev = NULL;
+	}
+}
+
+static void test_sysfs_reset_vals(struct sysfs_test_device *test_dev)
+{
+	test_dev->x = 3;
+	test_dev->y = 4;
+}
+
+static ssize_t config_show(struct device *dev,
+			   struct device_attribute *attr,
+			   char *buf)
+{
+	struct sysfs_test_device *test_dev = dev_to_test_dev(dev);
+	struct test_config *config = &test_dev->config;
+	int len = 0;
+
+	test_dev_config_lock(test_dev);
+
+	len += snprintf(buf, PAGE_SIZE,
+			"Configuration for: %s\n",
+			dev_name(dev));
+
+	len += snprintf(buf+len, PAGE_SIZE - len,
+			"x:\t%d\n",
+			test_dev->x);
+
+	len += snprintf(buf+len, PAGE_SIZE - len,
+			"y:\t%d\n",
+			test_dev->y);
+
+	len += snprintf(buf+len, PAGE_SIZE - len,
+			"enable_lock:\t%s\n",
+			config->enable_lock ? "true" : "false");
+
+	len += snprintf(buf+len, PAGE_SIZE - len,
+			"enable_lock_on_rmmmod:\t%s\n",
+			config->enable_lock_on_rmmod ? "true" : "false");
+
+	len += snprintf(buf+len, PAGE_SIZE - len,
+			"use_rtnl_lock:\t%s\n",
+			config->use_rtnl_lock ? "true" : "false");
+
+	len += snprintf(buf+len, PAGE_SIZE - len,
+			"write_delay_msec_y:\t%d\n",
+			config->write_delay_msec_y);
+
+	len += snprintf(buf+len, PAGE_SIZE - len,
+			"enable_busy_alloc:\t%s\n",
+			config->enable_busy_alloc ? "true" : "false");
+
+	len += snprintf(buf+len, PAGE_SIZE - len,
+			"enable_debugfs:\t%s\n",
+			enable_debugfs ? "true" : "false");
+
+	len += snprintf(buf+len, PAGE_SIZE - len,
+			"enable_verbose_writes:\t%s\n",
+			enable_verbose_writes ? "true" : "false");
+
+	test_dev_config_unlock(test_dev);
+
+	return len;
+}
+static DEVICE_ATTR_RO(config);
+
+static ssize_t reset_store(struct device *dev,
+			   struct device_attribute *attr,
+			   const char *buf, size_t count)
+{
+	struct sysfs_test_device *test_dev = dev_to_test_dev(dev);
+	struct test_config *config = &test_dev->config;
+
+	/*
+	 * We compromise and simplify this condition and do not use a lock
+	 * here as the lock type can change.
+	 */
+	config->enable_lock = false;
+	config->enable_lock_on_rmmod = false;
+	config->use_rtnl_lock = false;
+	config->enable_busy_alloc = false;
+	test_sysfs_reset_vals(test_dev);
+
+	dev_info(dev, "reset\n");
+
+	return count;
+}
+static DEVICE_ATTR_WO(reset);
+
+static void test_dev_busy_alloc(struct sysfs_test_device *test_dev)
+{
+	struct test_config *config = &test_dev->config;
+	char *ignore;
+
+	if (!config->enable_busy_alloc)
+		return;
+
+	ignore = kzalloc(sizeof(struct sysfs_test_device) * 10, GFP_KERNEL);
+	kfree(ignore);
+
+	schedule();
+}
+
+static ssize_t test_dev_x_store(struct device *dev,
+				struct device_attribute *attr,
+				const char *buf, size_t count)
+{
+	struct sysfs_test_device *test_dev = dev_to_test_dev(dev);
+	int ret;
+
+	test_dev_busy_alloc(test_dev);
+	test_dev_config_lock(test_dev);
+
+	ret = kstrtoint(buf, 10, &test_dev->x);
+	if (ret)
+		count = ret;
+
+	if (enable_verbose_writes)
+		dev_info(test_dev->dev, "wrote x = %d\n", test_dev->x);
+
+	test_dev_config_unlock(test_dev);
+
+	return count;
+}
+
+static ssize_t test_dev_x_show(struct device *dev,
+			       struct device_attribute *attr,
+			       char *buf)
+{
+	struct sysfs_test_device *test_dev = dev_to_test_dev(dev);
+	int ret;
+
+	test_dev_config_lock(test_dev);
+	ret = snprintf(buf, PAGE_SIZE, "%d\n", test_dev->x);
+	test_dev_config_unlock(test_dev);
+
+	return ret;
+}
+static DEVICE_ATTR_RW(test_dev_x);
+
+static ssize_t test_dev_y_store(struct device *dev,
+				struct device_attribute *attr,
+				const char *buf, size_t count)
+{
+	struct sysfs_test_device *test_dev = dev_to_test_dev(dev);
+	struct test_config *config;
+	int y;
+	int ret;
+
+	test_dev_busy_alloc(test_dev);
+	test_dev_config_lock(test_dev);
+
+	config = &test_dev->config;
+
+	ret = kstrtoint(buf, 10, &y);
+	if (ret)
+		count = ret;
+
+	msleep(config->write_delay_msec_y);
+	test_dev->y = test_dev->x + y + 7;
+
+	if (enable_verbose_writes)
+		dev_info(test_dev->dev, "wrote y = %d\n", test_dev->y);
+
+	test_dev_config_unlock(test_dev);
+
+	return count;
+}
+
+static ssize_t test_dev_y_show(struct device *dev,
+			       struct device_attribute *attr,
+			       char *buf)
+{
+	struct sysfs_test_device *test_dev = dev_to_test_dev(dev);
+	int ret;
+
+	test_dev_config_lock(test_dev);
+	ret = snprintf(buf, PAGE_SIZE, "%d\n", test_dev->y);
+	test_dev_config_unlock(test_dev);
+
+	return ret;
+}
+static DEVICE_ATTR_RW(test_dev_y);
+
+static ssize_t config_enable_lock_store(struct device *dev,
+					struct device_attribute *attr,
+					const char *buf, size_t count)
+{
+	struct sysfs_test_device *test_dev = dev_to_test_dev(dev);
+	struct test_config *config = &test_dev->config;
+	int ret;
+	int val;
+
+	ret = kstrtoint(buf, 10, &val);
+	if (ret)
+		return ret;
+
+	/*
+	 * We compromise for simplicty and do not lock when changing
+	 * locking configuration, with the assumption userspace tests
+	 * will know this.
+	 */
+	if (val)
+		config->enable_lock = true;
+	else
+		config->enable_lock = false;
+
+	return count;
+}
+
+static ssize_t config_enable_lock_show(struct device *dev,
+				       struct device_attribute *attr,
+				       char *buf)
+{
+	struct sysfs_test_device *test_dev = dev_to_test_dev(dev);
+	struct test_config *config = &test_dev->config;
+	ssize_t ret;
+
+	test_dev_config_lock(test_dev);
+	ret = snprintf(buf, PAGE_SIZE, "%d\n", config->enable_lock);
+	test_dev_config_unlock(test_dev);
+
+	return ret;
+}
+static DEVICE_ATTR_RW(config_enable_lock);
+
+static ssize_t config_enable_lock_on_rmmod_store(struct device *dev,
+						 struct device_attribute *attr,
+						 const char *buf, size_t count)
+{
+	struct sysfs_test_device *test_dev = dev_to_test_dev(dev);
+	struct test_config *config = &test_dev->config;
+	int ret;
+	int val;
+
+	ret = kstrtoint(buf, 10, &val);
+	if (ret)
+		return ret;
+
+	test_dev_config_lock(test_dev);
+	if (val)
+		config->enable_lock_on_rmmod = true;
+	else
+		config->enable_lock_on_rmmod = false;
+	test_dev_config_unlock(test_dev);
+
+	return count;
+}
+
+static ssize_t config_enable_lock_on_rmmod_show(struct device *dev,
+						struct device_attribute *attr,
+						char *buf)
+{
+	struct sysfs_test_device *test_dev = dev_to_test_dev(dev);
+	struct test_config *config = &test_dev->config;
+	ssize_t ret;
+
+	test_dev_config_lock(test_dev);
+	ret = snprintf(buf, PAGE_SIZE, "%d\n", config->enable_lock_on_rmmod);
+	test_dev_config_unlock(test_dev);
+
+	return ret;
+}
+static DEVICE_ATTR_RW(config_enable_lock_on_rmmod);
+
+static ssize_t config_use_rtnl_lock_store(struct device *dev,
+					  struct device_attribute *attr,
+					  const char *buf, size_t count)
+{
+	struct sysfs_test_device *test_dev = dev_to_test_dev(dev);
+	struct test_config *config = &test_dev->config;
+	int ret;
+	int val;
+
+	ret = kstrtoint(buf, 10, &val);
+	if (ret)
+		return ret;
+
+	/*
+	 * We compromise and simplify this condition and do not use a lock
+	 * here as the lock type can change.
+	 */
+	if (val)
+		config->use_rtnl_lock = true;
+	else
+		config->use_rtnl_lock = false;
+
+	return count;
+}
+
+static ssize_t config_use_rtnl_lock_show(struct device *dev,
+					 struct device_attribute *attr,
+					 char *buf)
+{
+	struct sysfs_test_device *test_dev = dev_to_test_dev(dev);
+	struct test_config *config = &test_dev->config;
+
+	return snprintf(buf, PAGE_SIZE, "%d\n", config->use_rtnl_lock);
+}
+static DEVICE_ATTR_RW(config_use_rtnl_lock);
+
+static ssize_t config_write_delay_msec_y_store(struct device *dev,
+					       struct device_attribute *attr,
+					       const char *buf, size_t count)
+{
+	struct sysfs_test_device *test_dev = dev_to_test_dev(dev);
+	struct test_config *config = &test_dev->config;
+	int ret;
+	int val;
+
+	ret = kstrtoint(buf, 10, &val);
+	if (ret)
+		return ret;
+
+	test_dev_config_lock(test_dev);
+	config->write_delay_msec_y = val;
+	test_dev_config_unlock(test_dev);
+
+	return count;
+}
+
+static ssize_t config_write_delay_msec_y_show(struct device *dev,
+					      struct device_attribute *attr,
+					      char *buf)
+{
+	struct sysfs_test_device *test_dev = dev_to_test_dev(dev);
+	struct test_config *config = &test_dev->config;
+
+	return snprintf(buf, PAGE_SIZE, "%d\n", config->write_delay_msec_y);
+}
+static DEVICE_ATTR_RW(config_write_delay_msec_y);
+
+static ssize_t config_enable_busy_alloc_store(struct device *dev,
+					      struct device_attribute *attr,
+					      const char *buf, size_t count)
+{
+	struct sysfs_test_device *test_dev = dev_to_test_dev(dev);
+	struct test_config *config = &test_dev->config;
+	int ret;
+	int val;
+
+	ret = kstrtoint(buf, 10, &val);
+	if (ret)
+		return ret;
+
+	test_dev_config_lock(test_dev);
+	config->enable_busy_alloc = val;
+	test_dev_config_unlock(test_dev);
+
+	return count;
+}
+
+static ssize_t config_enable_busy_alloc_show(struct device *dev,
+					     struct device_attribute *attr,
+					     char *buf)
+{
+	struct sysfs_test_device *test_dev = dev_to_test_dev(dev);
+	struct test_config *config = &test_dev->config;
+
+	return snprintf(buf, PAGE_SIZE, "%d\n", config->enable_busy_alloc);
+}
+static DEVICE_ATTR_RW(config_enable_busy_alloc);
+
+#define TEST_SYSFS_DEV_ATTR(name)		(&dev_attr_##name.attr)
+
+static struct attribute *test_dev_attrs[] = {
+	/* Generic driver knobs go here */
+	TEST_SYSFS_DEV_ATTR(config),
+	TEST_SYSFS_DEV_ATTR(reset),
+
+	/* These are used to test sysfs */
+	TEST_SYSFS_DEV_ATTR(test_dev_x),
+	TEST_SYSFS_DEV_ATTR(test_dev_y),
+
+	/*
+	 * These are configuration knobs to modify how we test sysfs when
+	 * doing reads / stores.
+	 */
+	TEST_SYSFS_DEV_ATTR(config_enable_lock),
+	TEST_SYSFS_DEV_ATTR(config_enable_lock_on_rmmod),
+	TEST_SYSFS_DEV_ATTR(config_use_rtnl_lock),
+	TEST_SYSFS_DEV_ATTR(config_write_delay_msec_y),
+	TEST_SYSFS_DEV_ATTR(config_enable_busy_alloc),
+
+	NULL,
+};
+
+ATTRIBUTE_GROUPS(test_dev);
+
+static int sysfs_test_dev_alloc_miscdev(struct sysfs_test_device *test_dev)
+{
+	struct miscdevice *misc_dev;
+
+	misc_dev = &test_dev->misc_dev;
+	misc_dev->minor = MISC_DYNAMIC_MINOR;
+	misc_dev->name = kasprintf(GFP_KERNEL, "test_sysfs%d", test_dev->dev_idx);
+	if (!misc_dev->name) {
+		pr_err("Cannot alloc misc_dev->name\n");
+		return -ENOMEM;
+	}
+	misc_dev->groups = test_dev_groups;
+
+	return 0;
+}
+
+static int testdev_open(struct block_device *bdev, fmode_t mode)
+{
+	return -EINVAL;
+}
+
+static blk_qc_t testdev_submit_bio(struct bio *bio)
+{
+	return BLK_QC_T_NONE;
+}
+
+static void testdev_slot_free_notify(struct block_device *bdev,
+				     unsigned long index)
+{
+}
+
+static int testdev_rw_page(struct block_device *bdev, sector_t sector,
+			   struct page *page, unsigned int op)
+{
+	return -EOPNOTSUPP;
+}
+
+static const struct block_device_operations sysfs_testdev_ops = {
+	.open = testdev_open,
+	.submit_bio = testdev_submit_bio,
+	.swap_slot_free_notify = testdev_slot_free_notify,
+	.rw_page = testdev_rw_page,
+	.owner = THIS_MODULE
+};
+
+static int sysfs_test_dev_alloc_blockdev(struct sysfs_test_device *test_dev)
+{
+	int ret = -ENOMEM;
+
+	test_dev->disk = blk_alloc_disk(NUMA_NO_NODE);
+	if (!test_dev->disk) {
+		pr_err("Error allocating disk structure for device %d\n",
+		       test_dev->dev_idx);
+		goto out;
+	}
+
+	test_dev->disk->major = sysfs_test_major;
+	test_dev->disk->first_minor = test_dev->dev_idx + 1;
+	test_dev->disk->fops = &sysfs_testdev_ops;
+	test_dev->disk->private_data = test_dev;
+	snprintf(test_dev->disk->disk_name, 16, "test_sysfs%d",
+		 test_dev->dev_idx);
+	set_capacity(test_dev->disk, 0);
+	blk_queue_flag_set(QUEUE_FLAG_NONROT, test_dev->disk->queue);
+	blk_queue_flag_clear(QUEUE_FLAG_ADD_RANDOM, test_dev->disk->queue);
+	blk_queue_physical_block_size(test_dev->disk->queue, PAGE_SIZE);
+	blk_queue_max_discard_sectors(test_dev->disk->queue, UINT_MAX);
+	blk_queue_flag_set(QUEUE_FLAG_DISCARD, test_dev->disk->queue);
+
+	return 0;
+out:
+	return ret;
+}
+
+static struct sysfs_test_device *alloc_test_dev_sysfs(int idx)
+{
+	struct sysfs_test_device *test_dev;
+	int ret;
+
+	switch (test_devtype) {
+	case TESTDEV_TYPE_MISC:
+	       fallthrough;
+	case TESTDEV_TYPE_BLOCK:
+		break;
+	default:
+		return NULL;
+	}
+
+	test_dev = kzalloc(sizeof(struct sysfs_test_device), GFP_KERNEL);
+	if (!test_dev)
+		goto err_out;
+
+	mutex_init(&test_dev->config_mutex);
+	test_dev->dev_idx = idx;
+	test_dev->devtype = test_devtype;
+
+	if (test_dev->devtype == TESTDEV_TYPE_MISC) {
+		ret = sysfs_test_dev_alloc_miscdev(test_dev);
+		if (ret)
+			goto err_out_free;
+	} else if (test_dev->devtype == TESTDEV_TYPE_BLOCK) {
+		ret = sysfs_test_dev_alloc_blockdev(test_dev);
+		if (ret)
+			goto err_out_free;
+	}
+	return test_dev;
+
+err_out_free:
+	kfree(test_dev);
+	test_dev = NULL;
+err_out:
+	return NULL;
+}
+
+static int register_test_dev_sysfs_misc(struct sysfs_test_device *test_dev)
+{
+	int ret;
+
+	ret = misc_register(&test_dev->misc_dev);
+	if (ret)
+		return ret;
+
+	test_dev->dev = test_dev->misc_dev.this_device;
+
+	return 0;
+}
+
+static int register_test_dev_sysfs_block(struct sysfs_test_device *test_dev)
+{
+	device_add_disk(NULL, test_dev->disk, test_dev_groups);
+	test_dev->dev = disk_to_dev(test_dev->disk);
+
+	return 0;
+}
+
+static struct sysfs_test_device *register_test_dev_sysfs(void)
+{
+	struct sysfs_test_device *test_dev = NULL;
+	int ret;
+
+	test_dev = alloc_test_dev_sysfs(0);
+	if (!test_dev)
+		goto out;
+
+	if (test_dev->devtype == TESTDEV_TYPE_MISC) {
+		ret = register_test_dev_sysfs_misc(test_dev);
+		if (ret) {
+			pr_err("could not register misc device: %d\n", ret);
+			goto out_free_dev;
+		}
+	} else if (test_dev->devtype == TESTDEV_TYPE_BLOCK) {
+		ret = register_test_dev_sysfs_block(test_dev);
+		if (ret) {
+			pr_err("could not register block device: %d\n", ret);
+			goto out_free_dev;
+		}
+	}
+
+	dev_info(test_dev->dev, "interface ready\n");
+
+out:
+	return test_dev;
+out_free_dev:
+	free_test_dev_sysfs(test_dev);
+	return NULL;
+}
+
+static struct sysfs_test_device *register_test_dev_set_config(void)
+{
+	struct sysfs_test_device *test_dev;
+	struct test_config *config;
+
+	test_dev = register_test_dev_sysfs();
+	if (!test_dev)
+		return NULL;
+
+	config = &test_dev->config;
+
+	if (enable_lock)
+		config->enable_lock = true;
+	if (enable_lock_on_rmmod)
+		config->enable_lock_on_rmmod = true;
+	if (use_rtnl_lock)
+		config->use_rtnl_lock = true;
+	if (enable_busy_alloc)
+		config->enable_busy_alloc = true;
+
+	config->write_delay_msec_y = write_delay_msec_y;
+	test_sysfs_reset_vals(test_dev);
+
+	return test_dev;
+}
+
+static void unregister_test_dev_sysfs_misc(struct sysfs_test_device *test_dev)
+{
+	misc_deregister(&test_dev->misc_dev);
+}
+
+static void unregister_test_dev_sysfs_block(struct sysfs_test_device *test_dev)
+{
+	del_gendisk(test_dev->disk);
+	blk_cleanup_disk(test_dev->disk);
+}
+
+static void unregister_test_dev_sysfs(struct sysfs_test_device *test_dev)
+{
+	test_dev_config_lock_rmmod(test_dev);
+
+	dev_info(test_dev->dev, "removing interface\n");
+
+	if (test_dev->devtype == TESTDEV_TYPE_MISC)
+		unregister_test_dev_sysfs_misc(test_dev);
+	else if (test_dev->devtype == TESTDEV_TYPE_BLOCK)
+		unregister_test_dev_sysfs_block(test_dev);
+
+	test_dev_config_unlock_rmmod(test_dev);
+
+	free_test_dev_sysfs(test_dev);
+}
+
+static struct dentry *debugfs_dir;
+
+/* When read represents how many times we have reset the first_test_dev */
+static u8 reset_first_test_dev;
+
+static ssize_t read_reset_first_test_dev(struct file *file,
+					 char __user *user_buf,
+					 size_t count, loff_t *ppos)
+{
+	ssize_t len;
+	char buf[32];
+
+	reset_first_test_dev++;
+	len = sprintf(buf, "%d\n", reset_first_test_dev);
+	return simple_read_from_buffer(user_buf, count, ppos, buf, len);
+}
+
+static ssize_t write_reset_first_test_dev(struct file *file,
+					  const char __user *user_buf,
+					  size_t count, loff_t *ppos)
+{
+	if (!try_module_get(THIS_MODULE))
+		return -ENODEV;
+
+	if (!first_test_dev) {
+		module_put(THIS_MODULE);
+		return -ENODEV;
+	}
+
+	dev_info(first_test_dev->dev, "going to reset first interface ...\n");
+
+	unregister_test_dev_sysfs(first_test_dev);
+	first_test_dev = register_test_dev_set_config();
+
+	dev_info(first_test_dev->dev, "first interface reset complete\n");
+
+	module_put(THIS_MODULE);
+
+	return count;
+}
+
+static const struct file_operations fops_reset_first_test_dev = {
+	.read = read_reset_first_test_dev,
+	.write = write_reset_first_test_dev,
+	.open = simple_open,
+	.owner = THIS_MODULE,
+	.llseek = default_llseek,
+};
+
+static int __init test_sysfs_init(void)
+{
+	first_test_dev = register_test_dev_set_config();
+	if (!first_test_dev)
+		return -ENOMEM;
+
+	if (!enable_debugfs)
+		return 0;
+
+	debugfs_dir = debugfs_create_dir("test_sysfs", NULL);
+	if (!debugfs_dir) {
+		unregister_test_dev_sysfs(first_test_dev);
+		return -ENOMEM;
+	}
+
+	debugfs_create_file("reset_first_test_dev", 0600, debugfs_dir,
+			    NULL, &fops_reset_first_test_dev);
+	return 0;
+}
+module_init(test_sysfs_init);
+
+static void __exit test_sysfs_exit(void)
+{
+	if (enable_debugfs)
+		debugfs_remove(debugfs_dir);
+	if (delay_rmmod_ms)
+		msleep(delay_rmmod_ms);
+	unregister_test_dev_sysfs(first_test_dev);
+	if (enable_verbose_rmmod)
+		pr_info("unregister_test_dev_sysfs() completed\n");
+	first_test_dev = NULL;
+}
+module_exit(test_sysfs_exit);
+
+MODULE_AUTHOR("Luis Chamberlain <mcgrof@kernel.org>");
+MODULE_LICENSE("GPL");
diff --git a/tools/testing/selftests/sysfs/Makefile b/tools/testing/selftests/sysfs/Makefile
new file mode 100644
index 000000000000..fde99caa2338
--- /dev/null
+++ b/tools/testing/selftests/sysfs/Makefile
@@ -0,0 +1,12 @@
+# SPDX-License-Identifier: GPL-2.0-only
+# Makefile for sysfs selftests.
+
+# No binaries, but make sure arg-less "make" doesn't trigger "run_tests".
+all:
+
+TEST_PROGS := sysfs.sh
+
+include ../lib.mk
+
+# Nothing to clean up.
+clean:
diff --git a/tools/testing/selftests/sysfs/config b/tools/testing/selftests/sysfs/config
new file mode 100644
index 000000000000..9196f452ecd5
--- /dev/null
+++ b/tools/testing/selftests/sysfs/config
@@ -0,0 +1,2 @@
+CONFIG_SYSFS=m
+CONFIG_TEST_SYSFS=m
diff --git a/tools/testing/selftests/sysfs/sysfs.sh b/tools/testing/selftests/sysfs/sysfs.sh
new file mode 100755
index 000000000000..b3f4c2236c7f
--- /dev/null
+++ b/tools/testing/selftests/sysfs/sysfs.sh
@@ -0,0 +1,1208 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0-or-later
+# Copyright (C) 2021 Luis Chamberlain <mcgrof@kernel.org>
+#
+# This program is free software; you can redistribute it and/or modify it
+# under the terms of the GNU General Public License as published by the Free
+# Software Foundation; either version 2 of the License, or at your option any
+# later version; or, when distributed separately from the Linux kernel or
+# when incorporated into other software packages, subject to the following
+# license:
+#
+# This program is free software; you can redistribute it and/or modify it
+# under the terms of copyleft-next (version 0.3.1 or later) as published
+# at http://copyleft-next.org/.
+
+# This performs a series tests against the sysfs filesystem.
+
+# Kselftest framework requirement - SKIP code is 4.
+ksft_skip=4
+
+TEST_NAME="sysfs"
+TEST_DRIVER="test_${TEST_NAME}"
+TEST_DIR=$(dirname $0)
+TEST_FILE=$(mktemp)
+
+# This represents
+#
+# TEST_ID:TEST_COUNT:ENABLED:TARGET
+#
+# TEST_ID: is the test id number
+# TEST_COUNT: number of times we should run the test
+# ENABLED: 1 if enabled, 0 otherwise
+# TARGET: test target file required on the test_sysfs module
+#
+# Once these are enabled please leave them as-is. Write your own test,
+# we have tons of space.
+ALL_TESTS="0001:3:1:test_dev_x:misc"
+ALL_TESTS="$ALL_TESTS 0002:3:1:test_dev_y:misc"
+ALL_TESTS="$ALL_TESTS 0003:3:1:test_dev_x:misc"
+ALL_TESTS="$ALL_TESTS 0004:3:1:test_dev_y:misc"
+ALL_TESTS="$ALL_TESTS 0005:1:1:test_dev_x:misc"
+ALL_TESTS="$ALL_TESTS 0006:1:1:test_dev_y:misc"
+ALL_TESTS="$ALL_TESTS 0007:1:1:test_dev_y:misc"
+ALL_TESTS="$ALL_TESTS 0008:1:1:test_dev_x:misc"
+ALL_TESTS="$ALL_TESTS 0009:1:1:test_dev_y:misc"
+ALL_TESTS="$ALL_TESTS 0010:1:1:test_dev_y:misc"
+ALL_TESTS="$ALL_TESTS 0011:1:1:test_dev_x:misc"
+ALL_TESTS="$ALL_TESTS 0012:1:1:test_dev_y:misc"
+ALL_TESTS="$ALL_TESTS 0013:1:1:test_dev_y:misc"
+ALL_TESTS="$ALL_TESTS 0014:3:1:test_dev_x:block" # block equivalent set
+ALL_TESTS="$ALL_TESTS 0015:3:1:test_dev_x:block"
+ALL_TESTS="$ALL_TESTS 0016:3:1:test_dev_x:block"
+ALL_TESTS="$ALL_TESTS 0017:3:1:test_dev_y:block"
+ALL_TESTS="$ALL_TESTS 0018:1:1:test_dev_x:block"
+ALL_TESTS="$ALL_TESTS 0019:1:1:test_dev_y:block"
+ALL_TESTS="$ALL_TESTS 0020:1:1:test_dev_y:block"
+ALL_TESTS="$ALL_TESTS 0021:1:1:test_dev_x:block"
+ALL_TESTS="$ALL_TESTS 0022:1:1:test_dev_y:block"
+ALL_TESTS="$ALL_TESTS 0023:1:1:test_dev_y:block"
+ALL_TESTS="$ALL_TESTS 0024:1:1:test_dev_x:block"
+ALL_TESTS="$ALL_TESTS 0025:1:1:test_dev_y:block"
+ALL_TESTS="$ALL_TESTS 0026:1:1:test_dev_y:block"
+ALL_TESTS="$ALL_TESTS 0027:1:0:test_dev_x:block" # deadlock test
+ALL_TESTS="$ALL_TESTS 0028:1:0:test_dev_x:block" # deadlock test with rntl_lock
+
+allow_user_defaults()
+{
+	if [ -z $DIR ]; then
+		case $TEST_DEV_TYPE in
+		misc)
+			DIR="/sys/devices/virtual/misc/${TEST_DRIVER}0"
+			;;
+		block)
+			DIR="/sys/devices/virtual/block/${TEST_DRIVER}0"
+			;;
+		*)
+			DIR="/sys/devices/virtual/misc/${TEST_DRIVER}0"
+			;;
+		esac
+	fi
+	case $TEST_DEV_TYPE in
+		misc)
+			MODPROBE_TESTDEV_TYPE=""
+			;;
+		block)
+			MODPROBE_TESTDEV_TYPE="devtype=1"
+			;;
+		*)
+			MODPROBE_TESTDEV_TYPE=""
+			;;
+	esac
+	if [ -z $SYSFS_DEBUGFS_DIR ]; then
+		SYSFS_DEBUGFS_DIR="/sys/kernel/debug/test_sysfs"
+	fi
+	if [ -z $PAGE_SIZE ]; then
+		PAGE_SIZE=$(getconf PAGESIZE)
+	fi
+	if [ -z $MAX_DIGITS ]; then
+		MAX_DIGITS=$(($PAGE_SIZE/8))
+	fi
+	if [ -z $INT_MAX ]; then
+		INT_MAX=$(getconf INT_MAX)
+	fi
+	if [ -z $UINT_MAX ]; then
+		UINT_MAX=$(getconf UINT_MAX)
+	fi
+}
+
+test_reqs()
+{
+	uid=$(id -u)
+	if [ $uid -ne 0 ]; then
+		echo $msg must be run as root >&2
+		exit $ksft_skip
+	fi
+
+	if ! which modprobe 2> /dev/null > /dev/null; then
+		echo "$0: You need modprobe installed" >&2
+		exit $ksft_skip
+	fi
+	if ! which getconf 2> /dev/null > /dev/null; then
+		echo "$0: You need getconf installed"
+		exit $ksft_skip
+	fi
+	if ! which diff 2> /dev/null > /dev/null; then
+		echo "$0: You need diff installed"
+		exit $ksft_skip
+	fi
+	if ! which perl 2> /dev/null > /dev/null; then
+		echo "$0: You need perl installed"
+		exit $ksft_skip
+	fi
+}
+
+call_modprobe()
+{
+	modprobe $TEST_DRIVER $MODPROBE_TESTDEV_TYPE $FIRST_MODPROBE_ARGS $MODPROBE_ARGS
+	return $?
+}
+
+modprobe_reset()
+{
+	modprobe -q -r $TEST_DRIVER
+	call_modprobe
+	return $?
+}
+
+modprobe_reset_enable_debugfs()
+{
+	FIRST_MODPROBE_ARGS="enable_debugfs=1"
+	modprobe_reset
+	unset FIRST_MODPROBE_ARGS
+}
+
+modprobe_reset_enable_lock_on_rmmod()
+{
+	FIRST_MODPROBE_ARGS="enable_lock=1 enable_lock_on_rmmod=1 enable_verbose_writes=1"
+	modprobe_reset
+	unset FIRST_MODPROBE_ARGS
+}
+
+modprobe_reset_enable_rtnl_lock_on_rmmod()
+{
+	FIRST_MODPROBE_ARGS="enable_lock=1 use_rtnl_lock=1 enable_lock_on_rmmod=1"
+	FIRST_MODPROBE_ARGS="$FIRST_MODPROBE_ARGS enable_verbose_writes=1"
+	modprobe_reset
+	unset FIRST_MODPROBE_ARGS
+}
+
+load_req_mod()
+{
+	modprobe_reset
+	if [ ! -d $DIR ]; then
+		if ! modprobe -q -n $TEST_DRIVER; then
+			echo "$0: module $TEST_DRIVER not found [SKIP]"
+			echo "You must set CONFIG_TEST_SYSFS=m in your kernel" >&2
+			exit $ksft_skip
+		fi
+		call_modprobe
+		if [ $? -ne 0 ]; then
+			echo "$0: modprobe $TEST_DRIVER failed."
+			exit
+		fi
+	fi
+}
+
+config_reset()
+{
+	if ! echo -n "1" >"$DIR"/reset; then
+		echo "$0: reset should have worked" >&2
+		exit 1
+	fi
+}
+
+debugfs_reset_first_test_dev_ignore_errors()
+{
+	echo -n "1" >"$SYSFS_DEBUGFS_DIR"/reset_first_test_dev
+}
+
+set_orig()
+{
+	if [[ ! -z $TARGET ]] && [[ ! -z $ORIG ]]; then
+		if [ -f ${TARGET} ]; then
+			echo "${ORIG}" > "${TARGET}"
+		fi
+	fi
+}
+
+set_test()
+{
+	echo "${TEST_STR}" > "${TARGET}"
+}
+
+set_test_ignore_errors()
+{
+	echo "${TEST_STR}" > "${TARGET}" 2> /dev/null
+}
+
+verify()
+{
+	local seen
+	seen=$(cat "$1")
+	target_short=$(basename $TARGET)
+	case $target_short in
+	test_dev_x)
+		if [ "${seen}" != "${TEST_STR}" ]; then
+			return 1
+		fi
+		;;
+	test_dev_y)
+		DIRNAME=$(dirname $1)
+		EXPECTED_RESULT=""
+		# If our target was the test file then what we write to it
+		# is the same as what that we expect when we read from it.
+		# When we write to test_dev_y directly though we expect
+		# a computed value which is driver specific.
+		if [[ "$DIRNAME" == "/tmp" ]]; then
+			let EXPECTED_RESULT="${TEST_STR}"
+		else
+			x=$(cat ${DIR}/test_dev_x)
+			let EXPECTED_RESULT="$x+${TEST_STR}+7"
+		fi
+
+		if [[ "${seen}" != "${EXPECTED_RESULT}" ]]; then
+			return 1
+		fi
+		;;
+	*)
+		echo "Unsupported target type update test script: $target_short"
+		exit 1
+	esac
+	return 0
+}
+
+verify_diff_w()
+{
+	echo "$TEST_STR" | diff -q -w -u - $1 > /dev/null
+	return $?
+}
+
+test_rc()
+{
+	if [[ $rc != 0 ]]; then
+		echo "Failed test, return value: $rc" >&2
+		exit $rc
+	fi
+}
+
+test_finish()
+{
+	set_orig
+	rm -f "${TEST_FILE}"
+
+	if [ ! -z ${old_strict} ]; then
+		echo ${old_strict} > ${WRITES_STRICT}
+	fi
+	exit $rc
+}
+
+# kernfs requires us to write everything we want in one shot because
+# There is no easy way for us to know if userspace is only doing a partial
+# write, so we don't support them. We expect the entire buffer to come on
+# the first write.  If you're writing a value, first read the file,
+# modify only the value you're changing, then write entire buffer back.
+# Since we are only testing digits we just full single writes and old stuff.
+# For more details, refer to kernfs_fop_write_iter().
+run_numerictests_single_write()
+{
+	echo "== Testing sysfs behavior against ${TARGET} =="
+
+	rc=0
+
+	echo -n "Writing test file ... "
+	echo "${TEST_STR}" > "${TEST_FILE}"
+	if ! verify "${TEST_FILE}"; then
+		echo "FAIL" >&2
+		exit 1
+	else
+		echo "ok"
+	fi
+
+	echo -n "Checking the sysfs file is not set to test value ... "
+	if verify "${TARGET}"; then
+		echo "FAIL" >&2
+		exit 1
+	else
+		echo "ok"
+	fi
+
+	echo -n "Writing to sysfs file from shell ... "
+	set_test
+	if ! verify "${TARGET}"; then
+		echo "FAIL" >&2
+		exit 1
+	else
+		echo "ok"
+	fi
+
+	echo -n "Resetting sysfs file to original value ... "
+	set_orig
+	if verify "${TARGET}"; then
+		echo "FAIL" >&2
+		exit 1
+	else
+		echo "ok"
+	fi
+
+	# Now that we've validated the sanity of "set_test" and "set_orig",
+	# we can use those functions to set starting states before running
+	# specific behavioral tests.
+
+	echo -n "Writing to the entire sysfs file in a single write ... "
+	set_orig
+	dd if="${TEST_FILE}" of="${TARGET}" bs=4096 2>/dev/null
+	if ! verify "${TARGET}"; then
+		echo "FAIL" >&2
+		rc=1
+	else
+		echo "ok"
+	fi
+
+	echo -n "Writing to the sysfs file with multiple long writes ... "
+	set_orig
+	(perl -e 'print "A" x 50;'; echo "${TEST_STR}") | \
+		dd of="${TARGET}" bs=50 2>/dev/null
+	if verify "${TARGET}"; then
+		echo "FAIL" >&2
+		rc=1
+	else
+		echo "ok"
+	fi
+	test_rc
+}
+
+reset_vals()
+{
+	echo -n 3 > $DIR/test_dev_x
+	echo -n 4 > $DIR/test_dev_x
+}
+
+check_failure()
+{
+	echo -n "Testing that $1 fails as expected..."
+	reset_vals
+	TEST_STR="$1"
+	orig="$(cat $TARGET)"
+	echo -n "$TEST_STR" > $TARGET 2> /dev/null
+
+	# write should fail and $TARGET should retain its original value
+	if [ $? = 0 ] || [ "$(cat $TARGET)" != "$orig" ]; then
+		echo "FAIL" >&2
+		rc=1
+	else
+		echo "ok"
+	fi
+	test_rc
+}
+
+load_modreqs()
+{
+	export TEST_DEV_TYPE=$(get_test_type $1)
+	unset DIR
+	allow_user_defaults
+	load_req_mod
+}
+
+target_exists()
+{
+	TARGET="${DIR}/$1"
+	TEST_ID="$2"
+
+	if [ ! -f ${TARGET} ] ; then
+		echo "Target for test $TEST_ID: $TARGET does not exist, skipping test ..."
+		return 0
+	fi
+	return 1
+}
+
+config_enable_lock()
+{
+	if ! echo -n 1 > $DIR/config_enable_lock; then
+		echo "$0: Unable to enable locks" >&2
+		exit 1
+	fi
+}
+
+config_write_delay_msec_y()
+{
+	if ! echo -n $1 > $DIR/config_write_delay_msec_y ; then
+		echo "$0: Unable to set write_delay_msec_y to $1" >&2
+		exit 1
+	fi
+}
+
+# Default filter for dmesg scanning.
+# Ignore lockdep complaining about its own bugginess when scanning dmesg
+# output, because we shouldn't be failing filesystem tests on account of
+# lockdep.
+_check_dmesg_filter()
+{
+	egrep -v -e "BUG: MAX_LOCKDEP_CHAIN_HLOCKS too low" \
+		-e "BUG: MAX_STACK_TRACE_ENTRIES too low"
+}
+
+check_dmesg()
+{
+	# filter out intentional WARNINGs or Oopses
+	local filter=${1:-_check_dmesg_filter}
+
+	_dmesg_since_test_start | $filter >$seqres.dmesg
+	egrep -q -e "kernel BUG at" \
+	     -e "WARNING:" \
+	     -e "\bBUG:" \
+	     -e "Oops:" \
+	     -e "possible recursive locking detected" \
+	     -e "Internal error" \
+	     -e "(INFO|ERR): suspicious RCU usage" \
+	     -e "INFO: possible circular locking dependency detected" \
+	     -e "general protection fault:" \
+	     -e "BUG .* remaining" \
+	     -e "UBSAN:" \
+	     $seqres.dmesg
+	if [ $? -eq 0 ]; then
+		echo "something found in dmesg (see $seqres.dmesg)"
+		return 1
+	else
+		if [ "$KEEP_DMESG" != "yes" ]; then
+			rm -f $seqres.dmesg
+		fi
+		return 0
+	fi
+}
+
+log_kernel_fstest_dmesg()
+{
+	export FSTYP="$1"
+	export seqnum="$FSTYP/$2"
+	export date_time=$(date +"%F %T")
+	echo "run fstests $seqnum at $date_time" > /dev/kmsg
+}
+
+modprobe_loop()
+{
+	while true; do
+		call_modprobe > /dev/null 2>&1
+		modprobe -r $TEST_DRIVER > /dev/null 2>&1
+	done > /dev/null 2>&1
+}
+
+write_loop()
+{
+	while true; do
+		set_test_ignore_errors > /dev/null 2>&1
+		TEST_STR=$(( $TEST_STR + 1 ))
+	done > /dev/null 2>&1
+}
+
+write_loop_reset()
+{
+	while true; do
+		set_test_ignore_errors > /dev/null 2>&1
+		debugfs_reset_first_test_dev_ignore_errors > /dev/null 2>&1
+	done > /dev/null 2>&1
+}
+
+write_loop_bg()
+{
+	BG_WRITES=1000 > /dev/null 2>&1
+	while true; do
+		for i in $(seq 1 $BG_WRITES); do
+			set_test_ignore_errors > /dev/null 2>&1 &
+			TEST_STR=$(( $TEST_STR + 1 ))
+		done > /dev/null 2>&1
+		wait
+	done > /dev/null 2>&1
+	wait
+}
+
+reset_loop()
+{
+	while true; do
+		debugfs_reset_first_test_dev_ignore_errors > /dev/null 2>&1
+	done > /dev/null 2>&1
+}
+
+kill_trigger_loop()
+{
+
+	local my_first_loop_pid=$1
+	local my_second_loop_pid=$2
+	local my_sleep_max=$3
+	local my_loop=0
+
+	while true; do
+		sleep 1
+		if [[ $my_loop -ge $my_sleep_max ]]; then
+			break
+		fi
+		let my_loop=$my_loop+1
+	done
+
+	kill -s TERM $my_first_loop_pid 2>&1 > /dev/null
+	kill -s TERM $my_second_loop_pid 2>&1 > /dev/null
+}
+
+_dmesg_since_test_start()
+{
+	# search the dmesg log of last run of $seqnum for possible failures
+	# use sed \cregexpc address type, since $seqnum contains "/"
+	dmesg | tac | sed -ne "0,\#run fstests $seqnum at $date_time#p" | tac
+}
+
+sysfs_test_0001()
+{
+	TARGET="${DIR}/$(get_test_target 0001)"
+	config_reset
+	reset_vals
+	ORIG=$(cat "${TARGET}")
+	TEST_STR=$(( $ORIG + 1 ))
+
+	run_numerictests_single_write
+}
+
+sysfs_test_0002()
+{
+	TARGET="${DIR}/$(get_test_target 0002)"
+	config_reset
+	ORIG=$(cat "${TARGET}")
+	TEST_STR=$(( $ORIG + 1 ))
+
+	run_numerictests_single_write
+}
+
+sysfs_test_0003()
+{
+	TARGET="${DIR}/$(get_test_target 0003)"
+	config_reset
+	ORIG=$(cat "${TARGET}")
+	TEST_STR=$(( $ORIG + 1 ))
+
+	config_enable_lock
+
+	run_numerictests_single_write
+}
+
+sysfs_test_0004()
+{
+	TARGET="${DIR}/$(get_test_target 0004)"
+	config_reset
+	ORIG=$(cat "${TARGET}")
+	TEST_STR=$(( $ORIG + 1 ))
+
+	config_enable_lock
+
+	run_numerictests_single_write
+}
+
+sysfs_test_0005()
+{
+	TARGET="${DIR}/$(get_test_target 0005)"
+	modprobe_reset
+	config_reset
+	ORIG=$(cat "${TARGET}")
+	TEST_STR=$(( $ORIG + 1 ))
+	WAIT_TIME=2
+
+	echo -n "Loop writing x while loading/unloading the module... "
+
+	modprobe_loop &
+	modprobe_pid=$!
+
+	write_loop &
+	write_pid=$!
+
+	kill_trigger_loop $modprobe_pid $write_pid $WAIT_TIME > /dev/null 2>&1 &
+	kill_pid=$!
+
+	wait $kill_pid > /dev/null 2>&1
+
+	if [[ $? -eq 0 ]]; then
+		echo "ok"
+	else
+		echo "FAIL" >&2
+	fi
+}
+
+sysfs_test_0006()
+{
+	TARGET="${DIR}/$(get_test_target 0006)"
+	modprobe_reset
+	config_reset
+	ORIG=$(cat "${TARGET}")
+	TEST_STR=$(( $ORIG + 1 ))
+	WAIT_TIME=2
+
+	echo -n "Loop writing y while loading/unloading the module... "
+	modprobe_loop &
+	modprobe_pid=$!
+
+	write_loop &
+	write_pid=$!
+
+	kill_trigger_loop $modprobe_pid $write_pid $WAIT_TIME > /dev/null 2>&1 &
+	kill_pid=$!
+
+	wait $kill_pid > /dev/null 2>&1
+
+	if [[ $? -eq 0 ]]; then
+		echo "ok"
+	else
+		echo "FAIL" >&2
+	fi
+}
+
+sysfs_test_0007()
+{
+	TARGET="${DIR}/$(get_test_target 0007)"
+	modprobe_reset
+	config_reset
+	ORIG=$(cat "${TARGET}")
+	TEST_STR=$(( $ORIG + 1 ))
+	WAIT_TIME=2
+
+	echo -n "Loop writing y with a larger delay while loading/unloading the module... "
+
+	MODPROBE_ARGS="write_delay_msec_y=1500"
+	modprobe_loop > /dev/null 2>&1 &
+	modprobe_pid=$!
+	unset MODPROBE_ARGS
+
+	write_loop &
+	write_pid=$!
+
+	kill_trigger_loop $modprobe_pid $write_pid $WAIT_TIME > /dev/null 2>&1 &
+	kill_pid=$!
+
+	wait $kill_pid > /dev/null 2>&1
+
+	if [[ $? -eq 0 ]]; then
+		echo "ok"
+	else
+		echo "FAIL" >&2
+	fi
+}
+
+sysfs_test_0008()
+{
+	TARGET="${DIR}/$(get_test_target 0008)"
+	modprobe_reset
+	config_reset
+	reset_vals
+	ORIG=$(cat "${TARGET}")
+	TEST_STR=$(( $ORIG + 1 ))
+	WAIT_TIME=2
+
+	echo -n "Loop busy writing x while loading/unloading the module... "
+
+	modprobe_loop > /dev/null 2>&1 &
+	modprobe_pid=$!
+
+	write_loop_bg > /dev/null 2>&1 &
+	write_pid=$!
+
+	kill_trigger_loop $modprobe_pid $write_pid $WAIT_TIME > /dev/null 2>&1 &
+	kill_pid=$!
+
+	wait $kill_pid > /dev/null 2>&1
+
+	if [[ $? -eq 0 ]]; then
+		echo "ok"
+	else
+		echo "FAIL" >&2
+	fi
+}
+
+sysfs_test_0009()
+{
+	TARGET="${DIR}/$(get_test_target 0009)"
+	modprobe_reset
+	config_reset
+	reset_vals
+	ORIG=$(cat "${TARGET}")
+	TEST_STR=$(( $ORIG + 1 ))
+	WAIT_TIME=2
+
+	echo -n "Loop busy writing y while loading/unloading the module... "
+
+	modprobe_loop > /dev/null 2>&1 &
+	modprobe_pid=$!
+
+	write_loop_bg > /dev/null 2>&1 &
+	write_pid=$!
+
+	kill_trigger_loop $modprobe_pid $write_pid $WAIT_TIME > /dev/null 2>&1 &
+	kill_pid=$!
+
+	wait $kill_pid > /dev/null 2>&1
+
+	if [[ $? -eq 0 ]]; then
+		echo "ok"
+	else
+		echo "FAIL" >&2
+	fi
+}
+
+sysfs_test_0010()
+{
+	TARGET="${DIR}/$(get_test_target 0010)"
+	modprobe_reset
+	config_reset
+	reset_vals
+	ORIG=$(cat "${TARGET}")
+	TEST_STR=$(( $ORIG + 1 ))
+	WAIT_TIME=2
+
+	echo -n "Loop busy writing y with a larger delay while loading/unloading the module... "
+	modprobe -q -r $TEST_DRIVER > /dev/null 2>&1
+
+	MODPROBE_ARGS="write_delay_msec_y=1500"
+	modprobe_loop > /dev/null 2>&1 &
+	modprobe_pid=$!
+	unset MODPROBE_ARGS
+
+	write_loop_bg > /dev/null 2>&1 &
+	write_pid=$!
+
+	kill_trigger_loop $modprobe_pid $write_pid $WAIT_TIME > /dev/null 2>&1 &
+	kill_pid=$!
+
+	wait $kill_pid > /dev/null 2>&1
+
+	if [[ $? -eq 0 ]]; then
+		echo "ok"
+	else
+		echo "FAIL" >&2
+	fi
+}
+
+sysfs_test_0011()
+{
+	TARGET="${DIR}/$(get_test_target 0011)"
+	modprobe_reset_enable_debugfs
+	config_reset
+	reset_vals
+	ORIG=$(cat "${TARGET}")
+	TEST_STR=$(( $ORIG + 1 ))
+	WAIT_TIME=2
+
+	echo -n "Loop writing x and resetting ... "
+
+	write_loop > /dev/null 2>&1 &
+	write_pid=$!
+
+	reset_loop > /dev/null 2>&1 &
+	reset_pid=$!
+
+	kill_trigger_loop $write_pid $reset_pid $WAIT_TIME > /dev/null 2>&1 &
+	kill_pid=$!
+
+	wait $kill_pid > /dev/null 2>&1
+
+	if [[ $? -eq 0 ]]; then
+		echo "ok"
+	else
+		echo "FAIL" >&2
+	fi
+}
+
+sysfs_test_0012()
+{
+	TARGET="${DIR}/$(get_test_target 0012)"
+	modprobe_reset_enable_debugfs
+	config_reset
+	reset_vals
+	ORIG=$(cat "${TARGET}")
+	TEST_STR=$(( $ORIG + 1 ))
+	WAIT_TIME=2
+
+	echo -n "Loop writing y and resetting ... "
+
+	write_loop > /dev/null 2>&1 &
+	write_pid=$!
+
+	reset_loop > /dev/null 2>&1 &
+	reset_pid=$!
+
+	kill_trigger_loop $write_pid $reset_pid $WAIT_TIME > /dev/null 2>&1 &
+	kill_pid=$!
+
+	wait $kill_pid > /dev/null 2>&1
+
+	if [[ $? -eq 0 ]]; then
+		echo "ok"
+	else
+		echo "FAIL" >&2
+	fi
+}
+
+sysfs_test_0013()
+{
+	TARGET="${DIR}/$(get_test_target 0013)"
+	modprobe_reset_enable_debugfs
+	config_reset
+	reset_vals
+	config_write_delay_msec_y 1500
+	ORIG=$(cat "${TARGET}")
+	TEST_STR=$(( $ORIG + 1 ))
+	WAIT_TIME=2
+
+	echo -n "Loop writing y with a larger delay and resetting ... "
+
+	write_loop > /dev/null 2>&1 &
+	write_pid=$!
+
+	reset_loop > /dev/null 2>&1 &
+	reset_pid=$!
+
+	kill_trigger_loop $write_pid $reset_pid $WAIT_TIME > /dev/null 2>&1 &
+	kill_pid=$!
+
+	wait $kill_pid > /dev/null 2>&1
+
+	if [[ $? -eq 0 ]]; then
+		echo "ok"
+	else
+		echo "FAIL" >&2
+	fi
+}
+
+sysfs_test_0014()
+{
+	sysfs_test_0001
+}
+
+sysfs_test_0015()
+{
+	sysfs_test_0002
+}
+
+sysfs_test_0016()
+{
+	sysfs_test_0003
+}
+
+sysfs_test_0017()
+{
+	sysfs_test_0004
+}
+
+sysfs_test_0018()
+{
+	sysfs_test_0005
+}
+
+sysfs_test_0019()
+{
+	sysfs_test_0006
+}
+
+sysfs_test_0020()
+{
+	sysfs_test_0007
+}
+
+sysfs_test_0021()
+{
+	sysfs_test_0008
+}
+
+sysfs_test_0022()
+{
+	sysfs_test_0009
+}
+
+sysfs_test_0023()
+{
+	sysfs_test_0010
+}
+
+sysfs_test_0024()
+{
+	sysfs_test_0011
+}
+
+sysfs_test_0025()
+{
+	sysfs_test_0012
+}
+
+sysfs_test_0026()
+{
+	sysfs_test_0013
+}
+
+sysfs_test_0027()
+{
+	TARGET="${DIR}/$(get_test_target 0027)"
+	modprobe_reset_enable_lock_on_rmmod
+	ORIG=$(cat "${TARGET}")
+	TEST_STR=$(( $ORIG + 1 ))
+	WAIT_TIME=2
+
+	echo -n "Test for possible rmmod deadlock while writing x ... "
+
+	write_loop > /dev/null 2>&1 &
+	write_pid=$!
+
+	MODPROBE_ARGS="enable_lock=1 enable_lock_on_rmmod=1 enable_verbose_writes=1"
+	modprobe_loop > /dev/null 2>&1 &
+	modprobe_pid=$!
+	unset MODPROBE_ARGS
+
+	kill_trigger_loop $modprobe_pid $write_pid $WAIT_TIME > /dev/null 2>&1 &
+	kill_pid=$!
+
+	wait $kill_pid > /dev/null 2>&1
+
+	if [[ $? -eq 0 ]]; then
+		echo "ok"
+	else
+		echo "FAIL" >&2
+	fi
+}
+
+sysfs_test_0028()
+{
+	TARGET="${DIR}/$(get_test_target 0028)"
+	modprobe_reset_enable_lock_on_rmmod
+	ORIG=$(cat "${TARGET}")
+	TEST_STR=$(( $ORIG + 1 ))
+	WAIT_TIME=2
+
+	echo -n "Test for possible rmmod deadlock using rtnl_lock while writing x ... "
+
+	write_loop > /dev/null 2>&1 &
+	write_pid=$!
+
+	MODPROBE_ARGS="enable_lock=1 enable_lock_on_rmmod=1 use_rtnl_lock=1 enable_verbose_writes=1"
+	modprobe_loop > /dev/null 2>&1 &
+	modprobe_pid=$!
+	unset MODPROBE_ARGS
+
+	kill_trigger_loop $modprobe_pid $write_pid $WAIT_TIME > /dev/null 2>&1 &
+	kill_pid=$!
+
+	wait $kill_pid > /dev/null 2>&1
+
+	if [[ $? -eq 0 ]]; then
+		echo "ok"
+	else
+		echo "FAIL" >&2
+	fi
+}
+
+test_gen_desc()
+{
+	echo -n "$1 x $(get_test_count $1)"
+}
+
+list_tests()
+{
+	echo "Test ID list:"
+	echo
+	echo "TEST_ID x NUM_TEST"
+	echo "TEST_ID:   Test ID"
+	echo "NUM_TESTS: Number of recommended times to run the test"
+	echo
+	echo "$(test_gen_desc 0001) - misc test writing x in different ways"
+	echo "$(test_gen_desc 0002) - misc test writing y in different ways"
+	echo "$(test_gen_desc 0003) - misc test writing x in different ways using a mutex lock"
+	echo "$(test_gen_desc 0004) - misc test writing y in different ways using a mutex lock"
+	echo "$(test_gen_desc 0005) - misc test writing x load and remove the test_sysfs module"
+	echo "$(test_gen_desc 0006) - misc writing y load and remove the test_sysfs module"
+	echo "$(test_gen_desc 0007) - misc test writing y larger delay, load, remove test_sysfs"
+	echo "$(test_gen_desc 0008) - misc test busy writing x remove test_sysfs module"
+	echo "$(test_gen_desc 0009) - misc test busy writing y remove the test_sysfs module"
+	echo "$(test_gen_desc 0010) - misc test busy writing y larger delay, remove test_sysfs"
+	echo "$(test_gen_desc 0011) - misc test writing x and resetting device"
+	echo "$(test_gen_desc 0012) - misc test writing y and resetting device"
+	echo "$(test_gen_desc 0013) - misc test writing y with a larger delay and resetting device"
+	echo "$(test_gen_desc 0014) - block test writing x in different ways"
+	echo "$(test_gen_desc 0015) - block test writing y in different ways"
+	echo "$(test_gen_desc 0016) - block test writing x in different ways using a mutex lock"
+	echo "$(test_gen_desc 0017) - block test writing y in different ways using a mutex lock"
+	echo "$(test_gen_desc 0018) - block test writing x load and remove the test_sysfs module"
+	echo "$(test_gen_desc 0019) - block test writing y load and remove the test_sysfs module"
+	echo "$(test_gen_desc 0020) - block test writing y larger delay, load, remove test_sysfs"
+	echo "$(test_gen_desc 0021) - block test busy writing x remove the test_sysfs module"
+	echo "$(test_gen_desc 0022) - block test busy writing y remove the test_sysfs module"
+	echo "$(test_gen_desc 0023) - block test busy writing y larger delay, remove test_sysfs"
+	echo "$(test_gen_desc 0024) - block test writing x and resetting device"
+	echo "$(test_gen_desc 0025) - block test writing y and resetting device"
+	echo "$(test_gen_desc 0026) - block test writing y larger delay and resetting device"
+	echo "$(test_gen_desc 0027) - test rmmod deadlock while writing x ... "
+	echo "$(test_gen_desc 0028) - test rmmod deadlock using rtnl_lock while writing x ..."
+}
+
+usage()
+{
+	NUM_TESTS=$(grep -o ' ' <<<"$ALL_TESTS" | grep -c .)
+	let NUM_TESTS=$NUM_TESTS+1
+	MAX_TEST=$(printf "%04d\n" $NUM_TESTS)
+	echo "Usage: $0 [ -t <4-number-digit> ] | [ -w <4-number-digit> ] |"
+	echo "		 [ -s <4-number-digit> ] | [ -c <4-number-digit> <test- count>"
+	echo "           [ all ] [ -h | --help ] [ -l ]"
+	echo ""
+	echo "Valid tests: 0001-$MAX_TEST"
+	echo ""
+	echo "    all     Runs all tests (default)"
+	echo "    -t      Run test ID the number amount of times is recommended"
+	echo "    -w      Watch test ID run until it runs into an error"
+	echo "    -c      Run test ID once"
+	echo "    -s      Run test ID x test-count number of times"
+	echo "    -l      List all test ID list"
+	echo " -h|--help  Help"
+	echo
+	echo "If an error every occurs execution will immediately terminate."
+	echo "If you are adding a new test try using -w <test-ID> first to"
+	echo "make sure the test passes a series of tests."
+	echo
+	echo Example uses:
+	echo
+	echo "$TEST_NAME.sh            -- executes all tests"
+	echo "$TEST_NAME.sh -t 0002    -- Executes test ID 0002 number of times is recomended"
+	echo "$TEST_NAME.sh -w 0002    -- Watch test ID 0002 run until an error occurs"
+	echo "$TEST_NAME.sh -s 0002    -- Run test ID 0002 once"
+	echo "$TEST_NAME.sh -c 0002 3  -- Run test ID 0002 three times"
+	echo
+	list_tests
+	exit 1
+}
+
+test_num()
+{
+	re='^[0-9]+$'
+	if ! [[ $1 =~ $re ]]; then
+		usage
+	fi
+}
+
+get_test_count()
+{
+	test_num $1
+	TEST_NUM=$(echo $1 | sed 's/^0*//')
+	TEST_DATA=$(echo $ALL_TESTS | awk '{print $'$TEST_NUM'}')
+	echo ${TEST_DATA} | awk -F":" '{print $2}'
+}
+
+get_test_enabled()
+{
+	test_num $1
+	TEST_NUM=$(echo $1 | sed 's/^0*//')
+	TEST_DATA=$(echo $ALL_TESTS | awk '{print $'$TEST_NUM'}')
+	echo ${TEST_DATA} | awk -F":" '{print $3}'
+}
+
+get_test_target()
+{
+	test_num $1
+	TEST_NUM=$(echo $1 | sed 's/^0*//')
+	TEST_DATA=$(echo $ALL_TESTS | awk '{print $'$TEST_NUM'}')
+	echo ${TEST_DATA} | awk -F":" '{print $4}'
+}
+
+get_test_type()
+{
+	test_num $1
+	TEST_NUM=$(echo $1 | sed 's/^0*//')
+	TEST_DATA=$(echo $ALL_TESTS | awk '{print $'$TEST_NUM'}')
+	echo ${TEST_DATA} | awk -F":" '{print $5}'
+}
+
+run_all_tests()
+{
+	for i in $ALL_TESTS ; do
+		TEST_ID=$(echo $i | awk -F":" '{print $1}')
+		ENABLED=$(get_test_enabled $TEST_ID)
+		TEST_COUNT=$(get_test_count $TEST_ID)
+		TEST_TARGET=$(get_test_target $TEST_ID)
+		if [[ $ENABLED -eq "1" ]]; then
+			test_case $TEST_ID $TEST_COUNT $TEST_TARGET
+		else
+			echo -n "Skipping test $TEST_ID as its disabled, likely "
+			echo "could crash your system ..."
+		fi
+	done
+}
+
+watch_log()
+{
+	if [ $# -ne 3 ]; then
+		clear
+	fi
+	echo "Running test: $2 - run #$1"
+}
+
+watch_case()
+{
+	i=0
+	while [ 1 ]; do
+		if [ $# -eq 1 ]; then
+			test_num $1
+			watch_log $i ${TEST_NAME}_test_$1
+			log_kernel_fstest_dmesg sysfs $1
+			RUN_TEST=${TEST_NAME}_test_$1
+			$RUN_TEST
+			check_dmesg
+			if [[ $? -ne 0 ]]; then
+				exit 1
+			fi
+		else
+			watch_log $i all
+			run_all_tests
+		fi
+		let i=$i+1
+	done
+}
+
+test_case()
+{
+	NUM_TESTS=$2
+
+	i=0
+
+	load_modreqs $1
+	if target_exists $3 $1; then
+		return
+	fi
+
+	while [[ $i -lt $NUM_TESTS ]]; do
+		test_num $1
+		watch_log $i ${TEST_NAME}_test_$1 noclear
+		log_kernel_fstest_dmesg sysfs $1
+		RUN_TEST=${TEST_NAME}_test_$1
+		$RUN_TEST
+		let i=$i+1
+	done
+	check_dmesg
+	if [[ $? -ne 0 ]]; then
+		exit 1
+	fi
+}
+
+parse_args()
+{
+	if [ $# -eq 0 ]; then
+		run_all_tests
+	else
+		if [[ "$1" = "all" ]]; then
+			run_all_tests
+		elif [[ "$1" = "-w" ]]; then
+			shift
+			watch_case $@
+		elif [[ "$1" = "-t" ]]; then
+			shift
+			test_num $1
+			test_case $1 $(get_test_count $1) $(get_test_target $1)
+			shift
+		elif [[ "$1" = "-c" ]]; then
+			shift
+			test_num $1
+			test_num $2
+			test_case $1 $2 $(get_test_target $1)
+			shift
+			shift
+		elif [[ "$1" = "-s" ]]; then
+			shift
+			test_case $1 1 $(get_test_target $1)
+			shift
+		elif [[ "$1" = "-l" ]]; then
+			list_tests
+			shift
+		elif [[ "$1" = "-h" || "$1" = "--help" ]]; then
+			usage
+		else
+			usage
+		fi
+	fi
+}
+
+test_reqs
+allow_user_defaults
+
+trap "test_finish" EXIT
+
+parse_args $@
+
+exit 0
-- 
2.30.2


^ permalink raw reply	[flat|nested] 94+ messages in thread

* [PATCH v8 04/12] kernfs: add initial failure injection support
  2021-09-27 16:37 [PATCH v8 00/12] syfs: generic deadlock fix with module removal Luis Chamberlain
                   ` (2 preceding siblings ...)
  2021-09-27 16:37 ` [PATCH v8 03/12] selftests: add tests_sysfs module Luis Chamberlain
@ 2021-09-27 16:37 ` Luis Chamberlain
  2021-10-05 19:47   ` Kees Cook
  2021-09-27 16:37 ` [PATCH v8 05/12] test_sysfs: add support to use kernfs failure injection Luis Chamberlain
                   ` (7 subsequent siblings)
  11 siblings, 1 reply; 94+ messages in thread
From: Luis Chamberlain @ 2021-09-27 16:37 UTC (permalink / raw)
  To: tj, gregkh, akpm, minchan, jeyu, shuah
  Cc: bvanassche, dan.j.williams, joe, tglx, mcgrof, keescook, rostedt,
	linux-spdx, linux-doc, linux-block, linux-fsdevel,
	linux-kselftest, linux-kernel

This adds initial failure injection support to kernfs. We start
off with debug knobs which when enabled allow test drivers, such as
test_sysfs, to then make use of these to try to force certain
difficult races to take place with a high degree of certainty.

This only adds runtime code *iff* the new bool CONFIG_FAIL_KERNFS_KNOBS is
enabled in your kernel. If you don't have this enabled this provides
no new functional. When CONFIG_FAIL_KERNFS_KNOBS is disabled the new
routine kernfs_debug_should_wait() ends up being transformed to if
(false), and so the compiler should optimize these out as dead code
producing no new effective binary changes.

We start off with enabling failure injections in kernfs by allowing us to
alter the way kernfs_fop_write_iter() behaves. We allow for the routine
kernfs_fop_write_iter() to wait for a certain condition in the kernel to
occur, after which it will sleep a predefined amount of time. This lets
kernfs users to time exactly when it want kernfs_fop_write_iter() to
complete, allowing for developing race conditions and test for correctness
in kernfs.

You'd boot with this enabled on your kernel command line:

fail_kernfs_fop_write_iter=1,100,0,1

The values are <interval,probability,size,times>, we don't care for
size, so for now we ignore it. The above ensures a failure will trigger
only once.

*How* we allow for this routine to change behaviour is left to knobs we
expose under debugfs:

 # ls -1 /sys/kernel/debug/kernfs/config_fail_kernfs_fop_write_iter/
wait_after_active
wait_after_mutex
wait_at_start
wait_before_mutex

A debugfs entry also exists to allow us to sleep a configurabler amount
of time after the completion:

/sys/kernel/debug/kernfs/sleep_after_wait_ms

These two sets of knobs allow us to construct races and demonstrate
how the kernfs active reference should suffice to project against
races.

Enabling CONFIG_FAULT_INJECTION_DEBUG_FS enables us to configure the
differnt fault injection parametres for the new fail_kernfs_fop_write_iter
fault injection at run time:

ls -1 /sys/kernel/debug/kernfs/fail_kernfs_fop_write_iter/
interval
probability
space
task-filter
times
verbose
verbose_ratelimit_burst
verbose_ratelimit_interval_ms

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 .../fault-injection/fault-injection.rst       | 22 +++++
 MAINTAINERS                                   |  2 +-
 fs/kernfs/Makefile                            |  1 +
 fs/kernfs/failure-injection.c                 | 91 +++++++++++++++++++
 fs/kernfs/file.c                              | 13 +++
 fs/kernfs/kernfs-internal.h                   | 72 +++++++++++++++
 include/linux/kernfs.h                        |  5 +
 lib/Kconfig.debug                             | 10 ++
 8 files changed, 215 insertions(+), 1 deletion(-)
 create mode 100644 fs/kernfs/failure-injection.c

diff --git a/Documentation/fault-injection/fault-injection.rst b/Documentation/fault-injection/fault-injection.rst
index 4a25c5eb6f07..d4d34b082f47 100644
--- a/Documentation/fault-injection/fault-injection.rst
+++ b/Documentation/fault-injection/fault-injection.rst
@@ -28,6 +28,28 @@ Available fault injection capabilities
 
   injects kernel RPC client and server failures.
 
+- fail_kernfs_fop_write_iter
+
+  Allows for failures to be enabled inside kernfs_fop_write_iter(). Enabling
+  this does not immediately enable any errors to occur. You must configure
+  how you want this routine to fail or change behaviour by using the debugfs
+  knobs for it:
+
+  # ls -1 /sys/kernel/debug/kernfs/config_fail_kernfs_fop_write_iter/
+  wait_after_active
+  wait_after_mutex
+  wait_at_start
+  wait_before_mutex
+
+  You can also configure how long to sleep after a wait under
+
+  /sys/kernel/debug/kernfs/sleep_after_wait_ms
+
+  If you enable CONFIG_FAULT_INJECTION_DEBUG_FS the fail_add_disk failure
+  injection parameters are placed under:
+
+  /sys/kernel/debug/kernfs/fail_kernfs_fop_write_iter/
+
 - fail_make_request
 
   injects disk IO errors on devices permitted by setting
diff --git a/MAINTAINERS b/MAINTAINERS
index 1b4cefcb064c..fadfd961ad80 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -10384,7 +10384,7 @@ M:	Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 M:	Tejun Heo <tj@kernel.org>
 S:	Supported
 T:	git git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core.git
-F:	fs/kernfs/
+F:	fs/kernfs/*
 F:	include/linux/kernfs.h
 
 KEXEC
diff --git a/fs/kernfs/Makefile b/fs/kernfs/Makefile
index 4ca54ff54c98..bc5b32ca39f9 100644
--- a/fs/kernfs/Makefile
+++ b/fs/kernfs/Makefile
@@ -4,3 +4,4 @@
 #
 
 obj-y		:= mount.o inode.o dir.o file.o symlink.o
+obj-$(CONFIG_FAIL_KERNFS_KNOBS)    += failure-injection.o
diff --git a/fs/kernfs/failure-injection.c b/fs/kernfs/failure-injection.c
new file mode 100644
index 000000000000..4130d202c13b
--- /dev/null
+++ b/fs/kernfs/failure-injection.c
@@ -0,0 +1,91 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <linux/fault-inject.h>
+#include <linux/delay.h>
+
+#include "kernfs-internal.h"
+
+static DECLARE_FAULT_ATTR(fail_kernfs_fop_write_iter);
+struct kernfs_config_fail kernfs_config_fail;
+
+#define kernfs_config_fail(when) \
+	kernfs_config_fail.kernfs_fop_write_iter_fail.wait_ ## when
+
+#define kernfs_config_fail(when) \
+	kernfs_config_fail.kernfs_fop_write_iter_fail.wait_ ## when
+
+static int __init setup_fail_kernfs_fop_write_iter(char *str)
+{
+	return setup_fault_attr(&fail_kernfs_fop_write_iter, str);
+}
+
+__setup("fail_kernfs_fop_write_iter=", setup_fail_kernfs_fop_write_iter);
+
+struct dentry *kernfs_debugfs_root;
+struct dentry *config_fail_kernfs_fop_write_iter;
+
+static int __init kernfs_init_failure_injection(void)
+{
+	kernfs_config_fail.sleep_after_wait_ms = 100;
+	kernfs_debugfs_root = debugfs_create_dir("kernfs", NULL);
+
+	fault_create_debugfs_attr("fail_kernfs_fop_write_iter",
+				  kernfs_debugfs_root, &fail_kernfs_fop_write_iter);
+
+	config_fail_kernfs_fop_write_iter =
+		debugfs_create_dir("config_fail_kernfs_fop_write_iter",
+				   kernfs_debugfs_root);
+
+	debugfs_create_u32("sleep_after_wait_ms", 0600,
+			   kernfs_debugfs_root,
+			   &kernfs_config_fail.sleep_after_wait_ms);
+
+	debugfs_create_bool("wait_at_start", 0600,
+			    config_fail_kernfs_fop_write_iter,
+			    &kernfs_config_fail(at_start));
+	debugfs_create_bool("wait_before_mutex", 0600,
+			    config_fail_kernfs_fop_write_iter,
+			    &kernfs_config_fail(before_mutex));
+	debugfs_create_bool("wait_after_mutex", 0600,
+			    config_fail_kernfs_fop_write_iter,
+			    &kernfs_config_fail(after_mutex));
+	debugfs_create_bool("wait_after_active", 0600,
+			    config_fail_kernfs_fop_write_iter,
+			    &kernfs_config_fail(after_active));
+	return 0;
+}
+late_initcall(kernfs_init_failure_injection);
+
+int __kernfs_debug_should_wait_kernfs_fop_write_iter(bool evaluate)
+{
+	if (!evaluate)
+		return 0;
+
+	return should_fail(&fail_kernfs_fop_write_iter, 0);
+}
+
+DECLARE_COMPLETION(kernfs_debug_wait_completion);
+EXPORT_SYMBOL_NS_GPL(kernfs_debug_wait_completion, KERNFS_DEBUG_PRIVATE);
+
+void kernfs_debug_wait(void)
+{
+	unsigned long timeout;
+
+	timeout = wait_for_completion_timeout(&kernfs_debug_wait_completion,
+					      msecs_to_jiffies(3000));
+	if (!timeout)
+		pr_info("%s waiting for kernfs_debug_wait_completion timed out\n",
+			__func__);
+	else
+		pr_info("%s received completion with time left on timeout %u ms\n",
+			__func__, jiffies_to_msecs(timeout));
+
+	/**
+	 * The goal is wait for an event, and *then* once we have
+	 * reached it, the other side will try to do something which
+	 * it thinks will break. So we must give it some time to do
+	 * that. The amount of time is configurable.
+	 */
+	msleep(kernfs_config_fail.sleep_after_wait_ms);
+	pr_info("%s ended\n", __func__);
+}
diff --git a/fs/kernfs/file.c b/fs/kernfs/file.c
index 60e2a86c535e..4479c6580333 100644
--- a/fs/kernfs/file.c
+++ b/fs/kernfs/file.c
@@ -259,6 +259,9 @@ static ssize_t kernfs_fop_write_iter(struct kiocb *iocb, struct iov_iter *iter)
 	const struct kernfs_ops *ops;
 	char *buf;
 
+	if (kernfs_debug_should_wait(kernfs_fop_write_iter, at_start))
+		kernfs_debug_wait();
+
 	if (of->atomic_write_len) {
 		if (len > of->atomic_write_len)
 			return -E2BIG;
@@ -280,17 +283,27 @@ static ssize_t kernfs_fop_write_iter(struct kiocb *iocb, struct iov_iter *iter)
 	}
 	buf[len] = '\0';	/* guarantee string termination */
 
+	if (kernfs_debug_should_wait(kernfs_fop_write_iter, before_mutex))
+		kernfs_debug_wait();
+
 	/*
 	 * @of->mutex nests outside active ref and is used both to ensure that
 	 * the ops aren't called concurrently for the same open file.
 	 */
 	mutex_lock(&of->mutex);
+
+	if (kernfs_debug_should_wait(kernfs_fop_write_iter, after_mutex))
+		kernfs_debug_wait();
+
 	if (!kernfs_get_active(of->kn)) {
 		mutex_unlock(&of->mutex);
 		len = -ENODEV;
 		goto out_free;
 	}
 
+	if (kernfs_debug_should_wait(kernfs_fop_write_iter, after_active))
+		kernfs_debug_wait();
+
 	ops = kernfs_ops(of->kn);
 	if (ops->write)
 		len = ops->write(of, buf, len, iocb->ki_pos);
diff --git a/fs/kernfs/kernfs-internal.h b/fs/kernfs/kernfs-internal.h
index f9cc912c31e1..9e3abf597e2d 100644
--- a/fs/kernfs/kernfs-internal.h
+++ b/fs/kernfs/kernfs-internal.h
@@ -18,6 +18,7 @@
 
 #include <linux/kernfs.h>
 #include <linux/fs_context.h>
+#include <linux/stringify.h>
 
 struct kernfs_iattrs {
 	kuid_t			ia_uid;
@@ -147,4 +148,75 @@ void kernfs_drain_open_files(struct kernfs_node *kn);
  */
 extern const struct inode_operations kernfs_symlink_iops;
 
+/*
+ * failure-injection.c
+ */
+#ifdef CONFIG_FAIL_KERNFS_KNOBS
+
+/**
+ * struct kernfs_fop_write_iter_fail - how kernfs_fop_write_iter_fail fails
+ *
+ * This lets you configure what part of kernfs_fop_write_iter() should behave
+ * in a specific way to allow userspace to capture possible failures in
+ * kernfs. The wait knobs are allowed to let you design capture possible
+ * race conditions which would otherwise be difficult to reproduce. A
+ * secondary driver would tell kernfs's wait completion when it is done.
+ *
+ * The point to the wait completion failure injection tests are to confirm
+ * that the kernfs active refcount suffice to ensure other objects in other
+ * layers are also gauranteed to exist, even they are opaque to kernfs. This
+ * includes kobjects, devices, and other objects built on top of this, like
+ * the block layer when using sysfs block device attributes.
+ *
+ * @wait_at_start: waits for completion from a third party at the start of
+ *	the routine.
+ * @wait_before_mutex: waits for completion from a third party before we
+ *	are allowed to continue before the of->mutex is held.
+ * @wait_after_mutex: waits for completion from a third party after we
+ *	have held the of->mutex.
+ * @wait_after_active: waits for completion from a thid party after we
+ *	have refcounted the struct kernfs_node.
+ */
+struct kernfs_fop_write_iter_fail {
+	bool wait_at_start;
+	bool wait_before_mutex;
+	bool wait_after_mutex;
+	bool wait_after_active;
+};
+
+/**
+ * struct kernfs_config_fail - kernfs configuration for failure injection
+ *
+ * You can kernfs failure injection on boot, and in particular we currently
+ * only support failures for kernfs_fop_write_iter(). However, we don't
+ * want to always enable errors on this call when failure injection is enabled
+ * as this routine is used by many parts of the kernel for proper functionality.
+ * The compromise we make is we let userspace start enabling which parts it
+ * wants to fail after boot, if and only if failure injection has been enabled.
+ *
+ * @kernfs_fop_write_iter_fail: configuration for how we want to allow
+ *	for failure injection on kernfs_fop_write_iter()
+ * @sleep_after_wait_ms: how many ms to wait after completion is received.
+ */
+struct kernfs_config_fail {
+	struct kernfs_fop_write_iter_fail kernfs_fop_write_iter_fail;
+	u32 sleep_after_wait_ms;
+};
+
+extern struct kernfs_config_fail kernfs_config_fail;
+
+#define __kernfs_config_wait_var(func, when) \
+	(kernfs_config_fail.  func  ## _fail.wait_  ## when)
+#define __kernfs_debug_should_wait_func_name(func) __kernfs_debug_should_wait_## func
+
+#define kernfs_debug_should_wait(func, when) \
+	__kernfs_debug_should_wait_func_name(func)(__kernfs_config_wait_var(func, when))
+int __kernfs_debug_should_wait_kernfs_fop_write_iter(bool evaluate);
+void kernfs_debug_wait(void);
+#else
+static inline void kernfs_init_failure_injection(void) {}
+#define kernfs_debug_should_wait(func, when) (false)
+static inline void kernfs_debug_wait(void) {}
+#endif
+
 #endif	/* __KERNFS_INTERNAL_H */
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index 3ccce6f24548..cd968ee2b503 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -411,6 +411,11 @@ void kernfs_init(void);
 
 struct kernfs_node *kernfs_find_and_get_node_by_id(struct kernfs_root *root,
 						   u64 id);
+
+#ifdef CONFIG_FAIL_KERNFS_KNOBS
+extern struct completion kernfs_debug_wait_completion;
+#endif
+
 #else	/* CONFIG_KERNFS */
 
 static inline enum kernfs_node_type kernfs_type(struct kernfs_node *kn)
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index ae19bf1a21b8..a29b7d398c4e 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -1902,6 +1902,16 @@ config FAULT_INJECTION_USERCOPY
 	  Provides fault-injection capability to inject failures
 	  in usercopy functions (copy_from_user(), get_user(), ...).
 
+config FAIL_KERNFS_KNOBS
+	bool "Fault-injection support in kernfs"
+	depends on FAULT_INJECTION
+	help
+	  Provide fault-injection capability for kernfs. This only enables
+	  the error injection functionality. To use it you must configure which
+	  which path you want to trigger on error on using debugfs under
+	  /sys/kernel/debug/kernfs/config_fail_kernfs_fop_write_iter/. By
+	  default all of these are disabled.
+
 config FAIL_MAKE_REQUEST
 	bool "Fault-injection capability for disk IO"
 	depends on FAULT_INJECTION && BLOCK
-- 
2.30.2


^ permalink raw reply	[flat|nested] 94+ messages in thread

* [PATCH v8 05/12] test_sysfs: add support to use kernfs failure injection
  2021-09-27 16:37 [PATCH v8 00/12] syfs: generic deadlock fix with module removal Luis Chamberlain
                   ` (3 preceding siblings ...)
  2021-09-27 16:37 ` [PATCH v8 04/12] kernfs: add initial failure injection support Luis Chamberlain
@ 2021-09-27 16:37 ` Luis Chamberlain
  2021-10-05 19:51   ` Kees Cook
  2021-09-27 16:37 ` [PATCH v8 06/12] kernel/module: add documentation for try_module_get() Luis Chamberlain
                   ` (6 subsequent siblings)
  11 siblings, 1 reply; 94+ messages in thread
From: Luis Chamberlain @ 2021-09-27 16:37 UTC (permalink / raw)
  To: tj, gregkh, akpm, minchan, jeyu, shuah
  Cc: bvanassche, dan.j.williams, joe, tglx, mcgrof, keescook, rostedt,
	linux-spdx, linux-doc, linux-block, linux-fsdevel,
	linux-kselftest, linux-kernel

This extends test_sysfs with support for using the failure injection
wait completion and knobs to force a few race conditions which
demonstrates that kernfs active reference protection is sufficient
for kobject / device protection at higher layers.

This adds 4 new tests which tries to remove the device attribute
store operation in 4 different situations:

  1) at the start of kernfs_kernfs_fop_write_iter()
  2) before the of->mutex is held in kernfs_kernfs_fop_write_iter()
  3) after the of->mutex is held in kernfs_kernfs_fop_write_iter()
  4) after the kernfs node active reference is taken

A write fails in call cases except the last one, test number #32. There
is a good explanation for this: *once* kernfs_get_active() gets called
we have a guarantee that the kernfs entry cannot be removed. If
kernfs_get_active() succeeds that entry cannot be removed and so
anything trying to remove that entry will have to wait. It is perhaps
not obvious but since a sysfs write will trigger eventually a
kernfs_get_active() call, and *only* if this succeeds will the sysfs
op be called, this and the fact that you cannot remove the kernfs
entry while the kenfs entry is active implies that a module that
created the respective sysfs / kernfs entry *cannot* possibly be
removed during a sysfs operation. And test number 32 provides us with
proof of this. If it were not true test #32 should crash.

No null dereferences are reproduced, even though this has been observed
in some complex testing cases [0]. If this issue really exists we should
have enough tools on the sysfs_test toolbox now to try to reproduce
this easily without having to poke around other drivers. It very likley
was the case that the issue reported [0] was possibly a side issue after
the first bug which was zram specific. This is why it is important to
isolate the issue and try to reproduce it in a generic form using the
test_sysfs driver.

[0] https://lkml.kernel.org/r/20210623215007.862787-1-mcgrof@kernel.org

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 lib/Kconfig.debug                      |   3 +
 lib/test_sysfs.c                       |  31 +++++
 tools/testing/selftests/sysfs/config   |   3 +
 tools/testing/selftests/sysfs/sysfs.sh | 175 +++++++++++++++++++++++++
 4 files changed, 212 insertions(+)

diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index a29b7d398c4e..176b822654e5 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -2358,6 +2358,9 @@ config TEST_SYSFS
 	depends on SYSFS
 	depends on NET
 	depends on BLOCK
+	select FAULT_INJECTION
+	select FAULT_INJECTION_DEBUG_FS
+	select FAIL_KERNFS_KNOBS
 	help
 	  This builds the "test_sysfs" module. This driver enables to test the
 	  sysfs file system safely without affecting production knobs which
diff --git a/lib/test_sysfs.c b/lib/test_sysfs.c
index 2043ca494af8..c6e62de61403 100644
--- a/lib/test_sysfs.c
+++ b/lib/test_sysfs.c
@@ -38,6 +38,11 @@
 #include <linux/rtnetlink.h>
 #include <linux/genhd.h>
 #include <linux/blkdev.h>
+#include <linux/kernfs.h>
+
+#ifdef CONFIG_FAIL_KERNFS_KNOBS
+MODULE_IMPORT_NS(KERNFS_DEBUG_PRIVATE);
+#endif
 
 static bool enable_lock;
 module_param(enable_lock, bool_enable_only, 0644);
@@ -82,6 +87,13 @@ static bool enable_verbose_rmmod;
 module_param(enable_verbose_rmmod, bool_enable_only, 0644);
 MODULE_PARM_DESC(enable_verbose_rmmod, "enable verbose print messages on rmmod");
 
+#ifdef CONFIG_FAIL_KERNFS_KNOBS
+static bool enable_completion_on_rmmod;
+module_param(enable_completion_on_rmmod, bool_enable_only, 0644);
+MODULE_PARM_DESC(enable_completion_on_rmmod,
+		 "enable sending a kernfs completion on rmmod");
+#endif
+
 static int sysfs_test_major;
 
 /**
@@ -285,6 +297,12 @@ static ssize_t config_show(struct device *dev,
 			"enable_verbose_writes:\t%s\n",
 			enable_verbose_writes ? "true" : "false");
 
+#ifdef CONFIG_FAIL_KERNFS_KNOBS
+	len += snprintf(buf+len, PAGE_SIZE - len,
+			"enable_completion_on_rmmod:\t%s\n",
+			enable_completion_on_rmmod ? "true" : "false");
+#endif
+
 	test_dev_config_unlock(test_dev);
 
 	return len;
@@ -904,10 +922,23 @@ static int __init test_sysfs_init(void)
 }
 module_init(test_sysfs_init);
 
+#ifdef CONFIG_FAIL_KERNFS_KNOBS
+/* The goal is to race our device removal with a pending kernfs -> store call */
+static void test_sysfs_kernfs_send_completion_rmmod(void)
+{
+	if (!enable_completion_on_rmmod)
+		return;
+	complete(&kernfs_debug_wait_completion);
+}
+#else
+static inline void test_sysfs_kernfs_send_completion_rmmod(void) {}
+#endif
+
 static void __exit test_sysfs_exit(void)
 {
 	if (enable_debugfs)
 		debugfs_remove(debugfs_dir);
+	test_sysfs_kernfs_send_completion_rmmod();
 	if (delay_rmmod_ms)
 		msleep(delay_rmmod_ms);
 	unregister_test_dev_sysfs(first_test_dev);
diff --git a/tools/testing/selftests/sysfs/config b/tools/testing/selftests/sysfs/config
index 9196f452ecd5..2876a229f95b 100644
--- a/tools/testing/selftests/sysfs/config
+++ b/tools/testing/selftests/sysfs/config
@@ -1,2 +1,5 @@
 CONFIG_SYSFS=m
 CONFIG_TEST_SYSFS=m
+CONFIG_FAULT_INJECTION=y
+CONFIG_FAULT_INJECTION_DEBUG_FS=y
+CONFIG_FAIL_KERNFS_KNOBS=y
diff --git a/tools/testing/selftests/sysfs/sysfs.sh b/tools/testing/selftests/sysfs/sysfs.sh
index b3f4c2236c7f..f928635d0e35 100755
--- a/tools/testing/selftests/sysfs/sysfs.sh
+++ b/tools/testing/selftests/sysfs/sysfs.sh
@@ -62,6 +62,10 @@ ALL_TESTS="$ALL_TESTS 0025:1:1:test_dev_y:block"
 ALL_TESTS="$ALL_TESTS 0026:1:1:test_dev_y:block"
 ALL_TESTS="$ALL_TESTS 0027:1:0:test_dev_x:block" # deadlock test
 ALL_TESTS="$ALL_TESTS 0028:1:0:test_dev_x:block" # deadlock test with rntl_lock
+ALL_TESTS="$ALL_TESTS 0029:1:1:test_dev_x:block" # kernfs race removal of store
+ALL_TESTS="$ALL_TESTS 0030:1:1:test_dev_x:block" # kernfs race removal before mutex
+ALL_TESTS="$ALL_TESTS 0031:1:1:test_dev_x:block" # kernfs race removal after mutex
+ALL_TESTS="$ALL_TESTS 0032:1:1:test_dev_x:block" # kernfs race removal after active
 
 allow_user_defaults()
 {
@@ -92,6 +96,9 @@ allow_user_defaults()
 	if [ -z $SYSFS_DEBUGFS_DIR ]; then
 		SYSFS_DEBUGFS_DIR="/sys/kernel/debug/test_sysfs"
 	fi
+	if [ -z $KERNFS_DEBUGFS_DIR ]; then
+		KERNFS_DEBUGFS_DIR="/sys/kernel/debug/kernfs"
+	fi
 	if [ -z $PAGE_SIZE ]; then
 		PAGE_SIZE=$(getconf PAGESIZE)
 	fi
@@ -167,6 +174,14 @@ modprobe_reset_enable_rtnl_lock_on_rmmod()
 	unset FIRST_MODPROBE_ARGS
 }
 
+modprobe_reset_enable_completion()
+{
+	FIRST_MODPROBE_ARGS="enable_completion_on_rmmod=1 enable_verbose_writes=1"
+	FIRST_MODPROBE_ARGS="$FIRST_MODPROBE_ARGS enable_verbose_rmmod=1 delay_rmmod_ms=0"
+	modprobe_reset
+	unset FIRST_MODPROBE_ARGS
+}
+
 load_req_mod()
 {
 	modprobe_reset
@@ -197,6 +212,63 @@ debugfs_reset_first_test_dev_ignore_errors()
 	echo -n "1" >"$SYSFS_DEBUGFS_DIR"/reset_first_test_dev
 }
 
+debugfs_kernfs_kernfs_fop_write_iter_exists()
+{
+	KNOB_DIR="${KERNFS_DEBUGFS_DIR}/config_fail_kernfs_fop_write_iter"
+	if [[ ! -d $KNOB_DIR ]]; then
+		echo "kernfs debugfs does not exist $KNOB_DIR"
+		return 0;
+	fi
+	KNOB_DEBUGFS="${KERNFS_DEBUGFS_DIR}/fail_kernfs_fop_write_iter"
+	if [[ ! -d $KNOB_DEBUGFS ]]; then
+		echo -n "kernfs debugfs for coniguring fail_kernfs_fop_write_iter "
+		echo "does not exist $KNOB_DIR"
+		return 0;
+	fi
+	return 1
+}
+
+debugfs_kernfs_kernfs_fop_write_iter_set_fail_once()
+{
+	KNOB_DEBUGFS="${KERNFS_DEBUGFS_DIR}/fail_kernfs_fop_write_iter"
+	echo 1 > $KNOB_DEBUGFS/interval
+	echo 100 > $KNOB_DEBUGFS/probability
+	echo 0 > $KNOB_DEBUGFS/space
+	# Disable verbose messages on the kernel ring buffer which may
+	# confuse developers with a kernel panic.
+	echo 0 > $KNOB_DEBUGFS/verbose
+
+	# Fail only once
+	echo 1 > $KNOB_DEBUGFS/times
+}
+
+debugfs_kernfs_kernfs_fop_write_iter_set_fail_never()
+{
+	KNOB_DEBUGFS="${KERNFS_DEBUGFS_DIR}/fail_kernfs_fop_write_iter"
+	echo 0 > $KNOB_DEBUGFS/times
+}
+
+debugfs_kernfs_set_wait_ms()
+{
+	SLEEP_AFTER_WAIT_MS="${KERNFS_DEBUGFS_DIR}/sleep_after_wait_ms"
+	echo $1 > $SLEEP_AFTER_WAIT_MS
+}
+
+debugfs_kernfs_disable_wait_kernfs_fop_write_iter()
+{
+	ENABLE_WAIT_KNOB="${KERNFS_DEBUGFS_DIR}/config_fail_kernfs_fop_write_iter/wait_"
+	for KNOB in ${ENABLE_WAIT_KNOB}*; do
+		echo 0 > $KNOB
+	done
+}
+
+debugfs_kernfs_enable_wait_kernfs_fop_write_iter()
+{
+	ENABLE_WAIT_KNOB="${KERNFS_DEBUGFS_DIR}/config_fail_kernfs_fop_write_iter/wait_$1"
+	echo -n "1" > $ENABLE_WAIT_KNOB
+	return $?
+}
+
 set_orig()
 {
 	if [[ ! -z $TARGET ]] && [[ ! -z $ORIG ]]; then
@@ -972,6 +1044,105 @@ sysfs_test_0028()
 	fi
 }
 
+sysfs_race_kernfs_kernfs_fop_write_iter()
+{
+	TARGET="${DIR}/$(get_test_target $1)"
+	WAIT_AT=$2
+	EXPECT_WRITE_RETURNS=$3
+	MSDELAY=$4
+
+	modprobe_reset_enable_completion
+	ORIG=$(cat "${TARGET}")
+	TEST_STR=$(( $ORIG + 1 ))
+
+	echo -n "Test racing removal of sysfs store op with kernfs $WAIT_AT ... "
+
+	if debugfs_kernfs_kernfs_fop_write_iter_exists; then
+		echo -n "skipping test as CONFIG_FAIL_KERNFS_KNOBS "
+		echo " or CONFIG_FAULT_INJECTION_DEBUG_FS is disabled"
+		return $ksft_skip
+	fi
+
+	# Allow for failing the kernfs_kernfs_fop_write_iter call once,
+	# we'll provide exact context shortly afterwards.
+	debugfs_kernfs_kernfs_fop_write_iter_set_fail_once
+
+	# First disable all waits
+	debugfs_kernfs_disable_wait_kernfs_fop_write_iter
+
+	# Enable a wait_for_completion(&kernfs_debug_wait_completion) at the
+	# specified location inside the kernfs_fop_write_iter() routine
+	debugfs_kernfs_enable_wait_kernfs_fop_write_iter $WAIT_AT
+
+	# Configure kernfs so that after its wait_for_completion() it
+	# will msleep() this amount of time and schedule(). We figure this
+	# will be sufficient time to allow for our module removal to complete.
+	debugfs_kernfs_set_wait_ms $MSDELAY
+
+	# Now we trigger a kernfs write op, which will run kernfs_fop_write_iter,
+	# but will wait until our driver sends a respective completion
+	set_test_ignore_errors &
+	write_pid=$!
+
+	# At this point kernfs_fop_write_iter() hasn't run our op, its
+	# waiting for our completion at the specified time $WAIT_AT.
+	# We now remove our module which will send a
+	# complete(&kernfs_debug_wait_completion) right before we deregister
+	# our device and the sysfs device attributes are removed.
+	#
+	# After the completion is sent, the test_sysfs driver races with
+	# kernfs to do the device deregistration with the kernfs msleep
+	# and schedule(). This should mean we've forced trying to remove the
+	# module prior to allowing kernfs to run our store operation. If the
+	# race did happen we'll panic with a null dereference on the store op.
+	#
+	# If no race happens we should see no write operation triggered.
+	modprobe -r $TEST_DRIVER > /dev/null 2>&1
+
+	debugfs_kernfs_kernfs_fop_write_iter_set_fail_never
+
+	wait $write_pid
+	if [[ $? -eq $EXPECT_WRITE_RETURNS ]]; then
+		echo "ok"
+	else
+		echo "FAIL" >&2
+	fi
+}
+
+sysfs_test_0029()
+{
+	for delay in 0 2 4 8 16 32 64 128 246 512 1024; do
+		echo "Using delay-after-completion: $delay"
+		sysfs_race_kernfs_kernfs_fop_write_iter 0029 at_start 1 $delay
+	done
+}
+
+sysfs_test_0030()
+{
+	for delay in 0 2 4 8 16 32 64 128 246 512 1024; do
+		echo "Using delay-after-completion: $delay"
+		sysfs_race_kernfs_kernfs_fop_write_iter 0030 before_mutex 1 $delay
+	done
+}
+
+sysfs_test_0031()
+{
+	for delay in 0 2 4 8 16 32 64 128 246 512 1024; do
+		echo "Using delay-after-completion: $delay"
+		sysfs_race_kernfs_kernfs_fop_write_iter 0031 after_mutex 1 $delay
+	done
+}
+
+# A write only succeeds *iff* a module removal happens *after* the
+# kernfs active reference is obtained with kernfs_get_active().
+sysfs_test_0032()
+{
+	for delay in 0 2 4 8 16 32 64 128 246 512 1024; do
+		echo "Using delay-after-completion: $delay"
+		sysfs_race_kernfs_kernfs_fop_write_iter 0032 after_active 0 $delay
+	done
+}
+
 test_gen_desc()
 {
 	echo -n "$1 x $(get_test_count $1)"
@@ -1013,6 +1184,10 @@ list_tests()
 	echo "$(test_gen_desc 0026) - block test writing y larger delay and resetting device"
 	echo "$(test_gen_desc 0027) - test rmmod deadlock while writing x ... "
 	echo "$(test_gen_desc 0028) - test rmmod deadlock using rtnl_lock while writing x ..."
+	echo "$(test_gen_desc 0029) - racing removal of store op with kernfs at start"
+	echo "$(test_gen_desc 0030) - racing removal of store op with kernfs before mutex"
+	echo "$(test_gen_desc 0031) - racing removal of store op with kernfs after mutex"
+	echo "$(test_gen_desc 0032) - racing removal of store op with kernfs after active"
 }
 
 usage()
-- 
2.30.2


^ permalink raw reply	[flat|nested] 94+ messages in thread

* [PATCH v8 06/12] kernel/module: add documentation for try_module_get()
  2021-09-27 16:37 [PATCH v8 00/12] syfs: generic deadlock fix with module removal Luis Chamberlain
                   ` (4 preceding siblings ...)
  2021-09-27 16:37 ` [PATCH v8 05/12] test_sysfs: add support to use kernfs failure injection Luis Chamberlain
@ 2021-09-27 16:37 ` Luis Chamberlain
  2021-10-05 19:58   ` Kees Cook
  2021-09-27 16:38 ` [PATCH v8 07/12] fs/kernfs/symlink.c: replace S_IRWXUGO with 0777 on kernfs_create_link() Luis Chamberlain
                   ` (5 subsequent siblings)
  11 siblings, 1 reply; 94+ messages in thread
From: Luis Chamberlain @ 2021-09-27 16:37 UTC (permalink / raw)
  To: tj, gregkh, akpm, minchan, jeyu, shuah
  Cc: bvanassche, dan.j.williams, joe, tglx, mcgrof, keescook, rostedt,
	linux-spdx, linux-doc, linux-block, linux-fsdevel,
	linux-kselftest, linux-kernel

There is quite a bit of tribal knowledge around proper use of
try_module_get() and that it must be used only in a context which
can ensure the module won't be gone during the operation. Document
this little bit of tribal knowledge.

I'm extending this tribal knowledge with new developments which it
seems some folks do not yet believe to be true: we can be sure a
module will exist during the lifetime of a sysfs file operation.
For proof, refer to test_sysfs test #32:

./tools/testing/selftests/sysfs/sysfs.sh -t 0032

Without this being true, the write would fail or worse,
a crash would happen, in this test. It does not.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 include/linux/module.h | 34 ++++++++++++++++++++++++++++++++--
 1 file changed, 32 insertions(+), 2 deletions(-)

diff --git a/include/linux/module.h b/include/linux/module.h
index c9f1200b2312..22eacd5e1e85 100644
--- a/include/linux/module.h
+++ b/include/linux/module.h
@@ -609,10 +609,40 @@ void symbol_put_addr(void *addr);
    to handle the error case (which only happens with rmmod --wait). */
 extern void __module_get(struct module *module);
 
-/* This is the Right Way to get a module: if it fails, it's being removed,
- * so pretend it's not there. */
+/**
+ * try_module_get() - yields to module removal and bumps refcnt otherwise
+ * @module: the module we should check for
+ *
+ * This can be used to try to bump the reference count of a module, so to
+ * prevent module removal. The reference count of a module is not allowed
+ * to be incremented if the module is already being removed.
+ *
+ * Care must be taken to ensure the module cannot be removed during the call to
+ * try_module_get(). This can be done by having another entity other than the
+ * module itself increment the module reference count, or through some other
+ * means which guarantees the module could not be removed during an operation.
+ * An example of this later case is using try_module_get() in a sysfs file
+ * which the module created. The sysfs store / read file operations are
+ * gauranteed to exist through the use of kernfs's active reference (see
+ * kernfs_active()). If a sysfs file operation is being run, the module which
+ * created it must still exist as the module is in charge of removing the same
+ * sysfs file being read. Also, a sysfs / kernfs file removal cannot happen
+ * unless the same file is not active.
+ *
+ * One of the real values to try_module_get() is the module_is_live() check
+ * which ensures this the caller of try_module_get() can yield to userspace
+ * module removal requests and fail whatever it was about to process.
+ */
 extern bool try_module_get(struct module *module);
 
+/**
+ * module_put() - release a reference count to a module
+ * @module: the module we should release a reference count for
+ *
+ * If you successfully bump a reference count to a module with try_module_get(),
+ * when you are finished you must call module_put() to release that reference
+ * count.
+ */
 extern void module_put(struct module *module);
 
 #else /*!CONFIG_MODULE_UNLOAD*/
-- 
2.30.2


^ permalink raw reply	[flat|nested] 94+ messages in thread

* [PATCH v8 07/12] fs/kernfs/symlink.c: replace S_IRWXUGO with 0777 on kernfs_create_link()
  2021-09-27 16:37 [PATCH v8 00/12] syfs: generic deadlock fix with module removal Luis Chamberlain
                   ` (5 preceding siblings ...)
  2021-09-27 16:37 ` [PATCH v8 06/12] kernel/module: add documentation for try_module_get() Luis Chamberlain
@ 2021-09-27 16:38 ` Luis Chamberlain
  2021-10-05 19:59   ` Kees Cook
  2021-09-27 16:38 ` [PATCH v8 08/12] fs/sysfs/dir.c: replace S_IRWXU|S_IRUGO|S_IXUGO with 0755 sysfs_create_dir_ns() Luis Chamberlain
                   ` (4 subsequent siblings)
  11 siblings, 1 reply; 94+ messages in thread
From: Luis Chamberlain @ 2021-09-27 16:38 UTC (permalink / raw)
  To: tj, gregkh, akpm, minchan, jeyu, shuah
  Cc: bvanassche, dan.j.williams, joe, tglx, mcgrof, keescook, rostedt,
	linux-spdx, linux-doc, linux-block, linux-fsdevel,
	linux-kselftest, linux-kernel

If one ends up extending this line checkpatch will complain about the
use of S_IRWXUGO suggesting it is not preferred and that 0777
should be used instead. Take the tip from checkpatch and do that
change before we do our subsequent changes.

This makes no functional changes.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 fs/kernfs/symlink.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/fs/kernfs/symlink.c b/fs/kernfs/symlink.c
index c8f8e41b8411..19a6c71c6ff5 100644
--- a/fs/kernfs/symlink.c
+++ b/fs/kernfs/symlink.c
@@ -36,8 +36,7 @@ struct kernfs_node *kernfs_create_link(struct kernfs_node *parent,
 		gid = target->iattr->ia_gid;
 	}
 
-	kn = kernfs_new_node(parent, name, S_IFLNK|S_IRWXUGO, uid, gid,
-			     KERNFS_LINK);
+	kn = kernfs_new_node(parent, name, S_IFLNK|0777, uid, gid, KERNFS_LINK);
 	if (!kn)
 		return ERR_PTR(-ENOMEM);
 
-- 
2.30.2


^ permalink raw reply	[flat|nested] 94+ messages in thread

* [PATCH v8 08/12] fs/sysfs/dir.c: replace S_IRWXU|S_IRUGO|S_IXUGO with 0755 sysfs_create_dir_ns()
  2021-09-27 16:37 [PATCH v8 00/12] syfs: generic deadlock fix with module removal Luis Chamberlain
                   ` (6 preceding siblings ...)
  2021-09-27 16:38 ` [PATCH v8 07/12] fs/kernfs/symlink.c: replace S_IRWXUGO with 0777 on kernfs_create_link() Luis Chamberlain
@ 2021-09-27 16:38 ` Luis Chamberlain
  2021-10-05 16:05   ` Kees Cook
  2021-09-27 16:38 ` [PATCH v8 09/12] sysfs: fix deadlock race with module removal Luis Chamberlain
                   ` (3 subsequent siblings)
  11 siblings, 1 reply; 94+ messages in thread
From: Luis Chamberlain @ 2021-09-27 16:38 UTC (permalink / raw)
  To: tj, gregkh, akpm, minchan, jeyu, shuah
  Cc: bvanassche, dan.j.williams, joe, tglx, mcgrof, keescook, rostedt,
	linux-spdx, linux-doc, linux-block, linux-fsdevel,
	linux-kselftest, linux-kernel

If one ends up expanding on this line checkpatch will complain that the
combination S_IRWXU|S_IRUGO|S_IXUGO should just be replaced with the
octal 0755. Do that.

This makes no functional changes.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 fs/sysfs/dir.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/fs/sysfs/dir.c b/fs/sysfs/dir.c
index 59dffd5ca517..b6b6796e1616 100644
--- a/fs/sysfs/dir.c
+++ b/fs/sysfs/dir.c
@@ -56,8 +56,7 @@ int sysfs_create_dir_ns(struct kobject *kobj, const void *ns)
 
 	kobject_get_ownership(kobj, &uid, &gid);
 
-	kn = kernfs_create_dir_ns(parent, kobject_name(kobj),
-				  S_IRWXU | S_IRUGO | S_IXUGO, uid, gid,
+	kn = kernfs_create_dir_ns(parent, kobject_name(kobj), 0755, uid, gid,
 				  kobj, ns);
 	if (IS_ERR(kn)) {
 		if (PTR_ERR(kn) == -EEXIST)
-- 
2.30.2


^ permalink raw reply	[flat|nested] 94+ messages in thread

* [PATCH v8 09/12] sysfs: fix deadlock race with module removal
  2021-09-27 16:37 [PATCH v8 00/12] syfs: generic deadlock fix with module removal Luis Chamberlain
                   ` (7 preceding siblings ...)
  2021-09-27 16:38 ` [PATCH v8 08/12] fs/sysfs/dir.c: replace S_IRWXU|S_IRUGO|S_IXUGO with 0755 sysfs_create_dir_ns() Luis Chamberlain
@ 2021-09-27 16:38 ` Luis Chamberlain
  2021-10-05  9:24   ` Ming Lei
  2021-10-05 20:50   ` Kees Cook
  2021-09-27 16:38 ` [PATCH v8 10/12] test_sysfs: enable deadlock tests by default Luis Chamberlain
                   ` (2 subsequent siblings)
  11 siblings, 2 replies; 94+ messages in thread
From: Luis Chamberlain @ 2021-09-27 16:38 UTC (permalink / raw)
  To: tj, gregkh, akpm, minchan, jeyu, shuah
  Cc: bvanassche, dan.j.williams, joe, tglx, mcgrof, keescook, rostedt,
	linux-spdx, linux-doc, linux-block, linux-fsdevel,
	linux-kselftest, linux-kernel

When driver sysfs attributes use a lock also used on module removal we
can race to deadlock. This happens when for instance a sysfs file on
a driver is used, then at the same time we have module removal call
trigger. The module removal call code holds a lock, and then the
driver's sysfs file entry waits for the same lock. While holding the
lock the module removal tries to remove the sysfs entries, but these
cannot be removed yet as one is waiting for a lock. This won't complete
as the lock is already held. Likewise module removal cannot complete,
and so we deadlock.

This can now be easily reproducible with our sysfs selftest as follows:

./tools/testing/selftests/sysfs/sysfs.sh -t 0027

This uses a local driver lock. Test 0028 can also be used, that uses
the rtnl_lock():

./tools/testing/selftests/sysfs/sysfs.sh -t 0028

To fix this we extend the struct kernfs_node with a module reference
and use the try_module_get() after kernfs_get_active() is called. As
documented in the prior patch, we now know that once kernfs_get_active()
is called the module is implicitly guarded to exist and cannot be removed.
This is because the module is the one in charge of removing the same
sysfs file it created, and removal of sysfs files on module exit will wait
until they don't have any active references. By using a try_module_get()
after kernfs_get_active() we yield to let module removal trump calls to
process a sysfs operation, while also preventing module removal if a sysfs
operation is in already progress. This prevents the deadlock.

This deadlock was first reported with the zram driver, however the live
patching folks have acknowledged they have observed this as well with
live patching, when a live patch is removed. I was then able to
reproduce easily by creating a dedicated selftest for it.

A sketch of how this can happen follows, consider foo a local mutex
part of a driver, and used on the driver's module exit routine and
on one of its sysfs ops:

foo.c:
static DEFINE_MUTEX(foo);
static ssize_t foo_store(struct device *dev,
			 struct device_attribute *attr,
			 const char *buf, size_t count)
{
	...
	mutex_lock(&foo);
	...
	mutex_lock(&foo);
	...
}
static DEVICE_ATTR_RW(foo);
...
void foo_exit(void)
{
	mutex_lock(&foo);
	...
	mutex_unlock(&foo);
}
module_exit(foo_exit);

And this can lead to this condition:

CPU A                              CPU B
                                   foo_store()
foo_exit()
  mutex_lock(&foo)
                                   mutex_lock(&foo)
   del_gendisk(some_struct->disk);
     device_del()
       device_remove_groups()

In this situation foo_store() is waiting for the mutex foo to
become unlocked, but that won't happen until module removal is complete.
But module removal won't complete until the sysfs file being poked at
completes which is waiting for a lock already held.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 arch/x86/kernel/cpu/resctrl/rdtgroup.c |  4 +-
 fs/kernfs/dir.c                        | 44 ++++++++++++++++++----
 fs/kernfs/file.c                       |  6 ++-
 fs/kernfs/kernfs-internal.h            |  3 +-
 fs/kernfs/symlink.c                    |  3 +-
 fs/sysfs/dir.c                         |  2 +-
 fs/sysfs/file.c                        |  6 ++-
 fs/sysfs/group.c                       |  3 +-
 include/linux/kernfs.h                 | 14 ++++---
 include/linux/sysfs.h                  | 52 ++++++++++++++++++++------
 kernel/cgroup/cgroup.c                 |  2 +-
 11 files changed, 105 insertions(+), 34 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index b57b3db9a6a7..4edf3b37fd2c 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -209,7 +209,7 @@ static int rdtgroup_add_file(struct kernfs_node *parent_kn, struct rftype *rft)
 
 	kn = __kernfs_create_file(parent_kn, rft->name, rft->mode,
 				  GLOBAL_ROOT_UID, GLOBAL_ROOT_GID,
-				  0, rft->kf_ops, rft, NULL, NULL);
+				  0, rft->kf_ops, rft, NULL, NULL, NULL);
 	if (IS_ERR(kn))
 		return PTR_ERR(kn);
 
@@ -2482,7 +2482,7 @@ static int mon_addfile(struct kernfs_node *parent_kn, const char *name,
 
 	kn = __kernfs_create_file(parent_kn, name, 0444,
 				  GLOBAL_ROOT_UID, GLOBAL_ROOT_GID, 0,
-				  &kf_mondata_ops, priv, NULL, NULL);
+				  &kf_mondata_ops, priv, NULL, NULL, NULL);
 	if (IS_ERR(kn))
 		return PTR_ERR(kn);
 
diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index ba581429bf7b..e841201fd11b 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -14,6 +14,7 @@
 #include <linux/slab.h>
 #include <linux/security.h>
 #include <linux/hash.h>
+#include <linux/module.h>
 
 #include "kernfs-internal.h"
 
@@ -414,12 +415,29 @@ static bool kernfs_unlink_sibling(struct kernfs_node *kn)
  */
 struct kernfs_node *kernfs_get_active(struct kernfs_node *kn)
 {
+	int v;
+
 	if (unlikely(!kn))
 		return NULL;
 
 	if (!atomic_inc_unless_negative(&kn->active))
 		return NULL;
 
+	/*
+	 * If a module created the kernfs_node, the module cannot possibly be
+	 * removed if the above atomic_inc_unless_negative() succeeded. So the
+	 * try_module_get() below is not to protect the lifetime of the module
+	 * as that is already guaranteed. The try_module_get() below is used
+	 * to ensure that we don't deadlock in case a kernfs operation and
+	 * module removal used a shared lock.
+	 */
+	if (!try_module_get(kn->owner)) {
+		v = atomic_dec_return(&kn->active);
+		if (unlikely(v == KN_DEACTIVATED_BIAS))
+			wake_up_all(&kernfs_root(kn)->deactivate_waitq);
+		return NULL;
+	}
+
 	if (kernfs_lockdep(kn))
 		rwsem_acquire_read(&kn->dep_map, 0, 1, _RET_IP_);
 	return kn;
@@ -442,6 +460,13 @@ void kernfs_put_active(struct kernfs_node *kn)
 	if (kernfs_lockdep(kn))
 		rwsem_release(&kn->dep_map, _RET_IP_);
 	v = atomic_dec_return(&kn->active);
+
+	/*
+	 * We prevent module exit *until* we know for sure all possible
+	 * kernfs ops are done.
+	 */
+	module_put(kn->owner);
+
 	if (likely(v != KN_DEACTIVATED_BIAS))
 		return;
 
@@ -572,7 +597,8 @@ static struct kernfs_node *__kernfs_new_node(struct kernfs_root *root,
 					     struct kernfs_node *parent,
 					     const char *name, umode_t mode,
 					     kuid_t uid, kgid_t gid,
-					     unsigned flags)
+					     unsigned flags,
+					     struct module *owner)
 {
 	struct kernfs_node *kn;
 	u32 id_highbits;
@@ -607,6 +633,7 @@ static struct kernfs_node *__kernfs_new_node(struct kernfs_root *root,
 	kn->name = name;
 	kn->mode = mode;
 	kn->flags = flags;
+	kn->owner = owner;
 
 	if (!uid_eq(uid, GLOBAL_ROOT_UID) || !gid_eq(gid, GLOBAL_ROOT_GID)) {
 		struct iattr iattr = {
@@ -640,12 +667,13 @@ static struct kernfs_node *__kernfs_new_node(struct kernfs_root *root,
 struct kernfs_node *kernfs_new_node(struct kernfs_node *parent,
 				    const char *name, umode_t mode,
 				    kuid_t uid, kgid_t gid,
-				    unsigned flags)
+				    unsigned flags,
+				    struct module *owner)
 {
 	struct kernfs_node *kn;
 
 	kn = __kernfs_new_node(kernfs_root(parent), parent,
-			       name, mode, uid, gid, flags);
+			       name, mode, uid, gid, flags, owner);
 	if (kn) {
 		kernfs_get(parent);
 		kn->parent = parent;
@@ -927,7 +955,7 @@ struct kernfs_root *kernfs_create_root(struct kernfs_syscall_ops *scops,
 
 	kn = __kernfs_new_node(root, NULL, "", S_IFDIR | S_IRUGO | S_IXUGO,
 			       GLOBAL_ROOT_UID, GLOBAL_ROOT_GID,
-			       KERNFS_DIR);
+			       KERNFS_DIR, NULL);
 	if (!kn) {
 		idr_destroy(&root->ino_idr);
 		kfree(root);
@@ -969,20 +997,22 @@ void kernfs_destroy_root(struct kernfs_root *root)
  * @gid: gid of the new directory
  * @priv: opaque data associated with the new directory
  * @ns: optional namespace tag of the directory
+ * @owner: if set, the module owner of this directory
  *
  * Returns the created node on success, ERR_PTR() value on failure.
  */
 struct kernfs_node *kernfs_create_dir_ns(struct kernfs_node *parent,
 					 const char *name, umode_t mode,
 					 kuid_t uid, kgid_t gid,
-					 void *priv, const void *ns)
+					 void *priv, const void *ns,
+					 struct module *owner)
 {
 	struct kernfs_node *kn;
 	int rc;
 
 	/* allocate */
 	kn = kernfs_new_node(parent, name, mode | S_IFDIR,
-			     uid, gid, KERNFS_DIR);
+			     uid, gid, KERNFS_DIR, owner);
 	if (!kn)
 		return ERR_PTR(-ENOMEM);
 
@@ -1014,7 +1044,7 @@ struct kernfs_node *kernfs_create_empty_dir(struct kernfs_node *parent,
 
 	/* allocate */
 	kn = kernfs_new_node(parent, name, S_IRUGO|S_IXUGO|S_IFDIR,
-			     GLOBAL_ROOT_UID, GLOBAL_ROOT_GID, KERNFS_DIR);
+			     GLOBAL_ROOT_UID, GLOBAL_ROOT_GID, KERNFS_DIR, NULL);
 	if (!kn)
 		return ERR_PTR(-ENOMEM);
 
diff --git a/fs/kernfs/file.c b/fs/kernfs/file.c
index 4479c6580333..0e125287e050 100644
--- a/fs/kernfs/file.c
+++ b/fs/kernfs/file.c
@@ -978,6 +978,7 @@ const struct file_operations kernfs_file_fops = {
  * @priv: private data for the file
  * @ns: optional namespace tag of the file
  * @key: lockdep key for the file's active_ref, %NULL to disable lockdep
+ * @owner: if set, the module owner of the file
  *
  * Returns the created node on success, ERR_PTR() value on error.
  */
@@ -987,7 +988,8 @@ struct kernfs_node *__kernfs_create_file(struct kernfs_node *parent,
 					 loff_t size,
 					 const struct kernfs_ops *ops,
 					 void *priv, const void *ns,
-					 struct lock_class_key *key)
+					 struct lock_class_key *key,
+					 struct module *owner)
 {
 	struct kernfs_node *kn;
 	unsigned flags;
@@ -996,7 +998,7 @@ struct kernfs_node *__kernfs_create_file(struct kernfs_node *parent,
 	flags = KERNFS_FILE;
 
 	kn = kernfs_new_node(parent, name, (mode & S_IALLUGO) | S_IFREG,
-			     uid, gid, flags);
+			     uid, gid, flags, owner);
 	if (!kn)
 		return ERR_PTR(-ENOMEM);
 
diff --git a/fs/kernfs/kernfs-internal.h b/fs/kernfs/kernfs-internal.h
index 9e3abf597e2d..6d275d661987 100644
--- a/fs/kernfs/kernfs-internal.h
+++ b/fs/kernfs/kernfs-internal.h
@@ -134,7 +134,8 @@ int kernfs_add_one(struct kernfs_node *kn);
 struct kernfs_node *kernfs_new_node(struct kernfs_node *parent,
 				    const char *name, umode_t mode,
 				    kuid_t uid, kgid_t gid,
-				    unsigned flags);
+				    unsigned flags,
+				    struct module *owner);
 
 /*
  * file.c
diff --git a/fs/kernfs/symlink.c b/fs/kernfs/symlink.c
index 19a6c71c6ff5..5a053eebee52 100644
--- a/fs/kernfs/symlink.c
+++ b/fs/kernfs/symlink.c
@@ -36,7 +36,8 @@ struct kernfs_node *kernfs_create_link(struct kernfs_node *parent,
 		gid = target->iattr->ia_gid;
 	}
 
-	kn = kernfs_new_node(parent, name, S_IFLNK|0777, uid, gid, KERNFS_LINK);
+	kn = kernfs_new_node(parent, name, S_IFLNK|0777, uid, gid, KERNFS_LINK,
+			     target->owner);
 	if (!kn)
 		return ERR_PTR(-ENOMEM);
 
diff --git a/fs/sysfs/dir.c b/fs/sysfs/dir.c
index b6b6796e1616..9763c2fde3c7 100644
--- a/fs/sysfs/dir.c
+++ b/fs/sysfs/dir.c
@@ -57,7 +57,7 @@ int sysfs_create_dir_ns(struct kobject *kobj, const void *ns)
 	kobject_get_ownership(kobj, &uid, &gid);
 
 	kn = kernfs_create_dir_ns(parent, kobject_name(kobj), 0755, uid, gid,
-				  kobj, ns);
+				  kobj, ns, NULL);
 	if (IS_ERR(kn)) {
 		if (PTR_ERR(kn) == -EEXIST)
 			sysfs_warn_dup(parent, kobject_name(kobj));
diff --git a/fs/sysfs/file.c b/fs/sysfs/file.c
index 42dcf96881b6..af9e91fd1a92 100644
--- a/fs/sysfs/file.c
+++ b/fs/sysfs/file.c
@@ -292,7 +292,8 @@ int sysfs_add_file_mode_ns(struct kernfs_node *parent,
 #endif
 
 	kn = __kernfs_create_file(parent, attr->name, mode & 0777, uid, gid,
-				  PAGE_SIZE, ops, (void *)attr, ns, key);
+				  PAGE_SIZE, ops, (void *)attr, ns, key,
+				  attr->owner);
 	if (IS_ERR(kn)) {
 		if (PTR_ERR(kn) == -EEXIST)
 			sysfs_warn_dup(parent, attr->name);
@@ -327,7 +328,8 @@ int sysfs_add_bin_file_mode_ns(struct kernfs_node *parent,
 #endif
 
 	kn = __kernfs_create_file(parent, attr->name, mode & 0777, uid, gid,
-				  battr->size, ops, (void *)attr, ns, key);
+				  battr->size, ops, (void *)attr, ns, key,
+				  attr->owner);
 	if (IS_ERR(kn)) {
 		if (PTR_ERR(kn) == -EEXIST)
 			sysfs_warn_dup(parent, attr->name);
diff --git a/fs/sysfs/group.c b/fs/sysfs/group.c
index eeb0e3099421..372864d1cb54 100644
--- a/fs/sysfs/group.c
+++ b/fs/sysfs/group.c
@@ -135,7 +135,8 @@ static int internal_create_group(struct kobject *kobj, int update,
 		} else {
 			kn = kernfs_create_dir_ns(kobj->sd, grp->name,
 						  S_IRWXU | S_IRUGO | S_IXUGO,
-						  uid, gid, kobj, NULL);
+						  uid, gid, kobj, NULL,
+						  grp->owner);
 			if (IS_ERR(kn)) {
 				if (PTR_ERR(kn) == -EEXIST)
 					sysfs_warn_dup(kobj->sd, grp->name);
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index cd968ee2b503..818b00ebd107 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -161,6 +161,7 @@ struct kernfs_node {
 	unsigned short		flags;
 	umode_t			mode;
 	struct kernfs_iattrs	*iattr;
+	struct module           *owner;
 };
 
 /*
@@ -370,7 +371,8 @@ void kernfs_destroy_root(struct kernfs_root *root);
 struct kernfs_node *kernfs_create_dir_ns(struct kernfs_node *parent,
 					 const char *name, umode_t mode,
 					 kuid_t uid, kgid_t gid,
-					 void *priv, const void *ns);
+					 void *priv, const void *ns,
+					 struct module *owner);
 struct kernfs_node *kernfs_create_empty_dir(struct kernfs_node *parent,
 					    const char *name);
 struct kernfs_node *__kernfs_create_file(struct kernfs_node *parent,
@@ -379,7 +381,8 @@ struct kernfs_node *__kernfs_create_file(struct kernfs_node *parent,
 					 loff_t size,
 					 const struct kernfs_ops *ops,
 					 void *priv, const void *ns,
-					 struct lock_class_key *key);
+					 struct lock_class_key *key,
+					 struct module *owner);
 struct kernfs_node *kernfs_create_link(struct kernfs_node *parent,
 				       const char *name,
 				       struct kernfs_node *target);
@@ -472,14 +475,15 @@ static inline void kernfs_destroy_root(struct kernfs_root *root) { }
 static inline struct kernfs_node *
 kernfs_create_dir_ns(struct kernfs_node *parent, const char *name,
 		     umode_t mode, kuid_t uid, kgid_t gid,
-		     void *priv, const void *ns)
+		     void *priv, const void *ns, struct module *owner)
 { return ERR_PTR(-ENOSYS); }
 
 static inline struct kernfs_node *
 __kernfs_create_file(struct kernfs_node *parent, const char *name,
 		     umode_t mode, kuid_t uid, kgid_t gid,
 		     loff_t size, const struct kernfs_ops *ops,
-		     void *priv, const void *ns, struct lock_class_key *key)
+		     void *priv, const void *ns, struct lock_class_key *key,
+		     struct module *owner)
 { return ERR_PTR(-ENOSYS); }
 
 static inline struct kernfs_node *
@@ -566,7 +570,7 @@ kernfs_create_dir(struct kernfs_node *parent, const char *name, umode_t mode,
 {
 	return kernfs_create_dir_ns(parent, name, mode,
 				    GLOBAL_ROOT_UID, GLOBAL_ROOT_GID,
-				    priv, NULL);
+				    priv, NULL, parent->owner);
 }
 
 static inline int kernfs_remove_by_name(struct kernfs_node *parent,
diff --git a/include/linux/sysfs.h b/include/linux/sysfs.h
index e3f1e8ac1f85..babbabb460dc 100644
--- a/include/linux/sysfs.h
+++ b/include/linux/sysfs.h
@@ -30,6 +30,7 @@ enum kobj_ns_type;
 struct attribute {
 	const char		*name;
 	umode_t			mode;
+	struct module           *owner;
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
 	bool			ignore_lockdep:1;
 	struct lock_class_key	*key;
@@ -80,6 +81,7 @@ do {							\
  * @attrs:	Pointer to NULL terminated list of attributes.
  * @bin_attrs:	Pointer to NULL terminated list of binary attributes.
  *		Either attrs or bin_attrs or both must be provided.
+ * @module:	If set, module responsible for this attribute group
  */
 struct attribute_group {
 	const char		*name;
@@ -89,6 +91,7 @@ struct attribute_group {
 						  struct bin_attribute *, int);
 	struct attribute	**attrs;
 	struct bin_attribute	**bin_attrs;
+	struct module           *owner;
 };
 
 /*
@@ -100,38 +103,52 @@ struct attribute_group {
 
 #define __ATTR(_name, _mode, _show, _store) {				\
 	.attr = {.name = __stringify(_name),				\
-		 .mode = VERIFY_OCTAL_PERMISSIONS(_mode) },		\
+		 .mode = VERIFY_OCTAL_PERMISSIONS(_mode),               \
+		 .owner  = THIS_MODULE,                                 \
+	},                                                             \
 	.show	= _show,						\
 	.store	= _store,						\
 }
 
 #define __ATTR_PREALLOC(_name, _mode, _show, _store) {			\
 	.attr = {.name = __stringify(_name),				\
-		 .mode = SYSFS_PREALLOC | VERIFY_OCTAL_PERMISSIONS(_mode) },\
+		 .mode = SYSFS_PREALLOC | VERIFY_OCTAL_PERMISSIONS(_mode),\
+		 .owner  = THIS_MODULE,                                 \
+	},                                                              \
 	.show	= _show,						\
 	.store	= _store,						\
 }
 
 #define __ATTR_RO(_name) {						\
-	.attr	= { .name = __stringify(_name), .mode = 0444 },		\
+	.attr	= { .name = __stringify(_name),                         \
+		    .mode = 0444,					\
+		    .owner  = THIS_MODULE,				\
+		},                                                     \
 	.show	= _name##_show,						\
 }
 
 #define __ATTR_RO_MODE(_name, _mode) {					\
 	.attr	= { .name = __stringify(_name),				\
-		    .mode = VERIFY_OCTAL_PERMISSIONS(_mode) },		\
+		    .mode = VERIFY_OCTAL_PERMISSIONS(_mode),            \
+		    .owner  = THIS_MODULE,				\
+	},                                                              \
 	.show	= _name##_show,						\
 }
 
 #define __ATTR_RW_MODE(_name, _mode) {					\
 	.attr	= { .name = __stringify(_name),				\
-		    .mode = VERIFY_OCTAL_PERMISSIONS(_mode) },		\
+		    .mode = VERIFY_OCTAL_PERMISSIONS(_mode),            \
+		    .owner  = THIS_MODULE,                              \
+	},								\
 	.show	= _name##_show,						\
 	.store	= _name##_store,					\
 }
 
 #define __ATTR_WO(_name) {						\
-	.attr	= { .name = __stringify(_name), .mode = 0200 },		\
+	.attr	= { .name = __stringify(_name),                         \
+		    .mode = 0200,					\
+		    .owner  = THIS_MODULE,				\
+	},                                                              \
 	.store	= _name##_store,					\
 }
 
@@ -141,8 +158,11 @@ struct attribute_group {
 
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
 #define __ATTR_IGNORE_LOCKDEP(_name, _mode, _show, _store) {	\
-	.attr = {.name = __stringify(_name), .mode = _mode,	\
-			.ignore_lockdep = true },		\
+	.attr = {.name = __stringify(_name),                    \
+		 .mode = _mode,					\
+		 .ignore_lockdep = true,                        \
+		 .owner  = THIS_MODULE,                         \
+	},							\
 	.show		= _show,				\
 	.store		= _store,				\
 }
@@ -159,6 +179,7 @@ static const struct attribute_group *_name##_groups[] = {	\
 #define ATTRIBUTE_GROUPS(_name)					\
 static const struct attribute_group _name##_group = {		\
 	.attrs = _name##_attrs,					\
+	.owner = THIS_MODULE,					\
 };								\
 __ATTRIBUTE_GROUPS(_name)
 
@@ -199,20 +220,29 @@ struct bin_attribute {
 
 /* macros to create static binary attributes easier */
 #define __BIN_ATTR(_name, _mode, _read, _write, _size) {		\
-	.attr = { .name = __stringify(_name), .mode = _mode },		\
+	.attr = { .name = __stringify(_name),                           \
+		   .mode = _mode,					\
+		   .owner = THIS_MODULE,				\
+	},								\
 	.read	= _read,						\
 	.write	= _write,						\
 	.size	= _size,						\
 }
 
 #define __BIN_ATTR_RO(_name, _size) {					\
-	.attr	= { .name = __stringify(_name), .mode = 0444 },		\
+	.attr	= { .name = __stringify(_name),                         \
+		    .mode = 0444,					\
+		    .owner = THIS_MODULE,				\
+	},								\
 	.read	= _name##_read,						\
 	.size	= _size,						\
 }
 
 #define __BIN_ATTR_WO(_name, _size) {					\
-	.attr	= { .name = __stringify(_name), .mode = 0200 },		\
+	.attr	= { .name = __stringify(_name),                         \
+		    .mode = 0200,					\
+		    .owner = THIS_MODULE,				\
+	},								\
 	.write	= _name##_write,					\
 	.size	= _size,						\
 }
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 9e0390000025..c6b0a28f599c 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -3975,7 +3975,7 @@ static int cgroup_add_file(struct cgroup_subsys_state *css, struct cgroup *cgrp,
 				  cgroup_file_mode(cft),
 				  GLOBAL_ROOT_UID, GLOBAL_ROOT_GID,
 				  0, cft->kf_ops, cft,
-				  NULL, key);
+				  NULL, key, NULL);
 	if (IS_ERR(kn))
 		return PTR_ERR(kn);
 
-- 
2.30.2


^ permalink raw reply	[flat|nested] 94+ messages in thread

* [PATCH v8 10/12] test_sysfs: enable deadlock tests by default
  2021-09-27 16:37 [PATCH v8 00/12] syfs: generic deadlock fix with module removal Luis Chamberlain
                   ` (8 preceding siblings ...)
  2021-09-27 16:38 ` [PATCH v8 09/12] sysfs: fix deadlock race with module removal Luis Chamberlain
@ 2021-09-27 16:38 ` Luis Chamberlain
  2021-09-27 16:38 ` [PATCH v8 11/12] zram: fix crashes with cpu hotplug multistate Luis Chamberlain
  2021-09-27 16:38 ` [PATCH v8 12/12] zram: use ATTRIBUTE_GROUPS to fix sysfs deadlock module removal Luis Chamberlain
  11 siblings, 0 replies; 94+ messages in thread
From: Luis Chamberlain @ 2021-09-27 16:38 UTC (permalink / raw)
  To: tj, gregkh, akpm, minchan, jeyu, shuah
  Cc: bvanassche, dan.j.williams, joe, tglx, mcgrof, keescook, rostedt,
	linux-spdx, linux-doc, linux-block, linux-fsdevel,
	linux-kselftest, linux-kernel

Now that sysfs has the deadlock race fixed with module removal,
enable the deadlock tests module removal tests. They were left
disabled by default as otherwise you would deadlock your system

./tools/testing/selftests/sysfs/sysfs.sh -t 0027
Running test: sysfs_test_0027 - run #0
Test for possible rmmod deadlock while writing x ... ok

./tools/testing/selftests/sysfs/sysfs.sh -t 0028
Running test: sysfs_test_0028 - run #0
Test for possible rmmod deadlock using rtnl_lock while writing x ... ok

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 tools/testing/selftests/sysfs/sysfs.sh | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/sysfs/sysfs.sh b/tools/testing/selftests/sysfs/sysfs.sh
index f928635d0e35..4047ac48e764 100755
--- a/tools/testing/selftests/sysfs/sysfs.sh
+++ b/tools/testing/selftests/sysfs/sysfs.sh
@@ -60,8 +60,8 @@ ALL_TESTS="$ALL_TESTS 0023:1:1:test_dev_y:block"
 ALL_TESTS="$ALL_TESTS 0024:1:1:test_dev_x:block"
 ALL_TESTS="$ALL_TESTS 0025:1:1:test_dev_y:block"
 ALL_TESTS="$ALL_TESTS 0026:1:1:test_dev_y:block"
-ALL_TESTS="$ALL_TESTS 0027:1:0:test_dev_x:block" # deadlock test
-ALL_TESTS="$ALL_TESTS 0028:1:0:test_dev_x:block" # deadlock test with rntl_lock
+ALL_TESTS="$ALL_TESTS 0027:1:1:test_dev_x:block" # deadlock test
+ALL_TESTS="$ALL_TESTS 0028:1:1:test_dev_x:block" # deadlock test with rntl_lock
 ALL_TESTS="$ALL_TESTS 0029:1:1:test_dev_x:block" # kernfs race removal of store
 ALL_TESTS="$ALL_TESTS 0030:1:1:test_dev_x:block" # kernfs race removal before mutex
 ALL_TESTS="$ALL_TESTS 0031:1:1:test_dev_x:block" # kernfs race removal after mutex
-- 
2.30.2


^ permalink raw reply	[flat|nested] 94+ messages in thread

* [PATCH v8 11/12] zram: fix crashes with cpu hotplug multistate
  2021-09-27 16:37 [PATCH v8 00/12] syfs: generic deadlock fix with module removal Luis Chamberlain
                   ` (9 preceding siblings ...)
  2021-09-27 16:38 ` [PATCH v8 10/12] test_sysfs: enable deadlock tests by default Luis Chamberlain
@ 2021-09-27 16:38 ` Luis Chamberlain
  2021-10-05 20:55   ` Kees Cook
  2021-10-14  1:55   ` Ming Lei
  2021-09-27 16:38 ` [PATCH v8 12/12] zram: use ATTRIBUTE_GROUPS to fix sysfs deadlock module removal Luis Chamberlain
  11 siblings, 2 replies; 94+ messages in thread
From: Luis Chamberlain @ 2021-09-27 16:38 UTC (permalink / raw)
  To: tj, gregkh, akpm, minchan, jeyu, shuah
  Cc: bvanassche, dan.j.williams, joe, tglx, mcgrof, keescook, rostedt,
	linux-spdx, linux-doc, linux-block, linux-fsdevel,
	linux-kselftest, linux-kernel

Provide a simple state machine to fix races with driver exit where we
remove the CPU multistate callbacks and re-initialization / creation of
new per CPU instances which should be managed by these callbacks.

The zram driver makes use of cpu hotplug multistate support, whereby it
associates a struct zcomp per CPU. Each struct zcomp represents a
compression algorithm in charge of managing compression streams per
CPU. Although a compiled zram driver only supports a fixed set of
compression algorithms, each zram device gets a struct zcomp allocated
per CPU. The "multi" in CPU hotplug multstate refers to these per
cpu struct zcomp instances. Each of these will have the CPU hotplug
callback called for it on CPU plug / unplug. The kernel's CPU hotplug
multistate keeps a linked list of these different structures so that
it will iterate over them on CPU transitions.

By default at driver initialization we will create just one zram device
(num_devices=1) and a zcomp structure then set for the now default
lzo-rle comrpession algorithm. At driver removal we first remove each
zram device, and so we destroy the associated struct zcomp per CPU. But
since we expose sysfs attributes to create new devices or reset /
initialize existing zram devices, we can easily end up re-initializing
a struct zcomp for a zram device before the exit routine of the module
removes the cpu hotplug callback. When this happens the kernel's CPU
hotplug will detect that at least one instance (struct zcomp for us)
exists. This can happen in the following situation:

CPU 1                            CPU 2

                                disksize_store(...);
class_unregister(...);
idr_for_each(...);
zram_debugfs_destroy();

idr_destroy(...);
unregister_blkdev(...);
cpuhp_remove_multi_state(...);

The warning comes up on cpuhp_remove_multi_state() when it sees that the
state for CPUHP_ZCOMP_PREPARE does not have an empty instance linked list.
In this case, that a struct zcom still exists, the driver allowed its
creation per CPU even though we could have just freed them per CPU
though a call on another CPU, and we are then later trying to remove the
hotplug callback.

Fix all this by providing a zram initialization boolean
protected the shared in the driver zram_index_mutex, which we
can use to annotate when sysfs attributes are safe to use or
not -- once the driver is properly initialized. When the driver
is going down we also are sure to not let userspace muck with
attributes which may affect each per cpu struct zcomp.

This also fixes a series of possible memory leaks. The
crashes and memory leaks can easily be caused by issuing
the zram02.sh script from the LTP project [0] in a loop
in two separate windows:

  cd testcases/kernel/device-drivers/zram
  while true; do PATH=$PATH:$PWD:$PWD/../../../lib/ ./zram02.sh; done

You end up with a splat as follows:

kernel: zram: Removed device: zram0
kernel: zram: Added device: zram0
kernel: zram0: detected capacity change from 0 to 209715200
kernel: Adding 104857596k swap on /dev/zram0.  <etc>
kernel: zram0: detected capacitky change from 209715200 to 0
kernel: zram0: detected capacity change from 0 to 209715200
kernel: ------------[ cut here ]------------
kernel: Error: Removing state 63 which has instances left.
kernel: WARNING: CPU: 7 PID: 70457 at \
	kernel/cpu.c:2069 __cpuhp_remove_state_cpuslocked+0xf9/0x100
kernel: Modules linked in: zram(E-) zsmalloc(E) <etc>
kernel: CPU: 7 PID: 70457 Comm: rmmod Tainted: G            \
	E     5.12.0-rc1-next-20210304 #3
kernel: Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), \
	BIOS 1.14.0-2 04/01/2014
kernel: RIP: 0010:__cpuhp_remove_state_cpuslocked+0xf9/0x100
kernel: Code: <etc>
kernel: RSP: 0018:ffffa800c139be98 EFLAGS: 00010282
kernel: RAX: 0000000000000000 RBX: ffffffff9083db58 RCX: ffff9609f7dd86d8
kernel: RDX: 00000000ffffffd8 RSI: 0000000000000027 RDI: ffff9609f7dd86d0
kernel: RBP: 0000000000000000i R08: 0000000000000000 R09: ffffa800c139bcb8
kernel: R10: ffffa800c139bcb0 R11: ffffffff908bea40 R12: 000000000000003f
kernel: R13: 00000000000009d8 R14: 0000000000000000 R15: 0000000000000000
kernel: FS: 00007f1b075a7540(0000) GS:ffff9609f7dc0000(0000) knlGS:<etc>
kernel: CS:  0010 DS: 0000 ES 0000 CR0: 0000000080050033
kernel: CR2: 00007f1b07610490 CR3: 00000001bd04e000 CR4: 0000000000350ee0
kernel: Call Trace:
kernel: __cpuhp_remove_state+0x2e/0x80
kernel: __do_sys_delete_module+0x190/0x2a0
kernel:  do_syscall_64+0x33/0x80
kernel: entry_SYSCALL_64_after_hwframe+0x44/0xae

The "Error: Removing state 63 which has instances left" refers
to the zram per CPU struct zcomp instances left.

[0] https://github.com/linux-test-project/ltp.git

Acked-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 drivers/block/zram/zram_drv.c | 63 ++++++++++++++++++++++++++++++-----
 1 file changed, 55 insertions(+), 8 deletions(-)

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index f61910c65f0f..b26abcb955cc 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -44,6 +44,8 @@ static DEFINE_MUTEX(zram_index_mutex);
 static int zram_major;
 static const char *default_compressor = CONFIG_ZRAM_DEF_COMP;
 
+static bool zram_up;
+
 /* Module params (documentation at end) */
 static unsigned int num_devices = 1;
 /*
@@ -1704,6 +1706,7 @@ static void zram_reset_device(struct zram *zram)
 	comp = zram->comp;
 	disksize = zram->disksize;
 	zram->disksize = 0;
+	zram->comp = NULL;
 
 	set_capacity_and_notify(zram->disk, 0);
 	part_stat_set_all(zram->disk->part0, 0);
@@ -1724,9 +1727,18 @@ static ssize_t disksize_store(struct device *dev,
 	struct zram *zram = dev_to_zram(dev);
 	int err;
 
+	mutex_lock(&zram_index_mutex);
+
+	if (!zram_up) {
+		err = -ENODEV;
+		goto out;
+	}
+
 	disksize = memparse(buf, NULL);
-	if (!disksize)
-		return -EINVAL;
+	if (!disksize) {
+		err = -EINVAL;
+		goto out;
+	}
 
 	down_write(&zram->init_lock);
 	if (init_done(zram)) {
@@ -1754,12 +1766,16 @@ static ssize_t disksize_store(struct device *dev,
 	set_capacity_and_notify(zram->disk, zram->disksize >> SECTOR_SHIFT);
 	up_write(&zram->init_lock);
 
+	mutex_unlock(&zram_index_mutex);
+
 	return len;
 
 out_free_meta:
 	zram_meta_free(zram, disksize);
 out_unlock:
 	up_write(&zram->init_lock);
+out:
+	mutex_unlock(&zram_index_mutex);
 	return err;
 }
 
@@ -1775,8 +1791,17 @@ static ssize_t reset_store(struct device *dev,
 	if (ret)
 		return ret;
 
-	if (!do_reset)
-		return -EINVAL;
+	mutex_lock(&zram_index_mutex);
+
+	if (!zram_up) {
+		len = -ENODEV;
+		goto out;
+	}
+
+	if (!do_reset) {
+		len = -EINVAL;
+		goto out;
+	}
 
 	zram = dev_to_zram(dev);
 	bdev = zram->disk->part0;
@@ -1785,7 +1810,8 @@ static ssize_t reset_store(struct device *dev,
 	/* Do not reset an active device or claimed device */
 	if (bdev->bd_openers || zram->claim) {
 		mutex_unlock(&bdev->bd_disk->open_mutex);
-		return -EBUSY;
+		len = -EBUSY;
+		goto out;
 	}
 
 	/* From now on, anyone can't open /dev/zram[0-9] */
@@ -1800,6 +1826,8 @@ static ssize_t reset_store(struct device *dev,
 	zram->claim = false;
 	mutex_unlock(&bdev->bd_disk->open_mutex);
 
+out:
+	mutex_unlock(&zram_index_mutex);
 	return len;
 }
 
@@ -2010,6 +2038,10 @@ static ssize_t hot_add_show(struct class *class,
 	int ret;
 
 	mutex_lock(&zram_index_mutex);
+	if (!zram_up) {
+		mutex_unlock(&zram_index_mutex);
+		return -ENODEV;
+	}
 	ret = zram_add();
 	mutex_unlock(&zram_index_mutex);
 
@@ -2037,6 +2069,11 @@ static ssize_t hot_remove_store(struct class *class,
 
 	mutex_lock(&zram_index_mutex);
 
+	if (!zram_up) {
+		ret = -ENODEV;
+		goto out;
+	}
+
 	zram = idr_find(&zram_index_idr, dev_id);
 	if (zram) {
 		ret = zram_remove(zram);
@@ -2046,6 +2083,7 @@ static ssize_t hot_remove_store(struct class *class,
 		ret = -ENODEV;
 	}
 
+out:
 	mutex_unlock(&zram_index_mutex);
 	return ret ? ret : count;
 }
@@ -2072,12 +2110,15 @@ static int zram_remove_cb(int id, void *ptr, void *data)
 
 static void destroy_devices(void)
 {
+	mutex_lock(&zram_index_mutex);
+	zram_up = false;
 	class_unregister(&zram_control_class);
 	idr_for_each(&zram_index_idr, &zram_remove_cb, NULL);
 	zram_debugfs_destroy();
 	idr_destroy(&zram_index_idr);
 	unregister_blkdev(zram_major, "zram");
 	cpuhp_remove_multi_state(CPUHP_ZCOMP_PREPARE);
+	mutex_unlock(&zram_index_mutex);
 }
 
 static int __init zram_init(void)
@@ -2105,15 +2146,21 @@ static int __init zram_init(void)
 		return -EBUSY;
 	}
 
+	mutex_lock(&zram_index_mutex);
+
 	while (num_devices != 0) {
-		mutex_lock(&zram_index_mutex);
 		ret = zram_add();
-		mutex_unlock(&zram_index_mutex);
-		if (ret < 0)
+		if (ret < 0) {
+			mutex_unlock(&zram_index_mutex);
 			goto out_error;
+		}
 		num_devices--;
 	}
 
+	zram_up = true;
+
+	mutex_unlock(&zram_index_mutex);
+
 	return 0;
 
 out_error:
-- 
2.30.2


^ permalink raw reply	[flat|nested] 94+ messages in thread

* [PATCH v8 12/12] zram: use ATTRIBUTE_GROUPS to fix sysfs deadlock module removal
  2021-09-27 16:37 [PATCH v8 00/12] syfs: generic deadlock fix with module removal Luis Chamberlain
                   ` (10 preceding siblings ...)
  2021-09-27 16:38 ` [PATCH v8 11/12] zram: fix crashes with cpu hotplug multistate Luis Chamberlain
@ 2021-09-27 16:38 ` Luis Chamberlain
  2021-10-05 20:57   ` Kees Cook
  11 siblings, 1 reply; 94+ messages in thread
From: Luis Chamberlain @ 2021-09-27 16:38 UTC (permalink / raw)
  To: tj, gregkh, akpm, minchan, jeyu, shuah
  Cc: bvanassche, dan.j.williams, joe, tglx, mcgrof, keescook, rostedt,
	linux-spdx, linux-doc, linux-block, linux-fsdevel,
	linux-kselftest, linux-kernel

The ATTRIBUTE_GROUPS is typically used to avoid boiler plate
code which is used in many drivers. Embracing ATTRIBUTE_GROUPS was
long due on the zram driver, however a recent fix for sysfs allows
users of ATTRIBUTE_GROUPS to also associate a module to the group
attribute.

In zram's case this also means it allows us to fix a race which triggers
a deadlock on the zram driver. This deadlock happens when a sysfs attribute
use a lock also used on module removal. This happens when for instance a
sysfs file on a driver is used, then at the same time we have module
removal call trigger. The module removal call code holds a lock, and then
the sysfs file entry waits for the same lock. While holding the lock the
module removal tries to remove the sysfs entries, but these cannot be
removed yet as one is waiting for a lock. This won't complete as the lock
is already held. Likewise module removal cannot complete, and so we
deadlock.

Sysfs fixes this when the group attributes have a module associated to
it, sysfs will *try* to get a refcount to the module when a shared
lock is used, prior to mucking with a sysfs attribute. If this fails we
just give up right away.

This deadlock was first reported with the zram driver, a sketch of how
this can happen follows:

CPU A                              CPU B
                                   whatever_store()
module_unload
  mutex_lock(foo)
                                   mutex_lock(foo)
   del_gendisk(zram->disk);
     device_del()
       device_remove_groups()

In this situation whatever_store() is waiting for the mutex foo to
become unlocked, but that won't happen until module removal is complete.
But module removal won't complete until the sysfs file being poked
completes which is waiting for a lock already held.

This issue can be reproduced easily on the zram driver as follows:

Loop 1 on one terminal:

while true;
	do modprobe zram;
	modprobe -r zram;
done

Loop 2 on a second terminal:
while true; do
	echo 1024 >  /sys/block/zram0/disksize;
	echo 1 > /sys/block/zram0/reset;
done

Without this patch we end up in a deadlock, and the following
stack trace is produced which hints to us what the issue was:

INFO: task bash:888 blocked for more than 120 seconds.
      Tainted: G            E 5.12.0-rc1-next-20210304+ #4
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:bash            state:D stack:    0 pid:  888 ppid: 887 flags:<etc>
Call Trace:
 __schedule+0x2e4/0x900
 schedule+0x46/0xb0
 schedule_preempt_disabled+0xa/0x10
 __mutex_lock.constprop.0+0x2c3/0x490
 ? _kstrtoull+0x35/0xd0
 reset_store+0x6c/0x160 [zram]
 kernfs_fop_write_iter+0x124/0x1b0
 new_sync_write+0x11c/0x1b0
 vfs_write+0x1c2/0x260
 ksys_write+0x5f/0xe0
 do_syscall_64+0x33/0x80
 entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7f34f2c3df33
RSP: 002b:00007ffe751df6e8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007f34f2c3df33
RDX: 0000000000000002 RSI: 0000561ccb06ec10 RDI: 0000000000000001
RBP: 0000561ccb06ec10 R08: 000000000000000a R09: 0000000000000001
R10: 0000561ccb157590 R11: 0000000000000246 R12: 0000000000000002
R13: 00007f34f2d0e6a0 R14: 0000000000000002 R15: 00007f34f2d0e8a0
INFO: task modprobe:1104 can't die for more than 120 seconds.
task:modprobe        state:D stack:    0 pid: 1104 ppid: 916 flags:<etc>
Call Trace:
 __schedule+0x2e4/0x900
 schedule+0x46/0xb0
 __kernfs_remove.part.0+0x228/0x2b0
 ? finish_wait+0x80/0x80
 kernfs_remove_by_name_ns+0x50/0x90
 remove_files+0x2b/0x60
 sysfs_remove_group+0x38/0x80
 sysfs_remove_groups+0x29/0x40
 device_remove_attrs+0x4a/0x80
 device_del+0x183/0x3e0
 ? mutex_lock+0xe/0x30
 del_gendisk+0x27a/0x2d0
 zram_remove+0x8a/0xb0 [zram]
 ? hot_remove_store+0xf0/0xf0 [zram]
 zram_remove_cb+0xd/0x10 [zram]
 idr_for_each+0x5e/0xd0
 destroy_devices+0x39/0x6f [zram]
 __do_sys_delete_module+0x190/0x2a0
 do_syscall_64+0x33/0x80
 entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7f32adf727d7
RSP: 002b:00007ffc08bb38a8 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
RAX: ffffffffffffffda RBX: 000055eea23cbb10 RCX: 00007f32adf727d7
RDX: 0000000000000000 RSI: 0000000000000800 RDI: 000055eea23cbb78
RBP: 000055eea23cbb10 R08: 0000000000000000 R09: 0000000000000000
R10: 00007f32adfe5ac0 R11: 0000000000000206 R12: 000055eea23cbb78
R13: 0000000000000000 R14: 0000000000000000 R15: 000055eea23cbc20

[0] https://lkml.kernel.org/r/20210401235925.GR4332@42.do-not-panic.com

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 drivers/block/zram/zram_drv.c | 11 ++---------
 1 file changed, 2 insertions(+), 9 deletions(-)

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index b26abcb955cc..60a55ae8cd91 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -1902,14 +1902,7 @@ static struct attribute *zram_disk_attrs[] = {
 	NULL,
 };
 
-static const struct attribute_group zram_disk_attr_group = {
-	.attrs = zram_disk_attrs,
-};
-
-static const struct attribute_group *zram_disk_attr_groups[] = {
-	&zram_disk_attr_group,
-	NULL,
-};
+ATTRIBUTE_GROUPS(zram_disk);
 
 /*
  * Allocate and initialize new zram device. the function returns
@@ -1981,7 +1974,7 @@ static int zram_add(void)
 		blk_queue_max_write_zeroes_sectors(zram->disk->queue, UINT_MAX);
 
 	blk_queue_flag_set(QUEUE_FLAG_STABLE_WRITES, zram->disk->queue);
-	device_add_disk(NULL, zram->disk, zram_disk_attr_groups);
+	device_add_disk(NULL, zram->disk, zram_disk_groups);
 
 	strlcpy(zram->compressor, default_compressor, sizeof(zram->compressor));
 
-- 
2.30.2


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v8 09/12] sysfs: fix deadlock race with module removal
  2021-09-27 16:38 ` [PATCH v8 09/12] sysfs: fix deadlock race with module removal Luis Chamberlain
@ 2021-10-05  9:24   ` Ming Lei
  2021-10-11 21:25     ` Luis Chamberlain
  2021-10-05 20:50   ` Kees Cook
  1 sibling, 1 reply; 94+ messages in thread
From: Ming Lei @ 2021-10-05  9:24 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: tj, gregkh, akpm, minchan, jeyu, shuah, bvanassche,
	dan.j.williams, joe, tglx, keescook, rostedt, linux-spdx,
	linux-doc, linux-block, linux-fsdevel, linux-kselftest,
	linux-kernel

On Mon, Sep 27, 2021 at 09:38:02AM -0700, Luis Chamberlain wrote:
> When driver sysfs attributes use a lock also used on module removal we
> can race to deadlock. This happens when for instance a sysfs file on
> a driver is used, then at the same time we have module removal call
> trigger. The module removal call code holds a lock, and then the
> driver's sysfs file entry waits for the same lock. While holding the
> lock the module removal tries to remove the sysfs entries, but these
> cannot be removed yet as one is waiting for a lock. This won't complete
> as the lock is already held. Likewise module removal cannot complete,
> and so we deadlock.
> 
> This can now be easily reproducible with our sysfs selftest as follows:
> 
> ./tools/testing/selftests/sysfs/sysfs.sh -t 0027
> 
> This uses a local driver lock. Test 0028 can also be used, that uses
> the rtnl_lock():
> 
> ./tools/testing/selftests/sysfs/sysfs.sh -t 0028
> 
> To fix this we extend the struct kernfs_node with a module reference
> and use the try_module_get() after kernfs_get_active() is called. As
> documented in the prior patch, we now know that once kernfs_get_active()
> is called the module is implicitly guarded to exist and cannot be removed.
> This is because the module is the one in charge of removing the same
> sysfs file it created, and removal of sysfs files on module exit will wait
> until they don't have any active references. By using a try_module_get()
> after kernfs_get_active() we yield to let module removal trump calls to
> process a sysfs operation, while also preventing module removal if a sysfs
> operation is in already progress. This prevents the deadlock.
> 
> This deadlock was first reported with the zram driver, however the live

Looks not see the lock pattern you mentioned in zram driver, can you
share the related zram code?

> patching folks have acknowledged they have observed this as well with
> live patching, when a live patch is removed. I was then able to
> reproduce easily by creating a dedicated selftest for it.
> 
> A sketch of how this can happen follows, consider foo a local mutex
> part of a driver, and used on the driver's module exit routine and
> on one of its sysfs ops:
> 
> foo.c:
> static DEFINE_MUTEX(foo);
> static ssize_t foo_store(struct device *dev,
> 			 struct device_attribute *attr,
> 			 const char *buf, size_t count)
> {
> 	...
> 	mutex_lock(&foo);
> 	...
> 	mutex_lock(&foo);
> 	...
> }
> static DEVICE_ATTR_RW(foo);
> ...
> void foo_exit(void)
> {
> 	mutex_lock(&foo);
> 	...
> 	mutex_unlock(&foo);
> }
> module_exit(foo_exit);
> 
> And this can lead to this condition:
> 
> CPU A                              CPU B
>                                    foo_store()
> foo_exit()
>   mutex_lock(&foo)
>                                    mutex_lock(&foo)
>    del_gendisk(some_struct->disk);
>      device_del()
>        device_remove_groups()

I guess the deadlock exists if foo_exit() is called anywhere. If yes,
look the issue may not be related with removing module directly, right?



Thanks,
Ming


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v8 03/12] selftests: add tests_sysfs module
  2021-09-27 16:37 ` [PATCH v8 03/12] selftests: add tests_sysfs module Luis Chamberlain
@ 2021-10-05 14:16   ` Greg KH
  2021-10-05 16:57     ` Tim.Bird
  2021-10-11 17:38     ` Luis Chamberlain
  2021-10-07 14:23   ` Miroslav Benes
       [not found]   ` <202110050912.3DF681ED@keescook>
  2 siblings, 2 replies; 94+ messages in thread
From: Greg KH @ 2021-10-05 14:16 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: tj, akpm, minchan, jeyu, shuah, bvanassche, dan.j.williams, joe,
	tglx, keescook, rostedt, linux-spdx, linux-doc, linux-block,
	linux-fsdevel, linux-kselftest, linux-kernel

On Mon, Sep 27, 2021 at 09:37:56AM -0700, Luis Chamberlain wrote:
> --- /dev/null
> +++ b/lib/test_sysfs.c
> @@ -0,0 +1,921 @@
> +// SPDX-License-Identifier: GPL-2.0-or-later OR copyleft-next-0.3.1
> +/*
> + * sysfs test driver
> + *
> + * Copyright (C) 2021 Luis Chamberlain <mcgrof@kernel.org>
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms of the GNU General Public License as published by the Free
> + * Software Foundation; either version 2 of the License, or at your option any
> + * later version; or, when distributed separately from the Linux kernel or
> + * when incorporated into other software packages, subject to the following
> + * license:
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms of copyleft-next (version 0.3.1 or later) as published
> + * at http://copyleft-next.org/.

Independant of the fact that I don't like sysfs code attempting to be
accessed in the kernel with licenses other than GPLv2, you do not need
the license "boilerplate" text at all in files.  That's what the SPDX
line is for.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v8 08/12] fs/sysfs/dir.c: replace S_IRWXU|S_IRUGO|S_IXUGO with 0755 sysfs_create_dir_ns()
  2021-09-27 16:38 ` [PATCH v8 08/12] fs/sysfs/dir.c: replace S_IRWXU|S_IRUGO|S_IXUGO with 0755 sysfs_create_dir_ns() Luis Chamberlain
@ 2021-10-05 16:05   ` Kees Cook
  0 siblings, 0 replies; 94+ messages in thread
From: Kees Cook @ 2021-10-05 16:05 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: tj, gregkh, akpm, minchan, jeyu, shuah, bvanassche,
	dan.j.williams, joe, tglx, rostedt, linux-spdx, linux-doc,
	linux-block, linux-fsdevel, linux-kselftest, linux-kernel

On Mon, Sep 27, 2021 at 09:38:01AM -0700, Luis Chamberlain wrote:
> If one ends up expanding on this line checkpatch will complain that the
> combination S_IRWXU|S_IRUGO|S_IXUGO should just be replaced with the
> octal 0755. Do that.
> 
> This makes no functional changes.
> 
> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>

It could be helpful to add a link too:
https://www.kernel.org/doc/html/latest/dev-tools/checkpatch.html?highlight=non_octal#permissions

Reviewed-by: Kees Cook <keescook@chromium.org>

> ---
>  fs/sysfs/dir.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/fs/sysfs/dir.c b/fs/sysfs/dir.c
> index 59dffd5ca517..b6b6796e1616 100644
> --- a/fs/sysfs/dir.c
> +++ b/fs/sysfs/dir.c
> @@ -56,8 +56,7 @@ int sysfs_create_dir_ns(struct kobject *kobj, const void *ns)
>  
>  	kobject_get_ownership(kobj, &uid, &gid);
>  
> -	kn = kernfs_create_dir_ns(parent, kobject_name(kobj),
> -				  S_IRWXU | S_IRUGO | S_IXUGO, uid, gid,
> +	kn = kernfs_create_dir_ns(parent, kobject_name(kobj), 0755, uid, gid,
>  				  kobj, ns);
>  	if (IS_ERR(kn)) {
>  		if (PTR_ERR(kn) == -EEXIST)
> -- 
> 2.30.2
> 

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v8 02/12] testing: use the copyleft-next-0.3.1 SPDX tag
  2021-09-27 16:37 ` [PATCH v8 02/12] testing: use the copyleft-next-0.3.1 SPDX tag Luis Chamberlain
@ 2021-10-05 16:11   ` Kees Cook
  0 siblings, 0 replies; 94+ messages in thread
From: Kees Cook @ 2021-10-05 16:11 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: tj, gregkh, akpm, minchan, jeyu, shuah, bvanassche,
	dan.j.williams, joe, tglx, rostedt, linux-spdx, linux-doc,
	linux-block, linux-fsdevel, linux-kselftest, linux-kernel,
	Goldwyn Rodrigues, Kuno Woudt, Richard Fontana, copyleft-next,
	Ciaran Farrell, Christopher De Nicolo, Christoph Hellwig,
	Jonathan Corbet, Thorsten Leemhuis

On Mon, Sep 27, 2021 at 09:37:55AM -0700, Luis Chamberlain wrote:
> Two selftests drivers exist under the copyleft-next license.
> These drivers were added prior to SPDX practice taking full swing
> in the kernel. Now that we have an SPDX tag for copylef-next-0.3.1
> documented, embrace it and remove the boiler plate.
> 
> Cc: Goldwyn Rodrigues <rgoldwyn@suse.com>
> Cc: Kuno Woudt <kuno@frob.nl>
> Cc: Richard Fontana <fontana@sharpeleven.org>
> Cc: copyleft-next@lists.fedorahosted.org
> Cc: Ciaran Farrell <Ciaran.Farrell@suse.com>
> Cc: Christopher De Nicolo <Christopher.DeNicolo@suse.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Jonathan Corbet <corbet@lwn.net>
> Cc: Thorsten Leemhuis <linux@leemhuis.info>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>

You're the primary author, and it cleans up boilerplate, so LGTM.

Reviewed-by: Kees Cook <keescook@chromium.org>

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 94+ messages in thread

* RE: [PATCH v8 03/12] selftests: add tests_sysfs module
  2021-10-05 14:16   ` Greg KH
@ 2021-10-05 16:57     ` Tim.Bird
  2021-10-11 17:40       ` Luis Chamberlain
  2021-10-11 17:38     ` Luis Chamberlain
  1 sibling, 1 reply; 94+ messages in thread
From: Tim.Bird @ 2021-10-05 16:57 UTC (permalink / raw)
  To: gregkh, mcgrof
  Cc: tj, akpm, minchan, jeyu, shuah, bvanassche, dan.j.williams, joe,
	tglx, keescook, rostedt, linux-spdx, linux-doc, linux-block,
	linux-fsdevel, linux-kselftest, linux-kernel



> -----Original Message-----
> From: Greg KH <gregkh@linuxfoundation.org>
> Sent: Tuesday, October 5, 2021 8:17 AM
> To: Luis Chamberlain <mcgrof@kernel.org>
> Cc: tj@kernel.org; akpm@linux-foundation.org; minchan@kernel.org; jeyu@kernel.org; shuah@kernel.org; bvanassche@acm.org;
> dan.j.williams@intel.com; joe@perches.com; tglx@linutronix.de; keescook@chromium.org; rostedt@goodmis.org; linux-
> spdx@vger.kernel.org; linux-doc@vger.kernel.org; linux-block@vger.kernel.org; linux-fsdevel@vger.kernel.org; linux-
> kselftest@vger.kernel.org; linux-kernel@vger.kernel.org
> Subject: Re: [PATCH v8 03/12] selftests: add tests_sysfs module
> 
> On Mon, Sep 27, 2021 at 09:37:56AM -0700, Luis Chamberlain wrote:
> > --- /dev/null
> > +++ b/lib/test_sysfs.c
> > @@ -0,0 +1,921 @@
> > +// SPDX-License-Identifier: GPL-2.0-or-later OR copyleft-next-0.3.1
> > +/*
> > + * sysfs test driver
> > + *
> > + * Copyright (C) 2021 Luis Chamberlain <mcgrof@kernel.org>
> > + *
> > + * This program is free software; you can redistribute it and/or modify it
> > + * under the terms of the GNU General Public License as published by the Free
> > + * Software Foundation; either version 2 of the License, or at your option any
> > + * later version; or, when distributed separately from the Linux kernel or
> > + * when incorporated into other software packages, subject to the following
> > + * license:

This is a very strange license grant, which I'm not sure is covered by any
current SPDX syntax.
" when distributed separately from the Linux kernel or when incorporated into
other software packages, subject to the following license:"

Why would we care about the license used when the code is used in a non-kernel
project?  If it is desired for the code to be available outside the kernel under a
different license, then surely the easiest thing is to make it available separately
under that license.  I'm not sure why the kernel needs to carry this license for
non-kernel use of the code.

I would recommend giving this a GPLv2 SPDX header, and maybe in the comment
at the top of the file put a reference to a git repository where the code can be
obtained under a different license.

Just my 2 cents.
 -- Tim

> > + *
> > + * This program is free software; you can redistribute it and/or modify it
> > + * under the terms of copyleft-next (version 0.3.1 or later) as published
> > + * at http://copyleft-next.org/.
> 
> Independant of the fact that I don't like sysfs code attempting to be
> accessed in the kernel with licenses other than GPLv2, you do not need
> the license "boilerplate" text at all in files.  That's what the SPDX
> line is for.
> 
> thanks,
> 
> greg k-h

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v8 04/12] kernfs: add initial failure injection support
  2021-09-27 16:37 ` [PATCH v8 04/12] kernfs: add initial failure injection support Luis Chamberlain
@ 2021-10-05 19:47   ` Kees Cook
  2021-10-11 20:44     ` Luis Chamberlain
  0 siblings, 1 reply; 94+ messages in thread
From: Kees Cook @ 2021-10-05 19:47 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: tj, gregkh, akpm, minchan, jeyu, shuah, bvanassche,
	dan.j.williams, joe, tglx, rostedt, linux-spdx, linux-doc,
	linux-block, linux-fsdevel, linux-kselftest, linux-kernel

On Mon, Sep 27, 2021 at 09:37:57AM -0700, Luis Chamberlain wrote:
> This adds initial failure injection support to kernfs. We start
> off with debug knobs which when enabled allow test drivers, such as
> test_sysfs, to then make use of these to try to force certain
> difficult races to take place with a high degree of certainty.
> 
> This only adds runtime code *iff* the new bool CONFIG_FAIL_KERNFS_KNOBS is
> enabled in your kernel. If you don't have this enabled this provides
> no new functional. When CONFIG_FAIL_KERNFS_KNOBS is disabled the new
> routine kernfs_debug_should_wait() ends up being transformed to if
> (false), and so the compiler should optimize these out as dead code
> producing no new effective binary changes.
> 
> We start off with enabling failure injections in kernfs by allowing us to
> alter the way kernfs_fop_write_iter() behaves. We allow for the routine
> kernfs_fop_write_iter() to wait for a certain condition in the kernel to
> occur, after which it will sleep a predefined amount of time. This lets
> kernfs users to time exactly when it want kernfs_fop_write_iter() to
> complete, allowing for developing race conditions and test for correctness
> in kernfs.
> 
> You'd boot with this enabled on your kernel command line:
> 
> fail_kernfs_fop_write_iter=1,100,0,1
> 
> The values are <interval,probability,size,times>, we don't care for
> size, so for now we ignore it. The above ensures a failure will trigger
> only once.
> 
> *How* we allow for this routine to change behaviour is left to knobs we
> expose under debugfs:
> 
>  # ls -1 /sys/kernel/debug/kernfs/config_fail_kernfs_fop_write_iter/

I'd expect this to live under /sys/kernel/debug/fail_kernfs, like the
other fault injectors.

> wait_after_active
> wait_after_mutex
> wait_at_start
> wait_before_mutex
> 
> A debugfs entry also exists to allow us to sleep a configurabler amount
> of time after the completion:
> 
> /sys/kernel/debug/kernfs/sleep_after_wait_ms
> 
> These two sets of knobs allow us to construct races and demonstrate
> how the kernfs active reference should suffice to project against
> races.
> 
> Enabling CONFIG_FAULT_INJECTION_DEBUG_FS enables us to configure the
> differnt fault injection parametres for the new fail_kernfs_fop_write_iter
> fault injection at run time:
> 
> ls -1 /sys/kernel/debug/kernfs/fail_kernfs_fop_write_iter/
> interval
> probability
> space
> times
> task-filter
> verbose
> verbose_ratelimit_burst
> verbose_ratelimit_interval_ms
> 
> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> ---
>  .../fault-injection/fault-injection.rst       | 22 +++++
>  MAINTAINERS                                   |  2 +-
>  fs/kernfs/Makefile                            |  1 +
>  fs/kernfs/failure-injection.c                 | 91 +++++++++++++++++++
>  fs/kernfs/file.c                              | 13 +++
>  fs/kernfs/kernfs-internal.h                   | 72 +++++++++++++++
>  include/linux/kernfs.h                        |  5 +
>  lib/Kconfig.debug                             | 10 ++
>  8 files changed, 215 insertions(+), 1 deletion(-)
>  create mode 100644 fs/kernfs/failure-injection.c
> 
> diff --git a/Documentation/fault-injection/fault-injection.rst b/Documentation/fault-injection/fault-injection.rst
> index 4a25c5eb6f07..d4d34b082f47 100644
> --- a/Documentation/fault-injection/fault-injection.rst
> +++ b/Documentation/fault-injection/fault-injection.rst
> @@ -28,6 +28,28 @@ Available fault injection capabilities
>  
>    injects kernel RPC client and server failures.
>  
> +- fail_kernfs_fop_write_iter
> +
> +  Allows for failures to be enabled inside kernfs_fop_write_iter(). Enabling
> +  this does not immediately enable any errors to occur. You must configure
> +  how you want this routine to fail or change behaviour by using the debugfs
> +  knobs for it:
> +
> +  # ls -1 /sys/kernel/debug/kernfs/config_fail_kernfs_fop_write_iter/
> +  wait_after_active
> +  wait_after_mutex
> +  wait_at_start
> +  wait_before_mutex

This should be split up and detailed in the "debugfs entries" section
below here.

> +
> +  You can also configure how long to sleep after a wait under
> +
> +  /sys/kernel/debug/kernfs/sleep_after_wait_ms
> +
> +  If you enable CONFIG_FAULT_INJECTION_DEBUG_FS the fail_add_disk failure
> +  injection parameters are placed under:
> +
> +  /sys/kernel/debug/kernfs/fail_kernfs_fop_write_iter/
> +
>  - fail_make_request
>  
>    injects disk IO errors on devices permitted by setting
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 1b4cefcb064c..fadfd961ad80 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -10384,7 +10384,7 @@ M:	Greg Kroah-Hartman <gregkh@linuxfoundation.org>
>  M:	Tejun Heo <tj@kernel.org>
>  S:	Supported
>  T:	git git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core.git
> -F:	fs/kernfs/
> +F:	fs/kernfs/*
>  F:	include/linux/kernfs.h
>  
>  KEXEC
> diff --git a/fs/kernfs/Makefile b/fs/kernfs/Makefile
> index 4ca54ff54c98..bc5b32ca39f9 100644
> --- a/fs/kernfs/Makefile
> +++ b/fs/kernfs/Makefile
> @@ -4,3 +4,4 @@
>  #
>  
>  obj-y		:= mount.o inode.o dir.o file.o symlink.o
> +obj-$(CONFIG_FAIL_KERNFS_KNOBS)    += failure-injection.o
> diff --git a/fs/kernfs/failure-injection.c b/fs/kernfs/failure-injection.c
> new file mode 100644
> index 000000000000..4130d202c13b
> --- /dev/null
> +++ b/fs/kernfs/failure-injection.c

I'd name this fault_inject.c, which matches the more common case:

$ find . -type f -name '*fault*inject*.c'
./fs/nfsd/fault_inject.c
./drivers/nvme/host/fault_inject.c
./drivers/scsi/ufs/ufs-fault-injection.c
./lib/fault-inject.c
./lib/fault-inject-usercopy.c

> @@ -0,0 +1,91 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +#include <linux/fault-inject.h>
> +#include <linux/delay.h>
> +
> +#include "kernfs-internal.h"
> +
> +static DECLARE_FAULT_ATTR(fail_kernfs_fop_write_iter);
> +struct kernfs_config_fail kernfs_config_fail;
> +
> +#define kernfs_config_fail(when) \
> +	kernfs_config_fail.kernfs_fop_write_iter_fail.wait_ ## when
> +
> +#define kernfs_config_fail(when) \
> +	kernfs_config_fail.kernfs_fop_write_iter_fail.wait_ ## when
> +
> +static int __init setup_fail_kernfs_fop_write_iter(char *str)
> +{
> +	return setup_fault_attr(&fail_kernfs_fop_write_iter, str);
> +}
> +
> +__setup("fail_kernfs_fop_write_iter=", setup_fail_kernfs_fop_write_iter);
> +
> +struct dentry *kernfs_debugfs_root;
> +struct dentry *config_fail_kernfs_fop_write_iter;
> +
> +static int __init kernfs_init_failure_injection(void)
> +{
> +	kernfs_config_fail.sleep_after_wait_ms = 100;
> +	kernfs_debugfs_root = debugfs_create_dir("kernfs", NULL);
> +
> +	fault_create_debugfs_attr("fail_kernfs_fop_write_iter",
> +				  kernfs_debugfs_root, &fail_kernfs_fop_write_iter);
> +
> +	config_fail_kernfs_fop_write_iter =
> +		debugfs_create_dir("config_fail_kernfs_fop_write_iter",
> +				   kernfs_debugfs_root);
> +
> +	debugfs_create_u32("sleep_after_wait_ms", 0600,
> +			   kernfs_debugfs_root,
> +			   &kernfs_config_fail.sleep_after_wait_ms);
> +
> +	debugfs_create_bool("wait_at_start", 0600,
> +			    config_fail_kernfs_fop_write_iter,
> +			    &kernfs_config_fail(at_start));
> +	debugfs_create_bool("wait_before_mutex", 0600,
> +			    config_fail_kernfs_fop_write_iter,
> +			    &kernfs_config_fail(before_mutex));
> +	debugfs_create_bool("wait_after_mutex", 0600,
> +			    config_fail_kernfs_fop_write_iter,
> +			    &kernfs_config_fail(after_mutex));
> +	debugfs_create_bool("wait_after_active", 0600,
> +			    config_fail_kernfs_fop_write_iter,
> +			    &kernfs_config_fail(after_active));
> +	return 0;
> +}
> +late_initcall(kernfs_init_failure_injection);
> +
> +int __kernfs_debug_should_wait_kernfs_fop_write_iter(bool evaluate)
> +{
> +	if (!evaluate)
> +		return 0;
> +
> +	return should_fail(&fail_kernfs_fop_write_iter, 0);
> +}

Every caller ends up doing the wait, so how about just including that
here instead? It should make things much less intrusive and more readable.

And for the naming, other fault injectors use "should_fail_$topic", so
maybe better here would be something like may_wait_kernfs(...).

> +
> +DECLARE_COMPLETION(kernfs_debug_wait_completion);
> +EXPORT_SYMBOL_NS_GPL(kernfs_debug_wait_completion, KERNFS_DEBUG_PRIVATE);
> +
> +void kernfs_debug_wait(void)
> +{
> +	unsigned long timeout;
> +
> +	timeout = wait_for_completion_timeout(&kernfs_debug_wait_completion,
> +					      msecs_to_jiffies(3000));
> +	if (!timeout)
> +		pr_info("%s waiting for kernfs_debug_wait_completion timed out\n",
> +			__func__);
> +	else
> +		pr_info("%s received completion with time left on timeout %u ms\n",
> +			__func__, jiffies_to_msecs(timeout));
> +
> +	/**
> +	 * The goal is wait for an event, and *then* once we have
> +	 * reached it, the other side will try to do something which
> +	 * it thinks will break. So we must give it some time to do
> +	 * that. The amount of time is configurable.
> +	 */
> +	msleep(kernfs_config_fail.sleep_after_wait_ms);
> +	pr_info("%s ended\n", __func__);
> +}

All the uses of "__func__" here seems redundant; I would drop them.

> diff --git a/fs/kernfs/file.c b/fs/kernfs/file.c
> index 60e2a86c535e..4479c6580333 100644
> --- a/fs/kernfs/file.c
> +++ b/fs/kernfs/file.c
> @@ -259,6 +259,9 @@ static ssize_t kernfs_fop_write_iter(struct kiocb *iocb, struct iov_iter *iter)
>  	const struct kernfs_ops *ops;
>  	char *buf;
>  
> +	if (kernfs_debug_should_wait(kernfs_fop_write_iter, at_start))
> +		kernfs_debug_wait();

So this could just be:

	may_wait_kernfs(kernfs_fop_write_iter, at_start);

> +
>  	if (of->atomic_write_len) {
>  		if (len > of->atomic_write_len)
>  			return -E2BIG;
> @@ -280,17 +283,27 @@ static ssize_t kernfs_fop_write_iter(struct kiocb *iocb, struct iov_iter *iter)
>  	}
>  	buf[len] = '\0';	/* guarantee string termination */
>  
> +	if (kernfs_debug_should_wait(kernfs_fop_write_iter, before_mutex))
> +		kernfs_debug_wait();
> +
>  	/*
>  	 * @of->mutex nests outside active ref and is used both to ensure that
>  	 * the ops aren't called concurrently for the same open file.
>  	 */
>  	mutex_lock(&of->mutex);
> +
> +	if (kernfs_debug_should_wait(kernfs_fop_write_iter, after_mutex))
> +		kernfs_debug_wait();
> +
>  	if (!kernfs_get_active(of->kn)) {
>  		mutex_unlock(&of->mutex);
>  		len = -ENODEV;
>  		goto out_free;
>  	}
>  
> +	if (kernfs_debug_should_wait(kernfs_fop_write_iter, after_active))
> +		kernfs_debug_wait();
> +
>  	ops = kernfs_ops(of->kn);
>  	if (ops->write)
>  		len = ops->write(of, buf, len, iocb->ki_pos);
> diff --git a/fs/kernfs/kernfs-internal.h b/fs/kernfs/kernfs-internal.h
> index f9cc912c31e1..9e3abf597e2d 100644
> --- a/fs/kernfs/kernfs-internal.h
> +++ b/fs/kernfs/kernfs-internal.h
> @@ -18,6 +18,7 @@
>  
>  #include <linux/kernfs.h>
>  #include <linux/fs_context.h>
> +#include <linux/stringify.h>
>  
>  struct kernfs_iattrs {
>  	kuid_t			ia_uid;
> @@ -147,4 +148,75 @@ void kernfs_drain_open_files(struct kernfs_node *kn);
>   */
>  extern const struct inode_operations kernfs_symlink_iops;
>  
> +/*
> + * failure-injection.c
> + */
> +#ifdef CONFIG_FAIL_KERNFS_KNOBS
> +
> +/**
> + * struct kernfs_fop_write_iter_fail - how kernfs_fop_write_iter_fail fails
> + *
> + * This lets you configure what part of kernfs_fop_write_iter() should behave
> + * in a specific way to allow userspace to capture possible failures in
> + * kernfs. The wait knobs are allowed to let you design capture possible
> + * race conditions which would otherwise be difficult to reproduce. A
> + * secondary driver would tell kernfs's wait completion when it is done.
> + *
> + * The point to the wait completion failure injection tests are to confirm
> + * that the kernfs active refcount suffice to ensure other objects in other
> + * layers are also gauranteed to exist, even they are opaque to kernfs. This
> + * includes kobjects, devices, and other objects built on top of this, like
> + * the block layer when using sysfs block device attributes.
> + *
> + * @wait_at_start: waits for completion from a third party at the start of
> + *	the routine.
> + * @wait_before_mutex: waits for completion from a third party before we
> + *	are allowed to continue before the of->mutex is held.
> + * @wait_after_mutex: waits for completion from a third party after we
> + *	have held the of->mutex.
> + * @wait_after_active: waits for completion from a thid party after we
> + *	have refcounted the struct kernfs_node.
> + */
> +struct kernfs_fop_write_iter_fail {
> +	bool wait_at_start;
> +	bool wait_before_mutex;
> +	bool wait_after_mutex;
> +	bool wait_after_active;
> +};
> +
> +/**
> + * struct kernfs_config_fail - kernfs configuration for failure injection
> + *
> + * You can kernfs failure injection on boot, and in particular we currently
> + * only support failures for kernfs_fop_write_iter(). However, we don't
> + * want to always enable errors on this call when failure injection is enabled
> + * as this routine is used by many parts of the kernel for proper functionality.
> + * The compromise we make is we let userspace start enabling which parts it
> + * wants to fail after boot, if and only if failure injection has been enabled.
> + *
> + * @kernfs_fop_write_iter_fail: configuration for how we want to allow
> + *	for failure injection on kernfs_fop_write_iter()
> + * @sleep_after_wait_ms: how many ms to wait after completion is received.
> + */
> +struct kernfs_config_fail {
> +	struct kernfs_fop_write_iter_fail kernfs_fop_write_iter_fail;
> +	u32 sleep_after_wait_ms;
> +};
> +
> +extern struct kernfs_config_fail kernfs_config_fail;
> +
> +#define __kernfs_config_wait_var(func, when) \
> +	(kernfs_config_fail.  func  ## _fail.wait_  ## when)
                            ^^     ^               ^
nit: needless spaces

> +#define __kernfs_debug_should_wait_func_name(func) __kernfs_debug_should_wait_## func
> +
> +#define kernfs_debug_should_wait(func, when) \
> +	__kernfs_debug_should_wait_func_name(func)(__kernfs_config_wait_var(func, when))
> +int __kernfs_debug_should_wait_kernfs_fop_write_iter(bool evaluate);
> +void kernfs_debug_wait(void);
> +#else
> +static inline void kernfs_init_failure_injection(void) {}
> +#define kernfs_debug_should_wait(func, when) (false)
> +static inline void kernfs_debug_wait(void) {}
> +#endif
> +
>  #endif	/* __KERNFS_INTERNAL_H */
> diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
> index 3ccce6f24548..cd968ee2b503 100644
> --- a/include/linux/kernfs.h
> +++ b/include/linux/kernfs.h
> @@ -411,6 +411,11 @@ void kernfs_init(void);
>  
>  struct kernfs_node *kernfs_find_and_get_node_by_id(struct kernfs_root *root,
>  						   u64 id);
> +
> +#ifdef CONFIG_FAIL_KERNFS_KNOBS
> +extern struct completion kernfs_debug_wait_completion;
> +#endif
> +
>  #else	/* CONFIG_KERNFS */
>  
>  static inline enum kernfs_node_type kernfs_type(struct kernfs_node *kn)
> diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
> index ae19bf1a21b8..a29b7d398c4e 100644
> --- a/lib/Kconfig.debug
> +++ b/lib/Kconfig.debug
> @@ -1902,6 +1902,16 @@ config FAULT_INJECTION_USERCOPY
>  	  Provides fault-injection capability to inject failures
>  	  in usercopy functions (copy_from_user(), get_user(), ...).
>  
> +config FAIL_KERNFS_KNOBS
> +	bool "Fault-injection support in kernfs"
> +	depends on FAULT_INJECTION
> +	help
> +	  Provide fault-injection capability for kernfs. This only enables
> +	  the error injection functionality. To use it you must configure which
> +	  which path you want to trigger on error on using debugfs under
> +	  /sys/kernel/debug/kernfs/config_fail_kernfs_fop_write_iter/. By
> +	  default all of these are disabled.
> +
>  config FAIL_MAKE_REQUEST
>  	bool "Fault-injection capability for disk IO"
>  	depends on FAULT_INJECTION && BLOCK
> -- 
> 2.30.2
> 

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v8 05/12] test_sysfs: add support to use kernfs failure injection
  2021-09-27 16:37 ` [PATCH v8 05/12] test_sysfs: add support to use kernfs failure injection Luis Chamberlain
@ 2021-10-05 19:51   ` Kees Cook
  2021-10-11 20:56     ` Luis Chamberlain
  0 siblings, 1 reply; 94+ messages in thread
From: Kees Cook @ 2021-10-05 19:51 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: tj, gregkh, akpm, minchan, jeyu, shuah, bvanassche,
	dan.j.williams, joe, tglx, rostedt, linux-spdx, linux-doc,
	linux-block, linux-fsdevel, linux-kselftest, linux-kernel

On Mon, Sep 27, 2021 at 09:37:58AM -0700, Luis Chamberlain wrote:
> This extends test_sysfs with support for using the failure injection
> wait completion and knobs to force a few race conditions which
> demonstrates that kernfs active reference protection is sufficient
> for kobject / device protection at higher layers.
> 
> This adds 4 new tests which tries to remove the device attribute
> store operation in 4 different situations:
> 
>   1) at the start of kernfs_kernfs_fop_write_iter()
>   2) before the of->mutex is held in kernfs_kernfs_fop_write_iter()
>   3) after the of->mutex is held in kernfs_kernfs_fop_write_iter()
>   4) after the kernfs node active reference is taken
> 
> A write fails in call cases except the last one, test number #32. There
> is a good explanation for this: *once* kernfs_get_active() gets called
> we have a guarantee that the kernfs entry cannot be removed. If
> kernfs_get_active() succeeds that entry cannot be removed and so
> anything trying to remove that entry will have to wait. It is perhaps
> not obvious but since a sysfs write will trigger eventually a
> kernfs_get_active() call, and *only* if this succeeds will the sysfs
> op be called, this and the fact that you cannot remove the kernfs
> entry while the kenfs entry is active implies that a module that
> created the respective sysfs / kernfs entry *cannot* possibly be
> removed during a sysfs operation. And test number 32 provides us with
> proof of this. If it were not true test #32 should crash.
> 
> No null dereferences are reproduced, even though this has been observed
> in some complex testing cases [0]. If this issue really exists we should
> have enough tools on the sysfs_test toolbox now to try to reproduce
> this easily without having to poke around other drivers. It very likley
> was the case that the issue reported [0] was possibly a side issue after
> the first bug which was zram specific. This is why it is important to
> isolate the issue and try to reproduce it in a generic form using the
> test_sysfs driver.
> 
> [0] https://lkml.kernel.org/r/20210623215007.862787-1-mcgrof@kernel.org
> 
> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> ---
>  lib/Kconfig.debug                      |   3 +
>  lib/test_sysfs.c                       |  31 +++++
>  tools/testing/selftests/sysfs/config   |   3 +
>  tools/testing/selftests/sysfs/sysfs.sh | 175 +++++++++++++++++++++++++
>  4 files changed, 212 insertions(+)
> 
> diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
> index a29b7d398c4e..176b822654e5 100644
> --- a/lib/Kconfig.debug
> +++ b/lib/Kconfig.debug
> @@ -2358,6 +2358,9 @@ config TEST_SYSFS
>  	depends on SYSFS
>  	depends on NET
>  	depends on BLOCK
> +	select FAULT_INJECTION
> +	select FAULT_INJECTION_DEBUG_FS
> +	select FAIL_KERNFS_KNOBS

I don't like seeing "select" for user-configurable CONFIGs -- things
tend to end up weird. This should simply be:

	depends on FAIL_KERNFS_KNOBS

>  	help
>  	  This builds the "test_sysfs" module. This driver enables to test the
>  	  sysfs file system safely without affecting production knobs which
> diff --git a/lib/test_sysfs.c b/lib/test_sysfs.c
> index 2043ca494af8..c6e62de61403 100644
> --- a/lib/test_sysfs.c
> +++ b/lib/test_sysfs.c
> @@ -38,6 +38,11 @@
>  #include <linux/rtnetlink.h>
>  #include <linux/genhd.h>
>  #include <linux/blkdev.h>
> +#include <linux/kernfs.h>
> +
> +#ifdef CONFIG_FAIL_KERNFS_KNOBS

This isn't an optional config here (and following)?

> +MODULE_IMPORT_NS(KERNFS_DEBUG_PRIVATE);
> +#endif
>  
>  static bool enable_lock;
>  module_param(enable_lock, bool_enable_only, 0644);
> @@ -82,6 +87,13 @@ static bool enable_verbose_rmmod;
>  module_param(enable_verbose_rmmod, bool_enable_only, 0644);
>  MODULE_PARM_DESC(enable_verbose_rmmod, "enable verbose print messages on rmmod");
>  
> +#ifdef CONFIG_FAIL_KERNFS_KNOBS
> +static bool enable_completion_on_rmmod;
> +module_param(enable_completion_on_rmmod, bool_enable_only, 0644);
> +MODULE_PARM_DESC(enable_completion_on_rmmod,
> +		 "enable sending a kernfs completion on rmmod");
> +#endif
> +
>  static int sysfs_test_major;
>  
>  /**
> @@ -285,6 +297,12 @@ static ssize_t config_show(struct device *dev,
>  			"enable_verbose_writes:\t%s\n",
>  			enable_verbose_writes ? "true" : "false");
>  
> +#ifdef CONFIG_FAIL_KERNFS_KNOBS
> +	len += snprintf(buf+len, PAGE_SIZE - len,
> +			"enable_completion_on_rmmod:\t%s\n",
> +			enable_completion_on_rmmod ? "true" : "false");
> +#endif
> +
>  	test_dev_config_unlock(test_dev);
>  
>  	return len;
> @@ -904,10 +922,23 @@ static int __init test_sysfs_init(void)
>  }
>  module_init(test_sysfs_init);
>  
> +#ifdef CONFIG_FAIL_KERNFS_KNOBS
> +/* The goal is to race our device removal with a pending kernfs -> store call */
> +static void test_sysfs_kernfs_send_completion_rmmod(void)
> +{
> +	if (!enable_completion_on_rmmod)
> +		return;
> +	complete(&kernfs_debug_wait_completion);
> +}
> +#else
> +static inline void test_sysfs_kernfs_send_completion_rmmod(void) {}
> +#endif
> +
>  static void __exit test_sysfs_exit(void)
>  {
>  	if (enable_debugfs)
>  		debugfs_remove(debugfs_dir);
> +	test_sysfs_kernfs_send_completion_rmmod();
>  	if (delay_rmmod_ms)
>  		msleep(delay_rmmod_ms);
>  	unregister_test_dev_sysfs(first_test_dev);
> diff --git a/tools/testing/selftests/sysfs/config b/tools/testing/selftests/sysfs/config
> index 9196f452ecd5..2876a229f95b 100644
> --- a/tools/testing/selftests/sysfs/config
> +++ b/tools/testing/selftests/sysfs/config
> @@ -1,2 +1,5 @@
>  CONFIG_SYSFS=m
>  CONFIG_TEST_SYSFS=m
> +CONFIG_FAULT_INJECTION=y
> +CONFIG_FAULT_INJECTION_DEBUG_FS=y
> +CONFIG_FAIL_KERNFS_KNOBS=y
> diff --git a/tools/testing/selftests/sysfs/sysfs.sh b/tools/testing/selftests/sysfs/sysfs.sh
> index b3f4c2236c7f..f928635d0e35 100755
> --- a/tools/testing/selftests/sysfs/sysfs.sh
> +++ b/tools/testing/selftests/sysfs/sysfs.sh
> @@ -62,6 +62,10 @@ ALL_TESTS="$ALL_TESTS 0025:1:1:test_dev_y:block"
>  ALL_TESTS="$ALL_TESTS 0026:1:1:test_dev_y:block"
>  ALL_TESTS="$ALL_TESTS 0027:1:0:test_dev_x:block" # deadlock test
>  ALL_TESTS="$ALL_TESTS 0028:1:0:test_dev_x:block" # deadlock test with rntl_lock
> +ALL_TESTS="$ALL_TESTS 0029:1:1:test_dev_x:block" # kernfs race removal of store
> +ALL_TESTS="$ALL_TESTS 0030:1:1:test_dev_x:block" # kernfs race removal before mutex
> +ALL_TESTS="$ALL_TESTS 0031:1:1:test_dev_x:block" # kernfs race removal after mutex
> +ALL_TESTS="$ALL_TESTS 0032:1:1:test_dev_x:block" # kernfs race removal after active
>  
>  allow_user_defaults()
>  {
> @@ -92,6 +96,9 @@ allow_user_defaults()
>  	if [ -z $SYSFS_DEBUGFS_DIR ]; then
>  		SYSFS_DEBUGFS_DIR="/sys/kernel/debug/test_sysfs"
>  	fi
> +	if [ -z $KERNFS_DEBUGFS_DIR ]; then
> +		KERNFS_DEBUGFS_DIR="/sys/kernel/debug/kernfs"
> +	fi
>  	if [ -z $PAGE_SIZE ]; then
>  		PAGE_SIZE=$(getconf PAGESIZE)
>  	fi
> @@ -167,6 +174,14 @@ modprobe_reset_enable_rtnl_lock_on_rmmod()
>  	unset FIRST_MODPROBE_ARGS
>  }
>  
> +modprobe_reset_enable_completion()
> +{
> +	FIRST_MODPROBE_ARGS="enable_completion_on_rmmod=1 enable_verbose_writes=1"
> +	FIRST_MODPROBE_ARGS="$FIRST_MODPROBE_ARGS enable_verbose_rmmod=1 delay_rmmod_ms=0"
> +	modprobe_reset
> +	unset FIRST_MODPROBE_ARGS
> +}
> +
>  load_req_mod()
>  {
>  	modprobe_reset
> @@ -197,6 +212,63 @@ debugfs_reset_first_test_dev_ignore_errors()
>  	echo -n "1" >"$SYSFS_DEBUGFS_DIR"/reset_first_test_dev
>  }
>  
> +debugfs_kernfs_kernfs_fop_write_iter_exists()
> +{
> +	KNOB_DIR="${KERNFS_DEBUGFS_DIR}/config_fail_kernfs_fop_write_iter"
> +	if [[ ! -d $KNOB_DIR ]]; then
> +		echo "kernfs debugfs does not exist $KNOB_DIR"
> +		return 0;
> +	fi
> +	KNOB_DEBUGFS="${KERNFS_DEBUGFS_DIR}/fail_kernfs_fop_write_iter"
> +	if [[ ! -d $KNOB_DEBUGFS ]]; then
> +		echo -n "kernfs debugfs for coniguring fail_kernfs_fop_write_iter "
> +		echo "does not exist $KNOB_DIR"
> +		return 0;
> +	fi
> +	return 1
> +}
> +
> +debugfs_kernfs_kernfs_fop_write_iter_set_fail_once()
> +{
> +	KNOB_DEBUGFS="${KERNFS_DEBUGFS_DIR}/fail_kernfs_fop_write_iter"
> +	echo 1 > $KNOB_DEBUGFS/interval
> +	echo 100 > $KNOB_DEBUGFS/probability
> +	echo 0 > $KNOB_DEBUGFS/space
> +	# Disable verbose messages on the kernel ring buffer which may
> +	# confuse developers with a kernel panic.
> +	echo 0 > $KNOB_DEBUGFS/verbose
> +
> +	# Fail only once
> +	echo 1 > $KNOB_DEBUGFS/times
> +}
> +
> +debugfs_kernfs_kernfs_fop_write_iter_set_fail_never()
> +{
> +	KNOB_DEBUGFS="${KERNFS_DEBUGFS_DIR}/fail_kernfs_fop_write_iter"
> +	echo 0 > $KNOB_DEBUGFS/times
> +}
> +
> +debugfs_kernfs_set_wait_ms()
> +{
> +	SLEEP_AFTER_WAIT_MS="${KERNFS_DEBUGFS_DIR}/sleep_after_wait_ms"
> +	echo $1 > $SLEEP_AFTER_WAIT_MS
> +}
> +
> +debugfs_kernfs_disable_wait_kernfs_fop_write_iter()
> +{
> +	ENABLE_WAIT_KNOB="${KERNFS_DEBUGFS_DIR}/config_fail_kernfs_fop_write_iter/wait_"
> +	for KNOB in ${ENABLE_WAIT_KNOB}*; do
> +		echo 0 > $KNOB
> +	done
> +}
> +
> +debugfs_kernfs_enable_wait_kernfs_fop_write_iter()
> +{
> +	ENABLE_WAIT_KNOB="${KERNFS_DEBUGFS_DIR}/config_fail_kernfs_fop_write_iter/wait_$1"
> +	echo -n "1" > $ENABLE_WAIT_KNOB
> +	return $?
> +}
> +
>  set_orig()
>  {
>  	if [[ ! -z $TARGET ]] && [[ ! -z $ORIG ]]; then
> @@ -972,6 +1044,105 @@ sysfs_test_0028()
>  	fi
>  }
>  
> +sysfs_race_kernfs_kernfs_fop_write_iter()
> +{
> +	TARGET="${DIR}/$(get_test_target $1)"
> +	WAIT_AT=$2
> +	EXPECT_WRITE_RETURNS=$3
> +	MSDELAY=$4
> +
> +	modprobe_reset_enable_completion
> +	ORIG=$(cat "${TARGET}")
> +	TEST_STR=$(( $ORIG + 1 ))
> +
> +	echo -n "Test racing removal of sysfs store op with kernfs $WAIT_AT ... "
> +
> +	if debugfs_kernfs_kernfs_fop_write_iter_exists; then
> +		echo -n "skipping test as CONFIG_FAIL_KERNFS_KNOBS "
> +		echo " or CONFIG_FAULT_INJECTION_DEBUG_FS is disabled"
> +		return $ksft_skip
> +	fi
> +
> +	# Allow for failing the kernfs_kernfs_fop_write_iter call once,
> +	# we'll provide exact context shortly afterwards.
> +	debugfs_kernfs_kernfs_fop_write_iter_set_fail_once
> +
> +	# First disable all waits
> +	debugfs_kernfs_disable_wait_kernfs_fop_write_iter
> +
> +	# Enable a wait_for_completion(&kernfs_debug_wait_completion) at the
> +	# specified location inside the kernfs_fop_write_iter() routine
> +	debugfs_kernfs_enable_wait_kernfs_fop_write_iter $WAIT_AT
> +
> +	# Configure kernfs so that after its wait_for_completion() it
> +	# will msleep() this amount of time and schedule(). We figure this
> +	# will be sufficient time to allow for our module removal to complete.
> +	debugfs_kernfs_set_wait_ms $MSDELAY
> +
> +	# Now we trigger a kernfs write op, which will run kernfs_fop_write_iter,
> +	# but will wait until our driver sends a respective completion
> +	set_test_ignore_errors &
> +	write_pid=$!
> +
> +	# At this point kernfs_fop_write_iter() hasn't run our op, its
> +	# waiting for our completion at the specified time $WAIT_AT.
> +	# We now remove our module which will send a
> +	# complete(&kernfs_debug_wait_completion) right before we deregister
> +	# our device and the sysfs device attributes are removed.
> +	#
> +	# After the completion is sent, the test_sysfs driver races with
> +	# kernfs to do the device deregistration with the kernfs msleep
> +	# and schedule(). This should mean we've forced trying to remove the
> +	# module prior to allowing kernfs to run our store operation. If the
> +	# race did happen we'll panic with a null dereference on the store op.
> +	#
> +	# If no race happens we should see no write operation triggered.
> +	modprobe -r $TEST_DRIVER > /dev/null 2>&1
> +
> +	debugfs_kernfs_kernfs_fop_write_iter_set_fail_never
> +
> +	wait $write_pid
> +	if [[ $? -eq $EXPECT_WRITE_RETURNS ]]; then
> +		echo "ok"
> +	else
> +		echo "FAIL" >&2
> +	fi
> +}
> +
> +sysfs_test_0029()
> +{
> +	for delay in 0 2 4 8 16 32 64 128 246 512 1024; do
> +		echo "Using delay-after-completion: $delay"
> +		sysfs_race_kernfs_kernfs_fop_write_iter 0029 at_start 1 $delay
> +	done
> +}
> +
> +sysfs_test_0030()
> +{
> +	for delay in 0 2 4 8 16 32 64 128 246 512 1024; do
> +		echo "Using delay-after-completion: $delay"
> +		sysfs_race_kernfs_kernfs_fop_write_iter 0030 before_mutex 1 $delay
> +	done
> +}
> +
> +sysfs_test_0031()
> +{
> +	for delay in 0 2 4 8 16 32 64 128 246 512 1024; do
> +		echo "Using delay-after-completion: $delay"
> +		sysfs_race_kernfs_kernfs_fop_write_iter 0031 after_mutex 1 $delay
> +	done
> +}
> +
> +# A write only succeeds *iff* a module removal happens *after* the
> +# kernfs active reference is obtained with kernfs_get_active().
> +sysfs_test_0032()
> +{
> +	for delay in 0 2 4 8 16 32 64 128 246 512 1024; do
> +		echo "Using delay-after-completion: $delay"
> +		sysfs_race_kernfs_kernfs_fop_write_iter 0032 after_active 0 $delay
> +	done
> +}
> +
>  test_gen_desc()
>  {
>  	echo -n "$1 x $(get_test_count $1)"
> @@ -1013,6 +1184,10 @@ list_tests()
>  	echo "$(test_gen_desc 0026) - block test writing y larger delay and resetting device"
>  	echo "$(test_gen_desc 0027) - test rmmod deadlock while writing x ... "
>  	echo "$(test_gen_desc 0028) - test rmmod deadlock using rtnl_lock while writing x ..."
> +	echo "$(test_gen_desc 0029) - racing removal of store op with kernfs at start"
> +	echo "$(test_gen_desc 0030) - racing removal of store op with kernfs before mutex"
> +	echo "$(test_gen_desc 0031) - racing removal of store op with kernfs after mutex"
> +	echo "$(test_gen_desc 0032) - racing removal of store op with kernfs after active"
>  }
>  
>  usage()
> -- 
> 2.30.2
> 

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v8 06/12] kernel/module: add documentation for try_module_get()
  2021-09-27 16:37 ` [PATCH v8 06/12] kernel/module: add documentation for try_module_get() Luis Chamberlain
@ 2021-10-05 19:58   ` Kees Cook
  2021-10-11 21:16     ` Luis Chamberlain
  0 siblings, 1 reply; 94+ messages in thread
From: Kees Cook @ 2021-10-05 19:58 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: tj, gregkh, akpm, minchan, jeyu, shuah, bvanassche,
	dan.j.williams, joe, tglx, rostedt, linux-spdx, linux-doc,
	linux-block, linux-fsdevel, linux-kselftest, linux-kernel

On Mon, Sep 27, 2021 at 09:37:59AM -0700, Luis Chamberlain wrote:
> There is quite a bit of tribal knowledge around proper use of
> try_module_get() and that it must be used only in a context which
> can ensure the module won't be gone during the operation. Document
> this little bit of tribal knowledge.
> 
> I'm extending this tribal knowledge with new developments which it
> seems some folks do not yet believe to be true: we can be sure a
> module will exist during the lifetime of a sysfs file operation.
> For proof, refer to test_sysfs test #32:
> 
> ./tools/testing/selftests/sysfs/sysfs.sh -t 0032
> 
> Without this being true, the write would fail or worse,
> a crash would happen, in this test. It does not.
> 
> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> ---
>  include/linux/module.h | 34 ++++++++++++++++++++++++++++++++--
>  1 file changed, 32 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/module.h b/include/linux/module.h
> index c9f1200b2312..22eacd5e1e85 100644
> --- a/include/linux/module.h
> +++ b/include/linux/module.h
> @@ -609,10 +609,40 @@ void symbol_put_addr(void *addr);
>     to handle the error case (which only happens with rmmod --wait). */
>  extern void __module_get(struct module *module);
>  
> -/* This is the Right Way to get a module: if it fails, it's being removed,
> - * so pretend it's not there. */
> +/**
> + * try_module_get() - yields to module removal and bumps refcnt otherwise

I find this hard to parse. How about:
	"Take module refcount unless module is being removed"

> + * @module: the module we should check for
> + *
> + * This can be used to try to bump the reference count of a module, so to
> + * prevent module removal. The reference count of a module is not allowed
> + * to be incremented if the module is already being removed.

This I understand.

> + *
> + * Care must be taken to ensure the module cannot be removed during the call to
> + * try_module_get(). This can be done by having another entity other than the
> + * module itself increment the module reference count, or through some other
> + * means which guarantees the module could not be removed during an operation.
> + * An example of this later case is using try_module_get() in a sysfs file
> + * which the module created. The sysfs store / read file operations are
> + * gauranteed to exist through the use of kernfs's active reference (see
> + * kernfs_active()). If a sysfs file operation is being run, the module which
> + * created it must still exist as the module is in charge of removing the same
> + * sysfs file being read. Also, a sysfs / kernfs file removal cannot happen
> + * unless the same file is not active.

I can't understand this paragraph at all. "Care must be taken ..."? Why?
Shouldn't callers of try_module_get() be satisfied with the results? I
don't follow the example at all. It seems to just say "sysfs store/read
functions don't need try_module_get() because whatever opened the sysfs
file is already keeping the module referenced." ?

> + *
> + * One of the real values to try_module_get() is the module_is_live() check
> + * which ensures this the caller of try_module_get() can yield to userspace
> + * module removal requests and fail whatever it was about to process.

Please document the return value explicitly.

> + */
>  extern bool try_module_get(struct module *module);
>  
> +/**
> + * module_put() - release a reference count to a module
> + * @module: the module we should release a reference count for
> + *
> + * If you successfully bump a reference count to a module with try_module_get(),
> + * when you are finished you must call module_put() to release that reference
> + * count.
> + */
>  extern void module_put(struct module *module);
>  
>  #else /*!CONFIG_MODULE_UNLOAD*/
> -- 
> 2.30.2
> 

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v8 07/12] fs/kernfs/symlink.c: replace S_IRWXUGO with 0777 on kernfs_create_link()
  2021-09-27 16:38 ` [PATCH v8 07/12] fs/kernfs/symlink.c: replace S_IRWXUGO with 0777 on kernfs_create_link() Luis Chamberlain
@ 2021-10-05 19:59   ` Kees Cook
  0 siblings, 0 replies; 94+ messages in thread
From: Kees Cook @ 2021-10-05 19:59 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: tj, gregkh, akpm, minchan, jeyu, shuah, bvanassche,
	dan.j.williams, joe, tglx, rostedt, linux-spdx, linux-doc,
	linux-block, linux-fsdevel, linux-kselftest, linux-kernel

On Mon, Sep 27, 2021 at 09:38:00AM -0700, Luis Chamberlain wrote:
> If one ends up extending this line checkpatch will complain about the
> use of S_IRWXUGO suggesting it is not preferred and that 0777
> should be used instead. Take the tip from checkpatch and do that
> change before we do our subsequent changes.
> 
> This makes no functional changes.
> 
> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>

Reviewed-by: Kees Cook <keescook@chromium.org>

> ---
>  fs/kernfs/symlink.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/fs/kernfs/symlink.c b/fs/kernfs/symlink.c
> index c8f8e41b8411..19a6c71c6ff5 100644
> --- a/fs/kernfs/symlink.c
> +++ b/fs/kernfs/symlink.c
> @@ -36,8 +36,7 @@ struct kernfs_node *kernfs_create_link(struct kernfs_node *parent,
>  		gid = target->iattr->ia_gid;
>  	}
>  
> -	kn = kernfs_new_node(parent, name, S_IFLNK|S_IRWXUGO, uid, gid,
> -			     KERNFS_LINK);
> +	kn = kernfs_new_node(parent, name, S_IFLNK|0777, uid, gid, KERNFS_LINK);
>  	if (!kn)
>  		return ERR_PTR(-ENOMEM);
>  
> -- 
> 2.30.2
> 

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v8 09/12] sysfs: fix deadlock race with module removal
  2021-09-27 16:38 ` [PATCH v8 09/12] sysfs: fix deadlock race with module removal Luis Chamberlain
  2021-10-05  9:24   ` Ming Lei
@ 2021-10-05 20:50   ` Kees Cook
  2021-10-11 22:26     ` Luis Chamberlain
  1 sibling, 1 reply; 94+ messages in thread
From: Kees Cook @ 2021-10-05 20:50 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: tj, gregkh, akpm, minchan, jeyu, shuah, bvanassche,
	dan.j.williams, joe, tglx, rostedt, linux-spdx, linux-doc,
	linux-block, linux-fsdevel, linux-kselftest, linux-kernel

On Mon, Sep 27, 2021 at 09:38:02AM -0700, Luis Chamberlain wrote:
> When driver sysfs attributes use a lock also used on module removal we
> can race to deadlock. This happens when for instance a sysfs file on
> a driver is used, then at the same time we have module removal call
> trigger. The module removal call code holds a lock, and then the
> driver's sysfs file entry waits for the same lock. While holding the
> lock the module removal tries to remove the sysfs entries, but these
> cannot be removed yet as one is waiting for a lock. This won't complete
> as the lock is already held. Likewise module removal cannot complete,
> and so we deadlock.
> 
> This can now be easily reproducible with our sysfs selftest as follows:
> 
> ./tools/testing/selftests/sysfs/sysfs.sh -t 0027
> 
> This uses a local driver lock. Test 0028 can also be used, that uses
> the rtnl_lock():
> 
> ./tools/testing/selftests/sysfs/sysfs.sh -t 0028
> 
> To fix this we extend the struct kernfs_node with a module reference
> and use the try_module_get() after kernfs_get_active() is called. As

I would agree: kernfs must know about the module containing the ops
structure it has been given. (Without this, there are, at the very least,
removal races for looking at kernfs_ops structures.)

In other places in the kernel, function callback dependencies are more
explicit in that if code is holding such things, it has already taken a
module reference, etc. But kernfs is special in the sense that just
because a kernfs entry exists, we don't want to pin the module use count
too.

But simple locking isn't workable to solve this because kernfs_remove()
must be able to be called from a module_exit routine without deadlocking.
(i.e. we would create exactly the situation that caused this condition
to get noticed in the first place.)

> documented in the prior patch, we now know that once kernfs_get_active()
> is called the module is implicitly guarded to exist and cannot be removed.
> This is because the module is the one in charge of removing the same
> sysfs file it created, and removal of sysfs files on module exit will wait
> until they don't have any active references. By using a try_module_get()
> after kernfs_get_active() we yield to let module removal trump calls to
> process a sysfs operation, while also preventing module removal if a sysfs
> operation is in already progress. This prevents the deadlock.
> 
> This deadlock was first reported with the zram driver, however the live
> patching folks have acknowledged they have observed this as well with
> live patching, when a live patch is removed. I was then able to
> reproduce easily by creating a dedicated selftest for it.
> 
> A sketch of how this can happen follows, consider foo a local mutex
> part of a driver, and used on the driver's module exit routine and
> on one of its sysfs ops:
> 
> foo.c:
> static DEFINE_MUTEX(foo);
> static ssize_t foo_store(struct device *dev,
> 			 struct device_attribute *attr,
> 			 const char *buf, size_t count)
> {
> 	...
> 	mutex_lock(&foo);
> 	...
> 	mutex_lock(&foo);
> 	...
> }
> static DEVICE_ATTR_RW(foo);
> ...
> void foo_exit(void)
> {
> 	mutex_lock(&foo);
> 	...
> 	mutex_unlock(&foo);
> }
> module_exit(foo_exit);
> 
> And this can lead to this condition:
> 
> CPU A                              CPU B
>                                    foo_store()
> foo_exit()
>   mutex_lock(&foo)
>                                    mutex_lock(&foo)
>    del_gendisk(some_struct->disk);
>      device_del()
>        device_remove_groups()

Please expand this further, where does device_remove_groups() end up
waiting for that never happens?

> 
> In this situation foo_store() is waiting for the mutex foo to
> become unlocked, but that won't happen until module removal is complete.
> But module removal won't complete until the sysfs file being poked at
> completes which is waiting for a lock already held.
> 
> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> ---
>  arch/x86/kernel/cpu/resctrl/rdtgroup.c |  4 +-
>  fs/kernfs/dir.c                        | 44 ++++++++++++++++++----
>  fs/kernfs/file.c                       |  6 ++-
>  fs/kernfs/kernfs-internal.h            |  3 +-
>  fs/kernfs/symlink.c                    |  3 +-
>  fs/sysfs/dir.c                         |  2 +-
>  fs/sysfs/file.c                        |  6 ++-
>  fs/sysfs/group.c                       |  3 +-
>  include/linux/kernfs.h                 | 14 ++++---
>  include/linux/sysfs.h                  | 52 ++++++++++++++++++++------
>  kernel/cgroup/cgroup.c                 |  2 +-
>  11 files changed, 105 insertions(+), 34 deletions(-)
> 
> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> index b57b3db9a6a7..4edf3b37fd2c 100644
> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> @@ -209,7 +209,7 @@ static int rdtgroup_add_file(struct kernfs_node *parent_kn, struct rftype *rft)
>  
>  	kn = __kernfs_create_file(parent_kn, rft->name, rft->mode,
>  				  GLOBAL_ROOT_UID, GLOBAL_ROOT_GID,
> -				  0, rft->kf_ops, rft, NULL, NULL);
> +				  0, rft->kf_ops, rft, NULL, NULL, NULL);
>  	if (IS_ERR(kn))
>  		return PTR_ERR(kn);
>  
> @@ -2482,7 +2482,7 @@ static int mon_addfile(struct kernfs_node *parent_kn, const char *name,
>  
>  	kn = __kernfs_create_file(parent_kn, name, 0444,
>  				  GLOBAL_ROOT_UID, GLOBAL_ROOT_GID, 0,
> -				  &kf_mondata_ops, priv, NULL, NULL);
> +				  &kf_mondata_ops, priv, NULL, NULL, NULL);
>  	if (IS_ERR(kn))
>  		return PTR_ERR(kn);
>  
> diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
> index ba581429bf7b..e841201fd11b 100644
> --- a/fs/kernfs/dir.c
> +++ b/fs/kernfs/dir.c
> @@ -14,6 +14,7 @@
>  #include <linux/slab.h>
>  #include <linux/security.h>
>  #include <linux/hash.h>
> +#include <linux/module.h>
>  
>  #include "kernfs-internal.h"
>  
> @@ -414,12 +415,29 @@ static bool kernfs_unlink_sibling(struct kernfs_node *kn)
>   */
>  struct kernfs_node *kernfs_get_active(struct kernfs_node *kn)
>  {
> +	int v;
> +
>  	if (unlikely(!kn))
>  		return NULL;
>  
>  	if (!atomic_inc_unless_negative(&kn->active))
>  		return NULL;
>  
> +	/*
> +	 * If a module created the kernfs_node, the module cannot possibly be
> +	 * removed if the above atomic_inc_unless_negative() succeeded. So the
> +	 * try_module_get() below is not to protect the lifetime of the module
> +	 * as that is already guaranteed. The try_module_get() below is used
> +	 * to ensure that we don't deadlock in case a kernfs operation and
> +	 * module removal used a shared lock.
> +	 */
> +	if (!try_module_get(kn->owner)) {
> +		v = atomic_dec_return(&kn->active);
> +		if (unlikely(v == KN_DEACTIVATED_BIAS))
> +			wake_up_all(&kernfs_root(kn)->deactivate_waitq);
> +		return NULL;
> +	}

The special casing in here makes me think this isn't happening the right
place. (i.e this looks like an open-coded version of kernfs_put_active())

> +
>  	if (kernfs_lockdep(kn))
>  		rwsem_acquire_read(&kn->dep_map, 0, 1, _RET_IP_);
>  	return kn;
> @@ -442,6 +460,13 @@ void kernfs_put_active(struct kernfs_node *kn)
>  	if (kernfs_lockdep(kn))
>  		rwsem_release(&kn->dep_map, _RET_IP_);
>  	v = atomic_dec_return(&kn->active);
> +
> +	/*
> +	 * We prevent module exit *until* we know for sure all possible
> +	 * kernfs ops are done.
> +	 */
> +	module_put(kn->owner);
> +
>  	if (likely(v != KN_DEACTIVATED_BIAS))
>  		return;

What I don't understand, however, is what kernfs_get/put_active() is
intending to do -- it looks like it's trying to provide an interruption
point for open kernfs file operations?

This all seems extremely complex for what seems like it should just be a
global "am I being removed?" bool?

Regardless, while I do see the logic of associating the module get/put
with get/put of kernfs "active", why is it not better tied to strictly
kernfs open/close? That would seem to be much simpler and not require
any special handling?

For example, why does this not work?


diff --git a/fs/kernfs/file.c b/fs/kernfs/file.c
index 60e2a86c535e..e44502ac244d 100644
--- a/fs/kernfs/file.c
+++ b/fs/kernfs/file.c
@@ -525,6 +525,9 @@ static int kernfs_get_open_node(struct kernfs_node *kn,
 {
 	struct kernfs_open_node *on, *new_on = NULL;
 
+	if (!try_module_get(kn->owner))
+		return -ENODEV;
+
  retry:
 	mutex_lock(&kernfs_open_file_mutex);
 	spin_lock_irq(&kernfs_open_node_lock);
@@ -592,6 +595,7 @@ static void kernfs_put_open_node(struct kernfs_node *kn,
 	mutex_unlock(&kernfs_open_file_mutex);
 
 	kfree(on);
+	module_put(kn->owner);
 }
 
 static int kernfs_fop_open(struct inode *inode, struct file *file)
@@ -719,6 +723,7 @@ static int kernfs_fop_open(struct inode *inode, struct file *file)
 	kfree(of);
 err_out:
 	kernfs_put_active(kn);
+	module_put(kn->owner);
 	return error;
 }
 


>  
> @@ -572,7 +597,8 @@ static struct kernfs_node *__kernfs_new_node(struct kernfs_root *root,
>  					     struct kernfs_node *parent,
>  					     const char *name, umode_t mode,
>  					     kuid_t uid, kgid_t gid,
> -					     unsigned flags)
> +					     unsigned flags,
> +					     struct module *owner)
>  {
>  	struct kernfs_node *kn;
>  	u32 id_highbits;
> @@ -607,6 +633,7 @@ static struct kernfs_node *__kernfs_new_node(struct kernfs_root *root,
>  	kn->name = name;
>  	kn->mode = mode;
>  	kn->flags = flags;
> +	kn->owner = owner;
>  
>  	if (!uid_eq(uid, GLOBAL_ROOT_UID) || !gid_eq(gid, GLOBAL_ROOT_GID)) {
>  		struct iattr iattr = {
> @@ -640,12 +667,13 @@ static struct kernfs_node *__kernfs_new_node(struct kernfs_root *root,
>  struct kernfs_node *kernfs_new_node(struct kernfs_node *parent,
>  				    const char *name, umode_t mode,
>  				    kuid_t uid, kgid_t gid,
> -				    unsigned flags)
> +				    unsigned flags,
> +				    struct module *owner)
>  {
>  	struct kernfs_node *kn;
>  
>  	kn = __kernfs_new_node(kernfs_root(parent), parent,
> -			       name, mode, uid, gid, flags);
> +			       name, mode, uid, gid, flags, owner);
>  	if (kn) {
>  		kernfs_get(parent);
>  		kn->parent = parent;
> @@ -927,7 +955,7 @@ struct kernfs_root *kernfs_create_root(struct kernfs_syscall_ops *scops,
>  
>  	kn = __kernfs_new_node(root, NULL, "", S_IFDIR | S_IRUGO | S_IXUGO,
>  			       GLOBAL_ROOT_UID, GLOBAL_ROOT_GID,
> -			       KERNFS_DIR);
> +			       KERNFS_DIR, NULL);
>  	if (!kn) {
>  		idr_destroy(&root->ino_idr);
>  		kfree(root);
> @@ -969,20 +997,22 @@ void kernfs_destroy_root(struct kernfs_root *root)
>   * @gid: gid of the new directory
>   * @priv: opaque data associated with the new directory
>   * @ns: optional namespace tag of the directory
> + * @owner: if set, the module owner of this directory
>   *
>   * Returns the created node on success, ERR_PTR() value on failure.
>   */
>  struct kernfs_node *kernfs_create_dir_ns(struct kernfs_node *parent,
>  					 const char *name, umode_t mode,
>  					 kuid_t uid, kgid_t gid,
> -					 void *priv, const void *ns)
> +					 void *priv, const void *ns,
> +					 struct module *owner)
>  {
>  	struct kernfs_node *kn;
>  	int rc;
>  
>  	/* allocate */
>  	kn = kernfs_new_node(parent, name, mode | S_IFDIR,
> -			     uid, gid, KERNFS_DIR);
> +			     uid, gid, KERNFS_DIR, owner);
>  	if (!kn)
>  		return ERR_PTR(-ENOMEM);
>  
> @@ -1014,7 +1044,7 @@ struct kernfs_node *kernfs_create_empty_dir(struct kernfs_node *parent,
>  
>  	/* allocate */
>  	kn = kernfs_new_node(parent, name, S_IRUGO|S_IXUGO|S_IFDIR,
> -			     GLOBAL_ROOT_UID, GLOBAL_ROOT_GID, KERNFS_DIR);
> +			     GLOBAL_ROOT_UID, GLOBAL_ROOT_GID, KERNFS_DIR, NULL);
>  	if (!kn)
>  		return ERR_PTR(-ENOMEM);
>  
> diff --git a/fs/kernfs/file.c b/fs/kernfs/file.c
> index 4479c6580333..0e125287e050 100644
> --- a/fs/kernfs/file.c
> +++ b/fs/kernfs/file.c
> @@ -978,6 +978,7 @@ const struct file_operations kernfs_file_fops = {
>   * @priv: private data for the file
>   * @ns: optional namespace tag of the file
>   * @key: lockdep key for the file's active_ref, %NULL to disable lockdep
> + * @owner: if set, the module owner of the file
>   *
>   * Returns the created node on success, ERR_PTR() value on error.
>   */
> @@ -987,7 +988,8 @@ struct kernfs_node *__kernfs_create_file(struct kernfs_node *parent,
>  					 loff_t size,
>  					 const struct kernfs_ops *ops,
>  					 void *priv, const void *ns,
> -					 struct lock_class_key *key)
> +					 struct lock_class_key *key,
> +					 struct module *owner)
>  {
>  	struct kernfs_node *kn;
>  	unsigned flags;
> @@ -996,7 +998,7 @@ struct kernfs_node *__kernfs_create_file(struct kernfs_node *parent,
>  	flags = KERNFS_FILE;
>  
>  	kn = kernfs_new_node(parent, name, (mode & S_IALLUGO) | S_IFREG,
> -			     uid, gid, flags);
> +			     uid, gid, flags, owner);
>  	if (!kn)
>  		return ERR_PTR(-ENOMEM);
>  
> diff --git a/fs/kernfs/kernfs-internal.h b/fs/kernfs/kernfs-internal.h
> index 9e3abf597e2d..6d275d661987 100644
> --- a/fs/kernfs/kernfs-internal.h
> +++ b/fs/kernfs/kernfs-internal.h
> @@ -134,7 +134,8 @@ int kernfs_add_one(struct kernfs_node *kn);
>  struct kernfs_node *kernfs_new_node(struct kernfs_node *parent,
>  				    const char *name, umode_t mode,
>  				    kuid_t uid, kgid_t gid,
> -				    unsigned flags);
> +				    unsigned flags,
> +				    struct module *owner);
>  
>  /*
>   * file.c
> diff --git a/fs/kernfs/symlink.c b/fs/kernfs/symlink.c
> index 19a6c71c6ff5..5a053eebee52 100644
> --- a/fs/kernfs/symlink.c
> +++ b/fs/kernfs/symlink.c
> @@ -36,7 +36,8 @@ struct kernfs_node *kernfs_create_link(struct kernfs_node *parent,
>  		gid = target->iattr->ia_gid;
>  	}
>  
> -	kn = kernfs_new_node(parent, name, S_IFLNK|0777, uid, gid, KERNFS_LINK);
> +	kn = kernfs_new_node(parent, name, S_IFLNK|0777, uid, gid, KERNFS_LINK,
> +			     target->owner);
>  	if (!kn)
>  		return ERR_PTR(-ENOMEM);
>  
> diff --git a/fs/sysfs/dir.c b/fs/sysfs/dir.c
> index b6b6796e1616..9763c2fde3c7 100644
> --- a/fs/sysfs/dir.c
> +++ b/fs/sysfs/dir.c
> @@ -57,7 +57,7 @@ int sysfs_create_dir_ns(struct kobject *kobj, const void *ns)
>  	kobject_get_ownership(kobj, &uid, &gid);
>  
>  	kn = kernfs_create_dir_ns(parent, kobject_name(kobj), 0755, uid, gid,
> -				  kobj, ns);
> +				  kobj, ns, NULL);
>  	if (IS_ERR(kn)) {
>  		if (PTR_ERR(kn) == -EEXIST)
>  			sysfs_warn_dup(parent, kobject_name(kobj));
> diff --git a/fs/sysfs/file.c b/fs/sysfs/file.c
> index 42dcf96881b6..af9e91fd1a92 100644
> --- a/fs/sysfs/file.c
> +++ b/fs/sysfs/file.c
> @@ -292,7 +292,8 @@ int sysfs_add_file_mode_ns(struct kernfs_node *parent,
>  #endif
>  
>  	kn = __kernfs_create_file(parent, attr->name, mode & 0777, uid, gid,
> -				  PAGE_SIZE, ops, (void *)attr, ns, key);
> +				  PAGE_SIZE, ops, (void *)attr, ns, key,
> +				  attr->owner);
>  	if (IS_ERR(kn)) {
>  		if (PTR_ERR(kn) == -EEXIST)
>  			sysfs_warn_dup(parent, attr->name);
> @@ -327,7 +328,8 @@ int sysfs_add_bin_file_mode_ns(struct kernfs_node *parent,
>  #endif
>  
>  	kn = __kernfs_create_file(parent, attr->name, mode & 0777, uid, gid,
> -				  battr->size, ops, (void *)attr, ns, key);
> +				  battr->size, ops, (void *)attr, ns, key,
> +				  attr->owner);
>  	if (IS_ERR(kn)) {
>  		if (PTR_ERR(kn) == -EEXIST)
>  			sysfs_warn_dup(parent, attr->name);
> diff --git a/fs/sysfs/group.c b/fs/sysfs/group.c
> index eeb0e3099421..372864d1cb54 100644
> --- a/fs/sysfs/group.c
> +++ b/fs/sysfs/group.c
> @@ -135,7 +135,8 @@ static int internal_create_group(struct kobject *kobj, int update,
>  		} else {
>  			kn = kernfs_create_dir_ns(kobj->sd, grp->name,
>  						  S_IRWXU | S_IRUGO | S_IXUGO,
> -						  uid, gid, kobj, NULL);
> +						  uid, gid, kobj, NULL,
> +						  grp->owner);
>  			if (IS_ERR(kn)) {
>  				if (PTR_ERR(kn) == -EEXIST)
>  					sysfs_warn_dup(kobj->sd, grp->name);
> diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
> index cd968ee2b503..818b00ebd107 100644
> --- a/include/linux/kernfs.h
> +++ b/include/linux/kernfs.h
> @@ -161,6 +161,7 @@ struct kernfs_node {
>  	unsigned short		flags;
>  	umode_t			mode;
>  	struct kernfs_iattrs	*iattr;
> +	struct module           *owner;
>  };
>  
>  /*
> @@ -370,7 +371,8 @@ void kernfs_destroy_root(struct kernfs_root *root);
>  struct kernfs_node *kernfs_create_dir_ns(struct kernfs_node *parent,
>  					 const char *name, umode_t mode,
>  					 kuid_t uid, kgid_t gid,
> -					 void *priv, const void *ns);
> +					 void *priv, const void *ns,
> +					 struct module *owner);
>  struct kernfs_node *kernfs_create_empty_dir(struct kernfs_node *parent,
>  					    const char *name);
>  struct kernfs_node *__kernfs_create_file(struct kernfs_node *parent,
> @@ -379,7 +381,8 @@ struct kernfs_node *__kernfs_create_file(struct kernfs_node *parent,
>  					 loff_t size,
>  					 const struct kernfs_ops *ops,
>  					 void *priv, const void *ns,
> -					 struct lock_class_key *key);
> +					 struct lock_class_key *key,
> +					 struct module *owner);
>  struct kernfs_node *kernfs_create_link(struct kernfs_node *parent,
>  				       const char *name,
>  				       struct kernfs_node *target);
> @@ -472,14 +475,15 @@ static inline void kernfs_destroy_root(struct kernfs_root *root) { }
>  static inline struct kernfs_node *
>  kernfs_create_dir_ns(struct kernfs_node *parent, const char *name,
>  		     umode_t mode, kuid_t uid, kgid_t gid,
> -		     void *priv, const void *ns)
> +		     void *priv, const void *ns, struct module *owner)
>  { return ERR_PTR(-ENOSYS); }
>  
>  static inline struct kernfs_node *
>  __kernfs_create_file(struct kernfs_node *parent, const char *name,
>  		     umode_t mode, kuid_t uid, kgid_t gid,
>  		     loff_t size, const struct kernfs_ops *ops,
> -		     void *priv, const void *ns, struct lock_class_key *key)
> +		     void *priv, const void *ns, struct lock_class_key *key,
> +		     struct module *owner)
>  { return ERR_PTR(-ENOSYS); }
>  
>  static inline struct kernfs_node *
> @@ -566,7 +570,7 @@ kernfs_create_dir(struct kernfs_node *parent, const char *name, umode_t mode,
>  {
>  	return kernfs_create_dir_ns(parent, name, mode,
>  				    GLOBAL_ROOT_UID, GLOBAL_ROOT_GID,
> -				    priv, NULL);
> +				    priv, NULL, parent->owner);
>  }
>  
>  static inline int kernfs_remove_by_name(struct kernfs_node *parent,
> diff --git a/include/linux/sysfs.h b/include/linux/sysfs.h
> index e3f1e8ac1f85..babbabb460dc 100644
> --- a/include/linux/sysfs.h
> +++ b/include/linux/sysfs.h
> @@ -30,6 +30,7 @@ enum kobj_ns_type;
>  struct attribute {
>  	const char		*name;
>  	umode_t			mode;
> +	struct module           *owner;
>  #ifdef CONFIG_DEBUG_LOCK_ALLOC
>  	bool			ignore_lockdep:1;
>  	struct lock_class_key	*key;
> @@ -80,6 +81,7 @@ do {							\
>   * @attrs:	Pointer to NULL terminated list of attributes.
>   * @bin_attrs:	Pointer to NULL terminated list of binary attributes.
>   *		Either attrs or bin_attrs or both must be provided.
> + * @module:	If set, module responsible for this attribute group
>   */
>  struct attribute_group {
>  	const char		*name;
> @@ -89,6 +91,7 @@ struct attribute_group {
>  						  struct bin_attribute *, int);
>  	struct attribute	**attrs;
>  	struct bin_attribute	**bin_attrs;
> +	struct module           *owner;
>  };
>  
>  /*
> @@ -100,38 +103,52 @@ struct attribute_group {
>  
>  #define __ATTR(_name, _mode, _show, _store) {				\
>  	.attr = {.name = __stringify(_name),				\
> -		 .mode = VERIFY_OCTAL_PERMISSIONS(_mode) },		\
> +		 .mode = VERIFY_OCTAL_PERMISSIONS(_mode),               \
> +		 .owner  = THIS_MODULE,                                 \
> +	},                                                             \
>  	.show	= _show,						\
>  	.store	= _store,						\
>  }
>  
>  #define __ATTR_PREALLOC(_name, _mode, _show, _store) {			\
>  	.attr = {.name = __stringify(_name),				\
> -		 .mode = SYSFS_PREALLOC | VERIFY_OCTAL_PERMISSIONS(_mode) },\
> +		 .mode = SYSFS_PREALLOC | VERIFY_OCTAL_PERMISSIONS(_mode),\
> +		 .owner  = THIS_MODULE,                                 \
> +	},                                                              \
>  	.show	= _show,						\
>  	.store	= _store,						\
>  }
>  
>  #define __ATTR_RO(_name) {						\
> -	.attr	= { .name = __stringify(_name), .mode = 0444 },		\
> +	.attr	= { .name = __stringify(_name),                         \
> +		    .mode = 0444,					\
> +		    .owner  = THIS_MODULE,				\
> +		},                                                     \
>  	.show	= _name##_show,						\
>  }
>  
>  #define __ATTR_RO_MODE(_name, _mode) {					\
>  	.attr	= { .name = __stringify(_name),				\
> -		    .mode = VERIFY_OCTAL_PERMISSIONS(_mode) },		\
> +		    .mode = VERIFY_OCTAL_PERMISSIONS(_mode),            \
> +		    .owner  = THIS_MODULE,				\
> +	},                                                              \
>  	.show	= _name##_show,						\
>  }
>  
>  #define __ATTR_RW_MODE(_name, _mode) {					\
>  	.attr	= { .name = __stringify(_name),				\
> -		    .mode = VERIFY_OCTAL_PERMISSIONS(_mode) },		\
> +		    .mode = VERIFY_OCTAL_PERMISSIONS(_mode),            \
> +		    .owner  = THIS_MODULE,                              \
> +	},								\
>  	.show	= _name##_show,						\
>  	.store	= _name##_store,					\
>  }
>  
>  #define __ATTR_WO(_name) {						\
> -	.attr	= { .name = __stringify(_name), .mode = 0200 },		\
> +	.attr	= { .name = __stringify(_name),                         \
> +		    .mode = 0200,					\
> +		    .owner  = THIS_MODULE,				\
> +	},                                                              \
>  	.store	= _name##_store,					\
>  }
>  
> @@ -141,8 +158,11 @@ struct attribute_group {
>  
>  #ifdef CONFIG_DEBUG_LOCK_ALLOC
>  #define __ATTR_IGNORE_LOCKDEP(_name, _mode, _show, _store) {	\
> -	.attr = {.name = __stringify(_name), .mode = _mode,	\
> -			.ignore_lockdep = true },		\
> +	.attr = {.name = __stringify(_name),                    \
> +		 .mode = _mode,					\
> +		 .ignore_lockdep = true,                        \
> +		 .owner  = THIS_MODULE,                         \
> +	},							\
>  	.show		= _show,				\
>  	.store		= _store,				\
>  }
> @@ -159,6 +179,7 @@ static const struct attribute_group *_name##_groups[] = {	\
>  #define ATTRIBUTE_GROUPS(_name)					\
>  static const struct attribute_group _name##_group = {		\
>  	.attrs = _name##_attrs,					\
> +	.owner = THIS_MODULE,					\
>  };								\
>  __ATTRIBUTE_GROUPS(_name)
>  
> @@ -199,20 +220,29 @@ struct bin_attribute {
>  
>  /* macros to create static binary attributes easier */
>  #define __BIN_ATTR(_name, _mode, _read, _write, _size) {		\
> -	.attr = { .name = __stringify(_name), .mode = _mode },		\
> +	.attr = { .name = __stringify(_name),                           \
> +		   .mode = _mode,					\
> +		   .owner = THIS_MODULE,				\
> +	},								\
>  	.read	= _read,						\
>  	.write	= _write,						\
>  	.size	= _size,						\
>  }
>  
>  #define __BIN_ATTR_RO(_name, _size) {					\
> -	.attr	= { .name = __stringify(_name), .mode = 0444 },		\
> +	.attr	= { .name = __stringify(_name),                         \
> +		    .mode = 0444,					\
> +		    .owner = THIS_MODULE,				\
> +	},								\
>  	.read	= _name##_read,						\
>  	.size	= _size,						\
>  }
>  
>  #define __BIN_ATTR_WO(_name, _size) {					\
> -	.attr	= { .name = __stringify(_name), .mode = 0200 },		\
> +	.attr	= { .name = __stringify(_name),                         \
> +		    .mode = 0200,					\
> +		    .owner = THIS_MODULE,				\
> +	},								\
>  	.write	= _name##_write,					\
>  	.size	= _size,						\
>  }
> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> index 9e0390000025..c6b0a28f599c 100644
> --- a/kernel/cgroup/cgroup.c
> +++ b/kernel/cgroup/cgroup.c
> @@ -3975,7 +3975,7 @@ static int cgroup_add_file(struct cgroup_subsys_state *css, struct cgroup *cgrp,
>  				  cgroup_file_mode(cft),
>  				  GLOBAL_ROOT_UID, GLOBAL_ROOT_GID,
>  				  0, cft->kf_ops, cft,
> -				  NULL, key);
> +				  NULL, key, NULL);
>  	if (IS_ERR(kn))
>  		return PTR_ERR(kn);
>  
> -- 
> 2.30.2
> 

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v8 11/12] zram: fix crashes with cpu hotplug multistate
  2021-09-27 16:38 ` [PATCH v8 11/12] zram: fix crashes with cpu hotplug multistate Luis Chamberlain
@ 2021-10-05 20:55   ` Kees Cook
  2021-10-11 18:27     ` Luis Chamberlain
  2021-10-14  1:55   ` Ming Lei
  1 sibling, 1 reply; 94+ messages in thread
From: Kees Cook @ 2021-10-05 20:55 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: tj, gregkh, akpm, minchan, jeyu, shuah, bvanassche,
	dan.j.williams, joe, tglx, rostedt, linux-spdx, linux-doc,
	linux-block, linux-fsdevel, linux-kselftest, linux-kernel

On Mon, Sep 27, 2021 at 09:38:04AM -0700, Luis Chamberlain wrote:
> Provide a simple state machine to fix races with driver exit where we
> remove the CPU multistate callbacks and re-initialization / creation of
> new per CPU instances which should be managed by these callbacks.
> 
> The zram driver makes use of cpu hotplug multistate support, whereby it
> associates a struct zcomp per CPU. Each struct zcomp represents a
> compression algorithm in charge of managing compression streams per
> CPU. Although a compiled zram driver only supports a fixed set of
> compression algorithms, each zram device gets a struct zcomp allocated
> per CPU. The "multi" in CPU hotplug multstate refers to these per
> cpu struct zcomp instances. Each of these will have the CPU hotplug
> callback called for it on CPU plug / unplug. The kernel's CPU hotplug
> multistate keeps a linked list of these different structures so that
> it will iterate over them on CPU transitions.
> 
> By default at driver initialization we will create just one zram device
> (num_devices=1) and a zcomp structure then set for the now default
> lzo-rle comrpession algorithm. At driver removal we first remove each
> zram device, and so we destroy the associated struct zcomp per CPU. But
> since we expose sysfs attributes to create new devices or reset /
> initialize existing zram devices, we can easily end up re-initializing
> a struct zcomp for a zram device before the exit routine of the module
> removes the cpu hotplug callback. When this happens the kernel's CPU
> hotplug will detect that at least one instance (struct zcomp for us)
> exists. This can happen in the following situation:
> 
> CPU 1                            CPU 2
> 
>                                 disksize_store(...);
> class_unregister(...);
> idr_for_each(...);
> zram_debugfs_destroy();
> 
> idr_destroy(...);
> unregister_blkdev(...);
> cpuhp_remove_multi_state(...);

So this is strictly separate from the sysfs/module unloading race?

-Kees

> 
> The warning comes up on cpuhp_remove_multi_state() when it sees that the
> state for CPUHP_ZCOMP_PREPARE does not have an empty instance linked list.
> In this case, that a struct zcom still exists, the driver allowed its
> creation per CPU even though we could have just freed them per CPU
> though a call on another CPU, and we are then later trying to remove the
> hotplug callback.
> 
> Fix all this by providing a zram initialization boolean
> protected the shared in the driver zram_index_mutex, which we
> can use to annotate when sysfs attributes are safe to use or
> not -- once the driver is properly initialized. When the driver
> is going down we also are sure to not let userspace muck with
> attributes which may affect each per cpu struct zcomp.
> 
> This also fixes a series of possible memory leaks. The
> crashes and memory leaks can easily be caused by issuing
> the zram02.sh script from the LTP project [0] in a loop
> in two separate windows:
> 
>   cd testcases/kernel/device-drivers/zram
>   while true; do PATH=$PATH:$PWD:$PWD/../../../lib/ ./zram02.sh; done
> 
> You end up with a splat as follows:
> 
> kernel: zram: Removed device: zram0
> kernel: zram: Added device: zram0
> kernel: zram0: detected capacity change from 0 to 209715200
> kernel: Adding 104857596k swap on /dev/zram0.  <etc>
> kernel: zram0: detected capacitky change from 209715200 to 0
> kernel: zram0: detected capacity change from 0 to 209715200
> kernel: ------------[ cut here ]------------
> kernel: Error: Removing state 63 which has instances left.
> kernel: WARNING: CPU: 7 PID: 70457 at \
> 	kernel/cpu.c:2069 __cpuhp_remove_state_cpuslocked+0xf9/0x100
> kernel: Modules linked in: zram(E-) zsmalloc(E) <etc>
> kernel: CPU: 7 PID: 70457 Comm: rmmod Tainted: G            \
> 	E     5.12.0-rc1-next-20210304 #3
> kernel: Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), \
> 	BIOS 1.14.0-2 04/01/2014
> kernel: RIP: 0010:__cpuhp_remove_state_cpuslocked+0xf9/0x100
> kernel: Code: <etc>
> kernel: RSP: 0018:ffffa800c139be98 EFLAGS: 00010282
> kernel: RAX: 0000000000000000 RBX: ffffffff9083db58 RCX: ffff9609f7dd86d8
> kernel: RDX: 00000000ffffffd8 RSI: 0000000000000027 RDI: ffff9609f7dd86d0
> kernel: RBP: 0000000000000000i R08: 0000000000000000 R09: ffffa800c139bcb8
> kernel: R10: ffffa800c139bcb0 R11: ffffffff908bea40 R12: 000000000000003f
> kernel: R13: 00000000000009d8 R14: 0000000000000000 R15: 0000000000000000
> kernel: FS: 00007f1b075a7540(0000) GS:ffff9609f7dc0000(0000) knlGS:<etc>
> kernel: CS:  0010 DS: 0000 ES 0000 CR0: 0000000080050033
> kernel: CR2: 00007f1b07610490 CR3: 00000001bd04e000 CR4: 0000000000350ee0
> kernel: Call Trace:
> kernel: __cpuhp_remove_state+0x2e/0x80
> kernel: __do_sys_delete_module+0x190/0x2a0
> kernel:  do_syscall_64+0x33/0x80
> kernel: entry_SYSCALL_64_after_hwframe+0x44/0xae
> 
> The "Error: Removing state 63 which has instances left" refers
> to the zram per CPU struct zcomp instances left.
> 
> [0] https://github.com/linux-test-project/ltp.git
> 
> Acked-by: Minchan Kim <minchan@kernel.org>
> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> ---
>  drivers/block/zram/zram_drv.c | 63 ++++++++++++++++++++++++++++++-----
>  1 file changed, 55 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
> index f61910c65f0f..b26abcb955cc 100644
> --- a/drivers/block/zram/zram_drv.c
> +++ b/drivers/block/zram/zram_drv.c
> @@ -44,6 +44,8 @@ static DEFINE_MUTEX(zram_index_mutex);
>  static int zram_major;
>  static const char *default_compressor = CONFIG_ZRAM_DEF_COMP;
>  
> +static bool zram_up;
> +
>  /* Module params (documentation at end) */
>  static unsigned int num_devices = 1;
>  /*
> @@ -1704,6 +1706,7 @@ static void zram_reset_device(struct zram *zram)
>  	comp = zram->comp;
>  	disksize = zram->disksize;
>  	zram->disksize = 0;
> +	zram->comp = NULL;
>  
>  	set_capacity_and_notify(zram->disk, 0);
>  	part_stat_set_all(zram->disk->part0, 0);
> @@ -1724,9 +1727,18 @@ static ssize_t disksize_store(struct device *dev,
>  	struct zram *zram = dev_to_zram(dev);
>  	int err;
>  
> +	mutex_lock(&zram_index_mutex);
> +
> +	if (!zram_up) {
> +		err = -ENODEV;
> +		goto out;
> +	}
> +
>  	disksize = memparse(buf, NULL);
> -	if (!disksize)
> -		return -EINVAL;
> +	if (!disksize) {
> +		err = -EINVAL;
> +		goto out;
> +	}
>  
>  	down_write(&zram->init_lock);
>  	if (init_done(zram)) {
> @@ -1754,12 +1766,16 @@ static ssize_t disksize_store(struct device *dev,
>  	set_capacity_and_notify(zram->disk, zram->disksize >> SECTOR_SHIFT);
>  	up_write(&zram->init_lock);
>  
> +	mutex_unlock(&zram_index_mutex);
> +
>  	return len;
>  
>  out_free_meta:
>  	zram_meta_free(zram, disksize);
>  out_unlock:
>  	up_write(&zram->init_lock);
> +out:
> +	mutex_unlock(&zram_index_mutex);
>  	return err;
>  }
>  
> @@ -1775,8 +1791,17 @@ static ssize_t reset_store(struct device *dev,
>  	if (ret)
>  		return ret;
>  
> -	if (!do_reset)
> -		return -EINVAL;
> +	mutex_lock(&zram_index_mutex);
> +
> +	if (!zram_up) {
> +		len = -ENODEV;
> +		goto out;
> +	}
> +
> +	if (!do_reset) {
> +		len = -EINVAL;
> +		goto out;
> +	}
>  
>  	zram = dev_to_zram(dev);
>  	bdev = zram->disk->part0;
> @@ -1785,7 +1810,8 @@ static ssize_t reset_store(struct device *dev,
>  	/* Do not reset an active device or claimed device */
>  	if (bdev->bd_openers || zram->claim) {
>  		mutex_unlock(&bdev->bd_disk->open_mutex);
> -		return -EBUSY;
> +		len = -EBUSY;
> +		goto out;
>  	}
>  
>  	/* From now on, anyone can't open /dev/zram[0-9] */
> @@ -1800,6 +1826,8 @@ static ssize_t reset_store(struct device *dev,
>  	zram->claim = false;
>  	mutex_unlock(&bdev->bd_disk->open_mutex);
>  
> +out:
> +	mutex_unlock(&zram_index_mutex);
>  	return len;
>  }
>  
> @@ -2010,6 +2038,10 @@ static ssize_t hot_add_show(struct class *class,
>  	int ret;
>  
>  	mutex_lock(&zram_index_mutex);
> +	if (!zram_up) {
> +		mutex_unlock(&zram_index_mutex);
> +		return -ENODEV;
> +	}
>  	ret = zram_add();
>  	mutex_unlock(&zram_index_mutex);
>  
> @@ -2037,6 +2069,11 @@ static ssize_t hot_remove_store(struct class *class,
>  
>  	mutex_lock(&zram_index_mutex);
>  
> +	if (!zram_up) {
> +		ret = -ENODEV;
> +		goto out;
> +	}
> +
>  	zram = idr_find(&zram_index_idr, dev_id);
>  	if (zram) {
>  		ret = zram_remove(zram);
> @@ -2046,6 +2083,7 @@ static ssize_t hot_remove_store(struct class *class,
>  		ret = -ENODEV;
>  	}
>  
> +out:
>  	mutex_unlock(&zram_index_mutex);
>  	return ret ? ret : count;
>  }
> @@ -2072,12 +2110,15 @@ static int zram_remove_cb(int id, void *ptr, void *data)
>  
>  static void destroy_devices(void)
>  {
> +	mutex_lock(&zram_index_mutex);
> +	zram_up = false;
>  	class_unregister(&zram_control_class);
>  	idr_for_each(&zram_index_idr, &zram_remove_cb, NULL);
>  	zram_debugfs_destroy();
>  	idr_destroy(&zram_index_idr);
>  	unregister_blkdev(zram_major, "zram");
>  	cpuhp_remove_multi_state(CPUHP_ZCOMP_PREPARE);
> +	mutex_unlock(&zram_index_mutex);
>  }
>  
>  static int __init zram_init(void)
> @@ -2105,15 +2146,21 @@ static int __init zram_init(void)
>  		return -EBUSY;
>  	}
>  
> +	mutex_lock(&zram_index_mutex);
> +
>  	while (num_devices != 0) {
> -		mutex_lock(&zram_index_mutex);
>  		ret = zram_add();
> -		mutex_unlock(&zram_index_mutex);
> -		if (ret < 0)
> +		if (ret < 0) {
> +			mutex_unlock(&zram_index_mutex);
>  			goto out_error;
> +		}
>  		num_devices--;
>  	}
>  
> +	zram_up = true;
> +
> +	mutex_unlock(&zram_index_mutex);
> +
>  	return 0;
>  
>  out_error:
> -- 
> 2.30.2
> 

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v8 12/12] zram: use ATTRIBUTE_GROUPS to fix sysfs deadlock module removal
  2021-09-27 16:38 ` [PATCH v8 12/12] zram: use ATTRIBUTE_GROUPS to fix sysfs deadlock module removal Luis Chamberlain
@ 2021-10-05 20:57   ` Kees Cook
  2021-10-11 18:28     ` Luis Chamberlain
  0 siblings, 1 reply; 94+ messages in thread
From: Kees Cook @ 2021-10-05 20:57 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: tj, gregkh, akpm, minchan, jeyu, shuah, bvanassche,
	dan.j.williams, joe, tglx, rostedt, linux-spdx, linux-doc,
	linux-block, linux-fsdevel, linux-kselftest, linux-kernel

On Mon, Sep 27, 2021 at 09:38:05AM -0700, Luis Chamberlain wrote:
> The ATTRIBUTE_GROUPS is typically used to avoid boiler plate
> code which is used in many drivers. Embracing ATTRIBUTE_GROUPS was
> long due on the zram driver, however a recent fix for sysfs allows
> users of ATTRIBUTE_GROUPS to also associate a module to the group
> attribute.

Does this mean that other modules using sysfs but _not_
ATTRIBUTE_GROUPS() are still vulnerable to potential use-after-free of
the kernfs fops?

-Kees

> 
> In zram's case this also means it allows us to fix a race which triggers
> a deadlock on the zram driver. This deadlock happens when a sysfs attribute
> use a lock also used on module removal. This happens when for instance a
> sysfs file on a driver is used, then at the same time we have module
> removal call trigger. The module removal call code holds a lock, and then
> the sysfs file entry waits for the same lock. While holding the lock the
> module removal tries to remove the sysfs entries, but these cannot be
> removed yet as one is waiting for a lock. This won't complete as the lock
> is already held. Likewise module removal cannot complete, and so we
> deadlock.
> 
> Sysfs fixes this when the group attributes have a module associated to
> it, sysfs will *try* to get a refcount to the module when a shared
> lock is used, prior to mucking with a sysfs attribute. If this fails we
> just give up right away.
> 
> This deadlock was first reported with the zram driver, a sketch of how
> this can happen follows:
> 
> CPU A                              CPU B
>                                    whatever_store()
> module_unload
>   mutex_lock(foo)
>                                    mutex_lock(foo)
>    del_gendisk(zram->disk);
>      device_del()
>        device_remove_groups()
> 
> In this situation whatever_store() is waiting for the mutex foo to
> become unlocked, but that won't happen until module removal is complete.
> But module removal won't complete until the sysfs file being poked
> completes which is waiting for a lock already held.
> 
> This issue can be reproduced easily on the zram driver as follows:
> 
> Loop 1 on one terminal:
> 
> while true;
> 	do modprobe zram;
> 	modprobe -r zram;
> done
> 
> Loop 2 on a second terminal:
> while true; do
> 	echo 1024 >  /sys/block/zram0/disksize;
> 	echo 1 > /sys/block/zram0/reset;
> done
> 
> Without this patch we end up in a deadlock, and the following
> stack trace is produced which hints to us what the issue was:
> 
> INFO: task bash:888 blocked for more than 120 seconds.
>       Tainted: G            E 5.12.0-rc1-next-20210304+ #4
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> task:bash            state:D stack:    0 pid:  888 ppid: 887 flags:<etc>
> Call Trace:
>  __schedule+0x2e4/0x900
>  schedule+0x46/0xb0
>  schedule_preempt_disabled+0xa/0x10
>  __mutex_lock.constprop.0+0x2c3/0x490
>  ? _kstrtoull+0x35/0xd0
>  reset_store+0x6c/0x160 [zram]
>  kernfs_fop_write_iter+0x124/0x1b0
>  new_sync_write+0x11c/0x1b0
>  vfs_write+0x1c2/0x260
>  ksys_write+0x5f/0xe0
>  do_syscall_64+0x33/0x80
>  entry_SYSCALL_64_after_hwframe+0x44/0xae
> RIP: 0033:0x7f34f2c3df33
> RSP: 002b:00007ffe751df6e8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
> RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007f34f2c3df33
> RDX: 0000000000000002 RSI: 0000561ccb06ec10 RDI: 0000000000000001
> RBP: 0000561ccb06ec10 R08: 000000000000000a R09: 0000000000000001
> R10: 0000561ccb157590 R11: 0000000000000246 R12: 0000000000000002
> R13: 00007f34f2d0e6a0 R14: 0000000000000002 R15: 00007f34f2d0e8a0
> INFO: task modprobe:1104 can't die for more than 120 seconds.
> task:modprobe        state:D stack:    0 pid: 1104 ppid: 916 flags:<etc>
> Call Trace:
>  __schedule+0x2e4/0x900
>  schedule+0x46/0xb0
>  __kernfs_remove.part.0+0x228/0x2b0
>  ? finish_wait+0x80/0x80
>  kernfs_remove_by_name_ns+0x50/0x90
>  remove_files+0x2b/0x60
>  sysfs_remove_group+0x38/0x80
>  sysfs_remove_groups+0x29/0x40
>  device_remove_attrs+0x4a/0x80
>  device_del+0x183/0x3e0
>  ? mutex_lock+0xe/0x30
>  del_gendisk+0x27a/0x2d0
>  zram_remove+0x8a/0xb0 [zram]
>  ? hot_remove_store+0xf0/0xf0 [zram]
>  zram_remove_cb+0xd/0x10 [zram]
>  idr_for_each+0x5e/0xd0
>  destroy_devices+0x39/0x6f [zram]
>  __do_sys_delete_module+0x190/0x2a0
>  do_syscall_64+0x33/0x80
>  entry_SYSCALL_64_after_hwframe+0x44/0xae
> RIP: 0033:0x7f32adf727d7
> RSP: 002b:00007ffc08bb38a8 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
> RAX: ffffffffffffffda RBX: 000055eea23cbb10 RCX: 00007f32adf727d7
> RDX: 0000000000000000 RSI: 0000000000000800 RDI: 000055eea23cbb78
> RBP: 000055eea23cbb10 R08: 0000000000000000 R09: 0000000000000000
> R10: 00007f32adfe5ac0 R11: 0000000000000206 R12: 000055eea23cbb78
> R13: 0000000000000000 R14: 0000000000000000 R15: 000055eea23cbc20
> 
> [0] https://lkml.kernel.org/r/20210401235925.GR4332@42.do-not-panic.com
> 
> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> ---
>  drivers/block/zram/zram_drv.c | 11 ++---------
>  1 file changed, 2 insertions(+), 9 deletions(-)
> 
> diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
> index b26abcb955cc..60a55ae8cd91 100644
> --- a/drivers/block/zram/zram_drv.c
> +++ b/drivers/block/zram/zram_drv.c
> @@ -1902,14 +1902,7 @@ static struct attribute *zram_disk_attrs[] = {
>  	NULL,
>  };
>  
> -static const struct attribute_group zram_disk_attr_group = {
> -	.attrs = zram_disk_attrs,
> -};
> -
> -static const struct attribute_group *zram_disk_attr_groups[] = {
> -	&zram_disk_attr_group,
> -	NULL,
> -};
> +ATTRIBUTE_GROUPS(zram_disk);
>  
>  /*
>   * Allocate and initialize new zram device. the function returns
> @@ -1981,7 +1974,7 @@ static int zram_add(void)
>  		blk_queue_max_write_zeroes_sectors(zram->disk->queue, UINT_MAX);
>  
>  	blk_queue_flag_set(QUEUE_FLAG_STABLE_WRITES, zram->disk->queue);
> -	device_add_disk(NULL, zram->disk, zram_disk_attr_groups);
> +	device_add_disk(NULL, zram->disk, zram_disk_groups);
>  
>  	strlcpy(zram->compressor, default_compressor, sizeof(zram->compressor));
>  
> -- 
> 2.30.2
> 

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v8 03/12] selftests: add tests_sysfs module
  2021-09-27 16:37 ` [PATCH v8 03/12] selftests: add tests_sysfs module Luis Chamberlain
  2021-10-05 14:16   ` Greg KH
@ 2021-10-07 14:23   ` Miroslav Benes
  2021-10-11 19:11     ` Luis Chamberlain
       [not found]   ` <202110050912.3DF681ED@keescook>
  2 siblings, 1 reply; 94+ messages in thread
From: Miroslav Benes @ 2021-10-07 14:23 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: tj, gregkh, akpm, minchan, jeyu, shuah, bvanassche,
	dan.j.williams, joe, tglx, keescook, rostedt, linux-spdx,
	linux-doc, linux-block, linux-fsdevel, linux-kselftest,
	linux-kernel

On Mon, 27 Sep 2021, Luis Chamberlain wrote:

> This adds a new selftest module which can be used to test sysfs, which
> would otherwise require using an existing driver. This lets us muck
> with a template driver to test breaking things without affecting
> system behaviour or requiring the dependencies of a real device
> driver.
> 
> A series of 28 tests are added. Support for using two device types are
> supported:
> 
>   * misc
>   * block

I suppose the selftests will run for more than 45 seconds (default 
kselftest timeout), so you probably also want to set timeout to something 
sensible in tools/testing/selftests/sysfs/settings file (0 would disable 
it).

Miroslav

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v8 03/12] selftests: add tests_sysfs module
  2021-10-05 14:16   ` Greg KH
  2021-10-05 16:57     ` Tim.Bird
@ 2021-10-11 17:38     ` Luis Chamberlain
  1 sibling, 0 replies; 94+ messages in thread
From: Luis Chamberlain @ 2021-10-11 17:38 UTC (permalink / raw)
  To: Greg KH
  Cc: tj, akpm, minchan, jeyu, shuah, bvanassche, dan.j.williams, joe,
	tglx, keescook, rostedt, linux-spdx, linux-doc, linux-block,
	linux-fsdevel, linux-kselftest, linux-kernel

On Tue, Oct 05, 2021 at 04:16:46PM +0200, Greg KH wrote:
> On Mon, Sep 27, 2021 at 09:37:56AM -0700, Luis Chamberlain wrote:
> > --- /dev/null
> > +++ b/lib/test_sysfs.c
> > @@ -0,0 +1,921 @@
> > +// SPDX-License-Identifier: GPL-2.0-or-later OR copyleft-next-0.3.1
> > +/*
> > + * sysfs test driver
> > + *
> > + * Copyright (C) 2021 Luis Chamberlain <mcgrof@kernel.org>
> > + *
> > + * This program is free software; you can redistribute it and/or modify it
> > + * under the terms of the GNU General Public License as published by the Free
> > + * Software Foundation; either version 2 of the License, or at your option any
> > + * later version; or, when distributed separately from the Linux kernel or
> > + * when incorporated into other software packages, subject to the following
> > + * license:
> > + *
> > + * This program is free software; you can redistribute it and/or modify it
> > + * under the terms of copyleft-next (version 0.3.1 or later) as published
> > + * at http://copyleft-next.org/.
> 
> Independant of the fact that I don't like sysfs code attempting to be
> accessed in the kernel with licenses other than GPLv2, you do not need
> the license "boilerplate" text at all in files.  That's what the SPDX
> line is for.

Sure, I'll remove the boilerplate, sorry for missing that again, I
thought I had removed it.

  Luis

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v8 03/12] selftests: add tests_sysfs module
  2021-10-05 16:57     ` Tim.Bird
@ 2021-10-11 17:40       ` Luis Chamberlain
  0 siblings, 0 replies; 94+ messages in thread
From: Luis Chamberlain @ 2021-10-11 17:40 UTC (permalink / raw)
  To: Tim.Bird
  Cc: gregkh, tj, akpm, minchan, jeyu, shuah, bvanassche,
	dan.j.williams, joe, tglx, keescook, rostedt, linux-spdx,
	linux-doc, linux-block, linux-fsdevel, linux-kselftest,
	linux-kernel

On Tue, Oct 05, 2021 at 04:57:55PM +0000, Tim.Bird@sony.com wrote:
> 
> 
> > -----Original Message-----
> > From: Greg KH <gregkh@linuxfoundation.org>
> > Sent: Tuesday, October 5, 2021 8:17 AM
> > To: Luis Chamberlain <mcgrof@kernel.org>
> > Cc: tj@kernel.org; akpm@linux-foundation.org; minchan@kernel.org; jeyu@kernel.org; shuah@kernel.org; bvanassche@acm.org;
> > dan.j.williams@intel.com; joe@perches.com; tglx@linutronix.de; keescook@chromium.org; rostedt@goodmis.org; linux-
> > spdx@vger.kernel.org; linux-doc@vger.kernel.org; linux-block@vger.kernel.org; linux-fsdevel@vger.kernel.org; linux-
> > kselftest@vger.kernel.org; linux-kernel@vger.kernel.org
> > Subject: Re: [PATCH v8 03/12] selftests: add tests_sysfs module
> > 
> > On Mon, Sep 27, 2021 at 09:37:56AM -0700, Luis Chamberlain wrote:
> > > --- /dev/null
> > > +++ b/lib/test_sysfs.c
> > > @@ -0,0 +1,921 @@
> > > +// SPDX-License-Identifier: GPL-2.0-or-later OR copyleft-next-0.3.1
> > > +/*
> > > + * sysfs test driver
> > > + *
> > > + * Copyright (C) 2021 Luis Chamberlain <mcgrof@kernel.org>
> > > + *
> > > + * This program is free software; you can redistribute it and/or modify it
> > > + * under the terms of the GNU General Public License as published by the Free
> > > + * Software Foundation; either version 2 of the License, or at your option any
> > > + * later version; or, when distributed separately from the Linux kernel or
> > > + * when incorporated into other software packages, subject to the following
> > > + * license:
> 
> This is a very strange license grant, which I'm not sure is covered by any
> current SPDX syntax.
> " when distributed separately from the Linux kernel or when incorporated into
> other software packages, subject to the following license:"

drivers/xen/events/events_fifo.c has that same language.

> Why would we care about the license used when the code is used in a non-kernel
> project?  If it is desired for the code to be available outside the kernel under a
> different license, then surely the easiest thing is to make it available separately
> under that license.  I'm not sure why the kernel needs to carry this license for
> non-kernel use of the code.
> 
> I would recommend giving this a GPLv2 SPDX header, and maybe in the comment
> at the top of the file put a reference to a git repository where the code can be
> obtained under a different license.

Keeping the dual let's new updates directly on the kernel benefit from
evolution. A fork would stagnate it in place and would require updates
separately.

  Luis

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v8 01/12] LICENSES: Add the copyleft-next-0.3.1 license
       [not found]     ` <YWR2ZrtzChamY1y4@bombadil.infradead.org>
@ 2021-10-11 17:57       ` Kees Cook
  0 siblings, 0 replies; 94+ messages in thread
From: Kees Cook @ 2021-10-11 17:57 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: tj, gregkh, akpm, minchan, jeyu, shuah, bvanassche,
	dan.j.williams, joe, tglx, rostedt, linux-spdx, linux-doc,
	linux-block, linux-fsdevel, linux-kselftest, linux-kernel,
	Goldwyn Rodrigues, Kuno Woudt, Richard Fontana, copyleft-next,
	Ciaran Farrell, Christopher De Nicolo, Christoph Hellwig,
	Jonathan Corbet, Thorsten Leemhuis

On Mon, Oct 11, 2021 at 10:37:42AM -0700, Luis Chamberlain wrote:
> On Tue, Oct 05, 2021 at 09:08:59AM -0700, Kees Cook wrote:
> > On Mon, Sep 27, 2021 at 09:37:54AM -0700, Luis Chamberlain wrote:
> > I can confirm that LICENSES/dual/copyleft-next-0.3.1 matches
> > https://github.com/copyleft-next/copyleft-next/blob/master/Releases/copyleft-next-0.3.1
> > 
> > Reviewed-by: Kees Cook <keescook@chromium.org>
> > 
> > > +   If the Derived Work includes material licensed under the GPL, You may
> > > +   instead license the Derived Work under the GPL.
> > > +   
> > 
> > nit: needless whitespace, though technically the original license
> > includes this too. :)
> 
> Indeed, I decided to leave the white space as the original had it too.
> Should I really get rid of the space or keep it?

Probably keep it for 0 diff with original. :)

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v8 11/12] zram: fix crashes with cpu hotplug multistate
  2021-10-05 20:55   ` Kees Cook
@ 2021-10-11 18:27     ` Luis Chamberlain
  0 siblings, 0 replies; 94+ messages in thread
From: Luis Chamberlain @ 2021-10-11 18:27 UTC (permalink / raw)
  To: Kees Cook, akpm
  Cc: tj, gregkh, minchan, jeyu, shuah, bvanassche, dan.j.williams,
	joe, tglx, rostedt, linux-spdx, linux-doc, linux-block,
	linux-fsdevel, linux-kselftest, linux-kernel

On Tue, Oct 05, 2021 at 01:55:35PM -0700, Kees Cook wrote:
> On Mon, Sep 27, 2021 at 09:38:04AM -0700, Luis Chamberlain wrote:
> > Provide a simple state machine to fix races with driver exit where we
> > remove the CPU multistate callbacks and re-initialization / creation of
> > new per CPU instances which should be managed by these callbacks.
> > 
> > The zram driver makes use of cpu hotplug multistate support, whereby it
> > associates a struct zcomp per CPU. Each struct zcomp represents a
> > compression algorithm in charge of managing compression streams per
> > CPU. Although a compiled zram driver only supports a fixed set of
> > compression algorithms, each zram device gets a struct zcomp allocated
> > per CPU. The "multi" in CPU hotplug multstate refers to these per
> > cpu struct zcomp instances. Each of these will have the CPU hotplug
> > callback called for it on CPU plug / unplug. The kernel's CPU hotplug
> > multistate keeps a linked list of these different structures so that
> > it will iterate over them on CPU transitions.
> > 
> > By default at driver initialization we will create just one zram device
> > (num_devices=1) and a zcomp structure then set for the now default
> > lzo-rle comrpession algorithm. At driver removal we first remove each
> > zram device, and so we destroy the associated struct zcomp per CPU. But
> > since we expose sysfs attributes to create new devices or reset /
> > initialize existing zram devices, we can easily end up re-initializing
> > a struct zcomp for a zram device before the exit routine of the module
> > removes the cpu hotplug callback. When this happens the kernel's CPU
> > hotplug will detect that at least one instance (struct zcomp for us)
> > exists. This can happen in the following situation:
> > 
> > CPU 1                            CPU 2
> > 
> >                                 disksize_store(...);
> > class_unregister(...);
> > idr_for_each(...);
> > zram_debugfs_destroy();
> > 
> > idr_destroy(...);
> > unregister_blkdev(...);
> > cpuhp_remove_multi_state(...);
> 
> So this is strictly separate from the sysfs/module unloading race?

It is only related in the sense that the sysfs/module unloading race
happened *after* this other issue, but addressing these through
separate threads created a break in conversation and focus. For
instance, a theoretical race was mentioned in one thread, which
I worked to prove/disprove and then I disproved it was not possible.

But at this point, yes, this is a purely separate issue, and this
patch *should* be picked up already.

Andrew, can you merge this? It already has the respective maintainer
Ack, and I can continue to work on the rest of the patches. The only
issue I can think of would be a conflict with the last patch but
that's a oneliner, I think chances are low that would create a conflict
if its all merged separately, and if so, it should be an easy fix for
a merge conflict.

  Luis

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v8 12/12] zram: use ATTRIBUTE_GROUPS to fix sysfs deadlock module removal
  2021-10-05 20:57   ` Kees Cook
@ 2021-10-11 18:28     ` Luis Chamberlain
  0 siblings, 0 replies; 94+ messages in thread
From: Luis Chamberlain @ 2021-10-11 18:28 UTC (permalink / raw)
  To: Kees Cook
  Cc: tj, gregkh, akpm, minchan, jeyu, shuah, bvanassche,
	dan.j.williams, joe, tglx, rostedt, linux-spdx, linux-doc,
	linux-block, linux-fsdevel, linux-kselftest, linux-kernel

On Tue, Oct 05, 2021 at 01:57:00PM -0700, Kees Cook wrote:
> On Mon, Sep 27, 2021 at 09:38:05AM -0700, Luis Chamberlain wrote:
> > The ATTRIBUTE_GROUPS is typically used to avoid boiler plate
> > code which is used in many drivers. Embracing ATTRIBUTE_GROUPS was
> > long due on the zram driver, however a recent fix for sysfs allows
> > users of ATTRIBUTE_GROUPS to also associate a module to the group
> > attribute.
> 
> Does this mean that other modules using sysfs but _not_
> ATTRIBUTE_GROUPS() are still vulnerable to potential use-after-free of
> the kernfs fops?

The issue is not UAF, its the possible deadlock, but in that sense, yes.
If they don't use ATTRIBUTE_GROUPS() then there is no information being
provided to sysfs about the module owner.

  Luis

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v8 03/12] selftests: add tests_sysfs module
       [not found]   ` <202110050912.3DF681ED@keescook>
@ 2021-10-11 19:03     ` Luis Chamberlain
  0 siblings, 0 replies; 94+ messages in thread
From: Luis Chamberlain @ 2021-10-11 19:03 UTC (permalink / raw)
  To: Kees Cook
  Cc: tj, gregkh, akpm, minchan, jeyu, shuah, bvanassche,
	dan.j.williams, joe, tglx, rostedt, linux-spdx, linux-doc,
	linux-block, linux-fsdevel, linux-kselftest, linux-kernel

On Tue, Oct 05, 2021 at 09:30:10AM -0700, Kees Cook wrote:
> On Mon, Sep 27, 2021 at 09:37:56AM -0700, Luis Chamberlain wrote:
> > --- /dev/null
> > +++ b/lib/test_sysfs.c
> > @@ -0,0 +1,921 @@
> > +// SPDX-License-Identifier: GPL-2.0-or-later OR copyleft-next-0.3.1
> > +/*
> > + * sysfs test driver
> > + *
> > + * Copyright (C) 2021 Luis Chamberlain <mcgrof@kernel.org>
> > + *
> > + * This program is free software; you can redistribute it and/or modify it
> > + * under the terms of the GNU General Public License as published by the Free
> > + * Software Foundation; either version 2 of the License, or at your option any
> > + * later version; or, when distributed separately from the Linux kernel or
> > + * when incorporated into other software packages, subject to the following
> > + * license:
> > + *
> > + * This program is free software; you can redistribute it and/or modify it
> > + * under the terms of copyleft-next (version 0.3.1 or later) as published
> > + * at http://copyleft-next.org/.
> 
> As Greg suggested, please drop the boilerplate here.

Sure, sorry for missing that fixed.

> > +static ssize_t config_show(struct device *dev,
> > +			   struct device_attribute *attr,
> > +			   char *buf)
> > +{
> > +	struct sysfs_test_device *test_dev = dev_to_test_dev(dev);
> > +	struct test_config *config = &test_dev->config;
> > +	int len = 0;
> > +
> > +	test_dev_config_lock(test_dev);
> > +
> > +	len += snprintf(buf, PAGE_SIZE,
> > +			"Configuration for: %s\n",
> > +			dev_name(dev));
> 
> Please use sysfs_emit() instead of snprintf().

Oh nice, done and fixed also in the other places.

> > +static int sysfs_test_dev_alloc_blockdev(struct sysfs_test_device *test_dev)
> > +{
> > +	int ret = -ENOMEM;
> > +
> > +	test_dev->disk = blk_alloc_disk(NUMA_NO_NODE);
> > +	if (!test_dev->disk) {
> > +		pr_err("Error allocating disk structure for device %d\n",
> > +		       test_dev->dev_idx);
> > +		goto out;
> > +	}
> > +
> > +	test_dev->disk->major = sysfs_test_major;
> > +	test_dev->disk->first_minor = test_dev->dev_idx + 1;
> > +	test_dev->disk->fops = &sysfs_testdev_ops;
> > +	test_dev->disk->private_data = test_dev;
> > +	snprintf(test_dev->disk->disk_name, 16, "test_sysfs%d",
> > +		 test_dev->dev_idx);
> 
> Prefer sizeof(test_dev->disk->disk_name) over open-coded "16".

Sure.

> > +static ssize_t read_reset_first_test_dev(struct file *file,
> > +					 char __user *user_buf,
> > +					 size_t count, loff_t *ppos)
> > +{
> > +	ssize_t len;
> > +	char buf[32];
> > +
> > +	reset_first_test_dev++;
> > +	len = sprintf(buf, "%d\n", reset_first_test_dev);
> 
> Even though it's safe as-is, I was going to suggest scnprintf() here
> (i.e. explicit bounds and a bounds-checked "len"). However, scnprintf()
> returns ssize_t, and there's no bounds checking in
> simple_read_from_buffer. That needs fixing (I'll send a patch).

OK we can later change it to scnprintf() once your patch gets merged.

> > --- /dev/null
> > +++ b/tools/testing/selftests/sysfs/sysfs.sh
> > @@ -0,0 +1,1208 @@
> > +#!/bin/bash
> > +# SPDX-License-Identifier: GPL-2.0-or-later
> > +# Copyright (C) 2021 Luis Chamberlain <mcgrof@kernel.org>
> > +#
> > +# This program is free software; you can redistribute it and/or modify it
> > +# under the terms of the GNU General Public License as published by the Free
> > +# Software Foundation; either version 2 of the License, or at your option any
> > +# later version; or, when distributed separately from the Linux kernel or
> > +# when incorporated into other software packages, subject to the following
> > +# license:
> > +#
> > +# This program is free software; you can redistribute it and/or modify it
> > +# under the terms of copyleft-next (version 0.3.1 or later) as published
> > +# at http://copyleft-next.org/.
> > +
> > +# This performs a series tests against the sysfs filesystem.
> 
> -boilerplate

Nuked.

> > +check_dmesg()
> > +{
> > +	# filter out intentional WARNINGs or Oopses
> > +	local filter=${1:-_check_dmesg_filter}
> > +
> > +	_dmesg_since_test_start | $filter >$seqres.dmesg
> > +	egrep -q -e "kernel BUG at" \
> > +	     -e "WARNING:" \
> > +	     -e "\bBUG:" \
> > +	     -e "Oops:" \
> > +	     -e "possible recursive locking detected" \
> > +	     -e "Internal error" \
> > +	     -e "(INFO|ERR): suspicious RCU usage" \
> > +	     -e "INFO: possible circular locking dependency detected" \
> > +	     -e "general protection fault:" \
> > +	     -e "BUG .* remaining" \
> > +	     -e "UBSAN:" \
> > +	     $seqres.dmesg
> 
> Is just looking for "call trace" sufficient here?

So far from my testing yes. This strategy is also borrowed from fstests
and that's what is done there, and so quite a lot of testing has been
done with that. If we are to consider an enhancement here we should then
also consider an enhancement welcome for fstests.

  Luis

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v8 03/12] selftests: add tests_sysfs module
  2021-10-07 14:23   ` Miroslav Benes
@ 2021-10-11 19:11     ` Luis Chamberlain
  0 siblings, 0 replies; 94+ messages in thread
From: Luis Chamberlain @ 2021-10-11 19:11 UTC (permalink / raw)
  To: Miroslav Benes
  Cc: tj, gregkh, akpm, minchan, jeyu, shuah, bvanassche,
	dan.j.williams, joe, tglx, keescook, rostedt, linux-spdx,
	linux-doc, linux-block, linux-fsdevel, linux-kselftest,
	linux-kernel

On Thu, Oct 07, 2021 at 04:23:22PM +0200, Miroslav Benes wrote:
> On Mon, 27 Sep 2021, Luis Chamberlain wrote:
> 
> > This adds a new selftest module which can be used to test sysfs, which
> > would otherwise require using an existing driver. This lets us muck
> > with a template driver to test breaking things without affecting
> > system behaviour or requiring the dependencies of a real device
> > driver.
> > 
> > A series of 28 tests are added. Support for using two device types are
> > supported:
> > 
> >   * misc
> >   * block
> 
> I suppose the selftests will run for more than 45 seconds (default 
> kselftest timeout), so you probably also want to set timeout to something 
> sensible in tools/testing/selftests/sysfs/settings file (0 would disable 
> it).

Good catch, I'll use a default of 200, in practice for me this runs in
much less than that, about 110 seconds, so 200 should be good wiggle
room.

  Luis

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v8 04/12] kernfs: add initial failure injection support
  2021-10-05 19:47   ` Kees Cook
@ 2021-10-11 20:44     ` Luis Chamberlain
  0 siblings, 0 replies; 94+ messages in thread
From: Luis Chamberlain @ 2021-10-11 20:44 UTC (permalink / raw)
  To: Kees Cook
  Cc: tj, gregkh, akpm, minchan, jeyu, shuah, bvanassche,
	dan.j.williams, joe, tglx, rostedt, linux-spdx, linux-doc,
	linux-block, linux-fsdevel, linux-kselftest, linux-kernel

On Tue, Oct 05, 2021 at 12:47:22PM -0700, Kees Cook wrote:
> On Mon, Sep 27, 2021 at 09:37:57AM -0700, Luis Chamberlain wrote:
> > This adds initial failure injection support to kernfs. We start
> > off with debug knobs which when enabled allow test drivers, such as
> > test_sysfs, to then make use of these to try to force certain
> > difficult races to take place with a high degree of certainty.
> > 
> > This only adds runtime code *iff* the new bool CONFIG_FAIL_KERNFS_KNOBS is
> > enabled in your kernel. If you don't have this enabled this provides
> > no new functional. When CONFIG_FAIL_KERNFS_KNOBS is disabled the new
> > routine kernfs_debug_should_wait() ends up being transformed to if
> > (false), and so the compiler should optimize these out as dead code
> > producing no new effective binary changes.
> > 
> > We start off with enabling failure injections in kernfs by allowing us to
> > alter the way kernfs_fop_write_iter() behaves. We allow for the routine
> > kernfs_fop_write_iter() to wait for a certain condition in the kernel to
> > occur, after which it will sleep a predefined amount of time. This lets
> > kernfs users to time exactly when it want kernfs_fop_write_iter() to
> > complete, allowing for developing race conditions and test for correctness
> > in kernfs.
> > 
> > You'd boot with this enabled on your kernel command line:
> > 
> > fail_kernfs_fop_write_iter=1,100,0,1
> > 
> > The values are <interval,probability,size,times>, we don't care for
> > size, so for now we ignore it. The above ensures a failure will trigger
> > only once.
> > 
> > *How* we allow for this routine to change behaviour is left to knobs we
> > expose under debugfs:
> > 
> >  # ls -1 /sys/kernel/debug/kernfs/config_fail_kernfs_fop_write_iter/
> 
> I'd expect this to live under /sys/kernel/debug/fail_kernfs, like the
> other fault injectors.

Yes I see, thanks will fix up!

> > diff --git a/Documentation/fault-injection/fault-injection.rst b/Documentation/fault-injection/fault-injection.rst
> > index 4a25c5eb6f07..d4d34b082f47 100644
> > --- a/Documentation/fault-injection/fault-injection.rst
> > +++ b/Documentation/fault-injection/fault-injection.rst
> > @@ -28,6 +28,28 @@ Available fault injection capabilities
> >  
> >    injects kernel RPC client and server failures.
> >  
> > +- fail_kernfs_fop_write_iter
> > +
> > +  Allows for failures to be enabled inside kernfs_fop_write_iter(). Enabling
> > +  this does not immediately enable any errors to occur. You must configure
> > +  how you want this routine to fail or change behaviour by using the debugfs
> > +  knobs for it:
> > +
> > +  # ls -1 /sys/kernel/debug/kernfs/config_fail_kernfs_fop_write_iter/
> > +  wait_after_active
> > +  wait_after_mutex
> > +  wait_at_start
> > +  wait_before_mutex
> 
> This should be split up and detailed in the "debugfs entries" section
> below here.

Done!

> > diff --git a/MAINTAINERS b/MAINTAINERS
> > index 1b4cefcb064c..fadfd961ad80 100644
> > --- a/MAINTAINERS
> > +++ b/MAINTAINERS
> > @@ -10384,7 +10384,7 @@ M:	Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> >  M:	Tejun Heo <tj@kernel.org>
> >  S:	Supported
> >  T:	git git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core.git
> > -F:	fs/kernfs/
> > +F:	fs/kernfs/*
> >  F:	include/linux/kernfs.h
> >  
> >  KEXEC
> > diff --git a/fs/kernfs/Makefile b/fs/kernfs/Makefile
> > index 4ca54ff54c98..bc5b32ca39f9 100644
> > --- a/fs/kernfs/Makefile
> > +++ b/fs/kernfs/Makefile
> > @@ -4,3 +4,4 @@
> >  #
> >  
> >  obj-y		:= mount.o inode.o dir.o file.o symlink.o
> > +obj-$(CONFIG_FAIL_KERNFS_KNOBS)    += failure-injection.o
> > diff --git a/fs/kernfs/failure-injection.c b/fs/kernfs/failure-injection.c
> > new file mode 100644
> > index 000000000000..4130d202c13b
> > --- /dev/null
> > +++ b/fs/kernfs/failure-injection.c
> 
> I'd name this fault_inject.c, which matches the more common case:
> 
> $ find . -type f -name '*fault*inject*.c'
> ./fs/nfsd/fault_inject.c
> ./drivers/nvme/host/fault_inject.c
> ./drivers/scsi/ufs/ufs-fault-injection.c
> ./lib/fault-inject.c
> ./lib/fault-inject-usercopy.c

Sure, done.

> > +int __kernfs_debug_should_wait_kernfs_fop_write_iter(bool evaluate)
> > +{
> > +	if (!evaluate)
> > +		return 0;
> > +
> > +	return should_fail(&fail_kernfs_fop_write_iter, 0);
> > +}
> 
> Every caller ends up doing the wait, so how about just including that
> here instead? It should make things much less intrusive and more readable.
> 
> And for the naming, other fault injectors use "should_fail_$topic", so
> maybe better here would be something like may_wait_kernfs(...).

In case anyone is reading Hail Mary by Andy Weir: "Yes yes yes!"

Indeed, that's a great idea. Changed!

> > +
> > +DECLARE_COMPLETION(kernfs_debug_wait_completion);
> > +EXPORT_SYMBOL_NS_GPL(kernfs_debug_wait_completion, KERNFS_DEBUG_PRIVATE);
> > +
> > +void kernfs_debug_wait(void)
> > +{
> > +	unsigned long timeout;
> > +
> > +	timeout = wait_for_completion_timeout(&kernfs_debug_wait_completion,
> > +					      msecs_to_jiffies(3000));
> > +	if (!timeout)
> > +		pr_info("%s waiting for kernfs_debug_wait_completion timed out\n",
> > +			__func__);
> > +	else
> > +		pr_info("%s received completion with time left on timeout %u ms\n",
> > +			__func__, jiffies_to_msecs(timeout));
> > +
> > +	/**
> > +	 * The goal is wait for an event, and *then* once we have
> > +	 * reached it, the other side will try to do something which
> > +	 * it thinks will break. So we must give it some time to do
> > +	 * that. The amount of time is configurable.
> > +	 */
> > +	msleep(kernfs_config_fail.sleep_after_wait_ms);
> > +	pr_info("%s ended\n", __func__);
> > +}
> 
> All the uses of "__func__" here seems redundant; I would drop them.

Alright, and I also added the pr_fmt define which I forgot.

> > diff --git a/fs/kernfs/file.c b/fs/kernfs/file.c
> > index 60e2a86c535e..4479c6580333 100644
> > --- a/fs/kernfs/file.c
> > +++ b/fs/kernfs/file.c
> > @@ -259,6 +259,9 @@ static ssize_t kernfs_fop_write_iter(struct kiocb *iocb, struct iov_iter *iter)
> >  	const struct kernfs_ops *ops;
> >  	char *buf;
> >  
> > +	if (kernfs_debug_should_wait(kernfs_fop_write_iter, at_start))
> > +		kernfs_debug_wait();
> 
> So this could just be:
> 
> 	may_wait_kernfs(kernfs_fop_write_iter, at_start);

Yup! Thanks!

> > diff --git a/fs/kernfs/kernfs-internal.h b/fs/kernfs/kernfs-internal.h
> > index f9cc912c31e1..9e3abf597e2d 100644
> > --- a/fs/kernfs/kernfs-internal.h
> > +++ b/fs/kernfs/kernfs-internal.h
> > +#define __kernfs_config_wait_var(func, when) \
> > +	(kernfs_config_fail.  func  ## _fail.wait_  ## when)
>                             ^^     ^               ^
> nit: needless spaces

Trimmed.

  Luis

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v8 05/12] test_sysfs: add support to use kernfs failure injection
  2021-10-05 19:51   ` Kees Cook
@ 2021-10-11 20:56     ` Luis Chamberlain
  0 siblings, 0 replies; 94+ messages in thread
From: Luis Chamberlain @ 2021-10-11 20:56 UTC (permalink / raw)
  To: Kees Cook
  Cc: tj, gregkh, akpm, minchan, jeyu, shuah, bvanassche,
	dan.j.williams, joe, tglx, rostedt, linux-spdx, linux-doc,
	linux-block, linux-fsdevel, linux-kselftest, linux-kernel

On Tue, Oct 05, 2021 at 12:51:33PM -0700, Kees Cook wrote:
> On Mon, Sep 27, 2021 at 09:37:58AM -0700, Luis Chamberlain wrote:
> > diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
> > index a29b7d398c4e..176b822654e5 100644
> > --- a/lib/Kconfig.debug
> > +++ b/lib/Kconfig.debug
> > @@ -2358,6 +2358,9 @@ config TEST_SYSFS
> >  	depends on SYSFS
> >  	depends on NET
> >  	depends on BLOCK
> > +	select FAULT_INJECTION
> > +	select FAULT_INJECTION_DEBUG_FS
> > +	select FAIL_KERNFS_KNOBS
> 
> I don't like seeing "select" for user-configurable CONFIGs -- things
> tend to end up weird. This should simply be:
> 
> 	depends on FAIL_KERNFS_KNOBS

Sure.

> > diff --git a/lib/test_sysfs.c b/lib/test_sysfs.c
> > index 2043ca494af8..c6e62de61403 100644
> > --- a/lib/test_sysfs.c
> > +++ b/lib/test_sysfs.c
> > @@ -38,6 +38,11 @@
> >  #include <linux/rtnetlink.h>
> >  #include <linux/genhd.h>
> >  #include <linux/blkdev.h>
> > +#include <linux/kernfs.h>
> > +
> > +#ifdef CONFIG_FAIL_KERNFS_KNOBS
> 
> This isn't an optional config here (and following)?

Sure with the above change this is no longer needed. Removed all that
ifdef'ery.

  Luis

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v8 06/12] kernel/module: add documentation for try_module_get()
  2021-10-05 19:58   ` Kees Cook
@ 2021-10-11 21:16     ` Luis Chamberlain
  0 siblings, 0 replies; 94+ messages in thread
From: Luis Chamberlain @ 2021-10-11 21:16 UTC (permalink / raw)
  To: Kees Cook
  Cc: tj, gregkh, akpm, minchan, jeyu, shuah, bvanassche,
	dan.j.williams, joe, tglx, rostedt, linux-spdx, linux-doc,
	linux-block, linux-fsdevel, linux-kselftest, linux-kernel

On Tue, Oct 05, 2021 at 12:58:47PM -0700, Kees Cook wrote:
> On Mon, Sep 27, 2021 at 09:37:59AM -0700, Luis Chamberlain wrote:
> > diff --git a/include/linux/module.h b/include/linux/module.h
> > index c9f1200b2312..22eacd5e1e85 100644
> > --- a/include/linux/module.h
> > +++ b/include/linux/module.h
> > @@ -609,10 +609,40 @@ void symbol_put_addr(void *addr);
> >     to handle the error case (which only happens with rmmod --wait). */
> >  extern void __module_get(struct module *module);
> >  
> > -/* This is the Right Way to get a module: if it fails, it's being removed,
> > - * so pretend it's not there. */
> > +/**
> > + * try_module_get() - yields to module removal and bumps refcnt otherwise
> 
> I find this hard to parse. How about:
> 	"Take module refcount unless module is being removed"

Sure.

> > + * @module: the module we should check for
> > + *
> > + * This can be used to try to bump the reference count of a module, so to
> > + * prevent module removal. The reference count of a module is not allowed
> > + * to be incremented if the module is already being removed.
> 
> This I understand.
> 
> > + *
> > + * Care must be taken to ensure the module cannot be removed during the call to
> > + * try_module_get(). This can be done by having another entity other than the
> > + * module itself increment the module reference count, or through some other
> > + * means which guarantees the module could not be removed during an operation.
> > + * An example of this later case is using try_module_get() in a sysfs file
> > + * which the module created. The sysfs store / read file operations are
> > + * gauranteed to exist through the use of kernfs's active reference (see
> > + * kernfs_active()). If a sysfs file operation is being run, the module which
> > + * created it must still exist as the module is in charge of removing the same
> > + * sysfs file being read. Also, a sysfs / kernfs file removal cannot happen
> > + * unless the same file is not active.
> 
> I can't understand this paragraph at all. "Care must be taken ..."? Why?

Because the routine try_module_get() assumes the struct module pointer
is valid for the entire call. That can only be true if at least one
reference is held prior to this call.

> Shouldn't callers of try_module_get() be satisfied with the results?

Yes but only with the above care addressed.

> I don't follow the example at all. It seems to just say "sysfs store/read
> functions don't need try_module_get() because whatever opened the sysfs
> file is already keeping the module referenced." ?

That is exactly what I intended to clarify with that example, yes, a
reference is held but this is done implicitly. *If* a kernfs op is
active module removal waits for that active reference to go down. So
while a kernfs file is being used it is simply not possible for the
module to disappear underneath us. And the reason is that the module
that created the sysfs file must obviously destroy that same sysfs file.
But since kernfs ensures that sysfs file cannot be removed if a sysfs
file is being used, this implicitly holds a module reference.

Let me know if y ou can think of a better way to phrase this.

> > + *
> > + * One of the real values to try_module_get() is the module_is_live() check
> > + * which ensures this the caller of try_module_get() can yield to userspace
> > + * module removal requests and fail whatever it was about to process.
> 
> Please document the return value explicitly.

Sure thing.

  Luis

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v8 09/12] sysfs: fix deadlock race with module removal
  2021-10-05  9:24   ` Ming Lei
@ 2021-10-11 21:25     ` Luis Chamberlain
  2021-10-12  0:20       ` Ming Lei
  0 siblings, 1 reply; 94+ messages in thread
From: Luis Chamberlain @ 2021-10-11 21:25 UTC (permalink / raw)
  To: Ming Lei
  Cc: tj, gregkh, akpm, minchan, jeyu, shuah, bvanassche,
	dan.j.williams, joe, tglx, keescook, rostedt, linux-spdx,
	linux-doc, linux-block, linux-fsdevel, linux-kselftest,
	linux-kernel

On Tue, Oct 05, 2021 at 05:24:18PM +0800, Ming Lei wrote:
> On Mon, Sep 27, 2021 at 09:38:02AM -0700, Luis Chamberlain wrote:
> > When driver sysfs attributes use a lock also used on module removal we
> > can race to deadlock. This happens when for instance a sysfs file on
> > a driver is used, then at the same time we have module removal call
> > trigger. The module removal call code holds a lock, and then the
> > driver's sysfs file entry waits for the same lock. While holding the
> > lock the module removal tries to remove the sysfs entries, but these
> > cannot be removed yet as one is waiting for a lock. This won't complete
> > as the lock is already held. Likewise module removal cannot complete,
> > and so we deadlock.
> > 
> > This can now be easily reproducible with our sysfs selftest as follows:
> > 
> > ./tools/testing/selftests/sysfs/sysfs.sh -t 0027
> > 
> > This uses a local driver lock. Test 0028 can also be used, that uses
> > the rtnl_lock():
> > 
> > ./tools/testing/selftests/sysfs/sysfs.sh -t 0028
> > 
> > To fix this we extend the struct kernfs_node with a module reference
> > and use the try_module_get() after kernfs_get_active() is called. As
> > documented in the prior patch, we now know that once kernfs_get_active()
> > is called the module is implicitly guarded to exist and cannot be removed.
> > This is because the module is the one in charge of removing the same
> > sysfs file it created, and removal of sysfs files on module exit will wait
> > until they don't have any active references. By using a try_module_get()
> > after kernfs_get_active() we yield to let module removal trump calls to
> > process a sysfs operation, while also preventing module removal if a sysfs
> > operation is in already progress. This prevents the deadlock.
> > 
> > This deadlock was first reported with the zram driver, however the live
> 
> Looks not see the lock pattern you mentioned in zram driver, can you
> share the related zram code?

I recommend to not look at the zram driver, instead look at the
test_sysfs driver as that abstracts the issue more clearly and uses
two different locks as an example. The point is that if on module
removal *any* lock is used which is *also* used on the sysfs file
created by the module, you can deadlock.

> > And this can lead to this condition:
> > 
> > CPU A                              CPU B
> >                                    foo_store()
> > foo_exit()
> >   mutex_lock(&foo)
> >                                    mutex_lock(&foo)
> >    del_gendisk(some_struct->disk);
> >      device_del()
> >        device_remove_groups()
> 
> I guess the deadlock exists if foo_exit() is called anywhere. If yes,
> look the issue may not be related with removing module directly, right?

No, the reason this can deadlock is that the module exit routine will
patiently wait for the sysfs / kernfs files to be stop being used,
but clearly they cannot if the exit routine took the mutex also used
by the sysfs ops. That is, the special condition here is the removal of
the sysfs files, and the sysfs files using a lock also used on module
exit.

 Luis

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v8 09/12] sysfs: fix deadlock race with module removal
  2021-10-05 20:50   ` Kees Cook
@ 2021-10-11 22:26     ` Luis Chamberlain
  2021-10-13 12:41       ` Luis Chamberlain
  0 siblings, 1 reply; 94+ messages in thread
From: Luis Chamberlain @ 2021-10-11 22:26 UTC (permalink / raw)
  To: Kees Cook
  Cc: tj, gregkh, akpm, minchan, jeyu, shuah, bvanassche,
	dan.j.williams, joe, tglx, rostedt, linux-spdx, linux-doc,
	linux-block, linux-fsdevel, linux-kselftest, linux-kernel

On Tue, Oct 05, 2021 at 01:50:31PM -0700, Kees Cook wrote:
> On Mon, Sep 27, 2021 at 09:38:02AM -0700, Luis Chamberlain wrote:
> > A sketch of how this can happen follows, consider foo a local mutex
> > part of a driver, and used on the driver's module exit routine and
> > on one of its sysfs ops:
> > 
> > foo.c:
> > static DEFINE_MUTEX(foo);
> > static ssize_t foo_store(struct device *dev,
> > 			 struct device_attribute *attr,
> > 			 const char *buf, size_t count)
> > {
> > 	...
> > 	mutex_lock(&foo);
> > 	...
> > 	mutex_lock(&foo);
> > 	...
> > }
> > static DEVICE_ATTR_RW(foo);
> > ...
> > void foo_exit(void)
> > {
> > 	mutex_lock(&foo);
> > 	...
> > 	mutex_unlock(&foo);
> > }
> > module_exit(foo_exit);
> > 
> > And this can lead to this condition:
> > 
> > CPU A                              CPU B
> >                                    foo_store()
> > foo_exit()
> >   mutex_lock(&foo)
> >                                    mutex_lock(&foo)
> >    del_gendisk(some_struct->disk);
> >      device_del()
> >        device_remove_groups()
> 
> Please expand this further, where does device_remove_groups() end up
> waiting for that never happens?

Sure. How about:

Furthermore, device_remove_groups() will just go on trying to remove
the sysfs files, which are kernfs entries. The way kernfs deals with
removal is that it will wait until all active references for the files
being removed are done. The active reference is obtained through
kernfs_get_active(). Removal ends up waiting through kernfs_drain()
for the active references to be done, and that only happens if the
kernfs file ops can complete. If these kernfs ops / sysfs files
are waiting for a mutex which taken by the module's exit routine
prior to trying to remove the sysfs files we deadlock.

> > In this situation foo_store() is waiting for the mutex foo to
> > become unlocked, but that won't happen until module removal is complete.
> > But module removal won't complete until the sysfs file being poked at
> > completes which is waiting for a lock already held.
> > 
> > Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> > ---
> >  arch/x86/kernel/cpu/resctrl/rdtgroup.c |  4 +-
> >  fs/kernfs/dir.c                        | 44 ++++++++++++++++++----
> >  fs/kernfs/file.c                       |  6 ++-
> >  fs/kernfs/kernfs-internal.h            |  3 +-
> >  fs/kernfs/symlink.c                    |  3 +-
> >  fs/sysfs/dir.c                         |  2 +-
> >  fs/sysfs/file.c                        |  6 ++-
> >  fs/sysfs/group.c                       |  3 +-
> >  include/linux/kernfs.h                 | 14 ++++---
> >  include/linux/sysfs.h                  | 52 ++++++++++++++++++++------
> >  kernel/cgroup/cgroup.c                 |  2 +-
> >  11 files changed, 105 insertions(+), 34 deletions(-)
> > 
> > diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> > index b57b3db9a6a7..4edf3b37fd2c 100644
> > --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> > +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> > @@ -209,7 +209,7 @@ static int rdtgroup_add_file(struct kernfs_node *parent_kn, struct rftype *rft)
> >  
> >  	kn = __kernfs_create_file(parent_kn, rft->name, rft->mode,
> >  				  GLOBAL_ROOT_UID, GLOBAL_ROOT_GID,
> > -				  0, rft->kf_ops, rft, NULL, NULL);
> > +				  0, rft->kf_ops, rft, NULL, NULL, NULL);
> >  	if (IS_ERR(kn))
> >  		return PTR_ERR(kn);
> >  
> > @@ -2482,7 +2482,7 @@ static int mon_addfile(struct kernfs_node *parent_kn, const char *name,
> >  
> >  	kn = __kernfs_create_file(parent_kn, name, 0444,
> >  				  GLOBAL_ROOT_UID, GLOBAL_ROOT_GID, 0,
> > -				  &kf_mondata_ops, priv, NULL, NULL);
> > +				  &kf_mondata_ops, priv, NULL, NULL, NULL);
> >  	if (IS_ERR(kn))
> >  		return PTR_ERR(kn);
> >  
> > diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
> > index ba581429bf7b..e841201fd11b 100644
> > --- a/fs/kernfs/dir.c
> > +++ b/fs/kernfs/dir.c
> > @@ -14,6 +14,7 @@
> >  #include <linux/slab.h>
> >  #include <linux/security.h>
> >  #include <linux/hash.h>
> > +#include <linux/module.h>
> >  
> >  #include "kernfs-internal.h"
> >  
> > @@ -414,12 +415,29 @@ static bool kernfs_unlink_sibling(struct kernfs_node *kn)
> >   */
> >  struct kernfs_node *kernfs_get_active(struct kernfs_node *kn)
> >  {
> > +	int v;
> > +
> >  	if (unlikely(!kn))
> >  		return NULL;
> >  
> >  	if (!atomic_inc_unless_negative(&kn->active))
> >  		return NULL;
> >  
> > +	/*
> > +	 * If a module created the kernfs_node, the module cannot possibly be
> > +	 * removed if the above atomic_inc_unless_negative() succeeded. So the
> > +	 * try_module_get() below is not to protect the lifetime of the module
> > +	 * as that is already guaranteed. The try_module_get() below is used
> > +	 * to ensure that we don't deadlock in case a kernfs operation and
> > +	 * module removal used a shared lock.
> > +	 */
> > +	if (!try_module_get(kn->owner)) {
> > +		v = atomic_dec_return(&kn->active);
> > +		if (unlikely(v == KN_DEACTIVATED_BIAS))
> > +			wake_up_all(&kernfs_root(kn)->deactivate_waitq);
> > +		return NULL;
> > +	}
> 
> The special casing in here makes me think this isn't happening the right
> place. (i.e this looks like an open-coded version of kernfs_put_active())

No, well you see, in effect the special care taken in
kernfs_put_active() *is* the right way to inform a waiter that
that the *taken* reference right above *also* is no longer active.

The special casing here is because we took the active reference
before the try_module_get() in the above atomic_inc_unless_negative()
call. Outside callers deal with this through kernfs_put_active().

We are special casing to deal with the deadlock case.

> > +
> >  	if (kernfs_lockdep(kn))
> >  		rwsem_acquire_read(&kn->dep_map, 0, 1, _RET_IP_);
> >  	return kn;
> > @@ -442,6 +460,13 @@ void kernfs_put_active(struct kernfs_node *kn)
> >  	if (kernfs_lockdep(kn))
> >  		rwsem_release(&kn->dep_map, _RET_IP_);
> >  	v = atomic_dec_return(&kn->active);
> > +
> > +	/*
> > +	 * We prevent module exit *until* we know for sure all possible
> > +	 * kernfs ops are done.
> > +	 */
> > +	module_put(kn->owner);
> > +
> >  	if (likely(v != KN_DEACTIVATED_BIAS))
> >  		return;
> 
> What I don't understand, however, is what kernfs_get/put_active() is
> intending to do -- it looks like it's trying to provide an interruption
> point for open kernfs file operations?

It is essentially ensuring that removal does not happen if any ops
are being used.

> This all seems extremely complex for what seems like it should just be a
> global "am I being removed?" bool?

It used to be worse :) And Tejun has cleaned this up over time. Yes,
perhaps we can improve that more but, given how sensible this code
is I think such improvements should be made separately.

> Regardless, while I do see the logic of associating the module get/put
> with get/put of kernfs "active", why is it not better tied to strictly
> kernfs open/close? 

It's not just files, consider kernfs_iop_mkdir() which also calls
kernfs_get_active(). How about kernfs_fop_mmap()? And so, the common
denominator is actually kernfs_get_active().

> That would seem to be much simpler and not require
> any special handling?

Yes true, but it I think this would still leave open some other possible
deadlocks.

> For example, why does this not work?

It does for the write case for sure, but I haven't written tests for the
other odd cases, but suspect that would deadlock as well.

 Luis

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v8 09/12] sysfs: fix deadlock race with module removal
  2021-10-11 21:25     ` Luis Chamberlain
@ 2021-10-12  0:20       ` Ming Lei
  2021-10-12 21:18         ` Luis Chamberlain
  0 siblings, 1 reply; 94+ messages in thread
From: Ming Lei @ 2021-10-12  0:20 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: tj, gregkh, akpm, minchan, jeyu, shuah, bvanassche,
	dan.j.williams, joe, tglx, keescook, rostedt, linux-spdx,
	linux-doc, linux-block, linux-fsdevel, linux-kselftest,
	linux-kernel

On Mon, Oct 11, 2021 at 02:25:46PM -0700, Luis Chamberlain wrote:
> On Tue, Oct 05, 2021 at 05:24:18PM +0800, Ming Lei wrote:
> > On Mon, Sep 27, 2021 at 09:38:02AM -0700, Luis Chamberlain wrote:
> > > When driver sysfs attributes use a lock also used on module removal we
> > > can race to deadlock. This happens when for instance a sysfs file on
> > > a driver is used, then at the same time we have module removal call
> > > trigger. The module removal call code holds a lock, and then the
> > > driver's sysfs file entry waits for the same lock. While holding the
> > > lock the module removal tries to remove the sysfs entries, but these
> > > cannot be removed yet as one is waiting for a lock. This won't complete
> > > as the lock is already held. Likewise module removal cannot complete,
> > > and so we deadlock.
> > > 
> > > This can now be easily reproducible with our sysfs selftest as follows:
> > > 
> > > ./tools/testing/selftests/sysfs/sysfs.sh -t 0027
> > > 
> > > This uses a local driver lock. Test 0028 can also be used, that uses
> > > the rtnl_lock():
> > > 
> > > ./tools/testing/selftests/sysfs/sysfs.sh -t 0028
> > > 
> > > To fix this we extend the struct kernfs_node with a module reference
> > > and use the try_module_get() after kernfs_get_active() is called. As
> > > documented in the prior patch, we now know that once kernfs_get_active()
> > > is called the module is implicitly guarded to exist and cannot be removed.
> > > This is because the module is the one in charge of removing the same
> > > sysfs file it created, and removal of sysfs files on module exit will wait
> > > until they don't have any active references. By using a try_module_get()
> > > after kernfs_get_active() we yield to let module removal trump calls to
> > > process a sysfs operation, while also preventing module removal if a sysfs
> > > operation is in already progress. This prevents the deadlock.
> > > 
> > > This deadlock was first reported with the zram driver, however the live
> > 
> > Looks not see the lock pattern you mentioned in zram driver, can you
> > share the related zram code?
> 
> I recommend to not look at the zram driver, instead look at the
> test_sysfs driver as that abstracts the issue more clearly and uses

Looks test_sysfs isn't in linus tree, where can I find it? Also please
update your commit log about this wrong info if it can't be applied on
zram.

> two different locks as an example. The point is that if on module
> removal *any* lock is used which is *also* used on the sysfs file
> created by the module, you can deadlock.
> 
> > > And this can lead to this condition:
> > > 
> > > CPU A                              CPU B
> > >                                    foo_store()
> > > foo_exit()
> > >   mutex_lock(&foo)
> > >                                    mutex_lock(&foo)
> > >    del_gendisk(some_struct->disk);
> > >      device_del()
> > >        device_remove_groups()
> > 
> > I guess the deadlock exists if foo_exit() is called anywhere. If yes,
> > look the issue may not be related with removing module directly, right?
> 
> No, the reason this can deadlock is that the module exit routine will
> patiently wait for the sysfs / kernfs files to be stop being used,

Can you share the code which waits for the sysfs / kernfs files to be
stop being used? And why does it make a difference in case of being
called from module_exit()?



Thanks,
Ming


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v8 09/12] sysfs: fix deadlock race with module removal
  2021-10-12  0:20       ` Ming Lei
@ 2021-10-12 21:18         ` Luis Chamberlain
  2021-10-13  1:07           ` Ming Lei
  0 siblings, 1 reply; 94+ messages in thread
From: Luis Chamberlain @ 2021-10-12 21:18 UTC (permalink / raw)
  To: Ming Lei
  Cc: tj, gregkh, akpm, minchan, jeyu, shuah, bvanassche,
	dan.j.williams, joe, tglx, keescook, rostedt, linux-spdx,
	linux-doc, linux-block, linux-fsdevel, linux-kselftest,
	linux-kernel

On Tue, Oct 12, 2021 at 08:20:46AM +0800, Ming Lei wrote:
> On Mon, Oct 11, 2021 at 02:25:46PM -0700, Luis Chamberlain wrote:
> > On Tue, Oct 05, 2021 at 05:24:18PM +0800, Ming Lei wrote:
> > > On Mon, Sep 27, 2021 at 09:38:02AM -0700, Luis Chamberlain wrote:
> > > > When driver sysfs attributes use a lock also used on module removal we
> > > > can race to deadlock. This happens when for instance a sysfs file on
> > > > a driver is used, then at the same time we have module removal call
> > > > trigger. The module removal call code holds a lock, and then the
> > > > driver's sysfs file entry waits for the same lock. While holding the
> > > > lock the module removal tries to remove the sysfs entries, but these
> > > > cannot be removed yet as one is waiting for a lock. This won't complete
> > > > as the lock is already held. Likewise module removal cannot complete,
> > > > and so we deadlock.
> > > > 
> > > > This can now be easily reproducible with our sysfs selftest as follows:
> > > > 
> > > > ./tools/testing/selftests/sysfs/sysfs.sh -t 0027
> > > > 
> > > > This uses a local driver lock. Test 0028 can also be used, that uses
> > > > the rtnl_lock():
> > > > 
> > > > ./tools/testing/selftests/sysfs/sysfs.sh -t 0028
> > > > 
> > > > To fix this we extend the struct kernfs_node with a module reference
> > > > and use the try_module_get() after kernfs_get_active() is called. As
> > > > documented in the prior patch, we now know that once kernfs_get_active()
> > > > is called the module is implicitly guarded to exist and cannot be removed.
> > > > This is because the module is the one in charge of removing the same
> > > > sysfs file it created, and removal of sysfs files on module exit will wait
> > > > until they don't have any active references. By using a try_module_get()
> > > > after kernfs_get_active() we yield to let module removal trump calls to
> > > > process a sysfs operation, while also preventing module removal if a sysfs
> > > > operation is in already progress. This prevents the deadlock.
> > > > 
> > > > This deadlock was first reported with the zram driver, however the live
> > > 
> > > Looks not see the lock pattern you mentioned in zram driver, can you
> > > share the related zram code?
> > 
> > I recommend to not look at the zram driver, instead look at the
> > test_sysfs driver as that abstracts the issue more clearly and uses
> 
> Looks test_sysfs isn't in linus tree, where can I find it?

https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux-next.git/log/?h=20210927-sysfs-generic-deadlock-fix

> Also please
> update your commit log about this wrong info if it can't be applied on
> zram.

It does apply to zram, it is just that I have other fixes for zram in
my pipeline which will change the zram driver further, and so what makes
more sense is to abstract the issue into a selftest driver to
demonstrate the issue more clearly.

To reproduce the deadlock revert the patch in this thread and then run
either of these two tests as root:

./tools/testing/selftests/sysfs/sysfs.sh -w 0027
./tools/testing/selftests/sysfs/sysfs.sh -w 0028

You will need to enable the test_sysfs driver.

> > two different locks as an example. The point is that if on module
> > removal *any* lock is used which is *also* used on the sysfs file
> > created by the module, you can deadlock.
> > 
> > > > And this can lead to this condition:
> > > > 
> > > > CPU A                              CPU B
> > > >                                    foo_store()
> > > > foo_exit()
> > > >   mutex_lock(&foo)
> > > >                                    mutex_lock(&foo)
> > > >    del_gendisk(some_struct->disk);
> > > >      device_del()
> > > >        device_remove_groups()
> > > 
> > > I guess the deadlock exists if foo_exit() is called anywhere. If yes,
> > > look the issue may not be related with removing module directly, right?
> > 
> > No, the reason this can deadlock is that the module exit routine will
> > patiently wait for the sysfs / kernfs files to be stop being used,
> 
> Can you share the code which waits for the sysfs / kernfs files to be
> stop being used?

How about a call trace of the two tasks which deadlock, here is one of
running test 0027:

kdevops login: [  363.875459] INFO: task sysfs.sh:1271 blocked for more
than 120 seconds.
[  363.878341]       Tainted: G            E
5.15.0-rc3-next-20210927+ #83
[  363.881218] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[  363.882255] task:sysfs.sh        state:D stack:    0 pid: 1271 ppid:
1 flags:0x00000004
[  363.882894] Call Trace:
[  363.883091]  <TASK>
[  363.883259]  __schedule+0x2fd/0x990
[  363.883551]  schedule+0x43/0xe0
[  363.883800]  schedule_preempt_disabled+0x14/0x20
[  363.884160]  __mutex_lock.constprop.0+0x249/0x470
[  363.884524]  test_dev_x_store+0xa5/0xc0 [test_sysfs]
[  363.884915]  kernfs_fop_write_iter+0x177/0x220
[  363.885257]  new_sync_write+0x11c/0x1b0
[  363.885556]  vfs_write+0x20d/0x2a0
[  363.885821]  ksys_write+0x5f/0xe0
[  363.886081]  do_syscall_64+0x38/0xc0
[  363.886359]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[  363.886748] RIP: 0033:0x7fee00f8bf33
[  363.887029] RSP: 002b:00007ffd372c5d18 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[  363.887633] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007fee00f8bf33
[  363.888217] RDX: 0000000000000003 RSI: 000055a4d14a0db0 RDI: 0000000000000001
[  363.888761] RBP: 000055a4d14a0db0 R08: 000000000000000a R09: 0000000000000002
[  363.889267] R10: 000055a4d1554ac0 R11: 0000000000000246 R12: 0000000000000003
[  363.889983] R13: 00007fee0105c6a0 R14: 0000000000000003 R15: 00007fee0105c8a0
[  363.890513]  </TASK>
[  363.890709] INFO: task modprobe:1276 blocked for more than 120 seconds.
[  363.891185]       Tainted: G            E 5.15.0-rc3-next-20210927+ #83
[  363.891781] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  363.892353] task:modprobe        state:D stack:    0 pid: 1276 ppid: 1 flags:0x00004000
[  363.892955] Call Trace:
[  363.893141]  <TASK>
[  363.893457]  __schedule+0x2fd/0x990
[  363.893865]  schedule+0x43/0xe0
[  363.894246]  __kernfs_remove.part.0+0x21e/0x2a0
[  363.894704]  ? do_wait_intr_irq+0xa0/0xa0
[  363.895142]  kernfs_remove_by_name_ns+0x50/0x90
[  363.895632]  remove_files+0x2b/0x60
[  363.896035]  sysfs_remove_group+0x38/0x80
[  363.896470]  sysfs_remove_groups+0x29/0x40
[  363.896912]  device_remove_attrs+0x5b/0x90
[  363.897352]  device_del+0x183/0x400
[  363.897758]  unregister_test_dev_sysfs+0x5b/0xaa [test_sysfs]
[  363.898317]  test_sysfs_exit+0x45/0xfb0 [test_sysfs]
[  363.898833]  __do_sys_delete_module+0x18d/0x2a0
[  363.899329]  ? fpregs_assert_state_consistent+0x1e/0x40
[  363.899868]  ? exit_to_user_mode_prepare+0x3a/0x180
[  363.900390]  do_syscall_64+0x38/0xc0
[  363.900810]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[  363.901330] RIP: 0033:0x7f21915c57d7
[  363.901747] RSP: 002b:00007ffd90869fe8 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
[  363.902442] RAX: ffffffffffffffda RBX: 000055ce676ffc30 RCX: 00007f21915c57d7
[  363.903104] RDX: 0000000000000000 RSI: 0000000000000800 RDI: 000055ce676ffc98
[  363.903782] RBP: 000055ce676ffc30 R08: 0000000000000000 R09: 0000000000000000
[  363.904462] R10: 00007f2191638ac0 R11: 0000000000000206 R12: 000055ce676ffc98
[  363.905128] R13: 0000000000000000 R14: 0000000000000000 R15: 000055ce676ffdf0
[  363.905797]  </TASK>


And gdb:

(gdb) l *(__kernfs_remove+0x21e)
0xffffffff8139288e is in __kernfs_remove (fs/kernfs/dir.c:476).
471                     if (atomic_read(&kn->active) != KN_DEACTIVATED_BIAS)
472                             lock_contended(&kn->dep_map, _RET_IP_);
473             }
474
475             /* but everyone should wait for draining */
476             wait_event(root->deactivate_waitq,
477                        atomic_read(&kn->active) == KN_DEACTIVATED_BIAS);
478
479             if (kernfs_lockdep(kn)) {
480                     lock_acquired(&kn->dep_map, _RET_IP_);

(gdb) l *(kernfs_remove_by_name_ns+0x50)
0xffffffff813938d0 is in kernfs_remove_by_name_ns (fs/kernfs/dir.c:1534).
1529
1530            kn = kernfs_find_ns(parent, name, ns);
1531            if (kn)
1532                    __kernfs_remove(kn);
1533
1534            up_write(&kernfs_rwsem);
1535
1536            if (kn)
1537                    return 0;
1538            else

The same happens for test 0028 except instead of a mutex
lock an rtnl_lock() is used.

Would this be better for the commit log?

> And why does it make a difference in case of being
> called from module_exit()?

Well because that is where we remove the sysfs files. *If*
a developer happens to use a lock on a sysfs op but it is
also used on module exit, this deadlock is bound to happen.

  Luis

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v8 09/12] sysfs: fix deadlock race with module removal
  2021-10-12 21:18         ` Luis Chamberlain
@ 2021-10-13  1:07           ` Ming Lei
  2021-10-13 12:35             ` Luis Chamberlain
  0 siblings, 1 reply; 94+ messages in thread
From: Ming Lei @ 2021-10-13  1:07 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: tj, gregkh, akpm, minchan, jeyu, shuah, bvanassche,
	dan.j.williams, joe, tglx, keescook, rostedt, linux-spdx,
	linux-doc, linux-block, linux-fsdevel, linux-kselftest,
	linux-kernel

On Tue, Oct 12, 2021 at 02:18:28PM -0700, Luis Chamberlain wrote:
> On Tue, Oct 12, 2021 at 08:20:46AM +0800, Ming Lei wrote:
> > On Mon, Oct 11, 2021 at 02:25:46PM -0700, Luis Chamberlain wrote:
> > > On Tue, Oct 05, 2021 at 05:24:18PM +0800, Ming Lei wrote:
> > > > On Mon, Sep 27, 2021 at 09:38:02AM -0700, Luis Chamberlain wrote:
> > > > > When driver sysfs attributes use a lock also used on module removal we
> > > > > can race to deadlock. This happens when for instance a sysfs file on
> > > > > a driver is used, then at the same time we have module removal call
> > > > > trigger. The module removal call code holds a lock, and then the
> > > > > driver's sysfs file entry waits for the same lock. While holding the
> > > > > lock the module removal tries to remove the sysfs entries, but these
> > > > > cannot be removed yet as one is waiting for a lock. This won't complete
> > > > > as the lock is already held. Likewise module removal cannot complete,
> > > > > and so we deadlock.
> > > > > 
> > > > > This can now be easily reproducible with our sysfs selftest as follows:
> > > > > 
> > > > > ./tools/testing/selftests/sysfs/sysfs.sh -t 0027
> > > > > 
> > > > > This uses a local driver lock. Test 0028 can also be used, that uses
> > > > > the rtnl_lock():
> > > > > 
> > > > > ./tools/testing/selftests/sysfs/sysfs.sh -t 0028
> > > > > 
> > > > > To fix this we extend the struct kernfs_node with a module reference
> > > > > and use the try_module_get() after kernfs_get_active() is called. As
> > > > > documented in the prior patch, we now know that once kernfs_get_active()
> > > > > is called the module is implicitly guarded to exist and cannot be removed.
> > > > > This is because the module is the one in charge of removing the same
> > > > > sysfs file it created, and removal of sysfs files on module exit will wait
> > > > > until they don't have any active references. By using a try_module_get()
> > > > > after kernfs_get_active() we yield to let module removal trump calls to
> > > > > process a sysfs operation, while also preventing module removal if a sysfs
> > > > > operation is in already progress. This prevents the deadlock.
> > > > > 
> > > > > This deadlock was first reported with the zram driver, however the live
> > > > 
> > > > Looks not see the lock pattern you mentioned in zram driver, can you
> > > > share the related zram code?
> > > 
> > > I recommend to not look at the zram driver, instead look at the
> > > test_sysfs driver as that abstracts the issue more clearly and uses
> > 
> > Looks test_sysfs isn't in linus tree, where can I find it?
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux-next.git/log/?h=20210927-sysfs-generic-deadlock-fix
> 
> > Also please
> > update your commit log about this wrong info if it can't be applied on
> > zram.
> 
> It does apply to zram, it is just that I have other fixes for zram in
> my pipeline which will change the zram driver further, and so what makes
> more sense is to abstract the issue into a selftest driver to
> demonstrate the issue more clearly.
> 
> To reproduce the deadlock revert the patch in this thread and then run
> either of these two tests as root:
> 
> ./tools/testing/selftests/sysfs/sysfs.sh -w 0027
> ./tools/testing/selftests/sysfs/sysfs.sh -w 0028
> 
> You will need to enable the test_sysfs driver.
> 
> > > two different locks as an example. The point is that if on module
> > > removal *any* lock is used which is *also* used on the sysfs file
> > > created by the module, you can deadlock.
> > > 
> > > > > And this can lead to this condition:
> > > > > 
> > > > > CPU A                              CPU B
> > > > >                                    foo_store()
> > > > > foo_exit()
> > > > >   mutex_lock(&foo)
> > > > >                                    mutex_lock(&foo)
> > > > >    del_gendisk(some_struct->disk);
> > > > >      device_del()
> > > > >        device_remove_groups()
> > > > 
> > > > I guess the deadlock exists if foo_exit() is called anywhere. If yes,
> > > > look the issue may not be related with removing module directly, right?
> > > 
> > > No, the reason this can deadlock is that the module exit routine will
> > > patiently wait for the sysfs / kernfs files to be stop being used,
> > 
> > Can you share the code which waits for the sysfs / kernfs files to be
> > stop being used?
> 
> How about a call trace of the two tasks which deadlock, here is one of
> running test 0027:
> 
> kdevops login: [  363.875459] INFO: task sysfs.sh:1271 blocked for more
> than 120 seconds.
> [  363.878341]       Tainted: G            E
> 5.15.0-rc3-next-20210927+ #83
> [  363.881218] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [  363.882255] task:sysfs.sh        state:D stack:    0 pid: 1271 ppid:
> 1 flags:0x00000004
> [  363.882894] Call Trace:
> [  363.883091]  <TASK>
> [  363.883259]  __schedule+0x2fd/0x990
> [  363.883551]  schedule+0x43/0xe0
> [  363.883800]  schedule_preempt_disabled+0x14/0x20
> [  363.884160]  __mutex_lock.constprop.0+0x249/0x470
> [  363.884524]  test_dev_x_store+0xa5/0xc0 [test_sysfs]
> [  363.884915]  kernfs_fop_write_iter+0x177/0x220
> [  363.885257]  new_sync_write+0x11c/0x1b0
> [  363.885556]  vfs_write+0x20d/0x2a0
> [  363.885821]  ksys_write+0x5f/0xe0
> [  363.886081]  do_syscall_64+0x38/0xc0
> [  363.886359]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> [  363.886748] RIP: 0033:0x7fee00f8bf33
> [  363.887029] RSP: 002b:00007ffd372c5d18 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
> [  363.887633] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007fee00f8bf33
> [  363.888217] RDX: 0000000000000003 RSI: 000055a4d14a0db0 RDI: 0000000000000001
> [  363.888761] RBP: 000055a4d14a0db0 R08: 000000000000000a R09: 0000000000000002
> [  363.889267] R10: 000055a4d1554ac0 R11: 0000000000000246 R12: 0000000000000003
> [  363.889983] R13: 00007fee0105c6a0 R14: 0000000000000003 R15: 00007fee0105c8a0
> [  363.890513]  </TASK>
> [  363.890709] INFO: task modprobe:1276 blocked for more than 120 seconds.
> [  363.891185]       Tainted: G            E 5.15.0-rc3-next-20210927+ #83
> [  363.891781] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [  363.892353] task:modprobe        state:D stack:    0 pid: 1276 ppid: 1 flags:0x00004000
> [  363.892955] Call Trace:
> [  363.893141]  <TASK>
> [  363.893457]  __schedule+0x2fd/0x990
> [  363.893865]  schedule+0x43/0xe0
> [  363.894246]  __kernfs_remove.part.0+0x21e/0x2a0
> [  363.894704]  ? do_wait_intr_irq+0xa0/0xa0
> [  363.895142]  kernfs_remove_by_name_ns+0x50/0x90
> [  363.895632]  remove_files+0x2b/0x60
> [  363.896035]  sysfs_remove_group+0x38/0x80
> [  363.896470]  sysfs_remove_groups+0x29/0x40
> [  363.896912]  device_remove_attrs+0x5b/0x90
> [  363.897352]  device_del+0x183/0x400
> [  363.897758]  unregister_test_dev_sysfs+0x5b/0xaa [test_sysfs]
> [  363.898317]  test_sysfs_exit+0x45/0xfb0 [test_sysfs]
> [  363.898833]  __do_sys_delete_module+0x18d/0x2a0
> [  363.899329]  ? fpregs_assert_state_consistent+0x1e/0x40
> [  363.899868]  ? exit_to_user_mode_prepare+0x3a/0x180
> [  363.900390]  do_syscall_64+0x38/0xc0
> [  363.900810]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> [  363.901330] RIP: 0033:0x7f21915c57d7
> [  363.901747] RSP: 002b:00007ffd90869fe8 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
> [  363.902442] RAX: ffffffffffffffda RBX: 000055ce676ffc30 RCX: 00007f21915c57d7
> [  363.903104] RDX: 0000000000000000 RSI: 0000000000000800 RDI: 000055ce676ffc98
> [  363.903782] RBP: 000055ce676ffc30 R08: 0000000000000000 R09: 0000000000000000
> [  363.904462] R10: 00007f2191638ac0 R11: 0000000000000206 R12: 000055ce676ffc98
> [  363.905128] R13: 0000000000000000 R14: 0000000000000000 R15: 000055ce676ffdf0
> [  363.905797]  </TASK>

That doesn't show the deadlock is related with module_exit().

> 
> 
> And gdb:
> 
> (gdb) l *(__kernfs_remove+0x21e)
> 0xffffffff8139288e is in __kernfs_remove (fs/kernfs/dir.c:476).
> 471                     if (atomic_read(&kn->active) != KN_DEACTIVATED_BIAS)
> 472                             lock_contended(&kn->dep_map, _RET_IP_);
> 473             }
> 474
> 475             /* but everyone should wait for draining */
> 476             wait_event(root->deactivate_waitq,
> 477                        atomic_read(&kn->active) == KN_DEACTIVATED_BIAS);
> 478
> 479             if (kernfs_lockdep(kn)) {
> 480                     lock_acquired(&kn->dep_map, _RET_IP_);
> 
> (gdb) l *(kernfs_remove_by_name_ns+0x50)
> 0xffffffff813938d0 is in kernfs_remove_by_name_ns (fs/kernfs/dir.c:1534).
> 1529
> 1530            kn = kernfs_find_ns(parent, name, ns);
> 1531            if (kn)
> 1532                    __kernfs_remove(kn);
> 1533
> 1534            up_write(&kernfs_rwsem);
> 1535
> 1536            if (kn)
> 1537                    return 0;
> 1538            else
> 
> The same happens for test 0028 except instead of a mutex
> lock an rtnl_lock() is used.
> 
> Would this be better for the commit log?
> 
> > And why does it make a difference in case of being
> > called from module_exit()?
> 
> Well because that is where we remove the sysfs files. *If*
> a developer happens to use a lock on a sysfs op but it is
> also used on module exit, this deadlock is bound to happen.

It is clearly one AA deadlock, what I meant was that it isn't related with
module exit cause lock & device_del() isn't always done in module exit, so
I doubt your fix with grabbing module refcnt is good or generic enough.

Except for your cooked test_sys module, how many real drivers do suffer the
problem? What are they? Why can't we fix the exact driver?


Thanks,
Ming


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v8 09/12] sysfs: fix deadlock race with module removal
  2021-10-13  1:07           ` Ming Lei
@ 2021-10-13 12:35             ` Luis Chamberlain
  2021-10-13 15:04               ` Ming Lei
  0 siblings, 1 reply; 94+ messages in thread
From: Luis Chamberlain @ 2021-10-13 12:35 UTC (permalink / raw)
  To: Ming Lei, Miroslav Benes
  Cc: tj, gregkh, akpm, minchan, jeyu, shuah, bvanassche,
	dan.j.williams, joe, tglx, keescook, rostedt, linux-spdx,
	linux-doc, linux-block, linux-fsdevel, linux-kselftest,
	linux-kernel

On Wed, Oct 13, 2021 at 09:07:03AM +0800, Ming Lei wrote:
> On Tue, Oct 12, 2021 at 02:18:28PM -0700, Luis Chamberlain wrote:
> > > Looks test_sysfs isn't in linus tree, where can I find it?
> > 
> > https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux-next.git/log/?h=20210927-sysfs-generic-deadlock-fix
> > 
> > To reproduce the deadlock revert the patch in this thread and then run
> > either of these two tests as root:
> > 
> > ./tools/testing/selftests/sysfs/sysfs.sh -w 0027
> > ./tools/testing/selftests/sysfs/sysfs.sh -w 0028
> > 
> > You will need to enable the test_sysfs driver.
> > > Can you share the code which waits for the sysfs / kernfs files to be
> > > stop being used?
> > 
> > How about a call trace of the two tasks which deadlock, here is one of
> > running test 0027:
> > 
> > kdevops login: [  363.875459] INFO: task sysfs.sh:1271 blocked for more
> > than 120 seconds.

<-- snip -->

> That doesn't show the deadlock is related with module_exit().

Not directly no.

> It is clearly one AA deadlock, what I meant was that it isn't related with
> module exit cause lock & device_del() isn't always done in module exit, so
> I doubt your fix with grabbing module refcnt is good or generic enough.

A device_del() *can* happen in other areas other than module exit sure,
but the issue is if a shared lock is used *before* device_del() and also
used on a sysfs op. Typically this can happen on module exit, and the
other common use case in my experience is on sysfs ops, such is the case
with the zram driver. Both cases are covered then by this fix.

If there are other areas, that is still driver specific, but of the
things we *can* generalize, definitely module exit is a common path.

> Except for your cooked test_sys module, how many real drivers do suffer the
> problem? What are they?

I only really seriously considered trying to generalize this after it
was hinted to me live patching was also affected, and so clearly
something generic was desirable.

There may be other drivers for sure, but a hunt for that with semantics
would require a bit complex coccinelle patch with iteration support.

> Why can't we fix the exact driver?

You can try, the way the lock is used in zram is correct, specially
after my other fix in this series which addresses another unrelated bug
with cpu hotplug multistate support. So we then can proceed to either
take the position to say: "Thou shalt not use a shared lock on module
exit and a sysfs op" and try to fix all places, or we generalize a fix
for this. A generic fix seems more desirable.

  Luis

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v8 09/12] sysfs: fix deadlock race with module removal
  2021-10-11 22:26     ` Luis Chamberlain
@ 2021-10-13 12:41       ` Luis Chamberlain
  0 siblings, 0 replies; 94+ messages in thread
From: Luis Chamberlain @ 2021-10-13 12:41 UTC (permalink / raw)
  To: Kees Cook
  Cc: tj, gregkh, akpm, minchan, jeyu, shuah, bvanassche,
	dan.j.williams, joe, tglx, rostedt, linux-spdx, linux-doc,
	linux-block, linux-fsdevel, linux-kselftest, linux-kernel

On Mon, Oct 11, 2021 at 03:26:02PM -0700, Luis Chamberlain wrote:
> On Tue, Oct 05, 2021 at 01:50:31PM -0700, Kees Cook wrote:
> > For example, why does this not work?
> 
> It does for the write case for sure,

I mispoke, just for the record, the changes you mentioned actually don't
suffice for the test cases in question for test_sysfs, the deadlock
still occurs with those changes. At first I thought it did but I had failed
to remove my own fix first on fs/kernfs/dir.c. After removing that and
just trying the proposed changes I confirm it does not fix the deadlock.

  Luis

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v8 09/12] sysfs: fix deadlock race with module removal
  2021-10-13 12:35             ` Luis Chamberlain
@ 2021-10-13 15:04               ` Ming Lei
  2021-10-13 21:16                 ` Luis Chamberlain
  0 siblings, 1 reply; 94+ messages in thread
From: Ming Lei @ 2021-10-13 15:04 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: Miroslav Benes, tj, gregkh, akpm, minchan, jeyu, shuah,
	bvanassche, dan.j.williams, joe, tglx, keescook, rostedt,
	linux-spdx, linux-doc, linux-block, linux-fsdevel,
	linux-kselftest, linux-kernel, ming.lei

On Wed, Oct 13, 2021 at 05:35:31AM -0700, Luis Chamberlain wrote:
> On Wed, Oct 13, 2021 at 09:07:03AM +0800, Ming Lei wrote:
> > On Tue, Oct 12, 2021 at 02:18:28PM -0700, Luis Chamberlain wrote:
> > > > Looks test_sysfs isn't in linus tree, where can I find it?
> > > 
> > > https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux-next.git/log/?h=20210927-sysfs-generic-deadlock-fix
> > > 
> > > To reproduce the deadlock revert the patch in this thread and then run
> > > either of these two tests as root:
> > > 
> > > ./tools/testing/selftests/sysfs/sysfs.sh -w 0027
> > > ./tools/testing/selftests/sysfs/sysfs.sh -w 0028
> > > 
> > > You will need to enable the test_sysfs driver.
> > > > Can you share the code which waits for the sysfs / kernfs files to be
> > > > stop being used?
> > > 
> > > How about a call trace of the two tasks which deadlock, here is one of
> > > running test 0027:
> > > 
> > > kdevops login: [  363.875459] INFO: task sysfs.sh:1271 blocked for more
> > > than 120 seconds.
> 
> <-- snip -->
> 
> > That doesn't show the deadlock is related with module_exit().
> 
> Not directly no.

Then the patch title of 'sysfs: fix deadlock race with module removal'
is wrong.

> 
> > It is clearly one AA deadlock, what I meant was that it isn't related with
> > module exit cause lock & device_del() isn't always done in module exit, so
> > I doubt your fix with grabbing module refcnt is good or generic enough.
> 
> A device_del() *can* happen in other areas other than module exit sure,
> but the issue is if a shared lock is used *before* device_del() and also
> used on a sysfs op. Typically this can happen on module exit, and the
> other common use case in my experience is on sysfs ops, such is the case
> with the zram driver. Both cases are covered then by this fix.

Again, can you share the related zram code about the issue? In
zram_drv.c of linus or next tree, I don't see any lock is held before
calling del_gendisk().

> 
> If there are other areas, that is still driver specific, but of the
> things we *can* generalize, definitely module exit is a common path.
> 
> > Except for your cooked test_sys module, how many real drivers do suffer the
> > problem? What are they?
> 
> I only really seriously considered trying to generalize this after it

IMO your generalization isn't good or correct because this kind of issue
is _not_ related with module exit at all. What matters is just that one lock is
held before calling device_del(), meantime the same lock is required
in the device's attribute show/store function().

There are many cases in which we call device_del() not from module_exit(),
such as scsi scan, scsi sysfs store(), or even handling event from
device side, nvme error handling, usb hotplug, ...

> was hinted to me live patching was also affected, and so clearly
> something generic was desirable.

It might be just the only two drivers(zram and live patch) with this bug, and
it is one simply AA bug in driver. Not mention I don't see such usage in
zram_drv.c.

> 
> There may be other drivers for sure, but a hunt for that with semantics
> would require a bit complex coccinelle patch with iteration support.
> 
> > Why can't we fix the exact driver?
> 
> You can try, the way the lock is used in zram is correct, specially

What is the lock in zram? Again can you share the related functions?

> after my other fix in this series which addresses another unrelated bug
> with cpu hotplug multistate support. So we then can proceed to either
> take the position to say: "Thou shalt not use a shared lock on module
> exit and a sysfs op" and try to fix all places, or we generalize a fix
> for this. A generic fix seems more desirable.

What matters is that the lock is held before calling device_del()
instead of being held in module_exit().



Thanks,
Ming


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v8 09/12] sysfs: fix deadlock race with module removal
  2021-10-13 15:04               ` Ming Lei
@ 2021-10-13 21:16                 ` Luis Chamberlain
  0 siblings, 0 replies; 94+ messages in thread
From: Luis Chamberlain @ 2021-10-13 21:16 UTC (permalink / raw)
  To: Ming Lei
  Cc: Miroslav Benes, tj, gregkh, akpm, minchan, jeyu, shuah,
	bvanassche, dan.j.williams, joe, tglx, keescook, rostedt,
	linux-spdx, linux-doc, linux-block, linux-fsdevel,
	linux-kselftest, linux-kernel, mcgrof

On Wed, Oct 13, 2021 at 11:04:07PM +0800, Ming Lei wrote:
> On Wed, Oct 13, 2021 at 05:35:31AM -0700, Luis Chamberlain wrote:
> > On Wed, Oct 13, 2021 at 09:07:03AM +0800, Ming Lei wrote:
> > > On Tue, Oct 12, 2021 at 02:18:28PM -0700, Luis Chamberlain wrote:
> > > > > Looks test_sysfs isn't in linus tree, where can I find it?
> > > > 
> > > > https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux-next.git/log/?h=20210927-sysfs-generic-deadlock-fix
> > > > 
> > > > To reproduce the deadlock revert the patch in this thread and then run
> > > > either of these two tests as root:
> > > > 
> > > > ./tools/testing/selftests/sysfs/sysfs.sh -w 0027
> > > > ./tools/testing/selftests/sysfs/sysfs.sh -w 0028
> > > > 
> > > > You will need to enable the test_sysfs driver.
> > > > > Can you share the code which waits for the sysfs / kernfs files to be
> > > > > stop being used?
> > > > 
> > > > How about a call trace of the two tasks which deadlock, here is one of
> > > > running test 0027:
> > > > 
> > > > kdevops login: [  363.875459] INFO: task sysfs.sh:1271 blocked for more
> > > > than 120 seconds.
> > 
> > <-- snip -->
> > 
> > > That doesn't show the deadlock is related with module_exit().
> > 
> > Not directly no.
> 
> Then the patch title of 'sysfs: fix deadlock race with module removal'
> is wrong.

Well that is what it does though. The scope of the issue you are raising
is beyond module removal, but I do agree such races can exist outside of
module removal.

> > > It is clearly one AA deadlock, what I meant was that it isn't related with
> > > module exit cause lock & device_del() isn't always done in module exit, so
> > > I doubt your fix with grabbing module refcnt is good or generic enough.
> > 
> > A device_del() *can* happen in other areas other than module exit sure,
> > but the issue is if a shared lock is used *before* device_del() and also
> > used on a sysfs op. Typically this can happen on module exit, and the
> > other common use case in my experience is on sysfs ops, such is the case
> > with the zram driver. Both cases are covered then by this fix.
> 
> Again, can you share the related zram code about the issue? In
> zram_drv.c of linus or next tree, I don't see any lock is held before
> calling del_gendisk().

There is another bug with CPU hotplug multistate support in the zram
driver which a patch in this series fixes, refer to the patch titled
"zram: fix crashes with cpu hotplug multistate". In zram's case we need
to contend a generic lock on certain sysfs attributes due to the way CPU
hotplug is used.

If we tried to generalize this on the block layer the closest we get is
the disk->fops->owner, however zram is an example driver where the
disk->fops is actually be even changed *after* module load, and so the
original disk->fops->owner can be dynamic. In zram's case the
fops->owner is the same, however we have no semantics to ensure this is
the case for all block drivers.

In the case for live patching, refer to the use of klp_mutex. The way
that was solved there was a combination of completions and deferred
works to solve it, so that all kobject_put calls are outside of the
critical sections, refer to commit 3ec24776bfd0 ("livepatch:
allow removal of a disabled patch").

And so it was encouraged a generic solution be sought after.

> > If there are other areas, that is still driver specific, but of the
> > things we *can* generalize, definitely module exit is a common path.
> > 
> > > Except for your cooked test_sys module, how many real drivers do suffer the
> > > problem? What are they?
> > 
> > I only really seriously considered trying to generalize this after it
> 
> IMO your generalization isn't good or correct because this kind of issue
> is _not_ related with module exit at all. What matters is just that one lock is
> held before calling device_del(), meantime the same lock is required
> in the device's attribute show/store function().

Your point that a race for a deadlock still can exist beyond module
removal is valid but unfortunately there are no possible semantics I can
see to fix that generically at this time.

> There are many cases in which we call device_del() not from module_exit(),
> such as scsi scan, scsi sysfs store(), or even handling event from
> device side, nvme error handling, usb hotplug, ...

These are really good points.

> > was hinted to me live patching was also affected, and so clearly
> > something generic was desirable.
> 
> It might be just the only two drivers(zram and live patch) with this bug, and
> it is one simply AA bug in driver. Not mention I don't see such usage in
> zram_drv.c.

Well... given what you say above about other uses cases other than
module removal which can remove sysfs files and having them be used,
the possibilities of this deadlock existing elsewhere should increase,
not decrease.

> > There may be other drivers for sure, but a hunt for that with semantics
> > would require a bit complex coccinelle patch with iteration support.
> > 
> > > Why can't we fix the exact driver?
> > 
> > You can try, the way the lock is used in zram is correct, specially
> 
> What is the lock in zram? Again can you share the related functions?

If you git checked out the tree I mentioned try looking at the code
there with the fix for CPU hotplug multistate in mind.

> > after my other fix in this series which addresses another unrelated bug
> > with cpu hotplug multistate support. So we then can proceed to either
> > take the position to say: "Thou shalt not use a shared lock on module
> > exit and a sysfs op" and try to fix all places, or we generalize a fix
> > for this. A generic fix seems more desirable.
> 
> What matters is that the lock is held before calling device_del()
> instead of being held in module_exit().

I agree the possibilities can include more than just module exit.
Unfortunately I can't see a way to generalize this further. I tried,
see below, and this moves the ideas from a module to the kobject, but
even with that, it does not get us any closer to fixing this
generically. The reason a fix works for module removal is the
try_module_get() call when getting the kernfs active reference
will trump the module exit call completely, and so we *do* prevent
the context which will issue the lock in this case if a sysfs
operation is in progress.

Outside of that call sequence I am afraid we'd need separate solutions
or side with the 'though shall not use a shared lock on a sysfs op
and when issuing a device_del(), other than module exit'.

Below is an attempt to generalize this further, but it does not work,
let me know if you have further ideas.

diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index b57b3db9a6a7..4edf3b37fd2c 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -209,7 +209,7 @@ static int rdtgroup_add_file(struct kernfs_node *parent_kn, struct rftype *rft)
 
 	kn = __kernfs_create_file(parent_kn, rft->name, rft->mode,
 				  GLOBAL_ROOT_UID, GLOBAL_ROOT_GID,
-				  0, rft->kf_ops, rft, NULL, NULL);
+				  0, rft->kf_ops, rft, NULL, NULL, NULL);
 	if (IS_ERR(kn))
 		return PTR_ERR(kn);
 
@@ -2482,7 +2482,7 @@ static int mon_addfile(struct kernfs_node *parent_kn, const char *name,
 
 	kn = __kernfs_create_file(parent_kn, name, 0444,
 				  GLOBAL_ROOT_UID, GLOBAL_ROOT_GID, 0,
-				  &kf_mondata_ops, priv, NULL, NULL);
+				  &kf_mondata_ops, priv, NULL, NULL, NULL);
 	if (IS_ERR(kn))
 		return PTR_ERR(kn);
 
diff --git a/drivers/base/core.c b/drivers/base/core.c
index 7758223f040c..38f07072ab44 100644
--- a/drivers/base/core.c
+++ b/drivers/base/core.c
@@ -3507,6 +3507,7 @@ bool kill_device(struct device *dev)
 	if (dev->p->dead)
 		return false;
 	dev->p->dead = true;
+	kobject_set_being_removed(&dev->kobj);
 	return true;
 }
 EXPORT_SYMBOL_GPL(kill_device);
diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index ba581429bf7b..7d14f6b2c12d 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -14,6 +14,7 @@
 #include <linux/slab.h>
 #include <linux/security.h>
 #include <linux/hash.h>
+#include <linux/kobject.h>
 
 #include "kernfs-internal.h"
 
@@ -414,15 +415,38 @@ static bool kernfs_unlink_sibling(struct kernfs_node *kn)
  */
 struct kernfs_node *kernfs_get_active(struct kernfs_node *kn)
 {
+	int v;
+
 	if (unlikely(!kn))
 		return NULL;
 
 	if (!atomic_inc_unless_negative(&kn->active))
 		return NULL;
 
+	/*
+	 * If a kobject created the kernfs_node, the kobject cannot possibly be
+	 * removed if the above atomic_inc_unless_negative() succeeded. But we
+	 * need to inspect if its on its way out to ensure that we don't
+	 * deadlock in case a kernfs operation and the code responsible for
+	 * the kobject removal used a shared lock.
+	 */
+	if (kn->kobj) {
+		if (WARN_ON(!kobject_get_unless_zero(kn->kobj))) {
+			goto fail;
+		} else if (kobject_being_removed(kn->kobj)) {
+			kobject_put(kn->kobj);
+			goto fail;
+		}
+	}
+
 	if (kernfs_lockdep(kn))
 		rwsem_acquire_read(&kn->dep_map, 0, 1, _RET_IP_);
 	return kn;
+fail:
+	v = atomic_dec_return(&kn->active);
+	if (unlikely(v == KN_DEACTIVATED_BIAS))
+		wake_up_all(&kernfs_root(kn)->deactivate_waitq);
+	return NULL;
 }
 
 /**
@@ -442,6 +466,7 @@ void kernfs_put_active(struct kernfs_node *kn)
 	if (kernfs_lockdep(kn))
 		rwsem_release(&kn->dep_map, _RET_IP_);
 	v = atomic_dec_return(&kn->active);
+	kobject_put(kn->kobj);
 	if (likely(v != KN_DEACTIVATED_BIAS))
 		return;
 
@@ -572,7 +597,8 @@ static struct kernfs_node *__kernfs_new_node(struct kernfs_root *root,
 					     struct kernfs_node *parent,
 					     const char *name, umode_t mode,
 					     kuid_t uid, kgid_t gid,
-					     unsigned flags)
+					     unsigned flags,
+					     struct kobject *kobj)
 {
 	struct kernfs_node *kn;
 	u32 id_highbits;
@@ -607,6 +633,7 @@ static struct kernfs_node *__kernfs_new_node(struct kernfs_root *root,
 	kn->name = name;
 	kn->mode = mode;
 	kn->flags = flags;
+	kn->kobj = kobj;
 
 	if (!uid_eq(uid, GLOBAL_ROOT_UID) || !gid_eq(gid, GLOBAL_ROOT_GID)) {
 		struct iattr iattr = {
@@ -640,12 +667,13 @@ static struct kernfs_node *__kernfs_new_node(struct kernfs_root *root,
 struct kernfs_node *kernfs_new_node(struct kernfs_node *parent,
 				    const char *name, umode_t mode,
 				    kuid_t uid, kgid_t gid,
-				    unsigned flags)
+				    unsigned flags,
+				    struct kobject *kobj)
 {
 	struct kernfs_node *kn;
 
 	kn = __kernfs_new_node(kernfs_root(parent), parent,
-			       name, mode, uid, gid, flags);
+			       name, mode, uid, gid, flags, kobj);
 	if (kn) {
 		kernfs_get(parent);
 		kn->parent = parent;
@@ -927,7 +955,7 @@ struct kernfs_root *kernfs_create_root(struct kernfs_syscall_ops *scops,
 
 	kn = __kernfs_new_node(root, NULL, "", S_IFDIR | S_IRUGO | S_IXUGO,
 			       GLOBAL_ROOT_UID, GLOBAL_ROOT_GID,
-			       KERNFS_DIR);
+			       KERNFS_DIR, NULL);
 	if (!kn) {
 		idr_destroy(&root->ino_idr);
 		kfree(root);
@@ -969,20 +997,22 @@ void kernfs_destroy_root(struct kernfs_root *root)
  * @gid: gid of the new directory
  * @priv: opaque data associated with the new directory
  * @ns: optional namespace tag of the directory
+ * @kobj: if set, the kobject responsible for this directory
  *
  * Returns the created node on success, ERR_PTR() value on failure.
  */
 struct kernfs_node *kernfs_create_dir_ns(struct kernfs_node *parent,
 					 const char *name, umode_t mode,
 					 kuid_t uid, kgid_t gid,
-					 void *priv, const void *ns)
+					 void *priv, const void *ns,
+					 struct kobject *kobj)
 {
 	struct kernfs_node *kn;
 	int rc;
 
 	/* allocate */
 	kn = kernfs_new_node(parent, name, mode | S_IFDIR,
-			     uid, gid, KERNFS_DIR);
+			     uid, gid, KERNFS_DIR, kobj);
 	if (!kn)
 		return ERR_PTR(-ENOMEM);
 
@@ -1014,7 +1044,8 @@ struct kernfs_node *kernfs_create_empty_dir(struct kernfs_node *parent,
 
 	/* allocate */
 	kn = kernfs_new_node(parent, name, S_IRUGO|S_IXUGO|S_IFDIR,
-			     GLOBAL_ROOT_UID, GLOBAL_ROOT_GID, KERNFS_DIR);
+			     GLOBAL_ROOT_UID, GLOBAL_ROOT_GID, KERNFS_DIR,
+			     parent->kobj);
 	if (!kn)
 		return ERR_PTR(-ENOMEM);
 
diff --git a/fs/kernfs/file.c b/fs/kernfs/file.c
index 4479c6580333..1b02f3e69c81 100644
--- a/fs/kernfs/file.c
+++ b/fs/kernfs/file.c
@@ -978,6 +978,7 @@ const struct file_operations kernfs_file_fops = {
  * @priv: private data for the file
  * @ns: optional namespace tag of the file
  * @key: lockdep key for the file's active_ref, %NULL to disable lockdep
+ * @kobj: if set, the kobject responsible for the file
  *
  * Returns the created node on success, ERR_PTR() value on error.
  */
@@ -987,7 +988,8 @@ struct kernfs_node *__kernfs_create_file(struct kernfs_node *parent,
 					 loff_t size,
 					 const struct kernfs_ops *ops,
 					 void *priv, const void *ns,
-					 struct lock_class_key *key)
+					 struct lock_class_key *key,
+					 struct kobject *kobj)
 {
 	struct kernfs_node *kn;
 	unsigned flags;
@@ -996,7 +998,7 @@ struct kernfs_node *__kernfs_create_file(struct kernfs_node *parent,
 	flags = KERNFS_FILE;
 
 	kn = kernfs_new_node(parent, name, (mode & S_IALLUGO) | S_IFREG,
-			     uid, gid, flags);
+			     uid, gid, flags, kobj);
 	if (!kn)
 		return ERR_PTR(-ENOMEM);
 
diff --git a/fs/kernfs/kernfs-internal.h b/fs/kernfs/kernfs-internal.h
index 9e3abf597e2d..44983720d292 100644
--- a/fs/kernfs/kernfs-internal.h
+++ b/fs/kernfs/kernfs-internal.h
@@ -134,7 +134,8 @@ int kernfs_add_one(struct kernfs_node *kn);
 struct kernfs_node *kernfs_new_node(struct kernfs_node *parent,
 				    const char *name, umode_t mode,
 				    kuid_t uid, kgid_t gid,
-				    unsigned flags);
+				    unsigned flags,
+				    struct kobject *kobj);
 
 /*
  * file.c
diff --git a/fs/kernfs/symlink.c b/fs/kernfs/symlink.c
index 19a6c71c6ff5..c877de06e53a 100644
--- a/fs/kernfs/symlink.c
+++ b/fs/kernfs/symlink.c
@@ -36,7 +36,8 @@ struct kernfs_node *kernfs_create_link(struct kernfs_node *parent,
 		gid = target->iattr->ia_gid;
 	}
 
-	kn = kernfs_new_node(parent, name, S_IFLNK|0777, uid, gid, KERNFS_LINK);
+	kn = kernfs_new_node(parent, name, S_IFLNK|0777, uid, gid, KERNFS_LINK,
+			     target->kobj);
 	if (!kn)
 		return ERR_PTR(-ENOMEM);
 
diff --git a/fs/sysfs/dir.c b/fs/sysfs/dir.c
index b6b6796e1616..9cc159e9fb55 100644
--- a/fs/sysfs/dir.c
+++ b/fs/sysfs/dir.c
@@ -57,7 +57,7 @@ int sysfs_create_dir_ns(struct kobject *kobj, const void *ns)
 	kobject_get_ownership(kobj, &uid, &gid);
 
 	kn = kernfs_create_dir_ns(parent, kobject_name(kobj), 0755, uid, gid,
-				  kobj, ns);
+				  kobj, ns, kobj);
 	if (IS_ERR(kn)) {
 		if (PTR_ERR(kn) == -EEXIST)
 			sysfs_warn_dup(parent, kobject_name(kobj));
diff --git a/fs/sysfs/file.c b/fs/sysfs/file.c
index 42dcf96881b6..e1a3315dba35 100644
--- a/fs/sysfs/file.c
+++ b/fs/sysfs/file.c
@@ -292,7 +292,7 @@ int sysfs_add_file_mode_ns(struct kernfs_node *parent,
 #endif
 
 	kn = __kernfs_create_file(parent, attr->name, mode & 0777, uid, gid,
-				  PAGE_SIZE, ops, (void *)attr, ns, key);
+				  PAGE_SIZE, ops, (void *)attr, ns, key, kobj);
 	if (IS_ERR(kn)) {
 		if (PTR_ERR(kn) == -EEXIST)
 			sysfs_warn_dup(parent, attr->name);
@@ -309,6 +309,7 @@ int sysfs_add_bin_file_mode_ns(struct kernfs_node *parent,
 	struct lock_class_key *key = NULL;
 	const struct kernfs_ops *ops;
 	struct kernfs_node *kn;
+	struct kobject *kobj = parent->priv;
 
 	if (battr->mmap)
 		ops = &sysfs_bin_kfops_mmap;
@@ -327,7 +328,8 @@ int sysfs_add_bin_file_mode_ns(struct kernfs_node *parent,
 #endif
 
 	kn = __kernfs_create_file(parent, attr->name, mode & 0777, uid, gid,
-				  battr->size, ops, (void *)attr, ns, key);
+				  battr->size, ops, (void *)attr, ns, key,
+				  kobj);
 	if (IS_ERR(kn)) {
 		if (PTR_ERR(kn) == -EEXIST)
 			sysfs_warn_dup(parent, attr->name);
diff --git a/fs/sysfs/group.c b/fs/sysfs/group.c
index eeb0e3099421..36022fe2b21d 100644
--- a/fs/sysfs/group.c
+++ b/fs/sysfs/group.c
@@ -135,7 +135,8 @@ static int internal_create_group(struct kobject *kobj, int update,
 		} else {
 			kn = kernfs_create_dir_ns(kobj->sd, grp->name,
 						  S_IRWXU | S_IRUGO | S_IXUGO,
-						  uid, gid, kobj, NULL);
+						  uid, gid, kobj, NULL,
+						  kobj);
 			if (IS_ERR(kn)) {
 				if (PTR_ERR(kn) == -EEXIST)
 					sysfs_warn_dup(kobj->sd, grp->name);
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index cd968ee2b503..38155414e6e5 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -161,6 +161,7 @@ struct kernfs_node {
 	unsigned short		flags;
 	umode_t			mode;
 	struct kernfs_iattrs	*iattr;
+	struct kobject		*kobj;
 };
 
 /*
@@ -370,7 +371,8 @@ void kernfs_destroy_root(struct kernfs_root *root);
 struct kernfs_node *kernfs_create_dir_ns(struct kernfs_node *parent,
 					 const char *name, umode_t mode,
 					 kuid_t uid, kgid_t gid,
-					 void *priv, const void *ns);
+					 void *priv, const void *ns,
+					 struct kobject *kobj);
 struct kernfs_node *kernfs_create_empty_dir(struct kernfs_node *parent,
 					    const char *name);
 struct kernfs_node *__kernfs_create_file(struct kernfs_node *parent,
@@ -379,7 +381,8 @@ struct kernfs_node *__kernfs_create_file(struct kernfs_node *parent,
 					 loff_t size,
 					 const struct kernfs_ops *ops,
 					 void *priv, const void *ns,
-					 struct lock_class_key *key);
+					 struct lock_class_key *key,
+					 struct kobject *kobj);
 struct kernfs_node *kernfs_create_link(struct kernfs_node *parent,
 				       const char *name,
 				       struct kernfs_node *target);
@@ -472,14 +475,15 @@ static inline void kernfs_destroy_root(struct kernfs_root *root) { }
 static inline struct kernfs_node *
 kernfs_create_dir_ns(struct kernfs_node *parent, const char *name,
 		     umode_t mode, kuid_t uid, kgid_t gid,
-		     void *priv, const void *ns)
+		     void *priv, const void *ns, struct kobject *kobj)
 { return ERR_PTR(-ENOSYS); }
 
 static inline struct kernfs_node *
 __kernfs_create_file(struct kernfs_node *parent, const char *name,
 		     umode_t mode, kuid_t uid, kgid_t gid,
 		     loff_t size, const struct kernfs_ops *ops,
-		     void *priv, const void *ns, struct lock_class_key *key)
+		     void *priv, const void *ns, struct lock_class_key *key,
+		     struct kobject *kobj)
 { return ERR_PTR(-ENOSYS); }
 
 static inline struct kernfs_node *
@@ -566,7 +570,7 @@ kernfs_create_dir(struct kernfs_node *parent, const char *name, umode_t mode,
 {
 	return kernfs_create_dir_ns(parent, name, mode,
 				    GLOBAL_ROOT_UID, GLOBAL_ROOT_GID,
-				    priv, NULL);
+				    priv, NULL, parent->kobj);
 }
 
 static inline int kernfs_remove_by_name(struct kernfs_node *parent,
diff --git a/include/linux/kobject.h b/include/linux/kobject.h
index efd56f990a46..cb26ebeb7cf1 100644
--- a/include/linux/kobject.h
+++ b/include/linux/kobject.h
@@ -77,6 +77,7 @@ struct kobject {
 	unsigned int state_add_uevent_sent:1;
 	unsigned int state_remove_uevent_sent:1;
 	unsigned int uevent_suppress:1;
+	unsigned int being_removed:1;
 };
 
 extern __printf(2, 3)
@@ -117,6 +118,15 @@ extern void kobject_get_ownership(struct kobject *kobj,
 				  kuid_t *uid, kgid_t *gid);
 extern char *kobject_get_path(struct kobject *kobj, gfp_t flag);
 
+static inline bool kobject_being_removed(const struct kobject *kobj)
+{
+	if (!kobj)
+		return false;
+	return !!kobj->being_removed;
+}
+
+void kobject_set_being_removed(struct kobject *kobj);
+
 /**
  * kobject_has_children - Returns whether a kobject has children.
  * @kobj: the object to test
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 9e0390000025..c6b0a28f599c 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -3975,7 +3975,7 @@ static int cgroup_add_file(struct cgroup_subsys_state *css, struct cgroup *cgrp,
 				  cgroup_file_mode(cft),
 				  GLOBAL_ROOT_UID, GLOBAL_ROOT_GID,
 				  0, cft->kf_ops, cft,
-				  NULL, key);
+				  NULL, key, NULL);
 	if (IS_ERR(kn))
 		return PTR_ERR(kn);
 
diff --git a/lib/kobject.c b/lib/kobject.c
index 4a56f519139d..ef89bf2ac218 100644
--- a/lib/kobject.c
+++ b/lib/kobject.c
@@ -221,6 +221,12 @@ static void kobject_init_internal(struct kobject *kobj)
 	kobj->state_initialized = 1;
 }
 
+void kobject_set_being_removed(struct kobject *kobj)
+{
+	if (!kobj)
+		return;
+	kobj->being_removed = 1;
+}
 
 static int kobject_add_internal(struct kobject *kobj)
 {

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v8 11/12] zram: fix crashes with cpu hotplug multistate
  2021-09-27 16:38 ` [PATCH v8 11/12] zram: fix crashes with cpu hotplug multistate Luis Chamberlain
  2021-10-05 20:55   ` Kees Cook
@ 2021-10-14  1:55   ` Ming Lei
  2021-10-14  2:11     ` Ming Lei
  1 sibling, 1 reply; 94+ messages in thread
From: Ming Lei @ 2021-10-14  1:55 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: tj, gregkh, akpm, minchan, jeyu, shuah, bvanassche,
	dan.j.williams, joe, tglx, keescook, rostedt, linux-spdx,
	linux-doc, linux-block, linux-fsdevel, linux-kselftest,
	linux-kernel, ming.lei

On Mon, Sep 27, 2021 at 09:38:04AM -0700, Luis Chamberlain wrote:
> Provide a simple state machine to fix races with driver exit where we
> remove the CPU multistate callbacks and re-initialization / creation of
> new per CPU instances which should be managed by these callbacks.
> 
> The zram driver makes use of cpu hotplug multistate support, whereby it
> associates a struct zcomp per CPU. Each struct zcomp represents a
> compression algorithm in charge of managing compression streams per
> CPU. Although a compiled zram driver only supports a fixed set of
> compression algorithms, each zram device gets a struct zcomp allocated
> per CPU. The "multi" in CPU hotplug multstate refers to these per
> cpu struct zcomp instances. Each of these will have the CPU hotplug
> callback called for it on CPU plug / unplug. The kernel's CPU hotplug
> multistate keeps a linked list of these different structures so that
> it will iterate over them on CPU transitions.
> 
> By default at driver initialization we will create just one zram device
> (num_devices=1) and a zcomp structure then set for the now default
> lzo-rle comrpession algorithm. At driver removal we first remove each
> zram device, and so we destroy the associated struct zcomp per CPU. But
> since we expose sysfs attributes to create new devices or reset /
> initialize existing zram devices, we can easily end up re-initializing
> a struct zcomp for a zram device before the exit routine of the module
> removes the cpu hotplug callback. When this happens the kernel's CPU
> hotplug will detect that at least one instance (struct zcomp for us)
> exists. This can happen in the following situation:
> 
> CPU 1                            CPU 2
> 
>                                 disksize_store(...);
> class_unregister(...);
> idr_for_each(...);
> zram_debugfs_destroy();
> 
> idr_destroy(...);
> unregister_blkdev(...);
> cpuhp_remove_multi_state(...);
> 
> The warning comes up on cpuhp_remove_multi_state() when it sees that the
> state for CPUHP_ZCOMP_PREPARE does not have an empty instance linked list.
> In this case, that a struct zcom still exists, the driver allowed its
> creation per CPU even though we could have just freed them per CPU
> though a call on another CPU, and we are then later trying to remove the
> hotplug callback.
> 
> Fix all this by providing a zram initialization boolean
> protected the shared in the driver zram_index_mutex, which we
> can use to annotate when sysfs attributes are safe to use or
> not -- once the driver is properly initialized. When the driver
> is going down we also are sure to not let userspace muck with
> attributes which may affect each per cpu struct zcomp.
> 
> This also fixes a series of possible memory leaks. The
> crashes and memory leaks can easily be caused by issuing
> the zram02.sh script from the LTP project [0] in a loop
> in two separate windows:
> 
>   cd testcases/kernel/device-drivers/zram
>   while true; do PATH=$PATH:$PWD:$PWD/../../../lib/ ./zram02.sh; done
> 
> You end up with a splat as follows:
> 
> kernel: zram: Removed device: zram0
> kernel: zram: Added device: zram0
> kernel: zram0: detected capacity change from 0 to 209715200
> kernel: Adding 104857596k swap on /dev/zram0.  <etc>
> kernel: zram0: detected capacitky change from 209715200 to 0
> kernel: zram0: detected capacity change from 0 to 209715200
> kernel: ------------[ cut here ]------------
> kernel: Error: Removing state 63 which has instances left.
> kernel: WARNING: CPU: 7 PID: 70457 at \
> 	kernel/cpu.c:2069 __cpuhp_remove_state_cpuslocked+0xf9/0x100
> kernel: Modules linked in: zram(E-) zsmalloc(E) <etc>
> kernel: CPU: 7 PID: 70457 Comm: rmmod Tainted: G            \
> 	E     5.12.0-rc1-next-20210304 #3
> kernel: Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), \
> 	BIOS 1.14.0-2 04/01/2014
> kernel: RIP: 0010:__cpuhp_remove_state_cpuslocked+0xf9/0x100
> kernel: Code: <etc>
> kernel: RSP: 0018:ffffa800c139be98 EFLAGS: 00010282
> kernel: RAX: 0000000000000000 RBX: ffffffff9083db58 RCX: ffff9609f7dd86d8
> kernel: RDX: 00000000ffffffd8 RSI: 0000000000000027 RDI: ffff9609f7dd86d0
> kernel: RBP: 0000000000000000i R08: 0000000000000000 R09: ffffa800c139bcb8
> kernel: R10: ffffa800c139bcb0 R11: ffffffff908bea40 R12: 000000000000003f
> kernel: R13: 00000000000009d8 R14: 0000000000000000 R15: 0000000000000000
> kernel: FS: 00007f1b075a7540(0000) GS:ffff9609f7dc0000(0000) knlGS:<etc>
> kernel: CS:  0010 DS: 0000 ES 0000 CR0: 0000000080050033
> kernel: CR2: 00007f1b07610490 CR3: 00000001bd04e000 CR4: 0000000000350ee0
> kernel: Call Trace:
> kernel: __cpuhp_remove_state+0x2e/0x80
> kernel: __do_sys_delete_module+0x190/0x2a0
> kernel:  do_syscall_64+0x33/0x80
> kernel: entry_SYSCALL_64_after_hwframe+0x44/0xae
> 
> The "Error: Removing state 63 which has instances left" refers
> to the zram per CPU struct zcomp instances left.
> 
> [0] https://github.com/linux-test-project/ltp.git
> 
> Acked-by: Minchan Kim <minchan@kernel.org>
> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> ---

Hello Luis,

Can you test the following patch and see if the issue can be addressed?

Please see the idea from the inline comment.

Also zram_index_mutex isn't needed in zram disk's store() compared with
your patch, then the deadlock issue you are addressing in this series can
be avoided.


diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index fcaf2750f68f..3c17927d23a7 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -1985,11 +1985,17 @@ static int zram_remove(struct zram *zram)
 
 	/* Make sure all the pending I/O are finished */
 	fsync_bdev(bdev);
-	zram_reset_device(zram);
 
 	pr_info("Removed device: %s\n", zram->disk->disk_name);
 
 	del_gendisk(zram->disk);
+
+	/*
+	 * reset device after gendisk is removed, so any change from sysfs
+	 * store won't come in, then we can really reset device here
+	 */
+	zram_reset_device(zram);
+
 	blk_cleanup_disk(zram->disk);
 	kfree(zram);
 	return 0;
@@ -2073,7 +2079,12 @@ static int zram_remove_cb(int id, void *ptr, void *data)
 static void destroy_devices(void)
 {
 	class_unregister(&zram_control_class);
+
+	/* hold the global lock so new device can't be added */
+	mutex_lock(&zram_index_mutex);
 	idr_for_each(&zram_index_idr, &zram_remove_cb, NULL);
+	mutex_unlock(&zram_index_mutex);
+
 	zram_debugfs_destroy();
 	idr_destroy(&zram_index_idr);
 	unregister_blkdev(zram_major, "zram");

Thanks,
Ming


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v8 11/12] zram: fix crashes with cpu hotplug multistate
  2021-10-14  1:55   ` Ming Lei
@ 2021-10-14  2:11     ` Ming Lei
  2021-10-14 20:24       ` Luis Chamberlain
  0 siblings, 1 reply; 94+ messages in thread
From: Ming Lei @ 2021-10-14  2:11 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: tj, gregkh, akpm, minchan, jeyu, shuah, bvanassche,
	dan.j.williams, joe, tglx, keescook, rostedt, linux-spdx,
	linux-doc, linux-block, linux-fsdevel, linux-kselftest,
	linux-kernel

On Thu, Oct 14, 2021 at 09:55:48AM +0800, Ming Lei wrote:
> On Mon, Sep 27, 2021 at 09:38:04AM -0700, Luis Chamberlain wrote:

...

> 
> Hello Luis,
> 
> Can you test the following patch and see if the issue can be addressed?
> 
> Please see the idea from the inline comment.
> 
> Also zram_index_mutex isn't needed in zram disk's store() compared with
> your patch, then the deadlock issue you are addressing in this series can
> be avoided.
> 
> 
> diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
> index fcaf2750f68f..3c17927d23a7 100644
> --- a/drivers/block/zram/zram_drv.c
> +++ b/drivers/block/zram/zram_drv.c
> @@ -1985,11 +1985,17 @@ static int zram_remove(struct zram *zram)
>  
>  	/* Make sure all the pending I/O are finished */
>  	fsync_bdev(bdev);
> -	zram_reset_device(zram);
>  
>  	pr_info("Removed device: %s\n", zram->disk->disk_name);
>  
>  	del_gendisk(zram->disk);
> +
> +	/*
> +	 * reset device after gendisk is removed, so any change from sysfs
> +	 * store won't come in, then we can really reset device here
> +	 */
> +	zram_reset_device(zram);
> +
>  	blk_cleanup_disk(zram->disk);
>  	kfree(zram);
>  	return 0;
> @@ -2073,7 +2079,12 @@ static int zram_remove_cb(int id, void *ptr, void *data)
>  static void destroy_devices(void)
>  {
>  	class_unregister(&zram_control_class);
> +
> +	/* hold the global lock so new device can't be added */
> +	mutex_lock(&zram_index_mutex);
>  	idr_for_each(&zram_index_idr, &zram_remove_cb, NULL);
> +	mutex_unlock(&zram_index_mutex);
> +

Actually zram_index_mutex isn't needed when calling zram_remove_cb()
since the zram-control sysfs interface has been removed, so userspace
can't add new device any more, then the issue is supposed to be fixed
by the following one line change, please test it:

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index fcaf2750f68f..96dd641de233 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -1985,11 +1985,17 @@ static int zram_remove(struct zram *zram)
 
 	/* Make sure all the pending I/O are finished */
 	fsync_bdev(bdev);
-	zram_reset_device(zram);
 
 	pr_info("Removed device: %s\n", zram->disk->disk_name);
 
 	del_gendisk(zram->disk);
+
+	/*
+	 * reset device after gendisk is removed, so any change from sysfs
+	 * store won't come in, then we can really reset device here
+	 */
+	zram_reset_device(zram);
+
 	blk_cleanup_disk(zram->disk);
 	kfree(zram);
 	return 0;



Thanks, 
Ming


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v8 11/12] zram: fix crashes with cpu hotplug multistate
  2021-10-14  2:11     ` Ming Lei
@ 2021-10-14 20:24       ` Luis Chamberlain
  2021-10-14 23:52         ` Ming Lei
  0 siblings, 1 reply; 94+ messages in thread
From: Luis Chamberlain @ 2021-10-14 20:24 UTC (permalink / raw)
  To: Ming Lei
  Cc: tj, gregkh, akpm, minchan, jeyu, shuah, bvanassche,
	dan.j.williams, joe, tglx, keescook, rostedt, linux-spdx,
	linux-doc, linux-block, linux-fsdevel, linux-kselftest,
	linux-kernel

On Thu, Oct 14, 2021 at 10:11:46AM +0800, Ming Lei wrote:
> On Thu, Oct 14, 2021 at 09:55:48AM +0800, Ming Lei wrote:
> > On Mon, Sep 27, 2021 at 09:38:04AM -0700, Luis Chamberlain wrote:
> 
> ...
> 
> > 
> > Hello Luis,
> > 
> > Can you test the following patch and see if the issue can be addressed?
> > 
> > Please see the idea from the inline comment.
> > 
> > Also zram_index_mutex isn't needed in zram disk's store() compared with
> > your patch, then the deadlock issue you are addressing in this series can
> > be avoided.
> > 
> > 
> > diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
> > index fcaf2750f68f..3c17927d23a7 100644
> > --- a/drivers/block/zram/zram_drv.c
> > +++ b/drivers/block/zram/zram_drv.c
> > @@ -1985,11 +1985,17 @@ static int zram_remove(struct zram *zram)
> >  
> >  	/* Make sure all the pending I/O are finished */
> >  	fsync_bdev(bdev);
> > -	zram_reset_device(zram);
> >  
> >  	pr_info("Removed device: %s\n", zram->disk->disk_name);
> >  
> >  	del_gendisk(zram->disk);
> > +
> > +	/*
> > +	 * reset device after gendisk is removed, so any change from sysfs
> > +	 * store won't come in, then we can really reset device here
> > +	 */
> > +	zram_reset_device(zram);
> > +
> >  	blk_cleanup_disk(zram->disk);
> >  	kfree(zram);
> >  	return 0;
> > @@ -2073,7 +2079,12 @@ static int zram_remove_cb(int id, void *ptr, void *data)
> >  static void destroy_devices(void)
> >  {
> >  	class_unregister(&zram_control_class);
> > +
> > +	/* hold the global lock so new device can't be added */
> > +	mutex_lock(&zram_index_mutex);
> >  	idr_for_each(&zram_index_idr, &zram_remove_cb, NULL);
> > +	mutex_unlock(&zram_index_mutex);
> > +
> 
> Actually zram_index_mutex isn't needed when calling zram_remove_cb()
> since the zram-control sysfs interface has been removed, so userspace
> can't add new device any more, then the issue is supposed to be fixed
> by the following one line change, please test it:
> 
> diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
> index fcaf2750f68f..96dd641de233 100644
> --- a/drivers/block/zram/zram_drv.c
> +++ b/drivers/block/zram/zram_drv.c
> @@ -1985,11 +1985,17 @@ static int zram_remove(struct zram *zram)
>  
>  	/* Make sure all the pending I/O are finished */
>  	fsync_bdev(bdev);
> -	zram_reset_device(zram);
>  
>  	pr_info("Removed device: %s\n", zram->disk->disk_name);
>  
>  	del_gendisk(zram->disk);
> +
> +	/*
> +	 * reset device after gendisk is removed, so any change from sysfs
> +	 * store won't come in, then we can really reset device here
> +	 */
> +	zram_reset_device(zram);
> +
>  	blk_cleanup_disk(zram->disk);
>  	kfree(zram);
>  	return 0;

Sorry but nope, the cpu multistate issue is still present and we end up
eventually with page faults. I tried with both patches.

Oct 14 20:21:34 kdevops kernel: ------------[ cut here ]------------
Oct 14 20:21:34 kdevops kernel: Error: Removing state 65 which has
instances left.
Oct 14 20:21:34 kdevops kernel: WARNING: CPU: 4 PID: 3358 at
kernel/cpu.c:2151 __cpuhp_remove_state_cpuslocked+0xf9/0x100
Oct 14 20:21:34 kdevops kernel: Modules linked in: zram(E-) zstd(E)
zsmalloc(E) kvm_intel(E) kvm(E) irqbypass(E) crct10dif_pclmul(E)
crc32_pclmul(E) ghash_clmulni_intel(E) >
Oct 14 20:21:34 kdevops kernel: CPU: 4 PID: 3358 Comm: rmmod Tainted: G
E     5.15.0-rc3-next-20210927+ #89
Oct 14 20:21:34 kdevops kernel: Hardware name: QEMU Standard PC (i440FX
+ PIIX, 1996), BIOS 1.14.0-2 04/01/2014
Oct 14 20:21:34 kdevops kernel: RIP:
0010:__cpuhp_remove_state_cpuslocked+0xf9/0x100
Oct 14 20:21:34 kdevops kernel: Code: 21 00 48 c7 43 18 00 00 00 00 5b
5d 41 5c 41 5d 41 5e 41 5f e9 d8 17 84 00 0f 0b 44 89 e6 48 c7 c7 78 0c
8b ad e8 56 92 7f 00 <0f> 0b >
Oct 14 20:21:34 kdevops kernel: RSP: 0018:ffffaac980a1fe90 EFLAGS:
00010286
Oct 14 20:21:34 kdevops kernel: RAX: 0000000000000000 RBX:
ffffffffada3e208 RCX: 0000000000000000
Oct 14 20:21:34 kdevops kernel: RDX: 0000000000000001 RSI:
ffffffffad8efdb6 RDI: 00000000ffffffff
Oct 14 20:21:34 kdevops kernel: RBP: 0000000000000000 R08:
0000000000000000 R09: ffffaac980a1fcc0
Oct 14 20:21:34 kdevops kernel: R10: ffffaac980a1fcb8 R11:
ffffffffadac3c68 R12: 0000000000000041
Oct 14 20:21:34 kdevops kernel: R13: 0000000000000a28 R14:
0000000000000000 R15: 0000000000000000
Oct 14 20:21:34 kdevops kernel: FS:  00007fc0c2882580(0000)
GS:ffff9ed6f7d00000(0000) knlGS:0000000000000000
Oct 14 20:21:34 kdevops kernel: CS:  0010 DS: 0000 ES: 0000 CR0:
0000000080050033
Oct 14 20:21:34 kdevops kernel: CR2: 00005621b0490b78 CR3:
000000011a538005 CR4: 0000000000370ee0
Oct 14 20:21:34 kdevops kernel: DR0: 0000000000000000 DR1:
0000000000000000 DR2: 0000000000000000
Oct 14 20:21:34 kdevops kernel: DR3: 0000000000000000 DR6:
00000000fffe0ff0 DR7: 0000000000000400
Oct 14 20:21:34 kdevops kernel: Call Trace:
Oct 14 20:21:34 kdevops kernel:  <TASK>
Oct 14 20:21:34 kdevops kernel:  __cpuhp_remove_state+0x4d/0xc0
Oct 14 20:21:34 kdevops kernel:  __do_sys_delete_module+0x18d/0x2a0
Oct 14 20:21:34 kdevops kernel:  ?
fpregs_assert_state_consistent+0x1e/0x40
Oct 14 20:21:34 kdevops kernel:  ? exit_to_user_mode_prepare+0x3a/0x180
Oct 14 20:21:34 kdevops kernel:  do_syscall_64+0x38/0xc0
Oct 14 20:21:34 kdevops kernel:
entry_SYSCALL_64_after_hwframe+0x44/0xae
Oct 14 20:21:34 kdevops kernel: RIP: 0033:0x7fc0c29a84a7
<etc>
Oct 14 20:21:35 kdevops kernel: sysfs: cannot create duplicate filename
'/devices/virtual/block/zram0'
Oct 14 20:21:35 kdevops kernel: CPU: 5 PID: 3388 Comm: modprobe Tainted:
G        W   E     5.15.0-rc3-next-20210927+ #89
Oct 14 20:21:35 kdevops kernel: Hardware name: QEMU Standard PC (i440FX
+ PIIX, 1996), BIOS 1.14.0-2 04/01/2014
Oct 14 20:21:35 kdevops kernel: Call Trace:
Oct 14 20:21:35 kdevops kernel:  <TASK>
Oct 14 20:21:35 kdevops kernel:  dump_stack_lvl+0x48/0x5e
Oct 14 20:21:35 kdevops kernel:  sysfs_warn_dup.cold+0x17/0x24
Oct 14 20:21:35 kdevops kernel:  sysfs_create_dir_ns+0xbc/0xd0
Oct 14 20:21:35 kdevops kernel:  kobject_add_internal+0xbd/0x2b0
Oct 14 20:21:35 kdevops kernel:  kobject_add+0x7e/0xb0
Oct 14 20:21:35 kdevops kernel:  ? _raw_spin_unlock_irqrestore+0x25/0x40
Oct 14 20:21:35 kdevops kernel:  ? preempt_count_add+0x68/0xa0
Oct 14 20:21:35 kdevops kernel:  device_add+0x11a/0x980
Oct 14 20:21:35 kdevops kernel:  ? dev_set_name+0x53/0x70
Oct 14 20:21:35 kdevops kernel:  device_add_disk+0x9d/0x3a0
Oct 14 20:21:35 kdevops kernel:  zram_add+0x1ad/0x200 [zram]
Oct 14 20:21:35 kdevops kernel:  ? 0xffffffffc0c10000
Oct 14 20:21:35 kdevops kernel:  zram_init+0xd7/0x1000 [zram]
Oct 14 20:21:35 kdevops kernel:  do_one_initcall+0x41/0x200
Oct 14 20:21:35 kdevops kernel:  ? _raw_spin_unlock_irqrestore+0x25/0x40
Oct 14 20:21:35 kdevops kernel:  ? kmem_cache_alloc_trace+0x2ab/0x420
Oct 14 20:21:35 kdevops kernel:  do_init_module+0x5c/0x270
Oct 14 20:21:35 kdevops kernel:  __do_sys_finit_module+0xae/0x110
Oct 14 20:21:35 kdevops kernel:  do_syscall_64+0x38/0xc0
Oct 14 20:21:35 kdevops kernel:
entry_SYSCALL_64_after_hwframe+0x44/0xae
Oct 14 20:21:35 kdevops kernel: RIP: 0033:0x7fca3aa555e9
Oct 14 20:21:35 kdevops kernel: Code: 00 c3 66 2e 0f 1f 84 00 00 00 00
00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8
4c 8b 4c 24 08 0f 05 <48> 3d >
Oct 14 20:21:35 kdevops kernel: RSP: 002b:00007fff142417b8 EFLAGS:
00000246 ORIG_RAX: 0000000000000139
Oct 14 20:21:35 kdevops kernel: RAX: ffffffffffffffda RBX:
0000558ba9491bd0 RCX: 00007fca3aa555e9
Oct 14 20:21:35 kdevops kernel: RDX: 0000000000000000 RSI:
0000558ba9491f60 RDI: 0000000000000003
Oct 14 20:21:35 kdevops kernel: RBP: 0000000000040000 R08:
0000000000000000 R09: 0000558ba9491db0
Oct 14 20:21:35 kdevops kernel: R10: 0000000000000003 R11:
0000000000000246 R12: 0000558ba9491f60
Oct 14 20:21:35 kdevops kernel: R13: 0000000000000000 R14:
0000558ba9491d00 R15: 0000558ba9491bd0
Oct 14 20:21:35 kdevops kernel:  </TASK>
<etc>
Oct 14 20:21:35 kdevops kernel: kobject_add_internal failed for zram0
with -EEXIST, don't try to register things with the same name in the
same directory.
Oct 14 20:21:35 kdevops kernel: ------------[ cut here ]------------
Oct 14 20:21:35 kdevops kernel: WARNING: CPU: 5 PID: 3388 at
block/genhd.c:537 device_add_disk+0x1b9/0x3a0
Oct 14 20:21:35 kdevops kernel: Modules linked in: zram(E+) zstd(E)
zsmalloc(E) kvm_intel(E) kvm(E) irqbypass(E) crct10dif_pclmul(E)
crc32_pclmul(E) ghash_clmulni_intel(E) >
Oct 14 20:21:35 kdevops kernel: CPU: 5 PID: 3388 Comm: modprobe Tainted:
G        W   E     5.15.0-rc3-next-20210927+ #89
Oct 14 20:21:35 kdevops kernel: Hardware name: QEMU Standard PC (i440FX
+ PIIX, 1996), BIOS 1.14.0-2 04/01/2014
Oct 14 20:21:35 kdevops kernel: RIP: 0010:device_add_disk+0x1b9/0x3a0
Oct 14 20:21:35 kdevops kernel: Code: 00 03 01 00 00 0f 85 32 ff ff ff
e9 1e ff ff ff 0f 0b 41 bc ea ff ff ff e9 29 ff ff ff 4c 89 ff e8 5c 45
1c 00 e9 ef fe ff ff <0f> 0b >
Oct 14 20:21:35 kdevops kernel: RSP: 0018:ffffaac980607d90 EFLAGS:
00010287
Oct 14 20:21:35 kdevops kernel: RAX: 0000000000000000 RBX:
0000000000000000 RCX: 0000000000023005
Oct 14 20:21:35 kdevops kernel: RDX: 0000000000022e05 RSI:
ffffffffacc4b710 RDI: 0000000000000000
Oct 14 20:21:35 kdevops kernel: RBP: ffff9ed5d788a600 R08:
0000000000000000 R09: ffffaac980607a98
Oct 14 20:21:35 kdevops kernel: R10: ffff9ed5c795ef00 R11:
ffffffffadac3c68 R12: 00000000ffffffef
Oct 14 20:21:35 kdevops kernel: R13: ffff9ed5d5600000 R14:
ffffffffc0a52100 R15: ffff9ed5d5600040
Oct 14 20:21:35 kdevops kernel: FS:  00007fca3a935580(0000)
GS:ffff9ed6f7d40000(0000) knlGS:0000000000000000
Oct 14 20:21:35 kdevops kernel: CS:  0010 DS: 0000 ES: 0000 CR0:
0000000080050033
Oct 14 20:21:35 kdevops kernel: CR2: 00007fff1423e6d8 CR3:
0000000136752002 CR4: 0000000000370ee0
Oct 14 20:21:35 kdevops kernel: DR0: 0000000000000000 DR1:
0000000000000000 DR2: 0000000000000000
Oct 14 20:21:35 kdevops kernel: DR3: 0000000000000000 DR6:
00000000fffe0ff0 DR7: 0000000000000400
Oct 14 20:21:35 kdevops kernel: Call Trace:
Oct 14 20:21:35 kdevops kernel:  <TASK>
Oct 14 20:21:35 kdevops kernel:  zram_add+0x1ad/0x200 [zram]
Oct 14 20:21:35 kdevops kernel:  ? 0xffffffffc0c10000
Oct 14 20:21:35 kdevops kernel:  zram_init+0xd7/0x1000 [zram]
Oct 14 20:21:35 kdevops kernel:  do_one_initcall+0x41/0x200
Oct 14 20:21:35 kdevops kernel:  ? _raw_spin_unlock_irqrestore+0x25/0x40
Oct 14 20:21:35 kdevops kernel:  ? kmem_cache_alloc_trace+0x2ab/0x420
Oct 14 20:21:35 kdevops kernel:  do_init_module+0x5c/0x270
Oct 14 20:21:35 kdevops kernel:  __do_sys_finit_module+0xae/0x110
Oct 14 20:21:35 kdevops kernel:  do_syscall_64+0x38/0xc0
Oct 14 20:21:35 kdevops kernel:
entry_SYSCALL_64_after_hwframe+0x44/0xae
Oct 14 20:21:35 kdevops kernel: RIP: 0033:0x7fca3aa555e9
<etc>
Oct 14 20:21:35 kdevops kernel: ------------[ cut here ]------------
Oct 14 20:21:35 kdevops kernel: WARNING: CPU: 2 PID: 3457 at
block/genhd.c:564 del_gendisk+0x1a2/0x1d0
Oct 14 20:21:35 kdevops kernel: Modules linked in: 842(E)
842_decompress(E) 842_compress(E) zram(E-) zstd(E) zsmalloc(E)
kvm_intel(E) kvm(E) irqbypass(E) crct10dif_pclmul(E>
Oct 14 20:21:35 kdevops kernel: CPU: 2 PID: 3457 Comm: rmmod Tainted: G
W   E     5.15.0-rc3-next-20210927+ #89
Oct 14 20:21:35 kdevops kernel: Hardware name: QEMU Standard PC (i440FX
+ PIIX, 1996), BIOS 1.14.0-2 04/01/2014
Oct 14 20:21:35 kdevops kernel: RIP: 0010:del_gendisk+0x1a2/0x1d0
Oct 14 20:21:35 kdevops kernel: Code: 48 8d 78 40 e8 8f 87 1d 00 48 8b
7b 40 5b 5d 41 5c 48 83 c7 40 e9 4e 47 1c 00 48 8b 70 40 eb ce f6 43 61
04 0f 85 85 fe ff ff <0f> 0b >
Oct 14 20:21:35 kdevops kernel: RSP: 0018:ffffaac9807cfe30 EFLAGS:
00010246
Oct 14 20:21:35 kdevops kernel: RAX: ffff9ed5d5600380 RBX:
ffff9ed5d788a600 RCX: 0000000000000000
Oct 14 20:21:35 kdevops kernel: RDX: 0000000000000000 RSI:
ffffffffad8efdb6 RDI: ffff9ed5d788a600
Oct 14 20:21:35 kdevops kernel: RBP: ffff9ed5d788b600 R08:
0000000000000000 R09: ffffaac9807cfc88
Oct 14 20:21:35 kdevops kernel: R10: ffffaac9807cfc80 R11:
ffffffffadac3c68 R12: ffff9ed5d5600000
Oct 14 20:21:35 kdevops kernel: R13: 0000000000000000 R14:
ffffffffc0a52360 R15: ffff9ed5c4a87b78
Oct 14 20:21:35 kdevops kernel: FS:  00007f292a2bb580(0000)
GS:ffff9ed6f7c80000(0000) knlGS:0000000000000000
Oct 14 20:21:35 kdevops kernel: CS:  0010 DS: 0000 ES: 0000 CR0:
0000000080050033
Oct 14 20:21:35 kdevops kernel: CR2: 000056161b453b78 CR3:
000000013213e002 CR4: 0000000000370ee0
Oct 14 20:21:35 kdevops kernel: DR0: 0000000000000000 DR1:
0000000000000000 DR2: 0000000000000000
Oct 14 20:21:35 kdevops kernel: DR3: 0000000000000000 DR6:
00000000fffe0ff0 DR7: 0000000000000400
Oct 14 20:21:35 kdevops kernel: Call Trace:
Oct 14 20:21:35 kdevops kernel:  <TASK>
Oct 14 20:21:35 kdevops kernel:  zram_remove+0x96/0xc0 [zram]
Oct 14 20:21:35 kdevops kernel:  ? hot_remove_store+0xe0/0xe0 [zram]
Oct 14 20:21:35 kdevops kernel:  zram_remove_cb+0xd/0x10 [zram]
Oct 14 20:21:35 kdevops kernel:  idr_for_each+0x5b/0xd0
Oct 14 20:21:35 kdevops kernel:  destroy_devices+0x32/0x68 [zram]
Oct 14 20:21:35 kdevops kernel:  __do_sys_delete_module+0x18d/0x2a0
Oct 14 20:21:35 kdevops kernel:  ?
fpregs_assert_state_consistent+0x1e/0x40
Oct 14 20:21:35 kdevops kernel:  ? exit_to_user_mode_prepare+0x3a/0x180
Oct 14 20:21:35 kdevops kernel:  do_syscall_64+0x38/0xc0
Oct 14 20:21:35 kdevops kernel:
entry_SYSCALL_64_after_hwframe+0x44/0xae
Oct 14 20:21:35 kdevops kernel: RIP: 0033:0x7f292a3e14a7
<etc>
Oct 14 20:21:35 kdevops kernel: BUG: unable to handle page fault for
address: ffffffffc0a4e0ae
Oct 14 20:21:35 kdevops kernel: #PF: supervisor instruction fetch in
kernel mode
Oct 14 20:21:35 kdevops kernel: #PF: error_code(0x0010) - not-present
page
Oct 14 20:21:35 kdevops kernel: PGD 3ba0e067 P4D 3ba0e067 PUD 3ba10067
PMD 10526c067 PTE 0
Oct 14 20:21:35 kdevops kernel: Oops: 0010 [#1] PREEMPT SMP NOPTI
Oct 14 20:21:35 kdevops kernel: CPU: 6 PID: 3655 Comm: zram02.sh
Tainted: G        W   E     5.15.0-rc3-next-20210927+ #89
Oct 14 20:21:35 kdevops kernel: Hardware name: QEMU Standard PC (i440FX
+ PIIX, 1996), BIOS 1.14.0-2 04/01/2014
Oct 14 20:21:35 kdevops kernel: RIP: 0010:0xffffffffc0a4e0ae
Oct 14 20:21:35 kdevops kernel: Code: Unable to access opcode bytes at
RIP 0xffffffffc0a4e084.
Oct 14 20:21:35 kdevops kernel: RSP: 0018:ffffaac980687da8 EFLAGS:
00010286
Oct 14 20:21:35 kdevops kernel: RAX: 0000000000000000 RBX:
ffff9ed5c40be400 RCX: 0000000080400035
Oct 14 20:21:35 kdevops kernel: RDX: 0000000080400036 RSI:
fffffa3544561080 RDI: 0000000040000000
Oct 14 20:21:35 kdevops kernel: RBP: 0000000001900000 R08:
ffff9ed5d5842cc0 R09: 0000000080400035
Oct 14 20:21:35 kdevops kernel: R10: ffff9ed5d5842c00 R11:
ffff9ed5f1341350 R12: 0000000001900000
Oct 14 20:21:35 kdevops kernel: R13: ffff9ed5d5666c00 R14:
ffff9ed5c40be420 R15: ffff9ed5dfa8c8c0
Oct 14 20:21:35 kdevops kernel: FS:  00007f978fe2d5c0(0000)
GS:ffff9ed6f7d80000(0000) knlGS:0000000000000000
Oct 14 20:21:35 kdevops kernel: CS:  0010 DS: 0000 ES: 0000 CR0:
0000000080050033
Oct 14 20:21:35 kdevops kernel: CR2: ffffffffc0a4e084 CR3:
0000000133fd4006 CR4: 0000000000370ee0
Oct 14 20:21:35 kdevops kernel: DR0: 0000000000000000 DR1:
0000000000000000 DR2: 0000000000000000
Oct 14 20:21:35 kdevops kernel: DR3: 0000000000000000 DR6:
00000000fffe0ff0 DR7: 0000000000000400
Oct 14 20:21:35 kdevops kernel: Call Trace:
Oct 14 20:21:35 kdevops kernel:  <TASK>
Oct 14 20:21:35 kdevops kernel:  ? kernfs_fop_write_iter+0x177/0x220
Oct 14 20:21:35 kdevops kernel:  ? new_sync_write+0x11c/0x1b0
Oct 14 20:21:35 kdevops kernel:  ? vfs_write+0x20d/0x2a0
Oct 14 20:21:35 kdevops kernel:  ? ksys_write+0x5f/0xe0
Oct 14 20:21:35 kdevops kernel:  ? do_syscall_64+0x38/0xc0
Oct 14 20:21:35 kdevops kernel:  ?
entry_SYSCALL_64_after_hwframe+0x44/0xae
Oct 14 20:21:35 kdevops kernel:  </TASK>
<etc, etc, etc, this goes on and on>

  Luis

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v8 11/12] zram: fix crashes with cpu hotplug multistate
  2021-10-14 20:24       ` Luis Chamberlain
@ 2021-10-14 23:52         ` Ming Lei
  2021-10-15  0:22           ` Luis Chamberlain
  0 siblings, 1 reply; 94+ messages in thread
From: Ming Lei @ 2021-10-14 23:52 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: tj, gregkh, akpm, minchan, jeyu, shuah, bvanassche,
	dan.j.williams, joe, tglx, keescook, rostedt, linux-spdx,
	linux-doc, linux-block, linux-fsdevel, linux-kselftest,
	linux-kernel

On Thu, Oct 14, 2021 at 01:24:32PM -0700, Luis Chamberlain wrote:
> On Thu, Oct 14, 2021 at 10:11:46AM +0800, Ming Lei wrote:
> > On Thu, Oct 14, 2021 at 09:55:48AM +0800, Ming Lei wrote:
> > > On Mon, Sep 27, 2021 at 09:38:04AM -0700, Luis Chamberlain wrote:
> > 
> > ...
> > 
> > > 
> > > Hello Luis,
> > > 
> > > Can you test the following patch and see if the issue can be addressed?
> > > 
> > > Please see the idea from the inline comment.
> > > 
> > > Also zram_index_mutex isn't needed in zram disk's store() compared with
> > > your patch, then the deadlock issue you are addressing in this series can
> > > be avoided.
> > > 
> > > 
> > > diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
> > > index fcaf2750f68f..3c17927d23a7 100644
> > > --- a/drivers/block/zram/zram_drv.c
> > > +++ b/drivers/block/zram/zram_drv.c
> > > @@ -1985,11 +1985,17 @@ static int zram_remove(struct zram *zram)
> > >  
> > >  	/* Make sure all the pending I/O are finished */
> > >  	fsync_bdev(bdev);
> > > -	zram_reset_device(zram);
> > >  
> > >  	pr_info("Removed device: %s\n", zram->disk->disk_name);
> > >  
> > >  	del_gendisk(zram->disk);
> > > +
> > > +	/*
> > > +	 * reset device after gendisk is removed, so any change from sysfs
> > > +	 * store won't come in, then we can really reset device here
> > > +	 */
> > > +	zram_reset_device(zram);
> > > +
> > >  	blk_cleanup_disk(zram->disk);
> > >  	kfree(zram);
> > >  	return 0;
> > > @@ -2073,7 +2079,12 @@ static int zram_remove_cb(int id, void *ptr, void *data)
> > >  static void destroy_devices(void)
> > >  {
> > >  	class_unregister(&zram_control_class);
> > > +
> > > +	/* hold the global lock so new device can't be added */
> > > +	mutex_lock(&zram_index_mutex);
> > >  	idr_for_each(&zram_index_idr, &zram_remove_cb, NULL);
> > > +	mutex_unlock(&zram_index_mutex);
> > > +
> > 
> > Actually zram_index_mutex isn't needed when calling zram_remove_cb()
> > since the zram-control sysfs interface has been removed, so userspace
> > can't add new device any more, then the issue is supposed to be fixed
> > by the following one line change, please test it:
> > 
> > diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
> > index fcaf2750f68f..96dd641de233 100644
> > --- a/drivers/block/zram/zram_drv.c
> > +++ b/drivers/block/zram/zram_drv.c
> > @@ -1985,11 +1985,17 @@ static int zram_remove(struct zram *zram)
> >  
> >  	/* Make sure all the pending I/O are finished */
> >  	fsync_bdev(bdev);
> > -	zram_reset_device(zram);
> >  
> >  	pr_info("Removed device: %s\n", zram->disk->disk_name);
> >  
> >  	del_gendisk(zram->disk);
> > +
> > +	/*
> > +	 * reset device after gendisk is removed, so any change from sysfs
> > +	 * store won't come in, then we can really reset device here
> > +	 */
> > +	zram_reset_device(zram);
> > +
> >  	blk_cleanup_disk(zram->disk);
> >  	kfree(zram);
> >  	return 0;
> 
> Sorry but nope, the cpu multistate issue is still present and we end up
> eventually with page faults. I tried with both patches.

In theory disksize_store() can't come in after del_gendisk() returns,
then zram_reset_device() should cleanup everything, that is the issue
you described in commit log.

We need to understand the exact reason why there is still cpuhp node
left, can you share us the exact steps for reproducing the issue?
Otherwise we may have to trace and narrow down the reason.



thanks,
Ming


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v8 11/12] zram: fix crashes with cpu hotplug multistate
  2021-10-14 23:52         ` Ming Lei
@ 2021-10-15  0:22           ` Luis Chamberlain
  2021-10-15  8:36             ` Ming Lei
  0 siblings, 1 reply; 94+ messages in thread
From: Luis Chamberlain @ 2021-10-15  0:22 UTC (permalink / raw)
  To: Ming Lei
  Cc: tj, gregkh, akpm, minchan, jeyu, shuah, bvanassche,
	dan.j.williams, joe, tglx, keescook, rostedt, linux-spdx,
	linux-doc, linux-block, linux-fsdevel, linux-kselftest,
	linux-kernel

On Fri, Oct 15, 2021 at 07:52:04AM +0800, Ming Lei wrote:
> On Thu, Oct 14, 2021 at 01:24:32PM -0700, Luis Chamberlain wrote:
> > On Thu, Oct 14, 2021 at 10:11:46AM +0800, Ming Lei wrote:
> > > On Thu, Oct 14, 2021 at 09:55:48AM +0800, Ming Lei wrote:
> > > > On Mon, Sep 27, 2021 at 09:38:04AM -0700, Luis Chamberlain wrote:
> > > 
> > > ...
> > > 
> > > > 
> > > > Hello Luis,
> > > > 
> > > > Can you test the following patch and see if the issue can be addressed?
> > > > 
> > > > Please see the idea from the inline comment.
> > > > 
> > > > Also zram_index_mutex isn't needed in zram disk's store() compared with
> > > > your patch, then the deadlock issue you are addressing in this series can
> > > > be avoided.
> > > > 
> > > > 
> > > > diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
> > > > index fcaf2750f68f..3c17927d23a7 100644
> > > > --- a/drivers/block/zram/zram_drv.c
> > > > +++ b/drivers/block/zram/zram_drv.c
> > > > @@ -1985,11 +1985,17 @@ static int zram_remove(struct zram *zram)
> > > >  
> > > >  	/* Make sure all the pending I/O are finished */
> > > >  	fsync_bdev(bdev);
> > > > -	zram_reset_device(zram);
> > > >  
> > > >  	pr_info("Removed device: %s\n", zram->disk->disk_name);
> > > >  
> > > >  	del_gendisk(zram->disk);
> > > > +
> > > > +	/*
> > > > +	 * reset device after gendisk is removed, so any change from sysfs
> > > > +	 * store won't come in, then we can really reset device here
> > > > +	 */
> > > > +	zram_reset_device(zram);
> > > > +
> > > >  	blk_cleanup_disk(zram->disk);
> > > >  	kfree(zram);
> > > >  	return 0;
> > > > @@ -2073,7 +2079,12 @@ static int zram_remove_cb(int id, void *ptr, void *data)
> > > >  static void destroy_devices(void)
> > > >  {
> > > >  	class_unregister(&zram_control_class);
> > > > +
> > > > +	/* hold the global lock so new device can't be added */
> > > > +	mutex_lock(&zram_index_mutex);
> > > >  	idr_for_each(&zram_index_idr, &zram_remove_cb, NULL);
> > > > +	mutex_unlock(&zram_index_mutex);
> > > > +
> > > 
> > > Actually zram_index_mutex isn't needed when calling zram_remove_cb()
> > > since the zram-control sysfs interface has been removed, so userspace
> > > can't add new device any more, then the issue is supposed to be fixed
> > > by the following one line change, please test it:
> > > 
> > > diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
> > > index fcaf2750f68f..96dd641de233 100644
> > > --- a/drivers/block/zram/zram_drv.c
> > > +++ b/drivers/block/zram/zram_drv.c
> > > @@ -1985,11 +1985,17 @@ static int zram_remove(struct zram *zram)
> > >  
> > >  	/* Make sure all the pending I/O are finished */
> > >  	fsync_bdev(bdev);
> > > -	zram_reset_device(zram);
> > >  
> > >  	pr_info("Removed device: %s\n", zram->disk->disk_name);
> > >  
> > >  	del_gendisk(zram->disk);
> > > +
> > > +	/*
> > > +	 * reset device after gendisk is removed, so any change from sysfs
> > > +	 * store won't come in, then we can really reset device here
> > > +	 */
> > > +	zram_reset_device(zram);
> > > +
> > >  	blk_cleanup_disk(zram->disk);
> > >  	kfree(zram);
> > >  	return 0;
> > 
> > Sorry but nope, the cpu multistate issue is still present and we end up
> > eventually with page faults. I tried with both patches.
> 
> In theory disksize_store() can't come in after del_gendisk() returns,
> then zram_reset_device() should cleanup everything, that is the issue
> you described in commit log.
> 
> We need to understand the exact reason why there is still cpuhp node
> left, can you share us the exact steps for reproducing the issue?
> Otherwise we may have to trace and narrow down the reason.

See my commit log for my own fix for this issue.

  Luis

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v8 11/12] zram: fix crashes with cpu hotplug multistate
  2021-10-15  0:22           ` Luis Chamberlain
@ 2021-10-15  8:36             ` Ming Lei
  2021-10-15  8:52               ` Greg KH
  2021-10-15 17:31               ` Luis Chamberlain
  0 siblings, 2 replies; 94+ messages in thread
From: Ming Lei @ 2021-10-15  8:36 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: tj, gregkh, akpm, minchan, jeyu, shuah, bvanassche,
	dan.j.williams, joe, tglx, keescook, rostedt, linux-spdx,
	linux-doc, linux-block, linux-fsdevel, linux-kselftest,
	linux-kernel

On Thu, Oct 14, 2021 at 05:22:40PM -0700, Luis Chamberlain wrote:
> On Fri, Oct 15, 2021 at 07:52:04AM +0800, Ming Lei wrote:
...
> > 
> > We need to understand the exact reason why there is still cpuhp node
> > left, can you share us the exact steps for reproducing the issue?
> > Otherwise we may have to trace and narrow down the reason.
> 
> See my commit log for my own fix for this issue.

OK, thanks!

I can reproduce the issue, and the reason is that reset_store fails
zram_remove() when unloading module, then the warning is caused.

The top 3 patches in the following tree can fix the issue:

https://github.com/ming1/linux/commits/my_v5.15-blk-dev


Thanks,
Ming


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v8 11/12] zram: fix crashes with cpu hotplug multistate
  2021-10-15  8:36             ` Ming Lei
@ 2021-10-15  8:52               ` Greg KH
  2021-10-15 17:31               ` Luis Chamberlain
  1 sibling, 0 replies; 94+ messages in thread
From: Greg KH @ 2021-10-15  8:52 UTC (permalink / raw)
  To: Ming Lei
  Cc: Luis Chamberlain, tj, akpm, minchan, jeyu, shuah, bvanassche,
	dan.j.williams, joe, tglx, keescook, rostedt, linux-spdx,
	linux-doc, linux-block, linux-fsdevel, linux-kselftest,
	linux-kernel

On Fri, Oct 15, 2021 at 04:36:11PM +0800, Ming Lei wrote:
> On Thu, Oct 14, 2021 at 05:22:40PM -0700, Luis Chamberlain wrote:
> > On Fri, Oct 15, 2021 at 07:52:04AM +0800, Ming Lei wrote:
> ...
> > > 
> > > We need to understand the exact reason why there is still cpuhp node
> > > left, can you share us the exact steps for reproducing the issue?
> > > Otherwise we may have to trace and narrow down the reason.
> > 
> > See my commit log for my own fix for this issue.
> 
> OK, thanks!
> 
> I can reproduce the issue, and the reason is that reset_store fails
> zram_remove() when unloading module, then the warning is caused.
> 
> The top 3 patches in the following tree can fix the issue:
> 
> https://github.com/ming1/linux/commits/my_v5.15-blk-dev

At a quick glance, those look sane to me, nice work.

greg k-h

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v8 11/12] zram: fix crashes with cpu hotplug multistate
  2021-10-15  8:36             ` Ming Lei
  2021-10-15  8:52               ` Greg KH
@ 2021-10-15 17:31               ` Luis Chamberlain
  2021-10-16 11:28                 ` Ming Lei
  1 sibling, 1 reply; 94+ messages in thread
From: Luis Chamberlain @ 2021-10-15 17:31 UTC (permalink / raw)
  To: Ming Lei
  Cc: tj, gregkh, akpm, minchan, jeyu, shuah, bvanassche,
	dan.j.williams, joe, tglx, keescook, rostedt, linux-spdx,
	linux-doc, linux-block, linux-fsdevel, linux-kselftest,
	linux-kernel

On Fri, Oct 15, 2021 at 04:36:11PM +0800, Ming Lei wrote:
> On Thu, Oct 14, 2021 at 05:22:40PM -0700, Luis Chamberlain wrote:
> > On Fri, Oct 15, 2021 at 07:52:04AM +0800, Ming Lei wrote:
> ...
> > > 
> > > We need to understand the exact reason why there is still cpuhp node
> > > left, can you share us the exact steps for reproducing the issue?
> > > Otherwise we may have to trace and narrow down the reason.
> > 
> > See my commit log for my own fix for this issue.
> 
> OK, thanks!
> 
> I can reproduce the issue, and the reason is that reset_store fails
> zram_remove() when unloading module, then the warning is caused.
> 
> The top 3 patches in the following tree can fix the issue:
> 
> https://github.com/ming1/linux/commits/my_v5.15-blk-dev

Thanks for trying an alternative fix! A crash stops yes, however this
also ends up leaving the driver in an unrecoverable state after a few
tries. Ie, you CTRL-C the scripts and try again over and over again and
the driver ends up in a situation where it just says:

zram: Can't change algorithm for initialized device

And the zram module can't be removed at that point.

  Luis

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v8 11/12] zram: fix crashes with cpu hotplug multistate
  2021-10-15 17:31               ` Luis Chamberlain
@ 2021-10-16 11:28                 ` Ming Lei
  2021-10-18 19:32                   ` Luis Chamberlain
  0 siblings, 1 reply; 94+ messages in thread
From: Ming Lei @ 2021-10-16 11:28 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: tj, gregkh, akpm, minchan, jeyu, shuah, bvanassche,
	dan.j.williams, joe, tglx, keescook, rostedt, linux-spdx,
	linux-doc, linux-block, linux-fsdevel, linux-kselftest,
	linux-kernel

On Fri, Oct 15, 2021 at 10:31:31AM -0700, Luis Chamberlain wrote:
> On Fri, Oct 15, 2021 at 04:36:11PM +0800, Ming Lei wrote:
> > On Thu, Oct 14, 2021 at 05:22:40PM -0700, Luis Chamberlain wrote:
> > > On Fri, Oct 15, 2021 at 07:52:04AM +0800, Ming Lei wrote:
> > ...
> > > > 
> > > > We need to understand the exact reason why there is still cpuhp node
> > > > left, can you share us the exact steps for reproducing the issue?
> > > > Otherwise we may have to trace and narrow down the reason.
> > > 
> > > See my commit log for my own fix for this issue.
> > 
> > OK, thanks!
> > 
> > I can reproduce the issue, and the reason is that reset_store fails
> > zram_remove() when unloading module, then the warning is caused.
> > 
> > The top 3 patches in the following tree can fix the issue:
> > 
> > https://github.com/ming1/linux/commits/my_v5.15-blk-dev
> 
> Thanks for trying an alternative fix! A crash stops yes, however this

I doubt it is alternative since your patchset doesn't mention the exact
reason of 'Error: Removing state 63 which has instances left.', that is
simply caused by failing to remove zram because ->claim is set during
unloading module.

Yeah, you mentioned the race between disksize_store() vs. zram_remove(),
however I don't think it is reproduced easily in the test because the race
window is pretty small, also it can be fixed easily in my 3rd path
without any complicated tricks.

Not dig into details of your patchset via grabbing module reference
count during show/store attribute of kernfs which is done in your patch
9, but IMO this way isn't necessary:

1) any driver module has to cleanup anything which may refer to symbols
or data defined in module_exit of this driver

2) device_del() is often done in module_exit(), once device_del()
returns, no any new show/store on the device's kobject attribute
is possible.

3) it is _not_ a must or pattern for fixing bugs to hold one lock before
calling device_del(), meantime the lock is required in the device's
attribute show()/store(), which causes AA deadlock easily. Your approach
just avoids the issue by not releasing module until all show/store are
done.

Also the model of using module refcount is usually that if anyone will
use the module, grab one extra ref, and once the use is done, release
it. For example of block device, the driver's module refcnt is grabbed
when the disk/part is opened, and released when the disk/part is closed.


> also ends up leaving the driver in an unrecoverable state after a few
> tries. Ie, you CTRL-C the scripts and try again over and over again and
> the driver ends up in a situation where it just says:
> 
> zram: Can't change algorithm for initialized device

It means the algorithm can't be changed for one initialized device
at the exact time. That is understandable because two zram02.sh are
running concurrently.

Your test script just runs two ./zram02.sh tasks concurrently forever,
so what is your expected result for the test? Of course, it can't be
over.

I can't reproduce the 'unrecoverable' state in my test, can you share the
stack trace log after that happens?

Is the zram02.sh still running or slept somewhere in the 'unrecoverable'
state? If it is still running, it means the current sleep point isn't
interruptable when running 'CTRL-C'. In my test, after several 'CTRL-C',
both the two zram02.sh started from two terminals can be terminated. If
it is slept somewhere forever, it can be one problem.

> 
> And the zram module can't be removed at that point.

It is just that systemd opens the zram or the disk is opened as swap
disk, and once systemd closes it or after you run swapoff, it can be
unloaded.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v8 11/12] zram: fix crashes with cpu hotplug multistate
  2021-10-16 11:28                 ` Ming Lei
@ 2021-10-18 19:32                   ` Luis Chamberlain
  2021-10-19  2:34                     ` Ming Lei
  0 siblings, 1 reply; 94+ messages in thread
From: Luis Chamberlain @ 2021-10-18 19:32 UTC (permalink / raw)
  To: Ming Lei, Benjamin Herrenschmidt, Paul Mackerras
  Cc: tj, gregkh, akpm, minchan, jeyu, shuah, bvanassche,
	dan.j.williams, joe, tglx, keescook, rostedt, linux-spdx,
	linux-doc, linux-block, linux-fsdevel, linux-kselftest,
	linux-kernel

On Sat, Oct 16, 2021 at 07:28:39PM +0800, Ming Lei wrote:
> On Fri, Oct 15, 2021 at 10:31:31AM -0700, Luis Chamberlain wrote:
> > On Fri, Oct 15, 2021 at 04:36:11PM +0800, Ming Lei wrote:
> > > On Thu, Oct 14, 2021 at 05:22:40PM -0700, Luis Chamberlain wrote:
> > > > On Fri, Oct 15, 2021 at 07:52:04AM +0800, Ming Lei wrote:
> > > ...
> > > > > 
> > > > > We need to understand the exact reason why there is still cpuhp node
> > > > > left, can you share us the exact steps for reproducing the issue?
> > > > > Otherwise we may have to trace and narrow down the reason.
> > > > 
> > > > See my commit log for my own fix for this issue.
> > > 
> > > OK, thanks!
> > > 
> > > I can reproduce the issue, and the reason is that reset_store fails
> > > zram_remove() when unloading module, then the warning is caused.
> > > 
> > > The top 3 patches in the following tree can fix the issue:
> > > 
> > > https://github.com/ming1/linux/commits/my_v5.15-blk-dev
> > 
> > Thanks for trying an alternative fix! A crash stops yes, however this
> 
> I doubt it is alternative since your patchset doesn't mention the exact
> reason of 'Error: Removing state 63 which has instances left.', that is
> simply caused by failing to remove zram because ->claim is set during
> unloading module.

Well I disagree because it does explain how the race can happen, and it
also explains how since the sysfs interface is exposed until module
removal completes, it leaves exposed knobs to allow re-initializing of a
struct zcomp for a zram device before the exit.

> Yeah, you mentioned the race between disksize_store() vs. zram_remove(),
> however I don't think it is reproduced easily in the test because the race
> window is pretty small, also it can be fixed easily in my 3rd path
> without any complicated tricks.

Reproducing for me is... extremely easy.

> Not dig into details of your patchset via grabbing module reference
> count during show/store attribute of kernfs which is done in your patch
> 9, but IMO this way isn't necessary:

That's to address the deadlock only.

> 1) any driver module has to cleanup anything which may refer to symbols
> or data defined in module_exit of this driver

Yes, and as the cpu multistate hotplug documentation warns (although
such documentation is kind of hidden) that driver authors need to be
careful with module removal too, refer to the warning at the end of
__cpuhp_remove_state_cpuslocked() about module removal.

> 2) device_del() is often done in module_exit(), once device_del()
> returns, no any new show/store on the device's kobject attribute
> is possible.

Right and if a syfs knob is exposed before device_del() completely
and is allowed to do things, the driver should take care to prevent
races for CPU multistate support. The small state machine I added ensures
we don't run over any expectations from cpu hotplug multistate support.

I've *never* suggested there cannot be alternatives to my solution with
the small state machine, but for you to say it is incorrect is simply
not right either.

> 3) it is _not_ a must or pattern for fixing bugs to hold one lock before
> calling device_del(), meantime the lock is required in the device's
> attribute show()/store(), which causes AA deadlock easily. Your approach
> just avoids the issue by not releasing module until all show/store are
> done.

Right, there are two approaches here:

a) Your approach is to accept the deadlock as a requirement and so
you would prefer to implement an alternative to using a shared lock
on module exit and sysfs op.

b) While I address such a deadlock head on as I think this sort of locking
be allowed for two reasons:
   b1) as we never documented such requirement otherwise.
   b2) There is a possibility that other drivers already exist too
       which *do* use a shared lock on module removal and sysfs ops
       (and I just confirmed this to be true)

By you only addressing the deadlock as a requirement on approach a) you are
forgetting that there *may* already be present drivers which *do* implement
such patterns in the kernel. I worked on addressing the deadlock because
I was informed livepatching *did* have that issue as well and so very
likely a generic solution to the deadlock could be beneficial to other
random drivers.

So I *really* don't think it is wise for us to simply accept this new
found deadlock as a *new* requirement, specially if we can fix it easily.

A cursory review using Coccinelle potential issues with mutex lock
directly used on module exit (so this doesn't cover drivers like zram
which uses a routine and then grabs the lock through indirection) and a
sysfs op shows these drivers are also affected by this deadlock:

  * arch/powerpc/sysdev/fsl_mpic_timer_wakeup.c
  * lib/test_firmware.c

Note that this cursory review does not cover spin_lock uses, and other
forms locks. Consider the case where a routine is used and then that
routine grabs a lock, so one level indirection. There are many levels
of indirections possible here. And likewise there are different types
of locks.

> > also ends up leaving the driver in an unrecoverable state after a few
> > tries. Ie, you CTRL-C the scripts and try again over and over again and
> > the driver ends up in a situation where it just says:
> > 
> > zram: Can't change algorithm for initialized device
> 
> It means the algorithm can't be changed for one initialized device
> at the exact time. That is understandable because two zram02.sh are
> running concurrently.

Indeed but with your patch it can get stuck and cannot be taken out of this
state.

> Your test script just runs two ./zram02.sh tasks concurrently forever,
> so what is your expected result for the test? Of course, it can't be
> over.
>
> I can't reproduce the 'unrecoverable' state in my test, can you share the
> stack trace log after that happens?

Try a bit harder, cancel the scripts after running for a while randomly
(CTRL C a few times until the script finishes) and have them race again.
Do this a few times.

> > And the zram module can't be removed at that point.
> 
> It is just that systemd opens the zram or the disk is opened as swap
> disk, and once systemd closes it or after you run swapoff, it can be
> unloaded.

With my patch this issues does not happen.

  Luis

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v8 11/12] zram: fix crashes with cpu hotplug multistate
  2021-10-18 19:32                   ` Luis Chamberlain
@ 2021-10-19  2:34                     ` Ming Lei
  2021-10-19  6:23                       ` Miroslav Benes
                                         ` (2 more replies)
  0 siblings, 3 replies; 94+ messages in thread
From: Ming Lei @ 2021-10-19  2:34 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: Benjamin Herrenschmidt, Paul Mackerras, tj, gregkh, akpm,
	minchan, jeyu, shuah, bvanassche, dan.j.williams, joe, tglx,
	keescook, rostedt, linux-spdx, linux-doc, linux-block,
	linux-fsdevel, linux-kselftest, linux-kernel, ming.lei

On Mon, Oct 18, 2021 at 12:32:11PM -0700, Luis Chamberlain wrote:
> On Sat, Oct 16, 2021 at 07:28:39PM +0800, Ming Lei wrote:
> > On Fri, Oct 15, 2021 at 10:31:31AM -0700, Luis Chamberlain wrote:
> > > On Fri, Oct 15, 2021 at 04:36:11PM +0800, Ming Lei wrote:
> > > > On Thu, Oct 14, 2021 at 05:22:40PM -0700, Luis Chamberlain wrote:
> > > > > On Fri, Oct 15, 2021 at 07:52:04AM +0800, Ming Lei wrote:
> > > > ...
> > > > > > 
> > > > > > We need to understand the exact reason why there is still cpuhp node
> > > > > > left, can you share us the exact steps for reproducing the issue?
> > > > > > Otherwise we may have to trace and narrow down the reason.
> > > > > 
> > > > > See my commit log for my own fix for this issue.
> > > > 
> > > > OK, thanks!
> > > > 
> > > > I can reproduce the issue, and the reason is that reset_store fails
> > > > zram_remove() when unloading module, then the warning is caused.
> > > > 
> > > > The top 3 patches in the following tree can fix the issue:
> > > > 
> > > > https://github.com/ming1/linux/commits/my_v5.15-blk-dev
> > > 
> > > Thanks for trying an alternative fix! A crash stops yes, however this
> > 
> > I doubt it is alternative since your patchset doesn't mention the exact
> > reason of 'Error: Removing state 63 which has instances left.', that is
> > simply caused by failing to remove zram because ->claim is set during
> > unloading module.
> 
> Well I disagree because it does explain how the race can happen, and it
> also explains how since the sysfs interface is exposed until module
> removal completes, it leaves exposed knobs to allow re-initializing of a
> struct zcomp for a zram device before the exit.
> 
> > Yeah, you mentioned the race between disksize_store() vs. zram_remove(),
> > however I don't think it is reproduced easily in the test because the race
> > window is pretty small, also it can be fixed easily in my 3rd path
> > without any complicated tricks.
> 
> Reproducing for me is... extremely easy.

In my observation, failing zram_remove() is extremely easy to trigger, which
is caused by reset_store() which sets ->reclaim as true, so
zram_remove() is failed and zram_reset_device() is bypassed , then the
failure of 'Error: Removing state 63 which has instances left.' is caused.

We are in same page?

> 
> > Not dig into details of your patchset via grabbing module reference
> > count during show/store attribute of kernfs which is done in your patch
> > 9, but IMO this way isn't necessary:
> 
> That's to address the deadlock only.
> 
> > 1) any driver module has to cleanup anything which may refer to symbols
> > or data defined in module_exit of this driver
> 
> Yes, and as the cpu multistate hotplug documentation warns (although
> such documentation is kind of hidden) that driver authors need to be
> careful with module removal too, refer to the warning at the end of
> __cpuhp_remove_state_cpuslocked() about module removal.

It is zram's bug. zram has to clean everything in module_exit(),
unfortunately zram_remove() can be failed when calling from
module_exit() because ->claim is set as true by reset_store(), then
zram_reset_device()(->zcomp_destroy) isn't called, and this failure should
not happen when unloading module, should it?

> 
> > 2) device_del() is often done in module_exit(), once device_del()
> > returns, no any new show/store on the device's kobject attribute
> > is possible.
> 
> Right and if a syfs knob is exposed before device_del() completely
> and is allowed to do things, the driver should take care to prevent
> races for CPU multistate support. The small state machine I added ensures

What is the race for CPU multistate support? If you mean 'Error: Removing
state 63 which has instances left.', it is zram's bug since zram has to
cleanup everything in module_exit().

> we don't run over any expectations from cpu hotplug multistate support.
> 
> I've *never* suggested there cannot be alternatives to my solution with
> the small state machine, but for you to say it is incorrect is simply
> not right either.
> 
> > 3) it is _not_ a must or pattern for fixing bugs to hold one lock before
> > calling device_del(), meantime the lock is required in the device's
> > attribute show()/store(), which causes AA deadlock easily. Your approach
> > just avoids the issue by not releasing module until all show/store are
> > done.
> 
> Right, there are two approaches here:
> 
> a) Your approach is to accept the deadlock as a requirement and so
> you would prefer to implement an alternative to using a shared lock
> on module exit and sysfs op.

wrt. in-tree zram, there is neither any deadlock in linus tree, nor after
applying my 3 patches. If you think there is, please share us the code
or lockdep warning.

> 
> b) While I address such a deadlock head on as I think this sort of locking
> be allowed for two reasons:
>    b1) as we never documented such requirement otherwise.
>    b2) There is a possibility that other drivers already exist too
>        which *do* use a shared lock on module removal and sysfs ops
>        (and I just confirmed this to be true)

The 'deadlock' is actually caused by your out-of-tree patch of 'zram: fix
crashes with cpu hotplug multistate' which adds mutex_lock(zram_index_mutex)
in destroy_devices().

We can fix this issue easily without needing the global lock, please see the
attached(pre-V2) patch.

> 
> By you only addressing the deadlock as a requirement on approach a) you are
> forgetting that there *may* already be present drivers which *do* implement
> such patterns in the kernel. I worked on addressing the deadlock because
> I was informed livepatching *did* have that issue as well and so very
> likely a generic solution to the deadlock could be beneficial to other
> random drivers.

In-tree zram doesn't have such deadlock, if livepatching has such AA deadlock,
just fixed it, and seems it has been fixed by 3ec24776bfd0.

> 
> So I *really* don't think it is wise for us to simply accept this new
> found deadlock as a *new* requirement, specially if we can fix it easily.
> 
> A cursory review using Coccinelle potential issues with mutex lock
> directly used on module exit (so this doesn't cover drivers like zram
> which uses a routine and then grabs the lock through indirection) and a
> sysfs op shows these drivers are also affected by this deadlock:
> 
>   * arch/powerpc/sysdev/fsl_mpic_timer_wakeup.c

In fsl_wakeup_sys_exit(), device_remove_file() is called before
acquiring &sysfs_lock, so there shouldn't be such AA deadlock.

>   * lib/test_firmware.c

Yeah, there is the AA deadlock risk, but it should be fixed by moving
misc_deregister() out of &test_fw_mutex.

> 
> Note that this cursory review does not cover spin_lock uses, and other
> forms locks. Consider the case where a routine is used and then that
> routine grabs a lock, so one level indirection. There are many levels
> of indirections possible here. And likewise there are different types
> of locks.
> 
> > > also ends up leaving the driver in an unrecoverable state after a few
> > > tries. Ie, you CTRL-C the scripts and try again over and over again and
> > > the driver ends up in a situation where it just says:
> > > 
> > > zram: Can't change algorithm for initialized device
> > 
> > It means the algorithm can't be changed for one initialized device
> > at the exact time. That is understandable because two zram02.sh are
> > running concurrently.
> 
> Indeed but with your patch it can get stuck and cannot be taken out of this
> state.

OK, I can keep current behavior: fail open() in case of removing or
resetting, meantime not hold open_mutex when sync bdev and reset device,
see attached patch.

> 
> > Your test script just runs two ./zram02.sh tasks concurrently forever,
> > so what is your expected result for the test? Of course, it can't be
> > over.
> >
> > I can't reproduce the 'unrecoverable' state in my test, can you share the
> > stack trace log after that happens?
> 
> Try a bit harder, cancel the scripts after running for a while randomly
> (CTRL C a few times until the script finishes) and have them race again.
> Do this a few times.
> 
> > > And the zram module can't be removed at that point.
> > 
> > It is just that systemd opens the zram or the disk is opened as swap
> > disk, and once systemd closes it or after you run swapoff, it can be
> > unloaded.
> 
> With my patch this issues does not happen.

It is because the patch 2 holds ->open_mutex() for sync bdev and reset
zram, so several 'CTRL-C' is needed for terminating the test script, then
zram02.sh's cleanup handler can be interrupted too. We can keep current
behavior easily.

Please try the following patch against upstream(linus or next) tree(basically
fold revised 2 and 3 of V1, and cover two issues: not fail zram_remove in
module_exit(), race between zram_remove() and disksize_store()), and see if
everything is fine for you:


diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index a68297fb51a2..320822a80b64 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -1967,25 +1967,45 @@ static int zram_add(void)
 static int zram_remove(struct zram *zram)
 {
 	struct block_device *bdev = zram->disk->part0;
+	bool claimed;
 
 	mutex_lock(&bdev->bd_disk->open_mutex);
-	if (bdev->bd_openers || zram->claim) {
+	if (bdev->bd_openers) {
 		mutex_unlock(&bdev->bd_disk->open_mutex);
 		return -EBUSY;
 	}
 
-	zram->claim = true;
+	claimed = zram->claim;
+	if (!claimed)
+		zram->claim = true;
 	mutex_unlock(&bdev->bd_disk->open_mutex);
 
 	zram_debugfs_unregister(zram);
 
-	/* Make sure all the pending I/O are finished */
-	fsync_bdev(bdev);
-	zram_reset_device(zram);
+	if (claimed) {
+		/*
+		 * If we were claimed by reset_store(), del_gendisk() will
+		 * wait until sync & reset is completed, so do nothing here.
+		 */
+		;
+	} else {
+		/* Make sure all the pending I/O are finished */
+		sync_blockdev(bdev);
+		zram_reset_device(zram);
+	}
 
 	pr_info("Removed device: %s\n", zram->disk->disk_name);
 
 	del_gendisk(zram->disk);
+
+	WARN_ON_ONCE(claimed && zram->claim);
+
+	/*
+	 * disksize store may come after the above zram_reset_device
+	 * returns, so run the last reset to avoid the race
+	 */
+	zram_reset_device(zram);
+
 	blk_cleanup_disk(zram->disk);
 	kfree(zram);
 	return 0;


Thanks,
Ming


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v8 11/12] zram: fix crashes with cpu hotplug multistate
  2021-10-19  2:34                     ` Ming Lei
@ 2021-10-19  6:23                       ` Miroslav Benes
  2021-10-19  9:23                         ` Ming Lei
  2021-10-19 15:28                       ` Luis Chamberlain
  2021-10-19 15:50                       ` Luis Chamberlain
  2 siblings, 1 reply; 94+ messages in thread
From: Miroslav Benes @ 2021-10-19  6:23 UTC (permalink / raw)
  To: Ming Lei
  Cc: Luis Chamberlain, Benjamin Herrenschmidt, Paul Mackerras, tj,
	gregkh, akpm, minchan, jeyu, shuah, bvanassche, dan.j.williams,
	joe, tglx, keescook, rostedt, linux-spdx, linux-doc, linux-block,
	linux-fsdevel, linux-kselftest, linux-kernel

> > By you only addressing the deadlock as a requirement on approach a) you are
> > forgetting that there *may* already be present drivers which *do* implement
> > such patterns in the kernel. I worked on addressing the deadlock because
> > I was informed livepatching *did* have that issue as well and so very
> > likely a generic solution to the deadlock could be beneficial to other
> > random drivers.
> 
> In-tree zram doesn't have such deadlock, if livepatching has such AA deadlock,
> just fixed it, and seems it has been fixed by 3ec24776bfd0.

I would not call it a fix. It is a kind of ugly workaround because the 
generic infrastructure lacked (lacks) the proper support in my opinion. 
Luis is trying to fix that.

Just my two cents.

Miroslav

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v8 11/12] zram: fix crashes with cpu hotplug multistate
  2021-10-19  6:23                       ` Miroslav Benes
@ 2021-10-19  9:23                         ` Ming Lei
  2021-10-20  6:43                           ` Miroslav Benes
  0 siblings, 1 reply; 94+ messages in thread
From: Ming Lei @ 2021-10-19  9:23 UTC (permalink / raw)
  To: Miroslav Benes
  Cc: Luis Chamberlain, Benjamin Herrenschmidt, Paul Mackerras, tj,
	gregkh, akpm, minchan, jeyu, shuah, bvanassche, dan.j.williams,
	joe, tglx, keescook, rostedt, linux-spdx, linux-doc, linux-block,
	linux-fsdevel, linux-kselftest, linux-kernel, ming.lei

On Tue, Oct 19, 2021 at 08:23:51AM +0200, Miroslav Benes wrote:
> > > By you only addressing the deadlock as a requirement on approach a) you are
> > > forgetting that there *may* already be present drivers which *do* implement
> > > such patterns in the kernel. I worked on addressing the deadlock because
> > > I was informed livepatching *did* have that issue as well and so very
> > > likely a generic solution to the deadlock could be beneficial to other
> > > random drivers.
> > 
> > In-tree zram doesn't have such deadlock, if livepatching has such AA deadlock,
> > just fixed it, and seems it has been fixed by 3ec24776bfd0.
> 
> I would not call it a fix. It is a kind of ugly workaround because the 
> generic infrastructure lacked (lacks) the proper support in my opinion. 
> Luis is trying to fix that.

What is the proper support of the generic infrastructure? I am not
familiar with livepatching's model(especially with module unload), you mean
livepatching have to do the following way from sysfs:

1) during module exit:
	
	mutex_lock(lp_lock);
	kobject_put(lp_kobj);
	mutex_unlock(lp_lock);
	
2) show()/store() method of attributes of lp_kobj
	
	mutex_lock(lp_lock)
	...
	mutex_unlock(lp_lock)

IMO, the above usage simply caused AA deadlock. Even in Luis's patch
'zram: fix crashes with cpu hotplug multistate', new/same AA deadlock
(hot_remove_store() vs. disksize_store() or reset_store()) is added
because hot_remove_store() isn't called from module_exit().

Luis tries to delay unloading module until all show()/store() are done. But
that can be obtained by the following way simply during module_exit():

	kobject_del(lp_kobj); //all pending store()/show() from lp_kobj are done,
						  //no new store()/show() can come after
						  //kobject_del() returns	
	mutex_lock(lp_lock);
	kobject_put(lp_kobj);
	mutex_unlock(lp_lock);

Or can you explain your requirement on kobject/module unload in a bit
details?


Thanks,
Ming


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v8 11/12] zram: fix crashes with cpu hotplug multistate
  2021-10-19  2:34                     ` Ming Lei
  2021-10-19  6:23                       ` Miroslav Benes
@ 2021-10-19 15:28                       ` Luis Chamberlain
  2021-10-19 16:29                         ` Ming Lei
  2021-10-19 15:50                       ` Luis Chamberlain
  2 siblings, 1 reply; 94+ messages in thread
From: Luis Chamberlain @ 2021-10-19 15:28 UTC (permalink / raw)
  To: Ming Lei
  Cc: Benjamin Herrenschmidt, Paul Mackerras, tj, gregkh, akpm,
	minchan, jeyu, shuah, bvanassche, dan.j.williams, joe, tglx,
	keescook, rostedt, linux-spdx, linux-doc, linux-block,
	linux-fsdevel, linux-kselftest, linux-kernel

On Tue, Oct 19, 2021 at 10:34:41AM +0800, Ming Lei wrote:
> Please try the following patch against upstream(linus or next) tree(basically
> fold revised 2 and 3 of V1, and cover two issues: not fail zram_remove in
> module_exit(), race between zram_remove() and disksize_store()), and see if
> everything is fine for you:

Page fault ...

[   18.284256] zram: Removed device: zram0
[   18.312974] BUG: unable to handle page fault for address:
ffffad86de903008
[   18.313707] #PF: supervisor read access in kernel mode
[   18.314248] #PF: error_code(0x0000) - not-present page
[   18.314797] PGD 100000067 P4D 100000067 PUD 10031e067 PMD 136a28067
PTE 0
[   18.315538] Oops: 0000 [#1] PREEMPT SMP NOPTI
[   18.316012] CPU: 3 PID: 1198 Comm: rmmod Tainted: G            E
5.15.0-rc3-next-20210927+ #89
[   18.316979] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS 1.14.0-2 04/01/2014
[   18.317876] RIP: 0010:zram_free_page+0x1b/0xf0 [zram]
[   18.318430] Code: 1f 44 00 00 48 89 c8 c3 0f 1f 80 00 00 00 00 0f 1f
44 00 00 41 54 49 89 f4 55 89 f5 53 48 8b 17 48 c1 e5 04 48 89 fb 48 01
ea <48> 8b 42 08 a9 00 00 00 20 74 14 48 25 ff ff ff df 48 89 42 08 48
[   18.320412] RSP: 0018:ffffad86f8013df8 EFLAGS: 00010286
[   18.320978] RAX: 0000000000000001 RBX: ffff9b7b435c7800 RCX:
0000000000000200
[   18.321758] RDX: ffffad86de903000 RSI: 0000000000000000 RDI:
ffff9b7b435c7800
[   18.322524] RBP: 0000000000000000 R08: 0000000000000200 R09:
0000000000000000
[   18.323299] R10: 0000000000000200 R11: 0000000000000000 R12:
0000000000000000
[   18.324030] R13: ffff9b7b55191800 R14: ffff9b7b435c7820 R15:
ffff9b7b4677f960
[   18.324784] FS:  00007fc8e4c90580(0000) GS:ffff9b7c77cc0000(0000)
knlGS:0000000000000000
[   18.325651] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   18.326272] CR2: ffffad86de903008 CR3: 000000014f1de003 CR4:
0000000000370ee0
[   18.327047] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[   18.327818] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
0000000000000400
[   18.328586] Call Trace:
[   18.328852]  <TASK>
[   18.329284]  zram_reset_device+0xd8/0x140 [zram]
[   18.329983]  zram_remove.cold+0xa/0x20 [zram]
[   18.330644]  ? hot_remove_store+0xe0/0xe0 [zram]
[   18.331367]  zram_remove_cb+0xd/0x10 [zram]
[   18.332010]  idr_for_each+0x5b/0xd0
[   18.332578]  destroy_devices+0x26/0x50 [zram]
[   18.333238]  __do_sys_delete_module+0x18d/0x2a0
[   18.333913]  ? fpregs_assert_state_consistent+0x1e/0x40
[   18.334665]  ? exit_to_user_mode_prepare+0x3a/0x180
[   18.335395]  do_syscall_64+0x38/0xc0
[   18.335966]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[   18.336681] RIP: 0033:0x7fc8e4db64a7


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v8 11/12] zram: fix crashes with cpu hotplug multistate
  2021-10-19  2:34                     ` Ming Lei
  2021-10-19  6:23                       ` Miroslav Benes
  2021-10-19 15:28                       ` Luis Chamberlain
@ 2021-10-19 15:50                       ` Luis Chamberlain
  2021-10-19 16:25                         ` Greg KH
  2021-10-19 16:39                         ` Ming Lei
  2 siblings, 2 replies; 94+ messages in thread
From: Luis Chamberlain @ 2021-10-19 15:50 UTC (permalink / raw)
  To: Ming Lei
  Cc: Benjamin Herrenschmidt, Paul Mackerras, tj, gregkh, akpm,
	minchan, jeyu, shuah, bvanassche, dan.j.williams, joe, tglx,
	keescook, rostedt, linux-spdx, linux-doc, linux-block,
	linux-fsdevel, linux-kselftest, linux-kernel

On Tue, Oct 19, 2021 at 10:34:41AM +0800, Ming Lei wrote:
> On Mon, Oct 18, 2021 at 12:32:11PM -0700, Luis Chamberlain wrote:
> > On Sat, Oct 16, 2021 at 07:28:39PM +0800, Ming Lei wrote:
> > > On Fri, Oct 15, 2021 at 10:31:31AM -0700, Luis Chamberlain wrote:
> > > > On Fri, Oct 15, 2021 at 04:36:11PM +0800, Ming Lei wrote:
> > > > > On Thu, Oct 14, 2021 at 05:22:40PM -0700, Luis Chamberlain wrote:
> > > > > > On Fri, Oct 15, 2021 at 07:52:04AM +0800, Ming Lei wrote:
> > > > > ...
> > > > > > > 
> > > > > > > We need to understand the exact reason why there is still cpuhp node
> > > > > > > left, can you share us the exact steps for reproducing the issue?
> > > > > > > Otherwise we may have to trace and narrow down the reason.
> > > > > > 
> > > > > > See my commit log for my own fix for this issue.
> > > > > 
> > > > > OK, thanks!
> > > > > 
> > > > > I can reproduce the issue, and the reason is that reset_store fails
> > > > > zram_remove() when unloading module, then the warning is caused.
> > > > > 
> > > > > The top 3 patches in the following tree can fix the issue:
> > > > > 
> > > > > https://github.com/ming1/linux/commits/my_v5.15-blk-dev
> > > > 
> > > > Thanks for trying an alternative fix! A crash stops yes, however this
> > > 
> > > I doubt it is alternative since your patchset doesn't mention the exact
> > > reason of 'Error: Removing state 63 which has instances left.', that is
> > > simply caused by failing to remove zram because ->claim is set during
> > > unloading module.
> > 
> > Well I disagree because it does explain how the race can happen, and it
> > also explains how since the sysfs interface is exposed until module
> > removal completes, it leaves exposed knobs to allow re-initializing of a
> > struct zcomp for a zram device before the exit.
> > 
> > > Yeah, you mentioned the race between disksize_store() vs. zram_remove(),
> > > however I don't think it is reproduced easily in the test because the race
> > > window is pretty small, also it can be fixed easily in my 3rd path
> > > without any complicated tricks.
> > 
> > Reproducing for me is... extremely easy.
> 
> In my observation, failing zram_remove() is extremely easy to trigger, which
> is caused by reset_store() which sets ->reclaim as true, so
> zram_remove() is failed and zram_reset_device() is bypassed , then the
> failure of 'Error: Removing state 63 which has instances left.' is caused.
> 
> We are in same page?

The actual first issue is the CPU hotplug remove callback is long gone and
in the meantime we allow a race to add a new "instance", in the zram
driver's case a cpu struct zcomp instance though the sysfs interface.

Regardless of if zram_remove() can fail or not, the above race needs to
be addressed.

> > > Not dig into details of your patchset via grabbing module reference
> > > count during show/store attribute of kernfs which is done in your patch
> > > 9, but IMO this way isn't necessary:
> > 
> > That's to address the deadlock only.
> > 
> > > 1) any driver module has to cleanup anything which may refer to symbols
> > > or data defined in module_exit of this driver
> > 
> > Yes, and as the cpu multistate hotplug documentation warns (although
> > such documentation is kind of hidden) that driver authors need to be
> > careful with module removal too, refer to the warning at the end of
> > __cpuhp_remove_state_cpuslocked() about module removal.
> 
> It is zram's bug. zram has to clean everything in module_exit(),
> unfortunately zram_remove() can be failed when calling from
> module_exit() because ->claim is set as true by reset_store(), then
> zram_reset_device()(->zcomp_destroy) isn't called, and this failure should
> not happen when unloading module, should it?

You're addressing a possible failig zram_remove() while I address not
allowing entry to muck with the zram driver at all once we're bailing
on module removal.

> > > 2) device_del() is often done in module_exit(), once device_del()
> > > returns, no any new show/store on the device's kobject attribute
> > > is possible.
> > 
> > Right and if a syfs knob is exposed before device_del() completely
> > and is allowed to do things, the driver should take care to prevent
> > races for CPU multistate support. The small state machine I added ensures
> 
> What is the race for CPU multistate support? If you mean 'Error: Removing
> state 63 which has instances left.', it is zram's bug since zram has to
> cleanup everything in module_exit().

Yes. And it is what my out of tree yet Acked patch, 'zram: fix     
crashes with cpu hotplug multistate' does.

> > we don't run over any expectations from cpu hotplug multistate support.
> > 
> > I've *never* suggested there cannot be alternatives to my solution with
> > the small state machine, but for you to say it is incorrect is simply
> > not right either.
> > 
> > > 3) it is _not_ a must or pattern for fixing bugs to hold one lock before
> > > calling device_del(), meantime the lock is required in the device's
> > > attribute show()/store(), which causes AA deadlock easily. Your approach
> > > just avoids the issue by not releasing module until all show/store are
> > > done.
> > 
> > Right, there are two approaches here:
> > 
> > a) Your approach is to accept the deadlock as a requirement and so
> > you would prefer to implement an alternative to using a shared lock
> > on module exit and sysfs op.
> 
> wrt. in-tree zram, there is neither any deadlock in linus tree, nor after
> applying my 3 patches. If you think there is, please share us the code
> or lockdep warning.

Right, 'zram: fix crashes with cpu hotplug multistate' is not yet
merged, my approach to fixing that does add a lock use on module removal
which does introduce a possible deadlock with syfs, which is later addressed
generically between sysfs and module removal for all drivers.

> > b) While I address such a deadlock head on as I think this sort of locking
> > be allowed for two reasons:
> >    b1) as we never documented such requirement otherwise.
> >    b2) There is a possibility that other drivers already exist too
> >        which *do* use a shared lock on module removal and sysfs ops
> >        (and I just confirmed this to be true)
> 
> The 'deadlock' is actually caused by your out-of-tree patch of 'zram: fix
> crashes with cpu hotplug multistate' which adds mutex_lock(zram_index_mutex)
> in destroy_devices().

Yes yes, but you are completely throwing out the window that other
possible deadlocks can exist in the kernel *and* that *new* cases of
the deadlock can easily also be added!

> We can fix this issue easily without needing the global lock, please see the
> attached(pre-V2) patch.

So far your patches do not fix the issues though...

> > So I *really* don't think it is wise for us to simply accept this new
> > found deadlock as a *new* requirement, specially if we can fix it easily.
> > 
> > A cursory review using Coccinelle potential issues with mutex lock
> > directly used on module exit (so this doesn't cover drivers like zram
> > which uses a routine and then grabs the lock through indirection) and a
> > sysfs op shows these drivers are also affected by this deadlock:
> > 
> >   * arch/powerpc/sysdev/fsl_mpic_timer_wakeup.c
> 
> In fsl_wakeup_sys_exit(), device_remove_file() is called before
> acquiring &sysfs_lock, so there shouldn't be such AA deadlock.
> 
> >   * lib/test_firmware.c
> 
> Yeah, there is the AA deadlock risk, but it should be fixed by moving
> misc_deregister() out of &test_fw_mutex.

And just like that you are ignoring other possible uses in the kernel
which might have similar deadlocks.

So do you want to take the position:

Hey driver authors: you cannot use any shared lock on module removal and
on sysfs ops?

  Luis

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v8 11/12] zram: fix crashes with cpu hotplug multistate
  2021-10-19 15:50                       ` Luis Chamberlain
@ 2021-10-19 16:25                         ` Greg KH
  2021-10-19 16:30                           ` Luis Chamberlain
  2021-10-19 16:39                         ` Ming Lei
  1 sibling, 1 reply; 94+ messages in thread
From: Greg KH @ 2021-10-19 16:25 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: Ming Lei, Benjamin Herrenschmidt, Paul Mackerras, tj, akpm,
	minchan, jeyu, shuah, bvanassche, dan.j.williams, joe, tglx,
	keescook, rostedt, linux-spdx, linux-doc, linux-block,
	linux-fsdevel, linux-kselftest, linux-kernel

On Tue, Oct 19, 2021 at 08:50:24AM -0700, Luis Chamberlain wrote:
> So do you want to take the position:
> 
> Hey driver authors: you cannot use any shared lock on module removal and
> on sysfs ops?

Yes, I would not recommend using such a lock at all.  sysfs operations
happen on a per-device basis, so you can lock the device structure.
Module removal happens on a driver basis, and I have no idea what you
want to lock there, but odds are it is NOT shared with your per-device
structures either, right?

If so, then yes, that is a bug, but a very rare one as drivers should do
almost nothing except register/unregister_driver() in their module
init/exit calls.

zram is not a "normal" driver at all here, so fixing this type of
problem up should be done in the zram code, it is not a generic
module/sysfs issue at all.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v8 11/12] zram: fix crashes with cpu hotplug multistate
  2021-10-19 15:28                       ` Luis Chamberlain
@ 2021-10-19 16:29                         ` Ming Lei
  2021-10-19 19:36                           ` Luis Chamberlain
  0 siblings, 1 reply; 94+ messages in thread
From: Ming Lei @ 2021-10-19 16:29 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: Benjamin Herrenschmidt, Paul Mackerras, tj, gregkh, akpm,
	minchan, jeyu, shuah, bvanassche, dan.j.williams, joe, tglx,
	keescook, rostedt, linux-spdx, linux-doc, linux-block,
	linux-fsdevel, linux-kselftest, linux-kernel

On Tue, Oct 19, 2021 at 08:28:21AM -0700, Luis Chamberlain wrote:
> On Tue, Oct 19, 2021 at 10:34:41AM +0800, Ming Lei wrote:
> > Please try the following patch against upstream(linus or next) tree(basically
> > fold revised 2 and 3 of V1, and cover two issues: not fail zram_remove in
> > module_exit(), race between zram_remove() and disksize_store()), and see if
> > everything is fine for you:
> 
> Page fault ...
> 
> [   18.284256] zram: Removed device: zram0
> [   18.312974] BUG: unable to handle page fault for address:
> ffffad86de903008
> [   18.313707] #PF: supervisor read access in kernel mode
> [   18.314248] #PF: error_code(0x0000) - not-present page
> [   18.314797] PGD 100000067 P4D 100000067 PUD 10031e067 PMD 136a28067

That is another race between zram_reset_device() and disksize_store(),
which is supposed to be covered by ->init_lock, and follows the delta fix
against the last patch I posted, and the whole patch can be found in the
github link:

https://github.com/ming1/linux/commit/fa6045b1371eb301f392ac84adaf3ad53bb16894


diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index d0cae7a42f4d..a14ba3d350ea 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -1704,12 +1704,12 @@ static void zram_reset_device(struct zram *zram)
 	set_capacity_and_notify(zram->disk, 0);
 	part_stat_set_all(zram->disk->part0, 0);
 
-	up_write(&zram->init_lock);
 	/* I/O operation under all of CPU are done so let's free */
 	zram_meta_free(zram, disksize);
 	memset(&zram->stats, 0, sizeof(zram->stats));
 	zcomp_destroy(comp);
 	reset_bdev(zram);
+	up_write(&zram->init_lock);
 }
 
 static ssize_t disksize_store(struct device *dev,

-- 
Ming


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v8 11/12] zram: fix crashes with cpu hotplug multistate
  2021-10-19 16:25                         ` Greg KH
@ 2021-10-19 16:30                           ` Luis Chamberlain
  2021-10-19 17:28                             ` Greg KH
  0 siblings, 1 reply; 94+ messages in thread
From: Luis Chamberlain @ 2021-10-19 16:30 UTC (permalink / raw)
  To: Greg KH
  Cc: Ming Lei, Benjamin Herrenschmidt, Paul Mackerras, tj, akpm,
	minchan, jeyu, shuah, bvanassche, dan.j.williams, joe, tglx,
	keescook, rostedt, linux-spdx, linux-doc, linux-block,
	linux-fsdevel, linux-kselftest, linux-kernel

On Tue, Oct 19, 2021 at 06:25:18PM +0200, Greg KH wrote:
> On Tue, Oct 19, 2021 at 08:50:24AM -0700, Luis Chamberlain wrote:
> > So do you want to take the position:
> > 
> > Hey driver authors: you cannot use any shared lock on module removal and
> > on sysfs ops?
> 
> Yes, I would not recommend using such a lock at all.  sysfs operations
> happen on a per-device basis, so you can lock the device structure.

All devices are going to be removed on module removal and so cannot be locked.

  Luis

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v8 11/12] zram: fix crashes with cpu hotplug multistate
  2021-10-19 15:50                       ` Luis Chamberlain
  2021-10-19 16:25                         ` Greg KH
@ 2021-10-19 16:39                         ` Ming Lei
  2021-10-19 19:38                           ` Luis Chamberlain
  1 sibling, 1 reply; 94+ messages in thread
From: Ming Lei @ 2021-10-19 16:39 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: Benjamin Herrenschmidt, Paul Mackerras, tj, gregkh, akpm,
	minchan, jeyu, shuah, bvanassche, dan.j.williams, joe, tglx,
	keescook, rostedt, linux-spdx, linux-doc, linux-block,
	linux-fsdevel, linux-kselftest, linux-kernel

On Tue, Oct 19, 2021 at 08:50:24AM -0700, Luis Chamberlain wrote:
> On Tue, Oct 19, 2021 at 10:34:41AM +0800, Ming Lei wrote:
> > On Mon, Oct 18, 2021 at 12:32:11PM -0700, Luis Chamberlain wrote:
> > > On Sat, Oct 16, 2021 at 07:28:39PM +0800, Ming Lei wrote:
> > > > On Fri, Oct 15, 2021 at 10:31:31AM -0700, Luis Chamberlain wrote:
> > > > > On Fri, Oct 15, 2021 at 04:36:11PM +0800, Ming Lei wrote:
> > > > > > On Thu, Oct 14, 2021 at 05:22:40PM -0700, Luis Chamberlain wrote:
> > > > > > > On Fri, Oct 15, 2021 at 07:52:04AM +0800, Ming Lei wrote:
> > > > > > ...
> > > > > > > > 
> > > > > > > > We need to understand the exact reason why there is still cpuhp node
> > > > > > > > left, can you share us the exact steps for reproducing the issue?
> > > > > > > > Otherwise we may have to trace and narrow down the reason.
> > > > > > > 
> > > > > > > See my commit log for my own fix for this issue.
> > > > > > 
> > > > > > OK, thanks!
> > > > > > 
> > > > > > I can reproduce the issue, and the reason is that reset_store fails
> > > > > > zram_remove() when unloading module, then the warning is caused.
> > > > > > 
> > > > > > The top 3 patches in the following tree can fix the issue:
> > > > > > 
> > > > > > https://github.com/ming1/linux/commits/my_v5.15-blk-dev
> > > > > 
> > > > > Thanks for trying an alternative fix! A crash stops yes, however this
> > > > 
> > > > I doubt it is alternative since your patchset doesn't mention the exact
> > > > reason of 'Error: Removing state 63 which has instances left.', that is
> > > > simply caused by failing to remove zram because ->claim is set during
> > > > unloading module.
> > > 
> > > Well I disagree because it does explain how the race can happen, and it
> > > also explains how since the sysfs interface is exposed until module
> > > removal completes, it leaves exposed knobs to allow re-initializing of a
> > > struct zcomp for a zram device before the exit.
> > > 
> > > > Yeah, you mentioned the race between disksize_store() vs. zram_remove(),
> > > > however I don't think it is reproduced easily in the test because the race
> > > > window is pretty small, also it can be fixed easily in my 3rd path
> > > > without any complicated tricks.
> > > 
> > > Reproducing for me is... extremely easy.
> > 
> > In my observation, failing zram_remove() is extremely easy to trigger, which
> > is caused by reset_store() which sets ->reclaim as true, so
> > zram_remove() is failed and zram_reset_device() is bypassed , then the
> > failure of 'Error: Removing state 63 which has instances left.' is caused.
> > 
> > We are in same page?
> 
> The actual first issue is the CPU hotplug remove callback is long gone and
> in the meantime we allow a race to add a new "instance", in the zram
> driver's case a cpu struct zcomp instance though the sysfs interface.
> 
> Regardless of if zram_remove() can fail or not, the above race needs to
> be addressed.
> 
> > > > Not dig into details of your patchset via grabbing module reference
> > > > count during show/store attribute of kernfs which is done in your patch
> > > > 9, but IMO this way isn't necessary:
> > > 
> > > That's to address the deadlock only.
> > > 
> > > > 1) any driver module has to cleanup anything which may refer to symbols
> > > > or data defined in module_exit of this driver
> > > 
> > > Yes, and as the cpu multistate hotplug documentation warns (although
> > > such documentation is kind of hidden) that driver authors need to be
> > > careful with module removal too, refer to the warning at the end of
> > > __cpuhp_remove_state_cpuslocked() about module removal.
> > 
> > It is zram's bug. zram has to clean everything in module_exit(),
> > unfortunately zram_remove() can be failed when calling from
> > module_exit() because ->claim is set as true by reset_store(), then
> > zram_reset_device()(->zcomp_destroy) isn't called, and this failure should
> > not happen when unloading module, should it?
> 
> You're addressing a possible failig zram_remove() while I address not
> allowing entry to muck with the zram driver at all once we're bailing
> on module removal.
> 
> > > > 2) device_del() is often done in module_exit(), once device_del()
> > > > returns, no any new show/store on the device's kobject attribute
> > > > is possible.
> > > 
> > > Right and if a syfs knob is exposed before device_del() completely
> > > and is allowed to do things, the driver should take care to prevent
> > > races for CPU multistate support. The small state machine I added ensures
> > 
> > What is the race for CPU multistate support? If you mean 'Error: Removing
> > state 63 which has instances left.', it is zram's bug since zram has to
> > cleanup everything in module_exit().
> 
> Yes. And it is what my out of tree yet Acked patch, 'zram: fix     
> crashes with cpu hotplug multistate' does.

Unfortunately that patch adds new deadlock between hot_remove_store() and
disksize_store() & others, see my below comment.

> 
> > > we don't run over any expectations from cpu hotplug multistate support.
> > > 
> > > I've *never* suggested there cannot be alternatives to my solution with
> > > the small state machine, but for you to say it is incorrect is simply
> > > not right either.
> > > 
> > > > 3) it is _not_ a must or pattern for fixing bugs to hold one lock before
> > > > calling device_del(), meantime the lock is required in the device's
> > > > attribute show()/store(), which causes AA deadlock easily. Your approach
> > > > just avoids the issue by not releasing module until all show/store are
> > > > done.
> > > 
> > > Right, there are two approaches here:
> > > 
> > > a) Your approach is to accept the deadlock as a requirement and so
> > > you would prefer to implement an alternative to using a shared lock
> > > on module exit and sysfs op.
> > 
> > wrt. in-tree zram, there is neither any deadlock in linus tree, nor after
> > applying my 3 patches. If you think there is, please share us the code
> > or lockdep warning.
> 
> Right, 'zram: fix crashes with cpu hotplug multistate' is not yet
> merged, my approach to fixing that does add a lock use on module removal
> which does introduce a possible deadlock with syfs, which is later addressed
> generically between sysfs and module removal for all drivers.
> 
> > > b) While I address such a deadlock head on as I think this sort of locking
> > > be allowed for two reasons:
> > >    b1) as we never documented such requirement otherwise.
> > >    b2) There is a possibility that other drivers already exist too
> > >        which *do* use a shared lock on module removal and sysfs ops
> > >        (and I just confirmed this to be true)
> > 
> > The 'deadlock' is actually caused by your out-of-tree patch of 'zram: fix
> > crashes with cpu hotplug multistate' which adds mutex_lock(zram_index_mutex)
> > in destroy_devices().
> 
> Yes yes, but you are completely throwing out the window that other
> possible deadlocks can exist in the kernel *and* that *new* cases of
> the deadlock can easily also be added!
> 
> > We can fix this issue easily without needing the global lock, please see the
> > attached(pre-V2) patch.
> 
> So far your patches do not fix the issues though...
> 
> > > So I *really* don't think it is wise for us to simply accept this new
> > > found deadlock as a *new* requirement, specially if we can fix it easily.
> > > 
> > > A cursory review using Coccinelle potential issues with mutex lock
> > > directly used on module exit (so this doesn't cover drivers like zram
> > > which uses a routine and then grabs the lock through indirection) and a
> > > sysfs op shows these drivers are also affected by this deadlock:
> > > 
> > >   * arch/powerpc/sysdev/fsl_mpic_timer_wakeup.c
> > 
> > In fsl_wakeup_sys_exit(), device_remove_file() is called before
> > acquiring &sysfs_lock, so there shouldn't be such AA deadlock.
> > 
> > >   * lib/test_firmware.c
> > 
> > Yeah, there is the AA deadlock risk, but it should be fixed by moving
> > misc_deregister() out of &test_fw_mutex.
> 
> And just like that you are ignoring other possible uses in the kernel
> which might have similar deadlocks.
> 
> So do you want to take the position:
> 
> Hey driver authors: you cannot use any shared lock on module removal and
> on sysfs ops?

IMO, yes, in your patch of 'zram: fix crashes with cpu hotplug multistate',
when you added mutex_lock(zram_index_mutex) to disksize_store() and
other attribute show() or store() method. You have added new deadlock
between hot_remove_store() and disksize_store() & others, which can't be
addressed by your approach of holding module refcnt.

So far not see ltp tests covers hot add/remove interface yet.


Thanks, 
Ming


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v8 11/12] zram: fix crashes with cpu hotplug multistate
  2021-10-19 16:30                           ` Luis Chamberlain
@ 2021-10-19 17:28                             ` Greg KH
  2021-10-19 19:46                               ` Luis Chamberlain
  0 siblings, 1 reply; 94+ messages in thread
From: Greg KH @ 2021-10-19 17:28 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: Ming Lei, Benjamin Herrenschmidt, Paul Mackerras, tj, akpm,
	minchan, jeyu, shuah, bvanassche, dan.j.williams, joe, tglx,
	keescook, rostedt, linux-spdx, linux-doc, linux-block,
	linux-fsdevel, linux-kselftest, linux-kernel

On Tue, Oct 19, 2021 at 09:30:05AM -0700, Luis Chamberlain wrote:
> On Tue, Oct 19, 2021 at 06:25:18PM +0200, Greg KH wrote:
> > On Tue, Oct 19, 2021 at 08:50:24AM -0700, Luis Chamberlain wrote:
> > > So do you want to take the position:
> > > 
> > > Hey driver authors: you cannot use any shared lock on module removal and
> > > on sysfs ops?
> > 
> > Yes, I would not recommend using such a lock at all.  sysfs operations
> > happen on a per-device basis, so you can lock the device structure.
> 
> All devices are going to be removed on module removal and so cannot be locked.

devices are not normally created by a driver, that is up to the bus
controller logic.  A module will just disconnect itself from the device,
the device does not go away.

But yes, there are exceptions, and if you are doing something odd like
that, then you need to be aware of crazy things like this, so be
careful.  But for all normal drivers, they do not have to worry about
this.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v8 11/12] zram: fix crashes with cpu hotplug multistate
  2021-10-19 16:29                         ` Ming Lei
@ 2021-10-19 19:36                           ` Luis Chamberlain
  2021-10-20  1:15                             ` Ming Lei
  0 siblings, 1 reply; 94+ messages in thread
From: Luis Chamberlain @ 2021-10-19 19:36 UTC (permalink / raw)
  To: Ming Lei
  Cc: Benjamin Herrenschmidt, Paul Mackerras, tj, gregkh, akpm,
	minchan, jeyu, shuah, bvanassche, dan.j.williams, joe, tglx,
	keescook, rostedt, linux-spdx, linux-doc, linux-block,
	linux-fsdevel, linux-kselftest, linux-kernel

On Wed, Oct 20, 2021 at 12:29:53AM +0800, Ming Lei wrote:
> On Tue, Oct 19, 2021 at 08:28:21AM -0700, Luis Chamberlain wrote:
> > On Tue, Oct 19, 2021 at 10:34:41AM +0800, Ming Lei wrote:
> > > Please try the following patch against upstream(linus or next) tree(basically
> > > fold revised 2 and 3 of V1, and cover two issues: not fail zram_remove in
> > > module_exit(), race between zram_remove() and disksize_store()), and see if
> > > everything is fine for you:
> > 
> > Page fault ...
> > 
> > [   18.284256] zram: Removed device: zram0
> > [   18.312974] BUG: unable to handle page fault for address:
> > ffffad86de903008
> > [   18.313707] #PF: supervisor read access in kernel mode
> > [   18.314248] #PF: error_code(0x0000) - not-present page
> > [   18.314797] PGD 100000067 P4D 100000067 PUD 10031e067 PMD 136a28067
> 
> That is another race between zram_reset_device() and disksize_store(),
> which is supposed to be covered by ->init_lock, and follows the delta fix
> against the last patch I posted, and the whole patch can be found in the
> github link:
> 
> https://github.com/ming1/linux/commit/fa6045b1371eb301f392ac84adaf3ad53bb16894
> 
> 
> diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
> index d0cae7a42f4d..a14ba3d350ea 100644
> --- a/drivers/block/zram/zram_drv.c
> +++ b/drivers/block/zram/zram_drv.c
> @@ -1704,12 +1704,12 @@ static void zram_reset_device(struct zram *zram)
>  	set_capacity_and_notify(zram->disk, 0);
>  	part_stat_set_all(zram->disk->part0, 0);
>  
> -	up_write(&zram->init_lock);
>  	/* I/O operation under all of CPU are done so let's free */
>  	zram_meta_free(zram, disksize);
>  	memset(&zram->stats, 0, sizeof(zram->stats));
>  	zcomp_destroy(comp);
>  	reset_bdev(zram);
> +	up_write(&zram->init_lock);
>  }
>  
>  static ssize_t disksize_store(struct device *dev,

With this, it still ends up in a state where we loop and can't get out of:

zram: Can't change algorithm for initialized device

  Luis

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v8 11/12] zram: fix crashes with cpu hotplug multistate
  2021-10-19 16:39                         ` Ming Lei
@ 2021-10-19 19:38                           ` Luis Chamberlain
  2021-10-20  0:55                             ` Ming Lei
  0 siblings, 1 reply; 94+ messages in thread
From: Luis Chamberlain @ 2021-10-19 19:38 UTC (permalink / raw)
  To: Ming Lei
  Cc: Benjamin Herrenschmidt, Paul Mackerras, tj, gregkh, akpm,
	minchan, jeyu, shuah, bvanassche, dan.j.williams, joe, tglx,
	keescook, rostedt, linux-spdx, linux-doc, linux-block,
	linux-fsdevel, linux-kselftest, linux-kernel

On Wed, Oct 20, 2021 at 12:39:22AM +0800, Ming Lei wrote:
> On Tue, Oct 19, 2021 at 08:50:24AM -0700, Luis Chamberlain wrote:
> > So do you want to take the position:
> > 
> > Hey driver authors: you cannot use any shared lock on module removal and
> > on sysfs ops?
> 
> IMO, yes, in your patch of 'zram: fix crashes with cpu hotplug multistate',
> when you added mutex_lock(zram_index_mutex) to disksize_store() and
> other attribute show() or store() method. You have added new deadlock
> between hot_remove_store() and disksize_store() & others, which can't be
> addressed by your approach of holding module refcnt.
> 
> So far not see ltp tests covers hot add/remove interface yet.

Care to show what commands to use to cause this deadlock with my patches?

  Luis

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v8 11/12] zram: fix crashes with cpu hotplug multistate
  2021-10-19 17:28                             ` Greg KH
@ 2021-10-19 19:46                               ` Luis Chamberlain
  0 siblings, 0 replies; 94+ messages in thread
From: Luis Chamberlain @ 2021-10-19 19:46 UTC (permalink / raw)
  To: Greg KH
  Cc: Ming Lei, Benjamin Herrenschmidt, Paul Mackerras, tj, akpm,
	minchan, jeyu, shuah, bvanassche, dan.j.williams, joe, tglx,
	keescook, rostedt, linux-spdx, linux-doc, linux-block,
	linux-fsdevel, linux-kselftest, linux-kernel

On Tue, Oct 19, 2021 at 07:28:35PM +0200, Greg KH wrote:
> On Tue, Oct 19, 2021 at 09:30:05AM -0700, Luis Chamberlain wrote:
> > On Tue, Oct 19, 2021 at 06:25:18PM +0200, Greg KH wrote:
> > > On Tue, Oct 19, 2021 at 08:50:24AM -0700, Luis Chamberlain wrote:
> > > > So do you want to take the position:
> > > > 
> > > > Hey driver authors: you cannot use any shared lock on module removal and
> > > > on sysfs ops?
> > > 
> > > Yes, I would not recommend using such a lock at all.  sysfs operations
> > > happen on a per-device basis, so you can lock the device structure.
> > 
> > All devices are going to be removed on module removal and so cannot be locked.
> 
> devices are not normally created by a driver, that is up to the bus
> controller logic.  A module will just disconnect itself from the device,
> the device does not go away.
> 
> But yes, there are exceptions, and if you are doing something odd like
> that, then you need to be aware of crazy things like this, so be
> careful.  But for all normal drivers, they do not have to worry about
> this.

"Recommend" is a weak position to take given a possible deadlock with sysfs.

Do we want to at the very least document this is not a supported scheme?

If so I can also add a simple 1 level indirrection coccinelle patch to
detect these schemes and complain about them as wel, if we are going to
take this position.

But to simply disregard this as "not an issue", or we won't do anything
seems pretty counter productive given we *do* had drivers with this
issue before *and* still have them upstream, and can end up with more
drivers like this later.

  Luis

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v8 11/12] zram: fix crashes with cpu hotplug multistate
  2021-10-19 19:38                           ` Luis Chamberlain
@ 2021-10-20  0:55                             ` Ming Lei
  0 siblings, 0 replies; 94+ messages in thread
From: Ming Lei @ 2021-10-20  0:55 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: Benjamin Herrenschmidt, Paul Mackerras, tj, gregkh, akpm,
	minchan, jeyu, shuah, bvanassche, dan.j.williams, joe, tglx,
	keescook, rostedt, linux-spdx, linux-doc, linux-block,
	linux-fsdevel, linux-kselftest, linux-kernel, ming.lei

On Tue, Oct 19, 2021 at 12:38:42PM -0700, Luis Chamberlain wrote:
> On Wed, Oct 20, 2021 at 12:39:22AM +0800, Ming Lei wrote:
> > On Tue, Oct 19, 2021 at 08:50:24AM -0700, Luis Chamberlain wrote:
> > > So do you want to take the position:
> > > 
> > > Hey driver authors: you cannot use any shared lock on module removal and
> > > on sysfs ops?
> > 
> > IMO, yes, in your patch of 'zram: fix crashes with cpu hotplug multistate',
> > when you added mutex_lock(zram_index_mutex) to disksize_store() and
> > other attribute show() or store() method. You have added new deadlock
> > between hot_remove_store() and disksize_store() & others, which can't be
> > addressed by your approach of holding module refcnt.
> > 
> > So far not see ltp tests covers hot add/remove interface yet.
> 
> Care to show what commands to use to cause this deadlock with my patches?

Build a kernel with your patch 4,7,8,9,11 and 12(all others are test module or
document change), with lockdep enabled, run the following command, then you
will see the warning, and it is one real deadlock, not false warning.

BTW, your patch 9 can't be applied cleanly against both linus and next
tree, so I edited it manually, but that can't make difference wrt. this issue.


[root@ktest-09 ~]# lsblk | grep zram
zram0   253:0    0    0B  0 disk 
cat /sys/class/zram-control/hot_add
[root@ktest-09 ~]# lsblk | grep zram
zram0   253:0    0    0B  0 disk 
zram1   253:1    0    0B  0 disk 
[root@ktest-09 ~]# echo 256M > /sys/block/zram1/disksize 
[root@ktest-09 ~]# echo 1 >  /sys/class/zram-control/hot_remove 
[root@ktest-09 ~]# dmesg
...
[   75.599882] ======================================================
[   75.601355] WARNING: possible circular locking dependency detected
[   75.602818] 5.15.0-rc3_zram_fix_luis+ #24 Not tainted
[   75.604038] ------------------------------------------------------
[   75.605512] bash/1154 is trying to acquire lock:
[   75.606634] ffff91ce026cd428 (kn->active#237){++++}-{0:0}, at: __kernfs_remove+0x1ab/0x1e0
[   75.608570]
               but task is already holding lock:
[   75.609955] ffffffff839e3ef0 (zram_index_mutex){+.+.}-{3:3}, at: hot_remove_store+0x52/0xf0
[   75.611910]
               which lock already depends on the new lock.

[   75.613896]
               the existing dependency chain (in reverse order) is:
[   75.615830]
               -> #1 (zram_index_mutex){+.+.}-{3:3}:
[   75.617483]        __lock_acquire+0x4d2/0x930
[   75.618650]        lock_acquire+0xbb/0x2d0
[   75.619748]        __mutex_lock+0x8e/0x8a0
[   75.620854]        disksize_store+0x38/0x180
[   75.621996]        kernfs_fop_write_iter+0x134/0x1d0
[   75.623287]        new_sync_write+0x122/0x1b0
[   75.624442]        vfs_write+0x23e/0x350
[   75.625506]        ksys_write+0x68/0xe0
[   75.626550]        do_syscall_64+0x3b/0x90
[   75.627649]        entry_SYSCALL_64_after_hwframe+0x44/0xae
[   75.629070]
               -> #0 (kn->active#237){++++}-{0:0}:
[   75.630677]        check_prev_add+0x91/0xc10
[   75.631816]        validate_chain+0x474/0x500
[   75.632972]        __lock_acquire+0x4d2/0x930
[   75.634131]        lock_acquire+0xbb/0x2d0
[   75.635234]        kernfs_drain+0x139/0x190
[   75.636355]        __kernfs_remove+0x1ab/0x1e0
[   75.637532]        kernfs_remove_by_name_ns+0x3f/0x80
[   75.638843]        remove_files+0x2b/0x60
[   75.639926]        sysfs_remove_group+0x38/0x80
[   75.641120]        sysfs_remove_groups+0x29/0x40
[   75.642334]        device_remove_attrs+0x5b/0x90
[   75.643552]        device_del+0x184/0x400
[   75.644635]        zram_remove+0xac/0xc0
[   75.645700]        hot_remove_store+0xa3/0xf0
[   75.646856]        kernfs_fop_write_iter+0x134/0x1d0
[   75.648147]        new_sync_write+0x122/0x1b0
[   75.649311]        vfs_write+0x23e/0x350
[   75.650372]        ksys_write+0x68/0xe0
[   75.651412]        do_syscall_64+0x3b/0x90
[   75.652512]        entry_SYSCALL_64_after_hwframe+0x44/0xae
[   75.653929]
               other info that might help us debug this:

[   75.656054]  Possible unsafe locking scenario:

[   75.657637]        CPU0                    CPU1
[   75.658833]        ----                    ----
[   75.660020]   lock(zram_index_mutex);
[   75.661024]                                lock(kn->active#237);
[   75.662549]                                lock(zram_index_mutex);
[   75.664103]   lock(kn->active#237);
[   75.665072]
                *** DEADLOCK ***

[   75.666736] 4 locks held by bash/1154:
[   75.667767]  #0: ffff91ce06983470 (sb_writers#4){.+.+}-{0:0}, at: ksys_write+0x68/0xe0
[   75.669802]  #1: ffff91ce4123d290 (&of->mutex){+.+.}-{3:3}, at: kernfs_fop_write_iter+0x100/0x1d0
[   75.672050]  #2: ffff91ce05a7ac40 (kn->active#238){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x108/0x1d0
[   75.674383]  #3: ffffffff839e3ef0 (zram_index_mutex){+.+.}-{3:3}, at: hot_remove_store+0x52/0xf0
[   75.676595]
               stack backtrace:
[   75.677835] CPU: 2 PID: 1154 Comm: bash Not tainted 5.15.0-rc3_zram_fix_luis+ #24
[   75.679768] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-1.fc33 04/01/2014
[   75.681927] Call Trace:
[   75.682674]  dump_stack_lvl+0x57/0x7d
[   75.683680]  check_noncircular+0xff/0x110
[   75.684758]  ? stack_trace_save+0x4b/0x70
[   75.685843]  check_prev_add+0x91/0xc10
[   75.686867]  ? add_chain_cache+0x112/0x2d0
[   75.687965]  validate_chain+0x474/0x500
[   75.689005]  __lock_acquire+0x4d2/0x930
[   75.690054]  lock_acquire+0xbb/0x2d0
[   75.691038]  ? __kernfs_remove+0x1ab/0x1e0
[   75.692131]  ? __lock_release+0x179/0x2c0
[   75.693212]  ? kernfs_drain+0x5b/0x190
[   75.694239]  kernfs_drain+0x139/0x190
[   75.695240]  ? __kernfs_remove+0x1ab/0x1e0
[   75.696341]  __kernfs_remove+0x1ab/0x1e0
[   75.697408]  kernfs_remove_by_name_ns+0x3f/0x80
[   75.698607]  remove_files+0x2b/0x60
[   75.699576]  sysfs_remove_group+0x38/0x80
[   75.700661]  sysfs_remove_groups+0x29/0x40
[   75.701770]  device_remove_attrs+0x5b/0x90
[   75.702870]  device_del+0x184/0x400
[   75.703835]  zram_remove+0xac/0xc0
[   75.704785]  hot_remove_store+0xa3/0xf0
[   75.705831]  kernfs_fop_write_iter+0x134/0x1d0
[   75.707004]  new_sync_write+0x122/0x1b0
[   75.708048]  ? __do_fast_syscall_32+0xe0/0xf0
[   75.709214]  vfs_write+0x23e/0x350
[   75.710161]  ksys_write+0x68/0xe0
[   75.711088]  do_syscall_64+0x3b/0x90
[   75.712078]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[   75.713389] RIP: 0033:0x7fcc1893f927
[   75.714381] Code: 0f 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
[   75.718879] RSP: 002b:00007ffcd56d91a8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[   75.720832] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007fcc1893f927
[   75.722592] RDX: 0000000000000002 RSI: 000055d7d33f78c0 RDI: 0000000000000001
[   75.724352] RBP: 000055d7d33f78c0 R08: 0000000000000000 R09: 00007fcc189f44e0
[   75.726123] R10: 00007fcc189f43e0 R11: 0000000000000246 R12: 0000000000000002
[   75.727884] R13: 00007fcc18a395a0 R14: 0000000000000002 R15: 00007fcc18a397a0



Thanks, 
Ming


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v8 11/12] zram: fix crashes with cpu hotplug multistate
  2021-10-19 19:36                           ` Luis Chamberlain
@ 2021-10-20  1:15                             ` Ming Lei
  2021-10-20 15:48                               ` Luis Chamberlain
  0 siblings, 1 reply; 94+ messages in thread
From: Ming Lei @ 2021-10-20  1:15 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: Benjamin Herrenschmidt, Paul Mackerras, tj, gregkh, akpm,
	minchan, jeyu, shuah, bvanassche, dan.j.williams, joe, tglx,
	keescook, rostedt, linux-spdx, linux-doc, linux-block,
	linux-fsdevel, linux-kselftest, linux-kernel

On Tue, Oct 19, 2021 at 12:36:42PM -0700, Luis Chamberlain wrote:
> On Wed, Oct 20, 2021 at 12:29:53AM +0800, Ming Lei wrote:
> > On Tue, Oct 19, 2021 at 08:28:21AM -0700, Luis Chamberlain wrote:
> > > On Tue, Oct 19, 2021 at 10:34:41AM +0800, Ming Lei wrote:
> > > > Please try the following patch against upstream(linus or next) tree(basically
> > > > fold revised 2 and 3 of V1, and cover two issues: not fail zram_remove in
> > > > module_exit(), race between zram_remove() and disksize_store()), and see if
> > > > everything is fine for you:
> > > 
> > > Page fault ...
> > > 
> > > [   18.284256] zram: Removed device: zram0
> > > [   18.312974] BUG: unable to handle page fault for address:
> > > ffffad86de903008
> > > [   18.313707] #PF: supervisor read access in kernel mode
> > > [   18.314248] #PF: error_code(0x0000) - not-present page
> > > [   18.314797] PGD 100000067 P4D 100000067 PUD 10031e067 PMD 136a28067
> > 
> > That is another race between zram_reset_device() and disksize_store(),
> > which is supposed to be covered by ->init_lock, and follows the delta fix
> > against the last patch I posted, and the whole patch can be found in the
> > github link:
> > 
> > https://github.com/ming1/linux/commit/fa6045b1371eb301f392ac84adaf3ad53bb16894
> > 
> > 
> > diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
> > index d0cae7a42f4d..a14ba3d350ea 100644
> > --- a/drivers/block/zram/zram_drv.c
> > +++ b/drivers/block/zram/zram_drv.c
> > @@ -1704,12 +1704,12 @@ static void zram_reset_device(struct zram *zram)
> >  	set_capacity_and_notify(zram->disk, 0);
> >  	part_stat_set_all(zram->disk->part0, 0);
> >  
> > -	up_write(&zram->init_lock);
> >  	/* I/O operation under all of CPU are done so let's free */
> >  	zram_meta_free(zram, disksize);
> >  	memset(&zram->stats, 0, sizeof(zram->stats));
> >  	zcomp_destroy(comp);
> >  	reset_bdev(zram);
> > +	up_write(&zram->init_lock);
> >  }
> >  
> >  static ssize_t disksize_store(struct device *dev,
> 
> With this, it still ends up in a state where we loop and can't get out of:
> 
> zram: Can't change algorithm for initialized device

Again, you are running two zram02.sh[1] on /dev/zram0, that isn't unexpected
behavior. Here the difference is just timing. In my test VM,
this message shows a while on one task, then it may be switched to
another task.

Just run your patches a while, nothing real difference here, and the
following message can be dumped from one task for long time:

	can't set '107374182400' to /sys/block/zram0/disksize

Also you did not answer my question about your test expected result when
running the following script from two terminal concurrently:

	while true; do
		PATH=$PATH:$PWD:$PWD/../../../lib/ ./zram02.sh;
	done




Thanks,
Ming


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v8 11/12] zram: fix crashes with cpu hotplug multistate
  2021-10-19  9:23                         ` Ming Lei
@ 2021-10-20  6:43                           ` Miroslav Benes
  2021-10-20  7:49                             ` Ming Lei
  0 siblings, 1 reply; 94+ messages in thread
From: Miroslav Benes @ 2021-10-20  6:43 UTC (permalink / raw)
  To: Ming Lei
  Cc: Luis Chamberlain, Benjamin Herrenschmidt, Paul Mackerras, tj,
	gregkh, akpm, minchan, jeyu, shuah, bvanassche, dan.j.williams,
	joe, tglx, keescook, rostedt, linux-spdx, linux-doc, linux-block,
	linux-fsdevel, linux-kselftest, linux-kernel, live-patching

On Tue, 19 Oct 2021, Ming Lei wrote:

> On Tue, Oct 19, 2021 at 08:23:51AM +0200, Miroslav Benes wrote:
> > > > By you only addressing the deadlock as a requirement on approach a) you are
> > > > forgetting that there *may* already be present drivers which *do* implement
> > > > such patterns in the kernel. I worked on addressing the deadlock because
> > > > I was informed livepatching *did* have that issue as well and so very
> > > > likely a generic solution to the deadlock could be beneficial to other
> > > > random drivers.
> > > 
> > > In-tree zram doesn't have such deadlock, if livepatching has such AA deadlock,
> > > just fixed it, and seems it has been fixed by 3ec24776bfd0.
> > 
> > I would not call it a fix. It is a kind of ugly workaround because the 
> > generic infrastructure lacked (lacks) the proper support in my opinion. 
> > Luis is trying to fix that.
> 
> What is the proper support of the generic infrastructure? I am not
> familiar with livepatching's model(especially with module unload), you mean
> livepatching have to do the following way from sysfs:
> 
> 1) during module exit:
> 	
> 	mutex_lock(lp_lock);
> 	kobject_put(lp_kobj);
> 	mutex_unlock(lp_lock);
> 	
> 2) show()/store() method of attributes of lp_kobj
> 	
> 	mutex_lock(lp_lock)
> 	...
> 	mutex_unlock(lp_lock)

Yes, this was exactly the case. We then reworked it a lot (see 
958ef1e39d24 ("livepatch: Simplify API by removing registration step"), so 
now the call sequence is different. kobject_put() is basically offloaded 
to a workqueue scheduled right from the store() method. Meaning that 
Luis's work would probably not help us currently, but on the other hand 
the issues with AA deadlock were one of the main drivers of the redesign 
(if I remember correctly). There were other reasons too as the changelog 
of the commit describes.

So, from my perspective, if there was a way to easily synchronize between 
a data cleanup from module_exit callback and sysfs/kernfs operations, it 
could spare people many headaches.
 
> IMO, the above usage simply caused AA deadlock. Even in Luis's patch
> 'zram: fix crashes with cpu hotplug multistate', new/same AA deadlock
> (hot_remove_store() vs. disksize_store() or reset_store()) is added
> because hot_remove_store() isn't called from module_exit().
> 
> Luis tries to delay unloading module until all show()/store() are done. But
> that can be obtained by the following way simply during module_exit():
> 
> 	kobject_del(lp_kobj); //all pending store()/show() from lp_kobj are done,
> 						  //no new store()/show() can come after
> 						  //kobject_del() returns	
> 	mutex_lock(lp_lock);
> 	kobject_put(lp_kobj);
> 	mutex_unlock(lp_lock);

kobject_del() already calls kobject_put(). Did you mean __kobject_del(). 
That one is internal though.
 
> Or can you explain your requirement on kobject/module unload in a bit
> details?

Does the above makes sense?

Thanks

Miroslav

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v8 11/12] zram: fix crashes with cpu hotplug multistate
  2021-10-20  6:43                           ` Miroslav Benes
@ 2021-10-20  7:49                             ` Ming Lei
  2021-10-20  8:19                               ` Miroslav Benes
  0 siblings, 1 reply; 94+ messages in thread
From: Ming Lei @ 2021-10-20  7:49 UTC (permalink / raw)
  To: Miroslav Benes
  Cc: Luis Chamberlain, Benjamin Herrenschmidt, Paul Mackerras, tj,
	gregkh, akpm, minchan, jeyu, shuah, bvanassche, dan.j.williams,
	joe, tglx, keescook, rostedt, linux-spdx, linux-doc, linux-block,
	linux-fsdevel, linux-kselftest, linux-kernel, live-patching,
	ming.lei

On Wed, Oct 20, 2021 at 08:43:37AM +0200, Miroslav Benes wrote:
> On Tue, 19 Oct 2021, Ming Lei wrote:
> 
> > On Tue, Oct 19, 2021 at 08:23:51AM +0200, Miroslav Benes wrote:
> > > > > By you only addressing the deadlock as a requirement on approach a) you are
> > > > > forgetting that there *may* already be present drivers which *do* implement
> > > > > such patterns in the kernel. I worked on addressing the deadlock because
> > > > > I was informed livepatching *did* have that issue as well and so very
> > > > > likely a generic solution to the deadlock could be beneficial to other
> > > > > random drivers.
> > > > 
> > > > In-tree zram doesn't have such deadlock, if livepatching has such AA deadlock,
> > > > just fixed it, and seems it has been fixed by 3ec24776bfd0.
> > > 
> > > I would not call it a fix. It is a kind of ugly workaround because the 
> > > generic infrastructure lacked (lacks) the proper support in my opinion. 
> > > Luis is trying to fix that.
> > 
> > What is the proper support of the generic infrastructure? I am not
> > familiar with livepatching's model(especially with module unload), you mean
> > livepatching have to do the following way from sysfs:
> > 
> > 1) during module exit:
> > 	
> > 	mutex_lock(lp_lock);
> > 	kobject_put(lp_kobj);
> > 	mutex_unlock(lp_lock);
> > 	
> > 2) show()/store() method of attributes of lp_kobj
> > 	
> > 	mutex_lock(lp_lock)
> > 	...
> > 	mutex_unlock(lp_lock)
> 
> Yes, this was exactly the case. We then reworked it a lot (see 
> 958ef1e39d24 ("livepatch: Simplify API by removing registration step"), so 
> now the call sequence is different. kobject_put() is basically offloaded 
> to a workqueue scheduled right from the store() method. Meaning that 
> Luis's work would probably not help us currently, but on the other hand 
> the issues with AA deadlock were one of the main drivers of the redesign 
> (if I remember correctly). There were other reasons too as the changelog 
> of the commit describes.
> 
> So, from my perspective, if there was a way to easily synchronize between 
> a data cleanup from module_exit callback and sysfs/kernfs operations, it 
> could spare people many headaches.

kobject_del() is supposed to do so, but you can't hold a shared lock
which is required in show()/store() method. Once kobject_del() returns,
no pending show()/store() any more.

The question is that why one shared lock is required for livepatching to
delete the kobject. What are you protecting when you delete one kobject?

>  
> > IMO, the above usage simply caused AA deadlock. Even in Luis's patch
> > 'zram: fix crashes with cpu hotplug multistate', new/same AA deadlock
> > (hot_remove_store() vs. disksize_store() or reset_store()) is added
> > because hot_remove_store() isn't called from module_exit().
> > 
> > Luis tries to delay unloading module until all show()/store() are done. But
> > that can be obtained by the following way simply during module_exit():
> > 
> > 	kobject_del(lp_kobj); //all pending store()/show() from lp_kobj are done,
> > 						  //no new store()/show() can come after
> > 						  //kobject_del() returns	
> > 	mutex_lock(lp_lock);
> > 	kobject_put(lp_kobj);
> > 	mutex_unlock(lp_lock);
> 
> kobject_del() already calls kobject_put(). Did you mean __kobject_del(). 
> That one is internal though.

kobject_del() is counter-part of kobject_add(), and kobject_put() will
call kobject_del() automatically() if it isn't deleted yet, but usually
kobject_put() is for releasing the object only. It is more often to
release kobject by calling kobject_del() and kobject_put().

>  
> > Or can you explain your requirement on kobject/module unload in a bit
> > details?
> 
> Does the above makes sense?

I think now focus is the shared lock between kobject_del() and
show()/store() of the kobject's attributes.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v8 11/12] zram: fix crashes with cpu hotplug multistate
  2021-10-20  7:49                             ` Ming Lei
@ 2021-10-20  8:19                               ` Miroslav Benes
  2021-10-20  8:28                                 ` Greg KH
  2021-10-20 10:09                                 ` Ming Lei
  0 siblings, 2 replies; 94+ messages in thread
From: Miroslav Benes @ 2021-10-20  8:19 UTC (permalink / raw)
  To: Ming Lei
  Cc: Luis Chamberlain, Benjamin Herrenschmidt, Paul Mackerras, tj,
	gregkh, akpm, minchan, jeyu, shuah, bvanassche, dan.j.williams,
	joe, tglx, keescook, rostedt, linux-spdx, linux-doc, linux-block,
	linux-fsdevel, linux-kselftest, linux-kernel, live-patching

On Wed, 20 Oct 2021, Ming Lei wrote:

> On Wed, Oct 20, 2021 at 08:43:37AM +0200, Miroslav Benes wrote:
> > On Tue, 19 Oct 2021, Ming Lei wrote:
> > 
> > > On Tue, Oct 19, 2021 at 08:23:51AM +0200, Miroslav Benes wrote:
> > > > > > By you only addressing the deadlock as a requirement on approach a) you are
> > > > > > forgetting that there *may* already be present drivers which *do* implement
> > > > > > such patterns in the kernel. I worked on addressing the deadlock because
> > > > > > I was informed livepatching *did* have that issue as well and so very
> > > > > > likely a generic solution to the deadlock could be beneficial to other
> > > > > > random drivers.
> > > > > 
> > > > > In-tree zram doesn't have such deadlock, if livepatching has such AA deadlock,
> > > > > just fixed it, and seems it has been fixed by 3ec24776bfd0.
> > > > 
> > > > I would not call it a fix. It is a kind of ugly workaround because the 
> > > > generic infrastructure lacked (lacks) the proper support in my opinion. 
> > > > Luis is trying to fix that.
> > > 
> > > What is the proper support of the generic infrastructure? I am not
> > > familiar with livepatching's model(especially with module unload), you mean
> > > livepatching have to do the following way from sysfs:
> > > 
> > > 1) during module exit:
> > > 	
> > > 	mutex_lock(lp_lock);
> > > 	kobject_put(lp_kobj);
> > > 	mutex_unlock(lp_lock);
> > > 	
> > > 2) show()/store() method of attributes of lp_kobj
> > > 	
> > > 	mutex_lock(lp_lock)
> > > 	...
> > > 	mutex_unlock(lp_lock)
> > 
> > Yes, this was exactly the case. We then reworked it a lot (see 
> > 958ef1e39d24 ("livepatch: Simplify API by removing registration step"), so 
> > now the call sequence is different. kobject_put() is basically offloaded 
> > to a workqueue scheduled right from the store() method. Meaning that 
> > Luis's work would probably not help us currently, but on the other hand 
> > the issues with AA deadlock were one of the main drivers of the redesign 
> > (if I remember correctly). There were other reasons too as the changelog 
> > of the commit describes.
> > 
> > So, from my perspective, if there was a way to easily synchronize between 
> > a data cleanup from module_exit callback and sysfs/kernfs operations, it 
> > could spare people many headaches.
> 
> kobject_del() is supposed to do so, but you can't hold a shared lock
> which is required in show()/store() method. Once kobject_del() returns,
> no pending show()/store() any more.
> 
> The question is that why one shared lock is required for livepatching to
> delete the kobject. What are you protecting when you delete one kobject?

I think it boils down to the fact that we embed kobject statically to 
structures which livepatch uses to maintain data. That is discouraged 
generally, but all the attempts to implement it correctly were utter 
failures.

Miroslav

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v8 11/12] zram: fix crashes with cpu hotplug multistate
  2021-10-20  8:19                               ` Miroslav Benes
@ 2021-10-20  8:28                                 ` Greg KH
  2021-10-25  9:58                                   ` Miroslav Benes
  2021-10-20 10:09                                 ` Ming Lei
  1 sibling, 1 reply; 94+ messages in thread
From: Greg KH @ 2021-10-20  8:28 UTC (permalink / raw)
  To: Miroslav Benes
  Cc: Ming Lei, Luis Chamberlain, Benjamin Herrenschmidt,
	Paul Mackerras, tj, akpm, minchan, jeyu, shuah, bvanassche,
	dan.j.williams, joe, tglx, keescook, rostedt, linux-spdx,
	linux-doc, linux-block, linux-fsdevel, linux-kselftest,
	linux-kernel, live-patching

On Wed, Oct 20, 2021 at 10:19:27AM +0200, Miroslav Benes wrote:
> On Wed, 20 Oct 2021, Ming Lei wrote:
> 
> > On Wed, Oct 20, 2021 at 08:43:37AM +0200, Miroslav Benes wrote:
> > > On Tue, 19 Oct 2021, Ming Lei wrote:
> > > 
> > > > On Tue, Oct 19, 2021 at 08:23:51AM +0200, Miroslav Benes wrote:
> > > > > > > By you only addressing the deadlock as a requirement on approach a) you are
> > > > > > > forgetting that there *may* already be present drivers which *do* implement
> > > > > > > such patterns in the kernel. I worked on addressing the deadlock because
> > > > > > > I was informed livepatching *did* have that issue as well and so very
> > > > > > > likely a generic solution to the deadlock could be beneficial to other
> > > > > > > random drivers.
> > > > > > 
> > > > > > In-tree zram doesn't have such deadlock, if livepatching has such AA deadlock,
> > > > > > just fixed it, and seems it has been fixed by 3ec24776bfd0.
> > > > > 
> > > > > I would not call it a fix. It is a kind of ugly workaround because the 
> > > > > generic infrastructure lacked (lacks) the proper support in my opinion. 
> > > > > Luis is trying to fix that.
> > > > 
> > > > What is the proper support of the generic infrastructure? I am not
> > > > familiar with livepatching's model(especially with module unload), you mean
> > > > livepatching have to do the following way from sysfs:
> > > > 
> > > > 1) during module exit:
> > > > 	
> > > > 	mutex_lock(lp_lock);
> > > > 	kobject_put(lp_kobj);
> > > > 	mutex_unlock(lp_lock);
> > > > 	
> > > > 2) show()/store() method of attributes of lp_kobj
> > > > 	
> > > > 	mutex_lock(lp_lock)
> > > > 	...
> > > > 	mutex_unlock(lp_lock)
> > > 
> > > Yes, this was exactly the case. We then reworked it a lot (see 
> > > 958ef1e39d24 ("livepatch: Simplify API by removing registration step"), so 
> > > now the call sequence is different. kobject_put() is basically offloaded 
> > > to a workqueue scheduled right from the store() method. Meaning that 
> > > Luis's work would probably not help us currently, but on the other hand 
> > > the issues with AA deadlock were one of the main drivers of the redesign 
> > > (if I remember correctly). There were other reasons too as the changelog 
> > > of the commit describes.
> > > 
> > > So, from my perspective, if there was a way to easily synchronize between 
> > > a data cleanup from module_exit callback and sysfs/kernfs operations, it 
> > > could spare people many headaches.
> > 
> > kobject_del() is supposed to do so, but you can't hold a shared lock
> > which is required in show()/store() method. Once kobject_del() returns,
> > no pending show()/store() any more.
> > 
> > The question is that why one shared lock is required for livepatching to
> > delete the kobject. What are you protecting when you delete one kobject?
> 
> I think it boils down to the fact that we embed kobject statically to 
> structures which livepatch uses to maintain data. That is discouraged 
> generally, but all the attempts to implement it correctly were utter 
> failures.

Sounds like this is the real problem that needs to be fixed.  kobjects
should always control the lifespan of the structure they are embedded
in.  If not, then that is a design flaw of the user of the kobject :(

Where in the kernel is this happening?  And where have been the attempts
to fix this up?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v8 11/12] zram: fix crashes with cpu hotplug multistate
  2021-10-20  8:19                               ` Miroslav Benes
  2021-10-20  8:28                                 ` Greg KH
@ 2021-10-20 10:09                                 ` Ming Lei
  2021-10-26  8:48                                   ` Petr Mladek
  1 sibling, 1 reply; 94+ messages in thread
From: Ming Lei @ 2021-10-20 10:09 UTC (permalink / raw)
  To: Miroslav Benes
  Cc: Luis Chamberlain, Benjamin Herrenschmidt, Paul Mackerras, tj,
	gregkh, akpm, minchan, jeyu, shuah, bvanassche, dan.j.williams,
	joe, tglx, keescook, rostedt, linux-spdx, linux-doc, linux-block,
	linux-fsdevel, linux-kselftest, linux-kernel, live-patching

On Wed, Oct 20, 2021 at 10:19:27AM +0200, Miroslav Benes wrote:
> On Wed, 20 Oct 2021, Ming Lei wrote:
> 
> > On Wed, Oct 20, 2021 at 08:43:37AM +0200, Miroslav Benes wrote:
> > > On Tue, 19 Oct 2021, Ming Lei wrote:
> > > 
> > > > On Tue, Oct 19, 2021 at 08:23:51AM +0200, Miroslav Benes wrote:
> > > > > > > By you only addressing the deadlock as a requirement on approach a) you are
> > > > > > > forgetting that there *may* already be present drivers which *do* implement
> > > > > > > such patterns in the kernel. I worked on addressing the deadlock because
> > > > > > > I was informed livepatching *did* have that issue as well and so very
> > > > > > > likely a generic solution to the deadlock could be beneficial to other
> > > > > > > random drivers.
> > > > > > 
> > > > > > In-tree zram doesn't have such deadlock, if livepatching has such AA deadlock,
> > > > > > just fixed it, and seems it has been fixed by 3ec24776bfd0.
> > > > > 
> > > > > I would not call it a fix. It is a kind of ugly workaround because the 
> > > > > generic infrastructure lacked (lacks) the proper support in my opinion. 
> > > > > Luis is trying to fix that.
> > > > 
> > > > What is the proper support of the generic infrastructure? I am not
> > > > familiar with livepatching's model(especially with module unload), you mean
> > > > livepatching have to do the following way from sysfs:
> > > > 
> > > > 1) during module exit:
> > > > 	
> > > > 	mutex_lock(lp_lock);
> > > > 	kobject_put(lp_kobj);
> > > > 	mutex_unlock(lp_lock);
> > > > 	
> > > > 2) show()/store() method of attributes of lp_kobj
> > > > 	
> > > > 	mutex_lock(lp_lock)
> > > > 	...
> > > > 	mutex_unlock(lp_lock)
> > > 
> > > Yes, this was exactly the case. We then reworked it a lot (see 
> > > 958ef1e39d24 ("livepatch: Simplify API by removing registration step"), so 
> > > now the call sequence is different. kobject_put() is basically offloaded 
> > > to a workqueue scheduled right from the store() method. Meaning that 
> > > Luis's work would probably not help us currently, but on the other hand 
> > > the issues with AA deadlock were one of the main drivers of the redesign 
> > > (if I remember correctly). There were other reasons too as the changelog 
> > > of the commit describes.
> > > 
> > > So, from my perspective, if there was a way to easily synchronize between 
> > > a data cleanup from module_exit callback and sysfs/kernfs operations, it 
> > > could spare people many headaches.
> > 
> > kobject_del() is supposed to do so, but you can't hold a shared lock
> > which is required in show()/store() method. Once kobject_del() returns,
> > no pending show()/store() any more.
> > 
> > The question is that why one shared lock is required for livepatching to
> > delete the kobject. What are you protecting when you delete one kobject?
> 
> I think it boils down to the fact that we embed kobject statically to 
> structures which livepatch uses to maintain data. That is discouraged 
> generally, but all the attempts to implement it correctly were utter 
> failures.

OK, then it isn't one common usage, in which kobject covers the release
of the external object. What is the exact kobject in livepatching?

But kobject_del() won't release the kobject, you shouldn't need the lock
to delete kobject first. After the kobject is deleted, no any show() and
store() any more, isn't such sync[1] you expected?


Thanks,
Ming


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v8 11/12] zram: fix crashes with cpu hotplug multistate
  2021-10-20  1:15                             ` Ming Lei
@ 2021-10-20 15:48                               ` Luis Chamberlain
  2021-10-21  0:39                                 ` Ming Lei
  0 siblings, 1 reply; 94+ messages in thread
From: Luis Chamberlain @ 2021-10-20 15:48 UTC (permalink / raw)
  To: Ming Lei
  Cc: Benjamin Herrenschmidt, Paul Mackerras, tj, gregkh, akpm,
	minchan, jeyu, shuah, bvanassche, dan.j.williams, joe, tglx,
	keescook, rostedt, linux-spdx, linux-doc, linux-block,
	linux-fsdevel, linux-kselftest, linux-kernel

On Wed, Oct 20, 2021 at 09:15:20AM +0800, Ming Lei wrote:
> On Tue, Oct 19, 2021 at 12:36:42PM -0700, Luis Chamberlain wrote:
> > On Wed, Oct 20, 2021 at 12:29:53AM +0800, Ming Lei wrote:
> > > diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
> > > index d0cae7a42f4d..a14ba3d350ea 100644
> > > --- a/drivers/block/zram/zram_drv.c
> > > +++ b/drivers/block/zram/zram_drv.c
> > > @@ -1704,12 +1704,12 @@ static void zram_reset_device(struct zram *zram)
> > >  	set_capacity_and_notify(zram->disk, 0);
> > >  	part_stat_set_all(zram->disk->part0, 0);
> > >  
> > > -	up_write(&zram->init_lock);
> > >  	/* I/O operation under all of CPU are done so let's free */
> > >  	zram_meta_free(zram, disksize);
> > >  	memset(&zram->stats, 0, sizeof(zram->stats));
> > >  	zcomp_destroy(comp);
> > >  	reset_bdev(zram);
> > > +	up_write(&zram->init_lock);
> > >  }
> > >  
> > >  static ssize_t disksize_store(struct device *dev,
> > 
> > With this, it still ends up in a state where we loop and can't get out of:
> > 
> > zram: Can't change algorithm for initialized device
> 
> Again, you are running two zram02.sh[1] on /dev/zram0, that isn't unexpected

You mean that it is not expected? If so then yes, of course.

> behavior. Here the difference is just timing.

Right, but that is what helped reproduce a difficutl to re-produce customer
bug. Once you find an easy way to reproduce a reported issue you stick
with it and try to make the situation worse to ensure no more bugs are
present.

> Also you did not answer my question about your test expected result when
> running the following script from two terminal concurrently:
> 
> 	while true; do
> 		PATH=$PATH:$PWD:$PWD/../../../lib/ ./zram02.sh;
> 	done

If you run this, you should see no failures.

Once you start a second script that one should cause odd issues on both
sides but never crash or stall the module.

A second series of tests is hitting CTRL-C on either randonly and
restarting testing once again randomly.

Again, neither should crash the kernel or stall the module.

In the end of these tests you should be able to run the script alone
just once and not see issues.

  Luis

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v8 11/12] zram: fix crashes with cpu hotplug multistate
  2021-10-20 15:48                               ` Luis Chamberlain
@ 2021-10-21  0:39                                 ` Ming Lei
  2021-10-21 17:18                                   ` Luis Chamberlain
  0 siblings, 1 reply; 94+ messages in thread
From: Ming Lei @ 2021-10-21  0:39 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: Benjamin Herrenschmidt, Paul Mackerras, tj, gregkh, akpm,
	minchan, jeyu, shuah, bvanassche, dan.j.williams, joe, tglx,
	keescook, rostedt, linux-spdx, linux-doc, linux-block,
	linux-fsdevel, linux-kselftest, linux-kernel, ming.lei

On Wed, Oct 20, 2021 at 08:48:04AM -0700, Luis Chamberlain wrote:
> On Wed, Oct 20, 2021 at 09:15:20AM +0800, Ming Lei wrote:
> > On Tue, Oct 19, 2021 at 12:36:42PM -0700, Luis Chamberlain wrote:
> > > On Wed, Oct 20, 2021 at 12:29:53AM +0800, Ming Lei wrote:
> > > > diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
> > > > index d0cae7a42f4d..a14ba3d350ea 100644
> > > > --- a/drivers/block/zram/zram_drv.c
> > > > +++ b/drivers/block/zram/zram_drv.c
> > > > @@ -1704,12 +1704,12 @@ static void zram_reset_device(struct zram *zram)
> > > >  	set_capacity_and_notify(zram->disk, 0);
> > > >  	part_stat_set_all(zram->disk->part0, 0);
> > > >  
> > > > -	up_write(&zram->init_lock);
> > > >  	/* I/O operation under all of CPU are done so let's free */
> > > >  	zram_meta_free(zram, disksize);
> > > >  	memset(&zram->stats, 0, sizeof(zram->stats));
> > > >  	zcomp_destroy(comp);
> > > >  	reset_bdev(zram);
> > > > +	up_write(&zram->init_lock);
> > > >  }
> > > >  
> > > >  static ssize_t disksize_store(struct device *dev,
> > > 
> > > With this, it still ends up in a state where we loop and can't get out of:
> > > 
> > > zram: Can't change algorithm for initialized device
> > 
> > Again, you are running two zram02.sh[1] on /dev/zram0, that isn't unexpected
> 
> You mean that it is not expected? If so then yes, of course.

My meaning is clear: it is not unexpected, so it is expected.

> 
> > behavior. Here the difference is just timing.
> 
> Right, but that is what helped reproduce a difficutl to re-produce customer
> bug. Once you find an easy way to reproduce a reported issue you stick
> with it and try to make the situation worse to ensure no more bugs are
> present.
> 
> > Also you did not answer my question about your test expected result when
> > running the following script from two terminal concurrently:
> > 
> > 	while true; do
> > 		PATH=$PATH:$PWD:$PWD/../../../lib/ ./zram02.sh;
> > 	done
> 
> If you run this, you should see no failures.

OK, not see any failure when running single zram02.sh after applying my
patch V2.

> 
> Once you start a second script that one should cause odd issues on both
> sides but never crash or stall the module.

crash can't be observed with my patch V2, what do you mean 'stall'
the module? Is that 'zram' can't be unloaded after the test is
terminated via multiple 'ctrl-c'?

> 
> A second series of tests is hitting CTRL-C on either randonly and
> restarting testing once again randomly.

ltp/zram02.sh has cleanup handler via trap to clean everything(swapoff/umount/reset/
rmmod), ctrl-c will terminate current forground task and cause shell to run the
cleanup handler first, but further 'ctrl-c' will terminate the cleanup handler,
then the cleanup won't be done completely, such as zram disk is left as swap
device and zram can't be unloaded. The idea can be observed via the following
script:

	#!/bin/bash
	trap 'echo "enter trap"; sleep 20; echo "exit trap";' INT
	sleep 30

After the above script is run foreground, when 1st ctrl-c is pressed, 'sleep 30'
is terminated, then the trap command is run, so you can see "enter trap"
dumped. Then if you pressed 2nd ctrl-c, 'sleep 20' is terminated immediately.
So 'swapoff' from zram02.sh's trap function can be terminated in this way.

zram disk being left as swap disk can be observed with your patch too
after terminating via multiple ctrl-c which has to be done this way because
the test is dead loop.

So it is hard to cleanup everything completely after multiple 'CTRL-C' is
involved, and it should be impossible. It needs violent multiple ctrl-c to
terminate the dealoop test.

So it isn't reasonable to expect that zram can be always unloaded successfully
after the test script is terminated via multiple ctrl-c.

But zram can be unloaded after running swapoff manually, from driver
viewpoint, nothing is wrong.

> 
> Again, neither should crash the kernel or stall the module.
> 
> In the end of these tests you should be able to run the script alone
> just once and not see issues.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v8 11/12] zram: fix crashes with cpu hotplug multistate
  2021-10-21  0:39                                 ` Ming Lei
@ 2021-10-21 17:18                                   ` Luis Chamberlain
  2021-10-22  0:05                                     ` Ming Lei
  0 siblings, 1 reply; 94+ messages in thread
From: Luis Chamberlain @ 2021-10-21 17:18 UTC (permalink / raw)
  To: Ming Lei
  Cc: Benjamin Herrenschmidt, Paul Mackerras, tj, gregkh, akpm,
	minchan, jeyu, shuah, bvanassche, dan.j.williams, joe, tglx,
	keescook, rostedt, linux-spdx, linux-doc, linux-block,
	linux-fsdevel, linux-kselftest, linux-kernel

On Thu, Oct 21, 2021 at 08:39:05AM +0800, Ming Lei wrote:
> On Wed, Oct 20, 2021 at 08:48:04AM -0700, Luis Chamberlain wrote:
> > A second series of tests is hitting CTRL-C on either randonly and
> > restarting testing once again randomly.
> 
> ltp/zram02.sh has cleanup handler via trap to clean everything(swapoff/umount/reset/
> rmmod), ctrl-c will terminate current forground task and cause shell to run the
> cleanup handler first, but further 'ctrl-c' will terminate the cleanup handler,
> then the cleanup won't be done completely, such as zram disk is left as swap
> device and zram can't be unloaded. The idea can be observed via the following
> script:
> 
> 	#!/bin/bash
> 	trap 'echo "enter trap"; sleep 20; echo "exit trap";' INT
> 	sleep 30
> 
> After the above script is run foreground, when 1st ctrl-c is pressed, 'sleep 30'
> is terminated, then the trap command is run, so you can see "enter trap"
> dumped. Then if you pressed 2nd ctrl-c, 'sleep 20' is terminated immediately.
> So 'swapoff' from zram02.sh's trap function can be terminated in this way.
> 
> zram disk being left as swap disk can be observed with your patch too
> after terminating via multiple ctrl-c which has to be done this way because
> the test is dead loop.
> 
> So it is hard to cleanup everything completely after multiple 'CTRL-C' is
> involved, and it should be impossible. It needs violent multiple ctrl-c to
> terminate the dealoop test.
> 
> So it isn't reasonable to expect that zram can be always unloaded successfully
> after the test script is terminated via multiple ctrl-c.

For the life of me, I do not run into these issue with my patch. But
with yours I had.

To be clear, I run zram02.sh on two terminals. Then to interrupt I just leave
CTRL-C pressed to issue multiple terminations until the script is done
on each terminal at a time, until I see both have completed.

I repeat the same test, noting always that when I start one one terminal
the test is succeeding. And also when I cancel completely one script the
test continue fine without issue.

> But zram can be unloaded after running swapoff manually, from driver
> viewpoint, nothing is wrong.

I had not run into that issue with my patch FWIW.

  Luis

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v8 11/12] zram: fix crashes with cpu hotplug multistate
  2021-10-21 17:18                                   ` Luis Chamberlain
@ 2021-10-22  0:05                                     ` Ming Lei
  0 siblings, 0 replies; 94+ messages in thread
From: Ming Lei @ 2021-10-22  0:05 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: Benjamin Herrenschmidt, Paul Mackerras, tj, gregkh, akpm,
	minchan, jeyu, shuah, bvanassche, dan.j.williams, joe, tglx,
	keescook, rostedt, linux-spdx, linux-doc, linux-block,
	linux-fsdevel, linux-kselftest, linux-kernel

On Thu, Oct 21, 2021 at 10:18:47AM -0700, Luis Chamberlain wrote:
> On Thu, Oct 21, 2021 at 08:39:05AM +0800, Ming Lei wrote:
> > On Wed, Oct 20, 2021 at 08:48:04AM -0700, Luis Chamberlain wrote:
> > > A second series of tests is hitting CTRL-C on either randonly and
> > > restarting testing once again randomly.
> > 
> > ltp/zram02.sh has cleanup handler via trap to clean everything(swapoff/umount/reset/
> > rmmod), ctrl-c will terminate current forground task and cause shell to run the
> > cleanup handler first, but further 'ctrl-c' will terminate the cleanup handler,
> > then the cleanup won't be done completely, such as zram disk is left as swap
> > device and zram can't be unloaded. The idea can be observed via the following
> > script:
> > 
> > 	#!/bin/bash
> > 	trap 'echo "enter trap"; sleep 20; echo "exit trap";' INT
> > 	sleep 30
> > 
> > After the above script is run foreground, when 1st ctrl-c is pressed, 'sleep 30'
> > is terminated, then the trap command is run, so you can see "enter trap"
> > dumped. Then if you pressed 2nd ctrl-c, 'sleep 20' is terminated immediately.
> > So 'swapoff' from zram02.sh's trap function can be terminated in this way.
> > 
> > zram disk being left as swap disk can be observed with your patch too
> > after terminating via multiple ctrl-c which has to be done this way because
> > the test is dead loop.
> > 
> > So it is hard to cleanup everything completely after multiple 'CTRL-C' is
> > involved, and it should be impossible. It needs violent multiple ctrl-c to
> > terminate the dealoop test.
> > 
> > So it isn't reasonable to expect that zram can be always unloaded successfully
> > after the test script is terminated via multiple ctrl-c.
> 
> For the life of me, I do not run into these issue with my patch. But
> with yours I had.
> 
> To be clear, I run zram02.sh on two terminals. Then to interrupt I just leave
> CTRL-C pressed to issue multiple terminations until the script is done
> on each terminal at a time, until I see both have completed.
> 
> I repeat the same test, noting always that when I start one one terminal
> the test is succeeding. And also when I cancel completely one script the
> test continue fine without issue.

As I explained wrt. shell's trap, this issue won't be avoided from
userspace because trap function can be terminated by ctrl-c too,
otherwise one shell script may not be terminated at all.

The unclean shutdown can be observed in single 'while true; do zram02.sh; done'
too on both your patches and mine.

Also it is insane to write write test in a deadloop, and people seldom
do that, not see such way in either blktests/xfstests.

I you limit completion time of this test in long enough time(one or
several hours) or big enough loops, I believe it can be done cleanly,
such as:

cnt=0
MAX=10000
while [ $cnt -lt $MAX ]; do
	PATH=$PATH:$PWD:$PWD/../../../lib/ ./zram02.sh;
done


Thanks,
Ming


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v8 11/12] zram: fix crashes with cpu hotplug multistate
  2021-10-20  8:28                                 ` Greg KH
@ 2021-10-25  9:58                                   ` Miroslav Benes
  0 siblings, 0 replies; 94+ messages in thread
From: Miroslav Benes @ 2021-10-25  9:58 UTC (permalink / raw)
  To: Greg KH
  Cc: Ming Lei, Luis Chamberlain, Benjamin Herrenschmidt,
	Paul Mackerras, tj, akpm, minchan, jeyu, shuah, bvanassche,
	dan.j.williams, joe, tglx, keescook, rostedt, linux-spdx,
	linux-doc, linux-block, linux-fsdevel, linux-kselftest,
	linux-kernel, live-patching, pmladek

On Wed, 20 Oct 2021, Greg KH wrote:

> On Wed, Oct 20, 2021 at 10:19:27AM +0200, Miroslav Benes wrote:
> > On Wed, 20 Oct 2021, Ming Lei wrote:
> > 
> > > On Wed, Oct 20, 2021 at 08:43:37AM +0200, Miroslav Benes wrote:
> > > > On Tue, 19 Oct 2021, Ming Lei wrote:
> > > > 
> > > > > On Tue, Oct 19, 2021 at 08:23:51AM +0200, Miroslav Benes wrote:
> > > > > > > > By you only addressing the deadlock as a requirement on approach a) you are
> > > > > > > > forgetting that there *may* already be present drivers which *do* implement
> > > > > > > > such patterns in the kernel. I worked on addressing the deadlock because
> > > > > > > > I was informed livepatching *did* have that issue as well and so very
> > > > > > > > likely a generic solution to the deadlock could be beneficial to other
> > > > > > > > random drivers.
> > > > > > > 
> > > > > > > In-tree zram doesn't have such deadlock, if livepatching has such AA deadlock,
> > > > > > > just fixed it, and seems it has been fixed by 3ec24776bfd0.
> > > > > > 
> > > > > > I would not call it a fix. It is a kind of ugly workaround because the 
> > > > > > generic infrastructure lacked (lacks) the proper support in my opinion. 
> > > > > > Luis is trying to fix that.
> > > > > 
> > > > > What is the proper support of the generic infrastructure? I am not
> > > > > familiar with livepatching's model(especially with module unload), you mean
> > > > > livepatching have to do the following way from sysfs:
> > > > > 
> > > > > 1) during module exit:
> > > > > 	
> > > > > 	mutex_lock(lp_lock);
> > > > > 	kobject_put(lp_kobj);
> > > > > 	mutex_unlock(lp_lock);
> > > > > 	
> > > > > 2) show()/store() method of attributes of lp_kobj
> > > > > 	
> > > > > 	mutex_lock(lp_lock)
> > > > > 	...
> > > > > 	mutex_unlock(lp_lock)
> > > > 
> > > > Yes, this was exactly the case. We then reworked it a lot (see 
> > > > 958ef1e39d24 ("livepatch: Simplify API by removing registration step"), so 
> > > > now the call sequence is different. kobject_put() is basically offloaded 
> > > > to a workqueue scheduled right from the store() method. Meaning that 
> > > > Luis's work would probably not help us currently, but on the other hand 
> > > > the issues with AA deadlock were one of the main drivers of the redesign 
> > > > (if I remember correctly). There were other reasons too as the changelog 
> > > > of the commit describes.
> > > > 
> > > > So, from my perspective, if there was a way to easily synchronize between 
> > > > a data cleanup from module_exit callback and sysfs/kernfs operations, it 
> > > > could spare people many headaches.
> > > 
> > > kobject_del() is supposed to do so, but you can't hold a shared lock
> > > which is required in show()/store() method. Once kobject_del() returns,
> > > no pending show()/store() any more.
> > > 
> > > The question is that why one shared lock is required for livepatching to
> > > delete the kobject. What are you protecting when you delete one kobject?
> > 
> > I think it boils down to the fact that we embed kobject statically to 
> > structures which livepatch uses to maintain data. That is discouraged 
> > generally, but all the attempts to implement it correctly were utter 
> > failures.
> 
> Sounds like this is the real problem that needs to be fixed.  kobjects
> should always control the lifespan of the structure they are embedded
> in.  If not, then that is a design flaw of the user of the kobject :(

Right, and you've already told us. A couple of times.

For example 
here https://lore.kernel.org/all/20190502074230.GA27847@kroah.com/

:)
 
> Where in the kernel is this happening?  And where have been the attempts
> to fix this up?

include/linux/livepatch.h and kernel/livepatch/core.c. See 
klp_{patch,object,func}.

It took some archeology, but I think 
https://lore.kernel.org/all/1464018848-4303-1-git-send-email-pmladek@suse.com/ 
is it. Petr might correct me.

It was long before we added some important features to the code, so it 
might be even more difficult today.

It resurfaced later when Tobin tried to fix some of kobject call sites in 
the kernel...

https://lore.kernel.org/all/20190430001534.26246-1-tobin@kernel.org/
https://lore.kernel.org/all/20190430233803.GB10777@eros.localdomain/
https://lore.kernel.org/all/20190502023142.20139-6-tobin@kernel.org/

There are probably more references.

Anyway, the current code works fine (well, one could argue about that). If 
someone wants to take a (another) stab at this, then why not, but it 
seemed like a rabbit hole without a substantial gain in the past. On the 
other hand, we currently misuse the API to some extent.

/me scratches head

Miroslav

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v8 11/12] zram: fix crashes with cpu hotplug multistate
  2021-10-20 10:09                                 ` Ming Lei
@ 2021-10-26  8:48                                   ` Petr Mladek
  2021-10-26 15:37                                     ` Ming Lei
  0 siblings, 1 reply; 94+ messages in thread
From: Petr Mladek @ 2021-10-26  8:48 UTC (permalink / raw)
  To: Ming Lei
  Cc: Miroslav Benes, Luis Chamberlain, Benjamin Herrenschmidt,
	Paul Mackerras, tj, gregkh, akpm, minchan, jeyu, shuah,
	bvanassche, dan.j.williams, joe, tglx, keescook, rostedt,
	linux-spdx, linux-doc, linux-block, linux-fsdevel,
	linux-kselftest, linux-kernel, live-patching

On Wed 2021-10-20 18:09:51, Ming Lei wrote:
> On Wed, Oct 20, 2021 at 10:19:27AM +0200, Miroslav Benes wrote:
> > On Wed, 20 Oct 2021, Ming Lei wrote:
> > 
> > > On Wed, Oct 20, 2021 at 08:43:37AM +0200, Miroslav Benes wrote:
> > > > On Tue, 19 Oct 2021, Ming Lei wrote:
> > > > 
> > > > > On Tue, Oct 19, 2021 at 08:23:51AM +0200, Miroslav Benes wrote:
> > > > > > > > By you only addressing the deadlock as a requirement on approach a) you are
> > > > > > > > forgetting that there *may* already be present drivers which *do* implement
> > > > > > > > such patterns in the kernel. I worked on addressing the deadlock because
> > > > > > > > I was informed livepatching *did* have that issue as well and so very
> > > > > > > > likely a generic solution to the deadlock could be beneficial to other
> > > > > > > > random drivers.
> > > > > > > 
> > > > > > > In-tree zram doesn't have such deadlock, if livepatching has such AA deadlock,
> > > > > > > just fixed it, and seems it has been fixed by 3ec24776bfd0.
> > > > > > 
> > > > > > I would not call it a fix. It is a kind of ugly workaround because the 
> > > > > > generic infrastructure lacked (lacks) the proper support in my opinion. 
> > > > > > Luis is trying to fix that.
> > > > > 
> > > > > What is the proper support of the generic infrastructure? I am not
> > > > > familiar with livepatching's model(especially with module unload), you mean
> > > > > livepatching have to do the following way from sysfs:
> > > > > 
> > > > > 1) during module exit:
> > > > > 	
> > > > > 	mutex_lock(lp_lock);
> > > > > 	kobject_put(lp_kobj);
> > > > > 	mutex_unlock(lp_lock);
> > > > > 	
> > > > > 2) show()/store() method of attributes of lp_kobj
> > > > > 	
> > > > > 	mutex_lock(lp_lock)
> > > > > 	...
> > > > > 	mutex_unlock(lp_lock)
> > > > 
> > > > Yes, this was exactly the case. We then reworked it a lot (see 
> > > > 958ef1e39d24 ("livepatch: Simplify API by removing registration step"), so 
> > > > now the call sequence is different. kobject_put() is basically offloaded 
> > > > to a workqueue scheduled right from the store() method. Meaning that 
> > > > Luis's work would probably not help us currently, but on the other hand 
> > > > the issues with AA deadlock were one of the main drivers of the redesign 
> > > > (if I remember correctly). There were other reasons too as the changelog 
> > > > of the commit describes.
> > > > 
> > > > So, from my perspective, if there was a way to easily synchronize between 
> > > > a data cleanup from module_exit callback and sysfs/kernfs operations, it 
> > > > could spare people many headaches.
> > > 
> > > kobject_del() is supposed to do so, but you can't hold a shared lock
> > > which is required in show()/store() method. Once kobject_del() returns,
> > > no pending show()/store() any more.
> > > 
> > > The question is that why one shared lock is required for livepatching to
> > > delete the kobject. What are you protecting when you delete one kobject?
> > 
> > I think it boils down to the fact that we embed kobject statically to 
> > structures which livepatch uses to maintain data. That is discouraged 
> > generally, but all the attempts to implement it correctly were utter 
> > failures.
> 
> OK, then it isn't one common usage, in which kobject covers the release
> of the external object. What is the exact kobject in livepatching?

Below are more details about the livepatch code. I hope that it will
help you to see if zram has similar problems or not.

We have kobject in three structures: klp_func, klp_object, and
klp_patch, see include/linux/livepatch.h.

These structures have to be statically defined in the module sources
because they define what is livepatched, see
samples/livepatch/livepatch-sample.c

The kobject is used there to show information about the patch, patched
objects, and patched functions, in sysfs. And most importantly,
the sysfs interface can be used to disable the livepatch.

The problem with static structures is that the module must stay
in the memory as long as the sysfs interface exists. It can be
solved in module_exit() callback. It could wait until the sysfs
interface is destroyed.

kobject API does not support this scenario. The relase() callbacks
are called asynchronously. It expects that the structure is bundled
in a dynamically allocated structure.  As a result, the sysfs
interface can be removed even after the module removal.

The livepatching might create the dynamic structures by duplicating
the structures defined in the module statically. It might safe us
some headaches with kobject release. But it would also need an extra code
that would need to be maintained. The structure constrains strings
than need to be duplicated and later freed...


> But kobject_del() won't release the kobject, you shouldn't need the lock
> to delete kobject first. After the kobject is deleted, no any show() and
> store() any more, isn't such sync[1] you expected?

Livepatch code never called kobject_del() under a lock. It would cause
the obvious deadlock. The historic code only waited in the
module_exit() callback until the sysfs interface was removed.

It has changed in the commit 958ef1e39d24d6cb8bf2a740 ("livepatch:
Simplify API by removing registration step"). The livepatch could
never get enabled again after it was disabled now. The sysfs interface
is removed when the livepatch gets disabled. The module could
be removed only after the sysfs interface is destroyed, see
the module_put() in klp_free_patch_finish().

The livepatch code uses workqueue because the livepatch can be
disabled via sysfs interface. It obviously could not wait until
the sysfs interface is removed in the sysfs write() callback
that triggered the removal.

HTH,
Petr

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v8 11/12] zram: fix crashes with cpu hotplug multistate
  2021-10-26  8:48                                   ` Petr Mladek
@ 2021-10-26 15:37                                     ` Ming Lei
  2021-10-26 17:01                                       ` Luis Chamberlain
                                                         ` (2 more replies)
  0 siblings, 3 replies; 94+ messages in thread
From: Ming Lei @ 2021-10-26 15:37 UTC (permalink / raw)
  To: Petr Mladek
  Cc: Miroslav Benes, Luis Chamberlain, Benjamin Herrenschmidt,
	Paul Mackerras, tj, gregkh, akpm, minchan, jeyu, shuah,
	bvanassche, dan.j.williams, joe, tglx, keescook, rostedt,
	linux-spdx, linux-doc, linux-block, linux-fsdevel,
	linux-kselftest, linux-kernel, live-patching, ming.lei

On Tue, Oct 26, 2021 at 10:48:18AM +0200, Petr Mladek wrote:
> On Wed 2021-10-20 18:09:51, Ming Lei wrote:
> > On Wed, Oct 20, 2021 at 10:19:27AM +0200, Miroslav Benes wrote:
> > > On Wed, 20 Oct 2021, Ming Lei wrote:
> > > 
> > > > On Wed, Oct 20, 2021 at 08:43:37AM +0200, Miroslav Benes wrote:
> > > > > On Tue, 19 Oct 2021, Ming Lei wrote:
> > > > > 
> > > > > > On Tue, Oct 19, 2021 at 08:23:51AM +0200, Miroslav Benes wrote:
> > > > > > > > > By you only addressing the deadlock as a requirement on approach a) you are
> > > > > > > > > forgetting that there *may* already be present drivers which *do* implement
> > > > > > > > > such patterns in the kernel. I worked on addressing the deadlock because
> > > > > > > > > I was informed livepatching *did* have that issue as well and so very
> > > > > > > > > likely a generic solution to the deadlock could be beneficial to other
> > > > > > > > > random drivers.
> > > > > > > > 
> > > > > > > > In-tree zram doesn't have such deadlock, if livepatching has such AA deadlock,
> > > > > > > > just fixed it, and seems it has been fixed by 3ec24776bfd0.
> > > > > > > 
> > > > > > > I would not call it a fix. It is a kind of ugly workaround because the 
> > > > > > > generic infrastructure lacked (lacks) the proper support in my opinion. 
> > > > > > > Luis is trying to fix that.
> > > > > > 
> > > > > > What is the proper support of the generic infrastructure? I am not
> > > > > > familiar with livepatching's model(especially with module unload), you mean
> > > > > > livepatching have to do the following way from sysfs:
> > > > > > 
> > > > > > 1) during module exit:
> > > > > > 	
> > > > > > 	mutex_lock(lp_lock);
> > > > > > 	kobject_put(lp_kobj);
> > > > > > 	mutex_unlock(lp_lock);
> > > > > > 	
> > > > > > 2) show()/store() method of attributes of lp_kobj
> > > > > > 	
> > > > > > 	mutex_lock(lp_lock)
> > > > > > 	...
> > > > > > 	mutex_unlock(lp_lock)
> > > > > 
> > > > > Yes, this was exactly the case. We then reworked it a lot (see 
> > > > > 958ef1e39d24 ("livepatch: Simplify API by removing registration step"), so 
> > > > > now the call sequence is different. kobject_put() is basically offloaded 
> > > > > to a workqueue scheduled right from the store() method. Meaning that 
> > > > > Luis's work would probably not help us currently, but on the other hand 
> > > > > the issues with AA deadlock were one of the main drivers of the redesign 
> > > > > (if I remember correctly). There were other reasons too as the changelog 
> > > > > of the commit describes.
> > > > > 
> > > > > So, from my perspective, if there was a way to easily synchronize between 
> > > > > a data cleanup from module_exit callback and sysfs/kernfs operations, it 
> > > > > could spare people many headaches.
> > > > 
> > > > kobject_del() is supposed to do so, but you can't hold a shared lock
> > > > which is required in show()/store() method. Once kobject_del() returns,
> > > > no pending show()/store() any more.
> > > > 
> > > > The question is that why one shared lock is required for livepatching to
> > > > delete the kobject. What are you protecting when you delete one kobject?
> > > 
> > > I think it boils down to the fact that we embed kobject statically to 
> > > structures which livepatch uses to maintain data. That is discouraged 
> > > generally, but all the attempts to implement it correctly were utter 
> > > failures.
> > 
> > OK, then it isn't one common usage, in which kobject covers the release
> > of the external object. What is the exact kobject in livepatching?
> 
> Below are more details about the livepatch code. I hope that it will
> help you to see if zram has similar problems or not.
> 
> We have kobject in three structures: klp_func, klp_object, and
> klp_patch, see include/linux/livepatch.h.
> 
> These structures have to be statically defined in the module sources
> because they define what is livepatched, see
> samples/livepatch/livepatch-sample.c
> 
> The kobject is used there to show information about the patch, patched
> objects, and patched functions, in sysfs. And most importantly,
> the sysfs interface can be used to disable the livepatch.
> 
> The problem with static structures is that the module must stay
> in the memory as long as the sysfs interface exists. It can be
> solved in module_exit() callback. It could wait until the sysfs
> interface is destroyed.
> 
> kobject API does not support this scenario. The relase() callbacks

kobject_delete() is for supporting this scenario, that is why we don't
need to grab module refcnt before calling show()/store() of the
kobject's attributes.

kobject_delete() can be called in module_exit(), then any show()/store()
will be done after kobject_delete() returns.

> are called asynchronously. It expects that the structure is bundled
> in a dynamically allocated structure.  As a result, the sysfs
> interface can be removed even after the module removal.

That should be one bug, otherwise store()/show() method could be called
into after the module is unloaded.

> 
> The livepatching might create the dynamic structures by duplicating
> the structures defined in the module statically. It might safe us
> some headaches with kobject release. But it would also need an extra code
> that would need to be maintained. The structure constrains strings
> than need to be duplicated and later freed...
> 
> 
> > But kobject_del() won't release the kobject, you shouldn't need the lock
> > to delete kobject first. After the kobject is deleted, no any show() and
> > store() any more, isn't such sync[1] you expected?
> 
> Livepatch code never called kobject_del() under a lock. It would cause
> the obvious deadlock. The historic code only waited in the
> module_exit() callback until the sysfs interface was removed.

OK, then Luis shouldn't consider livepatching as one such issue to solve
with one generic solution.

> 
> It has changed in the commit 958ef1e39d24d6cb8bf2a740 ("livepatch:
> Simplify API by removing registration step"). The livepatch could
> never get enabled again after it was disabled now. The sysfs interface
> is removed when the livepatch gets disabled. The module could
> be removed only after the sysfs interface is destroyed, see
> the module_put() in klp_free_patch_finish().

OK, that is livepatching's implementation: all the kobjects are deleted &
freed after disabling the livepatch module, that looks one kill-me
operation, instead of disabling, so this way isn't a normal usage,
scsi has similar sysfs interface of delete. Also kobjects can't be
removed in enable's store() directly, since deadlock could be
caused, looks wq has to be used here for avoiding deadlock.

BTW, what is the livepatching module use model? try_module_get() is
called in klp_init_patch_early()<-klp_enable_patch()<-module_init(),
module_put() is called in klp_free_patch_finish() which seems only be
called after 'echo 0 > /sys/kernel/livepatch/$lp_mod/enabled'.

Usually when the module isn't used, module_exit() gets chance to be called
by userspace rmmod, then all kobjects created in this module can be
deleted in module_exit().

> 
> The livepatch code uses workqueue because the livepatch can be
> disabled via sysfs interface. It obviously could not wait until
> the sysfs interface is removed in the sysfs write() callback
> that triggered the removal.

If klp_free_patch_* is moved into module_exit() and not let enable
store() to kill kobjects, all kobjects can be deleted in module_exit(),
then wait_for_completion(patch->finish) may be removed, also wq isn't
required for the async cleanup.



Thanks, 
Ming


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v8 11/12] zram: fix crashes with cpu hotplug multistate
  2021-10-26 15:37                                     ` Ming Lei
@ 2021-10-26 17:01                                       ` Luis Chamberlain
  2021-10-27 11:57                                         ` Miroslav Benes
  2021-10-27 11:42                                       ` Miroslav Benes
  2021-11-02 14:15                                       ` Petr Mladek
  2 siblings, 1 reply; 94+ messages in thread
From: Luis Chamberlain @ 2021-10-26 17:01 UTC (permalink / raw)
  To: Ming Lei, Julia Lawall
  Cc: Petr Mladek, Miroslav Benes, Benjamin Herrenschmidt,
	Paul Mackerras, tj, gregkh, akpm, minchan, jeyu, shuah,
	bvanassche, dan.j.williams, joe, tglx, keescook, rostedt,
	linux-spdx, linux-doc, linux-block, linux-fsdevel,
	linux-kselftest, linux-kernel, live-patching

On Tue, Oct 26, 2021 at 11:37:30PM +0800, Ming Lei wrote:
> On Tue, Oct 26, 2021 at 10:48:18AM +0200, Petr Mladek wrote:
> > Livepatch code never called kobject_del() under a lock. It would cause
> > the obvious deadlock.

Never?

> > The historic code only waited in the
> > module_exit() callback until the sysfs interface was removed.
> 
> OK, then Luis shouldn't consider livepatching as one such issue to solve
> with one generic solution.

It's not what I was told when the deadlock was found with zram, so I was
informed quite the contrary.

I'm working on a generic coccinelle patch which hunts for actual cases
using iteration (a feature of coccinelle for complex searches). The
search is pretty involved, so I don't think I'll have an answer to this
soon.

Since the question of how generic this deadlock is remains questionable,
I think it makes sense to put the generic deadlock fix off the table for
now, and we address this once we have a more concrete search with
coccinelle.

But to say we *don't* have drivers which can cause this is obviously
wrong as well, from a cursory search so far. But let's wait and see how
big this list actually is.

I'll drop the deadlock generic fixes and move on with at least a starter
kernfs / sysfs tests.

  Luis

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v8 11/12] zram: fix crashes with cpu hotplug multistate
  2021-10-26 15:37                                     ` Ming Lei
  2021-10-26 17:01                                       ` Luis Chamberlain
@ 2021-10-27 11:42                                       ` Miroslav Benes
  2021-11-02 14:15                                       ` Petr Mladek
  2 siblings, 0 replies; 94+ messages in thread
From: Miroslav Benes @ 2021-10-27 11:42 UTC (permalink / raw)
  To: Ming Lei
  Cc: Petr Mladek, Luis Chamberlain, Benjamin Herrenschmidt,
	Paul Mackerras, tj, gregkh, akpm, minchan, jeyu, shuah,
	bvanassche, dan.j.williams, joe, tglx, keescook, rostedt,
	linux-spdx, linux-doc, linux-block, linux-fsdevel,
	linux-kselftest, linux-kernel, live-patching

> > 
> > The livepatch code uses workqueue because the livepatch can be
> > disabled via sysfs interface. It obviously could not wait until
> > the sysfs interface is removed in the sysfs write() callback
> > that triggered the removal.
> 
> If klp_free_patch_* is moved into module_exit() and not let enable
> store() to kill kobjects, all kobjects can be deleted in module_exit(),
> then wait_for_completion(patch->finish) may be removed, also wq isn't
> required for the async cleanup.

It sounds like a nice cleanup. If we combine kobject_del() to prevent any 
show()/store() accesses and free everything later in module_exit(), it 
could work. If I am not missing something around how we maintain internal 
lists of live patches and their modules.

Thanks

Miroslav

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v8 11/12] zram: fix crashes with cpu hotplug multistate
  2021-10-26 17:01                                       ` Luis Chamberlain
@ 2021-10-27 11:57                                         ` Miroslav Benes
  2021-10-27 14:27                                           ` Luis Chamberlain
  2021-11-02 15:24                                           ` Petr Mladek
  0 siblings, 2 replies; 94+ messages in thread
From: Miroslav Benes @ 2021-10-27 11:57 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: Ming Lei, Julia Lawall, Petr Mladek, Benjamin Herrenschmidt,
	Paul Mackerras, tj, gregkh, akpm, minchan, jeyu, shuah,
	bvanassche, dan.j.williams, joe, tglx, keescook, rostedt,
	linux-spdx, linux-doc, linux-block, linux-fsdevel,
	linux-kselftest, linux-kernel, live-patching

On Tue, 26 Oct 2021, Luis Chamberlain wrote:

> On Tue, Oct 26, 2021 at 11:37:30PM +0800, Ming Lei wrote:
> > On Tue, Oct 26, 2021 at 10:48:18AM +0200, Petr Mladek wrote:
> > > Livepatch code never called kobject_del() under a lock. It would cause
> > > the obvious deadlock.
> 
> Never?

kobject_put() to be precise.

When I started working on the support for module/live patches removal, 
calling kobject_put() under our klp_mutex lock was the obvious first 
choice given how the code was structured, but I ran into problems with 
deadlocks immediately. So it was changed to async approach with the 
workqueue. Thus the mainline code has never suffered from this, but we 
knew about the issues.
 
> > > The historic code only waited in the
> > > module_exit() callback until the sysfs interface was removed.
> > 
> > OK, then Luis shouldn't consider livepatching as one such issue to solve
> > with one generic solution.
> 
> It's not what I was told when the deadlock was found with zram, so I was
> informed quite the contrary.

From my perspective, it is quite easy to get it wrong due to either a lack 
of generic support, or missing rules/documentation. So if this thread 
leads to "do not share locks between a module removal and a sysfs 
operation" strict rule, it would be at least something. In the same 
manner as Luis proposed to document try_module_get() expectations.
 
> I'm working on a generic coccinelle patch which hunts for actual cases
> using iteration (a feature of coccinelle for complex searches). The
> search is pretty involved, so I don't think I'll have an answer to this
> soon.
> 
> Since the question of how generic this deadlock is remains questionable,
> I think it makes sense to put the generic deadlock fix off the table for
> now, and we address this once we have a more concrete search with
> coccinelle.
> 
> But to say we *don't* have drivers which can cause this is obviously
> wrong as well, from a cursory search so far. But let's wait and see how
> big this list actually is.
> 
> I'll drop the deadlock generic fixes and move on with at least a starter
> kernfs / sysfs tests.

It makes sense to me.

Thanks, Luis, for pursuing it.

Miroslav

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v8 11/12] zram: fix crashes with cpu hotplug multistate
  2021-10-27 11:57                                         ` Miroslav Benes
@ 2021-10-27 14:27                                           ` Luis Chamberlain
  2021-11-02 15:24                                           ` Petr Mladek
  1 sibling, 0 replies; 94+ messages in thread
From: Luis Chamberlain @ 2021-10-27 14:27 UTC (permalink / raw)
  To: Miroslav Benes
  Cc: Ming Lei, Julia Lawall, Petr Mladek, Benjamin Herrenschmidt,
	Paul Mackerras, tj, gregkh, akpm, minchan, jeyu, shuah,
	bvanassche, dan.j.williams, joe, tglx, keescook, rostedt,
	linux-spdx, linux-doc, linux-block, linux-fsdevel,
	linux-kselftest, linux-kernel, live-patching

On Wed, Oct 27, 2021 at 01:57:40PM +0200, Miroslav Benes wrote:
> On Tue, 26 Oct 2021, Luis Chamberlain wrote:
> 
> > On Tue, Oct 26, 2021 at 11:37:30PM +0800, Ming Lei wrote:
> > > OK, then Luis shouldn't consider livepatching as one such issue to solve
> > > with one generic solution.
> > 
> > It's not what I was told when the deadlock was found with zram, so I was
> > informed quite the contrary.
> 
> From my perspective, it is quite easy to get it wrong due to either a lack 
> of generic support, or missing rules/documentation.

Indeed. I agree some level of guidence is needed, even if subtle, rather
than tribal knowledge. I'll start off with the test_sysfs demo'ing what
not to do and documenting this there. I don't think it makes sense to
formalize yet documentation for "though shalt not do this" generically
until a full depth search is done with Coccinelle.

> So if this thread 
> leads to "do not share locks between a module removal and a sysfs 
> operation" strict rule, it would be at least something.

I think that's where we are at. I'll wait to complete my coccinelle
deadlock hunt patch to complete the full search, and that could be
useful to *warn* aboute new use cases, so to prevent this deadlock
in the future. Until then I agree that the complexity introduced is
not worth it given the evidence of users, but the full evidence of
actual users still remains to be determined. A perfect job left to
advances with Coccinelle.

> In the same 
> manner as Luis proposed to document try_module_get() expectations.

Right and so sysfs ops using try_module_get() *still* remains safe,
and so will keep that patch in my next iteration because there *are*
*many* uses cases for that.

  Luis

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v8 11/12] zram: fix crashes with cpu hotplug multistate
  2021-10-26 15:37                                     ` Ming Lei
  2021-10-26 17:01                                       ` Luis Chamberlain
  2021-10-27 11:42                                       ` Miroslav Benes
@ 2021-11-02 14:15                                       ` Petr Mladek
  2021-11-02 14:51                                         ` Petr Mladek
  2021-11-02 14:56                                         ` Ming Lei
  2 siblings, 2 replies; 94+ messages in thread
From: Petr Mladek @ 2021-11-02 14:15 UTC (permalink / raw)
  To: Ming Lei
  Cc: Miroslav Benes, Luis Chamberlain, Benjamin Herrenschmidt,
	Paul Mackerras, tj, gregkh, akpm, minchan, jeyu, shuah,
	bvanassche, dan.j.williams, joe, tglx, keescook, rostedt,
	linux-spdx, linux-doc, linux-block, linux-fsdevel,
	linux-kselftest, linux-kernel, live-patching

On Tue 2021-10-26 23:37:30, Ming Lei wrote:
> On Tue, Oct 26, 2021 at 10:48:18AM +0200, Petr Mladek wrote:
> > Below are more details about the livepatch code. I hope that it will
> > help you to see if zram has similar problems or not.
> > 
> > We have kobject in three structures: klp_func, klp_object, and
> > klp_patch, see include/linux/livepatch.h.
> > 
> > These structures have to be statically defined in the module sources
> > because they define what is livepatched, see
> > samples/livepatch/livepatch-sample.c
> > 
> > The kobject is used there to show information about the patch, patched
> > objects, and patched functions, in sysfs. And most importantly,
> > the sysfs interface can be used to disable the livepatch.
> > 
> > The problem with static structures is that the module must stay
> > in the memory as long as the sysfs interface exists. It can be
> > solved in module_exit() callback. It could wait until the sysfs
> > interface is destroyed.
> > 
> > kobject API does not support this scenario. The relase() callbacks
> 
> kobject_delete() is for supporting this scenario, that is why we don't
> need to grab module refcnt before calling show()/store() of the
> kobject's attributes.
> 
> kobject_delete() can be called in module_exit(), then any show()/store()
> will be done after kobject_delete() returns.

I am a bit confused. I do not see kobject_delete() anywhere in kernel
sources.

I see only kobject_del() and kobject_put(). AFAIK, they do _not_
guarantee that either the sysfs interface was destroyed or
the release callbacks were called. For example, see
schedule_delayed_work(&kobj->release, delay) in kobject_release().

By other words, anyone could still be using either the sysfs interface
or the related structures after kobject_del() or kobject_put()
returns.

IMHO, kobject API does not support static structures and module
removal.

Best Regards,
Petr

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v8 11/12] zram: fix crashes with cpu hotplug multistate
  2021-11-02 14:15                                       ` Petr Mladek
@ 2021-11-02 14:51                                         ` Petr Mladek
  2021-11-02 15:17                                           ` Ming Lei
  2021-11-02 14:56                                         ` Ming Lei
  1 sibling, 1 reply; 94+ messages in thread
From: Petr Mladek @ 2021-11-02 14:51 UTC (permalink / raw)
  To: Ming Lei
  Cc: Miroslav Benes, Luis Chamberlain, Benjamin Herrenschmidt,
	Paul Mackerras, tj, gregkh, akpm, minchan, jeyu, shuah,
	bvanassche, dan.j.williams, joe, tglx, keescook, rostedt,
	linux-spdx, linux-doc, linux-block, linux-fsdevel,
	linux-kselftest, linux-kernel, live-patching

On Tue 2021-11-02 15:15:19, Petr Mladek wrote:
> On Tue 2021-10-26 23:37:30, Ming Lei wrote:
> > On Tue, Oct 26, 2021 at 10:48:18AM +0200, Petr Mladek wrote:
> > > Below are more details about the livepatch code. I hope that it will
> > > help you to see if zram has similar problems or not.
> > > 
> > > We have kobject in three structures: klp_func, klp_object, and
> > > klp_patch, see include/linux/livepatch.h.
> > > 
> > > These structures have to be statically defined in the module sources
> > > because they define what is livepatched, see
> > > samples/livepatch/livepatch-sample.c
> > > 
> > > The kobject is used there to show information about the patch, patched
> > > objects, and patched functions, in sysfs. And most importantly,
> > > the sysfs interface can be used to disable the livepatch.
> > > 
> > > The problem with static structures is that the module must stay
> > > in the memory as long as the sysfs interface exists. It can be
> > > solved in module_exit() callback. It could wait until the sysfs
> > > interface is destroyed.
> > > 
> > > kobject API does not support this scenario. The relase() callbacks
> > 
> > kobject_delete() is for supporting this scenario, that is why we don't
> > need to grab module refcnt before calling show()/store() of the
> > kobject's attributes.
> > 
> > kobject_delete() can be called in module_exit(), then any show()/store()
> > will be done after kobject_delete() returns.
> 
> I am a bit confused. I do not see kobject_delete() anywhere in kernel
> sources.
> 
> I see only kobject_del() and kobject_put(). AFAIK, they do _not_
> guarantee that either the sysfs interface was destroyed or
> the release callbacks were called. For example, see
> schedule_delayed_work(&kobj->release, delay) in kobject_release().

Grr, I always get confused by the code. kobject_del() actually waits
until the sysfs interface gets destroyed. This is why there is
the deadlock.

But kobject_put() is _not_ synchronous. And the comment above
kobject_add() repeat 3 times that kobject_put() must be called
on success:

 * Return: If this function returns an error, kobject_put() must be
 *         called to properly clean up the memory associated with the
 *         object.  Under no instance should the kobject that is passed
 *         to this function be directly freed with a call to kfree(),
 *         that can leak memory.
 *
 *         If this function returns success, kobject_put() must also be called
 *         in order to properly clean up the memory associated with the object.
 *
 *         In short, once this function is called, kobject_put() MUST be called
 *         when the use of the object is finished in order to properly free
 *         everything.

and similar text in Documentation/core-api/kobject.rst

  After a kobject has been registered with the kobject core successfully, it
  must be cleaned up when the code is finished with it.  To do that, call
  kobject_put().


If I read the code correctly then kobject_put() calls kref_put()
that might call kobject_delayed_cleanup(). This function does a lot
of things and need to access struct kobject.

> IMHO, kobject API does not support static structures and module
> removal.

If kobject_put() has to be called also for static structures then
module_exit() must explicitly wait until the clean up is finished.

Best Regards,
Petr

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v8 11/12] zram: fix crashes with cpu hotplug multistate
  2021-11-02 14:15                                       ` Petr Mladek
  2021-11-02 14:51                                         ` Petr Mladek
@ 2021-11-02 14:56                                         ` Ming Lei
  1 sibling, 0 replies; 94+ messages in thread
From: Ming Lei @ 2021-11-02 14:56 UTC (permalink / raw)
  To: Petr Mladek
  Cc: Miroslav Benes, Luis Chamberlain, Benjamin Herrenschmidt,
	Paul Mackerras, tj, gregkh, akpm, minchan, jeyu, shuah,
	bvanassche, dan.j.williams, joe, tglx, keescook, rostedt,
	linux-spdx, linux-doc, linux-block, linux-fsdevel,
	linux-kselftest, linux-kernel, live-patching, ming.lei

On Tue, Nov 02, 2021 at 03:15:15PM +0100, Petr Mladek wrote:
> On Tue 2021-10-26 23:37:30, Ming Lei wrote:
> > On Tue, Oct 26, 2021 at 10:48:18AM +0200, Petr Mladek wrote:
> > > Below are more details about the livepatch code. I hope that it will
> > > help you to see if zram has similar problems or not.
> > > 
> > > We have kobject in three structures: klp_func, klp_object, and
> > > klp_patch, see include/linux/livepatch.h.
> > > 
> > > These structures have to be statically defined in the module sources
> > > because they define what is livepatched, see
> > > samples/livepatch/livepatch-sample.c
> > > 
> > > The kobject is used there to show information about the patch, patched
> > > objects, and patched functions, in sysfs. And most importantly,
> > > the sysfs interface can be used to disable the livepatch.
> > > 
> > > The problem with static structures is that the module must stay
> > > in the memory as long as the sysfs interface exists. It can be
> > > solved in module_exit() callback. It could wait until the sysfs
> > > interface is destroyed.
> > > 
> > > kobject API does not support this scenario. The relase() callbacks
> > 
> > kobject_delete() is for supporting this scenario, that is why we don't
> > need to grab module refcnt before calling show()/store() of the
> > kobject's attributes.
> > 
> > kobject_delete() can be called in module_exit(), then any show()/store()
> > will be done after kobject_delete() returns.
> 
> I am a bit confused. I do not see kobject_delete() anywhere in kernel
> sources.
> 
> I see only kobject_del() and kobject_put(). AFAIK, they do _not_
> guarantee that either the sysfs interface was destroyed or
> the release callbacks were called. For example, see
> schedule_delayed_work(&kobj->release, delay) in kobject_release().

After kobject_del() returns, no one can call run into show()/store(),
and all pending show()/store() are drained meantime. But yes, the release
handler may still be called later, and the kobject has to be freed
during or before module_exit().

https://lore.kernel.org/lkml/20211101112548.3364086-2-ming.lei@redhat.com/

> 
> By other words, anyone could still be using either the sysfs interface
> or the related structures after kobject_del() or kobject_put()
> returns.

No, no one can do that after kobject_del() returns.

> 
> IMHO, kobject API does not support static structures and module
> removal.

But so far klp_patch can only be defined as static instance, and it
depends on the implementation, especially the release handler.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v8 11/12] zram: fix crashes with cpu hotplug multistate
  2021-11-02 14:51                                         ` Petr Mladek
@ 2021-11-02 15:17                                           ` Ming Lei
  0 siblings, 0 replies; 94+ messages in thread
From: Ming Lei @ 2021-11-02 15:17 UTC (permalink / raw)
  To: Petr Mladek
  Cc: Miroslav Benes, Luis Chamberlain, Benjamin Herrenschmidt,
	Paul Mackerras, tj, gregkh, akpm, minchan, jeyu, shuah,
	bvanassche, dan.j.williams, joe, tglx, keescook, rostedt,
	linux-spdx, linux-doc, linux-block, linux-fsdevel,
	linux-kselftest, linux-kernel, live-patching, ming.lei

On Tue, Nov 02, 2021 at 03:51:33PM +0100, Petr Mladek wrote:
> On Tue 2021-11-02 15:15:19, Petr Mladek wrote:
> > On Tue 2021-10-26 23:37:30, Ming Lei wrote:
> > > On Tue, Oct 26, 2021 at 10:48:18AM +0200, Petr Mladek wrote:
> > > > Below are more details about the livepatch code. I hope that it will
> > > > help you to see if zram has similar problems or not.
> > > > 
> > > > We have kobject in three structures: klp_func, klp_object, and
> > > > klp_patch, see include/linux/livepatch.h.
> > > > 
> > > > These structures have to be statically defined in the module sources
> > > > because they define what is livepatched, see
> > > > samples/livepatch/livepatch-sample.c
> > > > 
> > > > The kobject is used there to show information about the patch, patched
> > > > objects, and patched functions, in sysfs. And most importantly,
> > > > the sysfs interface can be used to disable the livepatch.
> > > > 
> > > > The problem with static structures is that the module must stay
> > > > in the memory as long as the sysfs interface exists. It can be
> > > > solved in module_exit() callback. It could wait until the sysfs
> > > > interface is destroyed.
> > > > 
> > > > kobject API does not support this scenario. The relase() callbacks
> > > 
> > > kobject_delete() is for supporting this scenario, that is why we don't
> > > need to grab module refcnt before calling show()/store() of the
> > > kobject's attributes.
> > > 
> > > kobject_delete() can be called in module_exit(), then any show()/store()
> > > will be done after kobject_delete() returns.
> > 
> > I am a bit confused. I do not see kobject_delete() anywhere in kernel
> > sources.
> > 
> > I see only kobject_del() and kobject_put(). AFAIK, they do _not_
> > guarantee that either the sysfs interface was destroyed or
> > the release callbacks were called. For example, see
> > schedule_delayed_work(&kobj->release, delay) in kobject_release().
> 
> Grr, I always get confused by the code. kobject_del() actually waits
> until the sysfs interface gets destroyed. This is why there is
> the deadlock.

Right.

> 
> But kobject_put() is _not_ synchronous. And the comment above
> kobject_add() repeat 3 times that kobject_put() must be called
> on success:
> 
>  * Return: If this function returns an error, kobject_put() must be
>  *         called to properly clean up the memory associated with the
>  *         object.  Under no instance should the kobject that is passed
>  *         to this function be directly freed with a call to kfree(),
>  *         that can leak memory.
>  *
>  *         If this function returns success, kobject_put() must also be called
>  *         in order to properly clean up the memory associated with the object.
>  *
>  *         In short, once this function is called, kobject_put() MUST be called
>  *         when the use of the object is finished in order to properly free
>  *         everything.
> 
> and similar text in Documentation/core-api/kobject.rst
> 
>   After a kobject has been registered with the kobject core successfully, it
>   must be cleaned up when the code is finished with it.  To do that, call
>   kobject_put().
> 
> 
> If I read the code correctly then kobject_put() calls kref_put()
> that might call kobject_delayed_cleanup(). This function does a lot
> of things and need to access struct kobject.

Yes, then what is the problem here wrt. kobject_put() which may not be
synchronous?

> 
> > IMHO, kobject API does not support static structures and module
> > removal.
> 
> If kobject_put() has to be called also for static structures then
> module_exit() must explicitly wait until the clean up is finished.

Right, that is exactly how klp_patch kobject is implemented. klp_patch
kobject has to be disabled first, then module refcnt can be dropped after
the klp_patch kobject is released. Then module_exit() is possible.

Thanks,
Ming


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v8 11/12] zram: fix crashes with cpu hotplug multistate
  2021-10-27 11:57                                         ` Miroslav Benes
  2021-10-27 14:27                                           ` Luis Chamberlain
@ 2021-11-02 15:24                                           ` Petr Mladek
  2021-11-02 16:25                                             ` Luis Chamberlain
  1 sibling, 1 reply; 94+ messages in thread
From: Petr Mladek @ 2021-11-02 15:24 UTC (permalink / raw)
  To: Miroslav Benes
  Cc: Luis Chamberlain, Ming Lei, Julia Lawall, Benjamin Herrenschmidt,
	Paul Mackerras, tj, gregkh, akpm, minchan, jeyu, shuah,
	bvanassche, dan.j.williams, joe, tglx, keescook, rostedt,
	linux-spdx, linux-doc, linux-block, linux-fsdevel,
	linux-kselftest, linux-kernel, live-patching

On Wed 2021-10-27 13:57:40, Miroslav Benes wrote:
> On Tue, 26 Oct 2021, Luis Chamberlain wrote:
> 
> > On Tue, Oct 26, 2021 at 11:37:30PM +0800, Ming Lei wrote:
> > > On Tue, Oct 26, 2021 at 10:48:18AM +0200, Petr Mladek wrote:
> > > > Livepatch code never called kobject_del() under a lock. It would cause
> > > > the obvious deadlock.

I have to correct myself. IMHO, the deadlock is far from obvious. I
always get lost in the code and the documentation is not clear.
I always get lost.

> >
> > Never?
> 
> kobject_put() to be precise.

IMHO, the problem is actually with kobject_del() that gets blocked
until the sysfs interface gets removed. kobject_put() will have
the same problem only when the clean up is not delayed.


> When I started working on the support for module/live patches removal, 
> calling kobject_put() under our klp_mutex lock was the obvious first 
> choice given how the code was structured, but I ran into problems with 
> deadlocks immediately. So it was changed to async approach with the 
> workqueue. Thus the mainline code has never suffered from this, but we 
> knew about the issues.
>  
> > > > The historic code only waited in the
> > > > module_exit() callback until the sysfs interface was removed.
> > > 
> > > OK, then Luis shouldn't consider livepatching as one such issue to solve
> > > with one generic solution.
> > 
> > It's not what I was told when the deadlock was found with zram, so I was
> > informed quite the contrary.
> 
> >From my perspective, it is quite easy to get it wrong due to either a lack 
> of generic support, or missing rules/documentation. So if this thread 
> leads to "do not share locks between a module removal and a sysfs 
> operation" strict rule, it would be at least something. In the same 
> manner as Luis proposed to document try_module_get() expectations.

The rule "do not share locks between a module removal and a sysfs
operation" is not clear to me.

IMHO, there are the following rules:

1. rule: kobject_del() or kobject_put() must not be called under a lock that
	 is used by store()/show() callbacks.

   reason: kobject_del() waits until the sysfs interface is destroyed.
	 It has to wait until all store()/show() callbacks are finished.


2. rule: kobject_del()/kobject_put() must not be called from the
	related store() callbacks.

   reason: same as in 1st rule.


3. rule: module_exit() must wait until all release() callbacks are called
	 when kobject are static.

   reason: kobject_put() must be called to clean up internal
	dependencies. The clean up might be done asynchronously
	and need access to the kobject structure.


Best Regards,
Petr

PS: I am sorry if I am messing things. I want to be sure that we are
    all talking about the same and understand it the same way.
    

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v8 11/12] zram: fix crashes with cpu hotplug multistate
  2021-11-02 15:24                                           ` Petr Mladek
@ 2021-11-02 16:25                                             ` Luis Chamberlain
  2021-11-03  0:01                                               ` Ming Lei
  0 siblings, 1 reply; 94+ messages in thread
From: Luis Chamberlain @ 2021-11-02 16:25 UTC (permalink / raw)
  To: Petr Mladek
  Cc: Miroslav Benes, Ming Lei, Julia Lawall, Benjamin Herrenschmidt,
	Paul Mackerras, tj, gregkh, akpm, minchan, jeyu, shuah,
	bvanassche, dan.j.williams, joe, tglx, keescook, rostedt,
	linux-spdx, linux-doc, linux-block, linux-fsdevel,
	linux-kselftest, linux-kernel, live-patching

On Tue, Nov 02, 2021 at 04:24:06PM +0100, Petr Mladek wrote:
> On Wed 2021-10-27 13:57:40, Miroslav Benes wrote:
> > >From my perspective, it is quite easy to get it wrong due to either a lack 
> > of generic support, or missing rules/documentation. So if this thread 
> > leads to "do not share locks between a module removal and a sysfs 
> > operation" strict rule, it would be at least something. In the same 
> > manner as Luis proposed to document try_module_get() expectations.
> 
> The rule "do not share locks between a module removal and a sysfs
> operation" is not clear to me.

That's exactly it. It *is* not. The test_sysfs selftest will hopefully
help with this. But I'll wait to take a final position on whether or not
a generic fix should be merged until the Coccinelle patch which looks
for all uses cases completes.

So I think that once that Coccinelle hunt is done for the deadlock, we
should also remind folks of the potential deadlock and some of the rules
you mentioned below so that if we take a position that we don't support
this, we at least inform developers why and what to avoid. If Coccinelle
finds quite a bit of cases, then perhaps evaluating the generic fix
might be worth evaluating.

> IMHO, there are the following rules:
> 
> 1. rule: kobject_del() or kobject_put() must not be called under a lock that
> 	 is used by store()/show() callbacks.
> 
>    reason: kobject_del() waits until the sysfs interface is destroyed.
> 	 It has to wait until all store()/show() callbacks are finished.

Right, this is what actually started this entire conversation.

Note that as Ming pointed out, the generic kernfs fix I proposed would
only cover the case when kobject_del() ends up being called on module
exit, so it would not cover the cases where perhaps kobject_del() might
be called outside of module exit, and so the cope of the possible
deadlock then increases in scope.

Likewise, the Coccinelle hunt I'm trying would only cover the module
exit case. I'm a bit of afraid of the complexity of a generic hunt
as expresed in rule 1.

> 
> 2. rule: kobject_del()/kobject_put() must not be called from the
> 	related store() callbacks.
> 
>    reason: same as in 1st rule.

Sensible corollary.

Given tha the exact kobjet_del() / kobject_put() which must not be
called from the respective sysfs ops depends on which kobject is
underneath the device for which the sysfs ops is being created,
it would make this hunt in Coccinelle a bit tricky. My current iteration
of a coccinelle hunt cheats and looks at any sysfs looking op and
ensures a module exit exists.

> 3. rule: module_exit() must wait until all release() callbacks are called
> 	 when kobject are static.
> 
>    reason: kobject_put() must be called to clean up internal
> 	dependencies. The clean up might be done asynchronously
> 	and need access to the kobject structure.

This might be an easier rule to implement a respective Coccinelle rule
for.

  Luis

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v8 11/12] zram: fix crashes with cpu hotplug multistate
  2021-11-02 16:25                                             ` Luis Chamberlain
@ 2021-11-03  0:01                                               ` Ming Lei
  2021-11-03 12:44                                                 ` Luis Chamberlain
  0 siblings, 1 reply; 94+ messages in thread
From: Ming Lei @ 2021-11-03  0:01 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: Petr Mladek, Miroslav Benes, Julia Lawall,
	Benjamin Herrenschmidt, Paul Mackerras, tj, gregkh, akpm,
	minchan, jeyu, shuah, bvanassche, dan.j.williams, joe, tglx,
	keescook, rostedt, linux-spdx, linux-doc, linux-block,
	linux-fsdevel, linux-kselftest, linux-kernel, live-patching,
	ming.lei

On Tue, Nov 02, 2021 at 09:25:44AM -0700, Luis Chamberlain wrote:
> On Tue, Nov 02, 2021 at 04:24:06PM +0100, Petr Mladek wrote:
> > On Wed 2021-10-27 13:57:40, Miroslav Benes wrote:
> > > >From my perspective, it is quite easy to get it wrong due to either a lack 
> > > of generic support, or missing rules/documentation. So if this thread 
> > > leads to "do not share locks between a module removal and a sysfs 
> > > operation" strict rule, it would be at least something. In the same 
> > > manner as Luis proposed to document try_module_get() expectations.
> > 
> > The rule "do not share locks between a module removal and a sysfs
> > operation" is not clear to me.
> 
> That's exactly it. It *is* not. The test_sysfs selftest will hopefully
> help with this. But I'll wait to take a final position on whether or not
> a generic fix should be merged until the Coccinelle patch which looks
> for all uses cases completes.
> 
> So I think that once that Coccinelle hunt is done for the deadlock, we
> should also remind folks of the potential deadlock and some of the rules
> you mentioned below so that if we take a position that we don't support
> this, we at least inform developers why and what to avoid. If Coccinelle
> finds quite a bit of cases, then perhaps evaluating the generic fix
> might be worth evaluating.
> 
> > IMHO, there are the following rules:
> > 
> > 1. rule: kobject_del() or kobject_put() must not be called under a lock that
> > 	 is used by store()/show() callbacks.
> > 
> >    reason: kobject_del() waits until the sysfs interface is destroyed.
> > 	 It has to wait until all store()/show() callbacks are finished.
> 
> Right, this is what actually started this entire conversation.
> 
> Note that as Ming pointed out, the generic kernfs fix I proposed would
> only cover the case when kobject_del() ends up being called on module
> exit, so it would not cover the cases where perhaps kobject_del() might
> be called outside of module exit, and so the cope of the possible
> deadlock then increases in scope.
> 
> Likewise, the Coccinelle hunt I'm trying would only cover the module
> exit case. I'm a bit of afraid of the complexity of a generic hunt
> as expresed in rule 1.

Question is that why one shared lock is required between kobject_del()
and its show()/store(), both zram and livepatch needn't that. Is it
one common usage?

> 
> > 
> > 2. rule: kobject_del()/kobject_put() must not be called from the
> > 	related store() callbacks.
> > 
> >    reason: same as in 1st rule.
> 
> Sensible corollary.
> 
> Given tha the exact kobjet_del() / kobject_put() which must not be
> called from the respective sysfs ops depends on which kobject is
> underneath the device for which the sysfs ops is being created,
> it would make this hunt in Coccinelle a bit tricky. My current iteration
> of a coccinelle hunt cheats and looks at any sysfs looking op and
> ensures a module exit exists.

Actually kernfs/sysfs provides interface for supporting deleting
kobject/attr from the attr's show()/store(), see example of
sdev_store_delete(), and the livepatch example:

https://lore.kernel.org/lkml/20211102145932.3623108-4-ming.lei@redhat.com/

> 
> > 3. rule: module_exit() must wait until all release() callbacks are called
> > 	 when kobject are static.
> > 
> >    reason: kobject_put() must be called to clean up internal
> > 	dependencies. The clean up might be done asynchronously
> > 	and need access to the kobject structure.
> 
> This might be an easier rule to implement a respective Coccinelle rule
> for.

If kobject_del() is done in module_exit() or before module_exit(),
kobject should have been freed in module_exit() via kobject_put().

But yes, it can be asynchronously because of CONFIG_DEBUG_KOBJECT_RELEASE,
seems like one real issue.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v8 11/12] zram: fix crashes with cpu hotplug multistate
  2021-11-03  0:01                                               ` Ming Lei
@ 2021-11-03 12:44                                                 ` Luis Chamberlain
  0 siblings, 0 replies; 94+ messages in thread
From: Luis Chamberlain @ 2021-11-03 12:44 UTC (permalink / raw)
  To: Ming Lei
  Cc: Petr Mladek, Miroslav Benes, Julia Lawall,
	Benjamin Herrenschmidt, Paul Mackerras, tj, gregkh, akpm,
	minchan, jeyu, shuah, bvanassche, dan.j.williams, joe, tglx,
	keescook, rostedt, linux-spdx, linux-doc, linux-block,
	linux-fsdevel, linux-kselftest, linux-kernel, live-patching

On Wed, Nov 03, 2021 at 08:01:45AM +0800, Ming Lei wrote:
> On Tue, Nov 02, 2021 at 09:25:44AM -0700, Luis Chamberlain wrote:
> > On Tue, Nov 02, 2021 at 04:24:06PM +0100, Petr Mladek wrote:
> > > On Wed 2021-10-27 13:57:40, Miroslav Benes wrote:
> > > > >From my perspective, it is quite easy to get it wrong due to either a lack 
> > > > of generic support, or missing rules/documentation. So if this thread 
> > > > leads to "do not share locks between a module removal and a sysfs 
> > > > operation" strict rule, it would be at least something. In the same 
> > > > manner as Luis proposed to document try_module_get() expectations.
> > > 
> > > The rule "do not share locks between a module removal and a sysfs
> > > operation" is not clear to me.
> > 
> > That's exactly it. It *is* not. The test_sysfs selftest will hopefully
> > help with this. But I'll wait to take a final position on whether or not
> > a generic fix should be merged until the Coccinelle patch which looks
> > for all uses cases completes.
> > 
> > So I think that once that Coccinelle hunt is done for the deadlock, we
> > should also remind folks of the potential deadlock and some of the rules
> > you mentioned below so that if we take a position that we don't support
> > this, we at least inform developers why and what to avoid. If Coccinelle
> > finds quite a bit of cases, then perhaps evaluating the generic fix
> > might be worth evaluating.
> > 
> > > IMHO, there are the following rules:
> > > 
> > > 1. rule: kobject_del() or kobject_put() must not be called under a lock that
> > > 	 is used by store()/show() callbacks.
> > > 
> > >    reason: kobject_del() waits until the sysfs interface is destroyed.
> > > 	 It has to wait until all store()/show() callbacks are finished.
> > 
> > Right, this is what actually started this entire conversation.
> > 
> > Note that as Ming pointed out, the generic kernfs fix I proposed would
> > only cover the case when kobject_del() ends up being called on module
> > exit, so it would not cover the cases where perhaps kobject_del() might
> > be called outside of module exit, and so the cope of the possible
> > deadlock then increases in scope.
> > 
> > Likewise, the Coccinelle hunt I'm trying would only cover the module
> > exit case. I'm a bit of afraid of the complexity of a generic hunt
> > as expresed in rule 1.
> 
> Question is that why one shared lock is required between kobject_del()
> and its show()/store(), both zram and livepatch needn't that. Is it
> one common usage?

That is the question the coccinelle hunt is aimed at finding. Answering
that in the context of module removal is easier than the generic case.

But also note that I had mentioned before that we have semantics to
check *when* we're in the module removal case, and as such can address
that case. For the other cases we have no possible semantics to be able to
address a generic fix. I tried though, refer to my reply in this
thread and refer to the new kobject_being_removed() I'm adding:

https://lkml.kernel.org/r/YWdMpv8lAFYtc18c@bombadil.infradead.org

So we have semantics for knowing when about to remove a module but,
my attempt with kobject_being_removed() isn't sufficient to address this
generically.

In either case, having a gauge of how common this is either on module
removal of generally would be wonderful. It is easier to answer the
question from a module removal perspective though.

> > > 2. rule: kobject_del()/kobject_put() must not be called from the
> > > 	related store() callbacks.
> > > 
> > >    reason: same as in 1st rule.
> > 
> > Sensible corollary.
> > 
> > Given tha the exact kobjet_del() / kobject_put() which must not be
> > called from the respective sysfs ops depends on which kobject is
> > underneath the device for which the sysfs ops is being created,
> > it would make this hunt in Coccinelle a bit tricky. My current iteration
> > of a coccinelle hunt cheats and looks at any sysfs looking op and
> > ensures a module exit exists.
> 
> Actually kernfs/sysfs provides interface for supporting deleting
> kobject/attr from the attr's show()/store(), see example of
> sdev_store_delete(), and the livepatch example:
> 
> https://lore.kernel.org/lkml/20211102145932.3623108-4-ming.lei@redhat.com/

Imagine that.. is that the suicidal thing?

> > > 3. rule: module_exit() must wait until all release() callbacks are called
> > > 	 when kobject are static.
> > > 
> > >    reason: kobject_put() must be called to clean up internal
> > > 	dependencies. The clean up might be done asynchronously
> > > 	and need access to the kobject structure.
> > 
> > This might be an easier rule to implement a respective Coccinelle rule
> > for.
> 
> If kobject_del() is done in module_exit() or before module_exit(),
> kobject should have been freed in module_exit() via kobject_put().
> 
> But yes, it can be asynchronously because of CONFIG_DEBUG_KOBJECT_RELEASE,
> seems like one real issue.

Alright thanks for confirming.

  Luis

^ permalink raw reply	[flat|nested] 94+ messages in thread

end of thread, other threads:[~2021-11-03 12:44 UTC | newest]

Thread overview: 94+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-09-27 16:37 [PATCH v8 00/12] syfs: generic deadlock fix with module removal Luis Chamberlain
2021-09-27 16:37 ` [PATCH v8 01/12] LICENSES: Add the copyleft-next-0.3.1 license Luis Chamberlain
     [not found]   ` <202110050907.35FBD2A1@keescook>
     [not found]     ` <YWR2ZrtzChamY1y4@bombadil.infradead.org>
2021-10-11 17:57       ` Kees Cook
2021-09-27 16:37 ` [PATCH v8 02/12] testing: use the copyleft-next-0.3.1 SPDX tag Luis Chamberlain
2021-10-05 16:11   ` Kees Cook
2021-09-27 16:37 ` [PATCH v8 03/12] selftests: add tests_sysfs module Luis Chamberlain
2021-10-05 14:16   ` Greg KH
2021-10-05 16:57     ` Tim.Bird
2021-10-11 17:40       ` Luis Chamberlain
2021-10-11 17:38     ` Luis Chamberlain
2021-10-07 14:23   ` Miroslav Benes
2021-10-11 19:11     ` Luis Chamberlain
     [not found]   ` <202110050912.3DF681ED@keescook>
2021-10-11 19:03     ` Luis Chamberlain
2021-09-27 16:37 ` [PATCH v8 04/12] kernfs: add initial failure injection support Luis Chamberlain
2021-10-05 19:47   ` Kees Cook
2021-10-11 20:44     ` Luis Chamberlain
2021-09-27 16:37 ` [PATCH v8 05/12] test_sysfs: add support to use kernfs failure injection Luis Chamberlain
2021-10-05 19:51   ` Kees Cook
2021-10-11 20:56     ` Luis Chamberlain
2021-09-27 16:37 ` [PATCH v8 06/12] kernel/module: add documentation for try_module_get() Luis Chamberlain
2021-10-05 19:58   ` Kees Cook
2021-10-11 21:16     ` Luis Chamberlain
2021-09-27 16:38 ` [PATCH v8 07/12] fs/kernfs/symlink.c: replace S_IRWXUGO with 0777 on kernfs_create_link() Luis Chamberlain
2021-10-05 19:59   ` Kees Cook
2021-09-27 16:38 ` [PATCH v8 08/12] fs/sysfs/dir.c: replace S_IRWXU|S_IRUGO|S_IXUGO with 0755 sysfs_create_dir_ns() Luis Chamberlain
2021-10-05 16:05   ` Kees Cook
2021-09-27 16:38 ` [PATCH v8 09/12] sysfs: fix deadlock race with module removal Luis Chamberlain
2021-10-05  9:24   ` Ming Lei
2021-10-11 21:25     ` Luis Chamberlain
2021-10-12  0:20       ` Ming Lei
2021-10-12 21:18         ` Luis Chamberlain
2021-10-13  1:07           ` Ming Lei
2021-10-13 12:35             ` Luis Chamberlain
2021-10-13 15:04               ` Ming Lei
2021-10-13 21:16                 ` Luis Chamberlain
2021-10-05 20:50   ` Kees Cook
2021-10-11 22:26     ` Luis Chamberlain
2021-10-13 12:41       ` Luis Chamberlain
2021-09-27 16:38 ` [PATCH v8 10/12] test_sysfs: enable deadlock tests by default Luis Chamberlain
2021-09-27 16:38 ` [PATCH v8 11/12] zram: fix crashes with cpu hotplug multistate Luis Chamberlain
2021-10-05 20:55   ` Kees Cook
2021-10-11 18:27     ` Luis Chamberlain
2021-10-14  1:55   ` Ming Lei
2021-10-14  2:11     ` Ming Lei
2021-10-14 20:24       ` Luis Chamberlain
2021-10-14 23:52         ` Ming Lei
2021-10-15  0:22           ` Luis Chamberlain
2021-10-15  8:36             ` Ming Lei
2021-10-15  8:52               ` Greg KH
2021-10-15 17:31               ` Luis Chamberlain
2021-10-16 11:28                 ` Ming Lei
2021-10-18 19:32                   ` Luis Chamberlain
2021-10-19  2:34                     ` Ming Lei
2021-10-19  6:23                       ` Miroslav Benes
2021-10-19  9:23                         ` Ming Lei
2021-10-20  6:43                           ` Miroslav Benes
2021-10-20  7:49                             ` Ming Lei
2021-10-20  8:19                               ` Miroslav Benes
2021-10-20  8:28                                 ` Greg KH
2021-10-25  9:58                                   ` Miroslav Benes
2021-10-20 10:09                                 ` Ming Lei
2021-10-26  8:48                                   ` Petr Mladek
2021-10-26 15:37                                     ` Ming Lei
2021-10-26 17:01                                       ` Luis Chamberlain
2021-10-27 11:57                                         ` Miroslav Benes
2021-10-27 14:27                                           ` Luis Chamberlain
2021-11-02 15:24                                           ` Petr Mladek
2021-11-02 16:25                                             ` Luis Chamberlain
2021-11-03  0:01                                               ` Ming Lei
2021-11-03 12:44                                                 ` Luis Chamberlain
2021-10-27 11:42                                       ` Miroslav Benes
2021-11-02 14:15                                       ` Petr Mladek
2021-11-02 14:51                                         ` Petr Mladek
2021-11-02 15:17                                           ` Ming Lei
2021-11-02 14:56                                         ` Ming Lei
2021-10-19 15:28                       ` Luis Chamberlain
2021-10-19 16:29                         ` Ming Lei
2021-10-19 19:36                           ` Luis Chamberlain
2021-10-20  1:15                             ` Ming Lei
2021-10-20 15:48                               ` Luis Chamberlain
2021-10-21  0:39                                 ` Ming Lei
2021-10-21 17:18                                   ` Luis Chamberlain
2021-10-22  0:05                                     ` Ming Lei
2021-10-19 15:50                       ` Luis Chamberlain
2021-10-19 16:25                         ` Greg KH
2021-10-19 16:30                           ` Luis Chamberlain
2021-10-19 17:28                             ` Greg KH
2021-10-19 19:46                               ` Luis Chamberlain
2021-10-19 16:39                         ` Ming Lei
2021-10-19 19:38                           ` Luis Chamberlain
2021-10-20  0:55                             ` Ming Lei
2021-09-27 16:38 ` [PATCH v8 12/12] zram: use ATTRIBUTE_GROUPS to fix sysfs deadlock module removal Luis Chamberlain
2021-10-05 20:57   ` Kees Cook
2021-10-11 18:28     ` Luis Chamberlain

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).