linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 0/3] zram: fix few sysfs races
@ 2021-06-21 23:30 Luis Chamberlain
  2021-06-21 23:30 ` [PATCH v3 1/3] zram: fix crashes due to use of cpu hotplug multistate Luis Chamberlain
                   ` (2 more replies)
  0 siblings, 3 replies; 21+ messages in thread
From: Luis Chamberlain @ 2021-06-21 23:30 UTC (permalink / raw)
  To: minchan, gregkh, jeyu, ngupta, sergey.senozhatsky.work
  Cc: mcgrof, axboe, mbenes, jpoimboe, tglx, keescook, jikos, rostedt,
	peterz, linux-block, linux-kernel

This v3 changes the approach for the deadlock fix with wrappers which
provide the same solution of the try_module_get(). This is less code,
should be easier to review.

The last patch is also new. I dropped the bdget() stuff for the block
device and instead am doing a direct kobject fetch as well as a bus get to
provide a more generic solution without having to require each type to
implement its own refcounts.

This series also available on my linux-next repository:

https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux-next.git/log/?h=20210427-sysfs-fix-races-v5

Luis Chamberlain (3):
  zram: fix crashes due to use of cpu hotplug multistate
  zram: fix deadlock with sysfs attribute usage and driver removal
  drivers/base/core: refcount kobject and bus on device attribute read /
    store

 drivers/base/base.h           |   2 +
 drivers/base/bus.c            |   4 +-
 drivers/base/core.c           |  42 +++++++++-
 drivers/block/zram/zram_drv.c | 141 ++++++++++++++++++++++++----------
 drivers/block/zram/zram_drv.h |  40 ++++++++++
 5 files changed, 181 insertions(+), 48 deletions(-)

-- 
2.30.2


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH v3 1/3] zram: fix crashes due to use of cpu hotplug multistate
  2021-06-21 23:30 [PATCH v3 0/3] zram: fix few sysfs races Luis Chamberlain
@ 2021-06-21 23:30 ` Luis Chamberlain
  2021-06-22  7:39   ` Greg KH
  2021-06-21 23:36 ` [PATCH v3 2/3] zram: fix deadlock with sysfs attribute usage and driver removal Luis Chamberlain
  2021-06-21 23:36 ` [PATCH v3 3/3] drivers/base/core: refcount kobject and bus on device attribute read / store Luis Chamberlain
  2 siblings, 1 reply; 21+ messages in thread
From: Luis Chamberlain @ 2021-06-21 23:30 UTC (permalink / raw)
  To: minchan, gregkh, jeyu, ngupta, sergey.senozhatsky.work
  Cc: mcgrof, axboe, mbenes, jpoimboe, tglx, keescook, jikos, rostedt,
	peterz, linux-block, linux-kernel

Provide a simple state machine to fix races with driver exit where we
remove the CPU multistate callbacks and re-initialization / creation of
new per CPU instances which should be managed by these callbacks.

The zram driver makes use of cpu hotplug multistate support, whereby it
associates a struct zcomp per CPU. Each struct zcomp represents a
compression algorithm in charge of managing compression streams per CPU.
Although a compiled zram driver only supports a fixed set of compression
algorithms, each zram device gets a struct zcomp allocated per CPU. The
"multi" in CPU hotplug multstate refers to these per cpu struct zcomp
instances. Each of these will have the CPU hotplug callback called for
it on CPU plug / unplug. The kernel's CPU hotplug multistate keeps a
linked list of these different structures so that it will iterate over
them on CPU transitions.

By default at driver initialization we will create just one zram device
(num_devices=1) and a zcomp structure then set for the now default
lzo-rle comrpession algorithm. At driver removal we first remove each
zram device, and so we destroy the associated struct zcomp per CPU. But
since we expose sysfs attributes to create new devices or reset / initialize
existing zram devices, we can easily end up re-initializing a struct zcomp
for a zram device before the exit routine of the module removes the cpu
hotplug callback. When this happens the kernel's CPU hotplug will detect
that at least one instance (struct zcomp for us) exists. This can happen
in the following situation:

CPU 1                            CPU 2

class_unregister(...);
idr_for_each(...);
zram_debugfs_destroy();
                                disksize_store(...);
idr_destroy(...);
unregister_blkdev(...);
cpuhp_remove_multi_state(...);

The warning comes up on cpuhp_remove_multi_state() when it sees that the
state for CPUHP_ZCOMP_PREPARE does not have an empty instance linked list.
In this case, that a struct zcom still exists, the driver allowed its
creation per CPU even though we could have just freed them per CPU
though a call on another CPU, and we are then later trying to remove the
hotplug callback.

Fix all this by providing a zram initialization boolean
protected the the shared in the driver zram_index_mutex,
which we can use to annotate when sysfs attributes are
safe to use or not -- once the driver is properly initialized.
When the driver is going down we also are sure to not let
userspace muck with attributes which may affect each per cpu
struct zcomp.

This also fixes a series of possible memory leaks. The
crashes and memory leaks can easily be caused by issuing
the zram02.sh script from the LTP project [0] in a loop
in two separate windows:

  cd testcases/kernel/device-drivers/zram
  while true; do PATH=$PATH:$PWD:$PWD/../../../lib/ ./zram02.sh; done

You end up with a splat as follows:

kernel: zram: Removed device: zram0
kernel: zram: Added device: zram0
kernel: zram0: detected capacity change from 0 to 209715200
kernel: Adding 104857596k swap on /dev/zram0.  Priority:-2 extents:1 across:104857596k SSFS
kernel: zram0: detected capacitky change from 209715200 to 0
kernel: zram0: detected capacity change from 0 to 209715200
kernel: ------------[ cut here ]------------
kernel: Error: Removing state 63 which has instances left.
kernel: WARNING: CPU: 7 PID: 70457 at kernel/cpu.c:2069 __cpuhp_remove_state_cpuslocked+0xf9/0x100
kernel: Modules linked in: zram(E-) zsmalloc(E) <etc>
kernel: CPU: 7 PID: 70457 Comm: rmmod Tainted: G            E     5.12.0-rc1-next-20210304 #3
kernel: Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014
kernel: RIP: 0010:__cpuhp_remove_state_cpuslocked+0xf9/0x100
kernel: Code: <etc>
kernel: RSP: 0018:ffffa800c139be98 EFLAGS: 00010282
kernel: RAX: 0000000000000000 RBX: ffffffff9083db58 RCX: ffff9609f7dd86d8
kernel: RDX: 00000000ffffffd8 RSI: 0000000000000027 RDI: ffff9609f7dd86d0
kernel: RBP: 0000000000000000i R08: 0000000000000000 R09: ffffa800c139bcb8
kernel: R10: ffffa800c139bcb0 R11: ffffffff908bea40 R12: 000000000000003f
kernel: R13: 00000000000009d8 R14: 0000000000000000 R15: 0000000000000000
kernel: FS: 00007f1b075a7540(0000) GS:ffff9609f7dc0000(0000) knlGS:0000000000000000
kernel: CS:  0010 DS: 0000 ES 0000 CR0: 0000000080050033
kernel: CR2: 00007f1b07610490 CR3: 00000001bd04e000 CR4: 0000000000350ee0
kernel: Call Trace:
kernel: __cpuhp_remove_state+0x2e/0x80
kernel: __do_sys_delete_module+0x190/0x2a0
kernel:  do_syscall_64+0x33/0x80
kernel: entry_SYSCALL_64_after_hwframe+0x44/0xae

The "Error: Removing state 63 which has instances left" refers
to the zram per CPU struc zcomp instances left.

[0] https://github.com/linux-test-project/ltp.git

Acked-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 drivers/block/zram/zram_drv.c | 63 ++++++++++++++++++++++++++++++-----
 1 file changed, 55 insertions(+), 8 deletions(-)

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index cf8deecc39ef..431b60cd85c1 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -44,6 +44,8 @@ static DEFINE_MUTEX(zram_index_mutex);
 static int zram_major;
 static const char *default_compressor = CONFIG_ZRAM_DEF_COMP;
 
+bool zram_up;
+
 /* Module params (documentation at end) */
 static unsigned int num_devices = 1;
 /*
@@ -1704,6 +1706,7 @@ static void zram_reset_device(struct zram *zram)
 	comp = zram->comp;
 	disksize = zram->disksize;
 	zram->disksize = 0;
+	zram->comp = NULL;
 
 	set_capacity_and_notify(zram->disk, 0);
 	part_stat_set_all(zram->disk->part0, 0);
@@ -1724,9 +1727,18 @@ static ssize_t disksize_store(struct device *dev,
 	struct zram *zram = dev_to_zram(dev);
 	int err;
 
+	mutex_lock(&zram_index_mutex);
+
+	if (!zram_up) {
+		err = -ENODEV;
+		goto out;
+	}
+
 	disksize = memparse(buf, NULL);
-	if (!disksize)
-		return -EINVAL;
+	if (!disksize) {
+		err = -EINVAL;
+		goto out;
+	}
 
 	down_write(&zram->init_lock);
 	if (init_done(zram)) {
@@ -1754,12 +1766,16 @@ static ssize_t disksize_store(struct device *dev,
 	set_capacity_and_notify(zram->disk, zram->disksize >> SECTOR_SHIFT);
 	up_write(&zram->init_lock);
 
+	mutex_unlock(&zram_index_mutex);
+
 	return len;
 
 out_free_meta:
 	zram_meta_free(zram, disksize);
 out_unlock:
 	up_write(&zram->init_lock);
+out:
+	mutex_unlock(&zram_index_mutex);
 	return err;
 }
 
@@ -1775,8 +1791,17 @@ static ssize_t reset_store(struct device *dev,
 	if (ret)
 		return ret;
 
-	if (!do_reset)
-		return -EINVAL;
+	mutex_lock(&zram_index_mutex);
+
+	if (!zram_up) {
+		len = -ENODEV;
+		goto out;
+	}
+
+	if (!do_reset) {
+		len = -EINVAL;
+		goto out;
+	}
 
 	zram = dev_to_zram(dev);
 	bdev = zram->disk->part0;
@@ -1785,7 +1810,8 @@ static ssize_t reset_store(struct device *dev,
 	/* Do not reset an active device or claimed device */
 	if (bdev->bd_openers || zram->claim) {
 		mutex_unlock(&bdev->bd_mutex);
-		return -EBUSY;
+		len = -EBUSY;
+		goto out;
 	}
 
 	/* From now on, anyone can't open /dev/zram[0-9] */
@@ -1800,6 +1826,8 @@ static ssize_t reset_store(struct device *dev,
 	zram->claim = false;
 	mutex_unlock(&bdev->bd_mutex);
 
+out:
+	mutex_unlock(&zram_index_mutex);
 	return len;
 }
 
@@ -2021,6 +2049,10 @@ static ssize_t hot_add_show(struct class *class,
 	int ret;
 
 	mutex_lock(&zram_index_mutex);
+	if (!zram_up) {
+		mutex_unlock(&zram_index_mutex);
+		return -ENODEV;
+	}
 	ret = zram_add();
 	mutex_unlock(&zram_index_mutex);
 
@@ -2048,6 +2080,11 @@ static ssize_t hot_remove_store(struct class *class,
 
 	mutex_lock(&zram_index_mutex);
 
+	if (!zram_up) {
+		ret = -ENODEV;
+		goto out;
+	}
+
 	zram = idr_find(&zram_index_idr, dev_id);
 	if (zram) {
 		ret = zram_remove(zram);
@@ -2057,6 +2094,7 @@ static ssize_t hot_remove_store(struct class *class,
 		ret = -ENODEV;
 	}
 
+out:
 	mutex_unlock(&zram_index_mutex);
 	return ret ? ret : count;
 }
@@ -2083,12 +2121,15 @@ static int zram_remove_cb(int id, void *ptr, void *data)
 
 static void destroy_devices(void)
 {
+	mutex_lock(&zram_index_mutex);
+	zram_up = false;
 	class_unregister(&zram_control_class);
 	idr_for_each(&zram_index_idr, &zram_remove_cb, NULL);
 	zram_debugfs_destroy();
 	idr_destroy(&zram_index_idr);
 	unregister_blkdev(zram_major, "zram");
 	cpuhp_remove_multi_state(CPUHP_ZCOMP_PREPARE);
+	mutex_unlock(&zram_index_mutex);
 }
 
 static int __init zram_init(void)
@@ -2116,15 +2157,21 @@ static int __init zram_init(void)
 		return -EBUSY;
 	}
 
+	mutex_lock(&zram_index_mutex);
+
 	while (num_devices != 0) {
-		mutex_lock(&zram_index_mutex);
 		ret = zram_add();
-		mutex_unlock(&zram_index_mutex);
-		if (ret < 0)
+		if (ret < 0) {
+			mutex_unlock(&zram_index_mutex);
 			goto out_error;
+		}
 		num_devices--;
 	}
 
+	zram_up = true;
+
+	mutex_unlock(&zram_index_mutex);
+
 	return 0;
 
 out_error:
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v3 2/3] zram: fix deadlock with sysfs attribute usage and driver removal
  2021-06-21 23:30 [PATCH v3 0/3] zram: fix few sysfs races Luis Chamberlain
  2021-06-21 23:30 ` [PATCH v3 1/3] zram: fix crashes due to use of cpu hotplug multistate Luis Chamberlain
@ 2021-06-21 23:36 ` Luis Chamberlain
  2021-06-22  7:41   ` Greg KH
  2021-06-22  7:45   ` Greg KH
  2021-06-21 23:36 ` [PATCH v3 3/3] drivers/base/core: refcount kobject and bus on device attribute read / store Luis Chamberlain
  2 siblings, 2 replies; 21+ messages in thread
From: Luis Chamberlain @ 2021-06-21 23:36 UTC (permalink / raw)
  To: minchan, gregkh, jeyu, ngupta, sergey.senozhatsky.work
  Cc: mcgrof, axboe, mbenes, jpoimboe, tglx, keescook, jikos, rostedt,
	peterz, linux-block, linux-kernel

When sysfs attributes use a lock also used on driver removal we can
potentially deadlock. This happens when for instance a sysfs file on
a driver is used, then at the same time we have driver removal trigger.
The driver removal code holds a lock, and then the sysfs file entry waits
for the same lock. While holding the lock the driver removal tries to
remove the sysfs entries, but these cannot be removed yet as one is
waiting for a lock. This won't complete as the lock is already held.
Likewise module removal cannot complete, and so we deadlock.

To fix this we just *try* to get a refcount to the module when a shared
lock is used, prior to mucking with a sysfs attribute. If this fails we
just give up right away.

We use a try method as a full lock means we'd then make our sysfs attributes
busy us out from possible module removal, and so userspace could force denying
module removal, a silly form of "DOS" against module removal. A try lock on
the module removal ensures we give priority to module removal and interacting
with sysfs attributes only comes second. Using a full lock could mean for
instance that if you don't stop poking at sysfs files you cannot remove a
module.

This deadlock was first reported with the zram driver, a sketch of how
this can happen follows:

CPU A                              CPU B
                                   whatever_store()
module_unload
  mutex_lock(foo)
                                   mutex_lock(foo)
   del_gendisk(zram->disk);
     device_del()
       device_remove_groups()

In this situation whatever_store() is waiting for the mutex foo to
become unlocked, but that won't happen until module removal is complete.
But module removal won't complete until the syfs file being poked completes
which is waiting for a lock already held.

This is a generic kernel issue with sysfs files which use any lock also
used on module removal. Different generic solutions have been proposed.
One approach proposed is by directly by augmenting attributes with module
information [0]. This patch implements a solution by adding macros with
the prefix MODULE_DEVICE_ATTR_*() which accomplish the same. Until we
don't have a generic agreed upon solution for this shared between drivers,
we must implement a fix for this on each driver.

We make zram use the new MODULE_DEVICE_ATTR_*() helpers, and completely
open code the solution for class attributes as there are only a few of
those.

This issue can be reproduced easily on the zram driver as follows:

Loop 1 on one terminal:

while true;
	do modprobe zram;
	modprobe -r zram;
done

Loop 2 on a second terminal:
while true; do
	echo 1024 >  /sys/block/zram0/disksize;
	echo 1 > /sys/block/zram0/reset;
done

Without this patch we end up in a deadlock, and the following
stack trace is produced which hints to us what the issue was:

INFO: task bash:888 blocked for more than 120 seconds.
      Tainted: G            E 5.12.0-rc1-next-20210304+ #4
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:bash            state:D stack:    0 pid:  888 ppid: 887 flags:0x00000004
Call Trace:
 __schedule+0x2e4/0x900
 schedule+0x46/0xb0
 schedule_preempt_disabled+0xa/0x10
 __mutex_lock.constprop.0+0x2c3/0x490
 ? _kstrtoull+0x35/0xd0
 reset_store+0x6c/0x160 [zram]
 kernfs_fop_write_iter+0x124/0x1b0
 new_sync_write+0x11c/0x1b0
 vfs_write+0x1c2/0x260
 ksys_write+0x5f/0xe0
 do_syscall_64+0x33/0x80
 entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7f34f2c3df33
RSP: 002b:00007ffe751df6e8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007f34f2c3df33
RDX: 0000000000000002 RSI: 0000561ccb06ec10 RDI: 0000000000000001
RBP: 0000561ccb06ec10 R08: 000000000000000a R09: 0000000000000001
R10: 0000561ccb157590 R11: 0000000000000246 R12: 0000000000000002
R13: 00007f34f2d0e6a0 R14: 0000000000000002 R15: 00007f34f2d0e8a0
INFO: task modprobe:1104 can't die for more than 120 seconds.
task:modprobe        state:D stack:    0 pid: 1104 ppid: 916 flags:0x00004004
Call Trace:
 __schedule+0x2e4/0x900
 schedule+0x46/0xb0
 __kernfs_remove.part.0+0x228/0x2b0
 ? finish_wait+0x80/0x80
 kernfs_remove_by_name_ns+0x50/0x90
 remove_files+0x2b/0x60
 sysfs_remove_group+0x38/0x80
 sysfs_remove_groups+0x29/0x40
 device_remove_attrs+0x4a/0x80
 device_del+0x183/0x3e0
 ? mutex_lock+0xe/0x30
 del_gendisk+0x27a/0x2d0
 zram_remove+0x8a/0xb0 [zram]
 ? hot_remove_store+0xf0/0xf0 [zram]
 zram_remove_cb+0xd/0x10 [zram]
 idr_for_each+0x5e/0xd0
 destroy_devices+0x39/0x6f [zram]
 __do_sys_delete_module+0x190/0x2a0
 do_syscall_64+0x33/0x80
 entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7f32adf727d7
RSP: 002b:00007ffc08bb38a8 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
RAX: ffffffffffffffda RBX: 000055eea23cbb10 RCX: 00007f32adf727d7
RDX: 0000000000000000 RSI: 0000000000000800 RDI: 000055eea23cbb78
RBP: 000055eea23cbb10 R08: 0000000000000000 R09: 0000000000000000
R10: 00007f32adfe5ac0 R11: 0000000000000206 R12: 000055eea23cbb78
R13: 0000000000000000 R14: 0000000000000000 R15: 000055eea23cbc20

[0] https://lkml.kernel.org/r/20210401235925.GR4332@42.do-not-panic.com

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 drivers/block/zram/zram_drv.c | 80 ++++++++++++++++++++---------------
 drivers/block/zram/zram_drv.h | 40 ++++++++++++++++++
 2 files changed, 85 insertions(+), 35 deletions(-)

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 431b60cd85c1..21d66415aa91 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -1134,12 +1134,12 @@ static ssize_t debug_stat_show(struct device *dev,
 	return ret;
 }
 
-static DEVICE_ATTR_RO(io_stat);
-static DEVICE_ATTR_RO(mm_stat);
+MODULE_DEVICE_ATTR_RO(io_stat);
+MODULE_DEVICE_ATTR_RO(mm_stat);
 #ifdef CONFIG_ZRAM_WRITEBACK
-static DEVICE_ATTR_RO(bd_stat);
+MODULE_DEVICE_ATTR_RO(bd_stat);
 #endif
-static DEVICE_ATTR_RO(debug_stat);
+MODULE_DEVICE_ATTR_RO(debug_stat);
 
 static void zram_meta_free(struct zram *zram, u64 disksize)
 {
@@ -1861,44 +1861,44 @@ static const struct block_device_operations zram_wb_devops = {
 	.owner = THIS_MODULE
 };
 
-static DEVICE_ATTR_WO(compact);
-static DEVICE_ATTR_RW(disksize);
-static DEVICE_ATTR_RO(initstate);
-static DEVICE_ATTR_WO(reset);
-static DEVICE_ATTR_WO(mem_limit);
-static DEVICE_ATTR_WO(mem_used_max);
-static DEVICE_ATTR_WO(idle);
-static DEVICE_ATTR_RW(max_comp_streams);
-static DEVICE_ATTR_RW(comp_algorithm);
+MODULE_DEVICE_ATTR_WO(compact);
+MODULE_DEVICE_ATTR_RW(disksize);
+MODULE_DEVICE_ATTR_RO(initstate);
+MODULE_DEVICE_ATTR_WO(reset);
+MODULE_DEVICE_ATTR_WO(mem_limit);
+MODULE_DEVICE_ATTR_WO(mem_used_max);
+MODULE_DEVICE_ATTR_WO(idle);
+MODULE_DEVICE_ATTR_RW(max_comp_streams);
+MODULE_DEVICE_ATTR_RW(comp_algorithm);
 #ifdef CONFIG_ZRAM_WRITEBACK
-static DEVICE_ATTR_RW(backing_dev);
-static DEVICE_ATTR_WO(writeback);
-static DEVICE_ATTR_RW(writeback_limit);
-static DEVICE_ATTR_RW(writeback_limit_enable);
+MODULE_DEVICE_ATTR_RW(backing_dev);
+MODULE_DEVICE_ATTR_WO(writeback);
+MODULE_DEVICE_ATTR_RW(writeback_limit);
+MODULE_DEVICE_ATTR_RW(writeback_limit_enable);
 #endif
 
 static struct attribute *zram_disk_attrs[] = {
-	&dev_attr_disksize.attr,
-	&dev_attr_initstate.attr,
-	&dev_attr_reset.attr,
-	&dev_attr_compact.attr,
-	&dev_attr_mem_limit.attr,
-	&dev_attr_mem_used_max.attr,
-	&dev_attr_idle.attr,
-	&dev_attr_max_comp_streams.attr,
-	&dev_attr_comp_algorithm.attr,
+	&dev_attr_module_disksize.attr,
+	&dev_attr_module_initstate.attr,
+	&dev_attr_module_reset.attr,
+	&dev_attr_module_compact.attr,
+	&dev_attr_module_mem_limit.attr,
+	&dev_attr_module_mem_used_max.attr,
+	&dev_attr_module_idle.attr,
+	&dev_attr_module_max_comp_streams.attr,
+	&dev_attr_module_comp_algorithm.attr,
 #ifdef CONFIG_ZRAM_WRITEBACK
-	&dev_attr_backing_dev.attr,
-	&dev_attr_writeback.attr,
-	&dev_attr_writeback_limit.attr,
-	&dev_attr_writeback_limit_enable.attr,
+	&dev_attr_module_backing_dev.attr,
+	&dev_attr_module_writeback.attr,
+	&dev_attr_module_writeback_limit.attr,
+	&dev_attr_module_writeback_limit_enable.attr,
 #endif
-	&dev_attr_io_stat.attr,
-	&dev_attr_mm_stat.attr,
+	&dev_attr_module_io_stat.attr,
+	&dev_attr_module_mm_stat.attr,
 #ifdef CONFIG_ZRAM_WRITEBACK
-	&dev_attr_bd_stat.attr,
+	&dev_attr_module_bd_stat.attr,
 #endif
-	&dev_attr_debug_stat.attr,
+	&dev_attr_module_debug_stat.attr,
 	NULL,
 };
 
@@ -2048,13 +2048,19 @@ static ssize_t hot_add_show(struct class *class,
 {
 	int ret;
 
+	if (!try_module_get(THIS_MODULE))
+		return -ENODEV;
+
 	mutex_lock(&zram_index_mutex);
 	if (!zram_up) {
 		mutex_unlock(&zram_index_mutex);
-		return -ENODEV;
+		ret = -ENODEV;
+		goto out;
 	}
 	ret = zram_add();
+out:
 	mutex_unlock(&zram_index_mutex);
+	module_put(THIS_MODULE);
 
 	if (ret < 0)
 		return ret;
@@ -2078,6 +2084,9 @@ static ssize_t hot_remove_store(struct class *class,
 	if (dev_id < 0)
 		return -EINVAL;
 
+	if (!try_module_get(THIS_MODULE))
+		return -ENODEV;
+
 	mutex_lock(&zram_index_mutex);
 
 	if (!zram_up) {
@@ -2096,6 +2105,7 @@ static ssize_t hot_remove_store(struct class *class,
 
 out:
 	mutex_unlock(&zram_index_mutex);
+	module_put(THIS_MODULE);
 	return ret ? ret : count;
 }
 static CLASS_ATTR_WO(hot_remove);
diff --git a/drivers/block/zram/zram_drv.h b/drivers/block/zram/zram_drv.h
index 419a7e8281ee..026eb8d41327 100644
--- a/drivers/block/zram/zram_drv.h
+++ b/drivers/block/zram/zram_drv.h
@@ -126,4 +126,44 @@ struct zram {
 	struct dentry *debugfs_dir;
 #endif
 };
+
+#define MODULE_DEVICE_ATTR_FUNC_STORE(_name) \
+static ssize_t module_ ## _name ## _store(struct device *dev, \
+				   struct device_attribute *attr, \
+				   const char *buf, size_t len) \
+{ \
+	ssize_t __ret; \
+	if (!try_module_get(THIS_MODULE)) \
+		return -ENODEV; \
+	__ret = _name ## _store(dev, attr, buf, len); \
+	module_put(THIS_MODULE); \
+	return __ret; \
+}
+
+#define MODULE_DEVICE_ATTR_FUNC_SHOW(_name) \
+static ssize_t module_ ## _name ## _show(struct device *dev, \
+					 struct device_attribute *attr, \
+					 char *buf) \
+{ \
+	ssize_t __ret; \
+	if (!try_module_get(THIS_MODULE)) \
+		return -ENODEV; \
+	__ret = _name ## _show(dev, attr, buf); \
+	module_put(THIS_MODULE); \
+	return __ret; \
+}
+
+#define MODULE_DEVICE_ATTR_WO(_name) \
+MODULE_DEVICE_ATTR_FUNC_STORE(_name); \
+static DEVICE_ATTR_WO(module_ ## _name)
+
+#define MODULE_DEVICE_ATTR_RW(_name) \
+MODULE_DEVICE_ATTR_FUNC_STORE(_name); \
+MODULE_DEVICE_ATTR_FUNC_SHOW(_name); \
+static DEVICE_ATTR_RW(module_ ## _name)
+
+#define MODULE_DEVICE_ATTR_RO(_name) \
+MODULE_DEVICE_ATTR_FUNC_SHOW(_name); \
+static DEVICE_ATTR_RO(module_ ## _name)
+
 #endif
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v3 3/3] drivers/base/core: refcount kobject and bus on device attribute read / store
  2021-06-21 23:30 [PATCH v3 0/3] zram: fix few sysfs races Luis Chamberlain
  2021-06-21 23:30 ` [PATCH v3 1/3] zram: fix crashes due to use of cpu hotplug multistate Luis Chamberlain
  2021-06-21 23:36 ` [PATCH v3 2/3] zram: fix deadlock with sysfs attribute usage and driver removal Luis Chamberlain
@ 2021-06-21 23:36 ` Luis Chamberlain
  2021-06-22  7:46   ` Greg KH
  2 siblings, 1 reply; 21+ messages in thread
From: Luis Chamberlain @ 2021-06-21 23:36 UTC (permalink / raw)
  To: minchan, gregkh, jeyu, ngupta, sergey.senozhatsky.work
  Cc: mcgrof, axboe, mbenes, jpoimboe, tglx, keescook, jikos, rostedt,
	peterz, linux-block, linux-kernel

It's possible today to have a device attribute read or store
race against device removal. When this happens there is a small
chance that the derefence for the private data area of the driver
is NULL.

Let's consider the zram driver as an example. Its possible to run into
a race where a sysfs knob is being used, we get preempted, and a zram
device is removed before we complete use of the sysfs knob. This can happen
for instance on block devices, where for instance the zram block devices
just part of the private data of the block device.

For instance this can happen in the following two situations
as examples to illustrate this better:

        CPU 1                            CPU 2
destroy_devices
...
                                 compact_store()
                                 zram = dev_to_zram(dev);
idr_for_each(zram_remove_cb
  zram_remove
  ...
  kfree(zram)
                                 down_read(&zram->init_lock);

        CPU 1                            CPU 2
hot_remove_store
                                 compact_store()
                                 zram = dev_to_zram(dev);
  zram_remove
    kfree(zram)
                                 down_read(&zram->init_lock);

To ensure the private data pointer is valid we could use bdget() / bdput()
in between access, however that would mean doing that in all sysfs
reads/stores on the driver. Instead a generic solution for all drivers
is to ensure the device kobject is still valid and also the bus, if
a bus is present.

This issue does not fix a known crash, however this race was
spotted by Minchan Kim through code inspection upon code review
of another zram patch.

Suggested-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 drivers/base/base.h |  2 ++
 drivers/base/bus.c  |  4 ++--
 drivers/base/core.c | 42 ++++++++++++++++++++++++++++++++++++++----
 3 files changed, 42 insertions(+), 6 deletions(-)

diff --git a/drivers/base/base.h b/drivers/base/base.h
index e5f9b7e656c3..3f95b125b667 100644
--- a/drivers/base/base.h
+++ b/drivers/base/base.h
@@ -127,6 +127,8 @@ static inline void auxiliary_bus_init(void) { }
 
 struct kobject *virtual_device_parent(struct device *dev);
 
+extern struct bus_type *bus_get(struct bus_type *bus);
+extern void bus_put(struct bus_type *bus);
 extern int bus_add_device(struct device *dev);
 extern void bus_probe_device(struct device *dev);
 extern void bus_remove_device(struct device *dev);
diff --git a/drivers/base/bus.c b/drivers/base/bus.c
index 36d0c654ea61..21c80d7d6433 100644
--- a/drivers/base/bus.c
+++ b/drivers/base/bus.c
@@ -39,7 +39,7 @@ static struct kset *system_kset;
 static int __must_check bus_rescan_devices_helper(struct device *dev,
 						void *data);
 
-static struct bus_type *bus_get(struct bus_type *bus)
+struct bus_type *bus_get(struct bus_type *bus)
 {
 	if (bus) {
 		kset_get(&bus->p->subsys);
@@ -48,7 +48,7 @@ static struct bus_type *bus_get(struct bus_type *bus)
 	return NULL;
 }
 
-static void bus_put(struct bus_type *bus)
+void bus_put(struct bus_type *bus)
 {
 	if (bus)
 		kset_put(&bus->p->subsys);
diff --git a/drivers/base/core.c b/drivers/base/core.c
index 4a8bf8cda52b..2fc52264b897 100644
--- a/drivers/base/core.c
+++ b/drivers/base/core.c
@@ -2042,28 +2042,62 @@ EXPORT_SYMBOL(dev_driver_string);
 static ssize_t dev_attr_show(struct kobject *kobj, struct attribute *attr,
 			     char *buf)
 {
-	struct device_attribute *dev_attr = to_dev_attr(attr);
-	struct device *dev = kobj_to_dev(kobj);
+	struct device_attribute *dev_attr;
+	struct device *dev;
+	struct bus_type *bus = NULL;
 	ssize_t ret = -EIO;
 
+	dev = get_device(kobj_to_dev(kobj));
+	if (!dev)
+		return ret;
+
+	if (dev->bus) {
+		bus = bus_get(dev->bus);
+		if (!bus)
+			goto out;
+	}
+
+	dev_attr = to_dev_attr(attr);
 	if (dev_attr->show)
 		ret = dev_attr->show(dev, dev_attr, buf);
 	if (ret >= (ssize_t)PAGE_SIZE) {
 		printk("dev_attr_show: %pS returned bad count\n",
 				dev_attr->show);
 	}
+
+	bus_put(bus);
+out:
+	put_device(dev);
+
 	return ret;
 }
 
 static ssize_t dev_attr_store(struct kobject *kobj, struct attribute *attr,
 			      const char *buf, size_t count)
 {
-	struct device_attribute *dev_attr = to_dev_attr(attr);
-	struct device *dev = kobj_to_dev(kobj);
+	struct device_attribute *dev_attr;
+	struct device *dev;
+	struct bus_type *bus = NULL;
 	ssize_t ret = -EIO;
 
+	dev = get_device(kobj_to_dev(kobj));
+	if (!dev)
+		return ret;
+
+	if (dev->bus) {
+		bus = bus_get(dev->bus);
+		if (!bus)
+			goto out;
+	}
+
+	dev_attr = to_dev_attr(attr);
 	if (dev_attr->store)
 		ret = dev_attr->store(dev, dev_attr, buf, count);
+
+	bus_put(bus);
+out:
+	put_device(dev);
+
 	return ret;
 }
 
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 1/3] zram: fix crashes due to use of cpu hotplug multistate
  2021-06-21 23:30 ` [PATCH v3 1/3] zram: fix crashes due to use of cpu hotplug multistate Luis Chamberlain
@ 2021-06-22  7:39   ` Greg KH
  2021-06-22 15:12     ` Luis Chamberlain
  0 siblings, 1 reply; 21+ messages in thread
From: Greg KH @ 2021-06-22  7:39 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: minchan, jeyu, ngupta, sergey.senozhatsky.work, axboe, mbenes,
	jpoimboe, tglx, keescook, jikos, rostedt, peterz, linux-block,
	linux-kernel

On Mon, Jun 21, 2021 at 04:30:11PM -0700, Luis Chamberlain wrote:
> Provide a simple state machine to fix races with driver exit where we
> remove the CPU multistate callbacks and re-initialization / creation of
> new per CPU instances which should be managed by these callbacks.
> 
> The zram driver makes use of cpu hotplug multistate support, whereby it
> associates a struct zcomp per CPU. Each struct zcomp represents a
> compression algorithm in charge of managing compression streams per CPU.
> Although a compiled zram driver only supports a fixed set of compression
> algorithms, each zram device gets a struct zcomp allocated per CPU. The
> "multi" in CPU hotplug multstate refers to these per cpu struct zcomp
> instances. Each of these will have the CPU hotplug callback called for
> it on CPU plug / unplug. The kernel's CPU hotplug multistate keeps a
> linked list of these different structures so that it will iterate over
> them on CPU transitions.
> 
> By default at driver initialization we will create just one zram device
> (num_devices=1) and a zcomp structure then set for the now default
> lzo-rle comrpession algorithm. At driver removal we first remove each
> zram device, and so we destroy the associated struct zcomp per CPU. But
> since we expose sysfs attributes to create new devices or reset / initialize
> existing zram devices, we can easily end up re-initializing a struct zcomp
> for a zram device before the exit routine of the module removes the cpu
> hotplug callback. When this happens the kernel's CPU hotplug will detect
> that at least one instance (struct zcomp for us) exists. This can happen
> in the following situation:
> 
> CPU 1                            CPU 2
> 
> class_unregister(...);

Now the sysfs files are removed and invalidated for all devices
associated with that class.

> idr_for_each(...);
> zram_debugfs_destroy();
>                                 disksize_store(...);

How will this call into the kobject's store function if
class_unregister() has already happened?

> idr_destroy(...);
> unregister_blkdev(...);

Ah, it's a block device's store function you are worried about, not the
class one?

> cpuhp_remove_multi_state(...);
> 
> The warning comes up on cpuhp_remove_multi_state() when it sees that the
> state for CPUHP_ZCOMP_PREPARE does not have an empty instance linked list.
> In this case, that a struct zcom still exists, the driver allowed its
> creation per CPU even though we could have just freed them per CPU
> though a call on another CPU, and we are then later trying to remove the
> hotplug callback.
> 
> Fix all this by providing a zram initialization boolean
> protected the the shared in the driver zram_index_mutex,
> which we can use to annotate when sysfs attributes are
> safe to use or not -- once the driver is properly initialized.
> When the driver is going down we also are sure to not let
> userspace muck with attributes which may affect each per cpu
> struct zcomp.
> 
> This also fixes a series of possible memory leaks. The
> crashes and memory leaks can easily be caused by issuing
> the zram02.sh script from the LTP project [0] in a loop
> in two separate windows:
> 
>   cd testcases/kernel/device-drivers/zram
>   while true; do PATH=$PATH:$PWD:$PWD/../../../lib/ ./zram02.sh; done
> 
> You end up with a splat as follows:
> 
> kernel: zram: Removed device: zram0
> kernel: zram: Added device: zram0
> kernel: zram0: detected capacity change from 0 to 209715200
> kernel: Adding 104857596k swap on /dev/zram0.  Priority:-2 extents:1 across:104857596k SSFS
> kernel: zram0: detected capacitky change from 209715200 to 0
> kernel: zram0: detected capacity change from 0 to 209715200
> kernel: ------------[ cut here ]------------
> kernel: Error: Removing state 63 which has instances left.
> kernel: WARNING: CPU: 7 PID: 70457 at kernel/cpu.c:2069 __cpuhp_remove_state_cpuslocked+0xf9/0x100
> kernel: Modules linked in: zram(E-) zsmalloc(E) <etc>
> kernel: CPU: 7 PID: 70457 Comm: rmmod Tainted: G            E     5.12.0-rc1-next-20210304 #3
> kernel: Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014
> kernel: RIP: 0010:__cpuhp_remove_state_cpuslocked+0xf9/0x100
> kernel: Code: <etc>
> kernel: RSP: 0018:ffffa800c139be98 EFLAGS: 00010282
> kernel: RAX: 0000000000000000 RBX: ffffffff9083db58 RCX: ffff9609f7dd86d8
> kernel: RDX: 00000000ffffffd8 RSI: 0000000000000027 RDI: ffff9609f7dd86d0
> kernel: RBP: 0000000000000000i R08: 0000000000000000 R09: ffffa800c139bcb8
> kernel: R10: ffffa800c139bcb0 R11: ffffffff908bea40 R12: 000000000000003f
> kernel: R13: 00000000000009d8 R14: 0000000000000000 R15: 0000000000000000
> kernel: FS: 00007f1b075a7540(0000) GS:ffff9609f7dc0000(0000) knlGS:0000000000000000
> kernel: CS:  0010 DS: 0000 ES 0000 CR0: 0000000080050033
> kernel: CR2: 00007f1b07610490 CR3: 00000001bd04e000 CR4: 0000000000350ee0
> kernel: Call Trace:
> kernel: __cpuhp_remove_state+0x2e/0x80
> kernel: __do_sys_delete_module+0x190/0x2a0
> kernel:  do_syscall_64+0x33/0x80
> kernel: entry_SYSCALL_64_after_hwframe+0x44/0xae
> 
> The "Error: Removing state 63 which has instances left" refers
> to the zram per CPU struc zcomp instances left.
> 
> [0] https://github.com/linux-test-project/ltp.git
> 
> Acked-by: Minchan Kim <minchan@kernel.org>
> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> ---
>  drivers/block/zram/zram_drv.c | 63 ++++++++++++++++++++++++++++++-----
>  1 file changed, 55 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
> index cf8deecc39ef..431b60cd85c1 100644
> --- a/drivers/block/zram/zram_drv.c
> +++ b/drivers/block/zram/zram_drv.c
> @@ -44,6 +44,8 @@ static DEFINE_MUTEX(zram_index_mutex);
>  static int zram_major;
>  static const char *default_compressor = CONFIG_ZRAM_DEF_COMP;
>  
> +bool zram_up;

static?

> +
>  /* Module params (documentation at end) */
>  static unsigned int num_devices = 1;
>  /*
> @@ -1704,6 +1706,7 @@ static void zram_reset_device(struct zram *zram)
>  	comp = zram->comp;
>  	disksize = zram->disksize;
>  	zram->disksize = 0;
> +	zram->comp = NULL;

Is this a new change?

Other than these two things, seems reasonable, if not total overkill :)

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 2/3] zram: fix deadlock with sysfs attribute usage and driver removal
  2021-06-21 23:36 ` [PATCH v3 2/3] zram: fix deadlock with sysfs attribute usage and driver removal Luis Chamberlain
@ 2021-06-22  7:41   ` Greg KH
  2021-06-22 15:27     ` Luis Chamberlain
  2021-06-22  7:45   ` Greg KH
  1 sibling, 1 reply; 21+ messages in thread
From: Greg KH @ 2021-06-22  7:41 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: minchan, jeyu, ngupta, sergey.senozhatsky.work, axboe, mbenes,
	jpoimboe, tglx, keescook, jikos, rostedt, peterz, linux-block,
	linux-kernel

On Mon, Jun 21, 2021 at 04:36:34PM -0700, Luis Chamberlain wrote:
> +	ssize_t __ret; \
> +	if (!try_module_get(THIS_MODULE)) \

try_module_get(THIS_MODULE) is always racy and probably does not do what
you want it to do.  You always want to get/put module references from
code that is NOT the code calling these functions.

> +		return -ENODEV; \
> +	__ret = _name ## _store(dev, attr, buf, len); \
> +	module_put(THIS_MODULE); \

This too is going to be racy.

While fun to poke at, I still think this is pointless.

greg k-h

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 2/3] zram: fix deadlock with sysfs attribute usage and driver removal
  2021-06-21 23:36 ` [PATCH v3 2/3] zram: fix deadlock with sysfs attribute usage and driver removal Luis Chamberlain
  2021-06-22  7:41   ` Greg KH
@ 2021-06-22  7:45   ` Greg KH
  2021-06-22 16:32     ` Luis Chamberlain
  1 sibling, 1 reply; 21+ messages in thread
From: Greg KH @ 2021-06-22  7:45 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: minchan, jeyu, ngupta, sergey.senozhatsky.work, axboe, mbenes,
	jpoimboe, tglx, keescook, jikos, rostedt, peterz, linux-block,
	linux-kernel

On Mon, Jun 21, 2021 at 04:36:34PM -0700, Luis Chamberlain wrote:
> When sysfs attributes use a lock also used on driver removal we can
> potentially deadlock. This happens when for instance a sysfs file on
> a driver is used, then at the same time we have driver removal trigger.
> The driver removal code holds a lock, and then the sysfs file entry waits
> for the same lock. While holding the lock the driver removal tries to
> remove the sysfs entries, but these cannot be removed yet as one is
> waiting for a lock. This won't complete as the lock is already held.
> Likewise module removal cannot complete, and so we deadlock.

This is all about removing modules, not about "driver removal" from a
device.  Please make the subject line here more explicit.

> 
> To fix this we just *try* to get a refcount to the module when a shared
> lock is used, prior to mucking with a sysfs attribute. If this fails we
> just give up right away.
> 
> We use a try method as a full lock means we'd then make our sysfs attributes
> busy us out from possible module removal, and so userspace could force denying
> module removal, a silly form of "DOS" against module removal. A try lock on
> the module removal ensures we give priority to module removal and interacting
> with sysfs attributes only comes second. Using a full lock could mean for
> instance that if you don't stop poking at sysfs files you cannot remove a
> module.
> 
> This deadlock was first reported with the zram driver, a sketch of how
> this can happen follows:
> 
> CPU A                              CPU B
>                                    whatever_store()
> module_unload
>   mutex_lock(foo)
>                                    mutex_lock(foo)
>    del_gendisk(zram->disk);
>      device_del()
>        device_remove_groups()

Can you duplicate this in a real-world situation?

What tools remove the zram module from the system on the fly?

> In this situation whatever_store() is waiting for the mutex foo to
> become unlocked, but that won't happen until module removal is complete.
> But module removal won't complete until the syfs file being poked completes
> which is waiting for a lock already held.
> 
> This is a generic kernel issue with sysfs files which use any lock also
> used on module removal. Different generic solutions have been proposed.
> One approach proposed is by directly by augmenting attributes with module
> information [0]. This patch implements a solution by adding macros with
> the prefix MODULE_DEVICE_ATTR_*() which accomplish the same. Until we
> don't have a generic agreed upon solution for this shared between drivers,
> we must implement a fix for this on each driver.
> 
> We make zram use the new MODULE_DEVICE_ATTR_*() helpers, and completely
> open code the solution for class attributes as there are only a few of
> those.
> 
> This issue can be reproduced easily on the zram driver as follows:
> 
> Loop 1 on one terminal:
> 
> while true;
> 	do modprobe zram;
> 	modprobe -r zram;
> done
> 
> Loop 2 on a second terminal:
> while true; do
> 	echo 1024 >  /sys/block/zram0/disksize;
> 	echo 1 > /sys/block/zram0/reset;
> done

As fun as this is, it's not a real workload, please do not pretend that
it is.

And your code is still racy, see below.  You just made the window even
smaller, which you still should be objecting to as you somehow feel this
is a valid usecase :)

> @@ -2048,13 +2048,19 @@ static ssize_t hot_add_show(struct class *class,
>  {
>  	int ret;
>  
> +	if (!try_module_get(THIS_MODULE))
> +		return -ENODEV;
> +

You can not increment/decrement your own module's reference count and
expect it to work properly, as it is still a race.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 3/3] drivers/base/core: refcount kobject and bus on device attribute read / store
  2021-06-21 23:36 ` [PATCH v3 3/3] drivers/base/core: refcount kobject and bus on device attribute read / store Luis Chamberlain
@ 2021-06-22  7:46   ` Greg KH
  2021-06-22 16:44     ` Luis Chamberlain
  0 siblings, 1 reply; 21+ messages in thread
From: Greg KH @ 2021-06-22  7:46 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: minchan, jeyu, ngupta, sergey.senozhatsky.work, axboe, mbenes,
	jpoimboe, tglx, keescook, jikos, rostedt, peterz, linux-block,
	linux-kernel

On Mon, Jun 21, 2021 at 04:36:51PM -0700, Luis Chamberlain wrote:
> It's possible today to have a device attribute read or store
> race against device removal. When this happens there is a small
> chance that the derefence for the private data area of the driver
> is NULL.
> 
> Let's consider the zram driver as an example. Its possible to run into
> a race where a sysfs knob is being used, we get preempted, and a zram
> device is removed before we complete use of the sysfs knob. This can happen
> for instance on block devices, where for instance the zram block devices
> just part of the private data of the block device.
> 
> For instance this can happen in the following two situations
> as examples to illustrate this better:
> 
>         CPU 1                            CPU 2
> destroy_devices
> ...
>                                  compact_store()
>                                  zram = dev_to_zram(dev);
> idr_for_each(zram_remove_cb
>   zram_remove
>   ...
>   kfree(zram)
>                                  down_read(&zram->init_lock);
> 
>         CPU 1                            CPU 2
> hot_remove_store
>                                  compact_store()
>                                  zram = dev_to_zram(dev);
>   zram_remove
>     kfree(zram)
>                                  down_read(&zram->init_lock);
> 
> To ensure the private data pointer is valid we could use bdget() / bdput()
> in between access, however that would mean doing that in all sysfs
> reads/stores on the driver. Instead a generic solution for all drivers
> is to ensure the device kobject is still valid and also the bus, if
> a bus is present.
> 
> This issue does not fix a known crash, however this race was
> spotted by Minchan Kim through code inspection upon code review
> of another zram patch.
> 
> Suggested-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> ---
>  drivers/base/base.h |  2 ++
>  drivers/base/bus.c  |  4 ++--
>  drivers/base/core.c | 42 ++++++++++++++++++++++++++++++++++++++----
>  3 files changed, 42 insertions(+), 6 deletions(-)

Please make this an independent patch of the zram mess and I will be
glad to consider it for the driver core tree then.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 1/3] zram: fix crashes due to use of cpu hotplug multistate
  2021-06-22  7:39   ` Greg KH
@ 2021-06-22 15:12     ` Luis Chamberlain
  0 siblings, 0 replies; 21+ messages in thread
From: Luis Chamberlain @ 2021-06-22 15:12 UTC (permalink / raw)
  To: Greg KH
  Cc: minchan, jeyu, ngupta, sergey.senozhatsky.work, axboe, mbenes,
	jpoimboe, tglx, keescook, jikos, rostedt, peterz, linux-block,
	linux-kernel

On Tue, Jun 22, 2021 at 09:39:40AM +0200, Greg KH wrote:
> On Mon, Jun 21, 2021 at 04:30:11PM -0700, Luis Chamberlain wrote:
> > CPU 1                            CPU 2
> > 
> > class_unregister(...);
> 
> Now the sysfs files are removed and invalidated for all devices
> associated with that class.

You're right the disksize_store() would have to happen before.

> > idr_for_each(...);
> > zram_debugfs_destroy();
> >                                 disksize_store(...);
> 
> How will this call into the kobject's store function if
> class_unregister() has already happened?

The disksize_store() would indeed have to happen before.

> > idr_destroy(...);
> > unregister_blkdev(...);
> 
> Ah, it's a block device's store function you are worried about, not the
> class one?

In this case its about any files which can change the cpu compression
streams.

> > diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
> > index cf8deecc39ef..431b60cd85c1 100644
> > --- a/drivers/block/zram/zram_drv.c
> > +++ b/drivers/block/zram/zram_drv.c
> > @@ -44,6 +44,8 @@ static DEFINE_MUTEX(zram_index_mutex);
> >  static int zram_major;
> >  static const char *default_compressor = CONFIG_ZRAM_DEF_COMP;
> >  
> > +bool zram_up;
> 
> static?

Will fix.

> > +
> >  /* Module params (documentation at end) */
> >  static unsigned int num_devices = 1;
> >  /*
> > @@ -1704,6 +1706,7 @@ static void zram_reset_device(struct zram *zram)
> >  	comp = zram->comp;
> >  	disksize = zram->disksize;
> >  	zram->disksize = 0;
> > +	zram->comp = NULL;
> 
> Is this a new change?

It is a sanity change, but indeed, it is separate. I'll let Minchan
decide if he would prefer this to go out as a separate change.

  Luis

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 2/3] zram: fix deadlock with sysfs attribute usage and driver removal
  2021-06-22  7:41   ` Greg KH
@ 2021-06-22 15:27     ` Luis Chamberlain
  2021-06-22 16:27       ` Greg KH
  0 siblings, 1 reply; 21+ messages in thread
From: Luis Chamberlain @ 2021-06-22 15:27 UTC (permalink / raw)
  To: Greg KH
  Cc: minchan, jeyu, ngupta, sergey.senozhatsky.work, axboe, mbenes,
	jpoimboe, tglx, keescook, jikos, rostedt, peterz, linux-block,
	linux-kernel

On Tue, Jun 22, 2021 at 09:41:23AM +0200, Greg KH wrote:
> On Mon, Jun 21, 2021 at 04:36:34PM -0700, Luis Chamberlain wrote:
> > +	ssize_t __ret; \
> > +	if (!try_module_get(THIS_MODULE)) \
> 
> try_module_get(THIS_MODULE) is always racy and probably does not do what
> you want it to do.  You always want to get/put module references from
> code that is NOT the code calling these functions.

In this case, we want it to trump module removal if it succeeds. That's all.

> > +		return -ENODEV; \
> > +	__ret = _name ## _store(dev, attr, buf, len); \
> > +	module_put(THIS_MODULE); \
> 
> This too is going to be racy.
> 
> While fun to poke at, I still think this is pointless.

If you have a better idea, which does not "DOS" module removal, please
let me know!

  Luis

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 2/3] zram: fix deadlock with sysfs attribute usage and driver removal
  2021-06-22 15:27     ` Luis Chamberlain
@ 2021-06-22 16:27       ` Greg KH
  2021-06-22 16:40         ` Luis Chamberlain
  0 siblings, 1 reply; 21+ messages in thread
From: Greg KH @ 2021-06-22 16:27 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: minchan, jeyu, ngupta, sergey.senozhatsky.work, axboe, mbenes,
	jpoimboe, tglx, keescook, jikos, rostedt, peterz, linux-block,
	linux-kernel

On Tue, Jun 22, 2021 at 08:27:13AM -0700, Luis Chamberlain wrote:
> On Tue, Jun 22, 2021 at 09:41:23AM +0200, Greg KH wrote:
> > On Mon, Jun 21, 2021 at 04:36:34PM -0700, Luis Chamberlain wrote:
> > > +	ssize_t __ret; \
> > > +	if (!try_module_get(THIS_MODULE)) \
> > 
> > try_module_get(THIS_MODULE) is always racy and probably does not do what
> > you want it to do.  You always want to get/put module references from
> > code that is NOT the code calling these functions.
> 
> In this case, we want it to trump module removal if it succeeds. That's all.

True, but either you stop the race, or you do not right?  If you are so
invested in your load/unload test, this should show up with this code
eventually as well.

> > > +		return -ENODEV; \
> > > +	__ret = _name ## _store(dev, attr, buf, len); \
> > > +	module_put(THIS_MODULE); \
> > 
> > This too is going to be racy.
> > 
> > While fun to poke at, I still think this is pointless.
> 
> If you have a better idea, which does not "DOS" module removal, please
> let me know!

I have yet to understand why you think that the load/unload in a loop is
a valid use case.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 2/3] zram: fix deadlock with sysfs attribute usage and driver removal
  2021-06-22  7:45   ` Greg KH
@ 2021-06-22 16:32     ` Luis Chamberlain
  2021-06-22 17:16       ` Greg KH
  0 siblings, 1 reply; 21+ messages in thread
From: Luis Chamberlain @ 2021-06-22 16:32 UTC (permalink / raw)
  To: Greg KH
  Cc: minchan, jeyu, ngupta, sergey.senozhatsky.work, axboe, mbenes,
	jpoimboe, tglx, keescook, jikos, rostedt, peterz, linux-block,
	linux-kernel

On Tue, Jun 22, 2021 at 09:45:39AM +0200, Greg KH wrote:
> On Mon, Jun 21, 2021 at 04:36:34PM -0700, Luis Chamberlain wrote:
> > When sysfs attributes use a lock also used on driver removal we can
> > potentially deadlock. This happens when for instance a sysfs file on
> > a driver is used, then at the same time we have driver removal trigger.
> > The driver removal code holds a lock, and then the sysfs file entry waits
> > for the same lock. While holding the lock the driver removal tries to
> > remove the sysfs entries, but these cannot be removed yet as one is
> > waiting for a lock. This won't complete as the lock is already held.
> > Likewise module removal cannot complete, and so we deadlock.
> 
> This is all about removing modules, not about "driver removal" from a
> device.  Please make the subject line here more explicit.

Sure.

> > To fix this we just *try* to get a refcount to the module when a shared
> > lock is used, prior to mucking with a sysfs attribute. If this fails we
> > just give up right away.
> > 
> > We use a try method as a full lock means we'd then make our sysfs attributes
> > busy us out from possible module removal, and so userspace could force denying
> > module removal, a silly form of "DOS" against module removal. A try lock on
> > the module removal ensures we give priority to module removal and interacting
> > with sysfs attributes only comes second. Using a full lock could mean for
> > instance that if you don't stop poking at sysfs files you cannot remove a
> > module.
> > 
> > This deadlock was first reported with the zram driver, a sketch of how
> > this can happen follows:
> > 
> > CPU A                              CPU B
> >                                    whatever_store()
> > module_unload
> >   mutex_lock(foo)
> >                                    mutex_lock(foo)
> >    del_gendisk(zram->disk);
> >      device_del()
> >        device_remove_groups()
> 
> Can you duplicate this in a real-world situation?
> 
> What tools remove the zram module from the system on the fly?

A customer did run into it through a series of automated tests. I was
able to finally reproduce with the instructions given below. I
simplified it given that the series of test the customer was running
was much more complex.

> > In this situation whatever_store() is waiting for the mutex foo to
> > become unlocked, but that won't happen until module removal is complete.
> > But module removal won't complete until the syfs file being poked completes
> > which is waiting for a lock already held.
> > 
> > This is a generic kernel issue with sysfs files which use any lock also
> > used on module removal. Different generic solutions have been proposed.
> > One approach proposed is by directly by augmenting attributes with module
> > information [0]. This patch implements a solution by adding macros with
> > the prefix MODULE_DEVICE_ATTR_*() which accomplish the same. Until we
> > don't have a generic agreed upon solution for this shared between drivers,
> > we must implement a fix for this on each driver.
> > 
> > We make zram use the new MODULE_DEVICE_ATTR_*() helpers, and completely
> > open code the solution for class attributes as there are only a few of
> > those.
> > 
> > This issue can be reproduced easily on the zram driver as follows:
> > 
> > Loop 1 on one terminal:
> > 
> > while true;
> > 	do modprobe zram;
> > 	modprobe -r zram;
> > done
> > 
> > Loop 2 on a second terminal:
> > while true; do
> > 	echo 1024 >  /sys/block/zram0/disksize;
> > 	echo 1 > /sys/block/zram0/reset;
> > done
> 
> As fun as this is, it's not a real workload, please do not pretend that
> it is.

Whoever said that it was? This is just a way to reproduce an issue which
was reported.

> And your code is still racy, see below.  You just made the window even
> smaller, which you still should be objecting to as you somehow feel this
> is a valid usecase :)
> 
> > @@ -2048,13 +2048,19 @@ static ssize_t hot_add_show(struct class *class,
> >  {
> >  	int ret;
> >  
> > +	if (!try_module_get(THIS_MODULE))
> > +		return -ENODEV;
> > +
> 
> You can not increment/decrement your own module's reference count and
> expect it to work properly, as it is still a race.

The goal here is to prevent an rmmod call if this succeeds. If it
succeeds then any subsequent rmmod will fail. Can you explain how
this is still racy?

  Luis

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 2/3] zram: fix deadlock with sysfs attribute usage and driver removal
  2021-06-22 16:27       ` Greg KH
@ 2021-06-22 16:40         ` Luis Chamberlain
  2021-06-22 16:51           ` Greg KH
  0 siblings, 1 reply; 21+ messages in thread
From: Luis Chamberlain @ 2021-06-22 16:40 UTC (permalink / raw)
  To: Greg KH
  Cc: minchan, jeyu, ngupta, sergey.senozhatsky.work, axboe, mbenes,
	jpoimboe, tglx, keescook, jikos, rostedt, peterz, linux-block,
	linux-kernel

On Tue, Jun 22, 2021 at 06:27:52PM +0200, Greg KH wrote:
> On Tue, Jun 22, 2021 at 08:27:13AM -0700, Luis Chamberlain wrote:
> > On Tue, Jun 22, 2021 at 09:41:23AM +0200, Greg KH wrote:
> > > On Mon, Jun 21, 2021 at 04:36:34PM -0700, Luis Chamberlain wrote:
> > > > +	ssize_t __ret; \
> > > > +	if (!try_module_get(THIS_MODULE)) \
> > > 
> > > try_module_get(THIS_MODULE) is always racy and probably does not do what
> > > you want it to do.  You always want to get/put module references from
> > > code that is NOT the code calling these functions.
> > 
> > In this case, we want it to trump module removal if it succeeds. That's all.
> 
> True, but either you stop the race, or you do not right?  If you are so
> invested in your load/unload test, this should show up with this code
> eventually as well.

I still do not see how the race is possible give the goal to prevent
module removal if a sysfs file is being used. If rmmod is taking
place, this simply will bail out.

> > > > +		return -ENODEV; \
> > > > +	__ret = _name ## _store(dev, attr, buf, len); \
> > > > +	module_put(THIS_MODULE); \
> > > 
> > > This too is going to be racy.
> > > 
> > > While fun to poke at, I still think this is pointless.
> > 
> > If you have a better idea, which does not "DOS" module removal, please
> > let me know!
> 
> I have yet to understand why you think that the load/unload in a loop is
> a valid use case.

That is dependent upon the intrastructure tests built for a driver.

In the case of fstests and blktests we have drivers which *always* get
removed and loaded on each test. Take for instance scsi_debug, which
creates / destroys virtual devices on the per test. Likewise, to build
confidence that failure rate is as close as possible to 0, one must run
a test as many times as possible in a loop. And, to build confidence in
a test, in some situations one ends up running modprobe / rmmod in a
loop.

In this case a customer does have a complex system of tests, and by looking
at the crash logs I managed to simplify the way to reproduce it using
simple shell scripts.

  Luis

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 3/3] drivers/base/core: refcount kobject and bus on device attribute read / store
  2021-06-22  7:46   ` Greg KH
@ 2021-06-22 16:44     ` Luis Chamberlain
  2021-06-22 16:48       ` Greg KH
  0 siblings, 1 reply; 21+ messages in thread
From: Luis Chamberlain @ 2021-06-22 16:44 UTC (permalink / raw)
  To: Greg KH
  Cc: minchan, jeyu, ngupta, sergey.senozhatsky.work, axboe, mbenes,
	jpoimboe, tglx, keescook, jikos, rostedt, peterz, linux-block,
	linux-kernel

On Tue, Jun 22, 2021 at 09:46:26AM +0200, Greg KH wrote:
> On Mon, Jun 21, 2021 at 04:36:51PM -0700, Luis Chamberlain wrote:
> > It's possible today to have a device attribute read or store
> > race against device removal. When this happens there is a small
> > chance that the derefence for the private data area of the driver
> > is NULL.
> > 
> > Let's consider the zram driver as an example. Its possible to run into
> > a race where a sysfs knob is being used, we get preempted, and a zram
> > device is removed before we complete use of the sysfs knob. This can happen
> > for instance on block devices, where for instance the zram block devices
> > just part of the private data of the block device.
> > 
> > For instance this can happen in the following two situations
> > as examples to illustrate this better:
> > 
> >         CPU 1                            CPU 2
> > destroy_devices
> > ...
> >                                  compact_store()
> >                                  zram = dev_to_zram(dev);
> > idr_for_each(zram_remove_cb
> >   zram_remove
> >   ...
> >   kfree(zram)
> >                                  down_read(&zram->init_lock);
> > 
> >         CPU 1                            CPU 2
> > hot_remove_store
> >                                  compact_store()
> >                                  zram = dev_to_zram(dev);
> >   zram_remove
> >     kfree(zram)
> >                                  down_read(&zram->init_lock);
> > 
> > To ensure the private data pointer is valid we could use bdget() / bdput()
> > in between access, however that would mean doing that in all sysfs
> > reads/stores on the driver. Instead a generic solution for all drivers
> > is to ensure the device kobject is still valid and also the bus, if
> > a bus is present.
> > 
> > This issue does not fix a known crash, however this race was
> > spotted by Minchan Kim through code inspection upon code review
> > of another zram patch.
> > 
> > Suggested-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> > Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> > ---
> >  drivers/base/base.h |  2 ++
> >  drivers/base/bus.c  |  4 ++--
> >  drivers/base/core.c | 42 ++++++++++++++++++++++++++++++++++++++----
> >  3 files changed, 42 insertions(+), 6 deletions(-)
> 
> Please make this an independent patch of the zram mess  and I will be
> glad to consider it for the driver core tree then.

What do you mean by making it independent?

The patch does not depend on the zram changes, and so, this can
be merged separately as-is.

If you mean that I should not mention zram on the commit log, please
let me know. I however think a concrete example is useful.

Or do you just mean that I should resend this out as a new patch
without it being attached to the zram thread?

 Luis

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 3/3] drivers/base/core: refcount kobject and bus on device attribute read / store
  2021-06-22 16:44     ` Luis Chamberlain
@ 2021-06-22 16:48       ` Greg KH
  0 siblings, 0 replies; 21+ messages in thread
From: Greg KH @ 2021-06-22 16:48 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: minchan, jeyu, ngupta, sergey.senozhatsky.work, axboe, mbenes,
	jpoimboe, tglx, keescook, jikos, rostedt, peterz, linux-block,
	linux-kernel

On Tue, Jun 22, 2021 at 09:44:02AM -0700, Luis Chamberlain wrote:
> On Tue, Jun 22, 2021 at 09:46:26AM +0200, Greg KH wrote:
> > On Mon, Jun 21, 2021 at 04:36:51PM -0700, Luis Chamberlain wrote:
> > > It's possible today to have a device attribute read or store
> > > race against device removal. When this happens there is a small
> > > chance that the derefence for the private data area of the driver
> > > is NULL.
> > > 
> > > Let's consider the zram driver as an example. Its possible to run into
> > > a race where a sysfs knob is being used, we get preempted, and a zram
> > > device is removed before we complete use of the sysfs knob. This can happen
> > > for instance on block devices, where for instance the zram block devices
> > > just part of the private data of the block device.
> > > 
> > > For instance this can happen in the following two situations
> > > as examples to illustrate this better:
> > > 
> > >         CPU 1                            CPU 2
> > > destroy_devices
> > > ...
> > >                                  compact_store()
> > >                                  zram = dev_to_zram(dev);
> > > idr_for_each(zram_remove_cb
> > >   zram_remove
> > >   ...
> > >   kfree(zram)
> > >                                  down_read(&zram->init_lock);
> > > 
> > >         CPU 1                            CPU 2
> > > hot_remove_store
> > >                                  compact_store()
> > >                                  zram = dev_to_zram(dev);
> > >   zram_remove
> > >     kfree(zram)
> > >                                  down_read(&zram->init_lock);
> > > 
> > > To ensure the private data pointer is valid we could use bdget() / bdput()
> > > in between access, however that would mean doing that in all sysfs
> > > reads/stores on the driver. Instead a generic solution for all drivers
> > > is to ensure the device kobject is still valid and also the bus, if
> > > a bus is present.
> > > 
> > > This issue does not fix a known crash, however this race was
> > > spotted by Minchan Kim through code inspection upon code review
> > > of another zram patch.
> > > 
> > > Suggested-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> > > Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> > > ---
> > >  drivers/base/base.h |  2 ++
> > >  drivers/base/bus.c  |  4 ++--
> > >  drivers/base/core.c | 42 ++++++++++++++++++++++++++++++++++++++----
> > >  3 files changed, 42 insertions(+), 6 deletions(-)
> > 
> > Please make this an independent patch of the zram mess  and I will be
> > glad to consider it for the driver core tree then.
> 
> What do you mean by making it independent?
> 
> The patch does not depend on the zram changes, and so, this can
> be merged separately as-is.

Great, then make it a 1/1 patch.  Putting it as patch 3 here means I can
not take it as our tools pull in the full series.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 2/3] zram: fix deadlock with sysfs attribute usage and driver removal
  2021-06-22 16:40         ` Luis Chamberlain
@ 2021-06-22 16:51           ` Greg KH
  2021-06-22 17:00             ` Luis Chamberlain
  0 siblings, 1 reply; 21+ messages in thread
From: Greg KH @ 2021-06-22 16:51 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: minchan, jeyu, ngupta, sergey.senozhatsky.work, axboe, mbenes,
	jpoimboe, tglx, keescook, jikos, rostedt, peterz, linux-block,
	linux-kernel

On Tue, Jun 22, 2021 at 09:40:27AM -0700, Luis Chamberlain wrote:
> On Tue, Jun 22, 2021 at 06:27:52PM +0200, Greg KH wrote:
> > On Tue, Jun 22, 2021 at 08:27:13AM -0700, Luis Chamberlain wrote:
> > > On Tue, Jun 22, 2021 at 09:41:23AM +0200, Greg KH wrote:
> > > > On Mon, Jun 21, 2021 at 04:36:34PM -0700, Luis Chamberlain wrote:
> > > > > +	ssize_t __ret; \
> > > > > +	if (!try_module_get(THIS_MODULE)) \
> > > > 
> > > > try_module_get(THIS_MODULE) is always racy and probably does not do what
> > > > you want it to do.  You always want to get/put module references from
> > > > code that is NOT the code calling these functions.
> > > 
> > > In this case, we want it to trump module removal if it succeeds. That's all.
> > 
> > True, but either you stop the race, or you do not right?  If you are so
> > invested in your load/unload test, this should show up with this code
> > eventually as well.
> 
> I still do not see how the race is possible give the goal to prevent
> module removal if a sysfs file is being used. If rmmod is taking
> place, this simply will bail out.
> 
> > > > > +		return -ENODEV; \
> > > > > +	__ret = _name ## _store(dev, attr, buf, len); \
> > > > > +	module_put(THIS_MODULE); \
> > > > 
> > > > This too is going to be racy.
> > > > 
> > > > While fun to poke at, I still think this is pointless.
> > > 
> > > If you have a better idea, which does not "DOS" module removal, please
> > > let me know!
> > 
> > I have yet to understand why you think that the load/unload in a loop is
> > a valid use case.
> 
> That is dependent upon the intrastructure tests built for a driver.
> 
> In the case of fstests and blktests we have drivers which *always* get
> removed and loaded on each test. Take for instance scsi_debug, which
> creates / destroys virtual devices on the per test. Likewise, to build
> confidence that failure rate is as close as possible to 0, one must run
> a test as many times as possible in a loop. And, to build confidence in
> a test, in some situations one ends up running modprobe / rmmod in a
> loop.
> 
> In this case a customer does have a complex system of tests, and by looking
> at the crash logs I managed to simplify the way to reproduce it using
> simple shell scripts.

And is _this_ change needed even with the changes in patch 1/3?

I think that commit fixes your issues given that you will not unload the
module until after the sysfs devices are removed from the system.  Have
you tried that alone with your test?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 2/3] zram: fix deadlock with sysfs attribute usage and driver removal
  2021-06-22 16:51           ` Greg KH
@ 2021-06-22 17:00             ` Luis Chamberlain
  0 siblings, 0 replies; 21+ messages in thread
From: Luis Chamberlain @ 2021-06-22 17:00 UTC (permalink / raw)
  To: Greg KH
  Cc: minchan, jeyu, ngupta, sergey.senozhatsky.work, axboe, mbenes,
	jpoimboe, tglx, keescook, jikos, rostedt, peterz, linux-block,
	linux-kernel

On Tue, Jun 22, 2021 at 06:51:13PM +0200, Greg KH wrote:
> On Tue, Jun 22, 2021 at 09:40:27AM -0700, Luis Chamberlain wrote:
> > On Tue, Jun 22, 2021 at 06:27:52PM +0200, Greg KH wrote:
> > > On Tue, Jun 22, 2021 at 08:27:13AM -0700, Luis Chamberlain wrote:
> > > > On Tue, Jun 22, 2021 at 09:41:23AM +0200, Greg KH wrote:
> > > > > On Mon, Jun 21, 2021 at 04:36:34PM -0700, Luis Chamberlain wrote:
> > > > > > +	ssize_t __ret; \
> > > > > > +	if (!try_module_get(THIS_MODULE)) \
> > > > > 
> > > > > try_module_get(THIS_MODULE) is always racy and probably does not do what
> > > > > you want it to do.  You always want to get/put module references from
> > > > > code that is NOT the code calling these functions.
> > > > 
> > > > In this case, we want it to trump module removal if it succeeds. That's all.
> > > 
> > > True, but either you stop the race, or you do not right?  If you are so
> > > invested in your load/unload test, this should show up with this code
> > > eventually as well.
> > 
> > I still do not see how the race is possible give the goal to prevent
> > module removal if a sysfs file is being used. If rmmod is taking
> > place, this simply will bail out.
> > 
> > > > > > +		return -ENODEV; \
> > > > > > +	__ret = _name ## _store(dev, attr, buf, len); \
> > > > > > +	module_put(THIS_MODULE); \
> > > > > 
> > > > > This too is going to be racy.
> > > > > 
> > > > > While fun to poke at, I still think this is pointless.
> > > > 
> > > > If you have a better idea, which does not "DOS" module removal, please
> > > > let me know!
> > > 
> > > I have yet to understand why you think that the load/unload in a loop is
> > > a valid use case.
> > 
> > That is dependent upon the intrastructure tests built for a driver.
> > 
> > In the case of fstests and blktests we have drivers which *always* get
> > removed and loaded on each test. Take for instance scsi_debug, which
> > creates / destroys virtual devices on the per test. Likewise, to build
> > confidence that failure rate is as close as possible to 0, one must run
> > a test as many times as possible in a loop. And, to build confidence in
> > a test, in some situations one ends up running modprobe / rmmod in a
> > loop.
> > 
> > In this case a customer does have a complex system of tests, and by looking
> > at the crash logs I managed to simplify the way to reproduce it using
> > simple shell scripts.
> 
> And is _this_ change needed even with the changes in patch 1/3?

Oh absolutely. This patch is needed 100%. Without it, it is actually
pretty trivial to deadlock as noted in my instructions on how to
reproduce.

> I think that commit fixes your issues given that you will not unload the
> module until after the sysfs devices are removed from the system.  Have
> you tried that alone with your test?

I have tried that, and it does not resolve the deadlock.

It was *why* I have been insisting that this is a real issue, and why I
decided to instead try to implement something generic after I was hinted
by livepatch folks that they also had observed a similar deadlock, and
so that a generic solution would be appreciated by them.

  Luis

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 2/3] zram: fix deadlock with sysfs attribute usage and driver removal
  2021-06-22 16:32     ` Luis Chamberlain
@ 2021-06-22 17:16       ` Greg KH
  2021-06-22 17:27         ` Luis Chamberlain
  0 siblings, 1 reply; 21+ messages in thread
From: Greg KH @ 2021-06-22 17:16 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: minchan, jeyu, ngupta, sergey.senozhatsky.work, axboe, mbenes,
	jpoimboe, tglx, keescook, jikos, rostedt, peterz, linux-block,
	linux-kernel

On Tue, Jun 22, 2021 at 09:32:08AM -0700, Luis Chamberlain wrote:
> On Tue, Jun 22, 2021 at 09:45:39AM +0200, Greg KH wrote:
> > On Mon, Jun 21, 2021 at 04:36:34PM -0700, Luis Chamberlain wrote:
> > > When sysfs attributes use a lock also used on driver removal we can
> > > potentially deadlock. This happens when for instance a sysfs file on
> > > a driver is used, then at the same time we have driver removal trigger.
> > > The driver removal code holds a lock, and then the sysfs file entry waits
> > > for the same lock. While holding the lock the driver removal tries to
> > > remove the sysfs entries, but these cannot be removed yet as one is
> > > waiting for a lock. This won't complete as the lock is already held.
> > > Likewise module removal cannot complete, and so we deadlock.
> > 
> > This is all about removing modules, not about "driver removal" from a
> > device.  Please make the subject line here more explicit.
> 
> Sure.
> 
> > > To fix this we just *try* to get a refcount to the module when a shared
> > > lock is used, prior to mucking with a sysfs attribute. If this fails we
> > > just give up right away.
> > > 
> > > We use a try method as a full lock means we'd then make our sysfs attributes
> > > busy us out from possible module removal, and so userspace could force denying
> > > module removal, a silly form of "DOS" against module removal. A try lock on
> > > the module removal ensures we give priority to module removal and interacting
> > > with sysfs attributes only comes second. Using a full lock could mean for
> > > instance that if you don't stop poking at sysfs files you cannot remove a
> > > module.
> > > 
> > > This deadlock was first reported with the zram driver, a sketch of how
> > > this can happen follows:
> > > 
> > > CPU A                              CPU B
> > >                                    whatever_store()
> > > module_unload
> > >   mutex_lock(foo)
> > >                                    mutex_lock(foo)
> > >    del_gendisk(zram->disk);
> > >      device_del()
> > >        device_remove_groups()
> > 
> > Can you duplicate this in a real-world situation?
> > 
> > What tools remove the zram module from the system on the fly?
> 
> A customer did run into it through a series of automated tests. I was
> able to finally reproduce with the instructions given below. I
> simplified it given that the series of test the customer was running
> was much more complex.
> 
> > > In this situation whatever_store() is waiting for the mutex foo to
> > > become unlocked, but that won't happen until module removal is complete.
> > > But module removal won't complete until the syfs file being poked completes
> > > which is waiting for a lock already held.
> > > 
> > > This is a generic kernel issue with sysfs files which use any lock also
> > > used on module removal. Different generic solutions have been proposed.
> > > One approach proposed is by directly by augmenting attributes with module
> > > information [0]. This patch implements a solution by adding macros with
> > > the prefix MODULE_DEVICE_ATTR_*() which accomplish the same. Until we
> > > don't have a generic agreed upon solution for this shared between drivers,
> > > we must implement a fix for this on each driver.
> > > 
> > > We make zram use the new MODULE_DEVICE_ATTR_*() helpers, and completely
> > > open code the solution for class attributes as there are only a few of
> > > those.
> > > 
> > > This issue can be reproduced easily on the zram driver as follows:
> > > 
> > > Loop 1 on one terminal:
> > > 
> > > while true;
> > > 	do modprobe zram;
> > > 	modprobe -r zram;
> > > done
> > > 
> > > Loop 2 on a second terminal:
> > > while true; do
> > > 	echo 1024 >  /sys/block/zram0/disksize;
> > > 	echo 1 > /sys/block/zram0/reset;
> > > done
> > 
> > As fun as this is, it's not a real workload, please do not pretend that
> > it is.
> 
> Whoever said that it was? This is just a way to reproduce an issue which
> was reported.
> 
> > And your code is still racy, see below.  You just made the window even
> > smaller, which you still should be objecting to as you somehow feel this
> > is a valid usecase :)
> > 
> > > @@ -2048,13 +2048,19 @@ static ssize_t hot_add_show(struct class *class,
> > >  {
> > >  	int ret;
> > >  
> > > +	if (!try_module_get(THIS_MODULE))
> > > +		return -ENODEV;
> > > +
> > 
> > You can not increment/decrement your own module's reference count and
> > expect it to work properly, as it is still a race.
> 
> The goal here is to prevent an rmmod call if this succeeds. If it
> succeeds then any subsequent rmmod will fail. Can you explain how
> this is still racy?

{sigh}

What happens if the driver core is just about to call hot_add_show() and
the module is removed from the system.  It then calls to the memory
location that hot_add_show() was previously at, but now that is not a
valid pointer to code, and boom.

That all happened _BEFORE_ you were even able to call try_module_get().

You have to check if a function pointer is valid _BEFORE_ you call into
it.

So you might have reduced the race, but it is still there.  Odds are you
can reproduce it if you enable "overwrite module memory with 0xff when
removed" debug options.

greg k-h

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 2/3] zram: fix deadlock with sysfs attribute usage and driver removal
  2021-06-22 17:16       ` Greg KH
@ 2021-06-22 17:27         ` Luis Chamberlain
  2021-06-22 18:05           ` Greg KH
  0 siblings, 1 reply; 21+ messages in thread
From: Luis Chamberlain @ 2021-06-22 17:27 UTC (permalink / raw)
  To: Greg KH
  Cc: minchan, jeyu, ngupta, sergey.senozhatsky.work, axboe, mbenes,
	jpoimboe, tglx, keescook, jikos, rostedt, peterz, linux-block,
	linux-kernel

On Tue, Jun 22, 2021 at 07:16:34PM +0200, Greg KH wrote:
> On Tue, Jun 22, 2021 at 09:32:08AM -0700, Luis Chamberlain wrote:
> > On Tue, Jun 22, 2021 at 09:45:39AM +0200, Greg KH wrote:
> > > On Mon, Jun 21, 2021 at 04:36:34PM -0700, Luis Chamberlain wrote:
> > > > @@ -2048,13 +2048,19 @@ static ssize_t hot_add_show(struct class *class,
> > > >  {
> > > >  	int ret;
> > > >  
> > > > +	if (!try_module_get(THIS_MODULE))
> > > > +		return -ENODEV;
> > > > +
> > > 
> > > You can not increment/decrement your own module's reference count and
> > > expect it to work properly, as it is still a race.
> > 
> > The goal here is to prevent an rmmod call if this succeeds. If it
> > succeeds then any subsequent rmmod will fail. Can you explain how
> > this is still racy?
> 
> {sigh}
> 
> What happens if the driver core is just about to call hot_add_show() and
> the module is removed from the system.  It then calls to the memory
> location that hot_add_show() was previously at, but now that is not a
> valid pointer to code, and boom.

The new kobject_get() on patch 3/3 ensures that the device will be up
throughout the entire life of the store call, and thus prevent the
code being executed being removed, no?

  Luis

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 2/3] zram: fix deadlock with sysfs attribute usage and driver removal
  2021-06-22 17:27         ` Luis Chamberlain
@ 2021-06-22 18:05           ` Greg KH
  2021-06-22 19:57             ` Luis Chamberlain
  0 siblings, 1 reply; 21+ messages in thread
From: Greg KH @ 2021-06-22 18:05 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: minchan, jeyu, ngupta, sergey.senozhatsky.work, axboe, mbenes,
	jpoimboe, tglx, keescook, jikos, rostedt, peterz, linux-block,
	linux-kernel

On Tue, Jun 22, 2021 at 10:27:12AM -0700, Luis Chamberlain wrote:
> On Tue, Jun 22, 2021 at 07:16:34PM +0200, Greg KH wrote:
> > On Tue, Jun 22, 2021 at 09:32:08AM -0700, Luis Chamberlain wrote:
> > > On Tue, Jun 22, 2021 at 09:45:39AM +0200, Greg KH wrote:
> > > > On Mon, Jun 21, 2021 at 04:36:34PM -0700, Luis Chamberlain wrote:
> > > > > @@ -2048,13 +2048,19 @@ static ssize_t hot_add_show(struct class *class,
> > > > >  {
> > > > >  	int ret;
> > > > >  
> > > > > +	if (!try_module_get(THIS_MODULE))
> > > > > +		return -ENODEV;
> > > > > +
> > > > 
> > > > You can not increment/decrement your own module's reference count and
> > > > expect it to work properly, as it is still a race.
> > > 
> > > The goal here is to prevent an rmmod call if this succeeds. If it
> > > succeeds then any subsequent rmmod will fail. Can you explain how
> > > this is still racy?
> > 
> > {sigh}
> > 
> > What happens if the driver core is just about to call hot_add_show() and
> > the module is removed from the system.  It then calls to the memory
> > location that hot_add_show() was previously at, but now that is not a
> > valid pointer to code, and boom.
> 
> The new kobject_get() on patch 3/3 ensures that the device will be up
> throughout the entire life of the store call, and thus prevent the
> code being executed being removed, no?

I do not know, I no longer remember what is in that patch at the moment
as it is long-gone from my queue.

Also, if the device will be "up" for the whole lifetime, why do you need
to increment the module reference count?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 2/3] zram: fix deadlock with sysfs attribute usage and driver removal
  2021-06-22 18:05           ` Greg KH
@ 2021-06-22 19:57             ` Luis Chamberlain
  0 siblings, 0 replies; 21+ messages in thread
From: Luis Chamberlain @ 2021-06-22 19:57 UTC (permalink / raw)
  To: Greg KH
  Cc: minchan, jeyu, ngupta, sergey.senozhatsky.work, axboe, mbenes,
	jpoimboe, tglx, keescook, jikos, rostedt, peterz, linux-block,
	linux-kernel

On Tue, Jun 22, 2021 at 08:05:12PM +0200, Greg KH wrote:
> On Tue, Jun 22, 2021 at 10:27:12AM -0700, Luis Chamberlain wrote:
> > On Tue, Jun 22, 2021 at 07:16:34PM +0200, Greg KH wrote:
> > > On Tue, Jun 22, 2021 at 09:32:08AM -0700, Luis Chamberlain wrote:
> > > > On Tue, Jun 22, 2021 at 09:45:39AM +0200, Greg KH wrote:
> > > > > On Mon, Jun 21, 2021 at 04:36:34PM -0700, Luis Chamberlain wrote:
> > > > > > @@ -2048,13 +2048,19 @@ static ssize_t hot_add_show(struct class *class,
> > > > > >  {
> > > > > >  	int ret;
> > > > > >  
> > > > > > +	if (!try_module_get(THIS_MODULE))
> > > > > > +		return -ENODEV;
> > > > > > +
> > > > > 
> > > > > You can not increment/decrement your own module's reference count and
> > > > > expect it to work properly, as it is still a race.
> > > > 
> > > > The goal here is to prevent an rmmod call if this succeeds. If it
> > > > succeeds then any subsequent rmmod will fail. Can you explain how
> > > > this is still racy?
> > > 
> > > {sigh}
> > > 
> > > What happens if the driver core is just about to call hot_add_show() and
> > > the module is removed from the system.  It then calls to the memory
> > > location that hot_add_show() was previously at, but now that is not a
> > > valid pointer to code, and boom.
> > 
> > The new kobject_get() on patch 3/3 ensures that the device will be up
> > throughout the entire life of the store call, and thus prevent the
> > code being executed being removed, no?
> 
> I do not know, I no longer remember what is in that patch at the moment
> as it is long-gone from my queue.

It was the changes *you* recommended, a generic way to ensure the
lifetime of the derefernce is valid. I had used bdgrab()/bdget() and you
suggested we generalize it with the kobject_get() for the device and a
bus get. With that change, I confirm that the device will still be
present during the lifetime of the sysfs knobs call.

> Also, if the device will be "up" for the whole lifetime, why do you need
> to increment the module reference count?

The goal is to prevent a deadlock. The lifetime of the device is not
an issue in this deadlock case, the issue is a race with module removal
and that code path using a lock which is also used on a sysfs knob.

  Luis

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2021-06-22 19:57 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-06-21 23:30 [PATCH v3 0/3] zram: fix few sysfs races Luis Chamberlain
2021-06-21 23:30 ` [PATCH v3 1/3] zram: fix crashes due to use of cpu hotplug multistate Luis Chamberlain
2021-06-22  7:39   ` Greg KH
2021-06-22 15:12     ` Luis Chamberlain
2021-06-21 23:36 ` [PATCH v3 2/3] zram: fix deadlock with sysfs attribute usage and driver removal Luis Chamberlain
2021-06-22  7:41   ` Greg KH
2021-06-22 15:27     ` Luis Chamberlain
2021-06-22 16:27       ` Greg KH
2021-06-22 16:40         ` Luis Chamberlain
2021-06-22 16:51           ` Greg KH
2021-06-22 17:00             ` Luis Chamberlain
2021-06-22  7:45   ` Greg KH
2021-06-22 16:32     ` Luis Chamberlain
2021-06-22 17:16       ` Greg KH
2021-06-22 17:27         ` Luis Chamberlain
2021-06-22 18:05           ` Greg KH
2021-06-22 19:57             ` Luis Chamberlain
2021-06-21 23:36 ` [PATCH v3 3/3] drivers/base/core: refcount kobject and bus on device attribute read / store Luis Chamberlain
2021-06-22  7:46   ` Greg KH
2021-06-22 16:44     ` Luis Chamberlain
2021-06-22 16:48       ` Greg KH

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).