* [PATCH 0/2] zram: fix few ltp zram02.sh crashes @ 2021-03-06 2:20 Luis Chamberlain 2021-03-06 2:20 ` [PATCH 1/2] zram: fix crashes due to use of cpu hotplug multistate Luis Chamberlain 2021-03-06 2:20 ` [PATCH 2/2] zram: fix races of sysfs attribute removal and usage Luis Chamberlain 0 siblings, 2 replies; 44+ messages in thread From: Luis Chamberlain @ 2021-03-06 2:20 UTC (permalink / raw) To: minchan, ngupta, sergey.senozhatsky.work Cc: axboe, mbenes, mcgrof, linux-block, linux-kernel LTP's zram02.sh script can be used to crah your kernel pretty badly. cd testcases/kernel/device-drivers/zram while true; do PATH=$PATH:$PWD:$PWD/../../../lib/ ./zram02.sh; done This won't do much, however if you run this in two separate windows you'll see the kernel become unhappy quite fast. The crux of this issue was mishandling of cpu hotplug multistate on the zram driver in consideration for driver unload. However there was still another long lasting bug present: races with sysfs attributes and driver unload. This series fixes these issues. [0] https://github.com/linux-test-project/ltp.git Luis Chamberlain (2): zram: fix crashes due to use of cpu hotplug multistate zram: fix races of sysfs attribute removal and usage drivers/block/zram/zram_drv.c | 82 +++++++++++++++++++++++++++++++---- 1 file changed, 74 insertions(+), 8 deletions(-) -- 2.30.1 ^ permalink raw reply [flat|nested] 44+ messages in thread
* [PATCH 1/2] zram: fix crashes due to use of cpu hotplug multistate 2021-03-06 2:20 [PATCH 0/2] zram: fix few ltp zram02.sh crashes Luis Chamberlain @ 2021-03-06 2:20 ` Luis Chamberlain 2021-03-09 2:55 ` Minchan Kim 2021-03-06 2:20 ` [PATCH 2/2] zram: fix races of sysfs attribute removal and usage Luis Chamberlain 1 sibling, 1 reply; 44+ messages in thread From: Luis Chamberlain @ 2021-03-06 2:20 UTC (permalink / raw) To: minchan, ngupta, sergey.senozhatsky.work Cc: axboe, mbenes, mcgrof, linux-block, linux-kernel The zram driver makes use of cpu hotplug multistate support, whereby it associates a zram compression stream per CPU. To support CPU hotplug multistate a callback enabled to allow the driver to do what it needs when a CPU hotplugs. It is however currently possible to end up removing the zram driver callback prior to removing the zram compression streams per CPU. This would leave these compression streams hanging. We need to fix ordering for driver load / removal, zram device additions, in light of the driver's use of cpu hotplug multistate. Since the zram driver exposes many sysfs attribute which can also muck with the comrpession streams this also means we can hit page faults today easily. Fix all this by providing an zram initialization boolean protected the shared in the driver zram_index_mutex, which we can use to annotate when sysfs attributes are safe to use or not -- once the driver is properly initialized. When the driver is going down we also are sure to not let userspace muck with attributes which may affect zram cpu compression streams. This also fixes a series of possible memory leaks. The crashes and memory leaks can easily be caused by issuing the zram02.sh script from the LTP project [0] in a loop in two separate windows: cd testcases/kernel/device-drivers/zram while true; do PATH=$PATH:$PWD:$PWD/../../../lib/ ./zram02.sh; done You end up with a splat as follows: kernel: zram: Removed device: zram0 kernel: zram: Added device: zram0 kernel: zram0: detected capacity change from 0 to 209715200 kernel: Adding 104857596k swap on /dev/zram0. Priority:-2 extents:1 across:104857596k SSFS kernel: zram0: detected capacitky change from 209715200 to 0 kernel: zram0: detected capacity change from 0 to 209715200 kernel: ------------[ cut here ]------------ kernel: Error: Removing state 63 which has instances left. kernel: WARNING: CPU: 7 PID: 70457 at kernel/cpu.c:2069 __cpuhp_remove_state_cpuslocked+0xf9/0x100 kernel: Modules linked in: zram(E-) zsmalloc(E) <etc> kernel: CPU: 7 PID: 70457 Comm: rmmod Tainted: G E 5.12.0-rc1-next-20210304 #3 kernel: Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014 kernel: RIP: 0010:__cpuhp_remove_state_cpuslocked+0xf9/0x100 kernel: Code: <etc> kernel: RSP: 0018:ffffa800c139be98 EFLAGS: 00010282 kernel: RAX: 0000000000000000 RBX: ffffffff9083db58 RCX: ffff9609f7dd86d8 kernel: RDX: 00000000ffffffd8 RSI: 0000000000000027 RDI: ffff9609f7dd86d0 kernel: RBP: 0000000000000000i R08: 0000000000000000 R09: ffffa800c139bcb8 kernel: R10: ffffa800c139bcb0 R11: ffffffff908bea40 R12: 000000000000003f kernel: R13: 00000000000009d8 R14: 0000000000000000 R15: 0000000000000000 kernel: FS: 00007f1b075a7540(0000) GS:ffff9609f7dc0000(0000) knlGS:0000000000000000 kernel: CS: 0010 DS: 0000 ES 0000 CR0: 0000000080050033 kernel: CR2: 00007f1b07610490 CR3: 00000001bd04e000 CR4: 0000000000350ee0 kernel: Call Trace: kernel: __cpuhp_remove_state+0x2e/0x80 kernel: __do_sys_delete_module+0x190/0x2a0 kernel: do_syscall_64+0x33/0x80 kernel: entry_SYSCALL_64_after_hwframe+0x44/0xae The "Error: Removing state 63 which has instances left" refers to the zram per CPU compression streams still left. [0] https://github.com/linux-test-project/ltp.git Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> --- drivers/block/zram/zram_drv.c | 63 ++++++++++++++++++++++++++++++----- 1 file changed, 55 insertions(+), 8 deletions(-) diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c index a711a2e2a794..63b6119cee93 100644 --- a/drivers/block/zram/zram_drv.c +++ b/drivers/block/zram/zram_drv.c @@ -44,6 +44,8 @@ static DEFINE_MUTEX(zram_index_mutex); static int zram_major; static const char *default_compressor = CONFIG_ZRAM_DEF_COMP; +bool zram_up; + /* Module params (documentation at end) */ static unsigned int num_devices = 1; /* @@ -1699,6 +1701,7 @@ static void zram_reset_device(struct zram *zram) comp = zram->comp; disksize = zram->disksize; zram->disksize = 0; + zram->comp = NULL; set_capacity_and_notify(zram->disk, 0); part_stat_set_all(zram->disk->part0, 0); @@ -1719,9 +1722,18 @@ static ssize_t disksize_store(struct device *dev, struct zram *zram = dev_to_zram(dev); int err; + mutex_lock(&zram_index_mutex); + + if (!zram_up) { + err = -ENODEV; + goto out; + } + disksize = memparse(buf, NULL); - if (!disksize) - return -EINVAL; + if (!disksize) { + err = -EINVAL; + goto out; + } down_write(&zram->init_lock); if (init_done(zram)) { @@ -1749,12 +1761,16 @@ static ssize_t disksize_store(struct device *dev, set_capacity_and_notify(zram->disk, zram->disksize >> SECTOR_SHIFT); up_write(&zram->init_lock); + mutex_unlock(&zram_index_mutex); + return len; out_free_meta: zram_meta_free(zram, disksize); out_unlock: up_write(&zram->init_lock); +out: + mutex_unlock(&zram_index_mutex); return err; } @@ -1770,8 +1786,17 @@ static ssize_t reset_store(struct device *dev, if (ret) return ret; - if (!do_reset) - return -EINVAL; + mutex_lock(&zram_index_mutex); + + if (!zram_up) { + len = -ENODEV; + goto out; + } + + if (!do_reset) { + len = -EINVAL; + goto out; + } zram = dev_to_zram(dev); bdev = zram->disk->part0; @@ -1780,7 +1805,8 @@ static ssize_t reset_store(struct device *dev, /* Do not reset an active device or claimed device */ if (bdev->bd_openers || zram->claim) { mutex_unlock(&bdev->bd_mutex); - return -EBUSY; + len = -EBUSY; + goto out; } /* From now on, anyone can't open /dev/zram[0-9] */ @@ -1795,6 +1821,8 @@ static ssize_t reset_store(struct device *dev, zram->claim = false; mutex_unlock(&bdev->bd_mutex); +out: + mutex_unlock(&zram_index_mutex); return len; } @@ -2016,6 +2044,10 @@ static ssize_t hot_add_show(struct class *class, int ret; mutex_lock(&zram_index_mutex); + if (!zram_up) { + mutex_unlock(&zram_index_mutex); + return -ENODEV; + } ret = zram_add(); mutex_unlock(&zram_index_mutex); @@ -2043,6 +2075,11 @@ static ssize_t hot_remove_store(struct class *class, mutex_lock(&zram_index_mutex); + if (!zram_up) { + ret = -ENODEV; + goto out; + } + zram = idr_find(&zram_index_idr, dev_id); if (zram) { ret = zram_remove(zram); @@ -2052,6 +2089,7 @@ static ssize_t hot_remove_store(struct class *class, ret = -ENODEV; } +out: mutex_unlock(&zram_index_mutex); return ret ? ret : count; } @@ -2078,12 +2116,15 @@ static int zram_remove_cb(int id, void *ptr, void *data) static void destroy_devices(void) { + mutex_lock(&zram_index_mutex); + zram_up = false; class_unregister(&zram_control_class); idr_for_each(&zram_index_idr, &zram_remove_cb, NULL); zram_debugfs_destroy(); idr_destroy(&zram_index_idr); unregister_blkdev(zram_major, "zram"); cpuhp_remove_multi_state(CPUHP_ZCOMP_PREPARE); + mutex_unlock(&zram_index_mutex); } static int __init zram_init(void) @@ -2111,15 +2152,21 @@ static int __init zram_init(void) return -EBUSY; } + mutex_lock(&zram_index_mutex); + while (num_devices != 0) { - mutex_lock(&zram_index_mutex); ret = zram_add(); - mutex_unlock(&zram_index_mutex); - if (ret < 0) + if (ret < 0) { + mutex_unlock(&zram_index_mutex); goto out_error; + } num_devices--; } + zram_up = true; + + mutex_unlock(&zram_index_mutex); + return 0; out_error: -- 2.30.1 ^ permalink raw reply related [flat|nested] 44+ messages in thread
* Re: [PATCH 1/2] zram: fix crashes due to use of cpu hotplug multistate 2021-03-06 2:20 ` [PATCH 1/2] zram: fix crashes due to use of cpu hotplug multistate Luis Chamberlain @ 2021-03-09 2:55 ` Minchan Kim 2021-03-10 13:11 ` Luis Chamberlain 2021-03-10 21:21 ` Luis Chamberlain 0 siblings, 2 replies; 44+ messages in thread From: Minchan Kim @ 2021-03-09 2:55 UTC (permalink / raw) To: Luis Chamberlain Cc: ngupta, sergey.senozhatsky.work, axboe, mbenes, linux-block, linux-kernel On Sat, Mar 06, 2021 at 02:20:34AM +0000, Luis Chamberlain wrote: > The zram driver makes use of cpu hotplug multistate support, > whereby it associates a zram compression stream per CPU. To > support CPU hotplug multistate a callback enabled to allow > the driver to do what it needs when a CPU hotplugs. > > It is however currently possible to end up removing the > zram driver callback prior to removing the zram compression > streams per CPU. This would leave these compression streams > hanging. > > We need to fix ordering for driver load / removal, zram > device additions, in light of the driver's use of cpu > hotplug multistate. Since the zram driver exposes many > sysfs attribute which can also muck with the comrpession > streams this also means we can hit page faults today easily. Hi Luis, First of all, thanks for reporting the ancient bugs. Looks like you found several bugs and I am trying to digest them from your description to understand more clear. I need to ask stupid questions, first. If I understand correctly, bugs you found were related to module unloading race while the zram are still working. If so, couldn't we fix the bug like this(it's not a perfect patch but just wanted to show close module unloading race)? (I might miss other pieces here. Let me know. Thanks!) diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c index a711a2e2a794..646ae9e0b710 100644 --- a/drivers/block/zram/zram_drv.c +++ b/drivers/block/zram/zram_drv.c @@ -1696,6 +1696,8 @@ static void zram_reset_device(struct zram *zram) return; } + module_put(THIS_MODULE); + comp = zram->comp; disksize = zram->disksize; zram->disksize = 0; @@ -1744,13 +1746,19 @@ static ssize_t disksize_store(struct device *dev, goto out_free_meta; } + if (!try_module_get(THIS_MODULE)) + goto out_free_zcomp; + zram->comp = comp; zram->disksize = disksize; + set_capacity_and_notify(zram->disk, zram->disksize >> SECTOR_SHIFT); up_write(&zram->init_lock); return len; +out_free_zcomp: + zcomp_destroy(comp); out_free_meta: zram_meta_free(zram, disksize); out_unlock: ^ permalink raw reply related [flat|nested] 44+ messages in thread
* Re: [PATCH 1/2] zram: fix crashes due to use of cpu hotplug multistate 2021-03-09 2:55 ` Minchan Kim @ 2021-03-10 13:11 ` Luis Chamberlain 2021-03-10 21:25 ` Luis Chamberlain 2021-03-12 2:08 ` Minchan Kim 2021-03-10 21:21 ` Luis Chamberlain 1 sibling, 2 replies; 44+ messages in thread From: Luis Chamberlain @ 2021-03-10 13:11 UTC (permalink / raw) To: Minchan Kim Cc: ngupta, sergey.senozhatsky.work, axboe, mbenes, linux-block, linux-kernel On Mon, Mar 08, 2021 at 06:55:30PM -0800, Minchan Kim wrote: > On Sat, Mar 06, 2021 at 02:20:34AM +0000, Luis Chamberlain wrote: > > The zram driver makes use of cpu hotplug multistate support, > > whereby it associates a zram compression stream per CPU. To > > support CPU hotplug multistate a callback enabled to allow > > the driver to do what it needs when a CPU hotplugs. > > > > It is however currently possible to end up removing the > > zram driver callback prior to removing the zram compression > > streams per CPU. This would leave these compression streams > > hanging. > > > > We need to fix ordering for driver load / removal, zram > > device additions, in light of the driver's use of cpu > > hotplug multistate. Since the zram driver exposes many > > sysfs attribute which can also muck with the comrpession > > streams this also means we can hit page faults today easily. > > Hi Luis, > > First of all, thanks for reporting the ancient bugs. > > Looks like you found several bugs and I am trying to digest them > from your description to understand more clear. I need to ask > stupid questions, first. > > If I understand correctly, bugs you found were related to module > unloading race while the zram are still working. > If so, couldn't we fix the bug like this(it's not a perfect > patch but just wanted to show close module unloading race)? > (I might miss other pieces here. Let me know. Thanks!) > > diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c > index a711a2e2a794..646ae9e0b710 100644 > --- a/drivers/block/zram/zram_drv.c > +++ b/drivers/block/zram/zram_drv.c > @@ -1696,6 +1696,8 @@ static void zram_reset_device(struct zram *zram) > return; > } > > + module_put(THIS_MODULE); > + > comp = zram->comp; > disksize = zram->disksize; > zram->disksize = 0; > @@ -1744,13 +1746,19 @@ static ssize_t disksize_store(struct device *dev, > goto out_free_meta; > } > > + if (!try_module_get(THIS_MODULE)) > + goto out_free_zcomp; > + > zram->comp = comp; > zram->disksize = disksize; > + > set_capacity_and_notify(zram->disk, zram->disksize >> SECTOR_SHIFT); > up_write(&zram->init_lock); > > return len; > > +out_free_zcomp: > + zcomp_destroy(comp); > out_free_meta: > zram_meta_free(zram, disksize); > out_unlock: This still allows for a crash on the driver by running the zram02.sh script twice. Mar 09 14:52:19 kdevops-blktests-block-dev kernel: zram0: detected capacity change from 209715200 to 0 Mar 09 14:52:19 kdevops-blktests-block-dev kernel: BUG: unable to handle page fault for address: ffffba7db7904008 Mar 09 14:52:19 kdevops-blktests-block-dev kernel: #PF: supervisor read access in kernel mode Mar 09 14:52:20 kdevops-blktests-block-dev kernel: #PF: error_code(0x0000) - not-present page Mar 09 14:52:20 kdevops-blktests-block-dev kernel: PGD 100000067 P4D 100000067 PUD 100311067 PMD 14cd2f067 PTE 0 Mar 09 14:52:20 kdevops-blktests-block-dev kernel: Oops: 0000 [#1] SMP NOPTI Mar 09 14:52:20 kdevops-blktests-block-dev kernel: CPU: 0 PID: 2137 Comm: zram02.sh Tainted: G E 5.12.0-rc1-next-20210304+ #4 Mar 09 14:52:20 kdevops-blktests-block-dev kernel: Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014 Mar 09 14:52:20 kdevops-blktests-block-dev kernel: RIP: 0010:zram_free_page+0x1b/0xf0 [zram] Mar 09 14:52:20 kdevops-blktests-block-dev kernel: Code: 1f 44 00 00 48 89 c8 c3 0f 1f 80 00 00 00 00 0f 1f 44 00 00 41 54 49 89 f4 55 89 f5 53 48 8b 17 48 c1 e5 04 48 89 f> Mar 09 14:52:20 kdevops-blktests-block-dev kernel: RSP:0018:ffffba7d808f3d88 EFLAGS: 00010286 Mar 09 14:52:20 kdevops-blktests-block-dev kernel: RAX: 0000000000000000 RBX: ffff9eee5317d200 RCX: 0000000000000000 Mar 09 14:52:20 kdevops-blktests-block-dev kernel: RDX: ffffba7db7904000 RSI: 0000000000000000 RDI: ffff9eee5317d200 Mar 09 14:52:20 kdevops-blktests-block-dev kernel: RBP: 0000000000000000 R08: 00000008f78bb1d3 R09: 0000000000000000 Mar 09 14:52:20 kdevops-blktests-block-dev kernel: R10: 0000000000000008 R11: 0000000000000000 R12: 0000000000000000 Mar 09 14:52:20 kdevops-blktests-block-dev kernel: R13: ffff9eee53d4cb00 R14: ffff9eee5317d220 R15: ffff9eee70508b80 Mar 09 14:52:20 kdevops-blktests-block-dev kernel: FS: 00007f4bb1482580(0000) GS:ffff9eef77c00000(0000) knlGS:0000000000000000 Mar 09 14:52:20 kdevops-blktests-block-dev kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Mar 09 14:52:20 kdevops-blktests-block-dev kernel: CR2: ffffba7db7904008 CR3: 0000000107e9c000 CR4: 0000000000350ef0 Mar 09 14:52:20 kdevops-blktests-block-dev kernel: Call Trace: Mar 09 14:52:20 kdevops-blktests-block-dev kernel: zram_reset_device+0xe9/0x150 [zram] Mar 09 14:52:20 kdevops-blktests-block-dev kernel: reset_store+0x9a/0x100 [zram] Mar 09 14:52:20 kdevops-blktests-block-dev kernel: kernfs_fop_write_iter+0x124/0x1b0 Mar 09 14:52:20 kdevops-blktests-block-dev kernel: new_sync_write+0x11c/0x1b0 Mar 09 14:52:20 kdevops-blktests-block-dev kernel: vfs_write+0x1c2/0x260 Mar 09 14:52:20 kdevops-blktests-block-dev kernel: ksys_write+0x5f/0xe0 Mar 09 14:52:20 kdevops-blktests-block-dev kernel: do_syscall_64+0x33/0x80 Mar 09 14:52:20 kdevops-blktests-block-dev kernel: entry_SYSCALL_64_after_hwframe+0x44/0xae Mar 09 14:52:20 kdevops-blktests-block-dev kernel: RIP: 0033:0x7f4bb13aaf33 Mar 09 14:52:20 kdevops-blktests-block-dev kernel: Code: 8b 15 61 ef 0c 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 64 8b 04 25 18 00 00 00 85 c0 75 14 b8 01 00 0> Mar 09 14:52:20 kdevops-blktests-block-dev kernel: RSP: 002b:00007ffce0090d88 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 Mar 09 14:52:20 kdevops-blktests-block-dev kernel: RAX: ffffffffffffffda RBX: 000055a17c4a14b0 RCX: 00007f4bb13aaf33 Mar 09 14:52:20 kdevops-blktests-block-dev kernel: RDX: 0000000000000002 RSI: 000055a17c4a14b0 RDI: 0000000000000001 Mar 09 14:52:20 kdevops-blktests-block-dev kernel: RBP: 0000000000000002 R08: 000055a17c48a1d0 R09: 0000000000000000 Mar 09 14:52:20 kdevops-blktests-block-dev kernel: R10: 000055a17c48a1d1 R11: 0000000000000246 R12: 0000000000000001 Mar 09 14:52:20 kdevops-blktests-block-dev kernel: R13: 0000000000000002 R14: 7fffffffffffffff R15: 0000000000000000 Mar 09 14:52:20 kdevops-blktests-block-dev kernel: Modules linked in: zram(E) zsmalloc(E) xfs(E) crct10dif_pclmul(E) crc32_pclmul(E) ghash_clmulni_intel(E) joydev(E) evdev(> Mar 09 14:52:20 kdevops-blktests-block-dev kernel: zram0: detected capacity change from 0 to 209715200 Mar 09 14:52:20 kdevops-blktests-block-dev kernel: CR2: ffffba7db7904008 Mar 09 14:52:20 kdevops-blktests-block-dev kernel: ---[ end trace 534ee1d0b880dd90 ]--- I can try to modify it to include second patch first, as that is required. There are two separate bugs here. Or was your goalt to try to address both with only one patch? Luis ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH 1/2] zram: fix crashes due to use of cpu hotplug multistate 2021-03-10 13:11 ` Luis Chamberlain @ 2021-03-10 21:25 ` Luis Chamberlain 2021-03-12 2:08 ` Minchan Kim 1 sibling, 0 replies; 44+ messages in thread From: Luis Chamberlain @ 2021-03-10 21:25 UTC (permalink / raw) To: Minchan Kim Cc: ngupta, sergey.senozhatsky.work, axboe, mbenes, linux-block, linux-kernel On Wed, Mar 10, 2021 at 01:11:15PM +0000, Luis Chamberlain wrote: > I can try to modify it to include second patch first, as that is > required. There are two separate bugs here. I tried this, applying the syfs required changes first and then applying your idea as a secondary patch ends up like this: diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c index 4b57c84ba9d4..bb45c1e0f3e0 100644 --- a/drivers/block/zram/zram_drv.c +++ b/drivers/block/zram/zram_drv.c @@ -1702,6 +1702,7 @@ static void zram_reset_device(struct zram *zram) set_capacity_and_notify(zram->disk, 0); part_stat_set_all(zram->disk->part0, 0); + module_put(THIS_MODULE); up_write(&zram->init_lock); /* I/O operation under all of CPU are done so let's free */ @@ -1747,6 +1748,7 @@ static ssize_t disksize_store(struct device *dev, goto out_free_meta; } + BUG_ON(!try_module_get(THIS_MODULE)); zram->comp = comp; zram->disksize = disksize; set_capacity_and_notify(zram->disk, zram->disksize >> SECTOR_SHIFT); The BUG_ON() is doable as we *know* we already have a reference to the module due to the beginning of the other try_module_get() which would be placed on the first patch at the top of disksize_store(). This however doesn't fix the issue. We end up in a situation where we cannot unload the zram driver. Luis ^ permalink raw reply related [flat|nested] 44+ messages in thread
* Re: [PATCH 1/2] zram: fix crashes due to use of cpu hotplug multistate 2021-03-10 13:11 ` Luis Chamberlain 2021-03-10 21:25 ` Luis Chamberlain @ 2021-03-12 2:08 ` Minchan Kim 1 sibling, 0 replies; 44+ messages in thread From: Minchan Kim @ 2021-03-12 2:08 UTC (permalink / raw) To: Luis Chamberlain Cc: ngupta, sergey.senozhatsky.work, axboe, mbenes, linux-block, linux-kernel On Wed, Mar 10, 2021 at 01:11:15PM +0000, Luis Chamberlain wrote: > On Mon, Mar 08, 2021 at 06:55:30PM -0800, Minchan Kim wrote: > > On Sat, Mar 06, 2021 at 02:20:34AM +0000, Luis Chamberlain wrote: > > > The zram driver makes use of cpu hotplug multistate support, > > > whereby it associates a zram compression stream per CPU. To > > > support CPU hotplug multistate a callback enabled to allow > > > the driver to do what it needs when a CPU hotplugs. > > > > > > It is however currently possible to end up removing the > > > zram driver callback prior to removing the zram compression > > > streams per CPU. This would leave these compression streams > > > hanging. > > > > > > We need to fix ordering for driver load / removal, zram > > > device additions, in light of the driver's use of cpu > > > hotplug multistate. Since the zram driver exposes many > > > sysfs attribute which can also muck with the comrpession > > > streams this also means we can hit page faults today easily. > > > > Hi Luis, > > > > First of all, thanks for reporting the ancient bugs. > > > > Looks like you found several bugs and I am trying to digest them > > from your description to understand more clear. I need to ask > > stupid questions, first. > > > > If I understand correctly, bugs you found were related to module > > unloading race while the zram are still working. > > If so, couldn't we fix the bug like this(it's not a perfect > > patch but just wanted to show close module unloading race)? > > (I might miss other pieces here. Let me know. Thanks!) > > > > diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c > > index a711a2e2a794..646ae9e0b710 100644 > > --- a/drivers/block/zram/zram_drv.c > > +++ b/drivers/block/zram/zram_drv.c > > @@ -1696,6 +1696,8 @@ static void zram_reset_device(struct zram *zram) > > return; > > } > > > > + module_put(THIS_MODULE); > > + > > comp = zram->comp; > > disksize = zram->disksize; > > zram->disksize = 0; > > @@ -1744,13 +1746,19 @@ static ssize_t disksize_store(struct device *dev, > > goto out_free_meta; > > } > > > > + if (!try_module_get(THIS_MODULE)) > > + goto out_free_zcomp; > > + > > zram->comp = comp; > > zram->disksize = disksize; > > + > > set_capacity_and_notify(zram->disk, zram->disksize >> SECTOR_SHIFT); > > up_write(&zram->init_lock); > > > > return len; > > > > +out_free_zcomp: > > + zcomp_destroy(comp); > > out_free_meta: > > zram_meta_free(zram, disksize); > > out_unlock: > > This still allows for a crash on the driver by running the zram02.sh script twice. > > Mar 09 14:52:19 kdevops-blktests-block-dev kernel: zram0: detected capacity change from 209715200 to 0 > Mar 09 14:52:19 kdevops-blktests-block-dev kernel: BUG: unable to handle page fault for address: ffffba7db7904008 > Mar 09 14:52:19 kdevops-blktests-block-dev kernel: #PF: supervisor read access in kernel mode > Mar 09 14:52:20 kdevops-blktests-block-dev kernel: #PF: error_code(0x0000) - not-present page > Mar 09 14:52:20 kdevops-blktests-block-dev kernel: PGD 100000067 P4D 100000067 PUD 100311067 PMD 14cd2f067 PTE 0 > Mar 09 14:52:20 kdevops-blktests-block-dev kernel: Oops: 0000 [#1] SMP NOPTI > Mar 09 14:52:20 kdevops-blktests-block-dev kernel: CPU: 0 PID: 2137 Comm: zram02.sh Tainted: G E 5.12.0-rc1-next-20210304+ #4 > Mar 09 14:52:20 kdevops-blktests-block-dev kernel: Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014 > Mar 09 14:52:20 kdevops-blktests-block-dev kernel: RIP: 0010:zram_free_page+0x1b/0xf0 [zram] > Mar 09 14:52:20 kdevops-blktests-block-dev kernel: Code: 1f 44 00 00 48 89 c8 c3 0f 1f 80 00 00 00 00 0f 1f 44 00 00 41 54 49 89 f4 55 89 f5 53 48 8b 17 48 c1 e5 04 48 89 f> > Mar 09 14:52:20 kdevops-blktests-block-dev kernel: RSP:0018:ffffba7d808f3d88 EFLAGS: 00010286 > Mar 09 14:52:20 kdevops-blktests-block-dev kernel: RAX: 0000000000000000 RBX: ffff9eee5317d200 RCX: 0000000000000000 > Mar 09 14:52:20 kdevops-blktests-block-dev kernel: RDX: ffffba7db7904000 RSI: 0000000000000000 RDI: ffff9eee5317d200 > Mar 09 14:52:20 kdevops-blktests-block-dev kernel: RBP: 0000000000000000 R08: 00000008f78bb1d3 R09: 0000000000000000 > Mar 09 14:52:20 kdevops-blktests-block-dev kernel: R10: 0000000000000008 R11: 0000000000000000 R12: 0000000000000000 > Mar 09 14:52:20 kdevops-blktests-block-dev kernel: R13: ffff9eee53d4cb00 R14: ffff9eee5317d220 R15: ffff9eee70508b80 > Mar 09 14:52:20 kdevops-blktests-block-dev kernel: FS: 00007f4bb1482580(0000) GS:ffff9eef77c00000(0000) knlGS:0000000000000000 > Mar 09 14:52:20 kdevops-blktests-block-dev kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > Mar 09 14:52:20 kdevops-blktests-block-dev kernel: CR2: ffffba7db7904008 CR3: 0000000107e9c000 CR4: 0000000000350ef0 > Mar 09 14:52:20 kdevops-blktests-block-dev kernel: Call Trace: > Mar 09 14:52:20 kdevops-blktests-block-dev kernel: zram_reset_device+0xe9/0x150 [zram] > Mar 09 14:52:20 kdevops-blktests-block-dev kernel: reset_store+0x9a/0x100 [zram] > Mar 09 14:52:20 kdevops-blktests-block-dev kernel: kernfs_fop_write_iter+0x124/0x1b0 > Mar 09 14:52:20 kdevops-blktests-block-dev kernel: new_sync_write+0x11c/0x1b0 > Mar 09 14:52:20 kdevops-blktests-block-dev kernel: vfs_write+0x1c2/0x260 > Mar 09 14:52:20 kdevops-blktests-block-dev kernel: ksys_write+0x5f/0xe0 > Mar 09 14:52:20 kdevops-blktests-block-dev kernel: do_syscall_64+0x33/0x80 > Mar 09 14:52:20 kdevops-blktests-block-dev kernel: entry_SYSCALL_64_after_hwframe+0x44/0xae > Mar 09 14:52:20 kdevops-blktests-block-dev kernel: RIP: 0033:0x7f4bb13aaf33 > Mar 09 14:52:20 kdevops-blktests-block-dev kernel: Code: 8b 15 61 ef 0c 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 64 8b 04 25 18 00 > 00 00 85 c0 75 14 b8 01 00 0> > Mar 09 14:52:20 kdevops-blktests-block-dev kernel: RSP: 002b:00007ffce0090d88 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 > Mar 09 14:52:20 kdevops-blktests-block-dev kernel: RAX: ffffffffffffffda RBX: 000055a17c4a14b0 RCX: 00007f4bb13aaf33 > Mar 09 14:52:20 kdevops-blktests-block-dev kernel: RDX: 0000000000000002 RSI: 000055a17c4a14b0 RDI: 0000000000000001 > Mar 09 14:52:20 kdevops-blktests-block-dev kernel: RBP: 0000000000000002 R08: 000055a17c48a1d0 R09: 0000000000000000 > Mar 09 14:52:20 kdevops-blktests-block-dev kernel: R10: 000055a17c48a1d1 R11: 0000000000000246 R12: 0000000000000001 > Mar 09 14:52:20 kdevops-blktests-block-dev kernel: R13: 0000000000000002 R14: 7fffffffffffffff R15: 0000000000000000 > Mar 09 14:52:20 kdevops-blktests-block-dev kernel: Modules linked in: zram(E) zsmalloc(E) xfs(E) crct10dif_pclmul(E) crc32_pclmul(E) ghash_clmulni_intel(E) joydev(E) evdev(> > Mar 09 14:52:20 kdevops-blktests-block-dev kernel: zram0: detected capacity change from 0 to 209715200 > Mar 09 14:52:20 kdevops-blktests-block-dev kernel: CR2: ffffba7db7904008 > Mar 09 14:52:20 kdevops-blktests-block-dev kernel: ---[ end trace 534ee1d0b880dd90 ]--- > > I can try to modify it to include second patch first, as that is > required. There are two separate bugs here. Or was your goalt to > try to address both with only one patch? I am trying to understand problem first. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH 1/2] zram: fix crashes due to use of cpu hotplug multistate 2021-03-09 2:55 ` Minchan Kim 2021-03-10 13:11 ` Luis Chamberlain @ 2021-03-10 21:21 ` Luis Chamberlain 2021-03-12 2:14 ` Minchan Kim 1 sibling, 1 reply; 44+ messages in thread From: Luis Chamberlain @ 2021-03-10 21:21 UTC (permalink / raw) To: Minchan Kim Cc: ngupta, sergey.senozhatsky.work, axboe, mbenes, linux-block, linux-kernel On Mon, Mar 08, 2021 at 06:55:30PM -0800, Minchan Kim wrote: > If I understand correctly, bugs you found were related to module > unloading race while the zram are still working. No, that is a simplifcation of the issue. The issue consists of two separate issues: a) race against module unloading in light of incorrect racty use of cpu hotplug multistate support b) module unload race with sysfs attribute race on *any* driver which has sysfs attributes which also shares the same lock as used during module unload It is important to realize that issue b) is actually a generic kernel issue, it would be present on *any* driver which shares a lock used on a sysfs attribute and module unload. I looked to implement a generic solution on kernfs, however we don't yet have semantics to enable this generically, and so for now each driver needs to deal with those races on their own. That sysfs race is dealt with in the second patch. The first patch only deals with the cpu hotplug races exposed at module unloading. Luis ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH 1/2] zram: fix crashes due to use of cpu hotplug multistate 2021-03-10 21:21 ` Luis Chamberlain @ 2021-03-12 2:14 ` Minchan Kim 2021-03-12 18:32 ` Luis Chamberlain 0 siblings, 1 reply; 44+ messages in thread From: Minchan Kim @ 2021-03-12 2:14 UTC (permalink / raw) To: Luis Chamberlain, gregkh Cc: ngupta, sergey.senozhatsky.work, axboe, mbenes, linux-block, linux-kernel On Wed, Mar 10, 2021 at 09:21:28PM +0000, Luis Chamberlain wrote: > On Mon, Mar 08, 2021 at 06:55:30PM -0800, Minchan Kim wrote: > > If I understand correctly, bugs you found were related to module > > unloading race while the zram are still working. > > No, that is a simplifcation of the issue. The issue consists of > two separate issues: > > a) race against module unloading in light of incorrect racty use of > cpu hotplug multistate support Could you add some pusedo code sequence to show the race cleary? It would be great if it goes in the description, too since it's more clear to show the problme. > b) module unload race with sysfs attribute race on *any* driver which > has sysfs attributes which also shares the same lock as used during > module unload Yub, that part I missed. Maybe, we need some wrapper to zram sysfs to get try_module_get in the warapper funnction and then call sub rountine only if it got the refcount. zram_sysfs_wrapper(func, A, B) if (!try_module_get(THIS_MODULE) return -ENODEV; ret = func(A,B); module_put(THIS_MODULE); return ret; > > It is important to realize that issue b) is actually a generic kernel > issue, it would be present on *any* driver which shares a lock used on > a sysfs attribute and module unload. I looked to implement a generic > solution on kernfs, however we don't yet have semantics to enable this > generically, and so for now each driver needs to deal with those races > on their own. That sysfs race is dealt with in the second patch. It's unforunate. Let me Cc Greg who might have some ideas or find zram mess on zram sysfs implementation. > > The first patch only deals with the cpu hotplug races exposed at module > unloading. True. Thanks for the clarification. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH 1/2] zram: fix crashes due to use of cpu hotplug multistate 2021-03-12 2:14 ` Minchan Kim @ 2021-03-12 18:32 ` Luis Chamberlain 2021-03-12 19:28 ` Minchan Kim 0 siblings, 1 reply; 44+ messages in thread From: Luis Chamberlain @ 2021-03-12 18:32 UTC (permalink / raw) To: Minchan Kim Cc: gregkh, ngupta, sergey.senozhatsky.work, axboe, mbenes, linux-block, linux-kernel On Thu, Mar 11, 2021 at 06:14:40PM -0800, Minchan Kim wrote: > On Wed, Mar 10, 2021 at 09:21:28PM +0000, Luis Chamberlain wrote: > > On Mon, Mar 08, 2021 at 06:55:30PM -0800, Minchan Kim wrote: > > > If I understand correctly, bugs you found were related to module > > > unloading race while the zram are still working. > > > > No, that is a simplifcation of the issue. The issue consists of > > two separate issues: > > > > a) race against module unloading in light of incorrect racty use of > > cpu hotplug multistate support > > > Could you add some pusedo code sequence to show the race cleary? Let us deal with each issue one at time. First, let's address understanding the kernel warning can be reproduced easily by triggering zram02.sh from LTP twice: kernel: ------------[ cut here ]------------ kernel: Error: Removing state 63 which has instances left. kernel: WARNING: CPU: 7 PID: 70457 at kernel/cpu.c:2069 __cpuhp_remove_state_cpuslocked+0xf9/0x100 kernel: Modules linked in: zram(E-) zsmalloc(E) <etc> The first patch prevents this race. This race is possible because on module init we associate callbacks for CPU hotplug add / remove: static int __init zram_init(void) { ... ret = cpuhp_setup_state_multi(CPUHP_ZCOMP_PREPARE, "block/zram:prepare", zcomp_cpu_up_prepare, zcomp_cpu_dead); ... } The zcomp_cpu_dead() accesses the zcom->comp, and if zcomp->comp is removed and this function is called, clearly we'll be accessing some random data here and can easily crash afterwards: int zcomp_cpu_dead(unsigned int cpu, struct hlist_node *node) { struct zcomp *comp = hlist_entry(node, struct zcomp, node); struct zcomp_strm *zstrm; zstrm = per_cpu_ptr(comp->stream, cpu); zcomp_strm_free(zstrm); return 0; } And zram's syfs reset_store() lets userspace call zram_reset_device() which calls zcomp_destroy(): void zcomp_destroy(struct zcomp *comp) { cpuhp_state_remove_instance(CPUHP_ZCOMP_PREPARE, &comp->node); free_percpu(comp->stream); kfree(comp); } > It would be great if it goes in the description, too since it's > more clear to show the problme. Does the above do it? > > > b) module unload race with sysfs attribute race on *any* driver which > > has sysfs attributes which also shares the same lock as used during > > module unload > > Yub, that part I missed. Maybe, we need some wrapper to zram sysfs > to get try_module_get in the warapper funnction and then call sub > rountine only if it got the refcount. > > zram_sysfs_wrapper(func, A, B) > if (!try_module_get(THIS_MODULE) > return -ENODEV; > ret = func(A,B); > module_put(THIS_MODULE); > return ret; I'd much prefer this be resolved in kernfs later, if you look at the kernel there are already some drivers which may have realized this requirement the hard way. Open coding this I think makes the race / intent clearer. Right now we have no semantics possible for a generic solution, but I can work on one later. Luis ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH 1/2] zram: fix crashes due to use of cpu hotplug multistate 2021-03-12 18:32 ` Luis Chamberlain @ 2021-03-12 19:28 ` Minchan Kim 2021-03-19 19:09 ` Luis Chamberlain 0 siblings, 1 reply; 44+ messages in thread From: Minchan Kim @ 2021-03-12 19:28 UTC (permalink / raw) To: Luis Chamberlain Cc: gregkh, ngupta, sergey.senozhatsky.work, axboe, mbenes, linux-block, linux-kernel On Fri, Mar 12, 2021 at 06:32:38PM +0000, Luis Chamberlain wrote: > On Thu, Mar 11, 2021 at 06:14:40PM -0800, Minchan Kim wrote: > > On Wed, Mar 10, 2021 at 09:21:28PM +0000, Luis Chamberlain wrote: > > > On Mon, Mar 08, 2021 at 06:55:30PM -0800, Minchan Kim wrote: > > > > If I understand correctly, bugs you found were related to module > > > > unloading race while the zram are still working. > > > > > > No, that is a simplifcation of the issue. The issue consists of > > > two separate issues: > > > > > > a) race against module unloading in light of incorrect racty use of > > > cpu hotplug multistate support > > > > > > Could you add some pusedo code sequence to show the race cleary? > > Let us deal with each issue one at time. First, let's address > understanding the kernel warning can be reproduced easily by > triggering zram02.sh from LTP twice: > > kernel: ------------[ cut here ]------------ > kernel: Error: Removing state 63 which has instances left. > kernel: WARNING: CPU: 7 PID: 70457 at kernel/cpu.c:2069 __cpuhp_remove_state_cpuslocked+0xf9/0x100 > kernel: Modules linked in: zram(E-) zsmalloc(E) <etc> > > The first patch prevents this race. This race is possible because on > module init we associate callbacks for CPU hotplug add / remove: > > static int __init zram_init(void) > { > ... > ret = cpuhp_setup_state_multi(CPUHP_ZCOMP_PREPARE, "block/zram:prepare", > zcomp_cpu_up_prepare, zcomp_cpu_dead); > ... > } > > The zcomp_cpu_dead() accesses the zcom->comp, and if zcomp->comp is > removed and this function is called, clearly we'll be accessing some > random data here and can easily crash afterwards: I am trying to understand this crash. You mentioned the warning above but here mention crash(I understand sysfs race but this is different topic). What's the relation with above warning and crash here? You are talking the warning could change to the crash sometimes? > > int zcomp_cpu_dead(unsigned int cpu, struct hlist_node *node) > { > struct zcomp *comp = hlist_entry(node, struct zcomp, node); > struct zcomp_strm *zstrm; > > zstrm = per_cpu_ptr(comp->stream, cpu); > zcomp_strm_free(zstrm); > return 0; > } > > And zram's syfs reset_store() lets userspace call zram_reset_device() > which calls zcomp_destroy(): > > void zcomp_destroy(struct zcomp *comp) > { > cpuhp_state_remove_instance(CPUHP_ZCOMP_PREPARE, &comp->node); > free_percpu(comp->stream); > kfree(comp); > } Do you mean the race between module unloading and zram_reset_device? If I understood correctly, the approach I put in the first reply would prevent this problem. > > > It would be great if it goes in the description, too since it's > > more clear to show the problme. > > Does the above do it? Rather than that, let's have this kinds of explanation, which is more clear(it's a just example, not describing exact race you saw. You could be better to descibe it). CPU A CPU B moudle_unload destroy_devices zram_remove_cb reset_store zcomp_cpu_dead zram_reset_device zcomp_destroy cpuhp_state_remove_instance zcomp_strm_free > > > > > b) module unload race with sysfs attribute race on *any* driver which > > > has sysfs attributes which also shares the same lock as used during > > > module unload > > > > Yub, that part I missed. Maybe, we need some wrapper to zram sysfs > > to get try_module_get in the warapper funnction and then call sub > > rountine only if it got the refcount. > > > > zram_sysfs_wrapper(func, A, B) > > if (!try_module_get(THIS_MODULE) > > return -ENODEV; > > ret = func(A,B); > > module_put(THIS_MODULE); > > return ret; > > I'd much prefer this be resolved in kernfs later, if you look at the kernel > there are already some drivers which may have realized this requirement > the hard way. Open coding this I think makes the race / intent clearer. > > Right now we have no semantics possible for a generic solution, but I > can work on one later. Yub, no problem. My point is let's have some zram sysfs wrapper to manage module_get/put automatically as I mentioned above so no need to deal with module refcount in worker functions. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH 1/2] zram: fix crashes due to use of cpu hotplug multistate 2021-03-12 19:28 ` Minchan Kim @ 2021-03-19 19:09 ` Luis Chamberlain 2021-03-22 16:37 ` Minchan Kim 0 siblings, 1 reply; 44+ messages in thread From: Luis Chamberlain @ 2021-03-19 19:09 UTC (permalink / raw) To: Minchan Kim Cc: gregkh, ngupta, sergey.senozhatsky.work, axboe, mbenes, linux-block, linux-kernel On Fri, Mar 12, 2021 at 11:28:21AM -0800, Minchan Kim wrote: > On Fri, Mar 12, 2021 at 06:32:38PM +0000, Luis Chamberlain wrote: > > On Thu, Mar 11, 2021 at 06:14:40PM -0800, Minchan Kim wrote: > > > On Wed, Mar 10, 2021 at 09:21:28PM +0000, Luis Chamberlain wrote: > > > > On Mon, Mar 08, 2021 at 06:55:30PM -0800, Minchan Kim wrote: > > > > > If I understand correctly, bugs you found were related to module > > > > > unloading race while the zram are still working. > > > > > > > > No, that is a simplifcation of the issue. The issue consists of > > > > two separate issues: > > > > > > > > a) race against module unloading in light of incorrect racty use of > > > > cpu hotplug multistate support > > > > > > > > > Could you add some pusedo code sequence to show the race cleary? > > > > Let us deal with each issue one at time. First, let's address > > understanding the kernel warning can be reproduced easily by > > triggering zram02.sh from LTP twice: > > > > kernel: ------------[ cut here ]------------ > > kernel: Error: Removing state 63 which has instances left. > > kernel: WARNING: CPU: 7 PID: 70457 at kernel/cpu.c:2069 __cpuhp_remove_state_cpuslocked+0xf9/0x100 > > kernel: Modules linked in: zram(E-) zsmalloc(E) <etc> > > > > The first patch prevents this race. This race is possible because on > > module init we associate callbacks for CPU hotplug add / remove: > > > > static int __init zram_init(void) > > { > > ... > > ret = cpuhp_setup_state_multi(CPUHP_ZCOMP_PREPARE, "block/zram:prepare", > > zcomp_cpu_up_prepare, zcomp_cpu_dead); > > ... > > } > > > > The zcomp_cpu_dead() accesses the zcom->comp, and if zcomp->comp is > > removed and this function is called, clearly we'll be accessing some > > random data here and can easily crash afterwards: > > I am trying to understand this crash. You mentioned the warning > above but here mention crash(I understand sysfs race but this is > different topic). What's the relation with above warning and > crash here? You are talking the warning could change to the > crash sometimes? Indeed one issue is a consequence of the other but a bit better description can be put together for both separately. The warning comes up when cpu hotplug detects that the callback is being removed, but yet "instances" for multistate are left behind, ie, not removed. CPU hotplug multistate allows us to dedicate a callback for zram so that it gets called every time a CPU hot plugs or unplugs. In the zram driver's case we want to trigger the callback per each struct zcomp, one is used per supported and used supported compression algorithm. The zram driver allocates a struct zcomp for an compression algorithm on a need basis. The default compressor is CONFIG_ZRAM_DEF_COMP which today is CONFIG_ZRAM_DEF_COMP_LZORLE ("lzo-rle"). Although we may have different zram devices, zram devices which use the same compression algorithm share the same struct zcomp. The "multi" in CPU hotplug multstate refers to these different struct zcomp instances, each of these will have the callback routine registered run for it. The kernel's CPU hotplug multistate keeps a linked list of these different structures so that it will iterate over them on CPU transitions. By default at driver initialization we will create just one zram device (num_devices=1) and a zcomp structure then set for the lzo-rle comrpession algorithm. At driver removal we first remove each zram device, and so we destroy the struct zcomp. But since we expose sysfs attributes to create new devices or reset / initialize existing ones, we can easily end up re-initializing a struct zcomp before the exit routine of the module removes the cpu hotplug callback. When this happens the kernel's CPU hotplug will detect that at least one instance (struct zcomp for us) exists. And so we can have: static void destroy_devices(void) { class_unregister(&zram_control_class); idr_for_each(&zram_index_idr, &zram_remove_cb, NULL); zram_debugfs_destroy(); idr_destroy(&zram_index_idr); unregister_blkdev(zram_major, "zram"); cpuhp_remove_multi_state(CPUHP_ZCOMP_PREPARE); } While destroy_devices() runs we can have this race: CPU 1 CPU 2 class_unregister(...); idr_for_each(...); zram_debugfs_destroy(); disksize_store(...) idr_destroy(...); unregister_blkdev(...); cpuhp_remove_multi_state(...); The warning comes up on cpuhp_remove_multi_state() when it sees that the state for CPUHP_ZCOMP_PREPARE does not have an empty instance linked list. After the idr_destory() its also easy to run into a crash. The above predicament also leads to memory leaks. And so we need to have a state machine for when we're up and when we're going down generically. Let me know if it makes sense now, if so I can amend the commit log accordingly. Luis ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH 1/2] zram: fix crashes due to use of cpu hotplug multistate 2021-03-19 19:09 ` Luis Chamberlain @ 2021-03-22 16:37 ` Minchan Kim 2021-03-22 20:41 ` Luis Chamberlain 0 siblings, 1 reply; 44+ messages in thread From: Minchan Kim @ 2021-03-22 16:37 UTC (permalink / raw) To: Luis Chamberlain Cc: gregkh, ngupta, sergey.senozhatsky.work, axboe, mbenes, linux-block, linux-kernel On Fri, Mar 19, 2021 at 07:09:24PM +0000, Luis Chamberlain wrote: > On Fri, Mar 12, 2021 at 11:28:21AM -0800, Minchan Kim wrote: > > On Fri, Mar 12, 2021 at 06:32:38PM +0000, Luis Chamberlain wrote: > > > On Thu, Mar 11, 2021 at 06:14:40PM -0800, Minchan Kim wrote: > > > > On Wed, Mar 10, 2021 at 09:21:28PM +0000, Luis Chamberlain wrote: > > > > > On Mon, Mar 08, 2021 at 06:55:30PM -0800, Minchan Kim wrote: > > > > > > If I understand correctly, bugs you found were related to module > > > > > > unloading race while the zram are still working. > > > > > > > > > > No, that is a simplifcation of the issue. The issue consists of > > > > > two separate issues: > > > > > > > > > > a) race against module unloading in light of incorrect racty use of > > > > > cpu hotplug multistate support > > > > > > > > > > > > Could you add some pusedo code sequence to show the race cleary? > > > > > > Let us deal with each issue one at time. First, let's address > > > understanding the kernel warning can be reproduced easily by > > > triggering zram02.sh from LTP twice: > > > > > > kernel: ------------[ cut here ]------------ > > > kernel: Error: Removing state 63 which has instances left. > > > kernel: WARNING: CPU: 7 PID: 70457 at kernel/cpu.c:2069 __cpuhp_remove_state_cpuslocked+0xf9/0x100 > > > kernel: Modules linked in: zram(E-) zsmalloc(E) <etc> > > > > > > The first patch prevents this race. This race is possible because on > > > module init we associate callbacks for CPU hotplug add / remove: > > > > > > static int __init zram_init(void) > > > { > > > ... > > > ret = cpuhp_setup_state_multi(CPUHP_ZCOMP_PREPARE, "block/zram:prepare", > > > zcomp_cpu_up_prepare, zcomp_cpu_dead); > > > ... > > > } > > > > > > The zcomp_cpu_dead() accesses the zcom->comp, and if zcomp->comp is > > > removed and this function is called, clearly we'll be accessing some > > > random data here and can easily crash afterwards: > > > > I am trying to understand this crash. You mentioned the warning > > above but here mention crash(I understand sysfs race but this is > > different topic). What's the relation with above warning and > > crash here? You are talking the warning could change to the > > crash sometimes? > Hi Luis, > Indeed one issue is a consequence of the other but a bit better > description can be put together for both separately. > > The warning comes up when cpu hotplug detects that the callback > is being removed, but yet "instances" for multistate are left behind, > ie, not removed. CPU hotplug multistate allows us to dedicate a callback > for zram so that it gets called every time a CPU hot plugs or unplugs. > In the zram driver's case we want to trigger the callback per each > struct zcomp, one is used per supported and used supported compression you meant "each one is used per supported compression"? > algorithm. The zram driver allocates a struct zcomp for an compression > algorithm on a need basis. The default compressor is CONFIG_ZRAM_DEF_COMP > which today is CONFIG_ZRAM_DEF_COMP_LZORLE ("lzo-rle"). Although we may > have different zram devices, zram devices which use the same compression > algorithm share the same struct zcomp. The "multi" in CPU hotplug multstate It allocates each own zcomp, not sharing even though zram devices use the same compression algorithm. > refers to these different struct zcomp instances, each of these will > have the callback routine registered run for it. The kernel's CPU > hotplug multistate keeps a linked list of these different structures > so that it will iterate over them on CPU transitions. By default at > driver initialization we will create just one zram device (num_devices=1) > and a zcomp structure then set for the lzo-rle comrpession algorithm. > At driver removal we first remove each zram device, and so we destroy > the struct zcomp. But since we expose sysfs attributes to create new > devices or reset / initialize existing ones, we can easily end up > re-initializing a struct zcomp before the exit routine of the module > removes the cpu hotplug callback. When this happens the kernel's > CPU hotplug will detect that at least one instance (struct zcomp for > us) exists. It's very helpful to understand. Thanks a lot!S So IIUC, it's fundamentally race between destroy_devices(i.e., module unload) and every sysfs knobs in zram. Isn't it? Below specific example is one of them and every sysfs code in zram could access freed object(e.g., zram->something). And you are claiming there isn't good way to fix it in kernfs(please add it in the description, too) even though it's general problem. (If we had, we may detect the race and make zram_remove_cb fails so unloading modulies fails, finally) So, shouldn't we introcude a global rw_semaphore? destroy_devices class_unregister down_write(&zram_unload); idr_for_each(zram_remove_cb); up_write(&zram_unload); And every sysfs code hold the lock with down_read in the first place and release the lock right before return. So, introduce a zram sysfs wrapper to centralize all of locking logic. Does it make sense? > > And so we can have: > > static void destroy_devices(void) > { > class_unregister(&zram_control_class); > idr_for_each(&zram_index_idr, &zram_remove_cb, NULL); > zram_debugfs_destroy(); > idr_destroy(&zram_index_idr); > unregister_blkdev(zram_major, "zram"); > cpuhp_remove_multi_state(CPUHP_ZCOMP_PREPARE); > } > > While destroy_devices() runs we can have this race: > > > CPU 1 CPU 2 > > class_unregister(...); > idr_for_each(...); I think sysfs was already remove here. > zram_debugfs_destroy(); > disksize_store(...) > idr_destroy(...); > unregister_blkdev(...); > cpuhp_remove_multi_state(...); So, CPU 1 CPU 2 destroy_devices .. disksize_store() zcomp_create zcomp_init idr_for_each(zram_remove_cb zram_remove zram_reset zcomp_destroy cpuhp_state_remove_instance cpuhp_state_add_instance .. cpuhp_remove_multi_state(...) Detect! kernel: Error: Removing state 63 which has instances left. > > > The warning comes up on cpuhp_remove_multi_state() when it sees > that the state for CPUHP_ZCOMP_PREPARE does not have an empty > instance linked list. > > After the idr_destory() its also easy to run into a crash. The > above predicament also leads to memory leaks. > > And so we need to have a state machine for when we're up and when > we're going down generically. > > Let me know if it makes sense now, if so I can amend the commit log > accordingly. Yub, It's much better. Let's have it in the commit log. Thanks. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH 1/2] zram: fix crashes due to use of cpu hotplug multistate 2021-03-22 16:37 ` Minchan Kim @ 2021-03-22 20:41 ` Luis Chamberlain 2021-03-22 22:12 ` Minchan Kim 0 siblings, 1 reply; 44+ messages in thread From: Luis Chamberlain @ 2021-03-22 20:41 UTC (permalink / raw) To: Minchan Kim Cc: gregkh, ngupta, sergey.senozhatsky.work, axboe, mbenes, linux-block, linux-kernel On Mon, Mar 22, 2021 at 09:37:17AM -0700, Minchan Kim wrote: > On Fri, Mar 19, 2021 at 07:09:24PM +0000, Luis Chamberlain wrote: > > Indeed one issue is a consequence of the other but a bit better > > description can be put together for both separately. > > > > The warning comes up when cpu hotplug detects that the callback > > is being removed, but yet "instances" for multistate are left behind, > > ie, not removed. CPU hotplug multistate allows us to dedicate a callback > > for zram so that it gets called every time a CPU hot plugs or unplugs. > > In the zram driver's case we want to trigger the callback per each > > struct zcomp, one is used per supported and used supported compression > > you meant "each one is used per supported compression"? Yup. > > algorithm. The zram driver allocates a struct zcomp for an compression > > algorithm on a need basis. The default compressor is CONFIG_ZRAM_DEF_COMP > > which today is CONFIG_ZRAM_DEF_COMP_LZORLE ("lzo-rle"). Although we may > > have different zram devices, zram devices which use the same compression > > algorithm share the same struct zcomp. The "multi" in CPU hotplug multstate > > It allocates each own zcomp, not sharing even though zram devices > use the same compression algorithm. Right. > > refers to these different struct zcomp instances, each of these will > > have the callback routine registered run for it. The kernel's CPU > > hotplug multistate keeps a linked list of these different structures > > so that it will iterate over them on CPU transitions. By default at > > driver initialization we will create just one zram device (num_devices=1) > > and a zcomp structure then set for the lzo-rle comrpession algorithm. > > At driver removal we first remove each zram device, and so we destroy > > the struct zcomp. But since we expose sysfs attributes to create new > > devices or reset / initialize existing ones, we can easily end up > > re-initializing a struct zcomp before the exit routine of the module > > removes the cpu hotplug callback. When this happens the kernel's > > CPU hotplug will detect that at least one instance (struct zcomp for > > us) exists. > > It's very helpful to understand. Thanks a lot!S > > So IIUC, it's fundamentally race between destroy_devices(i.e., module > unload) and every sysfs knobs in zram. Isn't it? I would not call it *every* syfs knob as not all deal with things which are related to CPU hotplug multistate, right? Note that using just try_module_get() alone (that is the second patch only, does not fix the race I am describing above). There are two issues. > Below specific example is one of them and every sysfs code in zram > could access freed object(e.g., zram->something). > And you are claiming there isn't good way to fix it in kernfs(please > add it in the description, too) even though it's general problem. Correct, the easier route would have been through the block layer as its the one adding our syfs attributes for us but even it canno deal with this race on behalf of drivers given the currently exposed semantics on kernfs. > (If we had, we may detect the race and make zram_remove_cb fails > so unloading modulies fails, finally) > > So, shouldn't we introcude a global rw_semaphore? > > destroy_devices > class_unregister > down_write(&zram_unload); > idr_for_each(zram_remove_cb); > up_write(&zram_unload); > > And every sysfs code hold the lock with down_read in the first place > and release the lock right before return. So, introduce a zram sysfs > wrapper to centralize all of locking logic. Actually that does work but only if we use write lock attempts so to be able to knock two birds with one stone, so to address the deadlock with sysfs attribute removal. We're not asking politely to read some value off of a zram devices with these locks, we're ensuring to not run if our driver is going to be removed, so replacing the try_module_get() with something similar. > Does it make sense? > > > > > And so we can have: > > > > static void destroy_devices(void) > > { > > class_unregister(&zram_control_class); > > idr_for_each(&zram_index_idr, &zram_remove_cb, NULL); > > zram_debugfs_destroy(); > > idr_destroy(&zram_index_idr); > > unregister_blkdev(zram_major, "zram"); > > cpuhp_remove_multi_state(CPUHP_ZCOMP_PREPARE); > > } > > > > While destroy_devices() runs we can have this race: > > > > > > CPU 1 CPU 2 > > > > class_unregister(...); > > idr_for_each(...); > I think sysfs was already remove here. > > zram_debugfs_destroy(); > > disksize_store(...) > > idr_destroy(...); > > unregister_blkdev(...); > > cpuhp_remove_multi_state(...); > > So, > > CPU 1 CPU 2 > > destroy_devices > .. > disksize_store() > zcomp_create > zcomp_init > idr_for_each(zram_remove_cb > zram_remove > zram_reset > zcomp_destroy > cpuhp_state_remove_instance > cpuhp_state_add_instance > .. > > cpuhp_remove_multi_state(...) > Detect! > kernel: Error: Removing state 63 which has instances left. Yup true. > > The warning comes up on cpuhp_remove_multi_state() when it sees > > that the state for CPUHP_ZCOMP_PREPARE does not have an empty > > instance linked list. > > > > After the idr_destory() its also easy to run into a crash. The > > above predicament also leads to memory leaks. > > > > And so we need to have a state machine for when we're up and when > > we're going down generically. > > > > Let me know if it makes sense now, if so I can amend the commit log > > accordingly. > > Yub, It's much better. Let's have it in the commit log. Sure OK, well the following is what I end up with. With this approach only one patch is needed. diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c index a711a2e2a794..3b86ee94309e 100644 --- a/drivers/block/zram/zram_drv.c +++ b/drivers/block/zram/zram_drv.c @@ -41,6 +41,15 @@ static DEFINE_IDR(zram_index_idr); /* idr index must be protected */ static DEFINE_MUTEX(zram_index_mutex); +/* + * Protects against: + * a) Hotplug cpu multistate races against compression algorithm removals + * and additions prior to its removal of the zram CPU hotplug callback + * b) Deadlock which is possible when sysfs attributes are used and we + * share a lock on removal. + */ +DECLARE_RWSEM(zram_unload); + static int zram_major; static const char *default_compressor = CONFIG_ZRAM_DEF_COMP; @@ -1723,6 +1732,9 @@ static ssize_t disksize_store(struct device *dev, if (!disksize) return -EINVAL; + if (!down_write_trylock(&zram_unload)) + return -ENODEV; + down_write(&zram->init_lock); if (init_done(zram)) { pr_info("Cannot change disksize for initialized device\n"); @@ -1748,6 +1760,7 @@ static ssize_t disksize_store(struct device *dev, zram->disksize = disksize; set_capacity_and_notify(zram->disk, zram->disksize >> SECTOR_SHIFT); up_write(&zram->init_lock); + up_write(&zram_unload); return len; @@ -1755,6 +1768,7 @@ static ssize_t disksize_store(struct device *dev, zram_meta_free(zram, disksize); out_unlock: up_write(&zram->init_lock); + up_write(&zram_unload); return err; } @@ -1773,14 +1787,17 @@ static ssize_t reset_store(struct device *dev, if (!do_reset) return -EINVAL; + if (!down_write_trylock(&zram_unload)) + return -ENODEV; + zram = dev_to_zram(dev); bdev = zram->disk->part0; mutex_lock(&bdev->bd_mutex); /* Do not reset an active device or claimed device */ if (bdev->bd_openers || zram->claim) { - mutex_unlock(&bdev->bd_mutex); - return -EBUSY; + len = -EBUSY; + goto out; } /* From now on, anyone can't open /dev/zram[0-9] */ @@ -1793,7 +1810,10 @@ static ssize_t reset_store(struct device *dev, mutex_lock(&bdev->bd_mutex); zram->claim = false; + +out: mutex_unlock(&bdev->bd_mutex); + up_write(&zram_unload); return len; } @@ -2015,10 +2035,15 @@ static ssize_t hot_add_show(struct class *class, { int ret; + if (!down_write_trylock(&zram_unload)) + return -ENODEV; + mutex_lock(&zram_index_mutex); ret = zram_add(); mutex_unlock(&zram_index_mutex); + up_write(&zram_unload); + if (ret < 0) return ret; return scnprintf(buf, PAGE_SIZE, "%d\n", ret); @@ -2041,6 +2066,9 @@ static ssize_t hot_remove_store(struct class *class, if (dev_id < 0) return -EINVAL; + if (!down_write_trylock(&zram_unload)) + return -ENODEV; + mutex_lock(&zram_index_mutex); zram = idr_find(&zram_index_idr, dev_id); @@ -2053,6 +2081,7 @@ static ssize_t hot_remove_store(struct class *class, } mutex_unlock(&zram_index_mutex); + up_write(&zram_unload); return ret ? ret : count; } static CLASS_ATTR_WO(hot_remove); @@ -2078,12 +2107,14 @@ static int zram_remove_cb(int id, void *ptr, void *data) static void destroy_devices(void) { + down_write(&zram_unload); class_unregister(&zram_control_class); idr_for_each(&zram_index_idr, &zram_remove_cb, NULL); zram_debugfs_destroy(); idr_destroy(&zram_index_idr); unregister_blkdev(zram_major, "zram"); cpuhp_remove_multi_state(CPUHP_ZCOMP_PREPARE); + up_write(&zram_unload); } static int __init zram_init(void) @@ -2095,10 +2126,13 @@ static int __init zram_init(void) if (ret < 0) return ret; + down_write(&zram_unload); + ret = class_register(&zram_control_class); if (ret) { pr_err("Unable to register zram-control class\n"); cpuhp_remove_multi_state(CPUHP_ZCOMP_PREPARE); + up_write(&zram_unload); return ret; } @@ -2108,6 +2142,7 @@ static int __init zram_init(void) pr_err("Unable to get major number\n"); class_unregister(&zram_control_class); cpuhp_remove_multi_state(CPUHP_ZCOMP_PREPARE); + up_write(&zram_unload); return -EBUSY; } @@ -2119,10 +2154,12 @@ static int __init zram_init(void) goto out_error; num_devices--; } + up_write(&zram_unload); return 0; out_error: + up_write(&zram_unload); destroy_devices(); return ret; } ^ permalink raw reply related [flat|nested] 44+ messages in thread
* Re: [PATCH 1/2] zram: fix crashes due to use of cpu hotplug multistate 2021-03-22 20:41 ` Luis Chamberlain @ 2021-03-22 22:12 ` Minchan Kim 2021-04-01 23:59 ` Luis Chamberlain 0 siblings, 1 reply; 44+ messages in thread From: Minchan Kim @ 2021-03-22 22:12 UTC (permalink / raw) To: Luis Chamberlain Cc: gregkh, ngupta, sergey.senozhatsky.work, axboe, mbenes, linux-block, linux-kernel On Mon, Mar 22, 2021 at 08:41:56PM +0000, Luis Chamberlain wrote: > On Mon, Mar 22, 2021 at 09:37:17AM -0700, Minchan Kim wrote: > > On Fri, Mar 19, 2021 at 07:09:24PM +0000, Luis Chamberlain wrote: > > > Indeed one issue is a consequence of the other but a bit better > > > description can be put together for both separately. > > > > > > The warning comes up when cpu hotplug detects that the callback > > > is being removed, but yet "instances" for multistate are left behind, > > > ie, not removed. CPU hotplug multistate allows us to dedicate a callback > > > for zram so that it gets called every time a CPU hot plugs or unplugs. > > > In the zram driver's case we want to trigger the callback per each > > > struct zcomp, one is used per supported and used supported compression > > > > you meant "each one is used per supported compression"? > > Yup. > > > > algorithm. The zram driver allocates a struct zcomp for an compression > > > algorithm on a need basis. The default compressor is CONFIG_ZRAM_DEF_COMP > > > which today is CONFIG_ZRAM_DEF_COMP_LZORLE ("lzo-rle"). Although we may > > > have different zram devices, zram devices which use the same compression > > > algorithm share the same struct zcomp. The "multi" in CPU hotplug multstate > > > > It allocates each own zcomp, not sharing even though zram devices > > use the same compression algorithm. > > Right. > > > > refers to these different struct zcomp instances, each of these will > > > have the callback routine registered run for it. The kernel's CPU > > > hotplug multistate keeps a linked list of these different structures > > > so that it will iterate over them on CPU transitions. By default at > > > driver initialization we will create just one zram device (num_devices=1) > > > and a zcomp structure then set for the lzo-rle comrpession algorithm. > > > At driver removal we first remove each zram device, and so we destroy > > > the struct zcomp. But since we expose sysfs attributes to create new > > > devices or reset / initialize existing ones, we can easily end up > > > re-initializing a struct zcomp before the exit routine of the module > > > removes the cpu hotplug callback. When this happens the kernel's > > > CPU hotplug will detect that at least one instance (struct zcomp for > > > us) exists. > > > > It's very helpful to understand. Thanks a lot!S > > > > So IIUC, it's fundamentally race between destroy_devices(i.e., module > > unload) and every sysfs knobs in zram. Isn't it? > > I would not call it *every* syfs knob as not all deal with things which > are related to CPU hotplug multistate, right? Note that using just > try_module_get() alone (that is the second patch only, does not fix the > race I am describing above). It wouldn't be CPU hotplug multistate issue but I'd like to call it as more "zram instance race" bug. What happens in this case? CPU 1 CPU 2 destroy_devices .. compact_store() zram = dev_to_zram(dev); idr_for_each(zram_remove_cb zram_remove .. kfree(zram) down_read(&zram->init_lock); CPU 1 CPU 2 hot_remove_store compact_store() zram = dev_to_zram(dev); zram_remove kfree(zram) down_read(&zram->init_lock); So, for me we need to close the zram instance create/removal with sysfs rather than focusing on CPU hotplug issue. Maybe, we could reuse zram_index_mutex with modifying it with rw_semaphore. What do you think? > > There are two issues. > > > Below specific example is one of them and every sysfs code in zram > > could access freed object(e.g., zram->something). > > And you are claiming there isn't good way to fix it in kernfs(please > > add it in the description, too) even though it's general problem. > > Correct, the easier route would have been through the block layer as > its the one adding our syfs attributes for us but even it canno deal > with this race on behalf of drivers given the currently exposed > semantics on kernfs. > > > (If we had, we may detect the race and make zram_remove_cb fails > > so unloading modulies fails, finally) > > > > So, shouldn't we introcude a global rw_semaphore? > > > > destroy_devices > > class_unregister > > down_write(&zram_unload); > > idr_for_each(zram_remove_cb); > > up_write(&zram_unload); > > > > And every sysfs code hold the lock with down_read in the first place > > and release the lock right before return. So, introduce a zram sysfs > > wrapper to centralize all of locking logic. > > Actually that does work but only if we use write lock attempts so to > be able to knock two birds with one stone, so to address the deadlock > with sysfs attribute removal. We're not asking politely to read some > value off of a zram devices with these locks, we're ensuring to not > run if our driver is going to be removed, so replacing the > try_module_get() with something similar. > > > Does it make sense? > > > > > > > > And so we can have: > > > > > > static void destroy_devices(void) > > > { > > > class_unregister(&zram_control_class); > > > idr_for_each(&zram_index_idr, &zram_remove_cb, NULL); > > > zram_debugfs_destroy(); > > > idr_destroy(&zram_index_idr); > > > unregister_blkdev(zram_major, "zram"); > > > cpuhp_remove_multi_state(CPUHP_ZCOMP_PREPARE); > > > } > > > > > > While destroy_devices() runs we can have this race: > > > > > > > > > CPU 1 CPU 2 > > > > > > class_unregister(...); > > > idr_for_each(...); > > I think sysfs was already remove here. > > > zram_debugfs_destroy(); > > > disksize_store(...) > > > idr_destroy(...); > > > unregister_blkdev(...); > > > cpuhp_remove_multi_state(...); > > > > So, > > > > CPU 1 CPU 2 > > > > destroy_devices > > .. > > disksize_store() > > zcomp_create > > zcomp_init > > idr_for_each(zram_remove_cb > > zram_remove > > zram_reset > > zcomp_destroy > > cpuhp_state_remove_instance > > cpuhp_state_add_instance > > .. > > > > cpuhp_remove_multi_state(...) > > Detect! > > kernel: Error: Removing state 63 which has instances left. > > Yup true. > > > > The warning comes up on cpuhp_remove_multi_state() when it sees > > > that the state for CPUHP_ZCOMP_PREPARE does not have an empty > > > instance linked list. > > > > > > After the idr_destory() its also easy to run into a crash. The > > > above predicament also leads to memory leaks. > > > > > > And so we need to have a state machine for when we're up and when > > > we're going down generically. > > > > > > Let me know if it makes sense now, if so I can amend the commit log > > > accordingly. > > > > Yub, It's much better. Let's have it in the commit log. > > Sure OK, well the following is what I end up with. With this approach > only one patch is needed. > > diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c > index a711a2e2a794..3b86ee94309e 100644 > --- a/drivers/block/zram/zram_drv.c > +++ b/drivers/block/zram/zram_drv.c > @@ -41,6 +41,15 @@ static DEFINE_IDR(zram_index_idr); > /* idr index must be protected */ > static DEFINE_MUTEX(zram_index_mutex); > > +/* > + * Protects against: > + * a) Hotplug cpu multistate races against compression algorithm removals > + * and additions prior to its removal of the zram CPU hotplug callback > + * b) Deadlock which is possible when sysfs attributes are used and we > + * share a lock on removal. > + */ > +DECLARE_RWSEM(zram_unload); > + > static int zram_major; > static const char *default_compressor = CONFIG_ZRAM_DEF_COMP; > > @@ -1723,6 +1732,9 @@ static ssize_t disksize_store(struct device *dev, > if (!disksize) > return -EINVAL; > > + if (!down_write_trylock(&zram_unload)) > + return -ENODEV; > + > down_write(&zram->init_lock); > if (init_done(zram)) { > pr_info("Cannot change disksize for initialized device\n"); > @@ -1748,6 +1760,7 @@ static ssize_t disksize_store(struct device *dev, > zram->disksize = disksize; > set_capacity_and_notify(zram->disk, zram->disksize >> SECTOR_SHIFT); > up_write(&zram->init_lock); > + up_write(&zram_unload); > > return len; > > @@ -1755,6 +1768,7 @@ static ssize_t disksize_store(struct device *dev, > zram_meta_free(zram, disksize); > out_unlock: > up_write(&zram->init_lock); > + up_write(&zram_unload); > return err; > } > > @@ -1773,14 +1787,17 @@ static ssize_t reset_store(struct device *dev, > if (!do_reset) > return -EINVAL; > > + if (!down_write_trylock(&zram_unload)) > + return -ENODEV; > + > zram = dev_to_zram(dev); > bdev = zram->disk->part0; > > mutex_lock(&bdev->bd_mutex); > /* Do not reset an active device or claimed device */ > if (bdev->bd_openers || zram->claim) { > - mutex_unlock(&bdev->bd_mutex); > - return -EBUSY; > + len = -EBUSY; > + goto out; > } > > /* From now on, anyone can't open /dev/zram[0-9] */ > @@ -1793,7 +1810,10 @@ static ssize_t reset_store(struct device *dev, > > mutex_lock(&bdev->bd_mutex); > zram->claim = false; > + > +out: > mutex_unlock(&bdev->bd_mutex); > + up_write(&zram_unload); > > return len; > } > @@ -2015,10 +2035,15 @@ static ssize_t hot_add_show(struct class *class, > { > int ret; > > + if (!down_write_trylock(&zram_unload)) > + return -ENODEV; > + > mutex_lock(&zram_index_mutex); > ret = zram_add(); > mutex_unlock(&zram_index_mutex); > > + up_write(&zram_unload); > + > if (ret < 0) > return ret; > return scnprintf(buf, PAGE_SIZE, "%d\n", ret); > @@ -2041,6 +2066,9 @@ static ssize_t hot_remove_store(struct class *class, > if (dev_id < 0) > return -EINVAL; > > + if (!down_write_trylock(&zram_unload)) > + return -ENODEV; > + > mutex_lock(&zram_index_mutex); > > zram = idr_find(&zram_index_idr, dev_id); > @@ -2053,6 +2081,7 @@ static ssize_t hot_remove_store(struct class *class, > } > > mutex_unlock(&zram_index_mutex); > + up_write(&zram_unload); > return ret ? ret : count; > } > static CLASS_ATTR_WO(hot_remove); > @@ -2078,12 +2107,14 @@ static int zram_remove_cb(int id, void *ptr, void *data) > > static void destroy_devices(void) > { > + down_write(&zram_unload); > class_unregister(&zram_control_class); > idr_for_each(&zram_index_idr, &zram_remove_cb, NULL); > zram_debugfs_destroy(); > idr_destroy(&zram_index_idr); > unregister_blkdev(zram_major, "zram"); > cpuhp_remove_multi_state(CPUHP_ZCOMP_PREPARE); > + up_write(&zram_unload); > } > > static int __init zram_init(void) > @@ -2095,10 +2126,13 @@ static int __init zram_init(void) > if (ret < 0) > return ret; > > + down_write(&zram_unload); > + > ret = class_register(&zram_control_class); > if (ret) { > pr_err("Unable to register zram-control class\n"); > cpuhp_remove_multi_state(CPUHP_ZCOMP_PREPARE); > + up_write(&zram_unload); > return ret; > } > > @@ -2108,6 +2142,7 @@ static int __init zram_init(void) > pr_err("Unable to get major number\n"); > class_unregister(&zram_control_class); > cpuhp_remove_multi_state(CPUHP_ZCOMP_PREPARE); > + up_write(&zram_unload); > return -EBUSY; > } > > @@ -2119,10 +2154,12 @@ static int __init zram_init(void) > goto out_error; > num_devices--; > } > + up_write(&zram_unload); > > return 0; > > out_error: > + up_write(&zram_unload); > destroy_devices(); > return ret; > } As I mentioned above, it didn't close the race I gave as examples unless I miss something. Let's discuss further. Thank you! ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH 1/2] zram: fix crashes due to use of cpu hotplug multistate 2021-03-22 22:12 ` Minchan Kim @ 2021-04-01 23:59 ` Luis Chamberlain 2021-04-02 7:54 ` Greg KH 2021-04-05 17:07 ` Minchan Kim 0 siblings, 2 replies; 44+ messages in thread From: Luis Chamberlain @ 2021-04-01 23:59 UTC (permalink / raw) To: Minchan Kim, keescook, dhowells, hch, mbenes, gregkh Cc: ngupta, sergey.senozhatsky.work, axboe, linux-block, linux-kernel On Mon, Mar 22, 2021 at 03:12:01PM -0700, Minchan Kim wrote: > On Mon, Mar 22, 2021 at 08:41:56PM +0000, Luis Chamberlain wrote: > > > > I would not call it *every* syfs knob as not all deal with things which > > are related to CPU hotplug multistate, right? Note that using just > > try_module_get() alone (that is the second patch only, does not fix the > > race I am describing above). > > It wouldn't be CPU hotplug multistate issue but I'd like to call it > as more "zram instance race" bug. > What happens in this case? > > CPU 1 CPU 2 > > destroy_devices > .. > compact_store() > zram = dev_to_zram(dev); > idr_for_each(zram_remove_cb > zram_remove > .. > kfree(zram) > down_read(&zram->init_lock); > > > CPU 1 CPU 2 > > hot_remove_store > compact_store() > zram = dev_to_zram(dev); > zram_remove > kfree(zram) > down_read(&zram->init_lock); > > So, for me we need to close the zram instance create/removal > with sysfs rather than focusing on CPU hotplug issue. Sure, that's a good point. The issue which I noted for the race which ends up in a deadlock is only possible if a shared lock is used on removal but also on sysfs knobs. At first glance, the issue you describe above *seems* to be just proper care driver developers must take with structures used. It is certainly yet another issue we need to address, and if we can generalize a solution even better. I now recall I *think* I spotted that race a while ago and mentioned it to Kees and David Howells but I didn't have a solution for it yet. More on this below. The issue you point out is real, however you cannot disregard the CPU hoplug possible race as well, it is a design consideration which the CPU hotplug multistate support warns for -- consider driver removal. I agree that perhaps solving this "zram instance race" can fix he hotplug race as well. If we can solves all 3 issues in one shot even better. But let's evaluate that prospect... > Maybe, we could reuse zram_index_mutex with modifying it with > rw_semaphore. What do you think? Although ideal given it would knock 3 birds with 1 stone, it ends up actually making the sysfs attributes rather useless in light of the requirements for each of the races. Namely, the sysfs deadlock race *must* use a try lock approach, just as the try_module_get() case. It must use this approach so to immediately just bail out if we have our module being removed, and so on our __exit path. By trying to repurpose zram_index_mutex we end up then just doing too much with it and making the syfs attributes rather fragile for most uses: Consider disksize_show(), that would have to become: static ssize_t disksize_show(struct device *dev, struct device_attribute *attr, char *buf) { struct zram *zram = dev_to_zram(dev); + size_t disksize; + down_read(&zram_index_rwlock); + disksize = zram->disksize; + up_read(&zram_index_rwlock); - return scnprintf(buf, PAGE_SIZE, "%llu\n", zram->disksize); + return scnprintf(buf, PAGE_SIZE, "%llu\n", disksize); } What's wrong with this? It can block during a write, yes, but there is a type of write which will make this crash after the read lock is acquired. When the instance is removed. What if we try down_read_trylock()? static ssize_t disksize_show(struct device *dev, struct device_attribute *attr, char *buf) { struct zram *zram = dev_to_zram(dev); + size_t disksize; + if (!down_read_trylock(&zram_index_rwlock)) + return -ENODEV; + + disksize = zram->disksize; + up_read(&zram_index_rwlock); - return scnprintf(buf, PAGE_SIZE, "%llu\n", zram->disksize); + return scnprintf(buf, PAGE_SIZE, "%llu\n", disksize); } What's wrong with this? If it got the lock, it should be OK as it is preventing writes from taking the lock for a bit. But then this just becomes pretty fragile, it will fail whenever another read or write is happening, triggering perhaps quite a bit of regressions on tests. And if we use write_trylock() we end up with the same fragile nature of failing the read with ENODEV for any silly thing going on with the driver. And come to think of it the last patch I had sent with a new DECLARE_RWSEM(zram_unload) also has this same issue making most sysfs attributes rather fragile. So, I still believe we should split this up into two separate patches then as I had originally proposed. I looked into the race you described as well and I believe I have a generic solution for that as well. As for the syfs deadlock possible with drivers, this fixes it in a generic way: commit fac43d8025727a74f80a183cc5eb74ed902a5d14 Author: Luis Chamberlain <mcgrof@kernel.org> Date: Sat Mar 27 14:58:15 2021 +0000 sysfs: add optional module_owner to attribute This is needed as otherwise the owner of the attribute or group read/store might have a shared lock used on driver removal, and deadlock if we race with driver removal. Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c index b44cd8ee2eb7..494695ff227e 100644 --- a/drivers/block/zram/zram_drv.c +++ b/drivers/block/zram/zram_drv.c @@ -1899,6 +1899,7 @@ static struct attribute *zram_disk_attrs[] = { static const struct attribute_group zram_disk_attr_group = { .attrs = zram_disk_attrs, + .owner = THIS_MODULE, }; static const struct attribute_group *zram_disk_attr_groups[] = { diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c index 7e0e62deab53..faedaaaed247 100644 --- a/fs/kernfs/dir.c +++ b/fs/kernfs/dir.c @@ -14,6 +14,7 @@ #include <linux/slab.h> #include <linux/security.h> #include <linux/hash.h> +#include <linux/module.h> #include "kernfs-internal.h" @@ -415,9 +416,15 @@ struct kernfs_node *kernfs_get_active(struct kernfs_node *kn) if (unlikely(!kn)) return NULL; - if (!atomic_inc_unless_negative(&kn->active)) + if (unlikely(kn->owner && !try_module_get(kn->owner))) return NULL; + if (!atomic_inc_unless_negative(&kn->active)) { + if (kn->owner) + module_put(kn->owner); + return NULL; + } + if (kernfs_lockdep(kn)) rwsem_acquire_read(&kn->dep_map, 0, 1, _RET_IP_); return kn; @@ -440,6 +447,8 @@ void kernfs_put_active(struct kernfs_node *kn) if (kernfs_lockdep(kn)) rwsem_release(&kn->dep_map, _RET_IP_); v = atomic_dec_return(&kn->active); + if (kn->owner) + module_put(kn->owner); if (likely(v != KN_DEACTIVATED_BIAS)) return; @@ -612,6 +621,7 @@ struct kernfs_node *kernfs_node_from_dentry(struct dentry *dentry) static struct kernfs_node *__kernfs_new_node(struct kernfs_root *root, struct kernfs_node *parent, const char *name, umode_t mode, + struct module *owner, kuid_t uid, kgid_t gid, unsigned flags) { @@ -647,6 +657,7 @@ static struct kernfs_node *__kernfs_new_node(struct kernfs_root *root, kn->name = name; kn->mode = mode; + kn->owner = owner; kn->flags = flags; if (!uid_eq(uid, GLOBAL_ROOT_UID) || !gid_eq(gid, GLOBAL_ROOT_GID)) { @@ -680,13 +691,14 @@ static struct kernfs_node *__kernfs_new_node(struct kernfs_root *root, struct kernfs_node *kernfs_new_node(struct kernfs_node *parent, const char *name, umode_t mode, + struct module *owner, kuid_t uid, kgid_t gid, unsigned flags) { struct kernfs_node *kn; kn = __kernfs_new_node(kernfs_root(parent), parent, - name, mode, uid, gid, flags); + name, mode, owner, uid, gid, flags); if (kn) { kernfs_get(parent); kn->parent = parent; @@ -967,7 +979,7 @@ struct kernfs_root *kernfs_create_root(struct kernfs_syscall_ops *scops, root->id_highbits = 1; kn = __kernfs_new_node(root, NULL, "", S_IFDIR | S_IRUGO | S_IXUGO, - GLOBAL_ROOT_UID, GLOBAL_ROOT_GID, + NULL, GLOBAL_ROOT_UID, GLOBAL_ROOT_GID, KERNFS_DIR); if (!kn) { idr_destroy(&root->ino_idr); @@ -1006,6 +1018,7 @@ void kernfs_destroy_root(struct kernfs_root *root) * @parent: parent in which to create a new directory * @name: name of the new directory * @mode: mode of the new directory + * @owner: if set, the module owner of this directory * @uid: uid of the new directory * @gid: gid of the new directory * @priv: opaque data associated with the new directory @@ -1015,6 +1028,7 @@ void kernfs_destroy_root(struct kernfs_root *root) */ struct kernfs_node *kernfs_create_dir_ns(struct kernfs_node *parent, const char *name, umode_t mode, + struct module *owner, kuid_t uid, kgid_t gid, void *priv, const void *ns) { @@ -1023,7 +1037,7 @@ struct kernfs_node *kernfs_create_dir_ns(struct kernfs_node *parent, /* allocate */ kn = kernfs_new_node(parent, name, mode | S_IFDIR, - uid, gid, KERNFS_DIR); + owner, uid, gid, KERNFS_DIR); if (!kn) return ERR_PTR(-ENOMEM); @@ -1055,7 +1069,7 @@ struct kernfs_node *kernfs_create_empty_dir(struct kernfs_node *parent, /* allocate */ kn = kernfs_new_node(parent, name, S_IRUGO|S_IXUGO|S_IFDIR, - GLOBAL_ROOT_UID, GLOBAL_ROOT_GID, KERNFS_DIR); + NULL, GLOBAL_ROOT_UID, GLOBAL_ROOT_GID, KERNFS_DIR); if (!kn) return ERR_PTR(-ENOMEM); diff --git a/fs/kernfs/file.c b/fs/kernfs/file.c index c75719312147..69080869abfc 100644 --- a/fs/kernfs/file.c +++ b/fs/kernfs/file.c @@ -961,6 +961,7 @@ const struct file_operations kernfs_file_fops = { * @uid: uid of the file * @gid: gid of the file * @size: size of the file + * @owner: if set, the module owner of the file * @ops: kernfs operations for the file * @priv: private data for the file * @ns: optional namespace tag of the file @@ -970,7 +971,9 @@ const struct file_operations kernfs_file_fops = { */ struct kernfs_node *__kernfs_create_file(struct kernfs_node *parent, const char *name, - umode_t mode, kuid_t uid, kgid_t gid, + umode_t mode, + struct module *owner, + kuid_t uid, kgid_t gid, loff_t size, const struct kernfs_ops *ops, void *priv, const void *ns, @@ -983,7 +986,7 @@ struct kernfs_node *__kernfs_create_file(struct kernfs_node *parent, flags = KERNFS_FILE; kn = kernfs_new_node(parent, name, (mode & S_IALLUGO) | S_IFREG, - uid, gid, flags); + owner, uid, gid, flags); if (!kn) return ERR_PTR(-ENOMEM); diff --git a/fs/kernfs/kernfs-internal.h b/fs/kernfs/kernfs-internal.h index ccc3b44f6306..5598c1127db7 100644 --- a/fs/kernfs/kernfs-internal.h +++ b/fs/kernfs/kernfs-internal.h @@ -112,6 +112,7 @@ void kernfs_put_active(struct kernfs_node *kn); int kernfs_add_one(struct kernfs_node *kn); struct kernfs_node *kernfs_new_node(struct kernfs_node *parent, const char *name, umode_t mode, + struct module *owner, kuid_t uid, kgid_t gid, unsigned flags); diff --git a/fs/kernfs/symlink.c b/fs/kernfs/symlink.c index 5432883d819f..8e130a411a9f 100644 --- a/fs/kernfs/symlink.c +++ b/fs/kernfs/symlink.c @@ -36,7 +36,8 @@ struct kernfs_node *kernfs_create_link(struct kernfs_node *parent, gid = target->iattr->ia_gid; } - kn = kernfs_new_node(parent, name, S_IFLNK|S_IRWXUGO, uid, gid, + kn = kernfs_new_node(parent, name, S_IFLNK|S_IRWXUGO, + target->owner, uid, gid, KERNFS_LINK); if (!kn) return ERR_PTR(-ENOMEM); diff --git a/fs/sysfs/dir.c b/fs/sysfs/dir.c index 59dffd5ca517..5aff6ff07392 100644 --- a/fs/sysfs/dir.c +++ b/fs/sysfs/dir.c @@ -57,7 +57,8 @@ int sysfs_create_dir_ns(struct kobject *kobj, const void *ns) kobject_get_ownership(kobj, &uid, &gid); kn = kernfs_create_dir_ns(parent, kobject_name(kobj), - S_IRWXU | S_IRUGO | S_IXUGO, uid, gid, + S_IRWXU | S_IRUGO | S_IXUGO, NULL, + uid, gid, kobj, ns); if (IS_ERR(kn)) { if (PTR_ERR(kn) == -EEXIST) diff --git a/fs/sysfs/file.c b/fs/sysfs/file.c index 9aefa7779b29..acd0756199e1 100644 --- a/fs/sysfs/file.c +++ b/fs/sysfs/file.c @@ -314,8 +314,8 @@ int sysfs_add_file_mode_ns(struct kernfs_node *parent, key = attr->key ?: (struct lock_class_key *)&attr->skey; #endif - kn = __kernfs_create_file(parent, attr->name, mode & 0777, uid, gid, - size, ops, (void *)attr, ns, key); + kn = __kernfs_create_file(parent, attr->name, mode & 0777, attr->owner, + uid, gid, size, ops, (void *)attr, ns, key); if (IS_ERR(kn)) { if (PTR_ERR(kn) == -EEXIST) sysfs_warn_dup(parent, attr->name); diff --git a/fs/sysfs/group.c b/fs/sysfs/group.c index 64e6a6698935..010c2dade2d6 100644 --- a/fs/sysfs/group.c +++ b/fs/sysfs/group.c @@ -136,6 +136,7 @@ static int internal_create_group(struct kobject *kobj, int update, } else { kn = kernfs_create_dir_ns(kobj->sd, grp->name, S_IRWXU | S_IRUGO | S_IXUGO, + grp->owner, uid, gid, kobj, NULL); if (IS_ERR(kn)) { if (PTR_ERR(kn) == -EEXIST) diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h index 9e8ca8743c26..43abaed26735 100644 --- a/include/linux/kernfs.h +++ b/include/linux/kernfs.h @@ -155,6 +155,7 @@ struct kernfs_node { unsigned short flags; umode_t mode; + struct module *owner; struct kernfs_iattrs *iattr; }; @@ -368,12 +369,14 @@ void kernfs_destroy_root(struct kernfs_root *root); struct kernfs_node *kernfs_create_dir_ns(struct kernfs_node *parent, const char *name, umode_t mode, + struct module *owner, kuid_t uid, kgid_t gid, void *priv, const void *ns); struct kernfs_node *kernfs_create_empty_dir(struct kernfs_node *parent, const char *name); struct kernfs_node *__kernfs_create_file(struct kernfs_node *parent, const char *name, umode_t mode, + struct module *owner, kuid_t uid, kgid_t gid, loff_t size, const struct kernfs_ops *ops, @@ -465,13 +468,15 @@ static inline void kernfs_destroy_root(struct kernfs_root *root) { } static inline struct kernfs_node * kernfs_create_dir_ns(struct kernfs_node *parent, const char *name, - umode_t mode, kuid_t uid, kgid_t gid, + umode_t mode, struct module *owner, + kuid_t uid, kgid_t gid, void *priv, const void *ns) { return ERR_PTR(-ENOSYS); } static inline struct kernfs_node * __kernfs_create_file(struct kernfs_node *parent, const char *name, - umode_t mode, kuid_t uid, kgid_t gid, + umode_t mode, struct module *owner, + kuid_t uid, kgid_t gid, loff_t size, const struct kernfs_ops *ops, void *priv, const void *ns, struct lock_class_key *key) { return ERR_PTR(-ENOSYS); } @@ -558,14 +563,15 @@ static inline struct kernfs_node * kernfs_create_dir(struct kernfs_node *parent, const char *name, umode_t mode, void *priv) { - return kernfs_create_dir_ns(parent, name, mode, + return kernfs_create_dir_ns(parent, name, mode, parent->owner, GLOBAL_ROOT_UID, GLOBAL_ROOT_GID, priv, NULL); } static inline struct kernfs_node * kernfs_create_file_ns(struct kernfs_node *parent, const char *name, - umode_t mode, kuid_t uid, kgid_t gid, + umode_t mode, struct module *owner, + kuid_t uid, kgid_t gid, loff_t size, const struct kernfs_ops *ops, void *priv, const void *ns) { @@ -574,15 +580,16 @@ kernfs_create_file_ns(struct kernfs_node *parent, const char *name, #ifdef CONFIG_DEBUG_LOCK_ALLOC key = (struct lock_class_key *)&ops->lockdep_key; #endif - return __kernfs_create_file(parent, name, mode, uid, gid, + return __kernfs_create_file(parent, name, mode, owner, uid, gid, size, ops, priv, ns, key); } static inline struct kernfs_node * kernfs_create_file(struct kernfs_node *parent, const char *name, umode_t mode, + struct module *owner, loff_t size, const struct kernfs_ops *ops, void *priv) { - return kernfs_create_file_ns(parent, name, mode, + return kernfs_create_file_ns(parent, name, mode, owner, GLOBAL_ROOT_UID, GLOBAL_ROOT_GID, size, ops, priv, NULL); } diff --git a/include/linux/sysfs.h b/include/linux/sysfs.h index d76a1ddf83a3..6652504b860e 100644 --- a/include/linux/sysfs.h +++ b/include/linux/sysfs.h @@ -30,6 +30,7 @@ enum kobj_ns_type; struct attribute { const char *name; umode_t mode; + struct module *owner; #ifdef CONFIG_DEBUG_LOCK_ALLOC bool ignore_lockdep:1; struct lock_class_key *key; @@ -80,6 +81,7 @@ do { \ * @attrs: Pointer to NULL terminated list of attributes. * @bin_attrs: Pointer to NULL terminated list of binary attributes. * Either attrs or bin_attrs or both must be provided. + * @module: If set, module responsible for this attribute group */ struct attribute_group { const char *name; @@ -89,6 +91,7 @@ struct attribute_group { struct bin_attribute *, int); struct attribute **attrs; struct bin_attribute **bin_attrs; + struct module *owner; }; /* @@ -100,38 +103,52 @@ struct attribute_group { #define __ATTR(_name, _mode, _show, _store) { \ .attr = {.name = __stringify(_name), \ - .mode = VERIFY_OCTAL_PERMISSIONS(_mode) }, \ + .mode = VERIFY_OCTAL_PERMISSIONS(_mode), \ + .owner = THIS_MODULE, \ + }, \ .show = _show, \ .store = _store, \ } #define __ATTR_PREALLOC(_name, _mode, _show, _store) { \ .attr = {.name = __stringify(_name), \ - .mode = SYSFS_PREALLOC | VERIFY_OCTAL_PERMISSIONS(_mode) },\ + .mode = SYSFS_PREALLOC | VERIFY_OCTAL_PERMISSIONS(_mode),\ + .owner = THIS_MODULE, \ + }, \ .show = _show, \ .store = _store, \ } #define __ATTR_RO(_name) { \ - .attr = { .name = __stringify(_name), .mode = 0444 }, \ + .attr = { .name = __stringify(_name), \ + .mode = 0444, \ + .owner = THIS_MODULE, \ + }, \ .show = _name##_show, \ } #define __ATTR_RO_MODE(_name, _mode) { \ .attr = { .name = __stringify(_name), \ - .mode = VERIFY_OCTAL_PERMISSIONS(_mode) }, \ + .mode = VERIFY_OCTAL_PERMISSIONS(_mode), \ + .owner = THIS_MODULE, \ + }, \ .show = _name##_show, \ } #define __ATTR_RW_MODE(_name, _mode) { \ .attr = { .name = __stringify(_name), \ - .mode = VERIFY_OCTAL_PERMISSIONS(_mode) }, \ + .mode = VERIFY_OCTAL_PERMISSIONS(_mode), \ + .owner = THIS_MODULE, \ + }, \ .show = _name##_show, \ .store = _name##_store, \ } #define __ATTR_WO(_name) { \ - .attr = { .name = __stringify(_name), .mode = 0200 }, \ + .attr = { .name = __stringify(_name), \ + .mode = 0200, \ + .owner = THIS_MODULE, \ + }, \ .store = _name##_store, \ } @@ -141,8 +158,11 @@ struct attribute_group { #ifdef CONFIG_DEBUG_LOCK_ALLOC #define __ATTR_IGNORE_LOCKDEP(_name, _mode, _show, _store) { \ - .attr = {.name = __stringify(_name), .mode = _mode, \ - .ignore_lockdep = true }, \ + .attr = {.name = __stringify(_name), \ + .mode = _mode, \ + .ignore_lockdep = true, \ + .owner = THIS_MODULE, \ + }, \ .show = _show, \ .store = _store, \ } @@ -159,6 +179,7 @@ static const struct attribute_group *_name##_groups[] = { \ #define ATTRIBUTE_GROUPS(_name) \ static const struct attribute_group _name##_group = { \ .attrs = _name##_attrs, \ + .owner = THIS_MODULE, \ }; \ __ATTRIBUTE_GROUPS(_name) @@ -193,20 +214,29 @@ struct bin_attribute { /* macros to create static binary attributes easier */ #define __BIN_ATTR(_name, _mode, _read, _write, _size) { \ - .attr = { .name = __stringify(_name), .mode = _mode }, \ + .attr = { .name = __stringify(_name), \ + .mode = _mode, \ + .owner = THIS_MODULE, \ + }, \ .read = _read, \ .write = _write, \ .size = _size, \ } #define __BIN_ATTR_RO(_name, _size) { \ - .attr = { .name = __stringify(_name), .mode = 0444 }, \ + .attr = { .name = __stringify(_name), \ + .mode = 0444, \ + .owner = THIS_MODULE, \ + }, \ .read = _name##_read, \ .size = _size, \ } #define __BIN_ATTR_WO(_name, _size) { \ - .attr = { .name = __stringify(_name), .mode = 0200 }, \ + .attr = { .name = __stringify(_name), \ + .mode = 0200, \ + .owner = THIS_MODULE, \ + }, \ .write = _name##_write, \ .size = _size, \ } diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index 9153b20e5cc6..c3b650748029 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -3820,7 +3820,7 @@ static int cgroup_add_file(struct cgroup_subsys_state *css, struct cgroup *cgrp, key = &cft->lockdep_key; #endif kn = __kernfs_create_file(cgrp->kn, cgroup_file_name(cgrp, cft, name), - cgroup_file_mode(cft), + cgroup_file_mode(cft), NULL, GLOBAL_ROOT_UID, GLOBAL_ROOT_GID, 0, cft->kf_ops, cft, NULL, key); As for the race you described, I think this might resolve it: diff --git a/drivers/base/core.c b/drivers/base/core.c index f29839382f81..5e3f65ab8148 100644 --- a/drivers/base/core.c +++ b/drivers/base/core.c @@ -29,6 +29,7 @@ #include <linux/sched/mm.h> #include <linux/sysfs.h> #include <linux/dma-map-ops.h> /* for dma_default_coherent */ +#include <linux/blkdev.h> #include "base.h" #include "power/power.h" @@ -1911,11 +1912,20 @@ static inline int device_is_not_partition(struct device *dev) { return !(dev->type == &part_type); } +static inline bool device_is_disk(struct device *dev) +{ + return (dev->type == &disk_type); +} #else static inline int device_is_not_partition(struct device *dev) { return 1; } + +static inline bool device_is_disk(struct device *dev) +{ + return false; +} #endif static int @@ -1960,6 +1970,19 @@ const char *dev_driver_string(const struct device *dev) } EXPORT_SYMBOL(dev_driver_string); +static int dev_type_get(struct device *dev) +{ + if (device_is_disk(dev)) + return !!bdgrab(dev_to_bdev(dev)); + return 1; +} + +static void dev_type_put(struct device *dev) +{ + if (device_is_disk(dev)) + bdput(dev_to_bdev(dev)); +} + #define to_dev_attr(_attr) container_of(_attr, struct device_attribute, attr) static ssize_t dev_attr_show(struct kobject *kobj, struct attribute *attr, @@ -1969,12 +1992,20 @@ static ssize_t dev_attr_show(struct kobject *kobj, struct attribute *attr, struct device *dev = kobj_to_dev(kobj); ssize_t ret = -EIO; + if (!dev_type_get(dev)) + return -ENODEV; + if (dev_attr->show) ret = dev_attr->show(dev, dev_attr, buf); if (ret >= (ssize_t)PAGE_SIZE) { printk("dev_attr_show: %pS returned bad count\n", dev_attr->show); } + + dev_type_put(dev); + + put_device(dev); + return ret; } @@ -1985,8 +2016,14 @@ static ssize_t dev_attr_store(struct kobject *kobj, struct attribute *attr, struct device *dev = kobj_to_dev(kobj); ssize_t ret = -EIO; + if (!dev_type_get(dev)) + return -ENODEV; + if (dev_attr->store) ret = dev_attr->store(dev, dev_attr, buf, count); + + dev_type_put(dev); + return ret; } ^ permalink raw reply related [flat|nested] 44+ messages in thread
* Re: [PATCH 1/2] zram: fix crashes due to use of cpu hotplug multistate 2021-04-01 23:59 ` Luis Chamberlain @ 2021-04-02 7:54 ` Greg KH 2021-04-02 18:30 ` Luis Chamberlain ` (2 more replies) 2021-04-05 17:07 ` Minchan Kim 1 sibling, 3 replies; 44+ messages in thread From: Greg KH @ 2021-04-02 7:54 UTC (permalink / raw) To: Luis Chamberlain Cc: Minchan Kim, keescook, dhowells, hch, mbenes, ngupta, sergey.senozhatsky.work, axboe, linux-block, linux-kernel On Thu, Apr 01, 2021 at 11:59:25PM +0000, Luis Chamberlain wrote: > As for the syfs deadlock possible with drivers, this fixes it in a generic way: > > commit fac43d8025727a74f80a183cc5eb74ed902a5d14 > Author: Luis Chamberlain <mcgrof@kernel.org> > Date: Sat Mar 27 14:58:15 2021 +0000 > > sysfs: add optional module_owner to attribute > > This is needed as otherwise the owner of the attribute > or group read/store might have a shared lock used on driver removal, > and deadlock if we race with driver removal. > > Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> No, please no. Module removal is a "best effort", if the system dies when it happens, that's on you. I am not willing to expend extra energy and maintance of core things like sysfs for stuff like this that does not matter in any system other than a developer's box. Lock data, not code please. Trying to tie data structure's lifespans to the lifespan of code is a tangled mess, and one that I do not want to add to in any form. sorry, greg k-h ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH 1/2] zram: fix crashes due to use of cpu hotplug multistate 2021-04-02 7:54 ` Greg KH @ 2021-04-02 18:30 ` Luis Chamberlain 2021-04-03 6:13 ` Greg KH 2021-04-07 20:17 ` Josh Poimboeuf 2021-04-08 1:37 ` Thomas Gleixner 2 siblings, 1 reply; 44+ messages in thread From: Luis Chamberlain @ 2021-04-02 18:30 UTC (permalink / raw) To: Greg KH Cc: Minchan Kim, keescook, dhowells, hch, mbenes, ngupta, sergey.senozhatsky.work, axboe, linux-block, linux-kernel On Fri, Apr 02, 2021 at 09:54:12AM +0200, Greg KH wrote: > On Thu, Apr 01, 2021 at 11:59:25PM +0000, Luis Chamberlain wrote: > > As for the syfs deadlock possible with drivers, this fixes it in a generic way: > > > > commit fac43d8025727a74f80a183cc5eb74ed902a5d14 > > Author: Luis Chamberlain <mcgrof@kernel.org> > > Date: Sat Mar 27 14:58:15 2021 +0000 > > > > sysfs: add optional module_owner to attribute > > > > This is needed as otherwise the owner of the attribute > > or group read/store might have a shared lock used on driver removal, > > and deadlock if we race with driver removal. > > > > Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> > > No, please no. Module removal is a "best effort", Not for live patching. I am not sure if I am missing any other valid use case? > if the system dies when it happens, that's on you. I think the better approach for now is simply to call testers / etc to deal with this open coded. I cannot be sure that other than live patching there may be other valid use cases for module removal, and for races we really may care for where userspace *will* typically be mucking with sysfs attributes. Monitoring my systems's sysfs attributes I am actually quite surprised at the random pokes at them. > I am not willing to expend extra energy > and maintance of core things like sysfs for stuff like this that does > not matter in any system other than a developer's box. Should we document this as well? Without this it is unclear that tons of random tests are sanely nullified. At least this dead lock I spotted can be pretty common form on many drivers. > Lock data, not code please. Trying to tie data structure's lifespans > to the lifespan of code is a tangled mess, and one that I do not want to > add to in any form. Driver developers will simply have to open code these protections. In light of what I see on LTP / fuzzing, I suspect the use case will grow and we'll have to revisit this in the future. But for now, sure, we can just open code the required protections everywhere to not crash on module removal. Luis ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH 1/2] zram: fix crashes due to use of cpu hotplug multistate 2021-04-02 18:30 ` Luis Chamberlain @ 2021-04-03 6:13 ` Greg KH [not found] ` <20210406003152.GZ4332@42.do-not-panic.com> 0 siblings, 1 reply; 44+ messages in thread From: Greg KH @ 2021-04-03 6:13 UTC (permalink / raw) To: Luis Chamberlain Cc: Minchan Kim, keescook, dhowells, hch, mbenes, ngupta, sergey.senozhatsky.work, axboe, linux-block, linux-kernel On Fri, Apr 02, 2021 at 06:30:16PM +0000, Luis Chamberlain wrote: > On Fri, Apr 02, 2021 at 09:54:12AM +0200, Greg KH wrote: > > On Thu, Apr 01, 2021 at 11:59:25PM +0000, Luis Chamberlain wrote: > > > As for the syfs deadlock possible with drivers, this fixes it in a generic way: > > > > > > commit fac43d8025727a74f80a183cc5eb74ed902a5d14 > > > Author: Luis Chamberlain <mcgrof@kernel.org> > > > Date: Sat Mar 27 14:58:15 2021 +0000 > > > > > > sysfs: add optional module_owner to attribute > > > > > > This is needed as otherwise the owner of the attribute > > > or group read/store might have a shared lock used on driver removal, > > > and deadlock if we race with driver removal. > > > > > > Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> > > > > No, please no. Module removal is a "best effort", > > Not for live patching. I am not sure if I am missing any other valid > use case? live patching removes modules? We have so many code paths that are "best effort" when it comes to module unloading, trying to resolve this one is a valiant try, but not realistic. > > if the system dies when it happens, that's on you. > > I think the better approach for now is simply to call testers / etc to > deal with this open coded. I cannot be sure that other than live > patching there may be other valid use cases for module removal, and for > races we really may care for where userspace *will* typically be mucking > with sysfs attributes. Monitoring my systems's sysfs attributes I am > actually quite surprised at the random pokes at them. > > > I am not willing to expend extra energy > > and maintance of core things like sysfs for stuff like this that does > > not matter in any system other than a developer's box. > > Should we document this as well? Without this it is unclear that tons of > random tests are sanely nullified. At least this dead lock I spotted can > be pretty common form on many drivers. What other drivers have this problem? > > Lock data, not code please. Trying to tie data structure's lifespans > > to the lifespan of code is a tangled mess, and one that I do not want to > > add to in any form. > > Driver developers will simply have to open code these protections. In > light of what I see on LTP / fuzzing, I suspect the use case will grow > and we'll have to revisit this in the future. But for now, sure, we can > just open code the required protections everywhere to not crash on module > removal. LTP and fuzzing too do not remove modules. So I do not understand the root problem here, that's just something that does not happen on a real system. thanks, greg k-h ^ permalink raw reply [flat|nested] 44+ messages in thread
[parent not found: <20210406003152.GZ4332@42.do-not-panic.com>]
* Re: [PATCH 1/2] zram: fix crashes due to use of cpu hotplug multistate [not found] ` <20210406003152.GZ4332@42.do-not-panic.com> @ 2021-04-06 12:00 ` Miroslav Benes 2021-04-06 15:54 ` Josh Poimboeuf 0 siblings, 1 reply; 44+ messages in thread From: Miroslav Benes @ 2021-04-06 12:00 UTC (permalink / raw) To: Greg KH, Luis Chamberlain Cc: mbenes, Minchan Kim, keescook, dhowells, hch, ngupta, sergey.senozhatsky.work, axboe, linux-block, linux-kernel, live-patching Hi, > > Driver developers will simply have to open code these protections. In > > light of what I see on LTP / fuzzing, I suspect the use case will grow > > and we'll have to revisit this in the future. But for now, sure, we can > > just open code the required protections everywhere to not crash on module > > removal. > > LTP and fuzzing too do not remove modules. So I do not understand the > root problem here, that's just something that does not happen on a real > system. If I am not mistaken, the issue that Luis tries to solve here was indeed found by running LTP. > On Sat, Apr 03, 2021 at 08:13:23AM +0200, Greg KH wrote: > > On Fri, Apr 02, 2021 at 06:30:16PM +0000, Luis Chamberlain wrote: > > > On Fri, Apr 02, 2021 at 09:54:12AM +0200, Greg KH wrote: > > > > No, please no. Module removal is a "best effort", > > > > > > Not for live patching. I am not sure if I am missing any other valid > > > use case? > > > > live patching removes modules? We have so many code paths that are > > "best effort" when it comes to module unloading, trying to resolve this > > one is a valiant try, but not realistic. > > Miroslav, your input / help here would be valuable. I did the > generalization work because you said it would be worthy for you too... Yes, we have the option to revert and remove the existing live patch from the system. I am not sure how (if) it is used in practice. At least at SUSE we do not support the option. But we are only one of the many downstream users. So yes, there is the option. Miroslav ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH 1/2] zram: fix crashes due to use of cpu hotplug multistate 2021-04-06 12:00 ` Miroslav Benes @ 2021-04-06 15:54 ` Josh Poimboeuf 2021-04-07 14:09 ` Peter Zijlstra 0 siblings, 1 reply; 44+ messages in thread From: Josh Poimboeuf @ 2021-04-06 15:54 UTC (permalink / raw) To: Miroslav Benes Cc: Greg KH, Luis Chamberlain, mbenes, Minchan Kim, keescook, dhowells, hch, ngupta, sergey.senozhatsky.work, axboe, linux-block, linux-kernel, live-patching, Jessica Yu, Peter Zijlstra On Tue, Apr 06, 2021 at 02:00:19PM +0200, Miroslav Benes wrote: > Hi, > > > > Driver developers will simply have to open code these protections. In > > > light of what I see on LTP / fuzzing, I suspect the use case will grow > > > and we'll have to revisit this in the future. But for now, sure, we can > > > just open code the required protections everywhere to not crash on module > > > removal. > > > > LTP and fuzzing too do not remove modules. So I do not understand the > > root problem here, that's just something that does not happen on a real > > system. > > If I am not mistaken, the issue that Luis tries to solve here was indeed > found by running LTP. > > > On Sat, Apr 03, 2021 at 08:13:23AM +0200, Greg KH wrote: > > > On Fri, Apr 02, 2021 at 06:30:16PM +0000, Luis Chamberlain wrote: > > > > On Fri, Apr 02, 2021 at 09:54:12AM +0200, Greg KH wrote: > > > > > No, please no. Module removal is a "best effort", > > > > > > > > Not for live patching. I am not sure if I am missing any other valid > > > > use case? > > > > > > live patching removes modules? We have so many code paths that are > > > "best effort" when it comes to module unloading, trying to resolve this > > > one is a valiant try, but not realistic. > > > > Miroslav, your input / help here would be valuable. I did the > > generalization work because you said it would be worthy for you too... > > Yes, we have the option to revert and remove the existing live patch from > the system. I am not sure how (if) it is used in practice. > > At least at SUSE we do not support the option. But we are only one of the > many downstream users. So yes, there is the option. Same for Red Hat. Unloading livepatch modules seems to work fine, but isn't officially supported. That said, if rmmod is just considered a development aid, and we're going to be ignoring bugs, we should make it official with a new TAINT_RMMOD. -- Josh ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH 1/2] zram: fix crashes due to use of cpu hotplug multistate 2021-04-06 15:54 ` Josh Poimboeuf @ 2021-04-07 14:09 ` Peter Zijlstra 2021-04-07 15:30 ` Josh Poimboeuf 0 siblings, 1 reply; 44+ messages in thread From: Peter Zijlstra @ 2021-04-07 14:09 UTC (permalink / raw) To: Josh Poimboeuf Cc: Miroslav Benes, Greg KH, Luis Chamberlain, mbenes, Minchan Kim, keescook, dhowells, hch, ngupta, sergey.senozhatsky.work, axboe, linux-block, linux-kernel, live-patching, Jessica Yu On Tue, Apr 06, 2021 at 10:54:23AM -0500, Josh Poimboeuf wrote: > Same for Red Hat. Unloading livepatch modules seems to work fine, but > isn't officially supported. > > That said, if rmmod is just considered a development aid, and we're > going to be ignoring bugs, we should make it official with a new > TAINT_RMMOD. Another option would be to have live-patch modules leak a module reference by default, except when some debug sysctl is set or something. Then only those LP modules loaded while the sysctl is set to 'YOLO' can be unloaded. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH 1/2] zram: fix crashes due to use of cpu hotplug multistate 2021-04-07 14:09 ` Peter Zijlstra @ 2021-04-07 15:30 ` Josh Poimboeuf 2021-04-07 16:48 ` Peter Zijlstra 0 siblings, 1 reply; 44+ messages in thread From: Josh Poimboeuf @ 2021-04-07 15:30 UTC (permalink / raw) To: Peter Zijlstra Cc: Miroslav Benes, Greg KH, Luis Chamberlain, mbenes, Minchan Kim, keescook, dhowells, hch, ngupta, sergey.senozhatsky.work, axboe, linux-block, linux-kernel, live-patching, Jessica Yu On Wed, Apr 07, 2021 at 04:09:44PM +0200, Peter Zijlstra wrote: > On Tue, Apr 06, 2021 at 10:54:23AM -0500, Josh Poimboeuf wrote: > > > Same for Red Hat. Unloading livepatch modules seems to work fine, but > > isn't officially supported. > > > > That said, if rmmod is just considered a development aid, and we're > > going to be ignoring bugs, we should make it official with a new > > TAINT_RMMOD. > > Another option would be to have live-patch modules leak a module > reference by default, except when some debug sysctl is set or something. > Then only those LP modules loaded while the sysctl is set to 'YOLO' can > be unloaded. The issue is broader than just live patching. My suggestion was that if we aren't going to fix bugs in kernel module unloading, then unloading modules shouldn't be supported, and should taint the kernel. -- Josh ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH 1/2] zram: fix crashes due to use of cpu hotplug multistate 2021-04-07 15:30 ` Josh Poimboeuf @ 2021-04-07 16:48 ` Peter Zijlstra 0 siblings, 0 replies; 44+ messages in thread From: Peter Zijlstra @ 2021-04-07 16:48 UTC (permalink / raw) To: Josh Poimboeuf Cc: Miroslav Benes, Greg KH, Luis Chamberlain, mbenes, Minchan Kim, keescook, dhowells, hch, ngupta, sergey.senozhatsky.work, axboe, linux-block, linux-kernel, live-patching, Jessica Yu On Wed, Apr 07, 2021 at 10:30:31AM -0500, Josh Poimboeuf wrote: > On Wed, Apr 07, 2021 at 04:09:44PM +0200, Peter Zijlstra wrote: > > On Tue, Apr 06, 2021 at 10:54:23AM -0500, Josh Poimboeuf wrote: > > > > > Same for Red Hat. Unloading livepatch modules seems to work fine, but > > > isn't officially supported. > > > > > > That said, if rmmod is just considered a development aid, and we're > > > going to be ignoring bugs, we should make it official with a new > > > TAINT_RMMOD. > > > > Another option would be to have live-patch modules leak a module > > reference by default, except when some debug sysctl is set or something. > > Then only those LP modules loaded while the sysctl is set to 'YOLO' can > > be unloaded. > > The issue is broader than just live patching. > > My suggestion was that if we aren't going to fix bugs in kernel module > unloading, then unloading modules shouldn't be supported, and should > taint the kernel. Hold up, what? However much I dislike modules (and that is lots), if you don't want to support rmmod, you have to leak a reference to self in init. Barring that you get to fix any and all unload bugs. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH 1/2] zram: fix crashes due to use of cpu hotplug multistate 2021-04-02 7:54 ` Greg KH 2021-04-02 18:30 ` Luis Chamberlain @ 2021-04-07 20:17 ` Josh Poimboeuf 2021-04-08 6:18 ` Greg KH 2021-04-08 1:37 ` Thomas Gleixner 2 siblings, 1 reply; 44+ messages in thread From: Josh Poimboeuf @ 2021-04-07 20:17 UTC (permalink / raw) To: Greg KH Cc: Luis Chamberlain, Minchan Kim, keescook, dhowells, hch, mbenes, ngupta, sergey.senozhatsky.work, axboe, linux-block, linux-kernel, Steven Rostedt, Peter Zijlstra, Jessica Yu, Thomas Gleixner, Linus Torvalds On Fri, Apr 02, 2021 at 09:54:12AM +0200, Greg KH wrote: > On Thu, Apr 01, 2021 at 11:59:25PM +0000, Luis Chamberlain wrote: > > As for the syfs deadlock possible with drivers, this fixes it in a generic way: > > > > commit fac43d8025727a74f80a183cc5eb74ed902a5d14 > > Author: Luis Chamberlain <mcgrof@kernel.org> > > Date: Sat Mar 27 14:58:15 2021 +0000 > > > > sysfs: add optional module_owner to attribute > > > > This is needed as otherwise the owner of the attribute > > or group read/store might have a shared lock used on driver removal, > > and deadlock if we race with driver removal. > > > > Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> > > No, please no. Module removal is a "best effort", if the system dies > when it happens, that's on you. I am not willing to expend extra energy > and maintance of core things like sysfs for stuff like this that does > not matter in any system other than a developer's box. So I mentioned this on IRC, and some folks were surprised to hear that module unloading is unsupported and is just a development aid. Is this stance documented anywhere? If we really believe this to be true, we should make rmmod taint the kernel. -- Josh ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH 1/2] zram: fix crashes due to use of cpu hotplug multistate 2021-04-07 20:17 ` Josh Poimboeuf @ 2021-04-08 6:18 ` Greg KH 2021-04-08 13:16 ` Steven Rostedt 2021-04-08 13:37 ` Josh Poimboeuf 0 siblings, 2 replies; 44+ messages in thread From: Greg KH @ 2021-04-08 6:18 UTC (permalink / raw) To: Josh Poimboeuf Cc: Luis Chamberlain, Minchan Kim, keescook, dhowells, hch, mbenes, ngupta, sergey.senozhatsky.work, axboe, linux-block, linux-kernel, Steven Rostedt, Peter Zijlstra, Jessica Yu, Thomas Gleixner, Linus Torvalds On Wed, Apr 07, 2021 at 03:17:46PM -0500, Josh Poimboeuf wrote: > On Fri, Apr 02, 2021 at 09:54:12AM +0200, Greg KH wrote: > > On Thu, Apr 01, 2021 at 11:59:25PM +0000, Luis Chamberlain wrote: > > > As for the syfs deadlock possible with drivers, this fixes it in a generic way: > > > > > > commit fac43d8025727a74f80a183cc5eb74ed902a5d14 > > > Author: Luis Chamberlain <mcgrof@kernel.org> > > > Date: Sat Mar 27 14:58:15 2021 +0000 > > > > > > sysfs: add optional module_owner to attribute > > > > > > This is needed as otherwise the owner of the attribute > > > or group read/store might have a shared lock used on driver removal, > > > and deadlock if we race with driver removal. > > > > > > Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> > > > > No, please no. Module removal is a "best effort", if the system dies > > when it happens, that's on you. I am not willing to expend extra energy > > and maintance of core things like sysfs for stuff like this that does > > not matter in any system other than a developer's box. > > So I mentioned this on IRC, and some folks were surprised to hear that > module unloading is unsupported and is just a development aid. > > Is this stance documented anywhere? > > If we really believe this to be true, we should make rmmod taint the > kernel. My throw-away comment here seems to have gotten way more attention than it deserved, sorry about that everyone. Nothing is supported for anything here, it's all "best effort" :) And I would love a taint for rmmod, but what is that going to help out with? thanks, greg k-h ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH 1/2] zram: fix crashes due to use of cpu hotplug multistate 2021-04-08 6:18 ` Greg KH @ 2021-04-08 13:16 ` Steven Rostedt 2021-04-08 13:37 ` Josh Poimboeuf 1 sibling, 0 replies; 44+ messages in thread From: Steven Rostedt @ 2021-04-08 13:16 UTC (permalink / raw) To: Greg KH Cc: Josh Poimboeuf, Luis Chamberlain, Minchan Kim, keescook, dhowells, hch, mbenes, ngupta, sergey.senozhatsky.work, axboe, linux-block, linux-kernel, Peter Zijlstra, Jessica Yu, Thomas Gleixner, Linus Torvalds On Thu, 8 Apr 2021 08:18:21 +0200 Greg KH <gregkh@linuxfoundation.org> wrote: > And I would love a taint for rmmod, but what is that going to help out > with? Just like any other taint. If a rmmod can cause the system to lose integrity, the rmmod could cause a subtle issue that manifests itself into something more serious and may look unrelated. If you have a bug report with the rmmod taint, one could ask to try to recreate the bug without doing rmmod. Or perhaps we have a similar bug reports that all show the rmmod taint. That would give us an impression that something was removed and caused the system to lose stability. -- Steve ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH 1/2] zram: fix crashes due to use of cpu hotplug multistate 2021-04-08 6:18 ` Greg KH 2021-04-08 13:16 ` Steven Rostedt @ 2021-04-08 13:37 ` Josh Poimboeuf 1 sibling, 0 replies; 44+ messages in thread From: Josh Poimboeuf @ 2021-04-08 13:37 UTC (permalink / raw) To: Greg KH Cc: Luis Chamberlain, Minchan Kim, keescook, dhowells, hch, mbenes, ngupta, sergey.senozhatsky.work, axboe, linux-block, linux-kernel, Steven Rostedt, Peter Zijlstra, Jessica Yu, Thomas Gleixner, Linus Torvalds On Thu, Apr 08, 2021 at 08:18:21AM +0200, Greg KH wrote: > On Wed, Apr 07, 2021 at 03:17:46PM -0500, Josh Poimboeuf wrote: > > On Fri, Apr 02, 2021 at 09:54:12AM +0200, Greg KH wrote: > > > On Thu, Apr 01, 2021 at 11:59:25PM +0000, Luis Chamberlain wrote: > > > > As for the syfs deadlock possible with drivers, this fixes it in a generic way: > > > > > > > > commit fac43d8025727a74f80a183cc5eb74ed902a5d14 > > > > Author: Luis Chamberlain <mcgrof@kernel.org> > > > > Date: Sat Mar 27 14:58:15 2021 +0000 > > > > > > > > sysfs: add optional module_owner to attribute > > > > > > > > This is needed as otherwise the owner of the attribute > > > > or group read/store might have a shared lock used on driver removal, > > > > and deadlock if we race with driver removal. > > > > > > > > Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> > > > > > > No, please no. Module removal is a "best effort", if the system dies > > > when it happens, that's on you. I am not willing to expend extra energy > > > and maintance of core things like sysfs for stuff like this that does > > > not matter in any system other than a developer's box. > > > > So I mentioned this on IRC, and some folks were surprised to hear that > > module unloading is unsupported and is just a development aid. > > > > Is this stance documented anywhere? > > > > If we really believe this to be true, we should make rmmod taint the > > kernel. > > My throw-away comment here seems to have gotten way more attention than > it deserved, sorry about that everyone. > > Nothing is supported for anything here, it's all "best effort" :) > > And I would love a taint for rmmod, but what is that going to help out > with? I'm having trouble following this conclusion. Sure, we give our best effort, but if "nothing is supported" then what are we even doing here? Either we aim to support module unloading, or we don't. If we do support it, "best effort" isn't a valid reason for nacking fixes. If we don't support it, but still want to keep it half-working for development purposes, tainting on rmmod will make it crystal clear that the system has been destabilized. -- Josh ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH 1/2] zram: fix crashes due to use of cpu hotplug multistate 2021-04-02 7:54 ` Greg KH 2021-04-02 18:30 ` Luis Chamberlain 2021-04-07 20:17 ` Josh Poimboeuf @ 2021-04-08 1:37 ` Thomas Gleixner 2021-04-08 6:16 ` Greg KH 2 siblings, 1 reply; 44+ messages in thread From: Thomas Gleixner @ 2021-04-08 1:37 UTC (permalink / raw) To: Greg KH, Luis Chamberlain Cc: Minchan Kim, keescook, dhowells, hch, mbenes, ngupta, sergey.senozhatsky.work, axboe, linux-block, linux-kernel Greg, On Fri, Apr 02 2021 at 09:54, Greg KH wrote: > On Thu, Apr 01, 2021 at 11:59:25PM +0000, Luis Chamberlain wrote: >> As for the syfs deadlock possible with drivers, this fixes it in a generic way: >> >> commit fac43d8025727a74f80a183cc5eb74ed902a5d14 >> Author: Luis Chamberlain <mcgrof@kernel.org> >> Date: Sat Mar 27 14:58:15 2021 +0000 >> >> sysfs: add optional module_owner to attribute >> >> This is needed as otherwise the owner of the attribute >> or group read/store might have a shared lock used on driver removal, >> and deadlock if we race with driver removal. >> >> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> > > No, please no. Module removal is a "best effort", if the system dies > when it happens, that's on you. I am not willing to expend extra energy > and maintance of core things like sysfs for stuff like this that does > not matter in any system other than a developer's box. > > Lock data, not code please. Trying to tie data structure's lifespans > to the lifespan of code is a tangled mess, and one that I do not want to > add to in any form. > > sorry, Sorry, but you are fundamentaly off track here. This has absolutely nothing to do with module removal. The point is that module removal is the reverse operation of module insertion. So far so good. But module insertion can fail. So if you have nested functionalities which hang off or are enabled by moduled insertion then any fail in that sequence has to be able to roll back and clean up properly no matter what. Which it turn makes modules removal a reverse operation of module insertion. If you think otherwise, then please provide a proper plan how nested operations like sysfs - not to talk about more complex things like multi instance discovery which can happen inside a module insertion sequence can be properly rolled back. Just declaring that rmmod is evil does not cut it. rmmod is the least of the problems. If that fails, then a lot of rollback, failure handling mechanisms are missing in the setup path already. Anything which cannot cleanly rollback no matter whether the fail or rollback request happens at insertion time or later is broken by design. So either you declare module removal as disfunctional or you stop making up semantically ill defined and therefore useless claims about it. Your argument in: https://lore.kernel.org/linux-block/YGbNpLKXfWpy0ZZa@kroah.com/ "Lock data, not code please. Trying to tie data structure's lifespans to the lifespan of code is a tangled mess, and one that I do not want to add to in any form" is just useless blurb because the fundamental purpose of discovery code is to create the data structures which are tied to the code which is associated to it. Please stop this 'module removal' is not supported nonsense unless you can prove a complete indepenence of module init/discovery code to subsequent discovered entities depending on it. Thanks, tglx ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH 1/2] zram: fix crashes due to use of cpu hotplug multistate 2021-04-08 1:37 ` Thomas Gleixner @ 2021-04-08 6:16 ` Greg KH 2021-04-08 8:01 ` Jiri Kosina 0 siblings, 1 reply; 44+ messages in thread From: Greg KH @ 2021-04-08 6:16 UTC (permalink / raw) To: Thomas Gleixner Cc: Luis Chamberlain, Minchan Kim, keescook, dhowells, hch, mbenes, ngupta, sergey.senozhatsky.work, axboe, linux-block, linux-kernel On Thu, Apr 08, 2021 at 03:37:53AM +0200, Thomas Gleixner wrote: > Greg, > > On Fri, Apr 02 2021 at 09:54, Greg KH wrote: > > On Thu, Apr 01, 2021 at 11:59:25PM +0000, Luis Chamberlain wrote: > >> As for the syfs deadlock possible with drivers, this fixes it in a generic way: > >> > >> commit fac43d8025727a74f80a183cc5eb74ed902a5d14 > >> Author: Luis Chamberlain <mcgrof@kernel.org> > >> Date: Sat Mar 27 14:58:15 2021 +0000 > >> > >> sysfs: add optional module_owner to attribute > >> > >> This is needed as otherwise the owner of the attribute > >> or group read/store might have a shared lock used on driver removal, > >> and deadlock if we race with driver removal. > >> > >> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> > > > > No, please no. Module removal is a "best effort", if the system dies > > when it happens, that's on you. I am not willing to expend extra energy > > and maintance of core things like sysfs for stuff like this that does > > not matter in any system other than a developer's box. > > > > Lock data, not code please. Trying to tie data structure's lifespans > > to the lifespan of code is a tangled mess, and one that I do not want to > > add to in any form. > > > > sorry, > > Sorry, but you are fundamentaly off track here. This has absolutely > nothing to do with module removal. > > The point is that module removal is the reverse operation of module > insertion. So far so good. > > But module insertion can fail. So if you have nested functionalities > which hang off or are enabled by moduled insertion then any fail in that > sequence has to be able to roll back and clean up properly no matter > what. > > Which it turn makes modules removal a reverse operation of module > insertion. > > If you think otherwise, then please provide a proper plan how nested > operations like sysfs - not to talk about more complex things like multi > instance discovery which can happen inside a module insertion sequence > can be properly rolled back. > > Just declaring that rmmod is evil does not cut it. rmmod is the least of > the problems. If that fails, then a lot of rollback, failure handling > mechanisms are missing in the setup path already. > > Anything which cannot cleanly rollback no matter whether the fail or > rollback request happens at insertion time or later is broken by design. > > So either you declare module removal as disfunctional or you stop making > up semantically ill defined and therefore useless claims about it. > > Your argument in: > > https://lore.kernel.org/linux-block/YGbNpLKXfWpy0ZZa@kroah.com/ > > "Lock data, not code please. Trying to tie data structure's lifespans > to the lifespan of code is a tangled mess, and one that I do not want to > add to in any form" > > is just useless blurb because the fundamental purpose of discovery code > is to create the data structures which are tied to the code which is > associated to it. > > Please stop this 'module removal' is not supported nonsense unless you > can prove a complete indepenence of module init/discovery code to > subsequent discovered entities depending on it. Ok, but to be fair, trying to add the crazy hacks that were being proposed to sysfs for something that is obviously not a code path that can be taken by a normal user or operation is just not going to fly. Removing a module from a system has always been "let's try it and see!" type of operation for a very long time. While we try our best, doing horrible hacks for this rare type of thing are generally not considered a good idea which is why I said that. thanks, greg k-h ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH 1/2] zram: fix crashes due to use of cpu hotplug multistate 2021-04-08 6:16 ` Greg KH @ 2021-04-08 8:01 ` Jiri Kosina 2021-04-08 8:09 ` Greg KH 0 siblings, 1 reply; 44+ messages in thread From: Jiri Kosina @ 2021-04-08 8:01 UTC (permalink / raw) To: Greg KH Cc: Thomas Gleixner, Luis Chamberlain, Minchan Kim, keescook, dhowells, hch, mbenes, ngupta, sergey.senozhatsky.work, axboe, linux-block, linux-kernel, Jiri Kosina On Thu, 8 Apr 2021, Greg KH wrote: > Removing a module from a system has always been "let's try it and see!" > type of operation for a very long time. Which part of it? If there is a driver/subsystem code that can't handle the reverse operation to modprobe, it clearly can't handle error handling during modprobe (which, one would hope, is supported), and should be fixed. If there is a particular issue in kernel dynamic linker that causes crash on module removal, we'd better off fixing it. Is there one such that makes you claim module removal unsupported? Thanks, -- Jiri Kosina SUSE Labs ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH 1/2] zram: fix crashes due to use of cpu hotplug multistate 2021-04-08 8:01 ` Jiri Kosina @ 2021-04-08 8:09 ` Greg KH 2021-04-08 8:35 ` Jiri Kosina 0 siblings, 1 reply; 44+ messages in thread From: Greg KH @ 2021-04-08 8:09 UTC (permalink / raw) To: Jiri Kosina Cc: Thomas Gleixner, Luis Chamberlain, Minchan Kim, keescook, dhowells, hch, mbenes, ngupta, sergey.senozhatsky.work, axboe, linux-block, linux-kernel, Jiri Kosina On Thu, Apr 08, 2021 at 10:01:23AM +0200, Jiri Kosina wrote: > On Thu, 8 Apr 2021, Greg KH wrote: > > > Removing a module from a system has always been "let's try it and see!" > > type of operation for a very long time. > > Which part of it? > > If there is a driver/subsystem code that can't handle the reverse > operation to modprobe, it clearly can't handle error handling during > modprobe (which, one would hope, is supported), and should be fixed. Huh? No, that's not the issue here, it's the issue of different userspace code paths into the module at the same time that it is trying to be unloaded. That has nothing to do with loading the module the first time as userspace is not touching those apis yet. > If there is a particular issue in kernel dynamic linker that causes crash > on module removal, we'd better off fixing it. Is there one such that makes > you claim module removal unsupported? The linker has nothing to do with this, it's userspace tasks touching code paths. thanks, greg k-h ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH 1/2] zram: fix crashes due to use of cpu hotplug multistate 2021-04-08 8:09 ` Greg KH @ 2021-04-08 8:35 ` Jiri Kosina 2021-04-08 8:55 ` Greg KH 0 siblings, 1 reply; 44+ messages in thread From: Jiri Kosina @ 2021-04-08 8:35 UTC (permalink / raw) To: Greg KH Cc: Thomas Gleixner, Luis Chamberlain, Minchan Kim, keescook, dhowells, hch, mbenes, ngupta, sergey.senozhatsky.work, Jens Axboe, linux-block, linux-kernel On Thu, 8 Apr 2021, Greg KH wrote: > > If there is a driver/subsystem code that can't handle the reverse > > operation to modprobe, it clearly can't handle error handling during > > modprobe (which, one would hope, is supported), and should be fixed. > > Huh? No, that's not the issue here, it's the issue of different > userspace code paths into the module at the same time that it is trying > to be unloaded. That has nothing to do with loading the module the > first time as userspace is not touching those apis yet. So do you claim that once the first (out of possibly many) userspace-visible sysfs entry has been created during module insertion and made available to userspace, there is never going to be rollback happening that'd be removing that first sysfs entry again? Thanks, -- Jiri Kosina SUSE Labs ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH 1/2] zram: fix crashes due to use of cpu hotplug multistate 2021-04-08 8:35 ` Jiri Kosina @ 2021-04-08 8:55 ` Greg KH 2021-04-08 18:40 ` Luis Chamberlain 2021-04-09 3:01 ` Kees Cook 0 siblings, 2 replies; 44+ messages in thread From: Greg KH @ 2021-04-08 8:55 UTC (permalink / raw) To: Jiri Kosina Cc: Thomas Gleixner, Luis Chamberlain, Minchan Kim, keescook, dhowells, hch, mbenes, ngupta, sergey.senozhatsky.work, Jens Axboe, linux-block, linux-kernel On Thu, Apr 08, 2021 at 10:35:17AM +0200, Jiri Kosina wrote: > On Thu, 8 Apr 2021, Greg KH wrote: > > > > If there is a driver/subsystem code that can't handle the reverse > > > operation to modprobe, it clearly can't handle error handling during > > > modprobe (which, one would hope, is supported), and should be fixed. > > > > Huh? No, that's not the issue here, it's the issue of different > > userspace code paths into the module at the same time that it is trying > > to be unloaded. That has nothing to do with loading the module the > > first time as userspace is not touching those apis yet. > > So do you claim that once the first (out of possibly many) > userspace-visible sysfs entry has been created during module insertion and > made available to userspace, there is never going to be rollback happening > that'd be removing that first sysfs entry again? {sigh} I'm not trying to argue that, no. What I am arguing is that the complexity that the original patch was not worth the low probablity of this actually being an issue hit in real-life operations. That's all, messing around with sysfs entries and module reference counts is tricky and complex and a total mess. We have a separation between normal sysfs files and devices being removed that should handle the normal operations but there are still some crazy corner cases, of which this seems to be one. Module removal is not a "normal" operation that can be triggered by a system automatically without a user asking for it. As someone reminded me on IRC, we used to do this "automatically" for many problematic drivers years ago for suspend/resume, that should all now be long fixed up. So to add crazy complexity to the kernel, for an operation that can only be triggered manually by a root user, is not worth it in my opinion, as the maintainer of that code the complexity was asked to be made to. My throw-away comment of "module unloading is not supported" was an attempt to summarize all of the above into one single sentence that seems to have struck a nerve with a lot of people, and I appologize for that :( thanks, greg k-h ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH 1/2] zram: fix crashes due to use of cpu hotplug multistate 2021-04-08 8:55 ` Greg KH @ 2021-04-08 18:40 ` Luis Chamberlain 2021-04-09 3:01 ` Kees Cook 1 sibling, 0 replies; 44+ messages in thread From: Luis Chamberlain @ 2021-04-08 18:40 UTC (permalink / raw) To: Greg KH Cc: Jiri Kosina, Thomas Gleixner, Minchan Kim, keescook, dhowells, hch, mbenes, ngupta, sergey.senozhatsky.work, Jens Axboe, linux-block, linux-kernel On Thu, Apr 08, 2021 at 10:55:53AM +0200, Greg KH wrote: > So to add crazy complexity to the kernel, I agree that this can be tricky. However, driver developers are going to open code this either way. The problem with that as well, and one of my own reasons for striving for at least *trying* for a generic solution, was that I am aware that driver developers may be trying a busy solution rather than the try method. The busy approach means you could also end up in a situation where userspace can prevent full undoing / removal of a driver. The try method is best effort in the sense that if the driver is going it won't be allowed. So a sensible consideration I think is we at least choose one of these: a) We allow driver developers to open code it on drivers which need it on each and every single sysfs knob on the driver where its needed, and accept folks might do it wrong b) Provide a macro which is opt-in, ie, not doing it for all attributes, but perhaps one which the driver author *is* aware to try / put of the driver method. c) Document this race and other races / failings so that driver authors who do care about module removal are aware and know what to do. In this thread two races were exposed on syfs: * deadlock when a sysfs attribute uses a lock which is also used on module __exit * possible race against the device's private data, and this is type specific. I think many people probably missed the last hunks of my proposed patch which added dev_type_get() which were not part of the deadlock fix. At least I showed how attributes for all block devices have an exposed race, which can crash if removal of a block device with del_gendisk() happens while a sysfs attribute is about to be used. I don't think either race is trivial for a driver developer to assume a solution for. Most focus on this thread was about the first race, the seconod however is also very real, and likely can cause more issues on rolling backs on error codes unrelated to rmmod... > for an operation that can only > be triggered manually by a root user, is not worth it in my opinion, as > the maintainer of that code the complexity was asked to be made to. Three things: 1) Many driver maintainers *do* care that rmmod works well. To the point that if it does not, they feel ashamed. Reason for this is that in some subsystems this *is* a typical test case. So although rmmod may be a vodoo thing for core, many driver developers do want this to work well. As someone who also works on many test cases to expose odd issues in the kernel unrelated to module removal, I can also say that module removal does also expose other possible races which would otherwise be difficult to trigger. So it is also a helfup aid as a developer working on differen areas of the kernel. 2) Folks have pointed out that this is not just about rmmod, rolling back removal of sysfs attributes due to error code paths is another real scenario to consider. I don't think we have rules to tell userspace to not muck with sysfs files after they are exposed. In fact those uevents we send to userspace allow userspace to know exactly when to muck with them. 3) Sadly, some sysfs attributes also purposely do things which *also* mimic what is typically done on module removal, such as removal of an interface, or block device. That was the case with zram, it allows remvoal / reset of a device... Yes these are odd things to do, but we allow for it. And sysfs users *do* need to be aware of these corner cases if they want to support them. There **may** be some severity to some of the issues mentioned above, to allow really bad things to be done in userspace even without module removal... but I didn't have time yet to expose a pattern with coccinelle yet to see how commonplace some of these issue are. I was focusing at first more for a generic solution if possible, as I thought that would be better first evaluated, and to let others slowly become aware of the issue. Luis ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH 1/2] zram: fix crashes due to use of cpu hotplug multistate 2021-04-08 8:55 ` Greg KH 2021-04-08 18:40 ` Luis Chamberlain @ 2021-04-09 3:01 ` Kees Cook 1 sibling, 0 replies; 44+ messages in thread From: Kees Cook @ 2021-04-09 3:01 UTC (permalink / raw) To: Greg KH Cc: Jiri Kosina, Thomas Gleixner, Luis Chamberlain, Minchan Kim, dhowells, hch, mbenes, ngupta, sergey.senozhatsky.work, Jens Axboe, linux-block, linux-kernel On Thu, Apr 08, 2021 at 10:55:53AM +0200, Greg KH wrote: > Module removal is not a "normal" operation that can be triggered by a > system automatically without a user asking for it. As someone reminded > me on IRC, we used to do this "automatically" for many problematic > drivers years ago for suspend/resume, that should all now be long fixed > up. I know what you're trying to say here, but I think you're just completely wrong. :P We absolutely must handle module unloading, and it is expected to work correctly. _The reason it doesn't work correctly sometimes is because it is a less common operation_. But that doesn't make it any less important. Disk failures are rare too, but RAID handles it. If there are bugs here it is due to _our lack of testing_. Module unload needs to work for developers loading/unloading while they work on a module[1], it needs to work for sysadmins that would to unload a now-unused network protocol[2], it needs to work for people trying to reset device hardware state[3], it needs to work for unloading unused modules after root device, partition, and filesystem discovery in an initramfs[3], and every other case. The kernel module subsystem is not "best effort" and removal is not "not normal". If unloading a module destabilizes the kernel, then there is a bug that needs fixing. I'm not advocating for any path to fixing the technical issues in this thread, but we absolutely cannot suddenly claim that it's the user's fault that their system dies if they use "rmmod". That's outrageous. O_o -Kees [1] Yes, I'm linking to the book you wrote ;) https://www.oreilly.com/library/view/linux-device-drivers/0596005903/ch02.html [2] https://www.cs.unh.edu/cnrg/people/gherrin/linux-net.html#tth_sEc11.2.2 [3] https://opensource.com/article/18/5/how-load-or-unload-linux-kernel-module https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/6/html/deployment_guide/sec-unloading_a_module [4] https://git.busybox.net/busybox/tree/modutils/rmmod.c -- Kees Cook ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH 1/2] zram: fix crashes due to use of cpu hotplug multistate 2021-04-01 23:59 ` Luis Chamberlain 2021-04-02 7:54 ` Greg KH @ 2021-04-05 17:07 ` Minchan Kim 2021-04-05 19:00 ` Luis Chamberlain 1 sibling, 1 reply; 44+ messages in thread From: Minchan Kim @ 2021-04-05 17:07 UTC (permalink / raw) To: Luis Chamberlain Cc: keescook, dhowells, hch, mbenes, gregkh, ngupta, sergey.senozhatsky.work, axboe, linux-block, linux-kernel On Thu, Apr 01, 2021 at 11:59:25PM +0000, Luis Chamberlain wrote: > On Mon, Mar 22, 2021 at 03:12:01PM -0700, Minchan Kim wrote: > > On Mon, Mar 22, 2021 at 08:41:56PM +0000, Luis Chamberlain wrote: > > > > > > I would not call it *every* syfs knob as not all deal with things which > > > are related to CPU hotplug multistate, right? Note that using just > > > try_module_get() alone (that is the second patch only, does not fix the > > > race I am describing above). > > > > It wouldn't be CPU hotplug multistate issue but I'd like to call it > > as more "zram instance race" bug. > > What happens in this case? > > > > CPU 1 CPU 2 > > > > destroy_devices > > .. > > compact_store() > > zram = dev_to_zram(dev); > > idr_for_each(zram_remove_cb > > zram_remove > > .. > > kfree(zram) > > down_read(&zram->init_lock); > > > > > > CPU 1 CPU 2 > > > > hot_remove_store > > compact_store() > > zram = dev_to_zram(dev); > > zram_remove > > kfree(zram) > > down_read(&zram->init_lock); > > > > So, for me we need to close the zram instance create/removal > > with sysfs rather than focusing on CPU hotplug issue. > > Sure, that's a good point. > > The issue which I noted for the race which ends up in a deadlock is only > possible if a shared lock is used on removal but also on sysfs knobs. > > At first glance, the issue you describe above *seems* to be just proper > care driver developers must take with structures used. It is certainly > yet another issue we need to address, and if we can generalize a > solution even better. I now recall I *think* I spotted that race a while > ago and mentioned it to Kees and David Howells but I didn't have a > solution for it yet. More on this below. > > The issue you point out is real, however you cannot disregard the > CPU hoplug possible race as well, it is a design consideration which > the CPU hotplug multistate support warns for -- consider driver removal. > I agree that perhaps solving this "zram instance race" can fix he > hotplug race as well. If we can solves all 3 issues in one shot even > better. But let's evaluate that prospect... > > > Maybe, we could reuse zram_index_mutex with modifying it with > > rw_semaphore. What do you think? > > Although ideal given it would knock 3 birds with 1 stone, it ends up > actually making the sysfs attributes rather useless in light of the > requirements for each of the races. Namely, the sysfs deadlock race > *must* use a try lock approach, just as the try_module_get() case. > It must use this approach so to immediately just bail out if we have > our module being removed, and so on our __exit path. By trying to > repurpose zram_index_mutex we end up then just doing too much with it > and making the syfs attributes rather fragile for most uses: > > Consider disksize_show(), that would have to become: > > static ssize_t disksize_show(struct device *dev, struct device_attribute *attr, char *buf) > { > struct zram *zram = dev_to_zram(dev); > + size_t disksize; > > + down_read(&zram_index_rwlock); > + disksize = zram->disksize; > + up_read(&zram_index_rwlock); > - return scnprintf(buf, PAGE_SIZE, "%llu\n", zram->disksize); > + return scnprintf(buf, PAGE_SIZE, "%llu\n", disksize); > } > > What's wrong with this? > > It can block during a write, yes, but there is a type of write which > will make this crash after the read lock is acquired. When the instance > is removed. What if we try down_read_trylock()? > > static ssize_t disksize_show(struct device *dev, struct device_attribute *attr, char *buf) > { > struct zram *zram = dev_to_zram(dev); > + size_t disksize; > > + if (!down_read_trylock(&zram_index_rwlock)) > + return -ENODEV; > + > + disksize = zram->disksize; > + up_read(&zram_index_rwlock); > - return scnprintf(buf, PAGE_SIZE, "%llu\n", zram->disksize); > + return scnprintf(buf, PAGE_SIZE, "%llu\n", disksize); > } > > What's wrong with this? > > If it got the lock, it should be OK as it is preventing writes from > taking the lock for a bit. But then this just becomes pretty fragile, > it will fail whenever another read or write is happening, triggering > perhaps quite a bit of regressions on tests. > > And if we use write_trylock() we end up with the same fragile nature > of failing the read with ENODEV for any silly thing going on with the > driver. > > And come to think of it the last patch I had sent with a new > DECLARE_RWSEM(zram_unload) also has this same issue making most > sysfs attributes rather fragile. Thanks for looking the way. I agree the single zram_index_rwlock is not the right approach to fix it. However, I still hope we find more generic solution to fix them at once since I see it's zram instance racing problem. A approach I am considering is to make struct zram include kobject and then make zram sysfs auto populated under the kobject. So, zram/sysfs lifetime should be under the kobject. With it, sysfs race probem I mentioned above should be gone. Furthermore, zram_remove should fail if one of the alive zram objects is existing (i.e., zram->kobject->refcount > 1) so module_exit will fail, too. I see one of the problems is how I could make new zram object's attribute group for zram knobs under /sys/block/zram0 since block layer already made zram0 kobject via device_add_disk. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH 1/2] zram: fix crashes due to use of cpu hotplug multistate 2021-04-05 17:07 ` Minchan Kim @ 2021-04-05 19:00 ` Luis Chamberlain 2021-04-05 19:58 ` Minchan Kim 0 siblings, 1 reply; 44+ messages in thread From: Luis Chamberlain @ 2021-04-05 19:00 UTC (permalink / raw) To: Minchan Kim Cc: keescook, dhowells, hch, mbenes, gregkh, ngupta, sergey.senozhatsky.work, axboe, linux-block, linux-kernel On Mon, Apr 05, 2021 at 10:07:24AM -0700, Minchan Kim wrote: > On Thu, Apr 01, 2021 at 11:59:25PM +0000, Luis Chamberlain wrote: > > And come to think of it the last patch I had sent with a new > > DECLARE_RWSEM(zram_unload) also has this same issue making most > > sysfs attributes rather fragile. > > Thanks for looking the way. I agree the single zram_index_rwlock is > not the right approach to fix it. However, I still hope we find more > generic solution to fix them at once since I see it's zram instance > racing problem. They are 3 separate different problems. Related, but different. At this point I think it would be difficult to resolve all 3 with one solution without creating side issues, but hey, I'm curious if you find a solution that does not create other issues. > A approach I am considering is to make struct zram include kobject > and then make zram sysfs auto populated under the kobject. So, zram/sysfs > lifetime should be under the kobject. With it, sysfs race probem I > mentioned above should be gone. Furthermore, zram_remove should fail > if one of the alive zram objects is existing > (i.e., zram->kobject->refcount > 1) so module_exit will fail, too. If the idea then is to busy out rmmod if a sysfs attribute is being read, that could then mean rmmod can sometimes never complete. Hogging up / busying out sysfs attributes means the module cannto be removed. Which is why the *try_module_get()* I think is much more suitable, as it will always fails if we're already going down. > I see one of the problems is how I could make new zram object's > attribute group for zram knobs under /sys/block/zram0 since block > layer already made zram0 kobject via device_add_disk. Right.. well the syfs attribute races uncovered here actually do apply to any block driver as well. And which is why I was aiming for something generic if possible. I am not sure if you missed the last hunks of the generic solution, but that would resolve the issue you noted. Here is the same approach but in a non-generic solution, specific to just one attribute so far and to zram: diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c index 494695ff227e..b566916e4ad9 100644 --- a/drivers/block/zram/zram_drv.c +++ b/drivers/block/zram/zram_drv.c @@ -1724,6 +1724,11 @@ static ssize_t disksize_store(struct device *dev, mutex_lock(&zram_index_mutex); + if (!bdgrab(dev_to_bdev(dev))) { + err = -ENODEV; + goto out_nodev; + } + if (!zram_up || zram->claim) { err = -ENODEV; goto out; @@ -1760,6 +1765,7 @@ static ssize_t disksize_store(struct device *dev, zram->disksize = disksize; set_capacity_and_notify(zram->disk, zram->disksize >> SECTOR_SHIFT); up_write(&zram->init_lock); + bdput(dev_to_bdev(dev); mutex_unlock(&zram_index_mutex); @@ -1770,6 +1776,8 @@ static ssize_t disksize_store(struct device *dev, out_unlock: up_write(&zram->init_lock); out: + bdput(dev_to_bdev(dev); +out_nodev: mutex_unlock(&zram_index_mutex); return err; } ^ permalink raw reply related [flat|nested] 44+ messages in thread
* Re: [PATCH 1/2] zram: fix crashes due to use of cpu hotplug multistate 2021-04-05 19:00 ` Luis Chamberlain @ 2021-04-05 19:58 ` Minchan Kim 2021-04-06 0:29 ` Luis Chamberlain 0 siblings, 1 reply; 44+ messages in thread From: Minchan Kim @ 2021-04-05 19:58 UTC (permalink / raw) To: Luis Chamberlain Cc: keescook, dhowells, hch, mbenes, gregkh, ngupta, sergey.senozhatsky.work, axboe, linux-block, linux-kernel On Mon, Apr 05, 2021 at 07:00:23PM +0000, Luis Chamberlain wrote: > On Mon, Apr 05, 2021 at 10:07:24AM -0700, Minchan Kim wrote: > > On Thu, Apr 01, 2021 at 11:59:25PM +0000, Luis Chamberlain wrote: > > > And come to think of it the last patch I had sent with a new > > > DECLARE_RWSEM(zram_unload) also has this same issue making most > > > sysfs attributes rather fragile. > > > > Thanks for looking the way. I agree the single zram_index_rwlock is > > not the right approach to fix it. However, I still hope we find more > > generic solution to fix them at once since I see it's zram instance > > racing problem. > > They are 3 separate different problems. Related, but different. What are 3 different problems? I am asking since I remember only two: one for CPU multistate and the other one for sysfs during rmmod. The missing one from my view would help why you are saying they are all different problems(even though it's a bit releated). > At this point I think it would be difficult to resolve all 3 > with one solution without creating side issues, but hey, > I'm curious if you find a solution that does not create other > issues. Yeah, let me try something with kobject stuff how I could go far but let me understand what I was missing from problems your mentioned above. > > > A approach I am considering is to make struct zram include kobject > > and then make zram sysfs auto populated under the kobject. So, zram/sysfs > > lifetime should be under the kobject. With it, sysfs race probem I > > mentioned above should be gone. Furthermore, zram_remove should fail > > if one of the alive zram objects is existing > > (i.e., zram->kobject->refcount > 1) so module_exit will fail, too. > > If the idea then is to busy out rmmod if a sysfs attribute is being > read, that could then mean rmmod can sometimes never complete. Hogging > up / busying out sysfs attributes means the module cannto be removed. It's true but is it a big problem? There are many cases that system just return error if it's busy and rely on the admin. IMHO, rmmod should be part of them. > Which is why the *try_module_get()* I think is much more suitable, as > it will always fails if we're already going down. How does the try_module_get solved the problem? I thought it's general problem on every sysfs knobs in zram. Do you mean you will add module_get/put for every sysfs knobs in zram? > > > I see one of the problems is how I could make new zram object's > > attribute group for zram knobs under /sys/block/zram0 since block > > layer already made zram0 kobject via device_add_disk. > > Right.. well the syfs attribute races uncovered here actually do > apply to any block driver as well. And which is why I was aiming > for something generic if possible. It would be great but that's not the one we have atm so want to proceed to fix anyway. > > I am not sure if you missed the last hunks of the generic solution, > but that would resolve the issue you noted. Here is the same approach > but in a non-generic solution, specific to just one attribute so far > and to zram: So idea is refcount of the block_device's inode and module_exit path checks also the inode refcount to make rmmod failure? Hmm, might work but I am not sure it's right layerm, too. > > diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c > index 494695ff227e..b566916e4ad9 100644 > --- a/drivers/block/zram/zram_drv.c > +++ b/drivers/block/zram/zram_drv.c > @@ -1724,6 +1724,11 @@ static ssize_t disksize_store(struct device *dev, > > mutex_lock(&zram_index_mutex); > > + if (!bdgrab(dev_to_bdev(dev))) { > + err = -ENODEV; > + goto out_nodev; > + } > + > if (!zram_up || zram->claim) { > err = -ENODEV; > goto out; > @@ -1760,6 +1765,7 @@ static ssize_t disksize_store(struct device *dev, > zram->disksize = disksize; > set_capacity_and_notify(zram->disk, zram->disksize >> SECTOR_SHIFT); > up_write(&zram->init_lock); > + bdput(dev_to_bdev(dev); > > mutex_unlock(&zram_index_mutex); > > @@ -1770,6 +1776,8 @@ static ssize_t disksize_store(struct device *dev, > out_unlock: > up_write(&zram->init_lock); > out: > + bdput(dev_to_bdev(dev); > +out_nodev: > mutex_unlock(&zram_index_mutex); > return err; > } ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH 1/2] zram: fix crashes due to use of cpu hotplug multistate 2021-04-05 19:58 ` Minchan Kim @ 2021-04-06 0:29 ` Luis Chamberlain 2021-04-07 1:23 ` Minchan Kim 0 siblings, 1 reply; 44+ messages in thread From: Luis Chamberlain @ 2021-04-06 0:29 UTC (permalink / raw) To: Minchan Kim Cc: keescook, dhowells, hch, mbenes, gregkh, ngupta, sergey.senozhatsky.work, axboe, linux-block, linux-kernel On Mon, Apr 05, 2021 at 12:58:05PM -0700, Minchan Kim wrote: > On Mon, Apr 05, 2021 at 07:00:23PM +0000, Luis Chamberlain wrote: > > On Mon, Apr 05, 2021 at 10:07:24AM -0700, Minchan Kim wrote: > > > On Thu, Apr 01, 2021 at 11:59:25PM +0000, Luis Chamberlain wrote: > > > > And come to think of it the last patch I had sent with a new > > > > DECLARE_RWSEM(zram_unload) also has this same issue making most > > > > sysfs attributes rather fragile. > > > > > > Thanks for looking the way. I agree the single zram_index_rwlock is > > > not the right approach to fix it. However, I still hope we find more > > > generic solution to fix them at once since I see it's zram instance > > > racing problem. > > > > They are 3 separate different problems. Related, but different. > > What are 3 different problems? I am asking since I remember only two: > one for CPU multistate and the other one for sysfs during rmmod. The third one is the race to use sysfs attributes and those routines then derefernece th egendisk private_data. > > If the idea then is to busy out rmmod if a sysfs attribute is being > > read, that could then mean rmmod can sometimes never complete. Hogging > > up / busying out sysfs attributes means the module cannto be removed. > > It's true but is it a big problem? There are many cases that system > just return error if it's busy and rely on the admin. IMHO, rmmod should > be part of them. It depends on existing userspace scripts which are used to test and expectations set. Consider existing tests, you would know better, and since you are the maintainer you decide. I at least know for many other types of device drivers an rmmod is a sledge hammer. You decide. I just thought it would be good to highlight the effect now rather than us considering it later. > > Which is why the *try_module_get()* I think is much more suitable, as > > it will always fails if we're already going down. > > How does the try_module_get solved the problem? The try stuff only resolves the deadlock. The bget() / bdput() resolves the race to access to the gendisk private_data. > > > I see one of the problems is how I could make new zram object's > > > attribute group for zram knobs under /sys/block/zram0 since block > > > layer already made zram0 kobject via device_add_disk. > > > > Right.. well the syfs attribute races uncovered here actually do > > apply to any block driver as well. And which is why I was aiming > > for something generic if possible. > > It would be great but that's not the one we have atm so want to > proceed to fix anyway. What is not the one we have atm? I *do* have a proposed generic solution for 2/3 issues we have been disussing: a) deadlock on sysfs access b) gendisk private_data race But so far Greg does not see enough justification for a), so we can either show how wider this issue is (which I can do using coccinelle), or we just open code the try_module_get() / put on each driver that needs it for now. Either way it would resolve the issue. As for b), given that I think even you had missed my attempt to generialzie the bdget/bdput solution for any attribute type (did you see my dev_type_get() and dev_type_put() proposed changes?), I don't think this problem is yet well defined in a generic way for us to rule it out. It is however easier to simply open code this per driver that needs it for now given that I don't think Greg is yet convinced the deadlock is yet a widespread issue. I however am pretty sure both races races *do* exist outside of zram in many places. > > I am not sure if you missed the last hunks of the generic solution, > > but that would resolve the issue you noted. Here is the same approach > > but in a non-generic solution, specific to just one attribute so far > > and to zram: > > So idea is refcount of the block_device's inode Yes that itself prevents races against the gendisk private_data from being invalid. Why because a bdget() would not be successful after del_gendisk(): del_gendisk() --> invalidate_partition() --> __invalidate_device() --> invalidate_inodes() > and module_exit path > checks also the inode refcount to make rmmod failure? They try_module_get() approach resolves the deadlock race, but it does so in a lazy way. I mean lazy in that then rmmod wins over sysfs knobs. So touching sysfs knobs won't make an rmmod fail. I think that's more typical expected behaviour. Why? Because I find it odd that looping forever touching sysfs attributes should prevent a module removal. But that's a personal preference. Luis ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH 1/2] zram: fix crashes due to use of cpu hotplug multistate 2021-04-06 0:29 ` Luis Chamberlain @ 2021-04-07 1:23 ` Minchan Kim 2021-04-07 1:38 ` Minchan Kim 2021-04-07 14:50 ` Luis Chamberlain 0 siblings, 2 replies; 44+ messages in thread From: Minchan Kim @ 2021-04-07 1:23 UTC (permalink / raw) To: Luis Chamberlain Cc: keescook, dhowells, hch, mbenes, gregkh, ngupta, sergey.senozhatsky.work, axboe, linux-block, linux-kernel On Tue, Apr 06, 2021 at 12:29:09AM +0000, Luis Chamberlain wrote: > On Mon, Apr 05, 2021 at 12:58:05PM -0700, Minchan Kim wrote: > > On Mon, Apr 05, 2021 at 07:00:23PM +0000, Luis Chamberlain wrote: > > > On Mon, Apr 05, 2021 at 10:07:24AM -0700, Minchan Kim wrote: > > > > On Thu, Apr 01, 2021 at 11:59:25PM +0000, Luis Chamberlain wrote: > > > > > And come to think of it the last patch I had sent with a new > > > > > DECLARE_RWSEM(zram_unload) also has this same issue making most > > > > > sysfs attributes rather fragile. > > > > > > > > Thanks for looking the way. I agree the single zram_index_rwlock is > > > > not the right approach to fix it. However, I still hope we find more > > > > generic solution to fix them at once since I see it's zram instance > > > > racing problem. > > > > > > They are 3 separate different problems. Related, but different. > > > > What are 3 different problems? I am asking since I remember only two: > > one for CPU multistate and the other one for sysfs during rmmod. > > The third one is the race to use sysfs attributes and those routines > then derefernece th egendisk private_data. First of all, thanks for keeping discussion, Luis. That was the one I thought race between sysfs and during rmmod. > > > > If the idea then is to busy out rmmod if a sysfs attribute is being > > > read, that could then mean rmmod can sometimes never complete. Hogging > > > up / busying out sysfs attributes means the module cannto be removed. > > > > It's true but is it a big problem? There are many cases that system > > just return error if it's busy and rely on the admin. IMHO, rmmod should > > be part of them. > > It depends on existing userspace scripts which are used to test and > expectations set. Consider existing tests, you would know better, and > since you are the maintainer you decide. > > I at least know for many other types of device drivers an rmmod is > a sledge hammer. > > You decide. I just thought it would be good to highlight the effect now > rather than us considering it later. To me, the rmmod faillure is not a big problem for zram since it's common cases in the system with -EBUSY(Having said, I agree that's the best if we could avoid the fail-and-retrial. IOW, -EBUSY should be last resort unless we have nicer way.) > > > > Which is why the *try_module_get()* I think is much more suitable, as > > > it will always fails if we're already going down. > > > > How does the try_module_get solved the problem? > > The try stuff only resolves the deadlock. The bget() / bdput() resolves > the race to access to the gendisk private_data. That's the one I missed in this discussion. Now I am reading your [2/2] in original patch. I thought it was just zram instance was destroyed by sysfs race problem so you had seen the deadlock. I might miss the point here, too. Hmm, we are discussing several problems all at once. I feel it's time to jump v2 with your way in this point. You said three different problems. As I asked, please write it down with more detail with code sequence as we discussed other thread. If you mean a deadlock, please write what specific locks was deadlock with it. It would make discussion much easier. Let's discuss the issue one by one in each thread. > > > > > I see one of the problems is how I could make new zram object's > > > > attribute group for zram knobs under /sys/block/zram0 since block > > > > layer already made zram0 kobject via device_add_disk. > > > > > > Right.. well the syfs attribute races uncovered here actually do > > > apply to any block driver as well. And which is why I was aiming > > > for something generic if possible. > > > > It would be great but that's not the one we have atm so want to > > proceed to fix anyway. > > What is not the one we have atm? I *do* have a proposed generic solution > for 2/3 issues we have been disussing: > > a) deadlock on sysfs access This is the one I didn't understand. > b) gendisk private_data race Yub. > > But so far Greg does not see enough justification for a), so we can either > show how wider this issue is (which I can do using coccinelle), or we > just open code the try_module_get() / put on each driver that needs it > for now. Either way it would resolve the issue. I second if it's general problem for drivers, I agree it's worth to addresss in the core unless the driver introduce the mess. I have no idea here since I didn't understand the problem, yet. > > As for b), given that I think even you had missed my attempt to > generialzie the bdget/bdput solution for any attribute type (did you see > my dev_type_get() and dev_type_put() proposed changes?), I don't think > this problem is yet well defined in a generic way for us to rule it out. > It is however easier to simply open code this per driver that needs it > for now given that I don't think Greg is yet convinced the deadlock is > yet a widespread issue. I however am pretty sure both races races *do* > exist outside of zram in many places. It would be good sign to propose general solution. > > > > I am not sure if you missed the last hunks of the generic solution, > > > but that would resolve the issue you noted. Here is the same approach > > > but in a non-generic solution, specific to just one attribute so far > > > and to zram: > > > > So idea is refcount of the block_device's inode > > Yes that itself prevents races against the gendisk private_data from > being invalid. Why because a bdget() would not be successful after > del_gendisk(): > > del_gendisk() --> invalidate_partition() --> __invalidate_device() --> invalidate_inodes() > > > and module_exit path > > checks also the inode refcount to make rmmod failure? > > They try_module_get() approach resolves the deadlock race, but it does > so in a lazy way. I mean lazy in that then rmmod wins over sysfs knobs. > So touching sysfs knobs won't make an rmmod fail. I think that's more > typical expected behaviour. Why? Because I find it odd that looping > forever touching sysfs attributes should prevent a module removal. But > that's a personal preference. I agree with you that would be better but let's see how it could go clean. FYI, please look at hot_remove_store which also can remove zram instance on demand. I am looking forward to seeing your v2. Thanks for your patience, Luis. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH 1/2] zram: fix crashes due to use of cpu hotplug multistate 2021-04-07 1:23 ` Minchan Kim @ 2021-04-07 1:38 ` Minchan Kim 2021-04-07 14:52 ` Luis Chamberlain 2021-04-07 14:50 ` Luis Chamberlain 1 sibling, 1 reply; 44+ messages in thread From: Minchan Kim @ 2021-04-07 1:38 UTC (permalink / raw) To: Luis Chamberlain Cc: keescook, dhowells, hch, mbenes, gregkh, ngupta, sergey.senozhatsky.work, axboe, linux-block, linux-kernel On Tue, Apr 06, 2021 at 06:23:46PM -0700, Minchan Kim wrote: > On Tue, Apr 06, 2021 at 12:29:09AM +0000, Luis Chamberlain wrote: > > On Mon, Apr 05, 2021 at 12:58:05PM -0700, Minchan Kim wrote: > > > On Mon, Apr 05, 2021 at 07:00:23PM +0000, Luis Chamberlain wrote: > > > > On Mon, Apr 05, 2021 at 10:07:24AM -0700, Minchan Kim wrote: > > > > > On Thu, Apr 01, 2021 at 11:59:25PM +0000, Luis Chamberlain wrote: > > > > > > And come to think of it the last patch I had sent with a new > > > > > > DECLARE_RWSEM(zram_unload) also has this same issue making most > > > > > > sysfs attributes rather fragile. > > > > > > > > > > Thanks for looking the way. I agree the single zram_index_rwlock is > > > > > not the right approach to fix it. However, I still hope we find more > > > > > generic solution to fix them at once since I see it's zram instance > > > > > racing problem. > > > > > > > > They are 3 separate different problems. Related, but different. > > > > > > What are 3 different problems? I am asking since I remember only two: > > > one for CPU multistate and the other one for sysfs during rmmod. > > > > The third one is the race to use sysfs attributes and those routines > > then derefernece th egendisk private_data. > > First of all, thanks for keeping discussion, Luis. > > That was the one I thought race between sysfs and during rmmod. > > > > > > > If the idea then is to busy out rmmod if a sysfs attribute is being > > > > read, that could then mean rmmod can sometimes never complete. Hogging > > > > up / busying out sysfs attributes means the module cannto be removed. > > > > > > It's true but is it a big problem? There are many cases that system > > > just return error if it's busy and rely on the admin. IMHO, rmmod should > > > be part of them. > > > > It depends on existing userspace scripts which are used to test and > > expectations set. Consider existing tests, you would know better, and > > since you are the maintainer you decide. > > > > I at least know for many other types of device drivers an rmmod is > > a sledge hammer. > > > > You decide. I just thought it would be good to highlight the effect now > > rather than us considering it later. > > To me, the rmmod faillure is not a big problem for zram since it's > common cases in the system with -EBUSY(Having said, I agree that's the > best if we could avoid the fail-and-retrial. IOW, -EBUSY should be > last resort unless we have nicer way.) > > > > > > > Which is why the *try_module_get()* I think is much more suitable, as > > > > it will always fails if we're already going down. > > > > > > How does the try_module_get solved the problem? > > > > The try stuff only resolves the deadlock. The bget() / bdput() resolves > > the race to access to the gendisk private_data. > > That's the one I missed in this discussion. Now I am reading your [2/2] > in original patch. I thought it was just zram instance was destroyed > by sysfs race problem so you had seen the deadlock. I might miss the > point here, too. > > Hmm, we are discussing several problems all at once. I feel it's time > to jump v2 with your way in this point. You said three different > problems. As I asked, please write it down with more detail with > code sequence as we discussed other thread. If you mean a deadlock, > please write what specific locks was deadlock with it. > It would make discussion much easier. Let's discuss the issue > one by one in each thread. To clarify what I understood form the discussion until now: 1. zram shouldn't allow creating more zram instance during rmmod(It causes CPU multistate splat) 2. the private data of gendisk shouldn't destroyed while zram sysfs knob is working(it makes system goes crash) Thank you. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH 1/2] zram: fix crashes due to use of cpu hotplug multistate 2021-04-07 1:38 ` Minchan Kim @ 2021-04-07 14:52 ` Luis Chamberlain 0 siblings, 0 replies; 44+ messages in thread From: Luis Chamberlain @ 2021-04-07 14:52 UTC (permalink / raw) To: Minchan Kim Cc: keescook, dhowells, hch, mbenes, gregkh, ngupta, sergey.senozhatsky.work, axboe, linux-block, linux-kernel On Tue, Apr 06, 2021 at 06:38:24PM -0700, Minchan Kim wrote: > To clarify what I understood form the discussion until now: > > 1. zram shouldn't allow creating more zram instance during > rmmod(It causes CPU multistate splat) Right! > 2. the private data of gendisk shouldn't destroyed while zram > sysfs knob is working(it makes system goes crash) Yup, this is resolved with the bdgrab / bdput on each sysfs knob. And the last issue is: 3) which patch 2/2 addresed. If a mutex is shared on sysfs knobs and module removal, you must do try_module_get() to prevent the deadlock. Luis ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH 1/2] zram: fix crashes due to use of cpu hotplug multistate 2021-04-07 1:23 ` Minchan Kim 2021-04-07 1:38 ` Minchan Kim @ 2021-04-07 14:50 ` Luis Chamberlain 1 sibling, 0 replies; 44+ messages in thread From: Luis Chamberlain @ 2021-04-07 14:50 UTC (permalink / raw) To: Minchan Kim Cc: keescook, dhowells, hch, mbenes, gregkh, ngupta, sergey.senozhatsky.work, axboe, linux-block, linux-kernel On Tue, Apr 06, 2021 at 06:23:46PM -0700, Minchan Kim wrote: > On Tue, Apr 06, 2021 at 12:29:09AM +0000, Luis Chamberlain wrote: > > On Mon, Apr 05, 2021 at 12:58:05PM -0700, Minchan Kim wrote: > > > On Mon, Apr 05, 2021 at 07:00:23PM +0000, Luis Chamberlain wrote: > > > > Which is why the *try_module_get()* I think is much more suitable, as > > > > it will always fails if we're already going down. > > > > > > How does the try_module_get solved the problem? > > > > The try stuff only resolves the deadlock. The bget() / bdput() resolves > > the race to access to the gendisk private_data. > > That's the one I missed in this discussion. Now I am reading your [2/2] > in original patch. I thought it was just zram instance was destroyed > by sysfs race problem so you had seen the deadlock. Patch [2/2] indeed dealt with a zram instance race but the issue was not a crash due to indirection, it was because of a deadlock. The deadlock happens because a shared mutex is used both at sysfs and also on __exit. I'll expand on the example as you request so that both issues are clearly understood. The zram race you spotted which could lead to a derefence and crash is a separate one, which the bget() / bdput() on sysfs knobs resolves. That race happens because zram's sysfs knobs don't prevent del_gendisk() from completing currently, and so a dereference can happen. > Hmm, we are discussing several problems all at once. I feel it's time > to jump v2 with your way in this point. You said three different > problems. As I asked, please write it down with more detail with > code sequence as we discussed other thread. If you mean a deadlock, > please write what specific locks was deadlock with it. > It would make discussion much easier. Let's discuss the issue > one by one in each thread. Sure. Will post a v2. > > But so far Greg does not see enough justification for a), so we can either > > show how wider this issue is (which I can do using coccinelle), or we > > just open code the try_module_get() / put on each driver that needs it > > for now. Either way it would resolve the issue. > > I second if it's general problem for drivers, I agree it's worth to > addresss in the core unless the driver introduce the mess. I have > no idea here since I didn't understand the problem, yet. I suspect these issue are generic, however hard to reproduce, but this just means busying out sysfs knobs and doing module removal can likely cause crashes to some kernel drivers. Since it seems the position to take is module removal is best effort, if crashes on module removal are important to module maintainers, the position to take is that such races are best addressed on the driver side, not core. > > As for b), given that I think even you had missed my attempt to > > generialzie the bdget/bdput solution for any attribute type (did you see > > my dev_type_get() and dev_type_put() proposed changes?), I don't think > > this problem is yet well defined in a generic way for us to rule it out. > > It is however easier to simply open code this per driver that needs it > > for now given that I don't think Greg is yet convinced the deadlock is > > yet a widespread issue. I however am pretty sure both races races *do* > > exist outside of zram in many places. > > It would be good sign to propose general solution. Same situation here, as the race is with module removal. Even in the case where module removal is possible today and likely "supported" -- on livepatching, it seems we're shying away slowly from that situation, and will likely expose a knob to ensure removing a livepatch (and do module removal) will require a knob to be flipped to say, "Yes, I know what I am doing"). > > They try_module_get() approach resolves the deadlock race, but it does > > so in a lazy way. I mean lazy in that then rmmod wins over sysfs knobs. > > So touching sysfs knobs won't make an rmmod fail. I think that's more > > typical expected behaviour. Why? Because I find it odd that looping > > forever touching sysfs attributes should prevent a module removal. But > > that's a personal preference. > > I agree with you that would be better but let's see how it could go > clean. OK great. > FYI, please look at hot_remove_store which also can remove zram > instance on demand. I am looking forward to seeing your v2. Sure, I'll address all sysfs knobs in v2. Luis ^ permalink raw reply [flat|nested] 44+ messages in thread
* [PATCH 2/2] zram: fix races of sysfs attribute removal and usage 2021-03-06 2:20 [PATCH 0/2] zram: fix few ltp zram02.sh crashes Luis Chamberlain 2021-03-06 2:20 ` [PATCH 1/2] zram: fix crashes due to use of cpu hotplug multistate Luis Chamberlain @ 2021-03-06 2:20 ` Luis Chamberlain 1 sibling, 0 replies; 44+ messages in thread From: Luis Chamberlain @ 2021-03-06 2:20 UTC (permalink / raw) To: minchan, ngupta, sergey.senozhatsky.work Cc: axboe, mbenes, mcgrof, linux-block, linux-kernel When we have sysfs attributes which muck with the driver heavily we me end up with situations where the core kernel driver removal call races with usage of a sysfs attribute. The can happen when for instance a lock is used on the sysfs attribute which is also used for driver removal. To fix this we just *try* to get a refcount to the module prior to mucking with a sysfs attribute. If this fails we just give up right away. Ideally we'd want a generic solution, however this requires a bit more work. If we tried to generalize this on the block layer the closest we get is the disk->fops->owner, however zram is an example driver where the disk->fops is actually even changed *after* module load, and so the original disk->fops->owner can be dynamic. In zram's case the fops->owner is the same, however we have no semantics to ensure this is the case for all block drivers. Using these two lines in two separate terminals can easily reproduce this hang: Loop 1 on one terminal: while true; do modprobe zram; modprobe -r zram; done Loop 2 on a second terminal: while true; do echo 1024 > /sys/block/zram0/disksize; echo 1 > /sys/block/zram0/reset; done The splat which follows is comes up without this patch. INFO: task bash:888 blocked for more than 120 seconds. Tainted: G E 5.12.0-rc1-next-20210304+ #4 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:bash state:D stack: 0 pid: 888 ppid: 887 flags:0x00000004 Call Trace: __schedule+0x2e4/0x900 schedule+0x46/0xb0 schedule_preempt_disabled+0xa/0x10 __mutex_lock.constprop.0+0x2c3/0x490 ? _kstrtoull+0x35/0xd0 reset_store+0x6c/0x160 [zram] kernfs_fop_write_iter+0x124/0x1b0 new_sync_write+0x11c/0x1b0 vfs_write+0x1c2/0x260 ksys_write+0x5f/0xe0 do_syscall_64+0x33/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x7f34f2c3df33 RSP: 002b:00007ffe751df6e8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007f34f2c3df33 RDX: 0000000000000002 RSI: 0000561ccb06ec10 RDI: 0000000000000001 RBP: 0000561ccb06ec10 R08: 000000000000000a R09: 0000000000000001 R10: 0000561ccb157590 R11: 0000000000000246 R12: 0000000000000002 R13: 00007f34f2d0e6a0 R14: 0000000000000002 R15: 00007f34f2d0e8a0 INFO: task modprobe:1104 can't die for more than 120 seconds. task:modprobe state:D stack: 0 pid: 1104 ppid: 916 flags:0x00004004 Call Trace: __schedule+0x2e4/0x900 schedule+0x46/0xb0 __kernfs_remove.part.0+0x228/0x2b0 ? finish_wait+0x80/0x80 kernfs_remove_by_name_ns+0x50/0x90 remove_files+0x2b/0x60 sysfs_remove_group+0x38/0x80 sysfs_remove_groups+0x29/0x40 device_remove_attrs+0x4a/0x80 device_del+0x183/0x3e0 ? mutex_lock+0xe/0x30 del_gendisk+0x27a/0x2d0 zram_remove+0x8a/0xb0 [zram] ? hot_remove_store+0xf0/0xf0 [zram] zram_remove_cb+0xd/0x10 [zram] idr_for_each+0x5e/0xd0 destroy_devices+0x39/0x6f [zram] __do_sys_delete_module+0x190/0x2a0 do_syscall_64+0x33/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x7f32adf727d7 RSP: 002b:00007ffc08bb38a8 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0 RAX: ffffffffffffffda RBX: 000055eea23cbb10 RCX: 00007f32adf727d7 RDX: 0000000000000000 RSI: 0000000000000800 RDI: 000055eea23cbb78 RBP: 000055eea23cbb10 R08: 0000000000000000 R09: 0000000000000000 R10: 00007f32adfe5ac0 R11: 0000000000000206 R12: 000055eea23cbb78 R13: 0000000000000000 R14: 0000000000000000 R15: 000055eea23cbc20 Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> --- drivers/block/zram/zram_drv.c | 21 ++++++++++++++++++++- 1 file changed, 20 insertions(+), 1 deletion(-) diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c index 63b6119cee93..ce54f4bf5a5b 100644 --- a/drivers/block/zram/zram_drv.c +++ b/drivers/block/zram/zram_drv.c @@ -1722,6 +1722,9 @@ static ssize_t disksize_store(struct device *dev, struct zram *zram = dev_to_zram(dev); int err; + if (!try_module_get(THIS_MODULE)) + return -ENODEV; + mutex_lock(&zram_index_mutex); if (!zram_up) { @@ -1762,6 +1765,7 @@ static ssize_t disksize_store(struct device *dev, up_write(&zram->init_lock); mutex_unlock(&zram_index_mutex); + module_put(THIS_MODULE); return len; @@ -1771,6 +1775,7 @@ static ssize_t disksize_store(struct device *dev, up_write(&zram->init_lock); out: mutex_unlock(&zram_index_mutex); + module_put(THIS_MODULE); return err; } @@ -1786,6 +1791,9 @@ static ssize_t reset_store(struct device *dev, if (ret) return ret; + if (!try_module_get(THIS_MODULE)) + return -ENODEV; + mutex_lock(&zram_index_mutex); if (!zram_up) { @@ -1823,6 +1831,7 @@ static ssize_t reset_store(struct device *dev, out: mutex_unlock(&zram_index_mutex); + module_put(THIS_MODULE); return len; } @@ -2043,13 +2052,19 @@ static ssize_t hot_add_show(struct class *class, { int ret; + if (!try_module_get(THIS_MODULE)) + return -ENODEV; + mutex_lock(&zram_index_mutex); if (!zram_up) { mutex_unlock(&zram_index_mutex); - return -ENODEV; + ret = -ENODEV; + goto out; } ret = zram_add(); +out: mutex_unlock(&zram_index_mutex); + module_put(THIS_MODULE); if (ret < 0) return ret; @@ -2073,6 +2088,9 @@ static ssize_t hot_remove_store(struct class *class, if (dev_id < 0) return -EINVAL; + if (!try_module_get(THIS_MODULE)) + return -ENODEV; + mutex_lock(&zram_index_mutex); if (!zram_up) { @@ -2091,6 +2109,7 @@ static ssize_t hot_remove_store(struct class *class, out: mutex_unlock(&zram_index_mutex); + module_put(THIS_MODULE); return ret ? ret : count; } static CLASS_ATTR_WO(hot_remove); -- 2.30.1 ^ permalink raw reply related [flat|nested] 44+ messages in thread
end of thread, other threads:[~2021-04-09 3:01 UTC | newest] Thread overview: 44+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2021-03-06 2:20 [PATCH 0/2] zram: fix few ltp zram02.sh crashes Luis Chamberlain 2021-03-06 2:20 ` [PATCH 1/2] zram: fix crashes due to use of cpu hotplug multistate Luis Chamberlain 2021-03-09 2:55 ` Minchan Kim 2021-03-10 13:11 ` Luis Chamberlain 2021-03-10 21:25 ` Luis Chamberlain 2021-03-12 2:08 ` Minchan Kim 2021-03-10 21:21 ` Luis Chamberlain 2021-03-12 2:14 ` Minchan Kim 2021-03-12 18:32 ` Luis Chamberlain 2021-03-12 19:28 ` Minchan Kim 2021-03-19 19:09 ` Luis Chamberlain 2021-03-22 16:37 ` Minchan Kim 2021-03-22 20:41 ` Luis Chamberlain 2021-03-22 22:12 ` Minchan Kim 2021-04-01 23:59 ` Luis Chamberlain 2021-04-02 7:54 ` Greg KH 2021-04-02 18:30 ` Luis Chamberlain 2021-04-03 6:13 ` Greg KH [not found] ` <20210406003152.GZ4332@42.do-not-panic.com> 2021-04-06 12:00 ` Miroslav Benes 2021-04-06 15:54 ` Josh Poimboeuf 2021-04-07 14:09 ` Peter Zijlstra 2021-04-07 15:30 ` Josh Poimboeuf 2021-04-07 16:48 ` Peter Zijlstra 2021-04-07 20:17 ` Josh Poimboeuf 2021-04-08 6:18 ` Greg KH 2021-04-08 13:16 ` Steven Rostedt 2021-04-08 13:37 ` Josh Poimboeuf 2021-04-08 1:37 ` Thomas Gleixner 2021-04-08 6:16 ` Greg KH 2021-04-08 8:01 ` Jiri Kosina 2021-04-08 8:09 ` Greg KH 2021-04-08 8:35 ` Jiri Kosina 2021-04-08 8:55 ` Greg KH 2021-04-08 18:40 ` Luis Chamberlain 2021-04-09 3:01 ` Kees Cook 2021-04-05 17:07 ` Minchan Kim 2021-04-05 19:00 ` Luis Chamberlain 2021-04-05 19:58 ` Minchan Kim 2021-04-06 0:29 ` Luis Chamberlain 2021-04-07 1:23 ` Minchan Kim 2021-04-07 1:38 ` Minchan Kim 2021-04-07 14:52 ` Luis Chamberlain 2021-04-07 14:50 ` Luis Chamberlain 2021-03-06 2:20 ` [PATCH 2/2] zram: fix races of sysfs attribute removal and usage Luis Chamberlain
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).