Re: [PATCH v5 2/2] zram: fix deadlock with sysfs attribute usage and module removal

From: Greg KH <gregkh@linuxfoundation.org>
To: Minchan Kim <minchan@kernel.org>
Cc: Luis Chamberlain <mcgrof@kernel.org>,
	jeyu@kernel.org, ngupta@vflare.org,
	sergey.senozhatsky.work@gmail.com, rafael@kernel.org,
	axboe@kernel.dk, tj@kernel.org, mbenes@suse.com,
	jpoimboe@redhat.com, tglx@linutronix.de, keescook@chromium.org,
	jikos@kernel.org, rostedt@goodmis.org, peterz@infradead.org,
	linux-block@vger.kernel.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH v5 2/2] zram: fix deadlock with sysfs attribute usage and module removal
Date: Fri, 2 Jul 2021 09:11:39 +0200	[thread overview]
Message-ID: <YN68K7JaAnM/TX69@kroah.com> (raw)
In-Reply-To: <YN6rBKIHmLhEo36w@google.com>

On Thu, Jul 01, 2021 at 10:58:28PM -0700, Minchan Kim wrote:
> On Thu, Jul 01, 2021 at 09:37:16PM -0700, Luis Chamberlain wrote:
> > When sysfs attributes use a lock also used on module removal we can
> > potentially deadlock. This happens when for instance a sysfs file on
> > a driver is used, then at the same time we have module removal call
> > trigger. The module removal call code holds a lock, and then the sysfs
> > file entry waits for the same lock. While holding the lock the module
> > removal tries to remove the sysfs entries, but these cannot be removed
> > yet as one is waiting for a lock. This won't complete as the lock is
> > already held. Likewise module removal cannot complete, and so we deadlock.
> > 
> > To fix this we just *try* to get a refcount to the module when a shared
> > lock is used, prior to mucking with a sysfs attribute. If this fails we
> > just give up right away.
> > 
> > We use a try method as a full lock means we'd then make our sysfs
> > attributes busy us out from possible module removal, and so userspace
> > could force denying module removal, a silly form of "DOS" against module
> > removal. A try lock on the module removal ensures we give priority to
> > module removal and interacting with sysfs attributes only comes second.
> > Using a full lock could mean for instance that if you don't stop poking
> > at sysfs files you cannot remove a module.
> > 
> > This deadlock was first reported with the zram driver, a sketch of how
> > this can happen follows:
> > 
> > CPU A                              CPU B
> >                                    whatever_store()
> > module_unload
> >   mutex_lock(foo)
> >                                    mutex_lock(foo)
> >    del_gendisk(zram->disk);
> >      device_del()
> >        device_remove_groups()
> > 
> > In this situation whatever_store() is waiting for the mutex foo to
> > become unlocked, but that won't happen until module removal is complete.
> > But module removal won't complete until the sysfs file being poked
> > completes which is waiting for a lock already held.
> > 
> > This is a generic kernel issue with sysfs files which use any lock also
> > used on module removal. Different generic solutions have been proposed.
> > One approach proposed is by directly by augmenting attributes with module
> > information [0]. This patch implements a solution by adding macros with
> > the prefix MODULE_DEVICE_ATTR_*() which accomplish the same. Until we
> > don't have a generic agreed upon solution for this shared between drivers,
> > we must implement a fix for this on each driver.
> > 
> > We make zram use the new MODULE_DEVICE_ATTR_*() helpers, and completely
> > open code the solution for class attributes as there are only a few of
> > those.
> > 
> > This issue can be reproduced easily on the zram driver as follows:
> > 
> > Loop 1 on one terminal:
> > 
> > while true;
> > 	do modprobe zram;
> > 	modprobe -r zram;
> > done
> > 
> > Loop 2 on a second terminal:
> > while true; do
> > 	echo 1024 >  /sys/block/zram0/disksize;
> > 	echo 1 > /sys/block/zram0/reset;
> > done
> > 
> > Without this patch we end up in a deadlock, and the following
> > stack trace is produced which hints to us what the issue was:
> > 
> > INFO: task bash:888 blocked for more than 120 seconds.
> >       Tainted: G            E 5.12.0-rc1-next-20210304+ #4
> > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > task:bash            state:D stack:    0 pid:  888 ppid: 887 flags:<etc>
> > Call Trace:
> >  __schedule+0x2e4/0x900
> >  schedule+0x46/0xb0
> >  schedule_preempt_disabled+0xa/0x10
> >  __mutex_lock.constprop.0+0x2c3/0x490
> >  ? _kstrtoull+0x35/0xd0
> >  reset_store+0x6c/0x160 [zram]
> >  kernfs_fop_write_iter+0x124/0x1b0
> >  new_sync_write+0x11c/0x1b0
> >  vfs_write+0x1c2/0x260
> >  ksys_write+0x5f/0xe0
> >  do_syscall_64+0x33/0x80
> >  entry_SYSCALL_64_after_hwframe+0x44/0xae
> > RIP: 0033:0x7f34f2c3df33
> > RSP: 002b:00007ffe751df6e8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
> > RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007f34f2c3df33
> > RDX: 0000000000000002 RSI: 0000561ccb06ec10 RDI: 0000000000000001
> > RBP: 0000561ccb06ec10 R08: 000000000000000a R09: 0000000000000001
> > R10: 0000561ccb157590 R11: 0000000000000246 R12: 0000000000000002
> > R13: 00007f34f2d0e6a0 R14: 0000000000000002 R15: 00007f34f2d0e8a0
> > INFO: task modprobe:1104 can't die for more than 120 seconds.
> > task:modprobe        state:D stack:    0 pid: 1104 ppid: 916 flags:<etc>
> > Call Trace:
> >  __schedule+0x2e4/0x900
> >  schedule+0x46/0xb0
> >  __kernfs_remove.part.0+0x228/0x2b0
> >  ? finish_wait+0x80/0x80
> >  kernfs_remove_by_name_ns+0x50/0x90
> >  remove_files+0x2b/0x60
> >  sysfs_remove_group+0x38/0x80
> >  sysfs_remove_groups+0x29/0x40
> >  device_remove_attrs+0x4a/0x80
> >  device_del+0x183/0x3e0
> >  ? mutex_lock+0xe/0x30
> >  del_gendisk+0x27a/0x2d0
> >  zram_remove+0x8a/0xb0 [zram]
> >  ? hot_remove_store+0xf0/0xf0 [zram]
> >  zram_remove_cb+0xd/0x10 [zram]
> >  idr_for_each+0x5e/0xd0
> >  destroy_devices+0x39/0x6f [zram]
> >  __do_sys_delete_module+0x190/0x2a0
> >  do_syscall_64+0x33/0x80
> >  entry_SYSCALL_64_after_hwframe+0x44/0xae
> > RIP: 0033:0x7f32adf727d7
> > RSP: 002b:00007ffc08bb38a8 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
> > RAX: ffffffffffffffda RBX: 000055eea23cbb10 RCX: 00007f32adf727d7
> > RDX: 0000000000000000 RSI: 0000000000000800 RDI: 000055eea23cbb78
> > RBP: 000055eea23cbb10 R08: 0000000000000000 R09: 0000000000000000
> > R10: 00007f32adfe5ac0 R11: 0000000000000206 R12: 000055eea23cbb78
> > R13: 0000000000000000 R14: 0000000000000000 R15: 000055eea23cbc20
> > 
> > [0] https://lkml.kernel.org/r/20210401235925.GR4332@42.do-not-panic.com
> > 
> > Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> 
> Acked-by: Minchan Kim <minchan@kernel.org>
> 
> Much simple/clean now. Thanks for persuing the effort.

No, please let us NOT do this.  Let's revisit this after 5.14-rc1 is
out, I still do not think this is the correct thing to do as this still
contains races, your window is now just smaller.

thanks,

greg k-h