All of lore.kernel.org
 help / color / mirror / Atom feed
From: Greg KH <gregkh@linuxfoundation.org>
To: Minchan Kim <minchan@kernel.org>
Cc: Luis Chamberlain <mcgrof@kernel.org>,
	jeyu@kernel.org, ngupta@vflare.org,
	sergey.senozhatsky.work@gmail.com, rafael@kernel.org,
	axboe@kernel.dk, tj@kernel.org, mbenes@suse.com,
	jpoimboe@redhat.com, tglx@linutronix.de, keescook@chromium.org,
	jikos@kernel.org, rostedt@goodmis.org, peterz@infradead.org,
	linux-block@vger.kernel.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH v5 2/2] zram: fix deadlock with sysfs attribute usage and module removal
Date: Fri, 2 Jul 2021 09:11:39 +0200	[thread overview]
Message-ID: <YN68K7JaAnM/TX69@kroah.com> (raw)
In-Reply-To: <YN6rBKIHmLhEo36w@google.com>

On Thu, Jul 01, 2021 at 10:58:28PM -0700, Minchan Kim wrote:
> On Thu, Jul 01, 2021 at 09:37:16PM -0700, Luis Chamberlain wrote:
> > When sysfs attributes use a lock also used on module removal we can
> > potentially deadlock. This happens when for instance a sysfs file on
> > a driver is used, then at the same time we have module removal call
> > trigger. The module removal call code holds a lock, and then the sysfs
> > file entry waits for the same lock. While holding the lock the module
> > removal tries to remove the sysfs entries, but these cannot be removed
> > yet as one is waiting for a lock. This won't complete as the lock is
> > already held. Likewise module removal cannot complete, and so we deadlock.
> > 
> > To fix this we just *try* to get a refcount to the module when a shared
> > lock is used, prior to mucking with a sysfs attribute. If this fails we
> > just give up right away.
> > 
> > We use a try method as a full lock means we'd then make our sysfs
> > attributes busy us out from possible module removal, and so userspace
> > could force denying module removal, a silly form of "DOS" against module
> > removal. A try lock on the module removal ensures we give priority to
> > module removal and interacting with sysfs attributes only comes second.
> > Using a full lock could mean for instance that if you don't stop poking
> > at sysfs files you cannot remove a module.
> > 
> > This deadlock was first reported with the zram driver, a sketch of how
> > this can happen follows:
> > 
> > CPU A                              CPU B
> >                                    whatever_store()
> > module_unload
> >   mutex_lock(foo)
> >                                    mutex_lock(foo)
> >    del_gendisk(zram->disk);
> >      device_del()
> >        device_remove_groups()
> > 
> > In this situation whatever_store() is waiting for the mutex foo to
> > become unlocked, but that won't happen until module removal is complete.
> > But module removal won't complete until the sysfs file being poked
> > completes which is waiting for a lock already held.
> > 
> > This is a generic kernel issue with sysfs files which use any lock also
> > used on module removal. Different generic solutions have been proposed.
> > One approach proposed is by directly by augmenting attributes with module
> > information [0]. This patch implements a solution by adding macros with
> > the prefix MODULE_DEVICE_ATTR_*() which accomplish the same. Until we
> > don't have a generic agreed upon solution for this shared between drivers,
> > we must implement a fix for this on each driver.
> > 
> > We make zram use the new MODULE_DEVICE_ATTR_*() helpers, and completely
> > open code the solution for class attributes as there are only a few of
> > those.
> > 
> > This issue can be reproduced easily on the zram driver as follows:
> > 
> > Loop 1 on one terminal:
> > 
> > while true;
> > 	do modprobe zram;
> > 	modprobe -r zram;
> > done
> > 
> > Loop 2 on a second terminal:
> > while true; do
> > 	echo 1024 >  /sys/block/zram0/disksize;
> > 	echo 1 > /sys/block/zram0/reset;
> > done
> > 
> > Without this patch we end up in a deadlock, and the following
> > stack trace is produced which hints to us what the issue was:
> > 
> > INFO: task bash:888 blocked for more than 120 seconds.
> >       Tainted: G            E 5.12.0-rc1-next-20210304+ #4
> > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > task:bash            state:D stack:    0 pid:  888 ppid: 887 flags:<etc>
> > Call Trace:
> >  __schedule+0x2e4/0x900
> >  schedule+0x46/0xb0
> >  schedule_preempt_disabled+0xa/0x10
> >  __mutex_lock.constprop.0+0x2c3/0x490
> >  ? _kstrtoull+0x35/0xd0
> >  reset_store+0x6c/0x160 [zram]
> >  kernfs_fop_write_iter+0x124/0x1b0
> >  new_sync_write+0x11c/0x1b0
> >  vfs_write+0x1c2/0x260
> >  ksys_write+0x5f/0xe0
> >  do_syscall_64+0x33/0x80
> >  entry_SYSCALL_64_after_hwframe+0x44/0xae
> > RIP: 0033:0x7f34f2c3df33
> > RSP: 002b:00007ffe751df6e8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
> > RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007f34f2c3df33
> > RDX: 0000000000000002 RSI: 0000561ccb06ec10 RDI: 0000000000000001
> > RBP: 0000561ccb06ec10 R08: 000000000000000a R09: 0000000000000001
> > R10: 0000561ccb157590 R11: 0000000000000246 R12: 0000000000000002
> > R13: 00007f34f2d0e6a0 R14: 0000000000000002 R15: 00007f34f2d0e8a0
> > INFO: task modprobe:1104 can't die for more than 120 seconds.
> > task:modprobe        state:D stack:    0 pid: 1104 ppid: 916 flags:<etc>
> > Call Trace:
> >  __schedule+0x2e4/0x900
> >  schedule+0x46/0xb0
> >  __kernfs_remove.part.0+0x228/0x2b0
> >  ? finish_wait+0x80/0x80
> >  kernfs_remove_by_name_ns+0x50/0x90
> >  remove_files+0x2b/0x60
> >  sysfs_remove_group+0x38/0x80
> >  sysfs_remove_groups+0x29/0x40
> >  device_remove_attrs+0x4a/0x80
> >  device_del+0x183/0x3e0
> >  ? mutex_lock+0xe/0x30
> >  del_gendisk+0x27a/0x2d0
> >  zram_remove+0x8a/0xb0 [zram]
> >  ? hot_remove_store+0xf0/0xf0 [zram]
> >  zram_remove_cb+0xd/0x10 [zram]
> >  idr_for_each+0x5e/0xd0
> >  destroy_devices+0x39/0x6f [zram]
> >  __do_sys_delete_module+0x190/0x2a0
> >  do_syscall_64+0x33/0x80
> >  entry_SYSCALL_64_after_hwframe+0x44/0xae
> > RIP: 0033:0x7f32adf727d7
> > RSP: 002b:00007ffc08bb38a8 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
> > RAX: ffffffffffffffda RBX: 000055eea23cbb10 RCX: 00007f32adf727d7
> > RDX: 0000000000000000 RSI: 0000000000000800 RDI: 000055eea23cbb78
> > RBP: 000055eea23cbb10 R08: 0000000000000000 R09: 0000000000000000
> > R10: 00007f32adfe5ac0 R11: 0000000000000206 R12: 000055eea23cbb78
> > R13: 0000000000000000 R14: 0000000000000000 R15: 000055eea23cbc20
> > 
> > [0] https://lkml.kernel.org/r/20210401235925.GR4332@42.do-not-panic.com
> > 
> > Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> 
> Acked-by: Minchan Kim <minchan@kernel.org>
> 
> Much simple/clean now. Thanks for persuing the effort.

No, please let us NOT do this.  Let's revisit this after 5.14-rc1 is
out, I still do not think this is the correct thing to do as this still
contains races, your window is now just smaller.

thanks,

greg k-h

  reply	other threads:[~2021-07-02  7:11 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-07-02  4:37 [PATCH v5 0/2] zram: fix few sysfs races Luis Chamberlain
2021-07-02  4:37 ` [PATCH v5 1/2] zram: fix crashes with cpu hotplug multistate Luis Chamberlain
2021-07-02  4:37 ` [PATCH v5 2/2] zram: fix deadlock with sysfs attribute usage and module removal Luis Chamberlain
2021-07-02  5:58   ` Minchan Kim
2021-07-02  7:11     ` Greg KH [this message]
2021-07-02  6:01 ` [PATCH v5 0/2] zram: fix few sysfs races Minchan Kim
2021-07-02 19:36   ` Luis Chamberlain

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YN68K7JaAnM/TX69@kroah.com \
    --to=gregkh@linuxfoundation.org \
    --cc=axboe@kernel.dk \
    --cc=jeyu@kernel.org \
    --cc=jikos@kernel.org \
    --cc=jpoimboe@redhat.com \
    --cc=keescook@chromium.org \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mbenes@suse.com \
    --cc=mcgrof@kernel.org \
    --cc=minchan@kernel.org \
    --cc=ngupta@vflare.org \
    --cc=peterz@infradead.org \
    --cc=rafael@kernel.org \
    --cc=rostedt@goodmis.org \
    --cc=sergey.senozhatsky.work@gmail.com \
    --cc=tglx@linutronix.de \
    --cc=tj@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.