Re: [PATCH v3 2/3] zram: fix deadlock with sysfs attribute usage and driver removal

From: Luis Chamberlain <mcgrof@kernel.org>
To: Greg KH <gregkh@linuxfoundation.org>
Cc: minchan@kernel.org, jeyu@kernel.org, ngupta@vflare.org,
	sergey.senozhatsky.work@gmail.com, axboe@kernel.dk,
	mbenes@suse.com, jpoimboe@redhat.com, tglx@linutronix.de,
	keescook@chromium.org, jikos@kernel.org, rostedt@goodmis.org,
	peterz@infradead.org, linux-block@vger.kernel.org,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH v3 2/3] zram: fix deadlock with sysfs attribute usage and driver removal
Date: Tue, 22 Jun 2021 09:32:08 -0700	[thread overview]
Message-ID: <20210622163208.epx4lf3pv2x2d5b4@garbanzo> (raw)
In-Reply-To: <YNGVI/vKSBAM8dlh@kroah.com>

On Tue, Jun 22, 2021 at 09:45:39AM +0200, Greg KH wrote:
> On Mon, Jun 21, 2021 at 04:36:34PM -0700, Luis Chamberlain wrote:
> > When sysfs attributes use a lock also used on driver removal we can
> > potentially deadlock. This happens when for instance a sysfs file on
> > a driver is used, then at the same time we have driver removal trigger.
> > The driver removal code holds a lock, and then the sysfs file entry waits
> > for the same lock. While holding the lock the driver removal tries to
> > remove the sysfs entries, but these cannot be removed yet as one is
> > waiting for a lock. This won't complete as the lock is already held.
> > Likewise module removal cannot complete, and so we deadlock.
> 
> This is all about removing modules, not about "driver removal" from a
> device.  Please make the subject line here more explicit.

Sure.

> > To fix this we just *try* to get a refcount to the module when a shared
> > lock is used, prior to mucking with a sysfs attribute. If this fails we
> > just give up right away.
> > 
> > We use a try method as a full lock means we'd then make our sysfs attributes
> > busy us out from possible module removal, and so userspace could force denying
> > module removal, a silly form of "DOS" against module removal. A try lock on
> > the module removal ensures we give priority to module removal and interacting
> > with sysfs attributes only comes second. Using a full lock could mean for
> > instance that if you don't stop poking at sysfs files you cannot remove a
> > module.
> > 
> > This deadlock was first reported with the zram driver, a sketch of how
> > this can happen follows:
> > 
> > CPU A                              CPU B
> >                                    whatever_store()
> > module_unload
> >   mutex_lock(foo)
> >                                    mutex_lock(foo)
> >    del_gendisk(zram->disk);
> >      device_del()
> >        device_remove_groups()
> 
> Can you duplicate this in a real-world situation?
> 
> What tools remove the zram module from the system on the fly?

A customer did run into it through a series of automated tests. I was
able to finally reproduce with the instructions given below. I
simplified it given that the series of test the customer was running
was much more complex.

> > In this situation whatever_store() is waiting for the mutex foo to
> > become unlocked, but that won't happen until module removal is complete.
> > But module removal won't complete until the syfs file being poked completes
> > which is waiting for a lock already held.
> > 
> > This is a generic kernel issue with sysfs files which use any lock also
> > used on module removal. Different generic solutions have been proposed.
> > One approach proposed is by directly by augmenting attributes with module
> > information [0]. This patch implements a solution by adding macros with
> > the prefix MODULE_DEVICE_ATTR_*() which accomplish the same. Until we
> > don't have a generic agreed upon solution for this shared between drivers,
> > we must implement a fix for this on each driver.
> > 
> > We make zram use the new MODULE_DEVICE_ATTR_*() helpers, and completely
> > open code the solution for class attributes as there are only a few of
> > those.
> > 
> > This issue can be reproduced easily on the zram driver as follows:
> > 
> > Loop 1 on one terminal:
> > 
> > while true;
> > 	do modprobe zram;
> > 	modprobe -r zram;
> > done
> > 
> > Loop 2 on a second terminal:
> > while true; do
> > 	echo 1024 >  /sys/block/zram0/disksize;
> > 	echo 1 > /sys/block/zram0/reset;
> > done
> 
> As fun as this is, it's not a real workload, please do not pretend that
> it is.

Whoever said that it was? This is just a way to reproduce an issue which
was reported.

> And your code is still racy, see below.  You just made the window even
> smaller, which you still should be objecting to as you somehow feel this
> is a valid usecase :)
> 
> > @@ -2048,13 +2048,19 @@ static ssize_t hot_add_show(struct class *class,
> >  {
> >  	int ret;
> >  
> > +	if (!try_module_get(THIS_MODULE))
> > +		return -ENODEV;
> > +
> 
> You can not increment/decrement your own module's reference count and
> expect it to work properly, as it is still a race.

The goal here is to prevent an rmmod call if this succeeds. If it
succeeds then any subsequent rmmod will fail. Can you explain how
this is still racy?

  Luis