Re: [PATCH 1/2] zram: fix crashes due to use of cpu hotplug multistate

From: Minchan Kim <minchan@kernel.org>
To: Luis Chamberlain <mcgrof@kernel.org>
Cc: keescook@chromium.org, dhowells@redhat.com, hch@infradead.org,
	mbenes@suse.com, gregkh@linuxfoundation.org, ngupta@vflare.org,
	sergey.senozhatsky.work@gmail.com, axboe@kernel.dk,
	linux-block@vger.kernel.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH 1/2] zram: fix crashes due to use of cpu hotplug multistate
Date: Mon, 5 Apr 2021 10:07:24 -0700	[thread overview]
Message-ID: <YGtDzH0dEfEngCij@google.com> (raw)
In-Reply-To: <20210401235925.GR4332@42.do-not-panic.com>

On Thu, Apr 01, 2021 at 11:59:25PM +0000, Luis Chamberlain wrote:
> On Mon, Mar 22, 2021 at 03:12:01PM -0700, Minchan Kim wrote:
> > On Mon, Mar 22, 2021 at 08:41:56PM +0000, Luis Chamberlain wrote:
> > > 
> > > I would not call it *every* syfs knob as not all deal with things which
> > > are related to CPU hotplug multistate, right? Note that using just
> > > try_module_get() alone (that is the second patch only, does not fix the
> > > race I am describing above).
> > 
> > It wouldn't be CPU hotplug multistate issue but I'd like to call it
> > as more "zram instance race" bug.
> > What happens in this case?
> > 
> >         CPU 1                            CPU 2
> > 
> > destroy_devices
> > ..
> >                                  compact_store()
> >                                  zram = dev_to_zram(dev);
> > idr_for_each(zram_remove_cb
> >   zram_remove
> >   ..
> >   kfree(zram)
> >                                  down_read(&zram->init_lock);
> > 
> > 
> >         CPU 1                            CPU 2
> > 
> > hot_remove_store
> >                                  compact_store()
> >                                  zram = dev_to_zram(dev);
> >   zram_remove
> >     kfree(zram)
> >                                  down_read(&zram->init_lock);
> >     				
> > So, for me we need to close the zram instance create/removal
> > with sysfs rather than focusing on CPU hotplug issue.
> 
> Sure, that's a good point.
> 
> The issue which I noted for the race which ends up in a deadlock is only
> possible if a shared lock is used on removal but also on sysfs knobs.
> 
> At first glance, the issue you describe above *seems* to be just proper
> care driver developers must take with structures used. It is certainly
> yet another issue we need to address, and if we can generalize a
> solution even better. I now recall I *think* I spotted that race a while
> ago and mentioned it to Kees and David Howells but I didn't have a
> solution for it yet. More on this below.
> 
> The issue you point out is real, however you cannot disregard the
> CPU hoplug possible race as well, it is a design consideration which
> the CPU hotplug multistate support warns for -- consider driver removal.
> I agree that perhaps solving this "zram instance race" can fix he
> hotplug race as well. If we can solves all 3 issues in one shot even
> better. But let's evaluate that prospect...
> 
> > Maybe, we could reuse zram_index_mutex with modifying it with
> > rw_semaphore. What do you think?
> 
> Although ideal given it would knock 3 birds with 1 stone, it ends up
> actually making the sysfs attributes rather useless in light of the
> requirements for each of the races. Namely, the sysfs deadlock race
> *must* use a try lock approach, just as the try_module_get() case.
> It must use this approach so to immediately just bail out if we have
> our module being removed, and so on our __exit path. By trying to
> repurpose zram_index_mutex we end up then just doing too much with it
> and making the syfs attributes rather fragile for most uses:
> 
> Consider disksize_show(), that would have to become:
> 
> static ssize_t disksize_show(struct device *dev, struct device_attribute *attr, char *buf)
> {
> 	struct zram *zram = dev_to_zram(dev);
> +	size_t disksize;
> 
> +	down_read(&zram_index_rwlock);
> +	disksize = zram->disksize;
> +	up_read(&zram_index_rwlock);
> -	return scnprintf(buf, PAGE_SIZE, "%llu\n", zram->disksize);
> +	return scnprintf(buf, PAGE_SIZE, "%llu\n", disksize);
> }
> 
> What's wrong with this?
> 
> It can block during a write, yes, but there is a type of write which
> will make this crash after the read lock is acquired. When the instance
> is removed. What if we try down_read_trylock()?
> 
> static ssize_t disksize_show(struct device *dev, struct device_attribute *attr, char *buf)
> {
> 	struct zram *zram = dev_to_zram(dev);
> +	size_t disksize;
> 
> +	if (!down_read_trylock(&zram_index_rwlock))
> +		return -ENODEV;
> +
> +	disksize = zram->disksize;
> +	up_read(&zram_index_rwlock);
> -	return scnprintf(buf, PAGE_SIZE, "%llu\n", zram->disksize);
> +	return scnprintf(buf, PAGE_SIZE, "%llu\n", disksize);
> }
> 
> What's wrong with this?
> 
> If it got the lock, it should be OK as it is preventing writes from
> taking the lock for a bit. But then this just becomes pretty fragile,
> it will fail whenever another read or write is happening, triggering
> perhaps quite a bit of regressions on tests.
> 
> And if we use write_trylock() we end up with the same fragile nature
> of failing the read with ENODEV for any silly thing going on with the
> driver.
> 
> And come to think of it the last patch I had sent with a new
> DECLARE_RWSEM(zram_unload) also has this same issue making most
> sysfs attributes rather fragile.

Thanks for looking the way. I agree the single zram_index_rwlock is
not the right approach to fix it. However, I still hope we find more
generic solution to fix them at once since I see it's zram instance
racing problem.

A approach I am considering is to make struct zram include kobject
and then make zram sysfs auto populated under the kobject. So, zram/sysfs
lifetime should be under the kobject. With it, sysfs race probem I
mentioned above should be gone. Furthermore, zram_remove should fail
if one of the alive zram objects is existing
(i.e., zram->kobject->refcount > 1) so module_exit will fail, too.

I see one of the problems is how I could make new zram object's
attribute group for zram knobs under /sys/block/zram0 since block
layer already made zram0 kobject via device_add_disk.