Re: [PATCH 1/2] zram: fix crashes due to use of cpu hotplug multistate

From: Minchan Kim <minchan@kernel.org>
To: Luis Chamberlain <mcgrof@kernel.org>
Cc: keescook@chromium.org, dhowells@redhat.com, hch@infradead.org,
	mbenes@suse.com, gregkh@linuxfoundation.org, ngupta@vflare.org,
	sergey.senozhatsky.work@gmail.com, axboe@kernel.dk,
	linux-block@vger.kernel.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH 1/2] zram: fix crashes due to use of cpu hotplug multistate
Date: Tue, 6 Apr 2021 18:23:46 -0700	[thread overview]
Message-ID: <YG0JouWqrJPHbpqz@google.com> (raw)
In-Reply-To: <20210406002909.GY4332@42.do-not-panic.com>

On Tue, Apr 06, 2021 at 12:29:09AM +0000, Luis Chamberlain wrote:
> On Mon, Apr 05, 2021 at 12:58:05PM -0700, Minchan Kim wrote:
> > On Mon, Apr 05, 2021 at 07:00:23PM +0000, Luis Chamberlain wrote:
> > > On Mon, Apr 05, 2021 at 10:07:24AM -0700, Minchan Kim wrote:
> > > > On Thu, Apr 01, 2021 at 11:59:25PM +0000, Luis Chamberlain wrote:
> > > > > And come to think of it the last patch I had sent with a new
> > > > > DECLARE_RWSEM(zram_unload) also has this same issue making most
> > > > > sysfs attributes rather fragile.
> > > > 
> > > > Thanks for looking the way. I agree the single zram_index_rwlock is
> > > > not the right approach to fix it. However, I still hope we find more
> > > > generic solution to fix them at once since I see it's zram instance
> > > > racing problem.
> > > 
> > > They are 3 separate different problems. Related, but different.
> > 
> > What are 3 different problems? I am asking since I remember only two:
> > one for CPU multistate and the other one for sysfs during rmmod.
> 
> The third one is the race to use sysfs attributes and those routines
> then derefernece th egendisk private_data.

First of all, thanks for keeping discussion, Luis.

That was the one I thought race between sysfs and during rmmod.

> 
> > > If the idea then is to busy out rmmod if a sysfs attribute is being
> > > read, that could then mean rmmod can sometimes never complete. Hogging
> > > up / busying out sysfs attributes means the module cannto be removed.
> > 
> > It's true but is it a big problem? There are many cases that system
> > just return error if it's busy and rely on the admin. IMHO, rmmod should
> > be part of them.
> 
> It depends on existing userspace scripts which are used to test and
> expectations set. Consider existing tests, you would know better, and
> since you are the maintainer you decide.
> 
> I at least know for many other types of device drivers an rmmod is
> a sledge hammer.
> 
> You decide. I just thought it would be good to highlight the effect now
> rather than us considering it later.

To me, the rmmod faillure is not a big problem for zram since it's
common cases in the system with -EBUSY(Having said, I agree that's the
best if we could avoid the fail-and-retrial. IOW, -EBUSY should be
last resort unless we have nicer way.)

> 
> > > Which is why the *try_module_get()* I think is much more suitable, as
> > > it will always fails if we're already going down.
> > 
> > How does the try_module_get solved the problem?
> 
> The try stuff only resolves the deadlock. The bget() / bdput() resolves
> the race to access to the gendisk private_data.

That's the one I missed in this discussion. Now I am reading your [2/2]
in original patch. I thought it was just zram instance was destroyed
by sysfs race problem so you had seen the deadlock. I might miss the
point here, too. 

Hmm, we are discussing several problems all at once. I feel it's time
to jump v2 with your way in this point. You said three different
problems. As I asked, please write it down with more detail with
code sequence as we discussed other thread. If you mean a deadlock,
please write what specific locks was deadlock with it.
It would make discussion much easier. Let's discuss the issue
one by one in each thread.

> 
> > > > I see one of the problems is how I could make new zram object's
> > > > attribute group for zram knobs under /sys/block/zram0 since block
> > > > layer already made zram0 kobject via device_add_disk.
> > > 
> > > Right.. well the syfs attribute races uncovered here actually do
> > > apply to any block driver as well. And which is why I was aiming
> > > for something generic if possible.
> > 
> > It would be great but that's not the one we have atm so want to
> > proceed to fix anyway.
> 
> What is not the one we have atm? I *do* have a proposed generic solution
> for 2/3 issues we have been disussing:
> 
>  a) deadlock on sysfs access

This is the one I didn't understand.

>  b) gendisk private_data race

Yub.

> 
> But so far Greg does not see enough justification for a), so we can either
> show how wider this issue is (which I can do using coccinelle), or we
> just open code the try_module_get() / put on each driver that needs it
> for now. Either way it would resolve the issue.

I second if it's general problem for drivers, I agree it's worth to
addresss in the core unless the driver introduce the mess. I have
no idea here since I didn't understand the problem, yet.

> 
> As for b), given that I think even you had missed my attempt to
> generialzie the bdget/bdput solution for any attribute type (did you see
> my dev_type_get() and dev_type_put() proposed changes?), I don't think
> this problem is yet well defined in a generic way for us to rule it out.
> It is however easier to simply open code this per driver that needs it
> for now given that I don't think Greg is yet convinced the deadlock is
> yet a widespread issue. I however am pretty sure both races races *do*
> exist outside of zram in many places.

It would be good sign to propose general solution.

> 
> > > I am not sure if you missed the last hunks of the generic solution,
> > > but that would resolve the issue you noted. Here is the same approach
> > > but in a non-generic solution, specific to just one attribute so far
> > > and to zram:
> > 
> > So idea is refcount of the block_device's inode 
> 
> Yes that itself prevents races against the gendisk private_data from
> being invalid. Why because a bdget() would not be successful after
> del_gendisk():
> 
> del_gendisk() --> invalidate_partition() --> __invalidate_device() --> invalidate_inodes()
> 
> > and module_exit path
> > checks also the inode refcount to make rmmod failure?
> 
> They try_module_get() approach resolves the deadlock race, but it does
> so in a lazy way. I mean lazy in that then rmmod wins over sysfs knobs.
> So touching sysfs knobs won't make an rmmod fail. I think that's more
> typical expected behaviour. Why? Because I find it odd that looping
> forever touching sysfs attributes should prevent a module removal. But
> that's a personal preference.

I agree with you that would be better but let's see how it could go
clean.

FYI, please look at hot_remove_store which also can remove zram
instance on demand. I am looking forward to seeing your v2.

Thanks for your patience, Luis.