Re: [PATCH 1/2] zram: fix crashes due to use of cpu hotplug multistate

From: Minchan Kim <minchan@kernel.org>
To: Luis Chamberlain <mcgrof@kernel.org>
Cc: gregkh@linuxfoundation.org, ngupta@vflare.org,
	sergey.senozhatsky.work@gmail.com, axboe@kernel.dk,
	mbenes@suse.com, linux-block@vger.kernel.org,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH 1/2] zram: fix crashes due to use of cpu hotplug multistate
Date: Mon, 22 Mar 2021 15:12:01 -0700	[thread overview]
Message-ID: <YFkWMZ0m9nKCT69T@google.com> (raw)
In-Reply-To: <20210322204156.GM4332@42.do-not-panic.com>

On Mon, Mar 22, 2021 at 08:41:56PM +0000, Luis Chamberlain wrote:
> On Mon, Mar 22, 2021 at 09:37:17AM -0700, Minchan Kim wrote:
> > On Fri, Mar 19, 2021 at 07:09:24PM +0000, Luis Chamberlain wrote:
> > > Indeed one issue is a consequence of the other but a bit better
> > > description can be put together for both separately.
> > > 
> > > The warning comes up when cpu hotplug detects that the callback
> > > is being removed, but yet "instances" for multistate are left behind,
> > > ie, not removed. CPU hotplug multistate allows us to dedicate a callback
> > > for zram so that it gets called every time a CPU hot plugs or unplugs.
> > > In the zram driver's case we want to trigger the callback per each
> > > struct zcomp, one is used per supported and used supported compression
> > 
> >      you meant "each one is used per supported compression"?
> 
> Yup.
> 
> > > algorithm. The zram driver allocates a struct zcomp for an compression
> > > algorithm on a need basis. The default compressor is CONFIG_ZRAM_DEF_COMP
> > > which today is CONFIG_ZRAM_DEF_COMP_LZORLE ("lzo-rle"). Although we may
> > > have different zram devices, zram devices which use the same compression
> > > algorithm share the same struct zcomp. The "multi" in CPU hotplug multstate
> > 
> > It allocates each own zcomp, not sharing even though zram devices
> > use the same compression algorithm.
> 
> Right.
> 
> > > refers to these different struct zcomp instances, each of these will
> > > have the callback routine registered run for it. The kernel's CPU
> > > hotplug multistate keeps a linked list of these different structures
> > > so that it will iterate over them on CPU transitions. By default at
> > > driver initialization we will create just one zram device (num_devices=1)
> > > and a zcomp structure then set for the lzo-rle comrpession algorithm.
> > > At driver removal we first remove each zram device, and so we destroy
> > > the struct zcomp. But since we expose sysfs attributes to create new
> > > devices or reset / initialize existing ones, we can easily end up
> > > re-initializing a struct zcomp before the exit routine of the module
> > > removes the cpu hotplug callback. When this happens the kernel's
> > > CPU hotplug will detect that at least one instance (struct zcomp for
> > > us) exists.
> > 
> > It's very helpful to understand. Thanks a lot!S
> > 
> > So IIUC, it's fundamentally race between destroy_devices(i.e., module
> > unload) and every sysfs knobs in zram. Isn't it?
> 
> I would not call it *every* syfs knob as not all deal with things which
> are related to CPU hotplug multistate, right? Note that using just
> try_module_get() alone (that is the second patch only, does not fix the
> race I am describing above).

It wouldn't be CPU hotplug multistate issue but I'd like to call it
as more "zram instance race" bug.
What happens in this case?

        CPU 1                            CPU 2

destroy_devices
..
                                 compact_store()
                                 zram = dev_to_zram(dev);
idr_for_each(zram_remove_cb
  zram_remove
  ..
  kfree(zram)
                                 down_read(&zram->init_lock);

        CPU 1                            CPU 2

hot_remove_store
                                 compact_store()
                                 zram = dev_to_zram(dev);
  zram_remove
    kfree(zram)
                                 down_read(&zram->init_lock);

So, for me we need to close the zram instance create/removal
with sysfs rather than focusing on CPU hotplug issue.
Maybe, we could reuse zram_index_mutex with modifying it with
rw_semaphore. What do you think?

> 
> There are two issues.
> 
> > Below specific example is one of them and every sysfs code in zram
> > could access freed object(e.g., zram->something).
> > And you are claiming there isn't good way to fix it in kernfs(please
> > add it in the description, too) even though it's general problem.
> 
> Correct, the easier route would have been through the block layer as
> its the one adding our syfs attributes for us but even it canno deal
> with this race on behalf of drivers given the currently exposed
> semantics on kernfs.
> 
> > (If we had, we may detect the race and make zram_remove_cb fails
> > so unloading modulies fails, finally)
> > 
> > So, shouldn't we introcude a global rw_semaphore?
> > 
> > destroy_devices
> >   class_unregister
> >   down_write(&zram_unload);
> >   idr_for_each(zram_remove_cb);
> >   up_write(&zram_unload);
> > 
> > And every sysfs code hold the lock with down_read in the first place
> > and release the lock right before return. So, introduce a zram sysfs
> > wrapper to centralize all of locking logic.
> 
> Actually that does work but only if we use write lock attempts so to
> be able to knock two birds with one stone, so to address the deadlock
> with sysfs attribute removal. We're not asking politely to read some
> value off of a zram devices with these locks, we're ensuring to not
> run if our driver is going to be removed, so replacing the
> try_module_get() with something similar.
> 
> > Does it make sense?
> > 
> > > 
> > > And so we can have:
> > > 
> > > static void destroy_devices(void)
> > > {
> > > 	class_unregister(&zram_control_class);
> > > 	idr_for_each(&zram_index_idr, &zram_remove_cb, NULL);
> > > 	zram_debugfs_destroy();
> > > 	idr_destroy(&zram_index_idr);
> > > 	unregister_blkdev(zram_major, "zram");
> > > 	cpuhp_remove_multi_state(CPUHP_ZCOMP_PREPARE);
> > > }
> > > 
> > > While destroy_devices() runs we can have this race:
> > > 
> > > 
> > > CPU 1                            CPU 2
> > > 
> > > class_unregister(...);
> > > idr_for_each(...);
> > 				  I think sysfs was already remove here.
> > > zram_debugfs_destroy();
> > >                                 disksize_store(...)
> > > idr_destroy(...);
> > > unregister_blkdev(...);
> > > cpuhp_remove_multi_state(...);
> > 
> > So,
> > 
> > 	CPU 1                            CPU 2
> >  
> > destroy_devices
> > ..
> > 				 disksize_store()
> > 				   zcomp_create
> > 				   zcomp_init
> > idr_for_each(zram_remove_cb
> >   zram_remove
> >   zram_reset
> >   zcomp_destroy
> >   cpuhp_state_remove_instance
> > 				   cpuhp_state_add_instance
> > ..
> > 
> > cpuhp_remove_multi_state(...)
> >   Detect!
> >     kernel: Error: Removing state 63 which has instances left.
> 
> Yup true.
> 
> > > The warning comes up on cpuhp_remove_multi_state() when it sees
> > > that the state for CPUHP_ZCOMP_PREPARE does not have an empty
> > > instance linked list.
> > > 
> > > After the idr_destory() its also easy to run into a crash. The
> > > above predicament also leads to memory leaks.
> > > 
> > > And so we need to have a state machine for when we're up and when
> > > we're going down generically.
> > > 
> > > Let me know if it makes sense now, if so I can amend the commit log
> > > accordingly.
> > 
> > Yub, It's much better. Let's have it in the commit log.
> 
> Sure OK, well the following is what I end up with. With this approach
> only one patch is needed.
> 
> diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
> index a711a2e2a794..3b86ee94309e 100644
> --- a/drivers/block/zram/zram_drv.c
> +++ b/drivers/block/zram/zram_drv.c
> @@ -41,6 +41,15 @@ static DEFINE_IDR(zram_index_idr);
>  /* idr index must be protected */
>  static DEFINE_MUTEX(zram_index_mutex);
>  
> +/*
> + * Protects against:
> + * a) Hotplug cpu multistate races against compression algorithm removals
> + *    and additions prior to its removal of the zram CPU hotplug callback
> + * b) Deadlock which is possible when sysfs attributes are used and we
> + *    share a lock on removal.
> + */
> +DECLARE_RWSEM(zram_unload);
> +
>  static int zram_major;
>  static const char *default_compressor = CONFIG_ZRAM_DEF_COMP;
>  
> @@ -1723,6 +1732,9 @@ static ssize_t disksize_store(struct device *dev,
>  	if (!disksize)
>  		return -EINVAL;
>  
> +	if (!down_write_trylock(&zram_unload))
> +		return -ENODEV;
> +
>  	down_write(&zram->init_lock);
>  	if (init_done(zram)) {
>  		pr_info("Cannot change disksize for initialized device\n");
> @@ -1748,6 +1760,7 @@ static ssize_t disksize_store(struct device *dev,
>  	zram->disksize = disksize;
>  	set_capacity_and_notify(zram->disk, zram->disksize >> SECTOR_SHIFT);
>  	up_write(&zram->init_lock);
> +	up_write(&zram_unload);
>  
>  	return len;
>  
> @@ -1755,6 +1768,7 @@ static ssize_t disksize_store(struct device *dev,
>  	zram_meta_free(zram, disksize);
>  out_unlock:
>  	up_write(&zram->init_lock);
> +	up_write(&zram_unload);
>  	return err;
>  }
>  
> @@ -1773,14 +1787,17 @@ static ssize_t reset_store(struct device *dev,
>  	if (!do_reset)
>  		return -EINVAL;
>  
> +	if (!down_write_trylock(&zram_unload))
> +		return -ENODEV;
> +
>  	zram = dev_to_zram(dev);
>  	bdev = zram->disk->part0;
>  
>  	mutex_lock(&bdev->bd_mutex);
>  	/* Do not reset an active device or claimed device */
>  	if (bdev->bd_openers || zram->claim) {
> -		mutex_unlock(&bdev->bd_mutex);
> -		return -EBUSY;
> +		len = -EBUSY;
> +		goto out;
>  	}
>  
>  	/* From now on, anyone can't open /dev/zram[0-9] */
> @@ -1793,7 +1810,10 @@ static ssize_t reset_store(struct device *dev,
>  
>  	mutex_lock(&bdev->bd_mutex);
>  	zram->claim = false;
> +
> +out:
>  	mutex_unlock(&bdev->bd_mutex);
> +	up_write(&zram_unload);
>  
>  	return len;
>  }
> @@ -2015,10 +2035,15 @@ static ssize_t hot_add_show(struct class *class,
>  {
>  	int ret;
>  
> +	if (!down_write_trylock(&zram_unload))
> +		return -ENODEV;
> +
>  	mutex_lock(&zram_index_mutex);
>  	ret = zram_add();
>  	mutex_unlock(&zram_index_mutex);
>  
> +	up_write(&zram_unload);
> +
>  	if (ret < 0)
>  		return ret;
>  	return scnprintf(buf, PAGE_SIZE, "%d\n", ret);
> @@ -2041,6 +2066,9 @@ static ssize_t hot_remove_store(struct class *class,
>  	if (dev_id < 0)
>  		return -EINVAL;
>  
> +	if (!down_write_trylock(&zram_unload))
> +		return -ENODEV;
> +
>  	mutex_lock(&zram_index_mutex);
>  
>  	zram = idr_find(&zram_index_idr, dev_id);
> @@ -2053,6 +2081,7 @@ static ssize_t hot_remove_store(struct class *class,
>  	}
>  
>  	mutex_unlock(&zram_index_mutex);
> +	up_write(&zram_unload);
>  	return ret ? ret : count;
>  }
>  static CLASS_ATTR_WO(hot_remove);
> @@ -2078,12 +2107,14 @@ static int zram_remove_cb(int id, void *ptr, void *data)
>  
>  static void destroy_devices(void)
>  {
> +	down_write(&zram_unload);
>  	class_unregister(&zram_control_class);
>  	idr_for_each(&zram_index_idr, &zram_remove_cb, NULL);
>  	zram_debugfs_destroy();
>  	idr_destroy(&zram_index_idr);
>  	unregister_blkdev(zram_major, "zram");
>  	cpuhp_remove_multi_state(CPUHP_ZCOMP_PREPARE);
> +	up_write(&zram_unload);
>  }
>  
>  static int __init zram_init(void)
> @@ -2095,10 +2126,13 @@ static int __init zram_init(void)
>  	if (ret < 0)
>  		return ret;
>  
> +	down_write(&zram_unload);
> +
>  	ret = class_register(&zram_control_class);
>  	if (ret) {
>  		pr_err("Unable to register zram-control class\n");
>  		cpuhp_remove_multi_state(CPUHP_ZCOMP_PREPARE);
> +		up_write(&zram_unload);
>  		return ret;
>  	}
>  
> @@ -2108,6 +2142,7 @@ static int __init zram_init(void)
>  		pr_err("Unable to get major number\n");
>  		class_unregister(&zram_control_class);
>  		cpuhp_remove_multi_state(CPUHP_ZCOMP_PREPARE);
> +		up_write(&zram_unload);
>  		return -EBUSY;
>  	}
>  
> @@ -2119,10 +2154,12 @@ static int __init zram_init(void)
>  			goto out_error;
>  		num_devices--;
>  	}
> +	up_write(&zram_unload);
>  
>  	return 0;
>  
>  out_error:
> +	up_write(&zram_unload);
>  	destroy_devices();
>  	return ret;
>  }

As I mentioned above, it didn't close the race I gave as examples
unless I miss something. Let's discuss further.

Thank you!