From: Minchan Kim <minchan@kernel.org>
To: Luis Chamberlain <mcgrof@kernel.org>
Cc: gregkh@linuxfoundation.org, ngupta@vflare.org,
sergey.senozhatsky.work@gmail.com, axboe@kernel.dk,
mbenes@suse.com, linux-block@vger.kernel.org,
linux-kernel@vger.kernel.org
Subject: Re: [PATCH 1/2] zram: fix crashes due to use of cpu hotplug multistate
Date: Mon, 22 Mar 2021 15:12:01 -0700 [thread overview]
Message-ID: <YFkWMZ0m9nKCT69T@google.com> (raw)
In-Reply-To: <20210322204156.GM4332@42.do-not-panic.com>
On Mon, Mar 22, 2021 at 08:41:56PM +0000, Luis Chamberlain wrote:
> On Mon, Mar 22, 2021 at 09:37:17AM -0700, Minchan Kim wrote:
> > On Fri, Mar 19, 2021 at 07:09:24PM +0000, Luis Chamberlain wrote:
> > > Indeed one issue is a consequence of the other but a bit better
> > > description can be put together for both separately.
> > >
> > > The warning comes up when cpu hotplug detects that the callback
> > > is being removed, but yet "instances" for multistate are left behind,
> > > ie, not removed. CPU hotplug multistate allows us to dedicate a callback
> > > for zram so that it gets called every time a CPU hot plugs or unplugs.
> > > In the zram driver's case we want to trigger the callback per each
> > > struct zcomp, one is used per supported and used supported compression
> >
> > you meant "each one is used per supported compression"?
>
> Yup.
>
> > > algorithm. The zram driver allocates a struct zcomp for an compression
> > > algorithm on a need basis. The default compressor is CONFIG_ZRAM_DEF_COMP
> > > which today is CONFIG_ZRAM_DEF_COMP_LZORLE ("lzo-rle"). Although we may
> > > have different zram devices, zram devices which use the same compression
> > > algorithm share the same struct zcomp. The "multi" in CPU hotplug multstate
> >
> > It allocates each own zcomp, not sharing even though zram devices
> > use the same compression algorithm.
>
> Right.
>
> > > refers to these different struct zcomp instances, each of these will
> > > have the callback routine registered run for it. The kernel's CPU
> > > hotplug multistate keeps a linked list of these different structures
> > > so that it will iterate over them on CPU transitions. By default at
> > > driver initialization we will create just one zram device (num_devices=1)
> > > and a zcomp structure then set for the lzo-rle comrpession algorithm.
> > > At driver removal we first remove each zram device, and so we destroy
> > > the struct zcomp. But since we expose sysfs attributes to create new
> > > devices or reset / initialize existing ones, we can easily end up
> > > re-initializing a struct zcomp before the exit routine of the module
> > > removes the cpu hotplug callback. When this happens the kernel's
> > > CPU hotplug will detect that at least one instance (struct zcomp for
> > > us) exists.
> >
> > It's very helpful to understand. Thanks a lot!S
> >
> > So IIUC, it's fundamentally race between destroy_devices(i.e., module
> > unload) and every sysfs knobs in zram. Isn't it?
>
> I would not call it *every* syfs knob as not all deal with things which
> are related to CPU hotplug multistate, right? Note that using just
> try_module_get() alone (that is the second patch only, does not fix the
> race I am describing above).
It wouldn't be CPU hotplug multistate issue but I'd like to call it
as more "zram instance race" bug.
What happens in this case?
CPU 1 CPU 2
destroy_devices
..
compact_store()
zram = dev_to_zram(dev);
idr_for_each(zram_remove_cb
zram_remove
..
kfree(zram)
down_read(&zram->init_lock);
CPU 1 CPU 2
hot_remove_store
compact_store()
zram = dev_to_zram(dev);
zram_remove
kfree(zram)
down_read(&zram->init_lock);
So, for me we need to close the zram instance create/removal
with sysfs rather than focusing on CPU hotplug issue.
Maybe, we could reuse zram_index_mutex with modifying it with
rw_semaphore. What do you think?
>
> There are two issues.
>
> > Below specific example is one of them and every sysfs code in zram
> > could access freed object(e.g., zram->something).
> > And you are claiming there isn't good way to fix it in kernfs(please
> > add it in the description, too) even though it's general problem.
>
> Correct, the easier route would have been through the block layer as
> its the one adding our syfs attributes for us but even it canno deal
> with this race on behalf of drivers given the currently exposed
> semantics on kernfs.
>
> > (If we had, we may detect the race and make zram_remove_cb fails
> > so unloading modulies fails, finally)
> >
> > So, shouldn't we introcude a global rw_semaphore?
> >
> > destroy_devices
> > class_unregister
> > down_write(&zram_unload);
> > idr_for_each(zram_remove_cb);
> > up_write(&zram_unload);
> >
> > And every sysfs code hold the lock with down_read in the first place
> > and release the lock right before return. So, introduce a zram sysfs
> > wrapper to centralize all of locking logic.
>
> Actually that does work but only if we use write lock attempts so to
> be able to knock two birds with one stone, so to address the deadlock
> with sysfs attribute removal. We're not asking politely to read some
> value off of a zram devices with these locks, we're ensuring to not
> run if our driver is going to be removed, so replacing the
> try_module_get() with something similar.
>
> > Does it make sense?
> >
> > >
> > > And so we can have:
> > >
> > > static void destroy_devices(void)
> > > {
> > > class_unregister(&zram_control_class);
> > > idr_for_each(&zram_index_idr, &zram_remove_cb, NULL);
> > > zram_debugfs_destroy();
> > > idr_destroy(&zram_index_idr);
> > > unregister_blkdev(zram_major, "zram");
> > > cpuhp_remove_multi_state(CPUHP_ZCOMP_PREPARE);
> > > }
> > >
> > > While destroy_devices() runs we can have this race:
> > >
> > >
> > > CPU 1 CPU 2
> > >
> > > class_unregister(...);
> > > idr_for_each(...);
> > I think sysfs was already remove here.
> > > zram_debugfs_destroy();
> > > disksize_store(...)
> > > idr_destroy(...);
> > > unregister_blkdev(...);
> > > cpuhp_remove_multi_state(...);
> >
> > So,
> >
> > CPU 1 CPU 2
> >
> > destroy_devices
> > ..
> > disksize_store()
> > zcomp_create
> > zcomp_init
> > idr_for_each(zram_remove_cb
> > zram_remove
> > zram_reset
> > zcomp_destroy
> > cpuhp_state_remove_instance
> > cpuhp_state_add_instance
> > ..
> >
> > cpuhp_remove_multi_state(...)
> > Detect!
> > kernel: Error: Removing state 63 which has instances left.
>
> Yup true.
>
> > > The warning comes up on cpuhp_remove_multi_state() when it sees
> > > that the state for CPUHP_ZCOMP_PREPARE does not have an empty
> > > instance linked list.
> > >
> > > After the idr_destory() its also easy to run into a crash. The
> > > above predicament also leads to memory leaks.
> > >
> > > And so we need to have a state machine for when we're up and when
> > > we're going down generically.
> > >
> > > Let me know if it makes sense now, if so I can amend the commit log
> > > accordingly.
> >
> > Yub, It's much better. Let's have it in the commit log.
>
> Sure OK, well the following is what I end up with. With this approach
> only one patch is needed.
>
> diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
> index a711a2e2a794..3b86ee94309e 100644
> --- a/drivers/block/zram/zram_drv.c
> +++ b/drivers/block/zram/zram_drv.c
> @@ -41,6 +41,15 @@ static DEFINE_IDR(zram_index_idr);
> /* idr index must be protected */
> static DEFINE_MUTEX(zram_index_mutex);
>
> +/*
> + * Protects against:
> + * a) Hotplug cpu multistate races against compression algorithm removals
> + * and additions prior to its removal of the zram CPU hotplug callback
> + * b) Deadlock which is possible when sysfs attributes are used and we
> + * share a lock on removal.
> + */
> +DECLARE_RWSEM(zram_unload);
> +
> static int zram_major;
> static const char *default_compressor = CONFIG_ZRAM_DEF_COMP;
>
> @@ -1723,6 +1732,9 @@ static ssize_t disksize_store(struct device *dev,
> if (!disksize)
> return -EINVAL;
>
> + if (!down_write_trylock(&zram_unload))
> + return -ENODEV;
> +
> down_write(&zram->init_lock);
> if (init_done(zram)) {
> pr_info("Cannot change disksize for initialized device\n");
> @@ -1748,6 +1760,7 @@ static ssize_t disksize_store(struct device *dev,
> zram->disksize = disksize;
> set_capacity_and_notify(zram->disk, zram->disksize >> SECTOR_SHIFT);
> up_write(&zram->init_lock);
> + up_write(&zram_unload);
>
> return len;
>
> @@ -1755,6 +1768,7 @@ static ssize_t disksize_store(struct device *dev,
> zram_meta_free(zram, disksize);
> out_unlock:
> up_write(&zram->init_lock);
> + up_write(&zram_unload);
> return err;
> }
>
> @@ -1773,14 +1787,17 @@ static ssize_t reset_store(struct device *dev,
> if (!do_reset)
> return -EINVAL;
>
> + if (!down_write_trylock(&zram_unload))
> + return -ENODEV;
> +
> zram = dev_to_zram(dev);
> bdev = zram->disk->part0;
>
> mutex_lock(&bdev->bd_mutex);
> /* Do not reset an active device or claimed device */
> if (bdev->bd_openers || zram->claim) {
> - mutex_unlock(&bdev->bd_mutex);
> - return -EBUSY;
> + len = -EBUSY;
> + goto out;
> }
>
> /* From now on, anyone can't open /dev/zram[0-9] */
> @@ -1793,7 +1810,10 @@ static ssize_t reset_store(struct device *dev,
>
> mutex_lock(&bdev->bd_mutex);
> zram->claim = false;
> +
> +out:
> mutex_unlock(&bdev->bd_mutex);
> + up_write(&zram_unload);
>
> return len;
> }
> @@ -2015,10 +2035,15 @@ static ssize_t hot_add_show(struct class *class,
> {
> int ret;
>
> + if (!down_write_trylock(&zram_unload))
> + return -ENODEV;
> +
> mutex_lock(&zram_index_mutex);
> ret = zram_add();
> mutex_unlock(&zram_index_mutex);
>
> + up_write(&zram_unload);
> +
> if (ret < 0)
> return ret;
> return scnprintf(buf, PAGE_SIZE, "%d\n", ret);
> @@ -2041,6 +2066,9 @@ static ssize_t hot_remove_store(struct class *class,
> if (dev_id < 0)
> return -EINVAL;
>
> + if (!down_write_trylock(&zram_unload))
> + return -ENODEV;
> +
> mutex_lock(&zram_index_mutex);
>
> zram = idr_find(&zram_index_idr, dev_id);
> @@ -2053,6 +2081,7 @@ static ssize_t hot_remove_store(struct class *class,
> }
>
> mutex_unlock(&zram_index_mutex);
> + up_write(&zram_unload);
> return ret ? ret : count;
> }
> static CLASS_ATTR_WO(hot_remove);
> @@ -2078,12 +2107,14 @@ static int zram_remove_cb(int id, void *ptr, void *data)
>
> static void destroy_devices(void)
> {
> + down_write(&zram_unload);
> class_unregister(&zram_control_class);
> idr_for_each(&zram_index_idr, &zram_remove_cb, NULL);
> zram_debugfs_destroy();
> idr_destroy(&zram_index_idr);
> unregister_blkdev(zram_major, "zram");
> cpuhp_remove_multi_state(CPUHP_ZCOMP_PREPARE);
> + up_write(&zram_unload);
> }
>
> static int __init zram_init(void)
> @@ -2095,10 +2126,13 @@ static int __init zram_init(void)
> if (ret < 0)
> return ret;
>
> + down_write(&zram_unload);
> +
> ret = class_register(&zram_control_class);
> if (ret) {
> pr_err("Unable to register zram-control class\n");
> cpuhp_remove_multi_state(CPUHP_ZCOMP_PREPARE);
> + up_write(&zram_unload);
> return ret;
> }
>
> @@ -2108,6 +2142,7 @@ static int __init zram_init(void)
> pr_err("Unable to get major number\n");
> class_unregister(&zram_control_class);
> cpuhp_remove_multi_state(CPUHP_ZCOMP_PREPARE);
> + up_write(&zram_unload);
> return -EBUSY;
> }
>
> @@ -2119,10 +2154,12 @@ static int __init zram_init(void)
> goto out_error;
> num_devices--;
> }
> + up_write(&zram_unload);
>
> return 0;
>
> out_error:
> + up_write(&zram_unload);
> destroy_devices();
> return ret;
> }
As I mentioned above, it didn't close the race I gave as examples
unless I miss something. Let's discuss further.
Thank you!
next prev parent reply other threads:[~2021-03-22 22:12 UTC|newest]
Thread overview: 44+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-03-06 2:20 [PATCH 0/2] zram: fix few ltp zram02.sh crashes Luis Chamberlain
2021-03-06 2:20 ` [PATCH 1/2] zram: fix crashes due to use of cpu hotplug multistate Luis Chamberlain
2021-03-09 2:55 ` Minchan Kim
2021-03-10 13:11 ` Luis Chamberlain
2021-03-10 21:25 ` Luis Chamberlain
2021-03-12 2:08 ` Minchan Kim
2021-03-10 21:21 ` Luis Chamberlain
2021-03-12 2:14 ` Minchan Kim
2021-03-12 18:32 ` Luis Chamberlain
2021-03-12 19:28 ` Minchan Kim
2021-03-19 19:09 ` Luis Chamberlain
2021-03-22 16:37 ` Minchan Kim
2021-03-22 20:41 ` Luis Chamberlain
2021-03-22 22:12 ` Minchan Kim [this message]
2021-04-01 23:59 ` Luis Chamberlain
2021-04-02 7:54 ` Greg KH
2021-04-02 18:30 ` Luis Chamberlain
2021-04-03 6:13 ` Greg KH
[not found] ` <20210406003152.GZ4332@42.do-not-panic.com>
2021-04-06 12:00 ` Miroslav Benes
2021-04-06 15:54 ` Josh Poimboeuf
2021-04-07 14:09 ` Peter Zijlstra
2021-04-07 15:30 ` Josh Poimboeuf
2021-04-07 16:48 ` Peter Zijlstra
2021-04-07 20:17 ` Josh Poimboeuf
2021-04-08 6:18 ` Greg KH
2021-04-08 13:16 ` Steven Rostedt
2021-04-08 13:37 ` Josh Poimboeuf
2021-04-08 1:37 ` Thomas Gleixner
2021-04-08 6:16 ` Greg KH
2021-04-08 8:01 ` Jiri Kosina
2021-04-08 8:09 ` Greg KH
2021-04-08 8:35 ` Jiri Kosina
2021-04-08 8:55 ` Greg KH
2021-04-08 18:40 ` Luis Chamberlain
2021-04-09 3:01 ` Kees Cook
2021-04-05 17:07 ` Minchan Kim
2021-04-05 19:00 ` Luis Chamberlain
2021-04-05 19:58 ` Minchan Kim
2021-04-06 0:29 ` Luis Chamberlain
2021-04-07 1:23 ` Minchan Kim
2021-04-07 1:38 ` Minchan Kim
2021-04-07 14:52 ` Luis Chamberlain
2021-04-07 14:50 ` Luis Chamberlain
2021-03-06 2:20 ` [PATCH 2/2] zram: fix races of sysfs attribute removal and usage Luis Chamberlain
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=YFkWMZ0m9nKCT69T@google.com \
--to=minchan@kernel.org \
--cc=axboe@kernel.dk \
--cc=gregkh@linuxfoundation.org \
--cc=linux-block@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=mbenes@suse.com \
--cc=mcgrof@kernel.org \
--cc=ngupta@vflare.org \
--cc=sergey.senozhatsky.work@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).