It's possible today to have a device attribute read or store race against device removal. This is known to happen as follows: write system call --> ksys_write () --> vfs_write() --> __vfs_write() --> kernfs_fop_write_iter() --> sysfs_kf_write() --> dev_attr_store() --> null reference This happens because the dev_attr->store() callback can be removed prior to its call, after dev_attr_store() was initiated. The null dereference is possible because the sysfs ops can be removed on module removal, for instance, when device_del() is called, and a sysfs read / store is not doing any kobject reference bumps either. This allows a read/store call to initiate, a device_del() to kick off, and then the read/store call can be gone by the time to execute it. The sysfs filesystem is not doing any kobject reference bumps during a read / store ops to prevent this. To fix this in a simplified way, just bump the kobject reference when we create a directory and remove it on directory removal. The big unfortunate eye-sore is addressing the manual kobject reference assumption on the networking code, which leads me to believe we should end up replacing that eventually with another sort of check. Suggested-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> --- This v4 moves to fixing the race condition on dev_attr_store() and dev_attr_read() to sysfs by bumping the kobject reference count on directory creation / deletion as suggested by Greg. Unfortunately at least the networking core has a manual refcount assumption, which needs to be adjusted to account for this change. This should also mean there is runtime for other kobjects which may not be explored yet which may need fixing as well. We may want to change the check to something else on the networking front, but its not clear to me yet what to use. fs/sysfs/dir.c | 3 +++ net/core/dev.c | 4 ++-- 2 files changed, 5 insertions(+), 2 deletions(-) diff --git a/fs/sysfs/dir.c b/fs/sysfs/dir.c index 59dffd5ca517..6c47aa4af6f5 100644 --- a/fs/sysfs/dir.c +++ b/fs/sysfs/dir.c @@ -56,12 +56,14 @@ int sysfs_create_dir_ns(struct kobject *kobj, const void *ns) kobject_get_ownership(kobj, &uid, &gid); + kobject_get(kobj); kn = kernfs_create_dir_ns(parent, kobject_name(kobj), S_IRWXU | S_IRUGO | S_IXUGO, uid, gid, kobj, ns); if (IS_ERR(kn)) { if (PTR_ERR(kn) == -EEXIST) sysfs_warn_dup(parent, kobject_name(kobj)); + kobject_put(kobj); return PTR_ERR(kn); } @@ -100,6 +102,7 @@ void sysfs_remove_dir(struct kobject *kobj) if (kn) { WARN_ON_ONCE(kernfs_type(kn) != KERNFS_DIR); kernfs_remove(kn); + kobject_put(kobj); } } diff --git a/net/core/dev.c b/net/core/dev.c index 222b1d322c96..3a0ffa603d14 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -10429,7 +10429,7 @@ static void netdev_wait_allrefs(struct net_device *dev) rebroadcast_time = warning_time = jiffies; refcnt = netdev_refcnt_read(dev); - while (refcnt != 1) { + while (refcnt != 3) { if (time_after(jiffies, rebroadcast_time + 1 * HZ)) { rtnl_lock(); @@ -10544,7 +10544,7 @@ void netdev_run_todo(void) netdev_wait_allrefs(dev); /* paranoia */ - BUG_ON(netdev_refcnt_read(dev) != 1); + BUG_ON(netdev_refcnt_read(dev) != 3); BUG_ON(!list_empty(&dev->ptype_all)); BUG_ON(!list_empty(&dev->ptype_specific)); WARN_ON(rcu_access_pointer(dev->ip_ptr)); -- 2.27.0
On Wed, Jun 23, 2021 at 02:50:07PM -0700, Luis Chamberlain wrote: > It's possible today to have a device attribute read or store > race against device removal. This is known to happen as follows: > > write system call --> > ksys_write () --> > vfs_write() --> > __vfs_write() --> > kernfs_fop_write_iter() --> > sysfs_kf_write() --> > dev_attr_store() --> > null reference > > This happens because the dev_attr->store() callback can be > removed prior to its call, after dev_attr_store() was initiated. > The null dereference is possible because the sysfs ops can be > removed on module removal, for instance, when device_del() is > called, and a sysfs read / store is not doing any kobject reference > bumps either. This allows a read/store call to initiate, a > device_del() to kick off, and then the read/store call can be > gone by the time to execute it. > > The sysfs filesystem is not doing any kobject reference bumps during a > read / store ops to prevent this. > > To fix this in a simplified way, just bump the kobject reference when > we create a directory and remove it on directory removal. > > The big unfortunate eye-sore is addressing the manual kobject reference > assumption on the networking code, which leads me to believe we should > end up replacing that eventually with another sort of check. > > Suggested-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> > Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> > --- > > This v4 moves to fixing the race condition on dev_attr_store() and > dev_attr_read() to sysfs by bumping the kobject reference count > on directory creation / deletion as suggested by Greg. > > Unfortunately at least the networking core has a manual refcount > assumption, which needs to be adjusted to account for this change. > This should also mean there is runtime for other kobjects which may > not be explored yet which may need fixing as well. We may want to > change the check to something else on the networking front, but its > not clear to me yet what to use. > > fs/sysfs/dir.c | 3 +++ > net/core/dev.c | 4 ++-- > 2 files changed, 5 insertions(+), 2 deletions(-) > > diff --git a/fs/sysfs/dir.c b/fs/sysfs/dir.c > index 59dffd5ca517..6c47aa4af6f5 100644 > --- a/fs/sysfs/dir.c > +++ b/fs/sysfs/dir.c > @@ -56,12 +56,14 @@ int sysfs_create_dir_ns(struct kobject *kobj, const void *ns) > > kobject_get_ownership(kobj, &uid, &gid); > > + kobject_get(kobj); > kn = kernfs_create_dir_ns(parent, kobject_name(kobj), > S_IRWXU | S_IRUGO | S_IXUGO, uid, gid, > kobj, ns); > if (IS_ERR(kn)) { > if (PTR_ERR(kn) == -EEXIST) > sysfs_warn_dup(parent, kobject_name(kobj)); > + kobject_put(kobj); > return PTR_ERR(kn); > } > > @@ -100,6 +102,7 @@ void sysfs_remove_dir(struct kobject *kobj) > if (kn) { > WARN_ON_ONCE(kernfs_type(kn) != KERNFS_DIR); > kernfs_remove(kn); > + kobject_put(kobj); > } > } Shouldn't this be taken on open() not sysfs creation, or is the problem that the kobject is held the module memory rather than duplicated by sysfs? > diff --git a/net/core/dev.c b/net/core/dev.c > index 222b1d322c96..3a0ffa603d14 100644 > --- a/net/core/dev.c > +++ b/net/core/dev.c > @@ -10429,7 +10429,7 @@ static void netdev_wait_allrefs(struct net_device *dev) > rebroadcast_time = warning_time = jiffies; > refcnt = netdev_refcnt_read(dev); > > - while (refcnt != 1) { > + while (refcnt != 3) { > if (time_after(jiffies, rebroadcast_time + 1 * HZ)) { > rtnl_lock(); > > @@ -10544,7 +10544,7 @@ void netdev_run_todo(void) > netdev_wait_allrefs(dev); > > /* paranoia */ > - BUG_ON(netdev_refcnt_read(dev) != 1); > + BUG_ON(netdev_refcnt_read(dev) != 3); And surely there are things besides netdevs that would suffer from this change? > BUG_ON(!list_empty(&dev->ptype_all)); > BUG_ON(!list_empty(&dev->ptype_specific)); > WARN_ON(rcu_access_pointer(dev->ip_ptr)); > -- > 2.27.0 > -- Kees Cook
On Wed, Jun 23, 2021 at 03:59:37PM -0700, Kees Cook wrote: > On Wed, Jun 23, 2021 at 02:50:07PM -0700, Luis Chamberlain wrote: > > It's possible today to have a device attribute read or store > > race against device removal. This is known to happen as follows: > > > > write system call --> > > ksys_write () --> > > vfs_write() --> > > __vfs_write() --> > > kernfs_fop_write_iter() --> > > sysfs_kf_write() --> > > dev_attr_store() --> > > null reference > > > > This happens because the dev_attr->store() callback can be > > removed prior to its call, after dev_attr_store() was initiated. > > The null dereference is possible because the sysfs ops can be > > removed on module removal, for instance, when device_del() is > > called, and a sysfs read / store is not doing any kobject reference > > bumps either. This allows a read/store call to initiate, a > > device_del() to kick off, and then the read/store call can be > > gone by the time to execute it. > > > > The sysfs filesystem is not doing any kobject reference bumps during a > > read / store ops to prevent this. > > > > To fix this in a simplified way, just bump the kobject reference when > > we create a directory and remove it on directory removal. > > > > The big unfortunate eye-sore is addressing the manual kobject reference > > assumption on the networking code, which leads me to believe we should > > end up replacing that eventually with another sort of check. > > > > Suggested-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> > > Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> > > --- > > > > This v4 moves to fixing the race condition on dev_attr_store() and > > dev_attr_read() to sysfs by bumping the kobject reference count > > on directory creation / deletion as suggested by Greg. > > > > Unfortunately at least the networking core has a manual refcount > > assumption, which needs to be adjusted to account for this change. > > This should also mean there is runtime for other kobjects which may > > not be explored yet which may need fixing as well. We may want to > > change the check to something else on the networking front, but its > > not clear to me yet what to use. > > > > fs/sysfs/dir.c | 3 +++ > > net/core/dev.c | 4 ++-- > > 2 files changed, 5 insertions(+), 2 deletions(-) > > > > diff --git a/fs/sysfs/dir.c b/fs/sysfs/dir.c > > index 59dffd5ca517..6c47aa4af6f5 100644 > > --- a/fs/sysfs/dir.c > > +++ b/fs/sysfs/dir.c > > @@ -56,12 +56,14 @@ int sysfs_create_dir_ns(struct kobject *kobj, const void *ns) > > > > kobject_get_ownership(kobj, &uid, &gid); > > > > + kobject_get(kobj); > > kn = kernfs_create_dir_ns(parent, kobject_name(kobj), > > S_IRWXU | S_IRUGO | S_IXUGO, uid, gid, > > kobj, ns); > > if (IS_ERR(kn)) { > > if (PTR_ERR(kn) == -EEXIST) > > sysfs_warn_dup(parent, kobject_name(kobj)); > > + kobject_put(kobj); > > return PTR_ERR(kn); > > } > > > > @@ -100,6 +102,7 @@ void sysfs_remove_dir(struct kobject *kobj) > > if (kn) { > > WARN_ON_ONCE(kernfs_type(kn) != KERNFS_DIR); > > kernfs_remove(kn); > > + kobject_put(kobj); > > } > > } > > Shouldn't this be taken on open() not sysfs creation, or is the problem > that the kobject is held the module memory rather than duplicated by > sysfs? If you kobject_get() you also protect from module removal to complete until the calls ends. The issue is the dereference for the attribute op will be gone if we don't protect the kobject during the op. We only have an open() call for the bin sysfs files. So the alternative is to sprinkly the get on all calls. The goal with adding it to the directory creation was to simplify this. > > diff --git a/net/core/dev.c b/net/core/dev.c > > index 222b1d322c96..3a0ffa603d14 100644 > > --- a/net/core/dev.c > > +++ b/net/core/dev.c > > @@ -10429,7 +10429,7 @@ static void netdev_wait_allrefs(struct net_device *dev) > > rebroadcast_time = warning_time = jiffies; > > refcnt = netdev_refcnt_read(dev); > > > > - while (refcnt != 1) { > > + while (refcnt != 3) { > > if (time_after(jiffies, rebroadcast_time + 1 * HZ)) { > > rtnl_lock(); > > > > @@ -10544,7 +10544,7 @@ void netdev_run_todo(void) > > netdev_wait_allrefs(dev); > > > > /* paranoia */ > > - BUG_ON(netdev_refcnt_read(dev) != 1); > > + BUG_ON(netdev_refcnt_read(dev) != 3); > > And surely there are things besides netdevs that would suffer from this > change? Yes, I noted this possibility below the commit log, below the "---". Luis
On Wed, Jun 23, 2021 at 03:59:37PM -0700, Kees Cook wrote: > On Wed, Jun 23, 2021 at 02:50:07PM -0700, Luis Chamberlain wrote: > > It's possible today to have a device attribute read or store > > race against device removal. This is known to happen as follows: > > > > write system call --> > > ksys_write () --> > > vfs_write() --> > > __vfs_write() --> > > kernfs_fop_write_iter() --> > > sysfs_kf_write() --> > > dev_attr_store() --> > > null reference > > > > This happens because the dev_attr->store() callback can be > > removed prior to its call, after dev_attr_store() was initiated. > > The null dereference is possible because the sysfs ops can be > > removed on module removal, for instance, when device_del() is > > called, and a sysfs read / store is not doing any kobject reference > > bumps either. This allows a read/store call to initiate, a > > device_del() to kick off, and then the read/store call can be > > gone by the time to execute it. > > > > The sysfs filesystem is not doing any kobject reference bumps during a > > read / store ops to prevent this. > > > > To fix this in a simplified way, just bump the kobject reference when > > we create a directory and remove it on directory removal. > > > > The big unfortunate eye-sore is addressing the manual kobject reference > > assumption on the networking code, which leads me to believe we should > > end up replacing that eventually with another sort of check. > > > > Suggested-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> > > Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> > > --- > > > > This v4 moves to fixing the race condition on dev_attr_store() and > > dev_attr_read() to sysfs by bumping the kobject reference count > > on directory creation / deletion as suggested by Greg. > > > > Unfortunately at least the networking core has a manual refcount > > assumption, which needs to be adjusted to account for this change. > > This should also mean there is runtime for other kobjects which may > > not be explored yet which may need fixing as well. We may want to > > change the check to something else on the networking front, but its > > not clear to me yet what to use. > > > > fs/sysfs/dir.c | 3 +++ > > net/core/dev.c | 4 ++-- > > 2 files changed, 5 insertions(+), 2 deletions(-) > > > > diff --git a/fs/sysfs/dir.c b/fs/sysfs/dir.c > > index 59dffd5ca517..6c47aa4af6f5 100644 > > --- a/fs/sysfs/dir.c > > +++ b/fs/sysfs/dir.c > > @@ -56,12 +56,14 @@ int sysfs_create_dir_ns(struct kobject *kobj, const void *ns) > > > > kobject_get_ownership(kobj, &uid, &gid); > > > > + kobject_get(kobj); > > kn = kernfs_create_dir_ns(parent, kobject_name(kobj), > > S_IRWXU | S_IRUGO | S_IXUGO, uid, gid, > > kobj, ns); > > if (IS_ERR(kn)) { > > if (PTR_ERR(kn) == -EEXIST) > > sysfs_warn_dup(parent, kobject_name(kobj)); > > + kobject_put(kobj); > > return PTR_ERR(kn); > > } > > > > @@ -100,6 +102,7 @@ void sysfs_remove_dir(struct kobject *kobj) > > if (kn) { > > WARN_ON_ONCE(kernfs_type(kn) != KERNFS_DIR); > > kernfs_remove(kn); > > + kobject_put(kobj); > > } > > } > > Shouldn't this be taken on open() not sysfs creation, or is the problem > that the kobject is held the module memory rather than duplicated by > sysfs? No, this is when we are passing a pointer off to kernfs, it needs to have the reference count incremented then otherwise it could "disappear" without kernfs knowing about it. > > diff --git a/net/core/dev.c b/net/core/dev.c > > index 222b1d322c96..3a0ffa603d14 100644 > > --- a/net/core/dev.c > > +++ b/net/core/dev.c > > @@ -10429,7 +10429,7 @@ static void netdev_wait_allrefs(struct net_device *dev) > > rebroadcast_time = warning_time = jiffies; > > refcnt = netdev_refcnt_read(dev); > > > > - while (refcnt != 1) { > > + while (refcnt != 3) { > > if (time_after(jiffies, rebroadcast_time + 1 * HZ)) { > > rtnl_lock(); > > > > @@ -10544,7 +10544,7 @@ void netdev_run_todo(void) > > netdev_wait_allrefs(dev); > > > > /* paranoia */ > > - BUG_ON(netdev_refcnt_read(dev) != 1); > > + BUG_ON(netdev_refcnt_read(dev) != 3); > > And surely there are things besides netdevs that would suffer from this > change? I _REALLY_ hope not. No one should ever be looking at the count of a kobject for anything as there are way too many ways that means nothing. This is a fun hack for networking to be paranoid that they got their initializion logic correct, so I'll leave it be right now. No one else should care, and if they do, we should fix it up... thanks, greg k-h
On Wed, Jun 23, 2021 at 02:50:07PM -0700, Luis Chamberlain wrote: > It's possible today to have a device attribute read or store > race against device removal. This is known to happen as follows: > > write system call --> > ksys_write () --> > vfs_write() --> > __vfs_write() --> > kernfs_fop_write_iter() --> > sysfs_kf_write() --> > dev_attr_store() --> > null reference > > This happens because the dev_attr->store() callback can be > removed prior to its call, after dev_attr_store() was initiated. > The null dereference is possible because the sysfs ops can be > removed on module removal, for instance, when device_del() is > called, and a sysfs read / store is not doing any kobject reference > bumps either. This allows a read/store call to initiate, a > device_del() to kick off, and then the read/store call can be > gone by the time to execute it. > > The sysfs filesystem is not doing any kobject reference bumps during a > read / store ops to prevent this. > > To fix this in a simplified way, just bump the kobject reference when > we create a directory and remove it on directory removal. > > The big unfortunate eye-sore is addressing the manual kobject reference > assumption on the networking code, which leads me to believe we should > end up replacing that eventually with another sort of check. > > Suggested-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> > Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> > --- > > This v4 moves to fixing the race condition on dev_attr_store() and > dev_attr_read() to sysfs by bumping the kobject reference count > on directory creation / deletion as suggested by Greg. This looks good. It's late in the development cycle, I'll hold off on adding this to my tree until 5.14-rc1 is out because of: > Unfortunately at least the networking core has a manual refcount > assumption, which needs to be adjusted to account for this change. > This should also mean there is runtime for other kobjects which may > not be explored yet which may need fixing as well. We may want to > change the check to something else on the networking front, but its > not clear to me yet what to use. That's crazy what networking is doing here, hopefully no one else is. If they are, let's shake it out in linux-next to find the problems which is why a good "soak" there is a good idea. thanks for making this change and sticking with it! Oh, and with this change, does your modprobe/rmmod crazy test now work? greg k-h
On Thu, Jun 24, 2021 at 01:09:03PM +0200, Greg KH wrote:
> thanks for making this change and sticking with it!
>
> Oh, and with this change, does your modprobe/rmmod crazy test now work?
It does but I wrote a test_syfs driver and I believe I see an issue with
this. I'll debug a bit more and see what it was, and I'll then also use
the driver to demo the issue more clearly, and then verification can be
an easy selftest test.
Luis
On Fri, Jun 25, 2021 at 02:56:03PM -0700, Luis Chamberlain wrote:
> On Thu, Jun 24, 2021 at 01:09:03PM +0200, Greg KH wrote:
> > thanks for making this change and sticking with it!
> >
> > Oh, and with this change, does your modprobe/rmmod crazy test now work?
>
> It does but I wrote a test_syfs driver and I believe I see an issue with
> this. I'll debug a bit more and see what it was, and I'll then also use
> the driver to demo the issue more clearly, and then verification can be
> an easy selftest test.
OK my conclusion based on a new selftest driver I wrote is we can drop
this patch safely. The selftest will cover this corner case well now.
In short: the kernfs active reference will ensure the store operation
still exists. The kernfs mutex is not enough, but if the driver removes
the operation prior to getting the active reference, the write will just
fail. The deferencing inside of the sysfs operation is abstract to
kernfs, and while kernfs can't do anything to prevent a driver from
doing something stupid, it at least can ensure an open file ensure the
op is not removed until the operation completes.
Luis
On Thu, Jul 01, 2021 at 03:48:22PM -0700, Luis Chamberlain wrote:
> On Fri, Jun 25, 2021 at 02:56:03PM -0700, Luis Chamberlain wrote:
> > On Thu, Jun 24, 2021 at 01:09:03PM +0200, Greg KH wrote:
> > > thanks for making this change and sticking with it!
> > >
> > > Oh, and with this change, does your modprobe/rmmod crazy test now work?
> >
> > It does but I wrote a test_syfs driver and I believe I see an issue with
> > this. I'll debug a bit more and see what it was, and I'll then also use
> > the driver to demo the issue more clearly, and then verification can be
> > an easy selftest test.
>
> OK my conclusion based on a new selftest driver I wrote is we can drop
> this patch safely. The selftest will cover this corner case well now.
>
> In short: the kernfs active reference will ensure the store operation
> still exists. The kernfs mutex is not enough, but if the driver removes
> the operation prior to getting the active reference, the write will just
> fail. The deferencing inside of the sysfs operation is abstract to
> kernfs, and while kernfs can't do anything to prevent a driver from
> doing something stupid, it at least can ensure an open file ensure the
> op is not removed until the operation completes.
OK and now its not so clear, as it would seem the refcount can indeed
get reduced after we validated it. In any case we'll have enough tools
to reproduce any possible failure soon.
Luis
On Thu, Jul 01, 2021 at 03:48:16PM -0700, Luis Chamberlain wrote:
> On Fri, Jun 25, 2021 at 02:56:03PM -0700, Luis Chamberlain wrote:
> > On Thu, Jun 24, 2021 at 01:09:03PM +0200, Greg KH wrote:
> > > thanks for making this change and sticking with it!
> > >
> > > Oh, and with this change, does your modprobe/rmmod crazy test now work?
> >
> > It does but I wrote a test_syfs driver and I believe I see an issue with
> > this. I'll debug a bit more and see what it was, and I'll then also use
> > the driver to demo the issue more clearly, and then verification can be
> > an easy selftest test.
>
> OK my conclusion based on a new selftest driver I wrote is we can drop
> this patch safely. The selftest will cover this corner case well now.
>
> In short: the kernfs active reference will ensure the store operation
> still exists. The kernfs mutex is not enough, but if the driver removes
> the operation prior to getting the active reference, the write will just
> fail. The deferencing inside of the sysfs operation is abstract to
> kernfs, and while kernfs can't do anything to prevent a driver from
> doing something stupid, it at least can ensure an open file ensure the
> op is not removed until the operation completes.
Ok, so all is good? Then why is your zram test code blowing up so
badly? Where is the reference counting going wrong?
thanks,
greg k-h
On Wed, Jul 21, 2021 at 01:30:29PM +0200, Greg KH wrote: > On Thu, Jul 01, 2021 at 03:48:16PM -0700, Luis Chamberlain wrote: > > On Fri, Jun 25, 2021 at 02:56:03PM -0700, Luis Chamberlain wrote: > > > On Thu, Jun 24, 2021 at 01:09:03PM +0200, Greg KH wrote: > > > > thanks for making this change and sticking with it! > > > > > > > > Oh, and with this change, does your modprobe/rmmod crazy test now work? > > > > > > It does but I wrote a test_syfs driver and I believe I see an issue with > > > this. I'll debug a bit more and see what it was, and I'll then also use > > > the driver to demo the issue more clearly, and then verification can be > > > an easy selftest test. > > > > OK my conclusion based on a new selftest driver I wrote is we can drop > > this patch safely. The selftest will cover this corner case well now. > > > > In short: the kernfs active reference will ensure the store operation > > still exists. The kernfs mutex is not enough, but if the driver removes > > the operation prior to getting the active reference, the write will just > > fail. The deferencing inside of the sysfs operation is abstract to > > kernfs, and while kernfs can't do anything to prevent a driver from > > doing something stupid, it at least can ensure an open file ensure the > > op is not removed until the operation completes. > > Ok, so all is good? It would seem to be the case. > Then why is your zram test code blowing up so badly? I checked the logs for the backtrace where the crash did happen and we did see clear evidence of the race we feared here. The *first* bug that happened was the CPU hotplug race: [132004.787099] Error: Removing state 61 which has instances left. [132004.787124] WARNING: CPU: 17 PID: 9307 at ../kernel/cpu.c:1879 __cpuhp_remove_state_cpuslocked+0x1c4/0x1d0 After this the crash happen: [132005.254022] BUG: Unable to handle kernel instruction fetch [132005.254049] Faulting instruction address: 0xc0080000004a0c24 [132005.254059] Oops: Kernel access of bad area, sig: 11 [#1] And that's when the backtrace does come up with race. Given the first race though, I think we can be skeptical of the rest, specially since I cannot reproduce with a self bombing selftest. > Where is the reference counting going wrong? It's not clear, as the misuse with the CPU multistate could lead to to us leaking per cpu stuct zcomp instances, leaving these behind as there is no one to remove them. I can't think of the relationship of this leak and the crash other then memory pressure. Because of this and the deadlock which is easily triggerable, I decided to write a selftest to allow is to more cleanly be able to reproduce any races we can dream up of. Luis
On Thu, Jul 22, 2021 at 02:31:37PM -0700, Luis Chamberlain wrote:
> On Wed, Jul 21, 2021 at 01:30:29PM +0200, Greg KH wrote:
> > On Thu, Jul 01, 2021 at 03:48:16PM -0700, Luis Chamberlain wrote:
> > > On Fri, Jun 25, 2021 at 02:56:03PM -0700, Luis Chamberlain wrote:
> > > > On Thu, Jun 24, 2021 at 01:09:03PM +0200, Greg KH wrote:
> > > > > thanks for making this change and sticking with it!
> > > > >
> > > > > Oh, and with this change, does your modprobe/rmmod crazy test now work?
> > > >
> > > > It does but I wrote a test_syfs driver and I believe I see an issue with
> > > > this. I'll debug a bit more and see what it was, and I'll then also use
> > > > the driver to demo the issue more clearly, and then verification can be
> > > > an easy selftest test.
> > >
> > > OK my conclusion based on a new selftest driver I wrote is we can drop
> > > this patch safely. The selftest will cover this corner case well now.
> > >
> > > In short: the kernfs active reference will ensure the store operation
> > > still exists. The kernfs mutex is not enough, but if the driver removes
> > > the operation prior to getting the active reference, the write will just
> > > fail. The deferencing inside of the sysfs operation is abstract to
> > > kernfs, and while kernfs can't do anything to prevent a driver from
> > > doing something stupid, it at least can ensure an open file ensure the
> > > op is not removed until the operation completes.
> >
> > Ok, so all is good?
>
> It would seem to be the case.
>
> > Then why is your zram test code blowing up so badly?
>
> I checked the logs for the backtrace where the crash did happen
> and we did see clear evidence of the race we feared here. The *first*
> bug that happened was the CPU hotplug race:
>
> [132004.787099] Error: Removing state 61 which has instances left.
> [132004.787124] WARNING: CPU: 17 PID: 9307 at ../kernel/cpu.c:1879 __cpuhp_remove_state_cpuslocked+0x1c4/0x1d0
I do not understand what this issue is, is it fixed? Why is a cpu being
hot unplugged at the same time a zram?
thanks,
greg k-h
On Fri, Jul 23, 2021 at 01:14:10PM +0200, Greg KH wrote: > On Thu, Jul 22, 2021 at 02:31:37PM -0700, Luis Chamberlain wrote: > > On Wed, Jul 21, 2021 at 01:30:29PM +0200, Greg KH wrote: > > > On Thu, Jul 01, 2021 at 03:48:16PM -0700, Luis Chamberlain wrote: > > > > On Fri, Jun 25, 2021 at 02:56:03PM -0700, Luis Chamberlain wrote: > > > > > On Thu, Jun 24, 2021 at 01:09:03PM +0200, Greg KH wrote: > > > > > > thanks for making this change and sticking with it! > > > > > > > > > > > > Oh, and with this change, does your modprobe/rmmod crazy test now work? > > > > > > > > > > It does but I wrote a test_syfs driver and I believe I see an issue with > > > > > this. I'll debug a bit more and see what it was, and I'll then also use > > > > > the driver to demo the issue more clearly, and then verification can be > > > > > an easy selftest test. > > > > > > > > OK my conclusion based on a new selftest driver I wrote is we can drop > > > > this patch safely. The selftest will cover this corner case well now. > > > > > > > > In short: the kernfs active reference will ensure the store operation > > > > still exists. The kernfs mutex is not enough, but if the driver removes > > > > the operation prior to getting the active reference, the write will just > > > > fail. The deferencing inside of the sysfs operation is abstract to > > > > kernfs, and while kernfs can't do anything to prevent a driver from > > > > doing something stupid, it at least can ensure an open file ensure the > > > > op is not removed until the operation completes. > > > > > > Ok, so all is good? > > > > It would seem to be the case. > > > > > Then why is your zram test code blowing up so badly? > > > > I checked the logs for the backtrace where the crash did happen > > and we did see clear evidence of the race we feared here. The *first* > > bug that happened was the CPU hotplug race: > > > > [132004.787099] Error: Removing state 61 which has instances left. > > [132004.787124] WARNING: CPU: 17 PID: 9307 at ../kernel/cpu.c:1879 __cpuhp_remove_state_cpuslocked+0x1c4/0x1d0 > > I do not understand what this issue is, is it fixed? My first patch for zram fixes the CPU multistate mis-use. And after that patch is applied triggering the other race does not happen. It is why I decided to write a selftest driver, so that we can have a way to do all sorts of crazy races in a self contained driver example. > Why is a cpu being hot unplugged at the same time a zram? That's not what is happening. The description of the issue with zram's misuse of CPU multistate is described clearly in my commit log for the fix for that driver. You can refer to that commit log description. Luis