linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3] drivers/base/core: refcount kobject and bus on device attribute read / store
@ 2021-06-23  0:36 Luis Chamberlain
  2021-06-23  8:32 ` Greg KH
  0 siblings, 1 reply; 6+ messages in thread
From: Luis Chamberlain @ 2021-06-23  0:36 UTC (permalink / raw)
  To: gregkh, rafael
  Cc: jeyu, ngupta, sergey.senozhatsky.work, minchan, mcgrof, axboe,
	mbenes, jpoimboe, tglx, keescook, jikos, rostedt, peterz,
	linux-block, linux-kernel

It's possible today to have a device attribute read or store
race against device removal. When this happens there is a small
chance that the dereference for the private data area of the driver
is NULL.

Let's consider the zram driver as an example. Its possible to run into
a race where a sysfs knob is being used, we get preempted, and a zram
device is removed before we complete use of the sysfs knob. This can happen
for instance on block devices, where for instance the zram block devices
just part of the private data of the block device.

For instance this can happen in the following two situations
as examples to illustrate this better:

        CPU 1                            CPU 2
destroy_devices
...
                                 compact_store()
                                 zram = dev_to_zram(dev);
idr_for_each(zram_remove_cb
  zram_remove
  ...
  kfree(zram)
                                 down_read(&zram->init_lock);

        CPU 1                            CPU 2
hot_remove_store
                                 compact_store()
                                 zram = dev_to_zram(dev);
  zram_remove
    kfree(zram)
                                 down_read(&zram->init_lock);

To ensure the private data pointer is valid we could use bdget() / bdput()
in between access, however that would mean doing that in all sysfs
reads/stores on the driver. Instead a generic solution for all drivers
is to ensure the device kobject is still valid and also the bus, if
a bus is present.

This issue does not fix a known crash, however this race was
spotted by Minchan Kim through code inspection upon code review
of another zram patch.

Suggested-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---

This v3 is the same as v2, but since the I got an email asking me to
clarify the differences between the first iteration and the second, this
v3 just describes what I did not explain in the v2. Namely:

  * I removed the checks from get_device() calls as suggested
  * The functions which are now being used outside of bus.c have
    forwared declarations stuffed into base.h

 drivers/base/base.h |  2 ++
 drivers/base/bus.c  |  4 ++--
 drivers/base/core.c | 36 ++++++++++++++++++++++++++++++++----
 3 files changed, 36 insertions(+), 6 deletions(-)

diff --git a/drivers/base/base.h b/drivers/base/base.h
index e5f9b7e656c3..3f95b125b667 100644
--- a/drivers/base/base.h
+++ b/drivers/base/base.h
@@ -127,6 +127,8 @@ static inline void auxiliary_bus_init(void) { }
 
 struct kobject *virtual_device_parent(struct device *dev);
 
+extern struct bus_type *bus_get(struct bus_type *bus);
+extern void bus_put(struct bus_type *bus);
 extern int bus_add_device(struct device *dev);
 extern void bus_probe_device(struct device *dev);
 extern void bus_remove_device(struct device *dev);
diff --git a/drivers/base/bus.c b/drivers/base/bus.c
index 36d0c654ea61..21c80d7d6433 100644
--- a/drivers/base/bus.c
+++ b/drivers/base/bus.c
@@ -39,7 +39,7 @@ static struct kset *system_kset;
 static int __must_check bus_rescan_devices_helper(struct device *dev,
 						void *data);
 
-static struct bus_type *bus_get(struct bus_type *bus)
+struct bus_type *bus_get(struct bus_type *bus)
 {
 	if (bus) {
 		kset_get(&bus->p->subsys);
@@ -48,7 +48,7 @@ static struct bus_type *bus_get(struct bus_type *bus)
 	return NULL;
 }
 
-static void bus_put(struct bus_type *bus)
+void bus_put(struct bus_type *bus)
 {
 	if (bus)
 		kset_put(&bus->p->subsys);
diff --git a/drivers/base/core.c b/drivers/base/core.c
index 4a8bf8cda52b..f69aa040b56d 100644
--- a/drivers/base/core.c
+++ b/drivers/base/core.c
@@ -2042,28 +2042,56 @@ EXPORT_SYMBOL(dev_driver_string);
 static ssize_t dev_attr_show(struct kobject *kobj, struct attribute *attr,
 			     char *buf)
 {
-	struct device_attribute *dev_attr = to_dev_attr(attr);
-	struct device *dev = kobj_to_dev(kobj);
+	struct device_attribute *dev_attr;
+	struct device *dev;
+	struct bus_type *bus = NULL;
 	ssize_t ret = -EIO;
 
+	dev = get_device(kobj_to_dev(kobj));
+	if (dev->bus) {
+		bus = bus_get(dev->bus);
+		if (!bus)
+			goto out;
+	}
+
+	dev_attr = to_dev_attr(attr);
 	if (dev_attr->show)
 		ret = dev_attr->show(dev, dev_attr, buf);
 	if (ret >= (ssize_t)PAGE_SIZE) {
 		printk("dev_attr_show: %pS returned bad count\n",
 				dev_attr->show);
 	}
+
+	bus_put(bus);
+out:
+	put_device(dev);
+
 	return ret;
 }
 
 static ssize_t dev_attr_store(struct kobject *kobj, struct attribute *attr,
 			      const char *buf, size_t count)
 {
-	struct device_attribute *dev_attr = to_dev_attr(attr);
-	struct device *dev = kobj_to_dev(kobj);
+	struct device_attribute *dev_attr;
+	struct device *dev;
+	struct bus_type *bus = NULL;
 	ssize_t ret = -EIO;
 
+	dev = get_device(kobj_to_dev(kobj));
+	if (dev->bus) {
+		bus = bus_get(dev->bus);
+		if (!bus)
+			goto out;
+	}
+
+	dev_attr = to_dev_attr(attr);
 	if (dev_attr->store)
 		ret = dev_attr->store(dev, dev_attr, buf, count);
+
+	bus_put(bus);
+out:
+	put_device(dev);
+
 	return ret;
 }
 
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH v3] drivers/base/core: refcount kobject and bus on device attribute read / store
  2021-06-23  0:36 [PATCH v3] drivers/base/core: refcount kobject and bus on device attribute read / store Luis Chamberlain
@ 2021-06-23  8:32 ` Greg KH
  2021-06-23 16:14   ` Luis Chamberlain
  0 siblings, 1 reply; 6+ messages in thread
From: Greg KH @ 2021-06-23  8:32 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: rafael, jeyu, ngupta, sergey.senozhatsky.work, minchan, axboe,
	mbenes, jpoimboe, tglx, keescook, jikos, rostedt, peterz,
	linux-block, linux-kernel

On Tue, Jun 22, 2021 at 05:36:30PM -0700, Luis Chamberlain wrote:
> It's possible today to have a device attribute read or store
> race against device removal. When this happens there is a small
> chance that the dereference for the private data area of the driver
> is NULL.
> 
> Let's consider the zram driver as an example. Its possible to run into
> a race where a sysfs knob is being used, we get preempted, and a zram
> device is removed before we complete use of the sysfs knob. This can happen
> for instance on block devices, where for instance the zram block devices
> just part of the private data of the block device.
> 
> For instance this can happen in the following two situations
> as examples to illustrate this better:
> 
>         CPU 1                            CPU 2
> destroy_devices
> ...
>                                  compact_store()
>                                  zram = dev_to_zram(dev);
> idr_for_each(zram_remove_cb
>   zram_remove
>   ...
>   kfree(zram)
>                                  down_read(&zram->init_lock);
> 
>         CPU 1                            CPU 2
> hot_remove_store
>                                  compact_store()
>                                  zram = dev_to_zram(dev);
>   zram_remove
>     kfree(zram)
>                                  down_read(&zram->init_lock);
> 
> To ensure the private data pointer is valid we could use bdget() / bdput()
> in between access, however that would mean doing that in all sysfs
> reads/stores on the driver. Instead a generic solution for all drivers
> is to ensure the device kobject is still valid and also the bus, if
> a bus is present.
> 
> This issue does not fix a known crash, however this race was
> spotted by Minchan Kim through code inspection upon code review
> of another zram patch.
> 
> Suggested-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> ---
> 
> This v3 is the same as v2, but since the I got an email asking me to
> clarify the differences between the first iteration and the second, this
> v3 just describes what I did not explain in the v2. Namely:
> 
>   * I removed the checks from get_device() calls as suggested
>   * The functions which are now being used outside of bus.c have
>     forwared declarations stuffed into base.h
> 
>  drivers/base/base.h |  2 ++
>  drivers/base/bus.c  |  4 ++--
>  drivers/base/core.c | 36 ++++++++++++++++++++++++++++++++----
>  3 files changed, 36 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/base/base.h b/drivers/base/base.h
> index e5f9b7e656c3..3f95b125b667 100644
> --- a/drivers/base/base.h
> +++ b/drivers/base/base.h
> @@ -127,6 +127,8 @@ static inline void auxiliary_bus_init(void) { }
>  
>  struct kobject *virtual_device_parent(struct device *dev);
>  
> +extern struct bus_type *bus_get(struct bus_type *bus);
> +extern void bus_put(struct bus_type *bus);
>  extern int bus_add_device(struct device *dev);
>  extern void bus_probe_device(struct device *dev);
>  extern void bus_remove_device(struct device *dev);
> diff --git a/drivers/base/bus.c b/drivers/base/bus.c
> index 36d0c654ea61..21c80d7d6433 100644
> --- a/drivers/base/bus.c
> +++ b/drivers/base/bus.c
> @@ -39,7 +39,7 @@ static struct kset *system_kset;
>  static int __must_check bus_rescan_devices_helper(struct device *dev,
>  						void *data);
>  
> -static struct bus_type *bus_get(struct bus_type *bus)
> +struct bus_type *bus_get(struct bus_type *bus)
>  {
>  	if (bus) {
>  		kset_get(&bus->p->subsys);
> @@ -48,7 +48,7 @@ static struct bus_type *bus_get(struct bus_type *bus)
>  	return NULL;
>  }
>  
> -static void bus_put(struct bus_type *bus)
> +void bus_put(struct bus_type *bus)
>  {
>  	if (bus)
>  		kset_put(&bus->p->subsys);
> diff --git a/drivers/base/core.c b/drivers/base/core.c
> index 4a8bf8cda52b..f69aa040b56d 100644
> --- a/drivers/base/core.c
> +++ b/drivers/base/core.c
> @@ -2042,28 +2042,56 @@ EXPORT_SYMBOL(dev_driver_string);
>  static ssize_t dev_attr_show(struct kobject *kobj, struct attribute *attr,
>  			     char *buf)
>  {
> -	struct device_attribute *dev_attr = to_dev_attr(attr);
> -	struct device *dev = kobj_to_dev(kobj);
> +	struct device_attribute *dev_attr;
> +	struct device *dev;
> +	struct bus_type *bus = NULL;
>  	ssize_t ret = -EIO;
>  
> +	dev = get_device(kobj_to_dev(kobj));
> +	if (dev->bus) {

No need to test for this, right?

> +		bus = bus_get(dev->bus);
> +		if (!bus)
> +			goto out;

How can this happen?

> +	}
> +
> +	dev_attr = to_dev_attr(attr);

Why did you move this call to way down here?  It's fine in the first
line of the function above as-is.

>  	if (dev_attr->show)
>  		ret = dev_attr->show(dev, dev_attr, buf);
>  	if (ret >= (ssize_t)PAGE_SIZE) {
>  		printk("dev_attr_show: %pS returned bad count\n",
>  				dev_attr->show);
>  	}
> +
> +	bus_put(bus);

You are incrementing the bus, which is nice, but I do not understand why
it is needed.  What is causing the bus to go away _before_ the devices
are going away?  Busses almost never are removed from the system, and if
they are, all devices associated with them are removed first.  So I do
not think you need to increment anything with that here.

But step back, what prevented the kobject from decrementing between the
call to dev_attr_show() and the first line of the function?

The kobject needs to be incremented before show is called, right?  Or
does kernfs handle this properly already?  I don't have the time at the
moment to dig into this, but maybe this isn't an issue?  Look at how
sysfs handles the kobject lookups for kernfs nodes to see if all is sane
there, maybe there isn't an issue?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v3] drivers/base/core: refcount kobject and bus on device attribute read / store
  2021-06-23  8:32 ` Greg KH
@ 2021-06-23 16:14   ` Luis Chamberlain
  2021-06-23 16:28     ` Greg KH
  2021-06-23 16:51     ` Greg KH
  0 siblings, 2 replies; 6+ messages in thread
From: Luis Chamberlain @ 2021-06-23 16:14 UTC (permalink / raw)
  To: Greg KH
  Cc: rafael, jeyu, ngupta, sergey.senozhatsky.work, minchan, axboe,
	mbenes, jpoimboe, tglx, keescook, jikos, rostedt, peterz,
	linux-block, linux-kernel

On Wed, Jun 23, 2021 at 10:32:53AM +0200, Greg KH wrote:
> On Tue, Jun 22, 2021 at 05:36:30PM -0700, Luis Chamberlain wrote:
> > diff --git a/drivers/base/core.c b/drivers/base/core.c
> > index 4a8bf8cda52b..f69aa040b56d 100644
> > --- a/drivers/base/core.c
> > +++ b/drivers/base/core.c
> > @@ -2042,28 +2042,56 @@ EXPORT_SYMBOL(dev_driver_string);
> >  static ssize_t dev_attr_show(struct kobject *kobj, struct attribute *attr,
> >  			     char *buf)
> >  {
> > -	struct device_attribute *dev_attr = to_dev_attr(attr);
> > -	struct device *dev = kobj_to_dev(kobj);
> > +	struct device_attribute *dev_attr;
> > +	struct device *dev;
> > +	struct bus_type *bus = NULL;
> >  	ssize_t ret = -EIO;
> >  
> > +	dev = get_device(kobj_to_dev(kobj));
> > +	if (dev->bus) {
> 
> No need to test for this, right?

dev_uevent() checks for dev->bus, so I thought that was a clear
indication this isn't always set.

> 
> > +		bus = bus_get(dev->bus);
> > +		if (!bus)
> > +			goto out;
> 
> How can this happen?

device_add() calls bus_add_device(), and the bus_add_device()
implementation seems to have a check for the bus returned from bus_get() as
well?

> > +	}
> > +
> > +	dev_attr = to_dev_attr(attr);
> 
> Why did you move this call to way down here?  It's fine in the first
> line of the function above as-is.

I had done that when I had the device kobject reference incremented.
I'll move it back up.

> >  	if (dev_attr->show)
> >  		ret = dev_attr->show(dev, dev_attr, buf);
> >  	if (ret >= (ssize_t)PAGE_SIZE) {
> >  		printk("dev_attr_show: %pS returned bad count\n",
> >  				dev_attr->show);
> >  	}
> > +
> > +	bus_put(bus);
> 
> You are incrementing the bus, which is nice, but I do not understand why
> it is needed.  What is causing the bus to go away _before_ the devices
> are going away?  Busses almost never are removed from the system, and if
> they are, all devices associated with them are removed first.  So I do
> not think you need to increment anything with that here.

You tell me. It was your suggestion as a replacement for the type
specific lock, in the zram case, its a block device so I was using
bdgrab().

> But step back, what prevented the kobject from decrementing between the
> call to dev_attr_show() and the first line of the function?
> 
> The kobject needs to be incremented before show is called, right?  Or
> does kernfs handle this properly already? 

kernfs / sysfs is in not struct device specific, it has no semantics for
modules or even devices. The aspect of kernfs which deals with locking
a struct kernfs_node is kernfs_get_active(). The refcount there of
importance is the call to atomic_inc_unless_negative(&kn->active).

struct kernfs_node *kernfs_get_active(struct kernfs_node *kn)                   
{                                                                               
	if (unlikely(!kn))
		return NULL;                                                    

	if (!atomic_inc_unless_negative(&kn->active))                           
		return NULL;                                                    

	if (kernfs_lockdep(kn))
		rwsem_acquire_read(&kn->dep_map, 0, 1, _RET_IP_);               
	return kn;                                                              
}

BTW this was also precisely where I had suggested to extend the
kernfs_node with an optional kn->owner and if set we try_module_get() to
prevent the deadlock case if the module exit routine also happens to use
a lock also used by a sysfs attribute store/read.

The flow would be (from a real live gdb crash backtrace from an original
bug report from a customer):

write system call -->
  ksys_write ()
    vfs_write() -->
      __vfs_write() -->
       kernfs_fop_write() (note now this is kernfs_fop_write_iter()) -->
         sysfs_kf_write() -->
           dev_attr_store() -->
	     null reference

The dereference is because the dev_attr_store() call which we are
modifying

LINE-001 static ssize_t dev_attr_store(struct kobject *kobj, struct attribute *attr,                                                                                         
LINE-002 const char *buf, size_t count)                                                                                                                 
LINE-003 {
LINE-004	struct device_attribute *dev_attr = to_dev_attr(attr);
LINE-005	struct device *dev = kobj_to_dev(kobj);
LINE-006	ssize_t ret = -EIO;
LINE-007
LINE-008	if (dev_attr->store)
LINE-009		ret = dev_attr->store(dev, dev_attr, buf, count);
	...
	}

The race happens because a sysfs store / read can be triggered, the CPU
could be preempted after LINE-008, and the ->store is gone by LINE-009.
This begs the question if kernfs_fop_write_iter() or sysfs protects this
somehow? Let's see.

For kernfs we have:

static ssize_t kernfs_fop_write_iter(struct kiocb *iocb, struct iov_iter *iter) 
{
	struct kernfs_open_file *of = kernfs_of(iocb->ki_filp);
	...
	mutex_lock(&of->mutex);
	if (!kernfs_get_active(of->kn)) {
		mutex_unlock(&of->mutex);
		len = -ENODEV;
		goto out_free;
	}

	ops = kernfs_ops(of->kn);
	if (ops->write)
		len = ops->write(of, buf, len, iocb->ki_pos);
	else
		len = -EINVAL;

	kernfs_put_active(of->kn);
	mutex_unlock(&of->mutex);
	...
}

And the write call here is a syfs calls:

static ssize_t sysfs_kf_write(struct kernfs_open_file *of, char *buf,           
			      size_t count, loff_t pos)                         
{                                                                               
	const struct sysfs_ops *ops = sysfs_file_ops(of->kn);                   
	struct kobject *kobj = of->kn->parent->priv;                            

	if (!count)                                                             
		return 0;                                                       

	return ops->store(kobj, of->kn->priv, buf, count);                      
}           

As we have observed already, the active reference obtained through
kernfs_get_active() was for the struct kernfs_node. Sure, the
ops->write() is valid, in this case it sysfs_kf_write().

sysfs isn't doing any active reference check for the kobject device
attribute as it doesn't care for them. So syfs calls
dev_attr_store(), but the dev_attr_store() is not preventing the device
attribute ops to go fishing, and we destroy them while we're destroying
the device on module removal.

In the zram case, the abstraction is worse given the device attributes
are are created on behalf of the driver as group attributes.  The zram
disksize_store() for instance will use:

static inline struct zram *dev_to_zram(struct device *dev)                      
{                                                                               
	return (struct zram *)dev_to_disk(dev)->private_data;                   
} 

static ssize_t disksize_store(struct device *dev,
			      struct device_attribute *attr,
			      const char *buf, size_t len)     
{                                  
	...
	struct zram *zram = dev_to_zram(dev);
	...
}

Nothing is preventing an active block group sysfs attribute from going
fishing, in the dev_to_disk() is the struct gendisk, and that can get
free'd on module exit while the sysfs attribute is about to be poked.

> I don't have the time at the
> moment to dig into this, but maybe this isn't an issue?  Look at how
> sysfs handles the kobject lookups for kernfs nodes to see if all is sane
> there, maybe there isn't an issue?

The deadlock issue I noted and am fixing with the zram driver was one
real live situation a customer reported, however a null dereference
illustrated above turned out to also be in the logs of the same bug
report, just that it happened in other situations, so what Minchan
claimed as theoretical actually indeed was a real issue and I hope
that the above illustrates this better.

  Luis

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v3] drivers/base/core: refcount kobject and bus on device attribute read / store
  2021-06-23 16:14   ` Luis Chamberlain
@ 2021-06-23 16:28     ` Greg KH
  2021-06-23 16:51     ` Greg KH
  1 sibling, 0 replies; 6+ messages in thread
From: Greg KH @ 2021-06-23 16:28 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: rafael, jeyu, ngupta, sergey.senozhatsky.work, minchan, axboe,
	mbenes, jpoimboe, tglx, keescook, jikos, rostedt, peterz,
	linux-block, linux-kernel

On Wed, Jun 23, 2021 at 09:14:34AM -0700, Luis Chamberlain wrote:
> On Wed, Jun 23, 2021 at 10:32:53AM +0200, Greg KH wrote:
> > On Tue, Jun 22, 2021 at 05:36:30PM -0700, Luis Chamberlain wrote:
> > > diff --git a/drivers/base/core.c b/drivers/base/core.c
> > > index 4a8bf8cda52b..f69aa040b56d 100644
> > > --- a/drivers/base/core.c
> > > +++ b/drivers/base/core.c
> > > @@ -2042,28 +2042,56 @@ EXPORT_SYMBOL(dev_driver_string);
> > >  static ssize_t dev_attr_show(struct kobject *kobj, struct attribute *attr,
> > >  			     char *buf)
> > >  {
> > > -	struct device_attribute *dev_attr = to_dev_attr(attr);
> > > -	struct device *dev = kobj_to_dev(kobj);
> > > +	struct device_attribute *dev_attr;
> > > +	struct device *dev;
> > > +	struct bus_type *bus = NULL;
> > >  	ssize_t ret = -EIO;
> > >  
> > > +	dev = get_device(kobj_to_dev(kobj));
> > > +	if (dev->bus) {
> > 
> > No need to test for this, right?
> 
> dev_uevent() checks for dev->bus, so I thought that was a clear
> indication this isn't always set.
> 
> > 
> > > +		bus = bus_get(dev->bus);
> > > +		if (!bus)
> > > +			goto out;

The point is that even if dev->bus is NULL, then bus_get(NULL) is NULL.
That's the only way that bus_get() can return NULL, which means this
check too is not needed.

> > >  	if (dev_attr->show)
> > >  		ret = dev_attr->show(dev, dev_attr, buf);
> > >  	if (ret >= (ssize_t)PAGE_SIZE) {
> > >  		printk("dev_attr_show: %pS returned bad count\n",
> > >  				dev_attr->show);
> > >  	}
> > > +
> > > +	bus_put(bus);
> > 
> > You are incrementing the bus, which is nice, but I do not understand why
> > it is needed.  What is causing the bus to go away _before_ the devices
> > are going away?  Busses almost never are removed from the system, and if
> > they are, all devices associated with them are removed first.  So I do
> > not think you need to increment anything with that here.
> 
> You tell me. It was your suggestion as a replacement for the type
> specific lock, in the zram case, its a block device so I was using
> bdgrab().

I did?  Sorry, I do not remember, but this is not a lock, nor does it
protect anything.

I'll respond to the rest later...

greg k-h

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v3] drivers/base/core: refcount kobject and bus on device attribute read / store
  2021-06-23 16:14   ` Luis Chamberlain
  2021-06-23 16:28     ` Greg KH
@ 2021-06-23 16:51     ` Greg KH
  2021-06-23 17:23       ` Luis Chamberlain
  1 sibling, 1 reply; 6+ messages in thread
From: Greg KH @ 2021-06-23 16:51 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: rafael, jeyu, ngupta, sergey.senozhatsky.work, minchan, axboe,
	mbenes, jpoimboe, tglx, keescook, jikos, rostedt, peterz,
	linux-block, linux-kernel

On Wed, Jun 23, 2021 at 09:14:34AM -0700, Luis Chamberlain wrote:
> kernfs / sysfs is in not struct device specific, it has no semantics for
> modules or even devices. The aspect of kernfs which deals with locking
> a struct kernfs_node is kernfs_get_active(). The refcount there of
> importance is the call to atomic_inc_unless_negative(&kn->active).
> 
> struct kernfs_node *kernfs_get_active(struct kernfs_node *kn)                   
> {                                                                               
> 	if (unlikely(!kn))
> 		return NULL;                                                    
> 
> 	if (!atomic_inc_unless_negative(&kn->active))                           
> 		return NULL;                                                    
> 
> 	if (kernfs_lockdep(kn))
> 		rwsem_acquire_read(&kn->dep_map, 0, 1, _RET_IP_);               
> 	return kn;                                                              
> }
> 
> BTW this was also precisely where I had suggested to extend the
> kernfs_node with an optional kn->owner and if set we try_module_get() to
> prevent the deadlock case if the module exit routine also happens to use
> a lock also used by a sysfs attribute store/read.
> 
> The flow would be (from a real live gdb crash backtrace from an original
> bug report from a customer):
> 
> write system call -->
>   ksys_write ()
>     vfs_write() -->
>       __vfs_write() -->
>        kernfs_fop_write() (note now this is kernfs_fop_write_iter()) -->
>          sysfs_kf_write() -->
>            dev_attr_store() -->
> 	     null reference
> 
> The dereference is because the dev_attr_store() call which we are
> modifying
> 
> LINE-001 static ssize_t dev_attr_store(struct kobject *kobj, struct attribute *attr,                                                                                         
> LINE-002 const char *buf, size_t count)                                                                                                                 
> LINE-003 {
> LINE-004	struct device_attribute *dev_attr = to_dev_attr(attr);
> LINE-005	struct device *dev = kobj_to_dev(kobj);
> LINE-006	ssize_t ret = -EIO;
> LINE-007
> LINE-008	if (dev_attr->store)
> LINE-009		ret = dev_attr->store(dev, dev_attr, buf, count);
> 	...
> 	}
> 
> The race happens because a sysfs store / read can be triggered, the CPU
> could be preempted after LINE-008, and the ->store is gone by LINE-009.
> This begs the question if kernfs_fop_write_iter() or sysfs protects this
> somehow? Let's see.
> 
> For kernfs we have:
> 
> static ssize_t kernfs_fop_write_iter(struct kiocb *iocb, struct iov_iter *iter) 
> {
> 	struct kernfs_open_file *of = kernfs_of(iocb->ki_filp);
> 	...
> 	mutex_lock(&of->mutex);
> 	if (!kernfs_get_active(of->kn)) {
> 		mutex_unlock(&of->mutex);
> 		len = -ENODEV;
> 		goto out_free;
> 	}
> 
> 	ops = kernfs_ops(of->kn);
> 	if (ops->write)
> 		len = ops->write(of, buf, len, iocb->ki_pos);
> 	else
> 		len = -EINVAL;
> 
> 	kernfs_put_active(of->kn);
> 	mutex_unlock(&of->mutex);
> 	...
> }
> 
> And the write call here is a syfs calls:
> 
> static ssize_t sysfs_kf_write(struct kernfs_open_file *of, char *buf,           
> 			      size_t count, loff_t pos)                         
> {                                                                               
> 	const struct sysfs_ops *ops = sysfs_file_ops(of->kn);                   
> 	struct kobject *kobj = of->kn->parent->priv;                            
> 
> 	if (!count)                                                             
> 		return 0;                                                       
> 
> 	return ops->store(kobj, of->kn->priv, buf, count);                      
> }           
> 
> As we have observed already, the active reference obtained through
> kernfs_get_active() was for the struct kernfs_node. Sure, the
> ops->write() is valid, in this case it sysfs_kf_write().
> 
> sysfs isn't doing any active reference check for the kobject device
> attribute as it doesn't care for them. So syfs calls
> dev_attr_store(), but the dev_attr_store() is not preventing the device
> attribute ops to go fishing, and we destroy them while we're destroying
> the device on module removal.

Ah, but sysfs _should_ be doing this properly.

I think the issue is that when we store the kobject pointer in kernfs,
it is NOT incremented.  Look at sysfs_create_dir_ns(), if we call
kobject_get(kobj) right before we call kernfs_create_dir_ns(), and then
properly clean things up later on when we remove the sysfs directory
(i.e. the kobject), it _should_ fix this problem.

Then, we know, whenever show/store/whatever is called, when we cast out
of kernfs the private pointer to a kobject, that the kobject really is
still alive, so we can use it properly.

Can you try that, it should be a much "simpler" change here.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v3] drivers/base/core: refcount kobject and bus on device attribute read / store
  2021-06-23 16:51     ` Greg KH
@ 2021-06-23 17:23       ` Luis Chamberlain
  0 siblings, 0 replies; 6+ messages in thread
From: Luis Chamberlain @ 2021-06-23 17:23 UTC (permalink / raw)
  To: Greg KH
  Cc: rafael, jeyu, ngupta, sergey.senozhatsky.work, minchan, axboe,
	mbenes, jpoimboe, tglx, keescook, jikos, rostedt, peterz,
	linux-block, linux-kernel

On Wed, Jun 23, 2021 at 06:51:42PM +0200, Greg KH wrote:
> On Wed, Jun 23, 2021 at 09:14:34AM -0700, Luis Chamberlain wrote:
> > sysfs isn't doing any active reference check for the kobject device
> > attribute as it doesn't care for them. So syfs calls
> > dev_attr_store(), but the dev_attr_store() is not preventing the device
> > attribute ops to go fishing, and we destroy them while we're destroying
> > the device on module removal.
> 
> Ah, but sysfs _should_ be doing this properly.
> 
> I think the issue is that when we store the kobject pointer in kernfs,
> it is NOT incremented.  Look at sysfs_create_dir_ns(), if we call
> kobject_get(kobj) right before we call kernfs_create_dir_ns(), and then
> properly clean things up later on when we remove the sysfs directory
> (i.e. the kobject), it _should_ fix this problem.
> 
> Then, we know, whenever show/store/whatever is called, when we cast out
> of kernfs the private pointer to a kobject, that the kobject really is
> still alive, so we can use it properly.
> 
> Can you try that, it should be a much "simpler" change here.

Agreed, its cleaner. It should also address the type race consideration
I had, given that in the zram case, for instance, we will call device_del() on
del_gendisk() and the order of freeing typically is something like:

	del_gendisk(zram->disk);
	blk_cleanup_queue(zram->disk->queue);
	put_disk(zram->disk);

The put_disk() is what would make our gendisk->private_data invalid.
Will spin up a v4 with your suggestion.

  Luis

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2021-06-23 17:24 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-06-23  0:36 [PATCH v3] drivers/base/core: refcount kobject and bus on device attribute read / store Luis Chamberlain
2021-06-23  8:32 ` Greg KH
2021-06-23 16:14   ` Luis Chamberlain
2021-06-23 16:28     ` Greg KH
2021-06-23 16:51     ` Greg KH
2021-06-23 17:23       ` Luis Chamberlain

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).