All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFD driver-core] Lifetime problems of the current driver model
@ 2007-03-30  9:43 Tejun Heo
  2007-03-30 12:29 ` James Bottomley
                   ` (4 more replies)
  0 siblings, 5 replies; 40+ messages in thread
From: Tejun Heo @ 2007-03-30  9:43 UTC (permalink / raw)
  To: gregkh, hugh, cornelia.huck, dmitry.torokhov, oneukum, maneesh,
	rpurdie, James.Bottomley, Jeff Garzik
  Cc: lkml, linux-ide, SCSI Mailing List

Hello, all.

This document tries to describe lifetime problems of the current
device driver model primarily from the point view of device drivers
and establish consensus, or at least, start discussion about how to
solve these problems.  This is primarily based on my experience with
IDE and SCSI layers and my knowledge on other drivers is limited, so I
might have generalized too aggressively.  Feel free to point out.


Table of Contents

1. Why is it this complex?
1-1. Rule #a and #b
1-2. Rule #b and #c
2. How do we solve this?
2-1. Rule #a and #b
2-2. Rule #b and #c
3. How do we get there from here?
4. Some afterthoughts


1. Why is it this complex?

Object lifetime management in device drivers is complex and thus
error-prone under the current driver model.  The primary reason is
that device drivers have to match impedances of three different life
time rules.

Rule a. The device itself.  Many devices are hot pluggable these days.
        Devices come when they come and go when they go.

Rule b. Driver model object lifetime rule.  Fundamental data
        structures used by device drivers are reference counted and
        their lifetime can be arbitrarily extended by userspace sysfs
        access.

Rule c. Module life time rule.  Most device drivers can be compiled as
        modules and thus module busy count should be managed properly.


1-1. Rule #a and #b

Being what it is, the primary concern of a device driver is the
'device' itself.  A device driver should create and release objects
representing the devices it manages as they come and go.  In one way
or another, this just has to be done, so this rule is pretty obvious
and intuitive to device driver writers.

The current device driver model mandates implementation of rule #b in
device drivers.  This makes the object lifetime management complicated
and, more importantly, less intuitive to driver writers.  As
implementing such lifetime management in each device driver is too
painful, most device driver midlayers handle this.  e.g. SCSI midlayer
handles driver model lifetime rules and presents simplified
register/unregister-immediately interface to all low level SCSI
drivers.  IDE tries to do similar and USB and netdev have their own
mechanisms too.

I'm not familiar with other layers, but it took quite a while to get
this right in the SCSI midlayer.  The current code works and low level
drivers can do away with most of lifetime management but the added
complexity is saddening to look at.


1-2. Rule #b and #c

On the surface, it seems that we're doing this mostly correctly but
subtle bugs caused by interaction between driver model lifetime and
module lifetime rules aren't difficult to find.

This happens because the two lifetime rules are closely related but
the relation is not properly represented in device driver model
itself.  A module cannot be unloaded unless all device driver objects
it hosts are released, but the only way to guarantee that is to always
grab module reference whenever grabbing relevant kobject reference.
This is ugly to code and too easy to get wrong.  Here are two such
examples.

Example 1. sysfs_schedule_callback() not grabbing the owning module

  This function is recently added to be used by suicidal sysfs nodes
  such that they don't deadlock when trying to unregister themselves.

 +#include <linux/delay.h>
  static void sysfs_schedule_callback_work(struct work_struct *work)
  {
	struct sysfs_schedule_callback_struct *ss = container_of(work,
			struct sysfs_schedule_callback_struct, work);

 +	msleep(100);
	(ss->func)(ss->data);
	kobject_put(ss->kobj);
	kfree(ss);
  }

  int sysfs_schedule_callback(struct kobject *kobj, void (*func)(void *),
		void *data)
  {
	struct sysfs_schedule_callback_struct *ss;

	ss = kmalloc(sizeof(*ss), GFP_KERNEL);
	if (!ss)
		return -ENOMEM;
	kobject_get(kobj);
	ss->kobj = kobj;
	ss->func = func;
	ss->data = data;
	INIT_WORK(&ss->work, sysfs_schedule_callback_work);
	schedule_work(&ss->work);
	return 0;
  }

  Two lines starting with '+' are inserted to make the problem
  reliably reproducible.  With the above changes,

  # insmod drivers/scsi/scsi_mod.ko; insmod drivers/scsi/sd_mod.ko; insmod drivers/ata/libata.ko; insmod drivers/ata/ahci.ko
  # echo 1 > /sys/block/sda/device/delete; rmmod ahci; rmmod libata; rmmod sd_mod; rmmod scsi_mod

  It's assumed that ahci detects /dev/sda.  The above command sequence
  causes the following oops.

  BUG: unable to handle kernel paging request at virtual address e0984020
  [--snip--]
  EIP is at 0xe0984020
  [--snip--]
   [<c0127132>] run_workqueue+0x92/0x140
   [<c01278c7>] worker_thread+0x137/0x160
   [<c012a513>] kthread+0xa3/0xd0
   [<c0104457>] kernel_thread_helper+0x7/0x10

  The problem here is that kobjec_get() in sysfs_schedule_callback()
  doesn't grab the module backing the kobject it's grabbing.  By the
  time (ss->func)(ss->kobj) runs, scsi_mod is already gone.

Example 2. sysfs attr grabbing the wrong kobject and module

  This bug has been there for a long time, probably from the day these
  attrs are implemented.

  # insmod drivers/scsi/scsi_mod.ko; insmod drivers/scsi/sd_mod.ko; insmod drivers/ata/libata.ko; insmod drivers/ata/ahci.ko
  # cat > /sys/block/sda/device/delete &
  # rmmod ahci; rmmod libata
  # kill %%

  BUG: unable to handle kernel paging request at virtual address e083b1e8
  [--snip--]
  EIP:    0060:[<e0983310>]    Not tainted VLI
  [--snip--]
   [<c0127649>] execute_in_process_context+0x19/0x40
   [<e09820f2>] scsi_target_reap+0x82/0xa0 [scsi_mod]
   [<e0984771>] scsi_device_dev_release_usercontext+0xe1/0x110 [scsi_mod]
   [<c0127649>] execute_in_process_context+0x19/0x40
   [<e09835a3>] scsi_device_dev_release+0x13/0x20 [scsi_mod]
   [<c0287247>] device_release+0x17/0x90
   [<c0213a69>] kobject_cleanup+0x49/0x80
   [<c0213aab>] kobject_release+0xb/0x10
   [<c021482b>] kref_put+0x2b/0x80
   [<c0213a14>] kobject_put+0x14/0x20
   [<c01969af>] sysfs_release+0x5f/0xb0
   [<c015dcb4>] __fput+0xb4/0x1b0
   [<c015de18>] fput+0x18/0x20
   [<c015b207>] filp_close+0x47/0x70
   [<c0119394>] put_files_struct+0xa4/0xc0
   [<c011a4bf>] do_exit+0x11f/0x7c0
   [<c011ab89>] do_group_exit+0x29/0x70
   [<c0123415>] get_signal_to_deliver+0x265/0x3f0
   [<c01036ea>] do_notify_resume+0x8a/0x7a0
   [<c010425a>] work_notifysig+0x13/0x19

  The delete attr's owner is scsi_mod.  However, it's created under
  the sysfs node for the SCSI device under the node for SCSI host
  which is driven by libata and ahci combination.  Opening the delete
  node grabs 1. struct device for the SCSI device which propagates to
  the SCSI host node and 2. module reference for sd_mod.  So, ahci and
  libata can go away even while the SCSI host they implement is still
  alive.

Both examples can be reproduced using any SCSI low level driver.
Attrs implemented similarly as in example #2 are everywhere.  This is
the standard way to implement such attrs.


2. How do we solve this?

2-1. Rule #a and #b

If you think about it, the impedance matching between rule #a and #b
should be done somewhere.  Maybe doing it in each low level driver is
what's intended when the driver model is designed but it's
impractically painful and the reality reflects that by doing it in
midlayers.

It's better if we do it in higher layers.  For example, SCSI has a
mechanism to reject new requests and drain request_queue to allow low
level driver to just unregister an existing device.  IDE currently
doesn't have such feature but it would need to do almost the same
thing to support hotplug.  If request_queue just had a feature to
drain and kill itself, both SCSI and IDE could use it.  It would be
simpler and less error-prone.  On the other hand, if it's pushed
downward, it will cause much more pain in all the low level drivers.

Orphaning sysfs nodes on unregistration is a big step in this
direction.  With sysfs reference counting out of the picture,
implementing 'disconnect immediate' interface only on a few components
(including request_queue) should suffice for most block device
drivers.  I'm not familiar with other drivers but I don't think
they'll be very different.

All in all, I'm hoping something like the following can be done in
device drivers, midlayer or low level.

* For binding

  alloc resources;
  init controller;
  register to upper layers;

* For unbinding

  unregister from upper layers; (no lingering references or objects)
  deinit controller;
  release resources;

This basically nullifies lifetime rule #b from the POV of drivers.


2-2. Rule #b and #c

One way to solve this problem is to subordinate lifetime rule #b to
rule #c.  Each kobject points to its owning module such that grabbing
a kobject automatically grabs the module.  The problem with this
approach is that it requires wide update and makes kobject_get
heavier.

Another more appealing approach is to do nothing.  Solving the problem
between rule #a and #b in the way described above means virtually
nullifying rule #b.  With rule #b gone, rule #c can't conflict with
it.  IOW, no reference from upper layer would be remaining after a
driver unregisters itself from the upper layer - the module can be
safely unloaded.


3. How do we get there from here?

The current device driver model is used by most device drivers - huge
amount of code.  We can't update them at once.  Fortunately, the
suggested solution can easily be implemented gradually.  We can add
'disconnect now' interface to upper driver interface one-by-one and
convert its users gradually.  The existing reference counting
mechanism can be left alone and used to verify that the conversion is
correct by verifying reference count is 0 after unregistering.

Once the conversion is complete (maybe after a year), we can remove
the reference counting mechanism from device driver interface.


4. Some afterthoughts

* Doing this would make module reference counting more flexible.
  Instead of being confined by implementation, it can be used to
  define when to allow and disallow module unloading.  (you can't
  unload a block driver module if a fs on it is mounted but sysfs
  access doesn't matter.)

* Outside of device driver model, kobject is currently used to supply
  something sysfs nodes can be attached to, such that both the
  information supplier and node users have something common to
  reference count.  With orphaning added, there is no need to use
  kobject this way.  IMHO, kobject should have been implementation
  detail inside the device model proper in the first place.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 40+ messages in thread
* Re: [RFD driver-core] Lifetime problems of the current driver model
@ 2007-04-07 15:48 Alan Stern
  2007-04-08  2:55 ` Tejun Heo
  0 siblings, 1 reply; 40+ messages in thread
From: Alan Stern @ 2007-04-07 15:48 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Greg KH, hugh, cornelia.huck, Dmitry Torokhov, Oliver Neukum,
	maneesh, rpurdie, James Bottomley, Jeff Garzik,
	Kernel development list

On March 30, 2007, Tejun Heo wrote:

> Hello, all.
> 
> This document tries to describe lifetime problems of the current
> device driver model primarily from the point view of device drivers
> and establish consensus, or at least, start discussion about how to
> solve these problems.  This is primarily based on my experience with
> IDE and SCSI layers and my knowledge on other drivers is limited, so I
> might have generalized too aggressively.  Feel free to point out.

...

> Example 1. sysfs_schedule_callback() not grabbing the owning module
> 
>   This function is recently added to be used by suicidal sysfs nodes
>   such that they don't deadlock when trying to unregister themselves.
> 
>  +#include <linux/delay.h>
>   static void sysfs_schedule_callback_work(struct work_struct *work)
>   {
> 	struct sysfs_schedule_callback_struct *ss = container_of(work,
> 			struct sysfs_schedule_callback_struct, work);
> 
>  +	msleep(100);
> 	(ss->func)(ss->data);
> 	kobject_put(ss->kobj);
> 	kfree(ss);
>   }
> 
>   int sysfs_schedule_callback(struct kobject *kobj, void (*func)(void *),
> 		void *data)
>   {
> 	struct sysfs_schedule_callback_struct *ss;
> 
> 	ss = kmalloc(sizeof(*ss), GFP_KERNEL);
> 	if (!ss)
> 		return -ENOMEM;
> 	kobject_get(kobj);
> 	ss->kobj = kobj;
> 	ss->func = func;
> 	ss->data = data;
> 	INIT_WORK(&ss->work, sysfs_schedule_callback_work);
> 	schedule_work(&ss->work);
> 	return 0;
>   }
> 
>   Two lines starting with '+' are inserted to make the problem
>   reliably reproducible.  With the above changes,
> 
>   # insmod drivers/scsi/scsi_mod.ko; insmod drivers/scsi/sd_mod.ko; insmod drivers/ata/libata.ko;
> insmod drivers/ata/ahci.ko
>   # echo 1 > /sys/block/sda/device/delete; rmmod ahci; rmmod libata; rmmod sd_mod; rmmod scsi_mod
> 
>   It's assumed that ahci detects /dev/sda.  The above command sequence
>   causes the following oops.
> 
>   BUG: unable to handle kernel paging request at virtual address e0984020
>   [--snip--]
>   EIP is at 0xe0984020
>   [--snip--]
>    [<c0127132>] run_workqueue+0x92/0x140
>    [<c01278c7>] worker_thread+0x137/0x160
>    [<c012a513>] kthread+0xa3/0xd0
>    [<c0104457>] kernel_thread_helper+0x7/0x10
> 
>   The problem here is that kobjec_get() in sysfs_schedule_callback()
>   doesn't grab the module backing the kobject it's grabbing.  By the
>   time (ss->func)(ss->kobj) runs, scsi_mod is already gone.

As the author of this routine, I wish you had included my name in your
CC: list.  :-(

The problem here isn't exactly as you described.  scsi_mod needs to be
pinned (1) because it is the owner of the kobject and hence will be
called when the kobject is released, and (2) because it is the owner
of the callback routine.  However this is just a detail; clearly the
bug needs to be fixed.

One possibility would be to have scsi_mod's exit_scsi() routine call
flush_scheduled_work().  Another would be to add such a call in
sys_delete_module().  Neither of these is attractive.  They would add
overhead when it's not needed, and they would deadlock if a workqueue
routine tried to unload a module.

On balance, the patch below seems better.  Do you agree?


With regard to your analysis of lifetime issues, there is a whole
aspect you did not mention.  A basic assumption of the refcounting
approach is that once X has a reference to Y, X can freely access and
use Y as much as it wants until it drops the reference.

However this is not true when X is a device driver and Y is a device
structure.  Drivers can be unbound from devices.  If X has been
unbound from Y then it must not access Y again, no matter how many
references it possesses.  After all, some other driver may have bound
to Y in the meantime; this other driver would not appreciate the
interference.

Just as bad, if Y represents a hot-pluggable device then some other
device may have been plugged in and may be using Y's old address.  We
don't want X sending commands to a new device, thinking that it is Y!

The complications caused by this requirement affect both the subsystem
code and device drivers.  Drivers must synchronize their release()
methods with every action they take -- and refcounts cannot provide
synchronization.

A similar problem afflicts the char-device subsystem, and here even
less care has been taken to address the issues.  The race between
open() and unregister() is resolved in many places by relying on the
BKL!

We should be able to make things better and easier than they are.
Orphaning open sysfs files was a move in this direction.  But I doubt
they will ever become truly simple and clear.

Alan Stern



Index: usb-2.6/drivers/base/core.c
===================================================================
--- usb-2.6.orig/drivers/base/core.c
+++ usb-2.6/drivers/base/core.c
@@ -431,9 +431,10 @@ void device_remove_bin_file(struct devic
 EXPORT_SYMBOL_GPL(device_remove_bin_file);
 
 /**
- * device_schedule_callback - helper to schedule a callback for a device
+ * device_schedule_callback_owner - helper to schedule a callback for a device
  * @dev: device.
  * @func: callback function to invoke later.
+ * @owner: module owning the callback routine
  *
  * Attribute methods must not unregister themselves or their parent device
  * (which would amount to the same thing).  Attempts to do so will deadlock,
@@ -444,20 +445,23 @@ EXPORT_SYMBOL_GPL(device_remove_bin_file
  * argument in the workqueue's process context.  @dev will be pinned until
  * @func returns.
  *
+ * This routine is usually called via the inline device_schedule_callback(),
+ * which automatically sets @owner to THIS_MODULE.
+ *
  * Returns 0 if the request was submitted, -ENOMEM if storage could not
- * be allocated.
+ * be allocated, -ENODEV if a reference to @owner isn't available.
  *
  * NOTE: This routine won't work if CONFIG_SYSFS isn't set!  It uses an
  * underlying sysfs routine (since it is intended for use by attribute
  * methods), and if sysfs isn't available you'll get nothing but -ENOSYS.
  */
-int device_schedule_callback(struct device *dev,
-		void (*func)(struct device *))
+int device_schedule_callback_owner(struct device *dev,
+		void (*func)(struct device *), struct module *owner)
 {
 	return sysfs_schedule_callback(&dev->kobj,
-			(void (*)(void *)) func, dev);
+			(void (*)(void *)) func, dev, owner);
 }
-EXPORT_SYMBOL_GPL(device_schedule_callback);
+EXPORT_SYMBOL_GPL(device_schedule_callback_owner);
 
 static void klist_children_get(struct klist_node *n)
 {
Index: usb-2.6/fs/sysfs/file.c
===================================================================
--- usb-2.6.orig/fs/sysfs/file.c
+++ usb-2.6/fs/sysfs/file.c
@@ -647,6 +647,7 @@ struct sysfs_schedule_callback_struct {
 	struct kobject 		*kobj;
 	void			(*func)(void *);
 	void			*data;
+	struct module		*owner;
 	struct work_struct	work;
 };
 
@@ -657,6 +658,7 @@ static void sysfs_schedule_callback_work
 
 	(ss->func)(ss->data);
 	kobject_put(ss->kobj);
+	module_put(ss->owner);
 	kfree(ss);
 }
 
@@ -665,6 +667,7 @@ static void sysfs_schedule_callback_work
  * @kobj: object we're acting for.
  * @func: callback function to invoke later.
  * @data: argument to pass to @func.
+ * @owner: module owning the callback code
  *
  * sysfs attribute methods must not unregister themselves or their parent
  * kobject (which would amount to the same thing).  Attempts to do so will
@@ -677,20 +680,25 @@ static void sysfs_schedule_callback_work
  * until @func returns.
  *
  * Returns 0 if the request was submitted, -ENOMEM if storage could not
- * be allocated.
+ * be allocated, -ENODEV if a reference to @owner isn't available.
  */
 int sysfs_schedule_callback(struct kobject *kobj, void (*func)(void *),
-		void *data)
+		void *data, struct module *owner)
 {
 	struct sysfs_schedule_callback_struct *ss;
 
+	if (!try_module_get(owner))
+		return -ENODEV;
 	ss = kmalloc(sizeof(*ss), GFP_KERNEL);
-	if (!ss)
+	if (!ss) {
+		module_put(owner);
 		return -ENOMEM;
+	}
 	kobject_get(kobj);
 	ss->kobj = kobj;
 	ss->func = func;
 	ss->data = data;
+	ss->owner = owner;
 	INIT_WORK(&ss->work, sysfs_schedule_callback_work);
 	schedule_work(&ss->work);
 	return 0;
Index: usb-2.6/include/linux/device.h
===================================================================
--- usb-2.6.orig/include/linux/device.h
+++ usb-2.6/include/linux/device.h
@@ -369,8 +369,14 @@ extern int __must_check device_create_bi
 					       struct bin_attribute *attr);
 extern void device_remove_bin_file(struct device *dev,
 				   struct bin_attribute *attr);
-extern int device_schedule_callback(struct device *dev,
-		void (*func)(struct device *));
+extern int device_schedule_callback_owner(struct device *dev,
+		void (*func)(struct device *), struct module *owner);
+
+static inline int device_schedule_callback(struct device *dev,
+		void (*func)(struct device *))
+{
+	device_schedule_callback_mod(dev, func, THIS_MODULE);
+}
 
 /* device resource management */
 typedef void (*dr_release_t)(struct device *dev, void *res);
Index: usb-2.6/include/linux/sysfs.h
===================================================================
--- usb-2.6.orig/include/linux/sysfs.h
+++ usb-2.6/include/linux/sysfs.h
@@ -80,7 +80,7 @@ struct sysfs_ops {
 #ifdef CONFIG_SYSFS
 
 extern int sysfs_schedule_callback(struct kobject *kobj,
-		void (*func)(void *), void *data);
+		void (*func)(void *), void *data, struct module *owner);
 
 extern int __must_check
 sysfs_create_dir(struct kobject *, struct dentry *);
@@ -138,7 +138,7 @@ extern int __must_check sysfs_init(void)
 #else /* CONFIG_SYSFS */
 
 static inline int sysfs_schedule_callback(struct kobject *kobj,
-		void (*func)(void *), void *data)
+		void (*func)(void *), void *data, struct module *owner)
 {
 	return -ENOSYS;
 }


^ permalink raw reply	[flat|nested] 40+ messages in thread
* [RFD driver-core] Lifetime problems of the current driver model
@ 2007-03-30  9:43 Tejun Heo
  0 siblings, 0 replies; 40+ messages in thread
From: Tejun Heo @ 2007-03-30  9:43 UTC (permalink / raw)
  To: gregkh, hugh, cornelia.huck, dmitry.torokhov, oneukum, maneesh,
	rpurdie, James.Bottomley, Jeff Garzik
  Cc: lkml, linux-ide, SCSI Mailing List

Hello, all.

This document tries to describe lifetime problems of the current
device driver model primarily from the point view of device drivers
and establish consensus, or at least, start discussion about how to
solve these problems.  This is primarily based on my experience with
IDE and SCSI layers and my knowledge on other drivers is limited, so I
might have generalized too aggressively.  Feel free to point out.


Table of Contents

1. Why is it this complex?
1-1. Rule #a and #b
1-2. Rule #b and #c
2. How do we solve this?
2-1. Rule #a and #b
2-2. Rule #b and #c
3. How do we get there from here?
4. Some afterthoughts


1. Why is it this complex?

Object lifetime management in device drivers is complex and thus
error-prone under the current driver model.  The primary reason is
that device drivers have to match impedances of three different life
time rules.

Rule a. The device itself.  Many devices are hot pluggable these days.
        Devices come when they come and go when they go.

Rule b. Driver model object lifetime rule.  Fundamental data
        structures used by device drivers are reference counted and
        their lifetime can be arbitrarily extended by userspace sysfs
        access.

Rule c. Module life time rule.  Most device drivers can be compiled as
        modules and thus module busy count should be managed properly.


1-1. Rule #a and #b

Being what it is, the primary concern of a device driver is the
'device' itself.  A device driver should create and release objects
representing the devices it manages as they come and go.  In one way
or another, this just has to be done, so this rule is pretty obvious
and intuitive to device driver writers.

The current device driver model mandates implementation of rule #b in
device drivers.  This makes the object lifetime management complicated
and, more importantly, less intuitive to driver writers.  As
implementing such lifetime management in each device driver is too
painful, most device driver midlayers handle this.  e.g. SCSI midlayer
handles driver model lifetime rules and presents simplified
register/unregister-immediately interface to all low level SCSI
drivers.  IDE tries to do similar and USB and netdev have their own
mechanisms too.

I'm not familiar with other layers, but it took quite a while to get
this right in the SCSI midlayer.  The current code works and low level
drivers can do away with most of lifetime management but the added
complexity is saddening to look at.


1-2. Rule #b and #c

On the surface, it seems that we're doing this mostly correctly but
subtle bugs caused by interaction between driver model lifetime and
module lifetime rules aren't difficult to find.

This happens because the two lifetime rules are closely related but
the relation is not properly represented in device driver model
itself.  A module cannot be unloaded unless all device driver objects
it hosts are released, but the only way to guarantee that is to always
grab module reference whenever grabbing relevant kobject reference.
This is ugly to code and too easy to get wrong.  Here are two such
examples.

Example 1. sysfs_schedule_callback() not grabbing the owning module

  This function is recently added to be used by suicidal sysfs nodes
  such that they don't deadlock when trying to unregister themselves.

 +#include <linux/delay.h>
  static void sysfs_schedule_callback_work(struct work_struct *work)
  {
	struct sysfs_schedule_callback_struct *ss = container_of(work,
			struct sysfs_schedule_callback_struct, work);

 +	msleep(100);
	(ss->func)(ss->data);
	kobject_put(ss->kobj);
	kfree(ss);
  }

  int sysfs_schedule_callback(struct kobject *kobj, void (*func)(void *),
		void *data)
  {
	struct sysfs_schedule_callback_struct *ss;

	ss = kmalloc(sizeof(*ss), GFP_KERNEL);
	if (!ss)
		return -ENOMEM;
	kobject_get(kobj);
	ss->kobj = kobj;
	ss->func = func;
	ss->data = data;
	INIT_WORK(&ss->work, sysfs_schedule_callback_work);
	schedule_work(&ss->work);
	return 0;
  }

  Two lines starting with '+' are inserted to make the problem
  reliably reproducible.  With the above changes,

  # insmod drivers/scsi/scsi_mod.ko; insmod drivers/scsi/sd_mod.ko; insmod drivers/ata/libata.ko; insmod drivers/ata/ahci.ko
  # echo 1 > /sys/block/sda/device/delete; rmmod ahci; rmmod libata; rmmod sd_mod; rmmod scsi_mod

  It's assumed that ahci detects /dev/sda.  The above command sequence
  causes the following oops.

  BUG: unable to handle kernel paging request at virtual address e0984020
  [--snip--]
  EIP is at 0xe0984020
  [--snip--]
   [<c0127132>] run_workqueue+0x92/0x140
   [<c01278c7>] worker_thread+0x137/0x160
   [<c012a513>] kthread+0xa3/0xd0
   [<c0104457>] kernel_thread_helper+0x7/0x10

  The problem here is that kobjec_get() in sysfs_schedule_callback()
  doesn't grab the module backing the kobject it's grabbing.  By the
  time (ss->func)(ss->kobj) runs, scsi_mod is already gone.

Example 2. sysfs attr grabbing the wrong kobject and module

  This bug has been there for a long time, probably from the day these
  attrs are implemented.

  # insmod drivers/scsi/scsi_mod.ko; insmod drivers/scsi/sd_mod.ko; insmod drivers/ata/libata.ko; insmod drivers/ata/ahci.ko
  # cat > /sys/block/sda/device/delete &
  # rmmod ahci; rmmod libata
  # kill %%

  BUG: unable to handle kernel paging request at virtual address e083b1e8
  [--snip--]
  EIP:    0060:[<e0983310>]    Not tainted VLI
  [--snip--]
   [<c0127649>] execute_in_process_context+0x19/0x40
   [<e09820f2>] scsi_target_reap+0x82/0xa0 [scsi_mod]
   [<e0984771>] scsi_device_dev_release_usercontext+0xe1/0x110 [scsi_mod]
   [<c0127649>] execute_in_process_context+0x19/0x40
   [<e09835a3>] scsi_device_dev_release+0x13/0x20 [scsi_mod]
   [<c0287247>] device_release+0x17/0x90
   [<c0213a69>] kobject_cleanup+0x49/0x80
   [<c0213aab>] kobject_release+0xb/0x10
   [<c021482b>] kref_put+0x2b/0x80
   [<c0213a14>] kobject_put+0x14/0x20
   [<c01969af>] sysfs_release+0x5f/0xb0
   [<c015dcb4>] __fput+0xb4/0x1b0
   [<c015de18>] fput+0x18/0x20
   [<c015b207>] filp_close+0x47/0x70
   [<c0119394>] put_files_struct+0xa4/0xc0
   [<c011a4bf>] do_exit+0x11f/0x7c0
   [<c011ab89>] do_group_exit+0x29/0x70
   [<c0123415>] get_signal_to_deliver+0x265/0x3f0
   [<c01036ea>] do_notify_resume+0x8a/0x7a0
   [<c010425a>] work_notifysig+0x13/0x19

  The delete attr's owner is scsi_mod.  However, it's created under
  the sysfs node for the SCSI device under the node for SCSI host
  which is driven by libata and ahci combination.  Opening the delete
  node grabs 1. struct device for the SCSI device which propagates to
  the SCSI host node and 2. module reference for sd_mod.  So, ahci and
  libata can go away even while the SCSI host they implement is still
  alive.

Both examples can be reproduced using any SCSI low level driver.
Attrs implemented similarly as in example #2 are everywhere.  This is
the standard way to implement such attrs.


2. How do we solve this?

2-1. Rule #a and #b

If you think about it, the impedance matching between rule #a and #b
should be done somewhere.  Maybe doing it in each low level driver is
what's intended when the driver model is designed but it's
impractically painful and the reality reflects that by doing it in
midlayers.

It's better if we do it in higher layers.  For example, SCSI has a
mechanism to reject new requests and drain request_queue to allow low
level driver to just unregister an existing device.  IDE currently
doesn't have such feature but it would need to do almost the same
thing to support hotplug.  If request_queue just had a feature to
drain and kill itself, both SCSI and IDE could use it.  It would be
simpler and less error-prone.  On the other hand, if it's pushed
downward, it will cause much more pain in all the low level drivers.

Orphaning sysfs nodes on unregistration is a big step in this
direction.  With sysfs reference counting out of the picture,
implementing 'disconnect immediate' interface only on a few components
(including request_queue) should suffice for most block device
drivers.  I'm not familiar with other drivers but I don't think
they'll be very different.

All in all, I'm hoping something like the following can be done in
device drivers, midlayer or low level.

* For binding

  alloc resources;
  init controller;
  register to upper layers;

* For unbinding

  unregister from upper layers; (no lingering references or objects)
  deinit controller;
  release resources;

This basically nullifies lifetime rule #b from the POV of drivers.


2-2. Rule #b and #c

One way to solve this problem is to subordinate lifetime rule #b to
rule #c.  Each kobject points to its owning module such that grabbing
a kobject automatically grabs the module.  The problem with this
approach is that it requires wide update and makes kobject_get
heavier.

Another more appealing approach is to do nothing.  Solving the problem
between rule #a and #b in the way described above means virtually
nullifying rule #b.  With rule #b gone, rule #c can't conflict with
it.  IOW, no reference from upper layer would be remaining after a
driver unregisters itself from the upper layer - the module can be
safely unloaded.


3. How do we get there from here?

The current device driver model is used by most device drivers - huge
amount of code.  We can't update them at once.  Fortunately, the
suggested solution can easily be implemented gradually.  We can add
'disconnect now' interface to upper driver interface one-by-one and
convert its users gradually.  The existing reference counting
mechanism can be left alone and used to verify that the conversion is
correct by verifying reference count is 0 after unregistering.

Once the conversion is complete (maybe after a year), we can remove
the reference counting mechanism from device driver interface.


4. Some afterthoughts

* Doing this would make module reference counting more flexible.
  Instead of being confined by implementation, it can be used to
  define when to allow and disallow module unloading.  (you can't
  unload a block driver module if a fs on it is mounted but sysfs
  access doesn't matter.)

* Outside of device driver model, kobject is currently used to supply
  something sysfs nodes can be attached to, such that both the
  information supplier and node users have something common to
  reference count.  With orphaning added, there is no need to use
  kobject this way.  IMHO, kobject should have been implementation
  detail inside the device model proper in the first place.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2007-04-08  2:55 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-03-30  9:43 [RFD driver-core] Lifetime problems of the current driver model Tejun Heo
2007-03-30 12:29 ` James Bottomley
2007-03-30 12:29 ` James Bottomley
2007-03-30 13:15   ` Dmitry Torokhov
2007-03-30 13:15   ` Dmitry Torokhov
2007-03-30 17:58     ` James Bottomley
2007-03-30 18:18       ` Dmitry Torokhov
2007-03-30 13:38   ` Tejun Heo
2007-03-30 17:41     ` Greg KH
2007-03-30 18:19     ` James Bottomley
2007-04-01 19:59       ` Tejun Heo
2007-04-02  9:20         ` Cornelia Huck
2007-04-02 15:34           ` Cornelia Huck
2007-04-03  3:08             ` Tejun Heo
2007-04-02  9:33         ` Greg KH
2007-04-02 12:10         ` Maneesh Soni
2007-04-02 19:33       ` Luben Tuikov
2007-03-30 13:38   ` Tejun Heo
2007-03-30 13:19 ` Cornelia Huck
2007-03-30 13:19 ` Cornelia Huck
2007-03-30 13:19   ` Tejun Heo
2007-03-30 13:19   ` Tejun Heo
2007-03-30 13:40     ` Cornelia Huck
2007-03-30 13:58       ` Tejun Heo
2007-03-30 13:58       ` Tejun Heo
2007-03-30 14:52         ` Cornelia Huck
2007-03-30 15:08           ` Tejun Heo
2007-03-30 15:08           ` Tejun Heo
2007-03-30 19:31             ` Cornelia Huck
2007-03-31  3:12               ` Tejun Heo
2007-03-31  3:15                 ` Tejun Heo
2007-03-31 16:08                 ` Cornelia Huck
2007-03-31 16:14                   ` Tejun Heo
2007-04-02 19:24                 ` Luben Tuikov
2007-03-30 14:52         ` Cornelia Huck
2007-03-30 13:40     ` Cornelia Huck
2007-03-30 17:38 ` Greg KH
  -- strict thread matches above, loose matches on Subject: below --
2007-04-07 15:48 Alan Stern
2007-04-08  2:55 ` Tejun Heo
2007-03-30  9:43 Tejun Heo

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.