All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFD driver-core] Lifetime problems of the current driver model
@ 2007-03-30  9:43 Tejun Heo
  2007-03-30 12:29 ` James Bottomley
                   ` (4 more replies)
  0 siblings, 5 replies; 40+ messages in thread
From: Tejun Heo @ 2007-03-30  9:43 UTC (permalink / raw)
  To: gregkh, hugh, cornelia.huck, dmitry.torokhov, oneukum, maneesh,
	rpurdie, James.Bottomley, Jeff Garzik
  Cc: lkml, linux-ide, SCSI Mailing List

Hello, all.

This document tries to describe lifetime problems of the current
device driver model primarily from the point view of device drivers
and establish consensus, or at least, start discussion about how to
solve these problems.  This is primarily based on my experience with
IDE and SCSI layers and my knowledge on other drivers is limited, so I
might have generalized too aggressively.  Feel free to point out.


Table of Contents

1. Why is it this complex?
1-1. Rule #a and #b
1-2. Rule #b and #c
2. How do we solve this?
2-1. Rule #a and #b
2-2. Rule #b and #c
3. How do we get there from here?
4. Some afterthoughts


1. Why is it this complex?

Object lifetime management in device drivers is complex and thus
error-prone under the current driver model.  The primary reason is
that device drivers have to match impedances of three different life
time rules.

Rule a. The device itself.  Many devices are hot pluggable these days.
        Devices come when they come and go when they go.

Rule b. Driver model object lifetime rule.  Fundamental data
        structures used by device drivers are reference counted and
        their lifetime can be arbitrarily extended by userspace sysfs
        access.

Rule c. Module life time rule.  Most device drivers can be compiled as
        modules and thus module busy count should be managed properly.


1-1. Rule #a and #b

Being what it is, the primary concern of a device driver is the
'device' itself.  A device driver should create and release objects
representing the devices it manages as they come and go.  In one way
or another, this just has to be done, so this rule is pretty obvious
and intuitive to device driver writers.

The current device driver model mandates implementation of rule #b in
device drivers.  This makes the object lifetime management complicated
and, more importantly, less intuitive to driver writers.  As
implementing such lifetime management in each device driver is too
painful, most device driver midlayers handle this.  e.g. SCSI midlayer
handles driver model lifetime rules and presents simplified
register/unregister-immediately interface to all low level SCSI
drivers.  IDE tries to do similar and USB and netdev have their own
mechanisms too.

I'm not familiar with other layers, but it took quite a while to get
this right in the SCSI midlayer.  The current code works and low level
drivers can do away with most of lifetime management but the added
complexity is saddening to look at.


1-2. Rule #b and #c

On the surface, it seems that we're doing this mostly correctly but
subtle bugs caused by interaction between driver model lifetime and
module lifetime rules aren't difficult to find.

This happens because the two lifetime rules are closely related but
the relation is not properly represented in device driver model
itself.  A module cannot be unloaded unless all device driver objects
it hosts are released, but the only way to guarantee that is to always
grab module reference whenever grabbing relevant kobject reference.
This is ugly to code and too easy to get wrong.  Here are two such
examples.

Example 1. sysfs_schedule_callback() not grabbing the owning module

  This function is recently added to be used by suicidal sysfs nodes
  such that they don't deadlock when trying to unregister themselves.

 +#include <linux/delay.h>
  static void sysfs_schedule_callback_work(struct work_struct *work)
  {
	struct sysfs_schedule_callback_struct *ss = container_of(work,
			struct sysfs_schedule_callback_struct, work);

 +	msleep(100);
	(ss->func)(ss->data);
	kobject_put(ss->kobj);
	kfree(ss);
  }

  int sysfs_schedule_callback(struct kobject *kobj, void (*func)(void *),
		void *data)
  {
	struct sysfs_schedule_callback_struct *ss;

	ss = kmalloc(sizeof(*ss), GFP_KERNEL);
	if (!ss)
		return -ENOMEM;
	kobject_get(kobj);
	ss->kobj = kobj;
	ss->func = func;
	ss->data = data;
	INIT_WORK(&ss->work, sysfs_schedule_callback_work);
	schedule_work(&ss->work);
	return 0;
  }

  Two lines starting with '+' are inserted to make the problem
  reliably reproducible.  With the above changes,

  # insmod drivers/scsi/scsi_mod.ko; insmod drivers/scsi/sd_mod.ko; insmod drivers/ata/libata.ko; insmod drivers/ata/ahci.ko
  # echo 1 > /sys/block/sda/device/delete; rmmod ahci; rmmod libata; rmmod sd_mod; rmmod scsi_mod

  It's assumed that ahci detects /dev/sda.  The above command sequence
  causes the following oops.

  BUG: unable to handle kernel paging request at virtual address e0984020
  [--snip--]
  EIP is at 0xe0984020
  [--snip--]
   [<c0127132>] run_workqueue+0x92/0x140
   [<c01278c7>] worker_thread+0x137/0x160
   [<c012a513>] kthread+0xa3/0xd0
   [<c0104457>] kernel_thread_helper+0x7/0x10

  The problem here is that kobjec_get() in sysfs_schedule_callback()
  doesn't grab the module backing the kobject it's grabbing.  By the
  time (ss->func)(ss->kobj) runs, scsi_mod is already gone.

Example 2. sysfs attr grabbing the wrong kobject and module

  This bug has been there for a long time, probably from the day these
  attrs are implemented.

  # insmod drivers/scsi/scsi_mod.ko; insmod drivers/scsi/sd_mod.ko; insmod drivers/ata/libata.ko; insmod drivers/ata/ahci.ko
  # cat > /sys/block/sda/device/delete &
  # rmmod ahci; rmmod libata
  # kill %%

  BUG: unable to handle kernel paging request at virtual address e083b1e8
  [--snip--]
  EIP:    0060:[<e0983310>]    Not tainted VLI
  [--snip--]
   [<c0127649>] execute_in_process_context+0x19/0x40
   [<e09820f2>] scsi_target_reap+0x82/0xa0 [scsi_mod]
   [<e0984771>] scsi_device_dev_release_usercontext+0xe1/0x110 [scsi_mod]
   [<c0127649>] execute_in_process_context+0x19/0x40
   [<e09835a3>] scsi_device_dev_release+0x13/0x20 [scsi_mod]
   [<c0287247>] device_release+0x17/0x90
   [<c0213a69>] kobject_cleanup+0x49/0x80
   [<c0213aab>] kobject_release+0xb/0x10
   [<c021482b>] kref_put+0x2b/0x80
   [<c0213a14>] kobject_put+0x14/0x20
   [<c01969af>] sysfs_release+0x5f/0xb0
   [<c015dcb4>] __fput+0xb4/0x1b0
   [<c015de18>] fput+0x18/0x20
   [<c015b207>] filp_close+0x47/0x70
   [<c0119394>] put_files_struct+0xa4/0xc0
   [<c011a4bf>] do_exit+0x11f/0x7c0
   [<c011ab89>] do_group_exit+0x29/0x70
   [<c0123415>] get_signal_to_deliver+0x265/0x3f0
   [<c01036ea>] do_notify_resume+0x8a/0x7a0
   [<c010425a>] work_notifysig+0x13/0x19

  The delete attr's owner is scsi_mod.  However, it's created under
  the sysfs node for the SCSI device under the node for SCSI host
  which is driven by libata and ahci combination.  Opening the delete
  node grabs 1. struct device for the SCSI device which propagates to
  the SCSI host node and 2. module reference for sd_mod.  So, ahci and
  libata can go away even while the SCSI host they implement is still
  alive.

Both examples can be reproduced using any SCSI low level driver.
Attrs implemented similarly as in example #2 are everywhere.  This is
the standard way to implement such attrs.


2. How do we solve this?

2-1. Rule #a and #b

If you think about it, the impedance matching between rule #a and #b
should be done somewhere.  Maybe doing it in each low level driver is
what's intended when the driver model is designed but it's
impractically painful and the reality reflects that by doing it in
midlayers.

It's better if we do it in higher layers.  For example, SCSI has a
mechanism to reject new requests and drain request_queue to allow low
level driver to just unregister an existing device.  IDE currently
doesn't have such feature but it would need to do almost the same
thing to support hotplug.  If request_queue just had a feature to
drain and kill itself, both SCSI and IDE could use it.  It would be
simpler and less error-prone.  On the other hand, if it's pushed
downward, it will cause much more pain in all the low level drivers.

Orphaning sysfs nodes on unregistration is a big step in this
direction.  With sysfs reference counting out of the picture,
implementing 'disconnect immediate' interface only on a few components
(including request_queue) should suffice for most block device
drivers.  I'm not familiar with other drivers but I don't think
they'll be very different.

All in all, I'm hoping something like the following can be done in
device drivers, midlayer or low level.

* For binding

  alloc resources;
  init controller;
  register to upper layers;

* For unbinding

  unregister from upper layers; (no lingering references or objects)
  deinit controller;
  release resources;

This basically nullifies lifetime rule #b from the POV of drivers.


2-2. Rule #b and #c

One way to solve this problem is to subordinate lifetime rule #b to
rule #c.  Each kobject points to its owning module such that grabbing
a kobject automatically grabs the module.  The problem with this
approach is that it requires wide update and makes kobject_get
heavier.

Another more appealing approach is to do nothing.  Solving the problem
between rule #a and #b in the way described above means virtually
nullifying rule #b.  With rule #b gone, rule #c can't conflict with
it.  IOW, no reference from upper layer would be remaining after a
driver unregisters itself from the upper layer - the module can be
safely unloaded.


3. How do we get there from here?

The current device driver model is used by most device drivers - huge
amount of code.  We can't update them at once.  Fortunately, the
suggested solution can easily be implemented gradually.  We can add
'disconnect now' interface to upper driver interface one-by-one and
convert its users gradually.  The existing reference counting
mechanism can be left alone and used to verify that the conversion is
correct by verifying reference count is 0 after unregistering.

Once the conversion is complete (maybe after a year), we can remove
the reference counting mechanism from device driver interface.


4. Some afterthoughts

* Doing this would make module reference counting more flexible.
  Instead of being confined by implementation, it can be used to
  define when to allow and disallow module unloading.  (you can't
  unload a block driver module if a fs on it is mounted but sysfs
  access doesn't matter.)

* Outside of device driver model, kobject is currently used to supply
  something sysfs nodes can be attached to, such that both the
  information supplier and node users have something common to
  reference count.  With orphaning added, there is no need to use
  kobject this way.  IMHO, kobject should have been implementation
  detail inside the device model proper in the first place.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFD driver-core] Lifetime problems of the current driver model
  2007-03-30  9:43 [RFD driver-core] Lifetime problems of the current driver model Tejun Heo
  2007-03-30 12:29 ` James Bottomley
@ 2007-03-30 12:29 ` James Bottomley
  2007-03-30 13:15   ` Dmitry Torokhov
                     ` (3 more replies)
  2007-03-30 13:19 ` Cornelia Huck
                   ` (2 subsequent siblings)
  4 siblings, 4 replies; 40+ messages in thread
From: James Bottomley @ 2007-03-30 12:29 UTC (permalink / raw)
  To: Tejun Heo
  Cc: gregkh, hugh, cornelia.huck, dmitry.torokhov, oneukum, maneesh,
	rpurdie, Jeff Garzik, lkml, linux-ide, SCSI Mailing List

On Fri, 2007-03-30 at 18:43 +0900, Tejun Heo wrote:
> Orphaning sysfs nodes on unregistration is a big step in this
> direction.  With sysfs reference counting out of the picture,
> implementing 'disconnect immediate' interface only on a few components
> (including request_queue) should suffice for most block device
> drivers.  I'm not familiar with other drivers but I don't think
> they'll be very different.

I agree with this statement.  The big question in my mind is how do we
do it?

The essential problem, and the reason why the lifetime rules are
entangled is that fundamentally, sysfs entries are managed by kobjects.
The things the device drivers are interested in is struct device, which
is usually embedded in driver data structures.  Unfortunately, the
required kobject is usually embedded in struct device meaning that the
relevant driver structure cannot be freed while the sysfs entry still
exists.

It seems to me that the only way to Orphan the sysfs entries is to
separate the kobject and the struct device so their lifetime rules are
no longer entangled.  Then we can free the driver structure with the
embedded struct device while still keeping the necessary kobject around
to perform orphaned sysfs operations.

So it seems to me that what we need to do is to give struct device its
own kref and a pointer to a kobject.  Then we can manage the separate
lifetimes independently.  The device would basically allocate and take a
reference to the kobject on device_add() and relinquish it again (and
null out the kobject pointer) on device_del().  The complexity here is
that we now have to allocate the kobject somewhere else ... probably in
device_add() (it doesn't come for free with the device structures).

James



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFD driver-core] Lifetime problems of the current driver model
  2007-03-30  9:43 [RFD driver-core] Lifetime problems of the current driver model Tejun Heo
@ 2007-03-30 12:29 ` James Bottomley
  2007-03-30 12:29 ` James Bottomley
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 40+ messages in thread
From: James Bottomley @ 2007-03-30 12:29 UTC (permalink / raw)
  To: Tejun Heo
  Cc: gregkh, hugh, cornelia.huck, dmitry.torokhov, oneukum, maneesh,
	rpurdie, Jeff Garzik, lkml, linux-ide, SCSI Mailing List

On Fri, 2007-03-30 at 18:43 +0900, Tejun Heo wrote:
> Orphaning sysfs nodes on unregistration is a big step in this
> direction.  With sysfs reference counting out of the picture,
> implementing 'disconnect immediate' interface only on a few components
> (including request_queue) should suffice for most block device
> drivers.  I'm not familiar with other drivers but I don't think
> they'll be very different.

I agree with this statement.  The big question in my mind is how do we
do it?

The essential problem, and the reason why the lifetime rules are
entangled is that fundamentally, sysfs entries are managed by kobjects.
The things the device drivers are interested in is struct device, which
is usually embedded in driver data structures.  Unfortunately, the
required kobject is usually embedded in struct device meaning that the
relevant driver structure cannot be freed while the sysfs entry still
exists.

It seems to me that the only way to Orphan the sysfs entries is to
separate the kobject and the struct device so their lifetime rules are
no longer entangled.  Then we can free the driver structure with the
embedded struct device while still keeping the necessary kobject around
to perform orphaned sysfs operations.

So it seems to me that what we need to do is to give struct device its
own kref and a pointer to a kobject.  Then we can manage the separate
lifetimes independently.  The device would basically allocate and take a
reference to the kobject on device_add() and relinquish it again (and
null out the kobject pointer) on device_del().  The complexity here is
that we now have to allocate the kobject somewhere else ... probably in
device_add() (it doesn't come for free with the device structures).

James



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFD driver-core] Lifetime problems of the current driver model
  2007-03-30 12:29 ` James Bottomley
@ 2007-03-30 13:15   ` Dmitry Torokhov
  2007-03-30 13:15   ` Dmitry Torokhov
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 40+ messages in thread
From: Dmitry Torokhov @ 2007-03-30 13:15 UTC (permalink / raw)
  To: James Bottomley
  Cc: Tejun Heo, gregkh, hugh, cornelia.huck, oneukum, maneesh,
	rpurdie, Jeff Garzik, lkml, linux-ide, SCSI Mailing List

Hi James,

On 3/30/07, James Bottomley <James.Bottomley@steeleye.com> wrote:
> On Fri, 2007-03-30 at 18:43 +0900, Tejun Heo wrote:
> > Orphaning sysfs nodes on unregistration is a big step in this
> > direction.  With sysfs reference counting out of the picture,
> > implementing 'disconnect immediate' interface only on a few components
> > (including request_queue) should suffice for most block device
> > drivers.  I'm not familiar with other drivers but I don't think
> > they'll be very different.
>
> I agree with this statement.  The big question in my mind is how do we
> do it?
>
> The essential problem, and the reason why the lifetime rules are
> entangled is that fundamentally, sysfs entries are managed by kobjects.
> The things the device drivers are interested in is struct device, which
> is usually embedded in driver data structures.  Unfortunately, the
> required kobject is usually embedded in struct device meaning that the
> relevant driver structure cannot be freed while the sysfs entry still
> exists.
>
> It seems to me that the only way to Orphan the sysfs entries is to
> separate the kobject and the struct device so their lifetime rules are
> no longer entangled.  Then we can free the driver structure with the
> embedded struct device while still keeping the necessary kobject around
> to perform orphaned sysfs operations.
>
> So it seems to me that what we need to do is to give struct device its
> own kref and a pointer to a kobject.  Then we can manage the separate
> lifetimes independently.  The device would basically allocate and take a
> reference to the kobject on device_add() and relinquish it again (and
> null out the kobject pointer) on device_del().  The complexity here is
> that we now have to allocate the kobject somewhere else ... probably in
> device_add() (it doesn't come for free with the device structures).
>

If you want to manage lifetime rules independently you might want to
not embed struct device into you subsystems objects but attach them
via pointers and use device_create(). Now that we orphan sysfs access
upon unregistering device this will severe all ties from driver core
to your system once you start teardown of a device and you should be
in clear.

However there are simpler subsystems (input, serio, etc.) where there
is only one core module which provides services to device drivers and
handles registration and deregistration. For such substustems it makes
sense to embed struct devices and manage lifetime for all components
at once.

-- 
Dmitry

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFD driver-core] Lifetime problems of the current driver model
  2007-03-30 12:29 ` James Bottomley
  2007-03-30 13:15   ` Dmitry Torokhov
@ 2007-03-30 13:15   ` Dmitry Torokhov
  2007-03-30 17:58     ` James Bottomley
  2007-03-30 13:38   ` Tejun Heo
  2007-03-30 13:38   ` Tejun Heo
  3 siblings, 1 reply; 40+ messages in thread
From: Dmitry Torokhov @ 2007-03-30 13:15 UTC (permalink / raw)
  To: James Bottomley
  Cc: Tejun Heo, gregkh, hugh, cornelia.huck, oneukum, maneesh,
	rpurdie, Jeff Garzik, lkml, linux-ide, SCSI Mailing List

Hi James,

On 3/30/07, James Bottomley <James.Bottomley@steeleye.com> wrote:
> On Fri, 2007-03-30 at 18:43 +0900, Tejun Heo wrote:
> > Orphaning sysfs nodes on unregistration is a big step in this
> > direction.  With sysfs reference counting out of the picture,
> > implementing 'disconnect immediate' interface only on a few components
> > (including request_queue) should suffice for most block device
> > drivers.  I'm not familiar with other drivers but I don't think
> > they'll be very different.
>
> I agree with this statement.  The big question in my mind is how do we
> do it?
>
> The essential problem, and the reason why the lifetime rules are
> entangled is that fundamentally, sysfs entries are managed by kobjects.
> The things the device drivers are interested in is struct device, which
> is usually embedded in driver data structures.  Unfortunately, the
> required kobject is usually embedded in struct device meaning that the
> relevant driver structure cannot be freed while the sysfs entry still
> exists.
>
> It seems to me that the only way to Orphan the sysfs entries is to
> separate the kobject and the struct device so their lifetime rules are
> no longer entangled.  Then we can free the driver structure with the
> embedded struct device while still keeping the necessary kobject around
> to perform orphaned sysfs operations.
>
> So it seems to me that what we need to do is to give struct device its
> own kref and a pointer to a kobject.  Then we can manage the separate
> lifetimes independently.  The device would basically allocate and take a
> reference to the kobject on device_add() and relinquish it again (and
> null out the kobject pointer) on device_del().  The complexity here is
> that we now have to allocate the kobject somewhere else ... probably in
> device_add() (it doesn't come for free with the device structures).
>

If you want to manage lifetime rules independently you might want to
not embed struct device into you subsystems objects but attach them
via pointers and use device_create(). Now that we orphan sysfs access
upon unregistering device this will severe all ties from driver core
to your system once you start teardown of a device and you should be
in clear.

However there are simpler subsystems (input, serio, etc.) where there
is only one core module which provides services to device drivers and
handles registration and deregistration. For such substustems it makes
sense to embed struct devices and manage lifetime for all components
at once.

-- 
Dmitry

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFD driver-core] Lifetime problems of the current driver model
  2007-03-30  9:43 [RFD driver-core] Lifetime problems of the current driver model Tejun Heo
                   ` (2 preceding siblings ...)
  2007-03-30 13:19 ` Cornelia Huck
@ 2007-03-30 13:19 ` Cornelia Huck
  2007-03-30 13:19   ` Tejun Heo
  2007-03-30 13:19   ` Tejun Heo
  2007-03-30 17:38 ` Greg KH
  4 siblings, 2 replies; 40+ messages in thread
From: Cornelia Huck @ 2007-03-30 13:19 UTC (permalink / raw)
  To: Tejun Heo
  Cc: gregkh, hugh, dmitry.torokhov, oneukum, maneesh, rpurdie,
	James.Bottomley, Jeff Garzik, lkml, linux-ide, SCSI Mailing List

On Fri, 30 Mar 2007 18:43:02 +0900,
Tejun Heo <htejun@gmail.com> wrote:

> One way to solve this problem is to subordinate lifetime rule #b to
> rule #c.  Each kobject points to its owning module such that grabbing
> a kobject automatically grabs the module.  The problem with this
> approach is that it requires wide update and makes kobject_get
> heavier.

Shouldn't getting/putting the module refcount be solely done in
kobject.c? Grab the module reference when the kobject is created and
release the module reference in kobject_cleanup() after the release
function has been called. This doesn't make kobject_get() heavier, and
it ensures we don't delete the module until after the last kobject it is
supposed to clean up has been released.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFD driver-core] Lifetime problems of the current driver model
  2007-03-30  9:43 [RFD driver-core] Lifetime problems of the current driver model Tejun Heo
  2007-03-30 12:29 ` James Bottomley
  2007-03-30 12:29 ` James Bottomley
@ 2007-03-30 13:19 ` Cornelia Huck
  2007-03-30 13:19 ` Cornelia Huck
  2007-03-30 17:38 ` Greg KH
  4 siblings, 0 replies; 40+ messages in thread
From: Cornelia Huck @ 2007-03-30 13:19 UTC (permalink / raw)
  To: Tejun Heo
  Cc: gregkh, hugh, dmitry.torokhov, oneukum, maneesh, rpurdie,
	James.Bottomley, Jeff Garzik, lkml, linux-ide, SCSI Mailing List

On Fri, 30 Mar 2007 18:43:02 +0900,
Tejun Heo <htejun@gmail.com> wrote:

> One way to solve this problem is to subordinate lifetime rule #b to
> rule #c.  Each kobject points to its owning module such that grabbing
> a kobject automatically grabs the module.  The problem with this
> approach is that it requires wide update and makes kobject_get
> heavier.

Shouldn't getting/putting the module refcount be solely done in
kobject.c? Grab the module reference when the kobject is created and
release the module reference in kobject_cleanup() after the release
function has been called. This doesn't make kobject_get() heavier, and
it ensures we don't delete the module until after the last kobject it is
supposed to clean up has been released.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFD driver-core] Lifetime problems of the current driver model
  2007-03-30 13:19 ` Cornelia Huck
  2007-03-30 13:19   ` Tejun Heo
@ 2007-03-30 13:19   ` Tejun Heo
  2007-03-30 13:40     ` Cornelia Huck
  2007-03-30 13:40     ` Cornelia Huck
  1 sibling, 2 replies; 40+ messages in thread
From: Tejun Heo @ 2007-03-30 13:19 UTC (permalink / raw)
  To: Cornelia Huck
  Cc: gregkh, hugh, dmitry.torokhov, oneukum, maneesh, rpurdie,
	James.Bottomley, Jeff Garzik, lkml, linux-ide, SCSI Mailing List

Cornelia Huck wrote:
> On Fri, 30 Mar 2007 18:43:02 +0900,
> Tejun Heo <htejun@gmail.com> wrote:
> 
>> One way to solve this problem is to subordinate lifetime rule #b to
>> rule #c.  Each kobject points to its owning module such that grabbing
>> a kobject automatically grabs the module.  The problem with this
>> approach is that it requires wide update and makes kobject_get
>> heavier.
> 
> Shouldn't getting/putting the module refcount be solely done in
> kobject.c? Grab the module reference when the kobject is created and
> release the module reference in kobject_cleanup() after the release
> function has been called. This doesn't make kobject_get() heavier, and
> it ensures we don't delete the module until after the last kobject it is
> supposed to clean up has been released.

If we do that, we wouldn't be able to unload a module if there is any
kobject referencing it even when the node has no openers, so no easy way
out there.  :-(

-- 
tejun

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFD driver-core] Lifetime problems of the current driver model
  2007-03-30 13:19 ` Cornelia Huck
@ 2007-03-30 13:19   ` Tejun Heo
  2007-03-30 13:19   ` Tejun Heo
  1 sibling, 0 replies; 40+ messages in thread
From: Tejun Heo @ 2007-03-30 13:19 UTC (permalink / raw)
  To: Cornelia Huck
  Cc: gregkh, hugh, dmitry.torokhov, oneukum, maneesh, rpurdie,
	James.Bottomley, Jeff Garzik, lkml, linux-ide, SCSI Mailing List

Cornelia Huck wrote:
> On Fri, 30 Mar 2007 18:43:02 +0900,
> Tejun Heo <htejun@gmail.com> wrote:
> 
>> One way to solve this problem is to subordinate lifetime rule #b to
>> rule #c.  Each kobject points to its owning module such that grabbing
>> a kobject automatically grabs the module.  The problem with this
>> approach is that it requires wide update and makes kobject_get
>> heavier.
> 
> Shouldn't getting/putting the module refcount be solely done in
> kobject.c? Grab the module reference when the kobject is created and
> release the module reference in kobject_cleanup() after the release
> function has been called. This doesn't make kobject_get() heavier, and
> it ensures we don't delete the module until after the last kobject it is
> supposed to clean up has been released.

If we do that, we wouldn't be able to unload a module if there is any
kobject referencing it even when the node has no openers, so no easy way
out there.  :-(

-- 
tejun

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFD driver-core] Lifetime problems of the current driver model
  2007-03-30 12:29 ` James Bottomley
                     ` (2 preceding siblings ...)
  2007-03-30 13:38   ` Tejun Heo
@ 2007-03-30 13:38   ` Tejun Heo
  3 siblings, 0 replies; 40+ messages in thread
From: Tejun Heo @ 2007-03-30 13:38 UTC (permalink / raw)
  To: James Bottomley
  Cc: gregkh, hugh, cornelia.huck, dmitry.torokhov, oneukum, maneesh,
	rpurdie, Jeff Garzik, lkml, linux-ide, SCSI Mailing List

James Bottomley wrote:
> On Fri, 2007-03-30 at 18:43 +0900, Tejun Heo wrote:
>> Orphaning sysfs nodes on unregistration is a big step in this
>> direction.  With sysfs reference counting out of the picture,
>> implementing 'disconnect immediate' interface only on a few components
>> (including request_queue) should suffice for most block device
>> drivers.  I'm not familiar with other drivers but I don't think
>> they'll be very different.
> 
> I agree with this statement.  The big question in my mind is how do we
> do it?
> 
> The essential problem, and the reason why the lifetime rules are
> entangled is that fundamentally, sysfs entries are managed by kobjects.
> The things the device drivers are interested in is struct device, which
> is usually embedded in driver data structures.  Unfortunately, the
> required kobject is usually embedded in struct device meaning that the
> relevant driver structure cannot be freed while the sysfs entry still
> exists.
> 
> It seems to me that the only way to Orphan the sysfs entries is to
> separate the kobject and the struct device so their lifetime rules are
> no longer entangled.  Then we can free the driver structure with the
> embedded struct device while still keeping the necessary kobject around
> to perform orphaned sysfs operations.
> 
> So it seems to me that what we need to do is to give struct device its
> own kref and a pointer to a kobject.  Then we can manage the separate
> lifetimes independently.  The device would basically allocate and take a
> reference to the kobject on device_add() and relinquish it again (and
> null out the kobject pointer) on device_del().  The complexity here is
> that we now have to allocate the kobject somewhere else ... probably in
> device_add() (it doesn't come for free with the device structures).

My plan was to make sysfs more independent from struct device/kobject.
e.g. Something like the following.

* kobject_get() on attr/symlink creation
* open() doesn't need extra kobject reference
* deleting a node makes it release the kobject reference and the kobject
pointer is nullified.

This way the sysfs reference counting can be put completely out of
picture without impacting the rest of code.  Handling symlink and
suicidal attributes might need some extra attention but I think they can
be done.

In the long term, I think sysfs should be independent from driver model
and kobject such that an entity which wants to use sysfs but is not a
device doesn't have to dance with kobject just to use sysfs.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFD driver-core] Lifetime problems of the current driver model
  2007-03-30 12:29 ` James Bottomley
  2007-03-30 13:15   ` Dmitry Torokhov
  2007-03-30 13:15   ` Dmitry Torokhov
@ 2007-03-30 13:38   ` Tejun Heo
  2007-03-30 17:41     ` Greg KH
  2007-03-30 18:19     ` James Bottomley
  2007-03-30 13:38   ` Tejun Heo
  3 siblings, 2 replies; 40+ messages in thread
From: Tejun Heo @ 2007-03-30 13:38 UTC (permalink / raw)
  To: James Bottomley
  Cc: gregkh, hugh, cornelia.huck, dmitry.torokhov, oneukum, maneesh,
	rpurdie, Jeff Garzik, lkml, linux-ide, SCSI Mailing List

James Bottomley wrote:
> On Fri, 2007-03-30 at 18:43 +0900, Tejun Heo wrote:
>> Orphaning sysfs nodes on unregistration is a big step in this
>> direction.  With sysfs reference counting out of the picture,
>> implementing 'disconnect immediate' interface only on a few components
>> (including request_queue) should suffice for most block device
>> drivers.  I'm not familiar with other drivers but I don't think
>> they'll be very different.
> 
> I agree with this statement.  The big question in my mind is how do we
> do it?
> 
> The essential problem, and the reason why the lifetime rules are
> entangled is that fundamentally, sysfs entries are managed by kobjects.
> The things the device drivers are interested in is struct device, which
> is usually embedded in driver data structures.  Unfortunately, the
> required kobject is usually embedded in struct device meaning that the
> relevant driver structure cannot be freed while the sysfs entry still
> exists.
> 
> It seems to me that the only way to Orphan the sysfs entries is to
> separate the kobject and the struct device so their lifetime rules are
> no longer entangled.  Then we can free the driver structure with the
> embedded struct device while still keeping the necessary kobject around
> to perform orphaned sysfs operations.
> 
> So it seems to me that what we need to do is to give struct device its
> own kref and a pointer to a kobject.  Then we can manage the separate
> lifetimes independently.  The device would basically allocate and take a
> reference to the kobject on device_add() and relinquish it again (and
> null out the kobject pointer) on device_del().  The complexity here is
> that we now have to allocate the kobject somewhere else ... probably in
> device_add() (it doesn't come for free with the device structures).

My plan was to make sysfs more independent from struct device/kobject.
e.g. Something like the following.

* kobject_get() on attr/symlink creation
* open() doesn't need extra kobject reference
* deleting a node makes it release the kobject reference and the kobject
pointer is nullified.

This way the sysfs reference counting can be put completely out of
picture without impacting the rest of code.  Handling symlink and
suicidal attributes might need some extra attention but I think they can
be done.

In the long term, I think sysfs should be independent from driver model
and kobject such that an entity which wants to use sysfs but is not a
device doesn't have to dance with kobject just to use sysfs.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFD driver-core] Lifetime problems of the current driver model
  2007-03-30 13:19   ` Tejun Heo
@ 2007-03-30 13:40     ` Cornelia Huck
  2007-03-30 13:58       ` Tejun Heo
  2007-03-30 13:58       ` Tejun Heo
  2007-03-30 13:40     ` Cornelia Huck
  1 sibling, 2 replies; 40+ messages in thread
From: Cornelia Huck @ 2007-03-30 13:40 UTC (permalink / raw)
  To: Tejun Heo
  Cc: gregkh, hugh, dmitry.torokhov, oneukum, maneesh, rpurdie,
	James.Bottomley, Jeff Garzik, lkml, linux-ide, SCSI Mailing List

On Fri, 30 Mar 2007 22:19:52 +0900,
Tejun Heo <htejun@gmail.com> wrote:

> > Shouldn't getting/putting the module refcount be solely done in
> > kobject.c? Grab the module reference when the kobject is created and
> > release the module reference in kobject_cleanup() after the release
> > function has been called. This doesn't make kobject_get() heavier, and
> > it ensures we don't delete the module until after the last kobject it is
> > supposed to clean up has been released.
> 
> If we do that, we wouldn't be able to unload a module if there is any
> kobject referencing it even when the node has no openers, so no easy way
> out there.  :-(

But we must not unload the module when there is still a kobject around
referencing a release function in the module, or we will oops if the
kobject is finally released. If needed, the module must clean up its
kobjects in its exit function.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFD driver-core] Lifetime problems of the current driver model
  2007-03-30 13:19   ` Tejun Heo
  2007-03-30 13:40     ` Cornelia Huck
@ 2007-03-30 13:40     ` Cornelia Huck
  1 sibling, 0 replies; 40+ messages in thread
From: Cornelia Huck @ 2007-03-30 13:40 UTC (permalink / raw)
  To: Tejun Heo
  Cc: gregkh, hugh, dmitry.torokhov, oneukum, maneesh, rpurdie,
	James.Bottomley, Jeff Garzik, lkml, linux-ide, SCSI Mailing List

On Fri, 30 Mar 2007 22:19:52 +0900,
Tejun Heo <htejun@gmail.com> wrote:

> > Shouldn't getting/putting the module refcount be solely done in
> > kobject.c? Grab the module reference when the kobject is created and
> > release the module reference in kobject_cleanup() after the release
> > function has been called. This doesn't make kobject_get() heavier, and
> > it ensures we don't delete the module until after the last kobject it is
> > supposed to clean up has been released.
> 
> If we do that, we wouldn't be able to unload a module if there is any
> kobject referencing it even when the node has no openers, so no easy way
> out there.  :-(

But we must not unload the module when there is still a kobject around
referencing a release function in the module, or we will oops if the
kobject is finally released. If needed, the module must clean up its
kobjects in its exit function.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFD driver-core] Lifetime problems of the current driver model
  2007-03-30 13:40     ` Cornelia Huck
  2007-03-30 13:58       ` Tejun Heo
@ 2007-03-30 13:58       ` Tejun Heo
  2007-03-30 14:52         ` Cornelia Huck
  2007-03-30 14:52         ` Cornelia Huck
  1 sibling, 2 replies; 40+ messages in thread
From: Tejun Heo @ 2007-03-30 13:58 UTC (permalink / raw)
  To: Cornelia Huck
  Cc: gregkh, hugh, dmitry.torokhov, oneukum, maneesh, rpurdie,
	James.Bottomley, Jeff Garzik, lkml, linux-ide, SCSI Mailing List

Cornelia Huck wrote:
> On Fri, 30 Mar 2007 22:19:52 +0900,
> Tejun Heo <htejun@gmail.com> wrote:
> 
>>> Shouldn't getting/putting the module refcount be solely done in
>>> kobject.c? Grab the module reference when the kobject is created and
>>> release the module reference in kobject_cleanup() after the release
>>> function has been called. This doesn't make kobject_get() heavier, and
>>> it ensures we don't delete the module until after the last kobject it is
>>> supposed to clean up has been released.
>> If we do that, we wouldn't be able to unload a module if there is any
>> kobject referencing it even when the node has no openers, so no easy way
>> out there.  :-(
> 
> But we must not unload the module when there is still a kobject around
> referencing a release function in the module, or we will oops if the
> kobject is finally released. If needed, the module must clean up its
> kobjects in its exit function.

It's a little bit more convoluted than that.  Module reference count of
zero doesn't indicate that there is no one referencing the module.  It
just means that the module can be unloaded.  ie. There still can be any
number of kobjects with release function backed by the module but as
long as all of them can be deleted and released by module exit function,
the module is unloadable at that point.

IOW, module reference count does not count number of objects depending
on the module.  It counts the number of active usages of those objects.

-- 
tejun

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFD driver-core] Lifetime problems of the current driver model
  2007-03-30 13:40     ` Cornelia Huck
@ 2007-03-30 13:58       ` Tejun Heo
  2007-03-30 13:58       ` Tejun Heo
  1 sibling, 0 replies; 40+ messages in thread
From: Tejun Heo @ 2007-03-30 13:58 UTC (permalink / raw)
  To: Cornelia Huck
  Cc: gregkh, hugh, dmitry.torokhov, oneukum, maneesh, rpurdie,
	James.Bottomley, Jeff Garzik, lkml, linux-ide, SCSI Mailing List

Cornelia Huck wrote:
> On Fri, 30 Mar 2007 22:19:52 +0900,
> Tejun Heo <htejun@gmail.com> wrote:
> 
>>> Shouldn't getting/putting the module refcount be solely done in
>>> kobject.c? Grab the module reference when the kobject is created and
>>> release the module reference in kobject_cleanup() after the release
>>> function has been called. This doesn't make kobject_get() heavier, and
>>> it ensures we don't delete the module until after the last kobject it is
>>> supposed to clean up has been released.
>> If we do that, we wouldn't be able to unload a module if there is any
>> kobject referencing it even when the node has no openers, so no easy way
>> out there.  :-(
> 
> But we must not unload the module when there is still a kobject around
> referencing a release function in the module, or we will oops if the
> kobject is finally released. If needed, the module must clean up its
> kobjects in its exit function.

It's a little bit more convoluted than that.  Module reference count of
zero doesn't indicate that there is no one referencing the module.  It
just means that the module can be unloaded.  ie. There still can be any
number of kobjects with release function backed by the module but as
long as all of them can be deleted and released by module exit function,
the module is unloadable at that point.

IOW, module reference count does not count number of objects depending
on the module.  It counts the number of active usages of those objects.

-- 
tejun

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFD driver-core] Lifetime problems of the current driver model
  2007-03-30 13:58       ` Tejun Heo
@ 2007-03-30 14:52         ` Cornelia Huck
  2007-03-30 15:08           ` Tejun Heo
  2007-03-30 15:08           ` Tejun Heo
  2007-03-30 14:52         ` Cornelia Huck
  1 sibling, 2 replies; 40+ messages in thread
From: Cornelia Huck @ 2007-03-30 14:52 UTC (permalink / raw)
  To: Tejun Heo
  Cc: gregkh, hugh, dmitry.torokhov, oneukum, maneesh, rpurdie,
	James.Bottomley, Jeff Garzik, lkml, linux-ide, SCSI Mailing List

On Fri, 30 Mar 2007 22:58:39 +0900,
Tejun Heo <htejun@gmail.com> wrote:

> It's a little bit more convoluted than that.  Module reference count of
> zero doesn't indicate that there is no one referencing the module.  It
> just means that the module can be unloaded.  ie. There still can be any
> number of kobjects with release function backed by the module but as
> long as all of them can be deleted and released by module exit function,
> the module is unloadable at that point.
> 
> IOW, module reference count does not count number of objects depending
> on the module.  It counts the number of active usages of those objects.

We must make sure that the module is never deleted while there may be
calls to ->release functions - the exit function can only return when
all ->release calls have returned. This can be guaranteed if we (1)
don't allow the module to unload if there are outstanding kobjects (we
may need a "self destruct" knob then) or (2) make sure the ->release
functions are outside of the module (see, for example,
drivers/s390/s390_rdev.c).

(Gah, that stuff is always giving me headaches. Sorry if I'm not making
sense...)

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFD driver-core] Lifetime problems of the current driver model
  2007-03-30 13:58       ` Tejun Heo
  2007-03-30 14:52         ` Cornelia Huck
@ 2007-03-30 14:52         ` Cornelia Huck
  1 sibling, 0 replies; 40+ messages in thread
From: Cornelia Huck @ 2007-03-30 14:52 UTC (permalink / raw)
  To: Tejun Heo
  Cc: gregkh, hugh, dmitry.torokhov, oneukum, maneesh, rpurdie,
	James.Bottomley, Jeff Garzik, lkml, linux-ide, SCSI Mailing List

On Fri, 30 Mar 2007 22:58:39 +0900,
Tejun Heo <htejun@gmail.com> wrote:

> It's a little bit more convoluted than that.  Module reference count of
> zero doesn't indicate that there is no one referencing the module.  It
> just means that the module can be unloaded.  ie. There still can be any
> number of kobjects with release function backed by the module but as
> long as all of them can be deleted and released by module exit function,
> the module is unloadable at that point.
> 
> IOW, module reference count does not count number of objects depending
> on the module.  It counts the number of active usages of those objects.

We must make sure that the module is never deleted while there may be
calls to ->release functions - the exit function can only return when
all ->release calls have returned. This can be guaranteed if we (1)
don't allow the module to unload if there are outstanding kobjects (we
may need a "self destruct" knob then) or (2) make sure the ->release
functions are outside of the module (see, for example,
drivers/s390/s390_rdev.c).

(Gah, that stuff is always giving me headaches. Sorry if I'm not making
sense...)

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFD driver-core] Lifetime problems of the current driver model
  2007-03-30 14:52         ` Cornelia Huck
  2007-03-30 15:08           ` Tejun Heo
@ 2007-03-30 15:08           ` Tejun Heo
  2007-03-30 19:31             ` Cornelia Huck
  1 sibling, 1 reply; 40+ messages in thread
From: Tejun Heo @ 2007-03-30 15:08 UTC (permalink / raw)
  To: Cornelia Huck
  Cc: gregkh, hugh, dmitry.torokhov, oneukum, maneesh, rpurdie,
	James.Bottomley, Jeff Garzik, lkml, linux-ide, SCSI Mailing List

Cornelia Huck wrote:
> On Fri, 30 Mar 2007 22:58:39 +0900,
> Tejun Heo <htejun@gmail.com> wrote:
> 
>> It's a little bit more convoluted than that.  Module reference count of
>> zero doesn't indicate that there is no one referencing the module.  It
>> just means that the module can be unloaded.  ie. There still can be any
>> number of kobjects with release function backed by the module but as
>> long as all of them can be deleted and released by module exit function,
>> the module is unloadable at that point.
>>
>> IOW, module reference count does not count number of objects depending
>> on the module.  It counts the number of active usages of those objects.
> 
> We must make sure that the module is never deleted while there may be
> calls to ->release functions - the exit function can only return when
> all ->release calls have returned. This can be guaranteed if we (1)
> don't allow the module to unload if there are outstanding kobjects (we
> may need a "self destruct" knob then) or (2) make sure the ->release
> functions are outside of the module (see, for example,
> drivers/s390/s390_rdev.c).

(3) make sure all existing kobjects are released by module exit function.

For example, let's say there is a hypothetical disk device /dev/dk0
driven by a hypothetical driver mydrv.  /dev/dk0 is represented like the
following in the sysfs tree.

/sys/devices/pci0000:00/0000:00:1f.0/dk0/{myknob0,myknob1}

Owner of both attrs myknob0 and myknob1 is mydrv and opening either
increases the reference counts of dk0 and mydrv and closing does the
opposite.

* When there is no opener of either knob and the /dev/dk0 isn't used by
anyone.  Reference count of dk0 is 1, mydrv 0.

* User issues rmmod mydrv.  As mydrv's reference count is zero, unload
proceeds and mydrv's exit function is called.

* mydrv's exit function looks like the following.

  mydrv_exit()
  {
	sysfs_remove_file(dk0, myknob0);
	sysfs_remove_file(dk1, myknob1);
	device_del(dk0);
	deinit controller;
	release all resources;
  }

The device_del(dk0) drops dk0's reference count to zero and its
->release is invoked immediately.

This method is widely used to allow modules to be unloaded even when
there still are valid objects if there's no active user.

> (Gah, that stuff is always giving me headaches. Sorry if I'm not making
> sense...)

Yeap, this is confusing.  Hope my explanation makes sense.

-- 
tejun

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFD driver-core] Lifetime problems of the current driver model
  2007-03-30 14:52         ` Cornelia Huck
@ 2007-03-30 15:08           ` Tejun Heo
  2007-03-30 15:08           ` Tejun Heo
  1 sibling, 0 replies; 40+ messages in thread
From: Tejun Heo @ 2007-03-30 15:08 UTC (permalink / raw)
  To: Cornelia Huck
  Cc: gregkh, hugh, dmitry.torokhov, oneukum, maneesh, rpurdie,
	James.Bottomley, Jeff Garzik, lkml, linux-ide, SCSI Mailing List

Cornelia Huck wrote:
> On Fri, 30 Mar 2007 22:58:39 +0900,
> Tejun Heo <htejun@gmail.com> wrote:
> 
>> It's a little bit more convoluted than that.  Module reference count of
>> zero doesn't indicate that there is no one referencing the module.  It
>> just means that the module can be unloaded.  ie. There still can be any
>> number of kobjects with release function backed by the module but as
>> long as all of them can be deleted and released by module exit function,
>> the module is unloadable at that point.
>>
>> IOW, module reference count does not count number of objects depending
>> on the module.  It counts the number of active usages of those objects.
> 
> We must make sure that the module is never deleted while there may be
> calls to ->release functions - the exit function can only return when
> all ->release calls have returned. This can be guaranteed if we (1)
> don't allow the module to unload if there are outstanding kobjects (we
> may need a "self destruct" knob then) or (2) make sure the ->release
> functions are outside of the module (see, for example,
> drivers/s390/s390_rdev.c).

(3) make sure all existing kobjects are released by module exit function.

For example, let's say there is a hypothetical disk device /dev/dk0
driven by a hypothetical driver mydrv.  /dev/dk0 is represented like the
following in the sysfs tree.

/sys/devices/pci0000:00/0000:00:1f.0/dk0/{myknob0,myknob1}

Owner of both attrs myknob0 and myknob1 is mydrv and opening either
increases the reference counts of dk0 and mydrv and closing does the
opposite.

* When there is no opener of either knob and the /dev/dk0 isn't used by
anyone.  Reference count of dk0 is 1, mydrv 0.

* User issues rmmod mydrv.  As mydrv's reference count is zero, unload
proceeds and mydrv's exit function is called.

* mydrv's exit function looks like the following.

  mydrv_exit()
  {
	sysfs_remove_file(dk0, myknob0);
	sysfs_remove_file(dk1, myknob1);
	device_del(dk0);
	deinit controller;
	release all resources;
  }

The device_del(dk0) drops dk0's reference count to zero and its
->release is invoked immediately.

This method is widely used to allow modules to be unloaded even when
there still are valid objects if there's no active user.

> (Gah, that stuff is always giving me headaches. Sorry if I'm not making
> sense...)

Yeap, this is confusing.  Hope my explanation makes sense.

-- 
tejun

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFD driver-core] Lifetime problems of the current driver model
  2007-03-30  9:43 [RFD driver-core] Lifetime problems of the current driver model Tejun Heo
                   ` (3 preceding siblings ...)
  2007-03-30 13:19 ` Cornelia Huck
@ 2007-03-30 17:38 ` Greg KH
  4 siblings, 0 replies; 40+ messages in thread
From: Greg KH @ 2007-03-30 17:38 UTC (permalink / raw)
  To: Tejun Heo
  Cc: hugh, cornelia.huck, dmitry.torokhov, oneukum, maneesh, rpurdie,
	James.Bottomley, Jeff Garzik, lkml, linux-ide, SCSI Mailing List

On Fri, Mar 30, 2007 at 06:43:02PM +0900, Tejun Heo wrote:
> Hello, all.
> 
> This document tries to describe lifetime problems of the current
> device driver model primarily from the point view of device drivers
> and establish consensus, or at least, start discussion about how to
> solve these problems.  This is primarily based on my experience with
> IDE and SCSI layers and my knowledge on other drivers is limited, so I
> might have generalized too aggressively.  Feel free to point out.

Very nice summary of the problems we are trying to address here,
thanks for doing this.

> 2. How do we solve this?
> 
> 2-1. Rule #a and #b
> 
> If you think about it, the impedance matching between rule #a and #b
> should be done somewhere.  Maybe doing it in each low level driver is
> what's intended when the driver model is designed but it's
> impractically painful and the reality reflects that by doing it in
> midlayers.
> 
> It's better if we do it in higher layers.  For example, SCSI has a
> mechanism to reject new requests and drain request_queue to allow low
> level driver to just unregister an existing device.  IDE currently
> doesn't have such feature but it would need to do almost the same
> thing to support hotplug.  If request_queue just had a feature to
> drain and kill itself, both SCSI and IDE could use it.  It would be
> simpler and less error-prone.  On the other hand, if it's pushed
> downward, it will cause much more pain in all the low level drivers.

No, we should not push this kind of thing down to the lower drivers.

In fact, my goal is to try to keep a "normal" driver writer from having
to know anything about the driver model, except for how to add and
remove a sysfs file that is bound to the device controlled by the
driver.

And for the majority of the busses, this is true, it's only people like
you who are mucking around in the bus specific code that know how hard
this whole thing is :)

> Orphaning sysfs nodes on unregistration is a big step in this
> direction.  With sysfs reference counting out of the picture,
> implementing 'disconnect immediate' interface only on a few components
> (including request_queue) should suffice for most block device
> drivers.  I'm not familiar with other drivers but I don't think
> they'll be very different.
> 
> All in all, I'm hoping something like the following can be done in
> device drivers, midlayer or low level.
> 
> * For binding
> 
>  alloc resources;
>  init controller;
>  register to upper layers;
> 
> * For unbinding
> 
>  unregister from upper layers; (no lingering references or objects)
>  deinit controller;
>  release resources;
> 
> This basically nullifies lifetime rule #b from the POV of drivers.

I agree, that should be our goal.  And for almost all busses, this is
true today, right?

> 2-2. Rule #b and #c
> 
> One way to solve this problem is to subordinate lifetime rule #b to
> rule #c.  Each kobject points to its owning module such that grabbing
> a kobject automatically grabs the module.  The problem with this
> approach is that it requires wide update and makes kobject_get
> heavier.

Why would kobject_get become heavier?

> Another more appealing approach is to do nothing.  Solving the problem
> between rule #a and #b in the way described above means virtually
> nullifying rule #b.  With rule #b gone, rule #c can't conflict with
> it.  IOW, no reference from upper layer would be remaining after a
> driver unregisters itself from the upper layer - the module can be
> safely unloaded.

Yes, but you have to watch out for devices that add their own files with
callbacks to that driver.  That is a problem today that Oliver has tried
to address with his recent patches.

> 3. How do we get there from here?
> 
> The current device driver model is used by most device drivers - huge
> amount of code.  We can't update them at once.

No, the majority of the logic is in the busses, and we can update them
pretty easily.

> Fortunately, the
> suggested solution can easily be implemented gradually.  We can add
> 'disconnect now' interface to upper driver interface one-by-one and
> convert its users gradually.  The existing reference counting
> mechanism can be left alone and used to verify that the conversion is
> correct by verifying reference count is 0 after unregistering.
> 
> Once the conversion is complete (maybe after a year), we can remove
> the reference counting mechanism from device driver interface.

I'd be interested in seeing how we can do this.  Do you have any
specific patches started in this area?

> 4. Some afterthoughts
> 
> * Doing this would make module reference counting more flexible.
>  Instead of being confined by implementation, it can be used to
>  define when to allow and disallow module unloading.  (you can't
>  unload a block driver module if a fs on it is mounted but sysfs
>  access doesn't matter.)

Well, remember, module unloading is racy and unrecommended to use anyway
:)

Also, look at the network drivers who all have their own special type of
lock, with no module reference count increment when they are in use.  Is
this what you want to do for everything else?

> * Outside of device driver model, kobject is currently used to supply
>  something sysfs nodes can be attached to, such that both the
>  information supplier and node users have something common to
>  reference count.  With orphaning added, there is no need to use
>  kobject this way.  IMHO, kobject should have been implementation
>  detail inside the device model proper in the first place.

Well, kobject was created because Al Viro wanted to use them for the
block devices and it didn't make sense to use a 'struct device' for them
at the time.  Also, the class code wasn't even around then (this is
_very_ early 2.5 timeframe.

So yes, any cleanups and refactoring that we can do in this area would
be very good and welcome.  Kay has had some ideas for how we can clean
up the whole subsystem/kset mess too.  Hopefully something comes of that
soon.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFD driver-core] Lifetime problems of the current driver model
  2007-03-30 13:38   ` Tejun Heo
@ 2007-03-30 17:41     ` Greg KH
  2007-03-30 18:19     ` James Bottomley
  1 sibling, 0 replies; 40+ messages in thread
From: Greg KH @ 2007-03-30 17:41 UTC (permalink / raw)
  To: Tejun Heo
  Cc: James Bottomley, hugh, cornelia.huck, dmitry.torokhov, oneukum,
	maneesh, rpurdie, Jeff Garzik, lkml, linux-ide,
	SCSI Mailing List

On Fri, Mar 30, 2007 at 10:38:00PM +0900, Tejun Heo wrote:
> James Bottomley wrote:
> > On Fri, 2007-03-30 at 18:43 +0900, Tejun Heo wrote:
> >> Orphaning sysfs nodes on unregistration is a big step in this
> >> direction.  With sysfs reference counting out of the picture,
> >> implementing 'disconnect immediate' interface only on a few components
> >> (including request_queue) should suffice for most block device
> >> drivers.  I'm not familiar with other drivers but I don't think
> >> they'll be very different.
> > 
> > I agree with this statement.  The big question in my mind is how do we
> > do it?
> > 
> > The essential problem, and the reason why the lifetime rules are
> > entangled is that fundamentally, sysfs entries are managed by kobjects.
> > The things the device drivers are interested in is struct device, which
> > is usually embedded in driver data structures.  Unfortunately, the
> > required kobject is usually embedded in struct device meaning that the
> > relevant driver structure cannot be freed while the sysfs entry still
> > exists.
> > 
> > It seems to me that the only way to Orphan the sysfs entries is to
> > separate the kobject and the struct device so their lifetime rules are
> > no longer entangled.  Then we can free the driver structure with the
> > embedded struct device while still keeping the necessary kobject around
> > to perform orphaned sysfs operations.
> > 
> > So it seems to me that what we need to do is to give struct device its
> > own kref and a pointer to a kobject.  Then we can manage the separate
> > lifetimes independently.  The device would basically allocate and take a
> > reference to the kobject on device_add() and relinquish it again (and
> > null out the kobject pointer) on device_del().  The complexity here is
> > that we now have to allocate the kobject somewhere else ... probably in
> > device_add() (it doesn't come for free with the device structures).
> 
> My plan was to make sysfs more independent from struct device/kobject.
> e.g. Something like the following.
> 
> * kobject_get() on attr/symlink creation
> * open() doesn't need extra kobject reference
> * deleting a node makes it release the kobject reference and the kobject
> pointer is nullified.

Hm, this sounds good, but I think it will be racy.

Although separating the very tight tie between sysfs and kobject would
be very good to have.  Pat originally wanted to do that years ago,
but he's now lost to the land of a million sheep...

So I don't object to this goal at all.

> This way the sysfs reference counting can be put completely out of
> picture without impacting the rest of code.  Handling symlink and
> suicidal attributes might need some extra attention but I think they can
> be done.

Ok, I'll but I really want to see that code :)

> In the long term, I think sysfs should be independent from driver model
> and kobject such that an entity which wants to use sysfs but is not a
> device doesn't have to dance with kobject just to use sysfs.

I agree.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFD driver-core] Lifetime problems of the current driver model
  2007-03-30 13:15   ` Dmitry Torokhov
@ 2007-03-30 17:58     ` James Bottomley
  2007-03-30 18:18       ` Dmitry Torokhov
  0 siblings, 1 reply; 40+ messages in thread
From: James Bottomley @ 2007-03-30 17:58 UTC (permalink / raw)
  To: Dmitry Torokhov
  Cc: Tejun Heo, gregkh, hugh, cornelia.huck, oneukum, maneesh,
	rpurdie, Jeff Garzik, lkml, linux-ide, SCSI Mailing List

On Fri, 2007-03-30 at 09:15 -0400, Dmitry Torokhov wrote:
> If you want to manage lifetime rules independently you might want to
> not embed struct device into you subsystems objects but attach them
> via pointers and use device_create(). Now that we orphan sysfs access
> upon unregistering device this will severe all ties from driver core
> to your system once you start teardown of a device and you should be
> in clear.

But that wouldn't really help ... all objects have lifetimes.  What
Tejun is pointing out is that we have two different lifetime
requirements:  driver internal (and subsystem) objects and sysfs
objects ... we still need something embedded in the driver objects, so
it may as well be struct device ... trying to manage struct device
outside of the driver objects would turn into another nasty refcounting
problem.  The struct device is usually the generic abstraction of the
specific structure it's embedded in, so I think it really does make
sense to keep these pieces together.

> However there are simpler subsystems (input, serio, etc.) where there
> is only one core module which provides services to device drivers and
> handles registration and deregistration. For such substustems it makes
> sense to embed struct devices and manage lifetime for all components
> at once.

James



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFD driver-core] Lifetime problems of the current driver model
  2007-03-30 17:58     ` James Bottomley
@ 2007-03-30 18:18       ` Dmitry Torokhov
  0 siblings, 0 replies; 40+ messages in thread
From: Dmitry Torokhov @ 2007-03-30 18:18 UTC (permalink / raw)
  To: James Bottomley
  Cc: Tejun Heo, gregkh, hugh, cornelia.huck, oneukum, maneesh,
	rpurdie, Jeff Garzik, lkml, linux-ide, SCSI Mailing List

On 3/30/07, James Bottomley <James.Bottomley@steeleye.com> wrote:
> On Fri, 2007-03-30 at 09:15 -0400, Dmitry Torokhov wrote:
> > If you want to manage lifetime rules independently you might want to
> > not embed struct device into you subsystems objects but attach them
> > via pointers and use device_create(). Now that we orphan sysfs access
> > upon unregistering device this will severe all ties from driver core
> > to your system once you start teardown of a device and you should be
> > in clear.
>
> But that wouldn't really help ... all objects have lifetimes.  What
> Tejun is pointing out is that we have two different lifetime
> requirements:  driver internal (and subsystem) objects and sysfs
> objects ...

Exactly. If driver-private data structures have different lifetime
than device abstractions (like they seem to have with scsi and ide)
then you split them and refcount separately. We are just arguing where
to put the line. You want to split devices and kobjects and rework
infrastructure and I am saying that we already have infrastructure we
need and that split shoud be between generic device structure and
driver-managed device structure.

-- 
Dmitry

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFD driver-core] Lifetime problems of the current driver model
  2007-03-30 13:38   ` Tejun Heo
  2007-03-30 17:41     ` Greg KH
@ 2007-03-30 18:19     ` James Bottomley
  2007-04-01 19:59       ` Tejun Heo
  2007-04-02 19:33       ` Luben Tuikov
  1 sibling, 2 replies; 40+ messages in thread
From: James Bottomley @ 2007-03-30 18:19 UTC (permalink / raw)
  To: Tejun Heo
  Cc: gregkh, hugh, cornelia.huck, dmitry.torokhov, oneukum, maneesh,
	rpurdie, Jeff Garzik, lkml, linux-ide, SCSI Mailing List

On Fri, 2007-03-30 at 22:38 +0900, Tejun Heo wrote:
> My plan was to make sysfs more independent from struct device/kobject.
> e.g. Something like the following.

That's sort of what I was reaching for too ... it just looks to me that
all the sysfs glue is in kobject, so they make a good candidate for the
pure sysfs objects.  Whatever we do, there has to be breakable cross
linking between the driver's tree and the sysfs representation ... of
course, nasty things like ksets get in the way of considering the
objects as separate, sigh.

> * kobject_get() on attr/symlink creation
> * open() doesn't need extra kobject reference
> * deleting a node makes it release the kobject reference and the kobject
> pointer is nullified.

To do this, you'll have to move at least the dentry out of the kobject,
I think.

> This way the sysfs reference counting can be put completely out of
> picture without impacting the rest of code.  Handling symlink and
> suicidal attributes might need some extra attention but I think they can
> be done.

True ... but you'll also have to implement something within sysfs to do
refcounting ... it needs to know how long to keep its nodes around.

> In the long term, I think sysfs should be independent from driver model
> and kobject such that an entity which wants to use sysfs but is not a
> device doesn't have to dance with kobject just to use sysfs.

I agree ... the question is just how to do it.

I'd favour trying to separate kobject and struct device for this ...
move all the sysfs stuff into kobject and device only stuff into struct
device ... but that would get us into disentangling the ksets, which, on
balance, isn't going to be fun ...

James



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFD driver-core] Lifetime problems of the current driver model
  2007-03-30 15:08           ` Tejun Heo
@ 2007-03-30 19:31             ` Cornelia Huck
  2007-03-31  3:12               ` Tejun Heo
  0 siblings, 1 reply; 40+ messages in thread
From: Cornelia Huck @ 2007-03-30 19:31 UTC (permalink / raw)
  To: Tejun Heo
  Cc: gregkh, hugh, dmitry.torokhov, oneukum, maneesh, rpurdie,
	James.Bottomley, Jeff Garzik, lkml, linux-ide, SCSI Mailing List

On Sat, 31 Mar 2007 00:08:19 +0900,
Tejun Heo <htejun@gmail.com> wrote:

> (3) make sure all existing kobjects are released by module exit function.
> 
> For example, let's say there is a hypothetical disk device /dev/dk0
> driven by a hypothetical driver mydrv.  /dev/dk0 is represented like the
> following in the sysfs tree.
> 
> /sys/devices/pci0000:00/0000:00:1f.0/dk0/{myknob0,myknob1}
> 
> Owner of both attrs myknob0 and myknob1 is mydrv and opening either
> increases the reference counts of dk0 and mydrv and closing does the
> opposite.
> 
> * When there is no opener of either knob and the /dev/dk0 isn't used by
> anyone.  Reference count of dk0 is 1, mydrv 0.

Hm, but as long as dk0 is registered, it can be looked up and someone
could get a reference on it.

> 
> * User issues rmmod mydrv.  As mydrv's reference count is zero, unload
> proceeds and mydrv's exit function is called.
> 
> * mydrv's exit function looks like the following.
> 
>   mydrv_exit()
>   {
> 	sysfs_remove_file(dk0, myknob0);
> 	sysfs_remove_file(dk1, myknob1);
> 	device_del(dk0);
> 	deinit controller;
> 	release all resources;
>   }
> 
> The device_del(dk0) drops dk0's reference count to zero and its
> ->release is invoked immediately.

And here is the problem if someone else still has a reference. The
module will be unloaded, but ->release will not be called until the
"someone else" gives up the reference...

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFD driver-core] Lifetime problems of the current driver model
  2007-03-30 19:31             ` Cornelia Huck
@ 2007-03-31  3:12               ` Tejun Heo
  2007-03-31  3:15                 ` Tejun Heo
                                   ` (2 more replies)
  0 siblings, 3 replies; 40+ messages in thread
From: Tejun Heo @ 2007-03-31  3:12 UTC (permalink / raw)
  To: Cornelia Huck
  Cc: gregkh, hugh, dmitry.torokhov, oneukum, maneesh, rpurdie,
	James.Bottomley, Jeff Garzik, lkml, linux-ide, SCSI Mailing List

Cornelia Huck wrote:
> On Sat, 31 Mar 2007 00:08:19 +0900,
> Tejun Heo <htejun@gmail.com> wrote:
> 
>> (3) make sure all existing kobjects are released by module exit function.
>>
>> For example, let's say there is a hypothetical disk device /dev/dk0
>> driven by a hypothetical driver mydrv.  /dev/dk0 is represented like the
>> following in the sysfs tree.
>>
>> /sys/devices/pci0000:00/0000:00:1f.0/dk0/{myknob0,myknob1}
>>
>> Owner of both attrs myknob0 and myknob1 is mydrv and opening either
>> increases the reference counts of dk0 and mydrv and closing does the
>> opposite.
>>
>> * When there is no opener of either knob and the /dev/dk0 isn't used by
>> anyone.  Reference count of dk0 is 1, mydrv 0.
> 
> Hm, but as long as dk0 is registered, it can be looked up and someone
> could get a reference on it.

Yeah, exactly.  That's why any getting any kobject reference backed by a
module must be accompanied by try_module_get().

int mydrv_get_dk(struct dk *dk)
{
	rc = try_module_get(mydrv);
	if (rc)
		return rc;
	kobject_get(&dk->kobj);
	return 0;
}

>> * User issues rmmod mydrv.  As mydrv's reference count is zero, unload
>> proceeds and mydrv's exit function is called.
>>
>> * mydrv's exit function looks like the following.
>>
>>   mydrv_exit()
>>   {
>> 	sysfs_remove_file(dk0, myknob0);
>> 	sysfs_remove_file(dk1, myknob1);
>> 	device_del(dk0);
>> 	deinit controller;
>> 	release all resources;
>>   }
>>
>> The device_del(dk0) drops dk0's reference count to zero and its
>> ->release is invoked immediately.
> 
> And here is the problem if someone else still has a reference. The
> module will be unloaded, but ->release will not be called until the
> "someone else" gives up the reference...

Exactly, in that case, module reference count must not be zero.  You and
I are saying the same thing.  Why are we running in circle?

-- 
tejun

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFD driver-core] Lifetime problems of the current driver model
  2007-03-31  3:12               ` Tejun Heo
@ 2007-03-31  3:15                 ` Tejun Heo
  2007-03-31 16:08                 ` Cornelia Huck
  2007-04-02 19:24                 ` Luben Tuikov
  2 siblings, 0 replies; 40+ messages in thread
From: Tejun Heo @ 2007-03-31  3:15 UTC (permalink / raw)
  To: Cornelia Huck
  Cc: gregkh, hugh, dmitry.torokhov, oneukum, maneesh, rpurdie,
	James.Bottomley, Jeff Garzik, lkml, linux-ide, SCSI Mailing List

Tejun Heo wrote:
> Cornelia Huck wrote:
>> On Sat, 31 Mar 2007 00:08:19 +0900,
>> Tejun Heo <htejun@gmail.com> wrote:
>>
>>> (3) make sure all existing kobjects are released by module exit function.
>>>
>>> For example, let's say there is a hypothetical disk device /dev/dk0
>>> driven by a hypothetical driver mydrv.  /dev/dk0 is represented like the
>>> following in the sysfs tree.
>>>
>>> /sys/devices/pci0000:00/0000:00:1f.0/dk0/{myknob0,myknob1}
>>>
>>> Owner of both attrs myknob0 and myknob1 is mydrv and opening either
>>> increases the reference counts of dk0 and mydrv and closing does the
>>> opposite.
>>>
>>> * When there is no opener of either knob and the /dev/dk0 isn't used by
>>> anyone.  Reference count of dk0 is 1, mydrv 0.
>> Hm, but as long as dk0 is registered, it can be looked up and someone
>> could get a reference on it.
> 
> Yeah, exactly.  That's why any getting any kobject reference backed by a
> module must be accompanied by try_module_get().
> 
> int mydrv_get_dk(struct dk *dk)
> {
> 	rc = try_module_get(mydrv);
> 	if (rc)
> 		return rc;
> 	kobject_get(&dk->kobj);
> 	return 0;
> }

And one more thing just in case.  In the above code, try_module_get()
and kobject_get() must be and is atomic w.r.t. try_stop_module().
That's why we do the following.

  stop_machine_run(__try_stop_module, &sref, NR_CPUS);.

-- 
tejun

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFD driver-core] Lifetime problems of the current driver model
  2007-03-31  3:12               ` Tejun Heo
  2007-03-31  3:15                 ` Tejun Heo
@ 2007-03-31 16:08                 ` Cornelia Huck
  2007-03-31 16:14                   ` Tejun Heo
  2007-04-02 19:24                 ` Luben Tuikov
  2 siblings, 1 reply; 40+ messages in thread
From: Cornelia Huck @ 2007-03-31 16:08 UTC (permalink / raw)
  To: Tejun Heo
  Cc: gregkh, hugh, dmitry.torokhov, oneukum, maneesh, rpurdie,
	James.Bottomley, Jeff Garzik, lkml, linux-ide, SCSI Mailing List

On Sat, 31 Mar 2007 12:12:48 +0900,
Tejun Heo <htejun@gmail.com> wrote:

> > Hm, but as long as dk0 is registered, it can be looked up and someone
> > could get a reference on it.
> 
> Yeah, exactly.  That's why any getting any kobject reference backed by a
> module must be accompanied by try_module_get().
> 
> int mydrv_get_dk(struct dk *dk)
> {
> 	rc = try_module_get(mydrv);
> 	if (rc)
> 		return rc;
> 	kobject_get(&dk->kobj);
> 	return 0;
> }

This works if the caller always knows which module to grab (I was
thinking about some tree-walking code).

> Exactly, in that case, module reference count must not be zero.  You and
> I are saying the same thing.  Why are we running in circle?

I hope we're not, it just makes one dizzy :)

I'll think some more about it...

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFD driver-core] Lifetime problems of the current driver model
  2007-03-31 16:08                 ` Cornelia Huck
@ 2007-03-31 16:14                   ` Tejun Heo
  0 siblings, 0 replies; 40+ messages in thread
From: Tejun Heo @ 2007-03-31 16:14 UTC (permalink / raw)
  To: Cornelia Huck
  Cc: gregkh, hugh, dmitry.torokhov, oneukum, maneesh, rpurdie,
	James.Bottomley, Jeff Garzik, lkml, linux-ide, SCSI Mailing List

Cornelia Huck wrote:
> On Sat, 31 Mar 2007 12:12:48 +0900,
> Tejun Heo <htejun@gmail.com> wrote:
> 
>>> Hm, but as long as dk0 is registered, it can be looked up and someone
>>> could get a reference on it.
>> Yeah, exactly.  That's why any getting any kobject reference backed by a
>> module must be accompanied by try_module_get().
>>
>> int mydrv_get_dk(struct dk *dk)
>> {
>> 	rc = try_module_get(mydrv);
>> 	if (rc)
>> 		return rc;
>> 	kobject_get(&dk->kobj);
>> 	return 0;
>> }
> 
> This works if the caller always knows which module to grab (I was
> thinking about some tree-walking code).

Yeah, that's why we have so many ->owner's and getting it wrong results 
in jumping to non-existent address.

-- 
tejun

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFD driver-core] Lifetime problems of the current driver model
  2007-03-30 18:19     ` James Bottomley
@ 2007-04-01 19:59       ` Tejun Heo
  2007-04-02  9:20         ` Cornelia Huck
                           ` (2 more replies)
  2007-04-02 19:33       ` Luben Tuikov
  1 sibling, 3 replies; 40+ messages in thread
From: Tejun Heo @ 2007-04-01 19:59 UTC (permalink / raw)
  To: James Bottomley
  Cc: gregkh, hugh, cornelia.huck, dmitry.torokhov, oneukum, maneesh,
	rpurdie, Jeff Garzik, lkml, linux-ide, SCSI Mailing List

Hello, James, Greg.

On Fri, Mar 30, 2007 at 01:19:34PM -0500, James Bottomley wrote:
> That's sort of what I was reaching for too ... it just looks to me that
> all the sysfs glue is in kobject, so they make a good candidate for the
> pure sysfs objects.  Whatever we do, there has to be breakable cross
> linking between the driver's tree and the sysfs representation ... of
> course, nasty things like ksets get in the way of considering the
> objects as separate, sigh.

The dependency between kobject and sysfs doesn't seem to be that
tight.  I managed to break them such that sysfs can disconnect from
its kobject on deletion _without_ changing sysfs/kobject API.

> > This way the sysfs reference counting can be put completely out of
> > picture without impacting the rest of code.  Handling symlink and
> > suicidal attributes might need some extra attention but I think they can
> > be done.
> 
> True ... but you'll also have to implement something within sysfs to do
> refcounting ... it needs to know how long to keep its nodes around.

sysfs_dirent already had reference counter and plays that role quite
nicely.

Okay, here's preliminary patch.  I've tested it very lightly so it's
likely to contain bugs and definitely needs to be splitted.

Dependencies between sysfs/kobject objects are clearer now and I think
I got the locking and referencing correct.  This patch immediately
fixes 'sysfs attr grabbing the wrong kobject and module' problem -
sysfs and module lifetimes are unrelated now.  We can drop
half-working attr->owner.

* A sysfs node no longer hold reference to its kobject.  It just
  attaches itself to the kobject on creation and detaches on deletion.

* For a dir node, sysfs_dirent no longer directly points to kobject.
  It points to sysfs_dir which contains pointer to kobject and a rwsem
  to protect it.

* An open file doesn't hold a reference to kobject.  It holds a
  reference to sysfs_dirent.  kobject pointer is verified and
  show/store are performed with rwsem read-locked.  Deletion
  disconnects the sysfs from its kobject while the rwsem is
  write-locked.  This mechanism replaces buffer orphaning and kobject
  validation during open.

* attr ops is determined on sysfs node creation not on open.  This is
  a step toward making kobject opaque in sysfs.

* bin files are handled similarly but mmap makes the open file hold
  reference to the kobject till it's closed.  Any better ideas?

* symlink is reimplemented in terms of sysfs_dirents instead of
  kobjects.  sysfs_dirent->s_parent is added and each sysfs_dirent
  holds reference to its parent.  Currently walking up the tree
  requires read locking and unlocking sysfs_dir at each level.  This
  can be removed if name is added to sysfs_dirent.

* As kobject can be disconnected anytime and sysfs code still needs to
  look follow dentry link in kobject, kobject->dentry is protected by
  dcache_lock.  Once kobject becomes opaque to sysfs, this hack can go
  away.  All in all, making kobject completely opaque in sysfs isn't
  too far away after this patch although it would require mass code
  update.

What do you think about this approach?  If it's acceptable, I'll test
further and split the patch into logical steps to get it reviewed
better.

Thanks.

---
 fs/sysfs/bin.c     |  121 +++++++++++++++++++++++++++++++---------
 fs/sysfs/dir.c     |  134 ++++++++++++++++++++++++++++++++++++++------
 fs/sysfs/file.c    |  158 ++++++++++++++++++++++-------------------------------
 fs/sysfs/inode.c   |   25 --------
 fs/sysfs/mount.c   |    9 ---
 fs/sysfs/symlink.c |  115 +++++++++++++++++++++++---------------
 fs/sysfs/sysfs.h   |   75 +++++--------------------
 7 files changed, 366 insertions(+), 271 deletions(-)

diff --git a/fs/sysfs/bin.c b/fs/sysfs/bin.c
index d3b9f5f..0c4671b 100644
--- a/fs/sysfs/bin.c
+++ b/fs/sysfs/bin.c
@@ -20,26 +20,41 @@
 
 #include "sysfs.h"
 
+struct bin_buffer {
+	struct mutex	mutex;
+	void		*buffer;
+	int		mmapped;
+};
+
 static int
 fill_read(struct dentry *dentry, char *buffer, loff_t off, size_t count)
 {
 	struct bin_attribute * attr = to_bin_attr(dentry);
-	struct kobject * kobj = to_kobj(dentry->d_parent);
+	struct sysfs_dirent * sd = dentry->d_parent->d_fsdata;
+	struct kobject * kobj;
+	int rc;
 
 	if (!attr->read)
 		return -EIO;
 
-	return attr->read(kobj, buffer, off, count);
+	kobj = sysfs_get_kobj(sd);
+	if (!kobj)
+		return -ENODEV;
+
+	rc = attr->read(kobj, buffer, off, count);
+
+	sysfs_put_kobj(sd);
+
+	return rc;
 }
 
 static ssize_t
 read(struct file * file, char __user * userbuf, size_t count, loff_t * off)
 {
-	char *buffer = file->private_data;
+	struct bin_buffer *bb = file->private_data;
 	struct dentry *dentry = file->f_path.dentry;
 	int size = dentry->d_inode->i_size;
 	loff_t offs = *off;
-	int ret;
 
 	if (count > PAGE_SIZE)
 		count = PAGE_SIZE;
@@ -51,18 +66,23 @@ read(struct file * file, char __user * userbuf, size_t count, loff_t * off)
 			count = size - offs;
 	}
 
-	ret = fill_read(dentry, buffer, offs, count);
-	if (ret < 0) 
-		return ret;
-	count = ret;
+	mutex_lock(&bb->mutex);
 
-	if (copy_to_user(userbuf, buffer, count))
-		return -EFAULT;
+	count = fill_read(dentry, bb->buffer, offs, count);
+	if (count < 0)
+		goto out_unlock;
+
+	if (copy_to_user(userbuf, bb->buffer, count)) {
+		count = -EFAULT;
+		goto out_unlock;
+	}
 
 	pr_debug("offs = %lld, *off = %lld, count = %zd\n", offs, *off, count);
 
 	*off = offs + count;
 
+ out_unlock:
+	mutex_unlock(&bb->mutex);
 	return count;
 }
 
@@ -70,18 +90,28 @@ static int
 flush_write(struct dentry *dentry, char *buffer, loff_t offset, size_t count)
 {
 	struct bin_attribute *attr = to_bin_attr(dentry);
-	struct kobject *kobj = to_kobj(dentry->d_parent);
+	struct sysfs_dirent * sd = dentry->d_parent->d_fsdata;
+	struct kobject * kobj;
+	int rc;
 
 	if (!attr->write)
 		return -EIO;
 
-	return attr->write(kobj, buffer, offset, count);
+	kobj = sysfs_get_kobj(sd);
+	if (!kobj)
+		return -ENODEV;
+
+	rc = attr->write(kobj, buffer, offset, count);
+
+	sysfs_put_kobj(sd);
+
+	return rc;
 }
 
 static ssize_t write(struct file * file, const char __user * userbuf,
 		     size_t count, loff_t * off)
 {
-	char *buffer = file->private_data;
+	struct bin_buffer *bb = file->private_data;
 	struct dentry *dentry = file->f_path.dentry;
 	int size = dentry->d_inode->i_size;
 	loff_t offs = *off;
@@ -95,34 +125,60 @@ static ssize_t write(struct file * file, const char __user * userbuf,
 			count = size - offs;
 	}
 
-	if (copy_from_user(buffer, userbuf, count))
-		return -EFAULT;
+	mutex_lock(&bb->mutex);
 
-	count = flush_write(dentry, buffer, offs, count);
+	if (copy_from_user(bb->buffer, userbuf, count)) {
+		count = -EFAULT;
+		goto out_unlock;
+	}
+
+	count = flush_write(dentry, bb->buffer, offs, count);
 	if (count > 0)
 		*off = offs + count;
+
+ out_unlock:
+	mutex_unlock(&bb->mutex);
 	return count;
 }
 
 static int mmap(struct file *file, struct vm_area_struct *vma)
 {
+	struct bin_buffer *bb = file->private_data;
 	struct dentry *dentry = file->f_path.dentry;
 	struct bin_attribute *attr = to_bin_attr(dentry);
-	struct kobject *kobj = to_kobj(dentry->d_parent);
+	struct sysfs_dirent *sd = dentry->d_parent->d_fsdata;
+	struct kobject *kobj;
+	int rc;
 
 	if (!attr->mmap)
 		return -EINVAL;
 
-	return attr->mmap(kobj, attr, vma);
+	kobj = sysfs_get_kobj(sd);
+	if (!kobj)
+		return -ENODEV;
+
+	mutex_lock(&bb->mutex);
+
+	rc = attr->mmap(kobj, attr, vma);
+
+	if (rc == 0 && !bb->mmapped)
+		bb->mmapped = 1;
+	else
+		sysfs_put_kobj(sd);
+
+	mutex_unlock(&bb->mutex);
+
+	return rc;
 }
 
 static int open(struct inode * inode, struct file * file)
 {
-	struct kobject *kobj = sysfs_get_kobject(file->f_path.dentry->d_parent);
+	struct sysfs_dirent *sd = sysfs_get(file->f_path.dentry->d_parent->d_fsdata);
 	struct bin_attribute * attr = to_bin_attr(file->f_path.dentry);
+	struct bin_buffer *bb = NULL;
 	int error = -EINVAL;
 
-	if (!kobj || !attr)
+	if (!attr)
 		goto Done;
 
 	/* Grab the module reference for this attribute if we have one */
@@ -137,30 +193,41 @@ static int open(struct inode * inode, struct file * file)
 		goto Error;
 
 	error = -ENOMEM;
-	file->private_data = kmalloc(PAGE_SIZE, GFP_KERNEL);
-	if (!file->private_data)
+	bb = kzalloc(sizeof(*bb), GFP_KERNEL);
+	if (!bb)
 		goto Error;
 
+	bb->buffer = kmalloc(PAGE_SIZE, GFP_KERNEL);
+	if (!bb->buffer)
+		goto Error;
+
+	mutex_init(&bb->mutex);
+	file->private_data = bb;
+
 	error = 0;
     goto Done;
 
  Error:
+	kfree(bb);
 	module_put(attr->attr.owner);
  Done:
 	if (error)
-		kobject_put(kobj);
+		sysfs_put(sd);
 	return error;
 }
 
 static int release(struct inode * inode, struct file * file)
 {
-	struct kobject * kobj = to_kobj(file->f_path.dentry->d_parent);
+	struct sysfs_dirent * sd = file->f_path.dentry->d_parent->d_fsdata;
 	struct bin_attribute * attr = to_bin_attr(file->f_path.dentry);
-	u8 * buffer = file->private_data;
+	struct bin_buffer *bb = file->private_data;
 
-	kobject_put(kobj);
+	if (bb->mmapped)
+		sysfs_put_kobj(sd);
+	sysfs_put(sd);
 	module_put(attr->attr.owner);
-	kfree(buffer);
+	kfree(bb->buffer);
+	kfree(bb);
 	return 0;
 }
 
diff --git a/fs/sysfs/dir.c b/fs/sysfs/dir.c
index 85a6686..e6e38f8 100644
--- a/fs/sysfs/dir.c
+++ b/fs/sysfs/dir.c
@@ -30,10 +30,60 @@ static struct dentry_operations sysfs_dentry_ops = {
 	.d_iput		= sysfs_d_iput,
 };
 
+void release_sysfs_dirent(struct sysfs_dirent * sd)
+{
+	struct sysfs_dirent *parent_sd;
+
+ repeat:
+	parent_sd = sd->s_parent;
+	if (sd->s_type & SYSFS_KOBJ_LINK) {
+		struct sysfs_symlink * sl = sd->s_element;
+		kfree(sl->link_name);
+		sysfs_put(sl->target_sd);
+	}
+	kfree(sd->s_element);
+	kfree(sd->s_iattr);
+	kmem_cache_free(sysfs_dir_cachep, sd);
+
+	sd = parent_sd;
+	if (sd && atomic_dec_and_test(&sd->s_count))
+		goto repeat;
+}
+
+struct kobject *sysfs_get_kobj(struct sysfs_dirent *sd)
+{
+	struct sysfs_dir *sdir = sd->s_element;
+
+	if (unlikely(!sd))
+		return NULL;
+
+	BUG_ON(!(sd->s_type & SYSFS_DIR));
+
+	down_read(&sdir->rwsem);
+	if (sdir->kobj)
+		return sdir->kobj;
+	up_read(&sdir->rwsem);
+
+	return NULL;
+}
+
+void sysfs_put_kobj(struct sysfs_dirent *sd)
+{
+	struct sysfs_dir *sdir = sd->s_element;
+
+	if (unlikely(!sd))
+		return;
+
+	BUG_ON(!(sd->s_type & SYSFS_DIR) || !sdir->kobj);
+
+	up_read(&sdir->rwsem);
+}
+
 /*
  * Allocates a new sysfs_dirent and links it to the parent sysfs_dirent
  */
-static struct sysfs_dirent * __sysfs_new_dirent(void * element)
+static struct sysfs_dirent * __sysfs_new_dirent(struct sysfs_dirent *parent_sd,
+						void * element)
 {
 	struct sysfs_dirent * sd;
 
@@ -46,6 +96,7 @@ static struct sysfs_dirent * __sysfs_new_dirent(void * element)
 	INIT_LIST_HEAD(&sd->s_children);
 	INIT_LIST_HEAD(&sd->s_sibling);
 	sd->s_element = element;
+	sd->s_parent = sysfs_get(parent_sd);
 
 	return sd;
 }
@@ -61,7 +112,7 @@ static struct sysfs_dirent * sysfs_new_dirent(struct sysfs_dirent *parent_sd,
 						void * element)
 {
 	struct sysfs_dirent *sd;
-	sd = __sysfs_new_dirent(element);
+	sd = __sysfs_new_dirent(parent_sd, element);
 	__sysfs_list_dirent(parent_sd, sd);
 	return sd;
 }
@@ -93,11 +144,12 @@ int sysfs_dirent_exist(struct sysfs_dirent *parent_sd,
 
 
 static struct sysfs_dirent *
-__sysfs_make_dirent(struct dentry *dentry, void *element, mode_t mode, int type)
+__sysfs_make_dirent(struct sysfs_dirent *parent_sd, struct dentry *dentry,
+		    void *element, mode_t mode, int type)
 {
 	struct sysfs_dirent * sd;
 
-	sd = __sysfs_new_dirent(element);
+	sd = __sysfs_new_dirent(parent_sd, element);
 	if (!sd)
 		goto out;
 
@@ -118,7 +170,7 @@ int sysfs_make_dirent(struct sysfs_dirent * parent_sd, struct dentry * dentry,
 {
 	struct sysfs_dirent *sd;
 
-	sd = __sysfs_make_dirent(dentry, element, mode, type);
+	sd = __sysfs_make_dirent(parent_sd, dentry, element, mode, type);
 	__sysfs_list_dirent(parent_sd, sd);
 
 	return sd ? 0 : -ENOMEM;
@@ -147,20 +199,42 @@ static int init_symlink(struct inode * inode)
 	return 0;
 }
 
-static int create_dir(struct kobject * k, struct dentry * p,
+static int create_dir(struct kobject * kobj, struct dentry * p,
 		      const char * n, struct dentry ** d)
 {
+	struct sysfs_ops *ops = NULL;
+	struct sysfs_dir *sdir;
 	int error;
 	umode_t mode = S_IFDIR| S_IRWXU | S_IRUGO | S_IXUGO;
 
+	/* if the kobject has no ktype, then we assume that it is a
+	 * subsystem itself, and use ops for it.
+	 */
+	if (kobj->kset && kobj->kset->ktype)
+		ops = kobj->kset->ktype->sysfs_ops;
+	else if (kobj->ktype)
+		ops = kobj->ktype->sysfs_ops;
+	else
+		ops = &subsys_sysfs_ops;
+
+	/* allocate and initialize sysfs_dir */
+	sdir = kzalloc(sizeof(*sdir), GFP_KERNEL);
+	if (!sdir)
+		return -ENOMEM;
+
+	init_rwsem(&sdir->rwsem);
+	sdir->kobj = kobj;
+	sdir->ops = ops;
+
+	/* allocate and link sysfs_dirent */
 	mutex_lock(&p->d_inode->i_mutex);
 	*d = lookup_one_len(n, p, strlen(n));
 	if (!IS_ERR(*d)) {
  		if (sysfs_dirent_exist(p->d_fsdata, n))
   			error = -EEXIST;
   		else
-			error = sysfs_make_dirent(p->d_fsdata, *d, k, mode,
-								SYSFS_DIR);
+			error = sysfs_make_dirent(p->d_fsdata, *d, sdir, mode,
+						  SYSFS_DIR);
 		if (!error) {
 			error = sysfs_create(*d, mode, init_dir);
 			if (!error) {
@@ -181,6 +255,9 @@ static int create_dir(struct kobject * k, struct dentry * p,
 	} else
 		error = PTR_ERR(*d);
 	mutex_unlock(&p->d_inode->i_mutex);
+
+	if (error)
+		kfree(sdir);
 	return error;
 }
 
@@ -304,11 +381,21 @@ const struct inode_operations sysfs_dir_inode_operations = {
 static void remove_dir(struct dentry * d)
 {
 	struct dentry * parent = dget(d->d_parent);
-	struct sysfs_dirent * sd;
+	struct sysfs_dirent * sd = d->d_fsdata;
+	struct sysfs_dir * sdir = sd->s_element;
+
+	down_write(&sdir->rwsem);
+	/* clear kobject->dentry only if @d is not a shadow */
+	if (sdir->kobj->dentry == d) {
+		spin_lock(&dcache_lock);
+		sdir->kobj->dentry = NULL;
+		spin_unlock(&dcache_lock);
+	}
+	sdir->kobj = NULL;
+	up_write(&sdir->rwsem);
 
 	mutex_lock(&parent->d_inode->i_mutex);
 	d_delete(d);
-	sd = d->d_fsdata;
  	list_del_init(&sd->s_sibling);
 	sysfs_put(sd);
 	if (d->d_inode)
@@ -367,7 +454,6 @@ static void __sysfs_remove_dir(struct dentry *dentry)
 void sysfs_remove_dir(struct kobject * kobj)
 {
 	__sysfs_remove_dir(kobj->dentry);
-	kobj->dentry = NULL;
 }
 
 int sysfs_rename_dir(struct kobject * kobj, struct dentry *new_parent,
@@ -636,13 +722,14 @@ int sysfs_make_shadowed_dir(struct kobject *kobj,
 
 struct dentry *sysfs_create_shadow_dir(struct kobject *kobj)
 {
-	struct sysfs_dirent *sd;
-	struct dentry *parent, *dir, *shadow;
-	struct inode *inode;
+	struct dentry *dir = kobj->dentry;
+	struct inode *inode = dir->d_inode;
+	struct dentry *parent = dir->d_parent;
+	struct sysfs_dirent *parent_sd = parent->d_fsdata;
+	struct sysfs_dir *parent_sdir = parent_sd->s_element;
+	struct dentry *shadow = NULL;
+	struct sysfs_dir *sdir = NULL;
 
-	dir = kobj->dentry;
-	inode = dir->d_inode;
-	parent = dir->d_parent;
 	shadow = ERR_PTR(-EINVAL);
 	if (!sysfs_is_shadowed_inode(inode))
 		goto out;
@@ -651,8 +738,16 @@ struct dentry *sysfs_create_shadow_dir(struct kobject *kobj)
 	if (!shadow)
 		goto nomem;
 
-	sd = __sysfs_make_dirent(shadow, kobj, inode->i_mode, SYSFS_DIR);
-	if (!sd)
+	sdir = kzalloc(sizeof(*sdir), GFP_KERNEL);
+	if (!sdir)
+		goto nomem;
+
+	init_rwsem(&sdir->rwsem);
+	sdir->kobj = kobj;
+	sdir->ops = parent_sdir->ops;
+
+	if (!__sysfs_make_dirent(parent_sd, shadow, sdir, inode->i_mode,
+				 SYSFS_DIR))
 		goto nomem;
 
 	d_instantiate(shadow, igrab(inode));
@@ -666,6 +761,7 @@ out:
 	return shadow;
 nomem:
 	dput(shadow);
+	kfree(sdir);
 	shadow = ERR_PTR(-ENOMEM);
 	goto out;
 }
diff --git a/fs/sysfs/file.c b/fs/sysfs/file.c
index fc46333..88a1d94 100644
--- a/fs/sysfs/file.c
+++ b/fs/sysfs/file.c
@@ -46,34 +46,19 @@ subsys_attr_store(struct kobject * kobj, struct attribute * attr,
 	return ret;
 }
 
-static struct sysfs_ops subsys_sysfs_ops = {
+struct sysfs_ops subsys_sysfs_ops = {
 	.show	= subsys_attr_show,
 	.store	= subsys_attr_store,
 };
 
-/**
- *	add_to_collection - add buffer to a collection
- *	@buffer:	buffer to be added
- *	@node:		inode of set to add to
- */
-
-static inline void
-add_to_collection(struct sysfs_buffer *buffer, struct inode *node)
-{
-	struct sysfs_buffer_collection *set = node->i_private;
-
-	mutex_lock(&node->i_mutex);
-	list_add(&buffer->associates, &set->associates);
-	mutex_unlock(&node->i_mutex);
-}
-
-static inline void
-remove_from_collection(struct sysfs_buffer *buffer, struct inode *node)
-{
-	mutex_lock(&node->i_mutex);
-	list_del(&buffer->associates);
-	mutex_unlock(&node->i_mutex);
-}
+struct sysfs_buffer {
+	size_t				count;
+	loff_t				pos;
+	char				* page;
+	struct semaphore		sem;
+	int				needs_read_fill;
+	int				event;
+};
 
 /**
  *	fill_read_buffer - allocate and fill buffer from object.
@@ -88,10 +73,11 @@ remove_from_collection(struct sysfs_buffer *buffer, struct inode *node)
  */
 static int fill_read_buffer(struct dentry * dentry, struct sysfs_buffer * buffer)
 {
-	struct sysfs_dirent * sd = dentry->d_fsdata;
+	struct sysfs_dirent * attr_sd = dentry->d_fsdata;
+	struct sysfs_dirent * sd = dentry->d_parent->d_fsdata;
+	struct sysfs_dir * sdir = sd->s_element;
 	struct attribute * attr = to_attr(dentry);
-	struct kobject * kobj = to_kobj(dentry->d_parent);
-	struct sysfs_ops * ops = buffer->ops;
+	struct kobject * kobj;
 	int ret = 0;
 	ssize_t count;
 
@@ -100,8 +86,15 @@ static int fill_read_buffer(struct dentry * dentry, struct sysfs_buffer * buffer
 	if (!buffer->page)
 		return -ENOMEM;
 
-	buffer->event = atomic_read(&sd->s_event);
-	count = ops->show(kobj,attr,buffer->page);
+	kobj = sysfs_get_kobj(sd);
+	if (!kobj)
+		return -ENODEV;
+
+	buffer->event = atomic_read(&attr_sd->s_event);
+	count = sdir->ops->show(kobj,attr,buffer->page);
+
+	sysfs_put_kobj(sd);
+
 	BUG_ON(count > (ssize_t)PAGE_SIZE);
 	if (count >= 0) {
 		buffer->needs_read_fill = 0;
@@ -169,10 +162,7 @@ sysfs_read_file(struct file *file, char __user *buf, size_t count, loff_t *ppos)
 
 	down(&buffer->sem);
 	if (buffer->needs_read_fill) {
-		if (buffer->orphaned)
-			retval = -ENODEV;
-		else
-			retval = fill_read_buffer(file->f_path.dentry,buffer);
+		retval = fill_read_buffer(file->f_path.dentry,buffer);
 		if (retval)
 			goto out;
 	}
@@ -229,11 +219,21 @@ fill_write_buffer(struct sysfs_buffer * buffer, const char __user * buf, size_t
 static int 
 flush_write_buffer(struct dentry * dentry, struct sysfs_buffer * buffer, size_t count)
 {
+	struct sysfs_dirent * sd = dentry->d_parent->d_fsdata;
+	struct sysfs_dir * sdir = sd->s_element;
 	struct attribute * attr = to_attr(dentry);
-	struct kobject * kobj = to_kobj(dentry->d_parent);
-	struct sysfs_ops * ops = buffer->ops;
+	struct kobject * kobj;
+	int rc;
+
+	kobj = sysfs_get_kobj(sd);
+	if (!kobj)
+		return -ENODEV;
+
+	rc = sdir->ops->store(kobj,attr,buffer->page,count);
+
+	sysfs_put_kobj(sd);
 
-	return ops->store(kobj,attr,buffer->page,count);
+	return rc;
 }
 
 
@@ -261,65 +261,41 @@ sysfs_write_file(struct file *file, const char __user *buf, size_t count, loff_t
 	ssize_t len;
 
 	down(&buffer->sem);
-	if (buffer->orphaned) {
-		len = -ENODEV;
-		goto out;
-	}
+
 	len = fill_write_buffer(buffer, buf, count);
 	if (len > 0)
 		len = flush_write_buffer(file->f_path.dentry, buffer, len);
 	if (len > 0)
 		*ppos += len;
-out:
+
 	up(&buffer->sem);
 	return len;
 }
 
 static int sysfs_open_file(struct inode *inode, struct file *file)
 {
-	struct kobject *kobj = sysfs_get_kobject(file->f_path.dentry->d_parent);
+	struct sysfs_dirent *sd = sysfs_get(file->f_path.dentry->d_parent->d_fsdata);
 	struct attribute * attr = to_attr(file->f_path.dentry);
-	struct sysfs_buffer_collection *set;
+	struct sysfs_dir *sdir;
 	struct sysfs_buffer * buffer;
-	struct sysfs_ops * ops = NULL;
 	int error = 0;
 
-	if (!kobj || !attr)
+	if (!attr)
 		goto Einval;
 
-	/* Grab the module reference for this attribute if we have one */
-	if (!try_module_get(attr->owner)) {
-		error = -ENODEV;
-		goto Done;
-	}
-
-	/* if the kobject has no ktype, then we assume that it is a subsystem
-	 * itself, and use ops for it.
-	 */
-	if (kobj->kset && kobj->kset->ktype)
-		ops = kobj->kset->ktype->sysfs_ops;
-	else if (kobj->ktype)
-		ops = kobj->ktype->sysfs_ops;
-	else
-		ops = &subsys_sysfs_ops;
+	sdir = sd->s_element;
 
 	/* No sysfs operations, either from having no subsystem,
 	 * or the subsystem have no operations.
 	 */
-	if (!ops)
-		goto Eaccess;
-
-	/* make sure we have a collection to add our buffers to */
-	mutex_lock(&inode->i_mutex);
-	if (!(set = inode->i_private)) {
-		if (!(set = inode->i_private = kmalloc(sizeof(struct sysfs_buffer_collection), GFP_KERNEL))) {
-			error = -ENOMEM;
-			goto Done;
-		} else {
-			INIT_LIST_HEAD(&set->associates);
-		}
+	if (!sdir->ops)
+		return -EINVAL;
+
+	/* Grab the module reference for this attribute if we have one */
+	if (!try_module_get(attr->owner)) {
+		error = -ENODEV;
+		goto Done;
 	}
-	mutex_unlock(&inode->i_mutex);
 
 	/* File needs write support.
 	 * The inode's perms must say it's ok, 
@@ -327,7 +303,7 @@ static int sysfs_open_file(struct inode *inode, struct file *file)
 	 */
 	if (file->f_mode & FMODE_WRITE) {
 
-		if (!(inode->i_mode & S_IWUGO) || !ops->store)
+		if (!(inode->i_mode & S_IWUGO) || !sdir->ops->store)
 			goto Eaccess;
 
 	}
@@ -337,7 +313,7 @@ static int sysfs_open_file(struct inode *inode, struct file *file)
 	 * must be a show method for it.
 	 */
 	if (file->f_mode & FMODE_READ) {
-		if (!(inode->i_mode & S_IRUGO) || !ops->show)
+		if (!(inode->i_mode & S_IRUGO) || !sdir->ops->show)
 			goto Eaccess;
 	}
 
@@ -346,11 +322,8 @@ static int sysfs_open_file(struct inode *inode, struct file *file)
 	 */
 	buffer = kzalloc(sizeof(struct sysfs_buffer), GFP_KERNEL);
 	if (buffer) {
-		INIT_LIST_HEAD(&buffer->associates);
 		init_MUTEX(&buffer->sem);
 		buffer->needs_read_fill = 1;
-		buffer->ops = ops;
-		add_to_collection(buffer, inode);
 		file->private_data = buffer;
 	} else
 		error = -ENOMEM;
@@ -364,20 +337,18 @@ static int sysfs_open_file(struct inode *inode, struct file *file)
 	module_put(attr->owner);
  Done:
 	if (error)
-		kobject_put(kobj);
+		sysfs_put(sd);
 	return error;
 }
 
 static int sysfs_release(struct inode * inode, struct file * filp)
 {
-	struct kobject * kobj = to_kobj(filp->f_path.dentry->d_parent);
+	struct sysfs_dirent * sd = filp->f_path.dentry->d_parent->d_fsdata;
 	struct attribute * attr = to_attr(filp->f_path.dentry);
 	struct module * owner = attr->owner;
 	struct sysfs_buffer * buffer = filp->private_data;
 
-	if (buffer)
-		remove_from_collection(buffer, inode);
-	kobject_put(kobj);
+	sysfs_put(sd);
 	/* After this point, attr should not be accessed. */
 	module_put(owner);
 
@@ -406,18 +377,25 @@ static int sysfs_release(struct inode * inode, struct file * filp)
 static unsigned int sysfs_poll(struct file *filp, poll_table *wait)
 {
 	struct sysfs_buffer * buffer = filp->private_data;
-	struct kobject * kobj = to_kobj(filp->f_path.dentry->d_parent);
 	struct sysfs_dirent * sd = filp->f_path.dentry->d_fsdata;
-	int res = 0;
+	struct kobject *kobj;
+
+	kobj = sysfs_get_kobj(sd);
+	if (!kobj)
+		goto trigger;
 
 	poll_wait(filp, &kobj->poll, wait);
 
-	if (buffer->event != atomic_read(&sd->s_event)) {
-		res = POLLERR|POLLPRI;
-		buffer->needs_read_fill = 1;
-	}
+	sysfs_put_kobj(sd);
 
-	return res;
+	if (buffer->event != atomic_read(&sd->s_event))
+		goto trigger;
+
+	return 0;
+
+ trigger:
+	buffer->needs_read_fill = 1;
+	return POLLERR|POLLPRI;
 }
 
 
diff --git a/fs/sysfs/inode.c b/fs/sysfs/inode.c
index 4de5c6b..441a757 100644
--- a/fs/sysfs/inode.c
+++ b/fs/sysfs/inode.c
@@ -220,24 +220,6 @@ const unsigned char * sysfs_get_name(struct sysfs_dirent *sd)
 	return NULL;
 }
 
-static inline void orphan_all_buffers(struct inode *node)
-{
-	struct sysfs_buffer_collection *set;
-	struct sysfs_buffer *buf;
-
-	mutex_lock_nested(&node->i_mutex, I_MUTEX_CHILD);
-	set = node->i_private;
-	if (set) {
-		list_for_each_entry(buf, &set->associates, associates) {
-			down(&buf->sem);
-			buf->orphaned = 1;
-			up(&buf->sem);
-		}
-	}
-	mutex_unlock(&node->i_mutex);
-}
-
-
 /*
  * Unhashes the dentry corresponding to given sysfs_dirent
  * Called with parent inode's i_mutex held.
@@ -245,23 +227,16 @@ static inline void orphan_all_buffers(struct inode *node)
 void sysfs_drop_dentry(struct sysfs_dirent * sd, struct dentry * parent)
 {
 	struct dentry * dentry = sd->s_dentry;
-	struct inode *inode;
 
 	if (dentry) {
 		spin_lock(&dcache_lock);
 		spin_lock(&dentry->d_lock);
 		if (!(d_unhashed(dentry) && dentry->d_inode)) {
-			inode = dentry->d_inode;
-			spin_lock(&inode->i_lock);
-			__iget(inode);
-			spin_unlock(&inode->i_lock);
 			dget_locked(dentry);
 			__d_drop(dentry);
 			spin_unlock(&dentry->d_lock);
 			spin_unlock(&dcache_lock);
 			simple_unlink(parent->d_inode, dentry);
-			orphan_all_buffers(inode);
-			iput(inode);
 		} else {
 			spin_unlock(&dentry->d_lock);
 			spin_unlock(&dcache_lock);
diff --git a/fs/sysfs/mount.c b/fs/sysfs/mount.c
index 23a48a3..5541041 100644
--- a/fs/sysfs/mount.c
+++ b/fs/sysfs/mount.c
@@ -19,15 +19,13 @@ struct vfsmount *sysfs_mount;
 struct super_block * sysfs_sb = NULL;
 struct kmem_cache *sysfs_dir_cachep;
 
-static void sysfs_clear_inode(struct inode *inode);
-
 static const struct super_operations sysfs_ops = {
 	.statfs		= simple_statfs,
 	.drop_inode	= sysfs_delete_inode,
-	.clear_inode	= sysfs_clear_inode,
 };
 
 static struct sysfs_dirent sysfs_root = {
+	.s_count	= ATOMIC_INIT(1),
 	.s_sibling	= LIST_HEAD_INIT(sysfs_root.s_sibling),
 	.s_children	= LIST_HEAD_INIT(sysfs_root.s_children),
 	.s_element	= NULL,
@@ -35,11 +33,6 @@ static struct sysfs_dirent sysfs_root = {
 	.s_iattr	= NULL,
 };
 
-static void sysfs_clear_inode(struct inode *inode)
-{
-	kfree(inode->i_private);
-}
-
 static int sysfs_fill_super(struct super_block *sb, void *data, int silent)
 {
 	struct inode *inode;
diff --git a/fs/sysfs/symlink.c b/fs/sysfs/symlink.c
index 7b9c5bf..ddc067d 100644
--- a/fs/sysfs/symlink.c
+++ b/fs/sysfs/symlink.c
@@ -11,43 +11,65 @@
 
 #include "sysfs.h"
 
-static int object_depth(struct kobject * kobj)
+static int object_depth(struct sysfs_dirent * sd)
 {
-	struct kobject * p = kobj;
+	struct sysfs_dirent * p;
 	int depth = 0;
-	do { depth++; } while ((p = p->parent));
+
+	for (p = sd; p->s_parent; p = p->s_parent)
+		depth++;
+
 	return depth;
 }
 
-static int object_path_length(struct kobject * kobj)
+static int object_path_length(struct sysfs_dirent * sd)
 {
-	struct kobject * p = kobj;
+	struct sysfs_dirent * p;
 	int length = 1;
-	do {
-		length += strlen(kobject_name(p)) + 1;
-		p = p->parent;
-	} while (p);
+
+	for (p = sd; p->s_parent; p = p->s_parent) {
+		struct kobject *kobj;
+
+		kobj = sysfs_get_kobj(p);
+		if (!kobj)
+			return -ENOENT;
+
+		length += strlen(kobject_name(kobj)) + 1;
+
+		sysfs_put_kobj(p);
+	}
+
 	return length;
 }
 
-static void fill_object_path(struct kobject * kobj, char * buffer, int length)
+static int fill_object_path(struct sysfs_dirent * sd, char * buffer, int length)
 {
-	struct kobject * p;
+	struct sysfs_dirent * p;
 
 	--length;
-	for (p = kobj; p; p = p->parent) {
-		int cur = strlen(kobject_name(p));
+	for (p = sd; p->s_parent; p = p->s_parent) {
+		struct kobject *kobj = sysfs_get_kobj(p);
+		int cur;
+
+		if (!kobj)
+			return -ENOENT;
+
+		cur = strlen(kobject_name(kobj));
 
 		/* back up enough to print this bus id with '/' */
 		length -= cur;
-		strncpy(buffer + length,kobject_name(p),cur);
+		strncpy(buffer + length, kobject_name(kobj), cur);
 		*(buffer + --length) = '/';
+
+		sysfs_put_kobj(p);
 	}
+
+	return 0;
 }
 
-static int sysfs_add_link(struct dentry * parent, const char * name, struct kobject * target)
+static int sysfs_add_link(struct sysfs_dirent * parent_sd, const char * name,
+			  struct sysfs_dirent * target_sd)
 {
-	struct sysfs_dirent * parent_sd = parent->d_fsdata;
 	struct sysfs_symlink * sl;
 	int error = 0;
 
@@ -61,14 +83,13 @@ static int sysfs_add_link(struct dentry * parent, const char * name, struct kobj
 		goto exit2;
 
 	strcpy(sl->link_name, name);
-	sl->target_kobj = kobject_get(target);
+	sl->target_sd = sysfs_get(target_sd);
 
 	error = sysfs_make_dirent(parent_sd, NULL, sl, S_IFLNK|S_IRWXUGO,
 				SYSFS_KOBJ_LINK);
 	if (!error)
 		return 0;
 
-	kobject_put(target);
 	kfree(sl->link_name);
 exit2:
 	kfree(sl);
@@ -85,6 +106,8 @@ exit1:
 int sysfs_create_link(struct kobject * kobj, struct kobject * target, const char * name)
 {
 	struct dentry *dentry = NULL;
+	struct sysfs_dirent *parent_sd = NULL;
+	struct sysfs_dirent *target_sd = NULL;
 	int error = -EEXIST;
 
 	BUG_ON(!name);
@@ -97,11 +120,26 @@ int sysfs_create_link(struct kobject * kobj, struct kobject * target, const char
 
 	if (!dentry)
 		return -EFAULT;
+	parent_sd = dentry->d_fsdata;
+
+	/* target->dentry can go away beneath us but is protected with
+	 * dcache_lock.  Fetch target_sd from it.
+	 */
+	spin_lock(&dcache_lock);
+	if (target->dentry)
+		target_sd = sysfs_get(target->dentry->d_fsdata);
+	spin_unlock(&dcache_lock);
+
+	if (!target_sd)
+		return -ENOENT;
 
 	mutex_lock(&dentry->d_inode->i_mutex);
 	if (!sysfs_dirent_exist(dentry->d_fsdata, name))
-		error = sysfs_add_link(dentry, name, target);
+		error = sysfs_add_link(parent_sd, name, target_sd);
 	mutex_unlock(&dentry->d_inode->i_mutex);
+
+	sysfs_put(target_sd);
+
 	return error;
 }
 
@@ -117,14 +155,14 @@ void sysfs_remove_link(struct kobject * kobj, const char * name)
 	sysfs_hash_and_remove(kobj->dentry,name);
 }
 
-static int sysfs_get_target_path(struct kobject * kobj, struct kobject * target,
-				 char *path)
+static int sysfs_get_target_path(struct sysfs_dirent * parent_sd,
+				 struct sysfs_dirent * target_sd, char *path)
 {
 	char * s;
-	int depth, size;
+	int depth, size, rc;
 
-	depth = object_depth(kobj);
-	size = object_path_length(target) + depth * 3 - 1;
+	depth = object_depth(parent_sd);
+	size = object_path_length(target_sd) + depth * 3 - 1;
 	if (size > PATH_MAX)
 		return -ENAMETOOLONG;
 
@@ -133,35 +171,24 @@ static int sysfs_get_target_path(struct kobject * kobj, struct kobject * target,
 	for (s = path; depth--; s += 3)
 		strcpy(s,"../");
 
-	fill_object_path(target, path, size);
-	pr_debug("%s: path = '%s'\n", __FUNCTION__, path);
+	rc = fill_object_path(target_sd, path, size);
+	pr_debug("%s: rc = %d path = '%s'\n", __FUNCTION__, rc, path);
 
-	return 0;
+	return rc;
 }
 
 static int sysfs_getlink(struct dentry *dentry, char * path)
 {
-	struct kobject *kobj, *target_kobj;
-	int error = 0;
-
-	kobj = sysfs_get_kobject(dentry->d_parent);
-	if (!kobj)
-		return -EINVAL;
-
-	target_kobj = sysfs_get_kobject(dentry);
-	if (!target_kobj) {
-		kobject_put(kobj);
-		return -EINVAL;
-	}
+	struct sysfs_dirent *parent_sd = dentry->d_parent->d_fsdata;
+	struct sysfs_dirent *sd = dentry->d_fsdata;
+	struct sysfs_symlink *sl = sd->s_element;
+	int error;
 
 	down_read(&sysfs_rename_sem);
-	error = sysfs_get_target_path(kobj, target_kobj, path);
+	error = sysfs_get_target_path(parent_sd, sl->target_sd, path);
 	up_read(&sysfs_rename_sem);
-	
-	kobject_put(kobj);
-	kobject_put(target_kobj);
-	return error;
 
+	return error;
 }
 
 static void *sysfs_follow_link(struct dentry *dentry, struct nameidata *nd)
diff --git a/fs/sysfs/sysfs.h b/fs/sysfs/sysfs.h
index a77c57e..912a6e9 100644
--- a/fs/sysfs/sysfs.h
+++ b/fs/sysfs/sysfs.h
@@ -1,5 +1,6 @@
 struct sysfs_dirent {
 	atomic_t		s_count;
+	struct sysfs_dirent	* s_parent;
 	struct list_head	s_sibling;
 	struct list_head	s_children;
 	void 			* s_element;
@@ -10,13 +11,28 @@ struct sysfs_dirent {
 	atomic_t		s_event;
 };
 
+struct sysfs_dir {
+	struct rw_semaphore	rwsem;	/* protects kobj and kobj->dentry */
+	struct kobject		* kobj;	/* associated kobj */
+	struct sysfs_ops	* ops;
+};
+
+struct sysfs_symlink {
+	char			* link_name;
+	struct sysfs_dirent	* target_sd;
+};
+
 extern struct vfsmount * sysfs_mount;
 extern struct kmem_cache *sysfs_dir_cachep;
+extern struct sysfs_ops subsys_sysfs_ops;
 
 extern void sysfs_delete_inode(struct inode *inode);
 extern struct inode * sysfs_new_inode(mode_t mode, struct sysfs_dirent *);
 extern int sysfs_create(struct dentry *, int mode, int (*init)(struct inode *));
 
+extern void release_sysfs_dirent(struct sysfs_dirent * sd);
+extern struct kobject *sysfs_get_kobj(struct sysfs_dirent *sd);
+extern void sysfs_put_kobj(struct sysfs_dirent *sd);
 extern int sysfs_dirent_exist(struct sysfs_dirent *, const unsigned char *);
 extern int sysfs_make_dirent(struct sysfs_dirent *, struct dentry *, void *,
 				umode_t, int);
@@ -40,33 +56,6 @@ extern const struct file_operations bin_fops;
 extern const struct inode_operations sysfs_dir_inode_operations;
 extern const struct inode_operations sysfs_symlink_inode_operations;
 
-struct sysfs_symlink {
-	char * link_name;
-	struct kobject * target_kobj;
-};
-
-struct sysfs_buffer {
-	struct list_head		associates;
-	size_t				count;
-	loff_t				pos;
-	char				* page;
-	struct sysfs_ops		* ops;
-	struct semaphore		sem;
-	int				orphaned;
-	int				needs_read_fill;
-	int				event;
-};
-
-struct sysfs_buffer_collection {
-	struct list_head	associates;
-};
-
-static inline struct kobject * to_kobj(struct dentry * dentry)
-{
-	struct sysfs_dirent * sd = dentry->d_fsdata;
-	return ((struct kobject *) sd->s_element);
-}
-
 static inline struct attribute * to_attr(struct dentry * dentry)
 {
 	struct sysfs_dirent * sd = dentry->d_fsdata;
@@ -79,36 +68,6 @@ static inline struct bin_attribute * to_bin_attr(struct dentry * dentry)
 	return ((struct bin_attribute *) sd->s_element);
 }
 
-static inline struct kobject *sysfs_get_kobject(struct dentry *dentry)
-{
-	struct kobject * kobj = NULL;
-
-	spin_lock(&dcache_lock);
-	if (!d_unhashed(dentry)) {
-		struct sysfs_dirent * sd = dentry->d_fsdata;
-		if (sd->s_type & SYSFS_KOBJ_LINK) {
-			struct sysfs_symlink * sl = sd->s_element;
-			kobj = kobject_get(sl->target_kobj);
-		} else
-			kobj = kobject_get(sd->s_element);
-	}
-	spin_unlock(&dcache_lock);
-
-	return kobj;
-}
-
-static inline void release_sysfs_dirent(struct sysfs_dirent * sd)
-{
-	if (sd->s_type & SYSFS_KOBJ_LINK) {
-		struct sysfs_symlink * sl = sd->s_element;
-		kfree(sl->link_name);
-		kobject_put(sl->target_kobj);
-		kfree(sl);
-	}
-	kfree(sd->s_iattr);
-	kmem_cache_free(sysfs_dir_cachep, sd);
-}
-
 static inline struct sysfs_dirent * sysfs_get(struct sysfs_dirent * sd)
 {
 	if (sd) {
@@ -120,7 +79,7 @@ static inline struct sysfs_dirent * sysfs_get(struct sysfs_dirent * sd)
 
 static inline void sysfs_put(struct sysfs_dirent * sd)
 {
-	if (atomic_dec_and_test(&sd->s_count))
+	if (sd && atomic_dec_and_test(&sd->s_count))
 		release_sysfs_dirent(sd);
 }
 

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [RFD driver-core] Lifetime problems of the current driver model
  2007-04-01 19:59       ` Tejun Heo
@ 2007-04-02  9:20         ` Cornelia Huck
  2007-04-02 15:34           ` Cornelia Huck
  2007-04-02  9:33         ` Greg KH
  2007-04-02 12:10         ` Maneesh Soni
  2 siblings, 1 reply; 40+ messages in thread
From: Cornelia Huck @ 2007-04-02  9:20 UTC (permalink / raw)
  To: Tejun Heo
  Cc: James Bottomley, gregkh, hugh, dmitry.torokhov, oneukum, maneesh,
	rpurdie, Jeff Garzik, lkml, linux-ide, SCSI Mailing List

On Mon, 2 Apr 2007 04:59:43 +0900,
Tejun Heo <htejun@gmail.com> wrote:

> Okay, here's preliminary patch.  I've tested it very lightly so it's
> likely to contain bugs and definitely needs to be splitted.

Cool. However, there's something fishy there (not sure whether it's in
your patch or a latent bug in the ccw bus code that just has been
uncovered):

------------[ cut here ]------------
kernel BUG at mm/slab.c:607!
illegal operation: 0001 [#1]
CPU:    1    Not tainted
Process kslowcrw (pid: 36, task: 0000000001a80cc0, ksp:
0000000001a87ce0) Krnl PSW : 0404000180000000 00000000000b21b0
(kfree+0x144/0x154) R:0 T:1 IO:0 EX:0 Key:0 M:1 W:0 P:0 AS:0 CC:0 PM:0
EA:3 Krnl GPRS: 0000000000000000 0000000000000000 00000000009836d8
000003fffd22ffe0 00000000000b20aa 0000000000396ba8 0000000001a6d100
0000000000986c88 0700000001a879d8 0000000000555428 000000000344a5b0
0000000001a879c8 000000000344a500 0000000000390ca8 00000000000b20aa
0000000001a879c8 Krnl Code: 00000000000b21a2: e392c0480024
stg     %r9,72(%r2,%r12) 00000000000b21a8: a7f4ffc9           brc
15,1000b213a 00000000000b21ac: a7f40001           brc     15,b21ae
          >00000000000b21b0: a7f4ff9d           brc     15,1000b20ea
           00000000000b21b4: e33030100004       lg      %r3,16(%r3)
           00000000000b21ba: a7f4ff90           brc     15,1000b20da
           00000000000b21be: 0707               bcr     0,%r7
           00000000000b21c0: eb8ff0580024       stmg
%r8,%r15,88(%r15) Call Trace:
([<00000000000b20aa>] kfree+0x3e/0x154)
 [<00000000001166f2>] release_sysfs_dirent+0x3e/0xf4
 [<0000000000114914>] sysfs_hash_and_remove+0x150/0x154
 [<0000000000118aec>] remove_files+0x48/0x68
 [<0000000000118b84>] sysfs_remove_group+0x78/0xe8
 [<00000000001cce3c>] device_remove_groups+0x48/0x68
 [<00000000001cd4c0>] device_remove_attrs+0x3c/0x7c
 [<00000000001cd70e>] device_del+0x20e/0x3a8
 [<00000000001cd8d2>] device_unregister+0x2a/0x44
 [<000000000022fb2c>] css_sch_device_unregister+0x3c/0x54
 [<0000000000230124>] css_evaluate_subchannel+0x2f0/0x400
 [<000000000023041a>] css_trigger_slow_path+0xda/0x154
 [<0000000000054c9a>] run_workqueue+0x136/0x1fc
 [<0000000000054e3a>] worker_thread+0xda/0x170
 [<000000000005b05a>] kthread+0x122/0x15c
 [<0000000000019912>] kernel_thread_starter+0x6/0xc
 [<000000000001990c>] kernel_thread_starter+0x0/0xc

This happens when I detach a device (which causes it to be
unregistered). I'll look at it when I'm fully operational.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFD driver-core] Lifetime problems of the current driver model
  2007-04-01 19:59       ` Tejun Heo
  2007-04-02  9:20         ` Cornelia Huck
@ 2007-04-02  9:33         ` Greg KH
  2007-04-02 12:10         ` Maneesh Soni
  2 siblings, 0 replies; 40+ messages in thread
From: Greg KH @ 2007-04-02  9:33 UTC (permalink / raw)
  To: Tejun Heo
  Cc: James Bottomley, hugh, cornelia.huck, dmitry.torokhov, oneukum,
	maneesh, rpurdie, Jeff Garzik, lkml, linux-ide,
	SCSI Mailing List

On Mon, Apr 02, 2007 at 04:59:43AM +0900, Tejun Heo wrote:
> 
> Dependencies between sysfs/kobject objects are clearer now and I think
> I got the locking and referencing correct.  This patch immediately
> fixes 'sysfs attr grabbing the wrong kobject and module' problem -
> sysfs and module lifetimes are unrelated now.  We can drop
> half-working attr->owner.
> 
> * A sysfs node no longer hold reference to its kobject.  It just
>   attaches itself to the kobject on creation and detaches on deletion.
> 
> * For a dir node, sysfs_dirent no longer directly points to kobject.
>   It points to sysfs_dir which contains pointer to kobject and a rwsem
>   to protect it.
> 
> * An open file doesn't hold a reference to kobject.  It holds a
>   reference to sysfs_dirent.  kobject pointer is verified and
>   show/store are performed with rwsem read-locked.  Deletion
>   disconnects the sysfs from its kobject while the rwsem is
>   write-locked.  This mechanism replaces buffer orphaning and kobject
>   validation during open.

Ah, very nice.

> * attr ops is determined on sysfs node creation not on open.  This is
>   a step toward making kobject opaque in sysfs.
> 
> * bin files are handled similarly but mmap makes the open file hold
>   reference to the kobject till it's closed.  Any better ideas?

This is probably needed, and might be acceptable, and I think we can
live with it as there is only a very small number of sysfs files that
fall into this category.

> * symlink is reimplemented in terms of sysfs_dirents instead of
>   kobjects.  sysfs_dirent->s_parent is added and each sysfs_dirent
>   holds reference to its parent.  Currently walking up the tree
>   requires read locking and unlocking sysfs_dir at each level.  This
>   can be removed if name is added to sysfs_dirent.
> 
> * As kobject can be disconnected anytime and sysfs code still needs to
>   look follow dentry link in kobject, kobject->dentry is protected by
>   dcache_lock.  Once kobject becomes opaque to sysfs, this hack can go
>   away.  All in all, making kobject completely opaque in sysfs isn't
>   too far away after this patch although it would require mass code
>   update.

What would need to be updated after this?

> What do you think about this approach?  If it's acceptable, I'll test
> further and split the patch into logical steps to get it reviewed
> better.

At first glance, I think it looks fine, but Maneesh is the one who
understands this code better than anyone, so I would like to get his
opinion on it.

I think you should start splitting it all up into steps, which will help
us review it a whole lot easier, as overall, I think this is a very
worthy goal.

thanks so much for doing this work,

greg k-h

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFD driver-core] Lifetime problems of the current driver model
  2007-04-01 19:59       ` Tejun Heo
  2007-04-02  9:20         ` Cornelia Huck
  2007-04-02  9:33         ` Greg KH
@ 2007-04-02 12:10         ` Maneesh Soni
  2 siblings, 0 replies; 40+ messages in thread
From: Maneesh Soni @ 2007-04-02 12:10 UTC (permalink / raw)
  To: Tejun Heo
  Cc: James Bottomley, gregkh, hugh, cornelia.huck, dmitry.torokhov,
	oneukum, rpurdie, Jeff Garzik, lkml, linux-ide,
	SCSI Mailing List, Andrew Morton

On Mon, Apr 02, 2007 at 04:59:43AM +0900, Tejun Heo wrote:
> Hello, James, Greg.
> 
> On Fri, Mar 30, 2007 at 01:19:34PM -0500, James Bottomley wrote:
> > That's sort of what I was reaching for too ... it just looks to me that
> > all the sysfs glue is in kobject, so they make a good candidate for the
> > pure sysfs objects.  Whatever we do, there has to be breakable cross
> > linking between the driver's tree and the sysfs representation ... of
> > course, nasty things like ksets get in the way of considering the
> > objects as separate, sigh.
> 
> The dependency between kobject and sysfs doesn't seem to be that
> tight.  I managed to break them such that sysfs can disconnect from
> its kobject on deletion _without_ changing sysfs/kobject API.
> 
> > > This way the sysfs reference counting can be put completely out of
> > > picture without impacting the rest of code.  Handling symlink and
> > > suicidal attributes might need some extra attention but I think they can
> > > be done.
> > 
> > True ... but you'll also have to implement something within sysfs to do
> > refcounting ... it needs to know how long to keep its nodes around.
> 
> sysfs_dirent already had reference counter and plays that role quite
> nicely.
> 
> Okay, here's preliminary patch.  I've tested it very lightly so it's
> likely to contain bugs and definitely needs to be splitted.
> 
> Dependencies between sysfs/kobject objects are clearer now and I think
> I got the locking and referencing correct.  This patch immediately
> fixes 'sysfs attr grabbing the wrong kobject and module' problem -
> sysfs and module lifetimes are unrelated now.  We can drop
> half-working attr->owner.
> 
> * A sysfs node no longer hold reference to its kobject.  It just
>   attaches itself to the kobject on creation and detaches on deletion.
> 

Earlier only sysfs leaf nodes held ref to kobject that is also 
only during open/close for regular files and in create/follow/destroy
symlinks.

> * For a dir node, sysfs_dirent no longer directly points to kobject.
>   It points to sysfs_dir which contains pointer to kobject and a rwsem
>   to protect it.
> 
> * An open file doesn't hold a reference to kobject.  It holds a
>   reference to sysfs_dirent.  kobject pointer is verified and
>   show/store are performed with rwsem read-locked.  Deletion
>   disconnects the sysfs from its kobject while the rwsem is
>   write-locked.  This mechanism replaces buffer orphaning and kobject
>   validation during open.
> 
> * attr ops is determined on sysfs node creation not on open.  This is
>   a step toward making kobject opaque in sysfs.
> 
> * bin files are handled similarly but mmap makes the open file hold
>   reference to the kobject till it's closed.  Any better ideas?
> > * symlink is reimplemented in terms of sysfs_dirents instead of
>   kobjects.  sysfs_dirent->s_parent is added and each sysfs_dirent
>   holds reference to its parent.  Currently walking up the tree
>   requires read locking and unlocking sysfs_dir at each level.  This
>   can be removed if name is added to sysfs_dirent.
> 
> * As kobject can be disconnected anytime and sysfs code still needs to
>   look follow dentry link in kobject, kobject->dentry is protected by
>   dcache_lock.  Once kobject becomes opaque to sysfs, this hack can go
>   away.  All in all, making kobject completely opaque in sysfs isn't
>   too far away after this patch although it would require mass code
>   update.
> 
> What do you think about this approach?  If it's acceptable, I'll test
> further and split the patch into logical steps to get it reviewed
> better.
> 

It does complicate a lot in sysfs code. The code is already quite fragile
considering the races (akpm is facing a couple of them, and which I am 
still working on) from VFS/dentry/inode angle. The previous rework of
sysfs was to make dentries/inodes corresponding to sysfs leaf nodes
reclaimable, i.e. looking at sysfs from VFS angle and IIUC you are 
looking at it from the driver layer to solve races between module/driver-
layer/sysfs, which is a valid purpose.

Following are the few rules in short to take care
1. directory dentries are pinned
2. regular files and symlink dentries are reclaimable
3. sysfs_dirent tree is protected using the parent inode's (ie directory 
   dentry's inode) i_mutex
4. sysfs rename is protected using a semaphore, sysfs_rename_sem also.
5. sysfs_dirent starts with refcount of 1 at the time of creation, refcount
   is incremented/decremented when a dentry is attached or detached. So, the
   refcount at the max can be 2. 
6. symlink creation pins the target kobject, so that the corresponding dentry
   is also pinned and removal releases the kobject.

There were few more additions like sysfs poll and shadow directory support,
for which respective authors can review and I can review the VFS related
changes but please split the patches as you like but one possible way
could be as chunks dealing with changes related to

1. data structure changes
2. creation/removal of directories
3. creation/removal of files
4. creation/removal of symlinks
5. open/read/write/close of a file

As far as testing is considered, doing insmod/rmmod in a tight loop in
parallel with working (find, open/read/writhe/close/ file) on /sys, 
makes a good enough testcase to hit races.

Thanks
Manesh

--
Maneesh Soni
Linux Technology Center,
IBM India Systems and Technology Lab, 
Bangalore, India

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFD driver-core] Lifetime problems of the current driver model
  2007-04-02  9:20         ` Cornelia Huck
@ 2007-04-02 15:34           ` Cornelia Huck
  2007-04-03  3:08             ` Tejun Heo
  0 siblings, 1 reply; 40+ messages in thread
From: Cornelia Huck @ 2007-04-02 15:34 UTC (permalink / raw)
  To: Tejun Heo
  Cc: James Bottomley, gregkh, hugh, dmitry.torokhov, oneukum, maneesh,
	rpurdie, Jeff Garzik, lkml, linux-ide, SCSI Mailing List

On Mon, 2 Apr 2007 11:20:48 +0200,
Cornelia Huck <cornelia.huck@de.ibm.com> wrote:

> Cool. However, there's something fishy there (not sure whether it's in
> your patch or a latent bug in the ccw bus code that just has been
> uncovered):

Similar bug when loading/unloading a module that creates a driver
attribute. The winner seems to be kfree(sd->s_element) in
release_sysfs_dirent() (in case of an attribute, it will point to the
attribute structure, which is usually statically created)...

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFD driver-core] Lifetime problems of the current driver model
  2007-03-31  3:12               ` Tejun Heo
  2007-03-31  3:15                 ` Tejun Heo
  2007-03-31 16:08                 ` Cornelia Huck
@ 2007-04-02 19:24                 ` Luben Tuikov
  2 siblings, 0 replies; 40+ messages in thread
From: Luben Tuikov @ 2007-04-02 19:24 UTC (permalink / raw)
  To: Tejun Heo, Cornelia Huck
  Cc: gregkh, hugh, dmitry.torokhov, oneukum, maneesh, rpurdie,
	James.Bottomley, Jeff Garzik, lkml, linux-ide, SCSI Mailing List

--- Tejun Heo <htejun@gmail.com> wrote:
> Cornelia Huck wrote:
> > On Sat, 31 Mar 2007 00:08:19 +0900,
> > Tejun Heo <htejun@gmail.com> wrote:
> > 
> >> (3) make sure all existing kobjects are released by module exit function.
> >>
> >> For example, let's say there is a hypothetical disk device /dev/dk0
> >> driven by a hypothetical driver mydrv.  /dev/dk0 is represented like the
> >> following in the sysfs tree.
> >>
> >> /sys/devices/pci0000:00/0000:00:1f.0/dk0/{myknob0,myknob1}
> >>
> >> Owner of both attrs myknob0 and myknob1 is mydrv and opening either
> >> increases the reference counts of dk0 and mydrv and closing does the
> >> opposite.
> >>
> >> * When there is no opener of either knob and the /dev/dk0 isn't used by
> >> anyone.  Reference count of dk0 is 1, mydrv 0.
> > 
> > Hm, but as long as dk0 is registered, it can be looked up and someone
> > could get a reference on it.
> 
> Yeah, exactly.  That's why any getting any kobject reference backed by a
> module must be accompanied by try_module_get().

Tejun,

This has always been the case.

By design, the kobject infrastructure is one such that allows for
different behavior implementations.  You could use it to accomplish
what you want and describe in your RFD, or you could use it to implement
what we have today or you could use it to implement completely different
(refcount) infrastructure.

    Luben


> 
> int mydrv_get_dk(struct dk *dk)
> {
> 	rc = try_module_get(mydrv);
> 	if (rc)
> 		return rc;
> 	kobject_get(&dk->kobj);
> 	return 0;
> }
> 
> >> * User issues rmmod mydrv.  As mydrv's reference count is zero, unload
> >> proceeds and mydrv's exit function is called.
> >>
> >> * mydrv's exit function looks like the following.
> >>
> >>   mydrv_exit()
> >>   {
> >> 	sysfs_remove_file(dk0, myknob0);
> >> 	sysfs_remove_file(dk1, myknob1);
> >> 	device_del(dk0);
> >> 	deinit controller;
> >> 	release all resources;
> >>   }
> >>
> >> The device_del(dk0) drops dk0's reference count to zero and its
> >> ->release is invoked immediately.
> > 
> > And here is the problem if someone else still has a reference. The
> > module will be unloaded, but ->release will not be called until the
> > "someone else" gives up the reference...
> 
> Exactly, in that case, module reference count must not be zero.  You and
> I are saying the same thing.  Why are we running in circle?
> 
> -- 
> tejun
> -
> To unsubscribe from this list: send the line "unsubscribe linux-ide" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFD driver-core] Lifetime problems of the current driver model
  2007-03-30 18:19     ` James Bottomley
  2007-04-01 19:59       ` Tejun Heo
@ 2007-04-02 19:33       ` Luben Tuikov
  1 sibling, 0 replies; 40+ messages in thread
From: Luben Tuikov @ 2007-04-02 19:33 UTC (permalink / raw)
  To: James Bottomley, Tejun Heo
  Cc: gregkh, hugh, cornelia.huck, dmitry.torokhov, oneukum, maneesh,
	rpurdie, Jeff Garzik, lkml, linux-ide, SCSI Mailing List

--- James Bottomley <James.Bottomley@SteelEye.com> wrote:

> I'd favour trying to separate kobject and struct device for this ...
> move all the sysfs stuff into kobject and device only stuff into struct
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Currently the kobject implementation is pure and well-defined.  It is
a good implementation [kobject], and I'd hate to see it lost into being
convoluted with/into another model.

Currently the infrastructure layers are well defined:
   kobject ->  (A layer with objects, their behavor and implementation)
          device ->   (--"--)
                  sysfs.   (--"--)
This isn't that bad of an infrastructure.

It is this well defined layering, i.e. objects, their behavior and
implementation, that allows different (better/worse) infrastructures
to be built on top of it.

It is this well-defined layering which will allow what Tejun wants
to be implemented.

> device ... but that would get us into disentangling the ksets, which, on
> balance, isn't going to be fun ...

     Luben


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFD driver-core] Lifetime problems of the current driver model
  2007-04-02 15:34           ` Cornelia Huck
@ 2007-04-03  3:08             ` Tejun Heo
  0 siblings, 0 replies; 40+ messages in thread
From: Tejun Heo @ 2007-04-03  3:08 UTC (permalink / raw)
  To: Cornelia Huck
  Cc: James Bottomley, gregkh, hugh, dmitry.torokhov, oneukum, maneesh,
	rpurdie, Jeff Garzik, lkml, linux-ide, SCSI Mailing List

Cornelia Huck wrote:
> On Mon, 2 Apr 2007 11:20:48 +0200,
> Cornelia Huck <cornelia.huck@de.ibm.com> wrote:
> 
>> Cool. However, there's something fishy there (not sure whether it's in
>> your patch or a latent bug in the ccw bus code that just has been
>> uncovered):
> 
> Similar bug when loading/unloading a module that creates a driver
> attribute. The winner seems to be kfree(sd->s_element) in
> release_sysfs_dirent() (in case of an attribute, it will point to the
> attribute structure, which is usually statically created)...

Thanks for finding it out.  I was suspecting that last minute change.
The code should be

if (dir node)
	kfree(s_element)
else if (symlink node)
	do things and kfree()

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFD driver-core] Lifetime problems of the current driver model
  2007-04-07 15:48 Alan Stern
@ 2007-04-08  2:55 ` Tejun Heo
  0 siblings, 0 replies; 40+ messages in thread
From: Tejun Heo @ 2007-04-08  2:55 UTC (permalink / raw)
  To: Alan Stern
  Cc: Greg KH, hugh, cornelia.huck, Dmitry Torokhov, Oliver Neukum,
	maneesh, rpurdie, James Bottomley, Jeff Garzik,
	Kernel development list

Hello,

Alan Stern wrote:
>>   The problem here is that kobjec_get() in sysfs_schedule_callback()
>>   doesn't grab the module backing the kobject it's grabbing.  By the
>>   time (ss->func)(ss->kobj) runs, scsi_mod is already gone.
> 
> As the author of this routine, I wish you had included my name in your
> CC: list.  :-(

Sorry.  I'll from the next time.  I posted related patchset yesterday.
You can see it at...

  http://thread.gmane.org/gmane.linux.kernel/513334

> The problem here isn't exactly as you described.  scsi_mod needs to be
> pinned (1) because it is the owner of the kobject and hence will be
> called when the kobject is released, and (2) because it is the owner
> of the callback routine.  However this is just a detail; clearly the
> bug needs to be fixed.
> 
> One possibility would be to have scsi_mod's exit_scsi() routine call
> flush_scheduled_work().  Another would be to add such a call in
> sys_delete_module().  Neither of these is attractive.  They would add
> overhead when it's not needed, and they would deadlock if a workqueue
> routine tried to unload a module.
> 
> On balance, the patch below seems better.  Do you agree?

Agreed.  Grabbing module on function schedule should fix the problem.

> With regard to your analysis of lifetime issues, there is a whole
> aspect you did not mention.  A basic assumption of the refcounting
> approach is that once X has a reference to Y, X can freely access and
> use Y as much as it wants until it drops the reference.
> 
> However this is not true when X is a device driver and Y is a device
> structure.  Drivers can be unbound from devices.  If X has been
> unbound from Y then it must not access Y again, no matter how many
> references it possesses.  After all, some other driver may have bound
> to Y in the meantime; this other driver would not appreciate the
> interference.
> 
> Just as bad, if Y represents a hot-pluggable device then some other
> device may have been plugged in and may be using Y's old address.  We
> don't want X sending commands to a new device, thinking that it is Y!
> 
> The complications caused by this requirement affect both the subsystem
> code and device drivers.  Drivers must synchronize their release()
> methods with every action they take -- and refcounts cannot provide
> synchronization.
> 
> A similar problem afflicts the char-device subsystem, and here even
> less care has been taken to address the issues.  The race between
> open() and unregister() is resolved in many places by relying on the
> BKL!
> 
> We should be able to make things better and easier than they are.
> Orphaning open sysfs files was a move in this direction.  But I doubt
> they will ever become truly simple and clear.

Yeap, you're right.  My goal is make driver detach point the automatic
final synchronization point such that after the driver unregisters
itself from all the upper layers including sysfs, it's guaranteed that
there's no user left to the driver or the device it was driving.  Taking
sysfs out of the lifetime equation is a big step toward this goal.
Converting all subsystems and upper layers to immediately disconnect
from the device on driver detach would take some time but I don't think
it will be too difficult.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFD driver-core] Lifetime problems of the current driver model
@ 2007-04-07 15:48 Alan Stern
  2007-04-08  2:55 ` Tejun Heo
  0 siblings, 1 reply; 40+ messages in thread
From: Alan Stern @ 2007-04-07 15:48 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Greg KH, hugh, cornelia.huck, Dmitry Torokhov, Oliver Neukum,
	maneesh, rpurdie, James Bottomley, Jeff Garzik,
	Kernel development list

On March 30, 2007, Tejun Heo wrote:

> Hello, all.
> 
> This document tries to describe lifetime problems of the current
> device driver model primarily from the point view of device drivers
> and establish consensus, or at least, start discussion about how to
> solve these problems.  This is primarily based on my experience with
> IDE and SCSI layers and my knowledge on other drivers is limited, so I
> might have generalized too aggressively.  Feel free to point out.

...

> Example 1. sysfs_schedule_callback() not grabbing the owning module
> 
>   This function is recently added to be used by suicidal sysfs nodes
>   such that they don't deadlock when trying to unregister themselves.
> 
>  +#include <linux/delay.h>
>   static void sysfs_schedule_callback_work(struct work_struct *work)
>   {
> 	struct sysfs_schedule_callback_struct *ss = container_of(work,
> 			struct sysfs_schedule_callback_struct, work);
> 
>  +	msleep(100);
> 	(ss->func)(ss->data);
> 	kobject_put(ss->kobj);
> 	kfree(ss);
>   }
> 
>   int sysfs_schedule_callback(struct kobject *kobj, void (*func)(void *),
> 		void *data)
>   {
> 	struct sysfs_schedule_callback_struct *ss;
> 
> 	ss = kmalloc(sizeof(*ss), GFP_KERNEL);
> 	if (!ss)
> 		return -ENOMEM;
> 	kobject_get(kobj);
> 	ss->kobj = kobj;
> 	ss->func = func;
> 	ss->data = data;
> 	INIT_WORK(&ss->work, sysfs_schedule_callback_work);
> 	schedule_work(&ss->work);
> 	return 0;
>   }
> 
>   Two lines starting with '+' are inserted to make the problem
>   reliably reproducible.  With the above changes,
> 
>   # insmod drivers/scsi/scsi_mod.ko; insmod drivers/scsi/sd_mod.ko; insmod drivers/ata/libata.ko;
> insmod drivers/ata/ahci.ko
>   # echo 1 > /sys/block/sda/device/delete; rmmod ahci; rmmod libata; rmmod sd_mod; rmmod scsi_mod
> 
>   It's assumed that ahci detects /dev/sda.  The above command sequence
>   causes the following oops.
> 
>   BUG: unable to handle kernel paging request at virtual address e0984020
>   [--snip--]
>   EIP is at 0xe0984020
>   [--snip--]
>    [<c0127132>] run_workqueue+0x92/0x140
>    [<c01278c7>] worker_thread+0x137/0x160
>    [<c012a513>] kthread+0xa3/0xd0
>    [<c0104457>] kernel_thread_helper+0x7/0x10
> 
>   The problem here is that kobjec_get() in sysfs_schedule_callback()
>   doesn't grab the module backing the kobject it's grabbing.  By the
>   time (ss->func)(ss->kobj) runs, scsi_mod is already gone.

As the author of this routine, I wish you had included my name in your
CC: list.  :-(

The problem here isn't exactly as you described.  scsi_mod needs to be
pinned (1) because it is the owner of the kobject and hence will be
called when the kobject is released, and (2) because it is the owner
of the callback routine.  However this is just a detail; clearly the
bug needs to be fixed.

One possibility would be to have scsi_mod's exit_scsi() routine call
flush_scheduled_work().  Another would be to add such a call in
sys_delete_module().  Neither of these is attractive.  They would add
overhead when it's not needed, and they would deadlock if a workqueue
routine tried to unload a module.

On balance, the patch below seems better.  Do you agree?


With regard to your analysis of lifetime issues, there is a whole
aspect you did not mention.  A basic assumption of the refcounting
approach is that once X has a reference to Y, X can freely access and
use Y as much as it wants until it drops the reference.

However this is not true when X is a device driver and Y is a device
structure.  Drivers can be unbound from devices.  If X has been
unbound from Y then it must not access Y again, no matter how many
references it possesses.  After all, some other driver may have bound
to Y in the meantime; this other driver would not appreciate the
interference.

Just as bad, if Y represents a hot-pluggable device then some other
device may have been plugged in and may be using Y's old address.  We
don't want X sending commands to a new device, thinking that it is Y!

The complications caused by this requirement affect both the subsystem
code and device drivers.  Drivers must synchronize their release()
methods with every action they take -- and refcounts cannot provide
synchronization.

A similar problem afflicts the char-device subsystem, and here even
less care has been taken to address the issues.  The race between
open() and unregister() is resolved in many places by relying on the
BKL!

We should be able to make things better and easier than they are.
Orphaning open sysfs files was a move in this direction.  But I doubt
they will ever become truly simple and clear.

Alan Stern



Index: usb-2.6/drivers/base/core.c
===================================================================
--- usb-2.6.orig/drivers/base/core.c
+++ usb-2.6/drivers/base/core.c
@@ -431,9 +431,10 @@ void device_remove_bin_file(struct devic
 EXPORT_SYMBOL_GPL(device_remove_bin_file);
 
 /**
- * device_schedule_callback - helper to schedule a callback for a device
+ * device_schedule_callback_owner - helper to schedule a callback for a device
  * @dev: device.
  * @func: callback function to invoke later.
+ * @owner: module owning the callback routine
  *
  * Attribute methods must not unregister themselves or their parent device
  * (which would amount to the same thing).  Attempts to do so will deadlock,
@@ -444,20 +445,23 @@ EXPORT_SYMBOL_GPL(device_remove_bin_file
  * argument in the workqueue's process context.  @dev will be pinned until
  * @func returns.
  *
+ * This routine is usually called via the inline device_schedule_callback(),
+ * which automatically sets @owner to THIS_MODULE.
+ *
  * Returns 0 if the request was submitted, -ENOMEM if storage could not
- * be allocated.
+ * be allocated, -ENODEV if a reference to @owner isn't available.
  *
  * NOTE: This routine won't work if CONFIG_SYSFS isn't set!  It uses an
  * underlying sysfs routine (since it is intended for use by attribute
  * methods), and if sysfs isn't available you'll get nothing but -ENOSYS.
  */
-int device_schedule_callback(struct device *dev,
-		void (*func)(struct device *))
+int device_schedule_callback_owner(struct device *dev,
+		void (*func)(struct device *), struct module *owner)
 {
 	return sysfs_schedule_callback(&dev->kobj,
-			(void (*)(void *)) func, dev);
+			(void (*)(void *)) func, dev, owner);
 }
-EXPORT_SYMBOL_GPL(device_schedule_callback);
+EXPORT_SYMBOL_GPL(device_schedule_callback_owner);
 
 static void klist_children_get(struct klist_node *n)
 {
Index: usb-2.6/fs/sysfs/file.c
===================================================================
--- usb-2.6.orig/fs/sysfs/file.c
+++ usb-2.6/fs/sysfs/file.c
@@ -647,6 +647,7 @@ struct sysfs_schedule_callback_struct {
 	struct kobject 		*kobj;
 	void			(*func)(void *);
 	void			*data;
+	struct module		*owner;
 	struct work_struct	work;
 };
 
@@ -657,6 +658,7 @@ static void sysfs_schedule_callback_work
 
 	(ss->func)(ss->data);
 	kobject_put(ss->kobj);
+	module_put(ss->owner);
 	kfree(ss);
 }
 
@@ -665,6 +667,7 @@ static void sysfs_schedule_callback_work
  * @kobj: object we're acting for.
  * @func: callback function to invoke later.
  * @data: argument to pass to @func.
+ * @owner: module owning the callback code
  *
  * sysfs attribute methods must not unregister themselves or their parent
  * kobject (which would amount to the same thing).  Attempts to do so will
@@ -677,20 +680,25 @@ static void sysfs_schedule_callback_work
  * until @func returns.
  *
  * Returns 0 if the request was submitted, -ENOMEM if storage could not
- * be allocated.
+ * be allocated, -ENODEV if a reference to @owner isn't available.
  */
 int sysfs_schedule_callback(struct kobject *kobj, void (*func)(void *),
-		void *data)
+		void *data, struct module *owner)
 {
 	struct sysfs_schedule_callback_struct *ss;
 
+	if (!try_module_get(owner))
+		return -ENODEV;
 	ss = kmalloc(sizeof(*ss), GFP_KERNEL);
-	if (!ss)
+	if (!ss) {
+		module_put(owner);
 		return -ENOMEM;
+	}
 	kobject_get(kobj);
 	ss->kobj = kobj;
 	ss->func = func;
 	ss->data = data;
+	ss->owner = owner;
 	INIT_WORK(&ss->work, sysfs_schedule_callback_work);
 	schedule_work(&ss->work);
 	return 0;
Index: usb-2.6/include/linux/device.h
===================================================================
--- usb-2.6.orig/include/linux/device.h
+++ usb-2.6/include/linux/device.h
@@ -369,8 +369,14 @@ extern int __must_check device_create_bi
 					       struct bin_attribute *attr);
 extern void device_remove_bin_file(struct device *dev,
 				   struct bin_attribute *attr);
-extern int device_schedule_callback(struct device *dev,
-		void (*func)(struct device *));
+extern int device_schedule_callback_owner(struct device *dev,
+		void (*func)(struct device *), struct module *owner);
+
+static inline int device_schedule_callback(struct device *dev,
+		void (*func)(struct device *))
+{
+	device_schedule_callback_mod(dev, func, THIS_MODULE);
+}
 
 /* device resource management */
 typedef void (*dr_release_t)(struct device *dev, void *res);
Index: usb-2.6/include/linux/sysfs.h
===================================================================
--- usb-2.6.orig/include/linux/sysfs.h
+++ usb-2.6/include/linux/sysfs.h
@@ -80,7 +80,7 @@ struct sysfs_ops {
 #ifdef CONFIG_SYSFS
 
 extern int sysfs_schedule_callback(struct kobject *kobj,
-		void (*func)(void *), void *data);
+		void (*func)(void *), void *data, struct module *owner);
 
 extern int __must_check
 sysfs_create_dir(struct kobject *, struct dentry *);
@@ -138,7 +138,7 @@ extern int __must_check sysfs_init(void)
 #else /* CONFIG_SYSFS */
 
 static inline int sysfs_schedule_callback(struct kobject *kobj,
-		void (*func)(void *), void *data)
+		void (*func)(void *), void *data, struct module *owner)
 {
 	return -ENOSYS;
 }


^ permalink raw reply	[flat|nested] 40+ messages in thread

* [RFD driver-core] Lifetime problems of the current driver model
@ 2007-03-30  9:43 Tejun Heo
  0 siblings, 0 replies; 40+ messages in thread
From: Tejun Heo @ 2007-03-30  9:43 UTC (permalink / raw)
  To: gregkh, hugh, cornelia.huck, dmitry.torokhov, oneukum, maneesh,
	rpurdie, James.Bottomley, Jeff Garzik
  Cc: lkml, linux-ide, SCSI Mailing List

Hello, all.

This document tries to describe lifetime problems of the current
device driver model primarily from the point view of device drivers
and establish consensus, or at least, start discussion about how to
solve these problems.  This is primarily based on my experience with
IDE and SCSI layers and my knowledge on other drivers is limited, so I
might have generalized too aggressively.  Feel free to point out.


Table of Contents

1. Why is it this complex?
1-1. Rule #a and #b
1-2. Rule #b and #c
2. How do we solve this?
2-1. Rule #a and #b
2-2. Rule #b and #c
3. How do we get there from here?
4. Some afterthoughts


1. Why is it this complex?

Object lifetime management in device drivers is complex and thus
error-prone under the current driver model.  The primary reason is
that device drivers have to match impedances of three different life
time rules.

Rule a. The device itself.  Many devices are hot pluggable these days.
        Devices come when they come and go when they go.

Rule b. Driver model object lifetime rule.  Fundamental data
        structures used by device drivers are reference counted and
        their lifetime can be arbitrarily extended by userspace sysfs
        access.

Rule c. Module life time rule.  Most device drivers can be compiled as
        modules and thus module busy count should be managed properly.


1-1. Rule #a and #b

Being what it is, the primary concern of a device driver is the
'device' itself.  A device driver should create and release objects
representing the devices it manages as they come and go.  In one way
or another, this just has to be done, so this rule is pretty obvious
and intuitive to device driver writers.

The current device driver model mandates implementation of rule #b in
device drivers.  This makes the object lifetime management complicated
and, more importantly, less intuitive to driver writers.  As
implementing such lifetime management in each device driver is too
painful, most device driver midlayers handle this.  e.g. SCSI midlayer
handles driver model lifetime rules and presents simplified
register/unregister-immediately interface to all low level SCSI
drivers.  IDE tries to do similar and USB and netdev have their own
mechanisms too.

I'm not familiar with other layers, but it took quite a while to get
this right in the SCSI midlayer.  The current code works and low level
drivers can do away with most of lifetime management but the added
complexity is saddening to look at.


1-2. Rule #b and #c

On the surface, it seems that we're doing this mostly correctly but
subtle bugs caused by interaction between driver model lifetime and
module lifetime rules aren't difficult to find.

This happens because the two lifetime rules are closely related but
the relation is not properly represented in device driver model
itself.  A module cannot be unloaded unless all device driver objects
it hosts are released, but the only way to guarantee that is to always
grab module reference whenever grabbing relevant kobject reference.
This is ugly to code and too easy to get wrong.  Here are two such
examples.

Example 1. sysfs_schedule_callback() not grabbing the owning module

  This function is recently added to be used by suicidal sysfs nodes
  such that they don't deadlock when trying to unregister themselves.

 +#include <linux/delay.h>
  static void sysfs_schedule_callback_work(struct work_struct *work)
  {
	struct sysfs_schedule_callback_struct *ss = container_of(work,
			struct sysfs_schedule_callback_struct, work);

 +	msleep(100);
	(ss->func)(ss->data);
	kobject_put(ss->kobj);
	kfree(ss);
  }

  int sysfs_schedule_callback(struct kobject *kobj, void (*func)(void *),
		void *data)
  {
	struct sysfs_schedule_callback_struct *ss;

	ss = kmalloc(sizeof(*ss), GFP_KERNEL);
	if (!ss)
		return -ENOMEM;
	kobject_get(kobj);
	ss->kobj = kobj;
	ss->func = func;
	ss->data = data;
	INIT_WORK(&ss->work, sysfs_schedule_callback_work);
	schedule_work(&ss->work);
	return 0;
  }

  Two lines starting with '+' are inserted to make the problem
  reliably reproducible.  With the above changes,

  # insmod drivers/scsi/scsi_mod.ko; insmod drivers/scsi/sd_mod.ko; insmod drivers/ata/libata.ko; insmod drivers/ata/ahci.ko
  # echo 1 > /sys/block/sda/device/delete; rmmod ahci; rmmod libata; rmmod sd_mod; rmmod scsi_mod

  It's assumed that ahci detects /dev/sda.  The above command sequence
  causes the following oops.

  BUG: unable to handle kernel paging request at virtual address e0984020
  [--snip--]
  EIP is at 0xe0984020
  [--snip--]
   [<c0127132>] run_workqueue+0x92/0x140
   [<c01278c7>] worker_thread+0x137/0x160
   [<c012a513>] kthread+0xa3/0xd0
   [<c0104457>] kernel_thread_helper+0x7/0x10

  The problem here is that kobjec_get() in sysfs_schedule_callback()
  doesn't grab the module backing the kobject it's grabbing.  By the
  time (ss->func)(ss->kobj) runs, scsi_mod is already gone.

Example 2. sysfs attr grabbing the wrong kobject and module

  This bug has been there for a long time, probably from the day these
  attrs are implemented.

  # insmod drivers/scsi/scsi_mod.ko; insmod drivers/scsi/sd_mod.ko; insmod drivers/ata/libata.ko; insmod drivers/ata/ahci.ko
  # cat > /sys/block/sda/device/delete &
  # rmmod ahci; rmmod libata
  # kill %%

  BUG: unable to handle kernel paging request at virtual address e083b1e8
  [--snip--]
  EIP:    0060:[<e0983310>]    Not tainted VLI
  [--snip--]
   [<c0127649>] execute_in_process_context+0x19/0x40
   [<e09820f2>] scsi_target_reap+0x82/0xa0 [scsi_mod]
   [<e0984771>] scsi_device_dev_release_usercontext+0xe1/0x110 [scsi_mod]
   [<c0127649>] execute_in_process_context+0x19/0x40
   [<e09835a3>] scsi_device_dev_release+0x13/0x20 [scsi_mod]
   [<c0287247>] device_release+0x17/0x90
   [<c0213a69>] kobject_cleanup+0x49/0x80
   [<c0213aab>] kobject_release+0xb/0x10
   [<c021482b>] kref_put+0x2b/0x80
   [<c0213a14>] kobject_put+0x14/0x20
   [<c01969af>] sysfs_release+0x5f/0xb0
   [<c015dcb4>] __fput+0xb4/0x1b0
   [<c015de18>] fput+0x18/0x20
   [<c015b207>] filp_close+0x47/0x70
   [<c0119394>] put_files_struct+0xa4/0xc0
   [<c011a4bf>] do_exit+0x11f/0x7c0
   [<c011ab89>] do_group_exit+0x29/0x70
   [<c0123415>] get_signal_to_deliver+0x265/0x3f0
   [<c01036ea>] do_notify_resume+0x8a/0x7a0
   [<c010425a>] work_notifysig+0x13/0x19

  The delete attr's owner is scsi_mod.  However, it's created under
  the sysfs node for the SCSI device under the node for SCSI host
  which is driven by libata and ahci combination.  Opening the delete
  node grabs 1. struct device for the SCSI device which propagates to
  the SCSI host node and 2. module reference for sd_mod.  So, ahci and
  libata can go away even while the SCSI host they implement is still
  alive.

Both examples can be reproduced using any SCSI low level driver.
Attrs implemented similarly as in example #2 are everywhere.  This is
the standard way to implement such attrs.


2. How do we solve this?

2-1. Rule #a and #b

If you think about it, the impedance matching between rule #a and #b
should be done somewhere.  Maybe doing it in each low level driver is
what's intended when the driver model is designed but it's
impractically painful and the reality reflects that by doing it in
midlayers.

It's better if we do it in higher layers.  For example, SCSI has a
mechanism to reject new requests and drain request_queue to allow low
level driver to just unregister an existing device.  IDE currently
doesn't have such feature but it would need to do almost the same
thing to support hotplug.  If request_queue just had a feature to
drain and kill itself, both SCSI and IDE could use it.  It would be
simpler and less error-prone.  On the other hand, if it's pushed
downward, it will cause much more pain in all the low level drivers.

Orphaning sysfs nodes on unregistration is a big step in this
direction.  With sysfs reference counting out of the picture,
implementing 'disconnect immediate' interface only on a few components
(including request_queue) should suffice for most block device
drivers.  I'm not familiar with other drivers but I don't think
they'll be very different.

All in all, I'm hoping something like the following can be done in
device drivers, midlayer or low level.

* For binding

  alloc resources;
  init controller;
  register to upper layers;

* For unbinding

  unregister from upper layers; (no lingering references or objects)
  deinit controller;
  release resources;

This basically nullifies lifetime rule #b from the POV of drivers.


2-2. Rule #b and #c

One way to solve this problem is to subordinate lifetime rule #b to
rule #c.  Each kobject points to its owning module such that grabbing
a kobject automatically grabs the module.  The problem with this
approach is that it requires wide update and makes kobject_get
heavier.

Another more appealing approach is to do nothing.  Solving the problem
between rule #a and #b in the way described above means virtually
nullifying rule #b.  With rule #b gone, rule #c can't conflict with
it.  IOW, no reference from upper layer would be remaining after a
driver unregisters itself from the upper layer - the module can be
safely unloaded.


3. How do we get there from here?

The current device driver model is used by most device drivers - huge
amount of code.  We can't update them at once.  Fortunately, the
suggested solution can easily be implemented gradually.  We can add
'disconnect now' interface to upper driver interface one-by-one and
convert its users gradually.  The existing reference counting
mechanism can be left alone and used to verify that the conversion is
correct by verifying reference count is 0 after unregistering.

Once the conversion is complete (maybe after a year), we can remove
the reference counting mechanism from device driver interface.


4. Some afterthoughts

* Doing this would make module reference counting more flexible.
  Instead of being confined by implementation, it can be used to
  define when to allow and disallow module unloading.  (you can't
  unload a block driver module if a fs on it is mounted but sysfs
  access doesn't matter.)

* Outside of device driver model, kobject is currently used to supply
  something sysfs nodes can be attached to, such that both the
  information supplier and node users have something common to
  reference count.  With orphaning added, there is no need to use
  kobject this way.  IMHO, kobject should have been implementation
  detail inside the device model proper in the first place.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2007-04-08  2:55 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-03-30  9:43 [RFD driver-core] Lifetime problems of the current driver model Tejun Heo
2007-03-30 12:29 ` James Bottomley
2007-03-30 12:29 ` James Bottomley
2007-03-30 13:15   ` Dmitry Torokhov
2007-03-30 13:15   ` Dmitry Torokhov
2007-03-30 17:58     ` James Bottomley
2007-03-30 18:18       ` Dmitry Torokhov
2007-03-30 13:38   ` Tejun Heo
2007-03-30 17:41     ` Greg KH
2007-03-30 18:19     ` James Bottomley
2007-04-01 19:59       ` Tejun Heo
2007-04-02  9:20         ` Cornelia Huck
2007-04-02 15:34           ` Cornelia Huck
2007-04-03  3:08             ` Tejun Heo
2007-04-02  9:33         ` Greg KH
2007-04-02 12:10         ` Maneesh Soni
2007-04-02 19:33       ` Luben Tuikov
2007-03-30 13:38   ` Tejun Heo
2007-03-30 13:19 ` Cornelia Huck
2007-03-30 13:19 ` Cornelia Huck
2007-03-30 13:19   ` Tejun Heo
2007-03-30 13:19   ` Tejun Heo
2007-03-30 13:40     ` Cornelia Huck
2007-03-30 13:58       ` Tejun Heo
2007-03-30 13:58       ` Tejun Heo
2007-03-30 14:52         ` Cornelia Huck
2007-03-30 15:08           ` Tejun Heo
2007-03-30 15:08           ` Tejun Heo
2007-03-30 19:31             ` Cornelia Huck
2007-03-31  3:12               ` Tejun Heo
2007-03-31  3:15                 ` Tejun Heo
2007-03-31 16:08                 ` Cornelia Huck
2007-03-31 16:14                   ` Tejun Heo
2007-04-02 19:24                 ` Luben Tuikov
2007-03-30 14:52         ` Cornelia Huck
2007-03-30 13:40     ` Cornelia Huck
2007-03-30 17:38 ` Greg KH
  -- strict thread matches above, loose matches on Subject: below --
2007-04-07 15:48 Alan Stern
2007-04-08  2:55 ` Tejun Heo
2007-03-30  9:43 Tejun Heo

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.