linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* refcounting drivers' data structures used in sysfs buffers
@ 2007-03-08 13:05 Oliver Neukum
  2007-03-08 16:02 ` Alan Stern
  0 siblings, 1 reply; 61+ messages in thread
From: Oliver Neukum @ 2007-03-08 13:05 UTC (permalink / raw)
  To: Dmitry Torokhov, Maneesh Soni, gregkh, Alan Stern, linux-kernel

Hi,

after a lightning bolt from high above I've been looking into refcounting
the data structures drivers use to provide the data used to refill sysfs
buffers. I've come to the following conclusion.

1. struct sysfs_buffer must have a struct kref * and probably a destructor
pointer
2. drivers must be able to pass these pointers through an extended
device_create_file()
3. Drivers must use refcounting if they want to use attributes
4. read/write/poll must do refcounting

I am not sure where to store the pointers. struct sysfs_dirent() looks
like the obvious choice. Comments?

	Regards
		Oliver

^ permalink raw reply	[flat|nested] 61+ messages in thread
* Re: 2.6.21-rc suspend regression: sysfs deadlock
@ 2007-03-10 20:44 Alan Stern
  0 siblings, 0 replies; 61+ messages in thread
From: Alan Stern @ 2007-03-10 20:44 UTC (permalink / raw)
  To: Oliver Neukum
  Cc: Linus Torvalds, Dmitry Torokhov, Maneesh Soni, gregkh,
	James Bottomley, Kernel development list

[For the start of this thread, see 
<http://marc.theaimsgroup.com/?l=linux-kernel&m=117320893726621&w=2>.]

On Wed, 7 Mar 2007, Linus Torvalds wrote:

> So you just pointed to *another* data structure that apparently violates 
> the "you MUST use refcounting" rule.
> 
> What is it with you people? It's really simple. Data structures must be 
> refcounted if you can reach them two different ways.
> 
> If you don't use refcounting, then you'd better make sure that the data 
> can be reached only one way (for example, by *not* exposing it for sysfs).
> 
> It really *is* that simple. Read the CodingStyle rules.

Linus's analysis is correct as far as it goes, but it misses some very 
important points.  The _real_ problem here, which nobody has pointed out 
so far, is not device removal or driver unloading.  It is driver 
unbinding -- with its consequent issue of access rights.

When a driver is unbound from a device, when should the driver stop trying 
to access that device?  The obvious answer is that it must stop before its 
release() method returns.  Otherwise the device might get bound to 
another driver and we would have both drivers trying to talk to it at the 
same time.

In other words, when a driver unbinds from a device, it loses its right to
access that device.  Same goes for any device-related data structures that
weren't created by the driver itself.  When you realize this, it becomes
obvious that the driver faces a synchronization problem.  All its entry
points must be synchronized with release(), to avoid races.

So there actually are two things a driver has to worry about:

	The lifetime of its private data structures (which can be solved
	using refcounts as Linus advocated);

	The race between release() and other activities (which cannot
	be solved by refcounts but needs a true synchronization technique,
	such as a mutex).

No doubt some of this sounds familiar; the race between open() and
disconnect() for char device drivers is one we have faced many times and
not always solved perfectly.  Also note that this is a fundamental
problem, affecting many facilities in addition to sysfs.


One way to solve these problems is to put all the responsibility on the 
driver.  Make it refcount its data structures and use mutexes.  This is 
not very attractive for several reasons:

	_Lots_ of drivers are affected.  Pretty much any driver which
	registers a char device or a sysfs attribute file.

	_Lots_ of code would need to be changed, adding considerable
	bloat.  Every show()/store() method would need to acquire a mutex,
	and many would need to be passed an additional argument, requiring
	a change in the sysfs API.  (I can explain why in a follow-up 
	email if anyone is interested.)

	Most importantly, doing all the refcounting and mutual exclusion
	correctly is quite hard.  It's amazingly easy to make mistakes
	in these areas.  The chances of getting it right while changing
	multiple functions in every single driver are infinitesimal.

Another approach is to put all the responsibility on the core subsystems
that handle driver registration.  They should enforce rigidly two
principles: "No driver callbacks occur after unregistration" and its
prerequisite, "Unregistration is mutually exclusive with driver
callbacks".  (This is exactly what Oliver's original patch did for sysfs.)

	The number of core subsystems affected is much smaller than the
	total number of drivers.  Sysfs, debugfs, the char device
	subsystem, maybe a few others.

	Drivers would no longer have to worry about doing their own
	synchronization or refcounts.  It would be guaranteed that a
	private data structure would never be accessed from sysfs after
	device_remove_file() returned, so the structure could safely and
	easily be deallocated as part of release().

At the expense of complicating a few central subsystems, we could simplify
a lot of drivers.  I think this is a worthwhile tradeoff.

It does have a small disadvantage; it means that an entry point would
deadlock if it tried to unregister itself.  (The example which started
this whole thread was sdev_store_delete() in the SCSI core.  Writing to
that attribute unregisters the device to which it belongs.)  Clearly the
actual unregistration would have to performed separately in a workqueue.  
I think the number of places where this occurs is pretty small.


It's true that this approach goes against the general philosophy used
elsewhere in the kernel.  Refcounting without synchronization is the
general rule.

However unbinding is a special case.  Normally with refcounting, it
doesn't matter when a driver tries to read or write a data structure.  So
long as the driver still holds a reference, the data will be there and the
access will be okay.

But not with unbinding!  After unbinding, the data will still be there but 
it might be owned by another driver.  Even worse, instead of just 
accessing data the code might try to access the device.

The basic assumption behind the refcounting approach is that a resource 
will be used for one purpose and then discarded, so accessing it will 
always be okay.  Driver unbinding violates this assumption; the devices 
and data structures that a driver binds to can be bound, unbound, and 
rebound multiple times.  Simple refcounting isn't sufficient to handle the 
situation.


In short, I think Oliver's original patch should be reinstated.  
(sdev_store_delete() can easily be rewritten to use a workqueue.)  Not
only that, it should be exanded upon.  For instance, it handles regular
sysfs files but it ignores binary files -- a clear oversight.  Debugfs and
the char device subsystem should be modified similarly.  Maybe also
procfs, and perhaps others.

Alan Stern


^ permalink raw reply	[flat|nested] 61+ messages in thread
* 2.6.21-rc suspend regression: sysfs deadlock
@ 2007-03-06 19:20 Hugh Dickins
  2007-03-06 20:16 ` Oliver Neukum
  2007-03-07  1:56 ` Linus Torvalds
  0 siblings, 2 replies; 61+ messages in thread
From: Hugh Dickins @ 2007-03-06 19:20 UTC (permalink / raw)
  To: Oliver Neukum
  Cc: Maneesh Soni, Greg Kroah-Hartman, Adrian Bunk, Linus Torvalds,
	linux-kernel

Resume from RAM on a ThinkPad T43p is now happy with Thomas' periodic
tick fix - the most unusable aspect of that for me had been how slow
repeat keys were to start repeating, but that's all fine now.

But suspend to RAM still hanging, unless I "chmod a-x /usr/sbin/docker"
on SuSE 10.2: docker undock tries to unregister /sys/block/sr0 and hangs:

60x60         D B0415080     0 10778  10771                     (NOTLB)
       e8227e04 00000086 e80c60b0 b0415080 ef3f5454 b041dc20 ef3f5430 00000001 
       e80c60b0 72af360e 00000085 00001941 e80c61bc e8227e00 b01606bf ef47d3c0 
       ed07c1dc ed07c1e4 00000246 e8227e30 b02f6ef0 e80c60b0 00000001 e80c60b0 
Call Trace:
 [<b02f6ef0>] __down+0xaa/0xb8
 [<b02f6de6>] __down_failed+0xa/0x10
 [<b0180529>] sysfs_drop_dentry+0xa2/0xda
 [<b01819b3>] __sysfs_remove_dir+0x6d/0xf8
 [<b0181a53>] sysfs_remove_dir+0x15/0x20
 [<b01d49a9>] kobject_del+0x16/0x22
 [<b0230041>] device_del+0x1c9/0x1e2
 [<b025705a>] __scsi_remove_device+0x43/0x7a
 [<b02570b0>] scsi_remove_device+0x1f/0x2b
 [<b0256a44>] sdev_store_delete+0x16/0x1b
 [<b022f0a0>] dev_attr_store+0x32/0x34
 [<b0180931>] flush_write_buffer+0x37/0x3d
 [<b0180995>] sysfs_write_file+0x5e/0x82
 [<b01507f5>] vfs_write+0xa7/0x150
 [<b0150950>] sys_write+0x47/0x6b
 [<b0103d56>] sysenter_past_esp+0x5f/0x85
              /usr/lib/dockutils/hooks/thinkpad/60x60 undock
              /usr/lib/dockutils/dockhandler undock
              /usr/sbin/docker undock
              /etc/pm/hooks/23dock suspend

This comes from Oliver's commit 94bebf4d1b8e7719f0f3944c037a21cfd99a4af7
Driver core: fix race in sysfs between sysfs_remove_file() and read()/write()
in 2.6.21-rc1.  It looks to me like sysfs_write_file downs buffer->sem
while calling flush_write_buffer, and flushing that particular write
buffer entails downing buffer->sem in orphan_all_buffers.

Suspend no longer deadlocks with the following silly patch, but I expect
this either pokes a small hole in your scheme, or renders it pointless.
Maybe that commit needs to be reverted, or maybe you can see how to fix
it up for -rc3.

Thanks,
Hugh

--- 2.6.21-rc2-git5/fs/sysfs/inode.c	2007-02-28 08:30:26.000000000 +0000
+++ linux/fs/sysfs/inode.c	2007-03-06 18:03:13.000000000 +0000
@@ -227,11 +227,8 @@ static inline void orphan_all_buffers(st
 
 	mutex_lock_nested(&node->i_mutex, I_MUTEX_CHILD);
 	if (node->i_private) {
-		list_for_each_entry(buf, &set->associates, associates) {
-			down(&buf->sem);
+		list_for_each_entry(buf, &set->associates, associates)
 			buf->orphaned = 1;
-			up(&buf->sem);
-		}
 	}
 	mutex_unlock(&node->i_mutex);
 }

^ permalink raw reply	[flat|nested] 61+ messages in thread

end of thread, other threads:[~2007-03-15 19:51 UTC | newest]

Thread overview: 61+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-03-08 13:05 refcounting drivers' data structures used in sysfs buffers Oliver Neukum
2007-03-08 16:02 ` Alan Stern
2007-03-09  0:45   ` Oliver Neukum
2007-03-09 16:32     ` Alan Stern
2007-03-09 16:44       ` Oliver Neukum
2007-03-09 17:02         ` Dmitry Torokhov
2007-03-09 17:18           ` Oliver Neukum
2007-03-09 17:34             ` Dmitry Torokhov
2007-03-09 19:32               ` Alan Stern
2007-03-09 20:05                 ` Oliver Neukum
2007-03-09 20:27                   ` Alan Stern
2007-03-09 20:39                     ` Oliver Neukum
2007-03-09 20:08               ` Alan Stern
2007-03-09 20:48                 ` Oliver Neukum
2007-03-10 19:19                   ` Alan Stern
2007-03-12  8:54                     ` Oliver Neukum
2007-03-12 14:57                       ` Alan Stern
2007-03-12 15:23                         ` Oliver Neukum
2007-03-12 15:42                           ` Dmitry Torokhov
2007-03-12 15:59                             ` Oliver Neukum
2007-03-12 16:21                               ` Alan Stern
2007-03-12 18:25                                 ` Oliver Neukum
2007-03-12 19:31                                   ` Alan Stern
2007-03-12 19:49                                     ` Oliver Neukum
2007-03-12 20:03                                       ` Alan Stern
2007-03-12 20:15                                         ` Oliver Neukum
2007-03-12 20:31                                         ` Dmitry Torokhov
2007-03-12 20:45                                           ` Alan Stern
2007-03-12 21:31                                           ` Richard Purdie
2007-03-13 15:00                                             ` 2.6.21-rc suspend regression: sysfs deadlock Alan Stern
2007-03-13 18:42                                               ` Cornelia Huck
2007-03-13 21:20                                                 ` Linus Torvalds
2007-03-14 16:12                                                   ` Alan Stern
2007-03-14 18:43                                                     ` Cornelia Huck
2007-03-14 19:23                                                       ` Alan Stern
2007-03-15 10:27                                                         ` Cornelia Huck
2007-03-15 12:31                                                           ` Hugh Dickins
2007-03-15 13:02                                                             ` Oliver Neukum
2007-03-15 13:22                                                               ` Dmitry Torokhov
2007-03-15 13:59                                                                 ` Hugh Dickins
2007-03-15 14:27                                                           ` Alan Stern
2007-03-15 15:32                                                             ` Cornelia Huck
2007-03-15 16:29                                                             ` Hugh Dickins
2007-03-15 16:51                                                               ` Linus Torvalds
2007-03-15 19:50                                                                 ` [PATCH] sysfs and driver core: add callback helper, used by SCSI and S390 Alan Stern
2007-03-15 19:51                                                                 ` [PATCH] sysfs: reinstate exclusion between method calls and attribute unregistration Alan Stern
2007-03-13 19:00                                               ` 2.6.21-rc suspend regression: sysfs deadlock Hugh Dickins
2007-03-13 20:09                                                 ` Alan Stern
2007-03-13 20:55                                                   ` Hugh Dickins
2007-03-13 21:08                                                     ` Dmitry Torokhov
2007-03-13 21:20                                                     ` Alan Stern
  -- strict thread matches above, loose matches on Subject: below --
2007-03-10 20:44 Alan Stern
2007-03-06 19:20 Hugh Dickins
2007-03-06 20:16 ` Oliver Neukum
2007-03-07  1:56 ` Linus Torvalds
2007-03-07 14:38   ` Oliver Neukum
2007-03-07 15:56   ` Dmitry Torokhov
2007-03-07 16:52     ` Linus Torvalds
2007-03-07 16:59       ` Oliver Neukum
2007-03-07 18:02         ` Linus Torvalds
2007-03-07 18:16           ` Oliver Neukum

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).