Fundamental flaw in system suspend, exposed by freezer removal

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Fundamental flaw in system suspend, exposed by freezer removal
@ 2008-02-25 15:39 Alan Stern
  2008-02-25 19:46 ` [linux-pm] " Alan Stern
                   ` (2 more replies)
  0 siblings, 3 replies; 38+ messages in thread
From: Alan Stern @ 2008-02-25 15:39 UTC (permalink / raw)
  To: Linux-pm mailing list, Kernel development list

Ongoing efforts to remove the freezer from the system suspend and
hibernation code ("system sleep" is the proper catch-all term) have
turned up a fundamental flaw in the Power Management subsystem's
design.  In brief, we cannot handle the race between hotplug addition
of new devices and suspending all existing devices.

It's not a simple problem (and I'm going to leave out a lot of details
here).  For a comparison, think about what happens when a device is
hot-unplugged.  When device_del() calls the driver's remove method, the
driver is expected to manage all the details of synchronizing with
other threads that may be trying to add new child devices as well as
removing all existing children.

But when a system sleep begins, the PM core is expected to suspend
all the children of a device before calling the device driver's suspend
method.  If there are other threads trying to add new children at the
same time, it's the PM core's responsibility to synchronize with
them -- an impossible job, since only the device's driver knows what
those other threads are and how to stop them safely.

In the past this deficiency has been hidden by the freezer.  Other 
tasks couldn't register new children because they were frozen.  But now 
we are phasing out the freezer (already most kernel threads are not 
freezable) and the problem is starting to show up.

A change to the PM core present in 2.6.25-rc2 (but which is about to be
reverted!) has the core try to prevent these additions by acquiring the
device semaphores for every registered device.  This has turned out to
be too heavy-handed; for example, it prevents drivers from
unregistering devices during a system sleep.  There are more subtle
synchronization problems as well.

The only possible solution is to have the drivers themselves be
responsible for preventing calls to device_add() or device_register()  
during a system sleep.  (It's also necessary to prevent driver binding,
but this isn't a major issue.)  The most straightforward approach is to
add a new pair of driver methods: one to disable adding children and
one to re-enable it.  Of course this would represent a significant
addition to the Power Management driver interface.

(Note that the existing suspend and resume methods cannot be used for 
this purpose.  Drivers assume that when the suspend method is called, 
it has already been called for all the child devices.  This wouldn't be 
true if one of the purposes of the method was to prevent addition of 
new children.)

Another way of accomplishing this is to require drivers to pay
attention to pm_notifier chain and stop registering children when any
of the PM_xxx_PREPARE messages is sent.  This approach feels a lot more
awkward to me.

Comments and discussion?

Alan Stern

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [linux-pm] Fundamental flaw in system suspend, exposed by freezer removal
  2008-02-25 15:39 Fundamental flaw in system suspend, exposed by freezer removal Alan Stern
@ 2008-02-25 19:46 ` Alan Stern
  2008-02-25 22:25   ` Rafael J. Wysocki
  2008-02-25 22:24 ` Rafael J. Wysocki
  2008-02-27 20:36 ` Benjamin Herrenschmidt
  2 siblings, 1 reply; 38+ messages in thread
From: Alan Stern @ 2008-02-25 19:46 UTC (permalink / raw)
  To: Linux-pm mailing list, Kernel development list

On Mon, 25 Feb 2008, Alan Stern wrote:

> The only possible solution is to have the drivers themselves be
> responsible for preventing calls to device_add() or device_register()  
> during a system sleep.  (It's also necessary to prevent driver binding,
> but this isn't a major issue.)  The most straightforward approach is to
> add a new pair of driver methods: one to disable adding children and
> one to re-enable it.  Of course this would represent a significant
> addition to the Power Management driver interface.
> 
> (Note that the existing suspend and resume methods cannot be used for 
> this purpose.  Drivers assume that when the suspend method is called, 
> it has already been called for all the child devices.  This wouldn't be 
> true if one of the purposes of the method was to prevent addition of 
> new children.)

On further thought maybe the existing methods can be used, with care.  
Drivers would have to assume the responsibility of synchronizing with
their helper threads and stopping addition of new children (something
they should already be doing), and they would also have to check that
all the existing children are already suspended.  They should not make
the assumption that the PM core has already suspended all the children.

The PM core could help detect errors here.  If it tries to suspend a 
device and sees that the device's parent is already suspended, then the 
parent's driver has a bug.

Alan Stern

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Fundamental flaw in system suspend, exposed by freezer removal
  2008-02-25 15:39 Fundamental flaw in system suspend, exposed by freezer removal Alan Stern
  2008-02-25 19:46 ` [linux-pm] " Alan Stern
@ 2008-02-25 22:24 ` Rafael J. Wysocki
  2008-02-27 20:36 ` Benjamin Herrenschmidt
  2 siblings, 0 replies; 38+ messages in thread
From: Rafael J. Wysocki @ 2008-02-25 22:24 UTC (permalink / raw)
  To: Alan Stern; +Cc: Linux-pm mailing list, Kernel development list

On Monday, 25 of February 2008, Alan Stern wrote:
> Ongoing efforts to remove the freezer from the system suspend and
> hibernation code ("system sleep" is the proper catch-all term) have
> turned up a fundamental flaw in the Power Management subsystem's
> design.  In brief, we cannot handle the race between hotplug addition
> of new devices and suspending all existing devices.
> 
> It's not a simple problem (and I'm going to leave out a lot of details
> here).  For a comparison, think about what happens when a device is
> hot-unplugged.  When device_del() calls the driver's remove method, the
> driver is expected to manage all the details of synchronizing with
> other threads that may be trying to add new child devices as well as
> removing all existing children.
> 
> But when a system sleep begins, the PM core is expected to suspend
> all the children of a device before calling the device driver's suspend
> method.  If there are other threads trying to add new children at the
> same time, it's the PM core's responsibility to synchronize with
> them -- an impossible job, since only the device's driver knows what
> those other threads are and how to stop them safely.

It's not a problem if new children are registered before the parent's
->suspend() is called, the PM core can handle that.  The problem is the
potential race between the suspending task and the threads registering new
children concurrently to the executing ->suspend(), because if those threads
lose the race, the resume ordering will be broken.

Since the PM core knows nothing about the drivers internals, the drivers'
->suspend() methods must be responsible for synchronizing with the other
threads used by the driver.

> In the past this deficiency has been hidden by the freezer.  Other 
> tasks couldn't register new children because they were frozen.  But now 
> we are phasing out the freezer (already most kernel threads are not 
> freezable) and the problem is starting to show up.
> 
> A change to the PM core present in 2.6.25-rc2 (but which is about to be
> reverted!) has the core try to prevent these additions by acquiring the
> device semaphores for every registered device.  This has turned out to
> be too heavy-handed; for example, it prevents drivers from
> unregistering devices during a system sleep.  There are more subtle
> synchronization problems as well.

I think we just attempted to take device semaphores too early.  We probably
can take the device semaphores _after_ suspending all devices without
much hassle.  However, it's not actually a problem if a suspended device
gets unregistered - it's removed from the list on which it is at the moment
and won't be resumed.  It also is not a problem if the device is registered
after it's master's ->resume() has run.

Besides, taking the semaphores for all _existing_ devices doesn't prevent
new devices from being added and that's we needed to take
pm_sleep_rwsem in device_add().

> The only possible solution is to have the drivers themselves be
> responsible for preventing calls to device_add() or device_register()  
> during a system sleep.  (It's also necessary to prevent driver binding,
> but this isn't a major issue.)  The most straightforward approach is to
> add a new pair of driver methods: one to disable adding children and
> one to re-enable it.  Of course this would represent a significant
> addition to the Power Management driver interface.

I'd rather not do that.

> (Note that the existing suspend and resume methods cannot be used for 
> this purpose.  Drivers assume that when the suspend method is called, 
> it has already been called for all the child devices.  This wouldn't be 
> true if one of the purposes of the method was to prevent addition of 
> new children.)
> 
> Another way of accomplishing this is to require drivers to pay
> attention to pm_notifier chain and stop registering children when any
> of the PM_xxx_PREPARE messages is sent.  This approach feels a lot more
> awkward to me.

As I said above, I don't see a problem with registering new devices before
the parent's ->suspend() is run and after it's ->resume() has run.  That
doesn't break any ordering rules and we can handle it.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [linux-pm] Fundamental flaw in system suspend, exposed by freezer removal
  2008-02-25 19:46 ` [linux-pm] " Alan Stern
@ 2008-02-25 22:25   ` Rafael J. Wysocki
  2008-02-25 23:37     ` Alan Stern
  2008-02-26  7:13     ` David Brownell
  0 siblings, 2 replies; 38+ messages in thread
From: Rafael J. Wysocki @ 2008-02-25 22:25 UTC (permalink / raw)
  To: Alan Stern; +Cc: Linux-pm mailing list, Kernel development list

On Monday, 25 of February 2008, Alan Stern wrote:
> On Mon, 25 Feb 2008, Alan Stern wrote:
> 
> > The only possible solution is to have the drivers themselves be
> > responsible for preventing calls to device_add() or device_register()  
> > during a system sleep.  (It's also necessary to prevent driver binding,
> > but this isn't a major issue.)  The most straightforward approach is to
> > add a new pair of driver methods: one to disable adding children and
> > one to re-enable it.  Of course this would represent a significant
> > addition to the Power Management driver interface.
> > 
> > (Note that the existing suspend and resume methods cannot be used for 
> > this purpose.  Drivers assume that when the suspend method is called, 
> > it has already been called for all the child devices.  This wouldn't be 
> > true if one of the purposes of the method was to prevent addition of 
> > new children.)
> 
> On further thought maybe the existing methods can be used, with care.  
> Drivers would have to assume the responsibility of synchronizing with
> their helper threads and stopping addition of new children (something
> they should already be doing), and they would also have to check that
> all the existing children are already suspended.  They should not make
> the assumption that the PM core has already suspended all the children.

IMO the device driver should assure that no new children will be registered
concurrently with the ->suspend() method (IOW, ->suspend() should wait for
all such registrations to complete and should prevent any new ones from
being started) and it should make it impossible to register any new children
after ->suspend() has run.  It's the driver's problem how to achieve that.

> The PM core could help detect errors here.  If it tries to suspend a 
> device and sees that the device's parent is already suspended, then the 
> parent's driver has a bug.

Yes, I think we ought to fail the suspend in such cases.  Still, that's not
sufficient to prevent a child from being registered after we've run
dpm_suspend().  For this reason, we could also leave dpm_suspend() with
dpm_list_mtx held and not release it until the next dpm_resume() is run.

That will potentially cause some trouble to CPU hotplug cotifiers, but we can
handle that, for example, by using the in_suspend_context() test.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [linux-pm] Fundamental flaw in system suspend, exposed by freezer removal
  2008-02-25 22:25   ` Rafael J. Wysocki
@ 2008-02-25 23:37     ` Alan Stern
  2008-02-26  0:07       ` Rafael J. Wysocki
  2008-02-26  7:13     ` David Brownell
  1 sibling, 1 reply; 38+ messages in thread
From: Alan Stern @ 2008-02-25 23:37 UTC (permalink / raw)
  To: Rafael J. Wysocki; +Cc: Linux-pm mailing list, Kernel development list

On Mon, 25 Feb 2008, Rafael J. Wysocki wrote:

> > But when a system sleep begins, the PM core is expected to suspend
> > all the children of a device before calling the device driver's suspend
> > method.  If there are other threads trying to add new children at the
> > same time, it's the PM core's responsibility to synchronize with
> > them -- an impossible job, since only the device's driver knows what
> > those other threads are and how to stop them safely.
> 
> It's not a problem if new children are registered before the parent's
> ->suspend() is called, the PM core can handle that.  The problem is the
> potential race between the suspending task and the threads registering new
> children concurrently to the executing ->suspend(), because if those threads
> lose the race, the resume ordering will be broken.

Not only that, the new children will exist temporarily in a state where 
their parent is suspended and they are not.  Clearly we cannot allow 
children to be registered below suspended parents.

> Since the PM core knows nothing about the drivers internals, the drivers'
> ->suspend() methods must be responsible for synchronizing with the other
> threads used by the driver.

This is a new requirement.  We didn't have to worry about it with the 
freezer.  Driver maintainers need to know about it.

> I think we just attempted to take device semaphores too early.  We probably
> can take the device semaphores _after_ suspending all devices without
> much hassle.

At that stage there isn't any real need to hold the semaphores.  And it 
complicates unregistration, which should always be allowed.  At the 
moment I don't see any benefit to locking all the semaphores.

>  However, it's not actually a problem if a suspended device
> gets unregistered - it's removed from the list on which it is at the moment
> and won't be resumed.  It also is not a problem if the device is registered
> after it's master's ->resume() has run.

Correct.

> Besides, taking the semaphores for all _existing_ devices doesn't prevent
> new devices from being added and that's we needed to take
> pm_sleep_rwsem in device_add().

Yes.  We could keep it, but not acquire it until after all devices have 
been suspended.  This might catch some errors.

> IMO the device driver should assure that no new children will be registered
> concurrently with the ->suspend() method (IOW, ->suspend() should wait for
> all such registrations to complete and should prevent any new ones from
> being started) and it should make it impossible to register any new children
> after ->suspend() has run.  It's the driver's problem how to achieve that.

Exactly; this has to be added to the PM documentation.

The difficulty arises only for drivers that support hotplugging, that 
is, detecting and registering children from somewhere other than their 
probe method.

> > The PM core could help detect errors here.  If it tries to suspend a 
> > device and sees that the device's parent is already suspended, then the 
> > parent's driver has a bug.
> 
> Yes, I think we ought to fail the suspend in such cases.  Still, that's not
> sufficient to prevent a child from being registered after we've run
> dpm_suspend().  For this reason, we could also leave dpm_suspend() with
> dpm_list_mtx held and not release it until the next dpm_resume() is run.

The pm_sleep_rwsem will do a better job of catching such errors.

> That will potentially cause some trouble to CPU hotplug cotifiers, but we can
> handle that, for example, by using the in_suspend_context() test.

Do they need to register new CPUs at some point?  There ought to be a 
way to handle that.

Alan Stern


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [linux-pm] Fundamental flaw in system suspend, exposed by freezer removal
  2008-02-25 23:37     ` Alan Stern
@ 2008-02-26  0:07       ` Rafael J. Wysocki
  2008-02-26 15:49         ` Alan Stern
  0 siblings, 1 reply; 38+ messages in thread
From: Rafael J. Wysocki @ 2008-02-26  0:07 UTC (permalink / raw)
  To: Alan Stern; +Cc: Linux-pm mailing list, Kernel development list

On Tuesday, 26 of February 2008, Alan Stern wrote:
> On Mon, 25 Feb 2008, Rafael J. Wysocki wrote:
> 
> > > But when a system sleep begins, the PM core is expected to suspend
> > > all the children of a device before calling the device driver's suspend
> > > method.  If there are other threads trying to add new children at the
> > > same time, it's the PM core's responsibility to synchronize with
> > > them -- an impossible job, since only the device's driver knows what
> > > those other threads are and how to stop them safely.
> > 
> > It's not a problem if new children are registered before the parent's
> > ->suspend() is called, the PM core can handle that.  The problem is the
> > potential race between the suspending task and the threads registering new
> > children concurrently to the executing ->suspend(), because if those threads
> > lose the race, the resume ordering will be broken.
> 
> Not only that, the new children will exist temporarily in a state where 
> their parent is suspended and they are not.  Clearly we cannot allow 
> children to be registered below suspended parents.
> 
> > Since the PM core knows nothing about the drivers internals, the drivers'
> > ->suspend() methods must be responsible for synchronizing with the other
> > threads used by the driver.
> 
> This is a new requirement.  We didn't have to worry about it with the 
> freezer.  Driver maintainers need to know about it.

Certainly.  That's why I've been saying that it's not really a simple task to
get rid of the freezer. ;-)

> > I think we just attempted to take device semaphores too early.  We probably
> > can take the device semaphores _after_ suspending all devices without
> > much hassle.
> 
> At that stage there isn't any real need to hold the semaphores.  And it 
> complicates unregistration, which should always be allowed.  At the 
> moment I don't see any benefit to locking all the semaphores.

Agreed.

> >  However, it's not actually a problem if a suspended device
> > gets unregistered - it's removed from the list on which it is at the moment
> > and won't be resumed.  It also is not a problem if the device is registered
> > after it's master's ->resume() has run.
> 
> Correct.
> 
> > Besides, taking the semaphores for all _existing_ devices doesn't prevent
> > new devices from being added and that's we needed to take
> > pm_sleep_rwsem in device_add().
> 
> Yes.  We could keep it, but not acquire it until after all devices have 
> been suspended.  This might catch some errors.

Agreed.

> > IMO the device driver should assure that no new children will be registered
> > concurrently with the ->suspend() method (IOW, ->suspend() should wait for
> > all such registrations to complete and should prevent any new ones from
> > being started) and it should make it impossible to register any new children
> > after ->suspend() has run.  It's the driver's problem how to achieve that.
> 
> Exactly; this has to be added to the PM documentation.

Into Documentation/power/devices.txt, I gather?

> The difficulty arises only for drivers that support hotplugging, that 
> is, detecting and registering children from somewhere other than their 
> probe method.

Yes.

> > > The PM core could help detect errors here.  If it tries to suspend a 
> > > device and sees that the device's parent is already suspended, then the 
> > > parent's driver has a bug.
> > 
> > Yes, I think we ought to fail the suspend in such cases.  Still, that's not
> > sufficient to prevent a child from being registered after we've run
> > dpm_suspend().  For this reason, we could also leave dpm_suspend() with
> > dpm_list_mtx held and not release it until the next dpm_resume() is run.
> 
> The pm_sleep_rwsem will do a better job of catching such errors.

But we should not leave a window between releasing dpm_list_mtx and taking
pm_sleep_rwsem.  Either that, or we should make sure that dpm_active is
empty after acquiring pm_sleep_rwsem.
 
> > That will potentially cause some trouble to CPU hotplug cotifiers, but we can
> > handle that, for example, by using the in_suspend_context() test.
> 
> Do they need to register new CPUs at some point?  There ought to be a 
> way to handle that.

No, they don't, but there are some CPU-related device objects that get
uregistered/registered.  Still, all of this work is really redundant if the CPU
in question comes back up during the resume, so it should be avoided in
general.  The CPU hotplug notifiers should only unregister those objects if
the CPU hasn't gone on line during the resume and they have all information
necessary for discovering that.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [linux-pm] Fundamental flaw in system suspend, exposed by freezer removal
  2008-02-25 22:25   ` Rafael J. Wysocki
  2008-02-25 23:37     ` Alan Stern
@ 2008-02-26  7:13     ` David Brownell
  2008-02-26  8:25       ` David Newall
  1 sibling, 1 reply; 38+ messages in thread
From: David Brownell @ 2008-02-26  7:13 UTC (permalink / raw)
  To: Rafael J. Wysocki, Alan Stern; +Cc: linux-pm, Kernel development list

This "flaw" isn't a new thing, of course.  I remember pointing out the rather
annoying proclivity of the PM framework to deadlock when suspend() tried to
remove USB devices ... back around 2.6.10 or so.  Things have shuffled around
a bit, and gotten better in some cases, but not fundamentally changed.

It may be more accurate to say that now we understand some constraints on
device tree management policies ... ones we had previously assumed should
not be issues.  (But AFAICT, without actually considering the question.
Now we know the right question to ask!)

On Monday 25 February 2008, Rafael J. Wysocki wrote:
> IMO the device driver should assure that no new children will be registered
> concurrently with the ->suspend() method (IOW, ->suspend() should wait for
> all such registrations to complete and should prevent any new ones from
> being started) and it should make it impossible to register any new children
> after ->suspend() has run.  It's the driver's problem how to achieve that.

There's also the case where it's framework code that handles the additions
rather than the parent device.  That would be typical for many bridge, hub,
or adapter type drivers ... you may be thinking mostly about drivers acting
as "leaf" nodes in the device tree, at least in terms of real hardware nodes.

Yes, "require that policy from such framework code too".  Just trying to be
sure the description doesn't have gaping holes in the middle.  :)

I can think of a bunch of serial busses where framework code has that sort
of responsiblity.  USB, SPI, I2C ... "legacy" I2C drivers would all need
to be taught not to create/remove children during those intervals, lacking
"new style" conversion (which makes them work more like normal drivers).

- Dave

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [linux-pm] Fundamental flaw in system suspend, exposed by freezer removal
  2008-02-26  7:13     ` David Brownell
@ 2008-02-26  8:25       ` David Newall
  2008-02-26  9:16         ` David Brownell
  0 siblings, 1 reply; 38+ messages in thread
From: David Newall @ 2008-02-26  8:25 UTC (permalink / raw)
  To: David Brownell
  Cc: Rafael J. Wysocki, Alan Stern, linux-pm, Kernel development list

David Brownell wrote:
> This "flaw" isn't a new thing, of course.  I remember pointing out the rather
> annoying proclivity of the PM framework to deadlock when suspend() tried to
> remove USB devices ... back around 2.6.10 or so.  Things have shuffled around
> a bit, and gotten better in some cases, but not fundamentally changed.

Hardware can be inserted and removed while we're in a suspend state; and
there's nothing that we can do about it until we resume.  Is it fair to
say, then, that having started suspend, we could reasonably ignore any
device insertion and removal, and handle it on resume?  Presumably we
need to scan for hardware changes on resume.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [linux-pm] Fundamental flaw in system suspend, exposed by freezer removal
  2008-02-26  8:25       ` David Newall
@ 2008-02-26  9:16         ` David Brownell
  2008-02-26 13:36           ` David Newall
  0 siblings, 1 reply; 38+ messages in thread
From: David Brownell @ 2008-02-26  9:16 UTC (permalink / raw)
  To: David Newall
  Cc: Rafael J. Wysocki, Alan Stern, linux-pm, Kernel development list

On Tuesday 26 February 2008, David Newall wrote:
> 
> Hardware can be inserted and removed while we're in a suspend state; and
> there's nothing that we can do about it until we resume.  Is it fair to
> say, then, that having started suspend, we could reasonably ignore any
> device insertion and removal, and handle it on resume?

"Ignore" seems a bit strong; those events may be wakeup triggers,
which would cause the hardware to make it a very short suspend state.

"Defer handling" is more to the point, be it by hardware or software.

> Presumably we need to scan for hardware changes on resume.

Not on most busses I work with; the hardware issues notifications
whenever the devices are removable.

Which is a Good Thing ... scanning, as you suggest, is inherently
not reliable.  If a mouse is swapped, it likely doesn't matter a
lot whether it's the "same" or not.

But ... how about if some removable storage media was taken out,
updated on a different system, then restored before the resume?
Not many filesystems handle that sanely.  You *really* want to
have hardware which reports disconnect/reconnect events, or which
physically prevents them (locked case etc).

- Dave

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [linux-pm] Fundamental flaw in system suspend, exposed by freezer removal
  2008-02-26  9:16         ` David Brownell
@ 2008-02-26 13:36           ` David Newall
  2008-02-26 15:58             ` Alan Stern
  0 siblings, 1 reply; 38+ messages in thread
From: David Newall @ 2008-02-26 13:36 UTC (permalink / raw)
  To: David Brownell
  Cc: Rafael J. Wysocki, Alan Stern, linux-pm, Kernel development list

David Brownell wrote:
> On Tuesday 26 February 2008, David Newall wrote:
>   
>> Hardware can be inserted and removed while we're in a suspend state; and
>> there's nothing that we can do about it until we resume.  Is it fair to
>> say, then, that having started suspend, we could reasonably ignore any
>> device insertion and removal, and handle it on resume?
>>     
>
> "Ignore" seems a bit strong; those events may be wakeup triggers,
> which would cause the hardware to make it a very short suspend state.
>
> "Defer handling" is more to the point, be it by hardware or software.
>
>   

Of course, "defer".  The insertion has to be handled eventually.  What
I'm wondering is if we can ignore it, and catch it on the resume.


>> Presumably we need to scan for hardware changes on resume.
>>     
>
> Not on most busses I work with; the hardware issues notifications
> whenever the devices are removable.
>   

There's no notification while we're suspended.  Isn't it necessary to
scan all busses on resume, just to know what's on them?

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [linux-pm] Fundamental flaw in system suspend, exposed by freezer removal
  2008-02-26  0:07       ` Rafael J. Wysocki
@ 2008-02-26 15:49         ` Alan Stern
  2008-02-26 23:17           ` Rafael J. Wysocki
  0 siblings, 1 reply; 38+ messages in thread
From: Alan Stern @ 2008-02-26 15:49 UTC (permalink / raw)
  To: Rafael J. Wysocki; +Cc: Linux-pm mailing list, Kernel development list

On Tue, 26 Feb 2008, Rafael J. Wysocki wrote:

> > > IMO the device driver should assure that no new children will be registered
> > > concurrently with the ->suspend() method (IOW, ->suspend() should wait for
> > > all such registrations to complete and should prevent any new ones from
> > > being started) and it should make it impossible to register any new children
> > > after ->suspend() has run.  It's the driver's problem how to achieve that.
> > 
> > Exactly; this has to be added to the PM documentation.
> 
> Into Documentation/power/devices.txt, I gather?

Yes.

> > > > The PM core could help detect errors here.  If it tries to suspend a 
> > > > device and sees that the device's parent is already suspended, then the 
> > > > parent's driver has a bug.
> > > 
> > > Yes, I think we ought to fail the suspend in such cases.  Still, that's not
> > > sufficient to prevent a child from being registered after we've run
> > > dpm_suspend().  For this reason, we could also leave dpm_suspend() with
> > > dpm_list_mtx held and not release it until the next dpm_resume() is run.
> > 
> > The pm_sleep_rwsem will do a better job of catching such errors.
> 
> But we should not leave a window between releasing dpm_list_mtx and taking
> pm_sleep_rwsem.  Either that, or we should make sure that dpm_active is
> empty after acquiring pm_sleep_rwsem.

I've got some ideas on how to implement this.

We can add a new field "suspend_called" to dev->power.  It would be
owned by the PM core (protect by dpm_list_mtx) and read-only to
drivers.  Normally it will contain 0, but when the suspend method is
running we set it to SUSPEND_RUNNING and when the method returns
successfully we set it to SUSPEND_DONE.  Before calling the resume
method we set it back to 0.  Drivers can use this field as an easy way
of checking that all the child devices have been suspended.

When a new device is registered we check its parent's suspend_called
value.  If it is SUSPEND_DONE then the caller has a bug and we have to
fail the registration.  If it is SUSPEND_RUNNING then the registration
is legal, but we remember what happened.  Then when the
currently-running suspend method returns and we reacquire the
dpm_list_mtx, we will realize that a race was lost.  If the method
completed successfully (which it shouldn't) we can resume that device
immediately without ever taking it off the dpm_active list; but either
way we should continue the suspend loop.  Now the new child will be at
the end of the dpm_active_list, so it will be suspended before the
parent is reached again.

This way we can recover from drivers that are willing to suspend their 
device even though there are unsuspended children.  The only drawback 
will be that for a short time the child will be active while its parent 
is suspended.

We should not abort the entire sleep transition simply because we lost 
a race.  With this scheme we won't even need the pm_sleep_rwsem; the 
dpm_list_mtx will provide all the necessary protection.

This is more intricate than it should be.  It would have been better to
have had "disable_new_children" and "enable_new_children" methods from
the beginning; then there wouldn't be any races at all.  That's life...

The one tricky thing to watch out for is when a suspend or resume 
method wants to unregister the device being suspended or resumed.  Even 
that should be doable (set suspend_called to UNREGISTERED or something 
like that).

> > > That will potentially cause some trouble to CPU hotplug cotifiers, but we can
> > > handle that, for example, by using the in_suspend_context() test.
> > 
> > Do they need to register new CPUs at some point?  There ought to be a 
> > way to handle that.
> 
> No, they don't, but there are some CPU-related device objects that get
> uregistered/registered.  Still, all of this work is really redundant if the CPU
> in question comes back up during the resume, so it should be avoided in
> general.  The CPU hotplug notifiers should only unregister those objects if
> the CPU hasn't gone on line during the resume and they have all information
> necessary for discovering that.

Unregistration should always be allowed, and registration should be 
allowed whenever the parent isn't suspended.  For devices with no 
parent, we can imagine there is a fictitious parent at the root of the 
device tree.  Conceptually it gets suspended after every real device 
and resumed before.  Maybe even before dpm_power_up(), meaning that 
devices with no parent could be registered by a resume_early method.

When your lock-removal stuff gets into Greg's tree, I'll write all 
this.  Sound good?

Alan Stern

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [linux-pm] Fundamental flaw in system suspend, exposed by freezer removal
  2008-02-26 13:36           ` David Newall
@ 2008-02-26 15:58             ` Alan Stern
  0 siblings, 0 replies; 38+ messages in thread
From: Alan Stern @ 2008-02-26 15:58 UTC (permalink / raw)
  To: David Newall
  Cc: David Brownell, Rafael J. Wysocki, linux-pm, Kernel development list

On Wed, 27 Feb 2008, David Newall wrote:

> David Brownell wrote:
> > On Tuesday 26 February 2008, David Newall wrote:
> >   
> >> Hardware can be inserted and removed while we're in a suspend state; and
> >> there's nothing that we can do about it until we resume.  Is it fair to
> >> say, then, that having started suspend, we could reasonably ignore any
> >> device insertion and removal, and handle it on resume?
> >>     
> >
> > "Ignore" seems a bit strong; those events may be wakeup triggers,
> > which would cause the hardware to make it a very short suspend state.
> >
> > "Defer handling" is more to the point, be it by hardware or software.
> >
> >   
> 
> Of course, "defer".  The insertion has to be handled eventually.  What
> I'm wondering is if we can ignore it, and catch it on the resume.

Certainly.  If hardware-change events can get lost because of the
system sleep, the resume method should make every effort to verify that 
what it remembers of the hardware state matches the current reality.

> >> Presumably we need to scan for hardware changes on resume.
> >>     
> >
> > Not on most busses I work with; the hardware issues notifications
> > whenever the devices are removable.
> >   
> 
> There's no notification while we're suspended.  Isn't it necessary to
> scan all busses on resume, just to know what's on them?

It depends on the bus.  If the bus doesn't support hotplugging then 
scanning isn't necessary.  If the bus does support hotplugging then 
scanning after suspend may or may not be necessary, depending on 
whether or not the bus controller remained powered during the suspend.  
For hotpluggable buses, scanning after hibernation is always necessary.

Alan Stern

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [linux-pm] Fundamental flaw in system suspend, exposed by freezer removal
  2008-02-26 15:49         ` Alan Stern
@ 2008-02-26 23:17           ` Rafael J. Wysocki
  2008-02-27 16:03             ` Alan Stern
  0 siblings, 1 reply; 38+ messages in thread
From: Rafael J. Wysocki @ 2008-02-26 23:17 UTC (permalink / raw)
  To: Alan Stern; +Cc: Linux-pm mailing list, Kernel development list

On Tuesday, 26 of February 2008, Alan Stern wrote:
> On Tue, 26 Feb 2008, Rafael J. Wysocki wrote:
> 
> > > > IMO the device driver should assure that no new children will be registered
> > > > concurrently with the ->suspend() method (IOW, ->suspend() should wait for
> > > > all such registrations to complete and should prevent any new ones from
> > > > being started) and it should make it impossible to register any new children
> > > > after ->suspend() has run.  It's the driver's problem how to achieve that.
> > > 
> > > Exactly; this has to be added to the PM documentation.
> > 
> > Into Documentation/power/devices.txt, I gather?
> 
> Yes.
> 
> > > > > The PM core could help detect errors here.  If it tries to suspend a 
> > > > > device and sees that the device's parent is already suspended, then the 
> > > > > parent's driver has a bug.
> > > > 
> > > > Yes, I think we ought to fail the suspend in such cases.  Still, that's not
> > > > sufficient to prevent a child from being registered after we've run
> > > > dpm_suspend().  For this reason, we could also leave dpm_suspend() with
> > > > dpm_list_mtx held and not release it until the next dpm_resume() is run.
> > > 
> > > The pm_sleep_rwsem will do a better job of catching such errors.
> > 
> > But we should not leave a window between releasing dpm_list_mtx and taking
> > pm_sleep_rwsem.  Either that, or we should make sure that dpm_active is
> > empty after acquiring pm_sleep_rwsem.
> 
> I've got some ideas on how to implement this.
> 
> We can add a new field "suspend_called" to dev->power.

I'd call it "sleeping" or something like this, for it will also be used by
hibernation callbacks.

> It would be owned by the PM core (protect by dpm_list_mtx) and read-only to
> drivers.  Normally it will contain 0, but when the suspend method is
> running we set it to SUSPEND_RUNNING and when the method returns
> successfully we set it to SUSPEND_DONE.  Before calling the resume
> method we set it back to 0.

Why before?  I'd think that any non-suspended children should not be visible
by the partent's ->resume().

> Drivers can use this field as an easy way of checking that all the child
> devices have been suspended. 
> 
> When a new device is registered we check its parent's suspend_called
> value.  If it is SUSPEND_DONE then the caller has a bug and we have to
> fail the registration.  If it is SUSPEND_RUNNING then the registration
> is legal, but we remember what happened.

This seems to require some trickery.  Namely, device_add() will notice that
the registration is done concurrently with the running ->suspend() of the
parent and will have to communicate that to dpm_suspend() which is supposed
to resume the master in the next step.

> Then when the currently-running suspend method returns and we reacquire the
> dpm_list_mtx, we will realize that a race was lost.

How exactly do you want to check that?

> If the method completed successfully (which it shouldn't) we can resume that
> device immediately without ever taking it off the dpm_active list; but either
> way we should continue the suspend loop.  Now the new child will be at
> the end of the dpm_active_list, so it will be suspended before the
> parent is reached again.
> 
> This way we can recover from drivers that are willing to suspend their 
> device even though there are unsuspended children.  The only drawback 
> will be that for a short time the child will be active while its parent 
> is suspended.

Well, if the parent is a bus, that will be a problem.

> We should not abort the entire sleep transition simply because we lost 
> a race.

I don't agree here.  If we require drivers to prevent such races from happening
and they don't comply, we can give up instead of trying to work around the
non-compilance.

> With this scheme we won't even need the pm_sleep_rwsem; the  
> dpm_list_mtx will provide all the necessary protection.
> 
> This is more intricate than it should be.  It would have been better to
> have had "disable_new_children" and "enable_new_children" methods from
> the beginning; then there wouldn't be any races at all.  That's life...
> 
> The one tricky thing to watch out for is when a suspend or resume 
> method wants to unregister the device being suspended or resumed.

That can't happen, because dev->sem is taken by suspend_device() and
device_del() would lock up attempting to acquire it once again.

> Even  that should be doable (set suspend_called to UNREGISTERED or something 
> like that).
> 
> > > > That will potentially cause some trouble to CPU hotplug cotifiers, but we can
> > > > handle that, for example, by using the in_suspend_context() test.
> > > 
> > > Do they need to register new CPUs at some point?  There ought to be a 
> > > way to handle that.
> > 
> > No, they don't, but there are some CPU-related device objects that get
> > uregistered/registered.  Still, all of this work is really redundant if the CPU
> > in question comes back up during the resume, so it should be avoided in
> > general.  The CPU hotplug notifiers should only unregister those objects if
> > the CPU hasn't gone on line during the resume and they have all information
> > necessary for discovering that.
> 
> Unregistration should always be allowed, and registration should be 
> allowed whenever the parent isn't suspended.

I'm still thinking that registering while the parent is suspending should not
be allowed.

> For devices with no parent, we can imagine there is a fictitious parent at
> the root of the device tree.  Conceptually it gets suspended after every real
> device and resumed before.  Maybe even before dpm_power_up(), meaning that 
> devices with no parent could be registered by a resume_early method.
> 
> When your lock-removal stuff gets into Greg's tree, I'll write all 
> this.  Sound good?

The direction seems to be fine, but the details need a bit more clarification,
as far as I'm concerned.  Having a patch to discuss will certainly help a lot,
though. ;-)

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [linux-pm] Fundamental flaw in system suspend, exposed by freezer removal
  2008-02-26 23:17           ` Rafael J. Wysocki
@ 2008-02-27 16:03             ` Alan Stern
  2008-02-27 19:50               ` Rafael J. Wysocki
  0 siblings, 1 reply; 38+ messages in thread
From: Alan Stern @ 2008-02-27 16:03 UTC (permalink / raw)
  To: Rafael J. Wysocki; +Cc: Linux-pm mailing list, Kernel development list

On Wed, 27 Feb 2008, Rafael J. Wysocki wrote:

> > I've got some ideas on how to implement this.
> > 
> > We can add a new field "suspend_called" to dev->power.
> 
> I'd call it "sleeping" or something like this, for it will also be used by
> hibernation callbacks.

The name refers to the "suspend" method, not the type of sleep being
carried out.  We use the same method for both suspend and hibernation.
But maybe "sleeping" would be better.

> > It would be owned by the PM core (protect by dpm_list_mtx) and read-only to
> > drivers.  Normally it will contain 0, but when the suspend method is
> > running we set it to SUSPEND_RUNNING and when the method returns
> > successfully we set it to SUSPEND_DONE.  Before calling the resume
> > method we set it back to 0.
> 
> Why before?  I'd think that any non-suspended children should not be visible
> by the partent's ->resume().

All right, we can set it to RESUME_RUNNING before calling the resume
method and then set it to 0 afterwards.  The point is that the value
shouldn't remain SUSPEND_DONE while resume runs, because it should be
legal for resume to register new children.

> > When a new device is registered we check its parent's suspend_called
> > value.  If it is SUSPEND_DONE then the caller has a bug and we have to
> > fail the registration.  If it is SUSPEND_RUNNING then the registration
> > is legal, but we remember what happened.
> 
> This seems to require some trickery.  Namely, device_add() will notice that
> the registration is done concurrently with the running ->suspend() of the
> parent and will have to communicate that to dpm_suspend() which is supposed
> to resume the master in the next step.

It will get noticed in device_pm_add() while holding dpm_list_mtx.  
The information can be stored in a static private flag
"child_added_while_parent_suspends" (or maybe something more terse!).

> > Then when the currently-running suspend method returns and we reacquire the
> > dpm_list_mtx, we will realize that a race was lost.
> 
> How exactly do you want to check that?

Check whether child_added_while_parent_suspends is nonzero.

> > If the method completed successfully (which it shouldn't) we can resume that
> > device immediately without ever taking it off the dpm_active list; but either
> > way we should continue the suspend loop.  Now the new child will be at
> > the end of the dpm_active_list, so it will be suspended before the
> > parent is reached again.
> > 
> > This way we can recover from drivers that are willing to suspend their 
> > device even though there are unsuspended children.  The only drawback 
> > will be that for a short time the child will be active while its parent 
> > is suspended.
> 
> Well, if the parent is a bus, that will be a problem.

Sure.  But it won't be the PM core's problem; it will be a bug in the
bus's driver.  We will print a warning in the log so the bug can be 
tracked down.

> > We should not abort the entire sleep transition simply because we lost 
> > a race.
> 
> I don't agree here.  If we require drivers to prevent such races from happening
> and they don't comply, we can give up instead of trying to work around the
> non-compilance.

You misunderstand.  We can't require drivers to prevent these races 
entirely.  As an example, a properly-written, compliant driver might 
work like this:

	Task 0				Task 1
	------				------
	dev->power.sleeping =
	  SUSPEND_RUNNING;
	Call (drv->suspend)(dev)
					Register a child below dev
	suspend method prevents new
	  child registrations
	suspend method waits for
	  existing registration to
	  finish
					Check dev->power.sleeping and set
					  child_added_while_parent_suspends
					Registration completes successfully
	suspend method sees there is
	  an unsuspended child and
	  returns -EBUSY

	Check child_added_while_parent_suspends
	  and realize that we lost the race

There's nothing illegal about this; it's just an accident of timing.  
Nothing has gone wrong and we shouldn't abort the sleep.  We should
continue where we left off, by suspending the new child and then trying
to suspend the parent again.

> > With this scheme we won't even need the pm_sleep_rwsem; the  
> > dpm_list_mtx will provide all the necessary protection.
> > 
> > This is more intricate than it should be.  It would have been better to
> > have had "disable_new_children" and "enable_new_children" methods from
> > the beginning; then there wouldn't be any races at all.  That's life...
> > 
> > The one tricky thing to watch out for is when a suspend or resume 
> > method wants to unregister the device being suspended or resumed.
> 
> That can't happen, because dev->sem is taken by suspend_device() and
> device_del() would lock up attempting to acquire it once again.

We'll have to fix device_del() to prevent that from happening.  Your 
in_sleep_context() approach should work.

> > Unregistration should always be allowed, and registration should be 
> > allowed whenever the parent isn't suspended.
> 
> I'm still thinking that registering while the parent is suspending should not
> be allowed.

Unfortunately the lack of "prevent_new_children" and 
"allow_new_children" methods gives us no choice.  The example above 
shows why.

Alan Stern


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [linux-pm] Fundamental flaw in system suspend, exposed by freezer removal
  2008-02-27 16:03             ` Alan Stern
@ 2008-02-27 19:50               ` Rafael J. Wysocki
  2008-02-27 20:15                 ` Alan Stern
  2008-02-28 22:49                 ` Alan Stern
  0 siblings, 2 replies; 38+ messages in thread
From: Rafael J. Wysocki @ 2008-02-27 19:50 UTC (permalink / raw)
  To: Alan Stern; +Cc: Linux-pm mailing list, Kernel development list

On Wednesday, 27 of February 2008, Alan Stern wrote:
> On Wed, 27 Feb 2008, Rafael J. Wysocki wrote:
> 
> > > I've got some ideas on how to implement this.
> > > 
> > > We can add a new field "suspend_called" to dev->power.
> > 
> > I'd call it "sleeping" or something like this, for it will also be used by
> > hibernation callbacks.
> 
> The name refers to the "suspend" method, not the type of sleep being
> carried out.  We use the same method for both suspend and hibernation.

We won't in the future.

> But maybe "sleeping" would be better.
> 
> > > It would be owned by the PM core (protect by dpm_list_mtx) and read-only to
> > > drivers.  Normally it will contain 0, but when the suspend method is
> > > running we set it to SUSPEND_RUNNING and when the method returns
> > > successfully we set it to SUSPEND_DONE.  Before calling the resume
> > > method we set it back to 0.
> > 
> > Why before?  I'd think that any non-suspended children should not be visible
> > by the partent's ->resume().
> 
> All right, we can set it to RESUME_RUNNING before calling the resume
> method and then set it to 0 afterwards.  The point is that the value
> shouldn't remain SUSPEND_DONE while resume runs, because it should be
> legal for resume to register new children.

I'm not sure.  The core moves the device to dpm_active only after ->resume()
has run.  Thus, if ->resume() registers new children, the ordering of
dpm_active will be wrong.

> > > When a new device is registered we check its parent's suspend_called
> > > value.  If it is SUSPEND_DONE then the caller has a bug and we have to
> > > fail the registration.  If it is SUSPEND_RUNNING then the registration
> > > is legal, but we remember what happened.
> > 
> > This seems to require some trickery.  Namely, device_add() will notice that
> > the registration is done concurrently with the running ->suspend() of the
> > parent and will have to communicate that to dpm_suspend() which is supposed
> > to resume the master in the next step.
> 
> It will get noticed in device_pm_add() while holding dpm_list_mtx.  
> The information can be stored in a static private flag
> "child_added_while_parent_suspends" (or maybe something more terse!).

Hmm, yes, we can do it this way.

> > > Then when the currently-running suspend method returns and we reacquire the
> > > dpm_list_mtx, we will realize that a race was lost.
> > 
> > How exactly do you want to check that?
> 
> Check whether child_added_while_parent_suspends is nonzero.
> 
> > > If the method completed successfully (which it shouldn't) we can resume that
> > > device immediately without ever taking it off the dpm_active list; but either
> > > way we should continue the suspend loop.  Now the new child will be at
> > > the end of the dpm_active_list, so it will be suspended before the
> > > parent is reached again.
> > > 
> > > This way we can recover from drivers that are willing to suspend their 
> > > device even though there are unsuspended children.  The only drawback 
> > > will be that for a short time the child will be active while its parent 
> > > is suspended.
> > 
> > Well, if the parent is a bus, that will be a problem.
> 
> Sure.  But it won't be the PM core's problem; it will be a bug in the
> bus's driver.  We will print a warning in the log so the bug can be 
> tracked down.
> 
> > > We should not abort the entire sleep transition simply because we lost 
> > > a race.
> > 
> > I don't agree here.  If we require drivers to prevent such races from happening
> > and they don't comply, we can give up instead of trying to work around the
> > non-compilance.
> 
> You misunderstand.

Well, I misunderstood indeed.

> We can't require drivers to prevent these races entirely.  As an example, a
> properly-written, compliant driver might work like this:
> 
> 	Task 0				Task 1
> 	------				------
> 	dev->power.sleeping =
> 	  SUSPEND_RUNNING;
> 	Call (drv->suspend)(dev)
> 					Register a child below dev
> 	suspend method prevents new
> 	  child registrations
> 	suspend method waits for
> 	  existing registration to
> 	  finish
> 					Check dev->power.sleeping and set
> 					  child_added_while_parent_suspends
> 					Registration completes successfully
> 	suspend method sees there is
> 	  an unsuspended child and
> 	  returns -EBUSY
> 
> 	Check child_added_while_parent_suspends
> 	  and realize that we lost the race
> 
> There's nothing illegal about this; it's just an accident of timing.  
> Nothing has gone wrong and we shouldn't abort the sleep.  We should
> continue where we left off, by suspending the new child and then trying
> to suspend the parent again.
> 
> > > With this scheme we won't even need the pm_sleep_rwsem; the  
> > > dpm_list_mtx will provide all the necessary protection.
> > > 
> > > This is more intricate than it should be.  It would have been better to
> > > have had "disable_new_children" and "enable_new_children" methods from
> > > the beginning; then there wouldn't be any races at all.  That's life...
> > > 
> > > The one tricky thing to watch out for is when a suspend or resume 
> > > method wants to unregister the device being suspended or resumed.
> > 
> > That can't happen, because dev->sem is taken by suspend_device() and
> > device_del() would lock up attempting to acquire it once again.
> 
> We'll have to fix device_del() to prevent that from happening.  Your 
> in_sleep_context() approach should work.

I'm not sure if we need to do it.  It's always been like this, so the current
drivers' ->suspend() and ->resume() don't unregister the device they're called
for.  I don't see any advantage from doing that for future drivers.

> > > Unregistration should always be allowed, and registration should be 
> > > allowed whenever the parent isn't suspended.
> > 
> > I'm still thinking that registering while the parent is suspending should not
> > be allowed.
> 
> Unfortunately the lack of "prevent_new_children" and 
> "allow_new_children" methods gives us no choice.  The example above 
> shows why.

Yes, it does.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [linux-pm] Fundamental flaw in system suspend, exposed by freezer removal
  2008-02-27 19:50               ` Rafael J. Wysocki
@ 2008-02-27 20:15                 ` Alan Stern
  2008-02-28 22:49                 ` Alan Stern
  1 sibling, 0 replies; 38+ messages in thread
From: Alan Stern @ 2008-02-27 20:15 UTC (permalink / raw)
  To: Rafael J. Wysocki; +Cc: Linux-pm mailing list, Kernel development list

On Wed, 27 Feb 2008, Rafael J. Wysocki wrote:

> > All right, we can set it to RESUME_RUNNING before calling the resume
> > method and then set it to 0 afterwards.  The point is that the value
> > shouldn't remain SUSPEND_DONE while resume runs, because it should be
> > legal for resume to register new children.
> 
> I'm not sure.  The core moves the device to dpm_active only after ->resume()
> has run.

Actually the move is done before the method is called.  So this isn't a
problem.

...
> > > > The one tricky thing to watch out for is when a suspend or resume 
> > > > method wants to unregister the device being suspended or resumed.
> > > 
> > > That can't happen, because dev->sem is taken by suspend_device() and
> > > device_del() would lock up attempting to acquire it once again.
> > 
> > We'll have to fix device_del() to prevent that from happening.  Your 
> > in_sleep_context() approach should work.
> 
> I'm not sure if we need to do it.  It's always been like this, so the current
> drivers' ->suspend() and ->resume() don't unregister the device they're called
> for.  I don't see any advantage from doing that for future drivers.

All right, if it doesn't happen now then we don't need to allow for it.  
That makes life a little simpler.

Alan Stern


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Fundamental flaw in system suspend, exposed by freezer removal
  2008-02-25 15:39 Fundamental flaw in system suspend, exposed by freezer removal Alan Stern
  2008-02-25 19:46 ` [linux-pm] " Alan Stern
  2008-02-25 22:24 ` Rafael J. Wysocki
@ 2008-02-27 20:36 ` Benjamin Herrenschmidt
  2 siblings, 0 replies; 38+ messages in thread
From: Benjamin Herrenschmidt @ 2008-02-27 20:36 UTC (permalink / raw)
  To: Alan Stern; +Cc: Linux-pm mailing list, Kernel development list


On Mon, 2008-02-25 at 10:39 -0500, Alan Stern wrote:
> Ongoing efforts to remove the freezer from the system suspend and
> hibernation code ("system sleep" is the proper catch-all term) have
> turned up a fundamental flaw in the Power Management subsystem's
> design.  In brief, we cannot handle the race between hotplug addition
> of new devices and suspending all existing devices.

 .../...

Yup, old problem, I think I've said a long time ago that it's the
reponsibility of bus drivers (such as USB khub) to stop issuing device
additions when suspend is in progress. It might be possible to still do
removal tho.

> The only possible solution is to have the drivers themselves be
> responsible for preventing calls to device_add() or device_register()  
> during a system sleep.  (It's also necessary to prevent driver binding,
> but this isn't a major issue.) 

Agreed.

>  The most straightforward approach is to
> add a new pair of driver methods: one to disable adding children and
> one to re-enable it.  Of course this would represent a significant
> addition to the Power Management driver interface.

No, I think the global notifier is plenty for now. Later on, we can
re-add early-suspend, late-resume, or we can finally turn the whole
thing into a recursive call tree where the parents are the ones calling
the children suspend :-) But for now, I think the global notifier is
enough.

Cheers,
Ben.



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [linux-pm] Fundamental flaw in system suspend, exposed by freezer removal
  2008-02-27 19:50               ` Rafael J. Wysocki
  2008-02-27 20:15                 ` Alan Stern
@ 2008-02-28 22:49                 ` Alan Stern
  2008-02-29  0:01                   ` Rafael J. Wysocki
  2008-02-29 14:26                   ` Rafael J. Wysocki
  1 sibling, 2 replies; 38+ messages in thread
From: Alan Stern @ 2008-02-28 22:49 UTC (permalink / raw)
  To: Rafael J. Wysocki; +Cc: Linux-pm mailing list, Kernel development list

Rafael:

Here's my patch.  It doesn't include the timers for deadlock debugging, 
but it does include all the other stuff we've been talking about.  My 
base probably isn't quite in sync with yours, so this may not apply 
cleanly on your system.  But the divergences should be small.

Incidentally, there seemed to be a bug in your dpm_suspend() -- the 
dpm_list_mtx needs to be reacquired before the error checking.  This
patch fixes that.  It also removes pm_sleep_rwsem, which isn't used 
any more.

We should think about device_pm_schedule_removal().  It won't work
right if a suspend method calls it for the device being suspended, 
because the device gets moved to the dpm_off list after the method 
runs.

Alan Stern



Index: usb-2.6/Documentation/power/devices.txt
===================================================================
--- usb-2.6.orig/Documentation/power/devices.txt
+++ usb-2.6/Documentation/power/devices.txt
@@ -192,11 +192,27 @@ used to resume those devices.
 
 The ordering of the device tree is defined by the order in which devices
 get registered:  a child can never be registered, probed or resumed before
-its parent; and can't be removed or suspended after that parent.
+its parent; and can't be removed or suspended after that parent.  This
+means that special care is needed when calling device_move(); currently
+the kernel does not adjust its suspend and resume ordering to match the
+new device tree.
 
 The policy is that the device tree should match hardware bus topology.
 (Or at least the control bus, for devices which use multiple busses.)
 
+There is an unavoidable race between the PM core suspending all the
+children of a device and the device's driver registering new children.
+As a result, it is possible that the core may try to suspend a device
+without first having suspended all of the device's children.  Drivers
+must check for this; a suspend method should return -EBUSY if there are
+unsuspended children.  (The child->power.sleeping field can be used
+for this check.)  In addition, it is illegal to register a child device
+below a suspended parent; hence suspend methods must synchronize with
+other kernel threads that may attempt to add new children.  The suspend
+method must prevent new registrations and wait for concurrent registrations
+to complete before it returns.  New children may be added once more when
+the resume method runs.
+
 
 Suspending Devices
 ------------------
@@ -387,7 +403,9 @@ while the system was powered off, whenev
 PCMCIA, MMC, USB, Firewire, SCSI, and even IDE are common examples of busses
 where common Linux platforms will see such removal.  Details of how drivers
 will notice and handle such removals are currently bus-specific, and often
-involve a separate thread.
+involve a separate thread.  Currently the kernel does not allow a suspend
+or resume method to directly unregister the device being suspended or
+resumed.
 
 
 Note that the bus-specific runtime PM wakeup mechanism can exist, and might
Index: usb-2.6/include/linux/pm.h
===================================================================
--- usb-2.6.orig/include/linux/pm.h
+++ usb-2.6/include/linux/pm.h
@@ -180,8 +180,20 @@ typedef struct pm_message {
 #define PMSG_HIBERNATE	((struct pm_message){ .event = PM_EVENT_HIBERNATE, })
 #define PMSG_ON		((struct pm_message){ .event = PM_EVENT_ON, })
 
+/* This records which method calls have been made, not the device's
+ * actual power state.  It is read-only to drivers.
+ */
+enum pm_sleep_state {
+	PM_AWAKE,		/* = 0, normal situation */
+	PM_SLEEPING,		/* suspend method is running */
+	PM_ASLEEP,		/* suspend method has returned */
+	PM_WAKING,		/* resume method is running */
+	PM_GONE,		/* device has been unregistered */
+};
+
 struct dev_pm_info {
 	pm_message_t		power_state;
+	enum pm_sleep_state	sleeping;
 	unsigned		can_wakeup:1;
 #ifdef	CONFIG_PM_SLEEP
 	unsigned		should_wakeup:1;
Index: usb-2.6/drivers/base/power/power.h
===================================================================
--- usb-2.6.orig/drivers/base/power/power.h
+++ usb-2.6/drivers/base/power/power.h
@@ -11,28 +11,18 @@ static inline struct device *to_device(s
 	return container_of(entry, struct device, power.entry);
 }
 
-extern void device_pm_add(struct device *);
+extern int device_pm_add(struct device *);
 extern void device_pm_remove(struct device *);
-extern int pm_sleep_lock(void);
-extern void pm_sleep_unlock(void);
 
 #else /* CONFIG_PM_SLEEP */
 
 
-static inline void device_pm_add(struct device *dev)
-{
-}
-
-static inline void device_pm_remove(struct device *dev)
-{
-}
-
-static inline int pm_sleep_lock(void)
+static inline int device_pm_add(struct device *dev)
 {
 	return 0;
 }
 
-static inline void pm_sleep_unlock(void)
+static inline void device_pm_remove(struct device *dev)
 {
 }
 
Index: usb-2.6/drivers/base/power/main.c
===================================================================
--- usb-2.6.orig/drivers/base/power/main.c
+++ usb-2.6/drivers/base/power/main.c
@@ -54,7 +54,9 @@ static LIST_HEAD(dpm_destroy);
 
 static DEFINE_MUTEX(dpm_list_mtx);
 
-static DECLARE_RWSEM(pm_sleep_rwsem);
+/* Protected by dpm_list_mtx */
+static bool child_added_while_parent_suspends;
+static bool all_devices_asleep;
 
 int (*platform_enable_wakeup)(struct device *dev, int is_on);
 
@@ -62,14 +64,37 @@ int (*platform_enable_wakeup)(struct dev
  *	device_pm_add - add a device to the list of active devices
  *	@dev:	Device to be added to the list
  */
-void device_pm_add(struct device *dev)
+int device_pm_add(struct device *dev)
 {
+	int rc = 0;
+
 	pr_debug("PM: Adding info for %s:%s\n",
 		 dev->bus ? dev->bus->name : "No Bus",
 		 kobject_name(&dev->kobj));
 	mutex_lock(&dpm_list_mtx);
-	list_add_tail(&dev->power.entry, &dpm_active);
+	if (dev->parent) {
+		switch (dev->parent->power.sleeping) {
+		case PM_SLEEPING:
+			child_added_while_parent_suspends = true;
+			break;
+		case PM_ASLEEP:
+			dev_err(dev, "added while parent '%s' is asleep\n",
+					dev->parent->bus_id);
+			rc = -EHOSTDOWN;
+			break;
+		default:
+			break;
+		}
+	} else if (all_devices_asleep) {
+		dev_err(dev, "added while all devices are asleep\n");
+		rc = -ENETDOWN;
+	}
+	if (rc == 0)
+		list_add_tail(&dev->power.entry, &dpm_active);
+	else
+		dump_stack();
 	mutex_unlock(&dpm_list_mtx);
+	return rc;
 }
 
 /**
@@ -86,6 +111,7 @@ void device_pm_remove(struct device *dev
 	mutex_lock(&dpm_list_mtx);
 	dpm_sysfs_remove(dev);
 	list_del_init(&dev->power.entry);
+	dev->power.sleeping = PM_GONE;
 	mutex_unlock(&dpm_list_mtx);
 }
 
@@ -107,31 +133,6 @@ void device_pm_schedule_removal(struct d
 }
 EXPORT_SYMBOL_GPL(device_pm_schedule_removal);
 
-/**
- *	pm_sleep_lock - mutual exclusion for registration and suspend
- *
- *	Returns 0 if no suspend is underway and device registration
- *	may proceed, otherwise -EBUSY.
- */
-int pm_sleep_lock(void)
-{
-	if (down_read_trylock(&pm_sleep_rwsem))
-		return 0;
-
-	return -EBUSY;
-}
-
-/**
- *	pm_sleep_unlock - mutual exclusion for registration and suspend
- *
- *	This routine undoes the effect of device_pm_add_lock
- *	when a device's registration is complete.
- */
-void pm_sleep_unlock(void)
-{
-	up_read(&pm_sleep_rwsem);
-}
-
 
 /*------------------------- Resume routines -------------------------*/
 
@@ -242,14 +243,18 @@ static int resume_device(struct device *
 static void dpm_resume(void)
 {
 	mutex_lock(&dpm_list_mtx);
+	all_devices_asleep = false;
 	while(!list_empty(&dpm_off)) {
 		struct list_head *entry = dpm_off.next;
 		struct device *dev = to_device(entry);
 
 		list_move_tail(entry, &dpm_active);
+		dev->power.sleeping = PM_WAKING;
 		mutex_unlock(&dpm_list_mtx);
 		resume_device(dev);
 		mutex_lock(&dpm_list_mtx);
+		if (dev->power.sleeping != PM_GONE)
+			dev->power.sleeping = PM_AWAKE;
 	}
 	mutex_unlock(&dpm_list_mtx);
 }
@@ -285,7 +290,6 @@ void device_resume(void)
 	might_sleep();
 	dpm_resume();
 	unregister_dropped_devices();
-	up_write(&pm_sleep_rwsem);
 }
 EXPORT_SYMBOL_GPL(device_resume);
 
@@ -421,9 +425,18 @@ static int dpm_suspend(pm_message_t stat
 		struct list_head *entry = dpm_active.prev;
 		struct device *dev = to_device(entry);
 
+		dev->power.sleeping = PM_SLEEPING;
+		child_added_while_parent_suspends = false;
 		mutex_unlock(&dpm_list_mtx);
 		error = suspend_device(dev, state);
+		mutex_lock(&dpm_list_mtx);
 		if (error) {
+			if (dev->power.sleeping != PM_GONE)
+				dev->power.sleeping = PM_AWAKE;
+			if (error == -EBUSY && entry != dpm_active.prev)
+				continue;	/* A child was added before
+						 * the device could suspend
+						 */
 			printk(KERN_ERR "Could not suspend device %s: "
 					"error %d%s\n",
 					kobject_name(&dev->kobj),
@@ -433,10 +446,24 @@ static int dpm_suspend(pm_message_t stat
 					""));
 			break;
 		}
-		mutex_lock(&dpm_list_mtx);
-		if (!list_empty(&dev->power.entry))
-			list_move(&dev->power.entry, &dpm_off);
+		if (dev->power.sleeping != PM_GONE) {
+			if (child_added_while_parent_suspends) {
+				dev_err(dev, "suspended while a child "
+						"was added\n");
+				dev->power.sleeping = PM_WAKING;
+				mutex_unlock(&dpm_list_mtx);
+				resume_device(dev);
+				mutex_lock(&dpm_list_mtx);
+				if (dev->power.sleeping != PM_GONE)
+					dev->power.sleeping = PM_AWAKE;
+			} else {
+				dev->power.sleeping = PM_ASLEEP;
+				list_move(&dev->power.entry, &dpm_off);
+			}
+		}
 	}
+	if (!error)
+		all_devices_asleep = true;
 	mutex_unlock(&dpm_list_mtx);
 
 	return error;
@@ -454,7 +481,6 @@ int device_suspend(pm_message_t state)
 	int error;
 
 	might_sleep();
-	down_write(&pm_sleep_rwsem);
 	error = dpm_suspend(state);
 	if (error)
 		device_resume();
Index: usb-2.6/drivers/base/core.c
===================================================================
--- usb-2.6.orig/drivers/base/core.c
+++ usb-2.6/drivers/base/core.c
@@ -815,10 +815,12 @@ int device_add(struct device *dev)
 	error = dpm_sysfs_add(dev);
 	if (error)
 		goto PMError;
-	device_pm_add(dev);
 	error = bus_add_device(dev);
 	if (error)
 		goto BusError;
+	error = device_pm_add(dev);
+	if (error)
+		goto PMAddError;
 	kobject_uevent(&dev->kobj, KOBJ_ADD);
 	bus_attach_device(dev);
 	if (parent)
@@ -838,8 +840,9 @@ int device_add(struct device *dev)
  Done:
 	put_device(dev);
 	return error;
+ PMAddError:
+	bus_remove_device(dev);
  BusError:
-	device_pm_remove(dev);
 	dpm_sysfs_remove(dev);
  PMError:
 	if (dev->bus)


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [linux-pm] Fundamental flaw in system suspend, exposed by freezer removal
  2008-02-28 22:49                 ` Alan Stern
@ 2008-02-29  0:01                   ` Rafael J. Wysocki
  2008-02-29 14:26                   ` Rafael J. Wysocki
  1 sibling, 0 replies; 38+ messages in thread
From: Rafael J. Wysocki @ 2008-02-29  0:01 UTC (permalink / raw)
  To: Alan Stern
  Cc: Linux-pm mailing list, Kernel development list, Alexey Starikovskiy

On Thursday, 28 of February 2008, Alan Stern wrote:
> Rafael:

Hi,

> Here's my patch.  It doesn't include the timers for deadlock debugging, 
> but it does include all the other stuff we've been talking about.  My 
> base probably isn't quite in sync with yours, so this may not apply 
> cleanly on your system.  But the divergences should be small.

No big deal.  In fact we're working with Alex on a patch that modifies main.c
too, so I'll have to resolve merging conflicts anyway. ;-)

> Incidentally, there seemed to be a bug in your dpm_suspend() -- the 
> dpm_list_mtx needs to be reacquired before the error checking.

Oh, yes.  Will you please send a fix when
http://marc.info/?l=linux-acpi&m=120389632114090&w=4 is merged?

> This patch fixes that.  It also removes pm_sleep_rwsem, which isn't used 
> any more.

Yeah.  I left pm_sleep_rwsem, because I wasn't sure it would be necessary
in the future.

> We should think about device_pm_schedule_removal().  It won't work
> right if a suspend method calls it for the device being suspended, 
> because the device gets moved to the dpm_off list after the method 
> runs.

I think we should remove device_pm_schedule_removal() early in the 2.6.26
cycle.

I'll review the patch more thoroughly tomorrow, since I have some new material
for the regressions list I need to take care of.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [linux-pm] Fundamental flaw in system suspend, exposed by freezer removal
  2008-02-28 22:49                 ` Alan Stern
  2008-02-29  0:01                   ` Rafael J. Wysocki
@ 2008-02-29 14:26                   ` Rafael J. Wysocki
  2008-02-29 15:53                     ` Alan Stern
  1 sibling, 1 reply; 38+ messages in thread
From: Rafael J. Wysocki @ 2008-02-29 14:26 UTC (permalink / raw)
  To: Alan Stern; +Cc: Linux-pm mailing list, Kernel development list

On Thursday, 28 of February 2008, Alan Stern wrote:
> Rafael:

Hi,

> Here's my patch.  It doesn't include the timers for deadlock debugging, 
> but it does include all the other stuff we've been talking about.  My 
> base probably isn't quite in sync with yours, so this may not apply 
> cleanly on your system.  But the divergences should be small.
> 
> Incidentally, there seemed to be a bug in your dpm_suspend() -- the 
> dpm_list_mtx needs to be reacquired before the error checking.  This
> patch fixes that.  It also removes pm_sleep_rwsem, which isn't used 
> any more.
> 
> We should think about device_pm_schedule_removal().  It won't work
> right if a suspend method calls it for the device being suspended, 
> because the device gets moved to the dpm_off list after the method 
> runs.
> 
> Index: usb-2.6/Documentation/power/devices.txt
> ===================================================================
> --- usb-2.6.orig/Documentation/power/devices.txt
> +++ usb-2.6/Documentation/power/devices.txt
> @@ -192,11 +192,27 @@ used to resume those devices.
>  
>  The ordering of the device tree is defined by the order in which devices
>  get registered:  a child can never be registered, probed or resumed before
> -its parent; and can't be removed or suspended after that parent.
> +its parent; and can't be removed or suspended after that parent.  This
> +means that special care is needed when calling device_move(); currently
> +the kernel does not adjust its suspend and resume ordering to match the
> +new device tree.
>  
>  The policy is that the device tree should match hardware bus topology.
>  (Or at least the control bus, for devices which use multiple busses.)
>  
> +There is an unavoidable race between the PM core suspending all the
> +children of a device and the device's driver registering new children.
> +As a result, it is possible that the core may try to suspend a device
> +without first having suspended all of the device's children.  Drivers
> +must check for this; a suspend method should return -EBUSY if there are
> +unsuspended children.  (The child->power.sleeping field can be used
> +for this check.)  In addition, it is illegal to register a child device

s/illegal/invalid/

> +below a suspended parent; hence suspend methods must synchronize with
> +other kernel threads that may attempt to add new children.  The suspend
> +method must prevent new registrations and wait for concurrent registrations
> +to complete before it returns.  New children may be added once more when
> +the resume method runs.
> +
>  Suspending Devices
>  ------------------
> @@ -387,7 +403,9 @@ while the system was powered off, whenev
>  PCMCIA, MMC, USB, Firewire, SCSI, and even IDE are common examples of busses
>  where common Linux platforms will see such removal.  Details of how drivers
>  will notice and handle such removals are currently bus-specific, and often
> -involve a separate thread.
> +involve a separate thread.  Currently the kernel does not allow a suspend
> +or resume method to directly unregister the device being suspended or
> +resumed.
>  
>  
>  Note that the bus-specific runtime PM wakeup mechanism can exist, and might
> Index: usb-2.6/include/linux/pm.h
> ===================================================================
> --- usb-2.6.orig/include/linux/pm.h
> +++ usb-2.6/include/linux/pm.h
> @@ -180,8 +180,20 @@ typedef struct pm_message {
>  #define PMSG_HIBERNATE	((struct pm_message){ .event = PM_EVENT_HIBERNATE, })
>  #define PMSG_ON		((struct pm_message){ .event = PM_EVENT_ON, })
>  
> +/* This records which method calls have been made, not the device's
> + * actual power state.  It is read-only to drivers.
> + */
> +enum pm_sleep_state {

I'd call it dev_pm_state, in analogy with dev_pm_info etc.

> +	PM_AWAKE,		/* = 0, normal situation */
> +	PM_SLEEPING,		/* suspend method is running */
> +	PM_ASLEEP,		/* suspend method has returned */
> +	PM_WAKING,		/* resume method is running */
> +	PM_GONE,		/* device has been unregistered */
> +};
> +
>  struct dev_pm_info {
>  	pm_message_t		power_state;
> +	enum pm_sleep_state	sleeping;

In fact 'sleeping' doesn't look good in this context.  'pm_state' seems
better to me (although it is confusingly similar to 'power_state', we're going
to get rid of the latter anyway).

>  	unsigned		can_wakeup:1;
>  #ifdef	CONFIG_PM_SLEEP
>  	unsigned		should_wakeup:1;
> Index: usb-2.6/drivers/base/power/power.h
> ===================================================================
> --- usb-2.6.orig/drivers/base/power/power.h
> +++ usb-2.6/drivers/base/power/power.h
> @@ -11,28 +11,18 @@ static inline struct device *to_device(s
>  	return container_of(entry, struct device, power.entry);
>  }
>  
> -extern void device_pm_add(struct device *);
> +extern int device_pm_add(struct device *);
>  extern void device_pm_remove(struct device *);
> -extern int pm_sleep_lock(void);
> -extern void pm_sleep_unlock(void);
>  
>  #else /* CONFIG_PM_SLEEP */
>  
>  
> -static inline void device_pm_add(struct device *dev)
> -{
> -}
> -
> -static inline void device_pm_remove(struct device *dev)
> -{
> -}
> -
> -static inline int pm_sleep_lock(void)
> +static inline int device_pm_add(struct device *dev)
>  {
>  	return 0;
>  }
>  
> -static inline void pm_sleep_unlock(void)
> +static inline void device_pm_remove(struct device *dev)
>  {
>  }
>  
> Index: usb-2.6/drivers/base/power/main.c
> ===================================================================
> --- usb-2.6.orig/drivers/base/power/main.c
> +++ usb-2.6/drivers/base/power/main.c
> @@ -54,7 +54,9 @@ static LIST_HEAD(dpm_destroy);
>  
>  static DEFINE_MUTEX(dpm_list_mtx);
>  
> -static DECLARE_RWSEM(pm_sleep_rwsem);
> +/* Protected by dpm_list_mtx */
> +static bool child_added_while_parent_suspends; 

I don't like this name, but I have no better idea at the moment.

> +static bool all_devices_asleep;
>  
>  int (*platform_enable_wakeup)(struct device *dev, int is_on);
>  
> @@ -62,14 +64,37 @@ int (*platform_enable_wakeup)(struct dev
>   *	device_pm_add - add a device to the list of active devices
>   *	@dev:	Device to be added to the list
>   */
> -void device_pm_add(struct device *dev)
> +int device_pm_add(struct device *dev)
>  {
> +	int rc = 0;
> +
>  	pr_debug("PM: Adding info for %s:%s\n",
>  		 dev->bus ? dev->bus->name : "No Bus",
>  		 kobject_name(&dev->kobj));
>  	mutex_lock(&dpm_list_mtx);
> -	list_add_tail(&dev->power.entry, &dpm_active);
> +	if (dev->parent) {

Hmm.

Suppose we add a mutex to dev_pm_info, say pm_mtx, and require it to be:
(1) taken by suspend_device(dev) (at the beginning)
(2) released by resume_device(dev) (at the end)
(3) taken (and released) by device_pm_add() if dev is the parent of the device
    being added.

In that case, device_pm_add() will block on attepmpts to register devices whose
parents are suspended (or suspending) and we're done.  At least so it would
seem.

> +		switch (dev->parent->power.sleeping) {
> +		case PM_SLEEPING:
> +			child_added_while_parent_suspends = true;
> +			break;
> +		case PM_ASLEEP:
> +			dev_err(dev, "added while parent '%s' is asleep\n",
> +					dev->parent->bus_id);
> +			rc = -EHOSTDOWN;
> +			break;
> +		default:
> +			break;
> +		}
> +	} else if (all_devices_asleep) {
> +		dev_err(dev, "added while all devices are asleep\n");
> +		rc = -ENETDOWN;
> +	}

The error codes are a bit unusual, but whatever.

> +	if (rc == 0)
> +		list_add_tail(&dev->power.entry, &dpm_active);
> +	else
> +		dump_stack();
>  	mutex_unlock(&dpm_list_mtx);
> +	return rc;
>  }
>  
>  /**
> @@ -86,6 +111,7 @@ void device_pm_remove(struct device *dev
>  	mutex_lock(&dpm_list_mtx);
>  	dpm_sysfs_remove(dev);
>  	list_del_init(&dev->power.entry);
> +	dev->power.sleeping = PM_GONE;
>  	mutex_unlock(&dpm_list_mtx);
>  }
>  
> @@ -107,31 +133,6 @@ void device_pm_schedule_removal(struct d
>  }
>  EXPORT_SYMBOL_GPL(device_pm_schedule_removal);
>  
> -/**
> - *	pm_sleep_lock - mutual exclusion for registration and suspend
> - *
> - *	Returns 0 if no suspend is underway and device registration
> - *	may proceed, otherwise -EBUSY.
> - */
> -int pm_sleep_lock(void)
> -{
> -	if (down_read_trylock(&pm_sleep_rwsem))
> -		return 0;
> -
> -	return -EBUSY;
> -}
> -
> -/**
> - *	pm_sleep_unlock - mutual exclusion for registration and suspend
> - *
> - *	This routine undoes the effect of device_pm_add_lock
> - *	when a device's registration is complete.
> - */
> -void pm_sleep_unlock(void)
> -{
> -	up_read(&pm_sleep_rwsem);
> -}
> -
>  
>  /*------------------------- Resume routines -------------------------*/
>  
> @@ -242,14 +243,18 @@ static int resume_device(struct device *
>  static void dpm_resume(void)
>  {
>  	mutex_lock(&dpm_list_mtx);
> +	all_devices_asleep = false;
>  	while(!list_empty(&dpm_off)) {
>  		struct list_head *entry = dpm_off.next;
>  		struct device *dev = to_device(entry);
>  
>  		list_move_tail(entry, &dpm_active);
> +		dev->power.sleeping = PM_WAKING;
>  		mutex_unlock(&dpm_list_mtx);
>  		resume_device(dev);
>  		mutex_lock(&dpm_list_mtx);
> +		if (dev->power.sleeping != PM_GONE)
> +			dev->power.sleeping = PM_AWAKE;
>  	}
>  	mutex_unlock(&dpm_list_mtx);
>  }
> @@ -285,7 +290,6 @@ void device_resume(void)
>  	might_sleep();
>  	dpm_resume();
>  	unregister_dropped_devices();
> -	up_write(&pm_sleep_rwsem);
>  }
>  EXPORT_SYMBOL_GPL(device_resume);
>  
> @@ -421,9 +425,18 @@ static int dpm_suspend(pm_message_t stat
>  		struct list_head *entry = dpm_active.prev;
>  		struct device *dev = to_device(entry);
>  
> +		dev->power.sleeping = PM_SLEEPING;
> +		child_added_while_parent_suspends = false;
>  		mutex_unlock(&dpm_list_mtx);
>  		error = suspend_device(dev, state);
> +		mutex_lock(&dpm_list_mtx);
>  		if (error) {
> +			if (dev->power.sleeping != PM_GONE)
> +				dev->power.sleeping = PM_AWAKE;
> +			if (error == -EBUSY && entry != dpm_active.prev)
> +				continue;	/* A child was added before
> +						 * the device could suspend
> +						 */
>  			printk(KERN_ERR "Could not suspend device %s: "
>  					"error %d%s\n",
>  					kobject_name(&dev->kobj),
> @@ -433,10 +446,24 @@ static int dpm_suspend(pm_message_t stat
>  					""));
>  			break;
>  		}
> -		mutex_lock(&dpm_list_mtx);
> -		if (!list_empty(&dev->power.entry))
> -			list_move(&dev->power.entry, &dpm_off);
> +		if (dev->power.sleeping != PM_GONE) {
> +			if (child_added_while_parent_suspends) {
> +				dev_err(dev, "suspended while a child "
> +						"was added\n");
> +				dev->power.sleeping = PM_WAKING;
> +				mutex_unlock(&dpm_list_mtx);

This seems to be a weak spot.  The resuming of the device at this point need
not work correctly, given that the system's target state is still a sleep
state.

> +				resume_device(dev);
> +				mutex_lock(&dpm_list_mtx);
> +				if (dev->power.sleeping != PM_GONE)
> +					dev->power.sleeping = PM_AWAKE;
> +			} else {
> +				dev->power.sleeping = PM_ASLEEP;
> +				list_move(&dev->power.entry, &dpm_off);
> +			}
> +		}
>  	}
> +	if (!error)
> +		all_devices_asleep = true;
>  	mutex_unlock(&dpm_list_mtx);
>  
>  	return error;
> @@ -454,7 +481,6 @@ int device_suspend(pm_message_t state)
>  	int error;
>  
>  	might_sleep();
> -	down_write(&pm_sleep_rwsem);
>  	error = dpm_suspend(state);
>  	if (error)
>  		device_resume();
> Index: usb-2.6/drivers/base/core.c
> ===================================================================
> --- usb-2.6.orig/drivers/base/core.c
> +++ usb-2.6/drivers/base/core.c
> @@ -815,10 +815,12 @@ int device_add(struct device *dev)
>  	error = dpm_sysfs_add(dev);
>  	if (error)
>  		goto PMError;
> -	device_pm_add(dev);
>  	error = bus_add_device(dev);
>  	if (error)
>  		goto BusError;
> +	error = device_pm_add(dev);
> +	if (error)
> +		goto PMAddError;
>  	kobject_uevent(&dev->kobj, KOBJ_ADD);
>  	bus_attach_device(dev);
>  	if (parent)
> @@ -838,8 +840,9 @@ int device_add(struct device *dev)
>   Done:
>  	put_device(dev);
>  	return error;
> + PMAddError:
> +	bus_remove_device(dev);
>   BusError:
> -	device_pm_remove(dev);
>  	dpm_sysfs_remove(dev);
>   PMError:
>  	if (dev->bus)

Well, I wish it could be simpler ...

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [linux-pm] Fundamental flaw in system suspend, exposed by freezer removal
  2008-02-29 14:26                   ` Rafael J. Wysocki
@ 2008-02-29 15:53                     ` Alan Stern
  2008-02-29 17:02                       ` Rafael J. Wysocki
  0 siblings, 1 reply; 38+ messages in thread
From: Alan Stern @ 2008-02-29 15:53 UTC (permalink / raw)
  To: Rafael J. Wysocki; +Cc: Linux-pm mailing list, Kernel development list

On Fri, 29 Feb 2008, Rafael J. Wysocki wrote:

> > +There is an unavoidable race between the PM core suspending all the
> > +children of a device and the device's driver registering new children.
> > +As a result, it is possible that the core may try to suspend a device
> > +without first having suspended all of the device's children.  Drivers
> > +must check for this; a suspend method should return -EBUSY if there are
> > +unsuspended children.  (The child->power.sleeping field can be used
> > +for this check.)  In addition, it is illegal to register a child device
> 
> s/illegal/invalid/

How about instead: "attempts to register a child device below a 
suspended parent will fail.  Hence..."?

> > +below a suspended parent; hence suspend methods must synchronize with
> > +other kernel threads that may attempt to add new children.  The suspend
> > +method must prevent new registrations and wait for concurrent registrations
> > +to complete before it returns.  New children may be added once more when
> > +the resume method runs.

...

> > --- usb-2.6.orig/include/linux/pm.h
> > +++ usb-2.6/include/linux/pm.h
> > @@ -180,8 +180,20 @@ typedef struct pm_message {
> >  #define PMSG_HIBERNATE	((struct pm_message){ .event = PM_EVENT_HIBERNATE, })
> >  #define PMSG_ON		((struct pm_message){ .event = PM_EVENT_ON, })
> >  
> > +/* This records which method calls have been made, not the device's
> > + * actual power state.  It is read-only to drivers.
> > + */
> > +enum pm_sleep_state {
> 
> I'd call it dev_pm_state, in analogy with dev_pm_info etc.

Okay.

> > +	PM_AWAKE,		/* = 0, normal situation */
> > +	PM_SLEEPING,		/* suspend method is running */
> > +	PM_ASLEEP,		/* suspend method has returned */
> > +	PM_WAKING,		/* resume method is running */
> > +	PM_GONE,		/* device has been unregistered */
> > +};
> > +
> >  struct dev_pm_info {
> >  	pm_message_t		power_state;
> > +	enum pm_sleep_state	sleeping;
> 
> In fact 'sleeping' doesn't look good in this context.  'pm_state' seems
> better to me (although it is confusingly similar to 'power_state', we're going
> to get rid of the latter anyway).

Okay.

> > Index: usb-2.6/drivers/base/power/main.c
> > ===================================================================
> > --- usb-2.6.orig/drivers/base/power/main.c
> > +++ usb-2.6/drivers/base/power/main.c
> > @@ -54,7 +54,9 @@ static LIST_HEAD(dpm_destroy);
> >  
> >  static DEFINE_MUTEX(dpm_list_mtx);
> >  
> > -static DECLARE_RWSEM(pm_sleep_rwsem);
> > +/* Protected by dpm_list_mtx */
> > +static bool child_added_while_parent_suspends; 
> 
> I don't like this name, but I have no better idea at the moment.

It gets used in only a couple of places, so nobody should object too 
violently.  :-)

> > -void device_pm_add(struct device *dev)
> > +int device_pm_add(struct device *dev)
> >  {
> > +	int rc = 0;
> > +
> >  	pr_debug("PM: Adding info for %s:%s\n",
> >  		 dev->bus ? dev->bus->name : "No Bus",
> >  		 kobject_name(&dev->kobj));
> >  	mutex_lock(&dpm_list_mtx);
> > -	list_add_tail(&dev->power.entry, &dpm_active);
> > +	if (dev->parent) {
> 
> Hmm.
> 
> Suppose we add a mutex to dev_pm_info, say pm_mtx, and require it to be:
> (1) taken by suspend_device(dev) (at the beginning)
> (2) released by resume_device(dev) (at the end)
> (3) taken (and released) by device_pm_add() if dev is the parent of the device
>     being added.
> 
> In that case, device_pm_add() will block on attepmpts to register devices whose
> parents are suspended (or suspending) and we're done.  At least so it would
> seem.

No; this would repeat the same mistake we were struggling with last
week.  Blocking registration attempts (especially if we start _before_
calling the device's suspend method or end _after_ calling the resume
method) will lead to deadlocks while suspending or resuming.

If the blocking starts after the suspend method returns then it will be 
safer.  But what's the point?  If a registration attempt is made at 
that stage, it means there's a bug in the driver.  So failing the 
registration seems like a reasonable thing to do.

One issue: This rule doesn't allow suspend_late or resume_early methods
to register any new devices.  Will that cause problems with the CPU
hotplug or ACPI subsystems?  ACPI in particular may need to freeze the
kacpi_notify workqueue -- in fact, that might solve the problem in
Bugzilla #9874.

> > +		switch (dev->parent->power.sleeping) {
> > +		case PM_SLEEPING:
> > +			child_added_while_parent_suspends = true;
> > +			break;
> > +		case PM_ASLEEP:
> > +			dev_err(dev, "added while parent '%s' is asleep\n",
> > +					dev->parent->bus_id);
> > +			rc = -EHOSTDOWN;
> > +			break;
> > +		default:
> > +			break;
> > +		}
> > +	} else if (all_devices_asleep) {
> > +		dev_err(dev, "added while all devices are asleep\n");
> > +		rc = -ENETDOWN;
> > +	}
> 
> The error codes are a bit unusual, but whatever.

I agree.  But there aren't any -EPOWER* or -EPM* error codes defined!  
Some should be added, but this patch isn't the place to do it.

> > @@ -433,10 +446,24 @@ static int dpm_suspend(pm_message_t stat
> >  					""));
> >  			break;
> >  		}
> > -		mutex_lock(&dpm_list_mtx);
> > -		if (!list_empty(&dev->power.entry))
> > -			list_move(&dev->power.entry, &dpm_off);
> > +		if (dev->power.sleeping != PM_GONE) {
> > +			if (child_added_while_parent_suspends) {
> > +				dev_err(dev, "suspended while a child "
> > +						"was added\n");
> > +				dev->power.sleeping = PM_WAKING;
> > +				mutex_unlock(&dpm_list_mtx);
> 
> This seems to be a weak spot.  The resuming of the device at this point need
> not work correctly, given that the system's target state is still a sleep
> state.

That may be true.  But this is an error-recovery path intended to work
around a driver bug.  It doesn't have to guarantee perfect operation, 
just do its best.

Remember too that the target state is set before any devices are
suspended.  Hence, after the state is set there may still be runtime
resumes taking place.  Those _must_ not fail, which means that this
resume ought to work also.

> > +				resume_device(dev);
> > +				mutex_lock(&dpm_list_mtx);
> > +				if (dev->power.sleeping != PM_GONE)
> > +					dev->power.sleeping = PM_AWAKE;
> > +			} else {

> Well, I wish it could be simpler ...

Me too.  But until the API is changed, this seems to be the best we can 
do.  It's not quite as bad as it looks, since a fair amount of the new 
code is just for reporting on and recovering from bugs in drivers.

Alan Stern


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [linux-pm] Fundamental flaw in system suspend, exposed by freezer removal
  2008-02-29 15:53                     ` Alan Stern
@ 2008-02-29 17:02                       ` Rafael J. Wysocki
  2008-02-29 18:42                         ` Alan Stern
  0 siblings, 1 reply; 38+ messages in thread
From: Rafael J. Wysocki @ 2008-02-29 17:02 UTC (permalink / raw)
  To: Alan Stern; +Cc: Linux-pm mailing list, Kernel development list

On Friday, 29 of February 2008, Alan Stern wrote:
> On Fri, 29 Feb 2008, Rafael J. Wysocki wrote:
> 
> > > +There is an unavoidable race between the PM core suspending all the
> > > +children of a device and the device's driver registering new children.
> > > +As a result, it is possible that the core may try to suspend a device
> > > +without first having suspended all of the device's children.  Drivers
> > > +must check for this; a suspend method should return -EBUSY if there are
> > > +unsuspended children.  (The child->power.sleeping field can be used
> > > +for this check.)  In addition, it is illegal to register a child device
> > 
> > s/illegal/invalid/
> 
> How about instead: "attempts to register a child device below a 
> suspended parent will fail.  Hence..."?

OK

> > > +below a suspended parent; hence suspend methods must synchronize with
> > > +other kernel threads that may attempt to add new children.  The suspend
> > > +method must prevent new registrations and wait for concurrent registrations
> > > +to complete before it returns.  New children may be added once more when
> > > +the resume method runs.
[--snip--]
> > > -void device_pm_add(struct device *dev)
> > > +int device_pm_add(struct device *dev)
> > >  {
> > > +	int rc = 0;
> > > +
> > >  	pr_debug("PM: Adding info for %s:%s\n",
> > >  		 dev->bus ? dev->bus->name : "No Bus",
> > >  		 kobject_name(&dev->kobj));
> > >  	mutex_lock(&dpm_list_mtx);
> > > -	list_add_tail(&dev->power.entry, &dpm_active);
> > > +	if (dev->parent) {
> > 
> > Hmm.
> > 
> > Suppose we add a mutex to dev_pm_info, say pm_mtx, and require it to be:
> > (1) taken by suspend_device(dev) (at the beginning)
> > (2) released by resume_device(dev) (at the end)
> > (3) taken (and released) by device_pm_add() if dev is the parent of the device
> >     being added.
> > 
> > In that case, device_pm_add() will block on attepmpts to register devices whose
> > parents are suspended (or suspending) and we're done.  At least so it would
> > seem.
> 
> No; this would repeat the same mistake we were struggling with last
> week.

Not exactly, because in that case we blocked all attempts to register devices,
while I think we can only block those regarding the children of a suspending
(or suspended) device.

> Blocking registration attempts (especially if we start _before_ 
> calling the device's suspend method or end _after_ calling the resume
> method) will lead to deadlocks while suspending or resuming.

I'm not really convinced that it'll happen.  If we make the rule that
registering children from the device's own ->suspend() method is forbidden
(I don't really see why it should be allowed), something like this might work
(it seems to me):

Index: linux-2.6/include/linux/pm.h
===================================================================
--- linux-2.6.orig/include/linux/pm.h
+++ linux-2.6/include/linux/pm.h
@@ -186,6 +186,7 @@ struct dev_pm_info {
 #ifdef	CONFIG_PM_SLEEP
 	unsigned		should_wakeup:1;
 	struct list_head	entry;
+	struct mutex		lock;
 #endif
 };
 
Index: linux-2.6/drivers/base/power/main.c
===================================================================
--- linux-2.6.orig/drivers/base/power/main.c
+++ linux-2.6/drivers/base/power/main.c
@@ -67,9 +67,14 @@ void device_pm_add(struct device *dev)
 	pr_debug("PM: Adding info for %s:%s\n",
 		 dev->bus ? dev->bus->name : "No Bus",
 		 kobject_name(&dev->kobj));
+	if (dev->parent)
+		mutex_lock(&dev->parent.power.lock);
 	mutex_lock(&dpm_list_mtx);
 	list_add_tail(&dev->power.entry, &dpm_active);
+	mutex_init(&dev->power.lock);
 	mutex_unlock(&dpm_list_mtx);
+	if (dev->parent)
+		mutex_unlock(&dev->parent.power.lock);
 }
 
 /**
@@ -249,6 +254,7 @@ static void dpm_resume(void)
 		list_move_tail(entry, &dpm_active);
 		mutex_unlock(&dpm_list_mtx);
 		resume_device(dev);
+		mutex_unlock(&dev->power.lock);
 		mutex_lock(&dpm_list_mtx);
 	}
 	mutex_unlock(&dpm_list_mtx);
@@ -427,6 +433,13 @@ static int dpm_suspend(pm_message_t stat
 		struct device *dev = to_device(entry);
 
 		mutex_unlock(&dpm_list_mtx);
+		mutex_lock(&dev->power.lock);
+		mutex_lock(&dpm_list_mtx);
+		if (dev != to_device(dpm_active.prev)) {
+			mutex_unlock(&dev->power.lock);
+			continue;
+		}
+		mutex_unlock(&dpm_list_mtx);
 		error = suspend_device(dev, state);
 		mutex_lock(&dpm_list_mtx);
 		if (error) {
@@ -437,6 +450,7 @@ static int dpm_suspend(pm_message_t stat
 					(error == -EAGAIN ?
 					" (please convert to suspend_late)" :
 					""));
+			mutex_unlock(&dev->power.lock);
 			break;
 		}
 		if (!list_empty(&dev->power.entry))

> If the blocking starts after the suspend method returns then it will be 
> safer.  But what's the point?  If a registration attempt is made at 
> that stage, it means there's a bug in the driver.  So failing the 
> registration seems like a reasonable thing to do.

That doesn't buy us anything if drivers don't check whether the registration
succeeded.  And they don't.

> One issue: This rule doesn't allow suspend_late or resume_early methods
> to register any new devices.

Not exactly.  Just the children of suspended devices, which makes a difference
IMO. :-)

> Will that cause problems with the CPU hotplug or ACPI subsystems?  ACPI in
> particular may need to freeze the kacpi_notify workqueue -- in fact, that
> might solve the problem in Bugzilla #9874.

Well, my impression is that we do this thing to prepare for removing the
freezer in the future, so I'd rather solve issues in some other ways than just
by freezing threads that get in the way. ;-)

> > > +		switch (dev->parent->power.sleeping) {
> > > +		case PM_SLEEPING:
> > > +			child_added_while_parent_suspends = true;
> > > +			break;
> > > +		case PM_ASLEEP:
> > > +			dev_err(dev, "added while parent '%s' is asleep\n",
> > > +					dev->parent->bus_id);
> > > +			rc = -EHOSTDOWN;
> > > +			break;
> > > +		default:
> > > +			break;
> > > +		}
> > > +	} else if (all_devices_asleep) {
> > > +		dev_err(dev, "added while all devices are asleep\n");
> > > +		rc = -ENETDOWN;
> > > +	}
> > 
> > The error codes are a bit unusual, but whatever.
> 
> I agree.  But there aren't any -EPOWER* or -EPM* error codes defined!  
> Some should be added, but this patch isn't the place to do it.
> 
> > > @@ -433,10 +446,24 @@ static int dpm_suspend(pm_message_t stat
> > >  					""));
> > >  			break;
> > >  		}
> > > -		mutex_lock(&dpm_list_mtx);
> > > -		if (!list_empty(&dev->power.entry))
> > > -			list_move(&dev->power.entry, &dpm_off);
> > > +		if (dev->power.sleeping != PM_GONE) {
> > > +			if (child_added_while_parent_suspends) {
> > > +				dev_err(dev, "suspended while a child "
> > > +						"was added\n");
> > > +				dev->power.sleeping = PM_WAKING;
> > > +				mutex_unlock(&dpm_list_mtx);
> > 
> > This seems to be a weak spot.  The resuming of the device at this point need
> > not work correctly, given that the system's target state is still a sleep
> > state.
> 
> That may be true.  But this is an error-recovery path intended to work
> around a driver bug.  It doesn't have to guarantee perfect operation, 
> just do its best.
> 
> Remember too that the target state is set before any devices are
> suspended.  Hence, after the state is set there may still be runtime
> resumes taking place.  Those _must_ not fail, which means that this
> resume ought to work also.

They may be handled in a different way, not by ->resume().
 
> > > +				resume_device(dev);
> > > +				mutex_lock(&dpm_list_mtx);
> > > +				if (dev->power.sleeping != PM_GONE)
> > > +					dev->power.sleeping = PM_AWAKE;
> > > +			} else {
> 
> > Well, I wish it could be simpler ...
> 
> Me too.  But until the API is changed, this seems to be the best we can 
> do.  It's not quite as bad as it looks, since a fair amount of the new 
> code is just for reporting on and recovering from bugs in drivers.

Still, it doesn't look very nice ...

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [linux-pm] Fundamental flaw in system suspend, exposed by freezer removal
  2008-02-29 17:02                       ` Rafael J. Wysocki
@ 2008-02-29 18:42                         ` Alan Stern
  2008-02-29 21:57                           ` Rafael J. Wysocki
  0 siblings, 1 reply; 38+ messages in thread
From: Alan Stern @ 2008-02-29 18:42 UTC (permalink / raw)
  To: Rafael J. Wysocki; +Cc: Linux-pm mailing list, Kernel development list

On Fri, 29 Feb 2008, Rafael J. Wysocki wrote:

> > > Suppose we add a mutex to dev_pm_info, say pm_mtx, and require it to be:
> > > (1) taken by suspend_device(dev) (at the beginning)
> > > (2) released by resume_device(dev) (at the end)
> > > (3) taken (and released) by device_pm_add() if dev is the parent of the device
> > >     being added.
> > > 
> > > In that case, device_pm_add() will block on attepmpts to register devices whose
> > > parents are suspended (or suspending) and we're done.  At least so it would
> > > seem.
> > 
> > No; this would repeat the same mistake we were struggling with last
> > week.
> 
> Not exactly, because in that case we blocked all attempts to register devices,
> while I think we can only block those regarding the children of a suspending
> (or suspended) device.

It's the "suspending" case that causes trouble.  Go back and look at 
the race I diagrammed in

https://lists.linux-foundation.org/pipermail/linux-pm/2008-February/016763.html

> > Blocking registration attempts (especially if we start _before_ 
> > calling the device's suspend method or end _after_ calling the resume
> > method) will lead to deadlocks while suspending or resuming.
> 
> I'm not really convinced that it'll happen.

It _did_ happen.  Precisely this race occurred in Bug #10030 (the SD
card insertion/removal), although there the window was bigger because
we blocked registrations starting from the start of the system sleep
instead of from when the parent's suspend method was called.

>  If we make the rule that
> registering children from the device's own ->suspend() method is forbidden
> (I don't really see why it should be allowed), something like this might work
> (it seems to me):

It's not that the suspend method itself will want to register children.  
The problem is that the method has to wait for other threads that may 
already have started to register a child.  If those other threads are 
blocked then suspend will deadlock.

> @@ -427,6 +433,13 @@ static int dpm_suspend(pm_message_t stat
>  		struct device *dev = to_device(entry);
>  
>  		mutex_unlock(&dpm_list_mtx);
> +		mutex_lock(&dev->power.lock);
> +		mutex_lock(&dpm_list_mtx);
> +		if (dev != to_device(dpm_active.prev)) {
> +			mutex_unlock(&dev->power.lock);
> +			continue;
> +		}
> +		mutex_unlock(&dpm_list_mtx);
>  		error = suspend_device(dev, state);
>  		mutex_lock(&dpm_list_mtx);
>  		if (error) {

This looks pretty awkward.  Won't it cause lockdep to complain about
recursive locking of dev->power.lock?

> > If the blocking starts after the suspend method returns then it will be 
> > safer.  But what's the point?  If a registration attempt is made at 
> > that stage, it means there's a bug in the driver.  So failing the 
> > registration seems like a reasonable thing to do.
> 
> That doesn't buy us anything if drivers don't check whether the registration
> succeeded.  And they don't.

It buys us one thing: The system will continue to limp along instead of 
locking up.

If drivers don't check whether registration succeeded...  What can I 
say?  It's another bug.

> > One issue: This rule doesn't allow suspend_late or resume_early methods
> > to register any new devices.
> 
> Not exactly.  Just the children of suspended devices, which makes a difference
> IMO. :-)

During suspend_late and resume_early _every_ device is suspended, 
including the fictitious "device-tree-root" device.  Hence _every_ 
registration is for a child of a suspended device.

Besides, you don't want to allow new devices to be registered during 
suspend_late, do you?  They wouldn't get suspended before the system 
went to sleep.

> > Will that cause problems with the CPU hotplug or ACPI subsystems?  ACPI in
> > particular may need to freeze the kacpi_notify workqueue -- in fact, that
> > might solve the problem in Bugzilla #9874.
> 
> Well, my impression is that we do this thing to prepare for removing the
> freezer in the future, so I'd rather solve issues in some other ways than just
> by freezing threads that get in the way. ;-)

Right now that may be the easiest solution.  In fact, it may still be 
the easiest solution even after we stop freezing user threads.

> > Remember too that the target state is set before any devices are
> > suspended.  Hence, after the state is set there may still be runtime
> > resumes taking place.  Those _must_ not fail, which means that this
> > resume ought to work also.
> 
> They may be handled in a different way, not by ->resume().

I'm not sure I understand.  Sure, autoresume may not involve calling 
the driver's resume method.  But it does involve actually setting the 
device back to full power, so what's the difference?

> > > > +				resume_device(dev);
> > > > +				mutex_lock(&dpm_list_mtx);
> > > > +				if (dev->power.sleeping != PM_GONE)
> > > > +					dev->power.sleeping = PM_AWAKE;
> > > > +			} else {
> > 
> > > Well, I wish it could be simpler ...
> > 
> > Me too.  But until the API is changed, this seems to be the best we can 
> > do.  It's not quite as bad as it looks, since a fair amount of the new 
> > code is just for reporting on and recovering from bugs in drivers.
> 
> Still, it doesn't look very nice ...

Are you still considering adding separate methods for suspend and 
hibernate (maybe also for freeze and prethaw)?  Perhaps the 
"prevent_new_children" and "allow_new_children" methods could be added 
then.  This would allow some of this complication to go away.

Alan Stern

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [linux-pm] Fundamental flaw in system suspend, exposed by freezer removal
  2008-02-29 18:42                         ` Alan Stern
@ 2008-02-29 21:57                           ` Rafael J. Wysocki
  2008-02-29 22:46                             ` Alan Stern
  0 siblings, 1 reply; 38+ messages in thread
From: Rafael J. Wysocki @ 2008-02-29 21:57 UTC (permalink / raw)
  To: Alan Stern; +Cc: Linux-pm mailing list, Kernel development list

On Friday, 29 of February 2008, Alan Stern wrote:
> On Fri, 29 Feb 2008, Rafael J. Wysocki wrote:
> 
> > > > Suppose we add a mutex to dev_pm_info, say pm_mtx, and require it to be:
> > > > (1) taken by suspend_device(dev) (at the beginning)
> > > > (2) released by resume_device(dev) (at the end)
> > > > (3) taken (and released) by device_pm_add() if dev is the parent of the device
> > > >     being added.
> > > > 
> > > > In that case, device_pm_add() will block on attepmpts to register devices whose
> > > > parents are suspended (or suspending) and we're done.  At least so it would
> > > > seem.
> > > 
> > > No; this would repeat the same mistake we were struggling with last
> > > week.
> > 
> > Not exactly, because in that case we blocked all attempts to register devices,
> > while I think we can only block those regarding the children of a suspending
> > (or suspended) device.
> 
> It's the "suspending" case that causes trouble.  Go back and look at 
> the race I diagrammed in
> 
> https://lists.linux-foundation.org/pipermail/linux-pm/2008-February/016763.html

Yes, I remember this picture.

> > > Blocking registration attempts (especially if we start _before_ 
> > > calling the device's suspend method or end _after_ calling the resume
> > > method) will lead to deadlocks while suspending or resuming.
> > 
> > I'm not really convinced that it'll happen.
> 
> It _did_ happen.  Precisely this race occurred in Bug #10030 (the SD
> card insertion/removal), although there the window was bigger because
> we blocked registrations starting from the start of the system sleep
> instead of from when the parent's suspend method was called.

I'm still not sure if this particular race would happen if only the registering
of children of already suspended partents were blocked.

> >  If we make the rule that registering children from the device's
> >  own ->suspend() method is forbidden (I don't really see why it should be
> >  allowed), something like this might work (it seems to me):
> 
> It's not that the suspend method itself will want to register children.  
> The problem is that the method has to wait for other threads that may 
> already have started to register a child.  If those other threads are 
> blocked then suspend will deadlock.
> 
> > @@ -427,6 +433,13 @@ static int dpm_suspend(pm_message_t stat
> >  		struct device *dev = to_device(entry);
> >  
> >  		mutex_unlock(&dpm_list_mtx);
> > +		mutex_lock(&dev->power.lock);
> > +		mutex_lock(&dpm_list_mtx);
> > +		if (dev != to_device(dpm_active.prev)) {
> > +			mutex_unlock(&dev->power.lock);
> > +			continue;
> > +		}
> > +		mutex_unlock(&dpm_list_mtx);
> >  		error = suspend_device(dev, state);
> >  		mutex_lock(&dpm_list_mtx);
> >  		if (error) {
> 
> This looks pretty awkward.  Won't it cause lockdep to complain about
> recursive locking of dev->power.lock?

Why would it?  It's not taken recursively at any place.

In fact we can't take dev->power.lock under dpm_list_mtx, because that would
deadlock (we have to be able to take dpm_list_mtx to complete the suspending
of devices).  For this reason, dpm_list_mtx is taken under dev->power.lock.
However, we don't know which lock to acquire (dev is unknown) until we get the
next device to suspend.  Thus, I first get the device (under dpm_list_mtx) and
release dpm_list_mtx.  Next, I take dev->power.lock and dpm_list_mtx under it
and check if the selected device is still the last on the list.  If it's not,
something has been registered in the meantime and it's better to get the new
device to suspend first (this seems to be cleaner than resuming an already
suspended device).

I could release dev->power.lock as soon as suspend_device(dev, state) has run,
but that could result in new devices being added to dpm_active in a wrong
order, so it's better not to release it until resume_device(dev).  However,
it's probably is safe to release it right prior to running resume_device(dev)
(after we've added dev back to dpm_active), because in that case the ordering
will be fine.

> > > If the blocking starts after the suspend method returns then it will be 
> > > safer.  But what's the point?  If a registration attempt is made at 
> > > that stage, it means there's a bug in the driver.  So failing the 
> > > registration seems like a reasonable thing to do.
> > 
> > That doesn't buy us anything if drivers don't check whether the registration
> > succeeded.  And they don't.
> 
> It buys us one thing: The system will continue to limp along instead of 
> locking up.

It may oops, though, if a driver attempts to use a device that it failed to
register, but didn't check.
 
> If drivers don't check whether registration succeeded...  What can I 
> say?  It's another bug.
> 
> > > One issue: This rule doesn't allow suspend_late or resume_early methods
> > > to register any new devices.
> > 
> > Not exactly.  Just the children of suspended devices, which makes a difference
> > IMO. :-)
> 
> During suspend_late and resume_early _every_ device is suspended, 
> including the fictitious "device-tree-root" device.  Hence _every_ 
> registration is for a child of a suspended device.
> 
> Besides, you don't want to allow new devices to be registered during 
> suspend_late, do you?  They wouldn't get suspended before the system 
> went to sleep.

No, I don't, but that's really along the lines of the last patch.

> > > Will that cause problems with the CPU hotplug or ACPI subsystems?  ACPI in
> > > particular may need to freeze the kacpi_notify workqueue -- in fact, that
> > > might solve the problem in Bugzilla #9874.
> > 
> > Well, my impression is that we do this thing to prepare for removing the
> > freezer in the future, so I'd rather solve issues in some other ways than just
> > by freezing threads that get in the way. ;-)
> 
> Right now that may be the easiest solution.  In fact, it may still be 
> the easiest solution even after we stop freezing user threads.

Well, people want to remove the freezing of tasks altogether from the suspend
code path.  Do you think it's not doable in the long run?

> > > Remember too that the target state is set before any devices are
> > > suspended.  Hence, after the state is set there may still be runtime
> > > resumes taking place.  Those _must_ not fail, which means that this
> > > resume ought to work also.
> > 
> > They may be handled in a different way, not by ->resume().
> 
> I'm not sure I understand.  Sure, autoresume may not involve calling 
> the driver's resume method.  But it does involve actually setting the 
> device back to full power, so what's the difference?

In fact, that's the matter of how we are going to handle the runtime PM vs
the system-wide suspend.

> > > > > +				resume_device(dev);
> > > > > +				mutex_lock(&dpm_list_mtx);
> > > > > +				if (dev->power.sleeping != PM_GONE)
> > > > > +					dev->power.sleeping = PM_AWAKE;
> > > > > +			} else {
> > > 
> > > > Well, I wish it could be simpler ...
> > > 
> > > Me too.  But until the API is changed, this seems to be the best we can 
> > > do.  It's not quite as bad as it looks, since a fair amount of the new 
> > > code is just for reporting on and recovering from bugs in drivers.
> > 
> > Still, it doesn't look very nice ...
> 
> Are you still considering adding separate methods for suspend and 
> hibernate (maybe also for freeze and prethaw)?

Yes, I am.  In fact I have a patch for that I'm going to send in a while. :-)

> Perhaps the "prevent_new_children" and "allow_new_children" methods could be
> added then.  This would allow some of this complication to go away.

I wonder how that would be different from using dev->power.lock for blocking
the registration of new children.  The only practical difference I see is that
the driver will have to block the registrations, this way or another, instead
of the core.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [linux-pm] Fundamental flaw in system suspend, exposed by freezer removal
  2008-02-29 21:57                           ` Rafael J. Wysocki
@ 2008-02-29 22:46                             ` Alan Stern
  2008-03-01  0:13                               ` Rafael J. Wysocki
  0 siblings, 1 reply; 38+ messages in thread
From: Alan Stern @ 2008-02-29 22:46 UTC (permalink / raw)
  To: Rafael J. Wysocki; +Cc: Linux-pm mailing list, Kernel development list

On Fri, 29 Feb 2008, Rafael J. Wysocki wrote:

> I'm still not sure if this particular race would happen if only the registering
> of children of already suspended partents were blocked.

That's different.  Before you were talking about acquiring
dev->power.lock _before_ calling the suspend method.  Now you're
talking about blocking child registration _after_ the parent is already
suspended.

It might work if you did it that way.  In theory it _should_ work,
since nobody should ever try to register a child below a suspended
parent.

Given that this is merely a way of preventing something which should
never happen in the first place, is it really necessary to add the
extra lock?  Certainly it's simpler just to fail the registration.  If
it turns out later that we'd be better off blocking it instead, we
can add the lock.

> > > @@ -427,6 +433,13 @@ static int dpm_suspend(pm_message_t stat
> > >  		struct device *dev = to_device(entry);
> > >  
> > >  		mutex_unlock(&dpm_list_mtx);
> > > +		mutex_lock(&dev->power.lock);
> > > +		mutex_lock(&dpm_list_mtx);
> > > +		if (dev != to_device(dpm_active.prev)) {
> > > +			mutex_unlock(&dev->power.lock);
> > > +			continue;
> > > +		}
> > > +		mutex_unlock(&dpm_list_mtx);
> > >  		error = suspend_device(dev, state);
> > >  		mutex_lock(&dpm_list_mtx);
> > >  		if (error) {
> > 
> > This looks pretty awkward.  Won't it cause lockdep to complain about
> > recursive locking of dev->power.lock?
> 
> Why would it?  It's not taken recursively at any place.

It is as far as lockdep is concerned.  You acquire power.lock for the 
first device, then you acquire it for the second device.  Lockdep 
doesn't know the two devices are different; all it knows is that you 
have tried to acquire a lock while already holding an instance of that 
same lock.  It's the same problem that affects attempts to convert 
dev->sem to a mutex.

As for the ordering of the lock and moving the device to dpm_off -- 
it's less of a problem if you don't acquire the lock until after the 
suspend method returns.  You can lock it just before reacquiring 
dpm_list_mtx, while the device is still on dpm_active.

> > > That doesn't buy us anything if drivers don't check whether the registration
> > > succeeded.  And they don't.
> > 
> > It buys us one thing: The system will continue to limp along instead of 
> > locking up.
> 
> It may oops, though, if a driver attempts to use a device that it failed to
> register, but didn't check.

Which is better, an oops or a hang?  As far as the user is concerned, 
either one is useless.  For kernel developers, an oops is easier to 
debug.

In the end we should just try it and see what happens.  I don't think 
we can decide which will work out better without some real-world 
experience.

> > > > Will that cause problems with the CPU hotplug or ACPI subsystems?  ACPI in
> > > > particular may need to freeze the kacpi_notify workqueue -- in fact, that
> > > > might solve the problem in Bugzilla #9874.
> > > 
> > > Well, my impression is that we do this thing to prepare for removing the
> > > freezer in the future, so I'd rather solve issues in some other ways than just
> > > by freezing threads that get in the way. ;-)
> > 
> > Right now that may be the easiest solution.  In fact, it may still be 
> > the easiest solution even after we stop freezing user threads.
> 
> Well, people want to remove the freezing of tasks altogether from the suspend
> code path.  Do you think it's not doable in the long run?

That's not what I mean.  In the long run it will turn out that certain
kernel threads _want_ to be frozen.  That is, if allowed to run during
a system sleep transition they would mess things up, and their
subsystem is designed so that it can carry out a sleep transition
perfectly well without the thread running.  (An example of such a
thread is khubd.)

To accomodate these threads we can freeze them -- that's easy since the
freezer already exists.  Or we can remove the freezer and provide a new
way for these threads to block until the system wakes up.  IMO using
the existing code is better than writing new code.

All the objections to the freezer have been about using it on arbitrary
kernel threads and on all user tasks.  But if it gets used on only
those kernel threads which request it, and on no user tasks, there
shouldn't be any objections.

> In fact, that's the matter of how we are going to handle the runtime PM vs
> the system-wide suspend.

This is an interesting matter.  My view is that runtime PM should be
almost completely disabled when the PM core calls the device's suspend
method.  The only exception is that remote wakeup may be enabled.  If a
remote wakeup event occurs and the device resumes, then its parent's
suspend method will realize what has happened when it sees that the
device is no longer suspended.  So the parent's suspend method will
return -EBUSY and the sleep will be aborted.

Right now USB does not disable runtime PM during a system sleep.  It
hasn't been necessary, thanks to the freezer.  But when we stop
freezing user tasks it will become necessary.  When that time arrives I
intend to put user threads doing runtime resume into the "icebox"  
(remember that?).  Khubd and other kernel threads could go into the
icebox also, instead of the freezer; in this way the freezer could be
removed completely.

> > Perhaps the "prevent_new_children" and "allow_new_children" methods could be
> > added then.  This would allow some of this complication to go away.
> 
> I wonder how that would be different from using dev->power.lock for blocking
> the registration of new children.  The only practical difference I see is that
> the driver will have to block the registrations, this way or another, instead
> of the core.

That is indeed the difference, and it's an important difference.  The
driver knows what other threads may be carrying out registrations, and
it knows which ones should be waited for and which can safely be
blocked or disabled.  The PM core doesn't know any of these things; all
it can do is blindly block everything.  That is dangerous and can lead
to deadlocks.

Alan Stern

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [linux-pm] Fundamental flaw in system suspend, exposed by freezer removal
  2008-02-29 22:46                             ` Alan Stern
@ 2008-03-01  0:13                               ` Rafael J. Wysocki
  2008-03-01 15:30                                 ` Alan Stern
  0 siblings, 1 reply; 38+ messages in thread
From: Rafael J. Wysocki @ 2008-03-01  0:13 UTC (permalink / raw)
  To: Alan Stern; +Cc: Linux-pm mailing list, Kernel development list

On Friday, 29 of February 2008, Alan Stern wrote:
> On Fri, 29 Feb 2008, Rafael J. Wysocki wrote:
> 
> > I'm still not sure if this particular race would happen if only the registering
> > of children of already suspended partents were blocked.
> 
> That's different.  Before you were talking about acquiring
> dev->power.lock _before_ calling the suspend method.  Now you're
> talking about blocking child registration _after_ the parent is already
> suspended.

Hm, perhaps I don't understand the issue correctly, so please let me restate it.

We have the design issue that it's possible to register a child of a device
that was taken for suspend or even suspended which, in turn, leads to a wrong
ordering of devices on dpm_active.  Namely, suppose that device dev is taken
from the end of dpm_active and suspend_device(dev, state) is going to be called
If at the same time a child of dev is being registered and
suspend_device(dev, state) returns 0, dev will be taken out of dpm_active, but
the new device (its child) will be added to the end of dpm_active and
subsequently we'll attempt to suspend it.  That will be wrong.

Also, if a dev's child is registered after suspending all devices, it will go
to the end of dpm_active, so after the subsequent resume dev will end up
closer to the end of the list than the new child.  Thus, during the next
suspend it will be taken for suspending before this child and that will be
wrong either.  Note that this situation need not look dangerously from the
driver's perspective, since it doesn't know of the dpm_active ordering issue.

One of the possible solutions is to require suspend_device(dev, state) to
return an error if it detects a concurrent child registration.   This, however
is not sufficient, because the registration of a child may go unseen, right
after suspend_device(dev, state) returns and before dpm_list_mtx is reacquired.
For this reason, we need an additional mechanism to detect such situations
and work around them.  [Hence, the question arises whether it's really
necessary to require suspend_device(dev, state) to detect concurrent
registrations of children (we need to detect them independently anyway) etc.]
There also would have to be a mechanism for detecting registrations when
all devices have been suspended (we discussed that approach previously).

The other possible solution, and that's the one I'm considering, is to use
locking to prevent the registration of children of suspended or suspending
devices from happening.  [This is to protect _our_ data structure, which is
the dpm_active list, from corruption.]

Now, to this end, we'd need an additional lock for each device, because using
dev->sem for this purpose will be prone to deadlocks.  My idea is to take this
lock (call it dev->power.lock) as soon as dev is selected for suspending
and require that it be acquired for registering any new children of dev.  In
that case, the only open window is when a new child of dev is registered right
after we've found dev at the end of dpm_active and right prior to taking
dev->power.lock.  However, we may avoid this race by (1) selecting dev
for suspending under dpm_list_mtx and (2) checking if it's still at the end of
dpm_active under dev->power.lock and dpm_list_mtx (the selection of a device
for suspending is repeated if this check fails).

The patch I sent previously implemented this idea, but it released
dev->power.lock after calling resume_device(dev), which is not really necessary
(dev->power.lock may be released as soon as dev is put back on dpm_active).
Below is another version of this patch that releases dev->power.lock earlier
and uses semaphores instead of mutexes.  [Note that it's trivial to rework it
so that the registrations of children considered as invalid will fail instead
of being blocked.]

> It might work if you did it that way.  In theory it _should_ work,
> since nobody should ever try to register a child below a suspended
> parent.

The appended patch may be reworked to release dev->power.lock for all devices
once they all have been suspended, but since nobody should ever try to register
a child below a suspended parent, that shouldn't be necessary.

> Given that this is merely a way of preventing something which should
> never happen in the first place, is it really necessary to add the
> extra lock?  Certainly it's simpler just to fail the registration.

That actually requires quite a bit of entangled code, new flags etc. :-)

> If it turns out later that we'd be better off blocking it instead, we can add
> the lock. 

We can use the lock for failing registrations just as well.

> > > > @@ -427,6 +433,13 @@ static int dpm_suspend(pm_message_t stat
> > > >  		struct device *dev = to_device(entry);
> > > >  
> > > >  		mutex_unlock(&dpm_list_mtx);
> > > > +		mutex_lock(&dev->power.lock);
> > > > +		mutex_lock(&dpm_list_mtx);
> > > > +		if (dev != to_device(dpm_active.prev)) {
> > > > +			mutex_unlock(&dev->power.lock);
> > > > +			continue;
> > > > +		}
> > > > +		mutex_unlock(&dpm_list_mtx);
> > > >  		error = suspend_device(dev, state);
> > > >  		mutex_lock(&dpm_list_mtx);
> > > >  		if (error) {
> > > 
> > > This looks pretty awkward.  Won't it cause lockdep to complain about
> > > recursive locking of dev->power.lock?
> > 
> > Why would it?  It's not taken recursively at any place.
> 
> It is as far as lockdep is concerned.  You acquire power.lock for the 
> first device, then you acquire it for the second device.  Lockdep 
> doesn't know the two devices are different; all it knows is that you 
> have tried to acquire a lock while already holding an instance of that 
> same lock.  It's the same problem that affects attempts to convert 
> dev->sem to a mutex.

Well, let the new lock be a semaphore, then. ;-)

> As for the ordering of the lock and moving the device to dpm_off -- 
> it's less of a problem if you don't acquire the lock until after the 
> suspend method returns.  You can lock it just before reacquiring 
> dpm_list_mtx, while the device is still on dpm_active.

The problem is that the new children should really be suspended _before_ the
parent.  One solution is to resume the parent in case we've detected new
children registered, the other one is not to suspend the parent at all in case
any new children appear, and that's what I'm attempting to achieve.

> > > > That doesn't buy us anything if drivers don't check whether the registration
> > > > succeeded.  And they don't.
> > > 
> > > It buys us one thing: The system will continue to limp along instead of 
> > > locking up.
> > 
> > It may oops, though, if a driver attempts to use a device that it failed to
> > register, but didn't check.
> 
> Which is better, an oops or a hang?  As far as the user is concerned, 
> either one is useless.  For kernel developers, an oops is easier to 
> debug.
> 
> In the end we should just try it and see what happens.  I don't think 
> we can decide which will work out better without some real-world 
> experience.

Agreed.

> > > > > Will that cause problems with the CPU hotplug or ACPI subsystems?  ACPI in
> > > > > particular may need to freeze the kacpi_notify workqueue -- in fact, that
> > > > > might solve the problem in Bugzilla #9874.
> > > > 
> > > > Well, my impression is that we do this thing to prepare for removing the
> > > > freezer in the future, so I'd rather solve issues in some other ways than just
> > > > by freezing threads that get in the way. ;-)
> > > 
> > > Right now that may be the easiest solution.  In fact, it may still be 
> > > the easiest solution even after we stop freezing user threads.
> > 
> > Well, people want to remove the freezing of tasks altogether from the suspend
> > code path.  Do you think it's not doable in the long run?
> 
> That's not what I mean.  In the long run it will turn out that certain
> kernel threads _want_ to be frozen.  That is, if allowed to run during
> a system sleep transition they would mess things up, and their
> subsystem is designed so that it can carry out a sleep transition
> perfectly well without the thread running.  (An example of such a
> thread is khubd.)
> 
> To accomodate these threads we can freeze them -- that's easy since the
> freezer already exists.  Or we can remove the freezer and provide a new
> way for these threads to block until the system wakes up.  IMO using
> the existing code is better than writing new code.

Well, it surely is reasonable. ;-)

> All the objections to the freezer have been about using it on arbitrary
> kernel threads and on all user tasks.  But if it gets used on only
> those kernel threads which request it, and on no user tasks, there
> shouldn't be any objections.

You may be right.

> > In fact, that's the matter of how we are going to handle the runtime PM vs
> > the system-wide suspend.
> 
> This is an interesting matter.  My view is that runtime PM should be
> almost completely disabled when the PM core calls the device's suspend
> method.  The only exception is that remote wakeup may be enabled.  If a
> remote wakeup event occurs and the device resumes, then its parent's
> suspend method will realize what has happened when it sees that the
> device is no longer suspended.  So the parent's suspend method will
> return -EBUSY and the sleep will be aborted.

I think that we may have to disable runtime PM as soon even earlier, just prior
to starting a transition to a sleep state.  There are systems which may crash
if we don't do that.

> Right now USB does not disable runtime PM during a system sleep.  It
> hasn't been necessary, thanks to the freezer.  But when we stop
> freezing user tasks it will become necessary.  When that time arrives I
> intend to put user threads doing runtime resume into the "icebox"  
> (remember that?).  Khubd and other kernel threads could go into the
> icebox also, instead of the freezer; in this way the freezer could be
> removed completely.

Yes.

In fact I'd like to work out some generic guidance for all device drivers
describing how to do such things.

> > > Perhaps the "prevent_new_children" and "allow_new_children" methods could be
> > > added then.  This would allow some of this complication to go away.
> > 
> > I wonder how that would be different from using dev->power.lock for blocking
> > the registration of new children.  The only practical difference I see is that
> > the driver will have to block the registrations, this way or another, instead
> > of the core.
> 
> That is indeed the difference, and it's an important difference.  The
> driver knows what other threads may be carrying out registrations, and
> it knows which ones should be waited for and which can safely be
> blocked or disabled.  The PM core doesn't know any of these things; all
> it can do is blindly block everything.  That is dangerous and can lead
> to deadlocks.

OTOH, the core should be allowed to protect it's data structures ...

Thanks,
Rafael

---
 drivers/base/power/main.c |   14 ++++++++++++++
 include/linux/pm.h        |    2 ++
 2 files changed, 16 insertions(+)

Index: linux-2.6/include/linux/pm.h
===================================================================
--- linux-2.6.orig/include/linux/pm.h
+++ linux-2.6/include/linux/pm.h
@@ -26,6 +26,7 @@
 #include <linux/list.h>
 #include <asm/atomic.h>
 #include <asm/errno.h>
+#include <asm/semaphore.h>

 /*
  * Power management requests... these are passed to pm_send_all() and friends.
@@ -186,6 +187,7 @@ struct dev_pm_info {
 #ifdef	CONFIG_PM_SLEEP
 	unsigned		should_wakeup:1;
 	struct list_head	entry;
+	struct semaphore	lock;
 #endif
 };

Index: linux-2.6/drivers/base/power/main.c
===================================================================
--- linux-2.6.orig/drivers/base/power/main.c
+++ linux-2.6/drivers/base/power/main.c
@@ -67,9 +67,14 @@ void device_pm_add(struct device *dev)
 	pr_debug("PM: Adding info for %s:%s\n",
 		 dev->bus ? dev->bus->name : "No Bus",
 		 kobject_name(&dev->kobj));
+	init_MUTEX(&dev->power.lock);
+	if (dev->parent)
+		down(&dev->parent->power.lock);
 	mutex_lock(&dpm_list_mtx);
 	list_add_tail(&dev->power.entry, &dpm_active);
 	mutex_unlock(&dpm_list_mtx);
+	if (dev->parent)
+		up(&dev->parent->power.lock);
 }

 /**
@@ -248,6 +253,7 @@ static void dpm_resume(void)

 		list_move_tail(entry, &dpm_active);
 		mutex_unlock(&dpm_list_mtx);
+		up(&dev->power.lock);
 		resume_device(dev);
 		mutex_lock(&dpm_list_mtx);
 	}
@@ -427,6 +433,13 @@ static int dpm_suspend(pm_message_t stat
 		struct device *dev = to_device(entry);

 		mutex_unlock(&dpm_list_mtx);
+		down(&dev->power.lock);
+		mutex_lock(&dpm_list_mtx);
+		if (dev != to_device(dpm_active.prev)) {
+			up(&dev->power.lock);
+			continue;
+		}
+		mutex_unlock(&dpm_list_mtx);
 		error = suspend_device(dev, state);
 		mutex_lock(&dpm_list_mtx);
 		if (error) {
@@ -437,6 +450,7 @@ static int dpm_suspend(pm_message_t stat
 					(error == -EAGAIN ?
 					" (please convert to suspend_late)" :
 					""));
+			up(&dev->power.lock);
 			break;
 		}
 		if (!list_empty(&dev->power.entry))

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [linux-pm] Fundamental flaw in system suspend, exposed by freezer removal
  2008-03-01  0:13                               ` Rafael J. Wysocki
@ 2008-03-01 15:30                                 ` Alan Stern
  2008-03-02 13:37                                   ` Rafael J. Wysocki
  0 siblings, 1 reply; 38+ messages in thread
From: Alan Stern @ 2008-03-01 15:30 UTC (permalink / raw)
  To: Rafael J. Wysocki; +Cc: Linux-pm mailing list, Kernel development list

On Sat, 1 Mar 2008, Rafael J. Wysocki wrote:

> On Friday, 29 of February 2008, Alan Stern wrote:
> > On Fri, 29 Feb 2008, Rafael J. Wysocki wrote:
> > 
> > > I'm still not sure if this particular race would happen if only the registering
> > > of children of already suspended partents were blocked.
> > 
> > That's different.  Before you were talking about acquiring
> > dev->power.lock _before_ calling the suspend method.  Now you're
> > talking about blocking child registration _after_ the parent is already
> > suspended.
> 
> Hm, perhaps I don't understand the issue correctly, so please let me restate it.
> 
> We have the design issue that it's possible to register a child of a device
> that was taken for suspend or even suspended which, in turn, leads to a wrong
> ordering of devices on dpm_active.  Namely, suppose that device dev is taken
> from the end of dpm_active and suspend_device(dev, state) is going to be called
> If at the same time a child of dev is being registered and
> suspend_device(dev, state) returns 0, dev will be taken out of dpm_active, but
> the new device (its child) will be added to the end of dpm_active and
> subsequently we'll attempt to suspend it.  That will be wrong.

That's right.

> Also, if a dev's child is registered after suspending all devices, it will go
> to the end of dpm_active, so after the subsequent resume dev will end up
> closer to the end of the list than the new child.  Thus, during the next
> suspend it will be taken for suspending before this child and that will be
> wrong either.  Note that this situation need not look dangerously from the
> driver's perspective, since it doesn't know of the dpm_active ordering issue.

Yes, but it's still wrong.  It's also wrong to register a new device 
(even one that has no parent) during the suspend_late stage, because 
this device wouldn't get suspended before the system went to sleep.

> One of the possible solutions is to require suspend_device(dev, state) to
> return an error if it detects a concurrent child registration.   This, however
> is not sufficient, because the registration of a child may go unseen, right
> after suspend_device(dev, state) returns and before dpm_list_mtx is reacquired.

It's worse than you describe, because suspend_device() is unable to
tell whether a concurrent child registration is valid or invalid.  By
"valid", I mean that it took place during the window before the suspend
method was called or before the method was able to prevent new child
registrations.

Valid child registrations must be allowed to proceed.  There's no
reason to fail them, because the driver has done nothing wrong -- and
it wouldn't be safe since drivers don't check for failure.  Likewise,
blocking them is likely to cause the suspend method to deadlock; see 
below.

But since suspend_device() can't tell which concurrent child
registrations are valid, it has no choice but to allow all of them.  
This means our only option for recovering from a concurrent child
registration is to resume the parent (don't move it to dpm_off) and 
continue.  That way the list ordering will remain correct.

Once the parent is fully suspended, suspend_device() has returned, and
the parent has been moved to dpm_off, the situation is different.  
Then we _know_ that child registrations are invalid, so either blocking
or failing them would be okay.

> For this reason, we need an additional mechanism to detect such situations
> and work around them.  [Hence, the question arises whether it's really
> necessary to require suspend_device(dev, state) to detect concurrent
> registrations of children (we need to detect them independently anyway) etc.]
> There also would have to be a mechanism for detecting registrations when
> all devices have been suspended (we discussed that approach previously).
> 
> The other possible solution, and that's the one I'm considering, is to use
> locking to prevent the registration of children of suspended or suspending
> devices from happening.  [This is to protect _our_ data structure, which is
> the dpm_active list, from corruption.]

In view of my comments above, we _must not_ prevent registration of
children of suspending devices.  It's okay to prevent registration of
suspended devices.  There's nothing wrong with using a lock for this
purpose, but the lock should not be acquired until after
suspend_device() has returned and we have verified that no children
were added while suspend_device() was running.

> Now, to this end, we'd need an additional lock for each device, because using
> dev->sem for this purpose will be prone to deadlocks.  My idea is to take this
> lock (call it dev->power.lock) as soon as dev is selected for suspending
> and require that it be acquired for registering any new children of dev.  In

Consider a situation where a kernel thread is used by a driver for 
several purposes, including registering new children.  The suspend 
method will have to synchronize with the thread for obvious reasons: 
you don't want the thread to be using the device at the same time as 
the device is being suspended.

A typical synchronization approach is for the suspend method to wait
until the other thread is idle (the sort of thing that
flush_scheduled_work() does).  But if the other thread is blocked on
dev->power.lock, it will never become idle and the suspend will
deadlock.

A better approach IMO is for the other thread to always acquire 
dev->sem before doing anything.  (That's how khubd works.)  Then it is 
automatically mutually exclusive with the suspend method, with no need 
to add dev->power.lock at all.  In fact, I have long believed that any 
thread adding a child device should hold the parent's semaphore.

But until all drivers are carefully designed in accordance with these
ideas, we have to assume it is dangerous to block child registrations
before the parent is fully suspended.

> that case, the only open window is when a new child of dev is registered right
> after we've found dev at the end of dpm_active and right prior to taking
> dev->power.lock.  However, we may avoid this race by (1) selecting dev
> for suspending under dpm_list_mtx and (2) checking if it's still at the end of
> dpm_active under dev->power.lock and dpm_list_mtx (the selection of a device
> for suspending is repeated if this check fails).
> 
> The patch I sent previously implemented this idea, but it released
> dev->power.lock after calling resume_device(dev), which is not really necessary
> (dev->power.lock may be released as soon as dev is put back on dpm_active).
> Below is another version of this patch that releases dev->power.lock earlier
> and uses semaphores instead of mutexes.  [Note that it's trivial to rework it
> so that the registrations of children considered as invalid will fail instead
> of being blocked.]

> The appended patch may be reworked to release dev->power.lock for all devices
> once they all have been suspended, but since nobody should ever try to register
> a child below a suspended parent, that shouldn't be necessary.

That's not the issue.  The only problem I have with your approach is
that it blocks child registrations before the parent is fully
suspended.

> > As for the ordering of the lock and moving the device to dpm_off -- 
> > it's less of a problem if you don't acquire the lock until after the 
> > suspend method returns.  You can lock it just before reacquiring 
> > dpm_list_mtx, while the device is still on dpm_active.
> 
> The problem is that the new children should really be suspended _before_ the
> parent.  One solution is to resume the parent in case we've detected new
> children registered, the other one is not to suspend the parent at all in case
> any new children appear, and that's what I'm attempting to achieve.

I believe it isn't achievable without modifying the drivers.  That was 
the case with the MMC subsystem, for instance.

> > This is an interesting matter.  My view is that runtime PM should be
> > almost completely disabled when the PM core calls the device's suspend
> > method.  The only exception is that remote wakeup may be enabled.  If a
> > remote wakeup event occurs and the device resumes, then its parent's
> > suspend method will realize what has happened when it sees that the
> > device is no longer suspended.  So the parent's suspend method will
> > return -EBUSY and the sleep will be aborted.
> 
> I think that we may have to disable runtime PM as soon even earlier, just prior
> to starting a transition to a sleep state.  There are systems which may crash
> if we don't do that.

There must some strange interactions going on.  For instance, what if a
device sends a remote wakeup request before the system is fully asleep?

> > Right now USB does not disable runtime PM during a system sleep.  It
> > hasn't been necessary, thanks to the freezer.  But when we stop
> > freezing user tasks it will become necessary.  When that time arrives I
> > intend to put user threads doing runtime resume into the "icebox"  
> > (remember that?).  Khubd and other kernel threads could go into the
> > icebox also, instead of the freezer; in this way the freezer could be
> > removed completely.
> 
> Yes.
> 
> In fact I'd like to work out some generic guidance for all device drivers
> describing how to do such things.

When the USB design is complete it could be used as a model.  But I 
have to admit that it is rather intricate.

> > That is indeed the difference, and it's an important difference.  The
> > driver knows what other threads may be carrying out registrations, and
> > it knows which ones should be waited for and which can safely be
> > blocked or disabled.  The PM core doesn't know any of these things; all
> > it can do is blindly block everything.  That is dangerous and can lead
> > to deadlocks.
> 
> OTOH, the core should be allowed to protect it's data structures ...

Oh, I agree.  But this protection simply isn't possible without either 
allowing devices to resume after the target sleep state has been set or 
else modifying an unknown number of drivers.

Alan Stern

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [linux-pm] Fundamental flaw in system suspend, exposed by freezer removal
  2008-03-01 15:30                                 ` Alan Stern
@ 2008-03-02 13:37                                   ` Rafael J. Wysocki
  2008-03-02 16:22                                     ` Alan Stern
  0 siblings, 1 reply; 38+ messages in thread
From: Rafael J. Wysocki @ 2008-03-02 13:37 UTC (permalink / raw)
  To: Alan Stern; +Cc: Linux-pm mailing list, Kernel development list

On Saturday, 1 of March 2008, Alan Stern wrote:
> On Sat, 1 Mar 2008, Rafael J. Wysocki wrote:
> 
> > On Friday, 29 of February 2008, Alan Stern wrote:
> > > On Fri, 29 Feb 2008, Rafael J. Wysocki wrote:
> > > 
> > > > I'm still not sure if this particular race would happen if only the registering
> > > > of children of already suspended partents were blocked.
> > > 
> > > That's different.  Before you were talking about acquiring
> > > dev->power.lock _before_ calling the suspend method.  Now you're
> > > talking about blocking child registration _after_ the parent is already
> > > suspended.
> > 
> > Hm, perhaps I don't understand the issue correctly, so please let me restate it.
> > 
> > We have the design issue that it's possible to register a child of a device
> > that was taken for suspend or even suspended which, in turn, leads to a wrong
> > ordering of devices on dpm_active.  Namely, suppose that device dev is taken
> > from the end of dpm_active and suspend_device(dev, state) is going to be called
> > If at the same time a child of dev is being registered and
> > suspend_device(dev, state) returns 0, dev will be taken out of dpm_active, but
> > the new device (its child) will be added to the end of dpm_active and
> > subsequently we'll attempt to suspend it.  That will be wrong.
> 
> That's right.
> 
> > Also, if a dev's child is registered after suspending all devices, it will go
> > to the end of dpm_active, so after the subsequent resume dev will end up
> > closer to the end of the list than the new child.  Thus, during the next
> > suspend it will be taken for suspending before this child and that will be
> > wrong either.  Note that this situation need not look dangerously from the
> > driver's perspective, since it doesn't know of the dpm_active ordering issue.
> 
> Yes, but it's still wrong.  It's also wrong to register a new device 
> (even one that has no parent) during the suspend_late stage, because 
> this device wouldn't get suspended before the system went to sleep.

That's correct.

> > One of the possible solutions is to require suspend_device(dev, state) to
> > return an error if it detects a concurrent child registration.   This, however
> > is not sufficient, because the registration of a child may go unseen, right
> > after suspend_device(dev, state) returns and before dpm_list_mtx is reacquired.
> 
> It's worse than you describe, because suspend_device() is unable to
> tell whether a concurrent child registration is valid or invalid.  By
> "valid", I mean that it took place during the window before the suspend
> method was called or before the method was able to prevent new child
> registrations.
> 
> Valid child registrations must be allowed to proceed. There's no 
> reason to fail them, because the driver has done nothing wrong -- and
> it wouldn't be safe since drivers don't check for failure.

Yes, that's bad, but I don't see any other solution at the moment (see below).

> Likewise, blocking them is likely to cause the suspend method to deadlock;
> see below.
> 
> But since suspend_device() can't tell which concurrent child
> registrations are valid, it has no choice but to allow all of them.  
> This means our only option for recovering from a concurrent child
> registration is to resume the parent (don't move it to dpm_off) and 
> continue.  That way the list ordering will remain correct.

If new children get registered when the parent is suspended, that's already
wrong, because the children should have been suspended before.  Think of a case
when the parent is a bus and the children are devices on it.  In that case, by
suspending the parent before the children we can make the children
unresponsive etc. (that would break the PCI PM rules, for one example).

> Once the parent is fully suspended, suspend_device() has returned, and
> the parent has been moved to dpm_off, the situation is different.

In fact we don't.  We only know that suspend_device() has won the race with
the device registration.

> Then we _know_ that child registrations are invalid, so either blocking
> or failing them would be okay.
> 
> > For this reason, we need an additional mechanism to detect such situations
> > and work around them.  [Hence, the question arises whether it's really
> > necessary to require suspend_device(dev, state) to detect concurrent
> > registrations of children (we need to detect them independently anyway) etc.]
> > There also would have to be a mechanism for detecting registrations when
> > all devices have been suspended (we discussed that approach previously).
> > 
> > The other possible solution, and that's the one I'm considering, is to use
> > locking to prevent the registration of children of suspended or suspending
> > devices from happening.  [This is to protect _our_ data structure, which is
> > the dpm_active list, from corruption.]
> 
> In view of my comments above, we _must not_ prevent registration of
> children of suspending devices.  It's okay to prevent registration of
> suspended devices.

I don't agree with that (not only because the last sentence is oversimplified ;-)).

I think that the rule "the driver must not register new children after
->suspend() has run" is not a good one, because in fact we don't want
->suspend() to be called while new children are being registered.  IMO, we
should make the rule that "device registration may fail if it's carried out
concurrently with the parent's ->suspend() method".  At least, that will tell
the drivers what to do or avoid.

> There's nothing wrong with using a lock for this 
> purpose, but the lock should not be acquired until after
> suspend_device() has returned and we have verified that no children
> were added while suspend_device() was running.

I don't agree with that too.

> > Now, to this end, we'd need an additional lock for each device, because using
> > dev->sem for this purpose will be prone to deadlocks.  My idea is to take this
> > lock (call it dev->power.lock) as soon as dev is selected for suspending
> > and require that it be acquired for registering any new children of dev.  In
> 
> Consider a situation where a kernel thread is used by a driver for 
> several purposes, including registering new children.  The suspend 
> method will have to synchronize with the thread for obvious reasons: 
> you don't want the thread to be using the device at the same time as 
> the device is being suspended.
> 
> A typical synchronization approach is for the suspend method to wait
> until the other thread is idle (the sort of thing that
> flush_scheduled_work() does).  But if the other thread is blocked on
> dev->power.lock, it will never become idle and the suspend will
> deadlock.

I agree it's not a good idea to hold the locks throughout the entire cycle,
but that can be overcome if we use an additional variable under
dev_pm_info (see patch below).

> A better approach IMO is for the other thread to always acquire 
> dev->sem before doing anything.  (That's how khubd works.)  Then it is 
> automatically mutually exclusive with the suspend method, with no need 
> to add dev->power.lock at all.  In fact, I have long believed that any 
> thread adding a child device should hold the parent's semaphore.
> 
> But until all drivers are carefully designed in accordance with these
> ideas, we have to assume it is dangerous to block child registrations
> before the parent is fully suspended.

Still, I think we can fail them.

In fact, drivers _should_ check for device_add() failures and if they don't,
it's a plain bug.
 
> > that case, the only open window is when a new child of dev is registered right
> > after we've found dev at the end of dpm_active and right prior to taking
> > dev->power.lock.  However, we may avoid this race by (1) selecting dev
> > for suspending under dpm_list_mtx and (2) checking if it's still at the end of
> > dpm_active under dev->power.lock and dpm_list_mtx (the selection of a device
> > for suspending is repeated if this check fails).
> > 
> > The patch I sent previously implemented this idea, but it released
> > dev->power.lock after calling resume_device(dev), which is not really necessary
> > (dev->power.lock may be released as soon as dev is put back on dpm_active).
> > Below is another version of this patch that releases dev->power.lock earlier
> > and uses semaphores instead of mutexes.  [Note that it's trivial to rework it
> > so that the registrations of children considered as invalid will fail instead
> > of being blocked.]
> 
> > The appended patch may be reworked to release dev->power.lock for all devices
> > once they all have been suspended, but since nobody should ever try to register
> > a child below a suspended parent, that shouldn't be necessary.
> 
> That's not the issue.  The only problem I have with your approach is
> that it blocks child registrations before the parent is fully
> suspended.

Well, I think it's not correct to allow the parent to suspend with active (not
suspended) children.

> > > As for the ordering of the lock and moving the device to dpm_off -- 
> > > it's less of a problem if you don't acquire the lock until after the 
> > > suspend method returns.  You can lock it just before reacquiring 
> > > dpm_list_mtx, while the device is still on dpm_active.
> > 
> > The problem is that the new children should really be suspended _before_ the
> > parent.  One solution is to resume the parent in case we've detected new
> > children registered, the other one is not to suspend the parent at all in case
> > any new children appear, and that's what I'm attempting to achieve.
> 
> I believe it isn't achievable without modifying the drivers.  That was 
> the case with the MMC subsystem, for instance.

I agree.  What I was trying to say is that the core should protect itself, to
the maximum reasonable extent possible, from suspending a parent before the
children which may lead to some subtle problems.

> > > This is an interesting matter.  My view is that runtime PM should be
> > > almost completely disabled when the PM core calls the device's suspend
> > > method.  The only exception is that remote wakeup may be enabled.  If a
> > > remote wakeup event occurs and the device resumes, then its parent's
> > > suspend method will realize what has happened when it sees that the
> > > device is no longer suspended.  So the parent's suspend method will
> > > return -EBUSY and the sleep will be aborted.
> > 
> > I think that we may have to disable runtime PM as soon even earlier, just prior
> > to starting a transition to a sleep state.  There are systems which may crash
> > if we don't do that.
> 
> There must some strange interactions going on.  For instance, what if a
> device sends a remote wakeup request before the system is fully asleep?

On the systems I'm talking about there are some devices referred to by the
ACPI _PTS method, so they must be on line whey this method is being executed.
Since we don't know a priori what devices they are, we must put all devices on
line (into the full power state) before executing _PTS.

> > > Right now USB does not disable runtime PM during a system sleep.  It
> > > hasn't been necessary, thanks to the freezer.  But when we stop
> > > freezing user tasks it will become necessary.  When that time arrives I
> > > intend to put user threads doing runtime resume into the "icebox"  
> > > (remember that?).  Khubd and other kernel threads could go into the
> > > icebox also, instead of the freezer; in this way the freezer could be
> > > removed completely.
> > 
> > Yes.
> > 
> > In fact I'd like to work out some generic guidance for all device drivers
> > describing how to do such things.
> 
> When the USB design is complete it could be used as a model.  But I 
> have to admit that it is rather intricate.
> 
> > > That is indeed the difference, and it's an important difference.  The
> > > driver knows what other threads may be carrying out registrations, and
> > > it knows which ones should be waited for and which can safely be
> > > blocked or disabled.  The PM core doesn't know any of these things; all
> > > it can do is blindly block everything.  That is dangerous and can lead
> > > to deadlocks.
> > 
> > OTOH, the core should be allowed to protect it's data structures ...
> 
> Oh, I agree.  But this protection simply isn't possible without either 
> allowing devices to resume after the target sleep state has been set or 
> else modifying an unknown number of drivers.

Well, that's correct.  However, if we make the rule that device_add() may fail
if it's run concurrently with the parent's ->suspend(), the changes of the
drivers need not be substantial (they should check for the failures of
device_add() anyway).

[BTW, there is a bug in the error path of device_add() for which I'm going to
send a fix later today.  Namely, in case of a BusError, dpm_sysfs_remove()
is executed after device_pm_remove() which calls dpm_sysfs_remove().]

Below is the current version of the patch for handling device registrations
in the PM core.  In this version, the new registrations are failed if done too
late (or too early) and the locks are not held during the entire suspend/resume
cycle.

The patch is against the mainline with
http://marc.info/?l=linux-acpi&m=120389632114090&w=2
and your fix applied and has been (slightly) tested on x86-64.

Thanks,
Rafael

---
 drivers/base/core.c        |    7 ++-
 drivers/base/power/main.c  |   86 ++++++++++++++++++++++++++++-----------------
 drivers/base/power/power.h |   23 +-----------
 include/linux/pm.h         |    3 +
 4 files changed, 65 insertions(+), 54 deletions(-)

Index: linux-2.6/include/linux/pm.h
===================================================================
--- linux-2.6.orig/include/linux/pm.h
+++ linux-2.6/include/linux/pm.h
@@ -24,6 +24,7 @@
 #ifdef __KERNEL__
 
 #include <linux/list.h>
+#include <linux/mutex.h>
 #include <asm/atomic.h>
 #include <asm/errno.h>
 
@@ -186,6 +187,8 @@ struct dev_pm_info {
 #ifdef	CONFIG_PM_SLEEP
 	unsigned		should_wakeup:1;
 	struct list_head	entry;
+	struct mutex		lock;
+	bool			sleeping;	/* Protected by 'lock' */
 #endif
 };
 
Index: linux-2.6/drivers/base/power/main.c
===================================================================
--- linux-2.6.orig/drivers/base/power/main.c
+++ linux-2.6/drivers/base/power/main.c
@@ -54,15 +54,13 @@ static LIST_HEAD(dpm_destroy);
 
 static DEFINE_MUTEX(dpm_list_mtx);
 
-static DECLARE_RWSEM(pm_sleep_rwsem);
-
 int (*platform_enable_wakeup)(struct device *dev, int is_on);
 
 /**
- *	device_pm_add - add a device to the list of active devices
+ *	__device_pm_add - add a device to the list of active devices
  *	@dev:	Device to be added to the list
  */
-void device_pm_add(struct device *dev)
+static void __device_pm_add(struct device *dev)
 {
 	pr_debug("PM: Adding info for %s:%s\n",
 		 dev->bus ? dev->bus->name : "No Bus",
@@ -73,6 +71,28 @@ void device_pm_add(struct device *dev)
 }
 
 /**
+ *	device_pm_add - check if given device can be safely added to the list of
+ *			active devices and use __device_pm_add() to do that
+ *	@dev:	Device to be added to the list
+ */
+int device_pm_add(struct device *dev)
+{
+	mutex_init(&dev->power.lock);
+	if (dev->parent) {
+		mutex_lock(&dev->parent->power.lock);
+		if (dev->parent->power.sleeping) {
+			mutex_unlock(&dev->parent->power.lock);
+			return -EBUSY;
+		}
+		__device_pm_add(dev);
+		mutex_unlock(&dev->parent->power.lock);
+	} else {
+		__device_pm_add(dev);
+	}
+	return 0;
+}
+
+/**
  *	device_pm_remove - remove a device from the list of active devices
  *	@dev:	Device to be removed from the list
  *
@@ -107,32 +127,6 @@ void device_pm_schedule_removal(struct d
 }
 EXPORT_SYMBOL_GPL(device_pm_schedule_removal);
 
-/**
- *	pm_sleep_lock - mutual exclusion for registration and suspend
- *
- *	Returns 0 if no suspend is underway and device registration
- *	may proceed, otherwise -EBUSY.
- */
-int pm_sleep_lock(void)
-{
-	if (down_read_trylock(&pm_sleep_rwsem))
-		return 0;
-
-	return -EBUSY;
-}
-
-/**
- *	pm_sleep_unlock - mutual exclusion for registration and suspend
- *
- *	This routine undoes the effect of device_pm_add_lock
- *	when a device's registration is complete.
- */
-void pm_sleep_unlock(void)
-{
-	up_read(&pm_sleep_rwsem);
-}
-
-
 /*------------------------- Resume routines -------------------------*/
 
 /**
@@ -248,7 +242,10 @@ static void dpm_resume(void)
 
 		list_move_tail(entry, &dpm_active);
 		mutex_unlock(&dpm_list_mtx);
+		mutex_lock(&dev->power.lock);
 		resume_device(dev);
+		dev->power.sleeping = false;
+		mutex_unlock(&dev->power.lock);
 		mutex_lock(&dpm_list_mtx);
 	}
 	mutex_unlock(&dpm_list_mtx);
@@ -285,7 +282,6 @@ void device_resume(void)
 	might_sleep();
 	dpm_resume();
 	unregister_dropped_devices();
-	up_write(&pm_sleep_rwsem);
 }
 EXPORT_SYMBOL_GPL(device_resume);
 
@@ -427,6 +423,30 @@ static int dpm_suspend(pm_message_t stat
 		struct device *dev = to_device(entry);
 
 		mutex_unlock(&dpm_list_mtx);
+		if (dev->parent) {
+			mutex_lock(&dev->parent->power.lock);
+			if (dev->parent->power.sleeping)
+				error = -EAGAIN;
+			mutex_unlock(&dev->parent->power.lock);
+			if (error)
+				break;
+		}
+		/*
+		 * We can't take dev->power.lock under dpm_list_mtx, because
+		 * we want to release dpm_list_mtx for the execution of
+		 * suspend_device() and recquire it afterwards, all with
+		 * dev->power.lock held.  However, since we've released
+		 * dpm_list_mtx above, dpm_active might have changed after
+		 * dev was initialized, so we must check if dev is still valid
+		 * after dpm_list_mtx have been reacquired.
+		 */
+		mutex_lock(&dev->power.lock);
+		mutex_lock(&dpm_list_mtx);
+		if (dev != to_device(dpm_active.prev)) {
+			mutex_unlock(&dev->power.lock);
+			continue;
+		}
+		mutex_unlock(&dpm_list_mtx);
 		error = suspend_device(dev, state);
 		mutex_lock(&dpm_list_mtx);
 		if (error) {
@@ -437,10 +457,13 @@ static int dpm_suspend(pm_message_t stat
 					(error == -EAGAIN ?
 					" (please convert to suspend_late)" :
 					""));
+			mutex_unlock(&dev->power.lock);
 			break;
 		}
 		if (!list_empty(&dev->power.entry))
 			list_move(&dev->power.entry, &dpm_off);
+		dev->power.sleeping = true;
+		mutex_unlock(&dev->power.lock);
 	}
 	mutex_unlock(&dpm_list_mtx);
 
@@ -459,7 +482,6 @@ int device_suspend(pm_message_t state)
 	int error;
 
 	might_sleep();
-	down_write(&pm_sleep_rwsem);
 	error = dpm_suspend(state);
 	if (error)
 		device_resume();
Index: linux-2.6/drivers/base/power/power.h
===================================================================
--- linux-2.6.orig/drivers/base/power/power.h
+++ linux-2.6/drivers/base/power/power.h
@@ -11,30 +11,13 @@ static inline struct device *to_device(s
 	return container_of(entry, struct device, power.entry);
 }
 
-extern void device_pm_add(struct device *);
+extern int device_pm_add(struct device *);
 extern void device_pm_remove(struct device *);
-extern int pm_sleep_lock(void);
-extern void pm_sleep_unlock(void);
 
 #else /* CONFIG_PM_SLEEP */
 
-
-static inline void device_pm_add(struct device *dev)
-{
-}
-
-static inline void device_pm_remove(struct device *dev)
-{
-}
-
-static inline int pm_sleep_lock(void)
-{
-	return 0;
-}
-
-static inline void pm_sleep_unlock(void)
-{
-}
+static inline int device_pm_add(struct device *dev) { return 0; }
+static inline void device_pm_remove(struct device *dev) {}
 
 #endif
 
Index: linux-2.6/drivers/base/core.c
===================================================================
--- linux-2.6.orig/drivers/base/core.c
+++ linux-2.6/drivers/base/core.c
@@ -814,7 +814,11 @@ int device_add(struct device *dev)
 	error = dpm_sysfs_add(dev);
 	if (error)
 		goto PMError;
-	device_pm_add(dev);
+	error = device_pm_add(dev);
+	if (error) {
+		dpm_sysfs_remove(dev);
+		goto PMError;
+	}
 	error = bus_add_device(dev);
 	if (error)
 		goto BusError;
@@ -839,7 +843,6 @@ int device_add(struct device *dev)
 	return error;
  BusError:
 	device_pm_remove(dev);
-	dpm_sysfs_remove(dev);
  PMError:
 	if (dev->bus)
 		blocking_notifier_call_chain(&dev->bus->p->bus_notifier,

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [linux-pm] Fundamental flaw in system suspend, exposed by freezer removal
  2008-03-02 13:37                                   ` Rafael J. Wysocki
@ 2008-03-02 16:22                                     ` Alan Stern
  2008-03-02 19:11                                       ` Rafael J. Wysocki
  0 siblings, 1 reply; 38+ messages in thread
From: Alan Stern @ 2008-03-02 16:22 UTC (permalink / raw)
  To: Rafael J. Wysocki; +Cc: Linux-pm mailing list, Kernel development list

On Sun, 2 Mar 2008, Rafael J. Wysocki wrote:

> If new children get registered when the parent is suspended, that's already
> wrong, because the children should have been suspended before.  Think of a case
> when the parent is a bus and the children are devices on it.  In that case, by
> suspending the parent before the children we can make the children
> unresponsive etc. (that would break the PCI PM rules, for one example).

Agreed.  That's why I've been saying all along that once the parent has
been moved to dpm_off, we should either block or fail child
registrations.  Drivers trying to register a child at such times are 
clearly buggy anyway.

What we are really trying to agree on is how the PM core should handle 
child registrations just before and while the parent is suspending.  
Drivers trying to register a child at such times need not be buggy at 
all; they may simply have lost a race.

I agree that the core needs to protect itself, but I also think that
drivers should need minimal changes -- preferably none.  If the core
becomes moderately more complicated as a tradeoff for keeping drivers a
little simpler, then IMO it's a win since there are lots and lots of
drivers but only one PM core.

> I think that the rule "the driver must not register new children after
> ->suspend() has run" is not a good one, because in fact we don't want
> ->suspend() to be called while new children are being registered.

But you do agree that "drivers must not register new children after 
->suspend() has run" is correct, right?  You just don't think it goes 
far enough.

>  IMO, we
> should make the rule that "device registration may fail if it's carried out
> concurrently with the parent's ->suspend() method".  At least, that will tell
> the drivers what to do or avoid.

In doing this you are putting a tremendous extra burden on drivers.  
You force _them_ to handle registration failure caused by an impending
suspend, something the driver has no way to know about beforehand!

In effect, you are trying to take the extra complication my patch adds
to the PM core and instead add that complication to _every_
hotpluggable driver.  This is not a good approach.

> I agree it's not a good idea to hold the locks throughout the entire cycle,
> but that can be overcome if we use an additional variable under
> dev_pm_info (see patch below).

The new patch is no better.  If a driver tries to create a new child
just before its suspend method is called, the registration will fail
with -EBUSY for no reason the driver is aware of.  So now the driver
has to detect the failure (which many drivers don't do!) and figure out
how to retry it later on.

> In fact, drivers _should_ check for device_add() failures and if they don't,
> it's a plain bug.

Of course they should check.  But they shouldn't have to deal with 
failures that need to be retried for no good reason.

In the long run, I still believe the best approach is to tell drivers
beforehand they should stop registering children.  That will make both
of us happy: The PM core can fail all later registration attempts with
no qualms, and drivers won't have to worry about unexpected failures.  

How many subsystems register new children at arbitrary times?  The
earlier we can fix them up, the better.

> Well, I think it's not correct to allow the parent to suspend with active (not
> suspended) children.

And I think it's a bad idea to make drivers responsible for recovering 
from these failures.  Especially since the effort writing that recovery 
code would be better spent in writing "prevent_new_children" and 
"allow_new_children" methods.

>  However, if we make the rule that device_add() may fail
> if it's run concurrently with the parent's ->suspend(), the changes of the
> drivers need not be substantial (they should check for the failures of
> device_add() anyway).

But this is different.  The failures you want to add are things which 
_should_ succeed -- in fact they would succeed if they were delayed 
until after the system wakes back up.

I admit these failures will be very rare, since they depend on a race
with a small window.  Here's a compromise: I'll agree to let these
registrations fail if you'll agree to add "prevent_new_children" and
"allow_new_children" methods along with the new pm_ops patch you and
Alex have prepared.  Then drivers and subsystems can implement the
methods later on, after which they won't have to worry about spurious
registration failures.

The prevent_new_children methods should be called in a separate initial
forward pass through the dpm_active list -- rather like the reverted
lock_all_devices() routine.  Similarly for the allow_new_children 
methods (a final backward pass like unlock_all_devices()).

The PM notifier messages can serve as a "prevent new children" 
announcement to prevent registration of devices with no parent.

> > There must some strange interactions going on.  For instance, what if a
> > device sends a remote wakeup request before the system is fully asleep?
> 
> On the systems I'm talking about there are some devices referred to by the
> ACPI _PTS method, so they must be on line whey this method is being executed.
> Since we don't know a priori what devices they are, we must put all devices on
> line (into the full power state) before executing _PTS.

Um.  Nobody has mentioned this before.  Are you saying that a disk 
drive (for example) which has been spun down and put in a low-power 
runtime-PM state must be spun up again before the system can suspend?

> Below is the current version of the patch for handling device registrations
> in the PM core.  In this version, the new registrations are failed if done too
> late (or too early) and the locks are not held during the entire suspend/resume
> cycle.

Actually you can simplify the whole thing by getting rid of 
dev->power.lock entirely.  Protect the "sleeping" flag by dpm_list_mtx.

You could even change the flag from bool to an enum, with a special
"GONE" state for devices that were unregistered during a system sleep
transition.  Once you do that, the whole thing starts to look a lot
like a simplified version of my patch.

How does this sound?

Alan Stern

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [linux-pm] Fundamental flaw in system suspend, exposed by freezer removal
  2008-03-02 16:22                                     ` Alan Stern
@ 2008-03-02 19:11                                       ` Rafael J. Wysocki
  2008-03-03  3:54                                         ` Alan Stern
  0 siblings, 1 reply; 38+ messages in thread
From: Rafael J. Wysocki @ 2008-03-02 19:11 UTC (permalink / raw)
  To: Alan Stern
  Cc: Linux-pm mailing list, Kernel development list, Alexey Starikovskiy

On Sunday, 2 of March 2008, Alan Stern wrote:
> On Sun, 2 Mar 2008, Rafael J. Wysocki wrote:
> 
> > If new children get registered when the parent is suspended, that's already
> > wrong, because the children should have been suspended before.  Think of a case
> > when the parent is a bus and the children are devices on it.  In that case, by
> > suspending the parent before the children we can make the children
> > unresponsive etc. (that would break the PCI PM rules, for one example).
> 
> Agreed.  That's why I've been saying all along that once the parent has
> been moved to dpm_off, we should either block or fail child
> registrations.

That's not enough, though.  As soon as the state of the parent has changed,
it's incorrect to register any new children.  Moving the parent to dpm_off is
only a confirmation of the change of the parent's state.

> Drivers trying to register a child at such times are clearly buggy anyway.

Yes, they are.

> What we are really trying to agree on is how the PM core should handle 
> child registrations just before and while the parent is suspending.  
> Drivers trying to register a child at such times need not be buggy at 
> all; they may simply have lost a race.
> 
> I agree that the core needs to protect itself, but I also think that
> drivers should need minimal changes -- preferably none.  If the core
> becomes moderately more complicated as a tradeoff for keeping drivers a
> little simpler, then IMO it's a win since there are lots and lots of
> drivers but only one PM core.

Would you agree, however, that the driver should be prepared for its
->resume() being called right after ->suspend() and the ->suspend() repeated
immediately after that?  This is not a trivial change too ...

> > I think that the rule "the driver must not register new children after
> > ->suspend() has run" is not a good one, because in fact we don't want
> > ->suspend() to be called while new children are being registered.
> 
> But you do agree that "drivers must not register new children after 
> ->suspend() has run" is correct, right?  You just don't think it goes 
> far enough.

Yes.

> >  IMO, we
> > should make the rule that "device registration may fail if it's carried out
> > concurrently with the parent's ->suspend() method".  At least, that will tell
> > the drivers what to do or avoid.
> 
> In doing this you are putting a tremendous extra burden on drivers.  
> You force _them_ to handle registration failure caused by an impending
> suspend, something the driver has no way to know about beforehand!
> 
> In effect, you are trying to take the extra complication my patch adds
> to the PM core and instead add that complication to _every_
> hotpluggable driver.  This is not a good approach.

As I said above, I think that resuming the parent in case of a concurrent
child registration is not a trivial modification.  It may seem trivial, but
it's not.  In effect, my approach is not much worse than that.

> > I agree it's not a good idea to hold the locks throughout the entire cycle,
> > but that can be overcome if we use an additional variable under
> > dev_pm_info (see patch below).
> 
> The new patch is no better.  If a driver tries to create a new child
> just before its suspend method is called, the registration will fail
> with -EBUSY for no reason the driver is aware of.  So now the driver
> has to detect the failure (which many drivers don't do!) and figure out
> how to retry it later on.

That's correct.  However, if such a child registration happens with the code
we currently have, there will be a problem, so the driver doing that may be
considered as buggy _right_ _now_.  Thus, in fact, we need not worry about
any existing drivers and we may require future drivers to follow the additional
rule.

> > In fact, drivers _should_ check for device_add() failures and if they don't,
> > it's a plain bug.
> 
> Of course they should check.  But they shouldn't have to deal with 
> failures that need to be retried for no good reason.

Arguably, they can avoid that, for example by using notifiers.

> In the long run, I still believe the best approach is to tell drivers
> beforehand they should stop registering children.  That will make both
> of us happy: The PM core can fail all later registration attempts with
> no qualms, and drivers won't have to worry about unexpected failures.  
> 
> How many subsystems register new children at arbitrary times?  The
> earlier we can fix them up, the better.

Agreed.

> > Well, I think it's not correct to allow the parent to suspend with active (not
> > suspended) children.
> 
> And I think it's a bad idea to make drivers responsible for recovering 
> from these failures.  Especially since the effort writing that recovery 
> code would be better spent in writing "prevent_new_children" and 
> "allow_new_children" methods.
> 
> >  However, if we make the rule that device_add() may fail
> > if it's run concurrently with the parent's ->suspend(), the changes of the
> > drivers need not be substantial (they should check for the failures of
> > device_add() anyway).
> 
> But this is different.  The failures you want to add are things which 
> _should_ succeed -- in fact they would succeed if they were delayed 
> until after the system wakes back up.

IMO they are things done at a wrong time.  What we're talking about is a driver
trying to ignore the fact that it may be suspended and do all things as though
that's never going to happen.  In fact, if you use separate threads for the
registration of children etc., you should have implement a notification
mechanism that will let you know when a suspend is going to occur.  Otherwise,
you do things in a racy way and expecting that they'll never fail is just
overoptimistic. :-)

> I admit these failures will be very rare, since they depend on a race
> with a small window.  Here's a compromise: I'll agree to let these
> registrations fail if you'll agree to add "prevent_new_children" and
> "allow_new_children" methods along with the new pm_ops patch you and
> Alex have prepared.  Then drivers and subsystems can implement the
> methods later on, after which they won't have to worry about spurious
> registration failures.

I'm fine with that, although I'd call the new methods ->begin() and ->end() or
something like this, since they may generally be used for other purposes
too.

> The prevent_new_children methods should be called in a separate initial
> forward pass through the dpm_active list -- rather like the reverted
> lock_all_devices() routine.  Similarly for the allow_new_children 
> methods (a final backward pass like unlock_all_devices()).

Agreed.
 
> The PM notifier messages can serve as a "prevent new children" 
> announcement to prevent registration of devices with no parent.
> 
> > > There must some strange interactions going on.  For instance, what if a
> > > device sends a remote wakeup request before the system is fully asleep?
> > 
> > On the systems I'm talking about there are some devices referred to by the
> > ACPI _PTS method, so they must be on line whey this method is being executed.
> > Since we don't know a priori what devices they are, we must put all devices on
> > line (into the full power state) before executing _PTS.

Well, "must" is too strong here ...

> Um.  Nobody has mentioned this before.  Are you saying that a disk 
> drive (for example) which has been spun down and put in a low-power 
> runtime-PM state must be spun up again before the system can suspend?

On the systems in question (some NVidia chipsets for K8) USB controllers have
to be in the full power state before executing _PTS.

IMO these systems are just badly designed and we can blacklist or something
 like this.  It also is against the ACPI 2.0 spec, although it's compliant
with ACPI 1.0 .

> > Below is the current version of the patch for handling device registrations
> > in the PM core.  In this version, the new registrations are failed if done too
> > late (or too early) and the locks are not held during the entire suspend/resume
> > cycle.
> 
> Actually you can simplify the whole thing by getting rid of 
> dev->power.lock entirely.  Protect the "sleeping" flag by dpm_list_mtx.

Then, if a child registration occurs while suspend_device(parent) is being
run, dpm_active will be in a wrong order, unless suspend_device(parent) returns
an error (which, BTW, also imposes a new rule on drivers).
 
> You could even change the flag from bool to an enum, with a special
> "GONE" state for devices that were unregistered during a system sleep
> transition.

I'm not sure the "GONE" state willl really be necessary.

> Once you do that, the whole thing starts to look a lot like a simplified
> version of my patch. 
> 
> How does this sound?

Well, we can add new callbacks for notifying drivers of an impending suspend.
In that case, say we add a ->begin() callback for this purpose (in fact that
would be two callbacks, ->suspend_begin() and ->hibernate_begin(), but let's
simplify things a bit for now), so there are the following questions:
* Is it going to return a result?
* If it is, should we fail the suspend if an error is returned?
* In that case, should the ->suspend() callback return a result?
* Perhaps we can require ->suspend() to always succeed after ->begin() has
  succeeded?

OTOH, IMO requiring ->suspend() to return an error if there's been a concurrent
child registration and resuming the device if that happens is not a trivial
change and it may require as many driver modifications as failing the
registration of the child right away.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [linux-pm] Fundamental flaw in system suspend, exposed by freezer removal
  2008-03-02 19:11                                       ` Rafael J. Wysocki
@ 2008-03-03  3:54                                         ` Alan Stern
  2008-03-03 16:32                                           ` Rafael J. Wysocki
  0 siblings, 1 reply; 38+ messages in thread
From: Alan Stern @ 2008-03-03  3:54 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Linux-pm mailing list, Kernel development list, Alexey Starikovskiy

On Sun, 2 Mar 2008, Rafael J. Wysocki wrote:

> Would you agree, however, that the driver should be prepared for its
> ->resume() being called right after ->suspend() and the ->suspend() repeated
> immediately after that?  This is not a trivial change too ...

This is now a moot point, but I'll answer it anyway...

Drivers should not depend on any particular time interval between their 
suspend and resume methods being called.  It should be okay to call one 
and then the other a microsecond later, or 100 days later.  The driver 
shouldn't know or care.  I doubt that many drivers do.

> > I admit these failures will be very rare, since they depend on a race
> > with a small window.  Here's a compromise: I'll agree to let these
> > registrations fail if you'll agree to add "prevent_new_children" and
> > "allow_new_children" methods along with the new pm_ops patch you and
> > Alex have prepared.  Then drivers and subsystems can implement the
> > methods later on, after which they won't have to worry about spurious
> > registration failures.
> 
> I'm fine with that, although I'd call the new methods ->begin() and ->end() or
> something like this, since they may generally be used for other purposes
> too.

"begin_sleep" and "end_sleep" are less ambiguous.

I was just thinking the same thing.  For instance, plenty of drivers
register children as part of their probe method, so the begin_sleep
method should prevent subsystems from probing the device.  (Not to
mention that it's kind of awkward for a driver to probe a suspended
device.)

> > The prevent_new_children methods should be called in a separate initial
> > forward pass through the dpm_active list -- rather like the reverted
> > lock_all_devices() routine.  Similarly for the allow_new_children 
> > methods (a final backward pass like unlock_all_devices()).
> 
> Agreed.

After more thought, I'm not so sure about this.  It might be a good
idea to call the begin_sleep method just before the suspend method (or
any of its variants: freeze, hibernate, prethaw, etc.) and call the
end_sleep method just after the resume method.  This minimizes the time
drivers will spend in a peculiar non-hotplug-aware state, although it 
means that begin_sleep would have to be idempotent.

It also allows sophisticated drivers to do all their processing in the
begin_sleep (and end_sleep) method: both preventing new child
registrations and powering down the device.  At the moment I'm not sure
whether this would turn out to be a good strategy, but it might.

Alternatively, some subsystems might want to stop all child
registrations right at the beginning of the sleep transition.  This
would amount to blocking the kernel thread responsible for those
registrations when the sleep begins -- in other words, making that
thread freezable!

> > The PM notifier messages can serve as a "prevent new children" 
> > announcement to prevent registration of devices with no parent.

This isn't quite so simple either.  A notifier isn't good enough
because it doesn't provide any synchronization.  On the other hand, how
many devices ever get registered without a parent?  Does this happen at
all after system startup?  Maybe when a loadable module for a new bus
type initializes...  We could block module loading at the start of a
sleep transition, if necessary.

> > > > There must some strange interactions going on.  For instance, what if a
> > > > device sends a remote wakeup request before the system is fully asleep?
> > > 
> > > On the systems I'm talking about there are some devices referred to by the
> > > ACPI _PTS method, so they must be on line whey this method is being executed.
> > > Since we don't know a priori what devices they are, we must put all devices on
> > > line (into the full power state) before executing _PTS.
> 
> Well, "must" is too strong here ...
> 
> > Um.  Nobody has mentioned this before.  Are you saying that a disk 
> > drive (for example) which has been spun down and put in a low-power 
> > runtime-PM state must be spun up again before the system can suspend?
> 
> On the systems in question (some NVidia chipsets for K8) USB controllers have
> to be in the full power state before executing _PTS.

Fortunately that won't cause any problems for some time to come.  
There are no current plans for doing runtime PM of PCI-based USB
controllers.  This may change eventually, but not for a while.

> IMO these systems are just badly designed and we can blacklist or something
>  like this.  It also is against the ACPI 2.0 spec, although it's compliant
> with ACPI 1.0 .

Or have a "no runtime PM" quirk for the devices in question.

> > > Below is the current version of the patch for handling device registrations
> > > in the PM core.  In this version, the new registrations are failed if done too
> > > late (or too early) and the locks are not held during the entire suspend/resume
> > > cycle.
> > 
> > Actually you can simplify the whole thing by getting rid of 
> > dev->power.lock entirely.  Protect the "sleeping" flag by dpm_list_mtx.
> 
> Then, if a child registration occurs while suspend_device(parent) is being
> run, dpm_active will be in a wrong order, unless suspend_device(parent) returns
> an error (which, BTW, also imposes a new rule on drivers).

Sorry, I don't follow you.

I meant doing something approximately like this:

	mutex_lock(&dpm_list_mtx);
	while (!list_is_empty(&dpm_active)) {
		dev = dpm_active.prev;
		dev->power.sleeping = true;
		mutex_unlock(&dpm_list_mtx);
		error = suspend_device(dev);
		mutex_lock(&dpm_list_mtx);
		...
	}

and make device_pm_add() fail if dev->parent->power.sleeping is true.
This will do what you want, right?

With the begin_sleep method call added it gets a little more
complicated, since you have to reacquire dpm_list_mtx after calling the
begin_sleep method and before setting dev->power.sleeping to true, in
order to make sure that dev is still the last entry on dpm_active.  
But it's quite doable.

(BTW, I wonder if it's a good idea for device_add() to call 
device_pm_add() before calling bus_add_device().  If a suspend occurs 
in between, we could end up in a strange situation with a driver being 
asked to suspend a device before that device has been fully registered 
-- in fact the registration might still fail.)

> I'm not sure the "GONE" state willl really be necessary.

I'm not sure either.  You can get the same effect by checking 
list_empty(&dev->power.entry).

> Well, we can add new callbacks for notifying drivers of an impending suspend.
> In that case, say we add a ->begin() callback for this purpose (in fact that
> would be two callbacks, ->suspend_begin() and ->hibernate_begin(), but let's
> simplify things a bit for now), so there are the following questions:

In theory you could even expand it to freeze_begin and prethaw_begin.

> * Is it going to return a result?
> * If it is, should we fail the suspend if an error is returned?
> * In that case, should the ->suspend() callback return a result?

To be safe, everything should return a result and we should abort the 
sleep if anything returns an error.

It's easier to ignore a return code now than to change a method
signature later.  :-)

> * Perhaps we can require ->suspend() to always succeed after ->begin() has
>   succeeded?

No.  Some drivers might implement just one and some drivers just the 
other.

> OTOH, IMO requiring ->suspend() to return an error if there's been a concurrent
> child registration and resuming the device if that happens is not a trivial
> change and it may require as many driver modifications as failing the
> registration of the child right away.

That's true.  If the driver's author wants to do things that way, he
can.  For instance, there are several subsystems which will probe for
new children as part of their resume processing.  So if a child was
detected just before the system went to sleep and failed to get
registered, then it will be detected again when the system wakes up and
now the registration will succeed.

Here's something else to think about.  We might want to allow some 
devices to be "power-irrelevant".  That is, the device exists in the 
kernel only as a representation of some software state, not as a 
physical device.  It doesn't consume power, it doesn't have any state 
to lose during a sleep, and its driver doesn't implement suspend or 
resume methods.  For these sorts of devices, we might allow 
device_add() to skip calling device_pm_add() altogether.  USB 
interfaces are a little like this, as are SCSI hosts and MMC hosts.

Alan Stern

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [linux-pm] Fundamental flaw in system suspend, exposed by freezer removal
  2008-03-03  3:54                                         ` Alan Stern
@ 2008-03-03 16:32                                           ` Rafael J. Wysocki
  2008-03-03 17:43                                             ` Alan Stern
  0 siblings, 1 reply; 38+ messages in thread
From: Rafael J. Wysocki @ 2008-03-03 16:32 UTC (permalink / raw)
  To: Alan Stern
  Cc: Linux-pm mailing list, Kernel development list, Alexey Starikovskiy

On Monday, 3 of March 2008, Alan Stern wrote:
> On Sun, 2 Mar 2008, Rafael J. Wysocki wrote:
[--snip--]
> 
> > > The prevent_new_children methods should be called in a separate initial
> > > forward pass through the dpm_active list -- rather like the reverted
> > > lock_all_devices() routine.  Similarly for the allow_new_children 
> > > methods (a final backward pass like unlock_all_devices()).
> > 
> > Agreed.
> 
> After more thought, I'm not so sure about this.  It might be a good
> idea to call the begin_sleep method just before the suspend method (or
> any of its variants: freeze, hibernate, prethaw, etc.) and call the
> end_sleep method just after the resume method.  This minimizes the time
> drivers will spend in a peculiar non-hotplug-aware state, although it 
> means that begin_sleep would have to be idempotent.
> 
> It also allows sophisticated drivers to do all their processing in the
> begin_sleep (and end_sleep) method: both preventing new child
> registrations and powering down the device.  At the moment I'm not sure
> whether this would turn out to be a good strategy, but it might.

Well, I think there should be a window between ->suspend_begin()
and ->suspend() allowing the core to cancel the suspending of given
device and to select another one.  For example, if there's a child
registration concurrent wrt ->suspend_begin() which completes after
->suspend_begin() has been called, but before ->suspend_begin() has a chance
to block it, the core should not call ->suspend() for the device, but select
another one (the child).

Of course, it won't be necessary if the ->suspend_begin() methods are called
in an initial forward pass through dpm_active.

> Alternatively, some subsystems might want to stop all child
> registrations right at the beginning of the sleep transition.  This
> would amount to blocking the kernel thread responsible for those
> registrations when the sleep begins -- in other words, making that
> thread freezable!

Well, I think the possibility to freeze kernel threads on demand may be
considered as one of suspend synchronization mechanisms available to drivers.

[--snip--] 
> > > Actually you can simplify the whole thing by getting rid of 
> > > dev->power.lock entirely.  Protect the "sleeping" flag by dpm_list_mtx.
> > 
> > Then, if a child registration occurs while suspend_device(parent) is being
> > run, dpm_active will be in a wrong order, unless suspend_device(parent) returns
> > an error (which, BTW, also imposes a new rule on drivers).
> 
> Sorry, I don't follow you.
> 
> I meant doing something approximately like this:
> 
> 	mutex_lock(&dpm_list_mtx);
> 	while (!list_is_empty(&dpm_active)) {
> 		dev = dpm_active.prev;
> 		dev->power.sleeping = true;
> 		mutex_unlock(&dpm_list_mtx);
> 		error = suspend_device(dev);
> 		mutex_lock(&dpm_list_mtx);
> 		...
> 	}
> 
> and make device_pm_add() fail if dev->parent->power.sleeping is true.
> This will do what you want, right?

Yes, that should work.  I'll rework the patch along these lines (it starts to
look really simple now :-); I'll post the new version in a separate thread).

> With the begin_sleep method call added it gets a little more
> complicated, since you have to reacquire dpm_list_mtx after calling the
> begin_sleep method and before setting dev->power.sleeping to true, in
> order to make sure that dev is still the last entry on dpm_active.  
> But it's quite doable.

Yes, it is.

> (BTW, I wonder if it's a good idea for device_add() to call 
> device_pm_add() before calling bus_add_device().  If a suspend occurs 
> in between, we could end up in a strange situation with a driver being 
> asked to suspend a device before that device has been fully registered 
> -- in fact the registration might still fail.)

That's correct.  Perhaps we should change device_add().

> > I'm not sure the "GONE" state willl really be necessary.
> 
> I'm not sure either.  You can get the same effect by checking 
> list_empty(&dev->power.entry).
> 
> > Well, we can add new callbacks for notifying drivers of an impending suspend.
> > In that case, say we add a ->begin() callback for this purpose (in fact that
> > would be two callbacks, ->suspend_begin() and ->hibernate_begin(), but let's
> > simplify things a bit for now), so there are the following questions:
> 
> In theory you could even expand it to freeze_begin and prethaw_begin.

I meant ->hibernate_begin() as ->freeze_begin() and I don't see any reason why
->prethaw_begin() would be different from it (the difference between ->freeze()
and ->prethaw() -- now called ->quiesce(), btw -- is that the former saves
device settings and the latter doesn't).
 
> > * Is it going to return a result?
> > * If it is, should we fail the suspend if an error is returned?
> > * In that case, should the ->suspend() callback return a result?
> 
> To be safe, everything should return a result and we should abort the 
> sleep if anything returns an error.
> 
> It's easier to ignore a return code now than to change a method
> signature later.  :-)
> 
> > * Perhaps we can require ->suspend() to always succeed after ->begin() has
> >   succeeded?
> 
> No.  Some drivers might implement just one and some drivers just the 
> other.

I thought ->suspend() would be mandatory, even if it's to be empty.

> > OTOH, IMO requiring ->suspend() to return an error if there's been a concurrent
> > child registration and resuming the device if that happens is not a trivial
> > change and it may require as many driver modifications as failing the
> > registration of the child right away.
> 
> That's true.  If the driver's author wants to do things that way, he
> can.  For instance, there are several subsystems which will probe for
> new children as part of their resume processing.  So if a child was
> detected just before the system went to sleep and failed to get
> registered, then it will be detected again when the system wakes up and
> now the registration will succeed.
> 
> Here's something else to think about.  We might want to allow some 
> devices to be "power-irrelevant".  That is, the device exists in the 
> kernel only as a representation of some software state, not as a 
> physical device.  It doesn't consume power, it doesn't have any state 
> to lose during a sleep, and its driver doesn't implement suspend or 
> resume methods.  For these sorts of devices, we might allow 
> device_add() to skip calling device_pm_add() altogether.  USB 
> interfaces are a little like this, as are SCSI hosts and MMC hosts.

If such devices serve as logical parents of some "real" ones, we should
at least mark them as "sleeping" during suspend to protect the dpm_active
ordering.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [linux-pm] Fundamental flaw in system suspend, exposed by freezer removal
  2008-03-03 16:32                                           ` Rafael J. Wysocki
@ 2008-03-03 17:43                                             ` Alan Stern
  2008-03-03 20:47                                               ` Rafael J. Wysocki
  0 siblings, 1 reply; 38+ messages in thread
From: Alan Stern @ 2008-03-03 17:43 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Linux-pm mailing list, Kernel development list, Alexey Starikovskiy

On Mon, 3 Mar 2008, Rafael J. Wysocki wrote:

> > After more thought, I'm not so sure about this.  It might be a good
> > idea to call the begin_sleep method just before the suspend method (or
> > any of its variants: freeze, hibernate, prethaw, etc.) and call the
> > end_sleep method just after the resume method.  This minimizes the time
> > drivers will spend in a peculiar non-hotplug-aware state, although it 
> > means that begin_sleep would have to be idempotent.
> > 
> > It also allows sophisticated drivers to do all their processing in the
> > begin_sleep (and end_sleep) method: both preventing new child
> > registrations and powering down the device.  At the moment I'm not sure
> > whether this would turn out to be a good strategy, but it might.
> 
> Well, I think there should be a window between ->suspend_begin()
> and ->suspend() allowing the core to cancel the suspending of given
> device and to select another one.  For example, if there's a child
> registration concurrent wrt ->suspend_begin() which completes after
> ->suspend_begin() has been called, but before ->suspend_begin() has a chance
> to block it, the core should not call ->suspend() for the device, but select
> another one (the child).

With a sophisticated driver that would never happen, because after
blocking new child registrations, the driver would check that the
power.sleeping flag is set in all the children before powering down the 
device.  But like I said, I'm not sure if this would be a good 
strategy.

(This partly has to do with the requirements for runtime PM.  During a
runtime suspend the driver does have to check the children's status; it
can't rely on the PM core.  So if the check has to be done anyway, why
not also check during a system sleep?)

With non-sophisticated drivers, it definitely could happen that a new
child is registered concurrently with begin_sleep.  Then the core would
go back and suspend the child first, as you say.  Eventually the core
would return to the parent device, at which time it would call the
parent's begin_sleep method again -- unless we add another flag to
indicate that it had already been called.

> Of course, it won't be necessary if the ->suspend_begin() methods are called
> in an initial forward pass through dpm_active.

Right.  That would be simpler.

> > (BTW, I wonder if it's a good idea for device_add() to call 
> > device_pm_add() before calling bus_add_device().  If a suspend occurs 
> > in between, we could end up in a strange situation with a driver being 
> > asked to suspend a device before that device has been fully registered 
> > -- in fact the registration might still fail.)
> 
> That's correct.  Perhaps we should change device_add().

I had a change like that in my version of the patch.  It's excerpted 
below.

> > > Well, we can add new callbacks for notifying drivers of an impending suspend.
> > > In that case, say we add a ->begin() callback for this purpose (in fact that
> > > would be two callbacks, ->suspend_begin() and ->hibernate_begin(), but let's
> > > simplify things a bit for now), so there are the following questions:
> > 
> > In theory you could even expand it to freeze_begin and prethaw_begin.
> 
> I meant ->hibernate_begin() as ->freeze_begin() and I don't see any reason why
> ->prethaw_begin() would be different from it (the difference between ->freeze()
> and ->prethaw() -- now called ->quiesce(), btw -- is that the former saves
> device settings and the latter doesn't).

AFAICS there needs to be only one begin_sleep method.  It should apply 
equally well to suspend, freeze, quiesce, and hibernate.  (But not to 
suspend_late.)

> > > * Perhaps we can require ->suspend() to always succeed after ->begin() has
> > >   succeeded?
> > 
> > No.  Some drivers might implement just one and some drivers just the 
> > other.
> 
> I thought ->suspend() would be mandatory, even if it's to be empty.

There's no need for that.  If it isn't implemented, treat it as though 
it was successful.

But if a driver implements suspend() and leaves begin_sleep() as just a 
stub (which many drivers might reasonably do, if they never register 
any children), then the core shouldn't require suspend() to succeed 
merely because begin_sleep() did.

And especially not if begin_sleep() is called in a separate pass.

> > Here's something else to think about.  We might want to allow some 
> > devices to be "power-irrelevant".  That is, the device exists in the 
> > kernel only as a representation of some software state, not as a 
> > physical device.  It doesn't consume power, it doesn't have any state 
> > to lose during a sleep, and its driver doesn't implement suspend or 
> > resume methods.  For these sorts of devices, we might allow 
> > device_add() to skip calling device_pm_add() altogether.  USB 
> > interfaces are a little like this, as are SCSI hosts and MMC hosts.
> 
> If such devices serve as logical parents of some "real" ones, we should
> at least mark them as "sleeping" during suspend to protect the dpm_active
> ordering.

They wouldn't have to be marked at all.  They would never get on any of 
the PM lists, because they would never be passed to device_pm_add().

The status of such devices is a little peculiar.  For example, consider 
the MMC subsystem.  There's an MMC controller, generally a platform 
device.  Its child is an mmc_host device, but the mmc_host methods are 
called by the platform device's methods -- there is no driver 
associated with an mmc_host.  At one time the mmc_host was a 
class_device; that's may explain in part why it works this way.

Alan Stern

Index: usb-2.6/drivers/base/core.c
===================================================================
--- usb-2.6.orig/drivers/base/core.c
+++ usb-2.6/drivers/base/core.c
@@ -815,10 +815,12 @@ int device_add(struct device *dev)
 	error = dpm_sysfs_add(dev);
 	if (error)
 		goto PMError;
-	device_pm_add(dev);
 	error = bus_add_device(dev);
 	if (error)
 		goto BusError;
+	error = device_pm_add(dev);
+	if (error)
+		goto PMAddError;
 	kobject_uevent(&dev->kobj, KOBJ_ADD);
 	bus_attach_device(dev);
 	if (parent)
@@ -838,8 +840,9 @@ int device_add(struct device *dev)
  Done:
 	put_device(dev);
 	return error;
+ PMAddError:
+	bus_remove_device(dev);
  BusError:
-	device_pm_remove(dev);
 	dpm_sysfs_remove(dev);
  PMError:
 	if (dev->bus)

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [linux-pm] Fundamental flaw in system suspend, exposed by freezer removal
  2008-03-03 17:43                                             ` Alan Stern
@ 2008-03-03 20:47                                               ` Rafael J. Wysocki
  2008-03-03 22:48                                                 ` Alan Stern
  0 siblings, 1 reply; 38+ messages in thread
From: Rafael J. Wysocki @ 2008-03-03 20:47 UTC (permalink / raw)
  To: Alan Stern
  Cc: Linux-pm mailing list, Kernel development list, Alexey Starikovskiy

On Monday, 3 of March 2008, Alan Stern wrote:
> On Mon, 3 Mar 2008, Rafael J. Wysocki wrote:
> 
> > > After more thought, I'm not so sure about this.  It might be a good
> > > idea to call the begin_sleep method just before the suspend method (or
> > > any of its variants: freeze, hibernate, prethaw, etc.) and call the
> > > end_sleep method just after the resume method.  This minimizes the time
> > > drivers will spend in a peculiar non-hotplug-aware state, although it 
> > > means that begin_sleep would have to be idempotent.
> > > 
> > > It also allows sophisticated drivers to do all their processing in the
> > > begin_sleep (and end_sleep) method: both preventing new child
> > > registrations and powering down the device.  At the moment I'm not sure
> > > whether this would turn out to be a good strategy, but it might.
> > 
> > Well, I think there should be a window between ->suspend_begin()
> > and ->suspend() allowing the core to cancel the suspending of given
> > device and to select another one.  For example, if there's a child
> > registration concurrent wrt ->suspend_begin() which completes after
> > ->suspend_begin() has been called, but before ->suspend_begin() has a chance
> > to block it, the core should not call ->suspend() for the device, but select
> > another one (the child).
> 
> With a sophisticated driver that would never happen, because after
> blocking new child registrations, the driver would check that the
> power.sleeping flag is set in all the children before powering down the 
> device.

The _core_ needs the window, not the driver.  Even if the driver discovers that
there are active children, it can only fail ->suspend(), in which the core will
faill the entire power state transition (unless we reserve an error code for
signaling such situations by ->suspend(), which I'd prefer to avoid).

> But like I said, I'm not sure if this would be a good strategy.
> 
> (This partly has to do with the requirements for runtime PM.  During a
> runtime suspend the driver does have to check the children's status; it
> can't rely on the PM core.  So if the check has to be done anyway, why
> not also check during a system sleep?)
> 
> With non-sophisticated drivers, it definitely could happen that a new
> child is registered concurrently with begin_sleep.  Then the core would
> go back and suspend the child first, as you say.  Eventually the core
> would return to the parent device, at which time it would call the
> parent's begin_sleep method again -- unless we add another flag to
> indicate that it had already been called.
> 
> > Of course, it won't be necessary if the ->suspend_begin() methods are called
> > in an initial forward pass through dpm_active.
> 
> Right.  That would be simpler.
> 
> > > (BTW, I wonder if it's a good idea for device_add() to call 
> > > device_pm_add() before calling bus_add_device().  If a suspend occurs 
> > > in between, we could end up in a strange situation with a driver being 
> > > asked to suspend a device before that device has been fully registered 
> > > -- in fact the registration might still fail.)
> > 
> > That's correct.  Perhaps we should change device_add().
> 
> I had a change like that in my version of the patch.  It's excerpted 
> below.

Hm, I wonder why you didn't move dpm_sysfs_add() along with device_pm_add()?

Perhaps it's better to include dpm_sysfs_add() into device_pm_add(), since we
are going the make the return a result anyway?

> > > > Well, we can add new callbacks for notifying drivers of an impending suspend.
> > > > In that case, say we add a ->begin() callback for this purpose (in fact that
> > > > would be two callbacks, ->suspend_begin() and ->hibernate_begin(), but let's
> > > > simplify things a bit for now), so there are the following questions:
> > > 
> > > In theory you could even expand it to freeze_begin and prethaw_begin.
> > 
> > I meant ->hibernate_begin() as ->freeze_begin() and I don't see any reason why
> > ->prethaw_begin() would be different from it (the difference between ->freeze()
> > and ->prethaw() -- now called ->quiesce(), btw -- is that the former saves
> > device settings and the latter doesn't).
> 
> AFAICS there needs to be only one begin_sleep method.  It should apply 
> equally well to suspend, freeze, quiesce, and hibernate.  (But not to 
> suspend_late.)

It definitely would be simpler to assume so and introduce just one common
begin_sleep().

> > > > * Perhaps we can require ->suspend() to always succeed after ->begin() has
> > > >   succeeded?
> > > 
> > > No.  Some drivers might implement just one and some drivers just the 
> > > other.
> > 
> > I thought ->suspend() would be mandatory, even if it's to be empty.
> 
> There's no need for that.  If it isn't implemented, treat it as though 
> it was successful.

Well, I'm not sure.  Right now we have a problem with distinguishing drivers
that don't implement ->suspend() purposefully from the ones that just don't
support suspend/hibernation ...

OTOH, since we are going to have a pointer to 'struct pm_ops', we can safely
assume that if it's not NULL, the driver writer knows what he's doing.
 
> But if a driver implements suspend() and leaves begin_sleep() as just a 
> stub (which many drivers might reasonably do, if they never register 
> any children), then the core shouldn't require suspend() to succeed 
> merely because begin_sleep() did.
> 
> And especially not if begin_sleep() is called in a separate pass.

Agreed.

> > > Here's something else to think about.  We might want to allow some 
> > > devices to be "power-irrelevant".  That is, the device exists in the 
> > > kernel only as a representation of some software state, not as a 
> > > physical device.  It doesn't consume power, it doesn't have any state 
> > > to lose during a sleep, and its driver doesn't implement suspend or 
> > > resume methods.  For these sorts of devices, we might allow 
> > > device_add() to skip calling device_pm_add() altogether.  USB 
> > > interfaces are a little like this, as are SCSI hosts and MMC hosts.
> > 
> > If such devices serve as logical parents of some "real" ones, we should
> > at least mark them as "sleeping" during suspend to protect the dpm_active
> > ordering.
> 
> They wouldn't have to be marked at all.  They would never get on any of 
> the PM lists, because they would never be passed to device_pm_add().

But dev->parent will be not NULL for their children being on dpm_active ...

IMO it's simpler to just add those devices without any suspend callbacks
defined to dpm_active and handle them normally than to introduce a special
case.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [linux-pm] Fundamental flaw in system suspend, exposed by freezer removal
  2008-03-03 20:47                                               ` Rafael J. Wysocki
@ 2008-03-03 22:48                                                 ` Alan Stern
  2008-03-03 22:56                                                   ` Rafael J. Wysocki
  0 siblings, 1 reply; 38+ messages in thread
From: Alan Stern @ 2008-03-03 22:48 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Linux-pm mailing list, Kernel development list, Alexey Starikovskiy

On Mon, 3 Mar 2008, Rafael J. Wysocki wrote:

> > > Of course, it won't be necessary if the ->suspend_begin() methods are called
> > > in an initial forward pass through dpm_active.
> > 
> > Right.  That would be simpler.

Let's stick with that for now.

What about on the resume side?  Do you want to prevent child
registrations until after end_sleep has run?  Or would you be okay with
allowing the resume method to register new children?  It should be safe
to assume that drivers are smart enough to bring the device back to
full power before looking for new children.

In fact, maybe we don't need an end_sleep method at all.  There isn't
any synchronization issue to worry about -- drivers aren't going to
register their new children in a suspended state!

> > > That's correct.  Perhaps we should change device_add().
> > 
> > I had a change like that in my version of the patch.  It's excerpted 
> > below.
> 
> Hm, I wonder why you didn't move dpm_sysfs_add() along with device_pm_add()?

Because I didn't think of it.  :-)

> Perhaps it's better to include dpm_sysfs_add() into device_pm_add(), since we
> are going the make the return a result anyway?

Yes.

> > > I thought ->suspend() would be mandatory, even if it's to be empty.
> > 
> > There's no need for that.  If it isn't implemented, treat it as though 
> > it was successful.
> 
> Well, I'm not sure.  Right now we have a problem with distinguishing drivers
> that don't implement ->suspend() purposefully from the ones that just don't
> support suspend/hibernation ...
> 
> OTOH, since we are going to have a pointer to 'struct pm_ops', we can safely
> assume that if it's not NULL, the driver writer knows what he's doing.

That's reasonable.  If a driver doesn't support PM then it won't have 
a pm_ops pointer.

> > > > Here's something else to think about.  We might want to allow some 
> > > > devices to be "power-irrelevant".  That is, the device exists in the 
> > > > kernel only as a representation of some software state, not as a 
> > > > physical device.  It doesn't consume power, it doesn't have any state 
> > > > to lose during a sleep, and its driver doesn't implement suspend or 
> > > > resume methods.  For these sorts of devices, we might allow 
> > > > device_add() to skip calling device_pm_add() altogether.  USB 
> > > > interfaces are a little like this, as are SCSI hosts and MMC hosts.
> > > 
> > > If such devices serve as logical parents of some "real" ones, we should
> > > at least mark them as "sleeping" during suspend to protect the dpm_active
> > > ordering.
> > 
> > They wouldn't have to be marked at all.  They would never get on any of 
> > the PM lists, because they would never be passed to device_pm_add().
> 
> But dev->parent will be not NULL for their children being on dpm_active ...
> 
> IMO it's simpler to just add those devices without any suspend callbacks
> defined to dpm_active and handle them normally than to introduce a special
> case.

Yeah, I'm just trying to think too far ahead.

Alan Stern


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [linux-pm] Fundamental flaw in system suspend, exposed by freezer removal
  2008-03-03 22:48                                                 ` Alan Stern
@ 2008-03-03 22:56                                                   ` Rafael J. Wysocki
  2008-03-03 23:12                                                     ` Alan Stern
  0 siblings, 1 reply; 38+ messages in thread
From: Rafael J. Wysocki @ 2008-03-03 22:56 UTC (permalink / raw)
  To: Alan Stern
  Cc: Linux-pm mailing list, Kernel development list, Alexey Starikovskiy

On Monday, 3 of March 2008, Alan Stern wrote:
> On Mon, 3 Mar 2008, Rafael J. Wysocki wrote:
> 
> > > > Of course, it won't be necessary if the ->suspend_begin() methods are called
> > > > in an initial forward pass through dpm_active.
> > > 
> > > Right.  That would be simpler.
> 
> Let's stick with that for now.
> 
> What about on the resume side?  Do you want to prevent child
> registrations until after end_sleep has run?  Or would you be okay with
> allowing the resume method to register new children?

I think ->resume() can register new children.  There's nothing wrong with
that from the core's standpoint.

> It should be safe to assume that drivers are smart enough to bring the device
> back to full power before looking for new children.

Well, it would be a bug not to do so.

> In fact, maybe we don't need an end_sleep method at all.  There isn't
> any synchronization issue to worry about -- drivers aren't going to
> register their new children in a suspended state!

Agreed.

> > > > That's correct.  Perhaps we should change device_add().
> > > 
> > > I had a change like that in my version of the patch.  It's excerpted 
> > > below.
> > 
> > Hm, I wonder why you didn't move dpm_sysfs_add() along with device_pm_add()?
> 
> Because I didn't think of it.  :-)
> 
> > Perhaps it's better to include dpm_sysfs_add() into device_pm_add(), since we
> > are going the make the return a result anyway?
> 
> Yes.

Okay, I'll prepare a patch for that, on top of the one introducing the
'sleeping' field into 'struct dev_pm_info' (posting in a while).
 
> > > > I thought ->suspend() would be mandatory, even if it's to be empty.
> > > 
> > > There's no need for that.  If it isn't implemented, treat it as though 
> > > it was successful.
> > 
> > Well, I'm not sure.  Right now we have a problem with distinguishing drivers
> > that don't implement ->suspend() purposefully from the ones that just don't
> > support suspend/hibernation ...
> > 
> > OTOH, since we are going to have a pointer to 'struct pm_ops', we can safely
> > assume that if it's not NULL, the driver writer knows what he's doing.
> 
> That's reasonable.  If a driver doesn't support PM then it won't have 
> a pm_ops pointer.

Exactly.

The question remains what we're going to do with the drivers without pm_ops
pointers in the long run (in the short run we will use the legacy callbacks in
that cases, if defined).

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [linux-pm] Fundamental flaw in system suspend, exposed by freezer removal
  2008-03-03 22:56                                                   ` Rafael J. Wysocki
@ 2008-03-03 23:12                                                     ` Alan Stern
  2008-03-03 23:18                                                       ` Rafael J. Wysocki
  0 siblings, 1 reply; 38+ messages in thread
From: Alan Stern @ 2008-03-03 23:12 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Linux-pm mailing list, Kernel development list, Alexey Starikovskiy

On Mon, 3 Mar 2008, Rafael J. Wysocki wrote:

> > > Perhaps it's better to include dpm_sysfs_add() into device_pm_add(), since we
> > > are going the make the return a result anyway?
> > 
> > Yes.
> 
> Okay, I'll prepare a patch for that, on top of the one introducing the
> 'sleeping' field into 'struct dev_pm_info' (posting in a while).

While you're at it, could you add a field to indicate whether 
begin_sleep() has been called?  It would help prevent multiple calls to 
that method when a race does occur, and it could be useful for drivers 
as well.

> The question remains what we're going to do with the drivers without pm_ops
> pointers in the long run (in the short run we will use the legacy callbacks in
> that cases, if defined).

One possibility is to unbind those drivers at the start of a sleep
transition and reprobe them at the end.  Another possibility is to
ignore the lack of PM support and hope it doesn't cause any problems.

Alan Stern


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [linux-pm] Fundamental flaw in system suspend, exposed by freezer removal
  2008-03-03 23:12                                                     ` Alan Stern
@ 2008-03-03 23:18                                                       ` Rafael J. Wysocki
  0 siblings, 0 replies; 38+ messages in thread
From: Rafael J. Wysocki @ 2008-03-03 23:18 UTC (permalink / raw)
  To: Alan Stern
  Cc: Linux-pm mailing list, Kernel development list, Alexey Starikovskiy

On Tuesday, 4 of March 2008, Alan Stern wrote:
> On Mon, 3 Mar 2008, Rafael J. Wysocki wrote:
> 
> > > > Perhaps it's better to include dpm_sysfs_add() into device_pm_add(), since we
> > > > are going the make the return a result anyway?
> > > 
> > > Yes.
> > 
> > Okay, I'll prepare a patch for that, on top of the one introducing the
> > 'sleeping' field into 'struct dev_pm_info' (posting in a while).
> 
> While you're at it, could you add a field to indicate whether 
> begin_sleep() has been called?  It would help prevent multiple calls to 
> that method when a race does occur, and it could be useful for drivers 
> as well.

That will be added along with the new callbacks.

I'd prefer to do all that in separate patches, so that it's clear what issue is
addressed and how, by each of them.

> > The question remains what we're going to do with the drivers without pm_ops
> > pointers in the long run (in the short run we will use the legacy callbacks in
> > that cases, if defined).
> 
> One possibility is to unbind those drivers at the start of a sleep
> transition and reprobe them at the end.

That sounds promising.

> Another possibility is to ignore the lack of PM support and hope it doesn't
> cause any problems. 

That would be easy, but that's what we do now and there are problems with it.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2008-03-03 23:19 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-02-25 15:39 Fundamental flaw in system suspend, exposed by freezer removal Alan Stern
2008-02-25 19:46 ` [linux-pm] " Alan Stern
2008-02-25 22:25   ` Rafael J. Wysocki
2008-02-25 23:37     ` Alan Stern
2008-02-26  0:07       ` Rafael J. Wysocki
2008-02-26 15:49         ` Alan Stern
2008-02-26 23:17           ` Rafael J. Wysocki
2008-02-27 16:03             ` Alan Stern
2008-02-27 19:50               ` Rafael J. Wysocki
2008-02-27 20:15                 ` Alan Stern
2008-02-28 22:49                 ` Alan Stern
2008-02-29  0:01                   ` Rafael J. Wysocki
2008-02-29 14:26                   ` Rafael J. Wysocki
2008-02-29 15:53                     ` Alan Stern
2008-02-29 17:02                       ` Rafael J. Wysocki
2008-02-29 18:42                         ` Alan Stern
2008-02-29 21:57                           ` Rafael J. Wysocki
2008-02-29 22:46                             ` Alan Stern
2008-03-01  0:13                               ` Rafael J. Wysocki
2008-03-01 15:30                                 ` Alan Stern
2008-03-02 13:37                                   ` Rafael J. Wysocki
2008-03-02 16:22                                     ` Alan Stern
2008-03-02 19:11                                       ` Rafael J. Wysocki
2008-03-03  3:54                                         ` Alan Stern
2008-03-03 16:32                                           ` Rafael J. Wysocki
2008-03-03 17:43                                             ` Alan Stern
2008-03-03 20:47                                               ` Rafael J. Wysocki
2008-03-03 22:48                                                 ` Alan Stern
2008-03-03 22:56                                                   ` Rafael J. Wysocki
2008-03-03 23:12                                                     ` Alan Stern
2008-03-03 23:18                                                       ` Rafael J. Wysocki
2008-02-26  7:13     ` David Brownell
2008-02-26  8:25       ` David Newall
2008-02-26  9:16         ` David Brownell
2008-02-26 13:36           ` David Newall
2008-02-26 15:58             ` Alan Stern
2008-02-25 22:24 ` Rafael J. Wysocki
2008-02-27 20:36 ` Benjamin Herrenschmidt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).