PM: sleep: core: Fix the handling of pending runtime resume requests
diff mbox series

Message ID 7969920.MVx1BpXlEM@kreacher
State Superseded
Headers show
Series
  • PM: sleep: core: Fix the handling of pending runtime resume requests
Related show

Commit Message

Rafael J. Wysocki Aug. 21, 2020, 5:41 p.m. UTC
From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

It has been reported that system-wide suspend may be aborted in the
absence of any wakeup events due to unforseen interactions of it with
the runtume PM framework.

One failing scenario is when there are multiple devices sharing an
ACPI power resource and runtime-resume needs to be carried out for
one of them during system-wide suspend (for example, because it needs
to be reconfigured before the whole system goes to sleep).  In that
case, the runtime-resume of that device involves turning the ACPI
power resource "on" which in turn causes runtime resume requests
to be queued up for all of the other devices sharing it.  Those
requests go to the runtime PM workqueue which is frozen during
system-wide suspend, so they are not actually taken care of until
the resume of the whole system, but the pm_runtime_barrier()
call in __device_suspend() sees them and triggers system wakeup
events for them which then cause the system-wide suspend to be
aborted if wakeup source objects are in active use.

Of course, the logic that leads to triggering those wakeup events is
questionable in the first place, because clearly there are cases in
which a pending runtime resume request for a device is not connected
to any real wakeup events in any way (like the one above).  Moreover,
if there is a pending runtime resume request for a device while
__device_suspend() is running for it, the physical state of the
device may not be in agreement with the "suspended" runtime PM status
of it (which may be the very reason for queuing up the runtime resume
request for it).

For these reasons, rework __device_suspend() to carry out synchronous
runtime-resume for devices with pending runtime resume requests before
attempting to invoke system-wide suspend callbacks for them with the
expectation that their drivers will trigger system-wide wakeup events
in the process of handling the runtime resume, if necessary.

Fixes: 1e2ef05bb8cf8 ("PM: Limit race conditions between runtime PM and system sleep (v2)")
Reported-by: Mika Westerberg <mika.westerberg@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---
 drivers/base/power/main.c |   12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

Comments

Alan Stern Aug. 21, 2020, 7:34 p.m. UTC | #1
On Fri, Aug 21, 2020 at 07:41:02PM +0200, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> 
> It has been reported that system-wide suspend may be aborted in the
> absence of any wakeup events due to unforseen interactions of it with
> the runtume PM framework.
> 
> One failing scenario is when there are multiple devices sharing an
> ACPI power resource and runtime-resume needs to be carried out for
> one of them during system-wide suspend (for example, because it needs
> to be reconfigured before the whole system goes to sleep).  In that
> case, the runtime-resume of that device involves turning the ACPI
> power resource "on" which in turn causes runtime resume requests
> to be queued up for all of the other devices sharing it.  Those
> requests go to the runtime PM workqueue which is frozen during
> system-wide suspend, so they are not actually taken care of until
> the resume of the whole system, but the pm_runtime_barrier()
> call in __device_suspend() sees them and triggers system wakeup
> events for them which then cause the system-wide suspend to be
> aborted if wakeup source objects are in active use.
> 
> Of course, the logic that leads to triggering those wakeup events is
> questionable in the first place, because clearly there are cases in
> which a pending runtime resume request for a device is not connected
> to any real wakeup events in any way (like the one above).  Moreover,
> if there is a pending runtime resume request for a device while
> __device_suspend() is running for it, the physical state of the
> device may not be in agreement with the "suspended" runtime PM status
> of it (which may be the very reason for queuing up the runtime resume
> request for it).
> 
> For these reasons, rework __device_suspend() to carry out synchronous
> runtime-resume for devices with pending runtime resume requests before
> attempting to invoke system-wide suspend callbacks for them with the
> expectation that their drivers will trigger system-wide wakeup events
> in the process of handling the runtime resume, if necessary.
> 
> Fixes: 1e2ef05bb8cf8 ("PM: Limit race conditions between runtime PM and system sleep (v2)")
> Reported-by: Mika Westerberg <mika.westerberg@linux.intel.com>
> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> ---
>  drivers/base/power/main.c |   12 ++++++------
>  1 file changed, 6 insertions(+), 6 deletions(-)
> 
> Index: linux-pm/drivers/base/power/main.c
> ===================================================================
> --- linux-pm.orig/drivers/base/power/main.c
> +++ linux-pm/drivers/base/power/main.c
> @@ -1606,13 +1606,13 @@ static int __device_suspend(struct devic
>  	}
>  
>  	/*
> -	 * If a device configured to wake up the system from sleep states
> -	 * has been suspended at run time and there's a resume request pending
> -	 * for it, this is equivalent to the device signaling wakeup, so the
> -	 * system suspend operation should be aborted.
> +	 * If there's a runtime resume request pending for the device, resume
> +	 * it before proceeding with invoking the system-wide suspend callbacks
> +	 * for it, because the physical state of the device may not reflect the
> +	 * "suspended" runtime PM status already in that case.
>  	 */
> -	if (pm_runtime_barrier(dev) && device_may_wakeup(dev))
> -		pm_wakeup_event(dev, 0);
> +	if (pm_runtime_barrier(dev))
> +		pm_runtime_resume(dev);

Is this really right?  Note that whenever pm_runtime_barrier() returns a 
nonzero value, it already calls rpm_resume(dev, 0).  So the 
pm_runtime_resume() call added here is redundant.

Furthermore, by the logic used in this patch, the call to 
pm_wakeup_event() in the original code is also redundant: Any required 
wakeup event should have been generated when the runtime resume inside 
pm_runtime_barrer() was carried out.  Removing a redundant function call 
can't fix a bug!

This means that the code could be simplified to just:

	pm_runtime_barrier(dev);

Will this fix the reported bug?  It seems likely to me that the actual 
problem with the failure scenario in the patch description was that 
turning on an ACPI power resource causes runtime-resume requests to be 
queued for all devices sharing that resource.  Wouldn't it make more 
sense to resume only the device that requested it and leave the others 
in runtime suspend?

Alan Stern
Mika Westerberg Aug. 24, 2020, 8:34 a.m. UTC | #2
Hi,

On Fri, Aug 21, 2020 at 03:34:42PM -0400, Alan Stern wrote:
> This means that the code could be simplified to just:
> 
> 	pm_runtime_barrier(dev);
> 
> Will this fix the reported bug?  It seems likely to me that the actual 
> problem with the failure scenario in the patch description was that 
> turning on an ACPI power resource causes runtime-resume requests to be 
> queued for all devices sharing that resource.  Wouldn't it make more 
> sense to resume only the device that requested it and leave the others 
> in runtime suspend?

The problem with at least PCIe devices that share ACPI power resources
is that once you turn on the power resource all the devices that shared
it will go into D0uninitialized power state and that means they lose all
wake configuration etc. so they need to be re-initialized by their
driver before they can go back to D3(cold) in order for their wakes to
still work.
Rafael J. Wysocki Aug. 24, 2020, 1:36 p.m. UTC | #3
On Friday, August 21, 2020 9:34:42 PM CEST Alan Stern wrote:
> On Fri, Aug 21, 2020 at 07:41:02PM +0200, Rafael J. Wysocki wrote:
> > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > 
> > It has been reported that system-wide suspend may be aborted in the
> > absence of any wakeup events due to unforseen interactions of it with
> > the runtume PM framework.
> > 
> > One failing scenario is when there are multiple devices sharing an
> > ACPI power resource and runtime-resume needs to be carried out for
> > one of them during system-wide suspend (for example, because it needs
> > to be reconfigured before the whole system goes to sleep).  In that
> > case, the runtime-resume of that device involves turning the ACPI
> > power resource "on" which in turn causes runtime resume requests
> > to be queued up for all of the other devices sharing it.  Those
> > requests go to the runtime PM workqueue which is frozen during
> > system-wide suspend, so they are not actually taken care of until
> > the resume of the whole system, but the pm_runtime_barrier()
> > call in __device_suspend() sees them and triggers system wakeup
> > events for them which then cause the system-wide suspend to be
> > aborted if wakeup source objects are in active use.
> > 
> > Of course, the logic that leads to triggering those wakeup events is
> > questionable in the first place, because clearly there are cases in
> > which a pending runtime resume request for a device is not connected
> > to any real wakeup events in any way (like the one above).  Moreover,
> > if there is a pending runtime resume request for a device while
> > __device_suspend() is running for it, the physical state of the
> > device may not be in agreement with the "suspended" runtime PM status
> > of it (which may be the very reason for queuing up the runtime resume
> > request for it).
> > 
> > For these reasons, rework __device_suspend() to carry out synchronous
> > runtime-resume for devices with pending runtime resume requests before
> > attempting to invoke system-wide suspend callbacks for them with the
> > expectation that their drivers will trigger system-wide wakeup events
> > in the process of handling the runtime resume, if necessary.
> > 
> > Fixes: 1e2ef05bb8cf8 ("PM: Limit race conditions between runtime PM and system sleep (v2)")
> > Reported-by: Mika Westerberg <mika.westerberg@linux.intel.com>
> > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > ---
> >  drivers/base/power/main.c |   12 ++++++------
> >  1 file changed, 6 insertions(+), 6 deletions(-)
> > 
> > Index: linux-pm/drivers/base/power/main.c
> > ===================================================================
> > --- linux-pm.orig/drivers/base/power/main.c
> > +++ linux-pm/drivers/base/power/main.c
> > @@ -1606,13 +1606,13 @@ static int __device_suspend(struct devic
> >  	}
> >  
> >  	/*
> > -	 * If a device configured to wake up the system from sleep states
> > -	 * has been suspended at run time and there's a resume request pending
> > -	 * for it, this is equivalent to the device signaling wakeup, so the
> > -	 * system suspend operation should be aborted.
> > +	 * If there's a runtime resume request pending for the device, resume
> > +	 * it before proceeding with invoking the system-wide suspend callbacks
> > +	 * for it, because the physical state of the device may not reflect the
> > +	 * "suspended" runtime PM status already in that case.
> >  	 */
> > -	if (pm_runtime_barrier(dev) && device_may_wakeup(dev))
> > -		pm_wakeup_event(dev, 0);
> > +	if (pm_runtime_barrier(dev))
> > +		pm_runtime_resume(dev);
> 
> Is this really right?  Note that whenever pm_runtime_barrier() returns a 
> nonzero value, it already calls rpm_resume(dev, 0).  So the 
> pm_runtime_resume() call added here is redundant.

Yes, it is (it seems that I've looked at __pm_runtime_barrier() alone, not sure
why).

> Furthermore, by the logic used in this patch, the call to 
> pm_wakeup_event() in the original code is also redundant: Any required 
> wakeup event should have been generated when the runtime resume inside 
> pm_runtime_barrer() was carried out.

It should be redundant in the real wakeup event cases, but it may cause
spurious suspend aborts to occur when there are no real system wakeup
events.

Actually, the original code is racy with respect to system wakeup events,
because it depends on the exact time when the runtime-resume starts.  Namely,
if it manages to start before the freezing of pm_wq, the wakeup will be lost
unless the driver takes care of reporting it, which means that drivers really
need to do that anyway.  And if they do that (which hopefully is the case), the
pm_wakeup_event() call in the core may be dropped.

> Removing a redundant function call can't fix a bug!

That's true, but this particular call is redundant only if there is a real
wakeup event.  Otherwise it may produce a spurious wakeup.

> This means that the code could be simplified to just:
> 
> 	pm_runtime_barrier(dev);

Yes, it could, so I'm going to re-spin the patch with this code simplification
and updated changelog.

> Will this fix the reported bug?

I think so.

> It seems likely to me that the actual 
> problem with the failure scenario in the patch description was that 
> turning on an ACPI power resource causes runtime-resume requests to be 
> queued for all devices sharing that resource.  Wouldn't it make more 
> sense to resume only the device that requested it and leave the others 
> in runtime suspend?

Please see the Mika's reply for a good explanation of this.

Cheers!
Rafael J. Wysocki Aug. 24, 2020, 1:38 p.m. UTC | #4
On Monday, August 24, 2020 10:34:52 AM CEST Mika Westerberg wrote:
> Hi,
> 
> On Fri, Aug 21, 2020 at 03:34:42PM -0400, Alan Stern wrote:
> > This means that the code could be simplified to just:
> > 
> > 	pm_runtime_barrier(dev);
> > 
> > Will this fix the reported bug?  It seems likely to me that the actual 
> > problem with the failure scenario in the patch description was that 
> > turning on an ACPI power resource causes runtime-resume requests to be 
> > queued for all devices sharing that resource.  Wouldn't it make more 
> > sense to resume only the device that requested it and leave the others 
> > in runtime suspend?
> 
> The problem with at least PCIe devices that share ACPI power resources
> is that once you turn on the power resource all the devices that shared
> it will go into D0uninitialized power state and that means they lose all
> wake configuration etc. so they need to be re-initialized by their
> driver before they can go back to D3(cold) in order for their wakes to
> still work.

Plus devices in D0uninitialized may prevent the SoC from allowing package
C-states to be used for the processor AFAICS.

BTW, does the patch make the issue at hand go away?
Mika Westerberg Aug. 24, 2020, 1:59 p.m. UTC | #5
On Mon, Aug 24, 2020 at 03:38:39PM +0200, Rafael J. Wysocki wrote:
> BTW, does the patch make the issue at hand go away?

I asked the folks who have this particular hardware to try it out as I
don't have access to this one. Hopefully we get the results back soon
and once we do, I'll let you know.
Alan Stern Aug. 24, 2020, 3:04 p.m. UTC | #6
On Mon, Aug 24, 2020 at 03:36:36PM +0200, Rafael J. Wysocki wrote:
> > Furthermore, by the logic used in this patch, the call to 
> > pm_wakeup_event() in the original code is also redundant: Any required 
> > wakeup event should have been generated when the runtime resume inside 
> > pm_runtime_barrer() was carried out.
> 
> It should be redundant in the real wakeup event cases, but it may cause
> spurious suspend aborts to occur when there are no real system wakeup
> events.
> 
> Actually, the original code is racy with respect to system wakeup events,
> because it depends on the exact time when the runtime-resume starts.  Namely,
> if it manages to start before the freezing of pm_wq, the wakeup will be lost
> unless the driver takes care of reporting it, which means that drivers really
> need to do that anyway.  And if they do that (which hopefully is the case), the
> pm_wakeup_event() call in the core may be dropped.

In other words, wakeup events are supposed to be reported at the time 
the wakeup request is first noticed, right?  We don't want to wait until 
a resume or runtime_resume callback runs; thanks to this race the 
callback might not run at all if the event isn't reported first.

Therefore the reasoning behind the original code appears to have been 
highly suspect.  If there already was a queued runtime-resume request 
for the device and the device was wakeup-enabled, the wakeup event 
should _already_ have been reported at the time the request was queued.  
And we shouldn't rely on it being reported by the runtime-resume 
callback routine.

> > This means that the code could be simplified to just:
> > 
> > 	pm_runtime_barrier(dev);
> 
> Yes, it could, so I'm going to re-spin the patch with this code simplification
> and updated changelog.
> 
> > Will this fix the reported bug?
> 
> I think so.

Okay, we'll see!

Alan Stern
Rafael J. Wysocki Aug. 24, 2020, 5:31 p.m. UTC | #7
On Monday, August 24, 2020 5:04:21 PM CEST Alan Stern wrote:
> On Mon, Aug 24, 2020 at 03:36:36PM +0200, Rafael J. Wysocki wrote:
> > > Furthermore, by the logic used in this patch, the call to 
> > > pm_wakeup_event() in the original code is also redundant: Any required 
> > > wakeup event should have been generated when the runtime resume inside 
> > > pm_runtime_barrer() was carried out.
> > 
> > It should be redundant in the real wakeup event cases, but it may cause
> > spurious suspend aborts to occur when there are no real system wakeup
> > events.
> > 
> > Actually, the original code is racy with respect to system wakeup events,
> > because it depends on the exact time when the runtime-resume starts.  Namely,
> > if it manages to start before the freezing of pm_wq, the wakeup will be lost
> > unless the driver takes care of reporting it, which means that drivers really
> > need to do that anyway.  And if they do that (which hopefully is the case), the
> > pm_wakeup_event() call in the core may be dropped.
> 
> In other words, wakeup events are supposed to be reported at the time 
> the wakeup request is first noticed, right?

That's correct.

> We don't want to wait until 
> a resume or runtime_resume callback runs; thanks to this race the 
> callback might not run at all if the event isn't reported first.

The callback will run, either through the wq or by the pm_runtime_barrier(),
but if it runs through the wq, pm_runtime_barrier() will return 0 and
pm_wakeup_event() will not called by the core, so it must be called from
elsewhere anyway.

> Therefore the reasoning behind the original code appears to have been 
> highly suspect.

Indeed.

> If there already was a queued runtime-resume request 
> for the device and the device was wakeup-enabled, the wakeup event 
> should _already_ have been reported at the time the request was queued.  
> And we shouldn't rely on it being reported by the runtime-resume 
> callback routine.

Right.

> > > This means that the code could be simplified to just:
> > > 
> > > 	pm_runtime_barrier(dev);
> > 
> > Yes, it could, so I'm going to re-spin the patch with this code simplification
> > and updated changelog.
> > 
> > > Will this fix the reported bug?
> > 
> > I think so.
> 
> Okay, we'll see!

Fair enough!

Patch
diff mbox series

Index: linux-pm/drivers/base/power/main.c
===================================================================
--- linux-pm.orig/drivers/base/power/main.c
+++ linux-pm/drivers/base/power/main.c
@@ -1606,13 +1606,13 @@  static int __device_suspend(struct devic
 	}
 
 	/*
-	 * If a device configured to wake up the system from sleep states
-	 * has been suspended at run time and there's a resume request pending
-	 * for it, this is equivalent to the device signaling wakeup, so the
-	 * system suspend operation should be aborted.
+	 * If there's a runtime resume request pending for the device, resume
+	 * it before proceeding with invoking the system-wide suspend callbacks
+	 * for it, because the physical state of the device may not reflect the
+	 * "suspended" runtime PM status already in that case.
 	 */
-	if (pm_runtime_barrier(dev) && device_may_wakeup(dev))
-		pm_wakeup_event(dev, 0);
+	if (pm_runtime_barrier(dev))
+		pm_runtime_resume(dev);
 
 	if (pm_wakeup_pending()) {
 		dev->power.direct_complete = false;