All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/2] driver core: Fixes related to device links
@ 2019-02-12 12:01 Rafael J. Wysocki
  2019-02-12 12:04 ` [PATCH 1/2] PM-runtime: Fix __pm_runtime_set_status() race with runtime resume Rafael J. Wysocki
                   ` (2 more replies)
  0 siblings, 3 replies; 28+ messages in thread
From: Rafael J. Wysocki @ 2019-02-12 12:01 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: LKML, Linux PM, Ulf Hansson, Daniel Vetter, Lukas Wunner,
	Andrzej Hajda, Russell King - ARM Linux, Lucas Stach,
	Linus Walleij, Thierry Reding, Laurent Pinchart,
	Marek Szyprowski

Hi Greg at al,

These fix two issues on top of the recent device links material in
driver-core/driver-core-next.  

The first one fixes a race condition that may trigger when
__pm_runtime_set_status() is used incorrectly (that is, when it is
called with PM-runtime enabled for the target device and working).

The second one fixes a supplier PM-runtime usage counter imbalance
resulting from adding and removing (e.g. in the error code path) a
stateless device link to it from within the consumer driver's probe
callback.

Please refer to the patch changelogs for details.

Cheers,
Rafael




^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 1/2] PM-runtime: Fix __pm_runtime_set_status() race with runtime resume
  2019-02-12 12:01 [PATCH 0/2] driver core: Fixes related to device links Rafael J. Wysocki
@ 2019-02-12 12:04 ` Rafael J. Wysocki
  2019-02-12 16:17   ` Ulf Hansson
  2019-02-12 20:14   ` Ulf Hansson
  2019-02-12 12:08 ` [PATCH 2/2] driver core: Fix possible supplier PM-usage counter imbalance Rafael J. Wysocki
  2019-02-12 14:09 ` [PATCH 0/2] driver core: Fixes related to device links Greg Kroah-Hartman
  2 siblings, 2 replies; 28+ messages in thread
From: Rafael J. Wysocki @ 2019-02-12 12:04 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: LKML, Linux PM, Ulf Hansson, Daniel Vetter, Lukas Wunner,
	Andrzej Hajda, Russell King - ARM Linux, Lucas Stach,
	Linus Walleij, Thierry Reding, Laurent Pinchart,
	Marek Szyprowski

From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

Commit 4080ab083000 ("PM-runtime: Take suppliers into account in
__pm_runtime_set_status()") introduced a race condition that may
trigger if __pm_runtime_set_status() is used incorrectly (that is,
if it is called when PM-runtime is enabled for the target device
and working).

In that case, if the original PM-runtime status of the device is
RPM_SUSPENDED, a runtime resume of the device may occur after
__pm_runtime_set_status() has dropped its power.lock spinlock
and before deactivating its suppliers, so the suppliers may be
deactivated while the device is PM-runtime-active which may lead
to functional issues.

To avoid that, modify __pm_runtime_set_status() to check whether
or not PM-runtime is enabled for the device before activating its
suppliers (if the new status is RPM_ACTIVE) and either return an
error if that's the case or increment the device's disable_depth
counter to prevent PM-runtime from being enabled for it while
the remaining part of the function is running (disable_depth is
then decremented on the way out).

Fixes: 4080ab083000 ("PM-runtime: Take suppliers into account in __pm_runtime_set_status()")
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---
 drivers/base/power/runtime.c |   24 ++++++++++++++++++------
 1 file changed, 18 insertions(+), 6 deletions(-)

Index: linux-pm/drivers/base/power/runtime.c
===================================================================
--- linux-pm.orig/drivers/base/power/runtime.c
+++ linux-pm/drivers/base/power/runtime.c
@@ -1129,6 +1129,22 @@ int __pm_runtime_set_status(struct devic
 	if (status != RPM_ACTIVE && status != RPM_SUSPENDED)
 		return -EINVAL;
 
+	spin_lock_irq(&dev->power.lock);
+
+	/*
+	 * Prevent PM-runtime from being enabled for the device or return an
+	 * error if it is enabled already and working.
+	 */
+	if (dev->power.runtime_error || dev->power.disable_depth)
+		dev->power.disable_depth++;
+	else
+		error = -EAGAIN;
+
+	spin_unlock_irq(&dev->power.lock);
+
+	if (error)
+		return error;
+
 	/*
 	 * If the new status is RPM_ACTIVE, the suppliers can be activated
 	 * upfront regardless of the current status, because next time
@@ -1147,12 +1163,6 @@ int __pm_runtime_set_status(struct devic
 
 	spin_lock_irq(&dev->power.lock);
 
-	if (!dev->power.runtime_error && !dev->power.disable_depth) {
-		status = dev->power.runtime_status;
-		error = -EAGAIN;
-		goto out;
-	}
-
 	if (dev->power.runtime_status == status || !parent)
 		goto out_set;
 
@@ -1205,6 +1215,8 @@ int __pm_runtime_set_status(struct devic
 		device_links_read_unlock(idx);
 	}
 
+	pm_runtime_enable(dev);
+
 	return error;
 }
 EXPORT_SYMBOL_GPL(__pm_runtime_set_status);


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 2/2] driver core: Fix possible supplier PM-usage counter imbalance
  2019-02-12 12:01 [PATCH 0/2] driver core: Fixes related to device links Rafael J. Wysocki
  2019-02-12 12:04 ` [PATCH 1/2] PM-runtime: Fix __pm_runtime_set_status() race with runtime resume Rafael J. Wysocki
@ 2019-02-12 12:08 ` Rafael J. Wysocki
  2019-02-12 21:02   ` Ulf Hansson
  2019-02-15 11:00   ` Jon Hunter
  2019-02-12 14:09 ` [PATCH 0/2] driver core: Fixes related to device links Greg Kroah-Hartman
  2 siblings, 2 replies; 28+ messages in thread
From: Rafael J. Wysocki @ 2019-02-12 12:08 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: LKML, Linux PM, Ulf Hansson, Daniel Vetter, Lukas Wunner,
	Andrzej Hajda, Russell King - ARM Linux, Lucas Stach,
	Linus Walleij, Thierry Reding, Laurent Pinchart,
	Marek Szyprowski

From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

If a stateless device link to a certain supplier with
DL_FLAG_PM_RUNTIME set in the flags is added and then removed by the
consumer driver's probe callback, the supplier's PM-runtime usage
counter will be nonzero after that which effectively causes the
supplier to remain "always on" going forward.

Namely, device_link_add() called to add the link invokes
device_link_rpm_prepare() which notices that the consumer driver is
probing, so it increments the supplier's PM-runtime usage counter
with the assumption that the link will stay around until
pm_runtime_put_suppliers() is called by driver_probe_device(),
but if the link goes away before that point, the supplier's
PM-runtime usage counter will remain nonzero.

To prevent that from happening, first rework pm_runtime_get_suppliers()
and pm_runtime_put_suppliers() to use the rpm_active refounts of device
links and make the latter only drop rpm_active and the supplier's
PM-runtime usage counter for each link by one, unless rpm_active is
one already for it.  Next, modify device_link_add() to bump up the
new link's rpm_active refcount and the suppliers PM-runtime usage
counter by two, to prevent pm_runtime_put_suppliers(), if it is
called subsequently, from suspending the supplier prematurely (in
case its PM-runtime usage counter goes down to 0 in there).

Due to the way rpm_put_suppliers() works, this change does not
affect runtime suspend of the consumer ends of new device links (or,
generally, device links for which DL_FLAG_PM_RUNTIME has just been
set).

Fixes: e2f3cd831a28 ("driver core: Fix handling of runtime PM flags in device_link_add()")
Reported-by: Ulf Hansson <ulf.hansson@linaro.org> 
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---

Note that the issue had been there before commit e2f3cd831a28, but it was
overlooked by that commit and this change is a fix on top of it, so make
the Fixes: tag point to commit e2f3cd831a28 (instead of an earlier one
that the patch will not be applicable to).

---
 drivers/base/core.c          |   21 ++++-----------------
 drivers/base/power/runtime.c |   27 +++++++++++++++++++++++++--
 include/linux/pm_runtime.h   |    4 ++++
 3 files changed, 33 insertions(+), 19 deletions(-)

Index: linux-pm/drivers/base/power/runtime.c
===================================================================
--- linux-pm.orig/drivers/base/power/runtime.c
+++ linux-pm/drivers/base/power/runtime.c
@@ -1655,8 +1655,10 @@ void pm_runtime_get_suppliers(struct dev
 	idx = device_links_read_lock();
 
 	list_for_each_entry_rcu(link, &dev->links.suppliers, c_node)
-		if (link->flags & DL_FLAG_PM_RUNTIME)
+		if (link->flags & DL_FLAG_PM_RUNTIME) {
+			refcount_inc(&link->rpm_active);
 			pm_runtime_get_sync(link->supplier);
+		}
 
 	device_links_read_unlock(idx);
 }
@@ -1673,7 +1675,8 @@ void pm_runtime_put_suppliers(struct dev
 	idx = device_links_read_lock();
 
 	list_for_each_entry_rcu(link, &dev->links.suppliers, c_node)
-		if (link->flags & DL_FLAG_PM_RUNTIME)
+		if (link->flags & DL_FLAG_PM_RUNTIME &&
+		    refcount_dec_not_one(&link->rpm_active))
 			pm_runtime_put(link->supplier);
 
 	device_links_read_unlock(idx);
@@ -1686,6 +1689,26 @@ void pm_runtime_new_link(struct device *
 	spin_unlock_irq(&dev->power.lock);
 }
 
+/**
+ * pm_runtime_active_link - Set up new device link as active for PM-runtime.
+ * @link: Device link to be set up as active.
+ * @supplier: Supplier end of the link.
+ *
+ * Add 2 to the rpm_active refcount of @link and increment the PM-runtime
+ * usage counter of @supplier once more in case the link is being added while
+ * the consumer driver is probing and pm_runtime_put_suppliers() will be called
+ * subsequently.
+ *
+ * Note that this doesn't prevent rpm_put_suppliers() from decreasing the link's
+ * rpm_active refcount down to one, so runtime suspend of the consumer end of
+ * @link is not affected.
+ */
+void pm_runtime_active_link(struct device_link *link, struct device *supplier)
+{
+	refcount_add(2, &link->rpm_active);
+	pm_runtime_get_noresume(supplier);
+}
+
 void pm_runtime_drop_link(struct device *dev)
 {
 	spin_lock_irq(&dev->power.lock);
Index: linux-pm/drivers/base/core.c
===================================================================
--- linux-pm.orig/drivers/base/core.c
+++ linux-pm/drivers/base/core.c
@@ -165,19 +165,6 @@ void device_pm_move_to_tail(struct devic
 	device_links_read_unlock(idx);
 }
 
-static void device_link_rpm_prepare(struct device *consumer,
-				    struct device *supplier)
-{
-	pm_runtime_new_link(consumer);
-	/*
-	 * If the link is being added by the consumer driver at probe time,
-	 * balance the decrementation of the supplier's runtime PM usage counter
-	 * after consumer probe in driver_probe_device().
-	 */
-	if (consumer->links.status == DL_DEV_PROBING)
-		pm_runtime_get_noresume(supplier);
-}
-
 /**
  * device_link_add - Create a link between two devices.
  * @consumer: Consumer end of the link.
@@ -286,11 +273,11 @@ struct device_link *device_link_add(stru
 
 		if (flags & DL_FLAG_PM_RUNTIME) {
 			if (!(link->flags & DL_FLAG_PM_RUNTIME)) {
-				device_link_rpm_prepare(consumer, supplier);
+				pm_runtime_new_link(consumer);
 				link->flags |= DL_FLAG_PM_RUNTIME;
 			}
 			if (flags & DL_FLAG_RPM_ACTIVE)
-				refcount_inc(&link->rpm_active);
+				pm_runtime_active_link(link, supplier);
 		}
 
 		if (flags & DL_FLAG_STATELESS) {
@@ -323,9 +310,9 @@ struct device_link *device_link_add(stru
 
 	if (flags & DL_FLAG_PM_RUNTIME) {
 		if (flags & DL_FLAG_RPM_ACTIVE)
-			refcount_inc(&link->rpm_active);
+			pm_runtime_active_link(link, supplier);
 
-		device_link_rpm_prepare(consumer, supplier);
+		pm_runtime_new_link(consumer);
 	}
 
 	get_device(supplier);
Index: linux-pm/include/linux/pm_runtime.h
===================================================================
--- linux-pm.orig/include/linux/pm_runtime.h
+++ linux-pm/include/linux/pm_runtime.h
@@ -59,6 +59,8 @@ extern void pm_runtime_clean_up_links(st
 extern void pm_runtime_get_suppliers(struct device *dev);
 extern void pm_runtime_put_suppliers(struct device *dev);
 extern void pm_runtime_new_link(struct device *dev);
+extern void pm_runtime_active_link(struct device_link *link,
+				   struct device *supplier);
 extern void pm_runtime_drop_link(struct device *dev);
 
 static inline void pm_suspend_ignore_children(struct device *dev, bool enable)
@@ -178,6 +180,8 @@ static inline void pm_runtime_clean_up_l
 static inline void pm_runtime_get_suppliers(struct device *dev) {}
 static inline void pm_runtime_put_suppliers(struct device *dev) {}
 static inline void pm_runtime_new_link(struct device *dev) {}
+static inline void pm_runtime_active_link(struct device_link *link,
+					  struct device *supplier) {}
 static inline void pm_runtime_drop_link(struct device *dev) {}
 
 #endif /* !CONFIG_PM */


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/2] driver core: Fixes related to device links
  2019-02-12 12:01 [PATCH 0/2] driver core: Fixes related to device links Rafael J. Wysocki
  2019-02-12 12:04 ` [PATCH 1/2] PM-runtime: Fix __pm_runtime_set_status() race with runtime resume Rafael J. Wysocki
  2019-02-12 12:08 ` [PATCH 2/2] driver core: Fix possible supplier PM-usage counter imbalance Rafael J. Wysocki
@ 2019-02-12 14:09 ` Greg Kroah-Hartman
  2019-02-12 14:52   ` Ulf Hansson
  2 siblings, 1 reply; 28+ messages in thread
From: Greg Kroah-Hartman @ 2019-02-12 14:09 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: LKML, Linux PM, Ulf Hansson, Daniel Vetter, Lukas Wunner,
	Andrzej Hajda, Russell King - ARM Linux, Lucas Stach,
	Linus Walleij, Thierry Reding, Laurent Pinchart,
	Marek Szyprowski

On Tue, Feb 12, 2019 at 01:01:13PM +0100, Rafael J. Wysocki wrote:
> Hi Greg at al,
> 
> These fix two issues on top of the recent device links material in
> driver-core/driver-core-next.  
> 
> The first one fixes a race condition that may trigger when
> __pm_runtime_set_status() is used incorrectly (that is, when it is
> called with PM-runtime enabled for the target device and working).
> 
> The second one fixes a supplier PM-runtime usage counter imbalance
> resulting from adding and removing (e.g. in the error code path) a
> stateless device link to it from within the consumer driver's probe
> callback.
> 
> Please refer to the patch changelogs for details.

Looks good, all now queued up, thanks.

greg k-h

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/2] driver core: Fixes related to device links
  2019-02-12 14:09 ` [PATCH 0/2] driver core: Fixes related to device links Greg Kroah-Hartman
@ 2019-02-12 14:52   ` Ulf Hansson
  2019-02-12 15:04     ` Rafael J. Wysocki
  2019-02-12 15:06     ` Greg Kroah-Hartman
  0 siblings, 2 replies; 28+ messages in thread
From: Ulf Hansson @ 2019-02-12 14:52 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: Rafael J. Wysocki, LKML, Linux PM, Daniel Vetter, Lukas Wunner,
	Andrzej Hajda, Russell King - ARM Linux, Lucas Stach,
	Linus Walleij, Thierry Reding, Laurent Pinchart,
	Marek Szyprowski

On Tue, 12 Feb 2019 at 15:09, Greg Kroah-Hartman
<gregkh@linuxfoundation.org> wrote:
>
> On Tue, Feb 12, 2019 at 01:01:13PM +0100, Rafael J. Wysocki wrote:
> > Hi Greg at al,
> >
> > These fix two issues on top of the recent device links material in
> > driver-core/driver-core-next.
> >
> > The first one fixes a race condition that may trigger when
> > __pm_runtime_set_status() is used incorrectly (that is, when it is
> > called with PM-runtime enabled for the target device and working).
> >
> > The second one fixes a supplier PM-runtime usage counter imbalance
> > resulting from adding and removing (e.g. in the error code path) a
> > stateless device link to it from within the consumer driver's probe
> > callback.
> >
> > Please refer to the patch changelogs for details.
>
> Looks good, all now queued up, thanks.

Greg, please don't get me wrong, but ~1.5 hours isn't sufficient for
me to review/test submitted patches.

I have been trying to collaborate (review/test) device links related
code with Rafael, but what's the point if you queue up the patches,
before I even got the change to look at them. Shall I interpret it as
you don't care about me reviewing this, then just tell me so I don't
have to waste my time.

Kind regards
Uffe

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/2] driver core: Fixes related to device links
  2019-02-12 14:52   ` Ulf Hansson
@ 2019-02-12 15:04     ` Rafael J. Wysocki
  2019-02-12 15:06     ` Greg Kroah-Hartman
  1 sibling, 0 replies; 28+ messages in thread
From: Rafael J. Wysocki @ 2019-02-12 15:04 UTC (permalink / raw)
  To: Ulf Hansson
  Cc: Greg Kroah-Hartman, Rafael J. Wysocki, LKML, Linux PM,
	Daniel Vetter, Lukas Wunner, Andrzej Hajda,
	Russell King - ARM Linux, Lucas Stach, Linus Walleij,
	Thierry Reding, Laurent Pinchart, Marek Szyprowski

On Tue, Feb 12, 2019 at 3:53 PM Ulf Hansson <ulf.hansson@linaro.org> wrote:
>
> On Tue, 12 Feb 2019 at 15:09, Greg Kroah-Hartman
> <gregkh@linuxfoundation.org> wrote:
> >
> > On Tue, Feb 12, 2019 at 01:01:13PM +0100, Rafael J. Wysocki wrote:
> > > Hi Greg at al,
> > >
> > > These fix two issues on top of the recent device links material in
> > > driver-core/driver-core-next.
> > >
> > > The first one fixes a race condition that may trigger when
> > > __pm_runtime_set_status() is used incorrectly (that is, when it is
> > > called with PM-runtime enabled for the target device and working).
> > >
> > > The second one fixes a supplier PM-runtime usage counter imbalance
> > > resulting from adding and removing (e.g. in the error code path) a
> > > stateless device link to it from within the consumer driver's probe
> > > callback.
> > >
> > > Please refer to the patch changelogs for details.
> >
> > Looks good, all now queued up, thanks.
>
> Greg, please don't get me wrong, but ~1.5 hours isn't sufficient for
> me to review/test submitted patches.
>
> I have been trying to collaborate (review/test) device links related
> code with Rafael, but what's the point if you queue up the patches,
> before I even got the change to look at them. Shall I interpret it as
> you don't care about me reviewing this, then just tell me so I don't
> have to waste my time.

I certainly do care about that.

Moreover, if you find any issues in the patches, they still can be
dropped or incremental fixes on top of them can be done.

Your work and feedback here is much appreciated, please don't drop the ball. :-)

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/2] driver core: Fixes related to device links
  2019-02-12 14:52   ` Ulf Hansson
  2019-02-12 15:04     ` Rafael J. Wysocki
@ 2019-02-12 15:06     ` Greg Kroah-Hartman
  2019-02-12 16:20         ` Rafael J. Wysocki
  1 sibling, 1 reply; 28+ messages in thread
From: Greg Kroah-Hartman @ 2019-02-12 15:06 UTC (permalink / raw)
  To: Ulf Hansson
  Cc: Rafael J. Wysocki, LKML, Linux PM, Daniel Vetter, Lukas Wunner,
	Andrzej Hajda, Russell King - ARM Linux, Lucas Stach,
	Linus Walleij, Thierry Reding, Laurent Pinchart,
	Marek Szyprowski

On Tue, Feb 12, 2019 at 03:52:53PM +0100, Ulf Hansson wrote:
> On Tue, 12 Feb 2019 at 15:09, Greg Kroah-Hartman
> <gregkh@linuxfoundation.org> wrote:
> >
> > On Tue, Feb 12, 2019 at 01:01:13PM +0100, Rafael J. Wysocki wrote:
> > > Hi Greg at al,
> > >
> > > These fix two issues on top of the recent device links material in
> > > driver-core/driver-core-next.
> > >
> > > The first one fixes a race condition that may trigger when
> > > __pm_runtime_set_status() is used incorrectly (that is, when it is
> > > called with PM-runtime enabled for the target device and working).
> > >
> > > The second one fixes a supplier PM-runtime usage counter imbalance
> > > resulting from adding and removing (e.g. in the error code path) a
> > > stateless device link to it from within the consumer driver's probe
> > > callback.
> > >
> > > Please refer to the patch changelogs for details.
> >
> > Looks good, all now queued up, thanks.
> 
> Greg, please don't get me wrong, but ~1.5 hours isn't sufficient for
> me to review/test submitted patches.
> 
> I have been trying to collaborate (review/test) device links related
> code with Rafael, but what's the point if you queue up the patches,
> before I even got the change to look at them. Shall I interpret it as
> you don't care about me reviewing this, then just tell me so I don't
> have to waste my time.

As they are just in my -testing branch, I can easily drop them now if
you find problems.  I didn't realize that Rafael was wanting you to
review this as they were marked as "fixes:" for previous patches.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/2] PM-runtime: Fix __pm_runtime_set_status() race with runtime resume
  2019-02-12 12:04 ` [PATCH 1/2] PM-runtime: Fix __pm_runtime_set_status() race with runtime resume Rafael J. Wysocki
@ 2019-02-12 16:17   ` Ulf Hansson
  2019-02-12 16:28     ` Rafael J. Wysocki
  2019-02-12 20:14   ` Ulf Hansson
  1 sibling, 1 reply; 28+ messages in thread
From: Ulf Hansson @ 2019-02-12 16:17 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Greg Kroah-Hartman, LKML, Linux PM, Daniel Vetter, Lukas Wunner,
	Andrzej Hajda, Russell King - ARM Linux, Lucas Stach,
	Linus Walleij, Thierry Reding, Laurent Pinchart,
	Marek Szyprowski

On Tue, 12 Feb 2019 at 13:10, Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
>
> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>
> Commit 4080ab083000 ("PM-runtime: Take suppliers into account in
> __pm_runtime_set_status()") introduced a race condition that may
> trigger if __pm_runtime_set_status() is used incorrectly (that is,
> if it is called when PM-runtime is enabled for the target device
> and working).
>
> In that case, if the original PM-runtime status of the device is
> RPM_SUSPENDED, a runtime resume of the device may occur after
> __pm_runtime_set_status() has dropped its power.lock spinlock
> and before deactivating its suppliers, so the suppliers may be
> deactivated while the device is PM-runtime-active which may lead
> to functional issues.
>
> To avoid that, modify __pm_runtime_set_status() to check whether
> or not PM-runtime is enabled for the device before activating its
> suppliers (if the new status is RPM_ACTIVE) and either return an
> error if that's the case or increment the device's disable_depth
> counter to prevent PM-runtime from being enabled for it while
> the remaining part of the function is running (disable_depth is
> then decremented on the way out).
>
> Fixes: 4080ab083000 ("PM-runtime: Take suppliers into account in __pm_runtime_set_status()")
> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> ---
>  drivers/base/power/runtime.c |   24 ++++++++++++++++++------
>  1 file changed, 18 insertions(+), 6 deletions(-)
>
> Index: linux-pm/drivers/base/power/runtime.c
> ===================================================================
> --- linux-pm.orig/drivers/base/power/runtime.c
> +++ linux-pm/drivers/base/power/runtime.c
> @@ -1129,6 +1129,22 @@ int __pm_runtime_set_status(struct devic
>         if (status != RPM_ACTIVE && status != RPM_SUSPENDED)
>                 return -EINVAL;
>
> +       spin_lock_irq(&dev->power.lock);
> +
> +       /*
> +        * Prevent PM-runtime from being enabled for the device or return an
> +        * error if it is enabled already and working.
> +        */
> +       if (dev->power.runtime_error || dev->power.disable_depth)
> +               dev->power.disable_depth++;
> +       else
> +               error = -EAGAIN;
> +
> +       spin_unlock_irq(&dev->power.lock);
> +
> +       if (error)
> +               return error;
> +
>         /*
>          * If the new status is RPM_ACTIVE, the suppliers can be activated
>          * upfront regardless of the current status, because next time
> @@ -1147,12 +1163,6 @@ int __pm_runtime_set_status(struct devic
>
>         spin_lock_irq(&dev->power.lock);
>
> -       if (!dev->power.runtime_error && !dev->power.disable_depth) {
> -               status = dev->power.runtime_status;
> -               error = -EAGAIN;
> -               goto out;
> -       }
> -
>         if (dev->power.runtime_status == status || !parent)
>                 goto out_set;
>
> @@ -1205,6 +1215,8 @@ int __pm_runtime_set_status(struct devic
>                 device_links_read_unlock(idx);
>         }
>
> +       pm_runtime_enable(dev);

pm_runtime_enable() uses spin_lock_irqsave(), rather than
spin_lock_irq() - is there a reason to why you want to allow that
here, but not earlier in the function?

> +
>         return error;
>  }
>  EXPORT_SYMBOL_GPL(__pm_runtime_set_status);
>

Other than the above comment, this looks good to me.

Kind regards
Uffe

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/2] driver core: Fixes related to device links
  2019-02-12 15:06     ` Greg Kroah-Hartman
@ 2019-02-12 16:20         ` Rafael J. Wysocki
  0 siblings, 0 replies; 28+ messages in thread
From: Rafael J. Wysocki @ 2019-02-12 16:20 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: Ulf Hansson, Rafael J. Wysocki, LKML, Linux PM, Daniel Vetter,
	Lukas Wunner, Andrzej Hajda, Russell King - ARM Linux,
	Lucas Stach, Linus Walleij, Thierry Reding, Laurent Pinchart,
	Marek Szyprowski

On Tue, Feb 12, 2019 at 4:06 PM Greg Kroah-Hartman
<gregkh@linuxfoundation.org> wrote:
>
> On Tue, Feb 12, 2019 at 03:52:53PM +0100, Ulf Hansson wrote:
> > On Tue, 12 Feb 2019 at 15:09, Greg Kroah-Hartman
> > <gregkh@linuxfoundation.org> wrote:
> > >
> > > On Tue, Feb 12, 2019 at 01:01:13PM +0100, Rafael J. Wysocki wrote:
> > > > Hi Greg at al,
> > > >
> > > > These fix two issues on top of the recent device links material in
> > > > driver-core/driver-core-next.
> > > >
> > > > The first one fixes a race condition that may trigger when
> > > > __pm_runtime_set_status() is used incorrectly (that is, when it is
> > > > called with PM-runtime enabled for the target device and working).
> > > >
> > > > The second one fixes a supplier PM-runtime usage counter imbalance
> > > > resulting from adding and removing (e.g. in the error code path) a
> > > > stateless device link to it from within the consumer driver's probe
> > > > callback.
> > > >
> > > > Please refer to the patch changelogs for details.
> > >
> > > Looks good, all now queued up, thanks.
> >
> > Greg, please don't get me wrong, but ~1.5 hours isn't sufficient for
> > me to review/test submitted patches.
> >
> > I have been trying to collaborate (review/test) device links related
> > code with Rafael, but what's the point if you queue up the patches,
> > before I even got the change to look at them. Shall I interpret it as
> > you don't care about me reviewing this, then just tell me so I don't
> > have to waste my time.
>
> As they are just in my -testing branch, I can easily drop them now if
> you find problems.  I didn't realize that Rafael was wanting you to
> review this as they were marked as "fixes:" for previous patches.

Sorry for the confusion.

From my perspective they fix issues on top of the previous commits as
indicated by the Fixes: tags and they work on my systems, but even
though *I* think that they are all good, there still may be problems
with them that I don't see.  As usual. :-)

I guess I should have sent them as RFC this time.

Cheers,
Rafael

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/2] driver core: Fixes related to device links
@ 2019-02-12 16:20         ` Rafael J. Wysocki
  0 siblings, 0 replies; 28+ messages in thread
From: Rafael J. Wysocki @ 2019-02-12 16:20 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: Ulf Hansson, Rafael J. Wysocki, LKML, Linux PM, Daniel Vetter,
	Lukas Wunner, Andrzej Hajda, Russell King - ARM Linux,
	Lucas Stach, Linus Walleij, Thierry Reding, Laurent Pinchart,
	Marek Szyprowski

On Tue, Feb 12, 2019 at 4:06 PM Greg Kroah-Hartman
<gregkh@linuxfoundation.org> wrote:
>
> On Tue, Feb 12, 2019 at 03:52:53PM +0100, Ulf Hansson wrote:
> > On Tue, 12 Feb 2019 at 15:09, Greg Kroah-Hartman
> > <gregkh@linuxfoundation.org> wrote:
> > >
> > > On Tue, Feb 12, 2019 at 01:01:13PM +0100, Rafael J. Wysocki wrote:
> > > > Hi Greg at al,
> > > >
> > > > These fix two issues on top of the recent device links material in
> > > > driver-core/driver-core-next.
> > > >
> > > > The first one fixes a race condition that may trigger when
> > > > __pm_runtime_set_status() is used incorrectly (that is, when it is
> > > > called with PM-runtime enabled for the target device and working).
> > > >
> > > > The second one fixes a supplier PM-runtime usage counter imbalance
> > > > resulting from adding and removing (e.g. in the error code path) a
> > > > stateless device link to it from within the consumer driver's probe
> > > > callback.
> > > >
> > > > Please refer to the patch changelogs for details.
> > >
> > > Looks good, all now queued up, thanks.
> >
> > Greg, please don't get me wrong, but ~1.5 hours isn't sufficient for
> > me to review/test submitted patches.
> >
> > I have been trying to collaborate (review/test) device links related
> > code with Rafael, but what's the point if you queue up the patches,
> > before I even got the change to look at them. Shall I interpret it as
> > you don't care about me reviewing this, then just tell me so I don't
> > have to waste my time.
>
> As they are just in my -testing branch, I can easily drop them now if
> you find problems.  I didn't realize that Rafael was wanting you to
> review this as they were marked as "fixes:" for previous patches.

Sorry for the confusion.

>From my perspective they fix issues on top of the previous commits as
indicated by the Fixes: tags and they work on my systems, but even
though *I* think that they are all good, there still may be problems
with them that I don't see.  As usual. :-)

I guess I should have sent them as RFC this time.

Cheers,
Rafael

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/2] PM-runtime: Fix __pm_runtime_set_status() race with runtime resume
  2019-02-12 16:17   ` Ulf Hansson
@ 2019-02-12 16:28     ` Rafael J. Wysocki
  2019-02-12 18:27       ` Ulf Hansson
  0 siblings, 1 reply; 28+ messages in thread
From: Rafael J. Wysocki @ 2019-02-12 16:28 UTC (permalink / raw)
  To: Ulf Hansson
  Cc: Rafael J. Wysocki, Greg Kroah-Hartman, LKML, Linux PM,
	Daniel Vetter, Lukas Wunner, Andrzej Hajda,
	Russell King - ARM Linux, Lucas Stach, Linus Walleij,
	Thierry Reding, Laurent Pinchart, Marek Szyprowski

On Tue, Feb 12, 2019 at 5:18 PM Ulf Hansson <ulf.hansson@linaro.org> wrote:
>
> On Tue, 12 Feb 2019 at 13:10, Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
> >
> > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> >
> > Commit 4080ab083000 ("PM-runtime: Take suppliers into account in
> > __pm_runtime_set_status()") introduced a race condition that may
> > trigger if __pm_runtime_set_status() is used incorrectly (that is,
> > if it is called when PM-runtime is enabled for the target device
> > and working).
> >
> > In that case, if the original PM-runtime status of the device is
> > RPM_SUSPENDED, a runtime resume of the device may occur after
> > __pm_runtime_set_status() has dropped its power.lock spinlock
> > and before deactivating its suppliers, so the suppliers may be
> > deactivated while the device is PM-runtime-active which may lead
> > to functional issues.
> >
> > To avoid that, modify __pm_runtime_set_status() to check whether
> > or not PM-runtime is enabled for the device before activating its
> > suppliers (if the new status is RPM_ACTIVE) and either return an
> > error if that's the case or increment the device's disable_depth
> > counter to prevent PM-runtime from being enabled for it while
> > the remaining part of the function is running (disable_depth is
> > then decremented on the way out).
> >
> > Fixes: 4080ab083000 ("PM-runtime: Take suppliers into account in __pm_runtime_set_status()")
> > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > ---
> >  drivers/base/power/runtime.c |   24 ++++++++++++++++++------
> >  1 file changed, 18 insertions(+), 6 deletions(-)
> >
> > Index: linux-pm/drivers/base/power/runtime.c
> > ===================================================================
> > --- linux-pm.orig/drivers/base/power/runtime.c
> > +++ linux-pm/drivers/base/power/runtime.c
> > @@ -1129,6 +1129,22 @@ int __pm_runtime_set_status(struct devic
> >         if (status != RPM_ACTIVE && status != RPM_SUSPENDED)
> >                 return -EINVAL;
> >
> > +       spin_lock_irq(&dev->power.lock);
> > +
> > +       /*
> > +        * Prevent PM-runtime from being enabled for the device or return an
> > +        * error if it is enabled already and working.
> > +        */
> > +       if (dev->power.runtime_error || dev->power.disable_depth)
> > +               dev->power.disable_depth++;
> > +       else
> > +               error = -EAGAIN;
> > +
> > +       spin_unlock_irq(&dev->power.lock);
> > +
> > +       if (error)
> > +               return error;
> > +
> >         /*
> >          * If the new status is RPM_ACTIVE, the suppliers can be activated
> >          * upfront regardless of the current status, because next time
> > @@ -1147,12 +1163,6 @@ int __pm_runtime_set_status(struct devic
> >
> >         spin_lock_irq(&dev->power.lock);
> >
> > -       if (!dev->power.runtime_error && !dev->power.disable_depth) {
> > -               status = dev->power.runtime_status;
> > -               error = -EAGAIN;
> > -               goto out;
> > -       }
> > -
> >         if (dev->power.runtime_status == status || !parent)
> >                 goto out_set;
> >
> > @@ -1205,6 +1215,8 @@ int __pm_runtime_set_status(struct devic
> >                 device_links_read_unlock(idx);
> >         }
> >
> > +       pm_runtime_enable(dev);
>
> pm_runtime_enable() uses spin_lock_irqsave(), rather than
> spin_lock_irq() - is there a reason to why you want to allow that
> here, but not earlier in the function?

Device links locking cannot be used in interrupt context.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/2] PM-runtime: Fix __pm_runtime_set_status() race with runtime resume
  2019-02-12 16:28     ` Rafael J. Wysocki
@ 2019-02-12 18:27       ` Ulf Hansson
  2019-02-12 19:05         ` Rafael J. Wysocki
  0 siblings, 1 reply; 28+ messages in thread
From: Ulf Hansson @ 2019-02-12 18:27 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Rafael J. Wysocki, Greg Kroah-Hartman, LKML, Linux PM,
	Daniel Vetter, Lukas Wunner, Andrzej Hajda,
	Russell King - ARM Linux, Lucas Stach, Linus Walleij,
	Thierry Reding, Laurent Pinchart, Marek Szyprowski

On Tue, 12 Feb 2019 at 17:28, Rafael J. Wysocki <rafael@kernel.org> wrote:
>
> On Tue, Feb 12, 2019 at 5:18 PM Ulf Hansson <ulf.hansson@linaro.org> wrote:
> >
> > On Tue, 12 Feb 2019 at 13:10, Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
> > >
> > > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > >
> > > Commit 4080ab083000 ("PM-runtime: Take suppliers into account in
> > > __pm_runtime_set_status()") introduced a race condition that may
> > > trigger if __pm_runtime_set_status() is used incorrectly (that is,
> > > if it is called when PM-runtime is enabled for the target device
> > > and working).
> > >
> > > In that case, if the original PM-runtime status of the device is
> > > RPM_SUSPENDED, a runtime resume of the device may occur after
> > > __pm_runtime_set_status() has dropped its power.lock spinlock
> > > and before deactivating its suppliers, so the suppliers may be
> > > deactivated while the device is PM-runtime-active which may lead
> > > to functional issues.
> > >
> > > To avoid that, modify __pm_runtime_set_status() to check whether
> > > or not PM-runtime is enabled for the device before activating its
> > > suppliers (if the new status is RPM_ACTIVE) and either return an
> > > error if that's the case or increment the device's disable_depth
> > > counter to prevent PM-runtime from being enabled for it while
> > > the remaining part of the function is running (disable_depth is
> > > then decremented on the way out).
> > >
> > > Fixes: 4080ab083000 ("PM-runtime: Take suppliers into account in __pm_runtime_set_status()")
> > > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > > ---
> > >  drivers/base/power/runtime.c |   24 ++++++++++++++++++------
> > >  1 file changed, 18 insertions(+), 6 deletions(-)
> > >
> > > Index: linux-pm/drivers/base/power/runtime.c
> > > ===================================================================
> > > --- linux-pm.orig/drivers/base/power/runtime.c
> > > +++ linux-pm/drivers/base/power/runtime.c
> > > @@ -1129,6 +1129,22 @@ int __pm_runtime_set_status(struct devic
> > >         if (status != RPM_ACTIVE && status != RPM_SUSPENDED)
> > >                 return -EINVAL;
> > >
> > > +       spin_lock_irq(&dev->power.lock);
> > > +
> > > +       /*
> > > +        * Prevent PM-runtime from being enabled for the device or return an
> > > +        * error if it is enabled already and working.
> > > +        */
> > > +       if (dev->power.runtime_error || dev->power.disable_depth)
> > > +               dev->power.disable_depth++;
> > > +       else
> > > +               error = -EAGAIN;
> > > +
> > > +       spin_unlock_irq(&dev->power.lock);
> > > +
> > > +       if (error)
> > > +               return error;
> > > +
> > >         /*
> > >          * If the new status is RPM_ACTIVE, the suppliers can be activated
> > >          * upfront regardless of the current status, because next time
> > > @@ -1147,12 +1163,6 @@ int __pm_runtime_set_status(struct devic
> > >
> > >         spin_lock_irq(&dev->power.lock);
> > >
> > > -       if (!dev->power.runtime_error && !dev->power.disable_depth) {
> > > -               status = dev->power.runtime_status;
> > > -               error = -EAGAIN;
> > > -               goto out;
> > > -       }
> > > -
> > >         if (dev->power.runtime_status == status || !parent)
> > >                 goto out_set;
> > >
> > > @@ -1205,6 +1215,8 @@ int __pm_runtime_set_status(struct devic
> > >                 device_links_read_unlock(idx);
> > >         }
> > >
> > > +       pm_runtime_enable(dev);
> >
> > pm_runtime_enable() uses spin_lock_irqsave(), rather than
> > spin_lock_irq() - is there a reason to why you want to allow that
> > here, but not earlier in the function?
>
> Device links locking cannot be used in interrupt context.

I get that, but thanks for clarifying.

However, then why did you convert __pm_runtime_set_status() from using
spin_lock_irqsave() to spin_lock_irq() in commit "PM-runtime: Take
suppliers into account in __pm_runtime_set_status()"?

As far as I can tell, the spin_lock is not being held while we operate
on the device links anyway. Or is this just to give the caller an
indication what kind of context the function can be called from?

Kind regards
Uffe

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/2] PM-runtime: Fix __pm_runtime_set_status() race with runtime resume
  2019-02-12 18:27       ` Ulf Hansson
@ 2019-02-12 19:05         ` Rafael J. Wysocki
  0 siblings, 0 replies; 28+ messages in thread
From: Rafael J. Wysocki @ 2019-02-12 19:05 UTC (permalink / raw)
  To: Ulf Hansson
  Cc: Rafael J. Wysocki, Rafael J. Wysocki, Greg Kroah-Hartman, LKML,
	Linux PM, Daniel Vetter, Lukas Wunner, Andrzej Hajda,
	Russell King - ARM Linux, Lucas Stach, Linus Walleij,
	Thierry Reding, Laurent Pinchart, Marek Szyprowski

On Tue, Feb 12, 2019 at 7:28 PM Ulf Hansson <ulf.hansson@linaro.org> wrote:
>
> On Tue, 12 Feb 2019 at 17:28, Rafael J. Wysocki <rafael@kernel.org> wrote:
> >
> > On Tue, Feb 12, 2019 at 5:18 PM Ulf Hansson <ulf.hansson@linaro.org> wrote:
> > >
> > > On Tue, 12 Feb 2019 at 13:10, Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
> > > >
> > > > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > > >
> > > > Commit 4080ab083000 ("PM-runtime: Take suppliers into account in
> > > > __pm_runtime_set_status()") introduced a race condition that may
> > > > trigger if __pm_runtime_set_status() is used incorrectly (that is,
> > > > if it is called when PM-runtime is enabled for the target device
> > > > and working).
> > > >
> > > > In that case, if the original PM-runtime status of the device is
> > > > RPM_SUSPENDED, a runtime resume of the device may occur after
> > > > __pm_runtime_set_status() has dropped its power.lock spinlock
> > > > and before deactivating its suppliers, so the suppliers may be
> > > > deactivated while the device is PM-runtime-active which may lead
> > > > to functional issues.
> > > >
> > > > To avoid that, modify __pm_runtime_set_status() to check whether
> > > > or not PM-runtime is enabled for the device before activating its
> > > > suppliers (if the new status is RPM_ACTIVE) and either return an
> > > > error if that's the case or increment the device's disable_depth
> > > > counter to prevent PM-runtime from being enabled for it while
> > > > the remaining part of the function is running (disable_depth is
> > > > then decremented on the way out).
> > > >
> > > > Fixes: 4080ab083000 ("PM-runtime: Take suppliers into account in __pm_runtime_set_status()")
> > > > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > > > ---
> > > >  drivers/base/power/runtime.c |   24 ++++++++++++++++++------
> > > >  1 file changed, 18 insertions(+), 6 deletions(-)
> > > >
> > > > Index: linux-pm/drivers/base/power/runtime.c
> > > > ===================================================================
> > > > --- linux-pm.orig/drivers/base/power/runtime.c
> > > > +++ linux-pm/drivers/base/power/runtime.c
> > > > @@ -1129,6 +1129,22 @@ int __pm_runtime_set_status(struct devic
> > > >         if (status != RPM_ACTIVE && status != RPM_SUSPENDED)
> > > >                 return -EINVAL;
> > > >
> > > > +       spin_lock_irq(&dev->power.lock);
> > > > +
> > > > +       /*
> > > > +        * Prevent PM-runtime from being enabled for the device or return an
> > > > +        * error if it is enabled already and working.
> > > > +        */
> > > > +       if (dev->power.runtime_error || dev->power.disable_depth)
> > > > +               dev->power.disable_depth++;
> > > > +       else
> > > > +               error = -EAGAIN;
> > > > +
> > > > +       spin_unlock_irq(&dev->power.lock);
> > > > +
> > > > +       if (error)
> > > > +               return error;
> > > > +
> > > >         /*
> > > >          * If the new status is RPM_ACTIVE, the suppliers can be activated
> > > >          * upfront regardless of the current status, because next time
> > > > @@ -1147,12 +1163,6 @@ int __pm_runtime_set_status(struct devic
> > > >
> > > >         spin_lock_irq(&dev->power.lock);
> > > >
> > > > -       if (!dev->power.runtime_error && !dev->power.disable_depth) {
> > > > -               status = dev->power.runtime_status;
> > > > -               error = -EAGAIN;
> > > > -               goto out;
> > > > -       }
> > > > -
> > > >         if (dev->power.runtime_status == status || !parent)
> > > >                 goto out_set;
> > > >
> > > > @@ -1205,6 +1215,8 @@ int __pm_runtime_set_status(struct devic
> > > >                 device_links_read_unlock(idx);
> > > >         }
> > > >
> > > > +       pm_runtime_enable(dev);
> > >
> > > pm_runtime_enable() uses spin_lock_irqsave(), rather than
> > > spin_lock_irq() - is there a reason to why you want to allow that
> > > here, but not earlier in the function?
> >
> > Device links locking cannot be used in interrupt context.
>
> I get that, but thanks for clarifying.
>
> However, then why did you convert __pm_runtime_set_status() from using
> spin_lock_irqsave() to spin_lock_irq() in commit "PM-runtime: Take
> suppliers into account in __pm_runtime_set_status()"?
>
> As far as I can tell, the spin_lock is not being held while we operate
> on the device links anyway. Or is this just to give the caller an
> indication what kind of context the function can be called from?

The latter plus getting rid of saving/restoring the flags which isn't necessary.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/2] PM-runtime: Fix __pm_runtime_set_status() race with runtime resume
  2019-02-12 12:04 ` [PATCH 1/2] PM-runtime: Fix __pm_runtime_set_status() race with runtime resume Rafael J. Wysocki
  2019-02-12 16:17   ` Ulf Hansson
@ 2019-02-12 20:14   ` Ulf Hansson
  1 sibling, 0 replies; 28+ messages in thread
From: Ulf Hansson @ 2019-02-12 20:14 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Greg Kroah-Hartman, LKML, Linux PM, Daniel Vetter, Lukas Wunner,
	Andrzej Hajda, Russell King - ARM Linux, Lucas Stach,
	Linus Walleij, Thierry Reding, Laurent Pinchart,
	Marek Szyprowski

On Tue, 12 Feb 2019 at 13:10, Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
>
> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>
> Commit 4080ab083000 ("PM-runtime: Take suppliers into account in
> __pm_runtime_set_status()") introduced a race condition that may
> trigger if __pm_runtime_set_status() is used incorrectly (that is,
> if it is called when PM-runtime is enabled for the target device
> and working).
>
> In that case, if the original PM-runtime status of the device is
> RPM_SUSPENDED, a runtime resume of the device may occur after
> __pm_runtime_set_status() has dropped its power.lock spinlock
> and before deactivating its suppliers, so the suppliers may be
> deactivated while the device is PM-runtime-active which may lead
> to functional issues.
>
> To avoid that, modify __pm_runtime_set_status() to check whether
> or not PM-runtime is enabled for the device before activating its
> suppliers (if the new status is RPM_ACTIVE) and either return an
> error if that's the case or increment the device's disable_depth
> counter to prevent PM-runtime from being enabled for it while
> the remaining part of the function is running (disable_depth is
> then decremented on the way out).
>
> Fixes: 4080ab083000 ("PM-runtime: Take suppliers into account in __pm_runtime_set_status()")
> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org>

Kind regards
Uffe

> ---
>  drivers/base/power/runtime.c |   24 ++++++++++++++++++------
>  1 file changed, 18 insertions(+), 6 deletions(-)
>
> Index: linux-pm/drivers/base/power/runtime.c
> ===================================================================
> --- linux-pm.orig/drivers/base/power/runtime.c
> +++ linux-pm/drivers/base/power/runtime.c
> @@ -1129,6 +1129,22 @@ int __pm_runtime_set_status(struct devic
>         if (status != RPM_ACTIVE && status != RPM_SUSPENDED)
>                 return -EINVAL;
>
> +       spin_lock_irq(&dev->power.lock);
> +
> +       /*
> +        * Prevent PM-runtime from being enabled for the device or return an
> +        * error if it is enabled already and working.
> +        */
> +       if (dev->power.runtime_error || dev->power.disable_depth)
> +               dev->power.disable_depth++;
> +       else
> +               error = -EAGAIN;
> +
> +       spin_unlock_irq(&dev->power.lock);
> +
> +       if (error)
> +               return error;
> +
>         /*
>          * If the new status is RPM_ACTIVE, the suppliers can be activated
>          * upfront regardless of the current status, because next time
> @@ -1147,12 +1163,6 @@ int __pm_runtime_set_status(struct devic
>
>         spin_lock_irq(&dev->power.lock);
>
> -       if (!dev->power.runtime_error && !dev->power.disable_depth) {
> -               status = dev->power.runtime_status;
> -               error = -EAGAIN;
> -               goto out;
> -       }
> -
>         if (dev->power.runtime_status == status || !parent)
>                 goto out_set;
>
> @@ -1205,6 +1215,8 @@ int __pm_runtime_set_status(struct devic
>                 device_links_read_unlock(idx);
>         }
>
> +       pm_runtime_enable(dev);
> +
>         return error;
>  }
>  EXPORT_SYMBOL_GPL(__pm_runtime_set_status);
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/2] driver core: Fix possible supplier PM-usage counter imbalance
  2019-02-12 12:08 ` [PATCH 2/2] driver core: Fix possible supplier PM-usage counter imbalance Rafael J. Wysocki
@ 2019-02-12 21:02   ` Ulf Hansson
  2019-02-12 22:08     ` Rafael J. Wysocki
  2019-02-15 11:00   ` Jon Hunter
  1 sibling, 1 reply; 28+ messages in thread
From: Ulf Hansson @ 2019-02-12 21:02 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Greg Kroah-Hartman, LKML, Linux PM, Daniel Vetter, Lukas Wunner,
	Andrzej Hajda, Russell King - ARM Linux, Lucas Stach,
	Linus Walleij, Thierry Reding, Laurent Pinchart,
	Marek Szyprowski

On Tue, 12 Feb 2019 at 13:10, Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
>
> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>
> If a stateless device link to a certain supplier with
> DL_FLAG_PM_RUNTIME set in the flags is added and then removed by the
> consumer driver's probe callback, the supplier's PM-runtime usage
> counter will be nonzero after that which effectively causes the
> supplier to remain "always on" going forward.
>
> Namely, device_link_add() called to add the link invokes
> device_link_rpm_prepare() which notices that the consumer driver is
> probing, so it increments the supplier's PM-runtime usage counter
> with the assumption that the link will stay around until
> pm_runtime_put_suppliers() is called by driver_probe_device(),
> but if the link goes away before that point, the supplier's
> PM-runtime usage counter will remain nonzero.
>
> To prevent that from happening, first rework pm_runtime_get_suppliers()
> and pm_runtime_put_suppliers() to use the rpm_active refounts of device
> links and make the latter only drop rpm_active and the supplier's
> PM-runtime usage counter for each link by one, unless rpm_active is
> one already for it.  Next, modify device_link_add() to bump up the
> new link's rpm_active refcount and the suppliers PM-runtime usage
> counter by two, to prevent pm_runtime_put_suppliers(), if it is
> called subsequently, from suspending the supplier prematurely (in
> case its PM-runtime usage counter goes down to 0 in there).
>
> Due to the way rpm_put_suppliers() works, this change does not
> affect runtime suspend of the consumer ends of new device links (or,
> generally, device links for which DL_FLAG_PM_RUNTIME has just been
> set).
>
> Fixes: e2f3cd831a28 ("driver core: Fix handling of runtime PM flags in device_link_add()")
> Reported-by: Ulf Hansson <ulf.hansson@linaro.org>
> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

I have tested a several various common probe sequences (including
error paths) by using my RPM test driver, no issues found. Of course
it also fixes the problem that occurred while deleting the device link
during ->probe(). Thanks a lot!

Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org>
Tested-by: Ulf Hansson <ulf.hansson@linaro.org>

Kind regards
Uffe

> ---
>
> Note that the issue had been there before commit e2f3cd831a28, but it was
> overlooked by that commit and this change is a fix on top of it, so make
> the Fixes: tag point to commit e2f3cd831a28 (instead of an earlier one
> that the patch will not be applicable to).
>
> ---
>  drivers/base/core.c          |   21 ++++-----------------
>  drivers/base/power/runtime.c |   27 +++++++++++++++++++++++++--
>  include/linux/pm_runtime.h   |    4 ++++
>  3 files changed, 33 insertions(+), 19 deletions(-)
>
> Index: linux-pm/drivers/base/power/runtime.c
> ===================================================================
> --- linux-pm.orig/drivers/base/power/runtime.c
> +++ linux-pm/drivers/base/power/runtime.c
> @@ -1655,8 +1655,10 @@ void pm_runtime_get_suppliers(struct dev
>         idx = device_links_read_lock();
>
>         list_for_each_entry_rcu(link, &dev->links.suppliers, c_node)
> -               if (link->flags & DL_FLAG_PM_RUNTIME)
> +               if (link->flags & DL_FLAG_PM_RUNTIME) {
> +                       refcount_inc(&link->rpm_active);
>                         pm_runtime_get_sync(link->supplier);
> +               }
>
>         device_links_read_unlock(idx);
>  }
> @@ -1673,7 +1675,8 @@ void pm_runtime_put_suppliers(struct dev
>         idx = device_links_read_lock();
>
>         list_for_each_entry_rcu(link, &dev->links.suppliers, c_node)
> -               if (link->flags & DL_FLAG_PM_RUNTIME)
> +               if (link->flags & DL_FLAG_PM_RUNTIME &&
> +                   refcount_dec_not_one(&link->rpm_active))
>                         pm_runtime_put(link->supplier);
>
>         device_links_read_unlock(idx);
> @@ -1686,6 +1689,26 @@ void pm_runtime_new_link(struct device *
>         spin_unlock_irq(&dev->power.lock);
>  }
>
> +/**
> + * pm_runtime_active_link - Set up new device link as active for PM-runtime.
> + * @link: Device link to be set up as active.
> + * @supplier: Supplier end of the link.
> + *
> + * Add 2 to the rpm_active refcount of @link and increment the PM-runtime
> + * usage counter of @supplier once more in case the link is being added while
> + * the consumer driver is probing and pm_runtime_put_suppliers() will be called
> + * subsequently.
> + *
> + * Note that this doesn't prevent rpm_put_suppliers() from decreasing the link's
> + * rpm_active refcount down to one, so runtime suspend of the consumer end of
> + * @link is not affected.
> + */
> +void pm_runtime_active_link(struct device_link *link, struct device *supplier)
> +{
> +       refcount_add(2, &link->rpm_active);
> +       pm_runtime_get_noresume(supplier);
> +}
> +
>  void pm_runtime_drop_link(struct device *dev)
>  {
>         spin_lock_irq(&dev->power.lock);
> Index: linux-pm/drivers/base/core.c
> ===================================================================
> --- linux-pm.orig/drivers/base/core.c
> +++ linux-pm/drivers/base/core.c
> @@ -165,19 +165,6 @@ void device_pm_move_to_tail(struct devic
>         device_links_read_unlock(idx);
>  }
>
> -static void device_link_rpm_prepare(struct device *consumer,
> -                                   struct device *supplier)
> -{
> -       pm_runtime_new_link(consumer);
> -       /*
> -        * If the link is being added by the consumer driver at probe time,
> -        * balance the decrementation of the supplier's runtime PM usage counter
> -        * after consumer probe in driver_probe_device().
> -        */
> -       if (consumer->links.status == DL_DEV_PROBING)
> -               pm_runtime_get_noresume(supplier);
> -}
> -
>  /**
>   * device_link_add - Create a link between two devices.
>   * @consumer: Consumer end of the link.
> @@ -286,11 +273,11 @@ struct device_link *device_link_add(stru
>
>                 if (flags & DL_FLAG_PM_RUNTIME) {
>                         if (!(link->flags & DL_FLAG_PM_RUNTIME)) {
> -                               device_link_rpm_prepare(consumer, supplier);
> +                               pm_runtime_new_link(consumer);
>                                 link->flags |= DL_FLAG_PM_RUNTIME;
>                         }
>                         if (flags & DL_FLAG_RPM_ACTIVE)
> -                               refcount_inc(&link->rpm_active);
> +                               pm_runtime_active_link(link, supplier);
>                 }
>
>                 if (flags & DL_FLAG_STATELESS) {
> @@ -323,9 +310,9 @@ struct device_link *device_link_add(stru
>
>         if (flags & DL_FLAG_PM_RUNTIME) {
>                 if (flags & DL_FLAG_RPM_ACTIVE)
> -                       refcount_inc(&link->rpm_active);
> +                       pm_runtime_active_link(link, supplier);
>
> -               device_link_rpm_prepare(consumer, supplier);
> +               pm_runtime_new_link(consumer);
>         }
>
>         get_device(supplier);
> Index: linux-pm/include/linux/pm_runtime.h
> ===================================================================
> --- linux-pm.orig/include/linux/pm_runtime.h
> +++ linux-pm/include/linux/pm_runtime.h
> @@ -59,6 +59,8 @@ extern void pm_runtime_clean_up_links(st
>  extern void pm_runtime_get_suppliers(struct device *dev);
>  extern void pm_runtime_put_suppliers(struct device *dev);
>  extern void pm_runtime_new_link(struct device *dev);
> +extern void pm_runtime_active_link(struct device_link *link,
> +                                  struct device *supplier);
>  extern void pm_runtime_drop_link(struct device *dev);
>
>  static inline void pm_suspend_ignore_children(struct device *dev, bool enable)
> @@ -178,6 +180,8 @@ static inline void pm_runtime_clean_up_l
>  static inline void pm_runtime_get_suppliers(struct device *dev) {}
>  static inline void pm_runtime_put_suppliers(struct device *dev) {}
>  static inline void pm_runtime_new_link(struct device *dev) {}
> +static inline void pm_runtime_active_link(struct device_link *link,
> +                                         struct device *supplier) {}
>  static inline void pm_runtime_drop_link(struct device *dev) {}
>
>  #endif /* !CONFIG_PM */
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/2] driver core: Fix possible supplier PM-usage counter imbalance
  2019-02-12 21:02   ` Ulf Hansson
@ 2019-02-12 22:08     ` Rafael J. Wysocki
  0 siblings, 0 replies; 28+ messages in thread
From: Rafael J. Wysocki @ 2019-02-12 22:08 UTC (permalink / raw)
  To: Ulf Hansson
  Cc: Rafael J. Wysocki, Greg Kroah-Hartman, LKML, Linux PM,
	Daniel Vetter, Lukas Wunner, Andrzej Hajda,
	Russell King - ARM Linux, Lucas Stach, Linus Walleij,
	Thierry Reding, Laurent Pinchart, Marek Szyprowski

On Tue, Feb 12, 2019 at 10:03 PM Ulf Hansson <ulf.hansson@linaro.org> wrote:
>
> On Tue, 12 Feb 2019 at 13:10, Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
> >
> > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> >
> > If a stateless device link to a certain supplier with
> > DL_FLAG_PM_RUNTIME set in the flags is added and then removed by the
> > consumer driver's probe callback, the supplier's PM-runtime usage
> > counter will be nonzero after that which effectively causes the
> > supplier to remain "always on" going forward.
> >
> > Namely, device_link_add() called to add the link invokes
> > device_link_rpm_prepare() which notices that the consumer driver is
> > probing, so it increments the supplier's PM-runtime usage counter
> > with the assumption that the link will stay around until
> > pm_runtime_put_suppliers() is called by driver_probe_device(),
> > but if the link goes away before that point, the supplier's
> > PM-runtime usage counter will remain nonzero.
> >
> > To prevent that from happening, first rework pm_runtime_get_suppliers()
> > and pm_runtime_put_suppliers() to use the rpm_active refounts of device
> > links and make the latter only drop rpm_active and the supplier's
> > PM-runtime usage counter for each link by one, unless rpm_active is
> > one already for it.  Next, modify device_link_add() to bump up the
> > new link's rpm_active refcount and the suppliers PM-runtime usage
> > counter by two, to prevent pm_runtime_put_suppliers(), if it is
> > called subsequently, from suspending the supplier prematurely (in
> > case its PM-runtime usage counter goes down to 0 in there).
> >
> > Due to the way rpm_put_suppliers() works, this change does not
> > affect runtime suspend of the consumer ends of new device links (or,
> > generally, device links for which DL_FLAG_PM_RUNTIME has just been
> > set).
> >
> > Fixes: e2f3cd831a28 ("driver core: Fix handling of runtime PM flags in device_link_add()")
> > Reported-by: Ulf Hansson <ulf.hansson@linaro.org>
> > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>
> I have tested a several various common probe sequences (including
> error paths) by using my RPM test driver, no issues found. Of course
> it also fixes the problem that occurred while deleting the device link
> during ->probe(). Thanks a lot!
>
> Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org>
> Tested-by: Ulf Hansson <ulf.hansson@linaro.org>

Thank you!

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/2] driver core: Fix possible supplier PM-usage counter imbalance
  2019-02-12 12:08 ` [PATCH 2/2] driver core: Fix possible supplier PM-usage counter imbalance Rafael J. Wysocki
  2019-02-12 21:02   ` Ulf Hansson
@ 2019-02-15 11:00   ` Jon Hunter
  2019-02-15 11:57     ` Rafael J. Wysocki
                       ` (2 more replies)
  1 sibling, 3 replies; 28+ messages in thread
From: Jon Hunter @ 2019-02-15 11:00 UTC (permalink / raw)
  To: Rafael J. Wysocki, Greg Kroah-Hartman
  Cc: LKML, Linux PM, Ulf Hansson, Daniel Vetter, Lukas Wunner,
	Andrzej Hajda, Russell King - ARM Linux, Lucas Stach,
	Linus Walleij, Thierry Reding, Laurent Pinchart,
	Marek Szyprowski, linux-tegra

Hi Rafael,

On 12/02/2019 12:08, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> 
> If a stateless device link to a certain supplier with
> DL_FLAG_PM_RUNTIME set in the flags is added and then removed by the
> consumer driver's probe callback, the supplier's PM-runtime usage
> counter will be nonzero after that which effectively causes the
> supplier to remain "always on" going forward.
> 
> Namely, device_link_add() called to add the link invokes
> device_link_rpm_prepare() which notices that the consumer driver is
> probing, so it increments the supplier's PM-runtime usage counter
> with the assumption that the link will stay around until
> pm_runtime_put_suppliers() is called by driver_probe_device(),
> but if the link goes away before that point, the supplier's
> PM-runtime usage counter will remain nonzero.
> 
> To prevent that from happening, first rework pm_runtime_get_suppliers()
> and pm_runtime_put_suppliers() to use the rpm_active refounts of device
> links and make the latter only drop rpm_active and the supplier's
> PM-runtime usage counter for each link by one, unless rpm_active is
> one already for it.  Next, modify device_link_add() to bump up the
> new link's rpm_active refcount and the suppliers PM-runtime usage
> counter by two, to prevent pm_runtime_put_suppliers(), if it is
> called subsequently, from suspending the supplier prematurely (in
> case its PM-runtime usage counter goes down to 0 in there).
> 
> Due to the way rpm_put_suppliers() works, this change does not
> affect runtime suspend of the consumer ends of new device links (or,
> generally, device links for which DL_FLAG_PM_RUNTIME has just been
> set).
> 
> Fixes: e2f3cd831a28 ("driver core: Fix handling of runtime PM flags in device_link_add()")
> Reported-by: Ulf Hansson <ulf.hansson@linaro.org> 
> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> ---
> 
> Note that the issue had been there before commit e2f3cd831a28, but it was
> overlooked by that commit and this change is a fix on top of it, so make
> the Fixes: tag point to commit e2f3cd831a28 (instead of an earlier one
> that the patch will not be applicable to).
I noticed that yesterday's and today's -next were no longer booting on
one of our Tegra boards (Tegra210 Jetson TX2) because networking is
failing. The ethernet chip is a USB device and looking at the bootlogs I
can see that the Tegra XHCI driver is failing ...

 tegra-xusb 70090000.usb: xHCI host controller not responding, assume dead
 tegra-xusb 70090000.usb: HC died; cleaning up

The Tegra XHCI driver uses multiple power-domains and uses
device_link_add() to attach them. So now I am wondering if there is
something that we have got wrong in our implementation. However, I don't
see the device being probed deferred on boot or anything like that.

The driver in question is drivers/usb/host/xhci-tegra.c and we add the
links in the function tegra_xusb_powerdomain_init() which is before RPM
is enabled. Let me know if you have any thoughts.

Cheers
Jon

-- 
nvpublic

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/2] driver core: Fix possible supplier PM-usage counter imbalance
  2019-02-15 11:00   ` Jon Hunter
@ 2019-02-15 11:57     ` Rafael J. Wysocki
  2019-02-15 12:06     ` Rafael J. Wysocki
  2019-02-15 14:37     ` Ulf Hansson
  2 siblings, 0 replies; 28+ messages in thread
From: Rafael J. Wysocki @ 2019-02-15 11:57 UTC (permalink / raw)
  To: Jon Hunter
  Cc: Rafael J. Wysocki, Greg Kroah-Hartman, LKML, Linux PM,
	Ulf Hansson, Daniel Vetter, Lukas Wunner, Andrzej Hajda,
	Russell King - ARM Linux, Lucas Stach, Linus Walleij,
	Thierry Reding, Laurent Pinchart, Marek Szyprowski, linux-tegra

On Fri, Feb 15, 2019 at 12:00 PM Jon Hunter <jonathanh@nvidia.com> wrote:
>
> Hi Rafael,
>
> On 12/02/2019 12:08, Rafael J. Wysocki wrote:
> > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> >
> > If a stateless device link to a certain supplier with
> > DL_FLAG_PM_RUNTIME set in the flags is added and then removed by the
> > consumer driver's probe callback, the supplier's PM-runtime usage
> > counter will be nonzero after that which effectively causes the
> > supplier to remain "always on" going forward.
> >
> > Namely, device_link_add() called to add the link invokes
> > device_link_rpm_prepare() which notices that the consumer driver is
> > probing, so it increments the supplier's PM-runtime usage counter
> > with the assumption that the link will stay around until
> > pm_runtime_put_suppliers() is called by driver_probe_device(),
> > but if the link goes away before that point, the supplier's
> > PM-runtime usage counter will remain nonzero.
> >
> > To prevent that from happening, first rework pm_runtime_get_suppliers()
> > and pm_runtime_put_suppliers() to use the rpm_active refounts of device
> > links and make the latter only drop rpm_active and the supplier's
> > PM-runtime usage counter for each link by one, unless rpm_active is
> > one already for it.  Next, modify device_link_add() to bump up the
> > new link's rpm_active refcount and the suppliers PM-runtime usage
> > counter by two, to prevent pm_runtime_put_suppliers(), if it is
> > called subsequently, from suspending the supplier prematurely (in
> > case its PM-runtime usage counter goes down to 0 in there).
> >
> > Due to the way rpm_put_suppliers() works, this change does not
> > affect runtime suspend of the consumer ends of new device links (or,
> > generally, device links for which DL_FLAG_PM_RUNTIME has just been
> > set).
> >
> > Fixes: e2f3cd831a28 ("driver core: Fix handling of runtime PM flags in device_link_add()")
> > Reported-by: Ulf Hansson <ulf.hansson@linaro.org>
> > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > ---
> >
> > Note that the issue had been there before commit e2f3cd831a28, but it was
> > overlooked by that commit and this change is a fix on top of it, so make
> > the Fixes: tag point to commit e2f3cd831a28 (instead of an earlier one
> > that the patch will not be applicable to).
>
> I noticed that yesterday's and today's -next were no longer booting on
> one of our Tegra boards (Tegra210 Jetson TX2) because networking is
> failing. The ethernet chip is a USB device and looking at the bootlogs I
> can see that the Tegra XHCI driver is failing ...

Is it failing because of this particular commit?  That is, does
reverting the entire commit help?

>  tegra-xusb 70090000.usb: xHCI host controller not responding, assume dead
>  tegra-xusb 70090000.usb: HC died; cleaning up
>
> The Tegra XHCI driver uses multiple power-domains and uses
> device_link_add() to attach them. So now I am wondering if there is
> something that we have got wrong in our implementation. However, I don't
> see the device being probed deferred on boot or anything like that.

It won't be, because you use stateless links.

> The driver in question is drivers/usb/host/xhci-tegra.c and we add the
> links in the function tegra_xusb_powerdomain_init() which is before RPM
> is enabled. Let me know if you have any thoughts.

Well, if it breaks, then there is a bug somewhere.  I'm not seeing it
now, but let's dig into this.

Since you don't pass DL_FLAG_RPM_ACTIVE to device_link_add(), the
changes related to that don't matter.

The links are not there before your probe function runs.  It adds the
links and then pm_runtime_put_suppliers() sees them, but since
link->rpm_active is one for the new links, it won't do anything with
them.

Well, there is a difference, but if it matters, then something fishy
is going on IMO.  Before this change pm_runtime_put_suppliers() would
do pm_runtime_put() on the new links' suppliers and (because their
PM-runtime usage counters are both one at that point) it will actually
try to suspend the suppliers.  It should be easy enough to verify if
this really matters, stay tuned.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/2] driver core: Fix possible supplier PM-usage counter imbalance
  2019-02-15 11:00   ` Jon Hunter
  2019-02-15 11:57     ` Rafael J. Wysocki
@ 2019-02-15 12:06     ` Rafael J. Wysocki
  2019-02-15 13:21       ` Jon Hunter
  2019-02-15 14:37     ` Ulf Hansson
  2 siblings, 1 reply; 28+ messages in thread
From: Rafael J. Wysocki @ 2019-02-15 12:06 UTC (permalink / raw)
  To: Jon Hunter
  Cc: Greg Kroah-Hartman, LKML, Linux PM, Ulf Hansson, Daniel Vetter,
	Lukas Wunner, Andrzej Hajda, Russell King - ARM Linux,
	Lucas Stach, Linus Walleij, Thierry Reding, Laurent Pinchart,
	Marek Szyprowski, linux-tegra

On Friday, February 15, 2019 12:00:27 PM CET Jon Hunter wrote:
> Hi Rafael,
> 
> On 12/02/2019 12:08, Rafael J. Wysocki wrote:
> > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > 
> > If a stateless device link to a certain supplier with
> > DL_FLAG_PM_RUNTIME set in the flags is added and then removed by the
> > consumer driver's probe callback, the supplier's PM-runtime usage
> > counter will be nonzero after that which effectively causes the
> > supplier to remain "always on" going forward.
> > 
> > Namely, device_link_add() called to add the link invokes
> > device_link_rpm_prepare() which notices that the consumer driver is
> > probing, so it increments the supplier's PM-runtime usage counter
> > with the assumption that the link will stay around until
> > pm_runtime_put_suppliers() is called by driver_probe_device(),
> > but if the link goes away before that point, the supplier's
> > PM-runtime usage counter will remain nonzero.
> > 
> > To prevent that from happening, first rework pm_runtime_get_suppliers()
> > and pm_runtime_put_suppliers() to use the rpm_active refounts of device
> > links and make the latter only drop rpm_active and the supplier's
> > PM-runtime usage counter for each link by one, unless rpm_active is
> > one already for it.  Next, modify device_link_add() to bump up the
> > new link's rpm_active refcount and the suppliers PM-runtime usage
> > counter by two, to prevent pm_runtime_put_suppliers(), if it is
> > called subsequently, from suspending the supplier prematurely (in
> > case its PM-runtime usage counter goes down to 0 in there).
> > 
> > Due to the way rpm_put_suppliers() works, this change does not
> > affect runtime suspend of the consumer ends of new device links (or,
> > generally, device links for which DL_FLAG_PM_RUNTIME has just been
> > set).
> > 
> > Fixes: e2f3cd831a28 ("driver core: Fix handling of runtime PM flags in device_link_add()")
> > Reported-by: Ulf Hansson <ulf.hansson@linaro.org> 
> > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > ---
> > 
> > Note that the issue had been there before commit e2f3cd831a28, but it was
> > overlooked by that commit and this change is a fix on top of it, so make
> > the Fixes: tag point to commit e2f3cd831a28 (instead of an earlier one
> > that the patch will not be applicable to).
> I noticed that yesterday's and today's -next were no longer booting on
> one of our Tegra boards (Tegra210 Jetson TX2) because networking is
> failing. The ethernet chip is a USB device and looking at the bootlogs I
> can see that the Tegra XHCI driver is failing ...
> 
>  tegra-xusb 70090000.usb: xHCI host controller not responding, assume dead
>  tegra-xusb 70090000.usb: HC died; cleaning up
> 
> The Tegra XHCI driver uses multiple power-domains and uses
> device_link_add() to attach them. So now I am wondering if there is
> something that we have got wrong in our implementation. However, I don't
> see the device being probed deferred on boot or anything like that.
> 
> The driver in question is drivers/usb/host/xhci-tegra.c and we add the
> links in the function tegra_xusb_powerdomain_init() which is before RPM
> is enabled. Let me know if you have any thoughts.

Please try the appended patch on top of the $subject one (provided that
reverting the $subject patch makes the problem go away).

---
 drivers/base/power/runtime.c |    9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

Index: linux-pm/drivers/base/power/runtime.c
===================================================================
--- linux-pm.orig/drivers/base/power/runtime.c
+++ linux-pm/drivers/base/power/runtime.c
@@ -1675,9 +1675,12 @@ void pm_runtime_put_suppliers(struct dev
 	idx = device_links_read_lock();
 
 	list_for_each_entry_rcu(link, &dev->links.suppliers, c_node)
-		if (link->flags & DL_FLAG_PM_RUNTIME &&
-		    refcount_dec_not_one(&link->rpm_active))
-			pm_runtime_put(link->supplier);
+		if (link->flags & DL_FLAG_PM_RUNTIME) {
+			if (refcount_dec_not_one(&link->rpm_active))
+				pm_runtime_put(link->supplier);
+			else
+				pm_request_idle(link->supplier);
+		}
 
 	device_links_read_unlock(idx);
 }

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/2] driver core: Fix possible supplier PM-usage counter imbalance
  2019-02-15 12:06     ` Rafael J. Wysocki
@ 2019-02-15 13:21       ` Jon Hunter
  2019-02-15 14:14         ` Jon Hunter
  0 siblings, 1 reply; 28+ messages in thread
From: Jon Hunter @ 2019-02-15 13:21 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Greg Kroah-Hartman, LKML, Linux PM, Ulf Hansson, Daniel Vetter,
	Lukas Wunner, Andrzej Hajda, Russell King - ARM Linux,
	Lucas Stach, Linus Walleij, Thierry Reding, Laurent Pinchart,
	Marek Szyprowski, linux-tegra


On 15/02/2019 12:06, Rafael J. Wysocki wrote:
> On Friday, February 15, 2019 12:00:27 PM CET Jon Hunter wrote:
>> Hi Rafael,
>>
>> On 12/02/2019 12:08, Rafael J. Wysocki wrote:
>>> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>>>
>>> If a stateless device link to a certain supplier with
>>> DL_FLAG_PM_RUNTIME set in the flags is added and then removed by the
>>> consumer driver's probe callback, the supplier's PM-runtime usage
>>> counter will be nonzero after that which effectively causes the
>>> supplier to remain "always on" going forward.
>>>
>>> Namely, device_link_add() called to add the link invokes
>>> device_link_rpm_prepare() which notices that the consumer driver is
>>> probing, so it increments the supplier's PM-runtime usage counter
>>> with the assumption that the link will stay around until
>>> pm_runtime_put_suppliers() is called by driver_probe_device(),
>>> but if the link goes away before that point, the supplier's
>>> PM-runtime usage counter will remain nonzero.
>>>
>>> To prevent that from happening, first rework pm_runtime_get_suppliers()
>>> and pm_runtime_put_suppliers() to use the rpm_active refounts of device
>>> links and make the latter only drop rpm_active and the supplier's
>>> PM-runtime usage counter for each link by one, unless rpm_active is
>>> one already for it.  Next, modify device_link_add() to bump up the
>>> new link's rpm_active refcount and the suppliers PM-runtime usage
>>> counter by two, to prevent pm_runtime_put_suppliers(), if it is
>>> called subsequently, from suspending the supplier prematurely (in
>>> case its PM-runtime usage counter goes down to 0 in there).
>>>
>>> Due to the way rpm_put_suppliers() works, this change does not
>>> affect runtime suspend of the consumer ends of new device links (or,
>>> generally, device links for which DL_FLAG_PM_RUNTIME has just been
>>> set).
>>>
>>> Fixes: e2f3cd831a28 ("driver core: Fix handling of runtime PM flags in device_link_add()")
>>> Reported-by: Ulf Hansson <ulf.hansson@linaro.org> 
>>> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>>> ---
>>>
>>> Note that the issue had been there before commit e2f3cd831a28, but it was
>>> overlooked by that commit and this change is a fix on top of it, so make
>>> the Fixes: tag point to commit e2f3cd831a28 (instead of an earlier one
>>> that the patch will not be applicable to).
>> I noticed that yesterday's and today's -next were no longer booting on
>> one of our Tegra boards (Tegra210 Jetson TX2) because networking is
>> failing. The ethernet chip is a USB device and looking at the bootlogs I
>> can see that the Tegra XHCI driver is failing ...
>>
>>  tegra-xusb 70090000.usb: xHCI host controller not responding, assume dead
>>  tegra-xusb 70090000.usb: HC died; cleaning up
>>
>> The Tegra XHCI driver uses multiple power-domains and uses
>> device_link_add() to attach them. So now I am wondering if there is
>> something that we have got wrong in our implementation. However, I don't
>> see the device being probed deferred on boot or anything like that.
>>
>> The driver in question is drivers/usb/host/xhci-tegra.c and we add the
>> links in the function tegra_xusb_powerdomain_init() which is before RPM
>> is enabled. Let me know if you have any thoughts.
> 
> Please try the appended patch on top of the $subject one (provided that
> reverting the $subject patch makes the problem go away).

Thanks and yes to confirm, reverting the $subject patch on top of next
does make the issue go away.

> ---
>  drivers/base/power/runtime.c |    9 ++++++---
>  1 file changed, 6 insertions(+), 3 deletions(-)
> 
> Index: linux-pm/drivers/base/power/runtime.c
> ===================================================================
> --- linux-pm.orig/drivers/base/power/runtime.c
> +++ linux-pm/drivers/base/power/runtime.c
> @@ -1675,9 +1675,12 @@ void pm_runtime_put_suppliers(struct dev
>  	idx = device_links_read_lock();
>  
>  	list_for_each_entry_rcu(link, &dev->links.suppliers, c_node)
> -		if (link->flags & DL_FLAG_PM_RUNTIME &&
> -		    refcount_dec_not_one(&link->rpm_active))
> -			pm_runtime_put(link->supplier);
> +		if (link->flags & DL_FLAG_PM_RUNTIME) {
> +			if (refcount_dec_not_one(&link->rpm_active))
> +				pm_runtime_put(link->supplier);
> +			else
> +				pm_request_idle(link->supplier);
> +		}
>  
>  	device_links_read_unlock(idx);
>  }

I will try this now and report back in a bit.

Cheers
Jon

-- 
nvpublic

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/2] driver core: Fix possible supplier PM-usage counter imbalance
  2019-02-15 13:21       ` Jon Hunter
@ 2019-02-15 14:14         ` Jon Hunter
  0 siblings, 0 replies; 28+ messages in thread
From: Jon Hunter @ 2019-02-15 14:14 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Greg Kroah-Hartman, LKML, Linux PM, Ulf Hansson, Daniel Vetter,
	Lukas Wunner, Andrzej Hajda, Russell King - ARM Linux,
	Lucas Stach, Linus Walleij, Thierry Reding, Laurent Pinchart,
	Marek Szyprowski, linux-tegra


On 15/02/2019 13:21, Jon Hunter wrote:
> 
> On 15/02/2019 12:06, Rafael J. Wysocki wrote:
>> On Friday, February 15, 2019 12:00:27 PM CET Jon Hunter wrote:
>>> Hi Rafael,
>>>
>>> On 12/02/2019 12:08, Rafael J. Wysocki wrote:
>>>> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>>>>
>>>> If a stateless device link to a certain supplier with
>>>> DL_FLAG_PM_RUNTIME set in the flags is added and then removed by the
>>>> consumer driver's probe callback, the supplier's PM-runtime usage
>>>> counter will be nonzero after that which effectively causes the
>>>> supplier to remain "always on" going forward.
>>>>
>>>> Namely, device_link_add() called to add the link invokes
>>>> device_link_rpm_prepare() which notices that the consumer driver is
>>>> probing, so it increments the supplier's PM-runtime usage counter
>>>> with the assumption that the link will stay around until
>>>> pm_runtime_put_suppliers() is called by driver_probe_device(),
>>>> but if the link goes away before that point, the supplier's
>>>> PM-runtime usage counter will remain nonzero.
>>>>
>>>> To prevent that from happening, first rework pm_runtime_get_suppliers()
>>>> and pm_runtime_put_suppliers() to use the rpm_active refounts of device
>>>> links and make the latter only drop rpm_active and the supplier's
>>>> PM-runtime usage counter for each link by one, unless rpm_active is
>>>> one already for it.  Next, modify device_link_add() to bump up the
>>>> new link's rpm_active refcount and the suppliers PM-runtime usage
>>>> counter by two, to prevent pm_runtime_put_suppliers(), if it is
>>>> called subsequently, from suspending the supplier prematurely (in
>>>> case its PM-runtime usage counter goes down to 0 in there).
>>>>
>>>> Due to the way rpm_put_suppliers() works, this change does not
>>>> affect runtime suspend of the consumer ends of new device links (or,
>>>> generally, device links for which DL_FLAG_PM_RUNTIME has just been
>>>> set).
>>>>
>>>> Fixes: e2f3cd831a28 ("driver core: Fix handling of runtime PM flags in device_link_add()")
>>>> Reported-by: Ulf Hansson <ulf.hansson@linaro.org> 
>>>> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>>>> ---
>>>>
>>>> Note that the issue had been there before commit e2f3cd831a28, but it was
>>>> overlooked by that commit and this change is a fix on top of it, so make
>>>> the Fixes: tag point to commit e2f3cd831a28 (instead of an earlier one
>>>> that the patch will not be applicable to).
>>> I noticed that yesterday's and today's -next were no longer booting on
>>> one of our Tegra boards (Tegra210 Jetson TX2) because networking is
>>> failing. The ethernet chip is a USB device and looking at the bootlogs I
>>> can see that the Tegra XHCI driver is failing ...
>>>
>>>  tegra-xusb 70090000.usb: xHCI host controller not responding, assume dead
>>>  tegra-xusb 70090000.usb: HC died; cleaning up
>>>
>>> The Tegra XHCI driver uses multiple power-domains and uses
>>> device_link_add() to attach them. So now I am wondering if there is
>>> something that we have got wrong in our implementation. However, I don't
>>> see the device being probed deferred on boot or anything like that.
>>>
>>> The driver in question is drivers/usb/host/xhci-tegra.c and we add the
>>> links in the function tegra_xusb_powerdomain_init() which is before RPM
>>> is enabled. Let me know if you have any thoughts.
>>
>> Please try the appended patch on top of the $subject one (provided that
>> reverting the $subject patch makes the problem go away).
> 
> Thanks and yes to confirm, reverting the $subject patch on top of next
> does make the issue go away.
> 
>> ---
>>  drivers/base/power/runtime.c |    9 ++++++---
>>  1 file changed, 6 insertions(+), 3 deletions(-)
>>
>> Index: linux-pm/drivers/base/power/runtime.c
>> ===================================================================
>> --- linux-pm.orig/drivers/base/power/runtime.c
>> +++ linux-pm/drivers/base/power/runtime.c
>> @@ -1675,9 +1675,12 @@ void pm_runtime_put_suppliers(struct dev
>>  	idx = device_links_read_lock();
>>  
>>  	list_for_each_entry_rcu(link, &dev->links.suppliers, c_node)
>> -		if (link->flags & DL_FLAG_PM_RUNTIME &&
>> -		    refcount_dec_not_one(&link->rpm_active))
>> -			pm_runtime_put(link->supplier);
>> +		if (link->flags & DL_FLAG_PM_RUNTIME) {
>> +			if (refcount_dec_not_one(&link->rpm_active))
>> +				pm_runtime_put(link->supplier);
>> +			else
>> +				pm_request_idle(link->supplier);
>> +		}
>>  
>>  	device_links_read_unlock(idx);
>>  }
> 
> I will try this now and report back in a bit.

I tried this on top of next, but unfortunately the same issue still
persists and so this did not fix it. Let me know if there is any debug I
can add/enable.

Cheers
Jon

-- 
nvpublic

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/2] driver core: Fix possible supplier PM-usage counter imbalance
  2019-02-15 11:00   ` Jon Hunter
  2019-02-15 11:57     ` Rafael J. Wysocki
  2019-02-15 12:06     ` Rafael J. Wysocki
@ 2019-02-15 14:37     ` Ulf Hansson
  2019-02-15 16:44         ` Jon Hunter
  2 siblings, 1 reply; 28+ messages in thread
From: Ulf Hansson @ 2019-02-15 14:37 UTC (permalink / raw)
  To: Jon Hunter
  Cc: Rafael J. Wysocki, Greg Kroah-Hartman, LKML, Linux PM,
	Daniel Vetter, Lukas Wunner, Andrzej Hajda,
	Russell King - ARM Linux, Lucas Stach, Linus Walleij,
	Thierry Reding, Laurent Pinchart, Marek Szyprowski, linux-tegra

On Fri, 15 Feb 2019 at 12:00, Jon Hunter <jonathanh@nvidia.com> wrote:
>
> Hi Rafael,
>
> On 12/02/2019 12:08, Rafael J. Wysocki wrote:
> > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> >
> > If a stateless device link to a certain supplier with
> > DL_FLAG_PM_RUNTIME set in the flags is added and then removed by the
> > consumer driver's probe callback, the supplier's PM-runtime usage
> > counter will be nonzero after that which effectively causes the
> > supplier to remain "always on" going forward.
> >
> > Namely, device_link_add() called to add the link invokes
> > device_link_rpm_prepare() which notices that the consumer driver is
> > probing, so it increments the supplier's PM-runtime usage counter
> > with the assumption that the link will stay around until
> > pm_runtime_put_suppliers() is called by driver_probe_device(),
> > but if the link goes away before that point, the supplier's
> > PM-runtime usage counter will remain nonzero.
> >
> > To prevent that from happening, first rework pm_runtime_get_suppliers()
> > and pm_runtime_put_suppliers() to use the rpm_active refounts of device
> > links and make the latter only drop rpm_active and the supplier's
> > PM-runtime usage counter for each link by one, unless rpm_active is
> > one already for it.  Next, modify device_link_add() to bump up the
> > new link's rpm_active refcount and the suppliers PM-runtime usage
> > counter by two, to prevent pm_runtime_put_suppliers(), if it is
> > called subsequently, from suspending the supplier prematurely (in
> > case its PM-runtime usage counter goes down to 0 in there).
> >
> > Due to the way rpm_put_suppliers() works, this change does not
> > affect runtime suspend of the consumer ends of new device links (or,
> > generally, device links for which DL_FLAG_PM_RUNTIME has just been
> > set).
> >
> > Fixes: e2f3cd831a28 ("driver core: Fix handling of runtime PM flags in device_link_add()")
> > Reported-by: Ulf Hansson <ulf.hansson@linaro.org>
> > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > ---
> >
> > Note that the issue had been there before commit e2f3cd831a28, but it was
> > overlooked by that commit and this change is a fix on top of it, so make
> > the Fixes: tag point to commit e2f3cd831a28 (instead of an earlier one
> > that the patch will not be applicable to).
> I noticed that yesterday's and today's -next were no longer booting on
> one of our Tegra boards (Tegra210 Jetson TX2) because networking is
> failing. The ethernet chip is a USB device and looking at the bootlogs I
> can see that the Tegra XHCI driver is failing ...
>
>  tegra-xusb 70090000.usb: xHCI host controller not responding, assume dead
>  tegra-xusb 70090000.usb: HC died; cleaning up
>
> The Tegra XHCI driver uses multiple power-domains and uses
> device_link_add() to attach them. So now I am wondering if there is
> something that we have got wrong in our implementation. However, I don't
> see the device being probed deferred on boot or anything like that.
>
> The driver in question is drivers/usb/host/xhci-tegra.c and we add the
> links in the function tegra_xusb_powerdomain_init() which is before RPM
> is enabled. Let me know if you have any thoughts.

If you are willing to help debugging then I am offering my assistance.

I would start by enabling CONFIG_PM_ADVANCED_DEBUG, which gives you
some more information about the runtime PM state of the device, like
the usage count for example.
I would also add a couple of prints in
tegra_xusb_runtime_suspend|resume() and in the ->power_on|off()
callbacks for the corresponding genpds, to see when those gets called.

While I was testing $subject patch I also used a local debug patch,
which adds a sysfs node that can be used to get the state of linked
suppliers for a consumer device. Feel free to use it, attached below.

Of course, the interesting part is the comparison of what happens with
and without $subject patch.

From: Ulf Hansson <ulf.hansson@linaro.org>
Date: Mon, 11 Feb 2019 15:37:44 +0100
Subject: [PATCH] PM / Runtime: Add sysfs for runtime counting of suppliers

Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
---
 drivers/base/power/sysfs.c | 20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

diff --git a/drivers/base/power/sysfs.c b/drivers/base/power/sysfs.c
index d713738ce796..ce5c188cdf54 100644
--- a/drivers/base/power/sysfs.c
+++ b/drivers/base/power/sysfs.c
@@ -537,6 +537,25 @@ static ssize_t runtime_enabled_show(struct device *dev,
 }
 static DEVICE_ATTR_RO(runtime_enabled);

+static ssize_t runtime_suppliers_show(struct device *dev,
+                                 struct device_attribute *attr, char *buf)
+{
+       struct device_link *link;
+       int chars = 0;
+
+       list_for_each_entry_rcu(link, &dev->links.suppliers, c_node) {
+
+               if (!(link->flags & DL_FLAG_PM_RUNTIME))
+                       continue;
+
+               chars += sprintf(buf + chars, "%s %d\n",
+                               dev_name(link->supplier),
+                               refcount_read(&link->rpm_active));
+       }
+       return chars;
+}
+static DEVICE_ATTR_RO(runtime_suppliers);
+
 #ifdef CONFIG_PM_SLEEP
 static ssize_t async_show(struct device *dev, struct device_attribute *attr,
                          char *buf)
@@ -572,6 +591,7 @@ static struct attribute *power_attrs[] = {
        &dev_attr_runtime_usage.attr,
        &dev_attr_runtime_active_kids.attr,
        &dev_attr_runtime_enabled.attr,
+       &dev_attr_runtime_suppliers.attr,
 #endif /* CONFIG_PM_ADVANCED_DEBUG */
        NULL,
 };
-- 
2.17.1

Kind regards
Uffe

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/2] driver core: Fix possible supplier PM-usage counter imbalance
  2019-02-15 14:37     ` Ulf Hansson
@ 2019-02-15 16:44         ` Jon Hunter
  0 siblings, 0 replies; 28+ messages in thread
From: Jon Hunter @ 2019-02-15 16:44 UTC (permalink / raw)
  To: Ulf Hansson
  Cc: Rafael J. Wysocki, Greg Kroah-Hartman, LKML, Linux PM,
	Daniel Vetter, Lukas Wunner, Andrzej Hajda,
	Russell King - ARM Linux, Lucas Stach, Linus Walleij,
	Thierry Reding, Laurent Pinchart, Marek Szyprowski, linux-tegra


On 15/02/2019 14:37, Ulf Hansson wrote:
> On Fri, 15 Feb 2019 at 12:00, Jon Hunter <jonathanh@nvidia.com> wrote:
>>
>> Hi Rafael,
>>
>> On 12/02/2019 12:08, Rafael J. Wysocki wrote:
>>> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>>>
>>> If a stateless device link to a certain supplier with
>>> DL_FLAG_PM_RUNTIME set in the flags is added and then removed by the
>>> consumer driver's probe callback, the supplier's PM-runtime usage
>>> counter will be nonzero after that which effectively causes the
>>> supplier to remain "always on" going forward.
>>>
>>> Namely, device_link_add() called to add the link invokes
>>> device_link_rpm_prepare() which notices that the consumer driver is
>>> probing, so it increments the supplier's PM-runtime usage counter
>>> with the assumption that the link will stay around until
>>> pm_runtime_put_suppliers() is called by driver_probe_device(),
>>> but if the link goes away before that point, the supplier's
>>> PM-runtime usage counter will remain nonzero.
>>>
>>> To prevent that from happening, first rework pm_runtime_get_suppliers()
>>> and pm_runtime_put_suppliers() to use the rpm_active refounts of device
>>> links and make the latter only drop rpm_active and the supplier's
>>> PM-runtime usage counter for each link by one, unless rpm_active is
>>> one already for it.  Next, modify device_link_add() to bump up the
>>> new link's rpm_active refcount and the suppliers PM-runtime usage
>>> counter by two, to prevent pm_runtime_put_suppliers(), if it is
>>> called subsequently, from suspending the supplier prematurely (in
>>> case its PM-runtime usage counter goes down to 0 in there).
>>>
>>> Due to the way rpm_put_suppliers() works, this change does not
>>> affect runtime suspend of the consumer ends of new device links (or,
>>> generally, device links for which DL_FLAG_PM_RUNTIME has just been
>>> set).
>>>
>>> Fixes: e2f3cd831a28 ("driver core: Fix handling of runtime PM flags in device_link_add()")
>>> Reported-by: Ulf Hansson <ulf.hansson@linaro.org>
>>> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>>> ---
>>>
>>> Note that the issue had been there before commit e2f3cd831a28, but it was
>>> overlooked by that commit and this change is a fix on top of it, so make
>>> the Fixes: tag point to commit e2f3cd831a28 (instead of an earlier one
>>> that the patch will not be applicable to).
>> I noticed that yesterday's and today's -next were no longer booting on
>> one of our Tegra boards (Tegra210 Jetson TX2) because networking is
>> failing. The ethernet chip is a USB device and looking at the bootlogs I
>> can see that the Tegra XHCI driver is failing ...
>>
>>  tegra-xusb 70090000.usb: xHCI host controller not responding, assume dead
>>  tegra-xusb 70090000.usb: HC died; cleaning up
>>
>> The Tegra XHCI driver uses multiple power-domains and uses
>> device_link_add() to attach them. So now I am wondering if there is
>> something that we have got wrong in our implementation. However, I don't
>> see the device being probed deferred on boot or anything like that.
>>
>> The driver in question is drivers/usb/host/xhci-tegra.c and we add the
>> links in the function tegra_xusb_powerdomain_init() which is before RPM
>> is enabled. Let me know if you have any thoughts.
> 
> If you are willing to help debugging then I am offering my assistance.
> 
> I would start by enabling CONFIG_PM_ADVANCED_DEBUG, which gives you
> some more information about the runtime PM state of the device, like
> the usage count for example.
> I would also add a couple of prints in
> tegra_xusb_runtime_suspend|resume() and in the ->power_on|off()
> callbacks for the corresponding genpds, to see when those gets called.

From the bootlog I see ...

[    4.445827] tegra_xusb_runtime_resume-788
[    4.508799] tegra-xusb 70090000.usb: Firmware timestamp: 2015-08-10 09:47:54 UTC
[    4.516223] tegra-xusb 70090000.usb: xHCI Host Controller
[    4.521622] tegra-xusb 70090000.usb: new USB bus registered, assigned bus number 1
[    4.530087] tegra-xusb 70090000.usb: hcc params 0x0184f525 hci version 0x100 quirks 0x0000000000010010
[    4.539398] tegra-xusb 70090000.usb: irq 69, io mem 0x70090000
[    4.553671] tegra-xusb 70090000.usb: xHCI Host Controller
[    4.559064] tegra-xusb 70090000.usb: new USB bus registered, assigned bus number 2
[    4.566622] tegra-xusb 70090000.usb: Host supports USB 3.0  SuperSpeed
[    4.595393] tegra-pmc: tegra_genpd_power_off-673: xusbc
[    4.600672] tegra-pmc: tegra_genpd_power_off-673: xusba
[    4.657346] tegra-xusb 70090000.usb: xHCI host controller not responding, assume dead
[    4.665157] tegra-xusb 70090000.usb: HC died; cleaning up

This shows the xusb controller is runtime resumed during probe but
then after probe the pm-domains, xusba and xusbc, are turned off
without it being runtime suspended. We never suspend it until it
is removed currently. 

Following boot the xusb device appears to still be active ... 

$ cat /sys/devices/platform/70090000.usb/power/runtime_status 
active

... but the status reported by the pm_genpd summary disagrees ...

$ sudo cat /sys/kernel/debug/pm_genpd/pm_genpd_summary
domain                          status          slaves
    /device                                             runtime status
----------------------------------------------------------------------
vic                             off-0           
    /devices/platform/50000000.host1x/54340000.vic      suspended
xusbc                           off-0           
    /devices/genpd:0:70090000.usb                       suspended
xusbb                           off-0           
xusba                           off-0           
    /devices/genpd:1:70090000.usb                       suspended
sor                             on              
    /devices/platform/700e3000.mipi                     unsupported
    /devices/platform/50000000.host1x/54300000.dsi      active
    /devices/platform/50000000.host1x/54040000.dpaux    active
    /devices/platform/50000000.host1x/54580000.sor      suspended
aud                             off-0           

... and ...

$ cat /sys/devices/platform/70090000.usb/power/runtime_suppliers 
genpd:0:70090000.usb 1
genpd:1:70090000.usb 1

Let me know if you have any thoughts.

Cheers
Jon

-- 
nvpublic

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/2] driver core: Fix possible supplier PM-usage counter imbalance
@ 2019-02-15 16:44         ` Jon Hunter
  0 siblings, 0 replies; 28+ messages in thread
From: Jon Hunter @ 2019-02-15 16:44 UTC (permalink / raw)
  To: Ulf Hansson
  Cc: Rafael J. Wysocki, Greg Kroah-Hartman, LKML, Linux PM,
	Daniel Vetter, Lukas Wunner, Andrzej Hajda,
	Russell King - ARM Linux, Lucas Stach, Linus Walleij,
	Thierry Reding, Laurent Pinchart, Marek Szyprowski, linux-tegra


On 15/02/2019 14:37, Ulf Hansson wrote:
> On Fri, 15 Feb 2019 at 12:00, Jon Hunter <jonathanh@nvidia.com> wrote:
>>
>> Hi Rafael,
>>
>> On 12/02/2019 12:08, Rafael J. Wysocki wrote:
>>> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>>>
>>> If a stateless device link to a certain supplier with
>>> DL_FLAG_PM_RUNTIME set in the flags is added and then removed by the
>>> consumer driver's probe callback, the supplier's PM-runtime usage
>>> counter will be nonzero after that which effectively causes the
>>> supplier to remain "always on" going forward.
>>>
>>> Namely, device_link_add() called to add the link invokes
>>> device_link_rpm_prepare() which notices that the consumer driver is
>>> probing, so it increments the supplier's PM-runtime usage counter
>>> with the assumption that the link will stay around until
>>> pm_runtime_put_suppliers() is called by driver_probe_device(),
>>> but if the link goes away before that point, the supplier's
>>> PM-runtime usage counter will remain nonzero.
>>>
>>> To prevent that from happening, first rework pm_runtime_get_suppliers()
>>> and pm_runtime_put_suppliers() to use the rpm_active refounts of device
>>> links and make the latter only drop rpm_active and the supplier's
>>> PM-runtime usage counter for each link by one, unless rpm_active is
>>> one already for it.  Next, modify device_link_add() to bump up the
>>> new link's rpm_active refcount and the suppliers PM-runtime usage
>>> counter by two, to prevent pm_runtime_put_suppliers(), if it is
>>> called subsequently, from suspending the supplier prematurely (in
>>> case its PM-runtime usage counter goes down to 0 in there).
>>>
>>> Due to the way rpm_put_suppliers() works, this change does not
>>> affect runtime suspend of the consumer ends of new device links (or,
>>> generally, device links for which DL_FLAG_PM_RUNTIME has just been
>>> set).
>>>
>>> Fixes: e2f3cd831a28 ("driver core: Fix handling of runtime PM flags in device_link_add()")
>>> Reported-by: Ulf Hansson <ulf.hansson@linaro.org>
>>> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>>> ---
>>>
>>> Note that the issue had been there before commit e2f3cd831a28, but it was
>>> overlooked by that commit and this change is a fix on top of it, so make
>>> the Fixes: tag point to commit e2f3cd831a28 (instead of an earlier one
>>> that the patch will not be applicable to).
>> I noticed that yesterday's and today's -next were no longer booting on
>> one of our Tegra boards (Tegra210 Jetson TX2) because networking is
>> failing. The ethernet chip is a USB device and looking at the bootlogs I
>> can see that the Tegra XHCI driver is failing ...
>>
>>  tegra-xusb 70090000.usb: xHCI host controller not responding, assume dead
>>  tegra-xusb 70090000.usb: HC died; cleaning up
>>
>> The Tegra XHCI driver uses multiple power-domains and uses
>> device_link_add() to attach them. So now I am wondering if there is
>> something that we have got wrong in our implementation. However, I don't
>> see the device being probed deferred on boot or anything like that.
>>
>> The driver in question is drivers/usb/host/xhci-tegra.c and we add the
>> links in the function tegra_xusb_powerdomain_init() which is before RPM
>> is enabled. Let me know if you have any thoughts.
> 
> If you are willing to help debugging then I am offering my assistance.
> 
> I would start by enabling CONFIG_PM_ADVANCED_DEBUG, which gives you
> some more information about the runtime PM state of the device, like
> the usage count for example.
> I would also add a couple of prints in
> tegra_xusb_runtime_suspend|resume() and in the ->power_on|off()
> callbacks for the corresponding genpds, to see when those gets called.

>From the bootlog I see ...

[    4.445827] tegra_xusb_runtime_resume-788
[    4.508799] tegra-xusb 70090000.usb: Firmware timestamp: 2015-08-10 09:47:54 UTC
[    4.516223] tegra-xusb 70090000.usb: xHCI Host Controller
[    4.521622] tegra-xusb 70090000.usb: new USB bus registered, assigned bus number 1
[    4.530087] tegra-xusb 70090000.usb: hcc params 0x0184f525 hci version 0x100 quirks 0x0000000000010010
[    4.539398] tegra-xusb 70090000.usb: irq 69, io mem 0x70090000
[    4.553671] tegra-xusb 70090000.usb: xHCI Host Controller
[    4.559064] tegra-xusb 70090000.usb: new USB bus registered, assigned bus number 2
[    4.566622] tegra-xusb 70090000.usb: Host supports USB 3.0  SuperSpeed
[    4.595393] tegra-pmc: tegra_genpd_power_off-673: xusbc
[    4.600672] tegra-pmc: tegra_genpd_power_off-673: xusba
[    4.657346] tegra-xusb 70090000.usb: xHCI host controller not responding, assume dead
[    4.665157] tegra-xusb 70090000.usb: HC died; cleaning up

This shows the xusb controller is runtime resumed during probe but
then after probe the pm-domains, xusba and xusbc, are turned off
without it being runtime suspended. We never suspend it until it
is removed currently. 

Following boot the xusb device appears to still be active ... 

$ cat /sys/devices/platform/70090000.usb/power/runtime_status 
active

... but the status reported by the pm_genpd summary disagrees ...

$ sudo cat /sys/kernel/debug/pm_genpd/pm_genpd_summary
domain                          status          slaves
    /device                                             runtime status
----------------------------------------------------------------------
vic                             off-0           
    /devices/platform/50000000.host1x/54340000.vic      suspended
xusbc                           off-0           
    /devices/genpd:0:70090000.usb                       suspended
xusbb                           off-0           
xusba                           off-0           
    /devices/genpd:1:70090000.usb                       suspended
sor                             on              
    /devices/platform/700e3000.mipi                     unsupported
    /devices/platform/50000000.host1x/54300000.dsi      active
    /devices/platform/50000000.host1x/54040000.dpaux    active
    /devices/platform/50000000.host1x/54580000.sor      suspended
aud                             off-0           

... and ...

$ cat /sys/devices/platform/70090000.usb/power/runtime_suppliers 
genpd:0:70090000.usb 1
genpd:1:70090000.usb 1

Let me know if you have any thoughts.

Cheers
Jon

-- 
nvpublic

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/2] driver core: Fix possible supplier PM-usage counter imbalance
  2019-02-15 16:44         ` Jon Hunter
  (?)
@ 2019-02-17 21:33         ` Rafael J. Wysocki
  -1 siblings, 0 replies; 28+ messages in thread
From: Rafael J. Wysocki @ 2019-02-17 21:33 UTC (permalink / raw)
  To: Jon Hunter
  Cc: Ulf Hansson, Rafael J. Wysocki, Greg Kroah-Hartman, LKML,
	Linux PM, Daniel Vetter, Lukas Wunner, Andrzej Hajda,
	Russell King - ARM Linux, Lucas Stach, Linus Walleij,
	Thierry Reding, Laurent Pinchart, Marek Szyprowski, linux-tegra

On Fri, Feb 15, 2019 at 5:44 PM Jon Hunter <jonathanh@nvidia.com> wrote:
>
>
> On 15/02/2019 14:37, Ulf Hansson wrote:
> > On Fri, 15 Feb 2019 at 12:00, Jon Hunter <jonathanh@nvidia.com> wrote:
> >>
> >> Hi Rafael,
> >>
> >> On 12/02/2019 12:08, Rafael J. Wysocki wrote:
> >>> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> >>>
> >>> If a stateless device link to a certain supplier with
> >>> DL_FLAG_PM_RUNTIME set in the flags is added and then removed by the
> >>> consumer driver's probe callback, the supplier's PM-runtime usage
> >>> counter will be nonzero after that which effectively causes the
> >>> supplier to remain "always on" going forward.
> >>>
> >>> Namely, device_link_add() called to add the link invokes
> >>> device_link_rpm_prepare() which notices that the consumer driver is
> >>> probing, so it increments the supplier's PM-runtime usage counter
> >>> with the assumption that the link will stay around until
> >>> pm_runtime_put_suppliers() is called by driver_probe_device(),
> >>> but if the link goes away before that point, the supplier's
> >>> PM-runtime usage counter will remain nonzero.
> >>>
> >>> To prevent that from happening, first rework pm_runtime_get_suppliers()
> >>> and pm_runtime_put_suppliers() to use the rpm_active refounts of device
> >>> links and make the latter only drop rpm_active and the supplier's
> >>> PM-runtime usage counter for each link by one, unless rpm_active is
> >>> one already for it.  Next, modify device_link_add() to bump up the
> >>> new link's rpm_active refcount and the suppliers PM-runtime usage
> >>> counter by two, to prevent pm_runtime_put_suppliers(), if it is
> >>> called subsequently, from suspending the supplier prematurely (in
> >>> case its PM-runtime usage counter goes down to 0 in there).
> >>>
> >>> Due to the way rpm_put_suppliers() works, this change does not
> >>> affect runtime suspend of the consumer ends of new device links (or,
> >>> generally, device links for which DL_FLAG_PM_RUNTIME has just been
> >>> set).
> >>>
> >>> Fixes: e2f3cd831a28 ("driver core: Fix handling of runtime PM flags in device_link_add()")
> >>> Reported-by: Ulf Hansson <ulf.hansson@linaro.org>
> >>> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> >>> ---
> >>>
> >>> Note that the issue had been there before commit e2f3cd831a28, but it was
> >>> overlooked by that commit and this change is a fix on top of it, so make
> >>> the Fixes: tag point to commit e2f3cd831a28 (instead of an earlier one
> >>> that the patch will not be applicable to).
> >> I noticed that yesterday's and today's -next were no longer booting on
> >> one of our Tegra boards (Tegra210 Jetson TX2) because networking is
> >> failing. The ethernet chip is a USB device and looking at the bootlogs I
> >> can see that the Tegra XHCI driver is failing ...
> >>
> >>  tegra-xusb 70090000.usb: xHCI host controller not responding, assume dead
> >>  tegra-xusb 70090000.usb: HC died; cleaning up
> >>
> >> The Tegra XHCI driver uses multiple power-domains and uses
> >> device_link_add() to attach them. So now I am wondering if there is
> >> something that we have got wrong in our implementation. However, I don't
> >> see the device being probed deferred on boot or anything like that.
> >>
> >> The driver in question is drivers/usb/host/xhci-tegra.c and we add the
> >> links in the function tegra_xusb_powerdomain_init() which is before RPM
> >> is enabled. Let me know if you have any thoughts.
> >
> > If you are willing to help debugging then I am offering my assistance.
> >
> > I would start by enabling CONFIG_PM_ADVANCED_DEBUG, which gives you
> > some more information about the runtime PM state of the device, like
> > the usage count for example.
> > I would also add a couple of prints in
> > tegra_xusb_runtime_suspend|resume() and in the ->power_on|off()
> > callbacks for the corresponding genpds, to see when those gets called.
>
> From the bootlog I see ...
>
> [    4.445827] tegra_xusb_runtime_resume-788
> [    4.508799] tegra-xusb 70090000.usb: Firmware timestamp: 2015-08-10 09:47:54 UTC
> [    4.516223] tegra-xusb 70090000.usb: xHCI Host Controller
> [    4.521622] tegra-xusb 70090000.usb: new USB bus registered, assigned bus number 1
> [    4.530087] tegra-xusb 70090000.usb: hcc params 0x0184f525 hci version 0x100 quirks 0x0000000000010010
> [    4.539398] tegra-xusb 70090000.usb: irq 69, io mem 0x70090000
> [    4.553671] tegra-xusb 70090000.usb: xHCI Host Controller
> [    4.559064] tegra-xusb 70090000.usb: new USB bus registered, assigned bus number 2
> [    4.566622] tegra-xusb 70090000.usb: Host supports USB 3.0  SuperSpeed
> [    4.595393] tegra-pmc: tegra_genpd_power_off-673: xusbc
> [    4.600672] tegra-pmc: tegra_genpd_power_off-673: xusba
> [    4.657346] tegra-xusb 70090000.usb: xHCI host controller not responding, assume dead
> [    4.665157] tegra-xusb 70090000.usb: HC died; cleaning up
>
> This shows the xusb controller is runtime resumed during probe but
> then after probe the pm-domains, xusba and xusbc, are turned off
> without it being runtime suspended. We never suspend it until it
> is removed currently.

As I said offline on Sat, I suspected that the suppliers (that would
be xusba and xusbc if my understanding of the system is correct) were
suspended prematurely during or after the consumer (xusb) probe, which
previously had been prevented from occurring by the
pm_runtime_get_noresume(supplier) removed by the $subject patch.

What you said above seems to be in agreement with that theory, so it
looks like you simply need to pass DL_FLAG_RPM_ACTICE to
device_link_add() in both cases, as that would prevent the suppliers
from being suspended until the consumer suspends (or its PM-runtime
status is changed to "suspended").  And if the consumer never
suspends, the suppliers will never suspend too then.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/2] driver core: Fix possible supplier PM-usage counter imbalance
  2019-02-15 16:44         ` Jon Hunter
  (?)
  (?)
@ 2019-02-18 12:12         ` Rafael J. Wysocki
  2019-02-18 13:02           ` Jon Hunter
  -1 siblings, 1 reply; 28+ messages in thread
From: Rafael J. Wysocki @ 2019-02-18 12:12 UTC (permalink / raw)
  To: Jon Hunter
  Cc: Ulf Hansson, Rafael J. Wysocki, Greg Kroah-Hartman, LKML,
	Linux PM, Daniel Vetter, Lukas Wunner, Andrzej Hajda,
	Russell King - ARM Linux, Lucas Stach, Linus Walleij,
	Thierry Reding, Laurent Pinchart, Marek Szyprowski, linux-tegra

 On Fri, Feb 15, 2019 at 5:44 PM Jon Hunter <jonathanh@nvidia.com> wrote:
>
>
> On 15/02/2019 14:37, Ulf Hansson wrote:
> > On Fri, 15 Feb 2019 at 12:00, Jon Hunter <jonathanh@nvidia.com> wrote:
> >>
> >> Hi Rafael,
> >>
> >> On 12/02/2019 12:08, Rafael J. Wysocki wrote:
> >>> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> >>>
> >>> If a stateless device link to a certain supplier with
> >>> DL_FLAG_PM_RUNTIME set in the flags is added and then removed by the
> >>> consumer driver's probe callback, the supplier's PM-runtime usage
> >>> counter will be nonzero after that which effectively causes the
> >>> supplier to remain "always on" going forward.
> >>>
> >>> Namely, device_link_add() called to add the link invokes
> >>> device_link_rpm_prepare() which notices that the consumer driver is
> >>> probing, so it increments the supplier's PM-runtime usage counter
> >>> with the assumption that the link will stay around until
> >>> pm_runtime_put_suppliers() is called by driver_probe_device(),
> >>> but if the link goes away before that point, the supplier's
> >>> PM-runtime usage counter will remain nonzero.
> >>>
> >>> To prevent that from happening, first rework pm_runtime_get_suppliers()
> >>> and pm_runtime_put_suppliers() to use the rpm_active refounts of device
> >>> links and make the latter only drop rpm_active and the supplier's
> >>> PM-runtime usage counter for each link by one, unless rpm_active is
> >>> one already for it.  Next, modify device_link_add() to bump up the
> >>> new link's rpm_active refcount and the suppliers PM-runtime usage
> >>> counter by two, to prevent pm_runtime_put_suppliers(), if it is
> >>> called subsequently, from suspending the supplier prematurely (in
> >>> case its PM-runtime usage counter goes down to 0 in there).
> >>>
> >>> Due to the way rpm_put_suppliers() works, this change does not
> >>> affect runtime suspend of the consumer ends of new device links (or,
> >>> generally, device links for which DL_FLAG_PM_RUNTIME has just been
> >>> set).
> >>>
> >>> Fixes: e2f3cd831a28 ("driver core: Fix handling of runtime PM flags in device_link_add()")
> >>> Reported-by: Ulf Hansson <ulf.hansson@linaro.org>
> >>> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> >>> ---
> >>>
> >>> Note that the issue had been there before commit e2f3cd831a28, but it was
> >>> overlooked by that commit and this change is a fix on top of it, so make
> >>> the Fixes: tag point to commit e2f3cd831a28 (instead of an earlier one
> >>> that the patch will not be applicable to).
> >> I noticed that yesterday's and today's -next were no longer booting on
> >> one of our Tegra boards (Tegra210 Jetson TX2) because networking is
> >> failing. The ethernet chip is a USB device and looking at the bootlogs I
> >> can see that the Tegra XHCI driver is failing ...
> >>
> >>  tegra-xusb 70090000.usb: xHCI host controller not responding, assume dead
> >>  tegra-xusb 70090000.usb: HC died; cleaning up
> >>
> >> The Tegra XHCI driver uses multiple power-domains and uses
> >> device_link_add() to attach them. So now I am wondering if there is
> >> something that we have got wrong in our implementation. However, I don't
> >> see the device being probed deferred on boot or anything like that.
> >>
> >> The driver in question is drivers/usb/host/xhci-tegra.c and we add the
> >> links in the function tegra_xusb_powerdomain_init() which is before RPM
> >> is enabled. Let me know if you have any thoughts.
> >
> > If you are willing to help debugging then I am offering my assistance.
> >
> > I would start by enabling CONFIG_PM_ADVANCED_DEBUG, which gives you
> > some more information about the runtime PM state of the device, like
> > the usage count for example.
> > I would also add a couple of prints in
> > tegra_xusb_runtime_suspend|resume() and in the ->power_on|off()
> > callbacks for the corresponding genpds, to see when those gets called.
>
> From the bootlog I see ...
>
> [    4.445827] tegra_xusb_runtime_resume-788
> [    4.508799] tegra-xusb 70090000.usb: Firmware timestamp: 2015-08-10 09:47:54 UTC

This message comes from tegra_xusb_load_firmware() in
tegra_xusb_probe() which is after the pm_runtime_get_sync().

If the device was PM-runtime-suspended before, the
pm_runtime_get_sync() will runtime-resume and reference-count the
suppliers in addition to resuming the device.  In that case
pm_runtime_put_suppliers() will suspend the suppliers, so there is a
bug in there.

What happens is that the links are new when pm_runtime_get_sync() runs
and so their rpm_active refcounts are one.  After the
pm_runtime_get_sync() they are two and pm_runtime_put_suppliers() will
drop them by one and drop the PM-runtime usage counter of each of them
by one, so they will become zero and the suppliers will suspend.

Passing DL_FLAG_RPM_ACTIVE to device_link_add() should help, but IMO
things should also work without that.

> [    4.516223] tegra-xusb 70090000.usb: xHCI Host Controller
> [    4.521622] tegra-xusb 70090000.usb: new USB bus registered, assigned bus number 1

This comes from usb_add_hcd()

> [    4.530087] tegra-xusb 70090000.usb: hcc params 0x0184f525 hci version 0x100 quirks 0x0000000000010010
> [    4.539398] tegra-xusb 70090000.usb: irq 69, io mem 0x70090000
> [    4.553671] tegra-xusb 70090000.usb: xHCI Host Controller
> [    4.559064] tegra-xusb 70090000.usb: new USB bus registered, assigned bus number 2

Like this.

> [    4.566622] tegra-xusb 70090000.usb: Host supports USB 3.0  SuperSpeed

And this if from xhci_gen_setup(), so probe returns around this point.

> [    4.595393] tegra-pmc: tegra_genpd_power_off-673: xusbc
> [    4.600672] tegra-pmc: tegra_genpd_power_off-673: xusba

And this appears to be done by pm_runtime_put_suppliers().

Hmm, I need to think how to fix this.  Maybe we'll need to revert
$subject patch and do something else, we'll see (later today).

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/2] driver core: Fix possible supplier PM-usage counter imbalance
  2019-02-18 12:12         ` Rafael J. Wysocki
@ 2019-02-18 13:02           ` Jon Hunter
  2019-02-18 22:14             ` Rafael J. Wysocki
  0 siblings, 1 reply; 28+ messages in thread
From: Jon Hunter @ 2019-02-18 13:02 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Ulf Hansson, Rafael J. Wysocki, Greg Kroah-Hartman, LKML,
	Linux PM, Daniel Vetter, Lukas Wunner, Andrzej Hajda,
	Russell King - ARM Linux, Lucas Stach, Linus Walleij,
	Thierry Reding, Laurent Pinchart, Marek Szyprowski, linux-tegra


On 18/02/2019 12:12, Rafael J. Wysocki wrote:
>  On Fri, Feb 15, 2019 at 5:44 PM Jon Hunter <jonathanh@nvidia.com> wrote:
>>
>>
>> On 15/02/2019 14:37, Ulf Hansson wrote:
>>> On Fri, 15 Feb 2019 at 12:00, Jon Hunter <jonathanh@nvidia.com> wrote:
>>>>
>>>> Hi Rafael,
>>>>
>>>> On 12/02/2019 12:08, Rafael J. Wysocki wrote:
>>>>> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>>>>>
>>>>> If a stateless device link to a certain supplier with
>>>>> DL_FLAG_PM_RUNTIME set in the flags is added and then removed by the
>>>>> consumer driver's probe callback, the supplier's PM-runtime usage
>>>>> counter will be nonzero after that which effectively causes the
>>>>> supplier to remain "always on" going forward.
>>>>>
>>>>> Namely, device_link_add() called to add the link invokes
>>>>> device_link_rpm_prepare() which notices that the consumer driver is
>>>>> probing, so it increments the supplier's PM-runtime usage counter
>>>>> with the assumption that the link will stay around until
>>>>> pm_runtime_put_suppliers() is called by driver_probe_device(),
>>>>> but if the link goes away before that point, the supplier's
>>>>> PM-runtime usage counter will remain nonzero.
>>>>>
>>>>> To prevent that from happening, first rework pm_runtime_get_suppliers()
>>>>> and pm_runtime_put_suppliers() to use the rpm_active refounts of device
>>>>> links and make the latter only drop rpm_active and the supplier's
>>>>> PM-runtime usage counter for each link by one, unless rpm_active is
>>>>> one already for it.  Next, modify device_link_add() to bump up the
>>>>> new link's rpm_active refcount and the suppliers PM-runtime usage
>>>>> counter by two, to prevent pm_runtime_put_suppliers(), if it is
>>>>> called subsequently, from suspending the supplier prematurely (in
>>>>> case its PM-runtime usage counter goes down to 0 in there).
>>>>>
>>>>> Due to the way rpm_put_suppliers() works, this change does not
>>>>> affect runtime suspend of the consumer ends of new device links (or,
>>>>> generally, device links for which DL_FLAG_PM_RUNTIME has just been
>>>>> set).
>>>>>
>>>>> Fixes: e2f3cd831a28 ("driver core: Fix handling of runtime PM flags in device_link_add()")
>>>>> Reported-by: Ulf Hansson <ulf.hansson@linaro.org>
>>>>> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>>>>> ---
>>>>>
>>>>> Note that the issue had been there before commit e2f3cd831a28, but it was
>>>>> overlooked by that commit and this change is a fix on top of it, so make
>>>>> the Fixes: tag point to commit e2f3cd831a28 (instead of an earlier one
>>>>> that the patch will not be applicable to).
>>>> I noticed that yesterday's and today's -next were no longer booting on
>>>> one of our Tegra boards (Tegra210 Jetson TX2) because networking is
>>>> failing. The ethernet chip is a USB device and looking at the bootlogs I
>>>> can see that the Tegra XHCI driver is failing ...
>>>>
>>>>  tegra-xusb 70090000.usb: xHCI host controller not responding, assume dead
>>>>  tegra-xusb 70090000.usb: HC died; cleaning up
>>>>
>>>> The Tegra XHCI driver uses multiple power-domains and uses
>>>> device_link_add() to attach them. So now I am wondering if there is
>>>> something that we have got wrong in our implementation. However, I don't
>>>> see the device being probed deferred on boot or anything like that.
>>>>
>>>> The driver in question is drivers/usb/host/xhci-tegra.c and we add the
>>>> links in the function tegra_xusb_powerdomain_init() which is before RPM
>>>> is enabled. Let me know if you have any thoughts.
>>>
>>> If you are willing to help debugging then I am offering my assistance.
>>>
>>> I would start by enabling CONFIG_PM_ADVANCED_DEBUG, which gives you
>>> some more information about the runtime PM state of the device, like
>>> the usage count for example.
>>> I would also add a couple of prints in
>>> tegra_xusb_runtime_suspend|resume() and in the ->power_on|off()
>>> callbacks for the corresponding genpds, to see when those gets called.
>>
>> From the bootlog I see ...
>>
>> [    4.445827] tegra_xusb_runtime_resume-788
>> [    4.508799] tegra-xusb 70090000.usb: Firmware timestamp: 2015-08-10 09:47:54 UTC
> 
> This message comes from tegra_xusb_load_firmware() in
> tegra_xusb_probe() which is after the pm_runtime_get_sync().
> 
> If the device was PM-runtime-suspended before, the
> pm_runtime_get_sync() will runtime-resume and reference-count the
> suppliers in addition to resuming the device.  In that case
> pm_runtime_put_suppliers() will suspend the suppliers, so there is a
> bug in there.
> 
> What happens is that the links are new when pm_runtime_get_sync() runs
> and so their rpm_active refcounts are one.  After the
> pm_runtime_get_sync() they are two and pm_runtime_put_suppliers() will
> drop them by one and drop the PM-runtime usage counter of each of them
> by one, so they will become zero and the suppliers will suspend.
> 
> Passing DL_FLAG_RPM_ACTIVE to device_link_add() should help, but IMO
> things should also work without that.

I can confirm that DL_FLAG_RPM_ACTIVE does indeed work. I assume though
this would prevent the suppliers from ever being suspended, which maybe
we will want to do eventually.

>> [    4.516223] tegra-xusb 70090000.usb: xHCI Host Controller
>> [    4.521622] tegra-xusb 70090000.usb: new USB bus registered, assigned bus number 1
> 
> This comes from usb_add_hcd()
> 
>> [    4.530087] tegra-xusb 70090000.usb: hcc params 0x0184f525 hci version 0x100 quirks 0x0000000000010010
>> [    4.539398] tegra-xusb 70090000.usb: irq 69, io mem 0x70090000
>> [    4.553671] tegra-xusb 70090000.usb: xHCI Host Controller
>> [    4.559064] tegra-xusb 70090000.usb: new USB bus registered, assigned bus number 2
> 
> Like this.
> 
>> [    4.566622] tegra-xusb 70090000.usb: Host supports USB 3.0  SuperSpeed
> 
> And this if from xhci_gen_setup(), so probe returns around this point.
> 
>> [    4.595393] tegra-pmc: tegra_genpd_power_off-673: xusbc
>> [    4.600672] tegra-pmc: tegra_genpd_power_off-673: xusba
> 
> And this appears to be done by pm_runtime_put_suppliers().
> 
> Hmm, I need to think how to fix this.  Maybe we'll need to revert
> $subject patch and do something else, we'll see (later today).

OK, thanks. Let me know if there is anything else I can test.

Cheers
Jon

-- 
nvpublic

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/2] driver core: Fix possible supplier PM-usage counter imbalance
  2019-02-18 13:02           ` Jon Hunter
@ 2019-02-18 22:14             ` Rafael J. Wysocki
  0 siblings, 0 replies; 28+ messages in thread
From: Rafael J. Wysocki @ 2019-02-18 22:14 UTC (permalink / raw)
  To: Jon Hunter
  Cc: Rafael J. Wysocki, Ulf Hansson, Greg Kroah-Hartman, LKML,
	Linux PM, Daniel Vetter, Lukas Wunner, Andrzej Hajda,
	Russell King - ARM Linux, Lucas Stach, Linus Walleij,
	Thierry Reding, Laurent Pinchart, Marek Szyprowski, linux-tegra

On Monday, February 18, 2019 2:02:50 PM CET Jon Hunter wrote:
> 
> On 18/02/2019 12:12, Rafael J. Wysocki wrote:
> >  On Fri, Feb 15, 2019 at 5:44 PM Jon Hunter <jonathanh@nvidia.com> wrote:
> >>
> >>
> >> On 15/02/2019 14:37, Ulf Hansson wrote:
> >>> On Fri, 15 Feb 2019 at 12:00, Jon Hunter <jonathanh@nvidia.com> wrote:
> >>>>
> >>>> Hi Rafael,
> >>>>
> >>>> On 12/02/2019 12:08, Rafael J. Wysocki wrote:
> >>>>> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> >>>>>
> >>>>> If a stateless device link to a certain supplier with
> >>>>> DL_FLAG_PM_RUNTIME set in the flags is added and then removed by the
> >>>>> consumer driver's probe callback, the supplier's PM-runtime usage
> >>>>> counter will be nonzero after that which effectively causes the
> >>>>> supplier to remain "always on" going forward.
> >>>>>
> >>>>> Namely, device_link_add() called to add the link invokes
> >>>>> device_link_rpm_prepare() which notices that the consumer driver is
> >>>>> probing, so it increments the supplier's PM-runtime usage counter
> >>>>> with the assumption that the link will stay around until
> >>>>> pm_runtime_put_suppliers() is called by driver_probe_device(),
> >>>>> but if the link goes away before that point, the supplier's
> >>>>> PM-runtime usage counter will remain nonzero.
> >>>>>
> >>>>> To prevent that from happening, first rework pm_runtime_get_suppliers()
> >>>>> and pm_runtime_put_suppliers() to use the rpm_active refounts of device
> >>>>> links and make the latter only drop rpm_active and the supplier's
> >>>>> PM-runtime usage counter for each link by one, unless rpm_active is
> >>>>> one already for it.  Next, modify device_link_add() to bump up the
> >>>>> new link's rpm_active refcount and the suppliers PM-runtime usage
> >>>>> counter by two, to prevent pm_runtime_put_suppliers(), if it is
> >>>>> called subsequently, from suspending the supplier prematurely (in
> >>>>> case its PM-runtime usage counter goes down to 0 in there).
> >>>>>
> >>>>> Due to the way rpm_put_suppliers() works, this change does not
> >>>>> affect runtime suspend of the consumer ends of new device links (or,
> >>>>> generally, device links for which DL_FLAG_PM_RUNTIME has just been
> >>>>> set).
> >>>>>
> >>>>> Fixes: e2f3cd831a28 ("driver core: Fix handling of runtime PM flags in device_link_add()")
> >>>>> Reported-by: Ulf Hansson <ulf.hansson@linaro.org>
> >>>>> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> >>>>> ---
> >>>>>
> >>>>> Note that the issue had been there before commit e2f3cd831a28, but it was
> >>>>> overlooked by that commit and this change is a fix on top of it, so make
> >>>>> the Fixes: tag point to commit e2f3cd831a28 (instead of an earlier one
> >>>>> that the patch will not be applicable to).
> >>>> I noticed that yesterday's and today's -next were no longer booting on
> >>>> one of our Tegra boards (Tegra210 Jetson TX2) because networking is
> >>>> failing. The ethernet chip is a USB device and looking at the bootlogs I
> >>>> can see that the Tegra XHCI driver is failing ...
> >>>>
> >>>>  tegra-xusb 70090000.usb: xHCI host controller not responding, assume dead
> >>>>  tegra-xusb 70090000.usb: HC died; cleaning up
> >>>>
> >>>> The Tegra XHCI driver uses multiple power-domains and uses
> >>>> device_link_add() to attach them. So now I am wondering if there is
> >>>> something that we have got wrong in our implementation. However, I don't
> >>>> see the device being probed deferred on boot or anything like that.
> >>>>
> >>>> The driver in question is drivers/usb/host/xhci-tegra.c and we add the
> >>>> links in the function tegra_xusb_powerdomain_init() which is before RPM
> >>>> is enabled. Let me know if you have any thoughts.
> >>>
> >>> If you are willing to help debugging then I am offering my assistance.
> >>>
> >>> I would start by enabling CONFIG_PM_ADVANCED_DEBUG, which gives you
> >>> some more information about the runtime PM state of the device, like
> >>> the usage count for example.
> >>> I would also add a couple of prints in
> >>> tegra_xusb_runtime_suspend|resume() and in the ->power_on|off()
> >>> callbacks for the corresponding genpds, to see when those gets called.
> >>
> >> From the bootlog I see ...
> >>
> >> [    4.445827] tegra_xusb_runtime_resume-788
> >> [    4.508799] tegra-xusb 70090000.usb: Firmware timestamp: 2015-08-10 09:47:54 UTC
> > 
> > This message comes from tegra_xusb_load_firmware() in
> > tegra_xusb_probe() which is after the pm_runtime_get_sync().
> > 
> > If the device was PM-runtime-suspended before, the
> > pm_runtime_get_sync() will runtime-resume and reference-count the
> > suppliers in addition to resuming the device.  In that case
> > pm_runtime_put_suppliers() will suspend the suppliers, so there is a
> > bug in there.
> > 
> > What happens is that the links are new when pm_runtime_get_sync() runs
> > and so their rpm_active refcounts are one.  After the
> > pm_runtime_get_sync() they are two and pm_runtime_put_suppliers() will
> > drop them by one and drop the PM-runtime usage counter of each of them
> > by one, so they will become zero and the suppliers will suspend.
> > 
> > Passing DL_FLAG_RPM_ACTIVE to device_link_add() should help, but IMO
> > things should also work without that.
> 
> I can confirm that DL_FLAG_RPM_ACTIVE does indeed work. I assume though
> this would prevent the suppliers from ever being suspended,

No, it wouldn't.  It only prevents the supplier from suspending as long as
the consumer remains active.

In your case it shouldn't matter (it only matters because of the bug in
the $subject patch), because the consumer is suspended to start with.

Had it been active initially, though, passing DL_FLAG_RPM_ACTIVE to
device_link_add() would have been the only way to to force the supplier to
remain active accross the return from consumer probe until the consumer is
suspended.

> which maybe
> we will want to do eventually.
> 
> >> [    4.516223] tegra-xusb 70090000.usb: xHCI Host Controller
> >> [    4.521622] tegra-xusb 70090000.usb: new USB bus registered, assigned bus number 1
> > 
> > This comes from usb_add_hcd()
> > 
> >> [    4.530087] tegra-xusb 70090000.usb: hcc params 0x0184f525 hci version 0x100 quirks 0x0000000000010010
> >> [    4.539398] tegra-xusb 70090000.usb: irq 69, io mem 0x70090000
> >> [    4.553671] tegra-xusb 70090000.usb: xHCI Host Controller
> >> [    4.559064] tegra-xusb 70090000.usb: new USB bus registered, assigned bus number 2
> > 
> > Like this.
> > 
> >> [    4.566622] tegra-xusb 70090000.usb: Host supports USB 3.0  SuperSpeed
> > 
> > And this if from xhci_gen_setup(), so probe returns around this point.
> > 
> >> [    4.595393] tegra-pmc: tegra_genpd_power_off-673: xusbc
> >> [    4.600672] tegra-pmc: tegra_genpd_power_off-673: xusba
> > 
> > And this appears to be done by pm_runtime_put_suppliers().
> > 
> > Hmm, I need to think how to fix this.  Maybe we'll need to revert
> > $subject patch and do something else, we'll see (later today).
> 
> OK, thanks. Let me know if there is anything else I can test.

Please test https://patchwork.kernel.org/patch/10818923/

Cheers,
Rafael

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2019-02-18 22:14 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-02-12 12:01 [PATCH 0/2] driver core: Fixes related to device links Rafael J. Wysocki
2019-02-12 12:04 ` [PATCH 1/2] PM-runtime: Fix __pm_runtime_set_status() race with runtime resume Rafael J. Wysocki
2019-02-12 16:17   ` Ulf Hansson
2019-02-12 16:28     ` Rafael J. Wysocki
2019-02-12 18:27       ` Ulf Hansson
2019-02-12 19:05         ` Rafael J. Wysocki
2019-02-12 20:14   ` Ulf Hansson
2019-02-12 12:08 ` [PATCH 2/2] driver core: Fix possible supplier PM-usage counter imbalance Rafael J. Wysocki
2019-02-12 21:02   ` Ulf Hansson
2019-02-12 22:08     ` Rafael J. Wysocki
2019-02-15 11:00   ` Jon Hunter
2019-02-15 11:57     ` Rafael J. Wysocki
2019-02-15 12:06     ` Rafael J. Wysocki
2019-02-15 13:21       ` Jon Hunter
2019-02-15 14:14         ` Jon Hunter
2019-02-15 14:37     ` Ulf Hansson
2019-02-15 16:44       ` Jon Hunter
2019-02-15 16:44         ` Jon Hunter
2019-02-17 21:33         ` Rafael J. Wysocki
2019-02-18 12:12         ` Rafael J. Wysocki
2019-02-18 13:02           ` Jon Hunter
2019-02-18 22:14             ` Rafael J. Wysocki
2019-02-12 14:09 ` [PATCH 0/2] driver core: Fixes related to device links Greg Kroah-Hartman
2019-02-12 14:52   ` Ulf Hansson
2019-02-12 15:04     ` Rafael J. Wysocki
2019-02-12 15:06     ` Greg Kroah-Hartman
2019-02-12 16:20       ` Rafael J. Wysocki
2019-02-12 16:20         ` Rafael J. Wysocki

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.