[PATCH v1] driver core: check for dead devices before onlining/offlining

* [PATCH v1] driver core: check for dead devices before onlining/offlining
@ 2020-01-20 10:49 David Hildenbrand
  2020-01-24  9:00 ` Greg Kroah-Hartman
  2020-01-24 17:14 ` David Hildenbrand
  0 siblings, 2 replies; 13+ messages in thread
From: David Hildenbrand @ 2020-01-20 10:49 UTC (permalink / raw)
  To: linux-kernel
  Cc: David Hildenbrand, Greg Kroah-Hartman, Rafael J. Wysocki,
	Suzuki K Poulose, Saravana Kannan, Heikki Krogerus, Dan Williams,
	Michal Hocko

We can have rare cases where the removal of a device races with
somebody trying to online it (esp. via sysfs). We can simply check
if the device is already removed or getting removed under the dev->lock.

E.g., right now, if memory block devices are removed (remove_memory()),
we do a:

remove_memory() -> lock_device_hotplug() -> mem_hotplug_begin() ->
lock_device() -> dev->dead = true

Somebody coming via sysfs (/sys/devices/system/memory/memoryX/online)
triggers a:

lock_device_hotplug_sysfs() -> device_online() -> lock_device() ...

So if we made it just before the lock_device_hotplug_sysfs() but get
delayed until remove_memory() released all locks, we will continue
taking locks and trying to online the device - which is then a zombie
device.

Note that at least the memory onlining path seems to be protected by
checking if all memory sections are still present (something we can then
get rid of). We do have other sysfs attributes
(e.g., /sys/devices/system/memory/memoryX/valid_zones) that don't do any
such locking yet and might race with memory removal in a similar way. For
these users, we can then do a

device_lock(dev);
if (!device_is_dead(dev)) {
	/* magic /*
}
device_unlock(dev);

Introduce and use device_is_dead() right away.

Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Suzuki K Poulose <suzuki.poulose@arm.com>
Cc: Saravana Kannan <saravanak@google.com>
Cc: Heikki Krogerus <heikki.krogerus@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: David Hildenbrand <david@redhat.com>
---

Am I missing any obvious mechanism in the device core that handles
something like this already? (especially also for other sysfs attributes?)

---
 drivers/base/core.c    | 13 +++++++++++--
 drivers/base/dd.c      |  6 +++---
 include/linux/device.h |  1 +
 3 files changed, 15 insertions(+), 5 deletions(-)

diff --git a/drivers/base/core.c b/drivers/base/core.c
index 42a672456432..01cd06eeb513 100644
--- a/drivers/base/core.c
+++ b/drivers/base/core.c
@@ -2952,7 +2952,9 @@ int device_offline(struct device *dev)
 		return ret;
 
 	device_lock(dev);
-	if (device_supports_offline(dev)) {
+	if (device_is_dead(dev))
+		ret = -ENXIO;
+	else if (device_supports_offline(dev)) {
 		if (dev->offline) {
 			ret = 1;
 		} else {
@@ -2983,7 +2985,9 @@ int device_online(struct device *dev)
 	int ret = 0;
 
 	device_lock(dev);
-	if (device_supports_offline(dev)) {
+	if (device_is_dead(dev))
+		ret = -ENXIO;
+	else if (device_supports_offline(dev)) {
 		if (dev->offline) {
 			ret = dev->bus->online(dev);
 			if (!ret) {
@@ -2999,6 +3003,11 @@ int device_online(struct device *dev)
 	return ret;
 }
 
+bool device_is_dead(struct device *dev)
+{
+	return dev->p->dead;
+}
+
 struct root_device {
 	struct device dev;
 	struct module *owner;
diff --git a/drivers/base/dd.c b/drivers/base/dd.c
index b25bcab2a26b..b1d7361720af 100644
--- a/drivers/base/dd.c
+++ b/drivers/base/dd.c
@@ -848,7 +848,7 @@ static void __device_attach_async_helper(void *_dev, async_cookie_t cookie)
 	 * and deferred probe processing happens all at once with
 	 * multiple threads.
 	 */
-	if (dev->p->dead || dev->driver)
+	if (device_is_dead(dev) || dev->driver)
 		goto out_unlock;
 
 	if (dev->parent)
@@ -994,7 +994,7 @@ int device_driver_attach(struct device_driver *drv, struct device *dev)
 	 * If device has been removed or someone has already successfully
 	 * bound a driver before us just skip the driver probe call.
 	 */
-	if (!dev->p->dead && !dev->driver)
+	if (!device_is_dead(dev) && !dev->driver)
 		ret = driver_probe_device(drv, dev);
 
 	__device_driver_unlock(dev, dev->parent);
@@ -1016,7 +1016,7 @@ static void __driver_attach_async_helper(void *_dev, async_cookie_t cookie)
 	 * If device has been removed or someone has already successfully
 	 * bound a driver before us just skip the driver probe call.
 	 */
-	if (!dev->p->dead && !dev->driver)
+	if (!device_is_dead(dev) && !dev->driver)
 		ret = driver_probe_device(drv, dev);
 
 	__device_driver_unlock(dev, dev->parent);
diff --git a/include/linux/device.h b/include/linux/device.h
index ce6db68c3f29..acea6a14daa1 100644
--- a/include/linux/device.h
+++ b/include/linux/device.h
@@ -832,6 +832,7 @@ extern void unlock_device_hotplug(void);
 extern int lock_device_hotplug_sysfs(void);
 extern int device_offline(struct device *dev);
 extern int device_online(struct device *dev);
+extern bool device_is_dead(struct device *dev);
 extern void set_primary_fwnode(struct device *dev, struct fwnode_handle *fwnode);
 extern void set_secondary_fwnode(struct device *dev, struct fwnode_handle *fwnode);
 void device_set_of_node_from_dev(struct device *dev, const struct device *dev2);
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread