linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 1/3] driver core: bus: introduce can_remove()
@ 2024-02-02 22:25 Hamza Mahfooz
  2024-02-02 22:25 ` [PATCH 2/3] PCI: " Hamza Mahfooz
                   ` (3 more replies)
  0 siblings, 4 replies; 12+ messages in thread
From: Hamza Mahfooz @ 2024-02-02 22:25 UTC (permalink / raw)
  To: linux-kernel
  Cc: Hamza Mahfooz, stable, Greg Kroah-Hartman, Rafael J. Wysocki,
	Alex Deucher, Christian König, Pan, Xinhui, David Airlie,
	Daniel Vetter, Bjorn Helgaas, Mario Limonciello, Lijo Lazar,
	Srinivasan Shanmugam, Le Ma, André Almeida, James Zhu,
	Aurabindo Pillai, Joerg Roedel, Iwona Winiarska, Robin Murphy,
	amd-gfx, dri-devel, linux-pci

Currently, drivers have no mechanism to block requests to unbind
devices. However, this can cause resource leaks and leave the device in
an inconsistent state, such that rebinding the device may cause a hang
or otherwise prevent the device from being rebound. So, introduce
the can_remove() callback to allow drivers to indicate if it isn't
appropriate to remove a device at the given time.

Cc: stable@vger.kernel.org
Signed-off-by: Hamza Mahfooz <hamza.mahfooz@amd.com>
---
 drivers/base/bus.c         | 4 ++++
 include/linux/device/bus.h | 2 ++
 2 files changed, 6 insertions(+)

diff --git a/drivers/base/bus.c b/drivers/base/bus.c
index daee55c9b2d9..7c259b01ea99 100644
--- a/drivers/base/bus.c
+++ b/drivers/base/bus.c
@@ -239,6 +239,10 @@ static ssize_t unbind_store(struct device_driver *drv, const char *buf,
 
 	dev = bus_find_device_by_name(bus, NULL, buf);
 	if (dev && dev->driver == drv) {
+		if (dev->bus && dev->bus->can_remove &&
+		    !dev->bus->can_remove(dev))
+			return -EBUSY;
+
 		device_driver_detach(dev);
 		err = count;
 	}
diff --git a/include/linux/device/bus.h b/include/linux/device/bus.h
index 5ef4ec1c36c3..c9d4af0ed3b8 100644
--- a/include/linux/device/bus.h
+++ b/include/linux/device/bus.h
@@ -46,6 +46,7 @@ struct fwnode_handle;
  *		be called at late_initcall_sync level. If the device has
  *		consumers that are never bound to a driver, this function
  *		will never get called until they do.
+ * @can_remove: Called before attempting to remove a device from this bus.
  * @remove:	Called when a device removed from this bus.
  * @shutdown:	Called at shut-down time to quiesce the device.
  *
@@ -85,6 +86,7 @@ struct bus_type {
 	int (*uevent)(const struct device *dev, struct kobj_uevent_env *env);
 	int (*probe)(struct device *dev);
 	void (*sync_state)(struct device *dev);
+	bool (*can_remove)(struct device *dev);
 	void (*remove)(struct device *dev);
 	void (*shutdown)(struct device *dev);
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 2/3] PCI: introduce can_remove()
  2024-02-02 22:25 [PATCH 1/3] driver core: bus: introduce can_remove() Hamza Mahfooz
@ 2024-02-02 22:25 ` Hamza Mahfooz
  2024-02-02 23:38   ` Greg Kroah-Hartman
  2024-02-02 22:25 ` [PATCH 3/3] drm/amdgpu: wire up the can_remove() callback Hamza Mahfooz
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 12+ messages in thread
From: Hamza Mahfooz @ 2024-02-02 22:25 UTC (permalink / raw)
  To: linux-kernel
  Cc: Hamza Mahfooz, stable, Greg Kroah-Hartman, Rafael J. Wysocki,
	Alex Deucher, Christian König, Pan, Xinhui, David Airlie,
	Daniel Vetter, Bjorn Helgaas, Mario Limonciello, Lijo Lazar,
	Srinivasan Shanmugam, Le Ma, André Almeida, James Zhu,
	Aurabindo Pillai, Jason Gunthorpe, Wei Liu, Robin Murphy,
	amd-gfx, dri-devel, linux-pci

Wire up the can_remove() callback, such that pci drivers can implement
their own version of it.

Cc: stable@vger.kernel.org
Signed-off-by: Hamza Mahfooz <hamza.mahfooz@amd.com>
---
 drivers/pci/pci-driver.c | 12 ++++++++++++
 include/linux/pci.h      |  5 +++++
 2 files changed, 17 insertions(+)

diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
index 51ec9e7e784f..8aae484c5494 100644
--- a/drivers/pci/pci-driver.c
+++ b/drivers/pci/pci-driver.c
@@ -466,6 +466,17 @@ static int pci_device_probe(struct device *dev)
 	return error;
 }
 
+static bool pci_device_can_remove(struct device *dev)
+{
+	struct pci_dev *pci_dev = to_pci_dev(dev);
+	struct pci_driver *drv = pci_dev->driver;
+
+	if (drv->can_remove)
+		return drv->can_remove(pci_dev);
+
+	return true;
+}
+
 static void pci_device_remove(struct device *dev)
 {
 	struct pci_dev *pci_dev = to_pci_dev(dev);
@@ -1680,6 +1691,7 @@ struct bus_type pci_bus_type = {
 	.match		= pci_bus_match,
 	.uevent		= pci_uevent,
 	.probe		= pci_device_probe,
+	.can_remove	= pci_device_can_remove,
 	.remove		= pci_device_remove,
 	.shutdown	= pci_device_shutdown,
 	.dev_groups	= pci_dev_groups,
diff --git a/include/linux/pci.h b/include/linux/pci.h
index add9368e6314..95276f44b23b 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -902,6 +902,10 @@ struct module;
  *		(negative number) otherwise.
  *		The probe function always gets called from process
  *		context, so it can sleep.
+ * @can_remove:	The can_remove() function gets called during driver
+ *		deregistration to determine if remove() can be called.
+ *		The probe function always gets called from process
+ *		context, so it can sleep.
  * @remove:	The remove() function gets called whenever a device
  *		being handled by this driver is removed (either during
  *		deregistration of the driver or when it's manually
@@ -943,6 +947,7 @@ struct pci_driver {
 	const char		*name;
 	const struct pci_device_id *id_table;	/* Must be non-NULL for probe to be called */
 	int  (*probe)(struct pci_dev *dev, const struct pci_device_id *id);	/* New device inserted */
+	bool (*can_remove)(struct pci_dev *dev);
 	void (*remove)(struct pci_dev *dev);	/* Device removed (NULL if not a hot-plug capable driver) */
 	int  (*suspend)(struct pci_dev *dev, pm_message_t state);	/* Device suspended */
 	int  (*resume)(struct pci_dev *dev);	/* Device woken up */
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 3/3] drm/amdgpu: wire up the can_remove() callback
  2024-02-02 22:25 [PATCH 1/3] driver core: bus: introduce can_remove() Hamza Mahfooz
  2024-02-02 22:25 ` [PATCH 2/3] PCI: " Hamza Mahfooz
@ 2024-02-02 22:25 ` Hamza Mahfooz
  2024-02-02 22:41   ` Bjorn Helgaas
                     ` (2 more replies)
  2024-02-02 23:38 ` [PATCH 1/3] driver core: bus: introduce can_remove() Greg Kroah-Hartman
  2024-02-05  8:48 ` Christian König
  3 siblings, 3 replies; 12+ messages in thread
From: Hamza Mahfooz @ 2024-02-02 22:25 UTC (permalink / raw)
  To: linux-kernel
  Cc: Hamza Mahfooz, stable, Greg Kroah-Hartman, Rafael J. Wysocki,
	Alex Deucher, Christian König, Pan, Xinhui, David Airlie,
	Daniel Vetter, Bjorn Helgaas, Mario Limonciello, Lijo Lazar,
	Srinivasan Shanmugam, Le Ma, André Almeida, James Zhu,
	Aurabindo Pillai, Alex Shi, Jerry Snitselaar, Wei Liu,
	Robin Murphy, amd-gfx, dri-devel, linux-pci

Removing an amdgpu device that still has user space references allocated
to it causes undefined behaviour. So, implement amdgpu_pci_can_remove()
and disallow devices that still have files allocated to them from being
unbound.

Cc: stable@vger.kernel.org
Signed-off-by: Hamza Mahfooz <hamza.mahfooz@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index cc69005f5b46..cfa64f3c5be5 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -2323,6 +2323,22 @@ static int amdgpu_pci_probe(struct pci_dev *pdev,
 	return ret;
 }
 
+static bool amdgpu_pci_can_remove(struct pci_dev *pdev)
+{
+	struct drm_device *dev = pci_get_drvdata(pdev);
+
+	mutex_lock(&dev->filelist_mutex);
+
+	if (!list_empty(&dev->filelist)) {
+		mutex_unlock(&dev->filelist_mutex);
+		return false;
+	}
+
+	mutex_unlock(&dev->filelist_mutex);
+
+	return true;
+}
+
 static void
 amdgpu_pci_remove(struct pci_dev *pdev)
 {
@@ -2929,6 +2945,7 @@ static struct pci_driver amdgpu_kms_pci_driver = {
 	.name = DRIVER_NAME,
 	.id_table = pciidlist,
 	.probe = amdgpu_pci_probe,
+	.can_remove = amdgpu_pci_can_remove,
 	.remove = amdgpu_pci_remove,
 	.shutdown = amdgpu_pci_shutdown,
 	.driver.pm = &amdgpu_pm_ops,
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH 3/3] drm/amdgpu: wire up the can_remove() callback
  2024-02-02 22:25 ` [PATCH 3/3] drm/amdgpu: wire up the can_remove() callback Hamza Mahfooz
@ 2024-02-02 22:41   ` Bjorn Helgaas
  2024-02-02 23:40   ` Greg Kroah-Hartman
  2024-02-02 23:41   ` Greg Kroah-Hartman
  2 siblings, 0 replies; 12+ messages in thread
From: Bjorn Helgaas @ 2024-02-02 22:41 UTC (permalink / raw)
  To: Hamza Mahfooz
  Cc: linux-kernel, stable, Greg Kroah-Hartman, Rafael J. Wysocki,
	Alex Deucher, Christian König, Pan, Xinhui, David Airlie,
	Daniel Vetter, Bjorn Helgaas, Mario Limonciello, Lijo Lazar,
	Srinivasan Shanmugam, Le Ma, André Almeida, James Zhu,
	Aurabindo Pillai, Alex Shi, Jerry Snitselaar, Wei Liu,
	Robin Murphy, amd-gfx, dri-devel, linux-pci, Bartosz Golaszewski

[+cc Bartosz]

On Fri, Feb 02, 2024 at 05:25:56PM -0500, Hamza Mahfooz wrote:
> Removing an amdgpu device that still has user space references allocated
> to it causes undefined behaviour. So, implement amdgpu_pci_can_remove()
> and disallow devices that still have files allocated to them from being
> unbound.

Maybe this would help for things that are completely built-in or
soldered down, but nothing can prevent a user from physically pulling
a card or cable, so I don't think this is a generic solution to the
problem of dangling user space references.

Maybe Bartosz's recent LPC talk is relevant:
https://lpc.events/event/17/contributions/1627/

> Cc: stable@vger.kernel.org
> Signed-off-by: Hamza Mahfooz <hamza.mahfooz@amd.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 17 +++++++++++++++++
>  1 file changed, 17 insertions(+)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> index cc69005f5b46..cfa64f3c5be5 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> @@ -2323,6 +2323,22 @@ static int amdgpu_pci_probe(struct pci_dev *pdev,
>  	return ret;
>  }
>  
> +static bool amdgpu_pci_can_remove(struct pci_dev *pdev)
> +{
> +	struct drm_device *dev = pci_get_drvdata(pdev);
> +
> +	mutex_lock(&dev->filelist_mutex);
> +
> +	if (!list_empty(&dev->filelist)) {
> +		mutex_unlock(&dev->filelist_mutex);
> +		return false;
> +	}
> +
> +	mutex_unlock(&dev->filelist_mutex);
> +
> +	return true;
> +}
> +
>  static void
>  amdgpu_pci_remove(struct pci_dev *pdev)
>  {
> @@ -2929,6 +2945,7 @@ static struct pci_driver amdgpu_kms_pci_driver = {
>  	.name = DRIVER_NAME,
>  	.id_table = pciidlist,
>  	.probe = amdgpu_pci_probe,
> +	.can_remove = amdgpu_pci_can_remove,
>  	.remove = amdgpu_pci_remove,
>  	.shutdown = amdgpu_pci_shutdown,
>  	.driver.pm = &amdgpu_pm_ops,
> -- 
> 2.43.0
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/3] driver core: bus: introduce can_remove()
  2024-02-02 22:25 [PATCH 1/3] driver core: bus: introduce can_remove() Hamza Mahfooz
  2024-02-02 22:25 ` [PATCH 2/3] PCI: " Hamza Mahfooz
  2024-02-02 22:25 ` [PATCH 3/3] drm/amdgpu: wire up the can_remove() callback Hamza Mahfooz
@ 2024-02-02 23:38 ` Greg Kroah-Hartman
  2024-02-05  8:48 ` Christian König
  3 siblings, 0 replies; 12+ messages in thread
From: Greg Kroah-Hartman @ 2024-02-02 23:38 UTC (permalink / raw)
  To: Hamza Mahfooz
  Cc: linux-kernel, stable, Rafael J. Wysocki, Alex Deucher,
	Christian König, Pan, Xinhui, David Airlie, Daniel Vetter,
	Bjorn Helgaas, Mario Limonciello, Lijo Lazar,
	Srinivasan Shanmugam, Le Ma, André Almeida, James Zhu,
	Aurabindo Pillai, Joerg Roedel, Iwona Winiarska, Robin Murphy,
	amd-gfx, dri-devel, linux-pci

On Fri, Feb 02, 2024 at 05:25:54PM -0500, Hamza Mahfooz wrote:
> Currently, drivers have no mechanism to block requests to unbind
> devices.

And that is by design.

> However, this can cause resource leaks and leave the device in
> an inconsistent state, such that rebinding the device may cause a hang
> or otherwise prevent the device from being rebound.

That is a driver bug, please fix your driver.

> So, introduce the can_remove() callback to allow drivers to indicate
> if it isn't appropriate to remove a device at the given time.

Nope, sorry, the driver needs to be fixed.

What broken driver are you needing this for?

Also realize, only root can unbind drivers (and it can also unload
modules), so if the root user really wants to do this, it can, and
should be possible to.

sorry,

greg k-h

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/3] PCI: introduce can_remove()
  2024-02-02 22:25 ` [PATCH 2/3] PCI: " Hamza Mahfooz
@ 2024-02-02 23:38   ` Greg Kroah-Hartman
  0 siblings, 0 replies; 12+ messages in thread
From: Greg Kroah-Hartman @ 2024-02-02 23:38 UTC (permalink / raw)
  To: Hamza Mahfooz
  Cc: linux-kernel, stable, Rafael J. Wysocki, Alex Deucher,
	Christian König, Pan, Xinhui, David Airlie, Daniel Vetter,
	Bjorn Helgaas, Mario Limonciello, Lijo Lazar,
	Srinivasan Shanmugam, Le Ma, André Almeida, James Zhu,
	Aurabindo Pillai, Jason Gunthorpe, Wei Liu, Robin Murphy,
	amd-gfx, dri-devel, linux-pci

On Fri, Feb 02, 2024 at 05:25:55PM -0500, Hamza Mahfooz wrote:
> Wire up the can_remove() callback, such that pci drivers can implement
> their own version of it.
> 
> Cc: stable@vger.kernel.org
> Signed-off-by: Hamza Mahfooz <hamza.mahfooz@amd.com>
> ---

Again, sorry, nope, not allowed.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 3/3] drm/amdgpu: wire up the can_remove() callback
  2024-02-02 22:25 ` [PATCH 3/3] drm/amdgpu: wire up the can_remove() callback Hamza Mahfooz
  2024-02-02 22:41   ` Bjorn Helgaas
@ 2024-02-02 23:40   ` Greg Kroah-Hartman
  2024-02-06 14:29     ` Daniel Vetter
  2024-02-02 23:41   ` Greg Kroah-Hartman
  2 siblings, 1 reply; 12+ messages in thread
From: Greg Kroah-Hartman @ 2024-02-02 23:40 UTC (permalink / raw)
  To: Hamza Mahfooz
  Cc: linux-kernel, stable, Rafael J. Wysocki, Alex Deucher,
	Christian König, Pan, Xinhui, David Airlie, Daniel Vetter,
	Bjorn Helgaas, Mario Limonciello, Lijo Lazar,
	Srinivasan Shanmugam, Le Ma, André Almeida, James Zhu,
	Aurabindo Pillai, Alex Shi, Jerry Snitselaar, Wei Liu,
	Robin Murphy, amd-gfx, dri-devel, linux-pci

On Fri, Feb 02, 2024 at 05:25:56PM -0500, Hamza Mahfooz wrote:
> Removing an amdgpu device that still has user space references allocated
> to it causes undefined behaviour.

Then fix that please.  There should not be anything special about your
hardware that all of the tens of thousands of other devices can't handle
today.

What happens when I yank your device out of a system with a pci hotplug
bus?  You can't prevent that either, so this should not be any different
at all.

sorry, but please, just fix your driver.

greg k-h

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 3/3] drm/amdgpu: wire up the can_remove() callback
  2024-02-02 22:25 ` [PATCH 3/3] drm/amdgpu: wire up the can_remove() callback Hamza Mahfooz
  2024-02-02 22:41   ` Bjorn Helgaas
  2024-02-02 23:40   ` Greg Kroah-Hartman
@ 2024-02-02 23:41   ` Greg Kroah-Hartman
  2 siblings, 0 replies; 12+ messages in thread
From: Greg Kroah-Hartman @ 2024-02-02 23:41 UTC (permalink / raw)
  To: Hamza Mahfooz
  Cc: linux-kernel, stable, Rafael J. Wysocki, Alex Deucher,
	Christian König, Pan, Xinhui, David Airlie, Daniel Vetter,
	Bjorn Helgaas, Mario Limonciello, Lijo Lazar,
	Srinivasan Shanmugam, Le Ma, André Almeida, James Zhu,
	Aurabindo Pillai, Alex Shi, Jerry Snitselaar, Wei Liu,
	Robin Murphy, amd-gfx, dri-devel, linux-pci

On Fri, Feb 02, 2024 at 05:25:56PM -0500, Hamza Mahfooz wrote:
> Removing an amdgpu device that still has user space references allocated
> to it causes undefined behaviour. So, implement amdgpu_pci_can_remove()
> and disallow devices that still have files allocated to them from being
> unbound.
> 
> Cc: stable@vger.kernel.org
> Signed-off-by: Hamza Mahfooz <hamza.mahfooz@amd.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 17 +++++++++++++++++
>  1 file changed, 17 insertions(+)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> index cc69005f5b46..cfa64f3c5be5 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> @@ -2323,6 +2323,22 @@ static int amdgpu_pci_probe(struct pci_dev *pdev,
>  	return ret;
>  }
>  
> +static bool amdgpu_pci_can_remove(struct pci_dev *pdev)
> +{
> +	struct drm_device *dev = pci_get_drvdata(pdev);
> +
> +	mutex_lock(&dev->filelist_mutex);
> +
> +	if (!list_empty(&dev->filelist)) {
> +		mutex_unlock(&dev->filelist_mutex);
> +		return false;
> +	}
> +
> +	mutex_unlock(&dev->filelist_mutex);
> +
> +	return true;

Also, to be pedantic, this will not work as right after you returned
"true" here, userspace could open a file, causing the same issue you are
trying to prevent to have happen, happen.

So even if we wanted to do this, which again, we do not, this isn't even
a solution for it because it will still cause you problems.

greg k-h

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/3] driver core: bus: introduce can_remove()
  2024-02-02 22:25 [PATCH 1/3] driver core: bus: introduce can_remove() Hamza Mahfooz
                   ` (2 preceding siblings ...)
  2024-02-02 23:38 ` [PATCH 1/3] driver core: bus: introduce can_remove() Greg Kroah-Hartman
@ 2024-02-05  8:48 ` Christian König
  3 siblings, 0 replies; 12+ messages in thread
From: Christian König @ 2024-02-05  8:48 UTC (permalink / raw)
  To: Hamza Mahfooz, linux-kernel
  Cc: stable, Greg Kroah-Hartman, Rafael J. Wysocki, Alex Deucher,
	Christian König, Pan, Xinhui, David Airlie, Daniel Vetter,
	Bjorn Helgaas, Mario Limonciello, Lijo Lazar,
	Srinivasan Shanmugam, Le Ma, André Almeida, James Zhu,
	Aurabindo Pillai, Joerg Roedel, Iwona Winiarska, Robin Murphy,
	amd-gfx, dri-devel, linux-pci

Am 02.02.24 um 23:25 schrieb Hamza Mahfooz:
> Currently, drivers have no mechanism to block requests to unbind
> devices. However, this can cause resource leaks and leave the device in
> an inconsistent state, such that rebinding the device may cause a hang
> or otherwise prevent the device from being rebound. So, introduce
> the can_remove() callback to allow drivers to indicate if it isn't
> appropriate to remove a device at the given time.

Well that is nonsense. When you physically remove a device (e.g. unplug 
it) then there is nothing in software you can do to prevent that.

Regards,
Christian.

>
> Cc: stable@vger.kernel.org
> Signed-off-by: Hamza Mahfooz <hamza.mahfooz@amd.com>
> ---
>   drivers/base/bus.c         | 4 ++++
>   include/linux/device/bus.h | 2 ++
>   2 files changed, 6 insertions(+)
>
> diff --git a/drivers/base/bus.c b/drivers/base/bus.c
> index daee55c9b2d9..7c259b01ea99 100644
> --- a/drivers/base/bus.c
> +++ b/drivers/base/bus.c
> @@ -239,6 +239,10 @@ static ssize_t unbind_store(struct device_driver *drv, const char *buf,
>   
>   	dev = bus_find_device_by_name(bus, NULL, buf);
>   	if (dev && dev->driver == drv) {
> +		if (dev->bus && dev->bus->can_remove &&
> +		    !dev->bus->can_remove(dev))
> +			return -EBUSY;
> +
>   		device_driver_detach(dev);
>   		err = count;
>   	}
> diff --git a/include/linux/device/bus.h b/include/linux/device/bus.h
> index 5ef4ec1c36c3..c9d4af0ed3b8 100644
> --- a/include/linux/device/bus.h
> +++ b/include/linux/device/bus.h
> @@ -46,6 +46,7 @@ struct fwnode_handle;
>    *		be called at late_initcall_sync level. If the device has
>    *		consumers that are never bound to a driver, this function
>    *		will never get called until they do.
> + * @can_remove: Called before attempting to remove a device from this bus.
>    * @remove:	Called when a device removed from this bus.
>    * @shutdown:	Called at shut-down time to quiesce the device.
>    *
> @@ -85,6 +86,7 @@ struct bus_type {
>   	int (*uevent)(const struct device *dev, struct kobj_uevent_env *env);
>   	int (*probe)(struct device *dev);
>   	void (*sync_state)(struct device *dev);
> +	bool (*can_remove)(struct device *dev);
>   	void (*remove)(struct device *dev);
>   	void (*shutdown)(struct device *dev);
>   


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 3/3] drm/amdgpu: wire up the can_remove() callback
  2024-02-02 23:40   ` Greg Kroah-Hartman
@ 2024-02-06 14:29     ` Daniel Vetter
  2024-02-06 18:42       ` Christian König
  0 siblings, 1 reply; 12+ messages in thread
From: Daniel Vetter @ 2024-02-06 14:29 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: Hamza Mahfooz, linux-kernel, stable, Rafael J. Wysocki,
	Alex Deucher, Christian König, Pan, Xinhui, David Airlie,
	Daniel Vetter, Bjorn Helgaas, Mario Limonciello, Lijo Lazar,
	Srinivasan Shanmugam, Le Ma, André Almeida, James Zhu,
	Aurabindo Pillai, Alex Shi, Jerry Snitselaar, Wei Liu,
	Robin Murphy, amd-gfx, dri-devel, linux-pci

On Fri, Feb 02, 2024 at 03:40:03PM -0800, Greg Kroah-Hartman wrote:
> On Fri, Feb 02, 2024 at 05:25:56PM -0500, Hamza Mahfooz wrote:
> > Removing an amdgpu device that still has user space references allocated
> > to it causes undefined behaviour.
> 
> Then fix that please.  There should not be anything special about your
> hardware that all of the tens of thousands of other devices can't handle
> today.
> 
> What happens when I yank your device out of a system with a pci hotplug
> bus?  You can't prevent that either, so this should not be any different
> at all.
> 
> sorry, but please, just fix your driver.

fwiw Christian König from amd already rejected this too, I have no idea
why this was submitted since the very elaborate plan I developed with a
bunch of amd folks was to fix the various lifetime lolz we still have in
drm. We unfortunately export the world of internal objects to userspace as
uabi objects with dma_buf, dma_fence and everything else, but it's all
fixable and we have the plan even documented:

https://dri.freedesktop.org/docs/drm/gpu/drm-uapi.html#device-hot-unplug

So yeah anything that isn't that plan of record is very much no-go for drm
drivers. Unless we change that plan of course, but that needs a
documentation patch first and a big discussion.

Aside from an absolute massive pile of kernel-internal refcounting bugs
the really big one we agreed on after a lot of discussion is that SIGBUS
on dma-buf mmaps is no-go for drm drivers, because it would break way too
much userspace in ways which are simply not fixable (since sig handlers
are shared in a process, which means the gl/vk driver cannot use it).

Otherwise it's bog standard "fix the kernel bugs" work, just a lot of it.

Cheers, Sima
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 3/3] drm/amdgpu: wire up the can_remove() callback
  2024-02-06 14:29     ` Daniel Vetter
@ 2024-02-06 18:42       ` Christian König
  2024-02-09 11:00         ` Daniel Vetter
  0 siblings, 1 reply; 12+ messages in thread
From: Christian König @ 2024-02-06 18:42 UTC (permalink / raw)
  To: Greg Kroah-Hartman, Hamza Mahfooz, linux-kernel, stable,
	Rafael J. Wysocki, Alex Deucher, Christian König, Pan,
	Xinhui, David Airlie, Bjorn Helgaas, Mario Limonciello,
	Lijo Lazar, Srinivasan Shanmugam, Le Ma, André Almeida,
	James Zhu, Aurabindo Pillai, Alex Shi, Jerry Snitselaar, Wei Liu,
	Robin Murphy, amd-gfx, dri-devel, linux-pci

Am 06.02.24 um 15:29 schrieb Daniel Vetter:
> On Fri, Feb 02, 2024 at 03:40:03PM -0800, Greg Kroah-Hartman wrote:
>> On Fri, Feb 02, 2024 at 05:25:56PM -0500, Hamza Mahfooz wrote:
>>> Removing an amdgpu device that still has user space references allocated
>>> to it causes undefined behaviour.
>> Then fix that please.  There should not be anything special about your
>> hardware that all of the tens of thousands of other devices can't handle
>> today.
>>
>> What happens when I yank your device out of a system with a pci hotplug
>> bus?  You can't prevent that either, so this should not be any different
>> at all.
>>
>> sorry, but please, just fix your driver.
> fwiw Christian König from amd already rejected this too, I have no idea
> why this was submitted

Well that was my fault.

I commented on an internal bug tracker that when sysfs bind/undbind is a 
different code path from PCI remove/re-scan we could try to reject it.

Turned out it isn't a different code path.

>   since the very elaborate plan I developed with a
> bunch of amd folks was to fix the various lifetime lolz we still have in
> drm. We unfortunately export the world of internal objects to userspace as
> uabi objects with dma_buf, dma_fence and everything else, but it's all
> fixable and we have the plan even documented:
>
> https://dri.freedesktop.org/docs/drm/gpu/drm-uapi.html#device-hot-unplug
>
> So yeah anything that isn't that plan of record is very much no-go for drm
> drivers. Unless we change that plan of course, but that needs a
> documentation patch first and a big discussion.
>
> Aside from an absolute massive pile of kernel-internal refcounting bugs
> the really big one we agreed on after a lot of discussion is that SIGBUS
> on dma-buf mmaps is no-go for drm drivers, because it would break way too
> much userspace in ways which are simply not fixable (since sig handlers
> are shared in a process, which means the gl/vk driver cannot use it).
>
> Otherwise it's bog standard "fix the kernel bugs" work, just a lot of it.

Ignoring a few memory leaks because of messed up refcounting we actually 
got that working quite nicely.

At least hot unplug / hot add seems to be working rather reliable in our 
internal testing.

So it can't be that messed up.

Regards,
Christian.

>
> Cheers, Sima


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 3/3] drm/amdgpu: wire up the can_remove() callback
  2024-02-06 18:42       ` Christian König
@ 2024-02-09 11:00         ` Daniel Vetter
  0 siblings, 0 replies; 12+ messages in thread
From: Daniel Vetter @ 2024-02-09 11:00 UTC (permalink / raw)
  To: Christian König
  Cc: Greg Kroah-Hartman, Hamza Mahfooz, linux-kernel, stable,
	Rafael J. Wysocki, Alex Deucher, Christian König, Pan,
	Xinhui, David Airlie, Bjorn Helgaas, Mario Limonciello,
	Lijo Lazar, Srinivasan Shanmugam, Le Ma, André Almeida,
	James Zhu, Aurabindo Pillai, Alex Shi, Jerry Snitselaar, Wei Liu,
	Robin Murphy, amd-gfx, dri-devel, linux-pci

On Tue, Feb 06, 2024 at 07:42:49PM +0100, Christian König wrote:
> Am 06.02.24 um 15:29 schrieb Daniel Vetter:
> > On Fri, Feb 02, 2024 at 03:40:03PM -0800, Greg Kroah-Hartman wrote:
> > > On Fri, Feb 02, 2024 at 05:25:56PM -0500, Hamza Mahfooz wrote:
> > > > Removing an amdgpu device that still has user space references allocated
> > > > to it causes undefined behaviour.
> > > Then fix that please.  There should not be anything special about your
> > > hardware that all of the tens of thousands of other devices can't handle
> > > today.
> > > 
> > > What happens when I yank your device out of a system with a pci hotplug
> > > bus?  You can't prevent that either, so this should not be any different
> > > at all.
> > > 
> > > sorry, but please, just fix your driver.
> > fwiw Christian König from amd already rejected this too, I have no idea
> > why this was submitted
> 
> Well that was my fault.
> 
> I commented on an internal bug tracker that when sysfs bind/undbind is a
> different code path from PCI remove/re-scan we could try to reject it.
> 
> Turned out it isn't a different code path.

Yeah it's exactly the same code, and removing the sysfs stuff means we
cant test hotunplug without physical hotunplugging stuff anymore. So
really not great - if one is buggy so is the other, and sysfs allows us to
control the timing a lot better to hit specific issues.
-Sima

> >   since the very elaborate plan I developed with a
> > bunch of amd folks was to fix the various lifetime lolz we still have in
> > drm. We unfortunately export the world of internal objects to userspace as
> > uabi objects with dma_buf, dma_fence and everything else, but it's all
> > fixable and we have the plan even documented:
> > 
> > https://dri.freedesktop.org/docs/drm/gpu/drm-uapi.html#device-hot-unplug
> > 
> > So yeah anything that isn't that plan of record is very much no-go for drm
> > drivers. Unless we change that plan of course, but that needs a
> > documentation patch first and a big discussion.
> > 
> > Aside from an absolute massive pile of kernel-internal refcounting bugs
> > the really big one we agreed on after a lot of discussion is that SIGBUS
> > on dma-buf mmaps is no-go for drm drivers, because it would break way too
> > much userspace in ways which are simply not fixable (since sig handlers
> > are shared in a process, which means the gl/vk driver cannot use it).
> > 
> > Otherwise it's bog standard "fix the kernel bugs" work, just a lot of it.
> 
> Ignoring a few memory leaks because of messed up refcounting we actually got
> that working quite nicely.
> 
> At least hot unplug / hot add seems to be working rather reliable in our
> internal testing.
> 
> So it can't be that messed up.
> 
> Regards,
> Christian.
> 
> > 
> > Cheers, Sima
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2024-02-09 11:00 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-02-02 22:25 [PATCH 1/3] driver core: bus: introduce can_remove() Hamza Mahfooz
2024-02-02 22:25 ` [PATCH 2/3] PCI: " Hamza Mahfooz
2024-02-02 23:38   ` Greg Kroah-Hartman
2024-02-02 22:25 ` [PATCH 3/3] drm/amdgpu: wire up the can_remove() callback Hamza Mahfooz
2024-02-02 22:41   ` Bjorn Helgaas
2024-02-02 23:40   ` Greg Kroah-Hartman
2024-02-06 14:29     ` Daniel Vetter
2024-02-06 18:42       ` Christian König
2024-02-09 11:00         ` Daniel Vetter
2024-02-02 23:41   ` Greg Kroah-Hartman
2024-02-02 23:38 ` [PATCH 1/3] driver core: bus: introduce can_remove() Greg Kroah-Hartman
2024-02-05  8:48 ` Christian König

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).