amd-gfx.lists.freedesktop.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 1/2] drm/amdkfd: Cleanup IO links during KFD device removal
@ 2022-04-08  8:45 Shuotao Xu
  2022-04-08  8:45 ` [PATCH 2/2] drm/amdkfd: Add PCIe Hotplug Support for AMDKFD Shuotao Xu
  2022-04-12  0:07 ` [PATCH 1/2] drm/amdkfd: Cleanup IO links during KFD device removal Felix Kuehling
  0 siblings, 2 replies; 31+ messages in thread
From: Shuotao Xu @ 2022-04-08  8:45 UTC (permalink / raw)
  To: amd-gfx
  Cc: Mukul Joshi, Andrey.Grodzovsky, Felix.Kuehling, pengc, Lei.Qu,
	Shuotao Xu, Ran.Shu, Ziyue.Yang

Currently, the IO-links to the device being removed from topology,
are not cleared. As a result, there would be dangling links left in
the KFD topology. This patch aims to fix the following:
1. Cleanup all IO links to the device being removed.
2. Ensure that node numbering in sysfs and nodes proximity domain
   values are consistent after the device is removed:
   a. Adding a device and removing a GPU device are made mutually
      exclusive.
   b. The global proximity domain counter is no longer required to be
      an atomic counter. A normal 32-bit counter can be used instead.
3. Update generation_count to let user-mode know that topology has
   changed due to device removal.

Reviewed-by: Shuotao Xu <shuotaoxu@microsoft.com>
CC: Shuotao Xu <shuotaoxu@microsoft.com>
Signed-off-by: Mukul Joshi <mukul.joshi@amd.com>
---
 drivers/gpu/drm/amd/amdkfd/kfd_crat.c     |  4 +-
 drivers/gpu/drm/amd/amdkfd/kfd_priv.h     |  2 +
 drivers/gpu/drm/amd/amdkfd/kfd_topology.c | 79 ++++++++++++++++++++---
 3 files changed, 74 insertions(+), 11 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_crat.c b/drivers/gpu/drm/amd/amdkfd/kfd_crat.c
index 1eaabd2cb41b..afc8a7fcdad8 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_crat.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_crat.c
@@ -1056,7 +1056,7 @@ static int kfd_parse_subtype_iolink(struct crat_subtype_iolink *iolink,
 	 * table, add corresponded reversed direction link now.
 	 */
 	if (props && (iolink->flags & CRAT_IOLINK_FLAGS_BI_DIRECTIONAL)) {
-		to_dev = kfd_topology_device_by_proximity_domain(id_to);
+		to_dev = kfd_topology_device_by_proximity_domain_no_lock(id_to);
 		if (!to_dev)
 			return -ENODEV;
 		/* same everything but the other direction */
@@ -2225,7 +2225,7 @@ static int kfd_create_vcrat_image_gpu(void *pcrat_image,
 	 */
 	if (kdev->hive_id) {
 		for (nid = 0; nid < proximity_domain; ++nid) {
-			peer_dev = kfd_topology_device_by_proximity_domain(nid);
+			peer_dev = kfd_topology_device_by_proximity_domain_no_lock(nid);
 			if (!peer_dev->gpu)
 				continue;
 			if (peer_dev->gpu->hive_id != kdev->hive_id)
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
index e1b7e6afa920..8a43def1f638 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
@@ -1016,6 +1016,8 @@ int kfd_topology_add_device(struct kfd_dev *gpu);
 int kfd_topology_remove_device(struct kfd_dev *gpu);
 struct kfd_topology_device *kfd_topology_device_by_proximity_domain(
 						uint32_t proximity_domain);
+struct kfd_topology_device *kfd_topology_device_by_proximity_domain_no_lock(
+						uint32_t proximity_domain);
 struct kfd_topology_device *kfd_topology_device_by_id(uint32_t gpu_id);
 struct kfd_dev *kfd_device_by_id(uint32_t gpu_id);
 struct kfd_dev *kfd_device_by_pci_dev(const struct pci_dev *pdev);
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_topology.c b/drivers/gpu/drm/amd/amdkfd/kfd_topology.c
index 3bdcae239bc0..874a273b81f7 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_topology.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_topology.c
@@ -46,27 +46,38 @@ static struct list_head topology_device_list;
 static struct kfd_system_properties sys_props;
 
 static DECLARE_RWSEM(topology_lock);
-static atomic_t topology_crat_proximity_domain;
+static uint32_t topology_crat_proximity_domain;
 
-struct kfd_topology_device *kfd_topology_device_by_proximity_domain(
+struct kfd_topology_device *kfd_topology_device_by_proximity_domain_no_lock(
 						uint32_t proximity_domain)
 {
 	struct kfd_topology_device *top_dev;
 	struct kfd_topology_device *device = NULL;
 
-	down_read(&topology_lock);
-
 	list_for_each_entry(top_dev, &topology_device_list, list)
 		if (top_dev->proximity_domain == proximity_domain) {
 			device = top_dev;
 			break;
 		}
 
+	return device;
+}
+
+struct kfd_topology_device *kfd_topology_device_by_proximity_domain(
+						uint32_t proximity_domain)
+{
+	struct kfd_topology_device *device = NULL;
+
+	down_read(&topology_lock);
+
+	device = kfd_topology_device_by_proximity_domain_no_lock(
+							proximity_domain);
 	up_read(&topology_lock);
 
 	return device;
 }
 
+
 struct kfd_topology_device *kfd_topology_device_by_id(uint32_t gpu_id)
 {
 	struct kfd_topology_device *top_dev = NULL;
@@ -1060,7 +1071,7 @@ int kfd_topology_init(void)
 	down_write(&topology_lock);
 	kfd_topology_update_device_list(&temp_topology_device_list,
 					&topology_device_list);
-	atomic_set(&topology_crat_proximity_domain, sys_props.num_devices-1);
+	topology_crat_proximity_domain = sys_props.num_devices-1;
 	ret = kfd_topology_update_sysfs();
 	up_write(&topology_lock);
 
@@ -1295,8 +1306,6 @@ int kfd_topology_add_device(struct kfd_dev *gpu)
 
 	pr_debug("Adding new GPU (ID: 0x%x) to topology\n", gpu_id);
 
-	proximity_domain = atomic_inc_return(&topology_crat_proximity_domain);
-
 	/* Include the CPU in xGMI hive if xGMI connected by assigning it the hive ID. */
 	if (gpu->hive_id && gpu->adev->gmc.xgmi.connected_to_cpu) {
 		struct kfd_topology_device *top_dev;
@@ -1321,12 +1330,16 @@ int kfd_topology_add_device(struct kfd_dev *gpu)
 	 */
 	dev = kfd_assign_gpu(gpu);
 	if (!dev) {
+		down_write(&topology_lock);
+		proximity_domain = ++topology_crat_proximity_domain;
+
 		res = kfd_create_crat_image_virtual(&crat_image, &image_size,
 						    COMPUTE_UNIT_GPU, gpu,
 						    proximity_domain);
 		if (res) {
 			pr_err("Error creating VCRAT for GPU (ID: 0x%x)\n",
 			       gpu_id);
+			topology_crat_proximity_domain--;
 			return res;
 		}
 		res = kfd_parse_crat_table(crat_image,
@@ -1335,10 +1348,10 @@ int kfd_topology_add_device(struct kfd_dev *gpu)
 		if (res) {
 			pr_err("Error parsing VCRAT for GPU (ID: 0x%x)\n",
 			       gpu_id);
+			topology_crat_proximity_domain--;
 			goto err;
 		}
 
-		down_write(&topology_lock);
 		kfd_topology_update_device_list(&temp_topology_device_list,
 			&topology_device_list);
 
@@ -1485,25 +1498,73 @@ int kfd_topology_add_device(struct kfd_dev *gpu)
 	return res;
 }
 
+static void kfd_topology_update_io_links(int proximity_domain)
+{
+	struct kfd_topology_device *dev;
+	struct kfd_iolink_properties *iolink, *p2plink, *tmp;
+	/*
+	 * The topology list currently is arranged in increasing
+	 * order of proximity domain.
+	 *
+	 * Two things need to be done when a device is removed:
+	 * 1. All the IO links to this device need to be
+	 *    removed.
+	 * 2. All nodes after the current device node need to move
+	 *    up once this device node is removed from the topology
+	 *    list. As a result, the proximity domain values for
+	 *    all nodes after the node being deleted reduce by 1.
+	 *    This would also cause the proximity domain values for
+	 *    io links to be updated based on new proximity
+	 *    domain values.
+	 */
+	list_for_each_entry(dev, &topology_device_list, list) {
+		if (dev->proximity_domain > proximity_domain)
+			dev->proximity_domain--;
+
+		list_for_each_entry_safe(iolink, tmp, &dev->io_link_props, list) {
+			/*
+			 * If there is an io link to the dev being deleted
+			 * then remove that IO link also.
+			 */
+			if (iolink->node_to == proximity_domain) {
+				list_del(&iolink->list);
+				dev->io_link_count--;
+				dev->node_props.io_links_count--;
+			} else if (iolink->node_from > proximity_domain) {
+				iolink->node_from--;
+			} else if (iolink->node_to > proximity_domain) {
+				iolink->node_to--;
+			}
+		}
+
+	}
+}
+
 int kfd_topology_remove_device(struct kfd_dev *gpu)
 {
 	struct kfd_topology_device *dev, *tmp;
 	uint32_t gpu_id;
 	int res = -ENODEV;
+	int i = 0;
 
 	down_write(&topology_lock);
 
-	list_for_each_entry_safe(dev, tmp, &topology_device_list, list)
+	list_for_each_entry_safe(dev, tmp, &topology_device_list, list) {
 		if (dev->gpu == gpu) {
 			gpu_id = dev->gpu_id;
 			kfd_remove_sysfs_node_entry(dev);
 			kfd_release_topology_device(dev);
 			sys_props.num_devices--;
+			kfd_topology_update_io_links(i);
+			topology_crat_proximity_domain = sys_props.num_devices-1;
+			sys_props.generation_count++;
 			res = 0;
 			if (kfd_topology_update_sysfs() < 0)
 				kfd_topology_release_sysfs();
 			break;
 		}
+		i++;
+	}
 
 	up_write(&topology_lock);
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 2/2] drm/amdkfd: Add PCIe Hotplug Support for AMDKFD
  2022-04-08  8:45 [PATCH 1/2] drm/amdkfd: Cleanup IO links during KFD device removal Shuotao Xu
@ 2022-04-08  8:45 ` Shuotao Xu
  2022-04-08 15:28   ` Andrey Grodzovsky
  2022-04-12  0:07 ` [PATCH 1/2] drm/amdkfd: Cleanup IO links during KFD device removal Felix Kuehling
  1 sibling, 1 reply; 31+ messages in thread
From: Shuotao Xu @ 2022-04-08  8:45 UTC (permalink / raw)
  To: amd-gfx
  Cc: Mukul.Joshi, Andrey.Grodzovsky, Felix.Kuehling, pengc, Lei.Qu,
	Shuotao Xu, Ran.Shu, Ziyue.Yang

Adding PCIe Hotplug Support for AMDKFD: the support of hot-plug of GPU
devices can open doors for many advanced applications in data center
in the next few years, such as for GPU resource
disaggregation. Current AMDKFD does not support hotplug out b/o the
following reasons:

1. During PCIe removal, decrement KFD lock which was incremented at
   the beginning of hw fini; otherwise kfd_open later is going to
   fail.

2. Remove redudant p2p/io links in sysfs when device is hotplugged
   out.

3. New kfd node_id is not properly assigned after a new device is
   added after a gpu is hotplugged out in a system. libhsakmt will
   find this anomaly, (i.e. node_from != <dev node id> in iolinks),
   when taking a topology_snapshot, thus returns fault to the rocm
   stack.

-- This patch fixes issue 1; another patch by Mukul fixes issues 2&3.
-- Tested on a 4-GPU MI100 gpu nodes with kernel 5.13.0-kfd; kernel
   5.16.0-kfd is unstable out of box for MI100.
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c |  5 +++++
 drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h |  7 +++++++
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  1 +
 drivers/gpu/drm/amd/amdkfd/kfd_device.c    | 13 +++++++++++++
 4 files changed, 26 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
index c18c4be1e4ac..d50011bdb5c4 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
@@ -213,6 +213,11 @@ int amdgpu_amdkfd_resume(struct amdgpu_device *adev, bool run_pm)
 	return r;
 }
 
+int amdgpu_amdkfd_resume_processes(void)
+{
+	return kgd2kfd_resume_processes();
+}
+
 int amdgpu_amdkfd_pre_reset(struct amdgpu_device *adev)
 {
 	int r = 0;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
index f8b9f27adcf5..803306e011c3 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
@@ -140,6 +140,7 @@ void amdgpu_amdkfd_fini(void);
 void amdgpu_amdkfd_suspend(struct amdgpu_device *adev, bool run_pm);
 int amdgpu_amdkfd_resume_iommu(struct amdgpu_device *adev);
 int amdgpu_amdkfd_resume(struct amdgpu_device *adev, bool run_pm);
+int amdgpu_amdkfd_resume_processes(void);
 void amdgpu_amdkfd_interrupt(struct amdgpu_device *adev,
 			const void *ih_ring_entry);
 void amdgpu_amdkfd_device_probe(struct amdgpu_device *adev);
@@ -347,6 +348,7 @@ void kgd2kfd_device_exit(struct kfd_dev *kfd);
 void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm);
 int kgd2kfd_resume_iommu(struct kfd_dev *kfd);
 int kgd2kfd_resume(struct kfd_dev *kfd, bool run_pm);
+int kgd2kfd_resume_processes(void);
 int kgd2kfd_pre_reset(struct kfd_dev *kfd);
 int kgd2kfd_post_reset(struct kfd_dev *kfd);
 void kgd2kfd_interrupt(struct kfd_dev *kfd, const void *ih_ring_entry);
@@ -393,6 +395,11 @@ static inline int kgd2kfd_resume(struct kfd_dev *kfd, bool run_pm)
 	return 0;
 }
 
+static inline int kgd2kfd_resume_processes(void)
+{
+	return 0;
+}
+
 static inline int kgd2kfd_pre_reset(struct kfd_dev *kfd)
 {
 	return 0;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index fa4a9f13c922..5827b65b7489 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4004,6 +4004,7 @@ void amdgpu_device_fini_hw(struct amdgpu_device *adev)
 	if (drm_dev_is_unplugged(adev_to_drm(adev)))
 		amdgpu_device_unmap_mmio(adev);
 
+	amdgpu_amdkfd_resume_processes();
 }
 
 void amdgpu_device_fini_sw(struct amdgpu_device *adev)
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
index 62aa6c9d5123..ef05aae9255e 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
@@ -714,6 +714,19 @@ int kgd2kfd_resume(struct kfd_dev *kfd, bool run_pm)
 	return ret;
 }
 
+/* for non-runtime resume only */
+int kgd2kfd_resume_processes(void)
+{
+	int count;
+
+	count = atomic_dec_return(&kfd_locked);
+	WARN_ONCE(count < 0, "KFD suspend / resume ref. error");
+	if (count == 0)
+		return kfd_resume_all_processes();
+
+	return 0;
+}
+
 int kgd2kfd_resume_iommu(struct kfd_dev *kfd)
 {
 	int err = 0;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/2] drm/amdkfd: Add PCIe Hotplug Support for AMDKFD
  2022-04-08  8:45 ` [PATCH 2/2] drm/amdkfd: Add PCIe Hotplug Support for AMDKFD Shuotao Xu
@ 2022-04-08 15:28   ` Andrey Grodzovsky
  2022-04-09  1:28     ` [EXTERNAL] " Shuotao Xu
  0 siblings, 1 reply; 31+ messages in thread
From: Andrey Grodzovsky @ 2022-04-08 15:28 UTC (permalink / raw)
  To: Shuotao Xu, amd-gfx
  Cc: Mukul.Joshi, Felix.Kuehling, pengc, Lei.Qu, Shuotao Xu, Ran.Shu,
	Ziyue.Yang



On 2022-04-08 04:45, Shuotao Xu wrote:
> Adding PCIe Hotplug Support for AMDKFD: the support of hot-plug of GPU
> devices can open doors for many advanced applications in data center
> in the next few years, such as for GPU resource
> disaggregation. Current AMDKFD does not support hotplug out b/o the
> following reasons:
> 
> 1. During PCIe removal, decrement KFD lock which was incremented at
>     the beginning of hw fini; otherwise kfd_open later is going to
>     fail.

I assumed you read my comment last time, still you do same approach.
More in details bellow

> 
> 2. Remove redudant p2p/io links in sysfs when device is hotplugged
>     out.
> 
> 3. New kfd node_id is not properly assigned after a new device is
>     added after a gpu is hotplugged out in a system. libhsakmt will
>     find this anomaly, (i.e. node_from != <dev node id> in iolinks),
>     when taking a topology_snapshot, thus returns fault to the rocm
>     stack.
> 
> -- This patch fixes issue 1; another patch by Mukul fixes issues 2&3.
> -- Tested on a 4-GPU MI100 gpu nodes with kernel 5.13.0-kfd; kernel
>     5.16.0-kfd is unstable out of box for MI100.
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c |  5 +++++
>   drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h |  7 +++++++
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  1 +
>   drivers/gpu/drm/amd/amdkfd/kfd_device.c    | 13 +++++++++++++
>   4 files changed, 26 insertions(+)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
> index c18c4be1e4ac..d50011bdb5c4 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
> @@ -213,6 +213,11 @@ int amdgpu_amdkfd_resume(struct amdgpu_device *adev, bool run_pm)
>   	return r;
>   }
>   
> +int amdgpu_amdkfd_resume_processes(void)
> +{
> +	return kgd2kfd_resume_processes();
> +}
> +
>   int amdgpu_amdkfd_pre_reset(struct amdgpu_device *adev)
>   {
>   	int r = 0;
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
> index f8b9f27adcf5..803306e011c3 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
> @@ -140,6 +140,7 @@ void amdgpu_amdkfd_fini(void);
>   void amdgpu_amdkfd_suspend(struct amdgpu_device *adev, bool run_pm);
>   int amdgpu_amdkfd_resume_iommu(struct amdgpu_device *adev);
>   int amdgpu_amdkfd_resume(struct amdgpu_device *adev, bool run_pm);
> +int amdgpu_amdkfd_resume_processes(void);
>   void amdgpu_amdkfd_interrupt(struct amdgpu_device *adev,
>   			const void *ih_ring_entry);
>   void amdgpu_amdkfd_device_probe(struct amdgpu_device *adev);
> @@ -347,6 +348,7 @@ void kgd2kfd_device_exit(struct kfd_dev *kfd);
>   void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm);
>   int kgd2kfd_resume_iommu(struct kfd_dev *kfd);
>   int kgd2kfd_resume(struct kfd_dev *kfd, bool run_pm);
> +int kgd2kfd_resume_processes(void);
>   int kgd2kfd_pre_reset(struct kfd_dev *kfd);
>   int kgd2kfd_post_reset(struct kfd_dev *kfd);
>   void kgd2kfd_interrupt(struct kfd_dev *kfd, const void *ih_ring_entry);
> @@ -393,6 +395,11 @@ static inline int kgd2kfd_resume(struct kfd_dev *kfd, bool run_pm)
>   	return 0;
>   }
>   
> +static inline int kgd2kfd_resume_processes(void)
> +{
> +	return 0;
> +}
> +
>   static inline int kgd2kfd_pre_reset(struct kfd_dev *kfd)
>   {
>   	return 0;
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index fa4a9f13c922..5827b65b7489 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -4004,6 +4004,7 @@ void amdgpu_device_fini_hw(struct amdgpu_device *adev)
>   	if (drm_dev_is_unplugged(adev_to_drm(adev)))
>   		amdgpu_device_unmap_mmio(adev);
>   
> +	amdgpu_amdkfd_resume_processes();
>   }
>   
>   void amdgpu_device_fini_sw(struct amdgpu_device *adev)
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> index 62aa6c9d5123..ef05aae9255e 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> @@ -714,6 +714,19 @@ int kgd2kfd_resume(struct kfd_dev *kfd, bool run_pm)
>   	return ret;
>   }
>   
> +/* for non-runtime resume only */
> +int kgd2kfd_resume_processes(void)
> +{
> +	int count;
> +
> +	count = atomic_dec_return(&kfd_locked);
> +	WARN_ONCE(count < 0, "KFD suspend / resume ref. error");
> +	if (count == 0)
> +		return kfd_resume_all_processes();
> +
> +	return 0;
> +}


It doesn't make sense to me to just increment kfd_locked in
kgd2kfd_suspend to only decrement it again a few functions down the
road.

I suggest this instead - you only incrmemnt if not during PCI remove

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
index 1c2cf3a33c1f..7754f77248a4 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
@@ -952,11 +952,12 @@ bool kfd_is_locked(void)

  void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm)
  {
+
         if (!kfd->init_complete)
                 return;

         /* for runtime suspend, skip locking kfd */
-       if (!run_pm) {
+       if (!run_pm && !drm_dev_is_unplugged(kfd->ddev)) {
                 /* For first KFD device suspend all the KFD processes */
                 if (atomic_inc_return(&kfd_locked) == 1)
                         kfd_suspend_all_processes();


Andrey



> +
>   int kgd2kfd_resume_iommu(struct kfd_dev *kfd)
>   {
>   	int err = 0;

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [EXTERNAL] [PATCH 2/2] drm/amdkfd: Add PCIe Hotplug Support for AMDKFD
  2022-04-08 15:28   ` Andrey Grodzovsky
@ 2022-04-09  1:28     ` Shuotao Xu
  2022-04-11 15:52       ` Andrey Grodzovsky
  0 siblings, 1 reply; 31+ messages in thread
From: Shuotao Xu @ 2022-04-09  1:28 UTC (permalink / raw)
  To: Andrey Grodzovsky
  Cc: Mukul.Joshi, Kuehling, Felix, Peng Cheng, amd-gfx, Lei Qu,
	Ran Shu, Ziyue Yang



> On Apr 8, 2022, at 11:28 PM, Andrey Grodzovsky <andrey.grodzovsky@amd.com> wrote:
> 
> [Some people who received this message don't often get email from andrey.grodzovsky@amd.com. Learn why this is important at http://aka.ms/LearnAboutSenderIdentification.]
> 
> On 2022-04-08 04:45, Shuotao Xu wrote:
>> Adding PCIe Hotplug Support for AMDKFD: the support of hot-plug of GPU
>> devices can open doors for many advanced applications in data center
>> in the next few years, such as for GPU resource
>> disaggregation. Current AMDKFD does not support hotplug out b/o the
>> following reasons:
>> 
>> 1. During PCIe removal, decrement KFD lock which was incremented at
>>    the beginning of hw fini; otherwise kfd_open later is going to
>>    fail.
> 
> I assumed you read my comment last time, still you do same approach.
> More in details bellow

Aha, I like your fix:) I was not familiar with drm APIs so just only half understood your comment last time. 

BTW, I tried hot-plugging out a GPU when rocm application is still running.
From dmesg, application is still trying to access the removed kfd device, and are met with some errors.
Application would hang and not exiting in this case.

Do you have any good suggestions on how to fix it down the line? (HIP runtime/libhsakmt or driver)

[64036.631333] amdgpu: amdgpu_vm_bo_update failed
[64036.631702] amdgpu: validate_invalid_user_pages: update PTE failed
[64036.640754] amdgpu: amdgpu_vm_bo_update failed
[64036.641120] amdgpu: validate_invalid_user_pages: update PTE failed
[64036.650394] amdgpu: amdgpu_vm_bo_update failed
[64036.650765] amdgpu: validate_invalid_user_pages: update PTE failed

Really appreciate your help!

Best,
Shuotao
 
> 
>> 
>> 2. Remove redudant p2p/io links in sysfs when device is hotplugged
>>    out.
>> 
>> 3. New kfd node_id is not properly assigned after a new device is
>>    added after a gpu is hotplugged out in a system. libhsakmt will
>>    find this anomaly, (i.e. node_from != <dev node id> in iolinks),
>>    when taking a topology_snapshot, thus returns fault to the rocm
>>    stack.
>> 
>> -- This patch fixes issue 1; another patch by Mukul fixes issues 2&3.
>> -- Tested on a 4-GPU MI100 gpu nodes with kernel 5.13.0-kfd; kernel
>>    5.16.0-kfd is unstable out of box for MI100.
>> ---
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c |  5 +++++
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h |  7 +++++++
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  1 +
>>  drivers/gpu/drm/amd/amdkfd/kfd_device.c    | 13 +++++++++++++
>>  4 files changed, 26 insertions(+)
>> 
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
>> index c18c4be1e4ac..d50011bdb5c4 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
>> @@ -213,6 +213,11 @@ int amdgpu_amdkfd_resume(struct amdgpu_device *adev, bool run_pm)
>>      return r;
>>  }
>> 
>> +int amdgpu_amdkfd_resume_processes(void)
>> +{
>> +     return kgd2kfd_resume_processes();
>> +}
>> +
>>  int amdgpu_amdkfd_pre_reset(struct amdgpu_device *adev)
>>  {
>>      int r = 0;
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
>> index f8b9f27adcf5..803306e011c3 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
>> @@ -140,6 +140,7 @@ void amdgpu_amdkfd_fini(void);
>>  void amdgpu_amdkfd_suspend(struct amdgpu_device *adev, bool run_pm);
>>  int amdgpu_amdkfd_resume_iommu(struct amdgpu_device *adev);
>>  int amdgpu_amdkfd_resume(struct amdgpu_device *adev, bool run_pm);
>> +int amdgpu_amdkfd_resume_processes(void);
>>  void amdgpu_amdkfd_interrupt(struct amdgpu_device *adev,
>>                      const void *ih_ring_entry);
>>  void amdgpu_amdkfd_device_probe(struct amdgpu_device *adev);
>> @@ -347,6 +348,7 @@ void kgd2kfd_device_exit(struct kfd_dev *kfd);
>>  void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm);
>>  int kgd2kfd_resume_iommu(struct kfd_dev *kfd);
>>  int kgd2kfd_resume(struct kfd_dev *kfd, bool run_pm);
>> +int kgd2kfd_resume_processes(void);
>>  int kgd2kfd_pre_reset(struct kfd_dev *kfd);
>>  int kgd2kfd_post_reset(struct kfd_dev *kfd);
>>  void kgd2kfd_interrupt(struct kfd_dev *kfd, const void *ih_ring_entry);
>> @@ -393,6 +395,11 @@ static inline int kgd2kfd_resume(struct kfd_dev *kfd, bool run_pm)
>>      return 0;
>>  }
>> 
>> +static inline int kgd2kfd_resume_processes(void)
>> +{
>> +     return 0;
>> +}
>> +
>>  static inline int kgd2kfd_pre_reset(struct kfd_dev *kfd)
>>  {
>>      return 0;
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> index fa4a9f13c922..5827b65b7489 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> @@ -4004,6 +4004,7 @@ void amdgpu_device_fini_hw(struct amdgpu_device *adev)
>>      if (drm_dev_is_unplugged(adev_to_drm(adev)))
>>              amdgpu_device_unmap_mmio(adev);
>> 
>> +     amdgpu_amdkfd_resume_processes();
>>  }
>> 
>>  void amdgpu_device_fini_sw(struct amdgpu_device *adev)
>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>> index 62aa6c9d5123..ef05aae9255e 100644
>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>> @@ -714,6 +714,19 @@ int kgd2kfd_resume(struct kfd_dev *kfd, bool run_pm)
>>      return ret;
>>  }
>> 
>> +/* for non-runtime resume only */
>> +int kgd2kfd_resume_processes(void)
>> +{
>> +     int count;
>> +
>> +     count = atomic_dec_return(&kfd_locked);
>> +     WARN_ONCE(count < 0, "KFD suspend / resume ref. error");
>> +     if (count == 0)
>> +             return kfd_resume_all_processes();
>> +
>> +     return 0;
>> +}
> 
> 
> It doesn't make sense to me to just increment kfd_locked in
> kgd2kfd_suspend to only decrement it again a few functions down the
> road.
> 
> I suggest this instead - you only incrmemnt if not during PCI remove
> 
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> index 1c2cf3a33c1f..7754f77248a4 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> @@ -952,11 +952,12 @@ bool kfd_is_locked(void)
> 
> void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm)
> {
> +
>        if (!kfd->init_complete)
>                return;
> 
>        /* for runtime suspend, skip locking kfd */
> -       if (!run_pm) {
> +       if (!run_pm && !drm_dev_is_unplugged(kfd->ddev)) {
>                /* For first KFD device suspend all the KFD processes */
>                if (atomic_inc_return(&kfd_locked) == 1)
>                        kfd_suspend_all_processes();
> 
> 
> Andrey
> 
> 
> 
>> +
>>  int kgd2kfd_resume_iommu(struct kfd_dev *kfd)
>>  {
>>      int err = 0;


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [EXTERNAL] [PATCH 2/2] drm/amdkfd: Add PCIe Hotplug Support for AMDKFD
  2022-04-09  1:28     ` [EXTERNAL] " Shuotao Xu
@ 2022-04-11 15:52       ` Andrey Grodzovsky
  2022-04-13 16:03         ` Shuotao Xu
  0 siblings, 1 reply; 31+ messages in thread
From: Andrey Grodzovsky @ 2022-04-11 15:52 UTC (permalink / raw)
  To: Shuotao Xu
  Cc: Mukul.Joshi, Kuehling, Felix, Peng Cheng, amd-gfx, Lei Qu,
	Ran Shu, Ziyue Yang


On 2022-04-08 21:28, Shuotao Xu wrote:
>
>> On Apr 8, 2022, at 11:28 PM, Andrey Grodzovsky <andrey.grodzovsky@amd.com> wrote:
>>
>> [Some people who received this message don't often get email from andrey.grodzovsky@amd.com. Learn why this is important at http://aka.ms/LearnAboutSenderIdentification.]
>>
>> On 2022-04-08 04:45, Shuotao Xu wrote:
>>> Adding PCIe Hotplug Support for AMDKFD: the support of hot-plug of GPU
>>> devices can open doors for many advanced applications in data center
>>> in the next few years, such as for GPU resource
>>> disaggregation. Current AMDKFD does not support hotplug out b/o the
>>> following reasons:
>>>
>>> 1. During PCIe removal, decrement KFD lock which was incremented at
>>>     the beginning of hw fini; otherwise kfd_open later is going to
>>>     fail.
>> I assumed you read my comment last time, still you do same approach.
>> More in details bellow
> Aha, I like your fix:) I was not familiar with drm APIs so just only half understood your comment last time.
>
> BTW, I tried hot-plugging out a GPU when rocm application is still running.
>  From dmesg, application is still trying to access the removed kfd device, and are met with some errors.


Application us supposed to keep running, it holds the drm_device 
reference as long as it has an open
FD to the device and final cleanup will come only after the app will die 
thus releasing the FD and the last
drm_device reference.

> Application would hang and not exiting in this case.


For graphic apps what i usually see is a crash because of sigsev when 
the app tries to access
an unmapped MMIO region on the device. I haven't tested for compute 
stack and so there might
be something I haven't covered. Hang could mean for example waiting on a 
fence which is not being
signaled - please provide full dmesg from this case.

>
> Do you have any good suggestions on how to fix it down the line? (HIP runtime/libhsakmt or driver)
>
> [64036.631333] amdgpu: amdgpu_vm_bo_update failed
> [64036.631702] amdgpu: validate_invalid_user_pages: update PTE failed
> [64036.640754] amdgpu: amdgpu_vm_bo_update failed
> [64036.641120] amdgpu: validate_invalid_user_pages: update PTE failed
> [64036.650394] amdgpu: amdgpu_vm_bo_update failed
> [64036.650765] amdgpu: validate_invalid_user_pages: update PTE failed


This just probably means trying to update PTEs after the physical device 
is gone - we usually avoid this by
first trying to do all HW shutdowns early before PCI remove completion 
but when it's really tricky by
protecting HW access sections with drm_dev_enter/exit scope.

For this particular error it would be the best to flush 
info->restore_userptr_work before the end of
amdgpu_pci_remove (rejecting  new process creation and calling 
cancel_delayed_work_sync(&process_info->restore_userptr_work) for all 
running processes)
somewhere in amdgpu_pci_remove.

Andrey


>
> Really appreciate your help!
>
> Best,
> Shuotao
>   
>>> 2. Remove redudant p2p/io links in sysfs when device is hotplugged
>>>     out.
>>>
>>> 3. New kfd node_id is not properly assigned after a new device is
>>>     added after a gpu is hotplugged out in a system. libhsakmt will
>>>     find this anomaly, (i.e. node_from != <dev node id> in iolinks),
>>>     when taking a topology_snapshot, thus returns fault to the rocm
>>>     stack.
>>>
>>> -- This patch fixes issue 1; another patch by Mukul fixes issues 2&3.
>>> -- Tested on a 4-GPU MI100 gpu nodes with kernel 5.13.0-kfd; kernel
>>>     5.16.0-kfd is unstable out of box for MI100.
>>> ---
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c |  5 +++++
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h |  7 +++++++
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  1 +
>>>   drivers/gpu/drm/amd/amdkfd/kfd_device.c    | 13 +++++++++++++
>>>   4 files changed, 26 insertions(+)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
>>> index c18c4be1e4ac..d50011bdb5c4 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
>>> @@ -213,6 +213,11 @@ int amdgpu_amdkfd_resume(struct amdgpu_device *adev, bool run_pm)
>>>       return r;
>>>   }
>>>
>>> +int amdgpu_amdkfd_resume_processes(void)
>>> +{
>>> +     return kgd2kfd_resume_processes();
>>> +}
>>> +
>>>   int amdgpu_amdkfd_pre_reset(struct amdgpu_device *adev)
>>>   {
>>>       int r = 0;
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
>>> index f8b9f27adcf5..803306e011c3 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
>>> @@ -140,6 +140,7 @@ void amdgpu_amdkfd_fini(void);
>>>   void amdgpu_amdkfd_suspend(struct amdgpu_device *adev, bool run_pm);
>>>   int amdgpu_amdkfd_resume_iommu(struct amdgpu_device *adev);
>>>   int amdgpu_amdkfd_resume(struct amdgpu_device *adev, bool run_pm);
>>> +int amdgpu_amdkfd_resume_processes(void);
>>>   void amdgpu_amdkfd_interrupt(struct amdgpu_device *adev,
>>>                       const void *ih_ring_entry);
>>>   void amdgpu_amdkfd_device_probe(struct amdgpu_device *adev);
>>> @@ -347,6 +348,7 @@ void kgd2kfd_device_exit(struct kfd_dev *kfd);
>>>   void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm);
>>>   int kgd2kfd_resume_iommu(struct kfd_dev *kfd);
>>>   int kgd2kfd_resume(struct kfd_dev *kfd, bool run_pm);
>>> +int kgd2kfd_resume_processes(void);
>>>   int kgd2kfd_pre_reset(struct kfd_dev *kfd);
>>>   int kgd2kfd_post_reset(struct kfd_dev *kfd);
>>>   void kgd2kfd_interrupt(struct kfd_dev *kfd, const void *ih_ring_entry);
>>> @@ -393,6 +395,11 @@ static inline int kgd2kfd_resume(struct kfd_dev *kfd, bool run_pm)
>>>       return 0;
>>>   }
>>>
>>> +static inline int kgd2kfd_resume_processes(void)
>>> +{
>>> +     return 0;
>>> +}
>>> +
>>>   static inline int kgd2kfd_pre_reset(struct kfd_dev *kfd)
>>>   {
>>>       return 0;
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> index fa4a9f13c922..5827b65b7489 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> @@ -4004,6 +4004,7 @@ void amdgpu_device_fini_hw(struct amdgpu_device *adev)
>>>       if (drm_dev_is_unplugged(adev_to_drm(adev)))
>>>               amdgpu_device_unmap_mmio(adev);
>>>
>>> +     amdgpu_amdkfd_resume_processes();
>>>   }
>>>
>>>   void amdgpu_device_fini_sw(struct amdgpu_device *adev)
>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>> index 62aa6c9d5123..ef05aae9255e 100644
>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>> @@ -714,6 +714,19 @@ int kgd2kfd_resume(struct kfd_dev *kfd, bool run_pm)
>>>       return ret;
>>>   }
>>>
>>> +/* for non-runtime resume only */
>>> +int kgd2kfd_resume_processes(void)
>>> +{
>>> +     int count;
>>> +
>>> +     count = atomic_dec_return(&kfd_locked);
>>> +     WARN_ONCE(count < 0, "KFD suspend / resume ref. error");
>>> +     if (count == 0)
>>> +             return kfd_resume_all_processes();
>>> +
>>> +     return 0;
>>> +}
>>
>> It doesn't make sense to me to just increment kfd_locked in
>> kgd2kfd_suspend to only decrement it again a few functions down the
>> road.
>>
>> I suggest this instead - you only incrmemnt if not during PCI remove
>>
>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>> b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>> index 1c2cf3a33c1f..7754f77248a4 100644
>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>> @@ -952,11 +952,12 @@ bool kfd_is_locked(void)
>>
>> void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm)
>> {
>> +
>>         if (!kfd->init_complete)
>>                 return;
>>
>>         /* for runtime suspend, skip locking kfd */
>> -       if (!run_pm) {
>> +       if (!run_pm && !drm_dev_is_unplugged(kfd->ddev)) {
>>                 /* For first KFD device suspend all the KFD processes */
>>                 if (atomic_inc_return(&kfd_locked) == 1)
>>                         kfd_suspend_all_processes();
>>
>>
>> Andrey
>>
>>
>>
>>> +
>>>   int kgd2kfd_resume_iommu(struct kfd_dev *kfd)
>>>   {
>>>       int err = 0;

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/2] drm/amdkfd: Cleanup IO links during KFD device removal
  2022-04-08  8:45 [PATCH 1/2] drm/amdkfd: Cleanup IO links during KFD device removal Shuotao Xu
  2022-04-08  8:45 ` [PATCH 2/2] drm/amdkfd: Add PCIe Hotplug Support for AMDKFD Shuotao Xu
@ 2022-04-12  0:07 ` Felix Kuehling
  2022-04-12  1:38   ` [EXTERNAL] " Shuotao Xu
  1 sibling, 1 reply; 31+ messages in thread
From: Felix Kuehling @ 2022-04-12  0:07 UTC (permalink / raw)
  To: Shuotao Xu, amd-gfx
  Cc: Mukul.Joshi, Andrey.Grodzovsky, pengc, Lei.Qu, Shuotao Xu,
	Ran.Shu, Ziyue.Yang

Am 2022-04-08 um 04:45 schrieb Shuotao Xu:
> Currently, the IO-links to the device being removed from topology,
> are not cleared. As a result, there would be dangling links left in
> the KFD topology. This patch aims to fix the following:
> 1. Cleanup all IO links to the device being removed.
> 2. Ensure that node numbering in sysfs and nodes proximity domain
>     values are consistent after the device is removed:
>     a. Adding a device and removing a GPU device are made mutually
>        exclusive.
>     b. The global proximity domain counter is no longer required to be
>        an atomic counter. A normal 32-bit counter can be used instead.
> 3. Update generation_count to let user-mode know that topology has
>     changed due to device removal.
>
> Reviewed-by: Shuotao Xu <shuotaoxu@microsoft.com>
> CC: Shuotao Xu <shuotaoxu@microsoft.com>
> Signed-off-by: Mukul Joshi <mukul.joshi@amd.com>

This looks like Mukul's patch, but with you as the author (otherwise I 
would have expected a "From: Mukul ..." line at the start of the email). 
Did you make any changes to it?

Regards,
   Felix


> ---
>   drivers/gpu/drm/amd/amdkfd/kfd_crat.c     |  4 +-
>   drivers/gpu/drm/amd/amdkfd/kfd_priv.h     |  2 +
>   drivers/gpu/drm/amd/amdkfd/kfd_topology.c | 79 ++++++++++++++++++++---
>   3 files changed, 74 insertions(+), 11 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_crat.c b/drivers/gpu/drm/amd/amdkfd/kfd_crat.c
> index 1eaabd2cb41b..afc8a7fcdad8 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_crat.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_crat.c
> @@ -1056,7 +1056,7 @@ static int kfd_parse_subtype_iolink(struct crat_subtype_iolink *iolink,
>   	 * table, add corresponded reversed direction link now.
>   	 */
>   	if (props && (iolink->flags & CRAT_IOLINK_FLAGS_BI_DIRECTIONAL)) {
> -		to_dev = kfd_topology_device_by_proximity_domain(id_to);
> +		to_dev = kfd_topology_device_by_proximity_domain_no_lock(id_to);
>   		if (!to_dev)
>   			return -ENODEV;
>   		/* same everything but the other direction */
> @@ -2225,7 +2225,7 @@ static int kfd_create_vcrat_image_gpu(void *pcrat_image,
>   	 */
>   	if (kdev->hive_id) {
>   		for (nid = 0; nid < proximity_domain; ++nid) {
> -			peer_dev = kfd_topology_device_by_proximity_domain(nid);
> +			peer_dev = kfd_topology_device_by_proximity_domain_no_lock(nid);
>   			if (!peer_dev->gpu)
>   				continue;
>   			if (peer_dev->gpu->hive_id != kdev->hive_id)
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
> index e1b7e6afa920..8a43def1f638 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
> @@ -1016,6 +1016,8 @@ int kfd_topology_add_device(struct kfd_dev *gpu);
>   int kfd_topology_remove_device(struct kfd_dev *gpu);
>   struct kfd_topology_device *kfd_topology_device_by_proximity_domain(
>   						uint32_t proximity_domain);
> +struct kfd_topology_device *kfd_topology_device_by_proximity_domain_no_lock(
> +						uint32_t proximity_domain);
>   struct kfd_topology_device *kfd_topology_device_by_id(uint32_t gpu_id);
>   struct kfd_dev *kfd_device_by_id(uint32_t gpu_id);
>   struct kfd_dev *kfd_device_by_pci_dev(const struct pci_dev *pdev);
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_topology.c b/drivers/gpu/drm/amd/amdkfd/kfd_topology.c
> index 3bdcae239bc0..874a273b81f7 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_topology.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_topology.c
> @@ -46,27 +46,38 @@ static struct list_head topology_device_list;
>   static struct kfd_system_properties sys_props;
>   
>   static DECLARE_RWSEM(topology_lock);
> -static atomic_t topology_crat_proximity_domain;
> +static uint32_t topology_crat_proximity_domain;
>   
> -struct kfd_topology_device *kfd_topology_device_by_proximity_domain(
> +struct kfd_topology_device *kfd_topology_device_by_proximity_domain_no_lock(
>   						uint32_t proximity_domain)
>   {
>   	struct kfd_topology_device *top_dev;
>   	struct kfd_topology_device *device = NULL;
>   
> -	down_read(&topology_lock);
> -
>   	list_for_each_entry(top_dev, &topology_device_list, list)
>   		if (top_dev->proximity_domain == proximity_domain) {
>   			device = top_dev;
>   			break;
>   		}
>   
> +	return device;
> +}
> +
> +struct kfd_topology_device *kfd_topology_device_by_proximity_domain(
> +						uint32_t proximity_domain)
> +{
> +	struct kfd_topology_device *device = NULL;
> +
> +	down_read(&topology_lock);
> +
> +	device = kfd_topology_device_by_proximity_domain_no_lock(
> +							proximity_domain);
>   	up_read(&topology_lock);
>   
>   	return device;
>   }
>   
> +
>   struct kfd_topology_device *kfd_topology_device_by_id(uint32_t gpu_id)
>   {
>   	struct kfd_topology_device *top_dev = NULL;
> @@ -1060,7 +1071,7 @@ int kfd_topology_init(void)
>   	down_write(&topology_lock);
>   	kfd_topology_update_device_list(&temp_topology_device_list,
>   					&topology_device_list);
> -	atomic_set(&topology_crat_proximity_domain, sys_props.num_devices-1);
> +	topology_crat_proximity_domain = sys_props.num_devices-1;
>   	ret = kfd_topology_update_sysfs();
>   	up_write(&topology_lock);
>   
> @@ -1295,8 +1306,6 @@ int kfd_topology_add_device(struct kfd_dev *gpu)
>   
>   	pr_debug("Adding new GPU (ID: 0x%x) to topology\n", gpu_id);
>   
> -	proximity_domain = atomic_inc_return(&topology_crat_proximity_domain);
> -
>   	/* Include the CPU in xGMI hive if xGMI connected by assigning it the hive ID. */
>   	if (gpu->hive_id && gpu->adev->gmc.xgmi.connected_to_cpu) {
>   		struct kfd_topology_device *top_dev;
> @@ -1321,12 +1330,16 @@ int kfd_topology_add_device(struct kfd_dev *gpu)
>   	 */
>   	dev = kfd_assign_gpu(gpu);
>   	if (!dev) {
> +		down_write(&topology_lock);
> +		proximity_domain = ++topology_crat_proximity_domain;
> +
>   		res = kfd_create_crat_image_virtual(&crat_image, &image_size,
>   						    COMPUTE_UNIT_GPU, gpu,
>   						    proximity_domain);
>   		if (res) {
>   			pr_err("Error creating VCRAT for GPU (ID: 0x%x)\n",
>   			       gpu_id);
> +			topology_crat_proximity_domain--;
>   			return res;
>   		}
>   		res = kfd_parse_crat_table(crat_image,
> @@ -1335,10 +1348,10 @@ int kfd_topology_add_device(struct kfd_dev *gpu)
>   		if (res) {
>   			pr_err("Error parsing VCRAT for GPU (ID: 0x%x)\n",
>   			       gpu_id);
> +			topology_crat_proximity_domain--;
>   			goto err;
>   		}
>   
> -		down_write(&topology_lock);
>   		kfd_topology_update_device_list(&temp_topology_device_list,
>   			&topology_device_list);
>   
> @@ -1485,25 +1498,73 @@ int kfd_topology_add_device(struct kfd_dev *gpu)
>   	return res;
>   }
>   
> +static void kfd_topology_update_io_links(int proximity_domain)
> +{
> +	struct kfd_topology_device *dev;
> +	struct kfd_iolink_properties *iolink, *p2plink, *tmp;
> +	/*
> +	 * The topology list currently is arranged in increasing
> +	 * order of proximity domain.
> +	 *
> +	 * Two things need to be done when a device is removed:
> +	 * 1. All the IO links to this device need to be
> +	 *    removed.
> +	 * 2. All nodes after the current device node need to move
> +	 *    up once this device node is removed from the topology
> +	 *    list. As a result, the proximity domain values for
> +	 *    all nodes after the node being deleted reduce by 1.
> +	 *    This would also cause the proximity domain values for
> +	 *    io links to be updated based on new proximity
> +	 *    domain values.
> +	 */
> +	list_for_each_entry(dev, &topology_device_list, list) {
> +		if (dev->proximity_domain > proximity_domain)
> +			dev->proximity_domain--;
> +
> +		list_for_each_entry_safe(iolink, tmp, &dev->io_link_props, list) {
> +			/*
> +			 * If there is an io link to the dev being deleted
> +			 * then remove that IO link also.
> +			 */
> +			if (iolink->node_to == proximity_domain) {
> +				list_del(&iolink->list);
> +				dev->io_link_count--;
> +				dev->node_props.io_links_count--;
> +			} else if (iolink->node_from > proximity_domain) {
> +				iolink->node_from--;
> +			} else if (iolink->node_to > proximity_domain) {
> +				iolink->node_to--;
> +			}
> +		}
> +
> +	}
> +}
> +
>   int kfd_topology_remove_device(struct kfd_dev *gpu)
>   {
>   	struct kfd_topology_device *dev, *tmp;
>   	uint32_t gpu_id;
>   	int res = -ENODEV;
> +	int i = 0;
>   
>   	down_write(&topology_lock);
>   
> -	list_for_each_entry_safe(dev, tmp, &topology_device_list, list)
> +	list_for_each_entry_safe(dev, tmp, &topology_device_list, list) {
>   		if (dev->gpu == gpu) {
>   			gpu_id = dev->gpu_id;
>   			kfd_remove_sysfs_node_entry(dev);
>   			kfd_release_topology_device(dev);
>   			sys_props.num_devices--;
> +			kfd_topology_update_io_links(i);
> +			topology_crat_proximity_domain = sys_props.num_devices-1;
> +			sys_props.generation_count++;
>   			res = 0;
>   			if (kfd_topology_update_sysfs() < 0)
>   				kfd_topology_release_sysfs();
>   			break;
>   		}
> +		i++;
> +	}
>   
>   	up_write(&topology_lock);
>   

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [EXTERNAL] [PATCH 1/2] drm/amdkfd: Cleanup IO links during KFD device removal
  2022-04-12  0:07 ` [PATCH 1/2] drm/amdkfd: Cleanup IO links during KFD device removal Felix Kuehling
@ 2022-04-12  1:38   ` Shuotao Xu
  0 siblings, 0 replies; 31+ messages in thread
From: Shuotao Xu @ 2022-04-12  1:38 UTC (permalink / raw)
  To: Felix Kuehling
  Cc: Mukul.Joshi, Andrey.Grodzovsky, Peng Cheng, amd-gfx, Lei Qu,
	Ran Shu, Ziyue Yang



> On Apr 12, 2022, at 8:07 AM, Felix Kuehling <felix.kuehling@amd.com> wrote:
> 
> Am 2022-04-08 um 04:45 schrieb Shuotao Xu:
>> Currently, the IO-links to the device being removed from topology,
>> are not cleared. As a result, there would be dangling links left in
>> the KFD topology. This patch aims to fix the following:
>> 1. Cleanup all IO links to the device being removed.
>> 2. Ensure that node numbering in sysfs and nodes proximity domain
>>    values are consistent after the device is removed:
>>    a. Adding a device and removing a GPU device are made mutually
>>       exclusive.
>>    b. The global proximity domain counter is no longer required to be
>>       an atomic counter. A normal 32-bit counter can be used instead.
>> 3. Update generation_count to let user-mode know that topology has
>>    changed due to device removal.
>> 
>> Reviewed-by: Shuotao Xu <shuotaoxu@microsoft.com>
>> CC: Shuotao Xu <shuotaoxu@microsoft.com>
>> Signed-off-by: Mukul Joshi <mukul.joshi@amd.com>
> 
> This looks like Mukul's patch, but with you as the author (otherwise I would have expected a "From: Mukul ..." line at the start of the email). Did you make any changes to it?
> 
> Regards,
>   Felix

Yes it was Mukul’s patch.Was trying to from a patch series, but I could did it wrong.

Regards,
Shuotao
> 
> 
>> ---
>>  drivers/gpu/drm/amd/amdkfd/kfd_crat.c     |  4 +-
>>  drivers/gpu/drm/amd/amdkfd/kfd_priv.h     |  2 +
>>  drivers/gpu/drm/amd/amdkfd/kfd_topology.c | 79 ++++++++++++++++++++---
>>  3 files changed, 74 insertions(+), 11 deletions(-)
>> 
>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_crat.c b/drivers/gpu/drm/amd/amdkfd/kfd_crat.c
>> index 1eaabd2cb41b..afc8a7fcdad8 100644
>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_crat.c
>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_crat.c
>> @@ -1056,7 +1056,7 @@ static int kfd_parse_subtype_iolink(struct crat_subtype_iolink *iolink,
>>  	 * table, add corresponded reversed direction link now.
>>  	 */
>>  	if (props && (iolink->flags & CRAT_IOLINK_FLAGS_BI_DIRECTIONAL)) {
>> -		to_dev = kfd_topology_device_by_proximity_domain(id_to);
>> +		to_dev = kfd_topology_device_by_proximity_domain_no_lock(id_to);
>>  		if (!to_dev)
>>  			return -ENODEV;
>>  		/* same everything but the other direction */
>> @@ -2225,7 +2225,7 @@ static int kfd_create_vcrat_image_gpu(void *pcrat_image,
>>  	 */
>>  	if (kdev->hive_id) {
>>  		for (nid = 0; nid < proximity_domain; ++nid) {
>> -			peer_dev = kfd_topology_device_by_proximity_domain(nid);
>> +			peer_dev = kfd_topology_device_by_proximity_domain_no_lock(nid);
>>  			if (!peer_dev->gpu)
>>  				continue;
>>  			if (peer_dev->gpu->hive_id != kdev->hive_id)
>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
>> index e1b7e6afa920..8a43def1f638 100644
>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
>> @@ -1016,6 +1016,8 @@ int kfd_topology_add_device(struct kfd_dev *gpu);
>>  int kfd_topology_remove_device(struct kfd_dev *gpu);
>>  struct kfd_topology_device *kfd_topology_device_by_proximity_domain(
>>  						uint32_t proximity_domain);
>> +struct kfd_topology_device *kfd_topology_device_by_proximity_domain_no_lock(
>> +						uint32_t proximity_domain);
>>  struct kfd_topology_device *kfd_topology_device_by_id(uint32_t gpu_id);
>>  struct kfd_dev *kfd_device_by_id(uint32_t gpu_id);
>>  struct kfd_dev *kfd_device_by_pci_dev(const struct pci_dev *pdev);
>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_topology.c b/drivers/gpu/drm/amd/amdkfd/kfd_topology.c
>> index 3bdcae239bc0..874a273b81f7 100644
>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_topology.c
>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_topology.c
>> @@ -46,27 +46,38 @@ static struct list_head topology_device_list;
>>  static struct kfd_system_properties sys_props;
>>    static DECLARE_RWSEM(topology_lock);
>> -static atomic_t topology_crat_proximity_domain;
>> +static uint32_t topology_crat_proximity_domain;
>>  -struct kfd_topology_device *kfd_topology_device_by_proximity_domain(
>> +struct kfd_topology_device *kfd_topology_device_by_proximity_domain_no_lock(
>>  						uint32_t proximity_domain)
>>  {
>>  	struct kfd_topology_device *top_dev;
>>  	struct kfd_topology_device *device = NULL;
>>  -	down_read(&topology_lock);
>> -
>>  	list_for_each_entry(top_dev, &topology_device_list, list)
>>  		if (top_dev->proximity_domain == proximity_domain) {
>>  			device = top_dev;
>>  			break;
>>  		}
>>  +	return device;
>> +}
>> +
>> +struct kfd_topology_device *kfd_topology_device_by_proximity_domain(
>> +						uint32_t proximity_domain)
>> +{
>> +	struct kfd_topology_device *device = NULL;
>> +
>> +	down_read(&topology_lock);
>> +
>> +	device = kfd_topology_device_by_proximity_domain_no_lock(
>> +							proximity_domain);
>>  	up_read(&topology_lock);
>>    	return device;
>>  }
>>  +
>>  struct kfd_topology_device *kfd_topology_device_by_id(uint32_t gpu_id)
>>  {
>>  	struct kfd_topology_device *top_dev = NULL;
>> @@ -1060,7 +1071,7 @@ int kfd_topology_init(void)
>>  	down_write(&topology_lock);
>>  	kfd_topology_update_device_list(&temp_topology_device_list,
>>  					&topology_device_list);
>> -	atomic_set(&topology_crat_proximity_domain, sys_props.num_devices-1);
>> +	topology_crat_proximity_domain = sys_props.num_devices-1;
>>  	ret = kfd_topology_update_sysfs();
>>  	up_write(&topology_lock);
>>  @@ -1295,8 +1306,6 @@ int kfd_topology_add_device(struct kfd_dev *gpu)
>>    	pr_debug("Adding new GPU (ID: 0x%x) to topology\n", gpu_id);
>>  -	proximity_domain = atomic_inc_return(&topology_crat_proximity_domain);
>> -
>>  	/* Include the CPU in xGMI hive if xGMI connected by assigning it the hive ID. */
>>  	if (gpu->hive_id && gpu->adev->gmc.xgmi.connected_to_cpu) {
>>  		struct kfd_topology_device *top_dev;
>> @@ -1321,12 +1330,16 @@ int kfd_topology_add_device(struct kfd_dev *gpu)
>>  	 */
>>  	dev = kfd_assign_gpu(gpu);
>>  	if (!dev) {
>> +		down_write(&topology_lock);
>> +		proximity_domain = ++topology_crat_proximity_domain;
>> +
>>  		res = kfd_create_crat_image_virtual(&crat_image, &image_size,
>>  						    COMPUTE_UNIT_GPU, gpu,
>>  						    proximity_domain);
>>  		if (res) {
>>  			pr_err("Error creating VCRAT for GPU (ID: 0x%x)\n",
>>  			       gpu_id);
>> +			topology_crat_proximity_domain--;
>>  			return res;
>>  		}
>>  		res = kfd_parse_crat_table(crat_image,
>> @@ -1335,10 +1348,10 @@ int kfd_topology_add_device(struct kfd_dev *gpu)
>>  		if (res) {
>>  			pr_err("Error parsing VCRAT for GPU (ID: 0x%x)\n",
>>  			       gpu_id);
>> +			topology_crat_proximity_domain--;
>>  			goto err;
>>  		}
>>  -		down_write(&topology_lock);
>>  		kfd_topology_update_device_list(&temp_topology_device_list,
>>  			&topology_device_list);
>>  @@ -1485,25 +1498,73 @@ int kfd_topology_add_device(struct kfd_dev *gpu)
>>  	return res;
>>  }
>>  +static void kfd_topology_update_io_links(int proximity_domain)
>> +{
>> +	struct kfd_topology_device *dev;
>> +	struct kfd_iolink_properties *iolink, *p2plink, *tmp;
>> +	/*
>> +	 * The topology list currently is arranged in increasing
>> +	 * order of proximity domain.
>> +	 *
>> +	 * Two things need to be done when a device is removed:
>> +	 * 1. All the IO links to this device need to be
>> +	 *    removed.
>> +	 * 2. All nodes after the current device node need to move
>> +	 *    up once this device node is removed from the topology
>> +	 *    list. As a result, the proximity domain values for
>> +	 *    all nodes after the node being deleted reduce by 1.
>> +	 *    This would also cause the proximity domain values for
>> +	 *    io links to be updated based on new proximity
>> +	 *    domain values.
>> +	 */
>> +	list_for_each_entry(dev, &topology_device_list, list) {
>> +		if (dev->proximity_domain > proximity_domain)
>> +			dev->proximity_domain--;
>> +
>> +		list_for_each_entry_safe(iolink, tmp, &dev->io_link_props, list) {
>> +			/*
>> +			 * If there is an io link to the dev being deleted
>> +			 * then remove that IO link also.
>> +			 */
>> +			if (iolink->node_to == proximity_domain) {
>> +				list_del(&iolink->list);
>> +				dev->io_link_count--;
>> +				dev->node_props.io_links_count--;
>> +			} else if (iolink->node_from > proximity_domain) {
>> +				iolink->node_from--;
>> +			} else if (iolink->node_to > proximity_domain) {
>> +				iolink->node_to--;
>> +			}
>> +		}
>> +
>> +	}
>> +}
>> +
>>  int kfd_topology_remove_device(struct kfd_dev *gpu)
>>  {
>>  	struct kfd_topology_device *dev, *tmp;
>>  	uint32_t gpu_id;
>>  	int res = -ENODEV;
>> +	int i = 0;
>>    	down_write(&topology_lock);
>>  -	list_for_each_entry_safe(dev, tmp, &topology_device_list, list)
>> +	list_for_each_entry_safe(dev, tmp, &topology_device_list, list) {
>>  		if (dev->gpu == gpu) {
>>  			gpu_id = dev->gpu_id;
>>  			kfd_remove_sysfs_node_entry(dev);
>>  			kfd_release_topology_device(dev);
>>  			sys_props.num_devices--;
>> +			kfd_topology_update_io_links(i);
>> +			topology_crat_proximity_domain = sys_props.num_devices-1;
>> +			sys_props.generation_count++;
>>  			res = 0;
>>  			if (kfd_topology_update_sysfs() < 0)
>>  				kfd_topology_release_sysfs();
>>  			break;
>>  		}
>> +		i++;
>> +	}
>>    	up_write(&topology_lock);
>>  


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [EXTERNAL] [PATCH 2/2] drm/amdkfd: Add PCIe Hotplug Support for AMDKFD
  2022-04-11 15:52       ` Andrey Grodzovsky
@ 2022-04-13 16:03         ` Shuotao Xu
  2022-04-13 17:31           ` Andrey Grodzovsky
  0 siblings, 1 reply; 31+ messages in thread
From: Shuotao Xu @ 2022-04-13 16:03 UTC (permalink / raw)
  To: Andrey Grodzovsky
  Cc: Mukul.Joshi, Kuehling, Felix, Peng Cheng, amd-gfx, Lei Qu,
	Ran Shu, Ziyue Yang

[-- Attachment #1: Type: text/plain, Size: 15699 bytes --]



On Apr 11, 2022, at 11:52 PM, Andrey Grodzovsky <andrey.grodzovsky@amd.com<mailto:andrey.grodzovsky@amd.com>> wrote:

[Some people who received this message don't often get email from andrey.grodzovsky@amd.com<mailto:andrey.grodzovsky@amd.com>. Learn why this is important at http://aka.ms/LearnAboutSenderIdentification.]

On 2022-04-08 21:28, Shuotao Xu wrote:

On Apr 8, 2022, at 11:28 PM, Andrey Grodzovsky <andrey.grodzovsky@amd.com<mailto:andrey.grodzovsky@amd.com>> wrote:

[Some people who received this message don't often get email from andrey.grodzovsky@amd.com<mailto:andrey.grodzovsky@amd.com>. Learn why this is important at http://aka.ms/LearnAboutSenderIdentification.]

On 2022-04-08 04:45, Shuotao Xu wrote:
Adding PCIe Hotplug Support for AMDKFD: the support of hot-plug of GPU
devices can open doors for many advanced applications in data center
in the next few years, such as for GPU resource
disaggregation. Current AMDKFD does not support hotplug out b/o the
following reasons:

1. During PCIe removal, decrement KFD lock which was incremented at
the beginning of hw fini; otherwise kfd_open later is going to
fail.
I assumed you read my comment last time, still you do same approach.
More in details bellow
Aha, I like your fix:) I was not familiar with drm APIs so just only half understood your comment last time.

BTW, I tried hot-plugging out a GPU when rocm application is still running.
From dmesg, application is still trying to access the removed kfd device, and are met with some errors.


Application us supposed to keep running, it holds the drm_device
reference as long as it has an open
FD to the device and final cleanup will come only after the app will die
thus releasing the FD and the last
drm_device reference.

Application would hang and not exiting in this case.


Actually I tried kill -7 $pid, and the process exists. The dmesg has some warning though.

[  711.769977] WARNING: CPU: 23 PID: 344 at .../amdgpu-rocm5.0.2/src/amd/amdgpu/amdgpu_object.c:1336 amdgpu_bo_release_notify+0x150/0x160 [amdgpu]
[  711.770528] Modules linked in: amdgpu(OE) amdttm(OE) amd_sched(OE) amdkcl(OE) iommu_v2 nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo xt_addrtype br_netfilter xt_CHECKSUM iptable_mangle xt_MASQUERADE iptable_nat nf_nat xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4 xt_tcpudp bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter overlay binfmt_misc intel_rapl_msr i40iw intel_rapl_common skx_edac nfit x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel rpcrdma kvm sunrpc ipmi_ssif ib_umad ib_ipoib rdma_ucm irqbypass rapl joydev acpi_ipmi input_leds intel_cstate ipmi_si ipmi_devintf mei_me mei intel_pch_thermal ipmi_msghandler ioatdma mac_hid lpc_ich dca acpi_power_meter acpi_pad sch_fq_codel ib_iser rdma_cm iw_cm ib_cm iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi pci_stub ip_tables x_tables autofs4 btrfs blake2b_generic zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor
[  711.779359]  raid6_pq libcrc32c raid1 raid0 multipath linear mlx5_ib ib_uverbs ib_core ast drm_vram_helper i2c_algo_bit drm_ttm_helper ttm drm_kms_helper syscopyarea crct10dif_pclmul crc32_pclmul ghash_clmulni_intel sysfillrect uas hid_generic sysimgblt aesni_intel mlx5_core fb_sys_fops crypto_simd usbhid cryptd drm i40e pci_hyperv_intf usb_storage glue_helper mlxfw hid ahci libahci wmi
[  711.779752] CPU: 23 PID: 344 Comm: kworker/23:1 Tainted: G        W  OE     5.11.0+ #1
[  711.779755] Hardware name: Supermicro SYS-4029GP-TRT2/X11DPG-OT-CPU, BIOS 2.1 08/14/2018
[  711.779756] Workqueue: kfd_process_wq kfd_process_wq_release [amdgpu]
[  711.779955] RIP: 0010:amdgpu_bo_release_notify+0x150/0x160 [amdgpu]
[  711.780141] Code: e8 b5 af 34 f4 e9 1f ff ff ff 48 39 c2 74 07 0f 0b e9 69 ff ff ff 4c 89 e7 e8 3c b4 16 00 e9 5c ff ff ff e8 a2 ce fd f3 eb cf <0f> 0b eb cb e8 d7 02 34 f4 0f 1f 80 00 00 00 00 0f 1f 44 00 00 55
[  711.780143] RSP: 0018:ffffa8100dd67c30 EFLAGS: 00010282
[  711.780145] RAX: 00000000ffffffea RBX: ffff89980e792058 RCX: 0000000000000000
[  711.780147] RDX: 0000000000000000 RSI: ffff89a8f9ad8870 RDI: ffff89a8f9ad8870
[  711.780148] RBP: ffffa8100dd67c50 R08: 0000000000000000 R09: fffffffffff99b18
[  711.780149] R10: ffffa8100dd67bd0 R11: ffffa8100dd67908 R12: ffff89980e792000
[  711.780151] R13: ffff89980e792058 R14: ffff89980e7921bc R15: dead000000000100
[  711.780152] FS:  0000000000000000(0000) GS:ffff89a8f9ac0000(0000) knlGS:0000000000000000
[  711.780154] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  711.780156] CR2: 00007ffddac6f71f CR3: 00000030bb80a003 CR4: 00000000007706e0
[  711.780157] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  711.780159] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  711.780160] PKRU: 55555554
[  711.780161] Call Trace:
[  711.780163]  ttm_bo_release+0x2ae/0x320 [amdttm]
[  711.780169]  amdttm_bo_put+0x30/0x40 [amdttm]
[  711.780357]  amdgpu_bo_unref+0x1e/0x30 [amdgpu]
[  711.780543]  amdgpu_gem_object_free+0x8b/0x160 [amdgpu]
[  711.781119]  drm_gem_object_free+0x1d/0x30 [drm]
[  711.781489]  amdgpu_amdkfd_gpuvm_free_memory_of_gpu+0x34a/0x380 [amdgpu]
[  711.782044]  kfd_process_device_free_bos+0xe0/0x130 [amdgpu]
[  711.782611]  kfd_process_wq_release+0x286/0x380 [amdgpu]
[  711.783172]  process_one_work+0x236/0x420
[  711.783543]  worker_thread+0x34/0x400
[  711.783911]  ? process_one_work+0x420/0x420
[  711.784279]  kthread+0x126/0x140
[  711.784653]  ? kthread_park+0x90/0x90
[  711.785018]  ret_from_fork+0x22/0x30
[  711.785387] ---[ end trace d8f50f6594817c84 ]---
[  711.798716] [drm] amdgpu: ttm finalized


For graphic apps what i usually see is a crash because of sigsev when
the app tries to access
an unmapped MMIO region on the device. I haven't tested for compute
stack and so there might
be something I haven't covered. Hang could mean for example waiting on a
fence which is not being
signaled - please provide full dmesg from this case.


Do you have any good suggestions on how to fix it down the line? (HIP runtime/libhsakmt or driver)

[64036.631333] amdgpu: amdgpu_vm_bo_update failed
[64036.631702] amdgpu: validate_invalid_user_pages: update PTE failed
[64036.640754] amdgpu: amdgpu_vm_bo_update failed
[64036.641120] amdgpu: validate_invalid_user_pages: update PTE failed
[64036.650394] amdgpu: amdgpu_vm_bo_update failed
[64036.650765] amdgpu: validate_invalid_user_pages: update PTE failed


The full dmesg will just the repetition of those two messages,
[186885.764079] amdgpu 0000:43:00.0: amdgpu: amdgpu: finishing device.
[186885.766916] [drm] free PSP TMR buffer
[186893.868173] amdgpu: amdgpu_vm_bo_update failed
[186893.868235] amdgpu: validate_invalid_user_pages: update PTE failed
[186893.876154] amdgpu: amdgpu_vm_bo_update failed
[186893.876190] amdgpu: validate_invalid_user_pages: update PTE failed
[186893.884150] amdgpu: amdgpu_vm_bo_update failed
[186893.884185] amdgpu: validate_invalid_user_pages: update PTE failed


This just probably means trying to update PTEs after the physical device
is gone - we usually avoid this by
first trying to do all HW shutdowns early before PCI remove completion
but when it's really tricky by
protecting HW access sections with drm_dev_enter/exit scope.

For this particular error it would be the best to flush
info->restore_userptr_work before the end of
amdgpu_pci_remove (rejecting new process creation and calling
cancel_delayed_work_sync(&process_info->restore_userptr_work) for all
running processes)
somewhere in amdgpu_pci_remove.

I tried something like *kfd_process_ref_release* which I think did what you described, but it did not work.

Instead I tried to kill the process from the kernel, but the amdgpu could **only** be hot-plugged in back successfully only if there was no rocm kernel running when it was plugged out. If not, amdgpu_probe will just hang later. (Maybe because amdgpu was plugged out while running state, it leaves a bad HW state which causes probe to hang).

I don’t know if this is a viable solution worth pursuing, but I attached the diff anyway.

Another solution could be let compute stack user mode detect a topology change via generation_count change, and abort gracefully there.

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
index 4e7d9cb09a69..79b4c9b84cd0 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
@@ -697,12 +697,15 @@ void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm, bool force)
                return;

        /* for runtime suspend, skip locking kfd */
-       if (!run_pm) {
+       if (!run_pm && !drm_dev_is_unplugged(kfd->ddev)) {
                /* For first KFD device suspend all the KFD processes */
                if (atomic_inc_return(&kfd_locked) == 1)
                        kfd_suspend_all_processes(force);
        }

+       if (drm_dev_is_unplugged(kfd->ddev))
+               kfd_kill_all_user_processes();
+
        kfd->dqm->ops.stop(kfd->dqm);
        kfd_iommu_suspend(kfd);
 }
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
index 55c9e1922714..84cbcd857856 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
@@ -1065,6 +1065,7 @@ void kfd_unref_process(struct kfd_process *p);
 int kfd_process_evict_queues(struct kfd_process *p, bool force);
 int kfd_process_restore_queues(struct kfd_process *p);
 void kfd_suspend_all_processes(bool force);
+void kfd_kill_all_user_processes(void);
 /*
  * kfd_resume_all_processes:
  *     bool sync: If kfd_resume_all_processes() should wait for the
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process.c b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
index 6cdc855abb6d..fb0c753b682c 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
@@ -2206,6 +2206,24 @@ void kfd_suspend_all_processes(bool force)
        srcu_read_unlock(&kfd_processes_srcu, idx);
 }

+
+void kfd_kill_all_user_processes(void)
+{
+       struct kfd_process *p;
+       struct amdkfd_process_info *p_info;
+       unsigned int temp;
+       int idx = srcu_read_lock(&kfd_processes_srcu);
+
+       pr_info("Killing all processes\n");
+       hash_for_each_rcu(kfd_processes_table, temp, p, kfd_processes) {
+               p_info = p->kgd_process_info;
+               pr_info("Killing  processes, pid = %d", pid_nr(p_info->pid));
+               kill_pid(p_info->pid, SIGBUS, 1);
+       }
+       srcu_read_unlock(&kfd_processes_srcu, idx);
+}
+
+
 int kfd_resume_all_processes(bool sync)
 {
        struct kfd_process *p;


Andrey



Really appreciate your help!

Best,
Shuotao

2. Remove redudant p2p/io links in sysfs when device is hotplugged
out.

3. New kfd node_id is not properly assigned after a new device is
added after a gpu is hotplugged out in a system. libhsakmt will
find this anomaly, (i.e. node_from != <dev node id> in iolinks),
when taking a topology_snapshot, thus returns fault to the rocm
stack.

-- This patch fixes issue 1; another patch by Mukul fixes issues 2&3.
-- Tested on a 4-GPU MI100 gpu nodes with kernel 5.13.0-kfd; kernel
5.16.0-kfd is unstable out of box for MI100.
---
drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 5 +++++
drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h | 7 +++++++
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 1 +
drivers/gpu/drm/amd/amdkfd/kfd_device.c | 13 +++++++++++++
4 files changed, 26 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
index c18c4be1e4ac..d50011bdb5c4 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
@@ -213,6 +213,11 @@ int amdgpu_amdkfd_resume(struct amdgpu_device *adev, bool run_pm)
return r;
}

+int amdgpu_amdkfd_resume_processes(void)
+{
+ return kgd2kfd_resume_processes();
+}
+
int amdgpu_amdkfd_pre_reset(struct amdgpu_device *adev)
{
int r = 0;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
index f8b9f27adcf5..803306e011c3 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
@@ -140,6 +140,7 @@ void amdgpu_amdkfd_fini(void);
void amdgpu_amdkfd_suspend(struct amdgpu_device *adev, bool run_pm);
int amdgpu_amdkfd_resume_iommu(struct amdgpu_device *adev);
int amdgpu_amdkfd_resume(struct amdgpu_device *adev, bool run_pm);
+int amdgpu_amdkfd_resume_processes(void);
void amdgpu_amdkfd_interrupt(struct amdgpu_device *adev,
const void *ih_ring_entry);
void amdgpu_amdkfd_device_probe(struct amdgpu_device *adev);
@@ -347,6 +348,7 @@ void kgd2kfd_device_exit(struct kfd_dev *kfd);
void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm);
int kgd2kfd_resume_iommu(struct kfd_dev *kfd);
int kgd2kfd_resume(struct kfd_dev *kfd, bool run_pm);
+int kgd2kfd_resume_processes(void);
int kgd2kfd_pre_reset(struct kfd_dev *kfd);
int kgd2kfd_post_reset(struct kfd_dev *kfd);
void kgd2kfd_interrupt(struct kfd_dev *kfd, const void *ih_ring_entry);
@@ -393,6 +395,11 @@ static inline int kgd2kfd_resume(struct kfd_dev *kfd, bool run_pm)
return 0;
}

+static inline int kgd2kfd_resume_processes(void)
+{
+ return 0;
+}
+
static inline int kgd2kfd_pre_reset(struct kfd_dev *kfd)
{
return 0;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index fa4a9f13c922..5827b65b7489 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4004,6 +4004,7 @@ void amdgpu_device_fini_hw(struct amdgpu_device *adev)
if (drm_dev_is_unplugged(adev_to_drm(adev)))
amdgpu_device_unmap_mmio(adev);

+ amdgpu_amdkfd_resume_processes();
}

void amdgpu_device_fini_sw(struct amdgpu_device *adev)
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
index 62aa6c9d5123..ef05aae9255e 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
@@ -714,6 +714,19 @@ int kgd2kfd_resume(struct kfd_dev *kfd, bool run_pm)
return ret;
}

+/* for non-runtime resume only */
+int kgd2kfd_resume_processes(void)
+{
+ int count;
+
+ count = atomic_dec_return(&kfd_locked);
+ WARN_ONCE(count < 0, "KFD suspend / resume ref. error");
+ if (count == 0)
+ return kfd_resume_all_processes();
+
+ return 0;
+}

It doesn't make sense to me to just increment kfd_locked in
kgd2kfd_suspend to only decrement it again a few functions down the
road.

I suggest this instead - you only incrmemnt if not during PCI remove

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
index 1c2cf3a33c1f..7754f77248a4 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
@@ -952,11 +952,12 @@ bool kfd_is_locked(void)

void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm)
{
+
if (!kfd->init_complete)
return;

/* for runtime suspend, skip locking kfd */
- if (!run_pm) {
+ if (!run_pm && !drm_dev_is_unplugged(kfd->ddev)) {
/* For first KFD device suspend all the KFD processes */
if (atomic_inc_return(&kfd_locked) == 1)
kfd_suspend_all_processes();


Andrey



+
int kgd2kfd_resume_iommu(struct kfd_dev *kfd)
{
int err = 0;


[-- Attachment #2: Type: text/html, Size: 47507 bytes --]

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [EXTERNAL] [PATCH 2/2] drm/amdkfd: Add PCIe Hotplug Support for AMDKFD
  2022-04-13 16:03         ` Shuotao Xu
@ 2022-04-13 17:31           ` Andrey Grodzovsky
  2022-04-14 14:00             ` Shuotao Xu
  0 siblings, 1 reply; 31+ messages in thread
From: Andrey Grodzovsky @ 2022-04-13 17:31 UTC (permalink / raw)
  To: Shuotao Xu
  Cc: Mukul.Joshi, Kuehling, Felix, Peng Cheng, amd-gfx, Lei Qu,
	Ran Shu, Ziyue Yang

[-- Attachment #1: Type: text/plain, Size: 17907 bytes --]


On 2022-04-13 12:03, Shuotao Xu wrote:
>
>
>> On Apr 11, 2022, at 11:52 PM, Andrey Grodzovsky 
>> <andrey.grodzovsky@amd.com> wrote:
>>
>> [Some people who received this message don't often get email 
>> fromandrey.grodzovsky@amd.com. Learn why this is important 
>> athttp://aka.ms/LearnAboutSenderIdentification.]
>>
>> On 2022-04-08 21:28, Shuotao Xu wrote:
>>>
>>>> On Apr 8, 2022, at 11:28 PM, Andrey Grodzovsky 
>>>> <andrey.grodzovsky@amd.com> wrote:
>>>>
>>>> [Some people who received this message don't often get email from 
>>>> andrey.grodzovsky@amd.com. Learn why this is important at 
>>>> http://aka.ms/LearnAboutSenderIdentification.]
>>>>
>>>> On 2022-04-08 04:45, Shuotao Xu wrote:
>>>>> Adding PCIe Hotplug Support for AMDKFD: the support of hot-plug of GPU
>>>>> devices can open doors for many advanced applications in data center
>>>>> in the next few years, such as for GPU resource
>>>>> disaggregation. Current AMDKFD does not support hotplug out b/o the
>>>>> following reasons:
>>>>>
>>>>> 1. During PCIe removal, decrement KFD lock which was incremented at
>>>>> the beginning of hw fini; otherwise kfd_open later is going to
>>>>> fail.
>>>> I assumed you read my comment last time, still you do same approach.
>>>> More in details bellow
>>> Aha, I like your fix:) I was not familiar with drm APIs so just only 
>>> half understood your comment last time.
>>>
>>> BTW, I tried hot-plugging out a GPU when rocm application is still 
>>> running.
>>> From dmesg, application is still trying to access the removed kfd 
>>> device, and are met with some errors.
>>
>>
>> Application us supposed to keep running, it holds the drm_device
>> reference as long as it has an open
>> FD to the device and final cleanup will come only after the app will die
>> thus releasing the FD and the last
>> drm_device reference.
>>
>>> Application would hang and not exiting in this case.
>>
>
> Actually I tried kill -7 $pid, and the process exists. The dmesg has 
> some warning though.
>
> [  711.769977] WARNING: CPU: 23 PID: 344 at 
> .../amdgpu-rocm5.0.2/src/amd/amdgpu/amdgpu_object.c:1336 
> amdgpu_bo_release_notify+0x150/0x160 [amdgpu]
> [  711.770528] Modules linked in: amdgpu(OE) amdttm(OE) amd_sched(OE) 
> amdkcl(OE) iommu_v2 nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo 
> xt_addrtype br_netfilter xt_CHECKSUM iptable_mangle xt_MASQUERADE 
> iptable_nat nf_nat xt_conntrack nf_conntrack nf_defrag_ipv6 
> nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4 xt_tcpudp bridge stp llc 
> ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter 
> overlay binfmt_misc intel_rapl_msr i40iw intel_rapl_common skx_edac 
> nfit x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel rpcrdma 
> kvm sunrpc ipmi_ssif ib_umad ib_ipoib rdma_ucm irqbypass rapl joydev 
> acpi_ipmi input_leds intel_cstate ipmi_si ipmi_devintf mei_me mei 
> intel_pch_thermal ipmi_msghandler ioatdma mac_hid lpc_ich dca 
> acpi_power_meter acpi_pad sch_fq_codel ib_iser rdma_cm iw_cm ib_cm 
> iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi pci_stub 
> ip_tables x_tables autofs4 btrfs blake2b_generic zstd_compress raid10 
> raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor
> [  711.779359]  raid6_pq libcrc32c raid1 raid0 multipath linear 
> mlx5_ib ib_uverbs ib_core ast drm_vram_helper i2c_algo_bit 
> drm_ttm_helper ttm drm_kms_helper syscopyarea crct10dif_pclmul 
> crc32_pclmul ghash_clmulni_intel sysfillrect uas hid_generic sysimgblt 
> aesni_intel mlx5_core fb_sys_fops crypto_simd usbhid cryptd drm i40e 
> pci_hyperv_intf usb_storage glue_helper mlxfw hid ahci libahci wmi
> [  711.779752] CPU: 23 PID: 344 Comm: kworker/23:1 Tainted: G        W 
>  OE     5.11.0+ #1
> [  711.779755] Hardware name: Supermicro 
> SYS-4029GP-TRT2/X11DPG-OT-CPU, BIOS 2.1 08/14/2018
> [  711.779756] Workqueue: kfd_process_wq kfd_process_wq_release [amdgpu]
> [  711.779955] RIP: 0010:amdgpu_bo_release_notify+0x150/0x160 [amdgpu]
> [  711.780141] Code: e8 b5 af 34 f4 e9 1f ff ff ff 48 39 c2 74 07 0f 
> 0b e9 69 ff ff ff 4c 89 e7 e8 3c b4 16 00 e9 5c ff ff ff e8 a2 ce fd 
> f3 eb cf <0f> 0b eb cb e8 d7 02 34 f4 0f 1f 80 00 00 00 00 0f 1f 44 00 
> 00 55
> [  711.780143] RSP: 0018:ffffa8100dd67c30 EFLAGS: 00010282
> [  711.780145] RAX: 00000000ffffffea RBX: ffff89980e792058 RCX: 
> 0000000000000000
> [  711.780147] RDX: 0000000000000000 RSI: ffff89a8f9ad8870 RDI: 
> ffff89a8f9ad8870
> [  711.780148] RBP: ffffa8100dd67c50 R08: 0000000000000000 R09: 
> fffffffffff99b18
> [  711.780149] R10: ffffa8100dd67bd0 R11: ffffa8100dd67908 R12: 
> ffff89980e792000
> [  711.780151] R13: ffff89980e792058 R14: ffff89980e7921bc R15: 
> dead000000000100
> [  711.780152] FS:  0000000000000000(0000) GS:ffff89a8f9ac0000(0000) 
> knlGS:0000000000000000
> [  711.780154] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  711.780156] CR2: 00007ffddac6f71f CR3: 00000030bb80a003 CR4: 
> 00000000007706e0
> [  711.780157] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 
> 0000000000000000
> [  711.780159] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 
> 0000000000000400
> [  711.780160] PKRU: 55555554
> [  711.780161] Call Trace:
> [  711.780163]  ttm_bo_release+0x2ae/0x320 [amdttm]
> [  711.780169]  amdttm_bo_put+0x30/0x40 [amdttm]
> [  711.780357]  amdgpu_bo_unref+0x1e/0x30 [amdgpu]
> [  711.780543]  amdgpu_gem_object_free+0x8b/0x160 [amdgpu]
> [  711.781119]  drm_gem_object_free+0x1d/0x30 [drm]
> [  711.781489]  amdgpu_amdkfd_gpuvm_free_memory_of_gpu+0x34a/0x380 
> [amdgpu]
> [  711.782044]  kfd_process_device_free_bos+0xe0/0x130 [amdgpu]
> [  711.782611]  kfd_process_wq_release+0x286/0x380 [amdgpu]
> [  711.783172]  process_one_work+0x236/0x420
> [  711.783543]  worker_thread+0x34/0x400
> [  711.783911]  ? process_one_work+0x420/0x420
> [  711.784279]  kthread+0x126/0x140
> [  711.784653]  ? kthread_park+0x90/0x90
> [  711.785018]  ret_from_fork+0x22/0x30
> [  711.785387] ---[ end trace d8f50f6594817c84 ]---
> [  711.798716] [drm] amdgpu: ttm finalized


So it means the process was stuck in some wait_event_killable (maybe 
here drm_sched_entity_flush) - you can try 'cat/proc/$process_pid/stack' 
maybe before
you kill it to see where it was stuck so we can go from there.


>
>>
>> For graphic apps what i usually see is a crash because of sigsev when
>> the app tries to access
>> an unmapped MMIO region on the device. I haven't tested for compute
>> stack and so there might
>> be something I haven't covered. Hang could mean for example waiting on a
>> fence which is not being
>> signaled - please provide full dmesg from this case.
>>
>>>
>>> Do you have any good suggestions on how to fix it down the line? 
>>> (HIP runtime/libhsakmt or driver)
>>>
>>> [64036.631333] amdgpu: amdgpu_vm_bo_update failed
>>> [64036.631702] amdgpu: validate_invalid_user_pages: update PTE failed
>>> [64036.640754] amdgpu: amdgpu_vm_bo_update failed
>>> [64036.641120] amdgpu: validate_invalid_user_pages: update PTE failed
>>> [64036.650394] amdgpu: amdgpu_vm_bo_update failed
>>> [64036.650765] amdgpu: validate_invalid_user_pages: update PTE failed
>>
>
> The full dmesg will just the repetition of those two messages,
> [186885.764079] amdgpu 0000:43:00.0: amdgpu: amdgpu: finishing device.
> [186885.766916] [drm] free PSP TMR buffer
> [186893.868173] amdgpu: amdgpu_vm_bo_update failed
> [186893.868235] amdgpu: validate_invalid_user_pages: update PTE failed
> [186893.876154] amdgpu: amdgpu_vm_bo_update failed
> [186893.876190] amdgpu: validate_invalid_user_pages: update PTE failed
> [186893.884150] amdgpu: amdgpu_vm_bo_update failed
> [186893.884185] amdgpu: validate_invalid_user_pages: update PTE failed
>
>>
>> This just probably means trying to update PTEs after the physical device
>> is gone - we usually avoid this by
>> first trying to do all HW shutdowns early before PCI remove completion
>> but when it's really tricky by
>> protecting HW access sections with drm_dev_enter/exit scope.
>>
>> For this particular error it would be the best to flush
>> info->restore_userptr_work before the end of
>> amdgpu_pci_remove (rejecting new process creation and calling
>> cancel_delayed_work_sync(&process_info->restore_userptr_work) for all
>> running processes)
>> somewhere in amdgpu_pci_remove.
>>
> I tried something like *kfd_process_ref_release* which I think did 
> what you described, but it did not work.


I don't see how kfd_process_ref_release is the same as I mentioned 
above, what i meant is calling the code above within kgd2kfd_suspend 
(where you tried to call kfd_kill_all_user_processes bellow)


>
> Instead I tried to kill the process from the kernel, but the amdgpu 
> could **only** be hot-plugged in back successfully only if there was 
> no rocm kernel running when it was plugged out. If not, amdgpu_probe 
> will just hang later. (Maybe because amdgpu was plugged out while 
> running state, it leaves a bad HW state which causes probe to hang).


We usually do asic_reset during probe to reset all HW state (checlk if 
amdgpu_device_init->amdgpu_asic_reset is running when you  plug back).


>
> I don’t know if this is a viable solution worth pursuing, but I 
> attached the diff anyway.
>
> Another solution could be let compute stack user mode detect a 
> topology change via generation_count change, and abort gracefully there.
>
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c 
> b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> index 4e7d9cb09a69..79b4c9b84cd0 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> @@ -697,12 +697,15 @@ void kgd2kfd_suspend(struct kfd_dev *kfd, bool 
> run_pm, bool force)
>                 return;
>
>         /* for runtime suspend, skip locking kfd */
> -       if (!run_pm) {
> +       if (!run_pm && !drm_dev_is_unplugged(kfd->ddev)) {
>                 /* For first KFD device suspend all the KFD processes */
>                 if (atomic_inc_return(&kfd_locked) == 1)
>                         kfd_suspend_all_processes(force);
>         }
>
> +       if (drm_dev_is_unplugged(kfd->ddev))
> +               kfd_kill_all_user_processes();
> +
>         kfd->dqm->ops.stop(kfd->dqm);
>         kfd_iommu_suspend(kfd);
>  }
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h 
> b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
> index 55c9e1922714..84cbcd857856 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
> @@ -1065,6 +1065,7 @@ void kfd_unref_process(struct kfd_process *p);
>  int kfd_process_evict_queues(struct kfd_process *p, bool force);
>  int kfd_process_restore_queues(struct kfd_process *p);
>  void kfd_suspend_all_processes(bool force);
> +void kfd_kill_all_user_processes(void);
>  /*
>   * kfd_resume_all_processes:
>   *     bool sync: If kfd_resume_all_processes() should wait for the
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process.c 
> b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
> index 6cdc855abb6d..fb0c753b682c 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
> @@ -2206,6 +2206,24 @@ void kfd_suspend_all_processes(bool force)
>         srcu_read_unlock(&kfd_processes_srcu, idx);
>  }
>
> +
> +void kfd_kill_all_user_processes(void)
> +{
> +       struct kfd_process *p;
> +       struct amdkfd_process_info *p_info;
> +       unsigned int temp;
> +       int idx = srcu_read_lock(&kfd_processes_srcu);
> +
> +       pr_info("Killing all processes\n");
> +       hash_for_each_rcu(kfd_processes_table, temp, p, kfd_processes) {
> +               p_info = p->kgd_process_info;
> +               pr_info("Killing  processes, pid = %d", 
> pid_nr(p_info->pid));
> +               kill_pid(p_info->pid, SIGBUS, 1);


 From looking into kill_pid I see it only sends a signal but doesn't 
wait for completion, it would make sense to wait for completion here. In 
any case I would actually try to put here

hash_for_each_rcu(p_info)
     cancel_delayed_work_sync(&p_info->restore_userptr_work)

instead  at least that what i meant in the previous mail.

Andrey

> +       }
> +       srcu_read_unlock(&kfd_processes_srcu, idx);
> +}
> +
> +
>  int kfd_resume_all_processes(bool sync)
>  {
>         struct kfd_process *p;
>
>
>> Andrey
>>
>>
>>>
>>> Really appreciate your help!
>>>
>>> Best,
>>> Shuotao
>>>
>>>>> 2. Remove redudant p2p/io links in sysfs when device is hotplugged
>>>>> out.
>>>>>
>>>>> 3. New kfd node_id is not properly assigned after a new device is
>>>>> added after a gpu is hotplugged out in a system. libhsakmt will
>>>>> find this anomaly, (i.e. node_from != <dev node id> in iolinks),
>>>>> when taking a topology_snapshot, thus returns fault to the rocm
>>>>> stack.
>>>>>
>>>>> -- This patch fixes issue 1; another patch by Mukul fixes issues 2&3.
>>>>> -- Tested on a 4-GPU MI100 gpu nodes with kernel 5.13.0-kfd; kernel
>>>>> 5.16.0-kfd is unstable out of box for MI100.
>>>>> ---
>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 5 +++++
>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h | 7 +++++++
>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 1 +
>>>>> drivers/gpu/drm/amd/amdkfd/kfd_device.c | 13 +++++++++++++
>>>>> 4 files changed, 26 insertions(+)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c 
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
>>>>> index c18c4be1e4ac..d50011bdb5c4 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
>>>>> @@ -213,6 +213,11 @@ int amdgpu_amdkfd_resume(struct amdgpu_device 
>>>>> *adev, bool run_pm)
>>>>> return r;
>>>>> }
>>>>>
>>>>> +int amdgpu_amdkfd_resume_processes(void)
>>>>> +{
>>>>> + return kgd2kfd_resume_processes();
>>>>> +}
>>>>> +
>>>>> int amdgpu_amdkfd_pre_reset(struct amdgpu_device *adev)
>>>>> {
>>>>> int r = 0;
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h 
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
>>>>> index f8b9f27adcf5..803306e011c3 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
>>>>> @@ -140,6 +140,7 @@ void amdgpu_amdkfd_fini(void);
>>>>> void amdgpu_amdkfd_suspend(struct amdgpu_device *adev, bool run_pm);
>>>>> int amdgpu_amdkfd_resume_iommu(struct amdgpu_device *adev);
>>>>> int amdgpu_amdkfd_resume(struct amdgpu_device *adev, bool run_pm);
>>>>> +int amdgpu_amdkfd_resume_processes(void);
>>>>> void amdgpu_amdkfd_interrupt(struct amdgpu_device *adev,
>>>>> const void *ih_ring_entry);
>>>>> void amdgpu_amdkfd_device_probe(struct amdgpu_device *adev);
>>>>> @@ -347,6 +348,7 @@ void kgd2kfd_device_exit(struct kfd_dev *kfd);
>>>>> void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm);
>>>>> int kgd2kfd_resume_iommu(struct kfd_dev *kfd);
>>>>> int kgd2kfd_resume(struct kfd_dev *kfd, bool run_pm);
>>>>> +int kgd2kfd_resume_processes(void);
>>>>> int kgd2kfd_pre_reset(struct kfd_dev *kfd);
>>>>> int kgd2kfd_post_reset(struct kfd_dev *kfd);
>>>>> void kgd2kfd_interrupt(struct kfd_dev *kfd, const void 
>>>>> *ih_ring_entry);
>>>>> @@ -393,6 +395,11 @@ static inline int kgd2kfd_resume(struct 
>>>>> kfd_dev *kfd, bool run_pm)
>>>>> return 0;
>>>>> }
>>>>>
>>>>> +static inline int kgd2kfd_resume_processes(void)
>>>>> +{
>>>>> + return 0;
>>>>> +}
>>>>> +
>>>>> static inline int kgd2kfd_pre_reset(struct kfd_dev *kfd)
>>>>> {
>>>>> return 0;
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>> index fa4a9f13c922..5827b65b7489 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>> @@ -4004,6 +4004,7 @@ void amdgpu_device_fini_hw(struct 
>>>>> amdgpu_device *adev)
>>>>> if (drm_dev_is_unplugged(adev_to_drm(adev)))
>>>>> amdgpu_device_unmap_mmio(adev);
>>>>>
>>>>> + amdgpu_amdkfd_resume_processes();
>>>>> }
>>>>>
>>>>> void amdgpu_device_fini_sw(struct amdgpu_device *adev)
>>>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c 
>>>>> b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>>>> index 62aa6c9d5123..ef05aae9255e 100644
>>>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>>>> @@ -714,6 +714,19 @@ int kgd2kfd_resume(struct kfd_dev *kfd, bool 
>>>>> run_pm)
>>>>> return ret;
>>>>> }
>>>>>
>>>>> +/* for non-runtime resume only */
>>>>> +int kgd2kfd_resume_processes(void)
>>>>> +{
>>>>> + int count;
>>>>> +
>>>>> + count = atomic_dec_return(&kfd_locked);
>>>>> + WARN_ONCE(count < 0, "KFD suspend / resume ref. error");
>>>>> + if (count == 0)
>>>>> + return kfd_resume_all_processes();
>>>>> +
>>>>> + return 0;
>>>>> +}
>>>>
>>>> It doesn't make sense to me to just increment kfd_locked in
>>>> kgd2kfd_suspend to only decrement it again a few functions down the
>>>> road.
>>>>
>>>> I suggest this instead - you only incrmemnt if not during PCI remove
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>>> b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>>> index 1c2cf3a33c1f..7754f77248a4 100644
>>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>>> @@ -952,11 +952,12 @@ bool kfd_is_locked(void)
>>>>
>>>> void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm)
>>>> {
>>>> +
>>>> if (!kfd->init_complete)
>>>> return;
>>>>
>>>> /* for runtime suspend, skip locking kfd */
>>>> - if (!run_pm) {
>>>> + if (!run_pm && !drm_dev_is_unplugged(kfd->ddev)) {
>>>> /* For first KFD device suspend all the KFD processes */
>>>> if (atomic_inc_return(&kfd_locked) == 1)
>>>> kfd_suspend_all_processes();
>>>>
>>>>
>>>> Andrey
>>>>
>>>>
>>>>
>>>>> +
>>>>> int kgd2kfd_resume_iommu(struct kfd_dev *kfd)
>>>>> {
>>>>> int err = 0;
>

[-- Attachment #2: Type: text/html, Size: 65940 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [EXTERNAL] [PATCH 2/2] drm/amdkfd: Add PCIe Hotplug Support for AMDKFD
  2022-04-13 17:31           ` Andrey Grodzovsky
@ 2022-04-14 14:00             ` Shuotao Xu
  2022-04-14 14:24               ` Shuotao Xu
  2022-04-14 15:13               ` Andrey Grodzovsky
  0 siblings, 2 replies; 31+ messages in thread
From: Shuotao Xu @ 2022-04-14 14:00 UTC (permalink / raw)
  To: Andrey Grodzovsky
  Cc: Mukul.Joshi, Kuehling, Felix, Peng Cheng, amd-gfx, Lei Qu,
	Ran Shu, Ziyue Yang

[-- Attachment #1: Type: text/plain, Size: 24508 bytes --]



On Apr 14, 2022, at 1:31 AM, Andrey Grodzovsky <andrey.grodzovsky@amd.com<mailto:andrey.grodzovsky@amd.com>> wrote:



On 2022-04-13 12:03, Shuotao Xu wrote:


On Apr 11, 2022, at 11:52 PM, Andrey Grodzovsky <andrey.grodzovsky@amd.com<mailto:andrey.grodzovsky@amd.com>> wrote:

[Some people who received this message don't often get email from andrey.grodzovsky@amd.com<mailto:andrey.grodzovsky@amd.com>. Learn why this is important at http://aka.ms/LearnAboutSenderIdentification.]

On 2022-04-08 21:28, Shuotao Xu wrote:

On Apr 8, 2022, at 11:28 PM, Andrey Grodzovsky <andrey.grodzovsky@amd.com<mailto:andrey.grodzovsky@amd.com>> wrote:

[Some people who received this message don't often get email from andrey.grodzovsky@amd.com<mailto:andrey.grodzovsky@amd.com>. Learn why this is important at http://aka.ms/LearnAboutSenderIdentification.]

On 2022-04-08 04:45, Shuotao Xu wrote:
Adding PCIe Hotplug Support for AMDKFD: the support of hot-plug of GPU
devices can open doors for many advanced applications in data center
in the next few years, such as for GPU resource
disaggregation. Current AMDKFD does not support hotplug out b/o the
following reasons:

1. During PCIe removal, decrement KFD lock which was incremented at
the beginning of hw fini; otherwise kfd_open later is going to
fail.
I assumed you read my comment last time, still you do same approach.
More in details bellow
Aha, I like your fix:) I was not familiar with drm APIs so just only half understood your comment last time.

BTW, I tried hot-plugging out a GPU when rocm application is still running.
From dmesg, application is still trying to access the removed kfd device, and are met with some errors.


Application us supposed to keep running, it holds the drm_device
reference as long as it has an open
FD to the device and final cleanup will come only after the app will die
thus releasing the FD and the last
drm_device reference.

Application would hang and not exiting in this case.


Actually I tried kill -7 $pid, and the process exists. The dmesg has some warning though.

[  711.769977] WARNING: CPU: 23 PID: 344 at .../amdgpu-rocm5.0.2/src/amd/amdgpu/amdgpu_object.c:1336 amdgpu_bo_release_notify+0x150/0x160 [amdgpu]
[  711.770528] Modules linked in: amdgpu(OE) amdttm(OE) amd_sched(OE) amdkcl(OE) iommu_v2 nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo xt_addrtype br_netfilter xt_CHECKSUM iptable_mangle xt_MASQUERADE iptable_nat nf_nat xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4 xt_tcpudp bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter overlay binfmt_misc intel_rapl_msr i40iw intel_rapl_common skx_edac nfit x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel rpcrdma kvm sunrpc ipmi_ssif ib_umad ib_ipoib rdma_ucm irqbypass rapl joydev acpi_ipmi input_leds intel_cstate ipmi_si ipmi_devintf mei_me mei intel_pch_thermal ipmi_msghandler ioatdma mac_hid lpc_ich dca acpi_power_meter acpi_pad sch_fq_codel ib_iser rdma_cm iw_cm ib_cm iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi pci_stub ip_tables x_tables autofs4 btrfs blake2b_generic zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor
[  711.779359]  raid6_pq libcrc32c raid1 raid0 multipath linear mlx5_ib ib_uverbs ib_core ast drm_vram_helper i2c_algo_bit drm_ttm_helper ttm drm_kms_helper syscopyarea crct10dif_pclmul crc32_pclmul ghash_clmulni_intel sysfillrect uas hid_generic sysimgblt aesni_intel mlx5_core fb_sys_fops crypto_simd usbhid cryptd drm i40e pci_hyperv_intf usb_storage glue_helper mlxfw hid ahci libahci wmi
[  711.779752] CPU: 23 PID: 344 Comm: kworker/23:1 Tainted: G        W  OE     5.11.0+ #1
[  711.779755] Hardware name: Supermicro SYS-4029GP-TRT2/X11DPG-OT-CPU, BIOS 2.1 08/14/2018
[  711.779756] Workqueue: kfd_process_wq kfd_process_wq_release [amdgpu]
[  711.779955] RIP: 0010:amdgpu_bo_release_notify+0x150/0x160 [amdgpu]
[  711.780141] Code: e8 b5 af 34 f4 e9 1f ff ff ff 48 39 c2 74 07 0f 0b e9 69 ff ff ff 4c 89 e7 e8 3c b4 16 00 e9 5c ff ff ff e8 a2 ce fd f3 eb cf <0f> 0b eb cb e8 d7 02 34 f4 0f 1f 80 00 00 00 00 0f 1f 44 00 00 55
[  711.780143] RSP: 0018:ffffa8100dd67c30 EFLAGS: 00010282
[  711.780145] RAX: 00000000ffffffea RBX: ffff89980e792058 RCX: 0000000000000000
[  711.780147] RDX: 0000000000000000 RSI: ffff89a8f9ad8870 RDI: ffff89a8f9ad8870
[  711.780148] RBP: ffffa8100dd67c50 R08: 0000000000000000 R09: fffffffffff99b18
[  711.780149] R10: ffffa8100dd67bd0 R11: ffffa8100dd67908 R12: ffff89980e792000
[  711.780151] R13: ffff89980e792058 R14: ffff89980e7921bc R15: dead000000000100
[  711.780152] FS:  0000000000000000(0000) GS:ffff89a8f9ac0000(0000) knlGS:0000000000000000
[  711.780154] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  711.780156] CR2: 00007ffddac6f71f CR3: 00000030bb80a003 CR4: 00000000007706e0
[  711.780157] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  711.780159] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  711.780160] PKRU: 55555554
[  711.780161] Call Trace:
[  711.780163]  ttm_bo_release+0x2ae/0x320 [amdttm]
[  711.780169]  amdttm_bo_put+0x30/0x40 [amdttm]
[  711.780357]  amdgpu_bo_unref+0x1e/0x30 [amdgpu]
[  711.780543]  amdgpu_gem_object_free+0x8b/0x160 [amdgpu]
[  711.781119]  drm_gem_object_free+0x1d/0x30 [drm]
[  711.781489]  amdgpu_amdkfd_gpuvm_free_memory_of_gpu+0x34a/0x380 [amdgpu]
[  711.782044]  kfd_process_device_free_bos+0xe0/0x130 [amdgpu]
[  711.782611]  kfd_process_wq_release+0x286/0x380 [amdgpu]
[  711.783172]  process_one_work+0x236/0x420
[  711.783543]  worker_thread+0x34/0x400
[  711.783911]  ? process_one_work+0x420/0x420
[  711.784279]  kthread+0x126/0x140
[  711.784653]  ? kthread_park+0x90/0x90
[  711.785018]  ret_from_fork+0x22/0x30
[  711.785387] ---[ end trace d8f50f6594817c84 ]---
[  711.798716] [drm] amdgpu: ttm finalized


So it means the process was stuck in some wait_event_killable (maybe here drm_sched_entity_flush) - you can try 'cat/proc/$process_pid/stack' maybe before
you kill it to see where it was stuck so we can go from there.



For graphic apps what i usually see is a crash because of sigsev when
the app tries to access
an unmapped MMIO region on the device. I haven't tested for compute
stack and so there might
be something I haven't covered. Hang could mean for example waiting on a
fence which is not being
signaled - please provide full dmesg from this case.


Do you have any good suggestions on how to fix it down the line? (HIP runtime/libhsakmt or driver)

[64036.631333] amdgpu: amdgpu_vm_bo_update failed
[64036.631702] amdgpu: validate_invalid_user_pages: update PTE failed
[64036.640754] amdgpu: amdgpu_vm_bo_update failed
[64036.641120] amdgpu: validate_invalid_user_pages: update PTE failed
[64036.650394] amdgpu: amdgpu_vm_bo_update failed
[64036.650765] amdgpu: validate_invalid_user_pages: update PTE failed


The full dmesg will just the repetition of those two messages,
[186885.764079] amdgpu 0000:43:00.0: amdgpu: amdgpu: finishing device.
[186885.766916] [drm] free PSP TMR buffer
[186893.868173] amdgpu: amdgpu_vm_bo_update failed
[186893.868235] amdgpu: validate_invalid_user_pages: update PTE failed
[186893.876154] amdgpu: amdgpu_vm_bo_update failed
[186893.876190] amdgpu: validate_invalid_user_pages: update PTE failed
[186893.884150] amdgpu: amdgpu_vm_bo_update failed
[186893.884185] amdgpu: validate_invalid_user_pages: update PTE failed


This just probably means trying to update PTEs after the physical device
is gone - we usually avoid this by
first trying to do all HW shutdowns early before PCI remove completion
but when it's really tricky by
protecting HW access sections with drm_dev_enter/exit scope.

For this particular error it would be the best to flush
info->restore_userptr_work before the end of
amdgpu_pci_remove (rejecting new process creation and calling
cancel_delayed_work_sync(&process_info->restore_userptr_work) for all
running processes)
somewhere in amdgpu_pci_remove.

I tried something like *kfd_process_ref_release* which I think did what you described, but it did not work.


I don't see how kfd_process_ref_release is the same as I mentioned above, what i meant is calling the code above within kgd2kfd_suspend (where you tried to call kfd_kill_all_user_processes bellow)

Yes, you are right. It was not called by it.


Instead I tried to kill the process from the kernel, but the amdgpu could **only** be hot-plugged in back successfully only if there was no rocm kernel running when it was plugged out. If not, amdgpu_probe will just hang later. (Maybe because amdgpu was plugged out while running state, it leaves a bad HW state which causes probe to hang).


We usually do asic_reset during probe to reset all HW state (checlk if amdgpu_device_init->amdgpu_asic_reset is running when you  plug back).

OK



I don’t know if this is a viable solution worth pursuing, but I attached the diff anyway.

Another solution could be let compute stack user mode detect a topology change via generation_count change, and abort gracefully there.

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
index 4e7d9cb09a69..79b4c9b84cd0 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
@@ -697,12 +697,15 @@ void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm, bool force)
                return;

        /* for runtime suspend, skip locking kfd */
-       if (!run_pm) {
+       if (!run_pm && !drm_dev_is_unplugged(kfd->ddev)) {
                /* For first KFD device suspend all the KFD processes */
                if (atomic_inc_return(&kfd_locked) == 1)
                        kfd_suspend_all_processes(force);
        }

+       if (drm_dev_is_unplugged(kfd->ddev))
+               kfd_kill_all_user_processes();
+
        kfd->dqm->ops.stop(kfd->dqm);
        kfd_iommu_suspend(kfd);
 }
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
index 55c9e1922714..84cbcd857856 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
@@ -1065,6 +1065,7 @@ void kfd_unref_process(struct kfd_process *p);
 int kfd_process_evict_queues(struct kfd_process *p, bool force);
 int kfd_process_restore_queues(struct kfd_process *p);
 void kfd_suspend_all_processes(bool force);
+void kfd_kill_all_user_processes(void);
 /*
  * kfd_resume_all_processes:
  *     bool sync: If kfd_resume_all_processes() should wait for the
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process.c b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
index 6cdc855abb6d..fb0c753b682c 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
@@ -2206,6 +2206,24 @@ void kfd_suspend_all_processes(bool force)
        srcu_read_unlock(&kfd_processes_srcu, idx);
 }

+
+void kfd_kill_all_user_processes(void)
+{
+       struct kfd_process *p;
+       struct amdkfd_process_info *p_info;
+       unsigned int temp;
+       int idx = srcu_read_lock(&kfd_processes_srcu);
+
+       pr_info("Killing all processes\n");
+       hash_for_each_rcu(kfd_processes_table, temp, p, kfd_processes) {
+               p_info = p->kgd_process_info;
+               pr_info("Killing  processes, pid = %d", pid_nr(p_info->pid));
+               kill_pid(p_info->pid, SIGBUS, 1);


From looking into kill_pid I see it only sends a signal but doesn't wait for completion, it would make sense to wait for completion here. In any case I would actually try to put here

I have made a version which does that with some atomic counters. Please read later in the diff.


hash_for_each_rcu(p_info)
    cancel_delayed_work_sync(&p_info->restore_userptr_work)

instead  at least that what i meant in the previous mail.

I actually tried that earlier, and it did not work. Application still keeps running, and you have to send a kill to the user process.

I have made the following version. It waits for processes to terminate synchronously after sending SIGBUS. After that it does the real work of amdgpu_pci_remove.
However, it hangs at amdgpu_device_ip_fini_early when it is trying to deinit ip_block 6 <sdma_v4_0> (https://gitlab.freedesktop.org/agd5f/linux/-/blob/amd-staging-drm-next/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c#L2818). I assume that there are still some inflight dma, therefore fini of this ip block thus hangs?

The following is an excerpt of the dmesg: please excuse for putting my own pr_info, but I hope you get my point of where it hangs.

[  392.344735] amdgpu: all processes has been fully released
[  392.346557] amdgpu: amdgpu_acpi_fini done
[  392.346568] amdgpu 0000:b3:00.0: amdgpu: amdgpu: finishing device.
[  392.349238] amdgpu: amdgpu_device_ip_fini_early enter ip_blocks = 9
[  392.349248] amdgpu: Free mem_obj = 000000007bf54275, range_start = 14, range_end = 14
[  392.350299] amdgpu: Free mem_obj = 00000000a85bc878, range_start = 12, range_end = 12
[  392.350304] amdgpu: Free mem_obj = 00000000b8019e32, range_start = 13, range_end = 13
[  392.350308] amdgpu: Free mem_obj = 000000002d296168, range_start = 4, range_end = 11
[  392.350313] amdgpu: Free mem_obj = 000000001fc4f934, range_start = 0, range_end = 3
[  392.350322] amdgpu: amdgpu_amdkfd_suspend(adev, false) done
[  392.350672] amdgpu: hw_fini of IP block[8] <jpeg_v2_5> done 0
[  392.350679] amdgpu: hw_fini of IP block[7] <vcn_v2_5> done 0


diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
index 8fa9b86ac9d2..c0b27f722281 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
@@ -188,6 +188,12 @@ void amdgpu_amdkfd_interrupt(struct amdgpu_device *adev,
  kgd2kfd_interrupt(adev->kfd.dev, ih_ring_entry);
 }

+void amdgpu_amdkfd_kill_all_processes(struct amdgpu_device *adev)
+{
+ if (adev->kfd.dev)
+ kgd2kfd_kill_all_user_processes(adev->kfd.dev);
+}
+
 void amdgpu_amdkfd_suspend(struct amdgpu_device *adev, bool run_pm)
 {
  if (adev->kfd.dev)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
index 27c74fcec455..f4e485d60442 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
@@ -141,6 +141,7 @@ struct amdkfd_process_info {
 int amdgpu_amdkfd_init(void);
 void amdgpu_amdkfd_fini(void);

+void amdgpu_amdkfd_kill_all_processes(struct amdgpu_device *adev);
 void amdgpu_amdkfd_suspend(struct amdgpu_device *adev, bool run_pm);
 int amdgpu_amdkfd_resume_iommu(struct amdgpu_device *adev);
 int amdgpu_amdkfd_resume(struct amdgpu_device *adev, bool run_pm, bool sync);
@@ -405,6 +406,7 @@ bool kgd2kfd_device_init(struct kfd_dev *kfd,
  const struct kgd2kfd_shared_resources *gpu_resources);
 void kgd2kfd_device_exit(struct kfd_dev *kfd);
 void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm, bool force);
+void kgd2kfd_kill_all_user_processes(struct kfd_dev *kfd);
 int kgd2kfd_resume_iommu(struct kfd_dev *kfd);
 int kgd2kfd_resume(struct kfd_dev *kfd, bool run_pm, bool sync);
 int kgd2kfd_pre_reset(struct kfd_dev *kfd);
@@ -443,6 +445,9 @@ static inline void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm, bool force)
 {
 }

+void kgd2kfd_kill_all_user_processes(struct kfd_dev *kfd){
+}
+
 static int __maybe_unused kgd2kfd_resume_iommu(struct kfd_dev *kfd)
 {
  return 0;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index 3d5fc0751829..af6fe5080cfa 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -2101,6 +2101,9 @@ amdgpu_pci_remove(struct pci_dev *pdev)
 {
  struct drm_device *dev = pci_get_drvdata(pdev);

+ /* kill all kfd processes before drm_dev_unplug */
+ amdgpu_amdkfd_kill_all_processes(drm_to_adev(dev));
+
 #ifdef HAVE_DRM_DEV_UNPLUG
  drm_dev_unplug(dev);
 #else
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
index 5504a18b5a45..480c23bef5e2 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
@@ -691,6 +691,12 @@ bool kfd_is_locked(void)
  return  (atomic_read(&kfd_locked) > 0);
 }

+inline void kgd2kfd_kill_all_user_processes(struct kfd_dev* dev)
+{
+ kfd_kill_all_user_processes();
+}
+
+
 void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm, bool force)
 {
  if (!kfd->init_complete)
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
index 55c9e1922714..a35a2cb5bb9f 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
@@ -1064,6 +1064,7 @@ static inline struct kfd_process_device *kfd_process_device_from_gpuidx(
 void kfd_unref_process(struct kfd_process *p);
 int kfd_process_evict_queues(struct kfd_process *p, bool force);
 int kfd_process_restore_queues(struct kfd_process *p);
+void kfd_kill_all_user_processes(void);
 void kfd_suspend_all_processes(bool force);
 /*
  * kfd_resume_all_processes:
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process.c b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
index 6cdc855abb6d..17e769e6951d 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
@@ -46,6 +46,9 @@ struct mm_struct;
 #include "kfd_trace.h"
 #include "kfd_debug.h"

+static atomic_t kfd_process_locked = ATOMIC_INIT(0);
+static atomic_t kfd_inflight_kills = ATOMIC_INIT(0);
+
 /*
  * List of struct kfd_process (field kfd_process).
  * Unique/indexed by mm_struct*
@@ -802,6 +805,9 @@ struct kfd_process *kfd_create_process(struct task_struct *thread)
  struct kfd_process *process;
  int ret;

+ if ( atomic_read(&kfd_process_locked) > 0 )
+ return ERR_PTR(-EINVAL);
+
  if (!(thread->mm && mmget_not_zero(thread->mm)))
  return ERR_PTR(-EINVAL);

@@ -1126,6 +1132,10 @@ static void kfd_process_wq_release(struct work_struct *work)
  put_task_struct(p->lead_thread);

  kfree(p);
+
+ if ( atomic_read(&kfd_process_locked) > 0 ){
+ atomic_dec(&kfd_inflight_kills);
+ }
 }

 static void kfd_process_ref_release(struct kref *ref)
@@ -2186,6 +2196,35 @@ static void restore_process_worker(struct work_struct *work)
  pr_err("Failed to restore queues of pasid 0x%x\n", p->pasid);
 }

+void kfd_kill_all_user_processes(void)
+{
+ struct kfd_process *p;
+ /* struct amdkfd_process_info *p_info; */
+ unsigned int temp;
+ int idx;
+ atomic_inc(&kfd_process_locked);
+
+ idx = srcu_read_lock(&kfd_processes_srcu);
+ pr_info("Killing all processes\n");
+ hash_for_each_rcu(kfd_processes_table, temp, p, kfd_processes) {
+ dev_warn(kfd_device,
+ "Sending SIGBUS to process %d (pasid 0x%x)",
+ p->lead_thread->pid, p->pasid);
+ send_sig(SIGBUS, p->lead_thread, 0);
+ atomic_inc(&kfd_inflight_kills);
+ }
+ srcu_read_unlock(&kfd_processes_srcu, idx);
+
+ while ( atomic_read(&kfd_inflight_kills) > 0 ){
+ dev_warn(kfd_device,
+ "kfd_processes_table is not empty, going to sleep for 10ms\n");
+ msleep(10);
+ }
+
+ atomic_dec(&kfd_process_locked);
+ pr_info("all processes has been fully released");
+}
+
 void kfd_suspend_all_processes(bool force)
 {
  struct kfd_process *p;



Regards,
Shuotao


Andrey


+       }
+       srcu_read_unlock(&kfd_processes_srcu, idx);
+}
+
+
 int kfd_resume_all_processes(bool sync)
 {
        struct kfd_process *p;


Andrey



Really appreciate your help!

Best,
Shuotao

2. Remove redudant p2p/io links in sysfs when device is hotplugged
out.

3. New kfd node_id is not properly assigned after a new device is
added after a gpu is hotplugged out in a system. libhsakmt will
find this anomaly, (i.e. node_from != <dev node id> in iolinks),
when taking a topology_snapshot, thus returns fault to the rocm
stack.

-- This patch fixes issue 1; another patch by Mukul fixes issues 2&3.
-- Tested on a 4-GPU MI100 gpu nodes with kernel 5.13.0-kfd; kernel
5.16.0-kfd is unstable out of box for MI100.
---
drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 5 +++++
drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h | 7 +++++++
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 1 +
drivers/gpu/drm/amd/amdkfd/kfd_device.c | 13 +++++++++++++
4 files changed, 26 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
index c18c4be1e4ac..d50011bdb5c4 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
@@ -213,6 +213,11 @@ int amdgpu_amdkfd_resume(struct amdgpu_device *adev, bool run_pm)
return r;
}

+int amdgpu_amdkfd_resume_processes(void)
+{
+ return kgd2kfd_resume_processes();
+}
+
int amdgpu_amdkfd_pre_reset(struct amdgpu_device *adev)
{
int r = 0;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
index f8b9f27adcf5..803306e011c3 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
@@ -140,6 +140,7 @@ void amdgpu_amdkfd_fini(void);
void amdgpu_amdkfd_suspend(struct amdgpu_device *adev, bool run_pm);
int amdgpu_amdkfd_resume_iommu(struct amdgpu_device *adev);
int amdgpu_amdkfd_resume(struct amdgpu_device *adev, bool run_pm);
+int amdgpu_amdkfd_resume_processes(void);
void amdgpu_amdkfd_interrupt(struct amdgpu_device *adev,
const void *ih_ring_entry);
void amdgpu_amdkfd_device_probe(struct amdgpu_device *adev);
@@ -347,6 +348,7 @@ void kgd2kfd_device_exit(struct kfd_dev *kfd);
void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm);
int kgd2kfd_resume_iommu(struct kfd_dev *kfd);
int kgd2kfd_resume(struct kfd_dev *kfd, bool run_pm);
+int kgd2kfd_resume_processes(void);
int kgd2kfd_pre_reset(struct kfd_dev *kfd);
int kgd2kfd_post_reset(struct kfd_dev *kfd);
void kgd2kfd_interrupt(struct kfd_dev *kfd, const void *ih_ring_entry);
@@ -393,6 +395,11 @@ static inline int kgd2kfd_resume(struct kfd_dev *kfd, bool run_pm)
return 0;
}

+static inline int kgd2kfd_resume_processes(void)
+{
+ return 0;
+}
+
static inline int kgd2kfd_pre_reset(struct kfd_dev *kfd)
{
return 0;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index fa4a9f13c922..5827b65b7489 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4004,6 +4004,7 @@ void amdgpu_device_fini_hw(struct amdgpu_device *adev)
if (drm_dev_is_unplugged(adev_to_drm(adev)))
amdgpu_device_unmap_mmio(adev);

+ amdgpu_amdkfd_resume_processes();
}

void amdgpu_device_fini_sw(struct amdgpu_device *adev)
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
index 62aa6c9d5123..ef05aae9255e 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
@@ -714,6 +714,19 @@ int kgd2kfd_resume(struct kfd_dev *kfd, bool run_pm)
return ret;
}

+/* for non-runtime resume only */
+int kgd2kfd_resume_processes(void)
+{
+ int count;
+
+ count = atomic_dec_return(&kfd_locked);
+ WARN_ONCE(count < 0, "KFD suspend / resume ref. error");
+ if (count == 0)
+ return kfd_resume_all_processes();
+
+ return 0;
+}

It doesn't make sense to me to just increment kfd_locked in
kgd2kfd_suspend to only decrement it again a few functions down the
road.

I suggest this instead - you only incrmemnt if not during PCI remove

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
index 1c2cf3a33c1f..7754f77248a4 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
@@ -952,11 +952,12 @@ bool kfd_is_locked(void)

void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm)
{
+
if (!kfd->init_complete)
return;

/* for runtime suspend, skip locking kfd */
- if (!run_pm) {
+ if (!run_pm && !drm_dev_is_unplugged(kfd->ddev)) {
/* For first KFD device suspend all the KFD processes */
if (atomic_inc_return(&kfd_locked) == 1)
kfd_suspend_all_processes();


Andrey



+
int kgd2kfd_resume_iommu(struct kfd_dev *kfd)
{
int err = 0;



[-- Attachment #2: Type: text/html, Size: 77113 bytes --]

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [EXTERNAL] [PATCH 2/2] drm/amdkfd: Add PCIe Hotplug Support for AMDKFD
  2022-04-14 14:00             ` Shuotao Xu
@ 2022-04-14 14:24               ` Shuotao Xu
  2022-04-14 15:13               ` Andrey Grodzovsky
  1 sibling, 0 replies; 31+ messages in thread
From: Shuotao Xu @ 2022-04-14 14:24 UTC (permalink / raw)
  To: Andrey Grodzovsky
  Cc: Mukul.Joshi, Kuehling, Felix, Peng Cheng, amd-gfx, Lei Qu,
	Ran Shu, Ziyue Yang

[-- Attachment #1: Type: text/plain, Size: 24840 bytes --]



On Apr 14, 2022, at 10:00 PM, Shuotao Xu <shuotaoxu@microsoft.com<mailto:shuotaoxu@microsoft.com>> wrote:



On Apr 14, 2022, at 1:31 AM, Andrey Grodzovsky <andrey.grodzovsky@amd.com<mailto:andrey.grodzovsky@amd.com>> wrote:



On 2022-04-13 12:03, Shuotao Xu wrote:


On Apr 11, 2022, at 11:52 PM, Andrey Grodzovsky <andrey.grodzovsky@amd.com<mailto:andrey.grodzovsky@amd.com>> wrote:

[Some people who received this message don't often get email from andrey.grodzovsky@amd.com<mailto:andrey.grodzovsky@amd.com>. Learn why this is important at http://aka.ms/LearnAboutSenderIdentification.]

On 2022-04-08 21:28, Shuotao Xu wrote:

On Apr 8, 2022, at 11:28 PM, Andrey Grodzovsky <andrey.grodzovsky@amd.com<mailto:andrey.grodzovsky@amd.com>> wrote:

[Some people who received this message don't often get email from andrey.grodzovsky@amd.com<mailto:andrey.grodzovsky@amd.com>. Learn why this is important at http://aka.ms/LearnAboutSenderIdentification.]

On 2022-04-08 04:45, Shuotao Xu wrote:
Adding PCIe Hotplug Support for AMDKFD: the support of hot-plug of GPU
devices can open doors for many advanced applications in data center
in the next few years, such as for GPU resource
disaggregation. Current AMDKFD does not support hotplug out b/o the
following reasons:

1. During PCIe removal, decrement KFD lock which was incremented at
the beginning of hw fini; otherwise kfd_open later is going to
fail.
I assumed you read my comment last time, still you do same approach.
More in details bellow
Aha, I like your fix:) I was not familiar with drm APIs so just only half understood your comment last time.

BTW, I tried hot-plugging out a GPU when rocm application is still running.
From dmesg, application is still trying to access the removed kfd device, and are met with some errors.


Application us supposed to keep running, it holds the drm_device
reference as long as it has an open
FD to the device and final cleanup will come only after the app will die
thus releasing the FD and the last
drm_device reference.

Application would hang and not exiting in this case.


Actually I tried kill -7 $pid, and the process exists. The dmesg has some warning though.

[  711.769977] WARNING: CPU: 23 PID: 344 at .../amdgpu-rocm5.0.2/src/amd/amdgpu/amdgpu_object.c:1336 amdgpu_bo_release_notify+0x150/0x160 [amdgpu]
[  711.770528] Modules linked in: amdgpu(OE) amdttm(OE) amd_sched(OE) amdkcl(OE) iommu_v2 nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo xt_addrtype br_netfilter xt_CHECKSUM iptable_mangle xt_MASQUERADE iptable_nat nf_nat xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4 xt_tcpudp bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter overlay binfmt_misc intel_rapl_msr i40iw intel_rapl_common skx_edac nfit x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel rpcrdma kvm sunrpc ipmi_ssif ib_umad ib_ipoib rdma_ucm irqbypass rapl joydev acpi_ipmi input_leds intel_cstate ipmi_si ipmi_devintf mei_me mei intel_pch_thermal ipmi_msghandler ioatdma mac_hid lpc_ich dca acpi_power_meter acpi_pad sch_fq_codel ib_iser rdma_cm iw_cm ib_cm iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi pci_stub ip_tables x_tables autofs4 btrfs blake2b_generic zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor
[  711.779359]  raid6_pq libcrc32c raid1 raid0 multipath linear mlx5_ib ib_uverbs ib_core ast drm_vram_helper i2c_algo_bit drm_ttm_helper ttm drm_kms_helper syscopyarea crct10dif_pclmul crc32_pclmul ghash_clmulni_intel sysfillrect uas hid_generic sysimgblt aesni_intel mlx5_core fb_sys_fops crypto_simd usbhid cryptd drm i40e pci_hyperv_intf usb_storage glue_helper mlxfw hid ahci libahci wmi
[  711.779752] CPU: 23 PID: 344 Comm: kworker/23:1 Tainted: G        W  OE     5.11.0+ #1
[  711.779755] Hardware name: Supermicro SYS-4029GP-TRT2/X11DPG-OT-CPU, BIOS 2.1 08/14/2018
[  711.779756] Workqueue: kfd_process_wq kfd_process_wq_release [amdgpu]
[  711.779955] RIP: 0010:amdgpu_bo_release_notify+0x150/0x160 [amdgpu]
[  711.780141] Code: e8 b5 af 34 f4 e9 1f ff ff ff 48 39 c2 74 07 0f 0b e9 69 ff ff ff 4c 89 e7 e8 3c b4 16 00 e9 5c ff ff ff e8 a2 ce fd f3 eb cf <0f> 0b eb cb e8 d7 02 34 f4 0f 1f 80 00 00 00 00 0f 1f 44 00 00 55
[  711.780143] RSP: 0018:ffffa8100dd67c30 EFLAGS: 00010282
[  711.780145] RAX: 00000000ffffffea RBX: ffff89980e792058 RCX: 0000000000000000
[  711.780147] RDX: 0000000000000000 RSI: ffff89a8f9ad8870 RDI: ffff89a8f9ad8870
[  711.780148] RBP: ffffa8100dd67c50 R08: 0000000000000000 R09: fffffffffff99b18
[  711.780149] R10: ffffa8100dd67bd0 R11: ffffa8100dd67908 R12: ffff89980e792000
[  711.780151] R13: ffff89980e792058 R14: ffff89980e7921bc R15: dead000000000100
[  711.780152] FS:  0000000000000000(0000) GS:ffff89a8f9ac0000(0000) knlGS:0000000000000000
[  711.780154] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  711.780156] CR2: 00007ffddac6f71f CR3: 00000030bb80a003 CR4: 00000000007706e0
[  711.780157] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  711.780159] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  711.780160] PKRU: 55555554
[  711.780161] Call Trace:
[  711.780163]  ttm_bo_release+0x2ae/0x320 [amdttm]
[  711.780169]  amdttm_bo_put+0x30/0x40 [amdttm]
[  711.780357]  amdgpu_bo_unref+0x1e/0x30 [amdgpu]
[  711.780543]  amdgpu_gem_object_free+0x8b/0x160 [amdgpu]
[  711.781119]  drm_gem_object_free+0x1d/0x30 [drm]
[  711.781489]  amdgpu_amdkfd_gpuvm_free_memory_of_gpu+0x34a/0x380 [amdgpu]
[  711.782044]  kfd_process_device_free_bos+0xe0/0x130 [amdgpu]
[  711.782611]  kfd_process_wq_release+0x286/0x380 [amdgpu]
[  711.783172]  process_one_work+0x236/0x420
[  711.783543]  worker_thread+0x34/0x400
[  711.783911]  ? process_one_work+0x420/0x420
[  711.784279]  kthread+0x126/0x140
[  711.784653]  ? kthread_park+0x90/0x90
[  711.785018]  ret_from_fork+0x22/0x30
[  711.785387] ---[ end trace d8f50f6594817c84 ]---
[  711.798716] [drm] amdgpu: ttm finalized


So it means the process was stuck in some wait_event_killable (maybe here drm_sched_entity_flush) - you can try 'cat/proc/$process_pid/stack' maybe before
you kill it to see where it was stuck so we can go from there.



For graphic apps what i usually see is a crash because of sigsev when
the app tries to access
an unmapped MMIO region on the device. I haven't tested for compute
stack and so there might
be something I haven't covered. Hang could mean for example waiting on a
fence which is not being
signaled - please provide full dmesg from this case.


Do you have any good suggestions on how to fix it down the line? (HIP runtime/libhsakmt or driver)

[64036.631333] amdgpu: amdgpu_vm_bo_update failed
[64036.631702] amdgpu: validate_invalid_user_pages: update PTE failed
[64036.640754] amdgpu: amdgpu_vm_bo_update failed
[64036.641120] amdgpu: validate_invalid_user_pages: update PTE failed
[64036.650394] amdgpu: amdgpu_vm_bo_update failed
[64036.650765] amdgpu: validate_invalid_user_pages: update PTE failed


The full dmesg will just the repetition of those two messages,
[186885.764079] amdgpu 0000:43:00.0: amdgpu: amdgpu: finishing device.
[186885.766916] [drm] free PSP TMR buffer
[186893.868173] amdgpu: amdgpu_vm_bo_update failed
[186893.868235] amdgpu: validate_invalid_user_pages: update PTE failed
[186893.876154] amdgpu: amdgpu_vm_bo_update failed
[186893.876190] amdgpu: validate_invalid_user_pages: update PTE failed
[186893.884150] amdgpu: amdgpu_vm_bo_update failed
[186893.884185] amdgpu: validate_invalid_user_pages: update PTE failed


This just probably means trying to update PTEs after the physical device
is gone - we usually avoid this by
first trying to do all HW shutdowns early before PCI remove completion
but when it's really tricky by
protecting HW access sections with drm_dev_enter/exit scope.

For this particular error it would be the best to flush
info->restore_userptr_work before the end of
amdgpu_pci_remove (rejecting new process creation and calling
cancel_delayed_work_sync(&process_info->restore_userptr_work) for all
running processes)
somewhere in amdgpu_pci_remove.

I tried something like *kfd_process_ref_release* which I think did what you described, but it did not work.


I don't see how kfd_process_ref_release is the same as I mentioned above, what i meant is calling the code above within kgd2kfd_suspend (where you tried to call kfd_kill_all_user_processes bellow)

Yes, you are right. It was not called by it.



Instead I tried to kill the process from the kernel, but the amdgpu could **only** be hot-plugged in back successfully only if there was no rocm kernel running when it was plugged out. If not, amdgpu_probe will just hang later. (Maybe because amdgpu was plugged out while running state, it leaves a bad HW state which causes probe to hang).


We usually do asic_reset during probe to reset all HW state (checlk if amdgpu_device_init->amdgpu_asic_reset is running when you  plug back).

OK




I don’t know if this is a viable solution worth pursuing, but I attached the diff anyway.

Another solution could be let compute stack user mode detect a topology change via generation_count change, and abort gracefully there.

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
index 4e7d9cb09a69..79b4c9b84cd0 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
@@ -697,12 +697,15 @@ void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm, bool force)
                return;

        /* for runtime suspend, skip locking kfd */
-       if (!run_pm) {
+       if (!run_pm && !drm_dev_is_unplugged(kfd->ddev)) {
                /* For first KFD device suspend all the KFD processes */
                if (atomic_inc_return(&kfd_locked) == 1)
                        kfd_suspend_all_processes(force);
        }

+       if (drm_dev_is_unplugged(kfd->ddev))
+               kfd_kill_all_user_processes();
+
        kfd->dqm->ops.stop(kfd->dqm);
        kfd_iommu_suspend(kfd);
 }
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
index 55c9e1922714..84cbcd857856 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
@@ -1065,6 +1065,7 @@ void kfd_unref_process(struct kfd_process *p);
 int kfd_process_evict_queues(struct kfd_process *p, bool force);
 int kfd_process_restore_queues(struct kfd_process *p);
 void kfd_suspend_all_processes(bool force);
+void kfd_kill_all_user_processes(void);
 /*
  * kfd_resume_all_processes:
  *     bool sync: If kfd_resume_all_processes() should wait for the
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process.c b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
index 6cdc855abb6d..fb0c753b682c 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
@@ -2206,6 +2206,24 @@ void kfd_suspend_all_processes(bool force)
        srcu_read_unlock(&kfd_processes_srcu, idx);
 }

+
+void kfd_kill_all_user_processes(void)
+{
+       struct kfd_process *p;
+       struct amdkfd_process_info *p_info;
+       unsigned int temp;
+       int idx = srcu_read_lock(&kfd_processes_srcu);
+
+       pr_info("Killing all processes\n");
+       hash_for_each_rcu(kfd_processes_table, temp, p, kfd_processes) {
+               p_info = p->kgd_process_info;
+               pr_info("Killing  processes, pid = %d", pid_nr(p_info->pid));
+               kill_pid(p_info->pid, SIGBUS, 1);


From looking into kill_pid I see it only sends a signal but doesn't wait for completion, it would make sense to wait for completion here. In any case I would actually try to put here

I have made a version which does that with some atomic counters. Please read later in the diff.


hash_for_each_rcu(p_info)
    cancel_delayed_work_sync(&p_info->restore_userptr_work)

instead  at least that what i meant in the previous mail.

I actually tried that earlier, and it did not work. Application still keeps running, and you have to send a kill to the user process.

I have made the following version. It waits for processes to terminate synchronously after sending SIGBUS. After that it does the real work of amdgpu_pci_remove.
However, it hangs at amdgpu_device_ip_fini_early when it is trying to deinit ip_block 6 <sdma_v4_0> (https://gitlab.freedesktop.org/agd5f/linux/-/blob/amd-staging-drm-next/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c#L2818). I assume that there are still some inflight dma, therefore fini of this ip block thus hangs?

Hanging at ip_block 6<sdma_v4_0> seems to go away if I sleep for 5s after all user processes are killed. Not sure if this is a viable solution.
Please see below as to where I put the sleep.

The following is an excerpt of the dmesg: please excuse for putting my own pr_info, but I hope you get my point of where it hangs.

[  392.344735] amdgpu: all processes has been fully released
[  392.346557] amdgpu: amdgpu_acpi_fini done
[  392.346568] amdgpu 0000:b3:00.0: amdgpu: amdgpu: finishing device.
[  392.349238] amdgpu: amdgpu_device_ip_fini_early enter ip_blocks = 9
[  392.349248] amdgpu: Free mem_obj = 000000007bf54275, range_start = 14, range_end = 14
[  392.350299] amdgpu: Free mem_obj = 00000000a85bc878, range_start = 12, range_end = 12
[  392.350304] amdgpu: Free mem_obj = 00000000b8019e32, range_start = 13, range_end = 13
[  392.350308] amdgpu: Free mem_obj = 000000002d296168, range_start = 4, range_end = 11
[  392.350313] amdgpu: Free mem_obj = 000000001fc4f934, range_start = 0, range_end = 3
[  392.350322] amdgpu: amdgpu_amdkfd_suspend(adev, false) done
[  392.350672] amdgpu: hw_fini of IP block[8] <jpeg_v2_5> done 0
[  392.350679] amdgpu: hw_fini of IP block[7] <vcn_v2_5> done 0


diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
index 8fa9b86ac9d2..c0b27f722281 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
@@ -188,6 +188,12 @@ void amdgpu_amdkfd_interrupt(struct amdgpu_device *adev,
  kgd2kfd_interrupt(adev->kfd.dev, ih_ring_entry);
 }

+void amdgpu_amdkfd_kill_all_processes(struct amdgpu_device *adev)
+{
+ if (adev->kfd.dev)
+ kgd2kfd_kill_all_user_processes(adev->kfd.dev);
+}
+
 void amdgpu_amdkfd_suspend(struct amdgpu_device *adev, bool run_pm)
 {
  if (adev->kfd.dev)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
index 27c74fcec455..f4e485d60442 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
@@ -141,6 +141,7 @@ struct amdkfd_process_info {
 int amdgpu_amdkfd_init(void);
 void amdgpu_amdkfd_fini(void);

+void amdgpu_amdkfd_kill_all_processes(struct amdgpu_device *adev);
 void amdgpu_amdkfd_suspend(struct amdgpu_device *adev, bool run_pm);
 int amdgpu_amdkfd_resume_iommu(struct amdgpu_device *adev);
 int amdgpu_amdkfd_resume(struct amdgpu_device *adev, bool run_pm, bool sync);
@@ -405,6 +406,7 @@ bool kgd2kfd_device_init(struct kfd_dev *kfd,
   const struct kgd2kfd_shared_resources *gpu_resources);
 void kgd2kfd_device_exit(struct kfd_dev *kfd);
 void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm, bool force);
+void kgd2kfd_kill_all_user_processes(struct kfd_dev *kfd);
 int kgd2kfd_resume_iommu(struct kfd_dev *kfd);
 int kgd2kfd_resume(struct kfd_dev *kfd, bool run_pm, bool sync);
 int kgd2kfd_pre_reset(struct kfd_dev *kfd);
@@ -443,6 +445,9 @@ static inline void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm, bool force)
 {
 }

+void kgd2kfd_kill_all_user_processes(struct kfd_dev *kfd){
+}
+
 static int __maybe_unused kgd2kfd_resume_iommu(struct kfd_dev *kfd)
 {
  return 0;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index 3d5fc0751829..af6fe5080cfa 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -2101,6 +2101,9 @@ amdgpu_pci_remove(struct pci_dev *pdev)
 {
  struct drm_device *dev = pci_get_drvdata(pdev);

+ /* kill all kfd processes before drm_dev_unplug */
+ amdgpu_amdkfd_kill_all_processes(drm_to_adev(dev));
+ msleep(5000);
+
 #ifdef HAVE_DRM_DEV_UNPLUG
  drm_dev_unplug(dev);
 #else
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
index 5504a18b5a45..480c23bef5e2 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
@@ -691,6 +691,12 @@ bool kfd_is_locked(void)
  return  (atomic_read(&kfd_locked) > 0);
 }

+inline void kgd2kfd_kill_all_user_processes(struct kfd_dev* dev)
+{
+ kfd_kill_all_user_processes();
+}
+
+
 void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm, bool force)
 {
  if (!kfd->init_complete)
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
index 55c9e1922714..a35a2cb5bb9f 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
@@ -1064,6 +1064,7 @@ static inline struct kfd_process_device *kfd_process_device_from_gpuidx(
 void kfd_unref_process(struct kfd_process *p);
 int kfd_process_evict_queues(struct kfd_process *p, bool force);
 int kfd_process_restore_queues(struct kfd_process *p);
+void kfd_kill_all_user_processes(void);
 void kfd_suspend_all_processes(bool force);
 /*
  * kfd_resume_all_processes:
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process.c b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
index 6cdc855abb6d..17e769e6951d 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
@@ -46,6 +46,9 @@ struct mm_struct;
 #include "kfd_trace.h"
 #include "kfd_debug.h"

+static atomic_t kfd_process_locked = ATOMIC_INIT(0);
+static atomic_t kfd_inflight_kills = ATOMIC_INIT(0);
+
 /*
  * List of struct kfd_process (field kfd_process).
  * Unique/indexed by mm_struct*
@@ -802,6 +805,9 @@ struct kfd_process *kfd_create_process(struct task_struct *thread)
  struct kfd_process *process;
  int ret;

+ if ( atomic_read(&kfd_process_locked) > 0 )
+ return ERR_PTR(-EINVAL);
+
  if (!(thread->mm && mmget_not_zero(thread->mm)))
  return ERR_PTR(-EINVAL);

@@ -1126,6 +1132,10 @@ static void kfd_process_wq_release(struct work_struct *work)
  put_task_struct(p->lead_thread);

  kfree(p);
+
+ if ( atomic_read(&kfd_process_locked) > 0 ){
+ atomic_dec(&kfd_inflight_kills);
+ }
 }

 static void kfd_process_ref_release(struct kref *ref)
@@ -2186,6 +2196,35 @@ static void restore_process_worker(struct work_struct *work)
  pr_err("Failed to restore queues of pasid 0x%x\n", p->pasid);
 }

+void kfd_kill_all_user_processes(void)
+{
+ struct kfd_process *p;
+ /* struct amdkfd_process_info *p_info; */
+ unsigned int temp;
+ int idx;
+ atomic_inc(&kfd_process_locked);
+
+ idx = srcu_read_lock(&kfd_processes_srcu);
+ pr_info("Killing all processes\n");
+ hash_for_each_rcu(kfd_processes_table, temp, p, kfd_processes) {
+ dev_warn(kfd_device,
+  "Sending SIGBUS to process %d (pasid 0x%x)",
+  p->lead_thread->pid, p->pasid);
+ send_sig(SIGBUS, p->lead_thread, 0);
+ atomic_inc(&kfd_inflight_kills);
+ }
+ srcu_read_unlock(&kfd_processes_srcu, idx);
+
+ while ( atomic_read(&kfd_inflight_kills) > 0 ){
+ dev_warn(kfd_device,
+  "kfd_processes_table is not empty, going to sleep for 10ms\n");
+ msleep(10);
+ }
+
+ atomic_dec(&kfd_process_locked);
+ pr_info("all processes has been fully released");
+}
+
 void kfd_suspend_all_processes(bool force)
 {
  struct kfd_process *p;



Regards,
Shuotao



Andrey


+       }
+       srcu_read_unlock(&kfd_processes_srcu, idx);
+}
+
+
 int kfd_resume_all_processes(bool sync)
 {
        struct kfd_process *p;


Andrey



Really appreciate your help!

Best,
Shuotao

2. Remove redudant p2p/io links in sysfs when device is hotplugged
out.

3. New kfd node_id is not properly assigned after a new device is
added after a gpu is hotplugged out in a system. libhsakmt will
find this anomaly, (i.e. node_from != <dev node id> in iolinks),
when taking a topology_snapshot, thus returns fault to the rocm
stack.

-- This patch fixes issue 1; another patch by Mukul fixes issues 2&3.
-- Tested on a 4-GPU MI100 gpu nodes with kernel 5.13.0-kfd; kernel
5.16.0-kfd is unstable out of box for MI100.
---
drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 5 +++++
drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h | 7 +++++++
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 1 +
drivers/gpu/drm/amd/amdkfd/kfd_device.c | 13 +++++++++++++
4 files changed, 26 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
index c18c4be1e4ac..d50011bdb5c4 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
@@ -213,6 +213,11 @@ int amdgpu_amdkfd_resume(struct amdgpu_device *adev, bool run_pm)
return r;
}

+int amdgpu_amdkfd_resume_processes(void)
+{
+ return kgd2kfd_resume_processes();
+}
+
int amdgpu_amdkfd_pre_reset(struct amdgpu_device *adev)
{
int r = 0;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
index f8b9f27adcf5..803306e011c3 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
@@ -140,6 +140,7 @@ void amdgpu_amdkfd_fini(void);
void amdgpu_amdkfd_suspend(struct amdgpu_device *adev, bool run_pm);
int amdgpu_amdkfd_resume_iommu(struct amdgpu_device *adev);
int amdgpu_amdkfd_resume(struct amdgpu_device *adev, bool run_pm);
+int amdgpu_amdkfd_resume_processes(void);
void amdgpu_amdkfd_interrupt(struct amdgpu_device *adev,
const void *ih_ring_entry);
void amdgpu_amdkfd_device_probe(struct amdgpu_device *adev);
@@ -347,6 +348,7 @@ void kgd2kfd_device_exit(struct kfd_dev *kfd);
void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm);
int kgd2kfd_resume_iommu(struct kfd_dev *kfd);
int kgd2kfd_resume(struct kfd_dev *kfd, bool run_pm);
+int kgd2kfd_resume_processes(void);
int kgd2kfd_pre_reset(struct kfd_dev *kfd);
int kgd2kfd_post_reset(struct kfd_dev *kfd);
void kgd2kfd_interrupt(struct kfd_dev *kfd, const void *ih_ring_entry);
@@ -393,6 +395,11 @@ static inline int kgd2kfd_resume(struct kfd_dev *kfd, bool run_pm)
return 0;
}

+static inline int kgd2kfd_resume_processes(void)
+{
+ return 0;
+}
+
static inline int kgd2kfd_pre_reset(struct kfd_dev *kfd)
{
return 0;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index fa4a9f13c922..5827b65b7489 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4004,6 +4004,7 @@ void amdgpu_device_fini_hw(struct amdgpu_device *adev)
if (drm_dev_is_unplugged(adev_to_drm(adev)))
amdgpu_device_unmap_mmio(adev);

+ amdgpu_amdkfd_resume_processes();
}

void amdgpu_device_fini_sw(struct amdgpu_device *adev)
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
index 62aa6c9d5123..ef05aae9255e 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
@@ -714,6 +714,19 @@ int kgd2kfd_resume(struct kfd_dev *kfd, bool run_pm)
return ret;
}

+/* for non-runtime resume only */
+int kgd2kfd_resume_processes(void)
+{
+ int count;
+
+ count = atomic_dec_return(&kfd_locked);
+ WARN_ONCE(count < 0, "KFD suspend / resume ref. error");
+ if (count == 0)
+ return kfd_resume_all_processes();
+
+ return 0;
+}

It doesn't make sense to me to just increment kfd_locked in
kgd2kfd_suspend to only decrement it again a few functions down the
road.

I suggest this instead - you only incrmemnt if not during PCI remove

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
index 1c2cf3a33c1f..7754f77248a4 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
@@ -952,11 +952,12 @@ bool kfd_is_locked(void)

void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm)
{
+
if (!kfd->init_complete)
return;

/* for runtime suspend, skip locking kfd */
- if (!run_pm) {
+ if (!run_pm && !drm_dev_is_unplugged(kfd->ddev)) {
/* For first KFD device suspend all the KFD processes */
if (atomic_inc_return(&kfd_locked) == 1)
kfd_suspend_all_processes();


Andrey



+
int kgd2kfd_resume_iommu(struct kfd_dev *kfd)
{
int err = 0;


[-- Attachment #2: Type: text/html, Size: 82130 bytes --]

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [EXTERNAL] [PATCH 2/2] drm/amdkfd: Add PCIe Hotplug Support for AMDKFD
  2022-04-14 14:00             ` Shuotao Xu
  2022-04-14 14:24               ` Shuotao Xu
@ 2022-04-14 15:13               ` Andrey Grodzovsky
  2022-04-15 10:12                 ` Shuotao Xu
  1 sibling, 1 reply; 31+ messages in thread
From: Andrey Grodzovsky @ 2022-04-14 15:13 UTC (permalink / raw)
  To: Shuotao Xu
  Cc: Mukul.Joshi, Kuehling, Felix, Peng Cheng, amd-gfx, Lei Qu,
	Ran Shu, Ziyue Yang

[-- Attachment #1: Type: text/plain, Size: 28335 bytes --]


On 2022-04-14 10:00, Shuotao Xu wrote:
>
>
>> On Apr 14, 2022, at 1:31 AM, Andrey Grodzovsky 
>> <andrey.grodzovsky@amd.com> wrote:
>>
>>
>> On 2022-04-13 12:03, Shuotao Xu wrote:
>>>
>>>
>>>> On Apr 11, 2022, at 11:52 PM, Andrey Grodzovsky 
>>>> <andrey.grodzovsky@amd.com> wrote:
>>>>
>>>> [Some people who received this message don't often get email 
>>>> fromandrey.grodzovsky@amd.com. Learn why this is important 
>>>> athttp://aka.ms/LearnAboutSenderIdentification.]
>>>>
>>>> On 2022-04-08 21:28, Shuotao Xu wrote:
>>>>>
>>>>>> On Apr 8, 2022, at 11:28 PM, Andrey Grodzovsky 
>>>>>> <andrey.grodzovsky@amd.com> wrote:
>>>>>>
>>>>>> [Some people who received this message don't often get email from 
>>>>>> andrey.grodzovsky@amd.com. Learn why this is important at 
>>>>>> http://aka.ms/LearnAboutSenderIdentification.]
>>>>>>
>>>>>> On 2022-04-08 04:45, Shuotao Xu wrote:
>>>>>>> Adding PCIe Hotplug Support for AMDKFD: the support of hot-plug 
>>>>>>> of GPU
>>>>>>> devices can open doors for many advanced applications in data center
>>>>>>> in the next few years, such as for GPU resource
>>>>>>> disaggregation. Current AMDKFD does not support hotplug out b/o the
>>>>>>> following reasons:
>>>>>>>
>>>>>>> 1. During PCIe removal, decrement KFD lock which was incremented at
>>>>>>> the beginning of hw fini; otherwise kfd_open later is going to
>>>>>>> fail.
>>>>>> I assumed you read my comment last time, still you do same approach.
>>>>>> More in details bellow
>>>>> Aha, I like your fix:) I was not familiar with drm APIs so just 
>>>>> only half understood your comment last time.
>>>>>
>>>>> BTW, I tried hot-plugging out a GPU when rocm application is still 
>>>>> running.
>>>>> From dmesg, application is still trying to access the removed kfd 
>>>>> device, and are met with some errors.
>>>>
>>>>
>>>> Application us supposed to keep running, it holds the drm_device
>>>> reference as long as it has an open
>>>> FD to the device and final cleanup will come only after the app 
>>>> will die
>>>> thus releasing the FD and the last
>>>> drm_device reference.
>>>>
>>>>> Application would hang and not exiting in this case.
>>>>
>>>
>>> Actually I tried kill -7 $pid, and the process exists. The dmesg has 
>>> some warning though.
>>>
>>> [  711.769977] WARNING: CPU: 23 PID: 344 at 
>>> .../amdgpu-rocm5.0.2/src/amd/amdgpu/amdgpu_object.c:1336 
>>> amdgpu_bo_release_notify+0x150/0x160 [amdgpu]
>>> [  711.770528] Modules linked in: amdgpu(OE) amdttm(OE) 
>>> amd_sched(OE) amdkcl(OE) iommu_v2 nf_conntrack_netlink nfnetlink 
>>> xfrm_user xfrm_algo xt_addrtype br_netfilter xt_CHECKSUM 
>>> iptable_mangle xt_MASQUERADE iptable_nat nf_nat xt_conntrack 
>>> nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4 
>>> xt_tcpudp bridge stp llc ebtable_filter ebtables ip6table_filter 
>>> ip6_tables iptable_filter overlay binfmt_misc intel_rapl_msr i40iw 
>>> intel_rapl_common skx_edac nfit x86_pkg_temp_thermal 
>>> intel_powerclamp coretemp kvm_intel rpcrdma kvm sunrpc ipmi_ssif 
>>> ib_umad ib_ipoib rdma_ucm irqbypass rapl joydev acpi_ipmi input_leds 
>>> intel_cstate ipmi_si ipmi_devintf mei_me mei intel_pch_thermal 
>>> ipmi_msghandler ioatdma mac_hid lpc_ich dca acpi_power_meter 
>>> acpi_pad sch_fq_codel ib_iser rdma_cm iw_cm ib_cm iscsi_tcp 
>>> libiscsi_tcp libiscsi scsi_transport_iscsi pci_stub ip_tables 
>>> x_tables autofs4 btrfs blake2b_generic zstd_compress raid10 raid456 
>>> async_raid6_recov async_memcpy async_pq async_xor async_tx xor
>>> [  711.779359]  raid6_pq libcrc32c raid1 raid0 multipath linear 
>>> mlx5_ib ib_uverbs ib_core ast drm_vram_helper i2c_algo_bit 
>>> drm_ttm_helper ttm drm_kms_helper syscopyarea crct10dif_pclmul 
>>> crc32_pclmul ghash_clmulni_intel sysfillrect uas hid_generic 
>>> sysimgblt aesni_intel mlx5_core fb_sys_fops crypto_simd usbhid 
>>> cryptd drm i40e pci_hyperv_intf usb_storage glue_helper mlxfw hid 
>>> ahci libahci wmi
>>> [  711.779752] CPU: 23 PID: 344 Comm: kworker/23:1 Tainted: G       
>>>  W  OE     5.11.0+ #1
>>> [  711.779755] Hardware name: Supermicro 
>>> SYS-4029GP-TRT2/X11DPG-OT-CPU, BIOS 2.1 08/14/2018
>>> [  711.779756] Workqueue: kfd_process_wq kfd_process_wq_release [amdgpu]
>>> [  711.779955] RIP: 0010:amdgpu_bo_release_notify+0x150/0x160 [amdgpu]
>>> [  711.780141] Code: e8 b5 af 34 f4 e9 1f ff ff ff 48 39 c2 74 07 0f 
>>> 0b e9 69 ff ff ff 4c 89 e7 e8 3c b4 16 00 e9 5c ff ff ff e8 a2 ce fd 
>>> f3 eb cf <0f> 0b eb cb e8 d7 02 34 f4 0f 1f 80 00 00 00 00 0f 1f 44 
>>> 00 00 55
>>> [  711.780143] RSP: 0018:ffffa8100dd67c30 EFLAGS: 00010282
>>> [  711.780145] RAX: 00000000ffffffea RBX: ffff89980e792058 RCX: 
>>> 0000000000000000
>>> [  711.780147] RDX: 0000000000000000 RSI: ffff89a8f9ad8870 RDI: 
>>> ffff89a8f9ad8870
>>> [  711.780148] RBP: ffffa8100dd67c50 R08: 0000000000000000 R09: 
>>> fffffffffff99b18
>>> [  711.780149] R10: ffffa8100dd67bd0 R11: ffffa8100dd67908 R12: 
>>> ffff89980e792000
>>> [  711.780151] R13: ffff89980e792058 R14: ffff89980e7921bc R15: 
>>> dead000000000100
>>> [  711.780152] FS:  0000000000000000(0000) GS:ffff89a8f9ac0000(0000) 
>>> knlGS:0000000000000000
>>> [  711.780154] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> [  711.780156] CR2: 00007ffddac6f71f CR3: 00000030bb80a003 CR4: 
>>> 00000000007706e0
>>> [  711.780157] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 
>>> 0000000000000000
>>> [  711.780159] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 
>>> 0000000000000400
>>> [  711.780160] PKRU: 55555554
>>> [  711.780161] Call Trace:
>>> [  711.780163]  ttm_bo_release+0x2ae/0x320 [amdttm]
>>> [  711.780169]  amdttm_bo_put+0x30/0x40 [amdttm]
>>> [  711.780357]  amdgpu_bo_unref+0x1e/0x30 [amdgpu]
>>> [  711.780543]  amdgpu_gem_object_free+0x8b/0x160 [amdgpu]
>>> [  711.781119]  drm_gem_object_free+0x1d/0x30 [drm]
>>> [  711.781489]  amdgpu_amdkfd_gpuvm_free_memory_of_gpu+0x34a/0x380 
>>> [amdgpu]
>>> [  711.782044]  kfd_process_device_free_bos+0xe0/0x130 [amdgpu]
>>> [  711.782611]  kfd_process_wq_release+0x286/0x380 [amdgpu]
>>> [  711.783172]  process_one_work+0x236/0x420
>>> [  711.783543]  worker_thread+0x34/0x400
>>> [  711.783911]  ? process_one_work+0x420/0x420
>>> [  711.784279]  kthread+0x126/0x140
>>> [  711.784653]  ? kthread_park+0x90/0x90
>>> [  711.785018]  ret_from_fork+0x22/0x30
>>> [  711.785387] ---[ end trace d8f50f6594817c84 ]---
>>> [  711.798716] [drm] amdgpu: ttm finalized
>>
>>
>> So it means the process was stuck in some wait_event_killable (maybe 
>> here drm_sched_entity_flush) - you can try 
>> 'cat/proc/$process_pid/stack' maybe before
>> you kill it to see where it was stuck so we can go from there.
>>
>>
>>>
>>>>
>>>> For graphic apps what i usually see is a crash because of sigsev when
>>>> the app tries to access
>>>> an unmapped MMIO region on the device. I haven't tested for compute
>>>> stack and so there might
>>>> be something I haven't covered. Hang could mean for example waiting 
>>>> on a
>>>> fence which is not being
>>>> signaled - please provide full dmesg from this case.
>>>>
>>>>>
>>>>> Do you have any good suggestions on how to fix it down the line? 
>>>>> (HIP runtime/libhsakmt or driver)
>>>>>
>>>>> [64036.631333] amdgpu: amdgpu_vm_bo_update failed
>>>>> [64036.631702] amdgpu: validate_invalid_user_pages: update PTE failed
>>>>> [64036.640754] amdgpu: amdgpu_vm_bo_update failed
>>>>> [64036.641120] amdgpu: validate_invalid_user_pages: update PTE failed
>>>>> [64036.650394] amdgpu: amdgpu_vm_bo_update failed
>>>>> [64036.650765] amdgpu: validate_invalid_user_pages: update PTE failed
>>>>
>>>
>>> The full dmesg will just the repetition of those two messages,
>>> [186885.764079] amdgpu 0000:43:00.0: amdgpu: amdgpu: finishing device.
>>> [186885.766916] [drm] free PSP TMR buffer
>>> [186893.868173] amdgpu: amdgpu_vm_bo_update failed
>>> [186893.868235] amdgpu: validate_invalid_user_pages: update PTE failed
>>> [186893.876154] amdgpu: amdgpu_vm_bo_update failed
>>> [186893.876190] amdgpu: validate_invalid_user_pages: update PTE failed
>>> [186893.884150] amdgpu: amdgpu_vm_bo_update failed
>>> [186893.884185] amdgpu: validate_invalid_user_pages: update PTE failed
>>>
>>>>
>>>> This just probably means trying to update PTEs after the physical 
>>>> device
>>>> is gone - we usually avoid this by
>>>> first trying to do all HW shutdowns early before PCI remove completion
>>>> but when it's really tricky by
>>>> protecting HW access sections with drm_dev_enter/exit scope.
>>>>
>>>> For this particular error it would be the best to flush
>>>> info->restore_userptr_work before the end of
>>>> amdgpu_pci_remove (rejecting new process creation and calling
>>>> cancel_delayed_work_sync(&process_info->restore_userptr_work) for all
>>>> running processes)
>>>> somewhere in amdgpu_pci_remove.
>>>>
>>> I tried something like *kfd_process_ref_release* which I think did 
>>> what you described, but it did not work.
>>
>>
>> I don't see how kfd_process_ref_release is the same as I mentioned 
>> above, what i meant is calling the code above within kgd2kfd_suspend 
>> (where you tried to call kfd_kill_all_user_processes bellow)
>>
> Yes, you are right. It was not called by it.
>>
>>
>>>
>>> Instead I tried to kill the process from the kernel, but the amdgpu 
>>> could **only** be hot-plugged in back successfully only if there was 
>>> no rocm kernel running when it was plugged out. If not, amdgpu_probe 
>>> will just hang later. (Maybe because amdgpu was plugged out while 
>>> running state, it leaves a bad HW state which causes probe to hang).
>>
>>
>> We usually do asic_reset during probe to reset all HW state (checlk 
>> if amdgpu_device_init->amdgpu_asic_reset is running when you  plug 
>> back).
>>
> OK
>>
>>
>>>
>>> I don’t know if this is a viable solution worth pursuing, but I 
>>> attached the diff anyway.
>>>
>>> Another solution could be let compute stack user mode detect a 
>>> topology change via generation_count change, and abort gracefully there.
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c 
>>> b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>> index 4e7d9cb09a69..79b4c9b84cd0 100644
>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>> @@ -697,12 +697,15 @@ void kgd2kfd_suspend(struct kfd_dev *kfd, bool 
>>> run_pm, bool force)
>>>                 return;
>>>
>>>         /* for runtime suspend, skip locking kfd */
>>> -       if (!run_pm) {
>>> +       if (!run_pm && !drm_dev_is_unplugged(kfd->ddev)) {
>>>                 /* For first KFD device suspend all the KFD processes */
>>>                 if (atomic_inc_return(&kfd_locked) == 1)
>>> kfd_suspend_all_processes(force);
>>>         }
>>>
>>> +       if (drm_dev_is_unplugged(kfd->ddev))
>>> + kfd_kill_all_user_processes();
>>> +
>>> kfd->dqm->ops.stop(kfd->dqm);
>>>         kfd_iommu_suspend(kfd);
>>>  }
>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h 
>>> b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
>>> index 55c9e1922714..84cbcd857856 100644
>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
>>> @@ -1065,6 +1065,7 @@ void kfd_unref_process(struct kfd_process *p);
>>>  int kfd_process_evict_queues(struct kfd_process *p, bool force);
>>>  int kfd_process_restore_queues(struct kfd_process *p);
>>>  void kfd_suspend_all_processes(bool force);
>>> +void kfd_kill_all_user_processes(void);
>>>  /*
>>>   * kfd_resume_all_processes:
>>>   *     bool sync: If kfd_resume_all_processes() should wait for the
>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process.c 
>>> b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
>>> index 6cdc855abb6d..fb0c753b682c 100644
>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
>>> @@ -2206,6 +2206,24 @@ void kfd_suspend_all_processes(bool force)
>>> srcu_read_unlock(&kfd_processes_srcu, idx);
>>>  }
>>>
>>> +
>>> +void kfd_kill_all_user_processes(void)
>>> +{
>>> +       struct kfd_process *p;
>>> +       struct amdkfd_process_info *p_info;
>>> +       unsigned int temp;
>>> +       int idx = srcu_read_lock(&kfd_processes_srcu);
>>> +
>>> +       pr_info("Killing all processes\n");
>>> + hash_for_each_rcu(kfd_processes_table, temp, p, kfd_processes) {
>>> +               p_info = p->kgd_process_info;
>>> +               pr_info("Killing  processes, pid = %d", 
>>> pid_nr(p_info->pid));
>>> +               kill_pid(p_info->pid, SIGBUS, 1);
>>
>>
>> From looking into kill_pid I see it only sends a signal but doesn't 
>> wait for completion, it would make sense to wait for completion here. 
>> In any case I would actually try to put here
>>
> I have made a version which does that with some atomic counters. 
> Please read later in the diff.
>>
>>
>> hash_for_each_rcu(p_info)
>>     cancel_delayed_work_sync(&p_info->restore_userptr_work)
>>
>> instead at least that what i meant in the previous mail.
>>
> I actually tried that earlier, and it did not work. Application still 
> keeps running, and you have to send a kill to the user process.
>
> I have made the following version. It waits for processes to terminate 
> synchronously after sending SIGBUS. After that it does the real work 
> of amdgpu_pci_remove.
> However, it hangs at amdgpu_device_ip_fini_early when it is trying to 
> deinit ip_block 6 <sdma_v4_0> 
> (https://gitlab.freedesktop.org/agd5f/linux/-/blob/amd-staging-drm-next/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c#L2818 
> <https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fagd5f%2Flinux%2F-%2Fblob%2Famd-staging-drm-next%2Fdrivers%2Fgpu%2Fdrm%2Famd%2Famdgpu%2Famdgpu_device.c%23L2818&data=04%7C01%7Candrey.grodzovsky%40amd.com%7C37a2503747384d07944608da1e1f37ee%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637855416726313174%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=FBdaleltc3PbJRmaWr8D3gxU7zuZ7n%2Bcu7J2yUrzD1I%3D&reserved=0>). 
> I assume that there are still some inflight dma, therefore fini of 
> this ip block thus hangs?
>
> The following is an excerpt of the dmesg: please excuse for putting my 
> own pr_info, but I hope you get my point of where it hangs.
>
> [  392.344735] amdgpu: all processes has been fully released
> [  392.346557] amdgpu: amdgpu_acpi_fini done
> [  392.346568] amdgpu 0000:b3:00.0: amdgpu: amdgpu: finishing device.
> [  392.349238] amdgpu: amdgpu_device_ip_fini_early enter ip_blocks = 9
> [  392.349248] amdgpu: Free mem_obj = 000000007bf54275, range_start = 
> 14, range_end = 14
> [  392.350299] amdgpu: Free mem_obj = 00000000a85bc878, range_start = 
> 12, range_end = 12
> [  392.350304] amdgpu: Free mem_obj = 00000000b8019e32, range_start = 
> 13, range_end = 13
> [  392.350308] amdgpu: Free mem_obj = 000000002d296168, range_start = 
> 4, range_end = 11
> [  392.350313] amdgpu: Free mem_obj = 000000001fc4f934, range_start = 
> 0, range_end = 3
> [  392.350322] amdgpu: amdgpu_amdkfd_suspend(adev, false) done
> [  392.350672] amdgpu: hw_fini of IP block[8] <jpeg_v2_5> done 0
> [  392.350679] amdgpu: hw_fini of IP block[7] <vcn_v2_5> done 0


I just remembered that the idea to actively kill and wait for running 
user processes during unplug was rejected
as a bad idea in the first iteration of unplug work I did (don't 
remember why now, need to look) so this is a no go.
Our policy is to let zombie processes (zombie in a sense that the 
underlying device is gone) live as long as they want
(as long as you able to terminate them - which you do, so that ok)
and the system should finish PCI remove gracefully and be able to hot 
plug back the device.  Hence, i suggest dropping
this direction of forcing all user processes to be killed, confirm you 
have graceful shutdown and remove of device
from PCI topology and then concentrate on why when you plug back it 
hangs. First confirm if ASIC reset happens on
next init. Second please confirm if the timing you kill manually the 
user process has impact on whether you have a hang
on next plug back (if you kill before or you kill after plug back does 
it makes a difference).

Andrey


>
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
> index 8fa9b86ac9d2..c0b27f722281 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
> @@ -188,6 +188,12 @@ void amdgpu_amdkfd_interrupt(struct amdgpu_device 
> *adev,
> kgd2kfd_interrupt(adev->kfd.dev, ih_ring_entry);
>  }
> +void amdgpu_amdkfd_kill_all_processes(struct amdgpu_device *adev)
> +{
> +if (adev->kfd.dev)
> +kgd2kfd_kill_all_user_processes(adev->kfd.dev);
> +}
> +
>  void amdgpu_amdkfd_suspend(struct amdgpu_device *adev, bool run_pm)
>  {
> if (adev->kfd.dev)
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
> index 27c74fcec455..f4e485d60442 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
> @@ -141,6 +141,7 @@ struct amdkfd_process_info {
>  int amdgpu_amdkfd_init(void);
>  void amdgpu_amdkfd_fini(void);
> +void amdgpu_amdkfd_kill_all_processes(struct amdgpu_device *adev);
>  void amdgpu_amdkfd_suspend(struct amdgpu_device *adev, bool run_pm);
>  int amdgpu_amdkfd_resume_iommu(struct amdgpu_device *adev);
>  int amdgpu_amdkfd_resume(struct amdgpu_device *adev, bool run_pm, 
> bool sync);
> @@ -405,6 +406,7 @@ bool kgd2kfd_device_init(struct kfd_dev *kfd,
> const struct kgd2kfd_shared_resources *gpu_resources);
>  void kgd2kfd_device_exit(struct kfd_dev *kfd);
>  void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm, bool force);
> +void kgd2kfd_kill_all_user_processes(struct kfd_dev *kfd);
>  int kgd2kfd_resume_iommu(struct kfd_dev *kfd);
>  int kgd2kfd_resume(struct kfd_dev *kfd, bool run_pm, bool sync);
>  int kgd2kfd_pre_reset(struct kfd_dev *kfd);
> @@ -443,6 +445,9 @@ static inline void kgd2kfd_suspend(struct kfd_dev 
> *kfd, bool run_pm, bool force)
>  {
>  }
> +void kgd2kfd_kill_all_user_processes(struct kfd_dev *kfd){
> +}
> +
>  static int __maybe_unused kgd2kfd_resume_iommu(struct kfd_dev *kfd)
>  {
> return 0;
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> index 3d5fc0751829..af6fe5080cfa 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> @@ -2101,6 +2101,9 @@ amdgpu_pci_remove(struct pci_dev *pdev)
>  {
> struct drm_device *dev = pci_get_drvdata(pdev);
> +/* kill all kfd processes before drm_dev_unplug */
> +amdgpu_amdkfd_kill_all_processes(drm_to_adev(dev));
> +
>  #ifdef HAVE_DRM_DEV_UNPLUG
> drm_dev_unplug(dev);
>  #else
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c 
> b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> index 5504a18b5a45..480c23bef5e2 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> @@ -691,6 +691,12 @@ bool kfd_is_locked(void)
> return  (atomic_read(&kfd_locked) > 0);
>  }
> +inline void kgd2kfd_kill_all_user_processes(struct kfd_dev* dev)
> +{
> +kfd_kill_all_user_processes();
> +}
> +
> +
>  void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm, bool force)
>  {
> if (!kfd->init_complete)
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h 
> b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
> index 55c9e1922714..a35a2cb5bb9f 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
> @@ -1064,6 +1064,7 @@ static inline struct kfd_process_device 
> *kfd_process_device_from_gpuidx(
>  void kfd_unref_process(struct kfd_process *p);
>  int kfd_process_evict_queues(struct kfd_process *p, bool force);
>  int kfd_process_restore_queues(struct kfd_process *p);
> +void kfd_kill_all_user_processes(void);
>  void kfd_suspend_all_processes(bool force);
>  /*
>   * kfd_resume_all_processes:
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process.c 
> b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
> index 6cdc855abb6d..17e769e6951d 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
> @@ -46,6 +46,9 @@ struct mm_struct;
>  #include "kfd_trace.h"
>  #include "kfd_debug.h"
> +static atomic_t kfd_process_locked = ATOMIC_INIT(0);
> +static atomic_t kfd_inflight_kills = ATOMIC_INIT(0);
> +
>  /*
>   * List of struct kfd_process (field kfd_process).
>   * Unique/indexed by mm_struct*
> @@ -802,6 +805,9 @@ struct kfd_process *kfd_create_process(struct 
> task_struct *thread)
> struct kfd_process *process;
> int ret;
> +if ( atomic_read(&kfd_process_locked) > 0 )
> +return ERR_PTR(-EINVAL);
> +
> if (!(thread->mm && mmget_not_zero(thread->mm)))
> return ERR_PTR(-EINVAL);
> @@ -1126,6 +1132,10 @@ static void kfd_process_wq_release(struct 
> work_struct *work)
> put_task_struct(p->lead_thread);
> kfree(p);
> +
> +if ( atomic_read(&kfd_process_locked) > 0 ){
> +atomic_dec(&kfd_inflight_kills);
> +}
>  }
>  static void kfd_process_ref_release(struct kref *ref)
> @@ -2186,6 +2196,35 @@ static void restore_process_worker(struct 
> work_struct *work)
> pr_err("Failed to restore queues of pasid 0x%x\n", p->pasid);
>  }
> +void kfd_kill_all_user_processes(void)
> +{
> +struct kfd_process *p;
> +/* struct amdkfd_process_info *p_info; */
> +unsigned int temp;
> +int idx;
> +atomic_inc(&kfd_process_locked);
> +
> +idx = srcu_read_lock(&kfd_processes_srcu);
> +pr_info("Killing all processes\n");
> +hash_for_each_rcu(kfd_processes_table, temp, p, kfd_processes) {
> +dev_warn(kfd_device,
> +"Sending SIGBUS to process %d (pasid 0x%x)",
> +p->lead_thread->pid, p->pasid);
> +send_sig(SIGBUS, p->lead_thread, 0);
> +atomic_inc(&kfd_inflight_kills);
> +}
> +srcu_read_unlock(&kfd_processes_srcu, idx);
> +
> +while ( atomic_read(&kfd_inflight_kills) > 0 ){
> +dev_warn(kfd_device,
> +"kfd_processes_table is not empty, going to sleep for 10ms\n");
> +msleep(10);
> +}
> +
> +atomic_dec(&kfd_process_locked);
> +pr_info("all processes has been fully released");
> +}
> +
>  void kfd_suspend_all_processes(bool force)
>  {
> struct kfd_process *p;
>
>
>
> Regards,
> Shuotao
>
>> Andrey
>>
>>> +       }
>>> + srcu_read_unlock(&kfd_processes_srcu, idx);
>>> +}
>>> +
>>> +
>>>  int kfd_resume_all_processes(bool sync)
>>>  {
>>>         struct kfd_process *p;
>>>
>>>
>>>> Andrey
>>>>
>>>>
>>>>>
>>>>> Really appreciate your help!
>>>>>
>>>>> Best,
>>>>> Shuotao
>>>>>
>>>>>>> 2. Remove redudant p2p/io links in sysfs when device is hotplugged
>>>>>>> out.
>>>>>>>
>>>>>>> 3. New kfd node_id is not properly assigned after a new device is
>>>>>>> added after a gpu is hotplugged out in a system. libhsakmt will
>>>>>>> find this anomaly, (i.e. node_from != <dev node id> in iolinks),
>>>>>>> when taking a topology_snapshot, thus returns fault to the rocm
>>>>>>> stack.
>>>>>>>
>>>>>>> -- This patch fixes issue 1; another patch by Mukul fixes issues 
>>>>>>> 2&3.
>>>>>>> -- Tested on a 4-GPU MI100 gpu nodes with kernel 5.13.0-kfd; kernel
>>>>>>> 5.16.0-kfd is unstable out of box for MI100.
>>>>>>> ---
>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 5 +++++
>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h | 7 +++++++
>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 1 +
>>>>>>> drivers/gpu/drm/amd/amdkfd/kfd_device.c | 13 +++++++++++++
>>>>>>> 4 files changed, 26 insertions(+)
>>>>>>>
>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c 
>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
>>>>>>> index c18c4be1e4ac..d50011bdb5c4 100644
>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
>>>>>>> @@ -213,6 +213,11 @@ int amdgpu_amdkfd_resume(struct 
>>>>>>> amdgpu_device *adev, bool run_pm)
>>>>>>> return r;
>>>>>>> }
>>>>>>>
>>>>>>> +int amdgpu_amdkfd_resume_processes(void)
>>>>>>> +{
>>>>>>> + return kgd2kfd_resume_processes();
>>>>>>> +}
>>>>>>> +
>>>>>>> int amdgpu_amdkfd_pre_reset(struct amdgpu_device *adev)
>>>>>>> {
>>>>>>> int r = 0;
>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h 
>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
>>>>>>> index f8b9f27adcf5..803306e011c3 100644
>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
>>>>>>> @@ -140,6 +140,7 @@ void amdgpu_amdkfd_fini(void);
>>>>>>> void amdgpu_amdkfd_suspend(struct amdgpu_device *adev, bool run_pm);
>>>>>>> int amdgpu_amdkfd_resume_iommu(struct amdgpu_device *adev);
>>>>>>> int amdgpu_amdkfd_resume(struct amdgpu_device *adev, bool run_pm);
>>>>>>> +int amdgpu_amdkfd_resume_processes(void);
>>>>>>> void amdgpu_amdkfd_interrupt(struct amdgpu_device *adev,
>>>>>>> const void *ih_ring_entry);
>>>>>>> void amdgpu_amdkfd_device_probe(struct amdgpu_device *adev);
>>>>>>> @@ -347,6 +348,7 @@ void kgd2kfd_device_exit(struct kfd_dev *kfd);
>>>>>>> void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm);
>>>>>>> int kgd2kfd_resume_iommu(struct kfd_dev *kfd);
>>>>>>> int kgd2kfd_resume(struct kfd_dev *kfd, bool run_pm);
>>>>>>> +int kgd2kfd_resume_processes(void);
>>>>>>> int kgd2kfd_pre_reset(struct kfd_dev *kfd);
>>>>>>> int kgd2kfd_post_reset(struct kfd_dev *kfd);
>>>>>>> void kgd2kfd_interrupt(struct kfd_dev *kfd, const void 
>>>>>>> *ih_ring_entry);
>>>>>>> @@ -393,6 +395,11 @@ static inline int kgd2kfd_resume(struct 
>>>>>>> kfd_dev *kfd, bool run_pm)
>>>>>>> return 0;
>>>>>>> }
>>>>>>>
>>>>>>> +static inline int kgd2kfd_resume_processes(void)
>>>>>>> +{
>>>>>>> + return 0;
>>>>>>> +}
>>>>>>> +
>>>>>>> static inline int kgd2kfd_pre_reset(struct kfd_dev *kfd)
>>>>>>> {
>>>>>>> return 0;
>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>> index fa4a9f13c922..5827b65b7489 100644
>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>> @@ -4004,6 +4004,7 @@ void amdgpu_device_fini_hw(struct 
>>>>>>> amdgpu_device *adev)
>>>>>>> if (drm_dev_is_unplugged(adev_to_drm(adev)))
>>>>>>> amdgpu_device_unmap_mmio(adev);
>>>>>>>
>>>>>>> + amdgpu_amdkfd_resume_processes();
>>>>>>> }
>>>>>>>
>>>>>>> void amdgpu_device_fini_sw(struct amdgpu_device *adev)
>>>>>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c 
>>>>>>> b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>>>>>> index 62aa6c9d5123..ef05aae9255e 100644
>>>>>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>>>>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>>>>>> @@ -714,6 +714,19 @@ int kgd2kfd_resume(struct kfd_dev *kfd, 
>>>>>>> bool run_pm)
>>>>>>> return ret;
>>>>>>> }
>>>>>>>
>>>>>>> +/* for non-runtime resume only */
>>>>>>> +int kgd2kfd_resume_processes(void)
>>>>>>> +{
>>>>>>> + int count;
>>>>>>> +
>>>>>>> + count = atomic_dec_return(&kfd_locked);
>>>>>>> + WARN_ONCE(count < 0, "KFD suspend / resume ref. error");
>>>>>>> + if (count == 0)
>>>>>>> + return kfd_resume_all_processes();
>>>>>>> +
>>>>>>> + return 0;
>>>>>>> +}
>>>>>>
>>>>>> It doesn't make sense to me to just increment kfd_locked in
>>>>>> kgd2kfd_suspend to only decrement it again a few functions down the
>>>>>> road.
>>>>>>
>>>>>> I suggest this instead - you only incrmemnt if not during PCI remove
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>>>>> b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>>>>> index 1c2cf3a33c1f..7754f77248a4 100644
>>>>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>>>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>>>>> @@ -952,11 +952,12 @@ bool kfd_is_locked(void)
>>>>>>
>>>>>> void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm)
>>>>>> {
>>>>>> +
>>>>>> if (!kfd->init_complete)
>>>>>> return;
>>>>>>
>>>>>> /* for runtime suspend, skip locking kfd */
>>>>>> - if (!run_pm) {
>>>>>> + if (!run_pm && !drm_dev_is_unplugged(kfd->ddev)) {
>>>>>> /* For first KFD device suspend all the KFD processes */
>>>>>> if (atomic_inc_return(&kfd_locked) == 1)
>>>>>> kfd_suspend_all_processes();
>>>>>>
>>>>>>
>>>>>> Andrey
>>>>>>
>>>>>>
>>>>>>
>>>>>>> +
>>>>>>> int kgd2kfd_resume_iommu(struct kfd_dev *kfd)
>>>>>>> {
>>>>>>> int err = 0;
>>>
>

[-- Attachment #2: Type: text/html, Size: 104271 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [EXTERNAL] [PATCH 2/2] drm/amdkfd: Add PCIe Hotplug Support for AMDKFD
  2022-04-14 15:13               ` Andrey Grodzovsky
@ 2022-04-15 10:12                 ` Shuotao Xu
  2022-04-15 16:43                   ` Andrey Grodzovsky
  0 siblings, 1 reply; 31+ messages in thread
From: Shuotao Xu @ 2022-04-15 10:12 UTC (permalink / raw)
  To: Andrey Grodzovsky
  Cc: Mukul.Joshi, Kuehling, Felix, Peng Cheng, amd-gfx, Lei Qu,
	Ran Shu, Ziyue Yang

[-- Attachment #1: Type: text/plain, Size: 32606 bytes --]

Hi Andrey,

First I really appreciate the discussion! It helped me understand the driver code greatly. Thank you so much:)
Please see my inline comments.

On Apr 14, 2022, at 11:13 PM, Andrey Grodzovsky <andrey.grodzovsky@amd.com<mailto:andrey.grodzovsky@amd.com>> wrote:



On 2022-04-14 10:00, Shuotao Xu wrote:


On Apr 14, 2022, at 1:31 AM, Andrey Grodzovsky <andrey.grodzovsky@amd.com<mailto:andrey.grodzovsky@amd.com>> wrote:



On 2022-04-13 12:03, Shuotao Xu wrote:


On Apr 11, 2022, at 11:52 PM, Andrey Grodzovsky <andrey.grodzovsky@amd.com<mailto:andrey.grodzovsky@amd.com>> wrote:

[Some people who received this message don't often get email from andrey.grodzovsky@amd.com<mailto:andrey.grodzovsky@amd.com>. Learn why this is important at http://aka.ms/LearnAboutSenderIdentification.]

On 2022-04-08 21:28, Shuotao Xu wrote:

On Apr 8, 2022, at 11:28 PM, Andrey Grodzovsky <andrey.grodzovsky@amd.com<mailto:andrey.grodzovsky@amd.com>> wrote:

[Some people who received this message don't often get email from andrey.grodzovsky@amd.com<mailto:andrey.grodzovsky@amd.com>. Learn why this is important at http://aka.ms/LearnAboutSenderIdentification.]

On 2022-04-08 04:45, Shuotao Xu wrote:
Adding PCIe Hotplug Support for AMDKFD: the support of hot-plug of GPU
devices can open doors for many advanced applications in data center
in the next few years, such as for GPU resource
disaggregation. Current AMDKFD does not support hotplug out b/o the
following reasons:

1. During PCIe removal, decrement KFD lock which was incremented at
the beginning of hw fini; otherwise kfd_open later is going to
fail.
I assumed you read my comment last time, still you do same approach.
More in details bellow
Aha, I like your fix:) I was not familiar with drm APIs so just only half understood your comment last time.

BTW, I tried hot-plugging out a GPU when rocm application is still running.
From dmesg, application is still trying to access the removed kfd device, and are met with some errors.


Application us supposed to keep running, it holds the drm_device
reference as long as it has an open
FD to the device and final cleanup will come only after the app will die
thus releasing the FD and the last
drm_device reference.

Application would hang and not exiting in this case.


Actually I tried kill -7 $pid, and the process exists. The dmesg has some warning though.

[  711.769977] WARNING: CPU: 23 PID: 344 at .../amdgpu-rocm5.0.2/src/amd/amdgpu/amdgpu_object.c:1336 amdgpu_bo_release_notify+0x150/0x160 [amdgpu]
[  711.770528] Modules linked in: amdgpu(OE) amdttm(OE) amd_sched(OE) amdkcl(OE) iommu_v2 nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo xt_addrtype br_netfilter xt_CHECKSUM iptable_mangle xt_MASQUERADE iptable_nat nf_nat xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4 xt_tcpudp bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter overlay binfmt_misc intel_rapl_msr i40iw intel_rapl_common skx_edac nfit x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel rpcrdma kvm sunrpc ipmi_ssif ib_umad ib_ipoib rdma_ucm irqbypass rapl joydev acpi_ipmi input_leds intel_cstate ipmi_si ipmi_devintf mei_me mei intel_pch_thermal ipmi_msghandler ioatdma mac_hid lpc_ich dca acpi_power_meter acpi_pad sch_fq_codel ib_iser rdma_cm iw_cm ib_cm iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi pci_stub ip_tables x_tables autofs4 btrfs blake2b_generic zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor
[  711.779359]  raid6_pq libcrc32c raid1 raid0 multipath linear mlx5_ib ib_uverbs ib_core ast drm_vram_helper i2c_algo_bit drm_ttm_helper ttm drm_kms_helper syscopyarea crct10dif_pclmul crc32_pclmul ghash_clmulni_intel sysfillrect uas hid_generic sysimgblt aesni_intel mlx5_core fb_sys_fops crypto_simd usbhid cryptd drm i40e pci_hyperv_intf usb_storage glue_helper mlxfw hid ahci libahci wmi
[  711.779752] CPU: 23 PID: 344 Comm: kworker/23:1 Tainted: G        W  OE     5.11.0+ #1
[  711.779755] Hardware name: Supermicro SYS-4029GP-TRT2/X11DPG-OT-CPU, BIOS 2.1 08/14/2018
[  711.779756] Workqueue: kfd_process_wq kfd_process_wq_release [amdgpu]
[  711.779955] RIP: 0010:amdgpu_bo_release_notify+0x150/0x160 [amdgpu]
[  711.780141] Code: e8 b5 af 34 f4 e9 1f ff ff ff 48 39 c2 74 07 0f 0b e9 69 ff ff ff 4c 89 e7 e8 3c b4 16 00 e9 5c ff ff ff e8 a2 ce fd f3 eb cf <0f> 0b eb cb e8 d7 02 34 f4 0f 1f 80 00 00 00 00 0f 1f 44 00 00 55
[  711.780143] RSP: 0018:ffffa8100dd67c30 EFLAGS: 00010282
[  711.780145] RAX: 00000000ffffffea RBX: ffff89980e792058 RCX: 0000000000000000
[  711.780147] RDX: 0000000000000000 RSI: ffff89a8f9ad8870 RDI: ffff89a8f9ad8870
[  711.780148] RBP: ffffa8100dd67c50 R08: 0000000000000000 R09: fffffffffff99b18
[  711.780149] R10: ffffa8100dd67bd0 R11: ffffa8100dd67908 R12: ffff89980e792000
[  711.780151] R13: ffff89980e792058 R14: ffff89980e7921bc R15: dead000000000100
[  711.780152] FS:  0000000000000000(0000) GS:ffff89a8f9ac0000(0000) knlGS:0000000000000000
[  711.780154] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  711.780156] CR2: 00007ffddac6f71f CR3: 00000030bb80a003 CR4: 00000000007706e0
[  711.780157] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  711.780159] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  711.780160] PKRU: 55555554
[  711.780161] Call Trace:
[  711.780163]  ttm_bo_release+0x2ae/0x320 [amdttm]
[  711.780169]  amdttm_bo_put+0x30/0x40 [amdttm]
[  711.780357]  amdgpu_bo_unref+0x1e/0x30 [amdgpu]
[  711.780543]  amdgpu_gem_object_free+0x8b/0x160 [amdgpu]
[  711.781119]  drm_gem_object_free+0x1d/0x30 [drm]
[  711.781489]  amdgpu_amdkfd_gpuvm_free_memory_of_gpu+0x34a/0x380 [amdgpu]
[  711.782044]  kfd_process_device_free_bos+0xe0/0x130 [amdgpu]
[  711.782611]  kfd_process_wq_release+0x286/0x380 [amdgpu]
[  711.783172]  process_one_work+0x236/0x420
[  711.783543]  worker_thread+0x34/0x400
[  711.783911]  ? process_one_work+0x420/0x420
[  711.784279]  kthread+0x126/0x140
[  711.784653]  ? kthread_park+0x90/0x90
[  711.785018]  ret_from_fork+0x22/0x30
[  711.785387] ---[ end trace d8f50f6594817c84 ]---
[  711.798716] [drm] amdgpu: ttm finalized


So it means the process was stuck in some wait_event_killable (maybe here drm_sched_entity_flush) - you can try 'cat/proc/$process_pid/stack' maybe before
you kill it to see where it was stuck so we can go from there.



For graphic apps what i usually see is a crash because of sigsev when
the app tries to access
an unmapped MMIO region on the device. I haven't tested for compute
stack and so there might
be something I haven't covered. Hang could mean for example waiting on a
fence which is not being
signaled - please provide full dmesg from this case.


Do you have any good suggestions on how to fix it down the line? (HIP runtime/libhsakmt or driver)

[64036.631333] amdgpu: amdgpu_vm_bo_update failed
[64036.631702] amdgpu: validate_invalid_user_pages: update PTE failed
[64036.640754] amdgpu: amdgpu_vm_bo_update failed
[64036.641120] amdgpu: validate_invalid_user_pages: update PTE failed
[64036.650394] amdgpu: amdgpu_vm_bo_update failed
[64036.650765] amdgpu: validate_invalid_user_pages: update PTE failed


The full dmesg will just the repetition of those two messages,
[186885.764079] amdgpu 0000:43:00.0: amdgpu: amdgpu: finishing device.
[186885.766916] [drm] free PSP TMR buffer
[186893.868173] amdgpu: amdgpu_vm_bo_update failed
[186893.868235] amdgpu: validate_invalid_user_pages: update PTE failed
[186893.876154] amdgpu: amdgpu_vm_bo_update failed
[186893.876190] amdgpu: validate_invalid_user_pages: update PTE failed
[186893.884150] amdgpu: amdgpu_vm_bo_update failed
[186893.884185] amdgpu: validate_invalid_user_pages: update PTE failed


This just probably means trying to update PTEs after the physical device
is gone - we usually avoid this by
first trying to do all HW shutdowns early before PCI remove completion
but when it's really tricky by
protecting HW access sections with drm_dev_enter/exit scope.

For this particular error it would be the best to flush
info->restore_userptr_work before the end of
amdgpu_pci_remove (rejecting new process creation and calling
cancel_delayed_work_sync(&process_info->restore_userptr_work) for all
running processes)
somewhere in amdgpu_pci_remove.

I tried something like *kfd_process_ref_release* which I think did what you described, but it did not work.


I don't see how kfd_process_ref_release is the same as I mentioned above, what i meant is calling the code above within kgd2kfd_suspend (where you tried to call kfd_kill_all_user_processes bellow)

Yes, you are right. It was not called by it.


Instead I tried to kill the process from the kernel, but the amdgpu could **only** be hot-plugged in back successfully only if there was no rocm kernel running when it was plugged out. If not, amdgpu_probe will just hang later. (Maybe because amdgpu was plugged out while running state, it leaves a bad HW state which causes probe to hang).


We usually do asic_reset during probe to reset all HW state (checlk if amdgpu_device_init->amdgpu_asic_reset is running when you  plug back).

OK



I don’t know if this is a viable solution worth pursuing, but I attached the diff anyway.

Another solution could be let compute stack user mode detect a topology change via generation_count change, and abort gracefully there.

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
index 4e7d9cb09a69..79b4c9b84cd0 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
@@ -697,12 +697,15 @@ void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm, bool force)
                return;

        /* for runtime suspend, skip locking kfd */
-       if (!run_pm) {
+       if (!run_pm && !drm_dev_is_unplugged(kfd->ddev)) {
                /* For first KFD device suspend all the KFD processes */
                if (atomic_inc_return(&kfd_locked) == 1)
                        kfd_suspend_all_processes(force);
        }

+       if (drm_dev_is_unplugged(kfd->ddev))
+               kfd_kill_all_user_processes();
+
        kfd->dqm->ops.stop(kfd->dqm);
        kfd_iommu_suspend(kfd);
 }
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
index 55c9e1922714..84cbcd857856 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
@@ -1065,6 +1065,7 @@ void kfd_unref_process(struct kfd_process *p);
 int kfd_process_evict_queues(struct kfd_process *p, bool force);
 int kfd_process_restore_queues(struct kfd_process *p);
 void kfd_suspend_all_processes(bool force);
+void kfd_kill_all_user_processes(void);
 /*
  * kfd_resume_all_processes:
  *     bool sync: If kfd_resume_all_processes() should wait for the
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process.c b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
index 6cdc855abb6d..fb0c753b682c 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
@@ -2206,6 +2206,24 @@ void kfd_suspend_all_processes(bool force)
        srcu_read_unlock(&kfd_processes_srcu, idx);
 }

+
+void kfd_kill_all_user_processes(void)
+{
+       struct kfd_process *p;
+       struct amdkfd_process_info *p_info;
+       unsigned int temp;
+       int idx = srcu_read_lock(&kfd_processes_srcu);
+
+       pr_info("Killing all processes\n");
+       hash_for_each_rcu(kfd_processes_table, temp, p, kfd_processes) {
+               p_info = p->kgd_process_info;
+               pr_info("Killing  processes, pid = %d", pid_nr(p_info->pid));
+               kill_pid(p_info->pid, SIGBUS, 1);


From looking into kill_pid I see it only sends a signal but doesn't wait for completion, it would make sense to wait for completion here. In any case I would actually try to put here

I have made a version which does that with some atomic counters. Please read later in the diff.


hash_for_each_rcu(p_info)
    cancel_delayed_work_sync(&p_info->restore_userptr_work)

instead  at least that what i meant in the previous mail.

I actually tried that earlier, and it did not work. Application still keeps running, and you have to send a kill to the user process.

I have made the following version. It waits for processes to terminate synchronously after sending SIGBUS. After that it does the real work of amdgpu_pci_remove.
However, it hangs at amdgpu_device_ip_fini_early when it is trying to deinit ip_block 6 <sdma_v4_0> (https://gitlab.freedesktop.org/agd5f/linux/-/blob/amd-staging-drm-next/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c#L2818<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fagd5f%2Flinux%2F-%2Fblob%2Famd-staging-drm-next%2Fdrivers%2Fgpu%2Fdrm%2Famd%2Famdgpu%2Famdgpu_device.c%23L2818&data=05%7C01%7Cshuotaoxu%40microsoft.com%7C4908bd0719924bba7a7908da1e2950a1%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637855460100244497%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=Owg%2Bl4MiAfVRrQXDKZledc4iycHAyVYL8Mxv0YxWBWs%3D&reserved=0>). I assume that there are still some inflight dma, therefore fini of this ip block thus hangs?

The following is an excerpt of the dmesg: please excuse for putting my own pr_info, but I hope you get my point of where it hangs.

[  392.344735] amdgpu: all processes has been fully released
[  392.346557] amdgpu: amdgpu_acpi_fini done
[  392.346568] amdgpu 0000:b3:00.0: amdgpu: amdgpu: finishing device.
[  392.349238] amdgpu: amdgpu_device_ip_fini_early enter ip_blocks = 9
[  392.349248] amdgpu: Free mem_obj = 000000007bf54275, range_start = 14, range_end = 14
[  392.350299] amdgpu: Free mem_obj = 00000000a85bc878, range_start = 12, range_end = 12
[  392.350304] amdgpu: Free mem_obj = 00000000b8019e32, range_start = 13, range_end = 13
[  392.350308] amdgpu: Free mem_obj = 000000002d296168, range_start = 4, range_end = 11
[  392.350313] amdgpu: Free mem_obj = 000000001fc4f934, range_start = 0, range_end = 3
[  392.350322] amdgpu: amdgpu_amdkfd_suspend(adev, false) done
[  392.350672] amdgpu: hw_fini of IP block[8] <jpeg_v2_5> done 0
[  392.350679] amdgpu: hw_fini of IP block[7] <vcn_v2_5> done 0


I just remembered that the idea to actively kill and wait for running user processes during unplug was rejected
as a bad idea in the first iteration of unplug work I did (don't remember why now, need to look) so this is a no go.

Maybe an application has kfd open, but was not accessing the dev. So kill it at unplug could kill the process unnecessarily.
However, the latest version I had with the sleep function got rid of the IP block fini hang.

Our policy is to let zombie processes (zombie in a sense that the underlying device is gone) live as long as they want
(as long as you able to terminate them - which you do, so that ok)
and the system should finish PCI remove gracefully and be able to hot plug back the device.  Hence, i suggest dropping
this direction of forcing all user processes to be killed, confirm you have graceful shutdown and remove of device
from PCI topology and then concentrate on why when you plug back it hangs.

So I basically revert back to the original solution which you suggested.

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
index 4e7d9cb09a69..5504a18b5a45 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
@@ -697,7 +697,7 @@ void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm, bool force)
                return;

        /* for runtime suspend, skip locking kfd */
-       if (!run_pm) {
+       if (!run_pm && !drm_dev_is_unplugged(kfd->ddev)) {
                /* For first KFD device suspend all the KFD processes */
                if (atomic_inc_return(&kfd_locked) == 1)
                        kfd_suspend_all_processes(force);


First confirm if ASIC reset happens on
next init.

This patch works great at planned plugout, where all the rocm processes are killed before plugout. And device can be added back without a problem.
However unplanned plugout when there is rocm processes are running just don’t work.

Second please confirm if the timing you kill manually the user process has impact on whether you have a hang
on next plug back (if you kill before

Scenario 0: Kill before plug back

1. echo 1 > /sys/bus/pci/…/remove, would finish.
But the application won’t exit until there is a kill signal.

2. kill the the process. The application does several things and seems trigger drm_release in the kernel, which are met with kernel NULL pointer deference related to sysfs_remove. Then the whole fs just freeze.

[  +0.002498] BUG: kernel NULL pointer dereference, address: 0000000000000098
[  +0.000486] #PF: supervisor read access in kernel mode
[  +0.000545] #PF: error_code(0x0000) - not-present page
[  +0.000551] PGD 0 P4D 0
[  +0.000553] Oops: 0000 [#1] SMP NOPTI
[  +0.000540] CPU: 75 PID: 4911 Comm: kworker/75:2 Tainted: G        W   E     5.13.0-kfd #1
[  +0.000559] Hardware name: INGRASYS         TURING  /MB      , BIOS K71FQ28A 10/05/2021
[  +0.000567] Workqueue: events delayed_fput
[  +0.000563] RIP: 0010:kernfs_find_ns+0x1b/0x100
[  +0.000569] Code: ff ff e8 88 59 9f 00 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 41 57 8b 05 df f0 7b 01 41 56 41 55 49 89 f5 41 54 55 48 89 fd 53 <44> 0f b7 b7 98 00 00 00 48 89 d3 4c 8b 67 70 66 41 83 e6 20 41 0f
[  +0.001193] RSP: 0018:ffffb9875db5fc98 EFLAGS: 00010246
[  +0.000602] RAX: 0000000000000000 RBX: ffffa101f79507d8 RCX: 0000000000000000
[  +0.000612] RDX: 0000000000000000 RSI: ffffffffc09a9417 RDI: 0000000000000000
[  +0.000604] RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000000
[  +0.000760] R10: ffffb9875db5fcd0 R11: 0000000000000000 R12: 0000000000000000
[  +0.000597] R13: ffffffffc09a9417 R14: ffffa08363fb2d18 R15: 0000000000000000
[  +0.000702] FS:  0000000000000000(0000) GS:ffffa0ffbfcc0000(0000) knlGS:0000000000000000
[  +0.000666] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  +0.000658] CR2: 0000000000000098 CR3: 0000005747812005 CR4: 0000000000770ee0
[  +0.000715] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  +0.000655] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  +0.000592] PKRU: 55555554
[  +0.000580] Call Trace:
[  +0.000591]  kernfs_find_and_get_ns+0x2f/0x50
[  +0.000584]  sysfs_remove_file_from_group+0x20/0x50
[  +0.000580]  amdgpu_ras_sysfs_remove+0x3d/0xd0 [amdgpu]
[  +0.000737]  amdgpu_ras_late_fini+0x1d/0x40 [amdgpu]
[  +0.000750]  amdgpu_sdma_ras_fini+0x96/0xb0 [amdgpu]
[  +0.000742]  ? gfx_v10_0_resume+0x10/0x10 [amdgpu]
[  +0.000738]  sdma_v4_0_sw_fini+0x23/0x90 [amdgpu]
[  +0.000717]  amdgpu_device_fini_sw+0xae/0x260 [amdgpu]
[  +0.000704]  amdgpu_driver_release_kms+0x12/0x30 [amdgpu]
[  +0.000687]  drm_dev_release+0x20/0x40 [drm]
[  +0.000583]  drm_release+0xa8/0xf0 [drm]
[  +0.000584]  __fput+0xa5/0x250
[  +0.000621]  delayed_fput+0x1f/0x30
[  +0.000726]  process_one_work+0x26e/0x580
[  +0.000581]  ? process_one_work+0x580/0x580
[  +0.000611]  worker_thread+0x4d/0x3d0
[  +0.000611]  ? process_one_work+0x580/0x580
[  +0.000605]  kthread+0x117/0x150
[  +0.000611]  ? kthread_park+0x90/0x90
[  +0.000619]  ret_from_fork+0x1f/0x30
[  +0.000625] Modules linked in: amdgpu(E) xt_conntrack xt_MASQUERADE nfnetlink xt_addrtype iptable_filter iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 br_netfilter x86_pkg_temp_thermal cdc_ether usbnet acpi_pad msr ip_tables x_tables ast drm_vram_helper iommu_v2 drm_ttm_helper gpu_sched ttm drm_kms_helper cfbfillrect syscopyarea cfbimgblt sysfillrect sysimgblt fb_sys_fops cfbcopyarea drm drm_panel_orientati
on_quirks [last unloaded: amdgpu]

3.  echo 1 > /sys/bus/pci/rescan. This would just hang. I assume the sysfs is broken.

Based on 1 & 2, it seems that 1 won’t let the amdgpu exit gracefully, because 2 will do some cleanup maybe should have happened before 1.

or you kill after plug back does it makes a difference).

Scenario 2: Kill after plug back

If I perform rescan before kill, then the driver seemed probed fine. But kill will have the same issue which messed up the sysfs the same way as in Scenario 2.


Final Comments:

0. cancel_delayed_work_sync(&p_info->restore_userptr_work) would make the repletion of amdgpu_vm_bo_update failure go away, but it does not solve the issues in those scenarios.

1. For planned hotplug, this patch should work as long as you follow some protocol, i.e. kill before plugout. Is this patch an acceptable one since it provides some added feature than before?

2. For unplanned hotplug when there is rocm app running, the patch that kill all processes and wait for 5 sec would work consistently. But it seems that it is an unacceptable solution for official release. I can hold it for our own internal usage.  It seems that kill after removal would cause problems, and I don’t know if there is a quick fix by me because of my limited understanding of the amdgpu driver. Maybe AMD could have a quick fix; Or it is really a difficult one. This feature may or may not be a blocking issue in our GPU disaggregation research down the way. Please let us know for either cases, and we would like to learn and help as much as we could!

Thank you so much!

Best regards,
Shuotao

Andrey



diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
index 8fa9b86ac9d2..c0b27f722281 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
@@ -188,6 +188,12 @@ void amdgpu_amdkfd_interrupt(struct amdgpu_device *adev,
  kgd2kfd_interrupt(adev->kfd.dev, ih_ring_entry);
 }

+void amdgpu_amdkfd_kill_all_processes(struct amdgpu_device *adev)
+{
+ if (adev->kfd.dev)
+ kgd2kfd_kill_all_user_processes(adev->kfd.dev);
+}
+
 void amdgpu_amdkfd_suspend(struct amdgpu_device *adev, bool run_pm)
 {
  if (adev->kfd.dev)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
index 27c74fcec455..f4e485d60442 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
@@ -141,6 +141,7 @@ struct amdkfd_process_info {
 int amdgpu_amdkfd_init(void);
 void amdgpu_amdkfd_fini(void);

+void amdgpu_amdkfd_kill_all_processes(struct amdgpu_device *adev);
 void amdgpu_amdkfd_suspend(struct amdgpu_device *adev, bool run_pm);
 int amdgpu_amdkfd_resume_iommu(struct amdgpu_device *adev);
 int amdgpu_amdkfd_resume(struct amdgpu_device *adev, bool run_pm, bool sync);
@@ -405,6 +406,7 @@ bool kgd2kfd_device_init(struct kfd_dev *kfd,
  const struct kgd2kfd_shared_resources *gpu_resources);
 void kgd2kfd_device_exit(struct kfd_dev *kfd);
 void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm, bool force);
+void kgd2kfd_kill_all_user_processes(struct kfd_dev *kfd);
 int kgd2kfd_resume_iommu(struct kfd_dev *kfd);
 int kgd2kfd_resume(struct kfd_dev *kfd, bool run_pm, bool sync);
 int kgd2kfd_pre_reset(struct kfd_dev *kfd);
@@ -443,6 +445,9 @@ static inline void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm, bool force)
 {
 }

+void kgd2kfd_kill_all_user_processes(struct kfd_dev *kfd){
+}
+
 static int __maybe_unused kgd2kfd_resume_iommu(struct kfd_dev *kfd)
 {
  return 0;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index 3d5fc0751829..af6fe5080cfa 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -2101,6 +2101,9 @@ amdgpu_pci_remove(struct pci_dev *pdev)
 {
  struct drm_device *dev = pci_get_drvdata(pdev);

+ /* kill all kfd processes before drm_dev_unplug */
+ amdgpu_amdkfd_kill_all_processes(drm_to_adev(dev));
+
 #ifdef HAVE_DRM_DEV_UNPLUG
  drm_dev_unplug(dev);
 #else
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
index 5504a18b5a45..480c23bef5e2 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
@@ -691,6 +691,12 @@ bool kfd_is_locked(void)
  return  (atomic_read(&kfd_locked) > 0);
 }

+inline void kgd2kfd_kill_all_user_processes(struct kfd_dev* dev)
+{
+ kfd_kill_all_user_processes();
+}
+
+
 void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm, bool force)
 {
  if (!kfd->init_complete)
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
index 55c9e1922714..a35a2cb5bb9f 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
@@ -1064,6 +1064,7 @@ static inline struct kfd_process_device *kfd_process_device_from_gpuidx(
 void kfd_unref_process(struct kfd_process *p);
 int kfd_process_evict_queues(struct kfd_process *p, bool force);
 int kfd_process_restore_queues(struct kfd_process *p);
+void kfd_kill_all_user_processes(void);
 void kfd_suspend_all_processes(bool force);
 /*
  * kfd_resume_all_processes:
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process.c b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
index 6cdc855abb6d..17e769e6951d 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
@@ -46,6 +46,9 @@ struct mm_struct;
 #include "kfd_trace.h"
 #include "kfd_debug.h"

+static atomic_t kfd_process_locked = ATOMIC_INIT(0);
+static atomic_t kfd_inflight_kills = ATOMIC_INIT(0);
+
 /*
  * List of struct kfd_process (field kfd_process).
  * Unique/indexed by mm_struct*
@@ -802,6 +805,9 @@ struct kfd_process *kfd_create_process(struct task_struct *thread)
  struct kfd_process *process;
  int ret;

+ if ( atomic_read(&kfd_process_locked) > 0 )
+ return ERR_PTR(-EINVAL);
+
  if (!(thread->mm && mmget_not_zero(thread->mm)))
  return ERR_PTR(-EINVAL);

@@ -1126,6 +1132,10 @@ static void kfd_process_wq_release(struct work_struct *work)
  put_task_struct(p->lead_thread);

  kfree(p);
+
+ if ( atomic_read(&kfd_process_locked) > 0 ){
+ atomic_dec(&kfd_inflight_kills);
+ }
 }

 static void kfd_process_ref_release(struct kref *ref)
@@ -2186,6 +2196,35 @@ static void restore_process_worker(struct work_struct *work)
  pr_err("Failed to restore queues of pasid 0x%x\n", p->pasid);
 }

+void kfd_kill_all_user_processes(void)
+{
+ struct kfd_process *p;
+ /* struct amdkfd_process_info *p_info; */
+ unsigned int temp;
+ int idx;
+ atomic_inc(&kfd_process_locked);
+
+ idx = srcu_read_lock(&kfd_processes_srcu);
+ pr_info("Killing all processes\n");
+ hash_for_each_rcu(kfd_processes_table, temp, p, kfd_processes) {
+ dev_warn(kfd_device,
+ "Sending SIGBUS to process %d (pasid 0x%x)",
+ p->lead_thread->pid, p->pasid);
+ send_sig(SIGBUS, p->lead_thread, 0);
+ atomic_inc(&kfd_inflight_kills);
+ }
+ srcu_read_unlock(&kfd_processes_srcu, idx);
+
+ while ( atomic_read(&kfd_inflight_kills) > 0 ){
+ dev_warn(kfd_device,
+ "kfd_processes_table is not empty, going to sleep for 10ms\n");
+ msleep(10);
+ }
+
+ atomic_dec(&kfd_process_locked);
+ pr_info("all processes has been fully released");
+}
+
 void kfd_suspend_all_processes(bool force)
 {
  struct kfd_process *p;



Regards,
Shuotao



Andrey


+       }
+       srcu_read_unlock(&kfd_processes_srcu, idx);
+}
+
+
 int kfd_resume_all_processes(bool sync)
 {
        struct kfd_process *p;


Andrey



Really appreciate your help!

Best,
Shuotao

2. Remove redudant p2p/io links in sysfs when device is hotplugged
out.

3. New kfd node_id is not properly assigned after a new device is
added after a gpu is hotplugged out in a system. libhsakmt will
find this anomaly, (i.e. node_from != <dev node id> in iolinks),
when taking a topology_snapshot, thus returns fault to the rocm
stack.

-- This patch fixes issue 1; another patch by Mukul fixes issues 2&3.
-- Tested on a 4-GPU MI100 gpu nodes with kernel 5.13.0-kfd; kernel
5.16.0-kfd is unstable out of box for MI100.
---
drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 5 +++++
drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h | 7 +++++++
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 1 +
drivers/gpu/drm/amd/amdkfd/kfd_device.c | 13 +++++++++++++
4 files changed, 26 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
index c18c4be1e4ac..d50011bdb5c4 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
@@ -213,6 +213,11 @@ int amdgpu_amdkfd_resume(struct amdgpu_device *adev, bool run_pm)
return r;
}

+int amdgpu_amdkfd_resume_processes(void)
+{
+ return kgd2kfd_resume_processes();
+}
+
int amdgpu_amdkfd_pre_reset(struct amdgpu_device *adev)
{
int r = 0;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
index f8b9f27adcf5..803306e011c3 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
@@ -140,6 +140,7 @@ void amdgpu_amdkfd_fini(void);
void amdgpu_amdkfd_suspend(struct amdgpu_device *adev, bool run_pm);
int amdgpu_amdkfd_resume_iommu(struct amdgpu_device *adev);
int amdgpu_amdkfd_resume(struct amdgpu_device *adev, bool run_pm);
+int amdgpu_amdkfd_resume_processes(void);
void amdgpu_amdkfd_interrupt(struct amdgpu_device *adev,
const void *ih_ring_entry);
void amdgpu_amdkfd_device_probe(struct amdgpu_device *adev);
@@ -347,6 +348,7 @@ void kgd2kfd_device_exit(struct kfd_dev *kfd);
void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm);
int kgd2kfd_resume_iommu(struct kfd_dev *kfd);
int kgd2kfd_resume(struct kfd_dev *kfd, bool run_pm);
+int kgd2kfd_resume_processes(void);
int kgd2kfd_pre_reset(struct kfd_dev *kfd);
int kgd2kfd_post_reset(struct kfd_dev *kfd);
void kgd2kfd_interrupt(struct kfd_dev *kfd, const void *ih_ring_entry);
@@ -393,6 +395,11 @@ static inline int kgd2kfd_resume(struct kfd_dev *kfd, bool run_pm)
return 0;
}

+static inline int kgd2kfd_resume_processes(void)
+{
+ return 0;
+}
+
static inline int kgd2kfd_pre_reset(struct kfd_dev *kfd)
{
return 0;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index fa4a9f13c922..5827b65b7489 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4004,6 +4004,7 @@ void amdgpu_device_fini_hw(struct amdgpu_device *adev)
if (drm_dev_is_unplugged(adev_to_drm(adev)))
amdgpu_device_unmap_mmio(adev);

+ amdgpu_amdkfd_resume_processes();
}

void amdgpu_device_fini_sw(struct amdgpu_device *adev)
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
index 62aa6c9d5123..ef05aae9255e 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
@@ -714,6 +714,19 @@ int kgd2kfd_resume(struct kfd_dev *kfd, bool run_pm)
return ret;
}

+/* for non-runtime resume only */
+int kgd2kfd_resume_processes(void)
+{
+ int count;
+
+ count = atomic_dec_return(&kfd_locked);
+ WARN_ONCE(count < 0, "KFD suspend / resume ref. error");
+ if (count == 0)
+ return kfd_resume_all_processes();
+
+ return 0;
+}

It doesn't make sense to me to just increment kfd_locked in
kgd2kfd_suspend to only decrement it again a few functions down the
road.

I suggest this instead - you only incrmemnt if not during PCI remove

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
index 1c2cf3a33c1f..7754f77248a4 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
@@ -952,11 +952,12 @@ bool kfd_is_locked(void)

void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm)
{
+
if (!kfd->init_complete)
return;

/* for runtime suspend, skip locking kfd */
- if (!run_pm) {
+ if (!run_pm && !drm_dev_is_unplugged(kfd->ddev)) {
/* For first KFD device suspend all the KFD processes */
if (atomic_inc_return(&kfd_locked) == 1)
kfd_suspend_all_processes();


Andrey



+
int kgd2kfd_resume_iommu(struct kfd_dev *kfd)
{
int err = 0;




[-- Attachment #2: Type: text/html, Size: 98546 bytes --]

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [EXTERNAL] [PATCH 2/2] drm/amdkfd: Add PCIe Hotplug Support for AMDKFD
  2022-04-15 10:12                 ` Shuotao Xu
@ 2022-04-15 16:43                   ` Andrey Grodzovsky
  2022-04-18 13:22                     ` Shuotao Xu
  0 siblings, 1 reply; 31+ messages in thread
From: Andrey Grodzovsky @ 2022-04-15 16:43 UTC (permalink / raw)
  To: Shuotao Xu
  Cc: Mukul.Joshi, Kuehling, Felix, Peng Cheng, amd-gfx, Lei Qu,
	Ran Shu, Ziyue Yang

[-- Attachment #1: Type: text/plain, Size: 37941 bytes --]


On 2022-04-15 06:12, Shuotao Xu wrote:
> Hi Andrey,
>
> First I really appreciate the discussion! It helped me understand the 
> driver code greatly. Thank you so much:)
> Please see my inline comments.
>
>> On Apr 14, 2022, at 11:13 PM, Andrey Grodzovsky 
>> <andrey.grodzovsky@amd.com> wrote:
>>
>>
>> On 2022-04-14 10:00, Shuotao Xu wrote:
>>>
>>>
>>>> On Apr 14, 2022, at 1:31 AM, Andrey Grodzovsky 
>>>> <andrey.grodzovsky@amd.com> wrote:
>>>>
>>>>
>>>> On 2022-04-13 12:03, Shuotao Xu wrote:
>>>>>
>>>>>
>>>>>> On Apr 11, 2022, at 11:52 PM, Andrey Grodzovsky 
>>>>>> <andrey.grodzovsky@amd.com> wrote:
>>>>>>
>>>>>> [Some people who received this message don't often get email 
>>>>>> fromandrey.grodzovsky@amd.com. Learn why this is important 
>>>>>> athttp://aka.ms/LearnAboutSenderIdentification.]
>>>>>>
>>>>>> On 2022-04-08 21:28, Shuotao Xu wrote:
>>>>>>>
>>>>>>>> On Apr 8, 2022, at 11:28 PM, Andrey Grodzovsky 
>>>>>>>> <andrey.grodzovsky@amd.com> wrote:
>>>>>>>>
>>>>>>>> [Some people who received this message don't often get email 
>>>>>>>> from andrey.grodzovsky@amd.com. Learn why this is important at 
>>>>>>>> http://aka.ms/LearnAboutSenderIdentification.]
>>>>>>>>
>>>>>>>> On 2022-04-08 04:45, Shuotao Xu wrote:
>>>>>>>>> Adding PCIe Hotplug Support for AMDKFD: the support of 
>>>>>>>>> hot-plug of GPU
>>>>>>>>> devices can open doors for many advanced applications in data 
>>>>>>>>> center
>>>>>>>>> in the next few years, such as for GPU resource
>>>>>>>>> disaggregation. Current AMDKFD does not support hotplug out 
>>>>>>>>> b/o the
>>>>>>>>> following reasons:
>>>>>>>>>
>>>>>>>>> 1. During PCIe removal, decrement KFD lock which was 
>>>>>>>>> incremented at
>>>>>>>>> the beginning of hw fini; otherwise kfd_open later is going to
>>>>>>>>> fail.
>>>>>>>> I assumed you read my comment last time, still you do same 
>>>>>>>> approach.
>>>>>>>> More in details bellow
>>>>>>> Aha, I like your fix:) I was not familiar with drm APIs so just 
>>>>>>> only half understood your comment last time.
>>>>>>>
>>>>>>> BTW, I tried hot-plugging out a GPU when rocm application is 
>>>>>>> still running.
>>>>>>> From dmesg, application is still trying to access the removed 
>>>>>>> kfd device, and are met with some errors.
>>>>>>
>>>>>>
>>>>>> Application us supposed to keep running, it holds the drm_device
>>>>>> reference as long as it has an open
>>>>>> FD to the device and final cleanup will come only after the app 
>>>>>> will die
>>>>>> thus releasing the FD and the last
>>>>>> drm_device reference.
>>>>>>
>>>>>>> Application would hang and not exiting in this case.
>>>>>>
>>>>>
>>>>> Actually I tried kill -7 $pid, and the process exists. The dmesg 
>>>>> has some warning though.
>>>>>
>>>>> [  711.769977] WARNING: CPU: 23 PID: 344 at 
>>>>> .../amdgpu-rocm5.0.2/src/amd/amdgpu/amdgpu_object.c:1336 
>>>>> amdgpu_bo_release_notify+0x150/0x160 [amdgpu]
>>>>> [  711.770528] Modules linked in: amdgpu(OE) amdttm(OE) 
>>>>> amd_sched(OE) amdkcl(OE) iommu_v2 nf_conntrack_netlink nfnetlink 
>>>>> xfrm_user xfrm_algo xt_addrtype br_netfilter xt_CHECKSUM 
>>>>> iptable_mangle xt_MASQUERADE iptable_nat nf_nat xt_conntrack 
>>>>> nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ipt_REJECT 
>>>>> nf_reject_ipv4 xt_tcpudp bridge stp llc ebtable_filter ebtables 
>>>>> ip6table_filter ip6_tables iptable_filter overlay binfmt_misc 
>>>>> intel_rapl_msr i40iw intel_rapl_common skx_edac nfit 
>>>>> x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel rpcrdma 
>>>>> kvm sunrpc ipmi_ssif ib_umad ib_ipoib rdma_ucm irqbypass rapl 
>>>>> joydev acpi_ipmi input_leds intel_cstate ipmi_si ipmi_devintf 
>>>>> mei_me mei intel_pch_thermal ipmi_msghandler ioatdma mac_hid 
>>>>> lpc_ich dca acpi_power_meter acpi_pad sch_fq_codel ib_iser rdma_cm 
>>>>> iw_cm ib_cm iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi 
>>>>> pci_stub ip_tables x_tables autofs4 btrfs blake2b_generic 
>>>>> zstd_compress raid10 raid456 async_raid6_recov async_memcpy 
>>>>> async_pq async_xor async_tx xor
>>>>> [  711.779359]  raid6_pq libcrc32c raid1 raid0 multipath linear 
>>>>> mlx5_ib ib_uverbs ib_core ast drm_vram_helper i2c_algo_bit 
>>>>> drm_ttm_helper ttm drm_kms_helper syscopyarea crct10dif_pclmul 
>>>>> crc32_pclmul ghash_clmulni_intel sysfillrect uas hid_generic 
>>>>> sysimgblt aesni_intel mlx5_core fb_sys_fops crypto_simd usbhid 
>>>>> cryptd drm i40e pci_hyperv_intf usb_storage glue_helper mlxfw hid 
>>>>> ahci libahci wmi
>>>>> [  711.779752] CPU: 23 PID: 344 Comm: kworker/23:1 Tainted: G  W 
>>>>>  OE     5.11.0+ #1
>>>>> [  711.779755] Hardware name: Supermicro 
>>>>> SYS-4029GP-TRT2/X11DPG-OT-CPU, BIOS 2.1 08/14/2018
>>>>> [  711.779756] Workqueue: kfd_process_wq kfd_process_wq_release 
>>>>> [amdgpu]
>>>>> [  711.779955] RIP: 0010:amdgpu_bo_release_notify+0x150/0x160 [amdgpu]
>>>>> [  711.780141] Code: e8 b5 af 34 f4 e9 1f ff ff ff 48 39 c2 74 07 
>>>>> 0f 0b e9 69 ff ff ff 4c 89 e7 e8 3c b4 16 00 e9 5c ff ff ff e8 a2 
>>>>> ce fd f3 eb cf <0f> 0b eb cb e8 d7 02 34 f4 0f 1f 80 00 00 00 00 
>>>>> 0f 1f 44 00 00 55
>>>>> [  711.780143] RSP: 0018:ffffa8100dd67c30 EFLAGS: 00010282
>>>>> [  711.780145] RAX: 00000000ffffffea RBX: ffff89980e792058 RCX: 
>>>>> 0000000000000000
>>>>> [  711.780147] RDX: 0000000000000000 RSI: ffff89a8f9ad8870 RDI: 
>>>>> ffff89a8f9ad8870
>>>>> [  711.780148] RBP: ffffa8100dd67c50 R08: 0000000000000000 R09: 
>>>>> fffffffffff99b18
>>>>> [  711.780149] R10: ffffa8100dd67bd0 R11: ffffa8100dd67908 R12: 
>>>>> ffff89980e792000
>>>>> [  711.780151] R13: ffff89980e792058 R14: ffff89980e7921bc R15: 
>>>>> dead000000000100
>>>>> [  711.780152] FS:  0000000000000000(0000) 
>>>>> GS:ffff89a8f9ac0000(0000) knlGS:0000000000000000
>>>>> [  711.780154] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>>> [  711.780156] CR2: 00007ffddac6f71f CR3: 00000030bb80a003 CR4: 
>>>>> 00000000007706e0
>>>>> [  711.780157] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 
>>>>> 0000000000000000
>>>>> [  711.780159] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 
>>>>> 0000000000000400
>>>>> [  711.780160] PKRU: 55555554
>>>>> [  711.780161] Call Trace:
>>>>> [  711.780163]  ttm_bo_release+0x2ae/0x320 [amdttm]
>>>>> [  711.780169]  amdttm_bo_put+0x30/0x40 [amdttm]
>>>>> [  711.780357]  amdgpu_bo_unref+0x1e/0x30 [amdgpu]
>>>>> [  711.780543]  amdgpu_gem_object_free+0x8b/0x160 [amdgpu]
>>>>> [  711.781119]  drm_gem_object_free+0x1d/0x30 [drm]
>>>>> [  711.781489]  amdgpu_amdkfd_gpuvm_free_memory_of_gpu+0x34a/0x380 
>>>>> [amdgpu]
>>>>> [  711.782044]  kfd_process_device_free_bos+0xe0/0x130 [amdgpu]
>>>>> [  711.782611]  kfd_process_wq_release+0x286/0x380 [amdgpu]
>>>>> [  711.783172]  process_one_work+0x236/0x420
>>>>> [  711.783543]  worker_thread+0x34/0x400
>>>>> [  711.783911]  ? process_one_work+0x420/0x420
>>>>> [  711.784279]  kthread+0x126/0x140
>>>>> [  711.784653]  ? kthread_park+0x90/0x90
>>>>> [  711.785018]  ret_from_fork+0x22/0x30
>>>>> [  711.785387] ---[ end trace d8f50f6594817c84 ]---
>>>>> [  711.798716] [drm] amdgpu: ttm finalized
>>>>
>>>>
>>>> So it means the process was stuck in some wait_event_killable 
>>>> (maybe here drm_sched_entity_flush) - you can try 
>>>> 'cat/proc/$process_pid/stack' maybe before
>>>> you kill it to see where it was stuck so we can go from there.
>>>>
>>>>
>>>>>
>>>>>>
>>>>>> For graphic apps what i usually see is a crash because of sigsev when
>>>>>> the app tries to access
>>>>>> an unmapped MMIO region on the device. I haven't tested for compute
>>>>>> stack and so there might
>>>>>> be something I haven't covered. Hang could mean for example 
>>>>>> waiting on a
>>>>>> fence which is not being
>>>>>> signaled - please provide full dmesg from this case.
>>>>>>
>>>>>>>
>>>>>>> Do you have any good suggestions on how to fix it down the line? 
>>>>>>> (HIP runtime/libhsakmt or driver)
>>>>>>>
>>>>>>> [64036.631333] amdgpu: amdgpu_vm_bo_update failed
>>>>>>> [64036.631702] amdgpu: validate_invalid_user_pages: update PTE 
>>>>>>> failed
>>>>>>> [64036.640754] amdgpu: amdgpu_vm_bo_update failed
>>>>>>> [64036.641120] amdgpu: validate_invalid_user_pages: update PTE 
>>>>>>> failed
>>>>>>> [64036.650394] amdgpu: amdgpu_vm_bo_update failed
>>>>>>> [64036.650765] amdgpu: validate_invalid_user_pages: update PTE 
>>>>>>> failed
>>>>>>
>>>>>
>>>>> The full dmesg will just the repetition of those two messages,
>>>>> [186885.764079] amdgpu 0000:43:00.0: amdgpu: amdgpu: finishing device.
>>>>> [186885.766916] [drm] free PSP TMR buffer
>>>>> [186893.868173] amdgpu: amdgpu_vm_bo_update failed
>>>>> [186893.868235] amdgpu: validate_invalid_user_pages: update PTE failed
>>>>> [186893.876154] amdgpu: amdgpu_vm_bo_update failed
>>>>> [186893.876190] amdgpu: validate_invalid_user_pages: update PTE failed
>>>>> [186893.884150] amdgpu: amdgpu_vm_bo_update failed
>>>>> [186893.884185] amdgpu: validate_invalid_user_pages: update PTE failed
>>>>>
>>>>>>
>>>>>> This just probably means trying to update PTEs after the physical 
>>>>>> device
>>>>>> is gone - we usually avoid this by
>>>>>> first trying to do all HW shutdowns early before PCI remove 
>>>>>> completion
>>>>>> but when it's really tricky by
>>>>>> protecting HW access sections with drm_dev_enter/exit scope.
>>>>>>
>>>>>> For this particular error it would be the best to flush
>>>>>> info->restore_userptr_work before the end of
>>>>>> amdgpu_pci_remove (rejecting new process creation and calling
>>>>>> cancel_delayed_work_sync(&process_info->restore_userptr_work) for all
>>>>>> running processes)
>>>>>> somewhere in amdgpu_pci_remove.
>>>>>>
>>>>> I tried something like *kfd_process_ref_release* which I think did 
>>>>> what you described, but it did not work.
>>>>
>>>>
>>>> I don't see how kfd_process_ref_release is the same as I mentioned 
>>>> above, what i meant is calling the code above within 
>>>> kgd2kfd_suspend (where you tried to call 
>>>> kfd_kill_all_user_processes bellow)
>>>>
>>> Yes, you are right. It was not called by it.
>>>>
>>>>
>>>>>
>>>>> Instead I tried to kill the process from the kernel, but the 
>>>>> amdgpu could **only** be hot-plugged in back successfully only if 
>>>>> there was no rocm kernel running when it was plugged out. If not, 
>>>>> amdgpu_probe will just hang later. (Maybe because amdgpu was 
>>>>> plugged out while running state, it leaves a bad HW state which 
>>>>> causes probe to hang).
>>>>
>>>>
>>>> We usually do asic_reset during probe to reset all HW state (checlk 
>>>> if amdgpu_device_init->amdgpu_asic_reset is running when you  plug 
>>>> back).
>>>>
>>> OK
>>>>
>>>>
>>>>>
>>>>> I don’t know if this is a viable solution worth pursuing, but I 
>>>>> attached the diff anyway.
>>>>>
>>>>> Another solution could be let compute stack user mode detect a 
>>>>> topology change via generation_count change, and abort gracefully 
>>>>> there.
>>>>>
>>>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c 
>>>>> b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>>>> index 4e7d9cb09a69..79b4c9b84cd0 100644
>>>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>>>> @@ -697,12 +697,15 @@ void kgd2kfd_suspend(struct kfd_dev *kfd, 
>>>>> bool run_pm, bool force)
>>>>>                 return;
>>>>>
>>>>>         /* for runtime suspend, skip locking kfd */
>>>>> -       if (!run_pm) {
>>>>> +       if (!run_pm && !drm_dev_is_unplugged(kfd->ddev)) {
>>>>>                 /* For first KFD device suspend all the KFD 
>>>>> processes */
>>>>>                 if (atomic_inc_return(&kfd_locked) == 1)
>>>>> kfd_suspend_all_processes(force);
>>>>>         }
>>>>>
>>>>> +       if (drm_dev_is_unplugged(kfd->ddev))
>>>>> + kfd_kill_all_user_processes();
>>>>> +
>>>>> kfd->dqm->ops.stop(kfd->dqm);
>>>>> kfd_iommu_suspend(kfd);
>>>>>  }
>>>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h 
>>>>> b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
>>>>> index 55c9e1922714..84cbcd857856 100644
>>>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
>>>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
>>>>> @@ -1065,6 +1065,7 @@ void kfd_unref_process(struct kfd_process *p);
>>>>>  int kfd_process_evict_queues(struct kfd_process *p, bool force);
>>>>>  int kfd_process_restore_queues(struct kfd_process *p);
>>>>>  void kfd_suspend_all_processes(bool force);
>>>>> +void kfd_kill_all_user_processes(void);
>>>>>  /*
>>>>>   * kfd_resume_all_processes:
>>>>>   *     bool sync: If kfd_resume_all_processes() should wait for the
>>>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process.c 
>>>>> b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
>>>>> index 6cdc855abb6d..fb0c753b682c 100644
>>>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
>>>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
>>>>> @@ -2206,6 +2206,24 @@ void kfd_suspend_all_processes(bool force)
>>>>> srcu_read_unlock(&kfd_processes_srcu, idx);
>>>>>  }
>>>>>
>>>>> +
>>>>> +void kfd_kill_all_user_processes(void)
>>>>> +{
>>>>> +       struct kfd_process *p;
>>>>> +       struct amdkfd_process_info *p_info;
>>>>> +       unsigned int temp;
>>>>> +       int idx = srcu_read_lock(&kfd_processes_srcu);
>>>>> +
>>>>> +       pr_info("Killing all processes\n");
>>>>> + hash_for_each_rcu(kfd_processes_table, temp, p, kfd_processes) {
>>>>> +               p_info = p->kgd_process_info;
>>>>> + pr_info("Killing  processes, pid = %d", pid_nr(p_info->pid));
>>>>> + kill_pid(p_info->pid, SIGBUS, 1);
>>>>
>>>>
>>>> From looking into kill_pid I see it only sends a signal but doesn't 
>>>> wait for completion, it would make sense to wait for completion 
>>>> here. In any case I would actually try to put here
>>>>
>>> I have made a version which does that with some atomic counters. 
>>> Please read later in the diff.
>>>>
>>>>
>>>> hash_for_each_rcu(p_info)
>>>>     cancel_delayed_work_sync(&p_info->restore_userptr_work)
>>>>
>>>> instead  at least that what i meant in the previous mail.
>>>>
>>> I actually tried that earlier, and it did not work. Application 
>>> still keeps running, and you have to send a kill to the user process.
>>>
>>> I have made the following version. It waits for processes to 
>>> terminate synchronously after sending SIGBUS. After that it does the 
>>> real work of amdgpu_pci_remove.
>>> However, it hangs at amdgpu_device_ip_fini_early when it is trying 
>>> to deinit ip_block 6 <sdma_v4_0> 
>>> (https://gitlab.freedesktop.org/agd5f/linux/-/blob/amd-staging-drm-next/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c#L2818 
>>> <https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fagd5f%2Flinux%2F-%2Fblob%2Famd-staging-drm-next%2Fdrivers%2Fgpu%2Fdrm%2Famd%2Famdgpu%2Famdgpu_device.c%23L2818&data=04%7C01%7Candrey.grodzovsky%40amd.com%7Cc74b47c7231b430bae5508da1ec870de%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637856143531476565%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=PHmqeYxqJ9WJ4IHDphaQqrhfTC95DL6%2B8NpbIyxtykI%3D&reserved=0>). 
>>> I assume that there are still some inflight dma, therefore fini of 
>>> this ip block thus hangs?
>>>
>>> The following is an excerpt of the dmesg: please excuse for putting 
>>> my own pr_info, but I hope you get my point of where it hangs.
>>>
>>> [  392.344735] amdgpu: all processes has been fully released
>>> [  392.346557] amdgpu: amdgpu_acpi_fini done
>>> [  392.346568] amdgpu 0000:b3:00.0: amdgpu: amdgpu: finishing device.
>>> [  392.349238] amdgpu: amdgpu_device_ip_fini_early enter ip_blocks = 9
>>> [  392.349248] amdgpu: Free mem_obj = 000000007bf54275, range_start 
>>> = 14, range_end = 14
>>> [  392.350299] amdgpu: Free mem_obj = 00000000a85bc878, range_start 
>>> = 12, range_end = 12
>>> [  392.350304] amdgpu: Free mem_obj = 00000000b8019e32, range_start 
>>> = 13, range_end = 13
>>> [  392.350308] amdgpu: Free mem_obj = 000000002d296168, range_start 
>>> = 4, range_end = 11
>>> [  392.350313] amdgpu: Free mem_obj = 000000001fc4f934, range_start 
>>> = 0, range_end = 3
>>> [  392.350322] amdgpu: amdgpu_amdkfd_suspend(adev, false) done
>>> [  392.350672] amdgpu: hw_fini of IP block[8] <jpeg_v2_5> done 0
>>> [  392.350679] amdgpu: hw_fini of IP block[7] <vcn_v2_5> done 0
>>
>>
>> I just remembered that the idea to actively kill and wait for running 
>> user processes during unplug was rejected
>> as a bad idea in the first iteration of unplug work I did (don't 
>> remember why now, need to look) so this is a no go.
>>
> Maybe an application has kfd open, but was not accessing the dev. So 
> kill it at unplug could kill the process unnecessarily.
> However, the latest version I had with the sleep function got rid of 
> the IP block fini hang.
>>
>> Our policy is to let zombie processes (zombie in a sense that the 
>> underlying device is gone) live as long as they want
>> (as long as you able to terminate them - which you do, so that ok)
>> and the system should finish PCI remove gracefully and be able to hot 
>> plug back the device.  Hence, i suggest dropping
>> this direction of forcing all user processes to be killed, confirm 
>> you have graceful shutdown and remove of device
>> from PCI topology and then concentrate on why when you plug back it 
>> hangs.
>>
> So I basically revert back to the original solution which you suggested.
>
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c 
> b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> index 4e7d9cb09a69..5504a18b5a45 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> @@ -697,7 +697,7 @@ void kgd2kfd_suspend(struct kfd_dev *kfd, bool 
> run_pm, bool force)
>                 return;
>
>         /* for runtime suspend, skip locking kfd */
> -       if (!run_pm) {
> +       if (!run_pm && !drm_dev_is_unplugged(kfd->ddev)) {
>                 /* For first KFD device suspend all the KFD processes */
>                 if (atomic_inc_return(&kfd_locked) == 1)
>                         kfd_suspend_all_processes(force);
>
>> First confirm if ASIC reset happens on
>> next init.
>>
> This patch works great at *planned* plugout, where all the rocm 
> processes are killed before plugout. And device can be added back 
> without a problem.
> However *unplanned* plugout when there is rocm processes are running 
> just don’t work.


Still I am not clear if ASIC reset happens on plug back or no, can you 
trace this please ?


>> Second please confirm if the timing you kill manually the user 
>> process has impact on whether you have a hang
>> on next plug back (if you kill before
>>
> *Scenario 0: Kill before plug back*
>
> 1. echo 1 > /sys/bus/pci/…/remove, would finish.
> But the application won’t exit until there is a kill signal.


Why you think it must exit ?


>
> 2. kill the the process. The application does several things and seems 
> trigger drm_release in the kernel, which are met with kernel NULL 
> pointer deference related to sysfs_remove. Then the whole fs just freeze.
>
> [  +0.002498] BUG: kernel NULL pointer dereference, address: 
> 0000000000000098
> [  +0.000486] #PF: supervisor read access in kernel mode
> [  +0.000545] #PF: error_code(0x0000) - not-present page
> [  +0.000551] PGD 0 P4D 0
> [  +0.000553] Oops: 0000 [#1] SMP NOPTI
> [  +0.000540] CPU: 75 PID: 4911 Comm: kworker/75:2 Tainted: G        W 
>   E     5.13.0-kfd #1
> [  +0.000559] Hardware name: INGRASYS         TURING  /MB      , BIOS 
> K71FQ28A 10/05/2021
> [  +0.000567] Workqueue: events delayed_fput
> [  +0.000563] RIP: 0010:kernfs_find_ns+0x1b/0x100
> [  +0.000569] Code: ff ff e8 88 59 9f 00 0f 1f 84 00 00 00 00 00 0f 1f 
> 44 00 00 41 57 8b 05 df f0 7b 01 41 56 41 55 49 89 f5 41 54 55 48 89 
> fd 53 <44> 0f b7 b7 98 00 00 00 48 89 d3 4c 8b 67 70 66 41 83 e6 20 41 0f
> [  +0.001193] RSP: 0018:ffffb9875db5fc98 EFLAGS: 00010246
> [  +0.000602] RAX: 0000000000000000 RBX: ffffa101f79507d8 RCX: 
> 0000000000000000
> [  +0.000612] RDX: 0000000000000000 RSI: ffffffffc09a9417 RDI: 
> 0000000000000000
> [  +0.000604] RBP: 0000000000000000 R08: 0000000000000001 R09: 
> 0000000000000000
> [  +0.000760] R10: ffffb9875db5fcd0 R11: 0000000000000000 R12: 
> 0000000000000000
> [  +0.000597] R13: ffffffffc09a9417 R14: ffffa08363fb2d18 R15: 
> 0000000000000000
> [  +0.000702] FS:  0000000000000000(0000) GS:ffffa0ffbfcc0000(0000) 
> knlGS:0000000000000000
> [  +0.000666] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  +0.000658] CR2: 0000000000000098 CR3: 0000005747812005 CR4: 
> 0000000000770ee0
> [  +0.000715] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 
> 0000000000000000
> [  +0.000655] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 
> 0000000000000400
> [  +0.000592] PKRU: 55555554
> [  +0.000580] Call Trace:
> [  +0.000591]  kernfs_find_and_get_ns+0x2f/0x50
> [  +0.000584]  sysfs_remove_file_from_group+0x20/0x50
> [  +0.000580]  amdgpu_ras_sysfs_remove+0x3d/0xd0 [amdgpu]
> [  +0.000737]  amdgpu_ras_late_fini+0x1d/0x40 [amdgpu]
> [  +0.000750]  amdgpu_sdma_ras_fini+0x96/0xb0 [amdgpu]
> [  +0.000742]  ? gfx_v10_0_resume+0x10/0x10 [amdgpu]
> [  +0.000738]  sdma_v4_0_sw_fini+0x23/0x90 [amdgpu]
> [  +0.000717]  amdgpu_device_fini_sw+0xae/0x260 [amdgpu]
> [  +0.000704]  amdgpu_driver_release_kms+0x12/0x30 [amdgpu]
> [  +0.000687]  drm_dev_release+0x20/0x40 [drm]
> [  +0.000583]  drm_release+0xa8/0xf0 [drm]
> [  +0.000584]  __fput+0xa5/0x250
> [  +0.000621]  delayed_fput+0x1f/0x30
> [  +0.000726]  process_one_work+0x26e/0x580
> [  +0.000581]  ? process_one_work+0x580/0x580
> [  +0.000611]  worker_thread+0x4d/0x3d0
> [  +0.000611]  ? process_one_work+0x580/0x580
> [  +0.000605]  kthread+0x117/0x150
> [  +0.000611]  ? kthread_park+0x90/0x90
> [  +0.000619]  ret_from_fork+0x1f/0x30
> [  +0.000625] Modules linked in: amdgpu(E) xt_conntrack xt_MASQUERADE 
> nfnetlink xt_addrtype iptable_filter iptable_nat nf_nat nf_conntrack 
> nf_defrag_ipv6 nf_defrag_ipv4 br_netfilter x86_pkg_temp_thermal 
> cdc_ether usbnet acpi_pad msr ip_tables x_tables ast drm_vram_helper 
> iommu_v2 drm_ttm_helper gpu_sched ttm drm_kms_helper cfbfillrect 
> syscopyarea cfbimgblt sysfillrect sysimgblt fb_sys_fops cfbcopyarea 
> drm drm_panel_orientati
> on_quirks [last unloaded: amdgpu]


This is a known regression, all SYSFS components must be removed before 
pci_remove code runs otherwise you get either warnings for single file 
removals or
OOPSEs for sysfs gorup removals like here. Please try to move 
amdgpu_ras_sysfs_remove from amdgpu_ras_late_fini to the end of 
amdgpu_ras_pre_fini (which happens before pci remove)


>
> 3.  echo 1 > /sys/bus/pci/rescan. This would just hang. I assume the 
> sysfs is broken.
>
> Based on 1 & 2, it seems that 1 won’t let the amdgpu exit gracefully, 
> because 2 will do some cleanup maybe should have happened before 1.
>>
>> or you kill after plug back does it makes a difference).
>>
> *Scenario 2: Kill after plug back*
>
> If I perform rescan before kill, then the driver seemed probed fine. 
> But kill will have the same issue which messed up the sysfs the same 
> way as in Scenario 2.
>
>
> *Final Comments:*
>
> 0. cancel_delayed_work_sync(&p_info->restore_userptr_work) would make 
> the repletion of amdgpu_vm_bo_update failure go away, but it does not 
> solve the issues in those scenarios.


Still - it's better to do it this way even for those failures to go awaya


>
> 1. For planned hotplug, this patch should work as long as you follow 
> some protocol, i.e. kill before plugout. Is this patch an acceptable 
> one since it provides some added feature than before?


Let's try to fix more as I advised above.


>
> 2. For unplanned hotplug when there is rocm app running, the patch 
> that kill all processes and wait for 5 sec would work consistently. 
> But it seems that it is an unacceptable solution for official release. 
> I can hold it for our own internal usage.  It seems that kill after 
> removal would cause problems, and I don’t know if there is a quick fix 
> by me because of my limited understanding of the amdgpu driver. Maybe 
> AMD could have a quick fix; Or it is really a difficult one. This 
> feature may or may not be a blocking issue in our GPU disaggregation 
> research down the way. Please let us know for either cases, and we 
> would like to learn and help as much as we could!


I am currently not sure why it helps. I will need to setup my own ROCm 
setup and retest hot plug to check this in more depth but currently i 
have higher priorities. Please try to confirm ASIC reset always takes 
place on plug back
and fix the sysfs OOPs as I advised above to clear up at least some of 
the issues. Also please describe to me exactly what you steps to 
reproduce this scenario so later I might be able to do it myself.

Also, we have hotplug test suite in libdrm (graphic stack), so maybe u 
can install libdrm and run that test suite to see if it exposes more issues.

Andrey


>
> Thank you so much!
>
> Best regards,
> Shuotao
>>
>> Andrey
>>
>>
>>>
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c 
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
>>> index 8fa9b86ac9d2..c0b27f722281 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
>>> @@ -188,6 +188,12 @@ void amdgpu_amdkfd_interrupt(struct 
>>> amdgpu_device *adev,
>>> kgd2kfd_interrupt(adev->kfd.dev, ih_ring_entry);
>>>  }
>>> +void amdgpu_amdkfd_kill_all_processes(struct amdgpu_device *adev)
>>> +{
>>> +if (adev->kfd.dev)
>>> +kgd2kfd_kill_all_user_processes(adev->kfd.dev);
>>> +}
>>> +
>>>  void amdgpu_amdkfd_suspend(struct amdgpu_device *adev, bool run_pm)
>>>  {
>>> if (adev->kfd.dev)
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h 
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
>>> index 27c74fcec455..f4e485d60442 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
>>> @@ -141,6 +141,7 @@ struct amdkfd_process_info {
>>>  int amdgpu_amdkfd_init(void);
>>>  void amdgpu_amdkfd_fini(void);
>>> +void amdgpu_amdkfd_kill_all_processes(struct amdgpu_device *adev);
>>>  void amdgpu_amdkfd_suspend(struct amdgpu_device *adev, bool run_pm);
>>>  int amdgpu_amdkfd_resume_iommu(struct amdgpu_device *adev);
>>>  int amdgpu_amdkfd_resume(struct amdgpu_device *adev, bool run_pm, 
>>> bool sync);
>>> @@ -405,6 +406,7 @@ bool kgd2kfd_device_init(struct kfd_dev *kfd,
>>> const struct kgd2kfd_shared_resources *gpu_resources);
>>>  void kgd2kfd_device_exit(struct kfd_dev *kfd);
>>>  void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm, bool force);
>>> +void kgd2kfd_kill_all_user_processes(struct kfd_dev *kfd);
>>>  int kgd2kfd_resume_iommu(struct kfd_dev *kfd);
>>>  int kgd2kfd_resume(struct kfd_dev *kfd, bool run_pm, bool sync);
>>>  int kgd2kfd_pre_reset(struct kfd_dev *kfd);
>>> @@ -443,6 +445,9 @@ static inline void kgd2kfd_suspend(struct 
>>> kfd_dev *kfd, bool run_pm, bool force)
>>>  {
>>>  }
>>> +void kgd2kfd_kill_all_user_processes(struct kfd_dev *kfd){
>>> +}
>>> +
>>>  static int __maybe_unused kgd2kfd_resume_iommu(struct kfd_dev *kfd)
>>>  {
>>> return 0;
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c 
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>> index 3d5fc0751829..af6fe5080cfa 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>> @@ -2101,6 +2101,9 @@ amdgpu_pci_remove(struct pci_dev *pdev)
>>>  {
>>> struct drm_device *dev = pci_get_drvdata(pdev);
>>> +/* kill all kfd processes before drm_dev_unplug */
>>> +amdgpu_amdkfd_kill_all_processes(drm_to_adev(dev));
>>> +
>>>  #ifdef HAVE_DRM_DEV_UNPLUG
>>> drm_dev_unplug(dev);
>>>  #else
>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c 
>>> b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>> index 5504a18b5a45..480c23bef5e2 100644
>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>> @@ -691,6 +691,12 @@ bool kfd_is_locked(void)
>>> return  (atomic_read(&kfd_locked) > 0);
>>>  }
>>> +inline void kgd2kfd_kill_all_user_processes(struct kfd_dev* dev)
>>> +{
>>> +kfd_kill_all_user_processes();
>>> +}
>>> +
>>> +
>>>  void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm, bool force)
>>>  {
>>> if (!kfd->init_complete)
>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h 
>>> b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
>>> index 55c9e1922714..a35a2cb5bb9f 100644
>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
>>> @@ -1064,6 +1064,7 @@ static inline struct kfd_process_device 
>>> *kfd_process_device_from_gpuidx(
>>>  void kfd_unref_process(struct kfd_process *p);
>>>  int kfd_process_evict_queues(struct kfd_process *p, bool force);
>>>  int kfd_process_restore_queues(struct kfd_process *p);
>>> +void kfd_kill_all_user_processes(void);
>>>  void kfd_suspend_all_processes(bool force);
>>>  /*
>>>   * kfd_resume_all_processes:
>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process.c 
>>> b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
>>> index 6cdc855abb6d..17e769e6951d 100644
>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
>>> @@ -46,6 +46,9 @@ struct mm_struct;
>>>  #include "kfd_trace.h"
>>>  #include "kfd_debug.h"
>>> +static atomic_t kfd_process_locked = ATOMIC_INIT(0);
>>> +static atomic_t kfd_inflight_kills = ATOMIC_INIT(0);
>>> +
>>>  /*
>>>   * List of struct kfd_process (field kfd_process).
>>>   * Unique/indexed by mm_struct*
>>> @@ -802,6 +805,9 @@ struct kfd_process *kfd_create_process(struct 
>>> task_struct *thread)
>>> struct kfd_process *process;
>>> int ret;
>>> +if ( atomic_read(&kfd_process_locked) > 0 )
>>> +return ERR_PTR(-EINVAL);
>>> +
>>> if (!(thread->mm && mmget_not_zero(thread->mm)))
>>> return ERR_PTR(-EINVAL);
>>> @@ -1126,6 +1132,10 @@ static void kfd_process_wq_release(struct 
>>> work_struct *work)
>>> put_task_struct(p->lead_thread);
>>> kfree(p);
>>> +
>>> +if ( atomic_read(&kfd_process_locked) > 0 ){
>>> +atomic_dec(&kfd_inflight_kills);
>>> +}
>>>  }
>>>  static void kfd_process_ref_release(struct kref *ref)
>>> @@ -2186,6 +2196,35 @@ static void restore_process_worker(struct 
>>> work_struct *work)
>>> pr_err("Failed to restore queues of pasid 0x%x\n", p->pasid);
>>>  }
>>> +void kfd_kill_all_user_processes(void)
>>> +{
>>> +struct kfd_process *p;
>>> +/* struct amdkfd_process_info *p_info; */
>>> +unsigned int temp;
>>> +int idx;
>>> +atomic_inc(&kfd_process_locked);
>>> +
>>> +idx = srcu_read_lock(&kfd_processes_srcu);
>>> +pr_info("Killing all processes\n");
>>> +hash_for_each_rcu(kfd_processes_table, temp, p, kfd_processes) {
>>> +dev_warn(kfd_device,
>>> +"Sending SIGBUS to process %d (pasid 0x%x)",
>>> +p->lead_thread->pid, p->pasid);
>>> +send_sig(SIGBUS, p->lead_thread, 0);
>>> +atomic_inc(&kfd_inflight_kills);
>>> +}
>>> +srcu_read_unlock(&kfd_processes_srcu, idx);
>>> +
>>> +while ( atomic_read(&kfd_inflight_kills) > 0 ){
>>> +dev_warn(kfd_device,
>>> +"kfd_processes_table is not empty, going to sleep for 10ms\n");
>>> +msleep(10);
>>> +}
>>> +
>>> +atomic_dec(&kfd_process_locked);
>>> +pr_info("all processes has been fully released");
>>> +}
>>> +
>>>  void kfd_suspend_all_processes(bool force)
>>>  {
>>> struct kfd_process *p;
>>>
>>>
>>>
>>> Regards,
>>> Shuotao
>>>
>>>>
>>>> Andrey
>>>>
>>>>> +       }
>>>>> + srcu_read_unlock(&kfd_processes_srcu, idx);
>>>>> +}
>>>>> +
>>>>> +
>>>>>  int kfd_resume_all_processes(bool sync)
>>>>>  {
>>>>>         struct kfd_process *p;
>>>>>
>>>>>
>>>>>> Andrey
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> Really appreciate your help!
>>>>>>>
>>>>>>> Best,
>>>>>>> Shuotao
>>>>>>>
>>>>>>>>> 2. Remove redudant p2p/io links in sysfs when device is hotplugged
>>>>>>>>> out.
>>>>>>>>>
>>>>>>>>> 3. New kfd node_id is not properly assigned after a new device is
>>>>>>>>> added after a gpu is hotplugged out in a system. libhsakmt will
>>>>>>>>> find this anomaly, (i.e. node_from != <dev node id> in iolinks),
>>>>>>>>> when taking a topology_snapshot, thus returns fault to the rocm
>>>>>>>>> stack.
>>>>>>>>>
>>>>>>>>> -- This patch fixes issue 1; another patch by Mukul fixes 
>>>>>>>>> issues 2&3.
>>>>>>>>> -- Tested on a 4-GPU MI100 gpu nodes with kernel 5.13.0-kfd; 
>>>>>>>>> kernel
>>>>>>>>> 5.16.0-kfd is unstable out of box for MI100.
>>>>>>>>> ---
>>>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 5 +++++
>>>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h | 7 +++++++
>>>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 1 +
>>>>>>>>> drivers/gpu/drm/amd/amdkfd/kfd_device.c | 13 +++++++++++++
>>>>>>>>> 4 files changed, 26 insertions(+)
>>>>>>>>>
>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c 
>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
>>>>>>>>> index c18c4be1e4ac..d50011bdb5c4 100644
>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
>>>>>>>>> @@ -213,6 +213,11 @@ int amdgpu_amdkfd_resume(struct 
>>>>>>>>> amdgpu_device *adev, bool run_pm)
>>>>>>>>> return r;
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>> +int amdgpu_amdkfd_resume_processes(void)
>>>>>>>>> +{
>>>>>>>>> + return kgd2kfd_resume_processes();
>>>>>>>>> +}
>>>>>>>>> +
>>>>>>>>> int amdgpu_amdkfd_pre_reset(struct amdgpu_device *adev)
>>>>>>>>> {
>>>>>>>>> int r = 0;
>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h 
>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
>>>>>>>>> index f8b9f27adcf5..803306e011c3 100644
>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
>>>>>>>>> @@ -140,6 +140,7 @@ void amdgpu_amdkfd_fini(void);
>>>>>>>>> void amdgpu_amdkfd_suspend(struct amdgpu_device *adev, bool 
>>>>>>>>> run_pm);
>>>>>>>>> int amdgpu_amdkfd_resume_iommu(struct amdgpu_device *adev);
>>>>>>>>> int amdgpu_amdkfd_resume(struct amdgpu_device *adev, bool run_pm);
>>>>>>>>> +int amdgpu_amdkfd_resume_processes(void);
>>>>>>>>> void amdgpu_amdkfd_interrupt(struct amdgpu_device *adev,
>>>>>>>>> const void *ih_ring_entry);
>>>>>>>>> void amdgpu_amdkfd_device_probe(struct amdgpu_device *adev);
>>>>>>>>> @@ -347,6 +348,7 @@ void kgd2kfd_device_exit(struct kfd_dev *kfd);
>>>>>>>>> void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm);
>>>>>>>>> int kgd2kfd_resume_iommu(struct kfd_dev *kfd);
>>>>>>>>> int kgd2kfd_resume(struct kfd_dev *kfd, bool run_pm);
>>>>>>>>> +int kgd2kfd_resume_processes(void);
>>>>>>>>> int kgd2kfd_pre_reset(struct kfd_dev *kfd);
>>>>>>>>> int kgd2kfd_post_reset(struct kfd_dev *kfd);
>>>>>>>>> void kgd2kfd_interrupt(struct kfd_dev *kfd, const void 
>>>>>>>>> *ih_ring_entry);
>>>>>>>>> @@ -393,6 +395,11 @@ static inline int kgd2kfd_resume(struct 
>>>>>>>>> kfd_dev *kfd, bool run_pm)
>>>>>>>>> return 0;
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>> +static inline int kgd2kfd_resume_processes(void)
>>>>>>>>> +{
>>>>>>>>> + return 0;
>>>>>>>>> +}
>>>>>>>>> +
>>>>>>>>> static inline int kgd2kfd_pre_reset(struct kfd_dev *kfd)
>>>>>>>>> {
>>>>>>>>> return 0;
>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>>> index fa4a9f13c922..5827b65b7489 100644
>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>>> @@ -4004,6 +4004,7 @@ void amdgpu_device_fini_hw(struct 
>>>>>>>>> amdgpu_device *adev)
>>>>>>>>> if (drm_dev_is_unplugged(adev_to_drm(adev)))
>>>>>>>>> amdgpu_device_unmap_mmio(adev);
>>>>>>>>>
>>>>>>>>> + amdgpu_amdkfd_resume_processes();
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>> void amdgpu_device_fini_sw(struct amdgpu_device *adev)
>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c 
>>>>>>>>> b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>>>>>>>> index 62aa6c9d5123..ef05aae9255e 100644
>>>>>>>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>>>>>>>> @@ -714,6 +714,19 @@ int kgd2kfd_resume(struct kfd_dev *kfd, 
>>>>>>>>> bool run_pm)
>>>>>>>>> return ret;
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>> +/* for non-runtime resume only */
>>>>>>>>> +int kgd2kfd_resume_processes(void)
>>>>>>>>> +{
>>>>>>>>> + int count;
>>>>>>>>> +
>>>>>>>>> + count = atomic_dec_return(&kfd_locked);
>>>>>>>>> + WARN_ONCE(count < 0, "KFD suspend / resume ref. error");
>>>>>>>>> + if (count == 0)
>>>>>>>>> + return kfd_resume_all_processes();
>>>>>>>>> +
>>>>>>>>> + return 0;
>>>>>>>>> +}
>>>>>>>>
>>>>>>>> It doesn't make sense to me to just increment kfd_locked in
>>>>>>>> kgd2kfd_suspend to only decrement it again a few functions down the
>>>>>>>> road.
>>>>>>>>
>>>>>>>> I suggest this instead - you only incrmemnt if not during PCI 
>>>>>>>> remove
>>>>>>>>
>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>>>>>>> b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>>>>>>> index 1c2cf3a33c1f..7754f77248a4 100644
>>>>>>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>>>>>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>>>>>>> @@ -952,11 +952,12 @@ bool kfd_is_locked(void)
>>>>>>>>
>>>>>>>> void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm)
>>>>>>>> {
>>>>>>>> +
>>>>>>>> if (!kfd->init_complete)
>>>>>>>> return;
>>>>>>>>
>>>>>>>> /* for runtime suspend, skip locking kfd */
>>>>>>>> - if (!run_pm) {
>>>>>>>> + if (!run_pm && !drm_dev_is_unplugged(kfd->ddev)) {
>>>>>>>> /* For first KFD device suspend all the KFD processes */
>>>>>>>> if (atomic_inc_return(&kfd_locked) == 1)
>>>>>>>> kfd_suspend_all_processes();
>>>>>>>>
>>>>>>>>
>>>>>>>> Andrey
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> +
>>>>>>>>> int kgd2kfd_resume_iommu(struct kfd_dev *kfd)
>>>>>>>>> {
>>>>>>>>> int err = 0;
>>>>>
>>>
>

[-- Attachment #2: Type: text/html, Size: 153092 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [EXTERNAL] [PATCH 2/2] drm/amdkfd: Add PCIe Hotplug Support for AMDKFD
  2022-04-15 16:43                   ` Andrey Grodzovsky
@ 2022-04-18 13:22                     ` Shuotao Xu
  2022-04-18 15:23                       ` Andrey Grodzovsky
  0 siblings, 1 reply; 31+ messages in thread
From: Shuotao Xu @ 2022-04-18 13:22 UTC (permalink / raw)
  To: Andrey Grodzovsky
  Cc: Mukul.Joshi, Kuehling, Felix, Peng Cheng, amd-gfx, Lei Qu,
	Ran Shu, Ziyue Yang

[-- Attachment #1: Type: text/plain, Size: 45535 bytes --]



On Apr 16, 2022, at 12:43 AM, Andrey Grodzovsky <andrey.grodzovsky@amd.com<mailto:andrey.grodzovsky@amd.com>> wrote:



On 2022-04-15 06:12, Shuotao Xu wrote:
Hi Andrey,

First I really appreciate the discussion! It helped me understand the driver code greatly. Thank you so much:)
Please see my inline comments.

On Apr 14, 2022, at 11:13 PM, Andrey Grodzovsky <andrey.grodzovsky@amd.com<mailto:andrey.grodzovsky@amd.com>> wrote:



On 2022-04-14 10:00, Shuotao Xu wrote:


On Apr 14, 2022, at 1:31 AM, Andrey Grodzovsky <andrey.grodzovsky@amd.com<mailto:andrey.grodzovsky@amd.com>> wrote:



On 2022-04-13 12:03, Shuotao Xu wrote:


On Apr 11, 2022, at 11:52 PM, Andrey Grodzovsky <andrey.grodzovsky@amd.com<mailto:andrey.grodzovsky@amd.com>> wrote:

[Some people who received this message don't often get email from andrey.grodzovsky@amd.com<mailto:andrey.grodzovsky@amd.com>. Learn why this is important at http://aka.ms/LearnAboutSenderIdentification.]

On 2022-04-08 21:28, Shuotao Xu wrote:

On Apr 8, 2022, at 11:28 PM, Andrey Grodzovsky <andrey.grodzovsky@amd.com<mailto:andrey.grodzovsky@amd.com>> wrote:

[Some people who received this message don't often get email from andrey.grodzovsky@amd.com<mailto:andrey.grodzovsky@amd.com>. Learn why this is important at http://aka.ms/LearnAboutSenderIdentification.]

On 2022-04-08 04:45, Shuotao Xu wrote:
Adding PCIe Hotplug Support for AMDKFD: the support of hot-plug of GPU
devices can open doors for many advanced applications in data center
in the next few years, such as for GPU resource
disaggregation. Current AMDKFD does not support hotplug out b/o the
following reasons:

1. During PCIe removal, decrement KFD lock which was incremented at
the beginning of hw fini; otherwise kfd_open later is going to
fail.
I assumed you read my comment last time, still you do same approach.
More in details bellow
Aha, I like your fix:) I was not familiar with drm APIs so just only half understood your comment last time.

BTW, I tried hot-plugging out a GPU when rocm application is still running.
From dmesg, application is still trying to access the removed kfd device, and are met with some errors.


Application us supposed to keep running, it holds the drm_device
reference as long as it has an open
FD to the device and final cleanup will come only after the app will die
thus releasing the FD and the last
drm_device reference.

Application would hang and not exiting in this case.


Actually I tried kill -7 $pid, and the process exists. The dmesg has some warning though.

[  711.769977] WARNING: CPU: 23 PID: 344 at .../amdgpu-rocm5.0.2/src/amd/amdgpu/amdgpu_object.c:1336 amdgpu_bo_release_notify+0x150/0x160 [amdgpu]
[  711.770528] Modules linked in: amdgpu(OE) amdttm(OE) amd_sched(OE) amdkcl(OE) iommu_v2 nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo xt_addrtype br_netfilter xt_CHECKSUM iptable_mangle xt_MASQUERADE iptable_nat nf_nat xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4 xt_tcpudp bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter overlay binfmt_misc intel_rapl_msr i40iw intel_rapl_common skx_edac nfit x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel rpcrdma kvm sunrpc ipmi_ssif ib_umad ib_ipoib rdma_ucm irqbypass rapl joydev acpi_ipmi input_leds intel_cstate ipmi_si ipmi_devintf mei_me mei intel_pch_thermal ipmi_msghandler ioatdma mac_hid lpc_ich dca acpi_power_meter acpi_pad sch_fq_codel ib_iser rdma_cm iw_cm ib_cm iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi pci_stub ip_tables x_tables autofs4 btrfs blake2b_generic zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor
[  711.779359]  raid6_pq libcrc32c raid1 raid0 multipath linear mlx5_ib ib_uverbs ib_core ast drm_vram_helper i2c_algo_bit drm_ttm_helper ttm drm_kms_helper syscopyarea crct10dif_pclmul crc32_pclmul ghash_clmulni_intel sysfillrect uas hid_generic sysimgblt aesni_intel mlx5_core fb_sys_fops crypto_simd usbhid cryptd drm i40e pci_hyperv_intf usb_storage glue_helper mlxfw hid ahci libahci wmi
[  711.779752] CPU: 23 PID: 344 Comm: kworker/23:1 Tainted: G        W  OE     5.11.0+ #1
[  711.779755] Hardware name: Supermicro SYS-4029GP-TRT2/X11DPG-OT-CPU, BIOS 2.1 08/14/2018
[  711.779756] Workqueue: kfd_process_wq kfd_process_wq_release [amdgpu]
[  711.779955] RIP: 0010:amdgpu_bo_release_notify+0x150/0x160 [amdgpu]
[  711.780141] Code: e8 b5 af 34 f4 e9 1f ff ff ff 48 39 c2 74 07 0f 0b e9 69 ff ff ff 4c 89 e7 e8 3c b4 16 00 e9 5c ff ff ff e8 a2 ce fd f3 eb cf <0f> 0b eb cb e8 d7 02 34 f4 0f 1f 80 00 00 00 00 0f 1f 44 00 00 55
[  711.780143] RSP: 0018:ffffa8100dd67c30 EFLAGS: 00010282
[  711.780145] RAX: 00000000ffffffea RBX: ffff89980e792058 RCX: 0000000000000000
[  711.780147] RDX: 0000000000000000 RSI: ffff89a8f9ad8870 RDI: ffff89a8f9ad8870
[  711.780148] RBP: ffffa8100dd67c50 R08: 0000000000000000 R09: fffffffffff99b18
[  711.780149] R10: ffffa8100dd67bd0 R11: ffffa8100dd67908 R12: ffff89980e792000
[  711.780151] R13: ffff89980e792058 R14: ffff89980e7921bc R15: dead000000000100
[  711.780152] FS:  0000000000000000(0000) GS:ffff89a8f9ac0000(0000) knlGS:0000000000000000
[  711.780154] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  711.780156] CR2: 00007ffddac6f71f CR3: 00000030bb80a003 CR4: 00000000007706e0
[  711.780157] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  711.780159] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  711.780160] PKRU: 55555554
[  711.780161] Call Trace:
[  711.780163]  ttm_bo_release+0x2ae/0x320 [amdttm]
[  711.780169]  amdttm_bo_put+0x30/0x40 [amdttm]
[  711.780357]  amdgpu_bo_unref+0x1e/0x30 [amdgpu]
[  711.780543]  amdgpu_gem_object_free+0x8b/0x160 [amdgpu]
[  711.781119]  drm_gem_object_free+0x1d/0x30 [drm]
[  711.781489]  amdgpu_amdkfd_gpuvm_free_memory_of_gpu+0x34a/0x380 [amdgpu]
[  711.782044]  kfd_process_device_free_bos+0xe0/0x130 [amdgpu]
[  711.782611]  kfd_process_wq_release+0x286/0x380 [amdgpu]
[  711.783172]  process_one_work+0x236/0x420
[  711.783543]  worker_thread+0x34/0x400
[  711.783911]  ? process_one_work+0x420/0x420
[  711.784279]  kthread+0x126/0x140
[  711.784653]  ? kthread_park+0x90/0x90
[  711.785018]  ret_from_fork+0x22/0x30
[  711.785387] ---[ end trace d8f50f6594817c84 ]---
[  711.798716] [drm] amdgpu: ttm finalized


So it means the process was stuck in some wait_event_killable (maybe here drm_sched_entity_flush) - you can try 'cat/proc/$process_pid/stack' maybe before
you kill it to see where it was stuck so we can go from there.



For graphic apps what i usually see is a crash because of sigsev when
the app tries to access
an unmapped MMIO region on the device. I haven't tested for compute
stack and so there might
be something I haven't covered. Hang could mean for example waiting on a
fence which is not being
signaled - please provide full dmesg from this case.


Do you have any good suggestions on how to fix it down the line? (HIP runtime/libhsakmt or driver)

[64036.631333] amdgpu: amdgpu_vm_bo_update failed
[64036.631702] amdgpu: validate_invalid_user_pages: update PTE failed
[64036.640754] amdgpu: amdgpu_vm_bo_update failed
[64036.641120] amdgpu: validate_invalid_user_pages: update PTE failed
[64036.650394] amdgpu: amdgpu_vm_bo_update failed
[64036.650765] amdgpu: validate_invalid_user_pages: update PTE failed


The full dmesg will just the repetition of those two messages,
[186885.764079] amdgpu 0000:43:00.0: amdgpu: amdgpu: finishing device.
[186885.766916] [drm] free PSP TMR buffer
[186893.868173] amdgpu: amdgpu_vm_bo_update failed
[186893.868235] amdgpu: validate_invalid_user_pages: update PTE failed
[186893.876154] amdgpu: amdgpu_vm_bo_update failed
[186893.876190] amdgpu: validate_invalid_user_pages: update PTE failed
[186893.884150] amdgpu: amdgpu_vm_bo_update failed
[186893.884185] amdgpu: validate_invalid_user_pages: update PTE failed


This just probably means trying to update PTEs after the physical device
is gone - we usually avoid this by
first trying to do all HW shutdowns early before PCI remove completion
but when it's really tricky by
protecting HW access sections with drm_dev_enter/exit scope.

For this particular error it would be the best to flush
info->restore_userptr_work before the end of
amdgpu_pci_remove (rejecting new process creation and calling
cancel_delayed_work_sync(&process_info->restore_userptr_work) for all
running processes)
somewhere in amdgpu_pci_remove.

I tried something like *kfd_process_ref_release* which I think did what you described, but it did not work.


I don't see how kfd_process_ref_release is the same as I mentioned above, what i meant is calling the code above within kgd2kfd_suspend (where you tried to call kfd_kill_all_user_processes bellow)

Yes, you are right. It was not called by it.


Instead I tried to kill the process from the kernel, but the amdgpu could **only** be hot-plugged in back successfully only if there was no rocm kernel running when it was plugged out. If not, amdgpu_probe will just hang later. (Maybe because amdgpu was plugged out while running state, it leaves a bad HW state which causes probe to hang).


We usually do asic_reset during probe to reset all HW state (checlk if amdgpu_device_init->amdgpu_asic_reset is running when you  plug back).

OK



I don’t know if this is a viable solution worth pursuing, but I attached the diff anyway.

Another solution could be let compute stack user mode detect a topology change via generation_count change, and abort gracefully there.

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
index 4e7d9cb09a69..79b4c9b84cd0 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
@@ -697,12 +697,15 @@ void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm, bool force)
                return;

        /* for runtime suspend, skip locking kfd */
-       if (!run_pm) {
+       if (!run_pm && !drm_dev_is_unplugged(kfd->ddev)) {
                /* For first KFD device suspend all the KFD processes */
                if (atomic_inc_return(&kfd_locked) == 1)
                        kfd_suspend_all_processes(force);
        }

+       if (drm_dev_is_unplugged(kfd->ddev))
+               kfd_kill_all_user_processes();
+
        kfd->dqm->ops.stop(kfd->dqm);
        kfd_iommu_suspend(kfd);
 }
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
index 55c9e1922714..84cbcd857856 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
@@ -1065,6 +1065,7 @@ void kfd_unref_process(struct kfd_process *p);
 int kfd_process_evict_queues(struct kfd_process *p, bool force);
 int kfd_process_restore_queues(struct kfd_process *p);
 void kfd_suspend_all_processes(bool force);
+void kfd_kill_all_user_processes(void);
 /*
  * kfd_resume_all_processes:
  *     bool sync: If kfd_resume_all_processes() should wait for the
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process.c b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
index 6cdc855abb6d..fb0c753b682c 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
@@ -2206,6 +2206,24 @@ void kfd_suspend_all_processes(bool force)
        srcu_read_unlock(&kfd_processes_srcu, idx);
 }

+
+void kfd_kill_all_user_processes(void)
+{
+       struct kfd_process *p;
+       struct amdkfd_process_info *p_info;
+       unsigned int temp;
+       int idx = srcu_read_lock(&kfd_processes_srcu);
+
+       pr_info("Killing all processes\n");
+       hash_for_each_rcu(kfd_processes_table, temp, p, kfd_processes) {
+               p_info = p->kgd_process_info;
+               pr_info("Killing  processes, pid = %d", pid_nr(p_info->pid));
+               kill_pid(p_info->pid, SIGBUS, 1);


From looking into kill_pid I see it only sends a signal but doesn't wait for completion, it would make sense to wait for completion here. In any case I would actually try to put here

I have made a version which does that with some atomic counters. Please read later in the diff.


hash_for_each_rcu(p_info)
    cancel_delayed_work_sync(&p_info->restore_userptr_work)

instead  at least that what i meant in the previous mail.

I actually tried that earlier, and it did not work. Application still keeps running, and you have to send a kill to the user process.

I have made the following version. It waits for processes to terminate synchronously after sending SIGBUS. After that it does the real work of amdgpu_pci_remove.
However, it hangs at amdgpu_device_ip_fini_early when it is trying to deinit ip_block 6 <sdma_v4_0> (https://gitlab.freedesktop.org/agd5f/linux/-/blob/amd-staging-drm-next/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c#L2818<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fagd5f%2Flinux%2F-%2Fblob%2Famd-staging-drm-next%2Fdrivers%2Fgpu%2Fdrm%2Famd%2Famdgpu%2Famdgpu_device.c%23L2818&data=05%7C01%7Cshuotaoxu%40microsoft.com%7C59070142c40742c64d4b08da1eff2599%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637856378522526534%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=2WI5MRDqJeUWgu%2FWg6UvK2tK7wGbJ6zhxHkj3FVwImY%3D&reserved=0>). I assume that there are still some inflight dma, therefore fini of this ip block thus hangs?

The following is an excerpt of the dmesg: please excuse for putting my own pr_info, but I hope you get my point of where it hangs.

[  392.344735] amdgpu: all processes has been fully released
[  392.346557] amdgpu: amdgpu_acpi_fini done
[  392.346568] amdgpu 0000:b3:00.0: amdgpu: amdgpu: finishing device.
[  392.349238] amdgpu: amdgpu_device_ip_fini_early enter ip_blocks = 9
[  392.349248] amdgpu: Free mem_obj = 000000007bf54275, range_start = 14, range_end = 14
[  392.350299] amdgpu: Free mem_obj = 00000000a85bc878, range_start = 12, range_end = 12
[  392.350304] amdgpu: Free mem_obj = 00000000b8019e32, range_start = 13, range_end = 13
[  392.350308] amdgpu: Free mem_obj = 000000002d296168, range_start = 4, range_end = 11
[  392.350313] amdgpu: Free mem_obj = 000000001fc4f934, range_start = 0, range_end = 3
[  392.350322] amdgpu: amdgpu_amdkfd_suspend(adev, false) done
[  392.350672] amdgpu: hw_fini of IP block[8] <jpeg_v2_5> done 0
[  392.350679] amdgpu: hw_fini of IP block[7] <vcn_v2_5> done 0


I just remembered that the idea to actively kill and wait for running user processes during unplug was rejected
as a bad idea in the first iteration of unplug work I did (don't remember why now, need to look) so this is a no go.

Maybe an application has kfd open, but was not accessing the dev. So kill it at unplug could kill the process unnecessarily.
However, the latest version I had with the sleep function got rid of the IP block fini hang.

Our policy is to let zombie processes (zombie in a sense that the underlying device is gone) live as long as they want
(as long as you able to terminate them - which you do, so that ok)
and the system should finish PCI remove gracefully and be able to hot plug back the device.  Hence, i suggest dropping
this direction of forcing all user processes to be killed, confirm you have graceful shutdown and remove of device
from PCI topology and then concentrate on why when you plug back it hangs.

So I basically revert back to the original solution which you suggested.

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
index 4e7d9cb09a69..5504a18b5a45 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
@@ -697,7 +697,7 @@ void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm, bool force)
                return;

        /* for runtime suspend, skip locking kfd */
-       if (!run_pm) {
+       if (!run_pm && !drm_dev_is_unplugged(kfd->ddev)) {
                /* For first KFD device suspend all the KFD processes */
                if (atomic_inc_return(&kfd_locked) == 1)
                        kfd_suspend_all_processes(force);


First confirm if ASIC reset happens on
next init.

This patch works great at planned plugout, where all the rocm processes are killed before plugout. And device can be added back without a problem.
However unplanned plugout when there is rocm processes are running just don’t work.


Still I am not clear if ASIC reset happens on plug back or no, can you trace this please ?


I tried add pr_info into asic_reset functions, but cannot trace any upon plug-back.


Second please confirm if the timing you kill manually the user process has impact on whether you have a hang
on next plug back (if you kill before

Scenario 0: Kill before plug back

1. echo 1 > /sys/bus/pci/…/remove, would finish.
But the application won’t exit until there is a kill signal.


Why you think it must exit ?

Because rocm will need to release the drm descriptor to get amdgpu_amdkfd_device_fini_sw called, which would eventually call kgd2kfd_device_exit called. This would clean up kfd_topology at least. Otherwise I don’t see how it would be added back without messing up kfd topology to say the least.

However, those are all based my own observations. Please explain why it does not need exit if you believe so?


2. kill the the process. The application does several things and seems trigger drm_release in the kernel, which are met with kernel NULL pointer deference related to sysfs_remove. Then the whole fs just freeze.

[  +0.002498] BUG: kernel NULL pointer dereference, address: 0000000000000098
[  +0.000486] #PF: supervisor read access in kernel mode
[  +0.000545] #PF: error_code(0x0000) - not-present page
[  +0.000551] PGD 0 P4D 0
[  +0.000553] Oops: 0000 [#1] SMP NOPTI
[  +0.000540] CPU: 75 PID: 4911 Comm: kworker/75:2 Tainted: G        W   E     5.13.0-kfd #1
[  +0.000559] Hardware name: INGRASYS         TURING  /MB      , BIOS K71FQ28A 10/05/2021
[  +0.000567] Workqueue: events delayed_fput
[  +0.000563] RIP: 0010:kernfs_find_ns+0x1b/0x100
[  +0.000569] Code: ff ff e8 88 59 9f 00 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 41 57 8b 05 df f0 7b 01 41 56 41 55 49 89 f5 41 54 55 48 89 fd 53 <44> 0f b7 b7 98 00 00 00 48 89 d3 4c 8b 67 70 66 41 83 e6 20 41 0f
[  +0.001193] RSP: 0018:ffffb9875db5fc98 EFLAGS: 00010246
[  +0.000602] RAX: 0000000000000000 RBX: ffffa101f79507d8 RCX: 0000000000000000
[  +0.000612] RDX: 0000000000000000 RSI: ffffffffc09a9417 RDI: 0000000000000000
[  +0.000604] RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000000
[  +0.000760] R10: ffffb9875db5fcd0 R11: 0000000000000000 R12: 0000000000000000
[  +0.000597] R13: ffffffffc09a9417 R14: ffffa08363fb2d18 R15: 0000000000000000
[  +0.000702] FS:  0000000000000000(0000) GS:ffffa0ffbfcc0000(0000) knlGS:0000000000000000
[  +0.000666] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  +0.000658] CR2: 0000000000000098 CR3: 0000005747812005 CR4: 0000000000770ee0
[  +0.000715] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  +0.000655] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  +0.000592] PKRU: 55555554
[  +0.000580] Call Trace:
[  +0.000591]  kernfs_find_and_get_ns+0x2f/0x50
[  +0.000584]  sysfs_remove_file_from_group+0x20/0x50
[  +0.000580]  amdgpu_ras_sysfs_remove+0x3d/0xd0 [amdgpu]
[  +0.000737]  amdgpu_ras_late_fini+0x1d/0x40 [amdgpu]
[  +0.000750]  amdgpu_sdma_ras_fini+0x96/0xb0 [amdgpu]
[  +0.000742]  ? gfx_v10_0_resume+0x10/0x10 [amdgpu]
[  +0.000738]  sdma_v4_0_sw_fini+0x23/0x90 [amdgpu]
[  +0.000717]  amdgpu_device_fini_sw+0xae/0x260 [amdgpu]
[  +0.000704]  amdgpu_driver_release_kms+0x12/0x30 [amdgpu]
[  +0.000687]  drm_dev_release+0x20/0x40 [drm]
[  +0.000583]  drm_release+0xa8/0xf0 [drm]
[  +0.000584]  __fput+0xa5/0x250
[  +0.000621]  delayed_fput+0x1f/0x30
[  +0.000726]  process_one_work+0x26e/0x580
[  +0.000581]  ? process_one_work+0x580/0x580
[  +0.000611]  worker_thread+0x4d/0x3d0
[  +0.000611]  ? process_one_work+0x580/0x580
[  +0.000605]  kthread+0x117/0x150
[  +0.000611]  ? kthread_park+0x90/0x90
[  +0.000619]  ret_from_fork+0x1f/0x30
[  +0.000625] Modules linked in: amdgpu(E) xt_conntrack xt_MASQUERADE nfnetlink xt_addrtype iptable_filter iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 br_netfilter x86_pkg_temp_thermal cdc_ether usbnet acpi_pad msr ip_tables x_tables ast drm_vram_helper iommu_v2 drm_ttm_helper gpu_sched ttm drm_kms_helper cfbfillrect syscopyarea cfbimgblt sysfillrect sysimgblt fb_sys_fops cfbcopyarea drm drm_panel_orientati
on_quirks [last unloaded: amdgpu]


This is a known regression, all SYSFS components must be removed before pci_remove code runs otherwise you get either warnings for single file removals or
OOPSEs for sysfs gorup removals like here. Please try to move amdgpu_ras_sysfs_remove from amdgpu_ras_late_fini to the end of amdgpu_ras_pre_fini (which happens before pci remove)


I fixed it in the newer patch, please see it below.
I first plugout the device, then kill the rocm user process. Then it has other OOPSES related to ttm_bo_cleanup_refs.

[  +0.000006] BUG: kernel NULL pointer dereference, address: 0000000000000010
[  +0.000349] #PF: supervisor read access in kernel mode
[  +0.000340] #PF: error_code(0x0000) - not-present page
[  +0.000341] PGD 0 P4D 0
[  +0.000336] Oops: 0000 [#1] SMP NOPTI
[  +0.000345] CPU: 9 PID: 95 Comm: kworker/9:1 Tainted: G        W   E     5.13.0-kfd #1
[  +0.000367] Hardware name: INGRASYS         TURING  /MB      , BIOS K71FQ28A 10/05/2021
[  +0.000376] Workqueue: events delayed_fput
[  +0.000422] RIP: 0010:ttm_resource_free+0x24/0x40 [ttm]
[  +0.000464] Code: 00 00 0f 1f 40 00 0f 1f 44 00 00 53 48 89 f3 48 8b 36 48 85 f6 74 21 48 8b 87 28 02 00 00 48 63 56 10 48 8b bc d0 b8 00 00 00 <48> 8b 47 10 ff 50 08 48 c7 03 00 00 00 00 5b c3 66 66 2e 0f 1f 84
[  +0.001009] RSP: 0018:ffffb21c59413c98 EFLAGS: 00010282
[  +0.000515] RAX: ffff8b1aa4285f68 RBX: ffff8b1a823b5ea0 RCX: 00000000002a000c
[  +0.000536] RDX: 0000000000000000 RSI: ffff8b1acb84db80 RDI: 0000000000000000
[  +0.000539] RBP: 0000000000000001 R08: 0000000000000000 R09: ffffffffc03c3e00
[  +0.000543] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8b1a823b5ec8
[  +0.000542] R13: 0000000000000000 R14: ffff8b1a823b5d90 R15: ffff8b1a823b5ec8
[  +0.000544] FS:  0000000000000000(0000) GS:ffff8b187f440000(0000) knlGS:0000000000000000
[  +0.000559] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  +0.000575] CR2: 0000000000000010 CR3: 00000076e6812004 CR4: 0000000000770ee0
[  +0.000575] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  +0.000579] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  +0.000575] PKRU: 55555554
[  +0.000568] Call Trace:
[  +0.000567]  ttm_bo_cleanup_refs+0xe4/0x290 [ttm]
[  +0.000588]  ttm_bo_delayed_delete+0x147/0x250 [ttm]
[  +0.000589]  ttm_device_fini+0xad/0x1b0 [ttm]
[  +0.000590]  amdgpu_ttm_fini+0x2a7/0x310 [amdgpu]
[  +0.000730]  gmc_v9_0_sw_fini+0x3a/0x40 [amdgpu]
[  +0.000753]  amdgpu_device_fini_sw+0xae/0x260 [amdgpu]
[  +0.000734]  amdgpu_driver_release_kms+0x12/0x30 [amdgpu]
[  +0.000737]  drm_dev_release+0x20/0x40 [drm]
[  +0.000626]  drm_release+0xa8/0xf0 [drm]
[  +0.000625]  __fput+0xa5/0x250
[  +0.000606]  delayed_fput+0x1f/0x30
[  +0.000607]  process_one_work+0x26e/0x580
[  +0.000608]  ? process_one_work+0x580/0x580
[  +0.000616]  worker_thread+0x4d/0x3d0
[  +0.000614]  ? process_one_work+0x580/0x580
[  +0.000617]  kthread+0x117/0x150
[  +0.000615]  ? kthread_park+0x90/0x90
[  +0.000621]  ret_from_fork+0x1f/0x30
[  +0.000603] Modules linked in: amdgpu(E) xt_conntrack xt_MASQUERADE nfnetlink xt_addrtype iptable_filter iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 br_netfilter x86_pkg_temp_thermal cdc_ether usbnet acpi_pad msr ip_tables x_tables ast drm_vram_helper drm_ttm_helper iommu_v2 ttm gpu_sched drm_kms_helper cfbfillrect syscopyarea cfbimgblt sysfillrect sysimgblt fb_sys_fops cfbcopyarea drm drm_panel_orientation_quirks [last unloaded: amdgpu]
[  +0.002840] CR2: 0000000000000010
[  +0.000755] ---[ end trace 9737737402551e39 ]--


3.  echo 1 > /sys/bus/pci/rescan. This would just hang. I assume the sysfs is broken.

Based on 1 & 2, it seems that 1 won’t let the amdgpu exit gracefully, because 2 will do some cleanup maybe should have happened before 1.

or you kill after plug back does it makes a difference).

Scenario 2: Kill after plug back

If I perform rescan before kill, then the driver seemed probed fine. But kill will have the same issue which messed up the sysfs the same way as in Scenario 2.


Final Comments:

0. cancel_delayed_work_sync(&p_info->restore_userptr_work) would make the repletion of amdgpu_vm_bo_update failure go away, but it does not solve the issues in those scenarios.


Still - it's better to do it this way even for those failures to go awaya

Cancel_delayed_work is insufficient, you will need to make sure the work won’t be processed after plugout. Please see my patch


1. For planned hotplug, this patch should work as long as you follow some protocol, i.e. kill before plugout. Is this patch an acceptable one since it provides some added feature than before?


Let's try to fix more as I advised above.


2. For unplanned hotplug when there is rocm app running, the patch that kill all processes and wait for 5 sec would work consistently. But it seems that it is an unacceptable solution for official release. I can hold it for our own internal usage.  It seems that kill after removal would cause problems, and I don’t know if there is a quick fix by me because of my limited understanding of the amdgpu driver. Maybe AMD could have a quick fix; Or it is really a difficult one. This feature may or may not be a blocking issue in our GPU disaggregation research down the way. Please let us know for either cases, and we would like to learn and help as much as we could!


I am currently not sure why it helps. I will need to setup my own ROCm setup and retest hot plug to check this in more depth but currently i have higher priorities. Please try to confirm ASIC reset always takes place on plug back
and fix the sysfs OOPs as I advised above to clear up at least some of the issues. Also please describe to me exactly what you steps to reproduce this scenario so later I might be able to do it myself.

I can still try to help to fix the bug in my spare time. My setup is as follows


  1.  I have a server with 4 AMD MI100 GPUs. I think 1 GPU would also work.
  2.  I used the https://github.com/RadeonOpenCompute/ROCK-Kernel-Driver/tree/roc-5.0.x as the starting point, and apply Mukul’s patch and my patch.
  3.  Then I run a tensorflow benchmark from a docker.
     *   docker run -it --device=/dev/kfd --device=/dev/dri --group-add video rocm/tensorflow:rocm4.5.2-tf1.15-dev
     *   And run the following benchmark in the docker:  python benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --num_gpus=4 --batch_size=32 --model=resnet50 --variable_update=parameter_server
        *   Might to need to adjust num_gpus parameter based on your setup
  4.  Remove a GPU at random time.
  5.  Do whatever is needed to before plugback and reverify the benchmark can still run.

Also, we have hotplug test suite in libdrm (graphic stack), so maybe u can install libdrm and run that test suite to see if it exposes more issues.

OK I could try it some time.

The following is the new diff.

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index 182b7eae598a..48c3cd4054de 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -1327,7 +1327,7 @@ int emu_soc_asic_init(struct amdgpu_device *adev);
  * ASICs macro.
  */
 #define amdgpu_asic_set_vga_state(adev, state) (adev)->asic_funcs->set_vga_state((adev), (state))
-#define amdgpu_asic_reset(adev) (adev)->asic_funcs->reset((adev))
+#define amdgpu_asic_reset(adev) ({int r; pr_info("performing amdgpu_asic_reset\n"); r = (adev)->asic_funcs->reset((adev));r;})
 #define amdgpu_asic_reset_method(adev) (adev)->asic_funcs->reset_method((adev))
 #define amdgpu_asic_get_xclk(adev) (adev)->asic_funcs->get_xclk((adev))
 #define amdgpu_asic_set_uvd_clocks(adev, v, d) (adev)->asic_funcs->set_uvd_clocks((adev), (v), (d))
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
index 27c74fcec455..842abd7150a6 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
@@ -134,6 +134,7 @@ struct amdkfd_process_info {

  /* MMU-notifier related fields */
  atomic_t evicted_bos;
+ atomic_t invalid;
  struct delayed_work restore_userptr_work;
  struct pid *pid;
 };
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
index 99d2b15bcbf3..2a588eb9f456 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
@@ -1325,6 +1325,7 @@ static int init_kfd_vm(struct amdgpu_vm *vm, void **process_info,

  info->pid = get_task_pid(current->group_leader, PIDTYPE_PID);
  atomic_set(&info->evicted_bos, 0);
+ atomic_set(&info->invalid, 0);
  INIT_DELAYED_WORK(&info->restore_userptr_work,
   amdgpu_amdkfd_restore_userptr_worker);

@@ -2693,6 +2694,9 @@ static void amdgpu_amdkfd_restore_userptr_worker(struct work_struct *work)
  struct mm_struct *mm;
  int evicted_bos;

+ if (atomic_read(&process_info->invalid))
+ return;
+
  evicted_bos = atomic_read(&process_info->evicted_bos);
  if (!evicted_bos)
  return;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index ec38517ab33f..e7d85d8d282d 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -1054,6 +1054,7 @@ void amdgpu_device_program_register_sequence(struct amdgpu_device *adev,
  */
 void amdgpu_device_pci_config_reset(struct amdgpu_device *adev)
 {
+ pr_debug("%s called\n",__func__);
  pci_write_config_dword(adev->pdev, 0x7c, AMDGPU_ASIC_RESET_DATA);
 }

@@ -1066,6 +1067,7 @@ void amdgpu_device_pci_config_reset(struct amdgpu_device *adev)
  */
 int amdgpu_device_pci_reset(struct amdgpu_device *adev)
 {
+ pr_debug("%s called\n",__func__);
  return pci_reset_function(adev->pdev);
 }

@@ -4702,6 +4704,8 @@ int amdgpu_do_asic_reset(struct list_head *device_list_handle,
  bool need_full_reset, skip_hw_reset, vram_lost = false;
  int r = 0;

+ pr_debug("%s called\n",__func__);
+
  /* Try reset handler method first */
  tmp_adev = list_first_entry(device_list_handle, struct amdgpu_device,
     reset_list);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
index 49bdf9ff7350..b469acb65c1e 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
@@ -2518,7 +2518,6 @@ void amdgpu_ras_late_fini(struct amdgpu_device *adev,
  if (!ras_block || !ih_info)
  return;

- amdgpu_ras_sysfs_remove(adev, ras_block);
  if (ih_info->cb)
  amdgpu_ras_interrupt_remove_handler(adev, ih_info);
 }
@@ -2577,6 +2576,7 @@ void amdgpu_ras_suspend(struct amdgpu_device *adev)
 int amdgpu_ras_pre_fini(struct amdgpu_device *adev)
 {
  struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
+ struct ras_manager *obj, *tmp;

  if (!adev->ras_enabled || !con)
  return 0;
@@ -2585,6 +2585,13 @@ int amdgpu_ras_pre_fini(struct amdgpu_device *adev)
  /* Need disable ras on all IPs here before ip [hw/sw]fini */
  amdgpu_ras_disable_all_features(adev, 0);
  amdgpu_ras_recovery_fini(adev);
+
+ /* remove sysfs before pci_remove to avoid OOPSES from sysfs_remove_groups */
+ list_for_each_entry_safe(obj, tmp, &con->head, node) {
+ amdgpu_ras_sysfs_remove(adev, &obj->head);
+ put_obj(obj);
+ }
+
  return 0;
 }

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
index 4e7d9cb09a69..0fa806a78e39 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
@@ -693,16 +693,35 @@ bool kfd_is_locked(void)

 void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm, bool force)
 {
+ struct kfd_process *p;
+ struct amdkfd_process_info *p_info;
+ unsigned int temp;
+
  if (!kfd->init_complete)
  return;

  /* for runtime suspend, skip locking kfd */
- if (!run_pm) {
+ if (!run_pm && !drm_dev_is_unplugged(kfd->ddev)) {
  /* For first KFD device suspend all the KFD processes */
  if (atomic_inc_return(&kfd_locked) == 1)
  kfd_suspend_all_processes(force);
  }

+ if (drm_dev_is_unplugged(kfd->ddev)){
+ int idx = srcu_read_lock(&kfd_processes_srcu);
+ pr_debug("cancel restore_userptr_wor\n");
+ hash_for_each_rcu(kfd_processes_table, temp, p, kfd_processes) {
+ if ( kfd_process_gpuidx_from_gpuid(p, kfd->id) >= 0 ) {
+ p_info = p->kgd_process_info;
+ pr_debug("cancel processes, pid = %d for gpu_id = %d", pid_nr(p_info->pid), kfd->id);
+ cancel_delayed_work_sync(&p_info->restore_userptr_work);
+ /* block all future restore_userptr_work */
+ atomic_inc(&p_info->invalid);
+ }
+ }
+ srcu_read_unlock(&kfd_processes_srcu, idx);
+ }
+
  kfd->dqm->ops.stop(kfd->dqm);
  kfd_iommu_suspend(kfd);
 }
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_topology.c b/drivers/gpu/drm/amd/amdkfd/kfd_topology.c
index 600ba2a728ea..7e3d1848eccc 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_topology.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_topology.c
@@ -669,11 +669,12 @@ static void kfd_remove_sysfs_node_entry(struct kfd_topology_device *dev)
 #ifdef HAVE_AMD_IOMMU_PC_SUPPORTED
  if (dev->kobj_perf) {
  list_for_each_entry(perf, &dev->perf_props, list) {
+ sysfs_remove_group(dev->kobj_perf, perf->attr_group);
  kfree(perf->attr_group);
  perf->attr_group = NULL;
  }
  kobject_del(dev->kobj_perf);
- kobject_put(dev->kobj_perf);
+ /* kobject_put(dev->kobj_perf); */
  dev->kobj_perf = NULL;
  }
 #endif

Thank you so much! Looking forward to your comments!

Regards,
Shuotao

Andrey


Thank you so much!

Best regards,
Shuotao

Andrey



diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
index 8fa9b86ac9d2..c0b27f722281 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
@@ -188,6 +188,12 @@ void amdgpu_amdkfd_interrupt(struct amdgpu_device *adev,
  kgd2kfd_interrupt(adev->kfd.dev, ih_ring_entry);
 }

+void amdgpu_amdkfd_kill_all_processes(struct amdgpu_device *adev)
+{
+ if (adev->kfd.dev)
+ kgd2kfd_kill_all_user_processes(adev->kfd.dev);
+}
+
 void amdgpu_amdkfd_suspend(struct amdgpu_device *adev, bool run_pm)
 {
  if (adev->kfd.dev)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
index 27c74fcec455..f4e485d60442 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
@@ -141,6 +141,7 @@ struct amdkfd_process_info {
 int amdgpu_amdkfd_init(void);
 void amdgpu_amdkfd_fini(void);

+void amdgpu_amdkfd_kill_all_processes(struct amdgpu_device *adev);
 void amdgpu_amdkfd_suspend(struct amdgpu_device *adev, bool run_pm);
 int amdgpu_amdkfd_resume_iommu(struct amdgpu_device *adev);
 int amdgpu_amdkfd_resume(struct amdgpu_device *adev, bool run_pm, bool sync);
@@ -405,6 +406,7 @@ bool kgd2kfd_device_init(struct kfd_dev *kfd,
  const struct kgd2kfd_shared_resources *gpu_resources);
 void kgd2kfd_device_exit(struct kfd_dev *kfd);
 void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm, bool force);
+void kgd2kfd_kill_all_user_processes(struct kfd_dev *kfd);
 int kgd2kfd_resume_iommu(struct kfd_dev *kfd);
 int kgd2kfd_resume(struct kfd_dev *kfd, bool run_pm, bool sync);
 int kgd2kfd_pre_reset(struct kfd_dev *kfd);
@@ -443,6 +445,9 @@ static inline void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm, bool force)
 {
 }

+void kgd2kfd_kill_all_user_processes(struct kfd_dev *kfd){
+}
+
 static int __maybe_unused kgd2kfd_resume_iommu(struct kfd_dev *kfd)
 {
  return 0;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index 3d5fc0751829..af6fe5080cfa 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -2101,6 +2101,9 @@ amdgpu_pci_remove(struct pci_dev *pdev)
 {
  struct drm_device *dev = pci_get_drvdata(pdev);

+ /* kill all kfd processes before drm_dev_unplug */
+ amdgpu_amdkfd_kill_all_processes(drm_to_adev(dev));
+
 #ifdef HAVE_DRM_DEV_UNPLUG
  drm_dev_unplug(dev);
 #else
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
index 5504a18b5a45..480c23bef5e2 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
@@ -691,6 +691,12 @@ bool kfd_is_locked(void)
  return  (atomic_read(&kfd_locked) > 0);
 }

+inline void kgd2kfd_kill_all_user_processes(struct kfd_dev* dev)
+{
+ kfd_kill_all_user_processes();
+}
+
+
 void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm, bool force)
 {
  if (!kfd->init_complete)
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
index 55c9e1922714..a35a2cb5bb9f 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
@@ -1064,6 +1064,7 @@ static inline struct kfd_process_device *kfd_process_device_from_gpuidx(
 void kfd_unref_process(struct kfd_process *p);
 int kfd_process_evict_queues(struct kfd_process *p, bool force);
 int kfd_process_restore_queues(struct kfd_process *p);
+void kfd_kill_all_user_processes(void);
 void kfd_suspend_all_processes(bool force);
 /*
  * kfd_resume_all_processes:
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process.c b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
index 6cdc855abb6d..17e769e6951d 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
@@ -46,6 +46,9 @@ struct mm_struct;
 #include "kfd_trace.h"
 #include "kfd_debug.h"

+static atomic_t kfd_process_locked = ATOMIC_INIT(0);
+static atomic_t kfd_inflight_kills = ATOMIC_INIT(0);
+
 /*
  * List of struct kfd_process (field kfd_process).
  * Unique/indexed by mm_struct*
@@ -802,6 +805,9 @@ struct kfd_process *kfd_create_process(struct task_struct *thread)
  struct kfd_process *process;
  int ret;

+ if ( atomic_read(&kfd_process_locked) > 0 )
+ return ERR_PTR(-EINVAL);
+
  if (!(thread->mm && mmget_not_zero(thread->mm)))
  return ERR_PTR(-EINVAL);

@@ -1126,6 +1132,10 @@ static void kfd_process_wq_release(struct work_struct *work)
  put_task_struct(p->lead_thread);

  kfree(p);
+
+ if ( atomic_read(&kfd_process_locked) > 0 ){
+ atomic_dec(&kfd_inflight_kills);
+ }
 }

 static void kfd_process_ref_release(struct kref *ref)
@@ -2186,6 +2196,35 @@ static void restore_process_worker(struct work_struct *work)
  pr_err("Failed to restore queues of pasid 0x%x\n", p->pasid);
 }

+void kfd_kill_all_user_processes(void)
+{
+ struct kfd_process *p;
+ /* struct amdkfd_process_info *p_info; */
+ unsigned int temp;
+ int idx;
+ atomic_inc(&kfd_process_locked);
+
+ idx = srcu_read_lock(&kfd_processes_srcu);
+ pr_info("Killing all processes\n");
+ hash_for_each_rcu(kfd_processes_table, temp, p, kfd_processes) {
+ dev_warn(kfd_device,
+ "Sending SIGBUS to process %d (pasid 0x%x)",
+ p->lead_thread->pid, p->pasid);
+ send_sig(SIGBUS, p->lead_thread, 0);
+ atomic_inc(&kfd_inflight_kills);
+ }
+ srcu_read_unlock(&kfd_processes_srcu, idx);
+
+ while ( atomic_read(&kfd_inflight_kills) > 0 ){
+ dev_warn(kfd_device,
+ "kfd_processes_table is not empty, going to sleep for 10ms\n");
+ msleep(10);
+ }
+
+ atomic_dec(&kfd_process_locked);
+ pr_info("all processes has been fully released");
+}
+
 void kfd_suspend_all_processes(bool force)
 {
  struct kfd_process *p;



Regards,
Shuotao



Andrey


+       }
+       srcu_read_unlock(&kfd_processes_srcu, idx);
+}
+
+
 int kfd_resume_all_processes(bool sync)
 {
        struct kfd_process *p;


Andrey



Really appreciate your help!

Best,
Shuotao

2. Remove redudant p2p/io links in sysfs when device is hotplugged
out.

3. New kfd node_id is not properly assigned after a new device is
added after a gpu is hotplugged out in a system. libhsakmt will
find this anomaly, (i.e. node_from != <dev node id> in iolinks),
when taking a topology_snapshot, thus returns fault to the rocm
stack.

-- This patch fixes issue 1; another patch by Mukul fixes issues 2&3.
-- Tested on a 4-GPU MI100 gpu nodes with kernel 5.13.0-kfd; kernel
5.16.0-kfd is unstable out of box for MI100.
---
drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 5 +++++
drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h | 7 +++++++
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 1 +
drivers/gpu/drm/amd/amdkfd/kfd_device.c | 13 +++++++++++++
4 files changed, 26 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
index c18c4be1e4ac..d50011bdb5c4 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
@@ -213,6 +213,11 @@ int amdgpu_amdkfd_resume(struct amdgpu_device *adev, bool run_pm)
return r;
}

+int amdgpu_amdkfd_resume_processes(void)
+{
+ return kgd2kfd_resume_processes();
+}
+
int amdgpu_amdkfd_pre_reset(struct amdgpu_device *adev)
{
int r = 0;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
index f8b9f27adcf5..803306e011c3 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
@@ -140,6 +140,7 @@ void amdgpu_amdkfd_fini(void);
void amdgpu_amdkfd_suspend(struct amdgpu_device *adev, bool run_pm);
int amdgpu_amdkfd_resume_iommu(struct amdgpu_device *adev);
int amdgpu_amdkfd_resume(struct amdgpu_device *adev, bool run_pm);
+int amdgpu_amdkfd_resume_processes(void);
void amdgpu_amdkfd_interrupt(struct amdgpu_device *adev,
const void *ih_ring_entry);
void amdgpu_amdkfd_device_probe(struct amdgpu_device *adev);
@@ -347,6 +348,7 @@ void kgd2kfd_device_exit(struct kfd_dev *kfd);
void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm);
int kgd2kfd_resume_iommu(struct kfd_dev *kfd);
int kgd2kfd_resume(struct kfd_dev *kfd, bool run_pm);
+int kgd2kfd_resume_processes(void);
int kgd2kfd_pre_reset(struct kfd_dev *kfd);
int kgd2kfd_post_reset(struct kfd_dev *kfd);
void kgd2kfd_interrupt(struct kfd_dev *kfd, const void *ih_ring_entry);
@@ -393,6 +395,11 @@ static inline int kgd2kfd_resume(struct kfd_dev *kfd, bool run_pm)
return 0;
}

+static inline int kgd2kfd_resume_processes(void)
+{
+ return 0;
+}
+
static inline int kgd2kfd_pre_reset(struct kfd_dev *kfd)
{
return 0;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index fa4a9f13c922..5827b65b7489 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4004,6 +4004,7 @@ void amdgpu_device_fini_hw(struct amdgpu_device *adev)
if (drm_dev_is_unplugged(adev_to_drm(adev)))
amdgpu_device_unmap_mmio(adev);

+ amdgpu_amdkfd_resume_processes();
}

void amdgpu_device_fini_sw(struct amdgpu_device *adev)
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
index 62aa6c9d5123..ef05aae9255e 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
@@ -714,6 +714,19 @@ int kgd2kfd_resume(struct kfd_dev *kfd, bool run_pm)
return ret;
}

+/* for non-runtime resume only */
+int kgd2kfd_resume_processes(void)
+{
+ int count;
+
+ count = atomic_dec_return(&kfd_locked);
+ WARN_ONCE(count < 0, "KFD suspend / resume ref. error");
+ if (count == 0)
+ return kfd_resume_all_processes();
+
+ return 0;
+}

It doesn't make sense to me to just increment kfd_locked in
kgd2kfd_suspend to only decrement it again a few functions down the
road.

I suggest this instead - you only incrmemnt if not during PCI remove

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
index 1c2cf3a33c1f..7754f77248a4 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
@@ -952,11 +952,12 @@ bool kfd_is_locked(void)

void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm)
{
+
if (!kfd->init_complete)
return;

/* for runtime suspend, skip locking kfd */
- if (!run_pm) {
+ if (!run_pm && !drm_dev_is_unplugged(kfd->ddev)) {
/* For first KFD device suspend all the KFD processes */
if (atomic_inc_return(&kfd_locked) == 1)
kfd_suspend_all_processes();


Andrey



+
int kgd2kfd_resume_iommu(struct kfd_dev *kfd)
{
int err = 0;





[-- Attachment #2: Type: text/html, Size: 140410 bytes --]

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [EXTERNAL] [PATCH 2/2] drm/amdkfd: Add PCIe Hotplug Support for AMDKFD
  2022-04-18 13:22                     ` Shuotao Xu
@ 2022-04-18 15:23                       ` Andrey Grodzovsky
  2022-04-19  7:41                         ` Shuotao Xu
  0 siblings, 1 reply; 31+ messages in thread
From: Andrey Grodzovsky @ 2022-04-18 15:23 UTC (permalink / raw)
  To: Shuotao Xu
  Cc: Mukul.Joshi, Kuehling, Felix, Peng Cheng, amd-gfx, Lei Qu,
	Ran Shu, Ziyue Yang

[-- Attachment #1: Type: text/plain, Size: 53800 bytes --]


On 2022-04-18 09:22, Shuotao Xu wrote:
>
>
>> On Apr 16, 2022, at 12:43 AM, Andrey Grodzovsky 
>> <andrey.grodzovsky@amd.com> wrote:
>>
>>
>> On 2022-04-15 06:12, Shuotao Xu wrote:
>>> Hi Andrey,
>>>
>>> First I really appreciate the discussion! It helped me understand 
>>> the driver code greatly. Thank you so much:)
>>> Please see my inline comments.
>>>
>>>> On Apr 14, 2022, at 11:13 PM, Andrey Grodzovsky 
>>>> <andrey.grodzovsky@amd.com> wrote:
>>>>
>>>>
>>>> On 2022-04-14 10:00, Shuotao Xu wrote:
>>>>>
>>>>>
>>>>>> On Apr 14, 2022, at 1:31 AM, Andrey Grodzovsky 
>>>>>> <andrey.grodzovsky@amd.com> wrote:
>>>>>>
>>>>>>
>>>>>> On 2022-04-13 12:03, Shuotao Xu wrote:
>>>>>>>
>>>>>>>
>>>>>>>> On Apr 11, 2022, at 11:52 PM, Andrey Grodzovsky 
>>>>>>>> <andrey.grodzovsky@amd.com> wrote:
>>>>>>>>
>>>>>>>> [Some people who received this message don't often get email 
>>>>>>>> fromandrey.grodzovsky@amd.com. Learn why this is important 
>>>>>>>> athttp://aka.ms/LearnAboutSenderIdentification.]
>>>>>>>>
>>>>>>>> On 2022-04-08 21:28, Shuotao Xu wrote:
>>>>>>>>>
>>>>>>>>>> On Apr 8, 2022, at 11:28 PM, Andrey Grodzovsky 
>>>>>>>>>> <andrey.grodzovsky@amd.com> wrote:
>>>>>>>>>>
>>>>>>>>>> [Some people who received this message don't often get email 
>>>>>>>>>> from andrey.grodzovsky@amd.com. Learn why this is important 
>>>>>>>>>> at http://aka.ms/LearnAboutSenderIdentification.]
>>>>>>>>>>
>>>>>>>>>> On 2022-04-08 04:45, Shuotao Xu wrote:
>>>>>>>>>>> Adding PCIe Hotplug Support for AMDKFD: the support of 
>>>>>>>>>>> hot-plug of GPU
>>>>>>>>>>> devices can open doors for many advanced applications in 
>>>>>>>>>>> data center
>>>>>>>>>>> in the next few years, such as for GPU resource
>>>>>>>>>>> disaggregation. Current AMDKFD does not support hotplug out 
>>>>>>>>>>> b/o the
>>>>>>>>>>> following reasons:
>>>>>>>>>>>
>>>>>>>>>>> 1. During PCIe removal, decrement KFD lock which was 
>>>>>>>>>>> incremented at
>>>>>>>>>>> the beginning of hw fini; otherwise kfd_open later is going to
>>>>>>>>>>> fail.
>>>>>>>>>> I assumed you read my comment last time, still you do same 
>>>>>>>>>> approach.
>>>>>>>>>> More in details bellow
>>>>>>>>> Aha, I like your fix:) I was not familiar with drm APIs so 
>>>>>>>>> just only half understood your comment last time.
>>>>>>>>>
>>>>>>>>> BTW, I tried hot-plugging out a GPU when rocm application is 
>>>>>>>>> still running.
>>>>>>>>> From dmesg, application is still trying to access the removed 
>>>>>>>>> kfd device, and are met with some errors.
>>>>>>>>
>>>>>>>>
>>>>>>>> Application us supposed to keep running, it holds the drm_device
>>>>>>>> reference as long as it has an open
>>>>>>>> FD to the device and final cleanup will come only after the app 
>>>>>>>> will die
>>>>>>>> thus releasing the FD and the last
>>>>>>>> drm_device reference.
>>>>>>>>
>>>>>>>>> Application would hang and not exiting in this case.
>>>>>>>>
>>>>>>>
>>>>>>> Actually I tried kill -7 $pid, and the process exists. The dmesg 
>>>>>>> has some warning though.
>>>>>>>
>>>>>>> [  711.769977] WARNING: CPU: 23 PID: 344 at 
>>>>>>> .../amdgpu-rocm5.0.2/src/amd/amdgpu/amdgpu_object.c:1336 
>>>>>>> amdgpu_bo_release_notify+0x150/0x160 [amdgpu]
>>>>>>> [  711.770528] Modules linked in: amdgpu(OE) amdttm(OE) 
>>>>>>> amd_sched(OE) amdkcl(OE) iommu_v2 nf_conntrack_netlink nfnetlink 
>>>>>>> xfrm_user xfrm_algo xt_addrtype br_netfilter xt_CHECKSUM 
>>>>>>> iptable_mangle xt_MASQUERADE iptable_nat nf_nat xt_conntrack 
>>>>>>> nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ipt_REJECT 
>>>>>>> nf_reject_ipv4 xt_tcpudp bridge stp llc ebtable_filter ebtables 
>>>>>>> ip6table_filter ip6_tables iptable_filter overlay binfmt_misc 
>>>>>>> intel_rapl_msr i40iw intel_rapl_common skx_edac nfit 
>>>>>>> x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel rpcrdma 
>>>>>>> kvm sunrpc ipmi_ssif ib_umad ib_ipoib rdma_ucm irqbypass rapl 
>>>>>>> joydev acpi_ipmi input_leds intel_cstate ipmi_si ipmi_devintf 
>>>>>>> mei_me mei intel_pch_thermal ipmi_msghandler ioatdma mac_hid 
>>>>>>> lpc_ich dca acpi_power_meter acpi_pad sch_fq_codel ib_iser 
>>>>>>> rdma_cm iw_cm ib_cm iscsi_tcp libiscsi_tcp libiscsi 
>>>>>>> scsi_transport_iscsi pci_stub ip_tables x_tables autofs4 btrfs 
>>>>>>> blake2b_generic zstd_compress raid10 raid456 async_raid6_recov 
>>>>>>> async_memcpy async_pq async_xor async_tx xor
>>>>>>> [  711.779359]  raid6_pq libcrc32c raid1 raid0 multipath linear 
>>>>>>> mlx5_ib ib_uverbs ib_core ast drm_vram_helper i2c_algo_bit 
>>>>>>> drm_ttm_helper ttm drm_kms_helper syscopyarea crct10dif_pclmul 
>>>>>>> crc32_pclmul ghash_clmulni_intel sysfillrect uas hid_generic 
>>>>>>> sysimgblt aesni_intel mlx5_core fb_sys_fops crypto_simd usbhid 
>>>>>>> cryptd drm i40e pci_hyperv_intf usb_storage glue_helper mlxfw 
>>>>>>> hid ahci libahci wmi
>>>>>>> [  711.779752] CPU: 23 PID: 344 Comm: kworker/23:1 Tainted: G  W 
>>>>>>>  OE     5.11.0+ #1
>>>>>>> [  711.779755] Hardware name: Supermicro 
>>>>>>> SYS-4029GP-TRT2/X11DPG-OT-CPU, BIOS 2.1 08/14/2018
>>>>>>> [  711.779756] Workqueue: kfd_process_wq kfd_process_wq_release 
>>>>>>> [amdgpu]
>>>>>>> [  711.779955] RIP: 0010:amdgpu_bo_release_notify+0x150/0x160 
>>>>>>> [amdgpu]
>>>>>>> [  711.780141] Code: e8 b5 af 34 f4 e9 1f ff ff ff 48 39 c2 74 
>>>>>>> 07 0f 0b e9 69 ff ff ff 4c 89 e7 e8 3c b4 16 00 e9 5c ff ff ff 
>>>>>>> e8 a2 ce fd f3 eb cf <0f> 0b eb cb e8 d7 02 34 f4 0f 1f 80 00 00 
>>>>>>> 00 00 0f 1f 44 00 00 55
>>>>>>> [  711.780143] RSP: 0018:ffffa8100dd67c30 EFLAGS: 00010282
>>>>>>> [  711.780145] RAX: 00000000ffffffea RBX: ffff89980e792058 RCX: 
>>>>>>> 0000000000000000
>>>>>>> [  711.780147] RDX: 0000000000000000 RSI: ffff89a8f9ad8870 RDI: 
>>>>>>> ffff89a8f9ad8870
>>>>>>> [  711.780148] RBP: ffffa8100dd67c50 R08: 0000000000000000 R09: 
>>>>>>> fffffffffff99b18
>>>>>>> [  711.780149] R10: ffffa8100dd67bd0 R11: ffffa8100dd67908 R12: 
>>>>>>> ffff89980e792000
>>>>>>> [  711.780151] R13: ffff89980e792058 R14: ffff89980e7921bc R15: 
>>>>>>> dead000000000100
>>>>>>> [  711.780152] FS:  0000000000000000(0000) 
>>>>>>> GS:ffff89a8f9ac0000(0000) knlGS:0000000000000000
>>>>>>> [  711.780154] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>>>>> [  711.780156] CR2: 00007ffddac6f71f CR3: 00000030bb80a003 CR4: 
>>>>>>> 00000000007706e0
>>>>>>> [  711.780157] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 
>>>>>>> 0000000000000000
>>>>>>> [  711.780159] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 
>>>>>>> 0000000000000400
>>>>>>> [  711.780160] PKRU: 55555554
>>>>>>> [  711.780161] Call Trace:
>>>>>>> [  711.780163]  ttm_bo_release+0x2ae/0x320 [amdttm]
>>>>>>> [  711.780169]  amdttm_bo_put+0x30/0x40 [amdttm]
>>>>>>> [  711.780357]  amdgpu_bo_unref+0x1e/0x30 [amdgpu]
>>>>>>> [  711.780543]  amdgpu_gem_object_free+0x8b/0x160 [amdgpu]
>>>>>>> [  711.781119]  drm_gem_object_free+0x1d/0x30 [drm]
>>>>>>> [  711.781489] 
>>>>>>>  amdgpu_amdkfd_gpuvm_free_memory_of_gpu+0x34a/0x380 [amdgpu]
>>>>>>> [  711.782044]  kfd_process_device_free_bos+0xe0/0x130 [amdgpu]
>>>>>>> [  711.782611]  kfd_process_wq_release+0x286/0x380 [amdgpu]
>>>>>>> [  711.783172]  process_one_work+0x236/0x420
>>>>>>> [  711.783543]  worker_thread+0x34/0x400
>>>>>>> [  711.783911]  ? process_one_work+0x420/0x420
>>>>>>> [  711.784279]  kthread+0x126/0x140
>>>>>>> [  711.784653]  ? kthread_park+0x90/0x90
>>>>>>> [  711.785018]  ret_from_fork+0x22/0x30
>>>>>>> [  711.785387] ---[ end trace d8f50f6594817c84 ]---
>>>>>>> [  711.798716] [drm] amdgpu: ttm finalized
>>>>>>
>>>>>>
>>>>>> So it means the process was stuck in some wait_event_killable 
>>>>>> (maybe here drm_sched_entity_flush) - you can try 
>>>>>> 'cat/proc/$process_pid/stack' maybe before
>>>>>> you kill it to see where it was stuck so we can go from there.
>>>>>>
>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> For graphic apps what i usually see is a crash because of 
>>>>>>>> sigsev when
>>>>>>>> the app tries to access
>>>>>>>> an unmapped MMIO region on the device. I haven't tested for compute
>>>>>>>> stack and so there might
>>>>>>>> be something I haven't covered. Hang could mean for example 
>>>>>>>> waiting on a
>>>>>>>> fence which is not being
>>>>>>>> signaled - please provide full dmesg from this case.
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Do you have any good suggestions on how to fix it down the 
>>>>>>>>> line? (HIP runtime/libhsakmt or driver)
>>>>>>>>>
>>>>>>>>> [64036.631333] amdgpu: amdgpu_vm_bo_update failed
>>>>>>>>> [64036.631702] amdgpu: validate_invalid_user_pages: update PTE 
>>>>>>>>> failed
>>>>>>>>> [64036.640754] amdgpu: amdgpu_vm_bo_update failed
>>>>>>>>> [64036.641120] amdgpu: validate_invalid_user_pages: update PTE 
>>>>>>>>> failed
>>>>>>>>> [64036.650394] amdgpu: amdgpu_vm_bo_update failed
>>>>>>>>> [64036.650765] amdgpu: validate_invalid_user_pages: update PTE 
>>>>>>>>> failed
>>>>>>>>
>>>>>>>
>>>>>>> The full dmesg will just the repetition of those two messages,
>>>>>>> [186885.764079] amdgpu 0000:43:00.0: amdgpu: amdgpu: finishing 
>>>>>>> device.
>>>>>>> [186885.766916] [drm] free PSP TMR buffer
>>>>>>> [186893.868173] amdgpu: amdgpu_vm_bo_update failed
>>>>>>> [186893.868235] amdgpu: validate_invalid_user_pages: update PTE 
>>>>>>> failed
>>>>>>> [186893.876154] amdgpu: amdgpu_vm_bo_update failed
>>>>>>> [186893.876190] amdgpu: validate_invalid_user_pages: update PTE 
>>>>>>> failed
>>>>>>> [186893.884150] amdgpu: amdgpu_vm_bo_update failed
>>>>>>> [186893.884185] amdgpu: validate_invalid_user_pages: update PTE 
>>>>>>> failed
>>>>>>>
>>>>>>>>
>>>>>>>> This just probably means trying to update PTEs after the 
>>>>>>>> physical device
>>>>>>>> is gone - we usually avoid this by
>>>>>>>> first trying to do all HW shutdowns early before PCI remove 
>>>>>>>> completion
>>>>>>>> but when it's really tricky by
>>>>>>>> protecting HW access sections with drm_dev_enter/exit scope.
>>>>>>>>
>>>>>>>> For this particular error it would be the best to flush
>>>>>>>> info->restore_userptr_work before the end of
>>>>>>>> amdgpu_pci_remove (rejecting new process creation and calling
>>>>>>>> cancel_delayed_work_sync(&process_info->restore_userptr_work) 
>>>>>>>> for all
>>>>>>>> running processes)
>>>>>>>> somewhere in amdgpu_pci_remove.
>>>>>>>>
>>>>>>> I tried something like *kfd_process_ref_release* which I think 
>>>>>>> did what you described, but it did not work.
>>>>>>
>>>>>>
>>>>>> I don't see how kfd_process_ref_release is the same as I 
>>>>>> mentioned above, what i meant is calling the code above within 
>>>>>> kgd2kfd_suspend (where you tried to call 
>>>>>> kfd_kill_all_user_processes bellow)
>>>>>>
>>>>> Yes, you are right. It was not called by it.
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> Instead I tried to kill the process from the kernel, but the 
>>>>>>> amdgpu could **only** be hot-plugged in back successfully only 
>>>>>>> if there was no rocm kernel running when it was plugged out. If 
>>>>>>> not, amdgpu_probe will just hang later. (Maybe because amdgpu 
>>>>>>> was plugged out while running state, it leaves a bad HW state 
>>>>>>> which causes probe to hang).
>>>>>>
>>>>>>
>>>>>> We usually do asic_reset during probe to reset all HW state 
>>>>>> (checlk if amdgpu_device_init->amdgpu_asic_reset is running when 
>>>>>> you  plug back).
>>>>>>
>>>>> OK
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> I don’t know if this is a viable solution worth pursuing, but I 
>>>>>>> attached the diff anyway.
>>>>>>>
>>>>>>> Another solution could be let compute stack user mode detect a 
>>>>>>> topology change via generation_count change, and abort 
>>>>>>> gracefully there.
>>>>>>>
>>>>>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c 
>>>>>>> b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>>>>>> index 4e7d9cb09a69..79b4c9b84cd0 100644
>>>>>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>>>>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>>>>>> @@ -697,12 +697,15 @@ void kgd2kfd_suspend(struct kfd_dev *kfd, 
>>>>>>> bool run_pm, bool force)
>>>>>>> return;
>>>>>>>
>>>>>>>         /* for runtime suspend, skip locking kfd */
>>>>>>> -       if (!run_pm) {
>>>>>>> +       if (!run_pm && !drm_dev_is_unplugged(kfd->ddev)) {
>>>>>>>                 /* For first KFD device suspend all the KFD 
>>>>>>> processes */
>>>>>>>                 if (atomic_inc_return(&kfd_locked) == 1)
>>>>>>> kfd_suspend_all_processes(force);
>>>>>>>         }
>>>>>>>
>>>>>>> +       if (drm_dev_is_unplugged(kfd->ddev))
>>>>>>> + kfd_kill_all_user_processes();
>>>>>>> +
>>>>>>> kfd->dqm->ops.stop(kfd->dqm);
>>>>>>> kfd_iommu_suspend(kfd);
>>>>>>>  }
>>>>>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h 
>>>>>>> b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
>>>>>>> index 55c9e1922714..84cbcd857856 100644
>>>>>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
>>>>>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
>>>>>>> @@ -1065,6 +1065,7 @@ void kfd_unref_process(struct kfd_process *p);
>>>>>>>  int kfd_process_evict_queues(struct kfd_process *p, bool force);
>>>>>>>  int kfd_process_restore_queues(struct kfd_process *p);
>>>>>>>  void kfd_suspend_all_processes(bool force);
>>>>>>> +void kfd_kill_all_user_processes(void);
>>>>>>>  /*
>>>>>>>   * kfd_resume_all_processes:
>>>>>>>   *     bool sync: If kfd_resume_all_processes() should wait for the
>>>>>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process.c 
>>>>>>> b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
>>>>>>> index 6cdc855abb6d..fb0c753b682c 100644
>>>>>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
>>>>>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
>>>>>>> @@ -2206,6 +2206,24 @@ void kfd_suspend_all_processes(bool force)
>>>>>>> srcu_read_unlock(&kfd_processes_srcu, idx);
>>>>>>>  }
>>>>>>>
>>>>>>> +
>>>>>>> +void kfd_kill_all_user_processes(void)
>>>>>>> +{
>>>>>>> +       struct kfd_process *p;
>>>>>>> +       struct amdkfd_process_info *p_info;
>>>>>>> +       unsigned int temp;
>>>>>>> +       int idx = srcu_read_lock(&kfd_processes_srcu);
>>>>>>> +
>>>>>>> + pr_info("Killing all processes\n");
>>>>>>> + hash_for_each_rcu(kfd_processes_table, temp, p, kfd_processes) {
>>>>>>> + p_info = p->kgd_process_info;
>>>>>>> + pr_info("Killing  processes, pid = %d", pid_nr(p_info->pid));
>>>>>>> + kill_pid(p_info->pid, SIGBUS, 1);
>>>>>>
>>>>>>
>>>>>> From looking into kill_pid I see it only sends a signal but 
>>>>>> doesn't wait for completion, it would make sense to wait for 
>>>>>> completion here. In any case I would actually try to put here
>>>>>>
>>>>> I have made a version which does that with some atomic counters. 
>>>>> Please read later in the diff.
>>>>>>
>>>>>>
>>>>>> hash_for_each_rcu(p_info)
>>>>>> cancel_delayed_work_sync(&p_info->restore_userptr_work)
>>>>>>
>>>>>> instead  at least that what i meant in the previous mail.
>>>>>>
>>>>> I actually tried that earlier, and it did not work. Application 
>>>>> still keeps running, and you have to send a kill to the user process.
>>>>>
>>>>> I have made the following version. It waits for processes to 
>>>>> terminate synchronously after sending SIGBUS. After that it does 
>>>>> the real work of amdgpu_pci_remove.
>>>>> However, it hangs at amdgpu_device_ip_fini_early when it is trying 
>>>>> to deinit ip_block 6 <sdma_v4_0> 
>>>>> (https://gitlab.freedesktop.org/agd5f/linux/-/blob/amd-staging-drm-next/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c#L2818 
>>>>> <https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fagd5f%2Flinux%2F-%2Fblob%2Famd-staging-drm-next%2Fdrivers%2Fgpu%2Fdrm%2Famd%2Famdgpu%2Famdgpu_device.c%23L2818&data=04%7C01%7Candrey.grodzovsky%40amd.com%7Cbfb611b3f5984b73ad3708da213e781e%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637858849473779872%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=Tn6rrnnrg%2BzkoLQUeN3WOjqDbgKD5Rotccd6AZs%2BQmI%3D&reserved=0>). 
>>>>> I assume that there are still some inflight dma, therefore fini of 
>>>>> this ip block thus hangs?
>>>>>
>>>>> The following is an excerpt of the dmesg: please excuse for 
>>>>> putting my own pr_info, but I hope you get my point of where it hangs.
>>>>>
>>>>> [  392.344735] amdgpu: all processes has been fully released
>>>>> [  392.346557] amdgpu: amdgpu_acpi_fini done
>>>>> [  392.346568] amdgpu 0000:b3:00.0: amdgpu: amdgpu: finishing device.
>>>>> [  392.349238] amdgpu: amdgpu_device_ip_fini_early enter ip_blocks = 9
>>>>> [  392.349248] amdgpu: Free mem_obj = 000000007bf54275, 
>>>>> range_start = 14, range_end = 14
>>>>> [  392.350299] amdgpu: Free mem_obj = 00000000a85bc878, 
>>>>> range_start = 12, range_end = 12
>>>>> [  392.350304] amdgpu: Free mem_obj = 00000000b8019e32, 
>>>>> range_start = 13, range_end = 13
>>>>> [  392.350308] amdgpu: Free mem_obj = 000000002d296168, 
>>>>> range_start = 4, range_end = 11
>>>>> [  392.350313] amdgpu: Free mem_obj = 000000001fc4f934, 
>>>>> range_start = 0, range_end = 3
>>>>> [  392.350322] amdgpu: amdgpu_amdkfd_suspend(adev, false) done
>>>>> [  392.350672] amdgpu: hw_fini of IP block[8] <jpeg_v2_5> done 0
>>>>> [  392.350679] amdgpu: hw_fini of IP block[7] <vcn_v2_5> done 0
>>>>
>>>>
>>>> I just remembered that the idea to actively kill and wait for 
>>>> running user processes during unplug was rejected
>>>> as a bad idea in the first iteration of unplug work I did (don't 
>>>> remember why now, need to look) so this is a no go.
>>>>
>>> Maybe an application has kfd open, but was not accessing the dev. So 
>>> kill it at unplug could kill the process unnecessarily.
>>> However, the latest version I had with the sleep function got rid of 
>>> the IP block fini hang.
>>>>
>>>> Our policy is to let zombie processes (zombie in a sense that the 
>>>> underlying device is gone) live as long as they want
>>>> (as long as you able to terminate them - which you do, so that ok)
>>>> and the system should finish PCI remove gracefully and be able to 
>>>> hot plug back the device.  Hence, i suggest dropping
>>>> this direction of forcing all user processes to be killed, confirm 
>>>> you have graceful shutdown and remove of device
>>>> from PCI topology and then concentrate on why when you plug back it 
>>>> hangs.
>>>>
>>> So I basically revert back to the original solution which you suggested.
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c 
>>> b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>> index 4e7d9cb09a69..5504a18b5a45 100644
>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>> @@ -697,7 +697,7 @@ void kgd2kfd_suspend(struct kfd_dev *kfd, bool 
>>> run_pm, bool force)
>>>                 return;
>>>
>>>         /* for runtime suspend, skip locking kfd */
>>> -       if (!run_pm) {
>>> +       if (!run_pm && !drm_dev_is_unplugged(kfd->ddev)) {
>>>                 /* For first KFD device suspend all the KFD processes */
>>>                 if (atomic_inc_return(&kfd_locked) == 1)
>>> kfd_suspend_all_processes(force);
>>>
>>>> First confirm if ASIC reset happens on
>>>> next init.
>>>>
>>> This patch works great at *planned* plugout, where all the rocm 
>>> processes are killed before plugout. And device can be added back 
>>> without a problem.
>>> However *unplanned* plugout when there is rocm processes are running 
>>> just don’t work.
>>
>>
>> Still I am not clear if ASIC reset happens on plug back or no, can 
>> you trace this please ?
>>
>>
>
> I tried add pr_info into asic_reset functions, but cannot trace any 
> upon plug-back.


This could possibly explain the hang on plug back. Can you see why we 
don't get there ?


>>
>>>> Second please confirm if the timing you kill manually the user 
>>>> process has impact on whether you have a hang
>>>> on next plug back (if you kill before
>>>>
>>> *Scenario 0: Kill before plug back*
>>>
>>> 1. echo 1 > /sys/bus/pci/…/remove, would finish.
>>> But the application won’t exit until there is a kill signal.
>>
>>
>> Why you think it must exit ?
>>
> Because rocm will need to release the drm descriptor to 
> get amdgpu_amdkfd_device_fini_sw called, which would eventually call 
> kgd2kfd_device_exit called. This would clean up kfd_topology at least. 
> Otherwise I don’t see how it would be added back without messing up 
> kfd topology to say the least.
>
> However, those are all based my own observations. Please explain why 
> it does not need exit if you believe so?


Note that when you add back a new device, pci device and drm device are 
created, I am not an expert on KFD code but i believe also a new KFD 
device is created independent of the old one and so the topology should 
see just 2 device instances (one old zombie and one real new).  I know 
at least this wasn't an issue for the graphic stack in exact same 
scenario and the libdrm tests i pointed to test exact this scenario. 
Also note that even with running grpahic stack there is always a KFD 
device and KFD topology present but of course probably not the same as 
when u run a KFD facing process so there could be some issues there.

Also note that because of this patch 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=267d51d77fdae8708b94e1a24b8e5d961297edb7 
all MMIO accesses from such zombie/orphan user processes will be 
remapped to zero page and so will not necessarily experience a segfault 
when device removal happnes but rather maybe some crash due to NULL data 
read from MMIO by the process and used in some manner.


>>>
>>> 2. kill the the process. The application does several things and 
>>> seems trigger drm_release in the kernel, which are met with kernel 
>>> NULL pointer deference related to sysfs_remove. Then the whole fs 
>>> just freeze.
>>>
>>> [  +0.002498] BUG: kernel NULL pointer dereference, address: 
>>> 0000000000000098
>>> [  +0.000486] #PF: supervisor read access in kernel mode
>>> [  +0.000545] #PF: error_code(0x0000) - not-present page
>>> [  +0.000551] PGD 0 P4D 0
>>> [  +0.000553] Oops: 0000 [#1] SMP NOPTI
>>> [  +0.000540] CPU: 75 PID: 4911 Comm: kworker/75:2 Tainted: G       
>>>  W   E 5.13.0-kfd #1
>>> [  +0.000559] Hardware name: INGRASYS         TURING  /MB      , 
>>> BIOS K71FQ28A 10/05/2021
>>> [  +0.000567] Workqueue: events delayed_fput
>>> [  +0.000563] RIP: 0010:kernfs_find_ns+0x1b/0x100
>>> [  +0.000569] Code: ff ff e8 88 59 9f 00 0f 1f 84 00 00 00 00 00 0f 
>>> 1f 44 00 00 41 57 8b 05 df f0 7b 01 41 56 41 55 49 89 f5 41 54 55 48 
>>> 89 fd 53 <44> 0f b7 b7 98 00 00 00 48 89 d3 4c 8b 67 70 66 41 83 e6 
>>> 20 41 0f
>>> [  +0.001193] RSP: 0018:ffffb9875db5fc98 EFLAGS: 00010246
>>> [  +0.000602] RAX: 0000000000000000 RBX: ffffa101f79507d8 RCX: 
>>> 0000000000000000
>>> [  +0.000612] RDX: 0000000000000000 RSI: ffffffffc09a9417 RDI: 
>>> 0000000000000000
>>> [  +0.000604] RBP: 0000000000000000 R08: 0000000000000001 R09: 
>>> 0000000000000000
>>> [  +0.000760] R10: ffffb9875db5fcd0 R11: 0000000000000000 R12: 
>>> 0000000000000000
>>> [  +0.000597] R13: ffffffffc09a9417 R14: ffffa08363fb2d18 R15: 
>>> 0000000000000000
>>> [  +0.000702] FS:  0000000000000000(0000) GS:ffffa0ffbfcc0000(0000) 
>>> knlGS:0000000000000000
>>> [  +0.000666] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> [  +0.000658] CR2: 0000000000000098 CR3: 0000005747812005 CR4: 
>>> 0000000000770ee0
>>> [  +0.000715] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 
>>> 0000000000000000
>>> [  +0.000655] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 
>>> 0000000000000400
>>> [  +0.000592] PKRU: 55555554
>>> [  +0.000580] Call Trace:
>>> [  +0.000591]  kernfs_find_and_get_ns+0x2f/0x50
>>> [  +0.000584]  sysfs_remove_file_from_group+0x20/0x50
>>> [  +0.000580]  amdgpu_ras_sysfs_remove+0x3d/0xd0 [amdgpu]
>>> [  +0.000737]  amdgpu_ras_late_fini+0x1d/0x40 [amdgpu]
>>> [  +0.000750]  amdgpu_sdma_ras_fini+0x96/0xb0 [amdgpu]
>>> [  +0.000742]  ? gfx_v10_0_resume+0x10/0x10 [amdgpu]
>>> [  +0.000738]  sdma_v4_0_sw_fini+0x23/0x90 [amdgpu]
>>> [  +0.000717]  amdgpu_device_fini_sw+0xae/0x260 [amdgpu]
>>> [  +0.000704]  amdgpu_driver_release_kms+0x12/0x30 [amdgpu]
>>> [  +0.000687]  drm_dev_release+0x20/0x40 [drm]
>>> [  +0.000583]  drm_release+0xa8/0xf0 [drm]
>>> [  +0.000584]  __fput+0xa5/0x250
>>> [  +0.000621]  delayed_fput+0x1f/0x30
>>> [  +0.000726]  process_one_work+0x26e/0x580
>>> [  +0.000581]  ? process_one_work+0x580/0x580
>>> [  +0.000611]  worker_thread+0x4d/0x3d0
>>> [  +0.000611]  ? process_one_work+0x580/0x580
>>> [  +0.000605]  kthread+0x117/0x150
>>> [  +0.000611]  ? kthread_park+0x90/0x90
>>> [  +0.000619]  ret_from_fork+0x1f/0x30
>>> [  +0.000625] Modules linked in: amdgpu(E) xt_conntrack 
>>> xt_MASQUERADE nfnetlink xt_addrtype iptable_filter iptable_nat 
>>> nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 br_netfilter 
>>> x86_pkg_temp_thermal cdc_ether usbnet acpi_pad msr ip_tables 
>>> x_tables ast drm_vram_helper iommu_v2 drm_ttm_helper gpu_sched ttm 
>>> drm_kms_helper cfbfillrect syscopyarea cfbimgblt sysfillrect 
>>> sysimgblt fb_sys_fops cfbcopyarea drm drm_panel_orientati
>>> on_quirks [last unloaded: amdgpu]
>>
>>
>> This is a known regression, all SYSFS components must be removed 
>> before pci_remove code runs otherwise you get either warnings for 
>> single file removals or
>> OOPSEs for sysfs gorup removals like here. Please try to move 
>> amdgpu_ras_sysfs_remove from amdgpu_ras_late_fini to the end of 
>> amdgpu_ras_pre_fini (which happens before pci remove)
>>
>>
>
> I fixed it in the newer patch, please see it below.



> I first plugout the device, then kill the rocm user process. Then it 
> has other OOPSES related to ttm_bo_cleanup_refs.
>
> [  +0.000006] BUG: kernel NULL pointer dereference, address: 
> 0000000000000010
> [  +0.000349] #PF: supervisor read access in kernel mode
> [  +0.000340] #PF: error_code(0x0000) - not-present page
> [  +0.000341] PGD 0 P4D 0
> [  +0.000336] Oops: 0000 [#1] SMP NOPTI
> [  +0.000345] CPU: 9 PID: 95 Comm: kworker/9:1 Tainted: G  W   E     
> 5.13.0-kfd #1
> [  +0.000367] Hardware name: INGRASYS         TURING  /MB      , BIOS 
> K71FQ28A 10/05/2021
> [  +0.000376] Workqueue: events delayed_fput
> [  +0.000422] RIP: 0010:ttm_resource_free+0x24/0x40 [ttm]
> [  +0.000464] Code: 00 00 0f 1f 40 00 0f 1f 44 00 00 53 48 89 f3 48 8b 
> 36 48 85 f6 74 21 48 8b 87 28 02 00 00 48 63 56 10 48 8b bc d0 b8 00 
> 00 00 <48> 8b 47 10 ff 50 08 48 c7 03 00 00 00 00 5b c3 66 66 2e 0f 1f 84
> [  +0.001009] RSP: 0018:ffffb21c59413c98 EFLAGS: 00010282
> [  +0.000515] RAX: ffff8b1aa4285f68 RBX: ffff8b1a823b5ea0 RCX: 
> 00000000002a000c
> [  +0.000536] RDX: 0000000000000000 RSI: ffff8b1acb84db80 RDI: 
> 0000000000000000
> [  +0.000539] RBP: 0000000000000001 R08: 0000000000000000 R09: 
> ffffffffc03c3e00
> [  +0.000543] R10: 0000000000000000 R11: 0000000000000000 R12: 
> ffff8b1a823b5ec8
> [  +0.000542] R13: 0000000000000000 R14: ffff8b1a823b5d90 R15: 
> ffff8b1a823b5ec8
> [  +0.000544] FS:  0000000000000000(0000) GS:ffff8b187f440000(0000) 
> knlGS:0000000000000000
> [  +0.000559] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  +0.000575] CR2: 0000000000000010 CR3: 00000076e6812004 CR4: 
> 0000000000770ee0
> [  +0.000575] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 
> 0000000000000000
> [  +0.000579] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 
> 0000000000000400
> [  +0.000575] PKRU: 55555554
> [  +0.000568] Call Trace:
> [  +0.000567]  ttm_bo_cleanup_refs+0xe4/0x290 [ttm]
> [  +0.000588]  ttm_bo_delayed_delete+0x147/0x250 [ttm]
> [  +0.000589]  ttm_device_fini+0xad/0x1b0 [ttm]
> [  +0.000590]  amdgpu_ttm_fini+0x2a7/0x310 [amdgpu]
> [  +0.000730]  gmc_v9_0_sw_fini+0x3a/0x40 [amdgpu]
> [  +0.000753]  amdgpu_device_fini_sw+0xae/0x260 [amdgpu]
> [  +0.000734]  amdgpu_driver_release_kms+0x12/0x30 [amdgpu]
> [  +0.000737]  drm_dev_release+0x20/0x40 [drm]
> [  +0.000626]  drm_release+0xa8/0xf0 [drm]
> [  +0.000625]  __fput+0xa5/0x250
> [  +0.000606]  delayed_fput+0x1f/0x30
> [  +0.000607]  process_one_work+0x26e/0x580
> [  +0.000608]  ? process_one_work+0x580/0x580
> [  +0.000616]  worker_thread+0x4d/0x3d0
> [  +0.000614]  ? process_one_work+0x580/0x580
> [  +0.000617]  kthread+0x117/0x150
> [  +0.000615]  ? kthread_park+0x90/0x90
> [  +0.000621]  ret_from_fork+0x1f/0x30
> [  +0.000603] Modules linked in: amdgpu(E) xt_conntrack xt_MASQUERADE 
> nfnetlink xt_addrtype iptable_filter iptable_nat nf_nat nf_conntrack 
> nf_defrag_ipv6 nf_defrag_ipv4 br_netfilter x86_pkg_temp_thermal 
> cdc_ether usbnet acpi_pad msr ip_tables x_tables ast drm_vram_helper 
> drm_ttm_helper iommu_v2 ttm gpu_sched drm_kms_helper cfbfillrect 
> syscopyarea cfbimgblt sysfillrect sysimgblt fb_sys_fops cfbcopyarea 
> drm drm_panel_orientation_quirks [last unloaded: amdgpu]
> [  +0.002840] CR2: 0000000000000010
> [  +0.000755] ---[ end trace 9737737402551e39 ]--


This looks like another regression - try seeing where is the NULL 
reference and then we can see how to avoid this.


>
>>>
>>> 3.  echo 1 > /sys/bus/pci/rescan. This would just hang. I assume the 
>>> sysfs is broken.
>>>
>>> Based on 1 & 2, it seems that 1 won’t let the amdgpu exit 
>>> gracefully, because 2 will do some cleanup maybe should have 
>>> happened before 1.
>>>>
>>>> or you kill after plug back does it makes a difference).
>>>>
>>> *Scenario 2: Kill after plug back*
>>>
>>> If I perform rescan before kill, then the driver seemed probed fine. 
>>> But kill will have the same issue which messed up the sysfs the same 
>>> way as in Scenario 2.
>>>
>>>
>>> *Final Comments:*
>>>
>>> 0. cancel_delayed_work_sync(&p_info->restore_userptr_work) 
>>> would make the repletion of amdgpu_vm_bo_update failure go away, but 
>>> it does not solve the issues in those scenarios.
>>
>>
>> Still - it's better to do it this way even for those failures to go awaya
>>
>>
> Cancel_delayed_work is insufficient, you will need to make sure the 
> work won’t be processed after plugout. Please see my patch


Saw, see my comment.


>>
>>>
>>> 1. For planned hotplug, this patch should work as long as you follow 
>>> some protocol, i.e. kill before plugout. Is this patch an acceptable 
>>> one since it provides some added feature than before?
>>
>>
>> Let's try to fix more as I advised above.
>>
>>
>>>
>>> 2. For unplanned hotplug when there is rocm app running, the patch 
>>> that kill all processes and wait for 5 sec would work consistently. 
>>> But it seems that it is an unacceptable solution for official 
>>> release. I can hold it for our own internal usage.  It seems that 
>>> kill after removal would cause problems, and I don’t know if there 
>>> is a quick fix by me because of my limited understanding of the 
>>> amdgpu driver. Maybe AMD could have a quick fix; Or it is really a 
>>> difficult one. This feature may or may not be a blocking issue in 
>>> our GPU disaggregation research down the way. Please let us know for 
>>> either cases, and we would like to learn and help as much as we could!
>>
>>
>> I am currently not sure why it helps. I will need to setup my own 
>> ROCm setup and retest hot plug to check this in more depth but 
>> currently i have higher priorities. Please try to confirm ASIC reset 
>> always takes place on plug back
>> and fix the sysfs OOPs as I advised above to clear up at least some 
>> of the issues. Also please describe to me exactly what you steps to 
>> reproduce this scenario so later I might be able to do it myself.
>>
>>
> I can still try to help to fix the bug in my spare time. My setup is 
> as follows
>
>  1. I have a server with 4 AMD MI100 GPUs. I think 1 GPU would also work.
>  2. I used the
>     https://github.com/RadeonOpenCompute/ROCK-Kernel-Driver/tree/roc-5.0.x
>     <https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FRadeonOpenCompute%2FROCK-Kernel-Driver%2Ftree%2Froc-5.0.x&data=04%7C01%7Candrey.grodzovsky%40amd.com%7Cbfb611b3f5984b73ad3708da213e781e%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637858849473779872%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=fJZBoJHu%2FwD8Ge1whM1VUERBNlCAyrnUGU78RLDp5yg%3D&reserved=0> as
>     the starting point, and apply Mukul’s patch and my patch.
>  3. Then I run a tensorflow benchmark from a docker.
>       * docker run -it --device=/dev/kfd --device=/dev/dri --group-add
>         video rocm/tensorflow:rocm4.5.2-tf1.15-dev
>       * And run the following benchmark in the docker:  python
>         benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py
>         --num_gpus=4 --batch_size=32 --model=resnet50
>         --variable_update=parameter_server
>           o Might to need to adjust num_gpus parameter based on your setup
>  4. Remove a GPU at random time.
>  5. Do whatever is needed to before plugback and reverify the
>     benchmark can still run.
>
>> Also, we have hotplug test suite in libdrm (graphic stack), so maybe 
>> u can install libdrm and run that test suite to see if it exposes 
>> more issues.
>>
> OK I could try it some time.
>
> The following is the new diff.
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> index 182b7eae598a..48c3cd4054de 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> @@ -1327,7 +1327,7 @@ int emu_soc_asic_init(struct amdgpu_device *adev);
>   * ASICs macro.
>   */
>  #define amdgpu_asic_set_vga_state(adev, state) 
> (adev)->asic_funcs->set_vga_state((adev), (state))
> -#define amdgpu_asic_reset(adev) (adev)->asic_funcs->reset((adev))
> +#define amdgpu_asic_reset(adev) ({int r; pr_info("performing 
> amdgpu_asic_reset\n"); r = (adev)->asic_funcs->reset((adev));r;})
>  #define amdgpu_asic_reset_method(adev) 
> (adev)->asic_funcs->reset_method((adev))
>  #define amdgpu_asic_get_xclk(adev) (adev)->asic_funcs->get_xclk((adev))
>  #define amdgpu_asic_set_uvd_clocks(adev, v, d) 
> (adev)->asic_funcs->set_uvd_clocks((adev), (v), (d))
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
> index 27c74fcec455..842abd7150a6 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
> @@ -134,6 +134,7 @@ struct amdkfd_process_info {
> /* MMU-notifier related fields */
> atomic_t evicted_bos;
> +atomic_t invalid;
> struct delayed_work restore_userptr_work;
> struct pid *pid;
>  };
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
> index 99d2b15bcbf3..2a588eb9f456 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
> @@ -1325,6 +1325,7 @@ static int init_kfd_vm(struct amdgpu_vm *vm, 
> void **process_info,
> info->pid = get_task_pid(current->group_leader, PIDTYPE_PID);
> atomic_set(&info->evicted_bos, 0);
> +atomic_set(&info->invalid, 0);
> INIT_DELAYED_WORK(&info->restore_userptr_work,
>  amdgpu_amdkfd_restore_userptr_worker);
> @@ -2693,6 +2694,9 @@ static void 
> amdgpu_amdkfd_restore_userptr_worker(struct work_struct *work)
> struct mm_struct *mm;
> int evicted_bos;
> +if (atomic_read(&process_info->invalid))
> +return;
> +


Probably better  to again use drm_dev_enter/exit guard pair instead of 
this flag.


> evicted_bos = atomic_read(&process_info->evicted_bos);
> if (!evicted_bos)
> return;
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index ec38517ab33f..e7d85d8d282d 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -1054,6 +1054,7 @@ void 
> amdgpu_device_program_register_sequence(struct amdgpu_device *adev,
>   */
>  void amdgpu_device_pci_config_reset(struct amdgpu_device *adev)
>  {
> +pr_debug("%s called\n",__func__);
> pci_write_config_dword(adev->pdev, 0x7c, AMDGPU_ASIC_RESET_DATA);
>  }
> @@ -1066,6 +1067,7 @@ void amdgpu_device_pci_config_reset(struct 
> amdgpu_device *adev)
>   */
>  int amdgpu_device_pci_reset(struct amdgpu_device *adev)
>  {
> +pr_debug("%s called\n",__func__);
> return pci_reset_function(adev->pdev);
>  }
> @@ -4702,6 +4704,8 @@ int amdgpu_do_asic_reset(struct list_head 
> *device_list_handle,
> bool need_full_reset, skip_hw_reset, vram_lost = false;
> int r = 0;
> +pr_debug("%s called\n",__func__);
> +
> /* Try reset handler method first */
> tmp_adev = list_first_entry(device_list_handle, struct amdgpu_device,
>  reset_list);
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> index 49bdf9ff7350..b469acb65c1e 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> @@ -2518,7 +2518,6 @@ void amdgpu_ras_late_fini(struct amdgpu_device 
> *adev,
> if (!ras_block || !ih_info)
> return;
> -amdgpu_ras_sysfs_remove(adev, ras_block);
> if (ih_info->cb)
> amdgpu_ras_interrupt_remove_handler(adev, ih_info);
>  }
> @@ -2577,6 +2576,7 @@ void amdgpu_ras_suspend(struct amdgpu_device *adev)
>  int amdgpu_ras_pre_fini(struct amdgpu_device *adev)
>  {
> struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
> +struct ras_manager *obj, *tmp;
> if (!adev->ras_enabled || !con)
> return 0;
> @@ -2585,6 +2585,13 @@ int amdgpu_ras_pre_fini(struct amdgpu_device *adev)
> /* Need disable ras on all IPs here before ip [hw/sw]fini */
> amdgpu_ras_disable_all_features(adev, 0);
> amdgpu_ras_recovery_fini(adev);
> +
> +/* remove sysfs before pci_remove to avoid OOPSES from 
> sysfs_remove_groups */
> +list_for_each_entry_safe(obj, tmp, &con->head, node) {
> +amdgpu_ras_sysfs_remove(adev, &obj->head);
> +put_obj(obj);
> +}
> +
> return 0;
>  }
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c 
> b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> index 4e7d9cb09a69..0fa806a78e39 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> @@ -693,16 +693,35 @@ bool kfd_is_locked(void)
>  void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm, bool force)
>  {
> +struct kfd_process *p;
> +struct amdkfd_process_info *p_info;
> +unsigned int temp;
> +
> if (!kfd->init_complete)
> return;
> /* for runtime suspend, skip locking kfd */
> -if (!run_pm) {
> +if (!run_pm && !drm_dev_is_unplugged(kfd->ddev)) {
> /* For first KFD device suspend all the KFD processes */
> if (atomic_inc_return(&kfd_locked) == 1)
> kfd_suspend_all_processes(force);
> }
> +if (drm_dev_is_unplugged(kfd->ddev)){
> +int idx = srcu_read_lock(&kfd_processes_srcu);
> +pr_debug("cancel restore_userptr_wor\n");
> +hash_for_each_rcu(kfd_processes_table, temp, p, kfd_processes) {
> +if ( kfd_process_gpuidx_from_gpuid(p, kfd->id) >= 0 ) {
> +p_info = p->kgd_process_info;
> +pr_debug("cancel processes, pid = %d for gpu_id = %d", 
> pid_nr(p_info->pid), kfd->id);
> +cancel_delayed_work_sync(&p_info->restore_userptr_work);
> +/* block all future restore_userptr_work */
> +atomic_inc(&p_info->invalid);


Same as i mentioned above with drm.dev_eneter/exit

Andrey


> +}
> +}
> +srcu_read_unlock(&kfd_processes_srcu, idx);
> +}
> +
> kfd->dqm->ops.stop(kfd->dqm);
> kfd_iommu_suspend(kfd);
>  }
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_topology.c 
> b/drivers/gpu/drm/amd/amdkfd/kfd_topology.c
> index 600ba2a728ea..7e3d1848eccc 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_topology.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_topology.c
> @@ -669,11 +669,12 @@ static void kfd_remove_sysfs_node_entry(struct 
> kfd_topology_device *dev)
>  #ifdef HAVE_AMD_IOMMU_PC_SUPPORTED
> if (dev->kobj_perf) {
> list_for_each_entry(perf, &dev->perf_props, list) {
> +sysfs_remove_group(dev->kobj_perf, perf->attr_group);
> kfree(perf->attr_group);
> perf->attr_group = NULL;
> }
> kobject_del(dev->kobj_perf);
> -kobject_put(dev->kobj_perf);
> +/* kobject_put(dev->kobj_perf); */
> dev->kobj_perf = NULL;
> }
>  #endif
>
> Thank you so much! Looking forward to your comments!
>
> Regards,
> Shuotao
>>
>> Andrey
>>
>>
>>>
>>> Thank you so much!
>>>
>>> Best regards,
>>> Shuotao
>>>>
>>>> Andrey
>>>>
>>>>
>>>>>
>>>>>
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c 
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
>>>>> index 8fa9b86ac9d2..c0b27f722281 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
>>>>> @@ -188,6 +188,12 @@ void amdgpu_amdkfd_interrupt(struct 
>>>>> amdgpu_device *adev,
>>>>> kgd2kfd_interrupt(adev->kfd.dev, ih_ring_entry);
>>>>>  }
>>>>> +void amdgpu_amdkfd_kill_all_processes(struct amdgpu_device *adev)
>>>>> +{
>>>>> +if (adev->kfd.dev)
>>>>> +kgd2kfd_kill_all_user_processes(adev->kfd.dev);
>>>>> +}
>>>>> +
>>>>>  void amdgpu_amdkfd_suspend(struct amdgpu_device *adev, bool run_pm)
>>>>>  {
>>>>> if (adev->kfd.dev)
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h 
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
>>>>> index 27c74fcec455..f4e485d60442 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
>>>>> @@ -141,6 +141,7 @@ struct amdkfd_process_info {
>>>>>  int amdgpu_amdkfd_init(void);
>>>>>  void amdgpu_amdkfd_fini(void);
>>>>> +void amdgpu_amdkfd_kill_all_processes(struct amdgpu_device *adev);
>>>>>  void amdgpu_amdkfd_suspend(struct amdgpu_device *adev, bool run_pm);
>>>>>  int amdgpu_amdkfd_resume_iommu(struct amdgpu_device *adev);
>>>>>  int amdgpu_amdkfd_resume(struct amdgpu_device *adev, bool run_pm, 
>>>>> bool sync);
>>>>> @@ -405,6 +406,7 @@ bool kgd2kfd_device_init(struct kfd_dev *kfd,
>>>>> const struct kgd2kfd_shared_resources *gpu_resources);
>>>>>  void kgd2kfd_device_exit(struct kfd_dev *kfd);
>>>>>  void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm, bool force);
>>>>> +void kgd2kfd_kill_all_user_processes(struct kfd_dev *kfd);
>>>>>  int kgd2kfd_resume_iommu(struct kfd_dev *kfd);
>>>>>  int kgd2kfd_resume(struct kfd_dev *kfd, bool run_pm, bool sync);
>>>>>  int kgd2kfd_pre_reset(struct kfd_dev *kfd);
>>>>> @@ -443,6 +445,9 @@ static inline void kgd2kfd_suspend(struct 
>>>>> kfd_dev *kfd, bool run_pm, bool force)
>>>>>  {
>>>>>  }
>>>>> +void kgd2kfd_kill_all_user_processes(struct kfd_dev *kfd){
>>>>> +}
>>>>> +
>>>>>  static int __maybe_unused kgd2kfd_resume_iommu(struct kfd_dev *kfd)
>>>>>  {
>>>>> return 0;
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c 
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>>>> index 3d5fc0751829..af6fe5080cfa 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>>>> @@ -2101,6 +2101,9 @@ amdgpu_pci_remove(struct pci_dev *pdev)
>>>>>  {
>>>>> struct drm_device *dev = pci_get_drvdata(pdev);
>>>>> +/* kill all kfd processes before drm_dev_unplug */
>>>>> +amdgpu_amdkfd_kill_all_processes(drm_to_adev(dev));
>>>>> +
>>>>>  #ifdef HAVE_DRM_DEV_UNPLUG
>>>>> drm_dev_unplug(dev);
>>>>>  #else
>>>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c 
>>>>> b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>>>> index 5504a18b5a45..480c23bef5e2 100644
>>>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>>>> @@ -691,6 +691,12 @@ bool kfd_is_locked(void)
>>>>> return  (atomic_read(&kfd_locked) > 0);
>>>>>  }
>>>>> +inline void kgd2kfd_kill_all_user_processes(struct kfd_dev* dev)
>>>>> +{
>>>>> +kfd_kill_all_user_processes();
>>>>> +}
>>>>> +
>>>>> +
>>>>>  void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm, bool force)
>>>>>  {
>>>>> if (!kfd->init_complete)
>>>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h 
>>>>> b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
>>>>> index 55c9e1922714..a35a2cb5bb9f 100644
>>>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
>>>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
>>>>> @@ -1064,6 +1064,7 @@ static inline struct kfd_process_device 
>>>>> *kfd_process_device_from_gpuidx(
>>>>>  void kfd_unref_process(struct kfd_process *p);
>>>>>  int kfd_process_evict_queues(struct kfd_process *p, bool force);
>>>>>  int kfd_process_restore_queues(struct kfd_process *p);
>>>>> +void kfd_kill_all_user_processes(void);
>>>>>  void kfd_suspend_all_processes(bool force);
>>>>>  /*
>>>>>   * kfd_resume_all_processes:
>>>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process.c 
>>>>> b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
>>>>> index 6cdc855abb6d..17e769e6951d 100644
>>>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
>>>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
>>>>> @@ -46,6 +46,9 @@ struct mm_struct;
>>>>>  #include "kfd_trace.h"
>>>>>  #include "kfd_debug.h"
>>>>> +static atomic_t kfd_process_locked = ATOMIC_INIT(0);
>>>>> +static atomic_t kfd_inflight_kills = ATOMIC_INIT(0);
>>>>> +
>>>>>  /*
>>>>>   * List of struct kfd_process (field kfd_process).
>>>>>   * Unique/indexed by mm_struct*
>>>>> @@ -802,6 +805,9 @@ struct kfd_process *kfd_create_process(struct 
>>>>> task_struct *thread)
>>>>> struct kfd_process *process;
>>>>> int ret;
>>>>> +if ( atomic_read(&kfd_process_locked) > 0 )
>>>>> +return ERR_PTR(-EINVAL);
>>>>> +
>>>>> if (!(thread->mm && mmget_not_zero(thread->mm)))
>>>>> return ERR_PTR(-EINVAL);
>>>>> @@ -1126,6 +1132,10 @@ static void kfd_process_wq_release(struct 
>>>>> work_struct *work)
>>>>> put_task_struct(p->lead_thread);
>>>>> kfree(p);
>>>>> +
>>>>> +if ( atomic_read(&kfd_process_locked) > 0 ){
>>>>> +atomic_dec(&kfd_inflight_kills);
>>>>> +}
>>>>>  }
>>>>>  static void kfd_process_ref_release(struct kref *ref)
>>>>> @@ -2186,6 +2196,35 @@ static void restore_process_worker(struct 
>>>>> work_struct *work)
>>>>> pr_err("Failed to restore queues of pasid 0x%x\n", p->pasid);
>>>>>  }
>>>>> +void kfd_kill_all_user_processes(void)
>>>>> +{
>>>>> +struct kfd_process *p;
>>>>> +/* struct amdkfd_process_info *p_info; */
>>>>> +unsigned int temp;
>>>>> +int idx;
>>>>> +atomic_inc(&kfd_process_locked);
>>>>> +
>>>>> +idx = srcu_read_lock(&kfd_processes_srcu);
>>>>> +pr_info("Killing all processes\n");
>>>>> +hash_for_each_rcu(kfd_processes_table, temp, p, kfd_processes) {
>>>>> +dev_warn(kfd_device,
>>>>> +"Sending SIGBUS to process %d (pasid 0x%x)",
>>>>> +p->lead_thread->pid, p->pasid);
>>>>> +send_sig(SIGBUS, p->lead_thread, 0);
>>>>> +atomic_inc(&kfd_inflight_kills);
>>>>> +}
>>>>> +srcu_read_unlock(&kfd_processes_srcu, idx);
>>>>> +
>>>>> +while ( atomic_read(&kfd_inflight_kills) > 0 ){
>>>>> +dev_warn(kfd_device,
>>>>> +"kfd_processes_table is not empty, going to sleep for 10ms\n");
>>>>> +msleep(10);
>>>>> +}
>>>>> +
>>>>> +atomic_dec(&kfd_process_locked);
>>>>> +pr_info("all processes has been fully released");
>>>>> +}
>>>>> +
>>>>>  void kfd_suspend_all_processes(bool force)
>>>>>  {
>>>>> struct kfd_process *p;
>>>>>
>>>>>
>>>>>
>>>>> Regards,
>>>>> Shuotao
>>>>>
>>>>>>
>>>>>> Andrey
>>>>>>
>>>>>>> +       }
>>>>>>> + srcu_read_unlock(&kfd_processes_srcu, idx);
>>>>>>> +}
>>>>>>> +
>>>>>>> +
>>>>>>>  int kfd_resume_all_processes(bool sync)
>>>>>>>  {
>>>>>>>         struct kfd_process *p;
>>>>>>>
>>>>>>>
>>>>>>>> Andrey
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Really appreciate your help!
>>>>>>>>>
>>>>>>>>> Best,
>>>>>>>>> Shuotao
>>>>>>>>>
>>>>>>>>>>> 2. Remove redudant p2p/io links in sysfs when device is 
>>>>>>>>>>> hotplugged
>>>>>>>>>>> out.
>>>>>>>>>>>
>>>>>>>>>>> 3. New kfd node_id is not properly assigned after a new 
>>>>>>>>>>> device is
>>>>>>>>>>> added after a gpu is hotplugged out in a system. libhsakmt will
>>>>>>>>>>> find this anomaly, (i.e. node_from != <dev node id> in iolinks),
>>>>>>>>>>> when taking a topology_snapshot, thus returns fault to the rocm
>>>>>>>>>>> stack.
>>>>>>>>>>>
>>>>>>>>>>> -- This patch fixes issue 1; another patch by Mukul fixes 
>>>>>>>>>>> issues 2&3.
>>>>>>>>>>> -- Tested on a 4-GPU MI100 gpu nodes with kernel 5.13.0-kfd; 
>>>>>>>>>>> kernel
>>>>>>>>>>> 5.16.0-kfd is unstable out of box for MI100.
>>>>>>>>>>> ---
>>>>>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 5 +++++
>>>>>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h | 7 +++++++
>>>>>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 1 +
>>>>>>>>>>> drivers/gpu/drm/amd/amdkfd/kfd_device.c | 13 +++++++++++++
>>>>>>>>>>> 4 files changed, 26 insertions(+)
>>>>>>>>>>>
>>>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c 
>>>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
>>>>>>>>>>> index c18c4be1e4ac..d50011bdb5c4 100644
>>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
>>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
>>>>>>>>>>> @@ -213,6 +213,11 @@ int amdgpu_amdkfd_resume(struct 
>>>>>>>>>>> amdgpu_device *adev, bool run_pm)
>>>>>>>>>>> return r;
>>>>>>>>>>> }
>>>>>>>>>>>
>>>>>>>>>>> +int amdgpu_amdkfd_resume_processes(void)
>>>>>>>>>>> +{
>>>>>>>>>>> + return kgd2kfd_resume_processes();
>>>>>>>>>>> +}
>>>>>>>>>>> +
>>>>>>>>>>> int amdgpu_amdkfd_pre_reset(struct amdgpu_device *adev)
>>>>>>>>>>> {
>>>>>>>>>>> int r = 0;
>>>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h 
>>>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
>>>>>>>>>>> index f8b9f27adcf5..803306e011c3 100644
>>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
>>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
>>>>>>>>>>> @@ -140,6 +140,7 @@ void amdgpu_amdkfd_fini(void);
>>>>>>>>>>> void amdgpu_amdkfd_suspend(struct amdgpu_device *adev, bool 
>>>>>>>>>>> run_pm);
>>>>>>>>>>> int amdgpu_amdkfd_resume_iommu(struct amdgpu_device *adev);
>>>>>>>>>>> int amdgpu_amdkfd_resume(struct amdgpu_device *adev, bool 
>>>>>>>>>>> run_pm);
>>>>>>>>>>> +int amdgpu_amdkfd_resume_processes(void);
>>>>>>>>>>> void amdgpu_amdkfd_interrupt(struct amdgpu_device *adev,
>>>>>>>>>>> const void *ih_ring_entry);
>>>>>>>>>>> void amdgpu_amdkfd_device_probe(struct amdgpu_device *adev);
>>>>>>>>>>> @@ -347,6 +348,7 @@ void kgd2kfd_device_exit(struct kfd_dev 
>>>>>>>>>>> *kfd);
>>>>>>>>>>> void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm);
>>>>>>>>>>> int kgd2kfd_resume_iommu(struct kfd_dev *kfd);
>>>>>>>>>>> int kgd2kfd_resume(struct kfd_dev *kfd, bool run_pm);
>>>>>>>>>>> +int kgd2kfd_resume_processes(void);
>>>>>>>>>>> int kgd2kfd_pre_reset(struct kfd_dev *kfd);
>>>>>>>>>>> int kgd2kfd_post_reset(struct kfd_dev *kfd);
>>>>>>>>>>> void kgd2kfd_interrupt(struct kfd_dev *kfd, const void 
>>>>>>>>>>> *ih_ring_entry);
>>>>>>>>>>> @@ -393,6 +395,11 @@ static inline int kgd2kfd_resume(struct 
>>>>>>>>>>> kfd_dev *kfd, bool run_pm)
>>>>>>>>>>> return 0;
>>>>>>>>>>> }
>>>>>>>>>>>
>>>>>>>>>>> +static inline int kgd2kfd_resume_processes(void)
>>>>>>>>>>> +{
>>>>>>>>>>> + return 0;
>>>>>>>>>>> +}
>>>>>>>>>>> +
>>>>>>>>>>> static inline int kgd2kfd_pre_reset(struct kfd_dev *kfd)
>>>>>>>>>>> {
>>>>>>>>>>> return 0;
>>>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
>>>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>>>>> index fa4a9f13c922..5827b65b7489 100644
>>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>>>>> @@ -4004,6 +4004,7 @@ void amdgpu_device_fini_hw(struct 
>>>>>>>>>>> amdgpu_device *adev)
>>>>>>>>>>> if (drm_dev_is_unplugged(adev_to_drm(adev)))
>>>>>>>>>>> amdgpu_device_unmap_mmio(adev);
>>>>>>>>>>>
>>>>>>>>>>> + amdgpu_amdkfd_resume_processes();
>>>>>>>>>>> }
>>>>>>>>>>>
>>>>>>>>>>> void amdgpu_device_fini_sw(struct amdgpu_device *adev)
>>>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c 
>>>>>>>>>>> b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>>>>>>>>>> index 62aa6c9d5123..ef05aae9255e 100644
>>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>>>>>>>>>> @@ -714,6 +714,19 @@ int kgd2kfd_resume(struct kfd_dev *kfd, 
>>>>>>>>>>> bool run_pm)
>>>>>>>>>>> return ret;
>>>>>>>>>>> }
>>>>>>>>>>>
>>>>>>>>>>> +/* for non-runtime resume only */
>>>>>>>>>>> +int kgd2kfd_resume_processes(void)
>>>>>>>>>>> +{
>>>>>>>>>>> + int count;
>>>>>>>>>>> +
>>>>>>>>>>> + count = atomic_dec_return(&kfd_locked);
>>>>>>>>>>> + WARN_ONCE(count < 0, "KFD suspend / resume ref. error");
>>>>>>>>>>> + if (count == 0)
>>>>>>>>>>> + return kfd_resume_all_processes();
>>>>>>>>>>> +
>>>>>>>>>>> + return 0;
>>>>>>>>>>> +}
>>>>>>>>>>
>>>>>>>>>> It doesn't make sense to me to just increment kfd_locked in
>>>>>>>>>> kgd2kfd_suspend to only decrement it again a few functions 
>>>>>>>>>> down the
>>>>>>>>>> road.
>>>>>>>>>>
>>>>>>>>>> I suggest this instead - you only incrmemnt if not during PCI 
>>>>>>>>>> remove
>>>>>>>>>>
>>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>>>>>>>>> b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>>>>>>>>> index 1c2cf3a33c1f..7754f77248a4 100644
>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>>>>>>>>> @@ -952,11 +952,12 @@ bool kfd_is_locked(void)
>>>>>>>>>>
>>>>>>>>>> void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm)
>>>>>>>>>> {
>>>>>>>>>> +
>>>>>>>>>> if (!kfd->init_complete)
>>>>>>>>>> return;
>>>>>>>>>>
>>>>>>>>>> /* for runtime suspend, skip locking kfd */
>>>>>>>>>> - if (!run_pm) {
>>>>>>>>>> + if (!run_pm && !drm_dev_is_unplugged(kfd->ddev)) {
>>>>>>>>>> /* For first KFD device suspend all the KFD processes */
>>>>>>>>>> if (atomic_inc_return(&kfd_locked) == 1)
>>>>>>>>>> kfd_suspend_all_processes();
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Andrey
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> +
>>>>>>>>>>> int kgd2kfd_resume_iommu(struct kfd_dev *kfd)
>>>>>>>>>>> {
>>>>>>>>>>> int err = 0;
>>>>>>>
>>>>>
>>>
>

[-- Attachment #2: Type: text/html, Size: 232812 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [EXTERNAL] [PATCH 2/2] drm/amdkfd: Add PCIe Hotplug Support for AMDKFD
  2022-04-18 15:23                       ` Andrey Grodzovsky
@ 2022-04-19  7:41                         ` Shuotao Xu
  2022-04-19 16:01                           ` Andrey Grodzovsky
  0 siblings, 1 reply; 31+ messages in thread
From: Shuotao Xu @ 2022-04-19  7:41 UTC (permalink / raw)
  To: Andrey Grodzovsky
  Cc: Mukul.Joshi, Kuehling, Felix, Peng Cheng, amd-gfx, Lei Qu,
	Ran Shu, Ziyue Yang

[-- Attachment #1: Type: text/plain, Size: 59699 bytes --]



On Apr 18, 2022, at 11:23 PM, Andrey Grodzovsky <andrey.grodzovsky@amd.com<mailto:andrey.grodzovsky@amd.com>> wrote:



On 2022-04-18 09:22, Shuotao Xu wrote:


On Apr 16, 2022, at 12:43 AM, Andrey Grodzovsky <andrey.grodzovsky@amd.com<mailto:andrey.grodzovsky@amd.com>> wrote:



On 2022-04-15 06:12, Shuotao Xu wrote:
Hi Andrey,

First I really appreciate the discussion! It helped me understand the driver code greatly. Thank you so much:)
Please see my inline comments.

On Apr 14, 2022, at 11:13 PM, Andrey Grodzovsky <andrey.grodzovsky@amd.com<mailto:andrey.grodzovsky@amd.com>> wrote:



On 2022-04-14 10:00, Shuotao Xu wrote:


On Apr 14, 2022, at 1:31 AM, Andrey Grodzovsky <andrey.grodzovsky@amd.com<mailto:andrey.grodzovsky@amd.com>> wrote:



On 2022-04-13 12:03, Shuotao Xu wrote:


On Apr 11, 2022, at 11:52 PM, Andrey Grodzovsky <andrey.grodzovsky@amd.com<mailto:andrey.grodzovsky@amd.com>> wrote:

[Some people who received this message don't often get email from andrey.grodzovsky@amd.com<mailto:andrey.grodzovsky@amd.com>. Learn why this is important at http://aka.ms/LearnAboutSenderIdentification.]

On 2022-04-08 21:28, Shuotao Xu wrote:

On Apr 8, 2022, at 11:28 PM, Andrey Grodzovsky <andrey.grodzovsky@amd.com<mailto:andrey.grodzovsky@amd.com>> wrote:

[Some people who received this message don't often get email from andrey.grodzovsky@amd.com<mailto:andrey.grodzovsky@amd.com>. Learn why this is important at http://aka.ms/LearnAboutSenderIdentification.]

On 2022-04-08 04:45, Shuotao Xu wrote:
Adding PCIe Hotplug Support for AMDKFD: the support of hot-plug of GPU
devices can open doors for many advanced applications in data center
in the next few years, such as for GPU resource
disaggregation. Current AMDKFD does not support hotplug out b/o the
following reasons:

1. During PCIe removal, decrement KFD lock which was incremented at
the beginning of hw fini; otherwise kfd_open later is going to
fail.
I assumed you read my comment last time, still you do same approach.
More in details bellow
Aha, I like your fix:) I was not familiar with drm APIs so just only half understood your comment last time.

BTW, I tried hot-plugging out a GPU when rocm application is still running.
From dmesg, application is still trying to access the removed kfd device, and are met with some errors.


Application us supposed to keep running, it holds the drm_device
reference as long as it has an open
FD to the device and final cleanup will come only after the app will die
thus releasing the FD and the last
drm_device reference.

Application would hang and not exiting in this case.


Actually I tried kill -7 $pid, and the process exists. The dmesg has some warning though.

[  711.769977] WARNING: CPU: 23 PID: 344 at .../amdgpu-rocm5.0.2/src/amd/amdgpu/amdgpu_object.c:1336 amdgpu_bo_release_notify+0x150/0x160 [amdgpu]
[  711.770528] Modules linked in: amdgpu(OE) amdttm(OE) amd_sched(OE) amdkcl(OE) iommu_v2 nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo xt_addrtype br_netfilter xt_CHECKSUM iptable_mangle xt_MASQUERADE iptable_nat nf_nat xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4 xt_tcpudp bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter overlay binfmt_misc intel_rapl_msr i40iw intel_rapl_common skx_edac nfit x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel rpcrdma kvm sunrpc ipmi_ssif ib_umad ib_ipoib rdma_ucm irqbypass rapl joydev acpi_ipmi input_leds intel_cstate ipmi_si ipmi_devintf mei_me mei intel_pch_thermal ipmi_msghandler ioatdma mac_hid lpc_ich dca acpi_power_meter acpi_pad sch_fq_codel ib_iser rdma_cm iw_cm ib_cm iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi pci_stub ip_tables x_tables autofs4 btrfs blake2b_generic zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor
[  711.779359]  raid6_pq libcrc32c raid1 raid0 multipath linear mlx5_ib ib_uverbs ib_core ast drm_vram_helper i2c_algo_bit drm_ttm_helper ttm drm_kms_helper syscopyarea crct10dif_pclmul crc32_pclmul ghash_clmulni_intel sysfillrect uas hid_generic sysimgblt aesni_intel mlx5_core fb_sys_fops crypto_simd usbhid cryptd drm i40e pci_hyperv_intf usb_storage glue_helper mlxfw hid ahci libahci wmi
[  711.779752] CPU: 23 PID: 344 Comm: kworker/23:1 Tainted: G        W  OE     5.11.0+ #1
[  711.779755] Hardware name: Supermicro SYS-4029GP-TRT2/X11DPG-OT-CPU, BIOS 2.1 08/14/2018
[  711.779756] Workqueue: kfd_process_wq kfd_process_wq_release [amdgpu]
[  711.779955] RIP: 0010:amdgpu_bo_release_notify+0x150/0x160 [amdgpu]
[  711.780141] Code: e8 b5 af 34 f4 e9 1f ff ff ff 48 39 c2 74 07 0f 0b e9 69 ff ff ff 4c 89 e7 e8 3c b4 16 00 e9 5c ff ff ff e8 a2 ce fd f3 eb cf <0f> 0b eb cb e8 d7 02 34 f4 0f 1f 80 00 00 00 00 0f 1f 44 00 00 55
[  711.780143] RSP: 0018:ffffa8100dd67c30 EFLAGS: 00010282
[  711.780145] RAX: 00000000ffffffea RBX: ffff89980e792058 RCX: 0000000000000000
[  711.780147] RDX: 0000000000000000 RSI: ffff89a8f9ad8870 RDI: ffff89a8f9ad8870
[  711.780148] RBP: ffffa8100dd67c50 R08: 0000000000000000 R09: fffffffffff99b18
[  711.780149] R10: ffffa8100dd67bd0 R11: ffffa8100dd67908 R12: ffff89980e792000
[  711.780151] R13: ffff89980e792058 R14: ffff89980e7921bc R15: dead000000000100
[  711.780152] FS:  0000000000000000(0000) GS:ffff89a8f9ac0000(0000) knlGS:0000000000000000
[  711.780154] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  711.780156] CR2: 00007ffddac6f71f CR3: 00000030bb80a003 CR4: 00000000007706e0
[  711.780157] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  711.780159] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  711.780160] PKRU: 55555554
[  711.780161] Call Trace:
[  711.780163]  ttm_bo_release+0x2ae/0x320 [amdttm]
[  711.780169]  amdttm_bo_put+0x30/0x40 [amdttm]
[  711.780357]  amdgpu_bo_unref+0x1e/0x30 [amdgpu]
[  711.780543]  amdgpu_gem_object_free+0x8b/0x160 [amdgpu]
[  711.781119]  drm_gem_object_free+0x1d/0x30 [drm]
[  711.781489]  amdgpu_amdkfd_gpuvm_free_memory_of_gpu+0x34a/0x380 [amdgpu]
[  711.782044]  kfd_process_device_free_bos+0xe0/0x130 [amdgpu]
[  711.782611]  kfd_process_wq_release+0x286/0x380 [amdgpu]
[  711.783172]  process_one_work+0x236/0x420
[  711.783543]  worker_thread+0x34/0x400
[  711.783911]  ? process_one_work+0x420/0x420
[  711.784279]  kthread+0x126/0x140
[  711.784653]  ? kthread_park+0x90/0x90
[  711.785018]  ret_from_fork+0x22/0x30
[  711.785387] ---[ end trace d8f50f6594817c84 ]---
[  711.798716] [drm] amdgpu: ttm finalized


So it means the process was stuck in some wait_event_killable (maybe here drm_sched_entity_flush) - you can try 'cat/proc/$process_pid/stack' maybe before
you kill it to see where it was stuck so we can go from there.



For graphic apps what i usually see is a crash because of sigsev when
the app tries to access
an unmapped MMIO region on the device. I haven't tested for compute
stack and so there might
be something I haven't covered. Hang could mean for example waiting on a
fence which is not being
signaled - please provide full dmesg from this case.


Do you have any good suggestions on how to fix it down the line? (HIP runtime/libhsakmt or driver)

[64036.631333] amdgpu: amdgpu_vm_bo_update failed
[64036.631702] amdgpu: validate_invalid_user_pages: update PTE failed
[64036.640754] amdgpu: amdgpu_vm_bo_update failed
[64036.641120] amdgpu: validate_invalid_user_pages: update PTE failed
[64036.650394] amdgpu: amdgpu_vm_bo_update failed
[64036.650765] amdgpu: validate_invalid_user_pages: update PTE failed


The full dmesg will just the repetition of those two messages,
[186885.764079] amdgpu 0000:43:00.0: amdgpu: amdgpu: finishing device.
[186885.766916] [drm] free PSP TMR buffer
[186893.868173] amdgpu: amdgpu_vm_bo_update failed
[186893.868235] amdgpu: validate_invalid_user_pages: update PTE failed
[186893.876154] amdgpu: amdgpu_vm_bo_update failed
[186893.876190] amdgpu: validate_invalid_user_pages: update PTE failed
[186893.884150] amdgpu: amdgpu_vm_bo_update failed
[186893.884185] amdgpu: validate_invalid_user_pages: update PTE failed


This just probably means trying to update PTEs after the physical device
is gone - we usually avoid this by
first trying to do all HW shutdowns early before PCI remove completion
but when it's really tricky by
protecting HW access sections with drm_dev_enter/exit scope.

For this particular error it would be the best to flush
info->restore_userptr_work before the end of
amdgpu_pci_remove (rejecting new process creation and calling
cancel_delayed_work_sync(&process_info->restore_userptr_work) for all
running processes)
somewhere in amdgpu_pci_remove.

I tried something like *kfd_process_ref_release* which I think did what you described, but it did not work.


I don't see how kfd_process_ref_release is the same as I mentioned above, what i meant is calling the code above within kgd2kfd_suspend (where you tried to call kfd_kill_all_user_processes bellow)

Yes, you are right. It was not called by it.


Instead I tried to kill the process from the kernel, but the amdgpu could **only** be hot-plugged in back successfully only if there was no rocm kernel running when it was plugged out. If not, amdgpu_probe will just hang later. (Maybe because amdgpu was plugged out while running state, it leaves a bad HW state which causes probe to hang).


We usually do asic_reset during probe to reset all HW state (checlk if amdgpu_device_init->amdgpu_asic_reset is running when you  plug back).

OK



I don’t know if this is a viable solution worth pursuing, but I attached the diff anyway.

Another solution could be let compute stack user mode detect a topology change via generation_count change, and abort gracefully there.

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
index 4e7d9cb09a69..79b4c9b84cd0 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
@@ -697,12 +697,15 @@ void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm, bool force)
                return;

        /* for runtime suspend, skip locking kfd */
-       if (!run_pm) {
+       if (!run_pm && !drm_dev_is_unplugged(kfd->ddev)) {
                /* For first KFD device suspend all the KFD processes */
                if (atomic_inc_return(&kfd_locked) == 1)
                        kfd_suspend_all_processes(force);
        }

+       if (drm_dev_is_unplugged(kfd->ddev))
+               kfd_kill_all_user_processes();
+
        kfd->dqm->ops.stop(kfd->dqm);
        kfd_iommu_suspend(kfd);
 }
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
index 55c9e1922714..84cbcd857856 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
@@ -1065,6 +1065,7 @@ void kfd_unref_process(struct kfd_process *p);
 int kfd_process_evict_queues(struct kfd_process *p, bool force);
 int kfd_process_restore_queues(struct kfd_process *p);
 void kfd_suspend_all_processes(bool force);
+void kfd_kill_all_user_processes(void);
 /*
  * kfd_resume_all_processes:
  *     bool sync: If kfd_resume_all_processes() should wait for the
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process.c b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
index 6cdc855abb6d..fb0c753b682c 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
@@ -2206,6 +2206,24 @@ void kfd_suspend_all_processes(bool force)
        srcu_read_unlock(&kfd_processes_srcu, idx);
 }

+
+void kfd_kill_all_user_processes(void)
+{
+       struct kfd_process *p;
+       struct amdkfd_process_info *p_info;
+       unsigned int temp;
+       int idx = srcu_read_lock(&kfd_processes_srcu);
+
+       pr_info("Killing all processes\n");
+       hash_for_each_rcu(kfd_processes_table, temp, p, kfd_processes) {
+               p_info = p->kgd_process_info;
+               pr_info("Killing  processes, pid = %d", pid_nr(p_info->pid));
+               kill_pid(p_info->pid, SIGBUS, 1);


From looking into kill_pid I see it only sends a signal but doesn't wait for completion, it would make sense to wait for completion here. In any case I would actually try to put here

I have made a version which does that with some atomic counters. Please read later in the diff.


hash_for_each_rcu(p_info)
    cancel_delayed_work_sync(&p_info->restore_userptr_work)

instead  at least that what i meant in the previous mail.

I actually tried that earlier, and it did not work. Application still keeps running, and you have to send a kill to the user process.

I have made the following version. It waits for processes to terminate synchronously after sending SIGBUS. After that it does the real work of amdgpu_pci_remove.
However, it hangs at amdgpu_device_ip_fini_early when it is trying to deinit ip_block 6 <sdma_v4_0> (https://gitlab.freedesktop.org/agd5f/linux/-/blob/amd-staging-drm-next/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c#L2818<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fagd5f%2Flinux%2F-%2Fblob%2Famd-staging-drm-next%2Fdrivers%2Fgpu%2Fdrm%2Famd%2Famdgpu%2Famdgpu_device.c%23L2818&data=05%7C01%7Cshuotaoxu%40microsoft.com%7C94d02de7d6d4469a001108da214f64ca%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637858922221966141%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=4gCvJnITEUO9WrtCmmHPYNfQPNYbub43yZIPeCMDJX0%3D&reserved=0>). I assume that there are still some inflight dma, therefore fini of this ip block thus hangs?

The following is an excerpt of the dmesg: please excuse for putting my own pr_info, but I hope you get my point of where it hangs.

[  392.344735] amdgpu: all processes has been fully released
[  392.346557] amdgpu: amdgpu_acpi_fini done
[  392.346568] amdgpu 0000:b3:00.0: amdgpu: amdgpu: finishing device.
[  392.349238] amdgpu: amdgpu_device_ip_fini_early enter ip_blocks = 9
[  392.349248] amdgpu: Free mem_obj = 000000007bf54275, range_start = 14, range_end = 14
[  392.350299] amdgpu: Free mem_obj = 00000000a85bc878, range_start = 12, range_end = 12
[  392.350304] amdgpu: Free mem_obj = 00000000b8019e32, range_start = 13, range_end = 13
[  392.350308] amdgpu: Free mem_obj = 000000002d296168, range_start = 4, range_end = 11
[  392.350313] amdgpu: Free mem_obj = 000000001fc4f934, range_start = 0, range_end = 3
[  392.350322] amdgpu: amdgpu_amdkfd_suspend(adev, false) done
[  392.350672] amdgpu: hw_fini of IP block[8] <jpeg_v2_5> done 0
[  392.350679] amdgpu: hw_fini of IP block[7] <vcn_v2_5> done 0


I just remembered that the idea to actively kill and wait for running user processes during unplug was rejected
as a bad idea in the first iteration of unplug work I did (don't remember why now, need to look) so this is a no go.

Maybe an application has kfd open, but was not accessing the dev. So kill it at unplug could kill the process unnecessarily.
However, the latest version I had with the sleep function got rid of the IP block fini hang.

Our policy is to let zombie processes (zombie in a sense that the underlying device is gone) live as long as they want
(as long as you able to terminate them - which you do, so that ok)
and the system should finish PCI remove gracefully and be able to hot plug back the device.  Hence, i suggest dropping
this direction of forcing all user processes to be killed, confirm you have graceful shutdown and remove of device
from PCI topology and then concentrate on why when you plug back it hangs.

So I basically revert back to the original solution which you suggested.

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
index 4e7d9cb09a69..5504a18b5a45 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
@@ -697,7 +697,7 @@ void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm, bool force)
                return;

        /* for runtime suspend, skip locking kfd */
-       if (!run_pm) {
+       if (!run_pm && !drm_dev_is_unplugged(kfd->ddev)) {
                /* For first KFD device suspend all the KFD processes */
                if (atomic_inc_return(&kfd_locked) == 1)
                        kfd_suspend_all_processes(force);


First confirm if ASIC reset happens on
next init.

This patch works great at planned plugout, where all the rocm processes are killed before plugout. And device can be added back without a problem.
However unplanned plugout when there is rocm processes are running just don’t work.


Still I am not clear if ASIC reset happens on plug back or no, can you trace this please ?


I tried add pr_info into asic_reset functions, but cannot trace any upon plug-back.


This could possibly explain the hang on plug back. Can you see why we don't get there ?


Is amdgpu supposed to asic_reset each time when it is probed? I right now it seems to probe ok (it did not hang). I will trace back further


Second please confirm if the timing you kill manually the user process has impact on whether you have a hang
on next plug back (if you kill before

Scenario 0: Kill before plug back

1. echo 1 > /sys/bus/pci/…/remove, would finish.
But the application won’t exit until there is a kill signal.


Why you think it must exit ?

Because rocm will need to release the drm descriptor to get amdgpu_amdkfd_device_fini_sw called, which would eventually call kgd2kfd_device_exit called. This would clean up kfd_topology at least. Otherwise I don’t see how it would be added back without messing up kfd topology to say the least.

However, those are all based my own observations. Please explain why it does not need exit if you believe so?


Note that when you add back a new device, pci device and drm device are created, I am not an expert on KFD code but i believe also a new KFD device is created independent of the old one and so the topology should see just 2 device instances (one old zombie and one real new).  I know at least this wasn't an issue for the graphic stack in exact same scenario and the libdrm tests i pointed to test exact this scenario.

Yes, regardless of the OOPS in ttm_bo_cleanup_refs, I plugged back the gpu, and I think it got probed all right, however the old kfd node is still there.
I can passed libdrm basic test suite on the plugged back. The bo test hangs out-of-box even without hotplug (see dmesg below).

 kernel:[ 1609.029125] watchdog: BUG: soft lockup - CPU#39 stuck for 89s! [amdgpu_test:36407]
[  +0.000407] Code: 48 89 47 18 48 89 47 20 48 89 47 28 48 89 47 30 48 89 47 38 48 8d 7f 40 75 d9 90 c3 0f 1f 80 00 00 00 00 b9 00 10 00 00 31 c0 <f3> aa c3 cc cc cc cc cc cc 0f 1f 44 00 00 48 85 ff 0f 84 f2 00 00
[  +0.000856] RSP: 0018:ffffb996b57b3c40 EFLAGS: 00010246
[  +0.000434] RAX: 0000000000000000 RBX: ffff9cc7f8706e88 RCX: 0000000000000980
[  +0.000436] RDX: fffff935b17d9140 RSI: fffff935b17e0000 RDI: ffff9c831f645680
[  +0.000439] RBP: 0000000000000400 R08: fffff935b17d0000 R09: 0000000000000000
[  +0.000447] R10: 0000000000000000 R11: 0000000000000000 R12: 000000000000000a
[  +0.000437] R13: ffff9cc783980a20 R14: 000000000b5dbc00 R15: ffff9cc7f8706078
[  +0.000438] FS:  00007ff1ef611300(0000) GS:ffff9d453efc0000(0000) knlGS:0000000000000000
[  +0.000445] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  +0.000462] CR2: 00007f418bbb9320 CR3: 000000819fa84006 CR4: 0000000000770ee0
[  +0.000466] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  +0.000451] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  +0.000455] PKRU: 55555554
[  +0.000451] Call Trace:
[  +0.000448]  ttm_pool_free+0x110/0x230 [ttm]
[  +0.000451]  ttm_tt_unpopulate+0x5e/0xb0 [ttm]
[  +0.000454]  ttm_tt_destroy_common+0xe/0x30 [ttm]
[  +0.000453]  amdgpu_ttm_backend_destroy+0x1e/0x70 [amdgpu]
[  +0.000569]  ttm_bo_cleanup_memtype_use+0x37/0x60 [ttm]
[  +0.000458]  ttm_bo_release+0x286/0x500 [ttm]
[  +0.000450]  amdgpu_bo_unref+0x1a/0x30 [amdgpu]
[  +0.000544]  amdgpu_gem_object_free+0xad/0x160 [amdgpu]
[  +0.000534]  drm_gem_object_release_handle+0x6a/0x80 [drm]
[  +0.000476]  drm_gem_handle_delete+0x5b/0xa0 [drm]
[  +0.000465]  ? drm_gem_handle_create+0x40/0x40 [drm]
[  +0.000469]  drm_ioctl_kernel+0xab/0xf0 [drm]
[  +0.000458]  drm_ioctl+0x1ec/0x390 [drm]
[  +0.000440]  ? drm_gem_handle_create+0x40/0x40 [drm]
[  +0.000438]  ? selinux_file_ioctl+0x17d/0x220
[  +0.000423]  ? lock_release+0x1ce/0x270
[  +0.000416]  ? trace_hardirqs_on+0x1b/0xd0
[  +0.000418]  ? _raw_spin_unlock_irqrestore+0x2d/0x40
[  +0.000419]  amdgpu_drm_ioctl+0x49/0x80 [amdgpu]
[  +0.000499]  __x64_sys_ioctl+0x80/0xb0
[  +0.000414]  do_syscall_64+0x3a/0x70
[  +0.000400]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[  +0.000387] RIP: 0033:0x7ff1ef7263db
[  +0.000371] Code: 0f 1e fa 48 8b 05 b5 7a 0d 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 85 7a 0d 00 f7 d8 64 89 01 48
[  +0.000763] RSP: 002b:00007ffdf1cd0278 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[  +0.000386] RAX: ffffffffffffffda RBX: 00007ffdf1cd02b0 RCX: 00007ff1ef7263db
[  +0.000383] RDX: 00007ffdf1cd02b0 RSI: 0000000040086409 RDI: 0000000000000007
[  +0.000396] RBP: 0000000040086409 R08: 00005574eefd5c60 R09: 00005574eefdd360
[  +0.000391] R10: 00005574eefd4010 R11: 0000000000000246 R12: 00005574eefd66d8
[  +0.000386] R13: 0000000000000007 R14: 0000000000000000 R15: 00007ff1ef830143


I also tried to run tf benchmark to the newly plugged nodes (one of the node is dummy), but failed.
Can we have some confirmation from kfd team that they have considered a zombie kfd node?


Also note that even with running grpahic stack there is always a KFD device and KFD topology present but of course probably not the same as when u run a KFD facing process so there could be some issues there.

Also note that because of this patch https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=267d51d77fdae8708b94e1a24b8e5d961297edb7<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgit.kernel.org%2Fpub%2Fscm%2Flinux%2Fkernel%2Fgit%2Ftorvalds%2Flinux.git%2Fcommit%2F%3Fid%3D267d51d77fdae8708b94e1a24b8e5d961297edb7&data=05%7C01%7Cshuotaoxu%40microsoft.com%7C94d02de7d6d4469a001108da214f64ca%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637858922221966141%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=DpIU1mioOwangDeY1SkIMcgdh5yMlG8netqatWL32UQ%3D&reserved=0> all MMIO accesses from such zombie/orphan user processes will be remapped to zero page and so will not necessarily experience a segfault when device removal happnes but rather maybe some crash due to NULL data read from MMIO by the process and used in some manner.

It depends on where the application is when the device is plugged out.

For example, in one case, the application keeps saying out-of-memory, but won’t exit.
For one of the cases. The other case is that it would wait for a signal.

2022-04-18 12:42:38.939303: E tensorflow/stream_executor/rocm/rocm_driver.cc:692<http://rocm_driver.cc:692>] failed to alloc 2304 bytes on host: HIP_ERROR_OutOfMemory
2022-04-18 12:42:38.939322: W ./tensorflow/core/common_runtime/gpu/gpu_host_allocator.h:44] could not allocate pinned host memory of size: 2304
2022-04-18 12:42:38.940772: E tensorflow/stream_executor/rocm/rocm_driver.cc:692<http://rocm_driver.cc:692>] failed to alloc 2304 bytes on host: HIP_ERROR_OutOfMemory
2022-04-18 12:42:38.940791: W ./tensorflow/core/common_runtime/gpu/gpu_host_allocator.h:44] could not allocate pinned host memory of size: 2304
2022-04-18 12:42:38.942379: E tensorflow/stream_executor/rocm/rocm_driver.cc:692<http://rocm_driver.cc:692>] failed to alloc 2304 bytes on host: HIP_ERROR_OutOfMemory
2022-04-18 12:42:38.942399: W ./tensorflow/core/common_runtime/gpu/gpu_host_allocator.h:44] could not allocate pinned host memory of size: 2304
2022-04-18 12:42:38.943829: E tensorflow/stream_executor/rocm/rocm_driver.cc:692<http://rocm_driver.cc:692>] failed to alloc 2304 bytes on host: HIP_ERROR_OutOfMemory
2022-04-18 12:42:38.943849: W ./tensorflow/core/common_runtime/gpu/gpu_host_allocator.h:44] could not allocate pinned host memory of size: 2304
2022-04-18 12:42:38.945272: E tensorflow/stream_executor/rocm/rocm_driver.cc:692<http://rocm_driver.cc:692>] failed to alloc 2304 bytes on host: HIP_ERROR_OutOfMemory
2022-04-18 12:42:38.945292: W ./tensorflow/core/common_runtime/gpu/gpu_host_allocator.h:44] could not allocate pinned host memory of size: 2304



2. kill the the process. The application does several things and seems trigger drm_release in the kernel, which are met with kernel NULL pointer deference related to sysfs_remove. Then the whole fs just freeze.

[  +0.002498] BUG: kernel NULL pointer dereference, address: 0000000000000098
[  +0.000486] #PF: supervisor read access in kernel mode
[  +0.000545] #PF: error_code(0x0000) - not-present page
[  +0.000551] PGD 0 P4D 0
[  +0.000553] Oops: 0000 [#1] SMP NOPTI
[  +0.000540] CPU: 75 PID: 4911 Comm: kworker/75:2 Tainted: G        W   E     5.13.0-kfd #1
[  +0.000559] Hardware name: INGRASYS         TURING  /MB      , BIOS K71FQ28A 10/05/2021
[  +0.000567] Workqueue: events delayed_fput
[  +0.000563] RIP: 0010:kernfs_find_ns+0x1b/0x100
[  +0.000569] Code: ff ff e8 88 59 9f 00 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 41 57 8b 05 df f0 7b 01 41 56 41 55 49 89 f5 41 54 55 48 89 fd 53 <44> 0f b7 b7 98 00 00 00 48 89 d3 4c 8b 67 70 66 41 83 e6 20 41 0f
[  +0.001193] RSP: 0018:ffffb9875db5fc98 EFLAGS: 00010246
[  +0.000602] RAX: 0000000000000000 RBX: ffffa101f79507d8 RCX: 0000000000000000
[  +0.000612] RDX: 0000000000000000 RSI: ffffffffc09a9417 RDI: 0000000000000000
[  +0.000604] RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000000
[  +0.000760] R10: ffffb9875db5fcd0 R11: 0000000000000000 R12: 0000000000000000
[  +0.000597] R13: ffffffffc09a9417 R14: ffffa08363fb2d18 R15: 0000000000000000
[  +0.000702] FS:  0000000000000000(0000) GS:ffffa0ffbfcc0000(0000) knlGS:0000000000000000
[  +0.000666] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  +0.000658] CR2: 0000000000000098 CR3: 0000005747812005 CR4: 0000000000770ee0
[  +0.000715] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  +0.000655] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  +0.000592] PKRU: 55555554
[  +0.000580] Call Trace:
[  +0.000591]  kernfs_find_and_get_ns+0x2f/0x50
[  +0.000584]  sysfs_remove_file_from_group+0x20/0x50
[  +0.000580]  amdgpu_ras_sysfs_remove+0x3d/0xd0 [amdgpu]
[  +0.000737]  amdgpu_ras_late_fini+0x1d/0x40 [amdgpu]
[  +0.000750]  amdgpu_sdma_ras_fini+0x96/0xb0 [amdgpu]
[  +0.000742]  ? gfx_v10_0_resume+0x10/0x10 [amdgpu]
[  +0.000738]  sdma_v4_0_sw_fini+0x23/0x90 [amdgpu]
[  +0.000717]  amdgpu_device_fini_sw+0xae/0x260 [amdgpu]
[  +0.000704]  amdgpu_driver_release_kms+0x12/0x30 [amdgpu]
[  +0.000687]  drm_dev_release+0x20/0x40 [drm]
[  +0.000583]  drm_release+0xa8/0xf0 [drm]
[  +0.000584]  __fput+0xa5/0x250
[  +0.000621]  delayed_fput+0x1f/0x30
[  +0.000726]  process_one_work+0x26e/0x580
[  +0.000581]  ? process_one_work+0x580/0x580
[  +0.000611]  worker_thread+0x4d/0x3d0
[  +0.000611]  ? process_one_work+0x580/0x580
[  +0.000605]  kthread+0x117/0x150
[  +0.000611]  ? kthread_park+0x90/0x90
[  +0.000619]  ret_from_fork+0x1f/0x30
[  +0.000625] Modules linked in: amdgpu(E) xt_conntrack xt_MASQUERADE nfnetlink xt_addrtype iptable_filter iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 br_netfilter x86_pkg_temp_thermal cdc_ether usbnet acpi_pad msr ip_tables x_tables ast drm_vram_helper iommu_v2 drm_ttm_helper gpu_sched ttm drm_kms_helper cfbfillrect syscopyarea cfbimgblt sysfillrect sysimgblt fb_sys_fops cfbcopyarea drm drm_panel_orientati
on_quirks [last unloaded: amdgpu]


This is a known regression, all SYSFS components must be removed before pci_remove code runs otherwise you get either warnings for single file removals or
OOPSEs for sysfs gorup removals like here. Please try to move amdgpu_ras_sysfs_remove from amdgpu_ras_late_fini to the end of amdgpu_ras_pre_fini (which happens before pci remove)


I fixed it in the newer patch, please see it below.



I first plugout the device, then kill the rocm user process. Then it has other OOPSES related to ttm_bo_cleanup_refs.

[  +0.000006] BUG: kernel NULL pointer dereference, address: 0000000000000010
[  +0.000349] #PF: supervisor read access in kernel mode
[  +0.000340] #PF: error_code(0x0000) - not-present page
[  +0.000341] PGD 0 P4D 0
[  +0.000336] Oops: 0000 [#1] SMP NOPTI
[  +0.000345] CPU: 9 PID: 95 Comm: kworker/9:1 Tainted: G        W   E     5.13.0-kfd #1
[  +0.000367] Hardware name: INGRASYS         TURING  /MB      , BIOS K71FQ28A 10/05/2021
[  +0.000376] Workqueue: events delayed_fput
[  +0.000422] RIP: 0010:ttm_resource_free+0x24/0x40 [ttm]
[  +0.000464] Code: 00 00 0f 1f 40 00 0f 1f 44 00 00 53 48 89 f3 48 8b 36 48 85 f6 74 21 48 8b 87 28 02 00 00 48 63 56 10 48 8b bc d0 b8 00 00 00 <48> 8b 47 10 ff 50 08 48 c7 03 00 00 00 00 5b c3 66 66 2e 0f 1f 84
[  +0.001009] RSP: 0018:ffffb21c59413c98 EFLAGS: 00010282
[  +0.000515] RAX: ffff8b1aa4285f68 RBX: ffff8b1a823b5ea0 RCX: 00000000002a000c
[  +0.000536] RDX: 0000000000000000 RSI: ffff8b1acb84db80 RDI: 0000000000000000
[  +0.000539] RBP: 0000000000000001 R08: 0000000000000000 R09: ffffffffc03c3e00
[  +0.000543] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8b1a823b5ec8
[  +0.000542] R13: 0000000000000000 R14: ffff8b1a823b5d90 R15: ffff8b1a823b5ec8
[  +0.000544] FS:  0000000000000000(0000) GS:ffff8b187f440000(0000) knlGS:0000000000000000
[  +0.000559] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  +0.000575] CR2: 0000000000000010 CR3: 00000076e6812004 CR4: 0000000000770ee0
[  +0.000575] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  +0.000579] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  +0.000575] PKRU: 55555554
[  +0.000568] Call Trace:
[  +0.000567]  ttm_bo_cleanup_refs+0xe4/0x290 [ttm]
[  +0.000588]  ttm_bo_delayed_delete+0x147/0x250 [ttm]
[  +0.000589]  ttm_device_fini+0xad/0x1b0 [ttm]
[  +0.000590]  amdgpu_ttm_fini+0x2a7/0x310 [amdgpu]
[  +0.000730]  gmc_v9_0_sw_fini+0x3a/0x40 [amdgpu]
[  +0.000753]  amdgpu_device_fini_sw+0xae/0x260 [amdgpu]
[  +0.000734]  amdgpu_driver_release_kms+0x12/0x30 [amdgpu]
[  +0.000737]  drm_dev_release+0x20/0x40 [drm]
[  +0.000626]  drm_release+0xa8/0xf0 [drm]
[  +0.000625]  __fput+0xa5/0x250
[  +0.000606]  delayed_fput+0x1f/0x30
[  +0.000607]  process_one_work+0x26e/0x580
[  +0.000608]  ? process_one_work+0x580/0x580
[  +0.000616]  worker_thread+0x4d/0x3d0
[  +0.000614]  ? process_one_work+0x580/0x580
[  +0.000617]  kthread+0x117/0x150
[  +0.000615]  ? kthread_park+0x90/0x90
[  +0.000621]  ret_from_fork+0x1f/0x30
[  +0.000603] Modules linked in: amdgpu(E) xt_conntrack xt_MASQUERADE nfnetlink xt_addrtype iptable_filter iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 br_netfilter x86_pkg_temp_thermal cdc_ether usbnet acpi_pad msr ip_tables x_tables ast drm_vram_helper drm_ttm_helper iommu_v2 ttm gpu_sched drm_kms_helper cfbfillrect syscopyarea cfbimgblt sysfillrect sysimgblt fb_sys_fops cfbcopyarea drm drm_panel_orientation_quirks [last unloaded: amdgpu]
[  +0.002840] CR2: 0000000000000010
[  +0.000755] ---[ end trace 9737737402551e39 ]--


This looks like another regression - try seeing where is the NULL reference and then we can see how to avoid this.


Those are the line of code.

(gdb) l *(ttm_bo_cleanup_refs+0xe4)
0x19c4 is in ttm_bo_cleanup_refs (drivers/gpu/drm/ttm/ttm_bo.c:360).
355             ttm_bo_move_to_pinned(bo);
356             list_del_init(&bo->ddestroy);
357             spin_unlock(&bo->bdev->lru_lock);
358             ttm_bo_cleanup_memtype_use(bo);
359
360             if (unlock_resv)
361                     dma_resv_unlock(amdkcl_ttm_resvp(bo));
362
363             ttm_bo_put(bo);
364
(gdb) l *(ttm_resource_free+0x24)
0x57f4 is in ttm_resource_free (drivers/gpu/drm/ttm/ttm_resource.c:65).
60
61              if (!*res)
62                      return;
63
64              man = ttm_manager_type(bo->bdev, (*res)->mem_type);
65              man->func->free(man, *res);
66              *res = NULL;
67      }
68      EXPORT_SYMBOL(ttm_resource_free);
69


3.  echo 1 > /sys/bus/pci/rescan. This would just hang. I assume the sysfs is broken.

Based on 1 & 2, it seems that 1 won’t let the amdgpu exit gracefully, because 2 will do some cleanup maybe should have happened before 1.

or you kill after plug back does it makes a difference).

Scenario 2: Kill after plug back

If I perform rescan before kill, then the driver seemed probed fine. But kill will have the same issue which messed up the sysfs the same way as in Scenario 2.


Final Comments:

0. cancel_delayed_work_sync(&p_info->restore_userptr_work) would make the repletion of amdgpu_vm_bo_update failure go away, but it does not solve the issues in those scenarios.


Still - it's better to do it this way even for those failures to go awaya

Cancel_delayed_work is insufficient, you will need to make sure the work won’t be processed after plugout. Please see my patch


Saw, see my comment.



1. For planned hotplug, this patch should work as long as you follow some protocol, i.e. kill before plugout. Is this patch an acceptable one since it provides some added feature than before?


Let's try to fix more as I advised above.


2. For unplanned hotplug when there is rocm app running, the patch that kill all processes and wait for 5 sec would work consistently. But it seems that it is an unacceptable solution for official release. I can hold it for our own internal usage.  It seems that kill after removal would cause problems, and I don’t know if there is a quick fix by me because of my limited understanding of the amdgpu driver. Maybe AMD could have a quick fix; Or it is really a difficult one. This feature may or may not be a blocking issue in our GPU disaggregation research down the way. Please let us know for either cases, and we would like to learn and help as much as we could!


I am currently not sure why it helps. I will need to setup my own ROCm setup and retest hot plug to check this in more depth but currently i have higher priorities. Please try to confirm ASIC reset always takes place on plug back
and fix the sysfs OOPs as I advised above to clear up at least some of the issues. Also please describe to me exactly what you steps to reproduce this scenario so later I might be able to do it myself.

I can still try to help to fix the bug in my spare time. My setup is as follows


  1.  I have a server with 4 AMD MI100 GPUs. I think 1 GPU would also work.
  2.  I used the https://github.com/RadeonOpenCompute/ROCK-Kernel-Driver/tree/roc-5.0.x<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FRadeonOpenCompute%2FROCK-Kernel-Driver%2Ftree%2Froc-5.0.x&data=05%7C01%7Cshuotaoxu%40microsoft.com%7C94d02de7d6d4469a001108da214f64ca%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637858922221966141%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=olLZ4qW5%2BxiQEinaUqjx3JXZ4Uj9hLW4z%2FKtO6nnSeM%3D&reserved=0> as the starting point, and apply Mukul’s patch and my patch.
  3.  Then I run a tensorflow benchmark from a docker.
     *   docker run -it --device=/dev/kfd --device=/dev/dri --group-add video rocm/tensorflow:rocm4.5.2-tf1.15-dev
     *   And run the following benchmark in the docker:  python benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --num_gpus=4 --batch_size=32 --model=resnet50 --variable_update=parameter_server
        *   Might to need to adjust num_gpus parameter based on your setup
  4.  Remove a GPU at random time.
  5.  Do whatever is needed to before plugback and reverify the benchmark can still run.

Also, we have hotplug test suite in libdrm (graphic stack), so maybe u can install libdrm and run that test suite to see if it exposes more issues.

OK I could try it some time.


I tried suite 13, the hotplugout test, but it says it got killed? There are a some oops from dmesg during ttm_pool_free_page.

Userspace log:

$ sudo ./tests/amdgpu/amdgpu_test -f -s 13


The ASIC NOT support UVD, suite disabled


The ASIC NOT support VCE, suite disabled


The ASIC NOT support UVD ENC, suite disabled.


Don't support TMZ (trust memory zone), security suite disabled


     CUnit - A unit testing framework for C - Version 2.1-3
     http://cunit.sourceforge.net/


Suite: Hotunplug Tests
  Test: Unplug card and rescan the bus to plug it back …Killed

Dmesg log:
[  +0.000479] BUG: unable to handle page fault for address: ffffc01343fc81b4
[  +0.000054] #PF: supervisor write access in kernel mode
[  +0.000033] #PF: error_code(0x0002) - not-present page
[  +0.000032] PGD 807ffc1067 P4D 807ffc1067 PUD 807ffc0067 PMD 0
[  +0.000038] Oops: 0002 [#1] SMP NOPTI
[  +0.000025] CPU: 92 PID: 7534 Comm: amdgpu_test Tainted: G        W   E     5.13.0-kfd #1
[  +0.000048] Hardware name: INGRASYS         TURING  /MB      , BIOS K71FQ28A 10/05/2021
[  +0.000045] RIP: 0010:__free_pages+0xc/0x80
[  +0.000031] Code: 01 00 74 0f 0f b6 77 51 85 f6 74 07 31 d2 e9 3b dc ff ff e9 66 ff ff ff 66 0f 1f 44 00 00 0f 1f 44 00 00 41 54 55 48 89 fd 53 <f0> ff 4f 34 74 46 48 8b 07 a9 00 00 01 00 75 54 44 8d 66 ff 85 f6
[  +0.000103] RSP: 0018:ffff96f71ba6fd60 EFLAGS: 00010246
[  +0.000032] RAX: 00000000ffffffff RBX: ffff89f1ccf86078 RCX: 0000000003fc8180
[  +0.000041] RDX: ffff89f1b4746000 RSI: 0000000000000000 RDI: ffffc01343fc8180
[  +0.000042] RBP: ffffc01343fc8180 R08: 0000000000000000 R09: 0000000000000246
[  +0.000040] R10: 00000080b4746fff R11: 0000000000000003 R12: 0000000000000000
[  +0.000041] R13: ffff89f1ccf85f80 R14: ffff89f1ccf86ef8 R15: ffff8972293b0000
[  +0.000042] FS:  00007fcfb843a300(0000) GS:ffff89ef80100000(0000) knlGS:0000000000000000
[  +0.000046] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  +0.000033] CR2: ffffc01343fc81b4 CR3: 0000000178154006 CR4: 0000000000770ee0
[  +0.000041] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  +0.000041] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  +0.000041] PKRU: 55555554
[  +0.000017] Call Trace:
[  +0.000018]  ttm_pool_free_page+0x69/0x90 [ttm]
[  +0.000038]  ttm_pool_type_fini+0x58/0x70 [ttm]
[  +0.000034]  ttm_pool_fini+0x30/0x50 [ttm]
[  +0.000031]  ttm_device_fini+0xf3/0x1b0 [ttm]
[  +0.000032]  amdgpu_ttm_fini+0x2a7/0x310 [amdgpu]
[  +0.000265]  gmc_v9_0_sw_fini+0x3a/0x40 [amdgpu]
[  +0.000246]  amdgpu_device_fini_sw+0xae/0x260 [amdgpu]
[  +0.000219]  amdgpu_driver_release_kms+0x12/0x30 [amdgpu]
[  +0.000219]  drm_dev_release+0x20/0x40 [drm]
[  +0.000059]  drm_release+0xa8/0xf0 [drm]
[  +0.000053]  __fput+0xa5/0x250
[  +0.000023]  task_work_run+0x5c/0xa0
[  +0.000026]  exit_to_user_mode_prepare+0x1db/0x1e0
[  +0.000033]  syscall_exit_to_user_mode+0x19/0x50
[  +0.000030]  do_syscall_64+0x47/0x70
[  +0.000018]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[  +0.000025] RIP: 0033:0x7fcfb86403d7
[  +0.000869] Code: 64 89 02 48 c7 c0 ff ff ff ff eb bb 0f 1f 80 00 00 00 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 41 c3 48 83 ec 18 89 7c 24 0c e8 f3 fb ff ff
[  +0.001788] RSP: 002b:00007ffc8fc26c28 EFLAGS: 00000246 ORIG_RAX: 0000000000000003
[  +0.000888] RAX: 0000000000000000 RBX: 000055d67a05b6a0 RCX: 00007fcfb86403d7
[  +0.000867] RDX: 00007fcfb8627be0 RSI: 0000000000000000 RDI: 0000000000000003
[  +0.000846] RBP: 000055d67a05b8a0 R08: 0000000000000007 R09: 0000000000000000
[  +0.000816] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[  +0.000791] R13: 0000000000000000 R14: 0000000000000000 R15: 00007fcfb8659143
[  +0.000770] Modules linked in: amdgpu(E) xt_conntrack xt_MASQUERADE nfnetlink xt_addrtype iptable_filter iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 br_netfilter x86_pkg_temp_thermal cdc_ether usbnet acpi_pad msr ip_tables x_tables ast drm_vram_helper drm_ttm_helper ttm iommu_v2 gpu_sched drm_kms_helper cfbfillrect syscopyarea cfbimgblt sysfillrect sysimgblt fb_sys_fops cfbcopyarea drm drm_panel_orientation_quirks [last unloaded: amdgpu]
[  +0.003303] CR2: ffffc01343fc81b4
[  +0.000799] ---[ end trace 2360927435b19009 ]—



The following is the new diff.

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index 182b7eae598a..48c3cd4054de 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -1327,7 +1327,7 @@ int emu_soc_asic_init(struct amdgpu_device *adev);
  * ASICs macro.
  */
 #define amdgpu_asic_set_vga_state(adev, state) (adev)->asic_funcs->set_vga_state((adev), (state))
-#define amdgpu_asic_reset(adev) (adev)->asic_funcs->reset((adev))
+#define amdgpu_asic_reset(adev) ({int r; pr_info("performing amdgpu_asic_reset\n"); r = (adev)->asic_funcs->reset((adev));r;})
 #define amdgpu_asic_reset_method(adev) (adev)->asic_funcs->reset_method((adev))
 #define amdgpu_asic_get_xclk(adev) (adev)->asic_funcs->get_xclk((adev))
 #define amdgpu_asic_set_uvd_clocks(adev, v, d) (adev)->asic_funcs->set_uvd_clocks((adev), (v), (d))
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
index 27c74fcec455..842abd7150a6 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
@@ -134,6 +134,7 @@ struct amdkfd_process_info {

  /* MMU-notifier related fields */
  atomic_t evicted_bos;
+ atomic_t invalid;
  struct delayed_work restore_userptr_work;
  struct pid *pid;
 };
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
index 99d2b15bcbf3..2a588eb9f456 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
@@ -1325,6 +1325,7 @@ static int init_kfd_vm(struct amdgpu_vm *vm, void **process_info,

  info->pid = get_task_pid(current->group_leader, PIDTYPE_PID);
  atomic_set(&info->evicted_bos, 0);
+ atomic_set(&info->invalid, 0);
  INIT_DELAYED_WORK(&info->restore_userptr_work,
   amdgpu_amdkfd_restore_userptr_worker);

@@ -2693,6 +2694,9 @@ static void amdgpu_amdkfd_restore_userptr_worker(struct work_struct *work)
  struct mm_struct *mm;
  int evicted_bos;

+ if (atomic_read(&process_info->invalid))
+ return;
+


Probably better  to again use drm_dev_enter/exit guard pair instead of this flag.


I don’t know if I could use drm_dev_enter/exit efficiently because a process can have multiple drm_dev open. And I don’t know how I can recover/refer drm_dev(s) efficiently in the worker function in order to use drm_dev_enter/exit.


  evicted_bos = atomic_read(&process_info->evicted_bos);
  if (!evicted_bos)
  return;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index ec38517ab33f..e7d85d8d282d 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -1054,6 +1054,7 @@ void amdgpu_device_program_register_sequence(struct amdgpu_device *adev,
  */
 void amdgpu_device_pci_config_reset(struct amdgpu_device *adev)
 {
+ pr_debug("%s called\n",__func__);
  pci_write_config_dword(adev->pdev, 0x7c, AMDGPU_ASIC_RESET_DATA);
 }

@@ -1066,6 +1067,7 @@ void amdgpu_device_pci_config_reset(struct amdgpu_device *adev)
  */
 int amdgpu_device_pci_reset(struct amdgpu_device *adev)
 {
+ pr_debug("%s called\n",__func__);
  return pci_reset_function(adev->pdev);
 }

@@ -4702,6 +4704,8 @@ int amdgpu_do_asic_reset(struct list_head *device_list_handle,
  bool need_full_reset, skip_hw_reset, vram_lost = false;
  int r = 0;

+ pr_debug("%s called\n",__func__);
+
  /* Try reset handler method first */
  tmp_adev = list_first_entry(device_list_handle, struct amdgpu_device,
     reset_list);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
index 49bdf9ff7350..b469acb65c1e 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
@@ -2518,7 +2518,6 @@ void amdgpu_ras_late_fini(struct amdgpu_device *adev,
  if (!ras_block || !ih_info)
  return;

- amdgpu_ras_sysfs_remove(adev, ras_block);
  if (ih_info->cb)
  amdgpu_ras_interrupt_remove_handler(adev, ih_info);
 }
@@ -2577,6 +2576,7 @@ void amdgpu_ras_suspend(struct amdgpu_device *adev)
 int amdgpu_ras_pre_fini(struct amdgpu_device *adev)
 {
  struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
+ struct ras_manager *obj, *tmp;

  if (!adev->ras_enabled || !con)
  return 0;
@@ -2585,6 +2585,13 @@ int amdgpu_ras_pre_fini(struct amdgpu_device *adev)
  /* Need disable ras on all IPs here before ip [hw/sw]fini */
  amdgpu_ras_disable_all_features(adev, 0);
  amdgpu_ras_recovery_fini(adev);
+
+ /* remove sysfs before pci_remove to avoid OOPSES from sysfs_remove_groups */
+ list_for_each_entry_safe(obj, tmp, &con->head, node) {
+ amdgpu_ras_sysfs_remove(adev, &obj->head);
+ put_obj(obj);
+ }
+
  return 0;
 }

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
index 4e7d9cb09a69..0fa806a78e39 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
@@ -693,16 +693,35 @@ bool kfd_is_locked(void)

 void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm, bool force)
 {
+ struct kfd_process *p;
+ struct amdkfd_process_info *p_info;
+ unsigned int temp;
+
  if (!kfd->init_complete)
  return;

  /* for runtime suspend, skip locking kfd */
- if (!run_pm) {
+ if (!run_pm && !drm_dev_is_unplugged(kfd->ddev)) {
  /* For first KFD device suspend all the KFD processes */
  if (atomic_inc_return(&kfd_locked) == 1)
  kfd_suspend_all_processes(force);
  }

+ if (drm_dev_is_unplugged(kfd->ddev)){
+ int idx = srcu_read_lock(&kfd_processes_srcu);
+ pr_debug("cancel restore_userptr_wor\n");
+ hash_for_each_rcu(kfd_processes_table, temp, p, kfd_processes) {
+ if ( kfd_process_gpuidx_from_gpuid(p, kfd->id) >= 0 ) {
+ p_info = p->kgd_process_info;
+ pr_debug("cancel processes, pid = %d for gpu_id = %d", pid_nr(p_info->pid), kfd->id);
+ cancel_delayed_work_sync(&p_info->restore_userptr_work);
+ /* block all future restore_userptr_work */
+ atomic_inc(&p_info->invalid);


Same as i mentioned above with drm.dev_eneter/exit

Same as I mentioned as the process can have many drm_dev open.

Final comments:

I suspect that the my linux kernel version might not have all the fixes you did for hotplug. Can you give me a pointer to the lowest version of linux kernel (5.14 from linux kernel repo? amd-drm-staging-next does not work for MI100 out-of-box), which would pass all libdrm tests including hotplug tests (some tests hang, some failed now) ?
p.s. I cloned and build libdrm from source (https://gitlab.freedesktop.org/mesa/drm)

Thank you so much!


Andrey


+ }
+ }
+ srcu_read_unlock(&kfd_processes_srcu, idx);
+ }
+
  kfd->dqm->ops.stop(kfd->dqm);
  kfd_iommu_suspend(kfd);
 }
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_topology.c b/drivers/gpu/drm/amd/amdkfd/kfd_topology.c
index 600ba2a728ea..7e3d1848eccc 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_topology.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_topology.c
@@ -669,11 +669,12 @@ static void kfd_remove_sysfs_node_entry(struct kfd_topology_device *dev)
 #ifdef HAVE_AMD_IOMMU_PC_SUPPORTED
  if (dev->kobj_perf) {
  list_for_each_entry(perf, &dev->perf_props, list) {
+ sysfs_remove_group(dev->kobj_perf, perf->attr_group);
  kfree(perf->attr_group);
  perf->attr_group = NULL;
  }
  kobject_del(dev->kobj_perf);
- kobject_put(dev->kobj_perf);
+ /* kobject_put(dev->kobj_perf); */
  dev->kobj_perf = NULL;
  }
 #endif

Thank you so much! Looking forward to your comments!

Regards,
Shuotao

Andrey


Thank you so much!

Best regards,
Shuotao

Andrey



diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
index 8fa9b86ac9d2..c0b27f722281 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
@@ -188,6 +188,12 @@ void amdgpu_amdkfd_interrupt(struct amdgpu_device *adev,
  kgd2kfd_interrupt(adev->kfd.dev, ih_ring_entry);
 }

+void amdgpu_amdkfd_kill_all_processes(struct amdgpu_device *adev)
+{
+ if (adev->kfd.dev)
+ kgd2kfd_kill_all_user_processes(adev->kfd.dev);
+}
+
 void amdgpu_amdkfd_suspend(struct amdgpu_device *adev, bool run_pm)
 {
  if (adev->kfd.dev)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
index 27c74fcec455..f4e485d60442 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
@@ -141,6 +141,7 @@ struct amdkfd_process_info {
 int amdgpu_amdkfd_init(void);
 void amdgpu_amdkfd_fini(void);

+void amdgpu_amdkfd_kill_all_processes(struct amdgpu_device *adev);
 void amdgpu_amdkfd_suspend(struct amdgpu_device *adev, bool run_pm);
 int amdgpu_amdkfd_resume_iommu(struct amdgpu_device *adev);
 int amdgpu_amdkfd_resume(struct amdgpu_device *adev, bool run_pm, bool sync);
@@ -405,6 +406,7 @@ bool kgd2kfd_device_init(struct kfd_dev *kfd,
  const struct kgd2kfd_shared_resources *gpu_resources);
 void kgd2kfd_device_exit(struct kfd_dev *kfd);
 void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm, bool force);
+void kgd2kfd_kill_all_user_processes(struct kfd_dev *kfd);
 int kgd2kfd_resume_iommu(struct kfd_dev *kfd);
 int kgd2kfd_resume(struct kfd_dev *kfd, bool run_pm, bool sync);
 int kgd2kfd_pre_reset(struct kfd_dev *kfd);
@@ -443,6 +445,9 @@ static inline void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm, bool force)
 {
 }

+void kgd2kfd_kill_all_user_processes(struct kfd_dev *kfd){
+}
+
 static int __maybe_unused kgd2kfd_resume_iommu(struct kfd_dev *kfd)
 {
  return 0;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index 3d5fc0751829..af6fe5080cfa 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -2101,6 +2101,9 @@ amdgpu_pci_remove(struct pci_dev *pdev)
 {
  struct drm_device *dev = pci_get_drvdata(pdev);

+ /* kill all kfd processes before drm_dev_unplug */
+ amdgpu_amdkfd_kill_all_processes(drm_to_adev(dev));
+
 #ifdef HAVE_DRM_DEV_UNPLUG
  drm_dev_unplug(dev);
 #else
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
index 5504a18b5a45..480c23bef5e2 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
@@ -691,6 +691,12 @@ bool kfd_is_locked(void)
  return  (atomic_read(&kfd_locked) > 0);
 }

+inline void kgd2kfd_kill_all_user_processes(struct kfd_dev* dev)
+{
+ kfd_kill_all_user_processes();
+}
+
+
 void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm, bool force)
 {
  if (!kfd->init_complete)
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
index 55c9e1922714..a35a2cb5bb9f 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
@@ -1064,6 +1064,7 @@ static inline struct kfd_process_device *kfd_process_device_from_gpuidx(
 void kfd_unref_process(struct kfd_process *p);
 int kfd_process_evict_queues(struct kfd_process *p, bool force);
 int kfd_process_restore_queues(struct kfd_process *p);
+void kfd_kill_all_user_processes(void);
 void kfd_suspend_all_processes(bool force);
 /*
  * kfd_resume_all_processes:
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process.c b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
index 6cdc855abb6d..17e769e6951d 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
@@ -46,6 +46,9 @@ struct mm_struct;
 #include "kfd_trace.h"
 #include "kfd_debug.h"

+static atomic_t kfd_process_locked = ATOMIC_INIT(0);
+static atomic_t kfd_inflight_kills = ATOMIC_INIT(0);
+
 /*
  * List of struct kfd_process (field kfd_process).
  * Unique/indexed by mm_struct*
@@ -802,6 +805,9 @@ struct kfd_process *kfd_create_process(struct task_struct *thread)
  struct kfd_process *process;
  int ret;

+ if ( atomic_read(&kfd_process_locked) > 0 )
+ return ERR_PTR(-EINVAL);
+
  if (!(thread->mm && mmget_not_zero(thread->mm)))
  return ERR_PTR(-EINVAL);

@@ -1126,6 +1132,10 @@ static void kfd_process_wq_release(struct work_struct *work)
  put_task_struct(p->lead_thread);

  kfree(p);
+
+ if ( atomic_read(&kfd_process_locked) > 0 ){
+ atomic_dec(&kfd_inflight_kills);
+ }
 }

 static void kfd_process_ref_release(struct kref *ref)
@@ -2186,6 +2196,35 @@ static void restore_process_worker(struct work_struct *work)
  pr_err("Failed to restore queues of pasid 0x%x\n", p->pasid);
 }

+void kfd_kill_all_user_processes(void)
+{
+ struct kfd_process *p;
+ /* struct amdkfd_process_info *p_info; */
+ unsigned int temp;
+ int idx;
+ atomic_inc(&kfd_process_locked);
+
+ idx = srcu_read_lock(&kfd_processes_srcu);
+ pr_info("Killing all processes\n");
+ hash_for_each_rcu(kfd_processes_table, temp, p, kfd_processes) {
+ dev_warn(kfd_device,
+ "Sending SIGBUS to process %d (pasid 0x%x)",
+ p->lead_thread->pid, p->pasid);
+ send_sig(SIGBUS, p->lead_thread, 0);
+ atomic_inc(&kfd_inflight_kills);
+ }
+ srcu_read_unlock(&kfd_processes_srcu, idx);
+
+ while ( atomic_read(&kfd_inflight_kills) > 0 ){
+ dev_warn(kfd_device,
+ "kfd_processes_table is not empty, going to sleep for 10ms\n");
+ msleep(10);
+ }
+
+ atomic_dec(&kfd_process_locked);
+ pr_info("all processes has been fully released");
+}
+
 void kfd_suspend_all_processes(bool force)
 {
  struct kfd_process *p;



Regards,
Shuotao



Andrey


+       }
+       srcu_read_unlock(&kfd_processes_srcu, idx);
+}
+
+
 int kfd_resume_all_processes(bool sync)
 {
        struct kfd_process *p;


Andrey



Really appreciate your help!

Best,
Shuotao

2. Remove redudant p2p/io links in sysfs when device is hotplugged
out.

3. New kfd node_id is not properly assigned after a new device is
added after a gpu is hotplugged out in a system. libhsakmt will
find this anomaly, (i.e. node_from != <dev node id> in iolinks),
when taking a topology_snapshot, thus returns fault to the rocm
stack.

-- This patch fixes issue 1; another patch by Mukul fixes issues 2&3.
-- Tested on a 4-GPU MI100 gpu nodes with kernel 5.13.0-kfd; kernel
5.16.0-kfd is unstable out of box for MI100.
---
drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 5 +++++
drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h | 7 +++++++
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 1 +
drivers/gpu/drm/amd/amdkfd/kfd_device.c | 13 +++++++++++++
4 files changed, 26 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
index c18c4be1e4ac..d50011bdb5c4 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
@@ -213,6 +213,11 @@ int amdgpu_amdkfd_resume(struct amdgpu_device *adev, bool run_pm)
return r;
}

+int amdgpu_amdkfd_resume_processes(void)
+{
+ return kgd2kfd_resume_processes();
+}
+
int amdgpu_amdkfd_pre_reset(struct amdgpu_device *adev)
{
int r = 0;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
index f8b9f27adcf5..803306e011c3 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
@@ -140,6 +140,7 @@ void amdgpu_amdkfd_fini(void);
void amdgpu_amdkfd_suspend(struct amdgpu_device *adev, bool run_pm);
int amdgpu_amdkfd_resume_iommu(struct amdgpu_device *adev);
int amdgpu_amdkfd_resume(struct amdgpu_device *adev, bool run_pm);
+int amdgpu_amdkfd_resume_processes(void);
void amdgpu_amdkfd_interrupt(struct amdgpu_device *adev,
const void *ih_ring_entry);
void amdgpu_amdkfd_device_probe(struct amdgpu_device *adev);
@@ -347,6 +348,7 @@ void kgd2kfd_device_exit(struct kfd_dev *kfd);
void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm);
int kgd2kfd_resume_iommu(struct kfd_dev *kfd);
int kgd2kfd_resume(struct kfd_dev *kfd, bool run_pm);
+int kgd2kfd_resume_processes(void);
int kgd2kfd_pre_reset(struct kfd_dev *kfd);
int kgd2kfd_post_reset(struct kfd_dev *kfd);
void kgd2kfd_interrupt(struct kfd_dev *kfd, const void *ih_ring_entry);
@@ -393,6 +395,11 @@ static inline int kgd2kfd_resume(struct kfd_dev *kfd, bool run_pm)
return 0;
}

+static inline int kgd2kfd_resume_processes(void)
+{
+ return 0;
+}
+
static inline int kgd2kfd_pre_reset(struct kfd_dev *kfd)
{
return 0;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index fa4a9f13c922..5827b65b7489 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4004,6 +4004,7 @@ void amdgpu_device_fini_hw(struct amdgpu_device *adev)
if (drm_dev_is_unplugged(adev_to_drm(adev)))
amdgpu_device_unmap_mmio(adev);

+ amdgpu_amdkfd_resume_processes();
}

void amdgpu_device_fini_sw(struct amdgpu_device *adev)
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
index 62aa6c9d5123..ef05aae9255e 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
@@ -714,6 +714,19 @@ int kgd2kfd_resume(struct kfd_dev *kfd, bool run_pm)
return ret;
}

+/* for non-runtime resume only */
+int kgd2kfd_resume_processes(void)
+{
+ int count;
+
+ count = atomic_dec_return(&kfd_locked);
+ WARN_ONCE(count < 0, "KFD suspend / resume ref. error");
+ if (count == 0)
+ return kfd_resume_all_processes();
+
+ return 0;
+}

It doesn't make sense to me to just increment kfd_locked in
kgd2kfd_suspend to only decrement it again a few functions down the
road.

I suggest this instead - you only incrmemnt if not during PCI remove

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
index 1c2cf3a33c1f..7754f77248a4 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
@@ -952,11 +952,12 @@ bool kfd_is_locked(void)

void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm)
{
+
if (!kfd->init_complete)
return;

/* for runtime suspend, skip locking kfd */
- if (!run_pm) {
+ if (!run_pm && !drm_dev_is_unplugged(kfd->ddev)) {
/* For first KFD device suspend all the KFD processes */
if (atomic_inc_return(&kfd_locked) == 1)
kfd_suspend_all_processes();


Andrey



+
int kgd2kfd_resume_iommu(struct kfd_dev *kfd)
{
int err = 0;






[-- Attachment #2: Type: text/html, Size: 189042 bytes --]

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [EXTERNAL] [PATCH 2/2] drm/amdkfd: Add PCIe Hotplug Support for AMDKFD
  2022-04-19  7:41                         ` Shuotao Xu
@ 2022-04-19 16:01                           ` Andrey Grodzovsky
  2022-04-19 16:18                             ` Felix Kuehling
  2022-04-20  9:24                             ` Shuotao Xu
  0 siblings, 2 replies; 31+ messages in thread
From: Andrey Grodzovsky @ 2022-04-19 16:01 UTC (permalink / raw)
  To: Shuotao Xu
  Cc: Mukul.Joshi, Kuehling, Felix, Peng Cheng, amd-gfx, Lei Qu,
	Ran Shu, Ziyue Yang

[-- Attachment #1: Type: text/plain, Size: 72428 bytes --]


On 2022-04-19 03:41, Shuotao Xu wrote:
>
>
>> On Apr 18, 2022, at 11:23 PM, Andrey Grodzovsky 
>> <andrey.grodzovsky@amd.com> wrote:
>>
>>
>> On 2022-04-18 09:22, Shuotao Xu wrote:
>>>
>>>
>>>> On Apr 16, 2022, at 12:43 AM, Andrey Grodzovsky 
>>>> <andrey.grodzovsky@amd.com> wrote:
>>>>
>>>>
>>>> On 2022-04-15 06:12, Shuotao Xu wrote:
>>>>> Hi Andrey,
>>>>>
>>>>> First I really appreciate the discussion! It helped me understand 
>>>>> the driver code greatly. Thank you so much:)
>>>>> Please see my inline comments.
>>>>>
>>>>>> On Apr 14, 2022, at 11:13 PM, Andrey Grodzovsky 
>>>>>> <andrey.grodzovsky@amd.com> wrote:
>>>>>>
>>>>>>
>>>>>> On 2022-04-14 10:00, Shuotao Xu wrote:
>>>>>>>
>>>>>>>
>>>>>>>> On Apr 14, 2022, at 1:31 AM, Andrey Grodzovsky 
>>>>>>>> <andrey.grodzovsky@amd.com> wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> On 2022-04-13 12:03, Shuotao Xu wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> On Apr 11, 2022, at 11:52 PM, Andrey Grodzovsky 
>>>>>>>>>> <andrey.grodzovsky@amd.com> wrote:
>>>>>>>>>>
>>>>>>>>>> [Some people who received this message don't often get email 
>>>>>>>>>> fromandrey.grodzovsky@amd.com. Learn why this is important 
>>>>>>>>>> athttp://aka.ms/LearnAboutSenderIdentification.]
>>>>>>>>>>
>>>>>>>>>> On 2022-04-08 21:28, Shuotao Xu wrote:
>>>>>>>>>>>
>>>>>>>>>>>> On Apr 8, 2022, at 11:28 PM, Andrey Grodzovsky 
>>>>>>>>>>>> <andrey.grodzovsky@amd.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> [Some people who received this message don't often get 
>>>>>>>>>>>> email from andrey.grodzovsky@amd.com. Learn why this is 
>>>>>>>>>>>> important at http://aka.ms/LearnAboutSenderIdentification.]
>>>>>>>>>>>>
>>>>>>>>>>>> On 2022-04-08 04:45, Shuotao Xu wrote:
>>>>>>>>>>>>> Adding PCIe Hotplug Support for AMDKFD: the support of 
>>>>>>>>>>>>> hot-plug of GPU
>>>>>>>>>>>>> devices can open doors for many advanced applications in 
>>>>>>>>>>>>> data center
>>>>>>>>>>>>> in the next few years, such as for GPU resource
>>>>>>>>>>>>> disaggregation. Current AMDKFD does not support hotplug 
>>>>>>>>>>>>> out b/o the
>>>>>>>>>>>>> following reasons:
>>>>>>>>>>>>>
>>>>>>>>>>>>> 1. During PCIe removal, decrement KFD lock which was 
>>>>>>>>>>>>> incremented at
>>>>>>>>>>>>> the beginning of hw fini; otherwise kfd_open later is going to
>>>>>>>>>>>>> fail.
>>>>>>>>>>>> I assumed you read my comment last time, still you do same 
>>>>>>>>>>>> approach.
>>>>>>>>>>>> More in details bellow
>>>>>>>>>>> Aha, I like your fix:) I was not familiar with drm APIs so 
>>>>>>>>>>> just only half understood your comment last time.
>>>>>>>>>>>
>>>>>>>>>>> BTW, I tried hot-plugging out a GPU when rocm application is 
>>>>>>>>>>> still running.
>>>>>>>>>>> From dmesg, application is still trying to access the 
>>>>>>>>>>> removed kfd device, and are met with some errors.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Application us supposed to keep running, it holds the drm_device
>>>>>>>>>> reference as long as it has an open
>>>>>>>>>> FD to the device and final cleanup will come only after the 
>>>>>>>>>> app will die
>>>>>>>>>> thus releasing the FD and the last
>>>>>>>>>> drm_device reference.
>>>>>>>>>>
>>>>>>>>>>> Application would hang and not exiting in this case.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Actually I tried kill -7 $pid, and the process exists. The 
>>>>>>>>> dmesg has some warning though.
>>>>>>>>>
>>>>>>>>> [  711.769977] WARNING: CPU: 23 PID: 344 at 
>>>>>>>>> .../amdgpu-rocm5.0.2/src/amd/amdgpu/amdgpu_object.c:1336 
>>>>>>>>> amdgpu_bo_release_notify+0x150/0x160 [amdgpu]
>>>>>>>>> [  711.770528] Modules linked in: amdgpu(OE) amdttm(OE) 
>>>>>>>>> amd_sched(OE) amdkcl(OE) iommu_v2 nf_conntrack_netlink 
>>>>>>>>> nfnetlink xfrm_user xfrm_algo xt_addrtype br_netfilter 
>>>>>>>>> xt_CHECKSUM iptable_mangle xt_MASQUERADE iptable_nat nf_nat 
>>>>>>>>> xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 
>>>>>>>>> ipt_REJECT nf_reject_ipv4 xt_tcpudp bridge stp llc 
>>>>>>>>> ebtable_filter ebtables ip6table_filter ip6_tables 
>>>>>>>>> iptable_filter overlay binfmt_misc intel_rapl_msr i40iw 
>>>>>>>>> intel_rapl_common skx_edac nfit x86_pkg_temp_thermal 
>>>>>>>>> intel_powerclamp coretemp kvm_intel rpcrdma kvm sunrpc 
>>>>>>>>> ipmi_ssif ib_umad ib_ipoib rdma_ucm irqbypass rapl joydev 
>>>>>>>>> acpi_ipmi input_leds intel_cstate ipmi_si ipmi_devintf mei_me 
>>>>>>>>> mei intel_pch_thermal ipmi_msghandler ioatdma mac_hid lpc_ich 
>>>>>>>>> dca acpi_power_meter acpi_pad sch_fq_codel ib_iser rdma_cm 
>>>>>>>>> iw_cm ib_cm iscsi_tcp libiscsi_tcp libiscsi 
>>>>>>>>> scsi_transport_iscsi pci_stub ip_tables x_tables autofs4 btrfs 
>>>>>>>>> blake2b_generic zstd_compress raid10 raid456 async_raid6_recov 
>>>>>>>>> async_memcpy async_pq async_xor async_tx xor
>>>>>>>>> [  711.779359]  raid6_pq libcrc32c raid1 raid0 multipath 
>>>>>>>>> linear mlx5_ib ib_uverbs ib_core ast drm_vram_helper 
>>>>>>>>> i2c_algo_bit drm_ttm_helper ttm drm_kms_helper syscopyarea 
>>>>>>>>> crct10dif_pclmul crc32_pclmul ghash_clmulni_intel sysfillrect 
>>>>>>>>> uas hid_generic sysimgblt aesni_intel mlx5_core fb_sys_fops 
>>>>>>>>> crypto_simd usbhid cryptd drm i40e pci_hyperv_intf usb_storage 
>>>>>>>>> glue_helper mlxfw hid ahci libahci wmi
>>>>>>>>> [  711.779752] CPU: 23 PID: 344 Comm: kworker/23:1 Tainted: G 
>>>>>>>>>        W  OE     5.11.0+ #1
>>>>>>>>> [  711.779755] Hardware name: Supermicro 
>>>>>>>>> SYS-4029GP-TRT2/X11DPG-OT-CPU, BIOS 2.1 08/14/2018
>>>>>>>>> [  711.779756] Workqueue: kfd_process_wq 
>>>>>>>>> kfd_process_wq_release [amdgpu]
>>>>>>>>> [  711.779955] RIP: 0010:amdgpu_bo_release_notify+0x150/0x160 
>>>>>>>>> [amdgpu]
>>>>>>>>> [  711.780141] Code: e8 b5 af 34 f4 e9 1f ff ff ff 48 39 c2 74 
>>>>>>>>> 07 0f 0b e9 69 ff ff ff 4c 89 e7 e8 3c b4 16 00 e9 5c ff ff ff 
>>>>>>>>> e8 a2 ce fd f3 eb cf <0f> 0b eb cb e8 d7 02 34 f4 0f 1f 80 00 
>>>>>>>>> 00 00 00 0f 1f 44 00 00 55
>>>>>>>>> [  711.780143] RSP: 0018:ffffa8100dd67c30 EFLAGS: 00010282
>>>>>>>>> [  711.780145] RAX: 00000000ffffffea RBX: ffff89980e792058 
>>>>>>>>> RCX: 0000000000000000
>>>>>>>>> [  711.780147] RDX: 0000000000000000 RSI: ffff89a8f9ad8870 
>>>>>>>>> RDI: ffff89a8f9ad8870
>>>>>>>>> [  711.780148] RBP: ffffa8100dd67c50 R08: 0000000000000000 
>>>>>>>>> R09: fffffffffff99b18
>>>>>>>>> [  711.780149] R10: ffffa8100dd67bd0 R11: ffffa8100dd67908 
>>>>>>>>> R12: ffff89980e792000
>>>>>>>>> [  711.780151] R13: ffff89980e792058 R14: ffff89980e7921bc 
>>>>>>>>> R15: dead000000000100
>>>>>>>>> [  711.780152] FS:  0000000000000000(0000) 
>>>>>>>>> GS:ffff89a8f9ac0000(0000) knlGS:0000000000000000
>>>>>>>>> [  711.780154] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>>>>>>> [  711.780156] CR2: 00007ffddac6f71f CR3: 00000030bb80a003 
>>>>>>>>> CR4: 00000000007706e0
>>>>>>>>> [  711.780157] DR0: 0000000000000000 DR1: 0000000000000000 
>>>>>>>>> DR2: 0000000000000000
>>>>>>>>> [  711.780159] DR3: 0000000000000000 DR6: 00000000fffe0ff0 
>>>>>>>>> DR7: 0000000000000400
>>>>>>>>> [  711.780160] PKRU: 55555554
>>>>>>>>> [  711.780161] Call Trace:
>>>>>>>>> [  711.780163]  ttm_bo_release+0x2ae/0x320 [amdttm]
>>>>>>>>> [  711.780169]  amdttm_bo_put+0x30/0x40 [amdttm]
>>>>>>>>> [  711.780357]  amdgpu_bo_unref+0x1e/0x30 [amdgpu]
>>>>>>>>> [  711.780543]  amdgpu_gem_object_free+0x8b/0x160 [amdgpu]
>>>>>>>>> [  711.781119]  drm_gem_object_free+0x1d/0x30 [drm]
>>>>>>>>> [  711.781489] 
>>>>>>>>>  amdgpu_amdkfd_gpuvm_free_memory_of_gpu+0x34a/0x380 [amdgpu]
>>>>>>>>> [  711.782044]  kfd_process_device_free_bos+0xe0/0x130 [amdgpu]
>>>>>>>>> [  711.782611]  kfd_process_wq_release+0x286/0x380 [amdgpu]
>>>>>>>>> [  711.783172]  process_one_work+0x236/0x420
>>>>>>>>> [  711.783543]  worker_thread+0x34/0x400
>>>>>>>>> [  711.783911]  ? process_one_work+0x420/0x420
>>>>>>>>> [  711.784279]  kthread+0x126/0x140
>>>>>>>>> [  711.784653]  ? kthread_park+0x90/0x90
>>>>>>>>> [  711.785018]  ret_from_fork+0x22/0x30
>>>>>>>>> [  711.785387] ---[ end trace d8f50f6594817c84 ]---
>>>>>>>>> [  711.798716] [drm] amdgpu: ttm finalized
>>>>>>>>
>>>>>>>>
>>>>>>>> So it means the process was stuck in some wait_event_killable 
>>>>>>>> (maybe here drm_sched_entity_flush) - you can try 
>>>>>>>> 'cat/proc/$process_pid/stack' maybe before
>>>>>>>> you kill it to see where it was stuck so we can go from there.
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> For graphic apps what i usually see is a crash because of 
>>>>>>>>>> sigsev when
>>>>>>>>>> the app tries to access
>>>>>>>>>> an unmapped MMIO region on the device. I haven't tested for 
>>>>>>>>>> compute
>>>>>>>>>> stack and so there might
>>>>>>>>>> be something I haven't covered. Hang could mean for example 
>>>>>>>>>> waiting on a
>>>>>>>>>> fence which is not being
>>>>>>>>>> signaled - please provide full dmesg from this case.
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Do you have any good suggestions on how to fix it down the 
>>>>>>>>>>> line? (HIP runtime/libhsakmt or driver)
>>>>>>>>>>>
>>>>>>>>>>> [64036.631333] amdgpu: amdgpu_vm_bo_update failed
>>>>>>>>>>> [64036.631702] amdgpu: validate_invalid_user_pages: update 
>>>>>>>>>>> PTE failed
>>>>>>>>>>> [64036.640754] amdgpu: amdgpu_vm_bo_update failed
>>>>>>>>>>> [64036.641120] amdgpu: validate_invalid_user_pages: update 
>>>>>>>>>>> PTE failed
>>>>>>>>>>> [64036.650394] amdgpu: amdgpu_vm_bo_update failed
>>>>>>>>>>> [64036.650765] amdgpu: validate_invalid_user_pages: update 
>>>>>>>>>>> PTE failed
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> The full dmesg will just the repetition of those two messages,
>>>>>>>>> [186885.764079] amdgpu 0000:43:00.0: amdgpu: amdgpu: finishing 
>>>>>>>>> device.
>>>>>>>>> [186885.766916] [drm] free PSP TMR buffer
>>>>>>>>> [186893.868173] amdgpu: amdgpu_vm_bo_update failed
>>>>>>>>> [186893.868235] amdgpu: validate_invalid_user_pages: update 
>>>>>>>>> PTE failed
>>>>>>>>> [186893.876154] amdgpu: amdgpu_vm_bo_update failed
>>>>>>>>> [186893.876190] amdgpu: validate_invalid_user_pages: update 
>>>>>>>>> PTE failed
>>>>>>>>> [186893.884150] amdgpu: amdgpu_vm_bo_update failed
>>>>>>>>> [186893.884185] amdgpu: validate_invalid_user_pages: update 
>>>>>>>>> PTE failed
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> This just probably means trying to update PTEs after the 
>>>>>>>>>> physical device
>>>>>>>>>> is gone - we usually avoid this by
>>>>>>>>>> first trying to do all HW shutdowns early before PCI remove 
>>>>>>>>>> completion
>>>>>>>>>> but when it's really tricky by
>>>>>>>>>> protecting HW access sections with drm_dev_enter/exit scope.
>>>>>>>>>>
>>>>>>>>>> For this particular error it would be the best to flush
>>>>>>>>>> info->restore_userptr_work before the end of
>>>>>>>>>> amdgpu_pci_remove (rejecting new process creation and calling
>>>>>>>>>> cancel_delayed_work_sync(&process_info->restore_userptr_work) 
>>>>>>>>>> for all
>>>>>>>>>> running processes)
>>>>>>>>>> somewhere in amdgpu_pci_remove.
>>>>>>>>>>
>>>>>>>>> I tried something like *kfd_process_ref_release* which I think 
>>>>>>>>> did what you described, but it did not work.
>>>>>>>>
>>>>>>>>
>>>>>>>> I don't see how kfd_process_ref_release is the same as I 
>>>>>>>> mentioned above, what i meant is calling the code above within 
>>>>>>>> kgd2kfd_suspend (where you tried to call 
>>>>>>>> kfd_kill_all_user_processes bellow)
>>>>>>>>
>>>>>>> Yes, you are right. It was not called by it.
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Instead I tried to kill the process from the kernel, but the 
>>>>>>>>> amdgpu could **only** be hot-plugged in back successfully only 
>>>>>>>>> if there was no rocm kernel running when it was plugged out. 
>>>>>>>>> If not, amdgpu_probe will just hang later. (Maybe because 
>>>>>>>>> amdgpu was plugged out while running state, it leaves a bad HW 
>>>>>>>>> state which causes probe to hang).
>>>>>>>>
>>>>>>>>
>>>>>>>> We usually do asic_reset during probe to reset all HW state 
>>>>>>>> (checlk if amdgpu_device_init->amdgpu_asic_reset is running 
>>>>>>>> when you plug back).
>>>>>>>>
>>>>>>> OK
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> I don’t know if this is a viable solution worth pursuing, but 
>>>>>>>>> I attached the diff anyway.
>>>>>>>>>
>>>>>>>>> Another solution could be let compute stack user mode detect a 
>>>>>>>>> topology change via generation_count change, and abort 
>>>>>>>>> gracefully there.
>>>>>>>>>
>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c 
>>>>>>>>> b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>>>>>>>> index 4e7d9cb09a69..79b4c9b84cd0 100644
>>>>>>>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>>>>>>>> @@ -697,12 +697,15 @@ void kgd2kfd_suspend(struct kfd_dev 
>>>>>>>>> *kfd, bool run_pm, bool force)
>>>>>>>>>       return;
>>>>>>>>>
>>>>>>>>>         /* for runtime suspend, skip locking kfd */
>>>>>>>>> -       if (!run_pm) {
>>>>>>>>> +       if (!run_pm && !drm_dev_is_unplugged(kfd->ddev)) {
>>>>>>>>>       /* For first KFD device suspend all the KFD processes */
>>>>>>>>>       if (atomic_inc_return(&kfd_locked) == 1)
>>>>>>>>> kfd_suspend_all_processes(force);
>>>>>>>>>         }
>>>>>>>>>
>>>>>>>>> +       if (drm_dev_is_unplugged(kfd->ddev))
>>>>>>>>> + kfd_kill_all_user_processes();
>>>>>>>>> +
>>>>>>>>> kfd->dqm->ops.stop(kfd->dqm);
>>>>>>>>> kfd_iommu_suspend(kfd);
>>>>>>>>>  }
>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h 
>>>>>>>>> b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
>>>>>>>>> index 55c9e1922714..84cbcd857856 100644
>>>>>>>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
>>>>>>>>> @@ -1065,6 +1065,7 @@ void kfd_unref_process(struct 
>>>>>>>>> kfd_process *p);
>>>>>>>>>  int kfd_process_evict_queues(struct kfd_process *p, bool force);
>>>>>>>>>  int kfd_process_restore_queues(struct kfd_process *p);
>>>>>>>>>  void kfd_suspend_all_processes(bool force);
>>>>>>>>> +void kfd_kill_all_user_processes(void);
>>>>>>>>>  /*
>>>>>>>>>   * kfd_resume_all_processes:
>>>>>>>>>   * bool sync: If kfd_resume_all_processes() should wait for the
>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process.c 
>>>>>>>>> b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
>>>>>>>>> index 6cdc855abb6d..fb0c753b682c 100644
>>>>>>>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
>>>>>>>>> @@ -2206,6 +2206,24 @@ void kfd_suspend_all_processes(bool force)
>>>>>>>>> srcu_read_unlock(&kfd_processes_srcu, idx);
>>>>>>>>>  }
>>>>>>>>>
>>>>>>>>> +
>>>>>>>>> +void kfd_kill_all_user_processes(void)
>>>>>>>>> +{
>>>>>>>>> + struct kfd_process *p;
>>>>>>>>> + struct amdkfd_process_info *p_info;
>>>>>>>>> + unsigned int temp;
>>>>>>>>> + int idx = srcu_read_lock(&kfd_processes_srcu);
>>>>>>>>> +
>>>>>>>>> + pr_info("Killing all processes\n");
>>>>>>>>> + hash_for_each_rcu(kfd_processes_table, temp, p, kfd_processes) {
>>>>>>>>> +       p_info = p->kgd_process_info;
>>>>>>>>> +       pr_info("Killing  processes, pid = %d", 
>>>>>>>>> pid_nr(p_info->pid));
>>>>>>>>> + kill_pid(p_info->pid, SIGBUS, 1);
>>>>>>>>
>>>>>>>>
>>>>>>>> From looking into kill_pid I see it only sends a signal but 
>>>>>>>> doesn't wait for completion, it would make sense to wait for 
>>>>>>>> completion here. In any case I would actually try to put here
>>>>>>>>
>>>>>>> I have made a version which does that with some atomic counters. 
>>>>>>> Please read later in the diff.
>>>>>>>>
>>>>>>>>
>>>>>>>> hash_for_each_rcu(p_info)
>>>>>>>> cancel_delayed_work_sync(&p_info->restore_userptr_work)
>>>>>>>>
>>>>>>>> instead at least that what i meant in the previous mail.
>>>>>>>>
>>>>>>> I actually tried that earlier, and it did not work. Application 
>>>>>>> still keeps running, and you have to send a kill to the user 
>>>>>>> process.
>>>>>>>
>>>>>>> I have made the following version. It waits for processes to 
>>>>>>> terminate synchronously after sending SIGBUS. After that it does 
>>>>>>> the real work of amdgpu_pci_remove.
>>>>>>> However, it hangs at amdgpu_device_ip_fini_early when it is 
>>>>>>> trying to deinit ip_block 6 <sdma_v4_0> 
>>>>>>> (https://gitlab.freedesktop.org/agd5f/linux/-/blob/amd-staging-drm-next/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c#L2818 
>>>>>>> <https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fagd5f%2Flinux%2F-%2Fblob%2Famd-staging-drm-next%2Fdrivers%2Fgpu%2Fdrm%2Famd%2Famdgpu%2Famdgpu_device.c%23L2818&data=04%7C01%7Candrey.grodzovsky%40amd.com%7C78c882de9193490a3b4408da21d8132d%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637859509203317436%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=rHsAHCgj8akFPDv6S3Td3GYgqEaWa6%2FhJ94fjH8k3%2Bg%3D&reserved=0>). 
>>>>>>> I assume that there are still some inflight dma, therefore fini 
>>>>>>> of this ip block thus hangs?
>>>>>>>
>>>>>>> The following is an excerpt of the dmesg: please excuse for 
>>>>>>> putting my own pr_info, but I hope you get my point of where it 
>>>>>>> hangs.
>>>>>>>
>>>>>>> [  392.344735] amdgpu: all processes has been fully released
>>>>>>> [  392.346557] amdgpu: amdgpu_acpi_fini done
>>>>>>> [  392.346568] amdgpu 0000:b3:00.0: amdgpu: amdgpu: finishing 
>>>>>>> device.
>>>>>>> [  392.349238] amdgpu: amdgpu_device_ip_fini_early enter 
>>>>>>> ip_blocks = 9
>>>>>>> [  392.349248] amdgpu: Free mem_obj = 000000007bf54275, 
>>>>>>> range_start = 14, range_end = 14
>>>>>>> [  392.350299] amdgpu: Free mem_obj = 00000000a85bc878, 
>>>>>>> range_start = 12, range_end = 12
>>>>>>> [  392.350304] amdgpu: Free mem_obj = 00000000b8019e32, 
>>>>>>> range_start = 13, range_end = 13
>>>>>>> [  392.350308] amdgpu: Free mem_obj = 000000002d296168, 
>>>>>>> range_start = 4, range_end = 11
>>>>>>> [  392.350313] amdgpu: Free mem_obj = 000000001fc4f934, 
>>>>>>> range_start = 0, range_end = 3
>>>>>>> [  392.350322] amdgpu: amdgpu_amdkfd_suspend(adev, false) done
>>>>>>> [  392.350672] amdgpu: hw_fini of IP block[8] <jpeg_v2_5> done 0
>>>>>>> [  392.350679] amdgpu: hw_fini of IP block[7] <vcn_v2_5> done 0
>>>>>>
>>>>>>
>>>>>> I just remembered that the idea to actively kill and wait for 
>>>>>> running user processes during unplug was rejected
>>>>>> as a bad idea in the first iteration of unplug work I did (don't 
>>>>>> remember why now, need to look) so this is a no go.
>>>>>>
>>>>> Maybe an application has kfd open, but was not accessing the dev. 
>>>>> So kill it at unplug could kill the process unnecessarily.
>>>>> However, the latest version I had with the sleep function got rid 
>>>>> of the IP block fini hang.
>>>>>>
>>>>>> Our policy is to let zombie processes (zombie in a sense that the 
>>>>>> underlying device is gone) live as long as they want
>>>>>> (as long as you able to terminate them - which you do, so that ok)
>>>>>> and the system should finish PCI remove gracefully and be able to 
>>>>>> hot plug back the device.  Hence, i suggest dropping
>>>>>> this direction of forcing all user processes to be killed, 
>>>>>> confirm you have graceful shutdown and remove of device
>>>>>> from PCI topology and then concentrate on why when you plug back 
>>>>>> it hangs.
>>>>>>
>>>>> So I basically revert back to the original solution which you 
>>>>> suggested.
>>>>>
>>>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c 
>>>>> b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>>>> index 4e7d9cb09a69..5504a18b5a45 100644
>>>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>>>> @@ -697,7 +697,7 @@ void kgd2kfd_suspend(struct kfd_dev *kfd, bool 
>>>>> run_pm, bool force)
>>>>>                 return;
>>>>>
>>>>>         /* for runtime suspend, skip locking kfd */
>>>>> -       if (!run_pm) {
>>>>> +       if (!run_pm && !drm_dev_is_unplugged(kfd->ddev)) {
>>>>>                 /* For first KFD device suspend all the KFD 
>>>>> processes */
>>>>>                 if (atomic_inc_return(&kfd_locked) == 1)
>>>>> kfd_suspend_all_processes(force);
>>>>>
>>>>>> First confirm if ASIC reset happens on
>>>>>> next init.
>>>>>>
>>>>> This patch works great at *planned* plugout, where all the rocm 
>>>>> processes are killed before plugout. And device can be added back 
>>>>> without a problem.
>>>>> However *unplanned* plugout when there is rocm processes are 
>>>>> running just don’t work.
>>>>
>>>>
>>>> Still I am not clear if ASIC reset happens on plug back or no, can 
>>>> you trace this please ?
>>>>
>>>>
>>>
>>> I tried add pr_info into asic_reset functions, but cannot trace any 
>>> upon plug-back.
>>
>>
>> This could possibly explain the hang on plug back. Can you see why we 
>> don't get there ?
>>
>>
> Is amdgpu supposed to asic_reset each time when it is probed? I right 
> now it seems to probe ok (it did not hang). I will trace back further


Yep


>>>>
>>>>>> Second please confirm if the timing you kill manually the user 
>>>>>> process has impact on whether you have a hang
>>>>>> on next plug back (if you kill before
>>>>>>
>>>>> *Scenario 0: Kill before plug back*
>>>>>
>>>>> 1. echo 1 > /sys/bus/pci/…/remove, would finish.
>>>>> But the application won’t exit until there is a kill signal.
>>>>
>>>>
>>>> Why you think it must exit ?
>>>>
>>> Because rocm will need to release the drm descriptor to 
>>> get amdgpu_amdkfd_device_fini_sw called, which would eventually call 
>>> kgd2kfd_device_exit called. This would clean up kfd_topology at 
>>> least. Otherwise I don’t see how it would be added back without 
>>> messing up kfd topology to say the least.
>>>
>>> However, those are all based my own observations. Please explain why 
>>> it does not need exit if you believe so?
>>
>>
>> Note that when you add back a new device, pci device and drm device 
>> are created, I am not an expert on KFD code but i believe also a new 
>> KFD device is created independent of the old one and so the topology 
>> should see just 2 device instances (one old zombie and one real 
>> new).  I know at least this wasn't an issue for the graphic stack in 
>> exact same scenario and the libdrm tests i pointed to test exact this 
>> scenario.
>>
> Yes, regardless of the OOPS in ttm_bo_cleanup_refs, I plugged back the 
> gpu, and I think it got probed all right, however the old kfd node is 
> still there.
> I can passed libdrm basic test suite on the plugged back. The bo test 
> hangs out-of-box even without hotplug (see dmesg below).
>
>  kernel:[ 1609.029125] watchdog: BUG: soft lockup - CPU#39 stuck for 
> 89s! [amdgpu_test:36407]
> [  +0.000407] Code: 48 89 47 18 48 89 47 20 48 89 47 28 48 89 47 30 48 
> 89 47 38 48 8d 7f 40 75 d9 90 c3 0f 1f 80 00 00 00 00 b9 00 10 00 00 
> 31 c0 <f3> aa c3 cc cc cc cc cc cc 0f 1f 44 00 00 48 85 ff 0f 84 f2 00 00
> [  +0.000856] RSP: 0018:ffffb996b57b3c40 EFLAGS: 00010246
> [  +0.000434] RAX: 0000000000000000 RBX: ffff9cc7f8706e88 RCX: 
> 0000000000000980
> [  +0.000436] RDX: fffff935b17d9140 RSI: fffff935b17e0000 RDI: 
> ffff9c831f645680
> [  +0.000439] RBP: 0000000000000400 R08: fffff935b17d0000 R09: 
> 0000000000000000
> [  +0.000447] R10: 0000000000000000 R11: 0000000000000000 R12: 
> 000000000000000a
> [  +0.000437] R13: ffff9cc783980a20 R14: 000000000b5dbc00 R15: 
> ffff9cc7f8706078
> [  +0.000438] FS:  00007ff1ef611300(0000) GS:ffff9d453efc0000(0000) 
> knlGS:0000000000000000
> [  +0.000445] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  +0.000462] CR2: 00007f418bbb9320 CR3: 000000819fa84006 CR4: 
> 0000000000770ee0
> [  +0.000466] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 
> 0000000000000000
> [  +0.000451] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 
> 0000000000000400
> [  +0.000455] PKRU: 55555554
> [  +0.000451] Call Trace:
> [  +0.000448]  ttm_pool_free+0x110/0x230 [ttm]
> [  +0.000451]  ttm_tt_unpopulate+0x5e/0xb0 [ttm]
> [  +0.000454]  ttm_tt_destroy_common+0xe/0x30 [ttm]
> [  +0.000453]  amdgpu_ttm_backend_destroy+0x1e/0x70 [amdgpu]
> [  +0.000569]  ttm_bo_cleanup_memtype_use+0x37/0x60 [ttm]
> [  +0.000458]  ttm_bo_release+0x286/0x500 [ttm]
> [  +0.000450]  amdgpu_bo_unref+0x1a/0x30 [amdgpu]
> [  +0.000544]  amdgpu_gem_object_free+0xad/0x160 [amdgpu]
> [  +0.000534]  drm_gem_object_release_handle+0x6a/0x80 [drm]
> [  +0.000476]  drm_gem_handle_delete+0x5b/0xa0 [drm]
> [  +0.000465]  ? drm_gem_handle_create+0x40/0x40 [drm]
> [  +0.000469]  drm_ioctl_kernel+0xab/0xf0 [drm]
> [  +0.000458]  drm_ioctl+0x1ec/0x390 [drm]
> [  +0.000440]  ? drm_gem_handle_create+0x40/0x40 [drm]
> [  +0.000438]  ? selinux_file_ioctl+0x17d/0x220
> [  +0.000423]  ? lock_release+0x1ce/0x270
> [  +0.000416]  ? trace_hardirqs_on+0x1b/0xd0
> [  +0.000418]  ? _raw_spin_unlock_irqrestore+0x2d/0x40
> [  +0.000419]  amdgpu_drm_ioctl+0x49/0x80 [amdgpu]
> [  +0.000499]  __x64_sys_ioctl+0x80/0xb0
> [  +0.000414]  do_syscall_64+0x3a/0x70
> [  +0.000400]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> [  +0.000387] RIP: 0033:0x7ff1ef7263db
> [  +0.000371] Code: 0f 1e fa 48 8b 05 b5 7a 0d 00 64 c7 00 26 00 00 00 
> 48 c7 c0 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 10 00 00 00 
> 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 85 7a 0d 00 f7 d8 64 89 01 48
> [  +0.000763] RSP: 002b:00007ffdf1cd0278 EFLAGS: 00000246 ORIG_RAX: 
> 0000000000000010
> [  +0.000386] RAX: ffffffffffffffda RBX: 00007ffdf1cd02b0 RCX: 
> 00007ff1ef7263db
> [  +0.000383] RDX: 00007ffdf1cd02b0 RSI: 0000000040086409 RDI: 
> 0000000000000007
> [  +0.000396] RBP: 0000000040086409 R08: 00005574eefd5c60 R09: 
> 00005574eefdd360
> [  +0.000391] R10: 00005574eefd4010 R11: 0000000000000246 R12: 
> 00005574eefd66d8
> [  +0.000386] R13: 0000000000000007 R14: 0000000000000000 R15: 
> 00007ff1ef830143
>
>
> I also tried to run tf benchmark to the newly plugged nodes (one of 
> the node is dummy), but failed.
> Can we have some confirmation from kfd team that they have considered 
> a zombie kfd node?
>
>> Also note that even with running grpahic stack there is always a KFD 
>> device and KFD topology present but of course probably not the same 
>> as when u run a KFD facing process so there could be some issues there.
>>
>> Also note that because of this patch 
>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=267d51d77fdae8708b94e1a24b8e5d961297edb7 
>> all MMIO accesses from such zombie/orphan user processes will be 
>> remapped to zero page and so will not necessarily experience a 
>> segfault when device removal happnes but rather maybe some crash due 
>> to NULL data read from MMIO by the process and used in some manner.
>>
> It depends on where the application is when the device is plugged out.
>
> For example, in one case, the application keeps saying out-of-memory, 
> but won’t exit.
> For one of the cases. The other case is that it would wait for a signal.
>
> 2022-04-18 12:42:38.939303: E 
> tensorflow/stream_executor/rocm/rocm_driver.cc:692 
> <https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Frocm_driver.cc%3A692%2F&data=04%7C01%7Candrey.grodzovsky%40amd.com%7C78c882de9193490a3b4408da21d8132d%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637859509203317436%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=y%2Bkqeyw4HyTRWT3E5OHXbgUfSe3oDutSL0h4FNvjlIA%3D&reserved=0>] 
> failed to alloc 2304 bytes on host: HIP_ERROR_OutOfMemory
> 2022-04-18 12:42:38.939322: W 
> ./tensorflow/core/common_runtime/gpu/gpu_host_allocator.h:44] could 
> not allocate pinned host memory of size: 2304
> 2022-04-18 12:42:38.940772: E 
> tensorflow/stream_executor/rocm/rocm_driver.cc:692 
> <https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Frocm_driver.cc%3A692%2F&data=04%7C01%7Candrey.grodzovsky%40amd.com%7C78c882de9193490a3b4408da21d8132d%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637859509203317436%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=y%2Bkqeyw4HyTRWT3E5OHXbgUfSe3oDutSL0h4FNvjlIA%3D&reserved=0>] 
> failed to alloc 2304 bytes on host: HIP_ERROR_OutOfMemory
> 2022-04-18 12:42:38.940791: W 
> ./tensorflow/core/common_runtime/gpu/gpu_host_allocator.h:44] could 
> not allocate pinned host memory of size: 2304
> 2022-04-18 12:42:38.942379: E 
> tensorflow/stream_executor/rocm/rocm_driver.cc:692 
> <https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Frocm_driver.cc%3A692%2F&data=04%7C01%7Candrey.grodzovsky%40amd.com%7C78c882de9193490a3b4408da21d8132d%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637859509203317436%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=y%2Bkqeyw4HyTRWT3E5OHXbgUfSe3oDutSL0h4FNvjlIA%3D&reserved=0>] 
> failed to alloc 2304 bytes on host: HIP_ERROR_OutOfMemory
> 2022-04-18 12:42:38.942399: W 
> ./tensorflow/core/common_runtime/gpu/gpu_host_allocator.h:44] could 
> not allocate pinned host memory of size: 2304
> 2022-04-18 12:42:38.943829: E 
> tensorflow/stream_executor/rocm/rocm_driver.cc:692 
> <https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Frocm_driver.cc%3A692%2F&data=04%7C01%7Candrey.grodzovsky%40amd.com%7C78c882de9193490a3b4408da21d8132d%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637859509203317436%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=y%2Bkqeyw4HyTRWT3E5OHXbgUfSe3oDutSL0h4FNvjlIA%3D&reserved=0>] 
> failed to alloc 2304 bytes on host: HIP_ERROR_OutOfMemory
> 2022-04-18 12:42:38.943849: W 
> ./tensorflow/core/common_runtime/gpu/gpu_host_allocator.h:44] could 
> not allocate pinned host memory of size: 2304
> 2022-04-18 12:42:38.945272: E 
> tensorflow/stream_executor/rocm/rocm_driver.cc:692 
> <https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Frocm_driver.cc%3A692%2F&data=04%7C01%7Candrey.grodzovsky%40amd.com%7C78c882de9193490a3b4408da21d8132d%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637859509203317436%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=y%2Bkqeyw4HyTRWT3E5OHXbgUfSe3oDutSL0h4FNvjlIA%3D&reserved=0>] 
> failed to alloc 2304 bytes on host: HIP_ERROR_OutOfMemory
> 2022-04-18 12:42:38.945292: W 
> ./tensorflow/core/common_runtime/gpu/gpu_host_allocator.h:44] could 
> not allocate pinned host memory of size: 2304
>
>>
>>>>>
>>>>> 2. kill the the process. The application does several things and 
>>>>> seems trigger drm_release in the kernel, which are met with kernel 
>>>>> NULL pointer deference related to sysfs_remove. Then the whole fs 
>>>>> just freeze.
>>>>>
>>>>> [  +0.002498] BUG: kernel NULL pointer dereference, address: 
>>>>> 0000000000000098
>>>>> [  +0.000486] #PF: supervisor read access in kernel mode
>>>>> [  +0.000545] #PF: error_code(0x0000) - not-present page
>>>>> [  +0.000551] PGD 0 P4D 0
>>>>> [  +0.000553] Oops: 0000 [#1] SMP NOPTI
>>>>> [  +0.000540] CPU: 75 PID: 4911 Comm: kworker/75:2 Tainted: G    W 
>>>>>   E     5.13.0-kfd #1
>>>>> [  +0.000559] Hardware name: INGRASYS         TURING  /MB  , BIOS 
>>>>> K71FQ28A 10/05/2021
>>>>> [  +0.000567] Workqueue: events delayed_fput
>>>>> [  +0.000563] RIP: 0010:kernfs_find_ns+0x1b/0x100
>>>>> [  +0.000569] Code: ff ff e8 88 59 9f 00 0f 1f 84 00 00 00 00 00 
>>>>> 0f 1f 44 00 00 41 57 8b 05 df f0 7b 01 41 56 41 55 49 89 f5 41 54 
>>>>> 55 48 89 fd 53 <44> 0f b7 b7 98 00 00 00 48 89 d3 4c 8b 67 70 66 
>>>>> 41 83 e6 20 41 0f
>>>>> [  +0.001193] RSP: 0018:ffffb9875db5fc98 EFLAGS: 00010246
>>>>> [  +0.000602] RAX: 0000000000000000 RBX: ffffa101f79507d8 RCX: 
>>>>> 0000000000000000
>>>>> [  +0.000612] RDX: 0000000000000000 RSI: ffffffffc09a9417 RDI: 
>>>>> 0000000000000000
>>>>> [  +0.000604] RBP: 0000000000000000 R08: 0000000000000001 R09: 
>>>>> 0000000000000000
>>>>> [  +0.000760] R10: ffffb9875db5fcd0 R11: 0000000000000000 R12: 
>>>>> 0000000000000000
>>>>> [  +0.000597] R13: ffffffffc09a9417 R14: ffffa08363fb2d18 R15: 
>>>>> 0000000000000000
>>>>> [  +0.000702] FS:  0000000000000000(0000) 
>>>>> GS:ffffa0ffbfcc0000(0000) knlGS:0000000000000000
>>>>> [  +0.000666] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>>> [  +0.000658] CR2: 0000000000000098 CR3: 0000005747812005 CR4: 
>>>>> 0000000000770ee0
>>>>> [  +0.000715] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 
>>>>> 0000000000000000
>>>>> [  +0.000655] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 
>>>>> 0000000000000400
>>>>> [  +0.000592] PKRU: 55555554
>>>>> [  +0.000580] Call Trace:
>>>>> [  +0.000591]  kernfs_find_and_get_ns+0x2f/0x50
>>>>> [  +0.000584]  sysfs_remove_file_from_group+0x20/0x50
>>>>> [  +0.000580]  amdgpu_ras_sysfs_remove+0x3d/0xd0 [amdgpu]
>>>>> [  +0.000737]  amdgpu_ras_late_fini+0x1d/0x40 [amdgpu]
>>>>> [  +0.000750]  amdgpu_sdma_ras_fini+0x96/0xb0 [amdgpu]
>>>>> [  +0.000742]  ? gfx_v10_0_resume+0x10/0x10 [amdgpu]
>>>>> [  +0.000738]  sdma_v4_0_sw_fini+0x23/0x90 [amdgpu]
>>>>> [  +0.000717]  amdgpu_device_fini_sw+0xae/0x260 [amdgpu]
>>>>> [  +0.000704]  amdgpu_driver_release_kms+0x12/0x30 [amdgpu]
>>>>> [  +0.000687]  drm_dev_release+0x20/0x40 [drm]
>>>>> [  +0.000583]  drm_release+0xa8/0xf0 [drm]
>>>>> [  +0.000584]  __fput+0xa5/0x250
>>>>> [  +0.000621]  delayed_fput+0x1f/0x30
>>>>> [  +0.000726]  process_one_work+0x26e/0x580
>>>>> [  +0.000581]  ? process_one_work+0x580/0x580
>>>>> [  +0.000611]  worker_thread+0x4d/0x3d0
>>>>> [  +0.000611]  ? process_one_work+0x580/0x580
>>>>> [  +0.000605]  kthread+0x117/0x150
>>>>> [  +0.000611]  ? kthread_park+0x90/0x90
>>>>> [  +0.000619]  ret_from_fork+0x1f/0x30
>>>>> [  +0.000625] Modules linked in: amdgpu(E) xt_conntrack 
>>>>> xt_MASQUERADE nfnetlink xt_addrtype iptable_filter iptable_nat 
>>>>> nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 br_netfilter 
>>>>> x86_pkg_temp_thermal cdc_ether usbnet acpi_pad msr ip_tables 
>>>>> x_tables ast drm_vram_helper iommu_v2 drm_ttm_helper gpu_sched ttm 
>>>>> drm_kms_helper cfbfillrect syscopyarea cfbimgblt sysfillrect 
>>>>> sysimgblt fb_sys_fops cfbcopyarea drm drm_panel_orientati
>>>>> on_quirks [last unloaded: amdgpu]
>>>>
>>>>
>>>> This is a known regression, all SYSFS components must be removed 
>>>> before pci_remove code runs otherwise you get either warnings for 
>>>> single file removals or
>>>> OOPSEs for sysfs gorup removals like here. Please try to move 
>>>> amdgpu_ras_sysfs_remove from amdgpu_ras_late_fini to the end of 
>>>> amdgpu_ras_pre_fini (which happens before pci remove)
>>>>
>>>>
>>>
>>> I fixed it in the newer patch, please see it below.
>>
>>
>>
>>> I first plugout the device, then kill the rocm user process. Then it 
>>> has other OOPSES related to ttm_bo_cleanup_refs.
>>>
>>> [  +0.000006] BUG: kernel NULL pointer dereference, address: 
>>> 0000000000000010
>>> [  +0.000349] #PF: supervisor read access in kernel mode
>>> [  +0.000340] #PF: error_code(0x0000) - not-present page
>>> [  +0.000341] PGD 0 P4D 0
>>> [  +0.000336] Oops: 0000 [#1] SMP NOPTI
>>> [  +0.000345] CPU: 9 PID: 95 Comm: kworker/9:1 Tainted: G        W   
>>> E     5.13.0-kfd #1
>>> [  +0.000367] Hardware name: INGRASYS         TURING  /MB      , 
>>> BIOS K71FQ28A 10/05/2021
>>> [  +0.000376] Workqueue: events delayed_fput
>>> [  +0.000422] RIP: 0010:ttm_resource_free+0x24/0x40 [ttm]
>>> [  +0.000464] Code: 00 00 0f 1f 40 00 0f 1f 44 00 00 53 48 89 f3 48 
>>> 8b 36 48 85 f6 74 21 48 8b 87 28 02 00 00 48 63 56 10 48 8b bc d0 b8 
>>> 00 00 00 <48> 8b 47 10 ff 50 08 48 c7 03 00 00 00 00 5b c3 66 66 2e 
>>> 0f 1f 84
>>> [  +0.001009] RSP: 0018:ffffb21c59413c98 EFLAGS: 00010282
>>> [  +0.000515] RAX: ffff8b1aa4285f68 RBX: ffff8b1a823b5ea0 RCX: 
>>> 00000000002a000c
>>> [  +0.000536] RDX: 0000000000000000 RSI: ffff8b1acb84db80 RDI: 
>>> 0000000000000000
>>> [  +0.000539] RBP: 0000000000000001 R08: 0000000000000000 R09: 
>>> ffffffffc03c3e00
>>> [  +0.000543] R10: 0000000000000000 R11: 0000000000000000 R12: 
>>> ffff8b1a823b5ec8
>>> [  +0.000542] R13: 0000000000000000 R14: ffff8b1a823b5d90 R15: 
>>> ffff8b1a823b5ec8
>>> [  +0.000544] FS:  0000000000000000(0000) GS:ffff8b187f440000(0000) 
>>> knlGS:0000000000000000
>>> [  +0.000559] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> [  +0.000575] CR2: 0000000000000010 CR3: 00000076e6812004 CR4: 
>>> 0000000000770ee0
>>> [  +0.000575] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 
>>> 0000000000000000
>>> [  +0.000579] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 
>>> 0000000000000400
>>> [  +0.000575] PKRU: 55555554
>>> [  +0.000568] Call Trace:
>>> [  +0.000567]  ttm_bo_cleanup_refs+0xe4/0x290 [ttm]
>>> [  +0.000588]  ttm_bo_delayed_delete+0x147/0x250 [ttm]
>>> [  +0.000589]  ttm_device_fini+0xad/0x1b0 [ttm]
>>> [  +0.000590]  amdgpu_ttm_fini+0x2a7/0x310 [amdgpu]
>>> [  +0.000730]  gmc_v9_0_sw_fini+0x3a/0x40 [amdgpu]
>>> [  +0.000753]  amdgpu_device_fini_sw+0xae/0x260 [amdgpu]
>>> [  +0.000734]  amdgpu_driver_release_kms+0x12/0x30 [amdgpu]
>>> [  +0.000737]  drm_dev_release+0x20/0x40 [drm]
>>> [  +0.000626]  drm_release+0xa8/0xf0 [drm]
>>> [  +0.000625]  __fput+0xa5/0x250
>>> [  +0.000606]  delayed_fput+0x1f/0x30
>>> [  +0.000607]  process_one_work+0x26e/0x580
>>> [  +0.000608]  ? process_one_work+0x580/0x580
>>> [  +0.000616]  worker_thread+0x4d/0x3d0
>>> [  +0.000614]  ? process_one_work+0x580/0x580
>>> [  +0.000617]  kthread+0x117/0x150
>>> [  +0.000615]  ? kthread_park+0x90/0x90
>>> [  +0.000621]  ret_from_fork+0x1f/0x30
>>> [  +0.000603] Modules linked in: amdgpu(E) xt_conntrack 
>>> xt_MASQUERADE nfnetlink xt_addrtype iptable_filter iptable_nat 
>>> nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 br_netfilter 
>>> x86_pkg_temp_thermal cdc_ether usbnet acpi_pad msr ip_tables 
>>> x_tables ast drm_vram_helper drm_ttm_helper iommu_v2 ttm gpu_sched 
>>> drm_kms_helper cfbfillrect syscopyarea cfbimgblt sysfillrect 
>>> sysimgblt fb_sys_fops cfbcopyarea drm drm_panel_orientation_quirks 
>>> [last unloaded: amdgpu]
>>> [  +0.002840] CR2: 0000000000000010
>>> [  +0.000755] ---[ end trace 9737737402551e39 ]--
>>
>>
>> This looks like another regression - try seeing where is the NULL 
>> reference and then we can see how to avoid this.
>>
>>
> Those are the line of code.
>
> (gdb) l *(ttm_bo_cleanup_refs+0xe4)
> 0x19c4 is in ttm_bo_cleanup_refs (drivers/gpu/drm/ttm/ttm_bo.c:360).
> 355             ttm_bo_move_to_pinned(bo);
> 356             list_del_init(&bo->ddestroy);
> 357             spin_unlock(&bo->bdev->lru_lock);
> 358 ttm_bo_cleanup_memtype_use(bo);
> 359
> 360             if (unlock_resv)
> 361 dma_resv_unlock(amdkcl_ttm_resvp(bo));
> 362
> 363             ttm_bo_put(bo);
> 364
> (gdb) l *(ttm_resource_free+0x24)
> 0x57f4 is in ttm_resource_free (drivers/gpu/drm/ttm/ttm_resource.c:65).
> 60
> 61              if (!*res)
> 62                      return;
> 63
> 64              man = ttm_manager_type(bo->bdev, (*res)->mem_type);
> 65  man->func->free(man, *res);
> 66              *res = NULL;
> 67      }
> 68      EXPORT_SYMBOL(ttm_resource_free);
> 69
>>>
>>>>>
>>>>> 3.  echo 1 > /sys/bus/pci/rescan. This would just hang. I assume 
>>>>> the sysfs is broken.
>>>>>
>>>>> Based on 1 & 2, it seems that 1 won’t let the amdgpu exit 
>>>>> gracefully, because 2 will do some cleanup maybe should have 
>>>>> happened before 1.
>>>>>>
>>>>>> or you kill after plug back does it makes a difference).
>>>>>>
>>>>> *Scenario 2: Kill after plug back*
>>>>>
>>>>> If I perform rescan before kill, then the driver seemed probed 
>>>>> fine. But kill will have the same issue which messed up the sysfs 
>>>>> the same way as in Scenario 2.
>>>>>
>>>>>
>>>>> *Final Comments:*
>>>>>
>>>>> 0. cancel_delayed_work_sync(&p_info->restore_userptr_work) 
>>>>> would make the repletion of amdgpu_vm_bo_update failure go away, 
>>>>> but it does not solve the issues in those scenarios.
>>>>
>>>>
>>>> Still - it's better to do it this way even for those failures to go 
>>>> awaya
>>>>
>>>>
>>> Cancel_delayed_work is insufficient, you will need to make sure the 
>>> work won’t be processed after plugout. Please see my patch
>>
>>
>> Saw, see my comment.
>>
>>
>>>>
>>>>>
>>>>> 1. For planned hotplug, this patch should work as long as you 
>>>>> follow some protocol, i.e. kill before plugout. Is this patch an 
>>>>> acceptable one since it provides some added feature than before?
>>>>
>>>>
>>>> Let's try to fix more as I advised above.
>>>>
>>>>
>>>>>
>>>>> 2. For unplanned hotplug when there is rocm app running, the patch 
>>>>> that kill all processes and wait for 5 sec would work 
>>>>> consistently. But it seems that it is an unacceptable solution for 
>>>>> official release. I can hold it for our own internal usage.  It 
>>>>> seems that kill after removal would cause problems, and I don’t 
>>>>> know if there is a quick fix by me because of my limited 
>>>>> understanding of the amdgpu driver. Maybe AMD could have a quick 
>>>>> fix; Or it is really a difficult one. This feature may or may not 
>>>>> be a blocking issue in our GPU disaggregation research down the 
>>>>> way. Please let us know for either cases, and we would like to 
>>>>> learn and help as much as we could!
>>>>
>>>>
>>>> I am currently not sure why it helps. I will need to setup my own 
>>>> ROCm setup and retest hot plug to check this in more depth but 
>>>> currently i have higher priorities. Please try to confirm ASIC 
>>>> reset always takes place on plug back
>>>> and fix the sysfs OOPs as I advised above to clear up at least some 
>>>> of the issues. Also please describe to me exactly what you steps to 
>>>> reproduce this scenario so later I might be able to do it myself.
>>>>
>>>>
>>> I can still try to help to fix the bug in my spare time. My setup is 
>>> as follows
>>>
>>>  1. I have a server with 4 AMD MI100 GPUs. I think 1 GPU would also
>>>     work.
>>>  2. I used the
>>>     https://github.com/RadeonOpenCompute/ROCK-Kernel-Driver/tree/roc-5.0.x
>>>     <https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FRadeonOpenCompute%2FROCK-Kernel-Driver%2Ftree%2Froc-5.0.x&data=04%7C01%7Candrey.grodzovsky%40amd.com%7C78c882de9193490a3b4408da21d8132d%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637859509203317436%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=7evLdVmTyf7MAADp0k4Gkjl1KwNYqMfqYTJiZbSr8tk%3D&reserved=0> as
>>>     the starting point, and apply Mukul’s patch and my patch.
>>>  3. Then I run a tensorflow benchmark from a docker.
>>>       * docker run -it --device=/dev/kfd --device=/dev/dri
>>>         --group-add video rocm/tensorflow:rocm4.5.2-tf1.15-dev
>>>       * And run the following benchmark in the docker:  python
>>>         benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py
>>>         --num_gpus=4 --batch_size=32 --model=resnet50
>>>         --variable_update=parameter_server
>>>           o Might to need to adjust num_gpus parameter based on your
>>>             setup
>>>  4. Remove a GPU at random time.
>>>  5. Do whatever is needed to before plugback and reverify the
>>>     benchmark can still run.
>>>
>>>> Also, we have hotplug test suite in libdrm (graphic stack), so 
>>>> maybe u can install libdrm and run that test suite to see if it 
>>>> exposes more issues.
>>>>
>>> OK I could try it some time.
>>>
>
> I tried suite 13, the hotplugout test, but it says it got killed? 
> There are a some oops from dmesg during ttm_pool_free_page.
>
> Userspace log:
>
> $ sudo ./tests/amdgpu/amdgpu_test -f -s 13
>
>
> The ASIC NOT support UVD, suite disabled
>
>
> The ASIC NOT support VCE, suite disabled
>
>
> The ASIC NOT support UVD ENC, suite disabled.
>
>
> Don't support TMZ (trust memory zone), security suite disabled
>
>
>      CUnit - A unit testing framework for C - Version 2.1-3
> http://cunit.sourceforge.net/ 
> <https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fcunit.sourceforge.net%2F&data=04%7C01%7Candrey.grodzovsky%40amd.com%7C78c882de9193490a3b4408da21d8132d%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637859509203317436%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=Y%2F6RnXqHvigeTXjCIJDEsslzNm0mtTOhsftckL2WGaI%3D&reserved=0>
>
>
> Suite: Hotunplug Tests
>   Test: Unplug card and rescan the bus to plug it back …Killed
>
> Dmesg log:
> [  +0.000479] BUG: unable to handle page fault for address: 
> ffffc01343fc81b4
> [  +0.000054] #PF: supervisor write access in kernel mode
> [  +0.000033] #PF: error_code(0x0002) - not-present page
> [  +0.000032] PGD 807ffc1067 P4D 807ffc1067 PUD 807ffc0067 PMD 0
> [  +0.000038] Oops: 0002 [#1] SMP NOPTI
> [  +0.000025] CPU: 92 PID: 7534 Comm: amdgpu_test Tainted: G        W 
>   E     5.13.0-kfd #1
> [  +0.000048] Hardware name: INGRASYS         TURING  /MB      , BIOS 
> K71FQ28A 10/05/2021
> [  +0.000045] RIP: 0010:__free_pages+0xc/0x80
> [  +0.000031] Code: 01 00 74 0f 0f b6 77 51 85 f6 74 07 31 d2 e9 3b dc 
> ff ff e9 66 ff ff ff 66 0f 1f 44 00 00 0f 1f 44 00 00 41 54 55 48 89 
> fd 53 <f0> ff 4f 34 74 46 48 8b 07 a9 00 00 01 00 75 54 44 8d 66 ff 85 f6
> [  +0.000103] RSP: 0018:ffff96f71ba6fd60 EFLAGS: 00010246
> [  +0.000032] RAX: 00000000ffffffff RBX: ffff89f1ccf86078 RCX: 
> 0000000003fc8180
> [  +0.000041] RDX: ffff89f1b4746000 RSI: 0000000000000000 RDI: 
> ffffc01343fc8180
> [  +0.000042] RBP: ffffc01343fc8180 R08: 0000000000000000 R09: 
> 0000000000000246
> [  +0.000040] R10: 00000080b4746fff R11: 0000000000000003 R12: 
> 0000000000000000
> [  +0.000041] R13: ffff89f1ccf85f80 R14: ffff89f1ccf86ef8 R15: 
> ffff8972293b0000
> [  +0.000042] FS:  00007fcfb843a300(0000) GS:ffff89ef80100000(0000) 
> knlGS:0000000000000000
> [  +0.000046] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  +0.000033] CR2: ffffc01343fc81b4 CR3: 0000000178154006 CR4: 
> 0000000000770ee0
> [  +0.000041] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 
> 0000000000000000
> [  +0.000041] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 
> 0000000000000400
> [  +0.000041] PKRU: 55555554
> [  +0.000017] Call Trace:
> [  +0.000018]  ttm_pool_free_page+0x69/0x90 [ttm]
> [  +0.000038]  ttm_pool_type_fini+0x58/0x70 [ttm]
> [  +0.000034]  ttm_pool_fini+0x30/0x50 [ttm]
> [  +0.000031]  ttm_device_fini+0xf3/0x1b0 [ttm]
> [  +0.000032]  amdgpu_ttm_fini+0x2a7/0x310 [amdgpu]
> [  +0.000265]  gmc_v9_0_sw_fini+0x3a/0x40 [amdgpu]
> [  +0.000246]  amdgpu_device_fini_sw+0xae/0x260 [amdgpu]
> [  +0.000219]  amdgpu_driver_release_kms+0x12/0x30 [amdgpu]
> [  +0.000219]  drm_dev_release+0x20/0x40 [drm]
> [  +0.000059]  drm_release+0xa8/0xf0 [drm]
> [  +0.000053]  __fput+0xa5/0x250
> [  +0.000023]  task_work_run+0x5c/0xa0
> [  +0.000026]  exit_to_user_mode_prepare+0x1db/0x1e0
> [  +0.000033]  syscall_exit_to_user_mode+0x19/0x50
> [  +0.000030]  do_syscall_64+0x47/0x70
> [  +0.000018]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> [  +0.000025] RIP: 0033:0x7fcfb86403d7
> [  +0.000869] Code: 64 89 02 48 c7 c0 ff ff ff ff eb bb 0f 1f 80 00 00 
> 00 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 03 00 00 00 
> 0f 05 <48> 3d 00 f0 ff ff 77 41 c3 48 83 ec 18 89 7c 24 0c e8 f3 fb ff ff
> [  +0.001788] RSP: 002b:00007ffc8fc26c28 EFLAGS: 00000246 ORIG_RAX: 
> 0000000000000003
> [  +0.000888] RAX: 0000000000000000 RBX: 000055d67a05b6a0 RCX: 
> 00007fcfb86403d7
> [  +0.000867] RDX: 00007fcfb8627be0 RSI: 0000000000000000 RDI: 
> 0000000000000003
> [  +0.000846] RBP: 000055d67a05b8a0 R08: 0000000000000007 R09: 
> 0000000000000000
> [  +0.000816] R10: 0000000000000000 R11: 0000000000000246 R12: 
> 0000000000000000
> [  +0.000791] R13: 0000000000000000 R14: 0000000000000000 R15: 
> 00007fcfb8659143
> [  +0.000770] Modules linked in: amdgpu(E) xt_conntrack xt_MASQUERADE 
> nfnetlink xt_addrtype iptable_filter iptable_nat nf_nat nf_conntrack 
> nf_defrag_ipv6 nf_defrag_ipv4 br_netfilter x86_pkg_temp_thermal 
> cdc_ether usbnet acpi_pad msr ip_tables x_tables ast drm_vram_helper 
> drm_ttm_helper ttm iommu_v2 gpu_sched drm_kms_helper cfbfillrect 
> syscopyarea cfbimgblt sysfillrect sysimgblt fb_sys_fops cfbcopyarea 
> drm drm_panel_orientation_quirks [last unloaded: amdgpu]
> [  +0.003303] CR2: ffffc01343fc81b4
> [  +0.000799] ---[ end trace 2360927435b19009 ]—
>
>
>
>>> The following is the new diff.
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>> index 182b7eae598a..48c3cd4054de 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>> @@ -1327,7 +1327,7 @@ int emu_soc_asic_init(struct amdgpu_device *adev);
>>>   * ASICs macro.
>>>   */
>>>  #define amdgpu_asic_set_vga_state(adev, state) 
>>> (adev)->asic_funcs->set_vga_state((adev), (state))
>>> -#define amdgpu_asic_reset(adev) (adev)->asic_funcs->reset((adev))
>>> +#define amdgpu_asic_reset(adev) ({int r; pr_info("performing 
>>> amdgpu_asic_reset\n"); r = (adev)->asic_funcs->reset((adev));r;})
>>>  #define amdgpu_asic_reset_method(adev) 
>>> (adev)->asic_funcs->reset_method((adev))
>>>  #define amdgpu_asic_get_xclk(adev) (adev)->asic_funcs->get_xclk((adev))
>>>  #define amdgpu_asic_set_uvd_clocks(adev, v, d) 
>>> (adev)->asic_funcs->set_uvd_clocks((adev), (v), (d))
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h 
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
>>> index 27c74fcec455..842abd7150a6 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
>>> @@ -134,6 +134,7 @@ struct amdkfd_process_info {
>>> /* MMU-notifier related fields */
>>> atomic_t evicted_bos;
>>> +atomic_t invalid;
>>> struct delayed_work restore_userptr_work;
>>> struct pid *pid;
>>>  };
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c 
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>>> index 99d2b15bcbf3..2a588eb9f456 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>>> @@ -1325,6 +1325,7 @@ static int init_kfd_vm(struct amdgpu_vm *vm, 
>>> void **process_info,
>>> info->pid = get_task_pid(current->group_leader, PIDTYPE_PID);
>>> atomic_set(&info->evicted_bos, 0);
>>> +atomic_set(&info->invalid, 0);
>>> INIT_DELAYED_WORK(&info->restore_userptr_work,
>>>  amdgpu_amdkfd_restore_userptr_worker);
>>> @@ -2693,6 +2694,9 @@ static void 
>>> amdgpu_amdkfd_restore_userptr_worker(struct work_struct *work)
>>> struct mm_struct *mm;
>>> int evicted_bos;
>>> +if (atomic_read(&process_info->invalid))
>>> +return;
>>> +
>>
>>
>> Probably better  to again use drm_dev_enter/exit guard pair instead 
>> of this flag.
>>
>>
>
> I don’t know if I could use drm_dev_enter/exit efficiently because a 
> process can have multiple drm_dev open. And I don’t know how I can 
> recover/refer drm_dev(s) efficiently in the worker function in order 
> to use drm_dev_enter/exit.


I think that within the KFD code each kfd device belongs or points to 
one specific drm_device so I don't think this is a problem.


>>
>>> evicted_bos = atomic_read(&process_info->evicted_bos);
>>> if (!evicted_bos)
>>> return;
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> index ec38517ab33f..e7d85d8d282d 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> @@ -1054,6 +1054,7 @@ void 
>>> amdgpu_device_program_register_sequence(struct amdgpu_device *adev,
>>>   */
>>>  void amdgpu_device_pci_config_reset(struct amdgpu_device *adev)
>>>  {
>>> +pr_debug("%s called\n",__func__);
>>> pci_write_config_dword(adev->pdev, 0x7c, AMDGPU_ASIC_RESET_DATA);
>>>  }
>>> @@ -1066,6 +1067,7 @@ void amdgpu_device_pci_config_reset(struct 
>>> amdgpu_device *adev)
>>>   */
>>>  int amdgpu_device_pci_reset(struct amdgpu_device *adev)
>>>  {
>>> +pr_debug("%s called\n",__func__);
>>> return pci_reset_function(adev->pdev);
>>>  }
>>> @@ -4702,6 +4704,8 @@ int amdgpu_do_asic_reset(struct list_head 
>>> *device_list_handle,
>>> bool need_full_reset, skip_hw_reset, vram_lost = false;
>>> int r = 0;
>>> +pr_debug("%s called\n",__func__);
>>> +
>>> /* Try reset handler method first */
>>> tmp_adev = list_first_entry(device_list_handle, struct amdgpu_device,
>>>  reset_list);
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c 
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
>>> index 49bdf9ff7350..b469acb65c1e 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
>>> @@ -2518,7 +2518,6 @@ void amdgpu_ras_late_fini(struct amdgpu_device 
>>> *adev,
>>> if (!ras_block || !ih_info)
>>> return;
>>> -amdgpu_ras_sysfs_remove(adev, ras_block);
>>> if (ih_info->cb)
>>> amdgpu_ras_interrupt_remove_handler(adev, ih_info);
>>>  }
>>> @@ -2577,6 +2576,7 @@ void amdgpu_ras_suspend(struct amdgpu_device 
>>> *adev)
>>>  int amdgpu_ras_pre_fini(struct amdgpu_device *adev)
>>>  {
>>> struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
>>> +struct ras_manager *obj, *tmp;
>>> if (!adev->ras_enabled || !con)
>>> return 0;
>>> @@ -2585,6 +2585,13 @@ int amdgpu_ras_pre_fini(struct amdgpu_device 
>>> *adev)
>>> /* Need disable ras on all IPs here before ip [hw/sw]fini */
>>> amdgpu_ras_disable_all_features(adev, 0);
>>> amdgpu_ras_recovery_fini(adev);
>>> +
>>> +/* remove sysfs before pci_remove to avoid OOPSES from 
>>> sysfs_remove_groups */
>>> +list_for_each_entry_safe(obj, tmp, &con->head, node) {
>>> +amdgpu_ras_sysfs_remove(adev, &obj->head);
>>> +put_obj(obj);
>>> +}
>>> +
>>> return 0;
>>>  }
>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c 
>>> b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>> index 4e7d9cb09a69..0fa806a78e39 100644
>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>> @@ -693,16 +693,35 @@ bool kfd_is_locked(void)
>>>  void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm, bool force)
>>>  {
>>> +struct kfd_process *p;
>>> +struct amdkfd_process_info *p_info;
>>> +unsigned int temp;
>>> +
>>> if (!kfd->init_complete)
>>> return;
>>> /* for runtime suspend, skip locking kfd */
>>> -if (!run_pm) {
>>> +if (!run_pm && !drm_dev_is_unplugged(kfd->ddev)) {
>>> /* For first KFD device suspend all the KFD processes */
>>> if (atomic_inc_return(&kfd_locked) == 1)
>>> kfd_suspend_all_processes(force);
>>> }
>>> +if (drm_dev_is_unplugged(kfd->ddev)){
>>> +int idx = srcu_read_lock(&kfd_processes_srcu);
>>> +pr_debug("cancel restore_userptr_wor\n");
>>> +hash_for_each_rcu(kfd_processes_table, temp, p, kfd_processes) {
>>> +if ( kfd_process_gpuidx_from_gpuid(p, kfd->id) >= 0 ) {
>>> +p_info = p->kgd_process_info;
>>> +pr_debug("cancel processes, pid = %d for gpu_id = %d", 
>>> pid_nr(p_info->pid), kfd->id);
>>> +cancel_delayed_work_sync(&p_info->restore_userptr_work);
>>> +/* block all future restore_userptr_work */
>>> +atomic_inc(&p_info->invalid);
>>
>>
>> Same as i mentioned above with drm.dev_eneter/exit
>>
> Same as I mentioned as the process can have many drm_dev open.
>
> Final comments:
>
> I suspect that the my linux kernel version might not have all the 
> fixes you did for hotplug. Can you give me a pointer to the lowest 
> version of linux kernel (5.14 from linux kernel repo? 
> amd-drm-staging-next does not work for MI100 out-of-box), which would 
> pass all libdrm tests including hotplug tests (some tests hang, some 
> failed now) ?


That a problem, latest working baseline I tested and confirmed passing 
hotplug tests is this branch and commit 
https://gitlab.freedesktop.org/agd5f/linux/-/commit/86e12a53b73135806e101142e72f3f1c0e6fa8e6 
which is amd-staging-drm-next. 5.14 was the branch we ups-reamed the 
hotplug code but it had a lot of regressions over time due to new 
changes (that why I added the hotplug test to try and catch them early). 
It would be best to run this branch on mi-100 so we have a clean 
baseline and only after confirming  this particular branch from this 
commits passes libdrm tests only then start adding the KFD specific 
addons. Another option if you can't work with MI-100 and this branch is 
to try a different ASIC that does work with this branch (if possible).

Andrey


> p.s. I cloned and build libdrm from source 
> (https://gitlab.freedesktop.org/mesa/drm 
> <https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fmesa%2Fdrm&data=04%7C01%7Candrey.grodzovsky%40amd.com%7C78c882de9193490a3b4408da21d8132d%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637859509203317436%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=k%2FqmPOyW%2BnIStUdJZqlBHPXDU6AaXGKGS0GKH34r7Mc%3D&reserved=0>)
>
> Thank you so much!
>
>> Andrey
>>
>>
>>> +}
>>> +}
>>> +srcu_read_unlock(&kfd_processes_srcu, idx);
>>> +}
>>> +
>>> kfd->dqm->ops.stop(kfd->dqm);
>>> kfd_iommu_suspend(kfd);
>>>  }
>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_topology.c 
>>> b/drivers/gpu/drm/amd/amdkfd/kfd_topology.c
>>> index 600ba2a728ea..7e3d1848eccc 100644
>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_topology.c
>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_topology.c
>>> @@ -669,11 +669,12 @@ static void kfd_remove_sysfs_node_entry(struct 
>>> kfd_topology_device *dev)
>>>  #ifdef HAVE_AMD_IOMMU_PC_SUPPORTED
>>> if (dev->kobj_perf) {
>>> list_for_each_entry(perf, &dev->perf_props, list) {
>>> +sysfs_remove_group(dev->kobj_perf, perf->attr_group);
>>> kfree(perf->attr_group);
>>> perf->attr_group = NULL;
>>> }
>>> kobject_del(dev->kobj_perf);
>>> -kobject_put(dev->kobj_perf);
>>> +/* kobject_put(dev->kobj_perf); */
>>> dev->kobj_perf = NULL;
>>> }
>>>  #endif
>>>
>>> Thank you so much! Looking forward to your comments!
>>>
>>> Regards,
>>> Shuotao
>>>>
>>>> Andrey
>>>>
>>>>
>>>>>
>>>>> Thank you so much!
>>>>>
>>>>> Best regards,
>>>>> Shuotao
>>>>>>
>>>>>> Andrey
>>>>>>
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c 
>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
>>>>>>> index 8fa9b86ac9d2..c0b27f722281 100644
>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
>>>>>>> @@ -188,6 +188,12 @@ void amdgpu_amdkfd_interrupt(struct 
>>>>>>> amdgpu_device *adev,
>>>>>>> kgd2kfd_interrupt(adev->kfd.dev, ih_ring_entry);
>>>>>>>  }
>>>>>>> +void amdgpu_amdkfd_kill_all_processes(struct amdgpu_device *adev)
>>>>>>> +{
>>>>>>> +if (adev->kfd.dev)
>>>>>>> +kgd2kfd_kill_all_user_processes(adev->kfd.dev);
>>>>>>> +}
>>>>>>> +
>>>>>>>  void amdgpu_amdkfd_suspend(struct amdgpu_device *adev, bool run_pm)
>>>>>>>  {
>>>>>>> if (adev->kfd.dev)
>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h 
>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
>>>>>>> index 27c74fcec455..f4e485d60442 100644
>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
>>>>>>> @@ -141,6 +141,7 @@ struct amdkfd_process_info {
>>>>>>>  int amdgpu_amdkfd_init(void);
>>>>>>>  void amdgpu_amdkfd_fini(void);
>>>>>>> +void amdgpu_amdkfd_kill_all_processes(struct amdgpu_device *adev);
>>>>>>>  void amdgpu_amdkfd_suspend(struct amdgpu_device *adev, bool 
>>>>>>> run_pm);
>>>>>>>  int amdgpu_amdkfd_resume_iommu(struct amdgpu_device *adev);
>>>>>>>  int amdgpu_amdkfd_resume(struct amdgpu_device *adev, bool 
>>>>>>> run_pm, bool sync);
>>>>>>> @@ -405,6 +406,7 @@ bool kgd2kfd_device_init(struct kfd_dev *kfd,
>>>>>>> const struct kgd2kfd_shared_resources *gpu_resources);
>>>>>>>  void kgd2kfd_device_exit(struct kfd_dev *kfd);
>>>>>>>  void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm, bool force);
>>>>>>> +void kgd2kfd_kill_all_user_processes(struct kfd_dev *kfd);
>>>>>>>  int kgd2kfd_resume_iommu(struct kfd_dev *kfd);
>>>>>>>  int kgd2kfd_resume(struct kfd_dev *kfd, bool run_pm, bool sync);
>>>>>>>  int kgd2kfd_pre_reset(struct kfd_dev *kfd);
>>>>>>> @@ -443,6 +445,9 @@ static inline void kgd2kfd_suspend(struct 
>>>>>>> kfd_dev *kfd, bool run_pm, bool force)
>>>>>>>  {
>>>>>>>  }
>>>>>>> +void kgd2kfd_kill_all_user_processes(struct kfd_dev *kfd){
>>>>>>> +}
>>>>>>> +
>>>>>>>  static int __maybe_unused kgd2kfd_resume_iommu(struct kfd_dev *kfd)
>>>>>>>  {
>>>>>>> return 0;
>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c 
>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>>>>>> index 3d5fc0751829..af6fe5080cfa 100644
>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>>>>>> @@ -2101,6 +2101,9 @@ amdgpu_pci_remove(struct pci_dev *pdev)
>>>>>>>  {
>>>>>>> struct drm_device *dev = pci_get_drvdata(pdev);
>>>>>>> +/* kill all kfd processes before drm_dev_unplug */
>>>>>>> +amdgpu_amdkfd_kill_all_processes(drm_to_adev(dev));
>>>>>>> +
>>>>>>>  #ifdef HAVE_DRM_DEV_UNPLUG
>>>>>>> drm_dev_unplug(dev);
>>>>>>>  #else
>>>>>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c 
>>>>>>> b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>>>>>> index 5504a18b5a45..480c23bef5e2 100644
>>>>>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>>>>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>>>>>> @@ -691,6 +691,12 @@ bool kfd_is_locked(void)
>>>>>>> return  (atomic_read(&kfd_locked) > 0);
>>>>>>>  }
>>>>>>> +inline void kgd2kfd_kill_all_user_processes(struct kfd_dev* dev)
>>>>>>> +{
>>>>>>> +kfd_kill_all_user_processes();
>>>>>>> +}
>>>>>>> +
>>>>>>> +
>>>>>>>  void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm, bool force)
>>>>>>>  {
>>>>>>> if (!kfd->init_complete)
>>>>>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h 
>>>>>>> b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
>>>>>>> index 55c9e1922714..a35a2cb5bb9f 100644
>>>>>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
>>>>>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
>>>>>>> @@ -1064,6 +1064,7 @@ static inline struct kfd_process_device 
>>>>>>> *kfd_process_device_from_gpuidx(
>>>>>>>  void kfd_unref_process(struct kfd_process *p);
>>>>>>>  int kfd_process_evict_queues(struct kfd_process *p, bool force);
>>>>>>>  int kfd_process_restore_queues(struct kfd_process *p);
>>>>>>> +void kfd_kill_all_user_processes(void);
>>>>>>>  void kfd_suspend_all_processes(bool force);
>>>>>>>  /*
>>>>>>>   * kfd_resume_all_processes:
>>>>>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process.c 
>>>>>>> b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
>>>>>>> index 6cdc855abb6d..17e769e6951d 100644
>>>>>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
>>>>>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
>>>>>>> @@ -46,6 +46,9 @@ struct mm_struct;
>>>>>>>  #include "kfd_trace.h"
>>>>>>>  #include "kfd_debug.h"
>>>>>>> +static atomic_t kfd_process_locked = ATOMIC_INIT(0);
>>>>>>> +static atomic_t kfd_inflight_kills = ATOMIC_INIT(0);
>>>>>>> +
>>>>>>>  /*
>>>>>>>   * List of struct kfd_process (field kfd_process).
>>>>>>>   * Unique/indexed by mm_struct*
>>>>>>> @@ -802,6 +805,9 @@ struct kfd_process 
>>>>>>> *kfd_create_process(struct task_struct *thread)
>>>>>>> struct kfd_process *process;
>>>>>>> int ret;
>>>>>>> +if ( atomic_read(&kfd_process_locked) > 0 )
>>>>>>> +return ERR_PTR(-EINVAL);
>>>>>>> +
>>>>>>> if (!(thread->mm && mmget_not_zero(thread->mm)))
>>>>>>> return ERR_PTR(-EINVAL);
>>>>>>> @@ -1126,6 +1132,10 @@ static void kfd_process_wq_release(struct 
>>>>>>> work_struct *work)
>>>>>>> put_task_struct(p->lead_thread);
>>>>>>> kfree(p);
>>>>>>> +
>>>>>>> +if ( atomic_read(&kfd_process_locked) > 0 ){
>>>>>>> +atomic_dec(&kfd_inflight_kills);
>>>>>>> +}
>>>>>>>  }
>>>>>>>  static void kfd_process_ref_release(struct kref *ref)
>>>>>>> @@ -2186,6 +2196,35 @@ static void restore_process_worker(struct 
>>>>>>> work_struct *work)
>>>>>>> pr_err("Failed to restore queues of pasid 0x%x\n", p->pasid);
>>>>>>>  }
>>>>>>> +void kfd_kill_all_user_processes(void)
>>>>>>> +{
>>>>>>> +struct kfd_process *p;
>>>>>>> +/* struct amdkfd_process_info *p_info; */
>>>>>>> +unsigned int temp;
>>>>>>> +int idx;
>>>>>>> +atomic_inc(&kfd_process_locked);
>>>>>>> +
>>>>>>> +idx = srcu_read_lock(&kfd_processes_srcu);
>>>>>>> +pr_info("Killing all processes\n");
>>>>>>> +hash_for_each_rcu(kfd_processes_table, temp, p, kfd_processes) {
>>>>>>> +dev_warn(kfd_device,
>>>>>>> +"Sending SIGBUS to process %d (pasid 0x%x)",
>>>>>>> +p->lead_thread->pid, p->pasid);
>>>>>>> +send_sig(SIGBUS, p->lead_thread, 0);
>>>>>>> +atomic_inc(&kfd_inflight_kills);
>>>>>>> +}
>>>>>>> +srcu_read_unlock(&kfd_processes_srcu, idx);
>>>>>>> +
>>>>>>> +while ( atomic_read(&kfd_inflight_kills) > 0 ){
>>>>>>> +dev_warn(kfd_device,
>>>>>>> +"kfd_processes_table is not empty, going to sleep for 10ms\n");
>>>>>>> +msleep(10);
>>>>>>> +}
>>>>>>> +
>>>>>>> +atomic_dec(&kfd_process_locked);
>>>>>>> +pr_info("all processes has been fully released");
>>>>>>> +}
>>>>>>> +
>>>>>>>  void kfd_suspend_all_processes(bool force)
>>>>>>>  {
>>>>>>> struct kfd_process *p;
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Regards,
>>>>>>> Shuotao
>>>>>>>
>>>>>>>>
>>>>>>>> Andrey
>>>>>>>>
>>>>>>>>> +       }
>>>>>>>>> + srcu_read_unlock(&kfd_processes_srcu, idx);
>>>>>>>>> +}
>>>>>>>>> +
>>>>>>>>> +
>>>>>>>>>  int kfd_resume_all_processes(bool sync)
>>>>>>>>>  {
>>>>>>>>> struct kfd_process *p;
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> Andrey
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Really appreciate your help!
>>>>>>>>>>>
>>>>>>>>>>> Best,
>>>>>>>>>>> Shuotao
>>>>>>>>>>>
>>>>>>>>>>>>> 2. Remove redudant p2p/io links in sysfs when device is 
>>>>>>>>>>>>> hotplugged
>>>>>>>>>>>>> out.
>>>>>>>>>>>>>
>>>>>>>>>>>>> 3. New kfd node_id is not properly assigned after a new 
>>>>>>>>>>>>> device is
>>>>>>>>>>>>> added after a gpu is hotplugged out in a system. libhsakmt 
>>>>>>>>>>>>> will
>>>>>>>>>>>>> find this anomaly, (i.e. node_from != <dev node id> in 
>>>>>>>>>>>>> iolinks),
>>>>>>>>>>>>> when taking a topology_snapshot, thus returns fault to the 
>>>>>>>>>>>>> rocm
>>>>>>>>>>>>> stack.
>>>>>>>>>>>>>
>>>>>>>>>>>>> -- This patch fixes issue 1; another patch by Mukul fixes 
>>>>>>>>>>>>> issues 2&3.
>>>>>>>>>>>>> -- Tested on a 4-GPU MI100 gpu nodes with kernel 
>>>>>>>>>>>>> 5.13.0-kfd; kernel
>>>>>>>>>>>>> 5.16.0-kfd is unstable out of box for MI100.
>>>>>>>>>>>>> ---
>>>>>>>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 5 +++++
>>>>>>>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h | 7 +++++++
>>>>>>>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 1 +
>>>>>>>>>>>>> drivers/gpu/drm/amd/amdkfd/kfd_device.c | 13 +++++++++++++
>>>>>>>>>>>>> 4 files changed, 26 insertions(+)
>>>>>>>>>>>>>
>>>>>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c 
>>>>>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
>>>>>>>>>>>>> index c18c4be1e4ac..d50011bdb5c4 100644
>>>>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
>>>>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
>>>>>>>>>>>>> @@ -213,6 +213,11 @@ int amdgpu_amdkfd_resume(struct 
>>>>>>>>>>>>> amdgpu_device *adev, bool run_pm)
>>>>>>>>>>>>> return r;
>>>>>>>>>>>>> }
>>>>>>>>>>>>>
>>>>>>>>>>>>> +int amdgpu_amdkfd_resume_processes(void)
>>>>>>>>>>>>> +{
>>>>>>>>>>>>> + return kgd2kfd_resume_processes();
>>>>>>>>>>>>> +}
>>>>>>>>>>>>> +
>>>>>>>>>>>>> int amdgpu_amdkfd_pre_reset(struct amdgpu_device *adev)
>>>>>>>>>>>>> {
>>>>>>>>>>>>> int r = 0;
>>>>>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h 
>>>>>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
>>>>>>>>>>>>> index f8b9f27adcf5..803306e011c3 100644
>>>>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
>>>>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
>>>>>>>>>>>>> @@ -140,6 +140,7 @@ void amdgpu_amdkfd_fini(void);
>>>>>>>>>>>>> void amdgpu_amdkfd_suspend(struct amdgpu_device *adev, 
>>>>>>>>>>>>> bool run_pm);
>>>>>>>>>>>>> int amdgpu_amdkfd_resume_iommu(struct amdgpu_device *adev);
>>>>>>>>>>>>> int amdgpu_amdkfd_resume(struct amdgpu_device *adev, bool 
>>>>>>>>>>>>> run_pm);
>>>>>>>>>>>>> +int amdgpu_amdkfd_resume_processes(void);
>>>>>>>>>>>>> void amdgpu_amdkfd_interrupt(struct amdgpu_device *adev,
>>>>>>>>>>>>> const void *ih_ring_entry);
>>>>>>>>>>>>> void amdgpu_amdkfd_device_probe(struct amdgpu_device *adev);
>>>>>>>>>>>>> @@ -347,6 +348,7 @@ void kgd2kfd_device_exit(struct 
>>>>>>>>>>>>> kfd_dev *kfd);
>>>>>>>>>>>>> void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm);
>>>>>>>>>>>>> int kgd2kfd_resume_iommu(struct kfd_dev *kfd);
>>>>>>>>>>>>> int kgd2kfd_resume(struct kfd_dev *kfd, bool run_pm);
>>>>>>>>>>>>> +int kgd2kfd_resume_processes(void);
>>>>>>>>>>>>> int kgd2kfd_pre_reset(struct kfd_dev *kfd);
>>>>>>>>>>>>> int kgd2kfd_post_reset(struct kfd_dev *kfd);
>>>>>>>>>>>>> void kgd2kfd_interrupt(struct kfd_dev *kfd, const void 
>>>>>>>>>>>>> *ih_ring_entry);
>>>>>>>>>>>>> @@ -393,6 +395,11 @@ static inline int 
>>>>>>>>>>>>> kgd2kfd_resume(struct kfd_dev *kfd, bool run_pm)
>>>>>>>>>>>>> return 0;
>>>>>>>>>>>>> }
>>>>>>>>>>>>>
>>>>>>>>>>>>> +static inline int kgd2kfd_resume_processes(void)
>>>>>>>>>>>>> +{
>>>>>>>>>>>>> + return 0;
>>>>>>>>>>>>> +}
>>>>>>>>>>>>> +
>>>>>>>>>>>>> static inline int kgd2kfd_pre_reset(struct kfd_dev *kfd)
>>>>>>>>>>>>> {
>>>>>>>>>>>>> return 0;
>>>>>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
>>>>>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>>>>>>> index fa4a9f13c922..5827b65b7489 100644
>>>>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>>>>>>> @@ -4004,6 +4004,7 @@ void amdgpu_device_fini_hw(struct 
>>>>>>>>>>>>> amdgpu_device *adev)
>>>>>>>>>>>>> if (drm_dev_is_unplugged(adev_to_drm(adev)))
>>>>>>>>>>>>> amdgpu_device_unmap_mmio(adev);
>>>>>>>>>>>>>
>>>>>>>>>>>>> + amdgpu_amdkfd_resume_processes();
>>>>>>>>>>>>> }
>>>>>>>>>>>>>
>>>>>>>>>>>>> void amdgpu_device_fini_sw(struct amdgpu_device *adev)
>>>>>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c 
>>>>>>>>>>>>> b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>>>>>>>>>>>> index 62aa6c9d5123..ef05aae9255e 100644
>>>>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>>>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>>>>>>>>>>>> @@ -714,6 +714,19 @@ int kgd2kfd_resume(struct kfd_dev 
>>>>>>>>>>>>> *kfd, bool run_pm)
>>>>>>>>>>>>> return ret;
>>>>>>>>>>>>> }
>>>>>>>>>>>>>
>>>>>>>>>>>>> +/* for non-runtime resume only */
>>>>>>>>>>>>> +int kgd2kfd_resume_processes(void)
>>>>>>>>>>>>> +{
>>>>>>>>>>>>> + int count;
>>>>>>>>>>>>> +
>>>>>>>>>>>>> + count = atomic_dec_return(&kfd_locked);
>>>>>>>>>>>>> + WARN_ONCE(count < 0, "KFD suspend / resume ref. error");
>>>>>>>>>>>>> + if (count == 0)
>>>>>>>>>>>>> + return kfd_resume_all_processes();
>>>>>>>>>>>>> +
>>>>>>>>>>>>> + return 0;
>>>>>>>>>>>>> +}
>>>>>>>>>>>>
>>>>>>>>>>>> It doesn't make sense to me to just increment kfd_locked in
>>>>>>>>>>>> kgd2kfd_suspend to only decrement it again a few functions 
>>>>>>>>>>>> down the
>>>>>>>>>>>> road.
>>>>>>>>>>>>
>>>>>>>>>>>> I suggest this instead - you only incrmemnt if not during 
>>>>>>>>>>>> PCI remove
>>>>>>>>>>>>
>>>>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>>>>>>>>>>> b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>>>>>>>>>>> index 1c2cf3a33c1f..7754f77248a4 100644
>>>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>>>>>>>>>>> @@ -952,11 +952,12 @@ bool kfd_is_locked(void)
>>>>>>>>>>>>
>>>>>>>>>>>> void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm)
>>>>>>>>>>>> {
>>>>>>>>>>>> +
>>>>>>>>>>>> if (!kfd->init_complete)
>>>>>>>>>>>> return;
>>>>>>>>>>>>
>>>>>>>>>>>> /* for runtime suspend, skip locking kfd */
>>>>>>>>>>>> - if (!run_pm) {
>>>>>>>>>>>> + if (!run_pm && !drm_dev_is_unplugged(kfd->ddev)) {
>>>>>>>>>>>> /* For first KFD device suspend all the KFD processes */
>>>>>>>>>>>> if (atomic_inc_return(&kfd_locked) == 1)
>>>>>>>>>>>> kfd_suspend_all_processes();
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Andrey
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> +
>>>>>>>>>>>>> int kgd2kfd_resume_iommu(struct kfd_dev *kfd)
>>>>>>>>>>>>> {
>>>>>>>>>>>>> int err = 0;
>>>>>>>>>
>>>>>>>
>>>>>
>>>
>

[-- Attachment #2: Type: text/html, Size: 361676 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [EXTERNAL] [PATCH 2/2] drm/amdkfd: Add PCIe Hotplug Support for AMDKFD
  2022-04-19 16:01                           ` Andrey Grodzovsky
@ 2022-04-19 16:18                             ` Felix Kuehling
  2022-04-20  9:24                             ` Shuotao Xu
  1 sibling, 0 replies; 31+ messages in thread
From: Felix Kuehling @ 2022-04-19 16:18 UTC (permalink / raw)
  To: Andrey Grodzovsky, Shuotao Xu
  Cc: Mukul.Joshi, Peng Cheng, amd-gfx, Lei Qu, Ran Shu, Ziyue Yang

Am 2022-04-19 um 12:01 schrieb Andrey Grodzovsky:
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
>>>> @@ -134,6 +134,7 @@ struct amdkfd_process_info {
>>>> /* MMU-notifier related fields */
>>>> atomic_t evicted_bos;
>>>> +atomic_t invalid;
>>>> struct delayed_work restore_userptr_work;
>>>> struct pid *pid;
>>>>  };
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c 
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>>>> index 99d2b15bcbf3..2a588eb9f456 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>>>> @@ -1325,6 +1325,7 @@ static int init_kfd_vm(struct amdgpu_vm *vm, 
>>>> void **process_info,
>>>> info->pid = get_task_pid(current->group_leader, PIDTYPE_PID);
>>>> atomic_set(&info->evicted_bos, 0);
>>>> +atomic_set(&info->invalid, 0);
>>>> INIT_DELAYED_WORK(&info->restore_userptr_work,
>>>>  amdgpu_amdkfd_restore_userptr_worker);
>>>> @@ -2693,6 +2694,9 @@ static void 
>>>> amdgpu_amdkfd_restore_userptr_worker(struct work_struct *work)
>>>> struct mm_struct *mm;
>>>> int evicted_bos;
>>>> +if (atomic_read(&process_info->invalid))
>>>> +return;
>>>> +
>>>
>>>
>>> Probably better  to again use drm_dev_enter/exit guard pair instead 
>>> of this flag.
>>>
>>>
>>
>> I don’t know if I could use drm_dev_enter/exit efficiently because a 
>> process can have multiple drm_dev open. And I don’t know how I can 
>> recover/refer drm_dev(s) efficiently in the worker function in order 
>> to use drm_dev_enter/exit.
>
>
> I think that within the KFD code each kfd device belongs or points to 
> one specific drm_device so I don't think this is a problem.
>
Sorry, I haven't been following this discussion in all its details. But 
I don't see why you need to check a flag in the worker. If the GPU is 
unplugged you already cancel any pending work. How is new work getting 
scheduled after the GPU is unplugged? Is it due to pending interrupts or 
something? Can you instead invalidate process_info->restore_userptr_work 
to prevent it from being scheduled again? Or add some check where it's 
scheduling the work, instead of in the worker.

Regards,
   Felix



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [EXTERNAL] [PATCH 2/2] drm/amdkfd: Add PCIe Hotplug Support for AMDKFD
  2022-04-19 16:01                           ` Andrey Grodzovsky
  2022-04-19 16:18                             ` Felix Kuehling
@ 2022-04-20  9:24                             ` Shuotao Xu
  2022-04-20 15:44                               ` Andrey Grodzovsky
  1 sibling, 1 reply; 31+ messages in thread
From: Shuotao Xu @ 2022-04-20  9:24 UTC (permalink / raw)
  To: Andrey Grodzovsky
  Cc: Mukul.Joshi, Kuehling, Felix, Peng Cheng, amd-gfx, Lei Qu,
	Ran Shu, Ziyue Yang

[-- Attachment #1: Type: text/plain, Size: 271753 bytes --]



On Apr 20, 2022, at 12:01 AM, Andrey Grodzovsky <andrey.grodzovsky@amd.com<mailto:andrey.grodzovsky@amd.com>> wrote:



On 2022-04-19 03:41, Shuotao Xu wrote:


On Apr 18, 2022, at 11:23 PM, Andrey Grodzovsky <andrey.grodzovsky@amd.com<mailto:andrey.grodzovsky@amd.com>> wrote:



On 2022-04-18 09:22, Shuotao Xu wrote:


On Apr 16, 2022, at 12:43 AM, Andrey Grodzovsky <andrey.grodzovsky@amd.com<mailto:andrey.grodzovsky@amd.com>> wrote:



On 2022-04-15 06:12, Shuotao Xu wrote:
Hi Andrey,

First I really appreciate the discussion! It helped me understand the driver code greatly. Thank you so much:)
Please see my inline comments.

On Apr 14, 2022, at 11:13 PM, Andrey Grodzovsky <andrey.grodzovsky@amd.com<mailto:andrey.grodzovsky@amd.com>> wrote:



On 2022-04-14 10:00, Shuotao Xu wrote:


On Apr 14, 2022, at 1:31 AM, Andrey Grodzovsky <andrey.grodzovsky@amd.com<mailto:andrey.grodzovsky@amd.com>> wrote:



On 2022-04-13 12:03, Shuotao Xu wrote:


On Apr 11, 2022, at 11:52 PM, Andrey Grodzovsky <andrey.grodzovsky@amd.com<mailto:andrey.grodzovsky@amd.com>> wrote:

[Some people who received this message don't often get email from andrey.grodzovsky@amd.com<mailto:andrey.grodzovsky@amd.com>. Learn why this is important at http://aka.ms/LearnAboutSenderIdentification.]

On 2022-04-08 21:28, Shuotao Xu wrote:

On Apr 8, 2022, at 11:28 PM, Andrey Grodzovsky <andrey.grodzovsky@amd.com<mailto:andrey.grodzovsky@amd.com>> wrote:

[Some people who received this message don't often get email from andrey.grodzovsky@amd.com<mailto:andrey.grodzovsky@amd.com>. Learn why this is important at http://aka.ms/LearnAboutSenderIdentification.]

On 2022-04-08 04:45, Shuotao Xu wrote:
Adding PCIe Hotplug Support for AMDKFD: the support of hot-plug of GPU
devices can open doors for many advanced applications in data center
in the next few years, such as for GPU resource
disaggregation. Current AMDKFD does not support hotplug out b/o the
following reasons:

1. During PCIe removal, decrement KFD lock which was incremented at
the beginning of hw fini; otherwise kfd_open later is going to
fail.
I assumed you read my comment last time, still you do same approach.
More in details bellow
Aha, I like your fix:) I was not familiar with drm APIs so just only half understood your comment last time.

BTW, I tried hot-plugging out a GPU when rocm application is still running.
From dmesg, application is still trying to access the removed kfd device, and are met with some errors.


Application us supposed to keep running, it holds the drm_device
reference as long as it has an open
FD to the device and final cleanup will come only after the app will die
thus releasing the FD and the last
drm_device reference.

Application would hang and not exiting in this case.


Actually I tried kill -7 $pid, and the process exists. The dmesg has some warning though.

[  711.769977] WARNING: CPU: 23 PID: 344 at .../amdgpu-rocm5.0.2/src/amd/amdgpu/amdgpu_object.c:1336 amdgpu_bo_release_notify+0x150/0x160 [amdgpu]
[  711.770528] Modules linked in: amdgpu(OE) amdttm(OE) amd_sched(OE) amdkcl(OE) iommu_v2 nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo xt_addrtype br_netfilter xt_CHECKSUM iptable_mangle xt_MASQUERADE iptable_nat nf_nat xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4 xt_tcpudp bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter overlay binfmt_misc intel_rapl_msr i40iw intel_rapl_common skx_edac nfit x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel rpcrdma kvm sunrpc ipmi_ssif ib_umad ib_ipoib rdma_ucm irqbypass rapl joydev acpi_ipmi input_leds intel_cstate ipmi_si ipmi_devintf mei_me mei intel_pch_thermal ipmi_msghandler ioatdma mac_hid lpc_ich dca acpi_power_meter acpi_pad sch_fq_codel ib_iser rdma_cm iw_cm ib_cm iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi pci_stub ip_tables x_tables autofs4 btrfs blake2b_generic zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor
[  711.779359]  raid6_pq libcrc32c raid1 raid0 multipath linear mlx5_ib ib_uverbs ib_core ast drm_vram_helper i2c_algo_bit drm_ttm_helper ttm drm_kms_helper syscopyarea crct10dif_pclmul crc32_pclmul ghash_clmulni_intel sysfillrect uas hid_generic sysimgblt aesni_intel mlx5_core fb_sys_fops crypto_simd usbhid cryptd drm i40e pci_hyperv_intf usb_storage glue_helper mlxfw hid ahci libahci wmi
[  711.779752] CPU: 23 PID: 344 Comm: kworker/23:1 Tainted: G        W  OE     5.11.0+ #1
[  711.779755] Hardware name: Supermicro SYS-4029GP-TRT2/X11DPG-OT-CPU, BIOS 2.1 08/14/2018
[  711.779756] Workqueue: kfd_process_wq kfd_process_wq_release [amdgpu]
[  711.779955] RIP: 0010:amdgpu_bo_release_notify+0x150/0x160 [amdgpu]
[  711.780141] Code: e8 b5 af 34 f4 e9 1f ff ff ff 48 39 c2 74 07 0f 0b e9 69 ff ff ff 4c 89 e7 e8 3c b4 16 00 e9 5c ff ff ff e8 a2 ce fd f3 eb cf <0f> 0b eb cb e8 d7 02 34 f4 0f 1f 80 00 00 00 00 0f 1f 44 00 00 55
[  711.780143] RSP: 0018:ffffa8100dd67c30 EFLAGS: 00010282
[  711.780145] RAX: 00000000ffffffea RBX: ffff89980e792058 RCX: 0000000000000000
[  711.780147] RDX: 0000000000000000 RSI: ffff89a8f9ad8870 RDI: ffff89a8f9ad8870
[  711.780148] RBP: ffffa8100dd67c50 R08: 0000000000000000 R09: fffffffffff99b18
[  711.780149] R10: ffffa8100dd67bd0 R11: ffffa8100dd67908 R12: ffff89980e792000
[  711.780151] R13: ffff89980e792058 R14: ffff89980e7921bc R15: dead000000000100
[  711.780152] FS:  0000000000000000(0000) GS:ffff89a8f9ac0000(0000) knlGS:0000000000000000
[  711.780154] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  711.780156] CR2: 00007ffddac6f71f CR3: 00000030bb80a003 CR4: 00000000007706e0
[  711.780157] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  711.780159] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  711.780160] PKRU: 55555554
[  711.780161] Call Trace:
[  711.780163]  ttm_bo_release+0x2ae/0x320 [amdttm]
[  711.780169]  amdttm_bo_put+0x30/0x40 [amdttm]
[  711.780357]  amdgpu_bo_unref+0x1e/0x30 [amdgpu]
[  711.780543]  amdgpu_gem_object_free+0x8b/0x160 [amdgpu]
[  711.781119]  drm_gem_object_free+0x1d/0x30 [drm]
[  711.781489]  amdgpu_amdkfd_gpuvm_free_memory_of_gpu+0x34a/0x380 [amdgpu]
[  711.782044]  kfd_process_device_free_bos+0xe0/0x130 [amdgpu]
[  711.782611]  kfd_process_wq_release+0x286/0x380 [amdgpu]
[  711.783172]  process_one_work+0x236/0x420
[  711.783543]  worker_thread+0x34/0x400
[  711.783911]  ? process_one_work+0x420/0x420
[  711.784279]  kthread+0x126/0x140
[  711.784653]  ? kthread_park+0x90/0x90
[  711.785018]  ret_from_fork+0x22/0x30
[  711.785387] ---[ end trace d8f50f6594817c84 ]---
[  711.798716] [drm] amdgpu: ttm finalized


So it means the process was stuck in some wait_event_killable (maybe here drm_sched_entity_flush) - you can try 'cat/proc/$process_pid/stack' maybe before
you kill it to see where it was stuck so we can go from there.



For graphic apps what i usually see is a crash because of sigsev when
the app tries to access
an unmapped MMIO region on the device. I haven't tested for compute
stack and so there might
be something I haven't covered. Hang could mean for example waiting on a
fence which is not being
signaled - please provide full dmesg from this case.


Do you have any good suggestions on how to fix it down the line? (HIP runtime/libhsakmt or driver)

[64036.631333] amdgpu: amdgpu_vm_bo_update failed
[64036.631702] amdgpu: validate_invalid_user_pages: update PTE failed
[64036.640754] amdgpu: amdgpu_vm_bo_update failed
[64036.641120] amdgpu: validate_invalid_user_pages: update PTE failed
[64036.650394] amdgpu: amdgpu_vm_bo_update failed
[64036.650765] amdgpu: validate_invalid_user_pages: update PTE failed


The full dmesg will just the repetition of those two messages,
[186885.764079] amdgpu 0000:43:00.0: amdgpu: amdgpu: finishing device.
[186885.766916] [drm] free PSP TMR buffer
[186893.868173] amdgpu: amdgpu_vm_bo_update failed
[186893.868235] amdgpu: validate_invalid_user_pages: update PTE failed
[186893.876154] amdgpu: amdgpu_vm_bo_update failed
[186893.876190] amdgpu: validate_invalid_user_pages: update PTE failed
[186893.884150] amdgpu: amdgpu_vm_bo_update failed
[186893.884185] amdgpu: validate_invalid_user_pages: update PTE failed


This just probably means trying to update PTEs after the physical device
is gone - we usually avoid this by
first trying to do all HW shutdowns early before PCI remove completion
but when it's really tricky by
protecting HW access sections with drm_dev_enter/exit scope.

For this particular error it would be the best to flush
info->restore_userptr_work before the end of
amdgpu_pci_remove (rejecting new process creation and calling
cancel_delayed_work_sync(&process_info->restore_userptr_work) for all
running processes)
somewhere in amdgpu_pci_remove.

I tried something like *kfd_process_ref_release* which I think did what you described, but it did not work.


I don't see how kfd_process_ref_release is the same as I mentioned above, what i meant is calling the code above within kgd2kfd_suspend (where you tried to call kfd_kill_all_user_processes bellow)

Yes, you are right. It was not called by it.


Instead I tried to kill the process from the kernel, but the amdgpu could **only** be hot-plugged in back successfully only if there was no rocm kernel running when it was plugged out. If not, amdgpu_probe will just hang later. (Maybe because amdgpu was plugged out while running state, it leaves a bad HW state which causes probe to hang).


We usually do asic_reset during probe to reset all HW state (checlk if amdgpu_device_init->amdgpu_asic_reset is running when you  plug back).

OK



I don’t know if this is a viable solution worth pursuing, but I attached the diff anyway.

Another solution could be let compute stack user mode detect a topology change via generation_count change, and abort gracefully there.

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
index 4e7d9cb09a69..79b4c9b84cd0 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
@@ -697,12 +697,15 @@ void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm, bool force)
                return;

        /* for runtime suspend, skip locking kfd */
-       if (!run_pm) {
+       if (!run_pm && !drm_dev_is_unplugged(kfd->ddev)) {
                /* For first KFD device suspend all the KFD processes */
                if (atomic_inc_return(&kfd_locked) == 1)
                        kfd_suspend_all_processes(force);
        }

+       if (drm_dev_is_unplugged(kfd->ddev))
+               kfd_kill_all_user_processes();
+
        kfd->dqm->ops.stop(kfd->dqm);
        kfd_iommu_suspend(kfd);
 }
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
index 55c9e1922714..84cbcd857856 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
@@ -1065,6 +1065,7 @@ void kfd_unref_process(struct kfd_process *p);
 int kfd_process_evict_queues(struct kfd_process *p, bool force);
 int kfd_process_restore_queues(struct kfd_process *p);
 void kfd_suspend_all_processes(bool force);
+void kfd_kill_all_user_processes(void);
 /*
  * kfd_resume_all_processes:
  *     bool sync: If kfd_resume_all_processes() should wait for the
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process.c b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
index 6cdc855abb6d..fb0c753b682c 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
@@ -2206,6 +2206,24 @@ void kfd_suspend_all_processes(bool force)
        srcu_read_unlock(&kfd_processes_srcu, idx);
 }

+
+void kfd_kill_all_user_processes(void)
+{
+       struct kfd_process *p;
+       struct amdkfd_process_info *p_info;
+       unsigned int temp;
+       int idx = srcu_read_lock(&kfd_processes_srcu);
+
+       pr_info("Killing all processes\n");
+       hash_for_each_rcu(kfd_processes_table, temp, p, kfd_processes) {
+               p_info = p->kgd_process_info;
+               pr_info("Killing  processes, pid = %d", pid_nr(p_info->pid));
+               kill_pid(p_info->pid, SIGBUS, 1);


From looking into kill_pid I see it only sends a signal but doesn't wait for completion, it would make sense to wait for completion here. In any case I would actually try to put here

I have made a version which does that with some atomic counters. Please read later in the diff.


hash_for_each_rcu(p_info)
    cancel_delayed_work_sync(&p_info->restore_userptr_work)

instead  at least that what i meant in the previous mail.

I actually tried that earlier, and it did not work. Application still keeps running, and you have to send a kill to the user process.

I have made the following version. It waits for processes to terminate synchronously after sending SIGBUS. After that it does the real work of amdgpu_pci_remove.
However, it hangs at amdgpu_device_ip_fini_early when it is trying to deinit ip_block 6 <sdma_v4_0> (https://gitlab.freedesktop.org/agd5f/linux/-/blob/amd-staging-drm-next/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c#L2818<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fagd5f%2Flinux%2F-%2Fblob%2Famd-staging-drm-next%2Fdrivers%2Fgpu%2Fdrm%2Famd%2Famdgpu%2Famdgpu_device.c%23L2818&data=05%7C01%7Cshuotaoxu%40microsoft.com%7C1ce273d333a04fde7dcf08da221de5f7%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637859809165306474%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=W%2FvVPYbOl3sDNrOriqhJnAvudww%2B2ucBi3jwgwyje%2Bs%3D&reserved=0>). I assume that there are still some inflight dma, therefore fini of this ip block thus hangs?

The following is an excerpt of the dmesg: please excuse for putting my own pr_info, but I hope you get my point of where it hangs.

[  392.344735] amdgpu: all processes has been fully released
[  392.346557] amdgpu: amdgpu_acpi_fini done
[  392.346568] amdgpu 0000:b3:00.0: amdgpu: amdgpu: finishing device.
[  392.349238] amdgpu: amdgpu_device_ip_fini_early enter ip_blocks = 9
[  392.349248] amdgpu: Free mem_obj = 000000007bf54275, range_start = 14, range_end = 14
[  392.350299] amdgpu: Free mem_obj = 00000000a85bc878, range_start = 12, range_end = 12
[  392.350304] amdgpu: Free mem_obj = 00000000b8019e32, range_start = 13, range_end = 13
[  392.350308] amdgpu: Free mem_obj = 000000002d296168, range_start = 4, range_end = 11
[  392.350313] amdgpu: Free mem_obj = 000000001fc4f934, range_start = 0, range_end = 3
[  392.350322] amdgpu: amdgpu_amdkfd_suspend(adev, false) done
[  392.350672] amdgpu: hw_fini of IP block[8] <jpeg_v2_5> done 0
[  392.350679] amdgpu: hw_fini of IP block[7] <vcn_v2_5> done 0


I just remembered that the idea to actively kill and wait for running user processes during unplug was rejected
as a bad idea in the first iteration of unplug work I did (don't remember why now, need to look) so this is a no go.

Maybe an application has kfd open, but was not accessing the dev. So kill it at unplug could kill the process unnecessarily.
However, the latest version I had with the sleep function got rid of the IP block fini hang.

Our policy is to let zombie processes (zombie in a sense that the underlying device is gone) live as long as they want
(as long as you able to terminate them - which you do, so that ok)
and the system should finish PCI remove gracefully and be able to hot plug back the device.  Hence, i suggest dropping
this direction of forcing all user processes to be killed, confirm you have graceful shutdown and remove of device
from PCI topology and then concentrate on why when you plug back it hangs.

So I basically revert back to the original solution which you suggested.

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
index 4e7d9cb09a69..5504a18b5a45 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
@@ -697,7 +697,7 @@ void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm, bool force)
                return;

        /* for runtime suspend, skip locking kfd */
-       if (!run_pm) {
+       if (!run_pm && !drm_dev_is_unplugged(kfd->ddev)) {
                /* For first KFD device suspend all the KFD processes */
                if (atomic_inc_return(&kfd_locked) == 1)
                        kfd_suspend_all_processes(force);


First confirm if ASIC reset happens on
next init.

This patch works great at planned plugout, where all the rocm processes are killed before plugout. And device can be added back without a problem.
However unplanned plugout when there is rocm processes are running just don’t work.


Still I am not clear if ASIC reset happens on plug back or no, can you trace this please ?


I tried add pr_info into asic_reset functions, but cannot trace any upon plug-back.


This could possibly explain the hang on plug back. Can you see why we don't get there ?


Is amdgpu supposed to asic_reset each time when it is probed? I right now it seems to probe ok (it did not hang). I will trace back further


Yep



Second please confirm if the timing you kill manually the user process has impact on whether you have a hang
on next plug back (if you kill before

Scenario 0: Kill before plug back

1. echo 1 > /sys/bus/pci/…/remove, would finish.
But the application won’t exit until there is a kill signal.


Why you think it must exit ?

Because rocm will need to release the drm descriptor to get amdgpu_amdkfd_device_fini_sw called, which would eventually call kgd2kfd_device_exit called. This would clean up kfd_topology at least. Otherwise I don’t see how it would be added back without messing up kfd topology to say the least.

However, those are all based my own observations. Please explain why it does not need exit if you believe so?


Note that when you add back a new device, pci device and drm device are created, I am not an expert on KFD code but i believe also a new KFD device is created independent of the old one and so the topology should see just 2 device instances (one old zombie and one real new).  I know at least this wasn't an issue for the graphic stack in exact same scenario and the libdrm tests i pointed to test exact this scenario.

Yes, regardless of the OOPS in ttm_bo_cleanup_refs, I plugged back the gpu, and I think it got probed all right, however the old kfd node is still there.
I can passed libdrm basic test suite on the plugged back. The bo test hangs out-of-box even without hotplug (see dmesg below).

 kernel:[ 1609.029125] watchdog: BUG: soft lockup - CPU#39 stuck for 89s! [amdgpu_test:36407]
[  +0.000407] Code: 48 89 47 18 48 89 47 20 48 89 47 28 48 89 47 30 48 89 47 38 48 8d 7f 40 75 d9 90 c3 0f 1f 80 00 00 00 00 b9 00 10 00 00 31 c0 <f3> aa c3 cc cc cc cc cc cc 0f 1f 44 00 00 48 85 ff 0f 84 f2 00 00
[  +0.000856] RSP: 0018:ffffb996b57b3c40 EFLAGS: 00010246
[  +0.000434] RAX: 0000000000000000 RBX: ffff9cc7f8706e88 RCX: 0000000000000980
[  +0.000436] RDX: fffff935b17d9140 RSI: fffff935b17e0000 RDI: ffff9c831f645680
[  +0.000439] RBP: 0000000000000400 R08: fffff935b17d0000 R09: 0000000000000000
[  +0.000447] R10: 0000000000000000 R11: 0000000000000000 R12: 000000000000000a
[  +0.000437] R13: ffff9cc783980a20 R14: 000000000b5dbc00 R15: ffff9cc7f8706078
[  +0.000438] FS:  00007ff1ef611300(0000) GS:ffff9d453efc0000(0000) knlGS:0000000000000000
[  +0.000445] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  +0.000462] CR2: 00007f418bbb9320 CR3: 000000819fa84006 CR4: 0000000000770ee0
[  +0.000466] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  +0.000451] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  +0.000455] PKRU: 55555554
[  +0.000451] Call Trace:
[  +0.000448]  ttm_pool_free+0x110/0x230 [ttm]
[  +0.000451]  ttm_tt_unpopulate+0x5e/0xb0 [ttm]
[  +0.000454]  ttm_tt_destroy_common+0xe/0x30 [ttm]
[  +0.000453]  amdgpu_ttm_backend_destroy+0x1e/0x70 [amdgpu]
[  +0.000569]  ttm_bo_cleanup_memtype_use+0x37/0x60 [ttm]
[  +0.000458]  ttm_bo_release+0x286/0x500 [ttm]
[  +0.000450]  amdgpu_bo_unref+0x1a/0x30 [amdgpu]
[  +0.000544]  amdgpu_gem_object_free+0xad/0x160 [amdgpu]
[  +0.000534]  drm_gem_object_release_handle+0x6a/0x80 [drm]
[  +0.000476]  drm_gem_handle_delete+0x5b/0xa0 [drm]
[  +0.000465]  ? drm_gem_handle_create+0x40/0x40 [drm]
[  +0.000469]  drm_ioctl_kernel+0xab/0xf0 [drm]
[  +0.000458]  drm_ioctl+0x1ec/0x390 [drm]
[  +0.000440]  ? drm_gem_handle_create+0x40/0x40 [drm]
[  +0.000438]  ? selinux_file_ioctl+0x17d/0x220
[  +0.000423]  ? lock_release+0x1ce/0x270
[  +0.000416]  ? trace_hardirqs_on+0x1b/0xd0
[  +0.000418]  ? _raw_spin_unlock_irqrestore+0x2d/0x40
[  +0.000419]  amdgpu_drm_ioctl+0x49/0x80 [amdgpu]
[  +0.000499]  __x64_sys_ioctl+0x80/0xb0
[  +0.000414]  do_syscall_64+0x3a/0x70
[  +0.000400]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[  +0.000387] RIP: 0033:0x7ff1ef7263db
[  +0.000371] Code: 0f 1e fa 48 8b 05 b5 7a 0d 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 85 7a 0d 00 f7 d8 64 89 01 48
[  +0.000763] RSP: 002b:00007ffdf1cd0278 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[  +0.000386] RAX: ffffffffffffffda RBX: 00007ffdf1cd02b0 RCX: 00007ff1ef7263db
[  +0.000383] RDX: 00007ffdf1cd02b0 RSI: 0000000040086409 RDI: 0000000000000007
[  +0.000396] RBP: 0000000040086409 R08: 00005574eefd5c60 R09: 00005574eefdd360
[  +0.000391] R10: 00005574eefd4010 R11: 0000000000000246 R12: 00005574eefd66d8
[  +0.000386] R13: 0000000000000007 R14: 0000000000000000 R15: 00007ff1ef830143


I also tried to run tf benchmark to the newly plugged nodes (one of the node is dummy), but failed.
Can we have some confirmation from kfd team that they have considered a zombie kfd node?


Also note that even with running grpahic stack there is always a KFD device and KFD topology present but of course probably not the same as when u run a KFD facing process so there could be some issues there.

Also note that because of this patch https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=267d51d77fdae8708b94e1a24b8e5d961297edb7<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgit.kernel.org%2Fpub%2Fscm%2Flinux%2Fkernel%2Fgit%2Ftorvalds%2Flinux.git%2Fcommit%2F%3Fid%3D267d51d77fdae8708b94e1a24b8e5d961297edb7&data=05%7C01%7Cshuotaoxu%40microsoft.com%7C1ce273d333a04fde7dcf08da221de5f7%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637859809165306474%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=oTS%2Brv6DOjcwSfKCHDgzrffsn3Qxl1hkfnexUqCRVaY%3D&reserved=0> all MMIO accesses from such zombie/orphan user processes will be remapped to zero page and so will not necessarily experience a segfault when device removal happnes but rather maybe some crash due to NULL data read from MMIO by the process and used in some manner.

It depends on where the application is when the device is plugged out.

For example, in one case, the application keeps saying out-of-memory, but won’t exit.
For one of the cases. The other case is that it would wait for a signal.

2022-04-18 12:42:38.939303: E tensorflow/stream_executor/rocm/rocm_driver.cc:692<https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Frocm_driver.cc%3A692%2F&data=05%7C01%7Cshuotaoxu%40microsoft.com%7C1ce273d333a04fde7dcf08da221de5f7%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637859809165306474%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=Mo0f7DfzRe8GwXBuCQwA2Gcnn6mzs4lGGRPRCPnEbxQ%3D&reserved=0>] failed to alloc 2304 bytes on host: HIP_ERROR_OutOfMemory
2022-04-18 12:42:38.939322: W ./tensorflow/core/common_runtime/gpu/gpu_host_allocator.h:44] could not allocate pinned host memory of size: 2304
2022-04-18 12:42:38.940772: E tensorflow/stream_executor/rocm/rocm_driver.cc:692<https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Frocm_driver.cc%3A692%2F&data=05%7C01%7Cshuotaoxu%40microsoft.com%7C1ce273d333a04fde7dcf08da221de5f7%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637859809165306474%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=Mo0f7DfzRe8GwXBuCQwA2Gcnn6mzs4lGGRPRCPnEbxQ%3D&reserved=0>] failed to alloc 2304 bytes on host: HIP_ERROR_OutOfMemory
2022-04-18 12:42:38.940791: W ./tensorflow/core/common_runtime/gpu/gpu_host_allocator.h:44] could not allocate pinned host memory of size: 2304
2022-04-18 12:42:38.942379: E tensorflow/stream_executor/rocm/rocm_driver.cc:692<https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Frocm_driver.cc%3A692%2F&data=05%7C01%7Cshuotaoxu%40microsoft.com%7C1ce273d333a04fde7dcf08da221de5f7%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637859809165306474%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=Mo0f7DfzRe8GwXBuCQwA2Gcnn6mzs4lGGRPRCPnEbxQ%3D&reserved=0>] failed to alloc 2304 bytes on host: HIP_ERROR_OutOfMemory
2022-04-18 12:42:38.942399: W ./tensorflow/core/common_runtime/gpu/gpu_host_allocator.h:44] could not allocate pinned host memory of size: 2304
2022-04-18 12:42:38.943829: E tensorflow/stream_executor/rocm/rocm_driver.cc:692<https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Frocm_driver.cc%3A692%2F&data=05%7C01%7Cshuotaoxu%40microsoft.com%7C1ce273d333a04fde7dcf08da221de5f7%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637859809165306474%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=Mo0f7DfzRe8GwXBuCQwA2Gcnn6mzs4lGGRPRCPnEbxQ%3D&reserved=0>] failed to alloc 2304 bytes on host: HIP_ERROR_OutOfMemory
2022-04-18 12:42:38.943849: W ./tensorflow/core/common_runtime/gpu/gpu_host_allocator.h:44] could not allocate pinned host memory of size: 2304
2022-04-18 12:42:38.945272: E tensorflow/stream_executor/rocm/rocm_driver.cc:692<https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Frocm_driver.cc%3A692%2F&data=05%7C01%7Cshuotaoxu%40microsoft.com%7C1ce273d333a04fde7dcf08da221de5f7%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637859809165306474%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=Mo0f7DfzRe8GwXBuCQwA2Gcnn6mzs4lGGRPRCPnEbxQ%3D&reserved=0>] failed to alloc 2304 bytes on host: HIP_ERROR_OutOfMemory
2022-04-18 12:42:38.945292: W ./tensorflow/core/common_runtime/gpu/gpu_host_allocator.h:44] could not allocate pinned host memory of size: 2304



2. kill the the process. The application does several things and seems trigger drm_release in the kernel, which are met with kernel NULL pointer deference related to sysfs_remove. Then the whole fs just freeze.

[  +0.002498] BUG: kernel NULL pointer dereference, address: 0000000000000098
[  +0.000486] #PF: supervisor read access in kernel mode
[  +0.000545] #PF: error_code(0x0000) - not-present page
[  +0.000551] PGD 0 P4D 0
[  +0.000553] Oops: 0000 [#1] SMP NOPTI
[  +0.000540] CPU: 75 PID: 4911 Comm: kworker/75:2 Tainted: G        W   E     5.13.0-kfd #1
[  +0.000559] Hardware name: INGRASYS         TURING  /MB      , BIOS K71FQ28A 10/05/2021
[  +0.000567] Workqueue: events delayed_fput
[  +0.000563] RIP: 0010:kernfs_find_ns+0x1b/0x100
[  +0.000569] Code: ff ff e8 88 59 9f 00 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 41 57 8b 05 df f0 7b 01 41 56 41 55 49 89 f5 41 54 55 48 89 fd 53 <44> 0f b7 b7 98 00 00 00 48 89 d3 4c 8b 67 70 66 41 83 e6 20 41 0f
[  +0.001193] RSP: 0018:ffffb9875db5fc98 EFLAGS: 00010246
[  +0.000602] RAX: 0000000000000000 RBX: ffffa101f79507d8 RCX: 0000000000000000
[  +0.000612] RDX: 0000000000000000 RSI: ffffffffc09a9417 RDI: 0000000000000000
[  +0.000604] RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000000
[  +0.000760] R10: ffffb9875db5fcd0 R11: 0000000000000000 R12: 0000000000000000
[  +0.000597] R13: ffffffffc09a9417 R14: ffffa08363fb2d18 R15: 0000000000000000
[  +0.000702] FS:  0000000000000000(0000) GS:ffffa0ffbfcc0000(0000) knlGS:0000000000000000
[  +0.000666] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  +0.000658] CR2: 0000000000000098 CR3: 0000005747812005 CR4: 0000000000770ee0
[  +0.000715] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  +0.000655] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  +0.000592] PKRU: 55555554
[  +0.000580] Call Trace:
[  +0.000591]  kernfs_find_and_get_ns+0x2f/0x50
[  +0.000584]  sysfs_remove_file_from_group+0x20/0x50
[  +0.000580]  amdgpu_ras_sysfs_remove+0x3d/0xd0 [amdgpu]
[  +0.000737]  amdgpu_ras_late_fini+0x1d/0x40 [amdgpu]
[  +0.000750]  amdgpu_sdma_ras_fini+0x96/0xb0 [amdgpu]
[  +0.000742]  ? gfx_v10_0_resume+0x10/0x10 [amdgpu]
[  +0.000738]  sdma_v4_0_sw_fini+0x23/0x90 [amdgpu]
[  +0.000717]  amdgpu_device_fini_sw+0xae/0x260 [amdgpu]
[  +0.000704]  amdgpu_driver_release_kms+0x12/0x30 [amdgpu]
[  +0.000687]  drm_dev_release+0x20/0x40 [drm]
[  +0.000583]  drm_release+0xa8/0xf0 [drm]
[  +0.000584]  __fput+0xa5/0x250
[  +0.000621]  delayed_fput+0x1f/0x30
[  +0.000726]  process_one_work+0x26e/0x580
[  +0.000581]  ? process_one_work+0x580/0x580
[  +0.000611]  worker_thread+0x4d/0x3d0
[  +0.000611]  ? process_one_work+0x580/0x580
[  +0.000605]  kthread+0x117/0x150
[  +0.000611]  ? kthread_park+0x90/0x90
[  +0.000619]  ret_from_fork+0x1f/0x30
[  +0.000625] Modules linked in: amdgpu(E) xt_conntrack xt_MASQUERADE nfnetlink xt_addrtype iptable_filter iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 br_netfilter x86_pkg_temp_thermal cdc_ether usbnet acpi_pad msr ip_tables x_tables ast drm_vram_helper iommu_v2 drm_ttm_helper gpu_sched ttm drm_kms_helper cfbfillrect syscopyarea cfbimgblt sysfillrect sysimgblt fb_sys_fops cfbcopyarea drm drm_panel_orientati
on_quirks [last unloaded: amdgpu]


This is a known regression, all SYSFS components must be removed before pci_remove code runs otherwise you get either warnings for single file removals or
OOPSEs for sysfs gorup removals like here. Please try to move amdgpu_ras_sysfs_remove from amdgpu_ras_late_fini to the end of amdgpu_ras_pre_fini (which happens before pci remove)


I fixed it in the newer patch, please see it below.



I first plugout the device, then kill the rocm user process. Then it has other OOPSES related to ttm_bo_cleanup_refs.

[  +0.000006] BUG: kernel NULL pointer dereference, address: 0000000000000010
[  +0.000349] #PF: supervisor read access in kernel mode
[  +0.000340] #PF: error_code(0x0000) - not-present page
[  +0.000341] PGD 0 P4D 0
[  +0.000336] Oops: 0000 [#1] SMP NOPTI
[  +0.000345] CPU: 9 PID: 95 Comm: kworker/9:1 Tainted: G        W   E     5.13.0-kfd #1
[  +0.000367] Hardware name: INGRASYS         TURING  /MB      , BIOS K71FQ28A 10/05/2021
[  +0.000376] Workqueue: events delayed_fput
[  +0.000422] RIP: 0010:ttm_resource_free+0x24/0x40 [ttm]
[  +0.000464] Code: 00 00 0f 1f 40 00 0f 1f 44 00 00 53 48 89 f3 48 8b 36 48 85 f6 74 21 48 8b 87 28 02 00 00 48 63 56 10 48 8b bc d0 b8 00 00 00 <48> 8b 47 10 ff 50 08 48 c7 03 00 00 00 00 5b c3 66 66 2e 0f 1f 84
[  +0.001009] RSP: 0018:ffffb21c59413c98 EFLAGS: 00010282
[  +0.000515] RAX: ffff8b1aa4285f68 RBX: ffff8b1a823b5ea0 RCX: 00000000002a000c
[  +0.000536] RDX: 0000000000000000 RSI: ffff8b1acb84db80 RDI: 0000000000000000
[  +0.000539] RBP: 0000000000000001 R08: 0000000000000000 R09: ffffffffc03c3e00
[  +0.000543] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8b1a823b5ec8
[  +0.000542] R13: 0000000000000000 R14: ffff8b1a823b5d90 R15: ffff8b1a823b5ec8
[  +0.000544] FS:  0000000000000000(0000) GS:ffff8b187f440000(0000) knlGS:0000000000000000
[  +0.000559] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  +0.000575] CR2: 0000000000000010 CR3: 00000076e6812004 CR4: 0000000000770ee0
[  +0.000575] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  +0.000579] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  +0.000575] PKRU: 55555554
[  +0.000568] Call Trace:
[  +0.000567]  ttm_bo_cleanup_refs+0xe4/0x290 [ttm]
[  +0.000588]  ttm_bo_delayed_delete+0x147/0x250 [ttm]
[  +0.000589]  ttm_device_fini+0xad/0x1b0 [ttm]
[  +0.000590]  amdgpu_ttm_fini+0x2a7/0x310 [amdgpu]
[  +0.000730]  gmc_v9_0_sw_fini+0x3a/0x40 [amdgpu]
[  +0.000753]  amdgpu_device_fini_sw+0xae/0x260 [amdgpu]
[  +0.000734]  amdgpu_driver_release_kms+0x12/0x30 [amdgpu]
[  +0.000737]  drm_dev_release+0x20/0x40 [drm]
[  +0.000626]  drm_release+0xa8/0xf0 [drm]
[  +0.000625]  __fput+0xa5/0x250
[  +0.000606]  delayed_fput+0x1f/0x30
[  +0.000607]  process_one_work+0x26e/0x580
[  +0.000608]  ? process_one_work+0x580/0x580
[  +0.000616]  worker_thread+0x4d/0x3d0
[  +0.000614]  ? process_one_work+0x580/0x580
[  +0.000617]  kthread+0x117/0x150
[  +0.000615]  ? kthread_park+0x90/0x90
[  +0.000621]  ret_from_fork+0x1f/0x30
[  +0.000603] Modules linked in: amdgpu(E) xt_conntrack xt_MASQUERADE nfnetlink xt_addrtype iptable_filter iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 br_netfilter x86_pkg_temp_thermal cdc_ether usbnet acpi_pad msr ip_tables x_tables ast drm_vram_helper drm_ttm_helper iommu_v2 ttm gpu_sched drm_kms_helper cfbfillrect syscopyarea cfbimgblt sysfillrect sysimgblt fb_sys_fops cfbcopyarea drm drm_panel_orientation_quirks [last unloaded: amdgpu]
[  +0.002840] CR2: 0000000000000010
[  +0.000755] ---[ end trace 9737737402551e39 ]--


This looks like another regression - try seeing where is the NULL reference and then we can see how to avoid this.


Those are the line of code.

(gdb) l *(ttm_bo_cleanup_refs+0xe4)
0x19c4 is in ttm_bo_cleanup_refs (drivers/gpu/drm/ttm/ttm_bo.c:360).
355             ttm_bo_move_to_pinned(bo);
356             list_del_init(&bo->ddestroy);
357             spin_unlock(&bo->bdev->lru_lock);
358             ttm_bo_cleanup_memtype_use(bo);
359
360             if (unlock_resv)
361                     dma_resv_unlock(amdkcl_ttm_resvp(bo));
362
363             ttm_bo_put(bo);
364
(gdb) l *(ttm_resource_free+0x24)
0x57f4 is in ttm_resource_free (drivers/gpu/drm/ttm/ttm_resource.c:65).
60
61              if (!*res)
62                      return;
63
64              man = ttm_manager_type(bo->bdev, (*res)->mem_type);
65              man->func->free(man, *res);
66              *res = NULL;
67      }
68      EXPORT_SYMBOL(ttm_resource_free);
69


3.  echo 1 > /sys/bus/pci/rescan. This would just hang. I assume the sysfs is broken.

Based on 1 & 2, it seems that 1 won’t let the amdgpu exit gracefully, because 2 will do some cleanup maybe should have happened before 1.

or you kill after plug back does it makes a difference).

Scenario 2: Kill after plug back

If I perform rescan before kill, then the driver seemed probed fine. But kill will have the same issue which messed up the sysfs the same way as in Scenario 2.


Final Comments:

0. cancel_delayed_work_sync(&p_info->restore_userptr_work) would make the repletion of amdgpu_vm_bo_update failure go away, but it does not solve the issues in those scenarios.


Still - it's better to do it this way even for those failures to go awaya

Cancel_delayed_work is insufficient, you will need to make sure the work won’t be processed after plugout. Please see my patch


Saw, see my comment.



1. For planned hotplug, this patch should work as long as you follow some protocol, i.e. kill before plugout. Is this patch an acceptable one since it provides some added feature than before?


Let's try to fix more as I advised above.


2. For unplanned hotplug when there is rocm app running, the patch that kill all processes and wait for 5 sec would work consistently. But it seems that it is an unacceptable solution for official release. I can hold it for our own internal usage.  It seems that kill after removal would cause problems, and I don’t know if there is a quick fix by me because of my limited understanding of the amdgpu driver. Maybe AMD could have a quick fix; Or it is really a difficult one. This feature may or may not be a blocking issue in our GPU disaggregation research down the way. Please let us know for either cases, and we would like to learn and help as much as we could!


I am currently not sure why it helps. I will need to setup my own ROCm setup and retest hot plug to check this in more depth but currently i have higher priorities. Please try to confirm ASIC reset always takes place on plug back
and fix the sysfs OOPs as I advised above to clear up at least some of the issues. Also please describe to me exactly what you steps to reproduce this scenario so later I might be able to do it myself.

I can still try to help to fix the bug in my spare time. My setup is as follows


  1.  I have a server with 4 AMD MI100 GPUs. I think 1 GPU would also work.
  2.  I used the https://github.com/RadeonOpenCompute/ROCK-Kernel-Driver/tree/roc-5.0.x<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FRadeonOpenCompute%2FROCK-Kernel-Driver%2Ftree%2Froc-5.0.x&data=05%7C01%7Cshuotaoxu%40microsoft.com%7C1ce273d333a04fde7dcf08da221de5f7%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637859809165306474%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=BigmopnuPSHNpwYjKPOKvsdyFPHVcZRLPgzR3PuvXbU%3D&reserved=0> as the starting point, and apply Mukul’s patch and my patch.
  3.  Then I run a tensorflow benchmark from a docker.
     *   docker run -it --device=/dev/kfd --device=/dev/dri --group-add video rocm/tensorflow:rocm4.5.2-tf1.15-dev
     *   And run the following benchmark in the docker:  python benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --num_gpus=4 --batch_size=32 --model=resnet50 --variable_update=parameter_server
        *   Might to need to adjust num_gpus parameter based on your setup
  4.  Remove a GPU at random time.
  5.  Do whatever is needed to before plugback and reverify the benchmark can still run.

Also, we have hotplug test suite in libdrm (graphic stack), so maybe u can install libdrm and run that test suite to see if it exposes more issues.

OK I could try it some time.


I tried suite 13, the hotplugout test, but it says it got killed? There are a some oops from dmesg during ttm_pool_free_page.

Userspace log:

$ sudo ./tests/amdgpu/amdgpu_test -f -s 13


The ASIC NOT support UVD, suite disabled


The ASIC NOT support VCE, suite disabled


The ASIC NOT support UVD ENC, suite disabled.


Don't support TMZ (trust memory zone), security suite disabled


     CUnit - A unit testing framework for C - Version 2.1-3
     http://cunit.sourceforge.net/<https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fcunit.sourceforge.net%2F&data=05%7C01%7Cshuotaoxu%40microsoft.com%7C1ce273d333a04fde7dcf08da221de5f7%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637859809165356475%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=PqV7be8A1lX%2Bn%2BmOihvTC6ZKdx4eUhm8Krz2VmU5ec8%3D&reserved=0>


Suite: Hotunplug Tests
  Test: Unplug card and rescan the bus to plug it back …Killed

Dmesg log:
[  +0.000479] BUG: unable to handle page fault for address: ffffc01343fc81b4
[  +0.000054] #PF: supervisor write access in kernel mode
[  +0.000033] #PF: error_code(0x0002) - not-present page
[  +0.000032] PGD 807ffc1067 P4D 807ffc1067 PUD 807ffc0067 PMD 0
[  +0.000038] Oops: 0002 [#1] SMP NOPTI
[  +0.000025] CPU: 92 PID: 7534 Comm: amdgpu_test Tainted: G        W   E     5.13.0-kfd #1
[  +0.000048] Hardware name: INGRASYS         TURING  /MB      , BIOS K71FQ28A 10/05/2021
[  +0.000045] RIP: 0010:__free_pages+0xc/0x80
[  +0.000031] Code: 01 00 74 0f 0f b6 77 51 85 f6 74 07 31 d2 e9 3b dc ff ff e9 66 ff ff ff 66 0f 1f 44 00 00 0f 1f 44 00 00 41 54 55 48 89 fd 53 <f0> ff 4f 34 74 46 48 8b 07 a9 00 00 01 00 75 54 44 8d 66 ff 85 f6
[  +0.000103] RSP: 0018:ffff96f71ba6fd60 EFLAGS: 00010246
[  +0.000032] RAX: 00000000ffffffff RBX: ffff89f1ccf86078 RCX: 0000000003fc8180
[  +0.000041] RDX: ffff89f1b4746000 RSI: 0000000000000000 RDI: ffffc01343fc8180
[  +0.000042] RBP: ffffc01343fc8180 R08: 0000000000000000 R09: 0000000000000246
[  +0.000040] R10: 00000080b4746fff R11: 0000000000000003 R12: 0000000000000000
[  +0.000041] R13: ffff89f1ccf85f80 R14: ffff89f1ccf86ef8 R15: ffff8972293b0000
[  +0.000042] FS:  00007fcfb843a300(0000) GS:ffff89ef80100000(0000) knlGS:0000000000000000
[  +0.000046] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  +0.000033] CR2: ffffc01343fc81b4 CR3: 0000000178154006 CR4: 0000000000770ee0
[  +0.000041] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  +0.000041] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  +0.000041] PKRU: 55555554
[  +0.000017] Call Trace:
[  +0.000018]  ttm_pool_free_page+0x69/0x90 [ttm]
[  +0.000038]  ttm_pool_type_fini+0x58/0x70 [ttm]
[  +0.000034]  ttm_pool_fini+0x30/0x50 [ttm]
[  +0.000031]  ttm_device_fini+0xf3/0x1b0 [ttm]
[  +0.000032]  amdgpu_ttm_fini+0x2a7/0x310 [amdgpu]
[  +0.000265]  gmc_v9_0_sw_fini+0x3a/0x40 [amdgpu]
[  +0.000246]  amdgpu_device_fini_sw+0xae/0x260 [amdgpu]
[  +0.000219]  amdgpu_driver_release_kms+0x12/0x30 [amdgpu]
[  +0.000219]  drm_dev_release+0x20/0x40 [drm]
[  +0.000059]  drm_release+0xa8/0xf0 [drm]
[  +0.000053]  __fput+0xa5/0x250
[  +0.000023]  task_work_run+0x5c/0xa0
[  +0.000026]  exit_to_user_mode_prepare+0x1db/0x1e0
[  +0.000033]  syscall_exit_to_user_mode+0x19/0x50
[  +0.000030]  do_syscall_64+0x47/0x70
[  +0.000018]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[  +0.000025] RIP: 0033:0x7fcfb86403d7
[  +0.000869] Code: 64 89 02 48 c7 c0 ff ff ff ff eb bb 0f 1f 80 00 00 00 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 41 c3 48 83 ec 18 89 7c 24 0c e8 f3 fb ff ff
[  +0.001788] RSP: 002b:00007ffc8fc26c28 EFLAGS: 00000246 ORIG_RAX: 0000000000000003
[  +0.000888] RAX: 0000000000000000 RBX: 000055d67a05b6a0 RCX: 00007fcfb86403d7
[  +0.000867] RDX: 00007fcfb8627be0 RSI: 0000000000000000 RDI: 0000000000000003
[  +0.000846] RBP: 000055d67a05b8a0 R08: 0000000000000007 R09: 0000000000000000
[  +0.000816] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[  +0.000791] R13: 0000000000000000 R14: 0000000000000000 R15: 00007fcfb8659143
[  +0.000770] Modules linked in: amdgpu(E) xt_conntrack xt_MASQUERADE nfnetlink xt_addrtype iptable_filter iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 br_netfilter x86_pkg_temp_thermal cdc_ether usbnet acpi_pad msr ip_tables x_tables ast drm_vram_helper drm_ttm_helper ttm iommu_v2 gpu_sched drm_kms_helper cfbfillrect syscopyarea cfbimgblt sysfillrect sysimgblt fb_sys_fops cfbcopyarea drm drm_panel_orientation_quirks [last unloaded: amdgpu]
[  +0.003303] CR2: ffffc01343fc81b4
[  +0.000799] ---[ end trace 2360927435b19009 ]—



The following is the new diff.

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index 182b7eae598a..48c3cd4054de 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -1327,7 +1327,7 @@ int emu_soc_asic_init(struct amdgpu_device *adev);
  * ASICs macro.
  */
 #define amdgpu_asic_set_vga_state(adev, state) (adev)->asic_funcs->set_vga_state((adev), (state))
-#define amdgpu_asic_reset(adev) (adev)->asic_funcs->reset((adev))
+#define amdgpu_asic_reset(adev) ({int r; pr_info("performing amdgpu_asic_reset\n"); r = (adev)->asic_funcs->reset((adev));r;})
 #define amdgpu_asic_reset_method(adev) (adev)->asic_funcs->reset_method((adev))
 #define amdgpu_asic_get_xclk(adev) (adev)->asic_funcs->get_xclk((adev))
 #define amdgpu_asic_set_uvd_clocks(adev, v, d) (adev)->asic_funcs->set_uvd_clocks((adev), (v), (d))
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
index 27c74fcec455..842abd7150a6 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
@@ -134,6 +134,7 @@ struct amdkfd_process_info {

  /* MMU-notifier related fields */
  atomic_t evicted_bos;
+ atomic_t invalid;
  struct delayed_work restore_userptr_work;
  struct pid *pid;
 };
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
index 99d2b15bcbf3..2a588eb9f456 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
@@ -1325,6 +1325,7 @@ static int init_kfd_vm(struct amdgpu_vm *vm, void **process_info,

  info->pid = get_task_pid(current->group_leader, PIDTYPE_PID);
  atomic_set(&info->evicted_bos, 0);
+ atomic_set(&info->invalid, 0);
  INIT_DELAYED_WORK(&info->restore_userptr_work,
   amdgpu_amdkfd_restore_userptr_worker);

@@ -2693,6 +2694,9 @@ static void amdgpu_amdkfd_restore_userptr_worker(struct work_struct *work)
  struct mm_struct *mm;
  int evicted_bos;

+ if (atomic_read(&process_info->invalid))
+ return;
+


Probably better  to again use drm_dev_enter/exit guard pair instead of this flag.


I don’t know if I could use drm_dev_enter/exit efficiently because a process can have multiple drm_dev open. And I don’t know how I can recover/refer drm_dev(s) efficiently in the worker function in order to use drm_dev_enter/exit.


I think that within the KFD code each kfd device belongs or points to one specific drm_device so I don't think this is a problem.



  evicted_bos = atomic_read(&process_info->evicted_bos);
  if (!evicted_bos)
  return;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index ec38517ab33f..e7d85d8d282d 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -1054,6 +1054,7 @@ void amdgpu_device_program_register_sequence(struct amdgpu_device *adev,
  */
 void amdgpu_device_pci_config_reset(struct amdgpu_device *adev)
 {
+ pr_debug("%s called\n",__func__);
  pci_write_config_dword(adev->pdev, 0x7c, AMDGPU_ASIC_RESET_DATA);
 }

@@ -1066,6 +1067,7 @@ void amdgpu_device_pci_config_reset(struct amdgpu_device *adev)
  */
 int amdgpu_device_pci_reset(struct amdgpu_device *adev)
 {
+ pr_debug("%s called\n",__func__);
  return pci_reset_function(adev->pdev);
 }

@@ -4702,6 +4704,8 @@ int amdgpu_do_asic_reset(struct list_head *device_list_handle,
  bool need_full_reset, skip_hw_reset, vram_lost = false;
  int r = 0;

+ pr_debug("%s called\n",__func__);
+
  /* Try reset handler method first */
  tmp_adev = list_first_entry(device_list_handle, struct amdgpu_device,
     reset_list);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
index 49bdf9ff7350..b469acb65c1e 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
@@ -2518,7 +2518,6 @@ void amdgpu_ras_late_fini(struct amdgpu_device *adev,
  if (!ras_block || !ih_info)
  return;

- amdgpu_ras_sysfs_remove(adev, ras_block);
  if (ih_info->cb)
  amdgpu_ras_interrupt_remove_handler(adev, ih_info);
 }
@@ -2577,6 +2576,7 @@ void amdgpu_ras_suspend(struct amdgpu_device *adev)
 int amdgpu_ras_pre_fini(struct amdgpu_device *adev)
 {
  struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
+ struct ras_manager *obj, *tmp;

  if (!adev->ras_enabled || !con)
  return 0;
@@ -2585,6 +2585,13 @@ int amdgpu_ras_pre_fini(struct amdgpu_device *adev)
  /* Need disable ras on all IPs here before ip [hw/sw]fini */
  amdgpu_ras_disable_all_features(adev, 0);
  amdgpu_ras_recovery_fini(adev);
+
+ /* remove sysfs before pci_remove to avoid OOPSES from sysfs_remove_groups */
+ list_for_each_entry_safe(obj, tmp, &con->head, node) {
+ amdgpu_ras_sysfs_remove(adev, &obj->head);
+ put_obj(obj);
+ }
+
  return 0;
 }

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
index 4e7d9cb09a69..0fa806a78e39 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
@@ -693,16 +693,35 @@ bool kfd_is_locked(void)

 void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm, bool force)
 {
+ struct kfd_process *p;
+ struct amdkfd_process_info *p_info;
+ unsigned int temp;
+
  if (!kfd->init_complete)
  return;

  /* for runtime suspend, skip locking kfd */
- if (!run_pm) {
+ if (!run_pm && !drm_dev_is_unplugged(kfd->ddev)) {
  /* For first KFD device suspend all the KFD processes */
  if (atomic_inc_return(&kfd_locked) == 1)
  kfd_suspend_all_processes(force);
  }

+ if (drm_dev_is_unplugged(kfd->ddev)){
+ int idx = srcu_read_lock(&kfd_processes_srcu);
+ pr_debug("cancel restore_userptr_wor\n");
+ hash_for_each_rcu(kfd_processes_table, temp, p, kfd_processes) {
+ if ( kfd_process_gpuidx_from_gpuid(p, kfd->id) >= 0 ) {
+ p_info = p->kgd_process_info;
+ pr_debug("cancel processes, pid = %d for gpu_id = %d", pid_nr(p_info->pid), kfd->id);
+ cancel_delayed_work_sync(&p_info->restore_userptr_work);
+ /* block all future restore_userptr_work */
+ atomic_inc(&p_info->invalid);


Same as i mentioned above with drm.dev_eneter/exit

Same as I mentioned as the process can have many drm_dev open.

Final comments:

I suspect that the my linux kernel version might not have all the fixes you did for hotplug. Can you give me a pointer to the lowest version of linux kernel (5.14 from linux kernel repo? amd-drm-staging-next does not work for MI100 out-of-box), which would pass all libdrm tests including hotplug tests (some tests hang, some failed now) ?


That a problem, latest working baseline I tested and confirmed passing hotplug tests is this branch and commit https://gitlab.freedesktop.org/agd5f/linux/-/commit/86e12a53b73135806e101142e72f3f1c0e6fa8e6<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fagd5f%2Flinux%2F-%2Fcommit%2F86e12a53b73135806e101142e72f3f1c0e6fa8e6&data=05%7C01%7Cshuotaoxu%40microsoft.com%7C1ce273d333a04fde7dcf08da221de5f7%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637859809165356475%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=tsMzw%2BphQCOvNs9%2FFue1ApMO2RnrJvilW6Kld4rnsco%3D&reserved=0> which is amd-staging-drm-next. 5.14 was the branch we ups-reamed the hotplug code but it had a lot of regressions over time due to new changes (that why I added the hotplug test to try and catch them early). It would be best to run this branch on mi-100 so we have a clean baseline and only after confirming  this particular branch from this commits passes libdrm tests only then start adding the KFD specific addons. Another option if you can't work with MI-100 and this branch is to try a different ASIC that does work with this branch (if possible).

Andrey

OK I tried both this commit and the HEAD of and-staging-drm-next on two GPUs( MI100 and Radeon VII) both did not pass hotplugout libdrm test. I might be able to gain access to MI200, but I suspect it would work.

I copied the complete dmesgs as follows. I highlighted the OOPSES for you.

Radeon VII:

[Apr20 18:01] amdgpu 0000:05:00.0: amdgpu: amdgpu: finishing device.
[  +0.000509] amdgpu 0000:05:00.0: amdgpu: ras disable gfx failed poison:0 ret:-22
[  +0.014459] [drm] free PSP TMR buffer
[  +0.000003] [drm:amdgpu_fill_buffer [amdgpu]] *ERROR* Trying to clear memory with ring turned off.
[  +0.000230] ------------[ cut here ]------------
[  +0.000001] WARNING: CPU: 16 PID: 2834 at drivers/gpu/drm/amd/amdgpu/amdgpu_object.c:1313 amdgpu_bo_release_notify+0x15f/0x170 [amdgpu]
[  +0.000105] Modules linked in: nf_conntrack_netlink nfnetlink xfrm_user xt_addrtype br_netfilter xt_CHECKSUM iptable_mangle xt_MASQUERADE iptable_nat nf_nat xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4 xt_tcpudp bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter overlay esp6_offload esp6 esp4_offload esp4 xfrm_algo ipmi_ssif intel_rapl_msr snd_hda_codec_hdmi intel_rapl_common sb_edac x86_pkg_temp_thermal snd_hda_intel intel_powerclamp coretemp snd_intel_dspcfg kvm_intel snd_hda_codec snd_hda_core kvm snd_hwdep ftdi_sio snd_pcm snd_timer joydev input_leds usbserial snd irqbypass soundcore rapl iTCO_wdt iTCO_vendor_support ipmi_si mei_me intel_cstate acpi_power_meter ipmi_devintf lpc_ich mei ipmi_msghandler mac_hid sch_fq_codel ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables x_tables autofs4 btrfs blake2b_generic zstd_compress raid10 raid456 async_raid6_recov async_memcpy
[  +0.000051]  async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear amdgpu iommu_v2 gpu_sched drm_ttm_helper mgag200 ttm drm_shmem_helper crct10dif_pclmul drm_kms_helper crc32_pclmul syscopyarea sysfillrect hid_generic ghash_clmulni_intel sysimgblt fb_sys_fops uas aesni_intel usbhid ahci crypto_simd igb usb_storage libahci cryptd hid drm dca megaraid_sas i2c_algo_bit wmi
[  +0.000026] CPU: 16 PID: 2834 Comm: amdgpu_test Not tainted 5.16.0+ #3
[  +0.000003] Hardware name: Dell Inc. PowerEdge R730/0H21J3, BIOS 1.5.4 [FPGA Test BIOS] 10/002/2015
[  +0.000001] RIP: 0010:amdgpu_bo_release_notify+0x15f/0x170 [amdgpu]
[  +0.000103] Code: ff ff ff 48 39 c2 74 07 0f 0b e9 61 ff ff ff 4c 89 e7 e8 d4 a8 15 00 e9 54 ff ff ff e8 3a af 11 da eb cf 0f 0b e9 f0 fe ff ff <0f> 0b eb c4 e8 a8 c4 4b da 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00
[  +0.000002] RSP: 0018:ffffafed0ed27b28 EFLAGS: 00010282
[  +0.000002] RAX: 00000000ffffffea RBX: ffff9e0c589d9458 RCX: 0000000000000001
[  +0.000002] RDX: 0000000000000000 RSI: ffffffff9b5712d9 RDI: 00000000ffffffff
[  +0.000001] RBP: ffffafed0ed27b48 R08: 0000000000000000 R09: 0000000000000001
[  +0.000001] R10: ffffafed0ed27bd0 R11: ffffafed0ed277c0 R12: ffff9e0c589d9400
[  +0.000001] R13: ffff9e0c589d9458 R14: ffff9e0c589d9558 R15: ffff9e0c5b785360
[  +0.000001] FS:  00007fb8e81390c0(0000) GS:ffff9e2b3fc00000(0000) knlGS:0000000000000000
[  +0.000002] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  +0.000001] CR2: 00007fff89079ea8 CR3: 000000010e764004 CR4: 00000000001706e0
[  +0.000002] Call Trace:
[  +0.000002]  <TASK>
[  +0.000003]  ttm_bo_release+0x2f2/0x370 [ttm]
[  +0.000008]  ? vprintk_default+0x1d/0x20
[  +0.000006]  ttm_bo_put+0x30/0x40 [ttm]
[  +0.000004]  amdgpu_bo_free_kernel+0xc7/0x120 [amdgpu]
[  +0.000102]  psp_tmr_terminate+0xa4/0xd0 [amdgpu]
[  +0.000125]  psp_hw_fini+0x6e/0x110 [amdgpu]
[  +0.000122]  amdgpu_device_fini_hw+0x1e5/0x3b0 [amdgpu]
[  +0.000098]  amdgpu_driver_unload_kms+0x4b/0x60 [amdgpu]
[  +0.000098]  amdgpu_pci_remove+0x46/0x60 [amdgpu]
[  +0.000096]  pci_device_remove+0x39/0xb0
[  +0.000008]  device_release_driver_internal+0xfe/0x1d0
[  +0.000005]  device_release_driver+0x12/0x20
[  +0.000001]  pci_stop_bus_device+0x68/0x90
[  +0.000003]  pci_stop_and_remove_bus_device_locked+0x1a/0x30
[  +0.000003]  remove_store+0x7c/0x90
[  +0.000003]  dev_attr_store+0x17/0x30
[  +0.000005]  sysfs_kf_write+0x3c/0x50
[  +0.000004]  kernfs_fop_write_iter+0x13c/0x1b0
[  +0.000004]  new_sync_write+0x11a/0x1b0
[  +0.000006]  vfs_write+0x247/0x2a0
[  +0.000003]  ksys_write+0xa7/0xe0
[  +0.000003]  ? fpregs_assert_state_consistent+0x23/0x50
[  +0.000005]  __x64_sys_write+0x1a/0x20
[  +0.000003]  do_syscall_64+0x3a/0xb0
[  +0.000006]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[  +0.000004] RIP: 0033:0x7fb8e76e8371
[  +0.000003] Code: c3 0f 1f 84 00 00 00 00 00 48 8b 05 69 8c 20 00 c3 0f 1f 84 00 00 00 00 00 8b 05 9a d0 20 00 85 c0 75 16 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 57 f3 c3 0f 1f 44 00 00 41 54 55 49 89 d4 53
[  +0.000001] RSP: 002b:00007fff89083f68 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[  +0.000003] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fb8e76e8371
[  +0.000001] RDX: 0000000000000001 RSI: 0000562815835316 RDI: 0000000000000005
[  +0.000001] RBP: 0000000000000005 R08: 0000562815f732b0 R09: 0000000000000000
[  +0.000001] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[  +0.000001] R13: 0000000000000000 R14: 0000000000000000 R15: 0000562815a3e8a0
[  +0.000003]  </TASK>
[  +0.000001] ---[ end trace 7e6704984e7ed0f1 ]---
[  +0.020476] [drm:amdgpu_fill_buffer [amdgpu]] *ERROR* Trying to clear memory with ring turned off.
[  +0.000143] ------------[ cut here ]------------
[  +0.000001] WARNING: CPU: 16 PID: 2834 at drivers/gpu/drm/amd/amdgpu/amdgpu_object.c:1313 amdgpu_bo_release_notify+0x15f/0x170 [amdgpu]
[  +0.000104] Modules linked in: nf_conntrack_netlink nfnetlink xfrm_user xt_addrtype br_netfilter xt_CHECKSUM iptable_mangle xt_MASQUERADE iptable_nat nf_nat xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4 xt_tcpudp bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter overlay esp6_offload esp6 esp4_offload esp4 xfrm_algo ipmi_ssif intel_rapl_msr snd_hda_codec_hdmi intel_rapl_common sb_edac x86_pkg_temp_thermal snd_hda_intel intel_powerclamp coretemp snd_intel_dspcfg kvm_intel snd_hda_codec snd_hda_core kvm snd_hwdep ftdi_sio snd_pcm snd_timer joydev input_leds usbserial snd irqbypass soundcore rapl iTCO_wdt iTCO_vendor_support ipmi_si mei_me intel_cstate acpi_power_meter ipmi_devintf lpc_ich mei ipmi_msghandler mac_hid sch_fq_codel ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables x_tables autofs4 btrfs blake2b_generic zstd_compress raid10 raid456 async_raid6_recov async_memcpy
[  +0.000044]  async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear amdgpu iommu_v2 gpu_sched drm_ttm_helper mgag200 ttm drm_shmem_helper crct10dif_pclmul drm_kms_helper crc32_pclmul syscopyarea sysfillrect hid_generic ghash_clmulni_intel sysimgblt fb_sys_fops uas aesni_intel usbhid ahci crypto_simd igb usb_storage libahci cryptd hid drm dca megaraid_sas i2c_algo_bit wmi
[  +0.000021] CPU: 16 PID: 2834 Comm: amdgpu_test Tainted: G        W         5.16.0+ #3
[  +0.000002] Hardware name: Dell Inc. PowerEdge R730/0H21J3, BIOS 1.5.4 [FPGA Test BIOS] 10/002/2015
[  +0.000001] RIP: 0010:amdgpu_bo_release_notify+0x15f/0x170 [amdgpu]
[  +0.000102] Code: ff ff ff 48 39 c2 74 07 0f 0b e9 61 ff ff ff 4c 89 e7 e8 d4 a8 15 00 e9 54 ff ff ff e8 3a af 11 da eb cf 0f 0b e9 f0 fe ff ff <0f> 0b eb c4 e8 a8 c4 4b da 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00
[  +0.000002] RSP: 0018:ffffafed0ed27b30 EFLAGS: 00010282
[  +0.000002] RAX: 00000000ffffffea RBX: ffff9e0c589de058 RCX: 0000000000000001
[  +0.000001] RDX: 0000000000000000 RSI: ffffffff9b5712d9 RDI: 00000000ffffffff
[  +0.000002] RBP: ffffafed0ed27b50 R08: 0000000000000000 R09: 0000000000000001
[  +0.000001] R10: ffffafed0df0fd80 R11: ffffafed0ed277c8 R12: ffff9e0c589de000
[  +0.000001] R13: ffff9e0c589de058 R14: ffff9e0c589de158 R15: ffff9e0c5b785360
[  +0.000001] FS:  00007fb8e81390c0(0000) GS:ffff9e2b3fc00000(0000) knlGS:0000000000000000
[  +0.000002] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  +0.000001] CR2: 00007fff89079ea8 CR3: 000000010e764004 CR4: 00000000001706e0
[  +0.000001] Call Trace:
[  +0.000001]  <TASK>
[  +0.000002]  ttm_bo_release+0x2f2/0x370 [ttm]
[  +0.000005]  ? __cond_resched+0x1d/0x30
[  +0.000006]  ttm_bo_put+0x30/0x40 [ttm]
[  +0.000004]  amdgpu_bo_free_kernel+0xc7/0x120 [amdgpu]
[  +0.000102]  psp_v11_0_ring_destroy+0x41/0x50 [amdgpu]
[  +0.000126]  psp_hw_fini+0x86/0x110 [amdgpu]
[  +0.000129]  amdgpu_device_fini_hw+0x1e5/0x3b0 [amdgpu]
[  +0.000101]  amdgpu_driver_unload_kms+0x4b/0x60 [amdgpu]
[  +0.000102]  amdgpu_pci_remove+0x46/0x60 [amdgpu]
[  +0.000100]  pci_device_remove+0x39/0xb0
[  +0.000003]  device_release_driver_internal+0xfe/0x1d0
[  +0.000003]  device_release_driver+0x12/0x20
[  +0.000001]  pci_stop_bus_device+0x68/0x90
[  +0.000003]  pci_stop_and_remove_bus_device_locked+0x1a/0x30
[  +0.000002]  remove_store+0x7c/0x90
[  +0.000003]  dev_attr_store+0x17/0x30
[  +0.000003]  sysfs_kf_write+0x3c/0x50
[  +0.000003]  kernfs_fop_write_iter+0x13c/0x1b0
[  +0.000004]  new_sync_write+0x11a/0x1b0
[  +0.000003]  vfs_write+0x247/0x2a0
[  +0.000003]  ksys_write+0xa7/0xe0
[  +0.000003]  ? fpregs_assert_state_consistent+0x23/0x50
[  +0.000003]  __x64_sys_write+0x1a/0x20
[  +0.000003]  do_syscall_64+0x3a/0xb0
[  +0.000003]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[  +0.000003] RIP: 0033:0x7fb8e76e8371
[  +0.000002] Code: c3 0f 1f 84 00 00 00 00 00 48 8b 05 69 8c 20 00 c3 0f 1f 84 00 00 00 00 00 8b 05 9a d0 20 00 85 c0 75 16 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 57 f3 c3 0f 1f 44 00 00 41 54 55 49 89 d4 53
[  +0.000002] RSP: 002b:00007fff89083f68 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[  +0.000002] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fb8e76e8371
[  +0.000001] RDX: 0000000000000001 RSI: 0000562815835316 RDI: 0000000000000005
[  +0.000001] RBP: 0000000000000005 R08: 0000562815f732b0 R09: 0000000000000000
[  +0.000001] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[  +0.000001] R13: 0000000000000000 R14: 0000000000000000 R15: 0000562815a3e8a0
[  +0.000003]  </TASK>
[  +0.000001] ---[ end trace 7e6704984e7ed0f2 ]---
[  +0.000627] [drm:amdgpu_fill_buffer [amdgpu]] *ERROR* Trying to clear memory with ring turned off.
[  +0.000174] ------------[ cut here ]------------
[  +0.000001] WARNING: CPU: 16 PID: 2834 at drivers/gpu/drm/amd/amdgpu/amdgpu_object.c:1313 amdgpu_bo_release_notify+0x15f/0x170 [amdgpu]
[  +0.000110] Modules linked in: nf_conntrack_netlink nfnetlink xfrm_user xt_addrtype br_netfilter xt_CHECKSUM iptable_mangle xt_MASQUERADE iptable_nat nf_nat xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4 xt_tcpudp bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter overlay esp6_offload esp6 esp4_offload esp4 xfrm_algo ipmi_ssif intel_rapl_msr snd_hda_codec_hdmi intel_rapl_common sb_edac x86_pkg_temp_thermal snd_hda_intel intel_powerclamp coretemp snd_intel_dspcfg kvm_intel snd_hda_codec snd_hda_core kvm snd_hwdep ftdi_sio snd_pcm snd_timer joydev input_leds usbserial snd irqbypass soundcore rapl iTCO_wdt iTCO_vendor_support ipmi_si mei_me intel_cstate acpi_power_meter ipmi_devintf lpc_ich mei ipmi_msghandler mac_hid sch_fq_codel ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables x_tables autofs4 btrfs blake2b_generic zstd_compress raid10 raid456 async_raid6_recov async_memcpy
[  +0.000046]  async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear amdgpu iommu_v2 gpu_sched drm_ttm_helper mgag200 ttm drm_shmem_helper crct10dif_pclmul drm_kms_helper crc32_pclmul syscopyarea sysfillrect hid_generic ghash_clmulni_intel sysimgblt fb_sys_fops uas aesni_intel usbhid ahci crypto_simd igb usb_storage libahci cryptd hid drm dca megaraid_sas i2c_algo_bit wmi
[  +0.000025] CPU: 16 PID: 2834 Comm: amdgpu_test Tainted: G        W         5.16.0+ #3
[  +0.000003] Hardware name: Dell Inc. PowerEdge R730/0H21J3, BIOS 1.5.4 [FPGA Test BIOS] 10/002/2015
[  +0.000002] RIP: 0010:amdgpu_bo_release_notify+0x15f/0x170 [amdgpu]
[  +0.000107] Code: ff ff ff 48 39 c2 74 07 0f 0b e9 61 ff ff ff 4c 89 e7 e8 d4 a8 15 00 e9 54 ff ff ff e8 3a af 11 da eb cf 0f 0b e9 f0 fe ff ff <0f> 0b eb c4 e8 a8 c4 4b da 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00
[  +0.000002] RSP: 0018:ffffafed0ed27b58 EFLAGS: 00010282
[  +0.000002] RAX: 00000000ffffffea RBX: ffff9e0c589d8c58 RCX: 0000000000000001
[  +0.000001] RDX: 0000000000000000 RSI: ffffffff9b5712d9 RDI: 00000000ffffffff
[  +0.000001] RBP: ffffafed0ed27b78 R08: 0000000000000000 R09: 0000000000000001
[  +0.000002] R10: ffffafed0ed27b08 R11: ffffafed0ed277f0 R12: ffff9e0c589d8c00
[  +0.000001] R13: ffff9e0c589d8c58 R14: ffff9e0c589d8d58 R15: ffff9e0c5b785360
[  +0.000001] FS:  00007fb8e81390c0(0000) GS:ffff9e2b3fc00000(0000) knlGS:0000000000000000
[  +0.000002] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  +0.000001] CR2: 00007fff89079ea8 CR3: 000000010e764004 CR4: 00000000001706e0
[  +0.000001] Call Trace:
[  +0.000001]  <TASK>
[  +0.000002]  ttm_bo_release+0x2f2/0x370 [ttm]
[  +0.000005]  ? __vunmap+0x1c9/0x210
[  +0.000007]  ttm_bo_put+0x30/0x40 [ttm]
[  +0.000004]  amdgpu_bo_free_kernel+0xc7/0x120 [amdgpu]
[  +0.000106]  psp_hw_fini+0xba/0x110 [amdgpu]
[  +0.000129]  amdgpu_device_fini_hw+0x1e5/0x3b0 [amdgpu]
[  +0.000101]  amdgpu_driver_unload_kms+0x4b/0x60 [amdgpu]
[  +0.000103]  amdgpu_pci_remove+0x46/0x60 [amdgpu]
[  +0.000100]  pci_device_remove+0x39/0xb0
[  +0.000003]  device_release_driver_internal+0xfe/0x1d0
[  +0.000002]  device_release_driver+0x12/0x20
[  +0.000002]  pci_stop_bus_device+0x68/0x90
[  +0.000003]  pci_stop_and_remove_bus_device_locked+0x1a/0x30
[  +0.000002]  remove_store+0x7c/0x90
[  +0.000003]  dev_attr_store+0x17/0x30
[  +0.000002]  sysfs_kf_write+0x3c/0x50
[  +0.000003]  kernfs_fop_write_iter+0x13c/0x1b0
[  +0.000004]  new_sync_write+0x11a/0x1b0
[  +0.000004]  vfs_write+0x247/0x2a0
[  +0.000003]  ksys_write+0xa7/0xe0
[  +0.000002]  ? fpregs_assert_state_consistent+0x23/0x50
[  +0.000003]  __x64_sys_write+0x1a/0x20
[  +0.000003]  do_syscall_64+0x3a/0xb0
[  +0.000003]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[  +0.000003] RIP: 0033:0x7fb8e76e8371
[  +0.000002] Code: c3 0f 1f 84 00 00 00 00 00 48 8b 05 69 8c 20 00 c3 0f 1f 84 00 00 00 00 00 8b 05 9a d0 20 00 85 c0 75 16 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 57 f3 c3 0f 1f 44 00 00 41 54 55 49 89 d4 53
[  +0.000002] RSP: 002b:00007fff89083f68 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[  +0.000002] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fb8e76e8371
[  +0.000001] RDX: 0000000000000001 RSI: 0000562815835316 RDI: 0000000000000005
[  +0.000001] RBP: 0000000000000005 R08: 0000562815f732b0 R09: 0000000000000000
[  +0.000001] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[  +0.000002] R13: 0000000000000000 R14: 0000000000000000 R15: 0000562815a3e8a0
[  +0.000002]  </TASK>
[  +0.000001] ---[ end trace 7e6704984e7ed0f3 ]---
[  +0.000022] [drm:amdgpu_fill_buffer [amdgpu]] *ERROR* Trying to clear memory with ring turned off.
[  +0.000420] ------------[ cut here ]------------
[  +0.000006] WARNING: CPU: 28 PID: 2834 at drivers/gpu/drm/amd/amdgpu/amdgpu_object.c:1313 amdgpu_bo_release_notify+0x15f/0x170 [amdgpu]
[  +0.000317] Modules linked in: nf_conntrack_netlink nfnetlink xfrm_user xt_addrtype br_netfilter xt_CHECKSUM iptable_mangle xt_MASQUERADE iptable_nat nf_nat xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4 xt_tcpudp bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter overlay esp6_offload esp6 esp4_offload esp4 xfrm_algo ipmi_ssif intel_rapl_msr snd_hda_codec_hdmi intel_rapl_common sb_edac x86_pkg_temp_thermal snd_hda_intel intel_powerclamp coretemp snd_intel_dspcfg kvm_intel snd_hda_codec snd_hda_core kvm snd_hwdep ftdi_sio snd_pcm snd_timer joydev input_leds usbserial snd irqbypass soundcore rapl iTCO_wdt iTCO_vendor_support ipmi_si mei_me intel_cstate acpi_power_meter ipmi_devintf lpc_ich mei ipmi_msghandler mac_hid sch_fq_codel ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables x_tables autofs4 btrfs blake2b_generic zstd_compress raid10 raid456 async_raid6_recov async_memcpy
[  +0.000114]  async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear amdgpu iommu_v2 gpu_sched drm_ttm_helper mgag200 ttm drm_shmem_helper crct10dif_pclmul drm_kms_helper crc32_pclmul syscopyarea sysfillrect hid_generic ghash_clmulni_intel sysimgblt fb_sys_fops uas aesni_intel usbhid ahci crypto_simd igb usb_storage libahci cryptd hid drm dca megaraid_sas i2c_algo_bit wmi
[  +0.000058] CPU: 28 PID: 2834 Comm: amdgpu_test Tainted: G        W         5.16.0+ #3
[  +0.000005] Hardware name: Dell Inc. PowerEdge R730/0H21J3, BIOS 1.5.4 [FPGA Test BIOS] 10/002/2015
[  +0.000003] RIP: 0010:amdgpu_bo_release_notify+0x15f/0x170 [amdgpu]
[  +0.000309] Code: ff ff ff 48 39 c2 74 07 0f 0b e9 61 ff ff ff 4c 89 e7 e8 d4 a8 15 00 e9 54 ff ff ff e8 3a af 11 da eb cf 0f 0b e9 f0 fe ff ff <0f> 0b eb c4 e8 a8 c4 4b da 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00
[  +0.000005] RSP: 0018:ffffafed0ed27b58 EFLAGS: 00010282
[  +0.000005] RAX: 00000000ffffffea RBX: ffff9e0c589d8858 RCX: 0000000000000001
[  +0.000003] RDX: 0000000000000000 RSI: ffffffff9b5712d9 RDI: 00000000ffffffff
[  +0.000003] RBP: ffffafed0ed27b78 R08: 0000000000000000 R09: 0000000000000001
[  +0.000002] R10: 0000000000000c32 R11: ffffafed0ed277f0 R12: ffff9e0c589d8800
[  +0.000003] R13: ffff9e0c589d8858 R14: ffff9e0c589d8958 R15: ffff9e0c5b785360
[  +0.000003] FS:  00007fb8e81390c0(0000) GS:ffff9e2b3fd80000(0000) knlGS:0000000000000000
[  +0.000004] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  +0.000003] CR2: 00007f68a9ecb1d8 CR3: 000000010e764004 CR4: 00000000001706e0
[  +0.000003] Call Trace:
[  +0.000002]  <TASK>
[  +0.000005]  ttm_bo_release+0x2f2/0x370 [ttm]
[  +0.000012]  ? __cond_resched+0x1d/0x30
[  +0.000011]  ttm_bo_put+0x30/0x40 [ttm]
[  +0.000010]  amdgpu_bo_free_kernel+0xc7/0x120 [amdgpu]
[  +0.000234]  psp_hw_fini+0xd4/0x110 [amdgpu]
[  +0.000292]  amdgpu_device_fini_hw+0x1e5/0x3b0 [amdgpu]
[  +0.000230]  amdgpu_driver_unload_kms+0x4b/0x60 [amdgpu]
[  +0.000230]  amdgpu_pci_remove+0x46/0x60 [amdgpu]
[  +0.000225]  pci_device_remove+0x39/0xb0
[  +0.000007]  device_release_driver_internal+0xfe/0x1d0
[  +0.000006]  device_release_driver+0x12/0x20
[  +0.000004]  pci_stop_bus_device+0x68/0x90
[  +0.000006]  pci_stop_and_remove_bus_device_locked+0x1a/0x30
[  +0.000005]  remove_store+0x7c/0x90
[  +0.000007]  dev_attr_store+0x17/0x30
[  +0.000006]  sysfs_kf_write+0x3c/0x50
[  +0.000007]  kernfs_fop_write_iter+0x13c/0x1b0
[  +0.000007]  new_sync_write+0x11a/0x1b0
[  +0.000009]  vfs_write+0x247/0x2a0
[  +0.000006]  ksys_write+0xa7/0xe0
[  +0.000006]  ? fpregs_assert_state_consistent+0x23/0x50
[  +0.000008]  __x64_sys_write+0x1a/0x20
[  +0.000006]  do_syscall_64+0x3a/0xb0
[  +0.000008]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[  +0.000006] RIP: 0033:0x7fb8e76e8371
[  +0.000005] Code: c3 0f 1f 84 00 00 00 00 00 48 8b 05 69 8c 20 00 c3 0f 1f 84 00 00 00 00 00 8b 05 9a d0 20 00 85 c0 75 16 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 57 f3 c3 0f 1f 44 00 00 41 54 55 49 89 d4 53
[  +0.000005] RSP: 002b:00007fff89083f68 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[  +0.000004] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fb8e76e8371
[  +0.000003] RDX: 0000000000000001 RSI: 0000562815835316 RDI: 0000000000000005
[  +0.000003] RBP: 0000000000000005 R08: 0000562815f732b0 R09: 0000000000000000
[  +0.000003] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[  +0.000002] R13: 0000000000000000 R14: 0000000000000000 R15: 0000562815a3e8a0
[  +0.000006]  </TASK>
[  +0.000002] ---[ end trace 7e6704984e7ed0f4 ]---
[  +0.130454] pci 0000:05:00.0: Removing from iommu group 24
[  +0.000164] [drm:amdgpu_fill_buffer [amdgpu]] *ERROR* Trying to clear memory with ring turned off.
[  +0.000339] ------------[ cut here ]------------
[  +0.000002] WARNING: CPU: 28 PID: 2834 at drivers/gpu/drm/amd/amdgpu/amdgpu_object.c:1313 amdgpu_bo_release_notify+0x15f/0x170 [amdgpu]
[  +0.000186] Modules linked in: nf_conntrack_netlink nfnetlink xfrm_user xt_addrtype br_netfilter xt_CHECKSUM iptable_mangle xt_MASQUERADE iptable_nat nf_nat xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4 xt_tcpudp bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter overlay esp6_offload esp6 esp4_offload esp4 xfrm_algo ipmi_ssif intel_rapl_msr snd_hda_codec_hdmi intel_rapl_common sb_edac x86_pkg_temp_thermal snd_hda_intel intel_powerclamp coretemp snd_intel_dspcfg kvm_intel snd_hda_codec snd_hda_core kvm snd_hwdep ftdi_sio snd_pcm snd_timer joydev input_leds usbserial snd irqbypass soundcore rapl iTCO_wdt iTCO_vendor_support ipmi_si mei_me intel_cstate acpi_power_meter ipmi_devintf lpc_ich mei ipmi_msghandler mac_hid sch_fq_codel ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables x_tables autofs4 btrfs blake2b_generic zstd_compress raid10 raid456 async_raid6_recov async_memcpy
[  +0.000088]  async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear amdgpu iommu_v2 gpu_sched drm_ttm_helper mgag200 ttm drm_shmem_helper crct10dif_pclmul drm_kms_helper crc32_pclmul syscopyarea sysfillrect hid_generic ghash_clmulni_intel sysimgblt fb_sys_fops uas aesni_intel usbhid ahci crypto_simd igb usb_storage libahci cryptd hid drm dca megaraid_sas i2c_algo_bit wmi
[  +0.000045] CPU: 28 PID: 2834 Comm: amdgpu_test Tainted: G        W         5.16.0+ #3
[  +0.000004] Hardware name: Dell Inc. PowerEdge R730/0H21J3, BIOS 1.5.4 [FPGA Test BIOS] 10/002/2015
[  +0.000003] RIP: 0010:amdgpu_bo_release_notify+0x15f/0x170 [amdgpu]
[  +0.000183] Code: ff ff ff 48 39 c2 74 07 0f 0b e9 61 ff ff ff 4c 89 e7 e8 d4 a8 15 00 e9 54 ff ff ff e8 3a af 11 da eb cf 0f 0b e9 f0 fe ff ff <0f> 0b eb c4 e8 a8 c4 4b da 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00
[  +0.000003] RSP: 0018:ffffafed0ed27cd0 EFLAGS: 00010282
[  +0.000004] RAX: 00000000ffffffea RBX: ffff9e0c6cae0058 RCX: 0000000000000001
[  +0.000003] RDX: 0000000000000000 RSI: ffffffff9b5712d9 RDI: 00000000ffffffff
[  +0.000002] RBP: ffffafed0ed27cf0 R08: 0000000000000000 R09: 0000000000000001
[  +0.000002] R10: ffffafed0ed27cf0 R11: ffffafed0ed27968 R12: ffff9e0c6cae0000
[  +0.000002] R13: ffff9e0c6cae0058 R14: ffff9e0c6cae0158 R15: ffff9e0c6cae0000
[  +0.000002] FS:  00007fb8e81390c0(0000) GS:ffff9e2b3fd80000(0000) knlGS:0000000000000000
[  +0.000003] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  +0.000003] CR2: 00007f68a9ecb1d8 CR3: 000000010e764004 CR4: 00000000001706e0
[  +0.000002] Call Trace:
[  +0.000003]  <TASK>
[  +0.000005]  ttm_bo_release+0x2f2/0x370 [ttm]
[  +0.000013]  ttm_bo_put+0x30/0x40 [ttm]
[  +0.000008]  amdgpu_bo_unref+0x1e/0x30 [amdgpu]
[  +0.000182]  amdgpu_driver_postclose_kms+0x173/0x300 [amdgpu]
[  +0.000174]  drm_file_free.part.10+0x275/0x2c0 [drm]
[  +0.000045]  drm_close_helper.isra.11+0x60/0x70 [drm]
[  +0.000029]  drm_release+0x6a/0xe0 [drm]
[  +0.000029]  __fput+0x99/0x260
[  +0.000009]  ____fput+0xe/0x10
[  +0.000005]  task_work_run+0x6c/0xa0
[  +0.000006]  exit_to_user_mode_prepare+0x1af/0x1c0
[  +0.000005]  syscall_exit_to_user_mode+0x2a/0x40
[  +0.000006]  do_syscall_64+0x46/0xb0
[  +0.000006]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[  +0.000005] RIP: 0033:0x7fb8e76e8511
[  +0.000005] Code: f7 d8 64 89 02 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 8b 05 fa ce 20 00 85 c0 75 16 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 3f f3 c3 0f 1f 44 00 00 53 89 fb 48 83 ec 10
[  +0.000003] RSP: 002b:00007fff89083f58 EFLAGS: 00000246 ORIG_RAX: 0000000000000003
[  +0.000004] RAX: 0000000000000000 RBX: 0000562815a3e6a0 RCX: 00007fb8e76e8511
[  +0.000002] RDX: 00007fb8e76d1ca0 RSI: 0000562815f75100 RDI: 0000000000000003
[  +0.000002] RBP: 0000562815a3e8a0 R08: 0000000000000000 R09: 0000000000000000
[  +0.000002] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[  +0.000002] R13: 0000000000000000 R14: 0000000000000000 R15: 0000562815a3e8a0
[  +0.000005]  </TASK>
[  +0.000002] ---[ end trace 7e6704984e7ed0f5 ]---
[  +0.001061] [drm:amdgpu_fill_buffer [amdgpu]] *ERROR* Trying to clear memory with ring turned off.
[  +0.000346] ------------[ cut here ]------------
[  +0.000003] WARNING: CPU: 14 PID: 2834 at drivers/gpu/drm/amd/amdgpu/amdgpu_object.c:1313 amdgpu_bo_release_notify+0x15f/0x170 [amdgpu]
[  +0.000230] Modules linked in: nf_conntrack_netlink nfnetlink xfrm_user xt_addrtype br_netfilter xt_CHECKSUM iptable_mangle xt_MASQUERADE iptable_nat nf_nat xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4 xt_tcpudp bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter overlay esp6_offload esp6 esp4_offload esp4 xfrm_algo ipmi_ssif intel_rapl_msr snd_hda_codec_hdmi intel_rapl_common sb_edac x86_pkg_temp_thermal snd_hda_intel intel_powerclamp coretemp snd_intel_dspcfg kvm_intel snd_hda_codec snd_hda_core kvm snd_hwdep ftdi_sio snd_pcm snd_timer joydev input_leds usbserial snd irqbypass soundcore rapl iTCO_wdt iTCO_vendor_support ipmi_si mei_me intel_cstate acpi_power_meter ipmi_devintf lpc_ich mei ipmi_msghandler mac_hid sch_fq_codel ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables x_tables autofs4 btrfs blake2b_generic zstd_compress raid10 raid456 async_raid6_recov async_memcpy
[  +0.000103]  async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear amdgpu iommu_v2 gpu_sched drm_ttm_helper mgag200 ttm drm_shmem_helper crct10dif_pclmul drm_kms_helper crc32_pclmul syscopyarea sysfillrect hid_generic ghash_clmulni_intel sysimgblt fb_sys_fops uas aesni_intel usbhid ahci crypto_simd igb usb_storage libahci cryptd hid drm dca megaraid_sas i2c_algo_bit wmi
[  +0.000049] CPU: 14 PID: 2834 Comm: amdgpu_test Tainted: G        W         5.16.0+ #3
[  +0.000006] Hardware name: Dell Inc. PowerEdge R730/0H21J3, BIOS 1.5.4 [FPGA Test BIOS] 10/002/2015
[  +0.000002] RIP: 0010:amdgpu_bo_release_notify+0x15f/0x170 [amdgpu]
[  +0.000228] Code: ff ff ff 48 39 c2 74 07 0f 0b e9 61 ff ff ff 4c 89 e7 e8 d4 a8 15 00 e9 54 ff ff ff e8 3a af 11 da eb cf 0f 0b e9 f0 fe ff ff <0f> 0b eb c4 e8 a8 c4 4b da 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00
[  +0.000004] RSP: 0018:ffffafed0ed27cb8 EFLAGS: 00010282
[  +0.000005] RAX: 00000000ffffffea RBX: ffff9e0c589d8458 RCX: 0000000000000001
[  +0.000003] RDX: 0000000000000000 RSI: ffffffff9b5712d9 RDI: 00000000ffffffff
[  +0.000003] RBP: ffffafed0ed27cd8 R08: 0000000000000000 R09: 0000000000000001
[  +0.000002] R10: dead000000000122 R11: ffffafed0ed27950 R12: ffff9e0c589d8400
[  +0.000003] R13: ffff9e0c589d8458 R14: ffff9e0c589d8558 R15: ffff9e0c5b785360
[  +0.000003] FS:  00007fb8e81390c0(0000) GS:ffff9e2b3fbc0000(0000) knlGS:0000000000000000
[  +0.000003] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  +0.000003] CR2: 00007fce894ebc20 CR3: 000000010e764002 CR4: 00000000001706e0
[  +0.000004] Call Trace:
[  +0.000002]  <TASK>
[  +0.000004]  ttm_bo_release+0x2f2/0x370 [ttm]
[  +0.000014]  ttm_bo_put+0x30/0x40 [ttm]
[  +0.000010]  amdgpu_bo_free_kernel+0xc7/0x120 [amdgpu]
[  +0.000225]  amdgpu_vce_sw_fini+0x42/0x90 [amdgpu]
[  +0.000298]  vce_v4_0_sw_fini+0x2f/0x50 [amdgpu]
[  +0.000290]  amdgpu_device_fini_sw+0xcc/0x320 [amdgpu]
[  +0.000214]  amdgpu_driver_release_kms+0x16/0x30 [amdgpu]
[  +0.000215]  drm_dev_release+0x28/0x40 [drm]
[  +0.000046]  drm_minor_release+0x30/0x40 [drm]
[  +0.000041]  drm_release+0xa1/0xe0 [drm]
[  +0.000037]  __fput+0x99/0x260
[  +0.000008]  ____fput+0xe/0x10
[  +0.000006]  task_work_run+0x6c/0xa0
[  +0.000006]  exit_to_user_mode_prepare+0x1af/0x1c0
[  +0.000006]  syscall_exit_to_user_mode+0x2a/0x40
[  +0.000005]  do_syscall_64+0x46/0xb0
[  +0.000007]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[  +0.000006] RIP: 0033:0x7fb8e76e8511
[  +0.000004] Code: f7 d8 64 89 02 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 8b 05 fa ce 20 00 85 c0 75 16 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 3f f3 c3 0f 1f 44 00 00 53 89 fb 48 83 ec 10
[  +0.000004] RSP: 002b:00007fff89083f58 EFLAGS: 00000246 ORIG_RAX: 0000000000000003
[  +0.000005] RAX: 0000000000000000 RBX: 0000562815a3e6a0 RCX: 00007fb8e76e8511
[  +0.000003] RDX: 00007fb8e76d1ca0 RSI: 0000562815f75100 RDI: 0000000000000003
[  +0.000002] RBP: 0000562815a3e8a0 R08: 0000000000000000 R09: 0000000000000000
[  +0.000002] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[  +0.000003] R13: 0000000000000000 R14: 0000000000000000 R15: 0000562815a3e8a0
[  +0.000006]  </TASK>
[  +0.000001] ---[ end trace 7e6704984e7ed0f6 ]---
[  +0.000532] [drm:amdgpu_fill_buffer [amdgpu]] *ERROR* Trying to clear memory with ring turned off.
[  +0.000362] ------------[ cut here ]------------
[  +0.000003] WARNING: CPU: 30 PID: 2834 at drivers/gpu/drm/amd/amdgpu/amdgpu_object.c:1313 amdgpu_bo_release_notify+0x15f/0x170 [amdgpu]
[  +0.000233] Modules linked in: nf_conntrack_netlink nfnetlink xfrm_user xt_addrtype br_netfilter xt_CHECKSUM iptable_mangle xt_MASQUERADE iptable_nat nf_nat xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4 xt_tcpudp bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter overlay esp6_offload esp6 esp4_offload esp4 xfrm_algo ipmi_ssif intel_rapl_msr snd_hda_codec_hdmi intel_rapl_common sb_edac x86_pkg_temp_thermal snd_hda_intel intel_powerclamp coretemp snd_intel_dspcfg kvm_intel snd_hda_codec snd_hda_core kvm snd_hwdep ftdi_sio snd_pcm snd_timer joydev input_leds usbserial snd irqbypass soundcore rapl iTCO_wdt iTCO_vendor_support ipmi_si mei_me intel_cstate acpi_power_meter ipmi_devintf lpc_ich mei ipmi_msghandler mac_hid sch_fq_codel ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables x_tables autofs4 btrfs blake2b_generic zstd_compress raid10 raid456 async_raid6_recov async_memcpy
[  +0.000102]  async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear amdgpu iommu_v2 gpu_sched drm_ttm_helper mgag200 ttm drm_shmem_helper crct10dif_pclmul drm_kms_helper crc32_pclmul syscopyarea sysfillrect hid_generic ghash_clmulni_intel sysimgblt fb_sys_fops uas aesni_intel usbhid ahci crypto_simd igb usb_storage libahci cryptd hid drm dca megaraid_sas i2c_algo_bit wmi
[  +0.000048] CPU: 30 PID: 2834 Comm: amdgpu_test Tainted: G        W         5.16.0+ #3
[  +0.000005] Hardware name: Dell Inc. PowerEdge R730/0H21J3, BIOS 1.5.4 [FPGA Test BIOS] 10/002/2015
[  +0.000003] RIP: 0010:amdgpu_bo_release_notify+0x15f/0x170 [amdgpu]
[  +0.000228] Code: ff ff ff 48 39 c2 74 07 0f 0b e9 61 ff ff ff 4c 89 e7 e8 d4 a8 15 00 e9 54 ff ff ff e8 3a af 11 da eb cf 0f 0b e9 f0 fe ff ff <0f> 0b eb c4 e8 a8 c4 4b da 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00
[  +0.000004] RSP: 0018:ffffafed0ed27c78 EFLAGS: 00010282
[  +0.000005] RAX: 00000000ffffffea RBX: ffff9e0c589db458 RCX: 0000000000000001
[  +0.000003] RDX: 0000000000000000 RSI: ffffffff9b5712d9 RDI: 00000000ffffffff
[  +0.000003] RBP: ffffafed0ed27c98 R08: 0000000000000000 R09: 0000000000000001
[  +0.000002] R10: dead000000000122 R11: ffffafed0ed27910 R12: ffff9e0c589db400
[  +0.000003] R13: ffff9e0c589db458 R14: ffff9e0c589db558 R15: ffff9e0c5b785360
[  +0.000002] FS:  00007fb8e81390c0(0000) GS:ffff9e2b3fdc0000(0000) knlGS:0000000000000000
[  +0.000004] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  +0.000003] CR2: 00007ffe1c1c1000 CR3: 000000010e764002 CR4: 00000000001706e0
[  +0.000004] Call Trace:
[  +0.000002]  <TASK>
[  +0.000005]  ttm_bo_release+0x2f2/0x370 [ttm]
[  +0.000012]  ? __free_pages+0x7e/0xa0
[  +0.000012]  ttm_bo_put+0x30/0x40 [ttm]
[  +0.000010]  amdgpu_bo_free_kernel+0xc7/0x120 [amdgpu]
[  +0.000226]  amdgpu_uvd_sw_fini+0x9b/0x120 [amdgpu]
[  +0.000295]  uvd_v7_0_sw_fini+0x9d/0xb0 [amdgpu]
[  +0.000290]  amdgpu_device_fini_sw+0xcc/0x320 [amdgpu]
[  +0.000214]  amdgpu_driver_release_kms+0x16/0x30 [amdgpu]
[  +0.000215]  drm_dev_release+0x28/0x40 [drm]
[  +0.000043]  drm_minor_release+0x30/0x40 [drm]
[  +0.000042]  drm_release+0xa1/0xe0 [drm]
[  +0.000037]  __fput+0x99/0x260
[  +0.000007]  ____fput+0xe/0x10
[  +0.000006]  task_work_run+0x6c/0xa0
[  +0.000006]  exit_to_user_mode_prepare+0x1af/0x1c0
[  +0.000006]  syscall_exit_to_user_mode+0x2a/0x40
[  +0.000005]  do_syscall_64+0x46/0xb0
[  +0.000007]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[  +0.000005] RIP: 0033:0x7fb8e76e8511
[  +0.000004] Code: f7 d8 64 89 02 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 8b 05 fa ce 20 00 85 c0 75 16 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 3f f3 c3 0f 1f 44 00 00 53 89 fb 48 83 ec 10
[  +0.000004] RSP: 002b:00007fff89083f58 EFLAGS: 00000246 ORIG_RAX: 0000000000000003
[  +0.000005] RAX: 0000000000000000 RBX: 0000562815a3e6a0 RCX: 00007fb8e76e8511
[  +0.000003] RDX: 00007fb8e76d1ca0 RSI: 0000562815f75100 RDI: 0000000000000003
[  +0.000002] RBP: 0000562815a3e8a0 R08: 0000000000000000 R09: 0000000000000000
[  +0.000002] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[  +0.000003] R13: 0000000000000000 R14: 0000000000000000 R15: 0000562815a3e8a0
[  +0.000006]  </TASK>
[  +0.000001] ---[ end trace 7e6704984e7ed0f7 ]---
[  +0.000659] [drm:amdgpu_fill_buffer [amdgpu]] *ERROR* Trying to clear memory with ring turned off.
[  +0.000311] ------------[ cut here ]------------
[  +0.000001] WARNING: CPU: 30 PID: 2834 at drivers/gpu/drm/amd/amdgpu/amdgpu_object.c:1313 amdgpu_bo_release_notify+0x15f/0x170 [amdgpu]
[  +0.000229] Modules linked in: nf_conntrack_netlink nfnetlink xfrm_user xt_addrtype br_netfilter xt_CHECKSUM iptable_mangle xt_MASQUERADE iptable_nat nf_nat xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4 xt_tcpudp bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter overlay esp6_offload esp6 esp4_offload esp4 xfrm_algo ipmi_ssif intel_rapl_msr snd_hda_codec_hdmi intel_rapl_common sb_edac x86_pkg_temp_thermal snd_hda_intel intel_powerclamp coretemp snd_intel_dspcfg kvm_intel snd_hda_codec snd_hda_core kvm snd_hwdep ftdi_sio snd_pcm snd_timer joydev input_leds usbserial snd irqbypass soundcore rapl iTCO_wdt iTCO_vendor_support ipmi_si mei_me intel_cstate acpi_power_meter ipmi_devintf lpc_ich mei ipmi_msghandler mac_hid sch_fq_codel ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables x_tables autofs4 btrfs blake2b_generic zstd_compress raid10 raid456 async_raid6_recov async_memcpy
[  +0.000100]  async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear amdgpu iommu_v2 gpu_sched drm_ttm_helper mgag200 ttm drm_shmem_helper crct10dif_pclmul drm_kms_helper crc32_pclmul syscopyarea sysfillrect hid_generic ghash_clmulni_intel sysimgblt fb_sys_fops uas aesni_intel usbhid ahci crypto_simd igb usb_storage libahci cryptd hid drm dca megaraid_sas i2c_algo_bit wmi
[  +0.000047] CPU: 30 PID: 2834 Comm: amdgpu_test Tainted: G        W         5.16.0+ #3
[  +0.000004] Hardware name: Dell Inc. PowerEdge R730/0H21J3, BIOS 1.5.4 [FPGA Test BIOS] 10/002/2015
[  +0.000002] RIP: 0010:amdgpu_bo_release_notify+0x15f/0x170 [amdgpu]
[  +0.000247] Code: ff ff ff 48 39 c2 74 07 0f 0b e9 61 ff ff ff 4c 89 e7 e8 d4 a8 15 00 e9 54 ff ff ff e8 3a af 11 da eb cf 0f 0b e9 f0 fe ff ff <0f> 0b eb c4 e8 a8 c4 4b da 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00
[  +0.000006] RSP: 0018:ffffafed0ed27c78 EFLAGS: 00010282
[  +0.000004] RAX: 00000000ffffffea RBX: ffff9e0c589df458 RCX: 0000000000000001
[  +0.000003] RDX: 0000000000000000 RSI: ffffffff9b5712d9 RDI: 00000000ffffffff
[  +0.000004] RBP: ffffafed0ed27c98 R08: 0000000000000000 R09: 0000000000000001
[  +0.000003] R10: dead000000000122 R11: ffffafed0ed27910 R12: ffff9e0c589df400
[  +0.000003] R13: ffff9e0c589df458 R14: ffff9e0c589df558 R15: ffff9e0c5b785360
[  +0.000003] FS:  00007fb8e81390c0(0000) GS:ffff9e2b3fdc0000(0000) knlGS:0000000000000000
[  +0.000005] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  +0.000003] CR2: 00007ffe1c1c1000 CR3: 000000010e764002 CR4: 00000000001706e0
[  +0.000004] Call Trace:
[  +0.000002]  <TASK>
[  +0.000004]  ttm_bo_release+0x2f2/0x370 [ttm]
[  +0.000013]  ? __free_pages+0x7e/0xa0
[  +0.000010]  ttm_bo_put+0x30/0x40 [ttm]
[  +0.000010]  amdgpu_bo_free_kernel+0xc7/0x120 [amdgpu]
[  +0.000225]  amdgpu_uvd_sw_fini+0x9b/0x120 [amdgpu]
[  +0.000303]  uvd_v7_0_sw_fini+0x9d/0xb0 [amdgpu]
[  +0.000305]  amdgpu_device_fini_sw+0xcc/0x320 [amdgpu]
[  +0.000229]  amdgpu_driver_release_kms+0x16/0x30 [amdgpu]
[  +0.000230]  drm_dev_release+0x28/0x40 [drm]
[  +0.000044]  drm_minor_release+0x30/0x40 [drm]
[  +0.000044]  drm_release+0xa1/0xe0 [drm]
[  +0.000039]  __fput+0x99/0x260
[  +0.000008]  ____fput+0xe/0x10
[  +0.000006]  task_work_run+0x6c/0xa0
[  +0.000007]  exit_to_user_mode_prepare+0x1af/0x1c0
[  +0.000006]  syscall_exit_to_user_mode+0x2a/0x40
[  +0.000005]  do_syscall_64+0x46/0xb0
[  +0.000007]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[  +0.000005] RIP: 0033:0x7fb8e76e8511
[  +0.000005] Code: f7 d8 64 89 02 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 8b 05 fa ce 20 00 85 c0 75 16 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 3f f3 c3 0f 1f 44 00 00 53 89 fb 48 83 ec 10
[  +0.000004] RSP: 002b:00007fff89083f58 EFLAGS: 00000246 ORIG_RAX: 0000000000000003
[  +0.000005] RAX: 0000000000000000 RBX: 0000562815a3e6a0 RCX: 00007fb8e76e8511
[  +0.000002] RDX: 00007fb8e76d1ca0 RSI: 0000562815f75100 RDI: 0000000000000003
[  +0.000003] RBP: 0000562815a3e8a0 R08: 0000000000000000 R09: 0000000000000000
[  +0.000002] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[  +0.000003] R13: 0000000000000000 R14: 0000000000000000 R15: 0000562815a3e8a0
[  +0.000006]  </TASK>
[  +0.000002] ---[ end trace 7e6704984e7ed0f8 ]---
[  +0.000745] [drm:amdgpu_fill_buffer [amdgpu]] *ERROR* Trying to clear memory with ring turned off.
[  +0.000330] ------------[ cut here ]------------
[  +0.000002] WARNING: CPU: 30 PID: 2834 at drivers/gpu/drm/amd/amdgpu/amdgpu_object.c:1313 amdgpu_bo_release_notify+0x15f/0x170 [amdgpu]
[  +0.000243] Modules linked in: nf_conntrack_netlink nfnetlink xfrm_user xt_addrtype br_netfilter xt_CHECKSUM iptable_mangle xt_MASQUERADE iptable_nat nf_nat xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4 xt_tcpudp bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter overlay esp6_offload esp6 esp4_offload esp4 xfrm_algo ipmi_ssif intel_rapl_msr snd_hda_codec_hdmi intel_rapl_common sb_edac x86_pkg_temp_thermal snd_hda_intel intel_powerclamp coretemp snd_intel_dspcfg kvm_intel snd_hda_codec snd_hda_core kvm snd_hwdep ftdi_sio snd_pcm snd_timer joydev input_leds usbserial snd irqbypass soundcore rapl iTCO_wdt iTCO_vendor_support ipmi_si mei_me intel_cstate acpi_power_meter ipmi_devintf lpc_ich mei ipmi_msghandler mac_hid sch_fq_codel ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables x_tables autofs4 btrfs blake2b_generic zstd_compress raid10 raid456 async_raid6_recov async_memcpy
[  +0.000107]  async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear amdgpu iommu_v2 gpu_sched drm_ttm_helper mgag200 ttm drm_shmem_helper crct10dif_pclmul drm_kms_helper crc32_pclmul syscopyarea sysfillrect hid_generic ghash_clmulni_intel sysimgblt fb_sys_fops uas aesni_intel usbhid ahci crypto_simd igb usb_storage libahci cryptd hid drm dca megaraid_sas i2c_algo_bit wmi
[  +0.000050] CPU: 30 PID: 2834 Comm: amdgpu_test Tainted: G        W         5.16.0+ #3
[  +0.000005] Hardware name: Dell Inc. PowerEdge R730/0H21J3, BIOS 1.5.4 [FPGA Test BIOS] 10/002/2015
[  +0.000003] RIP: 0010:amdgpu_bo_release_notify+0x15f/0x170 [amdgpu]
[  +0.000241] Code: ff ff ff 48 39 c2 74 07 0f 0b e9 61 ff ff ff 4c 89 e7 e8 d4 a8 15 00 e9 54 ff ff ff e8 3a af 11 da eb cf 0f 0b e9 f0 fe ff ff <0f> 0b eb c4 e8 a8 c4 4b da 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00
[  +0.000004] RSP: 0018:ffffafed0ed27ca8 EFLAGS: 00010282
[  +0.000004] RAX: 00000000ffffffea RBX: ffff9e0c58fd2458 RCX: 0000000000000001
[  +0.000003] RDX: 0000000000000000 RSI: ffffffff9b5712d9 RDI: 00000000ffffffff
[  +0.000003] RBP: ffffafed0ed27cc8 R08: 0000000000000000 R09: 0000000000000001
[  +0.000003] R10: ffffafed0ed27d48 R11: ffffafed0ed27940 R12: ffff9e0c58fd2400
[  +0.000002] R13: ffff9e0c58fd2458 R14: ffff9e0c58fd2558 R15: ffff9e0c5b785360
[  +0.000003] FS:  00007fb8e81390c0(0000) GS:ffff9e2b3fdc0000(0000) knlGS:0000000000000000
[  +0.000004] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  +0.000003] CR2: 00007ffe1c1c1000 CR3: 000000010e764002 CR4: 00000000001706e0
[  +0.000003] Call Trace:
[  +0.000002]  <TASK>
[  +0.000004]  ttm_bo_release+0x2f2/0x370 [ttm]
[  +0.000011]  ? ttm_bo_release+0x261/0x370 [ttm]
[  +0.000012]  ? __vunmap+0x1c9/0x210
[  +0.000007]  ttm_bo_put+0x30/0x40 [ttm]
[  +0.000011]  amdgpu_bo_free_kernel+0xc7/0x120 [amdgpu]
[  +0.000242]  amdgpu_gfx_mqd_sw_fini+0x96/0x110 [amdgpu]
[  +0.000297]  gfx_v9_0_sw_fini+0x7b/0x180 [amdgpu]
[  +0.000321]  amdgpu_device_fini_sw+0xcc/0x320 [amdgpu]
[  +0.000231]  amdgpu_driver_release_kms+0x16/0x30 [amdgpu]
[  +0.000232]  drm_dev_release+0x28/0x40 [drm]
[  +0.000044]  drm_minor_release+0x30/0x40 [drm]
[  +0.000044]  drm_release+0xa1/0xe0 [drm]
[  +0.000040]  __fput+0x99/0x260
[  +0.000007]  ____fput+0xe/0x10
[  +0.000007]  task_work_run+0x6c/0xa0
[  +0.000006]  exit_to_user_mode_prepare+0x1af/0x1c0
[  +0.000006]  syscall_exit_to_user_mode+0x2a/0x40
[  +0.000005]  do_syscall_64+0x46/0xb0
[  +0.000007]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[  +0.000006] RIP: 0033:0x7fb8e76e8511
[  +0.000004] Code: f7 d8 64 89 02 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 8b 05 fa ce 20 00 85 c0 75 16 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 3f f3 c3 0f 1f 44 00 00 53 89 fb 48 83 ec 10
[  +0.000004] RSP: 002b:00007fff89083f58 EFLAGS: 00000246 ORIG_RAX: 0000000000000003
[  +0.000005] RAX: 0000000000000000 RBX: 0000562815a3e6a0 RCX: 00007fb8e76e8511
[  +0.000002] RDX: 00007fb8e76d1ca0 RSI: 0000562815f75100 RDI: 0000000000000003
[  +0.000003] RBP: 0000562815a3e8a0 R08: 0000000000000000 R09: 0000000000000000
[  +0.000002] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[  +0.000003] R13: 0000000000000000 R14: 0000000000000000 R15: 0000562815a3e8a0
[  +0.000006]  </TASK>
[  +0.000002] ---[ end trace 7e6704984e7ed0f9 ]---
[  +0.000652] [drm:amdgpu_fill_buffer [amdgpu]] *ERROR* Trying to clear memory with ring turned off.
[  +0.000329] ------------[ cut here ]------------
[  +0.000001] WARNING: CPU: 30 PID: 2834 at drivers/gpu/drm/amd/amdgpu/amdgpu_object.c:1313 amdgpu_bo_release_notify+0x15f/0x170 [amdgpu]
[  +0.000208] Modules linked in: nf_conntrack_netlink nfnetlink xfrm_user xt_addrtype br_netfilter xt_CHECKSUM iptable_mangle xt_MASQUERADE iptable_nat nf_nat xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4 xt_tcpudp bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter overlay esp6_offload esp6 esp4_offload esp4 xfrm_algo ipmi_ssif intel_rapl_msr snd_hda_codec_hdmi intel_rapl_common sb_edac x86_pkg_temp_thermal snd_hda_intel intel_powerclamp coretemp snd_intel_dspcfg kvm_intel snd_hda_codec snd_hda_core kvm snd_hwdep ftdi_sio snd_pcm snd_timer joydev input_leds usbserial snd irqbypass soundcore rapl iTCO_wdt iTCO_vendor_support ipmi_si mei_me intel_cstate acpi_power_meter ipmi_devintf lpc_ich mei ipmi_msghandler mac_hid sch_fq_codel ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables x_tables autofs4 btrfs blake2b_generic zstd_compress raid10 raid456 async_raid6_recov async_memcpy
[  +0.000044]  async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear amdgpu iommu_v2 gpu_sched drm_ttm_helper mgag200 ttm drm_shmem_helper crct10dif_pclmul drm_kms_helper crc32_pclmul syscopyarea sysfillrect hid_generic ghash_clmulni_intel sysimgblt fb_sys_fops uas aesni_intel usbhid ahci crypto_simd igb usb_storage libahci cryptd hid drm dca megaraid_sas i2c_algo_bit wmi
[  +0.000020] CPU: 30 PID: 2834 Comm: amdgpu_test Tainted: G        W         5.16.0+ #3
[  +0.000002] Hardware name: Dell Inc. PowerEdge R730/0H21J3, BIOS 1.5.4 [FPGA Test BIOS] 10/002/2015
[  +0.000001] RIP: 0010:amdgpu_bo_release_notify+0x15f/0x170 [amdgpu]
[  +0.000108] Code: ff ff ff 48 39 c2 74 07 0f 0b e9 61 ff ff ff 4c 89 e7 e8 d4 a8 15 00 e9 54 ff ff ff e8 3a af 11 da eb cf 0f 0b e9 f0 fe ff ff <0f> 0b eb c4 e8 a8 c4 4b da 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00
[  +0.000002] RSP: 0018:ffffafed0ed27cb8 EFLAGS: 00010282
[  +0.000001] RAX: 00000000ffffffea RBX: ffff9e0c58fd2058 RCX: 0000000000000001
[  +0.000002] RDX: 0000000000000000 RSI: ffffffff9b5712d9 RDI: 00000000ffffffff
[  +0.000001] RBP: ffffafed0ed27cd8 R08: 0000000000000000 R09: 0000000000000001
[  +0.000001] R10: ffffafed0ed27c70 R11: ffffafed0ed27950 R12: ffff9e0c58fd2000
[  +0.000001] R13: ffff9e0c58fd2058 R14: ffff9e0c58fd2158 R15: ffff9e0c5b785360
[  +0.000001] FS:  00007fb8e81390c0(0000) GS:ffff9e2b3fdc0000(0000) knlGS:0000000000000000
[  +0.000002] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  +0.000001] CR2: 00007ffe1c1c1000 CR3: 000000010e764002 CR4: 00000000001706e0
[  +0.000001] Call Trace:
[  +0.000001]  <TASK>
[  +0.000002]  ttm_bo_release+0x2f2/0x370 [ttm]
[  +0.000005]  ? ttm_bo_release+0x261/0x370 [ttm]
[  +0.000005]  ttm_bo_put+0x30/0x40 [ttm]
[  +0.000003]  amdgpu_bo_free_kernel+0xc7/0x120 [amdgpu]
[  +0.000107]  gfx_v9_0_mec_fini+0x1d/0x30 [amdgpu]
[  +0.000132]  gfx_v9_0_sw_fini+0x97/0x180 [amdgpu]
[  +0.000129]  amdgpu_device_fini_sw+0xcc/0x320 [amdgpu]
[  +0.000104]  amdgpu_driver_release_kms+0x16/0x30 [amdgpu]
[  +0.000102]  drm_dev_release+0x28/0x40 [drm]
[  +0.000020]  drm_minor_release+0x30/0x40 [drm]
[  +0.000017]  drm_release+0xa1/0xe0 [drm]
[  +0.000016]  __fput+0x99/0x260
[  +0.000003]  ____fput+0xe/0x10
[  +0.000003]  task_work_run+0x6c/0xa0
[  +0.000003]  exit_to_user_mode_prepare+0x1af/0x1c0
[  +0.000002]  syscall_exit_to_user_mode+0x2a/0x40
[  +0.000003]  do_syscall_64+0x46/0xb0
[  +0.000002]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[  +0.000003] RIP: 0033:0x7fb8e76e8511
[  +0.000001] Code: f7 d8 64 89 02 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 8b 05 fa ce 20 00 85 c0 75 16 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 3f f3 c3 0f 1f 44 00 00 53 89 fb 48 83 ec 10
[  +0.000002] RSP: 002b:00007fff89083f58 EFLAGS: 00000246 ORIG_RAX: 0000000000000003
[  +0.000002] RAX: 0000000000000000 RBX: 0000562815a3e6a0 RCX: 00007fb8e76e8511
[  +0.000001] RDX: 00007fb8e76d1ca0 RSI: 0000562815f75100 RDI: 0000000000000003
[  +0.000001] RBP: 0000562815a3e8a0 R08: 0000000000000000 R09: 0000000000000000
[  +0.000001] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[  +0.000001] R13: 0000000000000000 R14: 0000000000000000 R15: 0000562815a3e8a0
[  +0.000002]  </TASK>
[  +0.000001] ---[ end trace 7e6704984e7ed0fa ]---
[  +0.000617] [drm:amdgpu_fill_buffer [amdgpu]] *ERROR* Trying to clear memory with ring turned off.
[  +0.000147] ------------[ cut here ]------------
[  +0.000000] WARNING: CPU: 30 PID: 2834 at drivers/gpu/drm/amd/amdgpu/amdgpu_object.c:1313 amdgpu_bo_release_notify+0x15f/0x170 [amdgpu]
[  +0.000109] Modules linked in: nf_conntrack_netlink nfnetlink xfrm_user xt_addrtype br_netfilter xt_CHECKSUM iptable_mangle xt_MASQUERADE iptable_nat nf_nat xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4 xt_tcpudp bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter overlay esp6_offload esp6 esp4_offload esp4 xfrm_algo ipmi_ssif intel_rapl_msr snd_hda_codec_hdmi intel_rapl_common sb_edac x86_pkg_temp_thermal snd_hda_intel intel_powerclamp coretemp snd_intel_dspcfg kvm_intel snd_hda_codec snd_hda_core kvm snd_hwdep ftdi_sio snd_pcm snd_timer joydev input_leds usbserial snd irqbypass soundcore rapl iTCO_wdt iTCO_vendor_support ipmi_si mei_me intel_cstate acpi_power_meter ipmi_devintf lpc_ich mei ipmi_msghandler mac_hid sch_fq_codel ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables x_tables autofs4 btrfs blake2b_generic zstd_compress raid10 raid456 async_raid6_recov async_memcpy
[  +0.000044]  async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear amdgpu iommu_v2 gpu_sched drm_ttm_helper mgag200 ttm drm_shmem_helper crct10dif_pclmul drm_kms_helper crc32_pclmul syscopyarea sysfillrect hid_generic ghash_clmulni_intel sysimgblt fb_sys_fops uas aesni_intel usbhid ahci crypto_simd igb usb_storage libahci cryptd hid drm dca megaraid_sas i2c_algo_bit wmi
[  +0.000020] CPU: 30 PID: 2834 Comm: amdgpu_test Tainted: G        W         5.16.0+ #3
[  +0.000002] Hardware name: Dell Inc. PowerEdge R730/0H21J3, BIOS 1.5.4 [FPGA Test BIOS] 10/002/2015
[  +0.000001] RIP: 0010:amdgpu_bo_release_notify+0x15f/0x170 [amdgpu]
[  +0.000118] Code: ff ff ff 48 39 c2 74 07 0f 0b e9 61 ff ff ff 4c 89 e7 e8 d4 a8 15 00 e9 54 ff ff ff e8 3a af 11 da eb cf 0f 0b e9 f0 fe ff ff <0f> 0b eb c4 e8 a8 c4 4b da 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00
[  +0.000003] RSP: 0018:ffffafed0ed27cd0 EFLAGS: 00010282
[  +0.000002] RAX: 00000000ffffffea RBX: ffff9e0c58fd7458 RCX: 0000000000000001
[  +0.000002] RDX: 0000000000000000 RSI: ffffffff9b5712d9 RDI: 00000000ffffffff
[  +0.000002] RBP: ffffafed0ed27cf0 R08: 0000000000000000 R09: 0000000000000001
[  +0.000001] R10: ffffafed0ed27c68 R11: ffffafed0ed27968 R12: ffff9e0c58fd7400
[  +0.000002] R13: ffff9e0c58fd7458 R14: ffff9e0c58fd7558 R15: ffff9e0c5b785360
[  +0.000002] FS:  00007fb8e81390c0(0000) GS:ffff9e2b3fdc0000(0000) knlGS:0000000000000000
[  +0.000002] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  +0.000001] CR2: 00007ffe1c1c1000 CR3: 000000010e764002 CR4: 00000000001706e0
[  +0.000002] Call Trace:
[  +0.000001]  <TASK>
[  +0.000003]  ttm_bo_release+0x2f2/0x370 [ttm]
[  +0.000006]  ttm_bo_put+0x30/0x40 [ttm]
[  +0.000005]  amdgpu_bo_free_kernel+0xc7/0x120 [amdgpu]
[  +0.000107]  gfx_v9_0_sw_fini+0xb1/0x180 [amdgpu]
[  +0.000130]  amdgpu_device_fini_sw+0xcc/0x320 [amdgpu]
[  +0.000103]  amdgpu_driver_release_kms+0x16/0x30 [amdgpu]
[  +0.000108]  drm_dev_release+0x28/0x40 [drm]
[  +0.000020]  drm_minor_release+0x30/0x40 [drm]
[  +0.000017]  drm_release+0xa1/0xe0 [drm]
[  +0.000016]  __fput+0x99/0x260
[  +0.000004]  ____fput+0xe/0x10
[  +0.000002]  task_work_run+0x6c/0xa0
[  +0.000003]  exit_to_user_mode_prepare+0x1af/0x1c0
[  +0.000003]  syscall_exit_to_user_mode+0x2a/0x40
[  +0.000002]  do_syscall_64+0x46/0xb0
[  +0.000003]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[  +0.000002] RIP: 0033:0x7fb8e76e8511
[  +0.000002] Code: f7 d8 64 89 02 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 8b 05 fa ce 20 00 85 c0 75 16 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 3f f3 c3 0f 1f 44 00 00 53 89 fb 48 83 ec 10
[  +0.000001] RSP: 002b:00007fff89083f58 EFLAGS: 00000246 ORIG_RAX: 0000000000000003
[  +0.000002] RAX: 0000000000000000 RBX: 0000562815a3e6a0 RCX: 00007fb8e76e8511
[  +0.000001] RDX: 00007fb8e76d1ca0 RSI: 0000562815f75100 RDI: 0000000000000003
[  +0.000001] RBP: 0000562815a3e8a0 R08: 0000000000000000 R09: 0000000000000000
[  +0.000001] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[  +0.000001] R13: 0000000000000000 R14: 0000000000000000 R15: 0000562815a3e8a0
[  +0.000003]  </TASK>
[  +0.000000] ---[ end trace 7e6704984e7ed0fb ]---
[  +0.000583] ------------[ cut here ]------------
[  +0.000004] sysfs group 'power' not found for kobject 'i2c-5'
[  +0.000008] WARNING: CPU: 22 PID: 2834 at fs/sysfs/group.c:280 sysfs_remove_group+0x80/0x90
[  +0.000012] Modules linked in: nf_conntrack_netlink nfnetlink xfrm_user xt_addrtype br_netfilter xt_CHECKSUM iptable_mangle xt_MASQUERADE iptable_nat nf_nat xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4 xt_tcpudp bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter overlay esp6_offload esp6 esp4_offload esp4 xfrm_algo ipmi_ssif intel_rapl_msr snd_hda_codec_hdmi intel_rapl_common sb_edac x86_pkg_temp_thermal snd_hda_intel intel_powerclamp coretemp snd_intel_dspcfg kvm_intel snd_hda_codec snd_hda_core kvm snd_hwdep ftdi_sio snd_pcm snd_timer joydev input_leds usbserial snd irqbypass soundcore rapl iTCO_wdt iTCO_vendor_support ipmi_si mei_me intel_cstate acpi_power_meter ipmi_devintf lpc_ich mei ipmi_msghandler mac_hid sch_fq_codel ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables x_tables autofs4 btrfs blake2b_generic zstd_compress raid10 raid456 async_raid6_recov async_memcpy
[  +0.000089]  async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear amdgpu iommu_v2 gpu_sched drm_ttm_helper mgag200 ttm drm_shmem_helper crct10dif_pclmul drm_kms_helper crc32_pclmul syscopyarea sysfillrect hid_generic ghash_clmulni_intel sysimgblt fb_sys_fops uas aesni_intel usbhid ahci crypto_simd igb usb_storage libahci cryptd hid drm dca megaraid_sas i2c_algo_bit wmi
[  +0.000042] CPU: 22 PID: 2834 Comm: amdgpu_test Tainted: G        W         5.16.0+ #3
[  +0.000004] Hardware name: Dell Inc. PowerEdge R730/0H21J3, BIOS 1.5.4 [FPGA Test BIOS] 10/002/2015
[  +0.000003] RIP: 0010:sysfs_remove_group+0x80/0x90
[  +0.000006] Code: e8 b5 b4 ff ff 5b 41 5c 41 5d 5d c3 48 89 df e8 b6 b0 ff ff eb c6 49 8b 55 00 49 8b 34 24 48 c7 c7 78 65 59 9b e8 40 4b cb ff <0f> 0b 5b 41 5c 41 5d 5d c3 0f 1f 80 00 00 00 00 0f 1f 44 00 00 48
[  +0.000004] RSP: 0018:ffffafed0ed27b88 EFLAGS: 00010286
[  +0.000004] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000001
[  +0.000002] RDX: 0000000080000001 RSI: ffffffff9b5712d9 RDI: 00000000ffffffff
[  +0.000003] RBP: ffffafed0ed27ba0 R08: 0000000000000000 R09: 0000000000000001
[  +0.000002] R10: 0000000000000000 R11: ffffafed0ed27958 R12: ffffffff9b2b95a0
[  +0.000003] R13: ffff9e0c58fd1c18 R14: ffff9e0c5b7885c0 R15: 00000000ffffffff
[  +0.000002] FS:  00007fb8e81390c0(0000) GS:ffff9e2b3fcc0000(0000) knlGS:0000000000000000
[  +0.000003] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  +0.000003] CR2: 00007fd86c031378 CR3: 000000010e764001 CR4: 00000000001706e0
[  +0.000003] Call Trace:
[  +0.000002]  <TASK>
[  +0.000003]  dpm_sysfs_remove+0x59/0x60
[  +0.000010]  device_del+0xb8/0x3f0
[  +0.000008]  cdev_device_del+0x1a/0x40
[  +0.000005]  i2cdev_detach_adapter+0x85/0xc0
[  +0.000008]  i2cdev_notifier_call+0x1f/0x40
[  +0.000006]  blocking_notifier_call_chain+0x69/0x90
[  +0.000006]  device_del+0xb0/0x3f0
[  +0.000006]  device_unregister+0x17/0x60
[  +0.000004]  i2c_del_adapter+0x251/0x310
[  +0.000008]  smu_v11_0_i2c_control_fini+0x19/0x40 [amdgpu]
[  +0.000229]  vega20_smu_fini+0x1e/0xe0 [amdgpu]
[  +0.000314]  hwmgr_sw_fini+0x28/0x30 [amdgpu]
[  +0.000318]  pp_sw_fini+0x19/0x40 [amdgpu]
[  +0.000340]  amdgpu_device_fini_sw+0xcc/0x320 [amdgpu]
[  +0.000196]  amdgpu_driver_release_kms+0x16/0x30 [amdgpu]
[  +0.000197]  drm_dev_release+0x28/0x40 [drm]
[  +0.000039]  drm_minor_release+0x30/0x40 [drm]
[  +0.000037]  drm_release+0xa1/0xe0 [drm]
[  +0.000034]  __fput+0x99/0x260
[  +0.000007]  ____fput+0xe/0x10
[  +0.000005]  task_work_run+0x6c/0xa0
[  +0.000006]  exit_to_user_mode_prepare+0x1af/0x1c0
[  +0.000005]  syscall_exit_to_user_mode+0x2a/0x40
[  +0.000004]  do_syscall_64+0x46/0xb0
[  +0.000006]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[  +0.000005] RIP: 0033:0x7fb8e76e8511
[  +0.000004] Code: f7 d8 64 89 02 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 8b 05 fa ce 20 00 85 c0 75 16 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 3f f3 c3 0f 1f 44 00 00 53 89 fb 48 83 ec 10
[  +0.000004] RSP: 002b:00007fff89083f58 EFLAGS: 00000246 ORIG_RAX: 0000000000000003
[  +0.000004] RAX: 0000000000000000 RBX: 0000562815a3e6a0 RCX: 00007fb8e76e8511
[  +0.000003] RDX: 00007fb8e76d1ca0 RSI: 0000562815f75100 RDI: 0000000000000003
[  +0.000002] RBP: 0000562815a3e8a0 R08: 0000000000000000 R09: 0000000000000000
[  +0.000002] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[  +0.000002] R13: 0000000000000000 R14: 0000000000000000 R15: 0000562815a3e8a0
[  +0.000005]  </TASK>
[  +0.000002] ---[ end trace 7e6704984e7ed0fc ]---
[  +0.000062] ------------[ cut here ]------------
[  +0.000002] sysfs group 'power' not found for kobject 'i2c-5'
[  +0.000007] WARNING: CPU: 22 PID: 2834 at fs/sysfs/group.c:280 sysfs_remove_group+0x80/0x90
[  +0.000009] Modules linked in: nf_conntrack_netlink nfnetlink xfrm_user xt_addrtype br_netfilter xt_CHECKSUM iptable_mangle xt_MASQUERADE iptable_nat nf_nat xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4 xt_tcpudp bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter overlay esp6_offload esp6 esp4_offload esp4 xfrm_algo ipmi_ssif intel_rapl_msr snd_hda_codec_hdmi intel_rapl_common sb_edac x86_pkg_temp_thermal snd_hda_intel intel_powerclamp coretemp snd_intel_dspcfg kvm_intel snd_hda_codec snd_hda_core kvm snd_hwdep ftdi_sio snd_pcm snd_timer joydev input_leds usbserial snd irqbypass soundcore rapl iTCO_wdt iTCO_vendor_support ipmi_si mei_me intel_cstate acpi_power_meter ipmi_devintf lpc_ich mei ipmi_msghandler mac_hid sch_fq_codel ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables x_tables autofs4 btrfs blake2b_generic zstd_compress raid10 raid456 async_raid6_recov async_memcpy
[  +0.000085]  async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear amdgpu iommu_v2 gpu_sched drm_ttm_helper mgag200 ttm drm_shmem_helper crct10dif_pclmul drm_kms_helper crc32_pclmul syscopyarea sysfillrect hid_generic ghash_clmulni_intel sysimgblt fb_sys_fops uas aesni_intel usbhid ahci crypto_simd igb usb_storage libahci cryptd hid drm dca megaraid_sas i2c_algo_bit wmi
[  +0.000040] CPU: 22 PID: 2834 Comm: amdgpu_test Tainted: G        W         5.16.0+ #3
[  +0.000004] Hardware name: Dell Inc. PowerEdge R730/0H21J3, BIOS 1.5.4 [FPGA Test BIOS] 10/002/2015
[  +0.000002] RIP: 0010:sysfs_remove_group+0x80/0x90
[  +0.000006] Code: e8 b5 b4 ff ff 5b 41 5c 41 5d 5d c3 48 89 df e8 b6 b0 ff ff eb c6 49 8b 55 00 49 8b 34 24 48 c7 c7 78 65 59 9b e8 40 4b cb ff <0f> 0b 5b 41 5c 41 5d 5d c3 0f 1f 80 00 00 00 00 0f 1f 44 00 00 48
[  +0.000004] RSP: 0018:ffffafed0ed27c70 EFLAGS: 00010286
[  +0.000003] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000001
[  +0.000002] RDX: 0000000080000001 RSI: ffffffff9b5712d9 RDI: 00000000ffffffff
[  +0.000003] RBP: ffffafed0ed27c88 R08: 0000000000000000 R09: 0000000000000001
[  +0.000002] R10: ffffafed0ed27ba0 R11: ffffafed0ed27a40 R12: ffffffff9b2b95a0
[  +0.000002] R13: ffff9e0c5b7885c0 R14: ffff9e0c488650d0 R15: ffff9e0c530a6a80
[  +0.000003] FS:  00007fb8e81390c0(0000) GS:ffff9e2b3fcc0000(0000) knlGS:0000000000000000
[  +0.000003] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  +0.000003] CR2: 00007fd86c031378 CR3: 000000010e764001 CR4: 00000000001706e0
[  +0.000002] Call Trace:
[  +0.000002]  <TASK>
[  +0.000002]  dpm_sysfs_remove+0x59/0x60
[  +0.000007]  device_del+0xb8/0x3f0
[  +0.000006]  device_unregister+0x17/0x60
[  +0.000005]  i2c_del_adapter+0x251/0x310
[  +0.000006]  smu_v11_0_i2c_control_fini+0x19/0x40 [amdgpu]
[  +0.000228]  vega20_smu_fini+0x1e/0xe0 [amdgpu]
[  +0.000330]  hwmgr_sw_fini+0x28/0x30 [amdgpu]
[  +0.000317]  pp_sw_fini+0x19/0x40 [amdgpu]
[  +0.000338]  amdgpu_device_fini_sw+0xcc/0x320 [amdgpu]
[  +0.000197]  amdgpu_driver_release_kms+0x16/0x30 [amdgpu]
[  +0.000197]  drm_dev_release+0x28/0x40 [drm]
[  +0.000037]  drm_minor_release+0x30/0x40 [drm]
[  +0.000038]  drm_release+0xa1/0xe0 [drm]
[  +0.000034]  __fput+0x99/0x260
[  +0.000007]  ____fput+0xe/0x10
[  +0.000006]  task_work_run+0x6c/0xa0
[  +0.000005]  exit_to_user_mode_prepare+0x1af/0x1c0
[  +0.000005]  syscall_exit_to_user_mode+0x2a/0x40
[  +0.000005]  do_syscall_64+0x46/0xb0
[  +0.000006]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[  +0.000004] RIP: 0033:0x7fb8e76e8511
[  +0.000004] Code: f7 d8 64 89 02 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 8b 05 fa ce 20 00 85 c0 75 16 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 3f f3 c3 0f 1f 44 00 00 53 89 fb 48 83 ec 10
[  +0.000003] RSP: 002b:00007fff89083f58 EFLAGS: 00000246 ORIG_RAX: 0000000000000003
[  +0.000004] RAX: 0000000000000000 RBX: 0000562815a3e6a0 RCX: 00007fb8e76e8511
[  +0.000002] RDX: 00007fb8e76d1ca0 RSI: 0000562815f75100 RDI: 0000000000000003
[  +0.000002] RBP: 0000562815a3e8a0 R08: 0000000000000000 R09: 0000000000000000
[  +0.000002] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[  +0.000003] R13: 0000000000000000 R14: 0000000000000000 R15: 0000562815a3e8a0
[  +0.000004]  </TASK>
[  +0.000002] ---[ end trace 7e6704984e7ed0fd ]---
[  +0.000027] [drm:amdgpu_fill_buffer [amdgpu]] *ERROR* Trying to clear memory with ring turned off.
[  +0.000284] ------------[ cut here ]------------
[  +0.000002] WARNING: CPU: 22 PID: 2834 at drivers/gpu/drm/amd/amdgpu/amdgpu_object.c:1313 amdgpu_bo_release_notify+0x15f/0x170 [amdgpu]
[  +0.000209] Modules linked in: nf_conntrack_netlink nfnetlink xfrm_user xt_addrtype br_netfilter xt_CHECKSUM iptable_mangle xt_MASQUERADE iptable_nat nf_nat xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4 xt_tcpudp bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter overlay esp6_offload esp6 esp4_offload esp4 xfrm_algo ipmi_ssif intel_rapl_msr snd_hda_codec_hdmi intel_rapl_common sb_edac x86_pkg_temp_thermal snd_hda_intel intel_powerclamp coretemp snd_intel_dspcfg kvm_intel snd_hda_codec snd_hda_core kvm snd_hwdep ftdi_sio snd_pcm snd_timer joydev input_leds usbserial snd irqbypass soundcore rapl iTCO_wdt iTCO_vendor_support ipmi_si mei_me intel_cstate acpi_power_meter ipmi_devintf lpc_ich mei ipmi_msghandler mac_hid sch_fq_codel ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables x_tables autofs4 btrfs blake2b_generic zstd_compress raid10 raid456 async_raid6_recov async_memcpy
[  +0.000092]  async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear amdgpu iommu_v2 gpu_sched drm_ttm_helper mgag200 ttm drm_shmem_helper crct10dif_pclmul drm_kms_helper crc32_pclmul syscopyarea sysfillrect hid_generic ghash_clmulni_intel sysimgblt fb_sys_fops uas aesni_intel usbhid ahci crypto_simd igb usb_storage libahci cryptd hid drm dca megaraid_sas i2c_algo_bit wmi
[  +0.000043] CPU: 22 PID: 2834 Comm: amdgpu_test Tainted: G        W         5.16.0+ #3
[  +0.000005] Hardware name: Dell Inc. PowerEdge R730/0H21J3, BIOS 1.5.4 [FPGA Test BIOS] 10/002/2015
[  +0.000002] RIP: 0010:amdgpu_bo_release_notify+0x15f/0x170 [amdgpu]
[  +0.000207] Code: ff ff ff 48 39 c2 74 07 0f 0b e9 61 ff ff ff 4c 89 e7 e8 d4 a8 15 00 e9 54 ff ff ff e8 3a af 11 da eb cf 0f 0b e9 f0 fe ff ff <0f> 0b eb c4 e8 a8 c4 4b da 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00
[  +0.000004] RSP: 0018:ffffafed0ed27ca8 EFLAGS: 00010282
[  +0.000004] RAX: 00000000ffffffea RBX: ffff9e0c58ec5058 RCX: 0000000000000001
[  +0.000002] RDX: 0000000000000000 RSI: ffffffff9b5712d9 RDI: 00000000ffffffff
[  +0.000003] RBP: ffffafed0ed27cc8 R08: 0000000000000000 R09: 0000000000000001
[  +0.000002] R10: 0000000000000000 R11: ffffafed0ed27940 R12: ffff9e0c58ec5000
[  +0.000002] R13: ffff9e0c58ec5058 R14: ffff9e0c58ec5158 R15: ffff9e0c5b785360
[  +0.000003] FS:  00007fb8e81390c0(0000) GS:ffff9e2b3fcc0000(0000) knlGS:0000000000000000
[  +0.000003] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  +0.000002] CR2: 00007fd86c031378 CR3: 000000010e764001 CR4: 00000000001706e0
[  +0.000003] Call Trace:
[  +0.000001]  <TASK>
[  +0.000003]  ttm_bo_release+0x2f2/0x370 [ttm]
[  +0.000011]  ? idr_remove+0x11/0x20
[  +0.000008]  ttm_bo_put+0x30/0x40 [ttm]
[  +0.000010]  amdgpu_bo_free_kernel+0xc7/0x120 [amdgpu]
[  +0.000206]  vega20_smu_fini+0x38/0xe0 [amdgpu]
[  +0.000320]  hwmgr_sw_fini+0x28/0x30 [amdgpu]
[  +0.000317]  pp_sw_fini+0x19/0x40 [amdgpu]
[  +0.000339]  amdgpu_device_fini_sw+0xcc/0x320 [amdgpu]
[  +0.000197]  amdgpu_driver_release_kms+0x16/0x30 [amdgpu]
[  +0.000198]  drm_dev_release+0x28/0x40 [drm]
[  +0.000038]  drm_minor_release+0x30/0x40 [drm]
[  +0.000037]  drm_release+0xa1/0xe0 [drm]
[  +0.000034]  __fput+0x99/0x260
[  +0.000007]  ____fput+0xe/0x10
[  +0.000006]  task_work_run+0x6c/0xa0
[  +0.000005]  exit_to_user_mode_prepare+0x1af/0x1c0
[  +0.000005]  syscall_exit_to_user_mode+0x2a/0x40
[  +0.000005]  do_syscall_64+0x46/0xb0
[  +0.000006]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[  +0.000004] RIP: 0033:0x7fb8e76e8511
[  +0.000004] Code: f7 d8 64 89 02 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 8b 05 fa ce 20 00 85 c0 75 16 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 3f f3 c3 0f 1f 44 00 00 53 89 fb 48 83 ec 10
[  +0.000003] RSP: 002b:00007fff89083f58 EFLAGS: 00000246 ORIG_RAX: 0000000000000003
[  +0.000004] RAX: 0000000000000000 RBX: 0000562815a3e6a0 RCX: 00007fb8e76e8511
[  +0.000002] RDX: 00007fb8e76d1ca0 RSI: 0000562815f75100 RDI: 0000000000000003
[  +0.000003] RBP: 0000562815a3e8a0 R08: 0000000000000000 R09: 0000000000000000
[  +0.000002] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[  +0.000002] R13: 0000000000000000 R14: 0000000000000000 R15: 0000562815a3e8a0
[  +0.000005]  </TASK>
[  +0.000002] ---[ end trace 7e6704984e7ed0fe ]---
[  +0.000516] [drm:amdgpu_fill_buffer [amdgpu]] *ERROR* Trying to clear memory with ring turned off.
[  +0.000321] ------------[ cut here ]------------
[  +0.000002] WARNING: CPU: 6 PID: 2834 at drivers/gpu/drm/amd/amdgpu/amdgpu_object.c:1313 amdgpu_bo_release_notify+0x15f/0x170 [amdgpu]
[  +0.000215] Modules linked in: nf_conntrack_netlink nfnetlink xfrm_user xt_addrtype br_netfilter xt_CHECKSUM iptable_mangle xt_MASQUERADE iptable_nat nf_nat xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4 xt_tcpudp bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter overlay esp6_offload esp6 esp4_offload esp4 xfrm_algo ipmi_ssif intel_rapl_msr snd_hda_codec_hdmi intel_rapl_common sb_edac x86_pkg_temp_thermal snd_hda_intel intel_powerclamp coretemp snd_intel_dspcfg kvm_intel snd_hda_codec snd_hda_core kvm snd_hwdep ftdi_sio snd_pcm snd_timer joydev input_leds usbserial snd irqbypass soundcore rapl iTCO_wdt iTCO_vendor_support ipmi_si mei_me intel_cstate acpi_power_meter ipmi_devintf lpc_ich mei ipmi_msghandler mac_hid sch_fq_codel ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables x_tables autofs4 btrfs blake2b_generic zstd_compress raid10 raid456 async_raid6_recov async_memcpy
[  +0.000094]  async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear amdgpu iommu_v2 gpu_sched drm_ttm_helper mgag200 ttm drm_shmem_helper crct10dif_pclmul drm_kms_helper crc32_pclmul syscopyarea sysfillrect hid_generic ghash_clmulni_intel sysimgblt fb_sys_fops uas aesni_intel usbhid ahci crypto_simd igb usb_storage libahci cryptd hid drm dca megaraid_sas i2c_algo_bit wmi
[  +0.000044] CPU: 6 PID: 2834 Comm: amdgpu_test Tainted: G        W         5.16.0+ #3
[  +0.000005] Hardware name: Dell Inc. PowerEdge R730/0H21J3, BIOS 1.5.4 [FPGA Test BIOS] 10/002/2015
[  +0.000003] RIP: 0010:amdgpu_bo_release_notify+0x15f/0x170 [amdgpu]
[  +0.000210] Code: ff ff ff 48 39 c2 74 07 0f 0b e9 61 ff ff ff 4c 89 e7 e8 d4 a8 15 00 e9 54 ff ff ff e8 3a af 11 da eb cf 0f 0b e9 f0 fe ff ff <0f> 0b eb c4 e8 a8 c4 4b da 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00
[  +0.000004] RSP: 0018:ffffafed0ed27ca8 EFLAGS: 00010282
[  +0.000004] RAX: 00000000ffffffea RBX: ffff9e0c58ec1c58 RCX: 0000000000000001
[  +0.000003] RDX: 0000000000000000 RSI: ffffffff9b5712d9 RDI: 00000000ffffffff
[  +0.000003] RBP: ffffafed0ed27cc8 R08: 0000000000000000 R09: 0000000000000001
[  +0.000002] R10: 0000000000000c32 R11: ffffafed0ed27940 R12: ffff9e0c58ec1c00
[  +0.000003] R13: ffff9e0c58ec1c58 R14: ffff9e0c58ec1d58 R15: ffff9e0c5b785360
[  +0.000002] FS:  00007fb8e81390c0(0000) GS:ffff9e2b3fac0000(0000) knlGS:0000000000000000
[  +0.000004] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  +0.000003] CR2: 00007f09d1154000 CR3: 000000010e764006 CR4: 00000000001706e0
[  +0.000003] Call Trace:
[  +0.000002]  <TASK>
[  +0.000004]  ttm_bo_release+0x2f2/0x370 [ttm]
[  +0.000011]  ? __cond_resched+0x1d/0x30
[  +0.000010]  ttm_bo_put+0x30/0x40 [ttm]
[  +0.000009]  amdgpu_bo_free_kernel+0xc7/0x120 [amdgpu]
[  +0.000208]  vega20_smu_fini+0x49/0xe0 [amdgpu]
[  +0.000339]  hwmgr_sw_fini+0x28/0x30 [amdgpu]
[  +0.000333]  pp_sw_fini+0x19/0x40 [amdgpu]
[  +0.000362]  amdgpu_device_fini_sw+0xcc/0x320 [amdgpu]
[  +0.000212]  amdgpu_driver_release_kms+0x16/0x30 [amdgpu]
[  +0.000213]  drm_dev_release+0x28/0x40 [drm]
[  +0.000041]  drm_minor_release+0x30/0x40 [drm]
[  +0.000041]  drm_release+0xa1/0xe0 [drm]
[  +0.000036]  __fput+0x99/0x260
[  +0.000007]  ____fput+0xe/0x10
[  +0.000006]  task_work_run+0x6c/0xa0
[  +0.000006]  exit_to_user_mode_prepare+0x1af/0x1c0
[  +0.000006]  syscall_exit_to_user_mode+0x2a/0x40
[  +0.000005]  do_syscall_64+0x46/0xb0
[  +0.000006]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[  +0.000006] RIP: 0033:0x7fb8e76e8511
[  +0.000004] Code: f7 d8 64 89 02 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 8b 05 fa ce 20 00 85 c0 75 16 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 3f f3 c3 0f 1f 44 00 00 53 89 fb 48 83 ec 10
[  +0.000004] RSP: 002b:00007fff89083f58 EFLAGS: 00000246 ORIG_RAX: 0000000000000003
[  +0.000004] RAX: 0000000000000000 RBX: 0000562815a3e6a0 RCX: 00007fb8e76e8511
[  +0.000003] RDX: 00007fb8e76d1ca0 RSI: 0000562815f75100 RDI: 0000000000000003
[  +0.000002] RBP: 0000562815a3e8a0 R08: 0000000000000000 R09: 0000000000000000
[  +0.000003] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[  +0.000002] R13: 0000000000000000 R14: 0000000000000000 R15: 0000562815a3e8a0
[  +0.000006]  </TASK>
[  +0.000001] ---[ end trace 7e6704984e7ed0ff ]---
[  +0.000648] [drm:amdgpu_fill_buffer [amdgpu]] *ERROR* Trying to clear memory with ring turned off.
[  +0.000323] ------------[ cut here ]------------
[  +0.000003] WARNING: CPU: 6 PID: 2834 at drivers/gpu/drm/amd/amdgpu/amdgpu_object.c:1313 amdgpu_bo_release_notify+0x15f/0x170 [amdgpu]
[  +0.000232] Modules linked in: nf_conntrack_netlink nfnetlink xfrm_user xt_addrtype br_netfilter xt_CHECKSUM iptable_mangle xt_MASQUERADE iptable_nat nf_nat xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4 xt_tcpudp bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter overlay esp6_offload esp6 esp4_offload esp4 xfrm_algo ipmi_ssif intel_rapl_msr snd_hda_codec_hdmi intel_rapl_common sb_edac x86_pkg_temp_thermal snd_hda_intel intel_powerclamp coretemp snd_intel_dspcfg kvm_intel snd_hda_codec snd_hda_core kvm snd_hwdep ftdi_sio snd_pcm snd_timer joydev input_leds usbserial snd irqbypass soundcore rapl iTCO_wdt iTCO_vendor_support ipmi_si mei_me intel_cstate acpi_power_meter ipmi_devintf lpc_ich mei ipmi_msghandler mac_hid sch_fq_codel ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables x_tables autofs4 btrfs blake2b_generic zstd_compress raid10 raid456 async_raid6_recov async_memcpy
[  +0.000100]  async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear amdgpu iommu_v2 gpu_sched drm_ttm_helper mgag200 ttm drm_shmem_helper crct10dif_pclmul drm_kms_helper crc32_pclmul syscopyarea sysfillrect hid_generic ghash_clmulni_intel sysimgblt fb_sys_fops uas aesni_intel usbhid ahci crypto_simd igb usb_storage libahci cryptd hid drm dca megaraid_sas i2c_algo_bit wmi
[  +0.000046] CPU: 6 PID: 2834 Comm: amdgpu_test Tainted: G        W         5.16.0+ #3
[  +0.000005] Hardware name: Dell Inc. PowerEdge R730/0H21J3, BIOS 1.5.4 [FPGA Test BIOS] 10/002/2015
[  +0.000002] RIP: 0010:amdgpu_bo_release_notify+0x15f/0x170 [amdgpu]
[  +0.000224] Code: ff ff ff 48 39 c2 74 07 0f 0b e9 61 ff ff ff 4c 89 e7 e8 d4 a8 15 00 e9 54 ff ff ff e8 3a af 11 da eb cf 0f 0b e9 f0 fe ff ff <0f> 0b eb c4 e8 a8 c4 4b da 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00
[  +0.000004] RSP: 0018:ffffafed0ed27ca8 EFLAGS: 00010282
[  +0.000004] RAX: 00000000ffffffea RBX: ffff9e0c58ec4458 RCX: 0000000000000001
[  +0.000002] RDX: 0000000000000000 RSI: ffffffff9b5712d9 RDI: 00000000ffffffff
[  +0.000003] RBP: ffffafed0ed27cc8 R08: 0000000000000000 R09: 0000000000000001
[  +0.000002] R10: ffffafed0e81fd80 R11: ffffafed0ed27940 R12: ffff9e0c58ec4400
[  +0.000002] R13: ffff9e0c58ec4458 R14: ffff9e0c58ec4558 R15: ffff9e0c5b785360
[  +0.000003] FS:  00007fb8e81390c0(0000) GS:ffff9e2b3fac0000(0000) knlGS:0000000000000000
[  +0.000004] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  +0.000002] CR2: 00007f09d1154000 CR3: 000000010e764006 CR4: 00000000001706e0
[  +0.000003] Call Trace:
[  +0.000002]  <TASK>
[  +0.000004]  ttm_bo_release+0x2f2/0x370 [ttm]
[  +0.000010]  ? __cond_resched+0x1d/0x30
[  +0.000009]  ttm_bo_put+0x30/0x40 [ttm]
[  +0.000010]  amdgpu_bo_free_kernel+0xc7/0x120 [amdgpu]
[  +0.000223]  vega20_smu_fini+0x63/0xe0 [amdgpu]
[  +0.000343]  hwmgr_sw_fini+0x28/0x30 [amdgpu]
[  +0.000343]  pp_sw_fini+0x19/0x40 [amdgpu]
[  +0.000365]  amdgpu_device_fini_sw+0xcc/0x320 [amdgpu]
[  +0.000211]  amdgpu_driver_release_kms+0x16/0x30 [amdgpu]
[  +0.000212]  drm_dev_release+0x28/0x40 [drm]
[  +0.000041]  drm_minor_release+0x30/0x40 [drm]
[  +0.000040]  drm_release+0xa1/0xe0 [drm]
[  +0.000037]  __fput+0x99/0x260
[  +0.000007]  ____fput+0xe/0x10
[  +0.000006]  task_work_run+0x6c/0xa0
[  +0.000005]  exit_to_user_mode_prepare+0x1af/0x1c0
[  +0.000006]  syscall_exit_to_user_mode+0x2a/0x40
[  +0.000005]  do_syscall_64+0x46/0xb0
[  +0.000006]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[  +0.000005] RIP: 0033:0x7fb8e76e8511
[  +0.000004] Code: f7 d8 64 89 02 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 8b 05 fa ce 20 00 85 c0 75 16 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 3f f3 c3 0f 1f 44 00 00 53 89 fb 48 83 ec 10
[  +0.000004] RSP: 002b:00007fff89083f58 EFLAGS: 00000246 ORIG_RAX: 0000000000000003
[  +0.000004] RAX: 0000000000000000 RBX: 0000562815a3e6a0 RCX: 00007fb8e76e8511
[  +0.000002] RDX: 00007fb8e76d1ca0 RSI: 0000562815f75100 RDI: 0000000000000003
[  +0.000003] RBP: 0000562815a3e8a0 R08: 0000000000000000 R09: 0000000000000000
[  +0.000002] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[  +0.000002] R13: 0000000000000000 R14: 0000000000000000 R15: 0000562815a3e8a0
[  +0.000005]  </TASK>
[  +0.000002] ---[ end trace 7e6704984e7ed100 ]---
[  +0.000648] [drm:amdgpu_fill_buffer [amdgpu]] *ERROR* Trying to clear memory with ring turned off.
[  +0.000176] ------------[ cut here ]------------
[  +0.000002] WARNING: CPU: 6 PID: 2834 at drivers/gpu/drm/amd/amdgpu/amdgpu_object.c:1313 amdgpu_bo_release_notify+0x15f/0x170 [amdgpu]
[  +0.000107] Modules linked in: nf_conntrack_netlink nfnetlink xfrm_user xt_addrtype br_netfilter xt_CHECKSUM iptable_mangle xt_MASQUERADE iptable_nat nf_nat xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4 xt_tcpudp bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter overlay esp6_offload esp6 esp4_offload esp4 xfrm_algo ipmi_ssif intel_rapl_msr snd_hda_codec_hdmi intel_rapl_common sb_edac x86_pkg_temp_thermal snd_hda_intel intel_powerclamp coretemp snd_intel_dspcfg kvm_intel snd_hda_codec snd_hda_core kvm snd_hwdep ftdi_sio snd_pcm snd_timer joydev input_leds usbserial snd irqbypass soundcore rapl iTCO_wdt iTCO_vendor_support ipmi_si mei_me intel_cstate acpi_power_meter ipmi_devintf lpc_ich mei ipmi_msghandler mac_hid sch_fq_codel ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables x_tables autofs4 btrfs blake2b_generic zstd_compress raid10 raid456 async_raid6_recov async_memcpy
[  +0.000044]  async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear amdgpu iommu_v2 gpu_sched drm_ttm_helper mgag200 ttm drm_shmem_helper crct10dif_pclmul drm_kms_helper crc32_pclmul syscopyarea sysfillrect hid_generic ghash_clmulni_intel sysimgblt fb_sys_fops uas aesni_intel usbhid ahci crypto_simd igb usb_storage libahci cryptd hid drm dca megaraid_sas i2c_algo_bit wmi
[  +0.000021] CPU: 6 PID: 2834 Comm: amdgpu_test Tainted: G        W         5.16.0+ #3
[  +0.000002] Hardware name: Dell Inc. PowerEdge R730/0H21J3, BIOS 1.5.4 [FPGA Test BIOS] 10/002/2015
[  +0.000001] RIP: 0010:amdgpu_bo_release_notify+0x15f/0x170 [amdgpu]
[  +0.000104] Code: ff ff ff 48 39 c2 74 07 0f 0b e9 61 ff ff ff 4c 89 e7 e8 d4 a8 15 00 e9 54 ff ff ff e8 3a af 11 da eb cf 0f 0b e9 f0 fe ff ff <0f> 0b eb c4 e8 a8 c4 4b da 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00
[  +0.000002] RSP: 0018:ffffafed0ed27ca8 EFLAGS: 00010282
[  +0.000002] RAX: 00000000ffffffea RBX: ffff9e0c58fd6858 RCX: 0000000000000001
[  +0.000001] RDX: 0000000000000000 RSI: ffffffff9b5712d9 RDI: 00000000ffffffff
[  +0.000001] RBP: ffffafed0ed27cc8 R08: 0000000000000000 R09: 0000000000000001
[  +0.000001] R10: ffffafed0e81fd80 R11: ffffafed0ed27940 R12: ffff9e0c58fd6800
[  +0.000002] R13: ffff9e0c58fd6858 R14: ffff9e0c58fd6958 R15: ffff9e0c5b785360
[  +0.000001] FS:  00007fb8e81390c0(0000) GS:ffff9e2b3fac0000(0000) knlGS:0000000000000000
[  +0.000002] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  +0.000001] CR2: 00007f09d1154000 CR3: 000000010e764006 CR4: 00000000001706e0
[  +0.000001] Call Trace:
[  +0.000001]  <TASK>
[  +0.000002]  ttm_bo_release+0x2f2/0x370 [ttm]
[  +0.000005]  ? __cond_resched+0x1d/0x30
[  +0.000004]  ttm_bo_put+0x30/0x40 [ttm]
[  +0.000004]  amdgpu_bo_free_kernel+0xc7/0x120 [amdgpu]
[  +0.000115]  vega20_smu_fini+0x7d/0xe0 [amdgpu]
[  +0.000162]  hwmgr_sw_fini+0x28/0x30 [amdgpu]
[  +0.000158]  pp_sw_fini+0x19/0x40 [amdgpu]
[  +0.000169]  amdgpu_device_fini_sw+0xcc/0x320 [amdgpu]
[  +0.000101]  amdgpu_driver_release_kms+0x16/0x30 [amdgpu]
[  +0.000101]  drm_dev_release+0x28/0x40 [drm]
[  +0.000019]  drm_minor_release+0x30/0x40 [drm]
[  +0.000018]  drm_release+0xa1/0xe0 [drm]
[  +0.000016]  __fput+0x99/0x260
[  +0.000003]  ____fput+0xe/0x10
[  +0.000003]  task_work_run+0x6c/0xa0
[  +0.000002]  exit_to_user_mode_prepare+0x1af/0x1c0
[  +0.000003]  syscall_exit_to_user_mode+0x2a/0x40
[  +0.000002]  do_syscall_64+0x46/0xb0
[  +0.000003]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[  +0.000003] RIP: 0033:0x7fb8e76e8511
[  +0.000001] Code: f7 d8 64 89 02 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 8b 05 fa ce 20 00 85 c0 75 16 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 3f f3 c3 0f 1f 44 00 00 53 89 fb 48 83 ec 10
[  +0.000002] RSP: 002b:00007fff89083f58 EFLAGS: 00000246 ORIG_RAX: 0000000000000003
[  +0.000002] RAX: 0000000000000000 RBX: 0000562815a3e6a0 RCX: 00007fb8e76e8511
[  +0.000001] RDX: 00007fb8e76d1ca0 RSI: 0000562815f75100 RDI: 0000000000000003
[  +0.000001] RBP: 0000562815a3e8a0 R08: 0000000000000000 R09: 0000000000000000
[  +0.000001] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[  +0.000001] R13: 0000000000000000 R14: 0000000000000000 R15: 0000562815a3e8a0
[  +0.000003]  </TASK>
[  +0.000000] ---[ end trace 7e6704984e7ed101 ]---
[  +0.000513] [drm:amdgpu_fill_buffer [amdgpu]] *ERROR* Trying to clear memory with ring turned off.
[  +0.000220] ------------[ cut here ]------------
[  +0.000001] WARNING: CPU: 12 PID: 2834 at drivers/gpu/drm/amd/amdgpu/amdgpu_object.c:1313 amdgpu_bo_release_notify+0x15f/0x170 [amdgpu]
[  +0.000141] Modules linked in: nf_conntrack_netlink nfnetlink xfrm_user xt_addrtype br_netfilter xt_CHECKSUM iptable_mangle xt_MASQUERADE iptable_nat nf_nat xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4 xt_tcpudp bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter overlay esp6_offload esp6 esp4_offload esp4 xfrm_algo ipmi_ssif intel_rapl_msr snd_hda_codec_hdmi intel_rapl_common sb_edac x86_pkg_temp_thermal snd_hda_intel intel_powerclamp coretemp snd_intel_dspcfg kvm_intel snd_hda_codec snd_hda_core kvm snd_hwdep ftdi_sio snd_pcm snd_timer joydev input_leds usbserial snd irqbypass soundcore rapl iTCO_wdt iTCO_vendor_support ipmi_si mei_me intel_cstate acpi_power_meter ipmi_devintf lpc_ich mei ipmi_msghandler mac_hid sch_fq_codel ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables x_tables autofs4 btrfs blake2b_generic zstd_compress raid10 raid456 async_raid6_recov async_memcpy
[  +0.000062]  async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear amdgpu iommu_v2 gpu_sched drm_ttm_helper mgag200 ttm drm_shmem_helper crct10dif_pclmul drm_kms_helper crc32_pclmul syscopyarea sysfillrect hid_generic ghash_clmulni_intel sysimgblt fb_sys_fops uas aesni_intel usbhid ahci crypto_simd igb usb_storage libahci cryptd hid drm dca megaraid_sas i2c_algo_bit wmi
[  +0.000029] CPU: 12 PID: 2834 Comm: amdgpu_test Tainted: G        W         5.16.0+ #3
[  +0.000004] Hardware name: Dell Inc. PowerEdge R730/0H21J3, BIOS 1.5.4 [FPGA Test BIOS] 10/002/2015
[  +0.000001] RIP: 0010:amdgpu_bo_release_notify+0x15f/0x170 [amdgpu]
[  +0.000138] Code: ff ff ff 48 39 c2 74 07 0f 0b e9 61 ff ff ff 4c 89 e7 e8 d4 a8 15 00 e9 54 ff ff ff e8 3a af 11 da eb cf 0f 0b e9 f0 fe ff ff <0f> 0b eb c4 e8 a8 c4 4b da 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00
[  +0.000003] RSP: 0018:ffffafed0ed27ca8 EFLAGS: 00010282
[  +0.000003] RAX: 00000000ffffffea RBX: ffff9e0c58fd5058 RCX: 0000000000000001
[  +0.000002] RDX: 0000000000000000 RSI: ffffffff9b5712d9 RDI: 00000000ffffffff
[  +0.000001] RBP: ffffafed0ed27cc8 R08: 0000000000000000 R09: 0000000000000001
[  +0.000002] R10: 0000000000000c32 R11: ffffafed0ed27940 R12: ffff9e0c58fd5000
[  +0.000002] R13: ffff9e0c58fd5058 R14: ffff9e0c58fd5158 R15: ffff9e0c5b785360
[  +0.000001] FS:  00007fb8e81390c0(0000) GS:ffff9e2b3fb80000(0000) knlGS:0000000000000000
[  +0.000003] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  +0.000001] CR2: 00007fce896fdd10 CR3: 000000010e764003 CR4: 00000000001706e0
[  +0.000002] Call Trace:
[  +0.000002]  <TASK>
[  +0.000003]  ttm_bo_release+0x2f2/0x370 [ttm]
[  +0.000007]  ? __cond_resched+0x1d/0x30
[  +0.000006]  ttm_bo_put+0x30/0x40 [ttm]
[  +0.000006]  amdgpu_bo_free_kernel+0xc7/0x120 [amdgpu]
[  +0.000137]  vega20_smu_fini+0x97/0xe0 [amdgpu]
[  +0.000221]  hwmgr_sw_fini+0x28/0x30 [amdgpu]
[  +0.000218]  pp_sw_fini+0x19/0x40 [amdgpu]
[  +0.000237]  amdgpu_device_fini_sw+0xcc/0x320 [amdgpu]
[  +0.000138]  amdgpu_driver_release_kms+0x16/0x30 [amdgpu]
[  +0.000139]  drm_dev_release+0x28/0x40 [drm]
[  +0.000027]  drm_minor_release+0x30/0x40 [drm]
[  +0.000026]  drm_release+0xa1/0xe0 [drm]
[  +0.000024]  __fput+0x99/0x260
[  +0.000004]  ____fput+0xe/0x10
[  +0.000004]  task_work_run+0x6c/0xa0
[  +0.000004]  exit_to_user_mode_prepare+0x1af/0x1c0
[  +0.000004]  syscall_exit_to_user_mode+0x2a/0x40
[  +0.000003]  do_syscall_64+0x46/0xb0
[  +0.000004]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[  +0.000004] RIP: 0033:0x7fb8e76e8511
[  +0.000003] Code: f7 d8 64 89 02 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 8b 05 fa ce 20 00 85 c0 75 16 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 3f f3 c3 0f 1f 44 00 00 53 89 fb 48 83 ec 10
[  +0.000002] RSP: 002b:00007fff89083f58 EFLAGS: 00000246 ORIG_RAX: 0000000000000003
[  +0.000003] RAX: 0000000000000000 RBX: 0000562815a3e6a0 RCX: 00007fb8e76e8511
[  +0.000002] RDX: 00007fb8e76d1ca0 RSI: 0000562815f75100 RDI: 0000000000000003
[  +0.000001] RBP: 0000562815a3e8a0 R08: 0000000000000000 R09: 0000000000000000
[  +0.000002] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[  +0.000001] R13: 0000000000000000 R14: 0000000000000000 R15: 0000562815a3e8a0
[  +0.000004]  </TASK>
[  +0.000001] ---[ end trace 7e6704984e7ed102 ]---
[  +0.000007] [drm:amdgpu_fill_buffer [amdgpu]] *ERROR* Trying to clear memory with ring turned off.
[  +0.000197] ------------[ cut here ]------------
[  +0.000001] WARNING: CPU: 12 PID: 2834 at drivers/gpu/drm/amd/amdgpu/amdgpu_object.c:1313 amdgpu_bo_release_notify+0x15f/0x170 [amdgpu]
[  +0.000148] Modules linked in: nf_conntrack_netlink nfnetlink xfrm_user xt_addrtype br_netfilter xt_CHECKSUM iptable_mangle xt_MASQUERADE iptable_nat nf_nat xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4 xt_tcpudp bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter overlay esp6_offload esp6 esp4_offload esp4 xfrm_algo ipmi_ssif intel_rapl_msr snd_hda_codec_hdmi intel_rapl_common sb_edac x86_pkg_temp_thermal snd_hda_intel intel_powerclamp coretemp snd_intel_dspcfg kvm_intel snd_hda_codec snd_hda_core kvm snd_hwdep ftdi_sio snd_pcm snd_timer joydev input_leds usbserial snd irqbypass soundcore rapl iTCO_wdt iTCO_vendor_support ipmi_si mei_me intel_cstate acpi_power_meter ipmi_devintf lpc_ich mei ipmi_msghandler mac_hid sch_fq_codel ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables x_tables autofs4 btrfs blake2b_generic zstd_compress raid10 raid456 async_raid6_recov async_memcpy
[  +0.000064]  async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear amdgpu iommu_v2 gpu_sched drm_ttm_helper mgag200 ttm drm_shmem_helper crct10dif_pclmul drm_kms_helper crc32_pclmul syscopyarea sysfillrect hid_generic ghash_clmulni_intel sysimgblt fb_sys_fops uas aesni_intel usbhid ahci crypto_simd igb usb_storage libahci cryptd hid drm dca megaraid_sas i2c_algo_bit wmi
[  +0.000031] CPU: 12 PID: 2834 Comm: amdgpu_test Tainted: G        W         5.16.0+ #3
[  +0.000003] Hardware name: Dell Inc. PowerEdge R730/0H21J3, BIOS 1.5.4 [FPGA Test BIOS] 10/002/2015
[  +0.000001] RIP: 0010:amdgpu_bo_release_notify+0x15f/0x170 [amdgpu]
[  +0.000146] Code: ff ff ff 48 39 c2 74 07 0f 0b e9 61 ff ff ff 4c 89 e7 e8 d4 a8 15 00 e9 54 ff ff ff e8 3a af 11 da eb cf 0f 0b e9 f0 fe ff ff <0f> 0b eb c4 e8 a8 c4 4b da 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00
[  +0.000002] RSP: 0018:ffffafed0ed27ca8 EFLAGS: 00010282
[  +0.000003] RAX: 00000000ffffffea RBX: ffff9e0c58fd4058 RCX: 0000000000000001
[  +0.000001] RDX: 0000000000000000 RSI: ffffffff9b5712d9 RDI: 00000000ffffffff
[  +0.000002] RBP: ffffafed0ed27cc8 R08: 0000000000000000 R09: 0000000000000001
[  +0.000002] R10: ffffafed0ed27c58 R11: ffffafed0ed27940 R12: ffff9e0c58fd4000
[  +0.000001] R13: ffff9e0c58fd4058 R14: ffff9e0c58fd4158 R15: ffff9e0c5b785360
[  +0.000002] FS:  00007fb8e81390c0(0000) GS:ffff9e2b3fb80000(0000) knlGS:0000000000000000
[  +0.000002] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  +0.000002] CR2: 00007fce896fdd10 CR3: 000000010e764003 CR4: 00000000001706e0
[  +0.000002] Call Trace:
[  +0.000001]  <TASK>
[  +0.000002]  ttm_bo_release+0x2f2/0x370 [ttm]
[  +0.000006]  ? __cond_resched+0x1d/0x30
[  +0.000006]  ttm_bo_put+0x30/0x40 [ttm]
[  +0.000007]  amdgpu_bo_free_kernel+0xc7/0x120 [amdgpu]
[  +0.000187]  vega20_smu_fini+0xb1/0xe0 [amdgpu]
[  +0.000225]  hwmgr_sw_fini+0x28/0x30 [amdgpu]
[  +0.000223]  pp_sw_fini+0x19/0x40 [amdgpu]
[  +0.000240]  amdgpu_device_fini_sw+0xcc/0x320 [amdgpu]
[  +0.000138]  amdgpu_driver_release_kms+0x16/0x30 [amdgpu]
[  +0.000140]  drm_dev_release+0x28/0x40 [drm]
[  +0.000027]  drm_minor_release+0x30/0x40 [drm]
[  +0.000026]  drm_release+0xa1/0xe0 [drm]
[  +0.000024]  __fput+0x99/0x260
[  +0.000005]  ____fput+0xe/0x10
[  +0.000003]  task_work_run+0x6c/0xa0
[  +0.000004]  exit_to_user_mode_prepare+0x1af/0x1c0
[  +0.000004]  syscall_exit_to_user_mode+0x2a/0x40
[  +0.000003]  do_syscall_64+0x46/0xb0
[  +0.000004]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[  +0.000003] RIP: 0033:0x7fb8e76e8511
[  +0.000003] Code: f7 d8 64 89 02 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 8b 05 fa ce 20 00 85 c0 75 16 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 3f f3 c3 0f 1f 44 00 00 53 89 fb 48 83 ec 10
[  +0.000003] RSP: 002b:00007fff89083f58 EFLAGS: 00000246 ORIG_RAX: 0000000000000003
[  +0.000002] RAX: 0000000000000000 RBX: 0000562815a3e6a0 RCX: 00007fb8e76e8511
[  +0.000002] RDX: 00007fb8e76d1ca0 RSI: 0000562815f75100 RDI: 0000000000000003
[  +0.000002] RBP: 0000562815a3e8a0 R08: 0000000000000000 R09: 0000000000000000
[  +0.000001] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[  +0.000002] R13: 0000000000000000 R14: 0000000000000000 R15: 0000562815a3e8a0
[  +0.000003]  </TASK>
[  +0.000001] ---[ end trace 7e6704984e7ed103 ]---
[  +0.000050] [drm:amdgpu_fill_buffer [amdgpu]] *ERROR* Trying to clear memory with ring turned off.
[  +0.000197] ------------[ cut here ]------------
[  +0.000001] WARNING: CPU: 12 PID: 2834 at drivers/gpu/drm/amd/amdgpu/amdgpu_object.c:1313 amdgpu_bo_release_notify+0x15f/0x170 [amdgpu]
[  +0.000148] Modules linked in: nf_conntrack_netlink nfnetlink xfrm_user xt_addrtype br_netfilter xt_CHECKSUM iptable_mangle xt_MASQUERADE iptable_nat nf_nat xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4 xt_tcpudp bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter overlay esp6_offload esp6 esp4_offload esp4 xfrm_algo ipmi_ssif intel_rapl_msr snd_hda_codec_hdmi intel_rapl_common sb_edac x86_pkg_temp_thermal snd_hda_intel intel_powerclamp coretemp snd_intel_dspcfg kvm_intel snd_hda_codec snd_hda_core kvm snd_hwdep ftdi_sio snd_pcm snd_timer joydev input_leds usbserial snd irqbypass soundcore rapl iTCO_wdt iTCO_vendor_support ipmi_si mei_me intel_cstate acpi_power_meter ipmi_devintf lpc_ich mei ipmi_msghandler mac_hid sch_fq_codel ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables x_tables autofs4 btrfs blake2b_generic zstd_compress raid10 raid456 async_raid6_recov async_memcpy
[  +0.000064]  async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear amdgpu iommu_v2 gpu_sched drm_ttm_helper mgag200 ttm drm_shmem_helper crct10dif_pclmul drm_kms_helper crc32_pclmul syscopyarea sysfillrect hid_generic ghash_clmulni_intel sysimgblt fb_sys_fops uas aesni_intel usbhid ahci crypto_simd igb usb_storage libahci cryptd hid drm dca megaraid_sas i2c_algo_bit wmi
[  +0.000031] CPU: 12 PID: 2834 Comm: amdgpu_test Tainted: G        W         5.16.0+ #3
[  +0.000003] Hardware name: Dell Inc. PowerEdge R730/0H21J3, BIOS 1.5.4 [FPGA Test BIOS] 10/002/2015
[  +0.000001] RIP: 0010:amdgpu_bo_release_notify+0x15f/0x170 [amdgpu]
[  +0.000147] Code: ff ff ff 48 39 c2 74 07 0f 0b e9 61 ff ff ff 4c 89 e7 e8 d4 a8 15 00 e9 54 ff ff ff e8 3a af 11 da eb cf 0f 0b e9 f0 fe ff ff <0f> 0b eb c4 e8 a8 c4 4b da 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00
[  +0.000002] RSP: 0018:ffffafed0ed27cf0 EFLAGS: 00010282
[  +0.000003] RAX: 00000000ffffffea RBX: ffff9e0c58ec6458 RCX: 0000000000000001
[  +0.000001] RDX: 0000000000000000 RSI: ffffffff9b5712d9 RDI: 00000000ffffffff
[  +0.000002] RBP: ffffafed0ed27d10 R08: 0000000000000000 R09: 0000000000000001
[  +0.000001] R10: ffffafed0ed27ca0 R11: ffffafed0ed27988 R12: ffff9e0c58ec6400
[  +0.000002] R13: ffff9e0c58ec6458 R14: ffff9e0c58ec6558 R15: ffff9e0c5b785360
[  +0.000002] FS:  00007fb8e81390c0(0000) GS:ffff9e2b3fb80000(0000) knlGS:0000000000000000
[  +0.000002] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  +0.000002] CR2: 00007fce896fdd10 CR3: 000000010e764003 CR4: 00000000001706e0
[  +0.000002] Call Trace:
[  +0.000001]  <TASK>
[  +0.000002]  ttm_bo_release+0x2f2/0x370 [ttm]
[  +0.000007]  ? __vunmap+0x1c9/0x210
[  +0.000005]  ttm_bo_put+0x30/0x40 [ttm]
[  +0.000007]  amdgpu_bo_free_kernel+0xc7/0x120 [amdgpu]
[  +0.000146]  amdgpu_device_fini_sw+0x27d/0x320 [amdgpu]
[  +0.000138]  amdgpu_driver_release_kms+0x16/0x30 [amdgpu]
[  +0.000140]  drm_dev_release+0x28/0x40 [drm]
[  +0.000026]  drm_minor_release+0x30/0x40 [drm]
[  +0.000026]  drm_release+0xa1/0xe0 [drm]
[  +0.000024]  __fput+0x99/0x260
[  +0.000004]  ____fput+0xe/0x10
[  +0.000004]  task_work_run+0x6c/0xa0
[  +0.000004]  exit_to_user_mode_prepare+0x1af/0x1c0
[  +0.000003]  syscall_exit_to_user_mode+0x2a/0x40
[  +0.000004]  do_syscall_64+0x46/0xb0
[  +0.000004]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[  +0.000003] RIP: 0033:0x7fb8e76e8511
[  +0.000002] Code: f7 d8 64 89 02 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 8b 05 fa ce 20 00 85 c0 75 16 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 3f f3 c3 0f 1f 44 00 00 53 89 fb 48 83 ec 10
[  +0.000003] RSP: 002b:00007fff89083f58 EFLAGS: 00000246 ORIG_RAX: 0000000000000003
[  +0.000002] RAX: 0000000000000000 RBX: 0000562815a3e6a0 RCX: 00007fb8e76e8511
[  +0.000002] RDX: 00007fb8e76d1ca0 RSI: 0000562815f75100 RDI: 0000000000000003
[  +0.000002] RBP: 0000562815a3e8a0 R08: 0000000000000000 R09: 0000000000000000
[  +0.000001] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[  +0.000001] R13: 0000000000000000 R14: 0000000000000000 R15: 0000562815a3e8a0
[  +0.000004]  </TASK>
[  +0.000001] ---[ end trace 7e6704984e7ed104 ]---
[  +0.000050] [drm:amdgpu_fill_buffer [amdgpu]] *ERROR* Trying to clear memory with ring turned off.
[  +0.000196] ------------[ cut here ]------------
[  +0.000001] WARNING: CPU: 12 PID: 2834 at drivers/gpu/drm/amd/amdgpu/amdgpu_object.c:1313 amdgpu_bo_release_notify+0x15f/0x170 [amdgpu]
[  +0.000147] Modules linked in: nf_conntrack_netlink nfnetlink xfrm_user xt_addrtype br_netfilter xt_CHECKSUM iptable_mangle xt_MASQUERADE iptable_nat nf_nat xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4 xt_tcpudp bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter overlay esp6_offload esp6 esp4_offload esp4 xfrm_algo ipmi_ssif intel_rapl_msr snd_hda_codec_hdmi intel_rapl_common sb_edac x86_pkg_temp_thermal snd_hda_intel intel_powerclamp coretemp snd_intel_dspcfg kvm_intel snd_hda_codec snd_hda_core kvm snd_hwdep ftdi_sio snd_pcm snd_timer joydev input_leds usbserial snd irqbypass soundcore rapl iTCO_wdt iTCO_vendor_support ipmi_si mei_me intel_cstate acpi_power_meter ipmi_devintf lpc_ich mei ipmi_msghandler mac_hid sch_fq_codel ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables x_tables autofs4 btrfs blake2b_generic zstd_compress raid10 raid456 async_raid6_recov async_memcpy
[  +0.000064]  async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear amdgpu iommu_v2 gpu_sched drm_ttm_helper mgag200 ttm drm_shmem_helper crct10dif_pclmul drm_kms_helper crc32_pclmul syscopyarea sysfillrect hid_generic ghash_clmulni_intel sysimgblt fb_sys_fops uas aesni_intel usbhid ahci crypto_simd igb usb_storage libahci cryptd hid drm dca megaraid_sas i2c_algo_bit wmi
[  +0.000030] CPU: 12 PID: 2834 Comm: amdgpu_test Tainted: G        W         5.16.0+ #3
[  +0.000003] Hardware name: Dell Inc. PowerEdge R730/0H21J3, BIOS 1.5.4 [FPGA Test BIOS] 10/002/2015
[  +0.000001] RIP: 0010:amdgpu_bo_release_notify+0x15f/0x170 [amdgpu]
[  +0.000146] Code: ff ff ff 48 39 c2 74 07 0f 0b e9 61 ff ff ff 4c 89 e7 e8 d4 a8 15 00 e9 54 ff ff ff e8 3a af 11 da eb cf 0f 0b e9 f0 fe ff ff <0f> 0b eb c4 e8 a8 c4 4b da 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00
[  +0.000002] RSP: 0018:ffffafed0ed27cc8 EFLAGS: 00010282
[  +0.000003] RAX: 00000000ffffffea RBX: ffff9e0c58ec7858 RCX: 0000000000000001
[  +0.000001] RDX: 0000000000000000 RSI: ffffffff9b5712d9 RDI: 00000000ffffffff
[  +0.000002] RBP: ffffafed0ed27ce8 R08: 0000000000000000 R09: 0000000000000001
[  +0.000002] R10: 0000000000000002 R11: ffffafed0ed27960 R12: ffff9e0c58ec7800
[  +0.000001] R13: ffff9e0c58ec7858 R14: ffff9e0c58ec7958 R15: ffff9e0c5b785360
[  +0.000002] FS:  00007fb8e81390c0(0000) GS:ffff9e2b3fb80000(0000) knlGS:0000000000000000
[  +0.000002] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  +0.000002] CR2: 00007fce896fdd10 CR3: 000000010e764003 CR4: 00000000001706e0
[  +0.000002] Call Trace:
[  +0.000000]  <TASK>
[  +0.000002]  ttm_bo_release+0x2f2/0x370 [ttm]
[  +0.000008]  ? amdgpu_fence_release+0x19/0x20 [amdgpu]
[  +0.000142]  ? dma_fence_release+0x4f/0x140
[  +0.000008]  ttm_bo_put+0x30/0x40 [ttm]
[  +0.000007]  amdgpu_bo_free_kernel+0xc7/0x120 [amdgpu]
[  +0.000145]  amdgpu_gart_table_vram_free+0x1e/0x20 [amdgpu]
[  +0.000146]  gmc_v9_0_sw_fini+0x2a/0x50 [amdgpu]
[  +0.000173]  amdgpu_device_fini_sw+0xcc/0x320 [amdgpu]
[  +0.000138]  amdgpu_driver_release_kms+0x16/0x30 [amdgpu]
[  +0.000140]  drm_dev_release+0x28/0x40 [drm]
[  +0.000027]  drm_minor_release+0x30/0x40 [drm]
[  +0.000026]  drm_release+0xa1/0xe0 [drm]
[  +0.000040]  __fput+0x99/0x260
[  +0.000006]  ____fput+0xe/0x10
[  +0.000004]  task_work_run+0x6c/0xa0
[  +0.000004]  exit_to_user_mode_prepare+0x1af/0x1c0
[  +0.000005]  syscall_exit_to_user_mode+0x2a/0x40
[  +0.000003]  do_syscall_64+0x46/0xb0
[  +0.000005]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[  +0.000004] RIP: 0033:0x7fb8e76e8511
[  +0.000003] Code: f7 d8 64 89 02 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 8b 05 fa ce 20 00 85 c0 75 16 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 3f f3 c3 0f 1f 44 00 00 53 89 fb 48 83 ec 10
[  +0.000003] RSP: 002b:00007fff89083f58 EFLAGS: 00000246 ORIG_RAX: 0000000000000003
[  +0.000003] RAX: 0000000000000000 RBX: 0000562815a3e6a0 RCX: 00007fb8e76e8511
[  +0.000002] RDX: 00007fb8e76d1ca0 RSI: 0000562815f75100 RDI: 0000000000000003
[  +0.000002] RBP: 0000562815a3e8a0 R08: 0000000000000000 R09: 0000000000000000
[  +0.000002] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[  +0.000002] R13: 0000000000000000 R14: 0000000000000000 R15: 0000562815a3e8a0
[  +0.000004]  </TASK>
[  +0.000002] ---[ end trace 7e6704984e7ed105 ]---
[  +0.000112] ------------[ cut here ]------------
[  +0.000004] kernfs: can not remove 'mem_info_preempt_used', no directory
[  +0.000011] WARNING: CPU: 20 PID: 2834 at fs/kernfs/dir.c:1536 kernfs_remove_by_name_ns+0x8d/0xa0
[  +0.000022] Modules linked in: nf_conntrack_netlink nfnetlink xfrm_user xt_addrtype br_netfilter xt_CHECKSUM iptable_mangle xt_MASQUERADE iptable_nat nf_nat xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4 xt_tcpudp bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter overlay esp6_offload esp6 esp4_offload esp4 xfrm_algo ipmi_ssif intel_rapl_msr snd_hda_codec_hdmi intel_rapl_common sb_edac x86_pkg_temp_thermal snd_hda_intel intel_powerclamp coretemp snd_intel_dspcfg kvm_intel snd_hda_codec snd_hda_core kvm snd_hwdep ftdi_sio snd_pcm snd_timer joydev input_leds usbserial snd irqbypass soundcore rapl iTCO_wdt iTCO_vendor_support ipmi_si mei_me intel_cstate acpi_power_meter ipmi_devintf lpc_ich mei ipmi_msghandler mac_hid sch_fq_codel ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables x_tables autofs4 btrfs blake2b_generic zstd_compress raid10 raid456 async_raid6_recov async_memcpy
[  +0.000146]  async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear amdgpu iommu_v2 gpu_sched drm_ttm_helper mgag200 ttm drm_shmem_helper crct10dif_pclmul drm_kms_helper crc32_pclmul syscopyarea sysfillrect hid_generic ghash_clmulni_intel sysimgblt fb_sys_fops uas aesni_intel usbhid ahci crypto_simd igb usb_storage libahci cryptd hid drm dca megaraid_sas i2c_algo_bit wmi
[  +0.000054] CPU: 20 PID: 2834 Comm: amdgpu_test Tainted: G        W         5.16.0+ #3
[  +0.000005] Hardware name: Dell Inc. PowerEdge R730/0H21J3, BIOS 1.5.4 [FPGA Test BIOS] 10/002/2015
[  +0.000003] RIP: 0010:kernfs_remove_by_name_ns+0x8d/0xa0
[  +0.000011] Code: 41 5c 41 5d 5d c3 48 c7 c7 00 81 a1 9b e8 7b 2d d2 ff b8 fe ff ff ff 5b 41 5c 41 5d 5d c3 48 c7 c7 50 63 59 9b e8 03 80 cb ff <0f> 0b b8 fe ff ff ff eb cc 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44
[  +0.000024] RSP: 0018:ffffafed0ed27d08 EFLAGS: 00010282
[  +0.000004] RAX: 0000000000000000 RBX: ffff9e0c5b780000 RCX: 0000000000000001
[  +0.000004] RDX: 0000000080000001 RSI: ffffffff9b5712d9 RDI: 00000000ffffffff
[  +0.000003] RBP: ffffafed0ed27d20 R08: 0000000000000000 R09: 0000000000000001
[  +0.000003] R10: ffffafed0ed27c30 R11: ffffafed0ed27ad8 R12: ffffffffc0e23107
[  +0.000003] R13: ffff9e0c5b7959b8 R14: ffff9e0c5b796918 R15: ffff9e0c530a6a80
[  +0.000004] FS:  00007fb8e81390c0(0000) GS:ffff9e2b3fc80000(0000) knlGS:0000000000000000
[  +0.000004] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  +0.000002] CR2: 00007f6ecf7498a0 CR3: 000000010e764006 CR4: 00000000001706e0
[  +0.000004] Call Trace:
[  +0.000002]  <TASK>
[  +0.000004]  sysfs_remove_file_ns+0x15/0x20
[  +0.000007]  device_remove_file+0x15/0x20
[  +0.000015]  amdgpu_preempt_mgr_fini+0x70/0xc0 [amdgpu]
[  +0.000323]  amdgpu_ttm_fini+0x128/0x190 [amdgpu]
[  +0.000301]  amdgpu_bo_fini+0x25/0x90 [amdgpu]
[  +0.000295]  gmc_v9_0_sw_fini+0x3e/0x50 [amdgpu]
[  +0.000284]  amdgpu_device_fini_sw+0xcc/0x320 [amdgpu]
[  +0.000196]  amdgpu_driver_release_kms+0x16/0x30 [amdgpu]
[  +0.000204]  drm_dev_release+0x28/0x40 [drm]
[  +0.000038]  drm_minor_release+0x30/0x40 [drm]
[  +0.000038]  drm_release+0xa1/0xe0 [drm]
[  +0.000034]  __fput+0x99/0x260
[  +0.000007]  ____fput+0xe/0x10
[  +0.000006]  task_work_run+0x6c/0xa0
[  +0.000006]  exit_to_user_mode_prepare+0x1af/0x1c0
[  +0.000005]  syscall_exit_to_user_mode+0x2a/0x40
[  +0.000005]  do_syscall_64+0x46/0xb0
[  +0.000006]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[  +0.000005] RIP: 0033:0x7fb8e76e8511
[  +0.000004] Code: f7 d8 64 89 02 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 8b 05 fa ce 20 00 85 c0 75 16 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 3f f3 c3 0f 1f 44 00 00 53 89 fb 48 83 ec 10
[  +0.000004] RSP: 002b:00007fff89083f58 EFLAGS: 00000246 ORIG_RAX: 0000000000000003
[  +0.000004] RAX: 0000000000000000 RBX: 0000562815a3e6a0 RCX: 00007fb8e76e8511
[  +0.000003] RDX: 00007fb8e76d1ca0 RSI: 0000562815f75100 RDI: 0000000000000003
[  +0.000002] RBP: 0000562815a3e8a0 R08: 0000000000000000 R09: 0000000000000000
[  +0.000002] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[  +0.000002] R13: 0000000000000000 R14: 0000000000000000 R15: 0000562815a3e8a0
[  +0.000005]  </TASK>
[  +0.000002] ---[ end trace 7e6704984e7ed106 ]---
[  +0.000383] [drm] amdgpu: ttm finalized
[  +0.000004] ------------[ cut here ]------------
[  +0.000001] kernfs: can not remove 'df_cntr_avail', no directory
[  +0.000008] WARNING: CPU: 20 PID: 2834 at fs/kernfs/dir.c:1536 kernfs_remove_by_name_ns+0x8d/0xa0
[  +0.000008] Modules linked in: nf_conntrack_netlink nfnetlink xfrm_user xt_addrtype br_netfilter xt_CHECKSUM iptable_mangle xt_MASQUERADE iptable_nat nf_nat xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4 xt_tcpudp bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter overlay esp6_offload esp6 esp4_offload esp4 xfrm_algo ipmi_ssif intel_rapl_msr snd_hda_codec_hdmi intel_rapl_common sb_edac x86_pkg_temp_thermal snd_hda_intel intel_powerclamp coretemp snd_intel_dspcfg kvm_intel snd_hda_codec snd_hda_core kvm snd_hwdep ftdi_sio snd_pcm snd_timer joydev input_leds usbserial snd irqbypass soundcore rapl iTCO_wdt iTCO_vendor_support ipmi_si mei_me intel_cstate acpi_power_meter ipmi_devintf lpc_ich mei ipmi_msghandler mac_hid sch_fq_codel ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables x_tables autofs4 btrfs blake2b_generic zstd_compress raid10 raid456 async_raid6_recov async_memcpy
[  +0.000094]  async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear amdgpu iommu_v2 gpu_sched drm_ttm_helper mgag200 ttm drm_shmem_helper crct10dif_pclmul drm_kms_helper crc32_pclmul syscopyarea sysfillrect hid_generic ghash_clmulni_intel sysimgblt fb_sys_fops uas aesni_intel usbhid ahci crypto_simd igb usb_storage libahci cryptd hid drm dca megaraid_sas i2c_algo_bit wmi
[  +0.000045] CPU: 20 PID: 2834 Comm: amdgpu_test Tainted: G        W         5.16.0+ #3
[  +0.000004] Hardware name: Dell Inc. PowerEdge R730/0H21J3, BIOS 1.5.4 [FPGA Test BIOS] 10/002/2015
[  +0.000003] RIP: 0010:kernfs_remove_by_name_ns+0x8d/0xa0
[  +0.000005] Code: 41 5c 41 5d 5d c3 48 c7 c7 00 81 a1 9b e8 7b 2d d2 ff b8 fe ff ff ff 5b 41 5c 41 5d 5d c3 48 c7 c7 50 63 59 9b e8 03 80 cb ff <0f> 0b b8 fe ff ff ff eb cc 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44
[  +0.000005] RSP: 0018:ffffafed0ed27d70 EFLAGS: 00010286
[  +0.000004] RAX: 0000000000000000 RBX: ffff9e0c5b780000 RCX: 0000000000000001
[  +0.000002] RDX: 0000000080000001 RSI: ffffffff9b5712d9 RDI: 00000000ffffffff
[  +0.000003] RBP: ffffafed0ed27d88 R08: 0000000000000000 R09: 0000000000000001
[  +0.000003] R10: 000000000000001f R11: ffffafed0ed27b40 R12: ffffffffc0e24a09
[  +0.000002] R13: ffff9e0c5b7959b8 R14: ffff9e0c5b796918 R15: ffff9e0c530a6a80
[  +0.000003] FS:  00007fb8e81390c0(0000) GS:ffff9e2b3fc80000(0000) knlGS:0000000000000000
[  +0.000003] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  +0.000003] CR2: 00007f6ecf7498a0 CR3: 000000010e764006 CR4: 00000000001706e0
[  +0.000003] Call Trace:
[  +0.000001]  <TASK>
[  +0.000003]  sysfs_remove_file_ns+0x15/0x20
[  +0.000006]  device_remove_file+0x15/0x20
[  +0.000005]  df_v3_6_sw_fini+0x18/0x20 [amdgpu]
[  +0.000257]  soc15_common_sw_fini+0x23/0x30 [amdgpu]
[  +0.000247]  amdgpu_device_fini_sw+0xcc/0x320 [amdgpu]
[  +0.000197]  amdgpu_driver_release_kms+0x16/0x30 [amdgpu]
[  +0.000199]  drm_dev_release+0x28/0x40 [drm]
[  +0.000040]  drm_minor_release+0x30/0x40 [drm]
[  +0.000038]  drm_release+0xa1/0xe0 [drm]
[  +0.000034]  __fput+0x99/0x260
[  +0.000007]  ____fput+0xe/0x10
[  +0.000005]  task_work_run+0x6c/0xa0
[  +0.000006]  exit_to_user_mode_prepare+0x1af/0x1c0
[  +0.000005]  syscall_exit_to_user_mode+0x2a/0x40
[  +0.000005]  do_syscall_64+0x46/0xb0
[  +0.000006]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[  +0.000005] RIP: 0033:0x7fb8e76e8511
[  +0.000003] Code: f7 d8 64 89 02 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 8b 05 fa ce 20 00 85 c0 75 16 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 3f f3 c3 0f 1f 44 00 00 53 89 fb 48 83 ec 10
[  +0.000004] RSP: 002b:00007fff89083f58 EFLAGS: 00000246 ORIG_RAX: 0000000000000003
[  +0.000004] RAX: 0000000000000000 RBX: 0000562815a3e6a0 RCX: 00007fb8e76e8511
[  +0.000003] RDX: 00007fb8e76d1ca0 RSI: 0000562815f75100 RDI: 0000000000000003
[  +0.000002] RBP: 0000562815a3e8a0 R08: 0000000000000000 R09: 0000000000000000
[  +0.000002] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[  +0.000002] R13: 0000000000000000 R14: 0000000000000000 R15: 0000562815a3e8a0
[  +0.000005]  </TASK>
[  +0.000002] ---[ end trace 7e6704984e7ed107 ]---
[  +0.000014] BUG: kernel NULL pointer dereference, address: 0000000000000070
[  +0.000053] #PF: supervisor read access in kernel mode
[  +0.000035] #PF: error_code(0x0000) - not-present page
[  +0.000033] PGD 0 P4D 0
[  +0.000023] Oops: 0000 [#1] PREEMPT SMP PTI
[  +0.000030] CPU: 20 PID: 2834 Comm: amdgpu_test Tainted: G        W         5.16.0+ #3
[  +0.000050] Hardware name: Dell Inc. PowerEdge R730/0H21J3, BIOS 1.5.4 [FPGA Test BIOS] 10/002/2015
[  +0.000054] RIP: 0010:kernfs_find_ns+0x19/0xc0
[  +0.000033] Code: 0f 85 ac fe ff ff e9 3c fe ff ff 66 0f 1f 44 00 00 0f 1f 44 00 00 55 48 85 d2 0f 95 c1 48 89 e5 41 56 41 55 41 54 53 49 89 d5 <0f> b7 47 70 49 89 f6 66 83 e0 20 0f 95 c2 38 d1 75 53 48 8b 5f 48
[  +0.000115] RSP: 0018:ffffafed0ed27cb0 EFLAGS: 00010246
[  +0.000036] RAX: ffff9e0c56eab201 RBX: 0000000000000000 RCX: ffff9e0c56488000
[  +0.000044] RDX: 0000000000000000 RSI: ffffffffc0e238c2 RDI: 0000000000000000
[  +0.000044] RBP: ffffafed0ed27cd0 R08: ffffffffc0e4c820 R09: 0000000000000001
[  +0.000044] R10: ffffafed0ed27d00 R11: ffffafed0ed27b00 R12: ffffffffc0e238c2
[  +0.000044] R13: 0000000000000000 R14: dead000000000100 R15: ffff9e0c52b49660
[  +0.000044] FS:  00007fb8e81390c0(0000) GS:ffff9e2b3fc80000(0000) knlGS:0000000000000000
[  +0.000051] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  +0.000036] CR2: 0000000000000070 CR3: 000000010e764006 CR4: 00000000001706e0
[  +0.000045] Call Trace:
[  +0.000018]  <TASK>
[  +0.000018]  kernfs_find_and_get_ns+0x31/0x60
[  +0.000034]  sysfs_remove_file_from_group+0x25/0x60
[  +0.000036]  amdgpu_ras_sysfs_remove+0x3f/0xd0 [amdgpu]
[  +0.000272]  amdgpu_ras_fini+0x105/0x360 [amdgpu]
[  +0.001157]  ? kfree+0x29b/0x2c0
[  +0.000915]  ? kfree+0x29b/0x2c0
[  +0.000900]  amdgpu_device_fini_sw+0x153/0x320 [amdgpu]
[  +0.001082]  amdgpu_driver_release_kms+0x16/0x30 [amdgpu]
[  +0.001058]  drm_dev_release+0x28/0x40 [drm]
[  +0.000911]  drm_minor_release+0x30/0x40 [drm]
[  +0.000924]  drm_release+0xa1/0xe0 [drm]
[  +0.000929]  __fput+0x99/0x260
[  +0.000908]  ____fput+0xe/0x10
[  +0.000913]  task_work_run+0x6c/0xa0
[  +0.000926]  exit_to_user_mode_prepare+0x1af/0x1c0
[  +0.000949]  syscall_exit_to_user_mode+0x2a/0x40
[  +0.000961]  do_syscall_64+0x46/0xb0
[  +0.000977]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[  +0.000998] RIP: 0033:0x7fb8e76e8511
[  +0.001006] Code: f7 d8 64 89 02 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 8b 05 fa ce 20 00 85 c0 75 16 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 3f f3 c3 0f 1f 44 00 00 53 89 fb 48 83 ec 10
[  +0.002212] RSP: 002b:00007fff89083f58 EFLAGS: 00000246 ORIG_RAX: 0000000000000003
[  +0.001160] RAX: 0000000000000000 RBX: 0000562815a3e6a0 RCX: 00007fb8e76e8511
[  +0.001183] RDX: 00007fb8e76d1ca0 RSI: 0000562815f75100 RDI: 0000000000000003
[  +0.001200] RBP: 0000562815a3e8a0 R08: 0000000000000000 R09: 0000000000000000
[  +0.001214] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[  +0.001224] R13: 0000000000000000 R14: 0000000000000000 R15: 0000562815a3e8a0
[  +0.001198]  </TASK>
[  +0.001171] Modules linked in: nf_conntrack_netlink nfnetlink xfrm_user xt_addrtype br_netfilter xt_CHECKSUM iptable_mangle xt_MASQUERADE iptable_nat nf_nat xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4 xt_tcpudp bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter overlay esp6_offload esp6 esp4_offload esp4 xfrm_algo ipmi_ssif intel_rapl_msr snd_hda_codec_hdmi intel_rapl_common sb_edac x86_pkg_temp_thermal snd_hda_intel intel_powerclamp coretemp snd_intel_dspcfg kvm_intel snd_hda_codec snd_hda_core kvm snd_hwdep ftdi_sio snd_pcm snd_timer joydev input_leds usbserial snd irqbypass soundcore rapl iTCO_wdt iTCO_vendor_support ipmi_si mei_me intel_cstate acpi_power_meter ipmi_devintf lpc_ich mei ipmi_msghandler mac_hid sch_fq_codel ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables x_tables autofs4 btrfs blake2b_generic zstd_compress raid10 raid456 async_raid6_recov async_memcpy
[  +0.000096]  async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear amdgpu iommu_v2 gpu_sched drm_ttm_helper mgag200 ttm drm_shmem_helper crct10dif_pclmul drm_kms_helper crc32_pclmul syscopyarea sysfillrect hid_generic ghash_clmulni_intel sysimgblt fb_sys_fops uas aesni_intel usbhid ahci crypto_simd igb usb_storage libahci cryptd hid drm dca megaraid_sas i2c_algo_bit wmi
[  +0.016580] CR2: 0000000000000070
[  +0.001512] ---[ end trace 7e6704984e7ed108 ]---
[  +0.013115] RIP: 0010:kernfs_find_ns+0x19/0xc0
[  +0.001477] Code: 0f 85 ac fe ff ff e9 3c fe ff ff 66 0f 1f 44 00 00 0f 1f 44 00 00 55 48 85 d2 0f 95 c1 48 89 e5 41 56 41 55 41 54 53 49 89 d5 <0f> b7 47 70 49 89 f6 66 83 e0 20 0f 95 c2 38 d1 75 53 48 8b 5f 48
[  +0.003030] RSP: 0018:ffffafed0ed27cb0 EFLAGS: 00010246
[  +0.001534] RAX: ffff9e0c56eab201 RBX: 0000000000000000 RCX: ffff9e0c56488000
[  +0.001550] RDX: 0000000000000000 RSI: ffffffffc0e238c2 RDI: 0000000000000000
[  +0.001559] RBP: ffffafed0ed27cd0 R08: ffffffffc0e4c820 R09: 0000000000000001
[  +0.001568] R10: ffffafed0ed27d00 R11: ffffafed0ed27b00 R12: ffffffffc0e238c2
[  +0.001556] R13: 0000000000000000 R14: dead000000000100 R15: ffff9e0c52b49660
[  +0.001579] FS:  00007fb8e81390c0(0000) GS:ffff9e2b3fc80000(0000) knlGS:0000000000000000
[  +0.001586] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  +0.001602] CR2: 0000000000000070 CR3: 000000010e764006 CR4: 00000000001706e0

MI100:

[Apr20 18:16] amdgpu 0000:43:00.0: amdgpu: amdgpu: finishing device.
[  +0.005980] [drm:amdgpu_fill_buffer [amdgpu]] *ERROR* Trying to clear memory with ring turned off.
[  +0.000538] ------------[ cut here ]------------
[  +0.000003] WARNING: CPU: 29 PID: 3800 at drivers/gpu/drm/amd/amdgpu/amdgpu_object.c:1313 amdgpu_bo_release_notify+0x164/0x170 [amdgpu]
[  +0.000137] Modules linked in: amdgpu iommu_v2 gpu_sched nf_conntrack_netlink nfnetlink xt_addrtype br_netfilter xt_CHECKSUM iptable_mangle xt_MASQUERADE iptable_nat nf_nat xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4 xt_tcpudp ip6table_filter ip6_tables iptable_filter fuse x86_pkg_temp_thermal acpi_pad ip_tables x_tables ast drm_vram_helper drm_ttm_helper ttm drm_kms_helper cfbfillrect syscopyarea cfbimgblt sysfillrect sysimgblt fb_sys_fops cfbcopyarea drm drm_panel_orientation_quirks
[  +0.000082] CPU: 29 PID: 3800 Comm: amdgpu_test Not tainted 5.16.0-kfd+ #1
[  +0.000004] Hardware name: Supermicro SYS-4029GP-TRT2/X11DPG-OT-CPU, BIOS 2.1 08/14/2018
[  +0.000003] RIP: 0010:amdgpu_bo_release_notify+0x164/0x170 [amdgpu]
[  +0.000114] Code: ff ff ff 48 39 c2 74 07 0f 0b e9 57 ff ff ff 48 89 ef e8 cf 14 15 00 e9 4a ff ff ff e8 f5 05 29 d1 eb c4 0f 0b e9 e7 fe ff ff <0f> 0b eb b9 e8 83 d7 7d d1 0f 1f 00 0f 1f 44 00 00 41 54 55 53 48
[  +0.000004] RSP: 0018:ffffaa4c206dfc28 EFLAGS: 00010282
[  +0.000005] RAX: 00000000ffffffea RBX: ffff9478d9c4cc58 RCX: 0000000000000001
[  +0.000003] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00000000ffffffff
[  +0.000002] RBP: ffff9478d9c4cc00 R08: 0000000000000000 R09: 0000000000000001
[  +0.000003] R10: ffffaa4c206dfac8 R11: ffffaa4c206df8f8 R12: ffff9478d9c4cc58
[  +0.000003] R13: ffff9478d9c4cd90 R14: ffff9479156c5e08 R15: ffff949857c55020
[  +0.000002] FS:  00007f59f5b50180(0000) GS:ffff9497c0040000(0000) knlGS:0000000000000000
[  +0.000003] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  +0.000003] CR2: 000056517f676b38 CR3: 0000000137bee001 CR4: 00000000007706e0
[  +0.000003] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  +0.000002] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  +0.000003] PKRU: 55555554
[  +0.000002] Call Trace:
[  +0.000003]  <TASK>
[  +0.000006]  ttm_bo_release+0x305/0x390 [ttm]
[  +0.000011]  ? __mutex_unlock_slowpath+0x41/0x280
[  +0.000014]  amdgpu_bo_free_kernel+0xd1/0x120 [amdgpu]
[  +0.000117]  psp_ras_terminate+0x5b/0x70 [amdgpu]
[  +0.000147]  psp_hw_fini+0x23/0x100 [amdgpu]
[  +0.000145]  amdgpu_device_fini_hw+0x1d5/0x3a0 [amdgpu]
[  +0.000109]  amdgpu_pci_remove+0x41/0x60 [amdgpu]
[  +0.000102]  pci_device_remove+0x31/0xb0
[  +0.000009]  device_release_driver_internal+0xf4/0x1d0
[  +0.000008]  pci_stop_bus_device+0x64/0x90
[  +0.000007]  pci_stop_and_remove_bus_device_locked+0x16/0x30
[  +0.000005]  remove_store+0x75/0x90
[  +0.000007]  kernfs_fop_write_iter+0x132/0x1b0
[  +0.000010]  new_sync_write+0x11f/0x1b0
[  +0.000015]  vfs_write+0x35b/0x3b0
[  +0.000008]  ksys_write+0xa7/0xe0
[  +0.000008]  do_syscall_64+0x34/0x80
[  +0.000007]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[  +0.000005] RIP: 0033:0x7f59f5106371
[  +0.000004] Code: c3 0f 1f 84 00 00 00 00 00 48 8b 05 69 8c 20 00 c3 0f 1f 84 00 00 00 00 00 8b 05 9a d0 20 00 85 c0 75 16 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 57 f3 c3 0f 1f 44 00 00 41 54 55 49 89 d4 53
[  +0.000003] RSP: 002b:00007ffff02c6bf8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[  +0.000006] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f59f5106371
[  +0.000003] RDX: 0000000000000001 RSI: 000056517d835316 RDI: 0000000000000005
[  +0.000002] RBP: 0000000000000005 R08: 000056517f6742b0 R09: 0000000000000000
[  +0.000002] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[  +0.000003] R13: 0000000000000000 R14: 0000000000000000 R15: 000056517da3e8a0
[  +0.000016]  </TASK>
[  +0.000002] irq event stamp: 20653
[  +0.000002] hardirqs last  enabled at (20659): [<ffffffff91107d32>] __up_console_sem+0x52/0x60
[  +0.000006] hardirqs last disabled at (20664): [<ffffffff91107d17>] __up_console_sem+0x37/0x60
[  +0.000003] softirqs last  enabled at (19878): [<ffffffff9220034b>] __do_softirq+0x34b/0x492
[  +0.000005] softirqs last disabled at (19753): [<ffffffff91083167>] irq_exit_rcu+0xd7/0xf0
[  +0.000005] ---[ end trace 90fbe3f286a48d6c ]---
[  +0.000410] [drm] free PSP TMR buffer
[  +0.000009] [drm:amdgpu_fill_buffer [amdgpu]] *ERROR* Trying to clear memory with ring turned off.
[  +0.000209] ------------[ cut here ]------------
[  +0.000002] WARNING: CPU: 29 PID: 3800 at drivers/gpu/drm/amd/amdgpu/amdgpu_object.c:1313 amdgpu_bo_release_notify+0x164/0x170 [amdgpu]
[  +0.000115] Modules linked in: amdgpu iommu_v2 gpu_sched nf_conntrack_netlink nfnetlink xt_addrtype br_netfilter xt_CHECKSUM iptable_mangle xt_MASQUERADE iptable_nat nf_nat xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4 xt_tcpudp ip6table_filter ip6_tables iptable_filter fuse x86_pkg_temp_thermal acpi_pad ip_tables x_tables ast drm_vram_helper drm_ttm_helper ttm drm_kms_helper cfbfillrect syscopyarea cfbimgblt sysfillrect sysimgblt fb_sys_fops cfbcopyarea drm drm_panel_orientation_quirks
[  +0.000079] CPU: 29 PID: 3800 Comm: amdgpu_test Tainted: G        W         5.16.0-kfd+ #1
[  +0.000004] Hardware name: Supermicro SYS-4029GP-TRT2/X11DPG-OT-CPU, BIOS 2.1 08/14/2018
[  +0.000003] RIP: 0010:amdgpu_bo_release_notify+0x164/0x170 [amdgpu]
[  +0.000111] Code: ff ff ff 48 39 c2 74 07 0f 0b e9 57 ff ff ff 48 89 ef e8 cf 14 15 00 e9 4a ff ff ff e8 f5 05 29 d1 eb c4 0f 0b e9 e7 fe ff ff <0f> 0b eb b9 e8 83 d7 7d d1 0f 1f 00 0f 1f 44 00 00 41 54 55 53 48
[  +0.000004] RSP: 0018:ffffaa4c206dfc18 EFLAGS: 00010282
[  +0.000004] RAX: 00000000ffffffea RBX: ffff9478d9c4fc58 RCX: 0000000000000001
[  +0.000003] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00000000ffffffff
[  +0.000041] RBP: ffff9478d9c4fc00 R08: 0000000000000000 R09: 0000000000000001
[  +0.000004] R10: ffffaa4c206dfab8 R11: ffffaa4c206df8e8 R12: ffff9478d9c4fc58
[  +0.000002] R13: ffff9478d9c4fd90 R14: ffff9479156c5e08 R15: ffff949857c55020
[  +0.000003] FS:  00007f59f5b50180(0000) GS:ffff9497c0040000(0000) knlGS:0000000000000000
[  +0.000003] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  +0.000003] CR2: 000056517f676b38 CR3: 0000000137bee001 CR4: 00000000007706e0
[  +0.000003] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  +0.000003] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  +0.000002] PKRU: 55555554
[  +0.000003] Call Trace:
[  +0.000002]  <TASK>
[  +0.000006]  ttm_bo_release+0x305/0x390 [ttm]
[  +0.000010]  ? __mutex_unlock_slowpath+0x41/0x280
[  +0.000013]  amdgpu_bo_free_kernel+0xd1/0x120 [amdgpu]
[  +0.000117]  psp_tmr_terminate+0x9b/0xc0 [amdgpu]
[  +0.000149]  psp_hw_fini+0x69/0x100 [amdgpu]
[  +0.000145]  amdgpu_device_fini_hw+0x1d5/0x3a0 [amdgpu]
[  +0.000109]  amdgpu_pci_remove+0x41/0x60 [amdgpu]
[  +0.000103]  pci_device_remove+0x31/0xb0
[  +0.000007]  device_release_driver_internal+0xf4/0x1d0
[  +0.000008]  pci_stop_bus_device+0x64/0x90
[  +0.000007]  pci_stop_and_remove_bus_device_locked+0x16/0x30
[  +0.000005]  remove_store+0x75/0x90
[  +0.000007]  kernfs_fop_write_iter+0x132/0x1b0
[  +0.000009]  new_sync_write+0x11f/0x1b0
[  +0.000015]  vfs_write+0x35b/0x3b0
[  +0.000007]  ksys_write+0xa7/0xe0
[  +0.000009]  do_syscall_64+0x34/0x80
[  +0.000007]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[  +0.000006] RIP: 0033:0x7f59f5106371
[  +0.000003] Code: c3 0f 1f 84 00 00 00 00 00 48 8b 05 69 8c 20 00 c3 0f 1f 84 00 00 00 00 00 8b 05 9a d0 20 00 85 c0 75 16 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 57 f3 c3 0f 1f 44 00 00 41 54 55 49 89 d4 53
[  +0.000003] RSP: 002b:00007ffff02c6bf8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[  +0.000005] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f59f5106371
[  +0.000003] RDX: 0000000000000001 RSI: 000056517d835316 RDI: 0000000000000005
[  +0.000002] RBP: 0000000000000005 R08: 000056517f6742b0 R09: 0000000000000000
[  +0.000003] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[  +0.000002] R13: 0000000000000000 R14: 0000000000000000 R15: 000056517da3e8a0
[  +0.000016]  </TASK>
[  +0.000002] irq event stamp: 21269
[  +0.000003] hardirqs last  enabled at (21275): [<ffffffff91107d32>] __up_console_sem+0x52/0x60
[  +0.000004] hardirqs last disabled at (21280): [<ffffffff91107d17>] __up_console_sem+0x37/0x60
[  +0.000003] softirqs last  enabled at (21032): [<ffffffff9220034b>] __do_softirq+0x34b/0x492
[  +0.000005] softirqs last disabled at (21027): [<ffffffff91083167>] irq_exit_rcu+0xd7/0xf0
[  +0.000006] ---[ end trace 90fbe3f286a48d6d ]---
[  +0.020204] [drm:amdgpu_fill_buffer [amdgpu]] *ERROR* Trying to clear memory with ring turned off.
[  +0.000197] ------------[ cut here ]------------
[  +0.000002] WARNING: CPU: 29 PID: 3800 at drivers/gpu/drm/amd/amdgpu/amdgpu_object.c:1313 amdgpu_bo_release_notify+0x164/0x170 [amdgpu]
[  +0.000116] Modules linked in: amdgpu iommu_v2 gpu_sched nf_conntrack_netlink nfnetlink xt_addrtype br_netfilter xt_CHECKSUM iptable_mangle xt_MASQUERADE iptable_nat nf_nat xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4 xt_tcpudp ip6table_filter ip6_tables iptable_filter fuse x86_pkg_temp_thermal acpi_pad ip_tables x_tables ast drm_vram_helper drm_ttm_helper ttm drm_kms_helper cfbfillrect syscopyarea cfbimgblt sysfillrect sysimgblt fb_sys_fops cfbcopyarea drm drm_panel_orientation_quirks
[  +0.000076] CPU: 29 PID: 3800 Comm: amdgpu_test Tainted: G        W         5.16.0-kfd+ #1
[  +0.000004] Hardware name: Supermicro SYS-4029GP-TRT2/X11DPG-OT-CPU, BIOS 2.1 08/14/2018
[  +0.000003] RIP: 0010:amdgpu_bo_release_notify+0x164/0x170 [amdgpu]
[  +0.000112] Code: ff ff ff 48 39 c2 74 07 0f 0b e9 57 ff ff ff 48 89 ef e8 cf 14 15 00 e9 4a ff ff ff e8 f5 05 29 d1 eb c4 0f 0b e9 e7 fe ff ff <0f> 0b eb b9 e8 83 d7 7d d1 0f 1f 00 0f 1f 44 00 00 41 54 55 53 48
[  +0.000003] RSP: 0018:ffffaa4c206dfc20 EFLAGS: 00010282
[  +0.000005] RAX: 00000000ffffffea RBX: ffff9478d9c4f858 RCX: 0000000000000001
[  +0.000003] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00000000ffffffff
[  +0.000002] RBP: ffff9478d9c4f800 R08: 0000000000000000 R09: 0000000000000001
[  +0.000003] R10: ffffaa4c206dfac0 R11: ffffaa4c206df8f0 R12: ffff9478d9c4f858
[  +0.000002] R13: ffff9478d9c4f990 R14: ffff9479156c5e08 R15: ffff949857c55020
[  +0.000003] FS:  00007f59f5b50180(0000) GS:ffff9497c0040000(0000) knlGS:0000000000000000
[  +0.000003] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  +0.000003] CR2: 000056517f676b38 CR3: 0000000137bee001 CR4: 00000000007706e0
[  +0.000002] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  +0.000003] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  +0.000002] PKRU: 55555554
[  +0.000002] Call Trace:
[  +0.000003]  <TASK>
[  +0.000005]  ttm_bo_release+0x305/0x390 [ttm]
[  +0.000010]  ? __mutex_unlock_slowpath+0x41/0x280
[  +0.000010]  amdgpu_bo_free_kernel+0xd1/0x120 [amdgpu]
[  +0.000116]  psp_v11_0_ring_destroy+0x3b/0x50 [amdgpu]
[  +0.000148]  psp_hw_fini+0x7b/0x100 [amdgpu]
[  +0.000145]  amdgpu_device_fini_hw+0x1d5/0x3a0 [amdgpu]
[  +0.000109]  amdgpu_pci_remove+0x41/0x60 [amdgpu]
[  +0.000102]  pci_device_remove+0x31/0xb0
[  +0.000006]  device_release_driver_internal+0xf4/0x1d0
[  +0.000008]  pci_stop_bus_device+0x64/0x90
[  +0.000006]  pci_stop_and_remove_bus_device_locked+0x16/0x30
[  +0.000005]  remove_store+0x75/0x90
[  +0.000007]  kernfs_fop_write_iter+0x132/0x1b0
[  +0.000008]  new_sync_write+0x11f/0x1b0
[  +0.000015]  vfs_write+0x35b/0x3b0
[  +0.000007]  ksys_write+0xa7/0xe0
[  +0.000009]  do_syscall_64+0x34/0x80
[  +0.000006]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[  +0.000004] RIP: 0033:0x7f59f5106371
[  +0.000003] Code: c3 0f 1f 84 00 00 00 00 00 48 8b 05 69 8c 20 00 c3 0f 1f 84 00 00 00 00 00 8b 05 9a d0 20 00 85 c0 75 16 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 57 f3 c3 0f 1f 44 00 00 41 54 55 49 89 d4 53
[  +0.000003] RSP: 002b:00007ffff02c6bf8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[  +0.000005] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f59f5106371
[  +0.000002] RDX: 0000000000000001 RSI: 000056517d835316 RDI: 0000000000000005
[  +0.000003] RBP: 0000000000000005 R08: 000056517f6742b0 R09: 0000000000000000
[  +0.000002] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[  +0.000002] R13: 0000000000000000 R14: 0000000000000000 R15: 000056517da3e8a0
[  +0.000016]  </TASK>
[  +0.000002] irq event stamp: 21937
[  +0.000002] hardirqs last  enabled at (21943): [<ffffffff91107d32>] __up_console_sem+0x52/0x60
[  +0.000004] hardirqs last disabled at (21948): [<ffffffff91107d17>] __up_console_sem+0x37/0x60
[  +0.000003] softirqs last  enabled at (21366): [<ffffffff9220034b>] __do_softirq+0x34b/0x492
[  +0.000005] softirqs last disabled at (21311): [<ffffffff91083167>] irq_exit_rcu+0xd7/0xf0
[  +0.000005] ---[ end trace 90fbe3f286a48d6e ]---
[  +0.000527] [drm:amdgpu_fill_buffer [amdgpu]] *ERROR* Trying to clear memory with ring turned off.
[  +0.000195] ------------[ cut here ]------------
[  +0.000003] WARNING: CPU: 29 PID: 3800 at drivers/gpu/drm/amd/amdgpu/amdgpu_object.c:1313 amdgpu_bo_release_notify+0x164/0x170 [amdgpu]
[  +0.000115] Modules linked in: amdgpu iommu_v2 gpu_sched nf_conntrack_netlink nfnetlink xt_addrtype br_netfilter xt_CHECKSUM iptable_mangle xt_MASQUERADE iptable_nat nf_nat xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4 xt_tcpudp ip6table_filter ip6_tables iptable_filter fuse x86_pkg_temp_thermal acpi_pad ip_tables x_tables ast drm_vram_helper drm_ttm_helper ttm drm_kms_helper cfbfillrect syscopyarea cfbimgblt sysfillrect sysimgblt fb_sys_fops cfbcopyarea drm drm_panel_orientation_quirks
[  +0.000077] CPU: 29 PID: 3800 Comm: amdgpu_test Tainted: G        W         5.16.0-kfd+ #1
[  +0.000003] Hardware name: Supermicro SYS-4029GP-TRT2/X11DPG-OT-CPU, BIOS 2.1 08/14/2018
[  +0.000003] RIP: 0010:amdgpu_bo_release_notify+0x164/0x170 [amdgpu]
[  +0.000111] Code: ff ff ff 48 39 c2 74 07 0f 0b e9 57 ff ff ff 48 89 ef e8 cf 14 15 00 e9 4a ff ff ff e8 f5 05 29 d1 eb c4 0f 0b e9 e7 fe ff ff <0f> 0b eb b9 e8 83 d7 7d d1 0f 1f 00 0f 1f 44 00 00 41 54 55 53 48
[  +0.000004] RSP: 0018:ffffaa4c206dfc40 EFLAGS: 00010282
[  +0.000004] RAX: 00000000ffffffea RBX: ffff9478d9c4f058 RCX: 0000000000000001
[  +0.000003] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00000000ffffffff
[  +0.000002] RBP: ffff9478d9c4f000 R08: 0000000000000000 R09: 0000000000000001
[  +0.000003] R10: ffffaa4c206dfae0 R11: ffffaa4c206df910 R12: ffff9478d9c4f058
[  +0.000002] R13: ffff9478d9c4f190 R14: ffff9479156c5e08 R15: ffff949857c55020
[  +0.000003] FS:  00007f59f5b50180(0000) GS:ffff9497c0040000(0000) knlGS:0000000000000000
[  +0.000003] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  +0.000003] CR2: 000056517f676b38 CR3: 0000000137bee001 CR4: 00000000007706e0
[  +0.000002] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  +0.000002] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  +0.000003] PKRU: 55555554
[  +0.000002] Call Trace:
[  +0.000003]  <TASK>
[  +0.000005]  ttm_bo_release+0x305/0x390 [ttm]
[  +0.000009]  ? __mutex_unlock_slowpath+0x41/0x280
[  +0.000011]  amdgpu_bo_free_kernel+0xd1/0x120 [amdgpu]
[  +0.000115]  psp_hw_fini+0xaf/0x100 [amdgpu]
[  +0.000145]  amdgpu_device_fini_hw+0x1d5/0x3a0 [amdgpu]
[  +0.000109]  amdgpu_pci_remove+0x41/0x60 [amdgpu]
[  +0.000102]  pci_device_remove+0x31/0xb0
[  +0.000006]  device_release_driver_internal+0xf4/0x1d0
[  +0.000008]  pci_stop_bus_device+0x64/0x90
[  +0.000006]  pci_stop_and_remove_bus_device_locked+0x16/0x30
[  +0.000004]  remove_store+0x75/0x90
[  +0.000040]  kernfs_fop_write_iter+0x132/0x1b0
[  +0.000010]  new_sync_write+0x11f/0x1b0
[  +0.000014]  vfs_write+0x35b/0x3b0
[  +0.000008]  ksys_write+0xa7/0xe0
[  +0.000009]  do_syscall_64+0x34/0x80
[  +0.000008]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[  +0.000004] RIP: 0033:0x7f59f5106371
[  +0.000003] Code: c3 0f 1f 84 00 00 00 00 00 48 8b 05 69 8c 20 00 c3 0f 1f 84 00 00 00 00 00 8b 05 9a d0 20 00 85 c0 75 16 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 57 f3 c3 0f 1f 44 00 00 41 54 55 49 89 d4 53
[  +0.000003] RSP: 002b:00007ffff02c6bf8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[  +0.000005] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f59f5106371
[  +0.000003] RDX: 0000000000000001 RSI: 000056517d835316 RDI: 0000000000000005
[  +0.000003] RBP: 0000000000000005 R08: 000056517f6742b0 R09: 0000000000000000
[  +0.000002] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[  +0.000003] R13: 0000000000000000 R14: 0000000000000000 R15: 000056517da3e8a0
[  +0.000016]  </TASK>
[  +0.000002] irq event stamp: 22539
[  +0.000002] hardirqs last  enabled at (22545): [<ffffffff91107d32>] __up_console_sem+0x52/0x60
[  +0.000005] hardirqs last disabled at (22550): [<ffffffff91107d17>] __up_console_sem+0x37/0x60
[  +0.000003] softirqs last  enabled at (21366): [<ffffffff9220034b>] __do_softirq+0x34b/0x492
[  +0.000005] softirqs last disabled at (21311): [<ffffffff91083167>] irq_exit_rcu+0xd7/0xf0
[  +0.000004] ---[ end trace 90fbe3f286a48d6f ]---
[  +0.000296] [drm:amdgpu_fill_buffer [amdgpu]] *ERROR* Trying to clear memory with ring turned off.
[  +0.000194] ------------[ cut here ]------------
[  +0.000002] WARNING: CPU: 29 PID: 3800 at drivers/gpu/drm/amd/amdgpu/amdgpu_object.c:1313 amdgpu_bo_release_notify+0x164/0x170 [amdgpu]
[  +0.000114] Modules linked in: amdgpu iommu_v2 gpu_sched nf_conntrack_netlink nfnetlink xt_addrtype br_netfilter xt_CHECKSUM iptable_mangle xt_MASQUERADE iptable_nat nf_nat xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4 xt_tcpudp ip6table_filter ip6_tables iptable_filter fuse x86_pkg_temp_thermal acpi_pad ip_tables x_tables ast drm_vram_helper drm_ttm_helper ttm drm_kms_helper cfbfillrect syscopyarea cfbimgblt sysfillrect sysimgblt fb_sys_fops cfbcopyarea drm drm_panel_orientation_quirks
[  +0.000078] CPU: 29 PID: 3800 Comm: amdgpu_test Tainted: G        W         5.16.0-kfd+ #1
[  +0.000003] Hardware name: Supermicro SYS-4029GP-TRT2/X11DPG-OT-CPU, BIOS 2.1 08/14/2018
[  +0.000003] RIP: 0010:amdgpu_bo_release_notify+0x164/0x170 [amdgpu]
[  +0.000112] Code: ff ff ff 48 39 c2 74 07 0f 0b e9 57 ff ff ff 48 89 ef e8 cf 14 15 00 e9 4a ff ff ff e8 f5 05 29 d1 eb c4 0f 0b e9 e7 fe ff ff <0f> 0b eb b9 e8 83 d7 7d d1 0f 1f 00 0f 1f 44 00 00 41 54 55 53 48
[  +0.000003] RSP: 0018:ffffaa4c206dfc40 EFLAGS: 00010282
[  +0.000005] RAX: 00000000ffffffea RBX: ffff9478d9c4f458 RCX: 0000000000000001
[  +0.000003] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00000000ffffffff
[  +0.000002] RBP: ffff9478d9c4f400 R08: 0000000000000000 R09: 0000000000000001
[  +0.000002] R10: ffffaa4c206dfae0 R11: ffffaa4c206df910 R12: ffff9478d9c4f458
[  +0.000003] R13: ffff9478d9c4f590 R14: ffff9479156c5e08 R15: ffff949857c55020
[  +0.000002] FS:  00007f59f5b50180(0000) GS:ffff9497c0040000(0000) knlGS:0000000000000000
[  +0.000004] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  +0.000003] CR2: 000056517f676b38 CR3: 0000000137bee001 CR4: 00000000007706e0
[  +0.000003] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  +0.000002] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  +0.000002] PKRU: 55555554
[  +0.000002] Call Trace:
[  +0.000003]  <TASK>
[  +0.000005]  ttm_bo_release+0x305/0x390 [ttm]
[  +0.000009]  ? __mutex_unlock_slowpath+0x41/0x280
[  +0.000011]  amdgpu_bo_free_kernel+0xd1/0x120 [amdgpu]
[  +0.000115]  psp_hw_fini+0xc9/0x100 [amdgpu]
[  +0.000146]  amdgpu_device_fini_hw+0x1d5/0x3a0 [amdgpu]
[  +0.000110]  amdgpu_pci_remove+0x41/0x60 [amdgpu]
[  +0.000103]  pci_device_remove+0x31/0xb0
[  +0.000006]  device_release_driver_internal+0xf4/0x1d0
[  +0.000008]  pci_stop_bus_device+0x64/0x90
[  +0.000006]  pci_stop_and_remove_bus_device_locked+0x16/0x30
[  +0.000005]  remove_store+0x75/0x90
[  +0.000007]  kernfs_fop_write_iter+0x132/0x1b0
[  +0.000008]  new_sync_write+0x11f/0x1b0
[  +0.000015]  vfs_write+0x35b/0x3b0
[  +0.000007]  ksys_write+0xa7/0xe0
[  +0.000010]  do_syscall_64+0x34/0x80
[  +0.000007]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[  +0.000004] RIP: 0033:0x7f59f5106371
[  +0.000003] Code: c3 0f 1f 84 00 00 00 00 00 48 8b 05 69 8c 20 00 c3 0f 1f 84 00 00 00 00 00 8b 05 9a d0 20 00 85 c0 75 16 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 57 f3 c3 0f 1f 44 00 00 41 54 55 49 89 d4 53
[  +0.000003] RSP: 002b:00007ffff02c6bf8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[  +0.000005] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f59f5106371
[  +0.000003] RDX: 0000000000000001 RSI: 000056517d835316 RDI: 0000000000000005
[  +0.000002] RBP: 0000000000000005 R08: 000056517f6742b0 R09: 0000000000000000
[  +0.000003] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[  +0.000002] R13: 0000000000000000 R14: 0000000000000000 R15: 000056517da3e8a0
[  +0.000016]  </TASK>
[  +0.000002] irq event stamp: 23135
[  +0.000002] hardirqs last  enabled at (23141): [<ffffffff91107d32>] __up_console_sem+0x52/0x60
[  +0.000004] hardirqs last disabled at (23146): [<ffffffff91107d17>] __up_console_sem+0x37/0x60
[  +0.000003] softirqs last  enabled at (21366): [<ffffffff9220034b>] __do_softirq+0x34b/0x492
[  +0.000004] softirqs last disabled at (21311): [<ffffffff91083167>] irq_exit_rcu+0xd7/0xf0
[  +0.000005] ---[ end trace 90fbe3f286a48d70 ]---
[  +2.511285] pci 0000:43:00.0: Removing from iommu group 73
[  +0.001309] [drm:amdgpu_fill_buffer [amdgpu]] *ERROR* Trying to clear memory with ring turned off.
[  +0.000472] ------------[ cut here ]------------
[  +0.000003] WARNING: CPU: 29 PID: 3800 at drivers/gpu/drm/amd/amdgpu/amdgpu_object.c:1313 amdgpu_bo_release_notify+0x164/0x170 [amdgpu]
[  +0.000119] Modules linked in: amdgpu iommu_v2 gpu_sched nf_conntrack_netlink nfnetlink xt_addrtype br_netfilter xt_CHECKSUM iptable_mangle xt_MASQUERADE iptable_nat nf_nat xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4 xt_tcpudp ip6table_filter ip6_tables iptable_filter fuse x86_pkg_temp_thermal acpi_pad ip_tables x_tables ast drm_vram_helper drm_ttm_helper ttm drm_kms_helper cfbfillrect syscopyarea cfbimgblt sysfillrect sysimgblt fb_sys_fops cfbcopyarea drm drm_panel_orientation_quirks
[  +0.000069] CPU: 29 PID: 3800 Comm: amdgpu_test Tainted: G        W         5.16.0-kfd+ #1
[  +0.000004] Hardware name: Supermicro SYS-4029GP-TRT2/X11DPG-OT-CPU, BIOS 2.1 08/14/2018
[  +0.000002] RIP: 0010:amdgpu_bo_release_notify+0x164/0x170 [amdgpu]
[  +0.000114] Code: ff ff ff 48 39 c2 74 07 0f 0b e9 57 ff ff ff 48 89 ef e8 cf 14 15 00 e9 4a ff ff ff e8 f5 05 29 d1 eb c4 0f 0b e9 e7 fe ff ff <0f> 0b eb b9 e8 83 d7 7d d1 0f 1f 00 0f 1f 44 00 00 41 54 55 53 48
[  +0.000003] RSP: 0018:ffffaa4c206dfd50 EFLAGS: 00010282
[  +0.000004] RAX: 00000000ffffffea RBX: ffff9479154e8058 RCX: 0000000000000001
[  +0.000003] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00000000ffffffff
[  +0.000002] RBP: ffff9479154e8000 R08: 0000000000000000 R09: 0000000000000001
[  +0.000003] R10: ffffaa4c206dfbf0 R11: ffffaa4c206dfa20 R12: ffff9479154e8058
[  +0.000002] R13: ffff9479154e8190 R14: ffff9479154e8190 R15: 0000000000008000
[  +0.000003] FS:  00007f59f5b50180(0000) GS:ffff9497c0040000(0000) knlGS:0000000000000000
[  +0.000003] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  +0.000002] CR2: 000056517f676b38 CR3: 0000000137bee001 CR4: 00000000007706e0
[  +0.000003] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  +0.000002] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  +0.000002] PKRU: 55555554
[  +0.000002] Call Trace:
[  +0.000002]  <TASK>
[  +0.000005]  ttm_bo_release+0x305/0x390 [ttm]
[  +0.000015]  amdgpu_bo_unref+0x1a/0x30 [amdgpu]
[  +0.000113]  amdgpu_driver_postclose_kms+0x17b/0x320 [amdgpu]
[  +0.000106]  ? __mutex_unlock_slowpath+0x41/0x280
[  +0.000010]  drm_file_free.part.16+0x1e3/0x230 [drm]
[  +0.000029]  drm_release+0x6e/0xf0 [drm]
[  +0.000020]  __fput+0xa1/0x260
[  +0.000007]  task_work_run+0x6d/0xb0
[  +0.000010]  exit_to_user_mode_prepare+0x1d3/0x1e0
[  +0.000005]  syscall_exit_to_user_mode+0x19/0x50
[  +0.000004]  do_syscall_64+0x40/0x80
[  +0.000005]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[  +0.000004] RIP: 0033:0x7f59f5106511
[  +0.000003] Code: f7 d8 64 89 02 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 8b 05 fa ce 20 00 85 c0 75 16 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 3f f3 c3 0f 1f 44 00 00 53 89 fb 48 83 ec 10
[  +0.000003] RSP: 002b:00007ffff02c6be8 EFLAGS: 00000246 ORIG_RAX: 0000000000000003
[  +0.000005] RAX: 0000000000000000 RBX: 000056517da3e6a0 RCX: 00007f59f5106511
[  +0.000002] RDX: 00007f59f50efca0 RSI: 000056517f676100 RDI: 0000000000000003
[  +0.000003] RBP: 000056517da3e8a0 R08: 0000000000000000 R09: 0000000000000000
[  +0.000002] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[  +0.000002] R13: 0000000000000000 R14: 0000000000000000 R15: 000056517da3e8a0
[  +0.000014]  </TASK>
[  +0.000002] irq event stamp: 32575
[  +0.000003] hardirqs last  enabled at (32581): [<ffffffff91107d32>] __up_console_sem+0x52/0x60
[  +0.000004] hardirqs last disabled at (32586): [<ffffffff91107d17>] __up_console_sem+0x37/0x60
[  +0.000003] softirqs last  enabled at (31702): [<ffffffff9220034b>] __do_softirq+0x34b/0x492
[  +0.000004] softirqs last disabled at (31697): [<ffffffff91083167>] irq_exit_rcu+0xd7/0xf0
[  +0.000004] ---[ end trace 90fbe3f286a48d71 ]---
[  +0.002906] [drm:amdgpu_fill_buffer [amdgpu]] *ERROR* Trying to clear memory with ring turned off.
[  +0.000426] ------------[ cut here ]------------
[  +0.000002] WARNING: CPU: 29 PID: 3800 at drivers/gpu/drm/amd/amdgpu/amdgpu_object.c:1313 amdgpu_bo_release_notify+0x164/0x170 [amdgpu]
[  +0.000119] Modules linked in: amdgpu iommu_v2 gpu_sched nf_conntrack_netlink nfnetlink xt_addrtype br_netfilter xt_CHECKSUM iptable_mangle xt_MASQUERADE iptable_nat nf_nat xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4 xt_tcpudp ip6table_filter ip6_tables iptable_filter fuse x86_pkg_temp_thermal acpi_pad ip_tables x_tables ast drm_vram_helper drm_ttm_helper ttm drm_kms_helper cfbfillrect syscopyarea cfbimgblt sysfillrect sysimgblt fb_sys_fops cfbcopyarea drm drm_panel_orientation_quirks
[  +0.000064] CPU: 29 PID: 3800 Comm: amdgpu_test Tainted: G        W         5.16.0-kfd+ #1
[  +0.000004] Hardware name: Supermicro SYS-4029GP-TRT2/X11DPG-OT-CPU, BIOS 2.1 08/14/2018
[  +0.000002] RIP: 0010:amdgpu_bo_release_notify+0x164/0x170 [amdgpu]
[  +0.000114] Code: ff ff ff 48 39 c2 74 07 0f 0b e9 57 ff ff ff 48 89 ef e8 cf 14 15 00 e9 4a ff ff ff e8 f5 05 29 d1 eb c4 0f 0b e9 e7 fe ff ff <0f> 0b eb b9 e8 83 d7 7d d1 0f 1f 00 0f 1f 44 00 00 41 54 55 53 48
[  +0.000003] RSP: 0018:ffffaa4c206dfd18 EFLAGS: 00010282
[  +0.000004] RAX: 00000000ffffffea RBX: ffff9478d615ec58 RCX: 0000000000000001
[  +0.000003] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00000000ffffffff
[  +0.000002] RBP: ffff9478d615ec00 R08: 0000000000000000 R09: 0000000000000001
[  +0.000003] R10: 0000000000000000 R11: ffffaa4c206df9e8 R12: ffff9478d615ec58
[  +0.000002] R13: ffff9478d615ed90 R14: ffff9479156c5e08 R15: ffffffff9345ea20
[  +0.000002] FS:  00007f59f5b50180(0000) GS:ffff9497c0040000(0000) knlGS:0000000000000000
[  +0.000003] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  +0.000003] CR2: 000056517f676b38 CR3: 0000000137bee001 CR4: 00000000007706e0
[  +0.000002] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  +0.000002] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  +0.000002] PKRU: 55555554
[  +0.000002] Call Trace:
[  +0.000002]  <TASK>
[  +0.000005]  ttm_bo_release+0x305/0x390 [ttm]
[  +0.000009]  ? __mutex_unlock_slowpath+0x41/0x280
[  +0.000010]  amdgpu_bo_free_kernel+0xd1/0x120 [amdgpu]
[  +0.000116]  amdgpu_vcn_sw_fini+0x12b/0x130 [amdgpu]
[  +0.000159]  vcn_v2_5_sw_fini+0x97/0xc0 [amdgpu]
[  +0.000157]  amdgpu_device_fini_sw+0xbc/0x2e0 [amdgpu]
[  +0.000108]  amdgpu_driver_release_kms+0x12/0x30 [amdgpu]
[  +0.000106]  drm_dev_release+0x20/0x40 [drm]
[  +0.000025]  drm_release+0xa8/0xf0 [drm]
[  +0.000020]  __fput+0xa1/0x260
[  +0.000007]  task_work_run+0x6d/0xb0
[  +0.000009]  exit_to_user_mode_prepare+0x1d3/0x1e0
[  +0.000005]  syscall_exit_to_user_mode+0x19/0x50
[  +0.000004]  do_syscall_64+0x40/0x80
[  +0.000005]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[  +0.000004] RIP: 0033:0x7f59f5106511
[  +0.000003] Code: f7 d8 64 89 02 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 8b 05 fa ce 20 00 85 c0 75 16 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 3f f3 c3 0f 1f 44 00 00 53 89 fb 48 83 ec 10
[  +0.000003] RSP: 002b:00007ffff02c6be8 EFLAGS: 00000246 ORIG_RAX: 0000000000000003
[  +0.000005] RAX: 0000000000000000 RBX: 000056517da3e6a0 RCX: 00007f59f5106511
[  +0.000002] RDX: 00007f59f50efca0 RSI: 000056517f676100 RDI: 0000000000000003
[  +0.000002] RBP: 000056517da3e8a0 R08: 0000000000000000 R09: 0000000000000000
[  +0.000003] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[  +0.000002] R13: 0000000000000000 R14: 0000000000000000 R15: 000056517da3e8a0
[  +0.000013]  </TASK>
[  +0.000002] irq event stamp: 33801
[  +0.000002] hardirqs last  enabled at (33807): [<ffffffff91107d32>] __up_console_sem+0x52/0x60
[  +0.000004] hardirqs last disabled at (33812): [<ffffffff91107d17>] __up_console_sem+0x37/0x60
[  +0.000002] softirqs last  enabled at (31702): [<ffffffff9220034b>] __do_softirq+0x34b/0x492
[  +0.000004] softirqs last disabled at (31697): [<ffffffff91083167>] irq_exit_rcu+0xd7/0xf0
[  +0.000004] ---[ end trace 90fbe3f286a48d72 ]---
[  +0.000226] [drm:amdgpu_fill_buffer [amdgpu]] *ERROR* Trying to clear memory with ring turned off.
[  +0.000205] ------------[ cut here ]------------
[  +0.000002] WARNING: CPU: 1 PID: 3800 at drivers/gpu/drm/amd/amdgpu/amdgpu_object.c:1313 amdgpu_bo_release_notify+0x164/0x170 [amdgpu]
[  +0.000117] Modules linked in: amdgpu iommu_v2 gpu_sched nf_conntrack_netlink nfnetlink xt_addrtype br_netfilter xt_CHECKSUM iptable_mangle xt_MASQUERADE iptable_nat nf_nat xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4 xt_tcpudp ip6table_filter ip6_tables iptable_filter fuse x86_pkg_temp_thermal acpi_pad ip_tables x_tables ast drm_vram_helper drm_ttm_helper ttm drm_kms_helper cfbfillrect syscopyarea cfbimgblt sysfillrect sysimgblt fb_sys_fops cfbcopyarea drm drm_panel_orientation_quirks
[  +0.000065] CPU: 1 PID: 3800 Comm: amdgpu_test Tainted: G        W         5.16.0-kfd+ #1
[  +0.000004] Hardware name: Supermicro SYS-4029GP-TRT2/X11DPG-OT-CPU, BIOS 2.1 08/14/2018
[  +0.000002] RIP: 0010:amdgpu_bo_release_notify+0x164/0x170 [amdgpu]
[  +0.000114] Code: ff ff ff 48 39 c2 74 07 0f 0b e9 57 ff ff ff 48 89 ef e8 cf 14 15 00 e9 4a ff ff ff e8 f5 05 29 d1 eb c4 0f 0b e9 e7 fe ff ff <0f> 0b eb b9 e8 83 d7 7d d1 0f 1f 00 0f 1f 44 00 00 41 54 55 53 48
[  +0.000003] RSP: 0018:ffffaa4c206dfd18 EFLAGS: 00010282
[  +0.000004] RAX: 00000000ffffffea RBX: ffff9478d615e858 RCX: 0000000000000001
[  +0.000003] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00000000ffffffff
[  +0.000002] RBP: ffff9478d615e800 R08: 0000000000000000 R09: 0000000000000001
[  +0.000002] R10: 0000000000000000 R11: ffffaa4c206df9e8 R12: ffff9478d615e858
[  +0.000002] R13: ffff9478d615e990 R14: ffff9479156c5e08 R15: ffffffff9345ea20
[  +0.000003] FS:  00007f59f5b50180(0000) GS:ffff9497bfc40000(0000) knlGS:0000000000000000
[  +0.000003] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  +0.000002] CR2: 000055881d667000 CR3: 0000000137bee003 CR4: 00000000007706e0
[  +0.000002] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  +0.000003] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  +0.000002] PKRU: 55555554
[  +0.000002] Call Trace:
[  +0.000003]  <TASK>
[  +0.000005]  ttm_bo_release+0x305/0x390 [ttm]
[  +0.000010]  ? __mutex_unlock_slowpath+0x41/0x280
[  +0.000010]  amdgpu_bo_free_kernel+0xd1/0x120 [amdgpu]
[  +0.000115]  amdgpu_vcn_sw_fini+0x7f/0x130 [amdgpu]
[  +0.000159]  vcn_v2_5_sw_fini+0x97/0xc0 [amdgpu]
[  +0.000157]  amdgpu_device_fini_sw+0xbc/0x2e0 [amdgpu]
[  +0.000108]  amdgpu_driver_release_kms+0x12/0x30 [amdgpu]
[  +0.000105]  drm_dev_release+0x20/0x40 [drm]
[  +0.000026]  drm_release+0xa8/0xf0 [drm]
[  +0.000021]  __fput+0xa1/0x260
[  +0.000008]  task_work_run+0x6d/0xb0
[  +0.000008]  exit_to_user_mode_prepare+0x1d3/0x1e0
[  +0.000005]  syscall_exit_to_user_mode+0x19/0x50
[  +0.000004]  do_syscall_64+0x40/0x80
[  +0.000005]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[  +0.000004] RIP: 0033:0x7f59f5106511
[  +0.000003] Code: f7 d8 64 89 02 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 8b 05 fa ce 20 00 85 c0 75 16 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 3f f3 c3 0f 1f 44 00 00 53 89 fb 48 83 ec 10
[  +0.000003] RSP: 002b:00007ffff02c6be8 EFLAGS: 00000246 ORIG_RAX: 0000000000000003
[  +0.000005] RAX: 0000000000000000 RBX: 000056517da3e6a0 RCX: 00007f59f5106511
[  +0.000002] RDX: 00007f59f50efca0 RSI: 000056517f676100 RDI: 0000000000000003
[  +0.000002] RBP: 000056517da3e8a0 R08: 0000000000000000 R09: 0000000000000000
[  +0.000002] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[  +0.000002] R13: 0000000000000000 R14: 0000000000000000 R15: 000056517da3e8a0
[  +0.000013]  </TASK>
[  +0.000002] irq event stamp: 34393
[  +0.000002] hardirqs last  enabled at (34399): [<ffffffff91107d32>] __up_console_sem+0x52/0x60
[  +0.000004] hardirqs last disabled at (34404): [<ffffffff91107d17>] __up_console_sem+0x37/0x60
[  +0.000003] softirqs last  enabled at (31702): [<ffffffff9220034b>] __do_softirq+0x34b/0x492
[  +0.000004] softirqs last disabled at (31697): [<ffffffff91083167>] irq_exit_rcu+0xd7/0xf0
[  +0.000004] ---[ end trace 90fbe3f286a48d73 ]---
[  +0.000319] [drm:amdgpu_fill_buffer [amdgpu]] *ERROR* Trying to clear memory with ring turned off.
[  +0.000188] ------------[ cut here ]------------
[  +0.000002] WARNING: CPU: 1 PID: 3800 at drivers/gpu/drm/amd/amdgpu/amdgpu_object.c:1313 amdgpu_bo_release_notify+0x164/0x170 [amdgpu]
[  +0.000115] Modules linked in: amdgpu iommu_v2 gpu_sched nf_conntrack_netlink nfnetlink xt_addrtype br_netfilter xt_CHECKSUM iptable_mangle xt_MASQUERADE iptable_nat nf_nat xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4 xt_tcpudp ip6table_filter ip6_tables iptable_filter fuse x86_pkg_temp_thermal acpi_pad ip_tables x_tables ast drm_vram_helper drm_ttm_helper ttm drm_kms_helper cfbfillrect syscopyarea cfbimgblt sysfillrect sysimgblt fb_sys_fops cfbcopyarea drm drm_panel_orientation_quirks
[  +0.000063] CPU: 1 PID: 3800 Comm: amdgpu_test Tainted: G        W         5.16.0-kfd+ #1
[  +0.000004] Hardware name: Supermicro SYS-4029GP-TRT2/X11DPG-OT-CPU, BIOS 2.1 08/14/2018
[  +0.000002] RIP: 0010:amdgpu_bo_release_notify+0x164/0x170 [amdgpu]
[  +0.000113] Code: ff ff ff 48 39 c2 74 07 0f 0b e9 57 ff ff ff 48 89 ef e8 cf 14 15 00 e9 4a ff ff ff e8 f5 05 29 d1 eb c4 0f 0b e9 e7 fe ff ff <0f> 0b eb b9 e8 83 d7 7d d1 0f 1f 00 0f 1f 44 00 00 41 54 55 53 48
[  +0.000003] RSP: 0018:ffffaa4c206dfd18 EFLAGS: 00010282
[  +0.000004] RAX: 00000000ffffffea RBX: ffff9478d615f458 RCX: 0000000000000001
[  +0.000002] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00000000ffffffff
[  +0.000002] RBP: ffff9478d615f400 R08: 0000000000000000 R09: 0000000000000001
[  +0.000003] R10: 0000000000000000 R11: ffffaa4c206df9e8 R12: ffff9478d615f458
[  +0.000002] R13: ffff9478d615f590 R14: ffff9479156c5e08 R15: 0000000000000002
[  +0.000002] FS:  00007f59f5b50180(0000) GS:ffff9497bfc40000(0000) knlGS:0000000000000000
[  +0.000003] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  +0.000002] CR2: 000055881d667000 CR3: 0000000137bee003 CR4: 00000000007706e0
[  +0.000002] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  +0.000002] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  +0.000002] PKRU: 55555554
[  +0.000002] Call Trace:
[  +0.000002]  <TASK>
[  +0.000005]  ttm_bo_release+0x305/0x390 [ttm]
[  +0.000009]  ? __mutex_unlock_slowpath+0x41/0x280
[  +0.000009]  amdgpu_bo_free_kernel+0xd1/0x120 [amdgpu]
[  +0.000115]  amdgpu_vcn_sw_fini+0x12b/0x130 [amdgpu]
[  +0.000158]  vcn_v2_5_sw_fini+0x97/0xc0 [amdgpu]
[  +0.000156]  amdgpu_device_fini_sw+0xbc/0x2e0 [amdgpu]
[  +0.000107]  amdgpu_driver_release_kms+0x12/0x30 [amdgpu]
[  +0.000106]  drm_dev_release+0x20/0x40 [drm]
[  +0.000023]  drm_release+0xa8/0xf0 [drm]
[  +0.000021]  __fput+0xa1/0x260
[  +0.000007]  task_work_run+0x6d/0xb0
[  +0.000008]  exit_to_user_mode_prepare+0x1d3/0x1e0
[  +0.000004]  syscall_exit_to_user_mode+0x19/0x50
[  +0.000004]  do_syscall_64+0x40/0x80
[  +0.000005]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[  +0.000004] RIP: 0033:0x7f59f5106511
[  +0.000003] Code: f7 d8 64 89 02 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 8b 05 fa ce 20 00 85 c0 75 16 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 3f f3 c3 0f 1f 44 00 00 53 89 fb 48 83 ec 10
[  +0.000003] RSP: 002b:00007ffff02c6be8 EFLAGS: 00000246 ORIG_RAX: 0000000000000003
[  +0.000004] RAX: 0000000000000000 RBX: 000056517da3e6a0 RCX: 00007f59f5106511
[  +0.000002] RDX: 00007f59f50efca0 RSI: 000056517f676100 RDI: 0000000000000003
[  +0.000002] RBP: 000056517da3e8a0 R08: 0000000000000000 R09: 0000000000000000
[  +0.000002] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[  +0.000002] R13: 0000000000000000 R14: 0000000000000000 R15: 000056517da3e8a0
[  +0.000013]  </TASK>
[  +0.000002] irq event stamp: 34989
[  +0.000002] hardirqs last  enabled at (34995): [<ffffffff91107d32>] __up_console_sem+0x52/0x60
[  +0.000003] hardirqs last disabled at (35000): [<ffffffff91107d17>] __up_console_sem+0x37/0x60
[  +0.000003] softirqs last  enabled at (31702): [<ffffffff9220034b>] __do_softirq+0x34b/0x492
[  +0.000004] softirqs last disabled at (31697): [<ffffffff91083167>] irq_exit_rcu+0xd7/0xf0
[  +0.000003] ---[ end trace 90fbe3f286a48d74 ]---
[  +0.000299] [drm:amdgpu_fill_buffer [amdgpu]] *ERROR* Trying to clear memory with ring turned off.
[  +0.000186] ------------[ cut here ]------------
[  +0.000002] WARNING: CPU: 1 PID: 3800 at drivers/gpu/drm/amd/amdgpu/amdgpu_object.c:1313 amdgpu_bo_release_notify+0x164/0x170 [amdgpu]
[  +0.000114] Modules linked in: amdgpu iommu_v2 gpu_sched nf_conntrack_netlink nfnetlink xt_addrtype br_netfilter xt_CHECKSUM iptable_mangle xt_MASQUERADE iptable_nat nf_nat xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4 xt_tcpudp ip6table_filter ip6_tables iptable_filter fuse x86_pkg_temp_thermal acpi_pad ip_tables x_tables ast drm_vram_helper drm_ttm_helper ttm drm_kms_helper cfbfillrect syscopyarea cfbimgblt sysfillrect sysimgblt fb_sys_fops cfbcopyarea drm drm_panel_orientation_quirks
[  +0.000064] CPU: 1 PID: 3800 Comm: amdgpu_test Tainted: G        W         5.16.0-kfd+ #1
[  +0.000003] Hardware name: Supermicro SYS-4029GP-TRT2/X11DPG-OT-CPU, BIOS 2.1 08/14/2018
[  +0.000002] RIP: 0010:amdgpu_bo_release_notify+0x164/0x170 [amdgpu]
[  +0.000111] Code: ff ff ff 48 39 c2 74 07 0f 0b e9 57 ff ff ff 48 89 ef e8 cf 14 15 00 e9 4a ff ff ff e8 f5 05 29 d1 eb c4 0f 0b e9 e7 fe ff ff <0f> 0b eb b9 e8 83 d7 7d d1 0f 1f 00 0f 1f 44 00 00 41 54 55 53 48
[  +0.000003] RSP: 0018:ffffaa4c206dfd18 EFLAGS: 00010282
[  +0.000004] RAX: 00000000ffffffea RBX: ffff9478d615f058 RCX: 0000000000000001
[  +0.000002] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00000000ffffffff
[  +0.000002] RBP: ffff9478d615f000 R08: 0000000000000000 R09: 0000000000000001
[  +0.000003] R10: 0000000000000000 R11: ffffaa4c206df9e8 R12: ffff9478d615f058
[  +0.000002] R13: ffff9478d615f190 R14: ffff9479156c5e08 R15: 0000000000000002
[  +0.000002] FS:  00007f59f5b50180(0000) GS:ffff9497bfc40000(0000) knlGS:0000000000000000
[  +0.000002] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  +0.000003] CR2: 000055881d667000 CR3: 0000000137bee003 CR4: 00000000007706e0
[  +0.000002] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  +0.000002] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  +0.000002] PKRU: 55555554
[  +0.000002] Call Trace:
[  +0.000002]  <TASK>
[  +0.000004]  ttm_bo_release+0x305/0x390 [ttm]
[  +0.000008]  ? __mutex_unlock_slowpath+0x41/0x280
[  +0.000010]  amdgpu_bo_free_kernel+0xd1/0x120 [amdgpu]
[  +0.000114]  amdgpu_vcn_sw_fini+0x7f/0x130 [amdgpu]
[  +0.000195]  vcn_v2_5_sw_fini+0x97/0xc0 [amdgpu]
[  +0.000159]  amdgpu_device_fini_sw+0xbc/0x2e0 [amdgpu]
[  +0.000108]  amdgpu_driver_release_kms+0x12/0x30 [amdgpu]
[  +0.000106]  drm_dev_release+0x20/0x40 [drm]
[  +0.000023]  drm_release+0xa8/0xf0 [drm]
[  +0.000021]  __fput+0xa1/0x260
[  +0.000007]  task_work_run+0x6d/0xb0
[  +0.000008]  exit_to_user_mode_prepare+0x1d3/0x1e0
[  +0.000004]  syscall_exit_to_user_mode+0x19/0x50
[  +0.000003]  do_syscall_64+0x40/0x80
[  +0.000005]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[  +0.000004] RIP: 0033:0x7f59f5106511
[  +0.000003] Code: f7 d8 64 89 02 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 8b 05 fa ce 20 00 85 c0 75 16 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 3f f3 c3 0f 1f 44 00 00 53 89 fb 48 83 ec 10
[  +0.000002] RSP: 002b:00007ffff02c6be8 EFLAGS: 00000246 ORIG_RAX: 0000000000000003
[  +0.000005] RAX: 0000000000000000 RBX: 000056517da3e6a0 RCX: 00007f59f5106511
[  +0.000002] RDX: 00007f59f50efca0 RSI: 000056517f676100 RDI: 0000000000000003
[  +0.000002] RBP: 000056517da3e8a0 R08: 0000000000000000 R09: 0000000000000000
[  +0.000002] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[  +0.000002] R13: 0000000000000000 R14: 0000000000000000 R15: 000056517da3e8a0
[  +0.000013]  </TASK>
[  +0.000002] irq event stamp: 35591
[  +0.000002] hardirqs last  enabled at (35597): [<ffffffff91107d32>] __up_console_sem+0x52/0x60
[  +0.000003] hardirqs last disabled at (35602): [<ffffffff91107d17>] __up_console_sem+0x37/0x60
[  +0.000003] softirqs last  enabled at (35460): [<ffffffff9220034b>] __do_softirq+0x34b/0x492
[  +0.000004] softirqs last disabled at (35449): [<ffffffff91083167>] irq_exit_rcu+0xd7/0xf0
[  +0.000003] ---[ end trace 90fbe3f286a48d75 ]---
[  +0.000536] [drm:amdgpu_fill_buffer [amdgpu]] *ERROR* Trying to clear memory with ring turned off.
[  +0.000189] ------------[ cut here ]------------
[  +0.000002] WARNING: CPU: 1 PID: 3800 at drivers/gpu/drm/amd/amdgpu/amdgpu_object.c:1313 amdgpu_bo_release_notify+0x164/0x170 [amdgpu]
[  +0.000115] Modules linked in: amdgpu iommu_v2 gpu_sched nf_conntrack_netlink nfnetlink xt_addrtype br_netfilter xt_CHECKSUM iptable_mangle xt_MASQUERADE iptable_nat nf_nat xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4 xt_tcpudp ip6table_filter ip6_tables iptable_filter fuse x86_pkg_temp_thermal acpi_pad ip_tables x_tables ast drm_vram_helper drm_ttm_helper ttm drm_kms_helper cfbfillrect syscopyarea cfbimgblt sysfillrect sysimgblt fb_sys_fops cfbcopyarea drm drm_panel_orientation_quirks
[  +0.000064] CPU: 1 PID: 3800 Comm: amdgpu_test Tainted: G        W         5.16.0-kfd+ #1
[  +0.000003] Hardware name: Supermicro SYS-4029GP-TRT2/X11DPG-OT-CPU, BIOS 2.1 08/14/2018
[  +0.000002] RIP: 0010:amdgpu_bo_release_notify+0x164/0x170 [amdgpu]
[  +0.000113] Code: ff ff ff 48 39 c2 74 07 0f 0b e9 57 ff ff ff 48 89 ef e8 cf 14 15 00 e9 4a ff ff ff e8 f5 05 29 d1 eb c4 0f 0b e9 e7 fe ff ff <0f> 0b eb b9 e8 83 d7 7d d1 0f 1f 00 0f 1f 44 00 00 41 54 55 53 48
[  +0.000003] RSP: 0018:ffffaa4c206dfd58 EFLAGS: 00010282
[  +0.000004] RAX: 00000000ffffffea RBX: ffff9478d5ce7c58 RCX: 0000000000000001
[  +0.000002] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00000000ffffffff
[  +0.000002] RBP: ffff9478d5ce7c00 R08: 0000000000000000 R09: 0000000000000001
[  +0.000003] R10: 0000000000000000 R11: ffffaa4c206dfa28 R12: ffff9478d5ce7c58
[  +0.000002] R13: ffff9478d5ce7d90 R14: ffff9479156c5e08 R15: ffffffff9345ea20
[  +0.000002] FS:  00007f59f5b50180(0000) GS:ffff9497bfc40000(0000) knlGS:0000000000000000
[  +0.000002] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  +0.000003] CR2: 000055881d667000 CR3: 0000000137bee003 CR4: 00000000007706e0
[  +0.000002] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  +0.000002] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  +0.000002] PKRU: 55555554
[  +0.000002] Call Trace:
[  +0.000002]  <TASK>
[  +0.000005]  ttm_bo_release+0x305/0x390 [ttm]
[  +0.000008]  ? __mutex_unlock_slowpath+0x41/0x280
[  +0.000010]  amdgpu_bo_free_kernel+0xd1/0x120 [amdgpu]
[  +0.000115]  gfx_v9_0_sw_fini+0x6e/0x170 [amdgpu]
[  +0.000150]  amdgpu_device_fini_sw+0xbc/0x2e0 [amdgpu]
[  +0.000107]  amdgpu_driver_release_kms+0x12/0x30 [amdgpu]
[  +0.000106]  drm_dev_release+0x20/0x40 [drm]
[  +0.000023]  drm_release+0xa8/0xf0 [drm]
[  +0.000020]  __fput+0xa1/0x260
[  +0.000007]  task_work_run+0x6d/0xb0
[  +0.000008]  exit_to_user_mode_prepare+0x1d3/0x1e0
[  +0.000004]  syscall_exit_to_user_mode+0x19/0x50
[  +0.000004]  do_syscall_64+0x40/0x80
[  +0.000005]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[  +0.000003] RIP: 0033:0x7f59f5106511
[  +0.000003] Code: f7 d8 64 89 02 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 8b 05 fa ce 20 00 85 c0 75 16 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 3f f3 c3 0f 1f 44 00 00 53 89 fb 48 83 ec 10
[  +0.000003] RSP: 002b:00007ffff02c6be8 EFLAGS: 00000246 ORIG_RAX: 0000000000000003
[  +0.000004] RAX: 0000000000000000 RBX: 000056517da3e6a0 RCX: 00007f59f5106511
[  +0.000002] RDX: 00007f59f50efca0 RSI: 000056517f676100 RDI: 0000000000000003
[  +0.000003] RBP: 000056517da3e8a0 R08: 0000000000000000 R09: 0000000000000000
[  +0.000002] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[  +0.000002] R13: 0000000000000000 R14: 0000000000000000 R15: 000056517da3e8a0
[  +0.000013]  </TASK>
[  +0.000002] irq event stamp: 36503
[  +0.000002] hardirqs last  enabled at (36509): [<ffffffff91107d32>] __up_console_sem+0x52/0x60
[  +0.000003] hardirqs last disabled at (36514): [<ffffffff91107d17>] __up_console_sem+0x37/0x60
[  +0.000003] softirqs last  enabled at (35460): [<ffffffff9220034b>] __do_softirq+0x34b/0x492
[  +0.000004] softirqs last disabled at (35449): [<ffffffff91083167>] irq_exit_rcu+0xd7/0xf0
[  +0.000003] ---[ end trace 90fbe3f286a48d76 ]---
[  +0.000335] [drm:amdgpu_fill_buffer [amdgpu]] *ERROR* Trying to clear memory with ring turned off.
[  +0.000188] ------------[ cut here ]------------
[  +0.000002] WARNING: CPU: 1 PID: 3800 at drivers/gpu/drm/amd/amdgpu/amdgpu_object.c:1313 amdgpu_bo_release_notify+0x164/0x170 [amdgpu]
[  +0.000115] Modules linked in: amdgpu iommu_v2 gpu_sched nf_conntrack_netlink nfnetlink xt_addrtype br_netfilter xt_CHECKSUM iptable_mangle xt_MASQUERADE iptable_nat nf_nat xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4 xt_tcpudp ip6table_filter ip6_tables iptable_filter fuse x86_pkg_temp_thermal acpi_pad ip_tables x_tables ast drm_vram_helper drm_ttm_helper ttm drm_kms_helper cfbfillrect syscopyarea cfbimgblt sysfillrect sysimgblt fb_sys_fops cfbcopyarea drm drm_panel_orientation_quirks
[  +0.000062] CPU: 1 PID: 3800 Comm: amdgpu_test Tainted: G        W         5.16.0-kfd+ #1
[  +0.000004] Hardware name: Supermicro SYS-4029GP-TRT2/X11DPG-OT-CPU, BIOS 2.1 08/14/2018
[  +0.000002] RIP: 0010:amdgpu_bo_release_notify+0x164/0x170 [amdgpu]
[  +0.000112] Code: ff ff ff 48 39 c2 74 07 0f 0b e9 57 ff ff ff 48 89 ef e8 cf 14 15 00 e9 4a ff ff ff e8 f5 05 29 d1 eb c4 0f 0b e9 e7 fe ff ff <0f> 0b eb b9 e8 83 d7 7d d1 0f 1f 00 0f 1f 44 00 00 41 54 55 53 48
[  +0.000003] RSP: 0018:ffffaa4c206dfd48 EFLAGS: 00010282
[  +0.000004] RAX: 00000000ffffffea RBX: ffff9478d5ce4c58 RCX: 0000000000000001
[  +0.000002] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00000000ffffffff
[  +0.000003] RBP: ffff9478d5ce4c00 R08: 0000000000000000 R09: 0000000000000001
[  +0.000002] R10: 0000000000000000 R11: ffffaa4c206dfa18 R12: ffff9478d5ce4c58
[  +0.000002] R13: ffff9478d5ce4d90 R14: ffff9479156c5e08 R15: ffffffff9345ea20
[  +0.000002] FS:  00007f59f5b50180(0000) GS:ffff9497bfc40000(0000) knlGS:0000000000000000
[  +0.000003] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  +0.000002] CR2: 000055881d667000 CR3: 0000000137bee003 CR4: 00000000007706e0
[  +0.000002] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  +0.000002] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  +0.000002] PKRU: 55555554
[  +0.000002] Call Trace:
[  +0.000002]  <TASK>
[  +0.000004]  ttm_bo_release+0x305/0x390 [ttm]
[  +0.000009]  ? __mutex_unlock_slowpath+0x41/0x280
[  +0.000009]  amdgpu_bo_free_kernel+0xd1/0x120 [amdgpu]
[  +0.000115]  gfx_v9_0_mec_fini+0x19/0x30 [amdgpu]
[  +0.000149]  gfx_v9_0_sw_fini+0x8a/0x170 [amdgpu]
[  +0.000148]  amdgpu_device_fini_sw+0xbc/0x2e0 [amdgpu]
[  +0.000108]  amdgpu_driver_release_kms+0x12/0x30 [amdgpu]
[  +0.000106]  drm_dev_release+0x20/0x40 [drm]
[  +0.000023]  drm_release+0xa8/0xf0 [drm]
[  +0.000021]  __fput+0xa1/0x260
[  +0.000007]  task_work_run+0x6d/0xb0
[  +0.000007]  exit_to_user_mode_prepare+0x1d3/0x1e0
[  +0.000004]  syscall_exit_to_user_mode+0x19/0x50
[  +0.000004]  do_syscall_64+0x40/0x80
[  +0.000005]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[  +0.000004] RIP: 0033:0x7f59f5106511
[  +0.000002] Code: f7 d8 64 89 02 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 8b 05 fa ce 20 00 85 c0 75 16 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 3f f3 c3 0f 1f 44 00 00 53 89 fb 48 83 ec 10
[  +0.000003] RSP: 002b:00007ffff02c6be8 EFLAGS: 00000246 ORIG_RAX: 0000000000000003
[  +0.000004] RAX: 0000000000000000 RBX: 000056517da3e6a0 RCX: 00007f59f5106511
[  +0.000002] RDX: 00007f59f50efca0 RSI: 000056517f676100 RDI: 0000000000000003
[  +0.000002] RBP: 000056517da3e8a0 R08: 0000000000000000 R09: 0000000000000000
[  +0.000002] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[  +0.000002] R13: 0000000000000000 R14: 0000000000000000 R15: 000056517da3e8a0
[  +0.000014]  </TASK>
[  +0.000002] irq event stamp: 37097
[  +0.000001] hardirqs last  enabled at (37103): [<ffffffff91107d32>] __up_console_sem+0x52/0x60
[  +0.000004] hardirqs last disabled at (37108): [<ffffffff91107d17>] __up_console_sem+0x37/0x60
[  +0.000003] softirqs last  enabled at (35460): [<ffffffff9220034b>] __do_softirq+0x34b/0x492
[  +0.000004] softirqs last disabled at (35449): [<ffffffff91083167>] irq_exit_rcu+0xd7/0xf0
[  +0.000003] ---[ end trace 90fbe3f286a48d77 ]---
[  +0.000314] [drm:amdgpu_fill_buffer [amdgpu]] *ERROR* Trying to clear memory with ring turned off.
[  +0.000187] ------------[ cut here ]------------
[  +0.000002] WARNING: CPU: 1 PID: 3800 at drivers/gpu/drm/amd/amdgpu/amdgpu_object.c:1313 amdgpu_bo_release_notify+0x164/0x170 [amdgpu]
[  +0.000115] Modules linked in: amdgpu iommu_v2 gpu_sched nf_conntrack_netlink nfnetlink xt_addrtype br_netfilter xt_CHECKSUM iptable_mangle xt_MASQUERADE iptable_nat nf_nat xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4 xt_tcpudp ip6table_filter ip6_tables iptable_filter fuse x86_pkg_temp_thermal acpi_pad ip_tables x_tables ast drm_vram_helper drm_ttm_helper ttm drm_kms_helper cfbfillrect syscopyarea cfbimgblt sysfillrect sysimgblt fb_sys_fops cfbcopyarea drm drm_panel_orientation_quirks
[  +0.000063] CPU: 1 PID: 3800 Comm: amdgpu_test Tainted: G        W         5.16.0-kfd+ #1
[  +0.000003] Hardware name: Supermicro SYS-4029GP-TRT2/X11DPG-OT-CPU, BIOS 2.1 08/14/2018
[  +0.000002] RIP: 0010:amdgpu_bo_release_notify+0x164/0x170 [amdgpu]
[  +0.000112] Code: ff ff ff 48 39 c2 74 07 0f 0b e9 57 ff ff ff 48 89 ef e8 cf 14 15 00 e9 4a ff ff ff e8 f5 05 29 d1 eb c4 0f 0b e9 e7 fe ff ff <0f> 0b eb b9 e8 83 d7 7d d1 0f 1f 00 0f 1f 44 00 00 41 54 55 53 48
[  +0.000003] RSP: 0018:ffffaa4c206dfd58 EFLAGS: 00010282
[  +0.000004] RAX: 00000000ffffffea RBX: ffff9478d5ce4858 RCX: 0000000000000001
[  +0.000003] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00000000ffffffff
[  +0.000002] RBP: ffff9478d5ce4800 R08: 0000000000000000 R09: 0000000000000001
[  +0.000002] R10: 0000000000000000 R11: ffffaa4c206dfa28 R12: ffff9478d5ce4858
[  +0.000002] R13: ffff9478d5ce4990 R14: ffff9479156c5e08 R15: ffffffff9345ea20
[  +0.000002] FS:  00007f59f5b50180(0000) GS:ffff9497bfc40000(0000) knlGS:0000000000000000
[  +0.000003] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  +0.000003] CR2: 000055881d667000 CR3: 0000000137bee003 CR4: 00000000007706e0
[  +0.000002] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  +0.000002] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  +0.000002] PKRU: 55555554
[  +0.000002] Call Trace:
[  +0.000002]  <TASK>
[  +0.000004]  ttm_bo_release+0x305/0x390 [ttm]
[  +0.000009]  ? __mutex_unlock_slowpath+0x41/0x280
[  +0.000009]  amdgpu_bo_free_kernel+0xd1/0x120 [amdgpu]
[  +0.000115]  gfx_v9_0_sw_fini+0xa4/0x170 [amdgpu]
[  +0.000150]  amdgpu_device_fini_sw+0xbc/0x2e0 [amdgpu]
[  +0.000108]  amdgpu_driver_release_kms+0x12/0x30 [amdgpu]
[  +0.000105]  drm_dev_release+0x20/0x40 [drm]
[  +0.000023]  drm_release+0xa8/0xf0 [drm]
[  +0.000021]  __fput+0xa1/0x260
[  +0.000008]  task_work_run+0x6d/0xb0
[  +0.000007]  exit_to_user_mode_prepare+0x1d3/0x1e0
[  +0.000005]  syscall_exit_to_user_mode+0x19/0x50
[  +0.000003]  do_syscall_64+0x40/0x80
[  +0.000005]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[  +0.000004] RIP: 0033:0x7f59f5106511
[  +0.000002] Code: f7 d8 64 89 02 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 8b 05 fa ce 20 00 85 c0 75 16 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 3f f3 c3 0f 1f 44 00 00 53 89 fb 48 83 ec 10
[  +0.000003] RSP: 002b:00007ffff02c6be8 EFLAGS: 00000246 ORIG_RAX: 0000000000000003
[  +0.000005] RAX: 0000000000000000 RBX: 000056517da3e6a0 RCX: 00007f59f5106511
[  +0.000002] RDX: 00007f59f50efca0 RSI: 000056517f676100 RDI: 0000000000000003
[  +0.000002] RBP: 000056517da3e8a0 R08: 0000000000000000 R09: 0000000000000000
[  +0.000002] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[  +0.000002] R13: 0000000000000000 R14: 0000000000000000 R15: 000056517da3e8a0
[  +0.000014]  </TASK>
[  +0.000001] irq event stamp: 37679
[  +0.000002] hardirqs last  enabled at (37685): [<ffffffff91107d32>] __up_console_sem+0x52/0x60
[  +0.000004] hardirqs last disabled at (37690): [<ffffffff91107d17>] __up_console_sem+0x37/0x60
[  +0.000002] softirqs last  enabled at (35460): [<ffffffff9220034b>] __do_softirq+0x34b/0x492
[  +0.000004] softirqs last disabled at (35449): [<ffffffff91083167>] irq_exit_rcu+0xd7/0xf0
[  +0.000004] ---[ end trace 90fbe3f286a48d78 ]---
[  +0.000409] ------------[ cut here ]------------
[  +0.000003] sysfs group 'power' not found for kobject 'i2c-2'
[  +0.000007] WARNING: CPU: 1 PID: 3800 at fs/sysfs/group.c:280 sysfs_remove_group+0x76/0x80
[  +0.000005] Modules linked in: amdgpu iommu_v2 gpu_sched nf_conntrack_netlink nfnetlink xt_addrtype br_netfilter xt_CHECKSUM iptable_mangle xt_MASQUERADE iptable_nat nf_nat xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4 xt_tcpudp ip6table_filter ip6_tables iptable_filter fuse x86_pkg_temp_thermal acpi_pad ip_tables x_tables ast drm_vram_helper drm_ttm_helper ttm drm_kms_helper cfbfillrect syscopyarea cfbimgblt sysfillrect sysimgblt fb_sys_fops cfbcopyarea drm drm_panel_orientation_quirks
[  +0.000057] CPU: 1 PID: 3800 Comm: amdgpu_test Tainted: G        W         5.16.0-kfd+ #1
[  +0.000003] Hardware name: Supermicro SYS-4029GP-TRT2/X11DPG-OT-CPU, BIOS 2.1 08/14/2018
[  +0.000002] RIP: 0010:sysfs_remove_group+0x76/0x80
[  +0.000003] Code: 48 89 df 5b 5d 41 5c e9 d8 b3 ff ff 48 89 df e8 60 ae ff ff eb cb 49 8b 14 24 48 8b 75 00 48 c7 c7 a8 8f 82 92 e8 7a 28 c8 ff <0f> 0b 5b 5d 41 5c c3 0f 1f 00 0f 1f 44 00 00 48 85 f6 74 31 41 54
[  +0.000003] RSP: 0018:ffffaa4c206dfd28 EFLAGS: 00010286
[  +0.000004] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000001
[  +0.000002] RDX: 0000000080000001 RSI: ffffffff92800179 RDI: 00000000ffffffff
[  +0.000002] RBP: ffffffff924c7f60 R08: 0000000000000000 R09: 0000000000000001
[  +0.000002] R10: 0000000000000000 R11: ffffaa4c206dfb30 R12: ffff9479156c9e48
[  +0.000002] R13: ffff9479156c9e48 R14: ffff9478f8fcfe28 R15: ffff9479156c9f48
[  +0.000002] FS:  00007f59f5b50180(0000) GS:ffff9497bfc40000(0000) knlGS:0000000000000000
[  +0.000003] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  +0.000002] CR2: 000055881d667000 CR3: 0000000137bee003 CR4: 00000000007706e0
[  +0.000002] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  +0.000002] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  +0.000002] PKRU: 55555554
[  +0.000001] Call Trace:
[  +0.000002]  <TASK>
[  +0.000003]  device_del+0xc2/0x420
[  +0.000005]  ? __raw_spin_lock_init+0x3b/0x60
[  +0.000003]  ? lockdep_init_map_type+0x58/0x240
[  +0.000010]  device_unregister+0x13/0x60
[  +0.000004]  i2c_del_adapter+0x264/0x330
[  +0.000007]  ? lockdep_hardirqs_on+0x79/0x100
[  +0.000008]  arcturus_i2c_control_fini+0x15/0x40 [amdgpu]
[  +0.000184]  smu_sw_fini+0x31/0x210 [amdgpu]
[  +0.000188]  amdgpu_device_fini_sw+0xbc/0x2e0 [amdgpu]
[  +0.000107]  amdgpu_driver_release_kms+0x12/0x30 [amdgpu]
[  +0.000105]  drm_dev_release+0x20/0x40 [drm]
[  +0.000023]  drm_release+0xa8/0xf0 [drm]
[  +0.000021]  __fput+0xa1/0x260
[  +0.000007]  task_work_run+0x6d/0xb0
[  +0.000008]  exit_to_user_mode_prepare+0x1d3/0x1e0
[  +0.000003]  syscall_exit_to_user_mode+0x19/0x50
[  +0.000004]  do_syscall_64+0x40/0x80
[  +0.000005]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[  +0.000003] RIP: 0033:0x7f59f5106511
[  +0.000003] Code: f7 d8 64 89 02 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 8b 05 fa ce 20 00 85 c0 75 16 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 3f f3 c3 0f 1f 44 00 00 53 89 fb 48 83 ec 10
[  +0.000002] RSP: 002b:00007ffff02c6be8 EFLAGS: 00000246 ORIG_RAX: 0000000000000003
[  +0.000004] RAX: 0000000000000000 RBX: 000056517da3e6a0 RCX: 00007f59f5106511
[  +0.000002] RDX: 00007f59f50efca0 RSI: 000056517f676100 RDI: 0000000000000003
[  +0.000002] RBP: 000056517da3e8a0 R08: 0000000000000000 R09: 0000000000000000
[  +0.000002] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[  +0.000002] R13: 0000000000000000 R14: 0000000000000000 R15: 000056517da3e8a0
[  +0.000013]  </TASK>
[  +0.000002] irq event stamp: 38451
[  +0.000002] hardirqs last  enabled at (38457): [<ffffffff91107d32>] __up_console_sem+0x52/0x60
[  +0.000003] hardirqs last disabled at (38462): [<ffffffff91107d17>] __up_console_sem+0x37/0x60
[  +0.000002] softirqs last  enabled at (35460): [<ffffffff9220034b>] __do_softirq+0x34b/0x492
[  +0.000004] softirqs last disabled at (35449): [<ffffffff91083167>] irq_exit_rcu+0xd7/0xf0
[  +0.000003] ---[ end trace 90fbe3f286a48d79 ]---
[  +0.000223] ------------[ cut here ]------------
[  +0.000002] sysfs group 'power' not found for kobject 'i2c-3'
[  +0.000006] WARNING: CPU: 1 PID: 3800 at fs/sysfs/group.c:280 sysfs_remove_group+0x76/0x80
[  +0.000005] Modules linked in: amdgpu iommu_v2 gpu_sched nf_conntrack_netlink nfnetlink xt_addrtype br_netfilter xt_CHECKSUM iptable_mangle xt_MASQUERADE iptable_nat nf_nat xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4 xt_tcpudp ip6table_filter ip6_tables iptable_filter fuse x86_pkg_temp_thermal acpi_pad ip_tables x_tables ast drm_vram_helper drm_ttm_helper ttm drm_kms_helper cfbfillrect syscopyarea cfbimgblt sysfillrect sysimgblt fb_sys_fops cfbcopyarea drm drm_panel_orientation_quirks
[  +0.000058] CPU: 1 PID: 3800 Comm: amdgpu_test Tainted: G        W         5.16.0-kfd+ #1
[  +0.000003] Hardware name: Supermicro SYS-4029GP-TRT2/X11DPG-OT-CPU, BIOS 2.1 08/14/2018
[  +0.000003] RIP: 0010:sysfs_remove_group+0x76/0x80
[  +0.000003] Code: 48 89 df 5b 5d 41 5c e9 d8 b3 ff ff 48 89 df e8 60 ae ff ff eb cb 49 8b 14 24 48 8b 75 00 48 c7 c7 a8 8f 82 92 e8 7a 28 c8 ff <0f> 0b 5b 5d 41 5c c3 0f 1f 00 0f 1f 44 00 00 48 85 f6 74 31 41 54
[  +0.000003] RSP: 0018:ffffaa4c206dfd28 EFLAGS: 00010286
[  +0.000004] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000001
[  +0.000002] RDX: 0000000080000001 RSI: ffffffff92800179 RDI: 00000000ffffffff
[  +0.000003] RBP: ffffffff924c7f60 R08: 0000000000000000 R09: 0000000000000001
[  +0.000001] R10: 0000000000000000 R11: ffffaa4c206dfb30 R12: ffff9479156ca6a0
[  +0.000002] R13: ffff9479156ca6a0 R14: ffff9478f8fcfe28 R15: ffff9479156ca7a0
[  +0.000002] FS:  00007f59f5b50180(0000) GS:ffff9497bfc40000(0000) knlGS:0000000000000000
[  +0.000003] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  +0.000002] CR2: 000055881d667000 CR3: 0000000137bee003 CR4: 00000000007706e0
[  +0.000002] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  +0.000002] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  +0.000002] PKRU: 55555554
[  +0.000002] Call Trace:
[  +0.000001]  <TASK>
[  +0.000004]  device_del+0xc2/0x420
[  +0.000003]  ? __raw_spin_lock_init+0x3b/0x60
[  +0.000004]  ? lockdep_init_map_type+0x58/0x240
[  +0.000010]  device_unregister+0x13/0x60
[  +0.000004]  i2c_del_adapter+0x264/0x330
[  +0.000007]  ? lockdep_hardirqs_on+0x79/0x100
[  +0.000008]  arcturus_i2c_control_fini+0x21/0x40 [amdgpu]
[  +0.000184]  smu_sw_fini+0x31/0x210 [amdgpu]
[  +0.000189]  amdgpu_device_fini_sw+0xbc/0x2e0 [amdgpu]
[  +0.000107]  amdgpu_driver_release_kms+0x12/0x30 [amdgpu]
[  +0.000105]  drm_dev_release+0x20/0x40 [drm]
[  +0.000023]  drm_release+0xa8/0xf0 [drm]
[  +0.000020]  __fput+0xa1/0x260
[  +0.000007]  task_work_run+0x6d/0xb0
[  +0.000007]  exit_to_user_mode_prepare+0x1d3/0x1e0
[  +0.000004]  syscall_exit_to_user_mode+0x19/0x50
[  +0.000003]  do_syscall_64+0x40/0x80
[  +0.000005]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[  +0.000004] RIP: 0033:0x7f59f5106511
[  +0.000002] Code: f7 d8 64 89 02 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 8b 05 fa ce 20 00 85 c0 75 16 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 3f f3 c3 0f 1f 44 00 00 53 89 fb 48 83 ec 10
[  +0.000003] RSP: 002b:00007ffff02c6be8 EFLAGS: 00000246 ORIG_RAX: 0000000000000003
[  +0.000004] RAX: 0000000000000000 RBX: 000056517da3e6a0 RCX: 00007f59f5106511
[  +0.000002] RDX: 00007f59f50efca0 RSI: 000056517f676100 RDI: 0000000000000003
[  +0.000002] RBP: 000056517da3e8a0 R08: 0000000000000000 R09: 0000000000000000
[  +0.000001] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[  +0.000003] R13: 0000000000000000 R14: 0000000000000000 R15: 000056517da3e8a0
[  +0.000012]  </TASK>
[  +0.000001] irq event stamp: 39087
[  +0.000002] hardirqs last  enabled at (39093): [<ffffffff91107d32>] __up_console_sem+0x52/0x60
[  +0.000003] hardirqs last disabled at (39098): [<ffffffff91107d17>] __up_console_sem+0x37/0x60
[  +0.000003] softirqs last  enabled at (35460): [<ffffffff9220034b>] __do_softirq+0x34b/0x492
[  +0.000004] softirqs last disabled at (35449): [<ffffffff91083167>] irq_exit_rcu+0xd7/0xf0
[  +0.000003] ---[ end trace 90fbe3f286a48d7a ]---
[  +0.000046] [drm:amdgpu_fill_buffer [amdgpu]] *ERROR* Trying to clear memory with ring turned off.
[  +0.000190] ------------[ cut here ]------------
[  +0.000002] WARNING: CPU: 1 PID: 3800 at drivers/gpu/drm/amd/amdgpu/amdgpu_object.c:1313 amdgpu_bo_release_notify+0x164/0x170 [amdgpu]
[  +0.000163] Modules linked in: amdgpu iommu_v2 gpu_sched nf_conntrack_netlink nfnetlink xt_addrtype br_netfilter xt_CHECKSUM iptable_mangle xt_MASQUERADE iptable_nat nf_nat xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4 xt_tcpudp ip6table_filter ip6_tables iptable_filter fuse x86_pkg_temp_thermal acpi_pad ip_tables x_tables ast drm_vram_helper drm_ttm_helper ttm drm_kms_helper cfbfillrect syscopyarea cfbimgblt sysfillrect sysimgblt fb_sys_fops cfbcopyarea drm drm_panel_orientation_quirks
[  +0.000067] CPU: 1 PID: 3800 Comm: amdgpu_test Tainted: G        W         5.16.0-kfd+ #1
[  +0.000004] Hardware name: Supermicro SYS-4029GP-TRT2/X11DPG-OT-CPU, BIOS 2.1 08/14/2018
[  +0.000002] RIP: 0010:amdgpu_bo_release_notify+0x164/0x170 [amdgpu]
[  +0.000114] Code: ff ff ff 48 39 c2 74 07 0f 0b e9 57 ff ff ff 48 89 ef e8 cf 14 15 00 e9 4a ff ff ff e8 f5 05 29 d1 eb c4 0f 0b e9 e7 fe ff ff <0f> 0b eb b9 e8 83 d7 7d d1 0f 1f 00 0f 1f 44 00 00 41 54 55 53 48
[  +0.000003] RSP: 0018:ffffaa4c206dfd48 EFLAGS: 00010282
[  +0.000004] RAX: 00000000ffffffea RBX: ffff9478d5ce3858 RCX: 0000000000000001
[  +0.000002] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00000000ffffffff
[  +0.000002] RBP: ffff9478d5ce3800 R08: 0000000000000000 R09: 0000000000000001
[  +0.000002] R10: 0000000000000000 R11: ffffaa4c206dfa18 R12: ffff9478d5ce3858
[  +0.000003] R13: ffff9478d5ce3990 R14: ffff9479156c5e08 R15: ffffffff9345ea20
[  +0.000002] FS:  00007f59f5b50180(0000) GS:ffff9497bfc40000(0000) knlGS:0000000000000000
[  +0.000003] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  +0.000002] CR2: 000055881d667000 CR3: 0000000137bee003 CR4: 00000000007706e0
[  +0.000002] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  +0.000002] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  +0.000002] PKRU: 55555554
[  +0.000002] Call Trace:
[  +0.000002]  <TASK>
[  +0.000005]  ttm_bo_release+0x305/0x390 [ttm]
[  +0.000009]  ? __mutex_unlock_slowpath+0x41/0x280
[  +0.000010]  amdgpu_bo_free_kernel+0xd1/0x120 [amdgpu]
[  +0.000115]  smu_sw_fini+0x4b/0x210 [amdgpu]
[  +0.000188]  amdgpu_device_fini_sw+0xbc/0x2e0 [amdgpu]
[  +0.000108]  amdgpu_driver_release_kms+0x12/0x30 [amdgpu]
[  +0.000106]  drm_dev_release+0x20/0x40 [drm]
[  +0.000023]  drm_release+0xa8/0xf0 [drm]
[  +0.000020]  __fput+0xa1/0x260
[  +0.000008]  task_work_run+0x6d/0xb0
[  +0.000007]  exit_to_user_mode_prepare+0x1d3/0x1e0
[  +0.000005]  syscall_exit_to_user_mode+0x19/0x50
[  +0.000003]  do_syscall_64+0x40/0x80
[  +0.000006]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[  +0.000003] RIP: 0033:0x7f59f5106511
[  +0.000003] Code: f7 d8 64 89 02 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 8b 05 fa ce 20 00 85 c0 75 16 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 3f f3 c3 0f 1f 44 00 00 53 89 fb 48 83 ec 10
[  +0.000002] RSP: 002b:00007ffff02c6be8 EFLAGS: 00000246 ORIG_RAX: 0000000000000003
[  +0.000005] RAX: 0000000000000000 RBX: 000056517da3e6a0 RCX: 00007f59f5106511
[  +0.000002] RDX: 00007f59f50efca0 RSI: 000056517f676100 RDI: 0000000000000003
[  +0.000002] RBP: 000056517da3e8a0 R08: 0000000000000000 R09: 0000000000000000
[  +0.000002] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[  +0.000002] R13: 0000000000000000 R14: 0000000000000000 R15: 000056517da3e8a0
[  +0.000014]  </TASK>
[  +0.000002] irq event stamp: 39789
[  +0.000002] hardirqs last  enabled at (39795): [<ffffffff91107d32>] __up_console_sem+0x52/0x60
[  +0.000003] hardirqs last disabled at (39800): [<ffffffff91107d17>] __up_console_sem+0x37/0x60
[  +0.000002] softirqs last  enabled at (39268): [<ffffffff9220034b>] __do_softirq+0x34b/0x492
[  +0.000004] softirqs last disabled at (39163): [<ffffffff91083167>] irq_exit_rcu+0xd7/0xf0
[  +0.000004] ---[ end trace 90fbe3f286a48d7b ]---
[  +0.000296] [drm:amdgpu_fill_buffer [amdgpu]] *ERROR* Trying to clear memory with ring turned off.
[  +0.000187] ------------[ cut here ]------------
[  +0.000002] WARNING: CPU: 1 PID: 3800 at drivers/gpu/drm/amd/amdgpu/amdgpu_object.c:1313 amdgpu_bo_release_notify+0x164/0x170 [amdgpu]
[  +0.000114] Modules linked in: amdgpu iommu_v2 gpu_sched nf_conntrack_netlink nfnetlink xt_addrtype br_netfilter xt_CHECKSUM iptable_mangle xt_MASQUERADE iptable_nat nf_nat xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4 xt_tcpudp ip6table_filter ip6_tables iptable_filter fuse x86_pkg_temp_thermal acpi_pad ip_tables x_tables ast drm_vram_helper drm_ttm_helper ttm drm_kms_helper cfbfillrect syscopyarea cfbimgblt sysfillrect sysimgblt fb_sys_fops cfbcopyarea drm drm_panel_orientation_quirks
[  +0.000062] CPU: 1 PID: 3800 Comm: amdgpu_test Tainted: G        W         5.16.0-kfd+ #1
[  +0.000003] Hardware name: Supermicro SYS-4029GP-TRT2/X11DPG-OT-CPU, BIOS 2.1 08/14/2018
[  +0.000002] RIP: 0010:amdgpu_bo_release_notify+0x164/0x170 [amdgpu]
[  +0.000112] Code: ff ff ff 48 39 c2 74 07 0f 0b e9 57 ff ff ff 48 89 ef e8 cf 14 15 00 e9 4a ff ff ff e8 f5 05 29 d1 eb c4 0f 0b e9 e7 fe ff ff <0f> 0b eb b9 e8 83 d7 7d d1 0f 1f 00 0f 1f 44 00 00 41 54 55 53 48
[  +0.000003] RSP: 0018:ffffaa4c206dfd48 EFLAGS: 00010282
[  +0.000004] RAX: 00000000ffffffea RBX: ffff9478d5ce3058 RCX: 0000000000000001
[  +0.000002] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00000000ffffffff
[  +0.000002] RBP: ffff9478d5ce3000 R08: 0000000000000000 R09: 0000000000000001
[  +0.000003] R10: 0000000000000000 R11: ffffaa4c206dfa18 R12: ffff9478d5ce3058
[  +0.000002] R13: ffff9478d5ce3190 R14: ffff9479156c5e08 R15: ffffffff9345ea20
[  +0.000002] FS:  00007f59f5b50180(0000) GS:ffff9497bfc40000(0000) knlGS:0000000000000000
[  +0.000004] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  +0.000002] CR2: 000055881d667000 CR3: 0000000137bee003 CR4: 00000000007706e0
[  +0.000002] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  +0.000002] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  +0.000002] PKRU: 55555554
[  +0.000002] Call Trace:
[  +0.000002]  <TASK>
[  +0.000004]  ttm_bo_release+0x305/0x390 [ttm]
[  +0.000009]  ? __mutex_unlock_slowpath+0x41/0x280
[  +0.000009]  amdgpu_bo_free_kernel+0xd1/0x120 [amdgpu]
[  +0.000115]  smu_sw_fini+0x188/0x210 [amdgpu]
[  +0.000189]  amdgpu_device_fini_sw+0xbc/0x2e0 [amdgpu]
[  +0.000107]  amdgpu_driver_release_kms+0x12/0x30 [amdgpu]
[  +0.000106]  drm_dev_release+0x20/0x40 [drm]
[  +0.000023]  drm_release+0xa8/0xf0 [drm]
[  +0.000021]  __fput+0xa1/0x260
[  +0.000007]  task_work_run+0x6d/0xb0
[  +0.000008]  exit_to_user_mode_prepare+0x1d3/0x1e0
[  +0.000004]  syscall_exit_to_user_mode+0x19/0x50
[  +0.000003]  do_syscall_64+0x40/0x80
[  +0.000005]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[  +0.000004] RIP: 0033:0x7f59f5106511
[  +0.000002] Code: f7 d8 64 89 02 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 8b 05 fa ce 20 00 85 c0 75 16 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 3f f3 c3 0f 1f 44 00 00 53 89 fb 48 83 ec 10
[  +0.000003] RSP: 002b:00007ffff02c6be8 EFLAGS: 00000246 ORIG_RAX: 0000000000000003
[  +0.000004] RAX: 0000000000000000 RBX: 000056517da3e6a0 RCX: 00007f59f5106511
[  +0.000003] RDX: 00007f59f50efca0 RSI: 000056517f676100 RDI: 0000000000000003
[  +0.000002] RBP: 000056517da3e8a0 R08: 0000000000000000 R09: 0000000000000000
[  +0.000002] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[  +0.000002] R13: 0000000000000000 R14: 0000000000000000 R15: 000056517da3e8a0
[  +0.000013]  </TASK>
[  +0.000002] irq event stamp: 40367
[  +0.000002] hardirqs last  enabled at (40373): [<ffffffff91107d32>] __up_console_sem+0x52/0x60
[  +0.000003] hardirqs last disabled at (40378): [<ffffffff91107d17>] __up_console_sem+0x37/0x60
[  +0.000003] softirqs last  enabled at (39268): [<ffffffff9220034b>] __do_softirq+0x34b/0x492
[  +0.000003] softirqs last disabled at (39163): [<ffffffff91083167>] irq_exit_rcu+0xd7/0xf0
[  +0.000004] ---[ end trace 90fbe3f286a48d7c ]---
[  +0.000086] [drm:amdgpu_fill_buffer [amdgpu]] *ERROR* Trying to clear memory with ring turned off.
[  +0.000197] ------------[ cut here ]------------
[  +0.000003] WARNING: CPU: 26 PID: 3800 at drivers/gpu/drm/amd/amdgpu/amdgpu_object.c:1313 amdgpu_bo_release_notify+0x164/0x170 [amdgpu]
[  +0.000117] Modules linked in: amdgpu iommu_v2 gpu_sched nf_conntrack_netlink nfnetlink xt_addrtype br_netfilter xt_CHECKSUM iptable_mangle xt_MASQUERADE iptable_nat nf_nat xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4 xt_tcpudp ip6table_filter ip6_tables iptable_filter fuse x86_pkg_temp_thermal acpi_pad ip_tables x_tables ast drm_vram_helper drm_ttm_helper ttm drm_kms_helper cfbfillrect syscopyarea cfbimgblt sysfillrect sysimgblt fb_sys_fops cfbcopyarea drm drm_panel_orientation_quirks
[  +0.000065] CPU: 26 PID: 3800 Comm: amdgpu_test Tainted: G        W         5.16.0-kfd+ #1
[  +0.000004] Hardware name: Supermicro SYS-4029GP-TRT2/X11DPG-OT-CPU, BIOS 2.1 08/14/2018
[  +0.000002] RIP: 0010:amdgpu_bo_release_notify+0x164/0x170 [amdgpu]
[  +0.000114] Code: ff ff ff 48 39 c2 74 07 0f 0b e9 57 ff ff ff 48 89 ef e8 cf 14 15 00 e9 4a ff ff ff e8 f5 05 29 d1 eb c4 0f 0b e9 e7 fe ff ff <0f> 0b eb b9 e8 83 d7 7d d1 0f 1f 00 0f 1f 44 00 00 41 54 55 53 48
[  +0.000003] RSP: 0018:ffffaa4c206dfd48 EFLAGS: 00010282
[  +0.000004] RAX: 00000000ffffffea RBX: ffff9478d5ce3458 RCX: 0000000000000001
[  +0.000003] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00000000ffffffff
[  +0.000002] RBP: ffff9478d5ce3400 R08: 0000000000000000 R09: 0000000000000001
[  +0.000003] R10: 0000000000000000 R11: ffffaa4c206dfa18 R12: ffff9478d5ce3458
[  +0.000002] R13: ffff9478d5ce3590 R14: ffff9479156c5e08 R15: ffffffff9345ea20
[  +0.000002] FS:  00007f59f5b50180(0000) GS:ffff9497bff80000(0000) knlGS:0000000000000000
[  +0.000003] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  +0.000003] CR2: 0000558d8cbed6f0 CR3: 0000000137bee005 CR4: 00000000007706e0
[  +0.000002] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  +0.000002] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  +0.000002] PKRU: 55555554
[  +0.000002] Call Trace:
[  +0.000003]  <TASK>
[  +0.000005]  ttm_bo_release+0x305/0x390 [ttm]
[  +0.000009]  ? __mutex_unlock_slowpath+0x41/0x280
[  +0.000010]  amdgpu_bo_free_kernel+0xd1/0x120 [amdgpu]
[  +0.000115]  smu_sw_fini+0x105/0x210 [amdgpu]
[  +0.000191]  amdgpu_device_fini_sw+0xbc/0x2e0 [amdgpu]
[  +0.000109]  amdgpu_driver_release_kms+0x12/0x30 [amdgpu]
[  +0.000106]  drm_dev_release+0x20/0x40 [drm]
[  +0.000024]  drm_release+0xa8/0xf0 [drm]
[  +0.000020]  __fput+0xa1/0x260
[  +0.000007]  task_work_run+0x6d/0xb0
[  +0.000008]  exit_to_user_mode_prepare+0x1d3/0x1e0
[  +0.000005]  syscall_exit_to_user_mode+0x19/0x50
[  +0.000003]  do_syscall_64+0x40/0x80
[  +0.000006]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[  +0.000004] RIP: 0033:0x7f59f5106511
[  +0.000002] Code: f7 d8 64 89 02 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 8b 05 fa ce 20 00 85 c0 75 16 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 3f f3 c3 0f 1f 44 00 00 53 89 fb 48 83 ec 10
[  +0.000003] RSP: 002b:00007ffff02c6be8 EFLAGS: 00000246 ORIG_RAX: 0000000000000003
[  +0.000005] RAX: 0000000000000000 RBX: 000056517da3e6a0 RCX: 00007f59f5106511
[  +0.000002] RDX: 00007f59f50efca0 RSI: 000056517f676100 RDI: 0000000000000003
[  +0.000003] RBP: 000056517da3e8a0 R08: 0000000000000000 R09: 0000000000000000
[  +0.000002] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[  +0.000002] R13: 0000000000000000 R14: 0000000000000000 R15: 000056517da3e8a0
[  +0.000014]  </TASK>
[  +0.000001] irq event stamp: 40947
[  +0.000002] hardirqs last  enabled at (40953): [<ffffffff91107d32>] __up_console_sem+0x52/0x60
[  +0.000004] hardirqs last disabled at (40958): [<ffffffff91107d17>] __up_console_sem+0x37/0x60
[  +0.000003] softirqs last  enabled at (39268): [<ffffffff9220034b>] __do_softirq+0x34b/0x492
[  +0.000004] softirqs last disabled at (39163): [<ffffffff91083167>] irq_exit_rcu+0xd7/0xf0
[  +0.000004] ---[ end trace 90fbe3f286a48d7d ]---
[  +0.000195] [drm:amdgpu_fill_buffer [amdgpu]] *ERROR* Trying to clear memory with ring turned off.
[  +0.000244] ------------[ cut here ]------------
[  +0.000002] WARNING: CPU: 26 PID: 3800 at drivers/gpu/drm/amd/amdgpu/amdgpu_object.c:1313 amdgpu_bo_release_notify+0x164/0x170 [amdgpu]
[  +0.000117] Modules linked in: amdgpu iommu_v2 gpu_sched nf_conntrack_netlink nfnetlink xt_addrtype br_netfilter xt_CHECKSUM iptable_mangle xt_MASQUERADE iptable_nat nf_nat xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4 xt_tcpudp ip6table_filter ip6_tables iptable_filter fuse x86_pkg_temp_thermal acpi_pad ip_tables x_tables ast drm_vram_helper drm_ttm_helper ttm drm_kms_helper cfbfillrect syscopyarea cfbimgblt sysfillrect sysimgblt fb_sys_fops cfbcopyarea drm drm_panel_orientation_quirks
[  +0.000063] CPU: 26 PID: 3800 Comm: amdgpu_test Tainted: G        W         5.16.0-kfd+ #1
[  +0.000004] Hardware name: Supermicro SYS-4029GP-TRT2/X11DPG-OT-CPU, BIOS 2.1 08/14/2018
[  +0.000003] RIP: 0010:amdgpu_bo_release_notify+0x164/0x170 [amdgpu]
[  +0.000114] Code: ff ff ff 48 39 c2 74 07 0f 0b e9 57 ff ff ff 48 89 ef e8 cf 14 15 00 e9 4a ff ff ff e8 f5 05 29 d1 eb c4 0f 0b e9 e7 fe ff ff <0f> 0b eb b9 e8 83 d7 7d d1 0f 1f 00 0f 1f 44 00 00 41 54 55 53 48
[  +0.000003] RSP: 0018:ffffaa4c206dfd70 EFLAGS: 00010282
[  +0.000004] RAX: 00000000ffffffea RBX: ffff949847be5058 RCX: 0000000000000001
[  +0.000002] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00000000ffffffff
[  +0.000003] RBP: ffff949847be5000 R08: 0000000000000000 R09: 0000000000000001
[  +0.000002] R10: 0000000000000000 R11: ffffaa4c206dfa40 R12: ffff949847be5058
[  +0.000002] R13: ffff949847be5190 R14: ffff9479156c5e08 R15: ffffffff9345ea20
[  +0.000002] FS:  00007f59f5b50180(0000) GS:ffff9497bff80000(0000) knlGS:0000000000000000
[  +0.000003] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  +0.000002] CR2: 0000558d8cbed6f0 CR3: 0000000137bee005 CR4: 00000000007706e0
[  +0.000002] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  +0.000002] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  +0.000002] PKRU: 55555554
[  +0.000002] Call Trace:
[  +0.000003]  <TASK>
[  +0.000005]  ttm_bo_release+0x305/0x390 [ttm]
[  +0.000009]  ? __mutex_unlock_slowpath+0x41/0x280
[  +0.000009]  amdgpu_bo_free_kernel+0xd1/0x120 [amdgpu]
[  +0.000115]  amdgpu_device_fini_sw+0x259/0x2e0 [amdgpu]
[  +0.000107]  amdgpu_driver_release_kms+0x12/0x30 [amdgpu]
[  +0.000105]  drm_dev_release+0x20/0x40 [drm]
[  +0.000023]  drm_release+0xa8/0xf0 [drm]
[  +0.000021]  __fput+0xa1/0x260
[  +0.000007]  task_work_run+0x6d/0xb0
[  +0.000007]  exit_to_user_mode_prepare+0x1d3/0x1e0
[  +0.000005]  syscall_exit_to_user_mode+0x19/0x50
[  +0.000003]  do_syscall_64+0x40/0x80
[  +0.000006]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[  +0.000004] RIP: 0033:0x7f59f5106511
[  +0.000003] Code: f7 d8 64 89 02 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 8b 05 fa ce 20 00 85 c0 75 16 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 3f f3 c3 0f 1f 44 00 00 53 89 fb 48 83 ec 10
[  +0.000002] RSP: 002b:00007ffff02c6be8 EFLAGS: 00000246 ORIG_RAX: 0000000000000003
[  +0.000005] RAX: 0000000000000000 RBX: 000056517da3e6a0 RCX: 00007f59f5106511
[  +0.000002] RDX: 00007f59f50efca0 RSI: 000056517f676100 RDI: 0000000000000003
[  +0.000002] RBP: 000056517da3e8a0 R08: 0000000000000000 R09: 0000000000000000
[  +0.000002] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[  +0.000002] R13: 0000000000000000 R14: 0000000000000000 R15: 000056517da3e8a0
[  +0.000013]  </TASK>
[  +0.000002] irq event stamp: 41913
[  +0.000002] hardirqs last  enabled at (41919): [<ffffffff91107d32>] __up_console_sem+0x52/0x60
[  +0.000004] hardirqs last disabled at (41924): [<ffffffff91107d17>] __up_console_sem+0x37/0x60
[  +0.000002] softirqs last  enabled at (41384): [<ffffffff9220034b>] __do_softirq+0x34b/0x492
[  +0.000004] softirqs last disabled at (41377): [<ffffffff91083167>] irq_exit_rcu+0xd7/0xf0
[  +0.000004] ---[ end trace 90fbe3f286a48d7e ]---
[  +0.000383] [drm:amdgpu_fill_buffer [amdgpu]] *ERROR* Trying to clear memory with ring turned off.
[  +0.000187] ------------[ cut here ]------------
[  +0.000002] WARNING: CPU: 26 PID: 3800 at drivers/gpu/drm/amd/amdgpu/amdgpu_object.c:1313 amdgpu_bo_release_notify+0x164/0x170 [amdgpu]
[  +0.000114] Modules linked in: amdgpu iommu_v2 gpu_sched nf_conntrack_netlink nfnetlink xt_addrtype br_netfilter xt_CHECKSUM iptable_mangle xt_MASQUERADE iptable_nat nf_nat xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4 xt_tcpudp ip6table_filter ip6_tables iptable_filter fuse x86_pkg_temp_thermal acpi_pad ip_tables x_tables ast drm_vram_helper drm_ttm_helper ttm drm_kms_helper cfbfillrect syscopyarea cfbimgblt sysfillrect sysimgblt fb_sys_fops cfbcopyarea drm drm_panel_orientation_quirks
[  +0.000062] CPU: 26 PID: 3800 Comm: amdgpu_test Tainted: G        W         5.16.0-kfd+ #1
[  +0.000004] Hardware name: Supermicro SYS-4029GP-TRT2/X11DPG-OT-CPU, BIOS 2.1 08/14/2018
[  +0.000003] RIP: 0010:amdgpu_bo_release_notify+0x164/0x170 [amdgpu]
[  +0.000112] Code: ff ff ff 48 39 c2 74 07 0f 0b e9 57 ff ff ff 48 89 ef e8 cf 14 15 00 e9 4a ff ff ff e8 f5 05 29 d1 eb c4 0f 0b e9 e7 fe ff ff <0f> 0b eb b9 e8 83 d7 7d d1 0f 1f 00 0f 1f 44 00 00 41 54 55 53 48
[  +0.000003] RSP: 0018:ffffaa4c206dfd60 EFLAGS: 00010282
[  +0.000004] RAX: 00000000ffffffea RBX: ffff949847be3858 RCX: 0000000000000001
[  +0.000002] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00000000ffffffff
[  +0.000002] RBP: ffff949847be3800 R08: 0000000000000000 R09: 0000000000000001
[  +0.000002] R10: 0000000000000000 R11: ffffaa4c206dfa30 R12: ffff949847be3858
[  +0.000002] R13: ffff949847be3990 R14: ffff9479156c5e08 R15: ffffffff9345ea20
[  +0.000002] FS:  00007f59f5b50180(0000) GS:ffff9497bff80000(0000) knlGS:0000000000000000
[  +0.000003] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  +0.000002] CR2: 0000558d8cbed6f0 CR3: 0000000137bee005 CR4: 00000000007706e0
[  +0.000003] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  +0.000002] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  +0.000002] PKRU: 55555554
[  +0.000002] Call Trace:
[  +0.000002]  <TASK>
[  +0.000005]  ttm_bo_release+0x305/0x390 [ttm]
[  +0.000008]  ? __mutex_unlock_slowpath+0x41/0x280
[  +0.000009]  amdgpu_bo_free_kernel+0xd1/0x120 [amdgpu]
[  +0.000115]  gmc_v9_0_sw_fini+0x26/0x40 [amdgpu]
[  +0.000143]  amdgpu_device_fini_sw+0xbc/0x2e0 [amdgpu]
[  +0.000108]  amdgpu_driver_release_kms+0x12/0x30 [amdgpu]
[  +0.000105]  drm_dev_release+0x20/0x40 [drm]
[  +0.000024]  drm_release+0xa8/0xf0 [drm]
[  +0.000020]  __fput+0xa1/0x260
[  +0.000007]  task_work_run+0x6d/0xb0
[  +0.000008]  exit_to_user_mode_prepare+0x1d3/0x1e0
[  +0.000005]  syscall_exit_to_user_mode+0x19/0x50
[  +0.000003]  do_syscall_64+0x40/0x80
[  +0.000006]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[  +0.000003] RIP: 0033:0x7f59f5106511
[  +0.000003] Code: f7 d8 64 89 02 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 8b 05 fa ce 20 00 85 c0 75 16 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 3f f3 c3 0f 1f 44 00 00 53 89 fb 48 83 ec 10
[  +0.000002] RSP: 002b:00007ffff02c6be8 EFLAGS: 00000246 ORIG_RAX: 0000000000000003
[  +0.000004] RAX: 0000000000000000 RBX: 000056517da3e6a0 RCX: 00007f59f5106511
[  +0.000003] RDX: 00007f59f50efca0 RSI: 000056517f676100 RDI: 0000000000000003
[  +0.000002] RBP: 000056517da3e8a0 R08: 0000000000000000 R09: 0000000000000000
[  +0.000002] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[  +0.000002] R13: 0000000000000000 R14: 0000000000000000 R15: 000056517da3e8a0
[  +0.000013]  </TASK>
[  +0.000002] irq event stamp: 42555
[  +0.000002] hardirqs last  enabled at (42561): [<ffffffff91107d32>] __up_console_sem+0x52/0x60
[  +0.000003] hardirqs last disabled at (42566): [<ffffffff91107d17>] __up_console_sem+0x37/0x60
[  +0.000003] softirqs last  enabled at (41384): [<ffffffff9220034b>] __do_softirq+0x34b/0x492
[  +0.000004] softirqs last disabled at (41377): [<ffffffff91083167>] irq_exit_rcu+0xd7/0xf0
[  +0.000004] ---[ end trace 90fbe3f286a48d7f ]---
[  +0.000304] ------------[ cut here ]------------
[  +0.000003] kernfs: can not remove 'mem_info_preempt_used', no directory
[  +0.000006] WARNING: CPU: 26 PID: 3800 at fs/kernfs/dir.c:1536 kernfs_remove_by_name_ns+0x73/0x80
[  +0.000005] Modules linked in: amdgpu iommu_v2 gpu_sched nf_conntrack_netlink nfnetlink xt_addrtype br_netfilter xt_CHECKSUM iptable_mangle xt_MASQUERADE iptable_nat nf_nat xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4 xt_tcpudp ip6table_filter ip6_tables iptable_filter fuse x86_pkg_temp_thermal acpi_pad ip_tables x_tables ast drm_vram_helper drm_ttm_helper ttm drm_kms_helper cfbfillrect syscopyarea cfbimgblt sysfillrect sysimgblt fb_sys_fops cfbcopyarea drm drm_panel_orientation_quirks
[  +0.000057] CPU: 26 PID: 3800 Comm: amdgpu_test Tainted: G        W         5.16.0-kfd+ #1
[  +0.000003] Hardware name: Supermicro SYS-4029GP-TRT2/X11DPG-OT-CPU, BIOS 2.1 08/14/2018
[  +0.000002] RIP: 0010:kernfs_remove_by_name_ns+0x73/0x80
[  +0.000003] Code: ff 31 c0 5b 5d 41 5c c3 48 c7 c7 00 81 bb 92 e8 b3 a6 cf ff b8 fe ff ff ff 5b 5d 41 5c c3 48 c7 c7 e0 8c 82 92 e8 ed 5c c8 ff <0f> 0b b8 fe ff ff ff eb d0 0f 1f 40 00 0f 1f 44 00 00 41 57 41 56
[  +0.000002] RSP: 0018:ffffaa4c206dfda8 EFLAGS: 00010282
[  +0.000004] RAX: 0000000000000000 RBX: ffff9479156c0000 RCX: 0000000000000001
[  +0.000002] RDX: 0000000080000001 RSI: ffffffff92800179 RDI: 00000000ffffffff
[  +0.000003] RBP: ffffffffc0b6904d R08: 0000000000000000 R09: 0000000000000001
[  +0.000002] R10: 0000000000000000 R11: ffffaa4c206dfbb0 R12: ffff9479156defe8
[  +0.000002] R13: ffff9479156e0008 R14: ffff9478f8fcfe28 R15: ffffffff9345ea20
[  +0.000002] FS:  00007f59f5b50180(0000) GS:ffff9497bff80000(0000) knlGS:0000000000000000
[  +0.000002] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  +0.000002] CR2: 0000558d8cbed6f0 CR3: 0000000137bee005 CR4: 00000000007706e0
[  +0.000002] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  +0.000002] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  +0.000002] PKRU: 55555554
[  +0.000002] Call Trace:
[  +0.000002]  <TASK>
[  +0.000003]  amdgpu_preempt_mgr_fini+0x67/0xc0 [amdgpu]
[  +0.000130]  amdgpu_ttm_fini+0x125/0x190 [amdgpu]
[  +0.000113]  amdgpu_bo_fini+0x22/0x90 [amdgpu]
[  +0.000113]  gmc_v9_0_sw_fini+0x3a/0x40 [amdgpu]
[  +0.000141]  amdgpu_device_fini_sw+0xbc/0x2e0 [amdgpu]
[  +0.000107]  amdgpu_driver_release_kms+0x12/0x30 [amdgpu]
[  +0.000105]  drm_dev_release+0x20/0x40 [drm]
[  +0.000023]  drm_release+0xa8/0xf0 [drm]
[  +0.000020]  __fput+0xa1/0x260
[  +0.000007]  task_work_run+0x6d/0xb0
[  +0.000007]  exit_to_user_mode_prepare+0x1d3/0x1e0
[  +0.000004]  syscall_exit_to_user_mode+0x19/0x50
[  +0.000004]  do_syscall_64+0x40/0x80
[  +0.000005]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[  +0.000003] RIP: 0033:0x7f59f5106511
[  +0.000003] Code: f7 d8 64 89 02 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 8b 05 fa ce 20 00 85 c0 75 16 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 3f f3 c3 0f 1f 44 00 00 53 89 fb 48 83 ec 10
[  +0.000003] RSP: 002b:00007ffff02c6be8 EFLAGS: 00000246 ORIG_RAX: 0000000000000003
[  +0.000003] RAX: 0000000000000000 RBX: 000056517da3e6a0 RCX: 00007f59f5106511
[  +0.000003] RDX: 00007f59f50efca0 RSI: 000056517f676100 RDI: 0000000000000003
[  +0.000002] RBP: 000056517da3e8a0 R08: 0000000000000000 R09: 0000000000000000
[  +0.000002] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[  +0.000002] R13: 0000000000000000 R14: 0000000000000000 R15: 000056517da3e8a0
[  +0.000013]  </TASK>
[  +0.000001] irq event stamp: 43135
[  +0.000002] hardirqs last  enabled at (43141): [<ffffffff91107d32>] __up_console_sem+0x52/0x60
[  +0.000003] hardirqs last disabled at (43146): [<ffffffff91107d17>] __up_console_sem+0x37/0x60
[  +0.000002] softirqs last  enabled at (41384): [<ffffffff9220034b>] __do_softirq+0x34b/0x492
[  +0.000004] softirqs last disabled at (41377): [<ffffffff91083167>] irq_exit_rcu+0xd7/0xf0
[  +0.000004] ---[ end trace 90fbe3f286a48d80 ]---
[  +0.000131] BUG: unable to handle page fault for address: ffffd3b803fc89b4
[  +0.000031] #PF: supervisor write access in kernel mode
[  +0.000021] #PF: error_code(0x0002) - not-present page
[  +0.000020] PGD 207ffea067 P4D 207ffea067 PUD 207ffe9067 PMD 0
[  +0.000026] Oops: 0002 [#1] PREEMPT SMP PTI
[  +0.000018] CPU: 26 PID: 3800 Comm: amdgpu_test Tainted: G        W         5.16.0-kfd+ #1
[  +0.000030] Hardware name: Supermicro SYS-4029GP-TRT2/X11DPG-OT-CPU, BIOS 2.1 08/14/2018
[  +0.000029] RIP: 0010:__free_pages+0xe/0x90
[  +0.000020] Code: 0c e9 76 fd ff ff 31 f6 e9 6f fd ff ff 31 d2 e9 c8 cd ff ff 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 41 54 49 89 fc 55 53 89 f3 <f0> ff 4f 34 74 50 48 8b 07 bd 01 00 00 00 a9 00 00 01 00 75 31 83
[  +0.000064] RSP: 0018:ffffaa4c206dfd30 EFLAGS: 00010246
[  +0.000021] RAX: 00000000ffffffff RBX: 0000000000000000 RCX: 00000000000ff226
[  +0.000544] RDX: ffff9478dfc5c000 RSI: 0000000000000000 RDI: ffffd3b803fc8980
[  +0.000533] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000246
[  +0.000536] R10: 0000000000000000 R11: 0000000000000000 R12: ffffd3b803fc8980
[  +0.000523] R13: ffff9479156c5e20 R14: ffff9479156c6d88 R15: ffffffffc0501398
[  +0.000524] FS:  00007f59f5b50180(0000) GS:ffff9497bff80000(0000) knlGS:0000000000000000
[  +0.000529] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  +0.000522] CR2: ffffd3b803fc89b4 CR3: 0000000137bee005 CR4: 00000000007706e0
[  +0.000532] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  +0.000528] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  +0.000518] PKRU: 55555554
[  +0.000521] Call Trace:
[  +0.000500]  <TASK>
[  +0.000494]  ttm_pool_free_page+0x68/0x90 [ttm]
[  +0.000506]  ttm_pool_type_fini+0x59/0x70 [ttm]
[  +0.000504]  ttm_pool_fini+0x2d/0x50 [ttm]
[  +0.000499]  ttm_device_fini+0xfc/0x1c0 [ttm]
[  +0.000496]  amdgpu_ttm_fini+0x154/0x190 [amdgpu]
[  +0.000595]  amdgpu_bo_fini+0x22/0x90 [amdgpu]
[  +0.000583]  gmc_v9_0_sw_fini+0x3a/0x40 [amdgpu]
[  +0.000613]  amdgpu_device_fini_sw+0xbc/0x2e0 [amdgpu]
[  +0.000592]  amdgpu_driver_release_kms+0x12/0x30 [amdgpu]
[  +0.000602]  drm_dev_release+0x20/0x40 [drm]
[  +0.000516]  drm_release+0xa8/0xf0 [drm]
[  +0.000522]  __fput+0xa1/0x260
[  +0.000506]  task_work_run+0x6d/0xb0
[  +0.000516]  exit_to_user_mode_prepare+0x1d3/0x1e0
[  +0.000529]  syscall_exit_to_user_mode+0x19/0x50
[  +0.000537]  do_syscall_64+0x40/0x80
[  +0.000542]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[  +0.000563] RIP: 0033:0x7f59f5106511
[  +0.000552] Code: f7 d8 64 89 02 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 8b 05 fa ce 20 00 85 c0 75 16 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 3f f3 c3 0f 1f 44 00 00 53 89 fb 48 83 ec 10
[  +0.001237] RSP: 002b:00007ffff02c6be8 EFLAGS: 00000246 ORIG_RAX: 0000000000000003
[  +0.000641] RAX: 0000000000000000 RBX: 000056517da3e6a0 RCX: 00007f59f5106511
[  +0.000656] RDX: 00007f59f50efca0 RSI: 000056517f676100 RDI: 0000000000000003
[  +0.000657] RBP: 000056517da3e8a0 R08: 0000000000000000 R09: 0000000000000000
[  +0.000661] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[  +0.000643] R13: 0000000000000000 R14: 0000000000000000 R15: 000056517da3e8a0
[  +0.000642]  </TASK>
[  +0.000627] Modules linked in: amdgpu iommu_v2 gpu_sched nf_conntrack_netlink nfnetlink xt_addrtype br_netfilter xt_CHECKSUM iptable_mangle xt_MASQUERADE iptable_nat nf_nat xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4 xt_tcpudp ip6table_filter ip6_tables iptable_filter fuse x86_pkg_temp_thermal acpi_pad ip_tables x_tables ast drm_vram_helper drm_ttm_helper ttm drm_kms_helper cfbfillrect syscopyarea cfbimgblt sysfillrect sysimgblt fb_sys_fops cfbcopyarea drm drm_panel_orientation_quirks
[  +0.003478] CR2: ffffd3b803fc89b4
[  +0.000671] ---[ end trace 90fbe3f286a48d81 ]---
[  +0.007599] RIP: 0010:__free_pages+0xe/0x90
[  +0.000667] Code: 0c e9 76 fd ff ff 31 f6 e9 6f fd ff ff 31 d2 e9 c8 cd ff ff 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 41 54 49 89 fc 55 53 89 f3 <f0> ff 4f 34 74 50 48 8b 07 bd 01 00 00 00 a9 00 00 01 00 75 31 83
[  +0.001401] RSP: 0018:ffffaa4c206dfd30 EFLAGS: 00010246
[  +0.000708] RAX: 00000000ffffffff RBX: 0000000000000000 RCX: 00000000000ff226
[  +0.000711] RDX: ffff9478dfc5c000 RSI: 0000000000000000 RDI: ffffd3b803fc8980
[  +0.000707] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000246
[  +0.000697] R10: 0000000000000000 R11: 0000000000000000 R12: ffffd3b803fc8980
[  +0.000687] R13: ffff9479156c5e20 R14: ffff9479156c6d88 R15: ffffffffc0501398
[  +0.000699] FS:  00007f59f5b50180(0000) GS:ffff9497bff80000(0000) knlGS:0000000000000000
[  +0.000709] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  +0.000700] CR2: ffffd3b803fc89b4 CR3: 0000000137bee005 CR4: 00000000007706e0
[  +0.000716] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  +0.000727] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  +0.000723] PKRU: 55555554



p.s. I cloned and build libdrm from source (https://gitlab.freedesktop.org/mesa/drm<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fmesa%2Fdrm&data=05%7C01%7Cshuotaoxu%40microsoft.com%7C1ce273d333a04fde7dcf08da221de5f7%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637859809165356475%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=FZ4PkLIElDTsj5nwDdYT9VIdjbc7zm0F78ALFjZRYdY%3D&reserved=0>)

Thank you so much!


Andrey


+ }
+ }
+ srcu_read_unlock(&kfd_processes_srcu, idx);
+ }
+
  kfd->dqm->ops.stop(kfd->dqm);
  kfd_iommu_suspend(kfd);
 }
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_topology.c b/drivers/gpu/drm/amd/amdkfd/kfd_topology.c
index 600ba2a728ea..7e3d1848eccc 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_topology.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_topology.c
@@ -669,11 +669,12 @@ static void kfd_remove_sysfs_node_entry(struct kfd_topology_device *dev)
 #ifdef HAVE_AMD_IOMMU_PC_SUPPORTED
  if (dev->kobj_perf) {
  list_for_each_entry(perf, &dev->perf_props, list) {
+ sysfs_remove_group(dev->kobj_perf, perf->attr_group);
  kfree(perf->attr_group);
  perf->attr_group = NULL;
  }
  kobject_del(dev->kobj_perf);
- kobject_put(dev->kobj_perf);
+ /* kobject_put(dev->kobj_perf); */
  dev->kobj_perf = NULL;
  }
 #endif

Thank you so much! Looking forward to your comments!

Regards,
Shuotao

Andrey


Thank you so much!

Best regards,
Shuotao

Andrey



diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
index 8fa9b86ac9d2..c0b27f722281 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
@@ -188,6 +188,12 @@ void amdgpu_amdkfd_interrupt(struct amdgpu_device *adev,
  kgd2kfd_interrupt(adev->kfd.dev, ih_ring_entry);
 }

+void amdgpu_amdkfd_kill_all_processes(struct amdgpu_device *adev)
+{
+ if (adev->kfd.dev)
+ kgd2kfd_kill_all_user_processes(adev->kfd.dev);
+}
+
 void amdgpu_amdkfd_suspend(struct amdgpu_device *adev, bool run_pm)
 {
  if (adev->kfd.dev)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
index 27c74fcec455..f4e485d60442 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
@@ -141,6 +141,7 @@ struct amdkfd_process_info {
 int amdgpu_amdkfd_init(void);
 void amdgpu_amdkfd_fini(void);

+void amdgpu_amdkfd_kill_all_processes(struct amdgpu_device *adev);
 void amdgpu_amdkfd_suspend(struct amdgpu_device *adev, bool run_pm);
 int amdgpu_amdkfd_resume_iommu(struct amdgpu_device *adev);
 int amdgpu_amdkfd_resume(struct amdgpu_device *adev, bool run_pm, bool sync);
@@ -405,6 +406,7 @@ bool kgd2kfd_device_init(struct kfd_dev *kfd,
  const struct kgd2kfd_shared_resources *gpu_resources);
 void kgd2kfd_device_exit(struct kfd_dev *kfd);
 void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm, bool force);
+void kgd2kfd_kill_all_user_processes(struct kfd_dev *kfd);
 int kgd2kfd_resume_iommu(struct kfd_dev *kfd);
 int kgd2kfd_resume(struct kfd_dev *kfd, bool run_pm, bool sync);
 int kgd2kfd_pre_reset(struct kfd_dev *kfd);
@@ -443,6 +445,9 @@ static inline void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm, bool force)
 {
 }

+void kgd2kfd_kill_all_user_processes(struct kfd_dev *kfd){
+}
+
 static int __maybe_unused kgd2kfd_resume_iommu(struct kfd_dev *kfd)
 {
  return 0;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index 3d5fc0751829..af6fe5080cfa 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -2101,6 +2101,9 @@ amdgpu_pci_remove(struct pci_dev *pdev)
 {
  struct drm_device *dev = pci_get_drvdata(pdev);

+ /* kill all kfd processes before drm_dev_unplug */
+ amdgpu_amdkfd_kill_all_processes(drm_to_adev(dev));
+
 #ifdef HAVE_DRM_DEV_UNPLUG
  drm_dev_unplug(dev);
 #else
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
index 5504a18b5a45..480c23bef5e2 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
@@ -691,6 +691,12 @@ bool kfd_is_locked(void)
  return  (atomic_read(&kfd_locked) > 0);
 }

+inline void kgd2kfd_kill_all_user_processes(struct kfd_dev* dev)
+{
+ kfd_kill_all_user_processes();
+}
+
+
 void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm, bool force)
 {
  if (!kfd->init_complete)
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
index 55c9e1922714..a35a2cb5bb9f 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
@@ -1064,6 +1064,7 @@ static inline struct kfd_process_device *kfd_process_device_from_gpuidx(
 void kfd_unref_process(struct kfd_process *p);
 int kfd_process_evict_queues(struct kfd_process *p, bool force);
 int kfd_process_restore_queues(struct kfd_process *p);
+void kfd_kill_all_user_processes(void);
 void kfd_suspend_all_processes(bool force);
 /*
  * kfd_resume_all_processes:
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process.c b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
index 6cdc855abb6d..17e769e6951d 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
@@ -46,6 +46,9 @@ struct mm_struct;
 #include "kfd_trace.h"
 #include "kfd_debug.h"

+static atomic_t kfd_process_locked = ATOMIC_INIT(0);
+static atomic_t kfd_inflight_kills = ATOMIC_INIT(0);
+
 /*
  * List of struct kfd_process (field kfd_process).
  * Unique/indexed by mm_struct*
@@ -802,6 +805,9 @@ struct kfd_process *kfd_create_process(struct task_struct *thread)
  struct kfd_process *process;
  int ret;

+ if ( atomic_read(&kfd_process_locked) > 0 )
+ return ERR_PTR(-EINVAL);
+
  if (!(thread->mm && mmget_not_zero(thread->mm)))
  return ERR_PTR(-EINVAL);

@@ -1126,6 +1132,10 @@ static void kfd_process_wq_release(struct work_struct *work)
  put_task_struct(p->lead_thread);

  kfree(p);
+
+ if ( atomic_read(&kfd_process_locked) > 0 ){
+ atomic_dec(&kfd_inflight_kills);
+ }
 }

 static void kfd_process_ref_release(struct kref *ref)
@@ -2186,6 +2196,35 @@ static void restore_process_worker(struct work_struct *work)
  pr_err("Failed to restore queues of pasid 0x%x\n", p->pasid);
 }

+void kfd_kill_all_user_processes(void)
+{
+ struct kfd_process *p;
+ /* struct amdkfd_process_info *p_info; */
+ unsigned int temp;
+ int idx;
+ atomic_inc(&kfd_process_locked);
+
+ idx = srcu_read_lock(&kfd_processes_srcu);
+ pr_info("Killing all processes\n");
+ hash_for_each_rcu(kfd_processes_table, temp, p, kfd_processes) {
+ dev_warn(kfd_device,
+ "Sending SIGBUS to process %d (pasid 0x%x)",
+ p->lead_thread->pid, p->pasid);
+ send_sig(SIGBUS, p->lead_thread, 0);
+ atomic_inc(&kfd_inflight_kills);
+ }
+ srcu_read_unlock(&kfd_processes_srcu, idx);
+
+ while ( atomic_read(&kfd_inflight_kills) > 0 ){
+ dev_warn(kfd_device,
+ "kfd_processes_table is not empty, going to sleep for 10ms\n");
+ msleep(10);
+ }
+
+ atomic_dec(&kfd_process_locked);
+ pr_info("all processes has been fully released");
+}
+
 void kfd_suspend_all_processes(bool force)
 {
  struct kfd_process *p;



Regards,
Shuotao



Andrey


+       }
+       srcu_read_unlock(&kfd_processes_srcu, idx);
+}
+
+
 int kfd_resume_all_processes(bool sync)
 {
        struct kfd_process *p;


Andrey



Really appreciate your help!

Best,
Shuotao

2. Remove redudant p2p/io links in sysfs when device is hotplugged
out.

3. New kfd node_id is not properly assigned after a new device is
added after a gpu is hotplugged out in a system. libhsakmt will
find this anomaly, (i.e. node_from != <dev node id> in iolinks),
when taking a topology_snapshot, thus returns fault to the rocm
stack.

-- This patch fixes issue 1; another patch by Mukul fixes issues 2&3.
-- Tested on a 4-GPU MI100 gpu nodes with kernel 5.13.0-kfd; kernel
5.16.0-kfd is unstable out of box for MI100.
---
drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 5 +++++
drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h | 7 +++++++
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 1 +
drivers/gpu/drm/amd/amdkfd/kfd_device.c | 13 +++++++++++++
4 files changed, 26 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
index c18c4be1e4ac..d50011bdb5c4 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
@@ -213,6 +213,11 @@ int amdgpu_amdkfd_resume(struct amdgpu_device *adev, bool run_pm)
return r;
}

+int amdgpu_amdkfd_resume_processes(void)
+{
+ return kgd2kfd_resume_processes();
+}
+
int amdgpu_amdkfd_pre_reset(struct amdgpu_device *adev)
{
int r = 0;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
index f8b9f27adcf5..803306e011c3 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
@@ -140,6 +140,7 @@ void amdgpu_amdkfd_fini(void);
void amdgpu_amdkfd_suspend(struct amdgpu_device *adev, bool run_pm);
int amdgpu_amdkfd_resume_iommu(struct amdgpu_device *adev);
int amdgpu_amdkfd_resume(struct amdgpu_device *adev, bool run_pm);
+int amdgpu_amdkfd_resume_processes(void);
void amdgpu_amdkfd_interrupt(struct amdgpu_device *adev,
const void *ih_ring_entry);
void amdgpu_amdkfd_device_probe(struct amdgpu_device *adev);
@@ -347,6 +348,7 @@ void kgd2kfd_device_exit(struct kfd_dev *kfd);
void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm);
int kgd2kfd_resume_iommu(struct kfd_dev *kfd);
int kgd2kfd_resume(struct kfd_dev *kfd, bool run_pm);
+int kgd2kfd_resume_processes(void);
int kgd2kfd_pre_reset(struct kfd_dev *kfd);
int kgd2kfd_post_reset(struct kfd_dev *kfd);
void kgd2kfd_interrupt(struct kfd_dev *kfd, const void *ih_ring_entry);
@@ -393,6 +395,11 @@ static inline int kgd2kfd_resume(struct kfd_dev *kfd, bool run_pm)
return 0;
}

+static inline int kgd2kfd_resume_processes(void)
+{
+ return 0;
+}
+
static inline int kgd2kfd_pre_reset(struct kfd_dev *kfd)
{
return 0;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index fa4a9f13c922..5827b65b7489 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4004,6 +4004,7 @@ void amdgpu_device_fini_hw(struct amdgpu_device *adev)
if (drm_dev_is_unplugged(adev_to_drm(adev)))
amdgpu_device_unmap_mmio(adev);

+ amdgpu_amdkfd_resume_processes();
}

void amdgpu_device_fini_sw(struct amdgpu_device *adev)
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
index 62aa6c9d5123..ef05aae9255e 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
@@ -714,6 +714,19 @@ int kgd2kfd_resume(struct kfd_dev *kfd, bool run_pm)
return ret;
}

+/* for non-runtime resume only */
+int kgd2kfd_resume_processes(void)
+{
+ int count;
+
+ count = atomic_dec_return(&kfd_locked);
+ WARN_ONCE(count < 0, "KFD suspend / resume ref. error");
+ if (count == 0)
+ return kfd_resume_all_processes();
+
+ return 0;
+}

It doesn't make sense to me to just increment kfd_locked in
kgd2kfd_suspend to only decrement it again a few functions down the
road.

I suggest this instead - you only incrmemnt if not during PCI remove

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
index 1c2cf3a33c1f..7754f77248a4 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
@@ -952,11 +952,12 @@ bool kfd_is_locked(void)

void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm)
{
+
if (!kfd->init_complete)
return;

/* for runtime suspend, skip locking kfd */
- if (!run_pm) {
+ if (!run_pm && !drm_dev_is_unplugged(kfd->ddev)) {
/* For first KFD device suspend all the KFD processes */
if (atomic_inc_return(&kfd_locked) == 1)
kfd_suspend_all_processes();


Andrey



+
int kgd2kfd_resume_iommu(struct kfd_dev *kfd)
{
int err = 0;







[-- Attachment #2: Type: text/html, Size: 522680 bytes --]

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [EXTERNAL] [PATCH 2/2] drm/amdkfd: Add PCIe Hotplug Support for AMDKFD
  2022-04-20  9:24                             ` Shuotao Xu
@ 2022-04-20 15:44                               ` Andrey Grodzovsky
  2022-04-20 18:41                                 ` Andrey Grodzovsky
  0 siblings, 1 reply; 31+ messages in thread
From: Andrey Grodzovsky @ 2022-04-20 15:44 UTC (permalink / raw)
  To: Shuotao Xu
  Cc: Mukul.Joshi, Kuehling, Felix, Peng Cheng, amd-gfx, Lei Qu,
	Ran Shu, Ziyue Yang

[-- Attachment #1: Type: text/plain, Size: 1530 bytes --]

The only one in Radeon 7 I see is the same sysfs crash we already fixed 
so you can use the same fix. The MI 200 issue i haven't seen yet but I 
also haven't tested MI200 so never saw it before. Need to test when i 
get the time.

So try that fix with Radeon 7 again to see if you pass the tests (the 
warnings should all be minor issues).

Andrey


On 2022-04-20 05:24, Shuotao Xu wrote:
>>
>> That a problem, latest working baseline I tested and confirmed 
>> passing hotplug tests is this branch and commit 
>> https://gitlab.freedesktop.org/agd5f/linux/-/commit/86e12a53b73135806e101142e72f3f1c0e6fa8e6 
>> which is amd-staging-drm-next. 5.14 was the branch we ups-reamed the 
>> hotplug code but it had a lot of regressions over time due to new 
>> changes (that why I added the hotplug test to try and catch them 
>> early). It would be best to run this branch on mi-100 so we have a 
>> clean baseline and only after confirming  this particular branch from 
>> this commits passes libdrm tests only then start adding the KFD 
>> specific addons. Another option if you can't work with MI-100 and 
>> this branch is to try a different ASIC that does work with this 
>> branch (if possible).
>>
>> Andrey
>>
> OK I tried both this commit and the HEAD of and-staging-drm-next on 
> two GPUs( MI100 and Radeon VII) both did not pass hotplugout libdrm 
> test. I might be able to gain access to MI200, but I suspect it would 
> work.
>
> I copied the complete dmesgs as follows. I highlighted the OOPSES for you.
>
> Radeon VII:

[-- Attachment #2: Type: text/html, Size: 3373 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [EXTERNAL] [PATCH 2/2] drm/amdkfd: Add PCIe Hotplug Support for AMDKFD
  2022-04-20 15:44                               ` Andrey Grodzovsky
@ 2022-04-20 18:41                                 ` Andrey Grodzovsky
  2022-04-27  9:20                                   ` Shuotao Xu
  0 siblings, 1 reply; 31+ messages in thread
From: Andrey Grodzovsky @ 2022-04-20 18:41 UTC (permalink / raw)
  To: Shuotao Xu
  Cc: Mukul.Joshi, Kuehling, Felix, Peng Cheng, amd-gfx, Lei Qu,
	Ran Shu, Ziyue Yang

[-- Attachment #1: Type: text/plain, Size: 3983 bytes --]

I retested hot plug tests at the commit I mentioned bellow - looks ok, 
my ASIC is Navi 10, I also tested using Vega 10 and older Polaris ASICs 
(whatever i had at home at the time). It's possible there are extra 
issues in ASICs like ur which I didn't cover during tests.

andrey@andrey-test:~/drm$ sudo ./build/tests/amdgpu/amdgpu_test -s 13
/usr/local/share/libdrm/amdgpu.ids: No such file or directory
/usr/local/share/libdrm/amdgpu.ids: No such file or directory
/usr/local/share/libdrm/amdgpu.ids: No such file or directory


The ASIC NOT support UVD, suite disabled
/usr/local/share/libdrm/amdgpu.ids: No such file or directory


The ASIC NOT support VCE, suite disabled
/usr/local/share/libdrm/amdgpu.ids: No such file or directory
/usr/local/share/libdrm/amdgpu.ids: No such file or directory
/usr/local/share/libdrm/amdgpu.ids: No such file or directory


The ASIC NOT support UVD ENC, suite disabled.
/usr/local/share/libdrm/amdgpu.ids: No such file or directory
/usr/local/share/libdrm/amdgpu.ids: No such file or directory
/usr/local/share/libdrm/amdgpu.ids: No such file or directory
/usr/local/share/libdrm/amdgpu.ids: No such file or directory


Don't support TMZ (trust memory zone), security suite disabled
/usr/local/share/libdrm/amdgpu.ids: No such file or directory
/usr/local/share/libdrm/amdgpu.ids: No such file or directory
Peer device is not opened or has ASIC not supported by the suite, skip 
all Peer to Peer tests.


      CUnit - A unit testing framework for C - Version 2.1-3
http://cunit.sourceforge.net/


*Suite: Hotunplug Tests**
**  Test: Unplug card and rescan the bus to plug it back 
.../usr/local/share/libdrm/amdgpu.ids: No such file or directory**
**passed**
**  Test: Same as first test but with command submission 
.../usr/local/share/libdrm/amdgpu.ids: No such file or directory**
**passed**
**  Test: Unplug with exported bo .../usr/local/share/libdrm/amdgpu.ids: 
No such file or directory**
**passed*

Run Summary:    Type  Total    Ran Passed Failed Inactive
               suites     14      1    n/a      0        0
                tests     71      3      3      0        1
              asserts     21     21     21      0      n/a

Elapsed time =    9.195 seconds


Andrey

On 2022-04-20 11:44, Andrey Grodzovsky wrote:
>
> The only one in Radeon 7 I see is the same sysfs crash we already 
> fixed so you can use the same fix. The MI 200 issue i haven't seen yet 
> but I also haven't tested MI200 so never saw it before. Need to test 
> when i get the time.
>
> So try that fix with Radeon 7 again to see if you pass the tests (the 
> warnings should all be minor issues).
>
> Andrey
>
>
> On 2022-04-20 05:24, Shuotao Xu wrote:
>>>
>>> That a problem, latest working baseline I tested and confirmed 
>>> passing hotplug tests is this branch and commit 
>>> https://gitlab.freedesktop.org/agd5f/linux/-/commit/86e12a53b73135806e101142e72f3f1c0e6fa8e6 
>>> which is amd-staging-drm-next. 5.14 was the branch we ups-reamed the 
>>> hotplug code but it had a lot of regressions over time due to new 
>>> changes (that why I added the hotplug test to try and catch them 
>>> early). It would be best to run this branch on mi-100 so we have a 
>>> clean baseline and only after confirming  this particular branch 
>>> from this commits passes libdrm tests only then start adding the KFD 
>>> specific addons. Another option if you can't work with MI-100 and 
>>> this branch is to try a different ASIC that does work with this 
>>> branch (if possible).
>>>
>>> Andrey
>>>
>> OK I tried both this commit and the HEAD of and-staging-drm-next on 
>> two GPUs( MI100 and Radeon VII) both did not pass hotplugout libdrm 
>> test. I might be able to gain access to MI200, but I suspect it would 
>> work.
>>
>> I copied the complete dmesgs as follows. I highlighted the OOPSES for 
>> you.
>>
>> Radeon VII:

[-- Attachment #2: Type: text/html, Size: 7293 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [EXTERNAL] [PATCH 2/2] drm/amdkfd: Add PCIe Hotplug Support for AMDKFD
  2022-04-20 18:41                                 ` Andrey Grodzovsky
@ 2022-04-27  9:20                                   ` Shuotao Xu
  2022-04-27 16:04                                     ` Andrey Grodzovsky
  0 siblings, 1 reply; 31+ messages in thread
From: Shuotao Xu @ 2022-04-27  9:20 UTC (permalink / raw)
  To: Andrey Grodzovsky
  Cc: Mukul.Joshi, Kuehling, Felix, Peng Cheng, amd-gfx, Lei Qu,
	Ran Shu, Ziyue Yang

[-- Attachment #1: Type: text/plain, Size: 15143 bytes --]

Hi Andrey,

Sorry that I did not have time to work on this for a few days.

I just tried the sysfs crash fix on Radeon VII and it seems that it worked. It did not pass last the hotplug test, but my version has 4 tests instead of 3 in your case.

root@NETSYS26:/home/shuotaoxu/workspace/drm/build# ./tests/amdgpu/amdgpu_test -s 13
/usr/local/share/libdrm/amdgpu.ids: No such file or directory
/usr/local/share/libdrm/amdgpu.ids: No such file or directory
/usr/local/share/libdrm/amdgpu.ids: No such file or directory
/usr/local/share/libdrm/amdgpu.ids: No such file or directory
/usr/local/share/libdrm/amdgpu.ids: No such file or directory


The ASIC NOT support VCN, suite disabled
/usr/local/share/libdrm/amdgpu.ids: No such file or directory


The ASIC NOT support JPEG, suite disabled
/usr/local/share/libdrm/amdgpu.ids: No such file or directory
/usr/local/share/libdrm/amdgpu.ids: No such file or directory
/usr/local/share/libdrm/amdgpu.ids: No such file or directory
/usr/local/share/libdrm/amdgpu.ids: No such file or directory
/usr/local/share/libdrm/amdgpu.ids: No such file or directory


Don't support TMZ (trust memory zone), security suite disabled
/usr/local/share/libdrm/amdgpu.ids: No such file or directory
/usr/local/share/libdrm/amdgpu.ids: No such file or directory
Peer device is not opened or has ASIC not supported by the suite, skip all Peer to Peer tests.


     CUnit - A unit testing framework for C - Version 2.1-3
     http://cunit.sourceforge.net/


Suite: Hotunplug Tests
  Test: Unplug card and rescan the bus to plug it back .../usr/local/share/libdrm/amdgpu.ids: No such file or directory
passed
  Test: Same as first test but with command submission .../usr/local/share/libdrm/amdgpu.ids: No such file or directory
passed
  Test: Unplug with exported bo .../usr/local/share/libdrm/amdgpu.ids: No such file or directory
passed
  Test: Unplug with exported fence .../usr/local/share/libdrm/amdgpu.ids: No such file or directory
amdgpu_device_initialize: amdgpu_get_auth (1) failed (-1)
FAILED
    1. ../tests/amdgpu/hotunplug_tests.c:368  - CU_ASSERT_EQUAL(r,0)
    2. ../tests/amdgpu/hotunplug_tests.c:411  - CU_ASSERT_EQUAL(amdgpu_cs_import_syncobj(device2, shared_fd, &sync_obj_handle2),0)
    3. ../tests/amdgpu/hotunplug_tests.c:423  - CU_ASSERT_EQUAL(amdgpu_cs_syncobj_wait(device2, &sync_obj_handle2, 1, 100000000, 0, NULL),0)
    4. ../tests/amdgpu/hotunplug_tests.c:425  - CU_ASSERT_EQUAL(amdgpu_cs_destroy_syncobj(device2, sync_obj_handle2),0)

Run Summary:    Type  Total    Ran Passed Failed Inactive
              suites     14      1    n/a      0        0
               tests     71      4      3      1        0
             asserts     39     39     35      4      n/a

Elapsed time =   17.321 seconds

For kfd compute, there is some problem which I did not see in MI100 after I killed the hung application after hot plugout. I was using rocm5.0.2 driver for MI100 card, and not sure if it is a regression from the newer driver.
After pkill, one of child of user process would be stuck in Zombie mode (Z) understandably because of the bug, and future rocm application after plug-back would in uninterrupted sleep mode (D) because it would not return from syscall to kfd.

Although drm test for amdgpu would run just fine without issues after plug-back with dangling kfd state.

I don’t know if there is a quick fix to it. I was thinking add drm_enter/drm_exit to amdgpu_device_rreg.
Also this has been a long time in my attempt to fix hotplug issue for kfd application.
I don’t know 1) if I would be able to get to MI100 (fixing Radeon VII would mean something but MI100 is more important for us); 2) what the direct of the patch to this issue will move forward.

Regards,
Shuotao

[  +0.001645] BUG: unable to handle page fault for address: 0000000000058a68
[  +0.001298] #PF: supervisor read access in kernel mode
[  +0.001252] #PF: error_code(0x0000) - not-present page
[  +0.001248] PGD 8000000115806067 P4D 8000000115806067 PUD 109b2d067 PMD 0
[  +0.001270] Oops: 0000 [#1] PREEMPT SMP PTI
[  +0.001256] CPU: 5 PID: 13818 Comm: tf_cnn_benchmar Tainted: G        W   E     5.16.0+ #3
[  +0.001290] Hardware name: Dell Inc. PowerEdge R730/0H21J3, BIOS 1.5.4 [FPGA Test BIOS] 10/002/2015
[  +0.001309] RIP: 0010:amdgpu_device_rreg.part.24+0xa9/0xe0 [amdgpu]
[  +0.001562] Code: e8 8c 7d 02 00 65 ff 0d 65 e0 7f 3f 75 ae 0f 1f 44 00 00 eb a7 83 e2 02 75 09 f6 87 10 69 01 00 10 75 0d 4c 03 a3 a0 09 00 00 <45> 8b 24 24 eb 8a 4c 8d b7 b0 6b 01 00 4c 89 f7 e8 a2 4c 2e ca 85
[  +0.002751] RSP: 0018:ffffb58fac313928 EFLAGS: 00010202
[  +0.001388] RAX: ffffffffc09a4270 RBX: ffff8b0c9c840000 RCX: 00000000ffffffff
[  +0.001402] RDX: 0000000000000000 RSI: 000000000001629a RDI: ffff8b0c9c840000
[  +0.001418] RBP: ffffb58fac313948 R08: 0000000000000021 R09: 0000000000000001
[  +0.001421] R10: ffffb58fac313b30 R11: ffffffff8c065b00 R12: 0000000000058a68
[  +0.001400] R13: 000000000001629a R14: 0000000000000000 R15: 000000000001629a
[  +0.001397] FS:  0000000000000000(0000) GS:ffff8b4b7fa80000(0000) knlGS:0000000000000000
[  +0.001411] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  +0.001405] CR2: 0000000000058a68 CR3: 000000010a2c8001 CR4: 00000000001706e0
[  +0.001422] Call Trace:
[  +0.001407]  <TASK>
[  +0.001391]  amdgpu_device_rreg+0x17/0x20 [amdgpu]
[  +0.001614]  amdgpu_cgs_read_register+0x14/0x20 [amdgpu]
[  +0.001735]  phm_wait_for_register_unequal.part.1+0x58/0x90 [amdgpu]
[  +0.001790]  phm_wait_for_register_unequal+0x1a/0x30 [amdgpu]
[  +0.001800]  vega20_wait_for_response+0x28/0x80 [amdgpu]
[  +0.001757]  vega20_send_msg_to_smc_with_parameter+0x21/0x110 [amdgpu]
[  +0.001838]  smum_send_msg_to_smc_with_parameter+0xcd/0x100 [amdgpu]
[  +0.001829]  ? kvfree+0x1e/0x30
[  +0.001462]  vega20_set_power_profile_mode+0x58/0x330 [amdgpu]
[  +0.001868]  ? kvfree+0x1e/0x30
[  +0.001462]  ? ttm_bo_release+0x261/0x370 [ttm]
[  +0.001467]  pp_dpm_switch_power_profile+0xc2/0x170 [amdgpu]
[  +0.001863]  amdgpu_dpm_switch_power_profile+0x6b/0x90 [amdgpu]
[  +0.001866]  amdgpu_amdkfd_set_compute_idle+0x1a/0x20 [amdgpu]
[  +0.001784]  kfd_dec_compute_active+0x2c/0x50 [amdgpu]
[  +0.001744]  process_termination_cpsch+0x2f9/0x3a0 [amdgpu]
[  +0.001728]  kfd_process_dequeue_from_all_devices+0x49/0x70 [amdgpu]
[  +0.001730]  kfd_process_notifier_release+0x91/0xe0 [amdgpu]
[  +0.001718]  __mmu_notifier_release+0x77/0x1f0
[  +0.001411]  exit_mmap+0x1b5/0x200
[  +0.001396]  ? __switch_to+0x12d/0x3e0
[  +0.001388]  ? __switch_to_asm+0x36/0x70
[  +0.001372]  ? preempt_count_add+0x74/0xc0
[  +0.001364]  mmput+0x57/0x110
[  +0.001349]  do_exit+0x33d/0xc20
[  +0.001337]  ? _raw_spin_unlock+0x1a/0x30
[  +0.001346]  do_group_exit+0x43/0xa0
[  +0.001341]  get_signal+0x131/0x920
[  +0.001295]  arch_do_signal_or_restart+0xb1/0x870
[  +0.001303]  ? do_futex+0x125/0x190
[  +0.001285]  exit_to_user_mode_prepare+0xb1/0x1c0
[  +0.001282]  syscall_exit_to_user_mode+0x2a/0x40
[  +0.001264]  do_syscall_64+0x46/0xb0
[  +0.001236]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[  +0.001219] RIP: 0033:0x7f6aff1d2ad3
[  +0.001177] Code: Unable to access opcode bytes at RIP 0x7f6aff1d2aa9.
[  +0.001166] RSP: 002b:00007f6ab2029d20 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca
[  +0.001170] RAX: fffffffffffffe00 RBX: 0000000004f542b0 RCX: 00007f6aff1d2ad3
[  +0.001168] RDX: 0000000000000000 RSI: 0000000000000080 RDI: 0000000004f542d8
[  +0.001162] RBP: 0000000004f542d4 R08: 0000000000000000 R09: 0000000000000000
[  +0.001152] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000004f542d8
[  +0.001176] R13: 0000000000000000 R14: 0000000004f54288 R15: 0000000000000000
[  +0.001152]  </TASK>
[  +0.001113] Modules linked in: veth amdgpu(E) nf_conntrack_netlink nfnetlink xfrm_user xt_addrtype br_netfilter xt_CHECKSUM iptable_mangle xt_MASQUERADE iptable_nat nf_nat xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4 xt_tcpudp bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter overlay esp6_offload esp6 esp4_offload esp4 xfrm_algo intel_rapl_msr intel_rapl_common sb_edac x86_pkg_temp_thermal intel_powerclamp snd_hda_codec_hdmi snd_hda_intel ipmi_ssif snd_intel_dspcfg coretemp snd_hda_codec kvm_intel snd_hda_core snd_hwdep snd_pcm snd_timer snd kvm soundcore irqbypass ftdi_sio usbserial input_leds iTCO_wdt iTCO_vendor_support joydev mei_me rapl lpc_ich intel_cstate mei ipmi_si ipmi_devintf ipmi_msghandler mac_hid acpi_power_meter sch_fq_codel ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables x_tables autofs4 btrfs blake2b_generic zstd_compress raid10 raid456
[  +0.000102]  async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear iommu_v2 gpu_sched drm_ttm_helper mgag200 ttm drm_shmem_helper drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops crct10dif_pclmul hid_generic crc32_pclmul ghash_clmulni_intel usbhid uas aesni_intel crypto_simd igb ahci hid drm usb_storage cryptd libahci dca megaraid_sas i2c_algo_bit wmi [last unloaded: amdgpu]
[  +0.016626] CR2: 0000000000058a68
[  +0.001550] ---[ end trace ff90849fe0a8b3b4 ]---
[  +0.024953] RIP: 0010:amdgpu_device_rreg.part.24+0xa9/0xe0 [amdgpu]
[  +0.001814] Code: e8 8c 7d 02 00 65 ff 0d 65 e0 7f 3f 75 ae 0f 1f 44 00 00 eb a7 83 e2 02 75 09 f6 87 10 69 01 00 10 75 0d 4c 03 a3 a0 09 00 00 <45> 8b 24 24 eb 8a 4c 8d b7 b0 6b 01 00 4c 89 f7 e8 a2 4c 2e ca 85
[  +0.003255] RSP: 0018:ffffb58fac313928 EFLAGS: 00010202
[  +0.001641] RAX: ffffffffc09a4270 RBX: ffff8b0c9c840000 RCX: 00000000ffffffff
[  +0.001656] RDX: 0000000000000000 RSI: 000000000001629a RDI: ffff8b0c9c840000
[  +0.001681] RBP: ffffb58fac313948 R08: 0000000000000021 R09: 0000000000000001
[  +0.001662] R10: ffffb58fac313b30 R11: ffffffff8c065b00 R12: 0000000000058a68
[  +0.001650] R13: 000000000001629a R14: 0000000000000000 R15: 000000000001629a
[  +0.001648] FS:  0000000000000000(0000) GS:ffff8b4b7fa80000(0000) knlGS:0000000000000000
[  +0.001668] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  +0.001673] CR2: 0000000000058a68 CR3: 000000010a2c8001 CR4: 00000000001706e0
[  +0.001740] Fixing recursive fault but reboot is needed!


On Apr 21, 2022, at 2:41 AM, Andrey Grodzovsky <andrey.grodzovsky@amd.com<mailto:andrey.grodzovsky@amd.com>> wrote:


I retested hot plug tests at the commit I mentioned bellow - looks ok, my ASIC is Navi 10, I also tested using Vega 10 and older Polaris ASICs (whatever i had at home at the time). It's possible there are extra issues in ASICs like ur which I didn't cover during tests.

andrey@andrey-test:~/drm$ sudo ./build/tests/amdgpu/amdgpu_test -s 13
/usr/local/share/libdrm/amdgpu.ids: No such file or directory
/usr/local/share/libdrm/amdgpu.ids: No such file or directory
/usr/local/share/libdrm/amdgpu.ids: No such file or directory


The ASIC NOT support UVD, suite disabled
/usr/local/share/libdrm/amdgpu.ids: No such file or directory


The ASIC NOT support VCE, suite disabled
/usr/local/share/libdrm/amdgpu.ids: No such file or directory
/usr/local/share/libdrm/amdgpu.ids: No such file or directory
/usr/local/share/libdrm/amdgpu.ids: No such file or directory


The ASIC NOT support UVD ENC, suite disabled.
/usr/local/share/libdrm/amdgpu.ids: No such file or directory
/usr/local/share/libdrm/amdgpu.ids: No such file or directory
/usr/local/share/libdrm/amdgpu.ids: No such file or directory
/usr/local/share/libdrm/amdgpu.ids: No such file or directory


Don't support TMZ (trust memory zone), security suite disabled
/usr/local/share/libdrm/amdgpu.ids: No such file or directory
/usr/local/share/libdrm/amdgpu.ids: No such file or directory
Peer device is not opened or has ASIC not supported by the suite, skip all Peer to Peer tests.


     CUnit - A unit testing framework for C - Version 2.1-3
     http://cunit.sourceforge.net/<https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fcunit.sourceforge.net%2F&data=05%7C01%7Cshuotaoxu%40microsoft.com%7C8f940e6562a542c083e508da22fd6c20%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637860769133864172%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=RfLoqVEHZKc5U0CUexbu5CfzOTxN0S2cm8WWSTWrdlo%3D&reserved=0>


Suite: Hotunplug Tests
  Test: Unplug card and rescan the bus to plug it back .../usr/local/share/libdrm/amdgpu.ids: No such file or directory
passed
  Test: Same as first test but with command submission .../usr/local/share/libdrm/amdgpu.ids: No such file or directory
passed
  Test: Unplug with exported bo .../usr/local/share/libdrm/amdgpu.ids: No such file or directory
passed

Run Summary:    Type  Total    Ran Passed Failed Inactive
              suites     14      1    n/a      0        0
               tests     71      3      3      0        1
             asserts     21     21     21      0      n/a

Elapsed time =    9.195 seconds


Andrey

On 2022-04-20 11:44, Andrey Grodzovsky wrote:

The only one in Radeon 7 I see is the same sysfs crash we already fixed so you can use the same fix. The MI 200 issue i haven't seen yet but I also haven't tested MI200 so never saw it before. Need to test when i get the time.

So try that fix with Radeon 7 again to see if you pass the tests (the warnings should all be minor issues).

Andrey


On 2022-04-20 05:24, Shuotao Xu wrote:

That a problem, latest working baseline I tested and confirmed passing hotplug tests is this branch and commit https://gitlab.freedesktop.org/agd5f/linux/-/commit/86e12a53b73135806e101142e72f3f1c0e6fa8e6<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fagd5f%2Flinux%2F-%2Fcommit%2F86e12a53b73135806e101142e72f3f1c0e6fa8e6&data=05%7C01%7Cshuotaoxu%40microsoft.com%7C8f940e6562a542c083e508da22fd6c20%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637860769133864172%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=E0sJHwTc0isf1UBcfsiBEc8Gc7JxJq29XRtzUiFe3VM%3D&reserved=0> which is amd-staging-drm-next. 5.14 was the branch we ups-reamed the hotplug code but it had a lot of regressions over time due to new changes (that why I added the hotplug test to try and catch them early). It would be best to run this branch on mi-100 so we have a clean baseline and only after confirming  this particular branch from this commits passes libdrm tests only then start adding the KFD specific addons. Another option if you can't work with MI-100 and this branch is to try a different ASIC that does work with this branch (if possible).

Andrey

OK I tried both this commit and the HEAD of and-staging-drm-next on two GPUs( MI100 and Radeon VII) both did not pass hotplugout libdrm test. I might be able to gain access to MI200, but I suspect it would work.

I copied the complete dmesgs as follows. I highlighted the OOPSES for you.

Radeon VII:


[-- Attachment #2: Type: text/html, Size: 23138 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [EXTERNAL] [PATCH 2/2] drm/amdkfd: Add PCIe Hotplug Support for AMDKFD
  2022-04-27  9:20                                   ` Shuotao Xu
@ 2022-04-27 16:04                                     ` Andrey Grodzovsky
  2022-05-10 11:03                                       ` Shuotao Xu
  0 siblings, 1 reply; 31+ messages in thread
From: Andrey Grodzovsky @ 2022-04-27 16:04 UTC (permalink / raw)
  To: Shuotao Xu
  Cc: Mukul.Joshi, Kuehling, Felix, Peng Cheng, amd-gfx, Lei Qu,
	Ran Shu, Ziyue Yang

[-- Attachment #1: Type: text/plain, Size: 15087 bytes --]

On 2022-04-27 05:20, Shuotao Xu wrote:

> Hi Andrey,
>
> Sorry that I did not have time to work on this for a few days.
>
> I just tried the sysfs crash fix on Radeon VII and it seems that it 
> worked. It did not pass last the hotplug test, but my version has 4 
> tests instead of 3 in your case.


That because the 4th one is only enabled when here are 2 cards in the 
system - to test DRI_PRIME export. I tested this time with only one card.

>
>
> Suite: Hotunplug Tests
>   Test: Unplug card and rescan the bus to plug it back 
> .../usr/local/share/libdrm/amdgpu.ids: No such file or directory
> passed
>   Test: Same as first test but with command submission 
> .../usr/local/share/libdrm/amdgpu.ids: No such file or directory
> passed
>   Test: Unplug with exported bo .../usr/local/share/libdrm/amdgpu.ids: 
> No such file or directory
> passed
>   Test: Unplug with exported fence 
> .../usr/local/share/libdrm/amdgpu.ids: No such file or directory
> amdgpu_device_initialize: amdgpu_get_auth (1) failed (-1)


on the kernel side - the IOCTlL returning this is drm_getclient - maybe 
take a look while it can't find client it ? I didn't have such issue as 
far as I remember when testing.


> FAILED
>     1. ../tests/amdgpu/hotunplug_tests.c:368  - CU_ASSERT_EQUAL(r,0)
>     2. ../tests/amdgpu/hotunplug_tests.c:411  - 
> CU_ASSERT_EQUAL(amdgpu_cs_import_syncobj(device2, shared_fd, 
> &sync_obj_handle2),0)
>     3. ../tests/amdgpu/hotunplug_tests.c:423  - 
> CU_ASSERT_EQUAL(amdgpu_cs_syncobj_wait(device2, &sync_obj_handle2, 1, 
> 100000000, 0, NULL),0)
>     4. ../tests/amdgpu/hotunplug_tests.c:425  - 
> CU_ASSERT_EQUAL(amdgpu_cs_destroy_syncobj(device2, sync_obj_handle2),0)
>
> Run Summary:    Type  Total    Ran Passed Failed Inactive
>               suites     14      1    n/a      0      0
>                tests     71      4      3      1      0
>              asserts     39     39     35      4    n/a
>
> Elapsed time =   17.321 seconds
>
> For kfd compute, there is some problem which I did not see in MI100 
> after I killed the hung application after hot plugout. I was using 
> rocm5.0.2 driver for MI100 card, and not sure if it is a regression 
> from the newer driver.
> After pkill, one of child of user process would be stuck in Zombie 
> mode (Z) understandably because of the bug, and future rocm 
> application after plug-back would in uninterrupted sleep mode (D) 
> because it would not return from syscall to kfd.
>
> Although drm test for amdgpu would run just fine without issues after 
> plug-back with dangling kfd state.


I am not clear when the crash bellow happens ? Is it related to what you 
describe above ?


>
> I don’t know if there is a quick fix to it. I was thinking add 
> drm_enter/drm_exit to amdgpu_device_rreg.


Try adding drm_dev_enter/exit pair at the highest level of attmetong to 
access HW - in this case it's amdgpu_amdkfd_set_compute_idle. We always 
try to avoid accessing any HW functions after backing device is gone.


> Also this has been a long time in my attempt to fix hotplug issue for 
> kfd application.
> I don’t know 1) if I would be able to get to MI100 (fixing Radeon VII 
> would mean something but MI100 is more important for us); 2) what the 
> direct of the patch to this issue will move forward.


I will go to office tomorrow to pick up MI-100, With time and priorities 
permitting I will then then try to test it and fix any bugs such that it 
will be passing all hot plug libdrm tests at the tip of public 
amd-staging-drm-next - https://gitlab.freedesktop.org/agd5f/linux, after 
that you can try to continue working with ROCm enabling on top of that.

For now i suggest you move on with Radeon 7 which as your development 
ASIC and use the fix i mentioned above.

Andrey


>
> Regards,
> Shuotao
>
> [  +0.001645] BUG: unable to handle page fault for address: 
> 0000000000058a68
> [  +0.001298] #PF: supervisor read access in kernel mode
> [  +0.001252] #PF: error_code(0x0000) - not-present page
> [  +0.001248] PGD 8000000115806067 P4D 8000000115806067 PUD 109b2d067 
> PMD 0
> [  +0.001270] Oops: 0000 [#1] PREEMPT SMP PTI
> [  +0.001256] CPU: 5 PID: 13818 Comm: tf_cnn_benchmar Tainted: G       
>  W   E     5.16.0+ #3
> [  +0.001290] Hardware name: Dell Inc. PowerEdge R730/0H21J3, BIOS 
> 1.5.4 [FPGA Test BIOS] 10/002/2015
> [  +0.001309] RIP: 0010:amdgpu_device_rreg.part.24+0xa9/0xe0 [amdgpu]
> [  +0.001562] Code: e8 8c 7d 02 00 65 ff 0d 65 e0 7f 3f 75 ae 0f 1f 44 
> 00 00 eb a7 83 e2 02 75 09 f6 87 10 69 01 00 10 75 0d 4c 03 a3 a0 09 
> 00 00 <45> 8b 24 24 eb 8a 4c 8d b7 b0 6b 01 00 4c 89 f7 e8 a2 4c 2e ca 85
> [  +0.002751] RSP: 0018:ffffb58fac313928 EFLAGS: 00010202
> [  +0.001388] RAX: ffffffffc09a4270 RBX: ffff8b0c9c840000 RCX: 
> 00000000ffffffff
> [  +0.001402] RDX: 0000000000000000 RSI: 000000000001629a RDI: 
> ffff8b0c9c840000
> [  +0.001418] RBP: ffffb58fac313948 R08: 0000000000000021 R09: 
> 0000000000000001
> [  +0.001421] R10: ffffb58fac313b30 R11: ffffffff8c065b00 R12: 
> 0000000000058a68
> [  +0.001400] R13: 000000000001629a R14: 0000000000000000 R15: 
> 000000000001629a
> [  +0.001397] FS:  0000000000000000(0000) GS:ffff8b4b7fa80000(0000) 
> knlGS:0000000000000000
> [  +0.001411] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  +0.001405] CR2: 0000000000058a68 CR3: 000000010a2c8001 CR4: 
> 00000000001706e0
> [  +0.001422] Call Trace:
> [  +0.001407]  <TASK>
> [  +0.001391]  amdgpu_device_rreg+0x17/0x20 [amdgpu]
> [  +0.001614]  amdgpu_cgs_read_register+0x14/0x20 [amdgpu]
> [  +0.001735]  phm_wait_for_register_unequal.part.1+0x58/0x90 [amdgpu]
> [  +0.001790]  phm_wait_for_register_unequal+0x1a/0x30 [amdgpu]
> [  +0.001800]  vega20_wait_for_response+0x28/0x80 [amdgpu]
> [  +0.001757]  vega20_send_msg_to_smc_with_parameter+0x21/0x110 [amdgpu]
> [  +0.001838]  smum_send_msg_to_smc_with_parameter+0xcd/0x100 [amdgpu]
> [  +0.001829]  ? kvfree+0x1e/0x30
> [  +0.001462]  vega20_set_power_profile_mode+0x58/0x330 [amdgpu]
> [  +0.001868]  ? kvfree+0x1e/0x30
> [  +0.001462]  ? ttm_bo_release+0x261/0x370 [ttm]
> [  +0.001467]  pp_dpm_switch_power_profile+0xc2/0x170 [amdgpu]
> [  +0.001863]  amdgpu_dpm_switch_power_profile+0x6b/0x90 [amdgpu]
> [  +0.001866]  amdgpu_amdkfd_set_compute_idle+0x1a/0x20 [amdgpu]
> [  +0.001784]  kfd_dec_compute_active+0x2c/0x50 [amdgpu]
> [  +0.001744]  process_termination_cpsch+0x2f9/0x3a0 [amdgpu]
> [  +0.001728]  kfd_process_dequeue_from_all_devices+0x49/0x70 [amdgpu]
> [  +0.001730]  kfd_process_notifier_release+0x91/0xe0 [amdgpu]
> [  +0.001718]  __mmu_notifier_release+0x77/0x1f0
> [  +0.001411]  exit_mmap+0x1b5/0x200
> [  +0.001396]  ? __switch_to+0x12d/0x3e0
> [  +0.001388]  ? __switch_to_asm+0x36/0x70
> [  +0.001372]  ? preempt_count_add+0x74/0xc0
> [  +0.001364]  mmput+0x57/0x110
> [  +0.001349]  do_exit+0x33d/0xc20
> [  +0.001337]  ? _raw_spin_unlock+0x1a/0x30
> [  +0.001346]  do_group_exit+0x43/0xa0
> [  +0.001341]  get_signal+0x131/0x920
> [  +0.001295]  arch_do_signal_or_restart+0xb1/0x870
> [  +0.001303]  ? do_futex+0x125/0x190
> [  +0.001285]  exit_to_user_mode_prepare+0xb1/0x1c0
> [  +0.001282]  syscall_exit_to_user_mode+0x2a/0x40
> [  +0.001264]  do_syscall_64+0x46/0xb0
> [  +0.001236]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> [  +0.001219] RIP: 0033:0x7f6aff1d2ad3
> [  +0.001177] Code: Unable to access opcode bytes at RIP 0x7f6aff1d2aa9.
> [  +0.001166] RSP: 002b:00007f6ab2029d20 EFLAGS: 00000246 ORIG_RAX: 
> 00000000000000ca
> [  +0.001170] RAX: fffffffffffffe00 RBX: 0000000004f542b0 RCX: 
> 00007f6aff1d2ad3
> [  +0.001168] RDX: 0000000000000000 RSI: 0000000000000080 RDI: 
> 0000000004f542d8
> [  +0.001162] RBP: 0000000004f542d4 R08: 0000000000000000 R09: 
> 0000000000000000
> [  +0.001152] R10: 0000000000000000 R11: 0000000000000246 R12: 
> 0000000004f542d8
> [  +0.001176] R13: 0000000000000000 R14: 0000000004f54288 R15: 
> 0000000000000000
> [  +0.001152]  </TASK>
> [  +0.001113] Modules linked in: veth amdgpu(E) nf_conntrack_netlink 
> nfnetlink xfrm_user xt_addrtype br_netfilter xt_CHECKSUM 
> iptable_mangle xt_MASQUERADE iptable_nat nf_nat xt_conntrack 
> nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4 
> xt_tcpudp bridge stp llc ebtable_filter ebtables ip6table_filter 
> ip6_tables iptable_filter overlay esp6_offload esp6 esp4_offload esp4 
> xfrm_algo intel_rapl_msr intel_rapl_common sb_edac 
> x86_pkg_temp_thermal intel_powerclamp snd_hda_codec_hdmi snd_hda_intel 
> ipmi_ssif snd_intel_dspcfg coretemp snd_hda_codec kvm_intel 
> snd_hda_core snd_hwdep snd_pcm snd_timer snd kvm soundcore irqbypass 
> ftdi_sio usbserial input_leds iTCO_wdt iTCO_vendor_support joydev 
> mei_me rapl lpc_ich intel_cstate mei ipmi_si ipmi_devintf 
> ipmi_msghandler mac_hid acpi_power_meter sch_fq_codel ib_iser rdma_cm 
> iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi 
> scsi_transport_iscsi ip_tables x_tables autofs4 btrfs blake2b_generic 
> zstd_compress raid10 raid456
> [  +0.000102]  async_raid6_recov async_memcpy async_pq async_xor 
> async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear iommu_v2 
> gpu_sched drm_ttm_helper mgag200 ttm drm_shmem_helper drm_kms_helper 
> syscopyarea sysfillrect sysimgblt fb_sys_fops crct10dif_pclmul 
> hid_generic crc32_pclmul ghash_clmulni_intel usbhid uas aesni_intel 
> crypto_simd igb ahci hid drm usb_storage cryptd libahci dca 
> megaraid_sas i2c_algo_bit wmi [last unloaded: amdgpu]
> [  +0.016626] CR2: 0000000000058a68
> [  +0.001550] ---[ end trace ff90849fe0a8b3b4 ]---
> [  +0.024953] RIP: 0010:amdgpu_device_rreg.part.24+0xa9/0xe0 [amdgpu]
> [  +0.001814] Code: e8 8c 7d 02 00 65 ff 0d 65 e0 7f 3f 75 ae 0f 1f 44 
> 00 00 eb a7 83 e2 02 75 09 f6 87 10 69 01 00 10 75 0d 4c 03 a3 a0 09 
> 00 00 <45> 8b 24 24 eb 8a 4c 8d b7 b0 6b 01 00 4c 89 f7 e8 a2 4c 2e ca 85
> [  +0.003255] RSP: 0018:ffffb58fac313928 EFLAGS: 00010202
> [  +0.001641] RAX: ffffffffc09a4270 RBX: ffff8b0c9c840000 RCX: 
> 00000000ffffffff
> [  +0.001656] RDX: 0000000000000000 RSI: 000000000001629a RDI: 
> ffff8b0c9c840000
> [  +0.001681] RBP: ffffb58fac313948 R08: 0000000000000021 R09: 
> 0000000000000001
> [  +0.001662] R10: ffffb58fac313b30 R11: ffffffff8c065b00 R12: 
> 0000000000058a68
> [  +0.001650] R13: 000000000001629a R14: 0000000000000000 R15: 
> 000000000001629a
> [  +0.001648] FS:  0000000000000000(0000) GS:ffff8b4b7fa80000(0000) 
> knlGS:0000000000000000
> [  +0.001668] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  +0.001673] CR2: 0000000000058a68 CR3: 000000010a2c8001 CR4: 
> 00000000001706e0
> [  +0.001740] Fixing recursive fault but reboot is needed!
>
>
>> On Apr 21, 2022, at 2:41 AM, Andrey Grodzovsky 
>> <andrey.grodzovsky@amd.com> wrote:
>>
>> I retested hot plug tests at the commit I mentioned bellow - looks 
>> ok, my ASIC is Navi 10, I also tested using Vega 10 and older Polaris 
>> ASICs (whatever i had at home at the time). It's possible there are 
>> extra issues in ASICs like ur which I didn't cover during tests.
>>
>> andrey@andrey-test:~/drm$ sudo ./build/tests/amdgpu/amdgpu_test -s 13
>> /usr/local/share/libdrm/amdgpu.ids: No such file or directory
>> /usr/local/share/libdrm/amdgpu.ids: No such file or directory
>> /usr/local/share/libdrm/amdgpu.ids: No such file or directory
>>
>>
>> The ASIC NOT support UVD, suite disabled
>> /usr/local/share/libdrm/amdgpu.ids: No such file or directory
>>
>>
>> The ASIC NOT support VCE, suite disabled
>> /usr/local/share/libdrm/amdgpu.ids: No such file or directory
>> /usr/local/share/libdrm/amdgpu.ids: No such file or directory
>> /usr/local/share/libdrm/amdgpu.ids: No such file or directory
>>
>>
>> The ASIC NOT support UVD ENC, suite disabled.
>> /usr/local/share/libdrm/amdgpu.ids: No such file or directory
>> /usr/local/share/libdrm/amdgpu.ids: No such file or directory
>> /usr/local/share/libdrm/amdgpu.ids: No such file or directory
>> /usr/local/share/libdrm/amdgpu.ids: No such file or directory
>>
>>
>> Don't support TMZ (trust memory zone), security suite disabled
>> /usr/local/share/libdrm/amdgpu.ids: No such file or directory
>> /usr/local/share/libdrm/amdgpu.ids: No such file or directory
>> Peer device is not opened or has ASIC not supported by the suite, 
>> skip all Peer to Peer tests.
>>
>>
>>      CUnit - A unit testing framework for C - Version 2.1-3
>> http://cunit.sourceforge.net/
>>
>>
>> *Suite: Hotunplug Tests**
>> **  Test: Unplug card and rescan the bus to plug it back 
>> .../usr/local/share/libdrm/amdgpu.ids: No such file or directory**
>> **passed**
>> **  Test: Same as first test but with command submission 
>> .../usr/local/share/libdrm/amdgpu.ids: No such file or directory**
>> **passed**
>> **  Test: Unplug with exported bo 
>> .../usr/local/share/libdrm/amdgpu.ids: No such file or directory**
>> **passed*
>>
>> Run Summary:    Type  Total    Ran Passed Failed Inactive
>>               suites     14      1    n/a 0        0
>>                tests     71      3      3 0        1
>>              asserts     21     21     21      0 n/a
>>
>> Elapsed time =    9.195 seconds
>>
>>
>> Andrey
>>
>> On 2022-04-20 11:44, Andrey Grodzovsky wrote:
>>>
>>> The only one in Radeon 7 I see is the same sysfs crash we already 
>>> fixed so you can use the same fix. The MI 200 issue i haven't seen 
>>> yet but I also haven't tested MI200 so never saw it before. Need to 
>>> test when i get the time.
>>>
>>> So try that fix with Radeon 7 again to see if you pass the tests 
>>> (the warnings should all be minor issues).
>>>
>>> Andrey
>>>
>>>
>>> On 2022-04-20 05:24, Shuotao Xu wrote:
>>>>>
>>>>> That a problem, latest working baseline I tested and confirmed 
>>>>> passing hotplug tests is this branch and commit 
>>>>> https://gitlab.freedesktop.org/agd5f/linux/-/commit/86e12a53b73135806e101142e72f3f1c0e6fa8e6 
>>>>> which is amd-staging-drm-next. 5.14 was the branch we ups-reamed 
>>>>> the hotplug code but it had a lot of regressions over time due to 
>>>>> new changes (that why I added the hotplug test to try and catch 
>>>>> them early). It would be best to run this branch on mi-100 so we 
>>>>> have a clean baseline and only after confirming  this particular 
>>>>> branch from this commits passes libdrm tests only then start 
>>>>> adding the KFD specific addons. Another option if you can't work 
>>>>> with MI-100 and this branch is to try a different ASIC that does 
>>>>> work with this branch (if possible).
>>>>>
>>>>> Andrey
>>>>>
>>>> OK I tried both this commit and the HEAD of and-staging-drm-next on 
>>>> two GPUs( MI100 and Radeon VII) both did not pass hotplugout libdrm 
>>>> test. I might be able to gain access to MI200, but I suspect it 
>>>> would work.
>>>>
>>>> I copied the complete dmesgs as follows. I highlighted the OOPSES 
>>>> for you.
>>>>
>>>> Radeon VII:
>

[-- Attachment #2: Type: text/html, Size: 28514 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [EXTERNAL] [PATCH 2/2] drm/amdkfd: Add PCIe Hotplug Support for AMDKFD
  2022-04-27 16:04                                     ` Andrey Grodzovsky
@ 2022-05-10 11:03                                       ` Shuotao Xu
  2022-05-10 16:34                                         ` Andrey Grodzovsky
  2022-05-10 20:31                                         ` Felix Kuehling
  0 siblings, 2 replies; 31+ messages in thread
From: Shuotao Xu @ 2022-05-10 11:03 UTC (permalink / raw)
  To: Andrey Grodzovsky
  Cc: Mukul.Joshi, Kuehling, Felix, Peng Cheng, amd-gfx, Lei Qu,
	Ran Shu, Ziyue Yang

[-- Attachment #1: Type: text/plain, Size: 21636 bytes --]



On Apr 28, 2022, at 12:04 AM, Andrey Grodzovsky <andrey.grodzovsky@amd.com<mailto:andrey.grodzovsky@amd.com>> wrote:


On 2022-04-27 05:20, Shuotao Xu wrote:

Hi Andrey,

Sorry that I did not have time to work on this for a few days.

I just tried the sysfs crash fix on Radeon VII and it seems that it worked. It did not pass last the hotplug test, but my version has 4 tests instead of 3 in your case.


That because the 4th one is only enabled when here are 2 cards in the system - to test DRI_PRIME export. I tested this time with only one card.

Yes, I only had one Radeon VII in my system, so this 4th test should have been skipped. I am ignoring this issue.



Suite: Hotunplug Tests
  Test: Unplug card and rescan the bus to plug it back .../usr/local/share/libdrm/amdgpu.ids: No such file or directory
passed
  Test: Same as first test but with command submission .../usr/local/share/libdrm/amdgpu.ids: No such file or directory
passed
  Test: Unplug with exported bo .../usr/local/share/libdrm/amdgpu.ids: No such file or directory
passed
  Test: Unplug with exported fence .../usr/local/share/libdrm/amdgpu.ids: No such file or directory
amdgpu_device_initialize: amdgpu_get_auth (1) failed (-1)

on the kernel side - the IOCTlL returning this is drm_getclient - maybe take a look while it can't find client it ? I didn't have such issue as far as I remember when testing.


FAILED
    1. ../tests/amdgpu/hotunplug_tests.c:368  - CU_ASSERT_EQUAL(r,0)
    2. ../tests/amdgpu/hotunplug_tests.c:411  - CU_ASSERT_EQUAL(amdgpu_cs_import_syncobj(device2, shared_fd, &sync_obj_handle2),0)
    3. ../tests/amdgpu/hotunplug_tests.c:423  - CU_ASSERT_EQUAL(amdgpu_cs_syncobj_wait(device2, &sync_obj_handle2, 1, 100000000, 0, NULL),0)
    4. ../tests/amdgpu/hotunplug_tests.c:425  - CU_ASSERT_EQUAL(amdgpu_cs_destroy_syncobj(device2, sync_obj_handle2),0)

Run Summary:    Type  Total    Ran Passed Failed Inactive
              suites     14      1    n/a      0        0
               tests     71      4      3      1        0
             asserts     39     39     35      4      n/a

Elapsed time =   17.321 seconds

For kfd compute, there is some problem which I did not see in MI100 after I killed the hung application after hot plugout. I was using rocm5.0.2 driver for MI100 card, and not sure if it is a regression from the newer driver.
After pkill, one of child of user process would be stuck in Zombie mode (Z) understandably because of the bug, and future rocm application after plug-back would in uninterrupted sleep mode (D) because it would not return from syscall to kfd.

Although drm test for amdgpu would run just fine without issues after plug-back with dangling kfd state.


I am not clear when the crash bellow happens ? Is it related to what you describe above ?


I don’t know if there is a quick fix to it. I was thinking add drm_enter/drm_exit to amdgpu_device_rreg.


Try adding drm_dev_enter/exit pair at the highest level of attmetong to access HW - in this case it's amdgpu_amdkfd_set_compute_idle. We always try to avoid accessing any HW functions after backing device is gone.


Also this has been a long time in my attempt to fix hotplug issue for kfd application.
I don’t know 1) if I would be able to get to MI100 (fixing Radeon VII would mean something but MI100 is more important for us); 2) what the direct of the patch to this issue will move forward.


I will go to office tomorrow to pick up MI-100, With time and priorities permitting I will then then try to test it and fix any bugs such that it will be passing all hot plug libdrm tests at the tip of public amd-staging-drm-next - https://gitlab.freedesktop.org/agd5f/linux<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fagd5f%2Flinux&data=05%7C01%7Cshuotaoxu%40microsoft.com%7C2707c61047544f54735d08da28678fd6%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637866722561602834%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=e4JtsLqUijyzzcspth2RaWdNT2pJVUKMcBPcY2IZJWA%3D&reserved=0>, after that you can try to continue working with ROCm enabling on top of that.

For now i suggest you move on with Radeon 7 which as your development ASIC and use the fix i mentioned above.

I finally got some time to continue on kfd hotplug patch attempt.
The following patch seems to work for kfd hotplug on Radeon VII. After hot plugout, the tf process exists because of vm fault.
A new tf process run without issues after plugback.

It has the following fixes.

  1.  ras sysfs regression;
  2.  skip setting compute idle after dev is plugged, otherwise it will try to write the pci bar thus driver fault
  3.  stops the actual work of invalidate memory map triggered by useptrs; (return false will trigger warning, so I returned true. Not sure if it is correct)
  4.  It sends exceptions to all the events/signal that a “zombie” process that are waiting for. (Not sure if the hw_exception is worthwhile, it did not do anything in my case since there is such event type associated with that process)

Please take a look and let me know if it acceptable.

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
index 1f8161cd507f..2f7858692067 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
@@ -33,6 +33,7 @@
 #include <uapi/linux/kfd_ioctl.h>
 #include "amdgpu_ras.h"
 #include "amdgpu_umc.h"
+#include <drm/drm_drv.h>

 /* Total memory size in system memory and all GPU VRAM. Used to
  * estimate worst case amount of memory to reserve for page tables
@@ -681,9 +682,10 @@ int amdgpu_amdkfd_submit_ib(struct amdgpu_device *adev,

 void amdgpu_amdkfd_set_compute_idle(struct amdgpu_device *adev, bool idle)
 {
-       amdgpu_dpm_switch_power_profile(adev,
-                                       PP_SMC_POWER_PROFILE_COMPUTE,
-                                       !idle);
+       if (!drm_dev_is_unplugged(adev_to_drm(adev)))
+               amdgpu_dpm_switch_power_profile(adev,
+                                               PP_SMC_POWER_PROFILE_COMPUTE,
+                                               !idle);
 }

 bool amdgpu_amdkfd_is_kfd_vmid(struct amdgpu_device *adev, u32 vmid)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
index 4b153daf283d..fb4c9e55eace 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
@@ -46,6 +46,7 @@
 #include <linux/firmware.h>
 #include <linux/module.h>
 #include <drm/drm.h>
+#include <drm/drm_drv.h>

 #include "amdgpu.h"
 #include "amdgpu_amdkfd.h"
@@ -104,6 +105,9 @@ static bool amdgpu_mn_invalidate_hsa(struct mmu_interval_notifier *mni,
        struct amdgpu_bo *bo = container_of(mni, struct amdgpu_bo, notifier);
        struct amdgpu_device *adev = amdgpu_ttm_adev(bo->tbo.bdev);

+       if (drm_dev_is_unplugged(adev_to_drm(adev)))
+               return true;
+
        if (!mmu_notifier_range_blockable(range))
                return false;

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
index cac56f830aed..fbbaaabf3a67 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
@@ -1509,7 +1509,6 @@ static int amdgpu_ras_fs_fini(struct amdgpu_device *adev)
                }
        }

-       amdgpu_ras_sysfs_remove_all(adev);
        return 0;
 }
 /* ras fs end */
@@ -2557,8 +2556,6 @@ void amdgpu_ras_block_late_fini(struct amdgpu_device *adev,
        if (!ras_block)
                return;

-       amdgpu_ras_sysfs_remove(adev, ras_block);
-
        ras_obj = container_of(ras_block, struct amdgpu_ras_block_object, ras_comm);
        if (ras_obj->ras_cb)
                amdgpu_ras_interrupt_remove_handler(adev, ras_block);
@@ -2659,6 +2656,7 @@ int amdgpu_ras_pre_fini(struct amdgpu_device *adev)
        /* Need disable ras on all IPs here before ip [hw/sw]fini */
        amdgpu_ras_disable_all_features(adev, 0);
        amdgpu_ras_recovery_fini(adev);
+       amdgpu_ras_sysfs_remove_all(adev);
        return 0;
 }

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
index f1a225a20719..4b789bec9670 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
@@ -714,16 +714,37 @@ bool kfd_is_locked(void)

 void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm)
 {
+       struct kfd_process *p;
+       struct amdkfd_process_info *p_info;
+       unsigned int temp;
+
        if (!kfd->init_complete)
                return;

        /* for runtime suspend, skip locking kfd */
-       if (!run_pm) {
+       if (!run_pm && !drm_dev_is_unplugged(kfd->ddev)) {
                /* For first KFD device suspend all the KFD processes */
                if (atomic_inc_return(&kfd_locked) == 1)
                        kfd_suspend_all_processes();
        }

+       if (drm_dev_is_unplugged(kfd->ddev)){
+               int idx = srcu_read_lock(&kfd_processes_srcu);
+               pr_debug("cancel restore_userptr_work\n");
+               hash_for_each_rcu(kfd_processes_table, temp, p, kfd_processes) {
+                       if (kfd_process_gpuidx_from_gpuid(p, kfd->id) >= 0) {
+                               p_info = p->kgd_process_info;
+                               pr_debug("cancel processes, pid = %d for gpu_id = %d", pid_nr(p_info->pid), kfd->id);
+                               cancel_delayed_work_sync(&p_info->restore_userptr_work);
+
+                               /* send exception signals to the kfd events waiting in user space */
+                               kfd_signal_hw_exception_event(p->pasid);
+                               kfd_signal_vm_fault_event(kfd, p->pasid, NULL);
+                       }
+               }
+               srcu_read_unlock(&kfd_processes_srcu, idx);
+       }
+
        kfd->dqm->ops.stop(kfd->dqm);
        kfd_iommu_suspend(kfd);
 }

Regards,
Shuotao

Andrey


Regards,
Shuotao

[  +0.001645] BUG: unable to handle page fault for address: 0000000000058a68
[  +0.001298] #PF: supervisor read access in kernel mode
[  +0.001252] #PF: error_code(0x0000) - not-present page
[  +0.001248] PGD 8000000115806067 P4D 8000000115806067 PUD 109b2d067 PMD 0
[  +0.001270] Oops: 0000 [#1] PREEMPT SMP PTI
[  +0.001256] CPU: 5 PID: 13818 Comm: tf_cnn_benchmar Tainted: G        W   E     5.16.0+ #3
[  +0.001290] Hardware name: Dell Inc. PowerEdge R730/0H21J3, BIOS 1.5.4 [FPGA Test BIOS] 10/002/2015
[  +0.001309] RIP: 0010:amdgpu_device_rreg.part.24+0xa9/0xe0 [amdgpu]
[  +0.001562] Code: e8 8c 7d 02 00 65 ff 0d 65 e0 7f 3f 75 ae 0f 1f 44 00 00 eb a7 83 e2 02 75 09 f6 87 10 69 01 00 10 75 0d 4c 03 a3 a0 09 00 00 <45> 8b 24 24 eb 8a 4c 8d b7 b0 6b 01 00 4c 89 f7 e8 a2 4c 2e ca 85
[  +0.002751] RSP: 0018:ffffb58fac313928 EFLAGS: 00010202
[  +0.001388] RAX: ffffffffc09a4270 RBX: ffff8b0c9c840000 RCX: 00000000ffffffff
[  +0.001402] RDX: 0000000000000000 RSI: 000000000001629a RDI: ffff8b0c9c840000
[  +0.001418] RBP: ffffb58fac313948 R08: 0000000000000021 R09: 0000000000000001
[  +0.001421] R10: ffffb58fac313b30 R11: ffffffff8c065b00 R12: 0000000000058a68
[  +0.001400] R13: 000000000001629a R14: 0000000000000000 R15: 000000000001629a
[  +0.001397] FS:  0000000000000000(0000) GS:ffff8b4b7fa80000(0000) knlGS:0000000000000000
[  +0.001411] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  +0.001405] CR2: 0000000000058a68 CR3: 000000010a2c8001 CR4: 00000000001706e0
[  +0.001422] Call Trace:
[  +0.001407]  <TASK>
[  +0.001391]  amdgpu_device_rreg+0x17/0x20 [amdgpu]
[  +0.001614]  amdgpu_cgs_read_register+0x14/0x20 [amdgpu]
[  +0.001735]  phm_wait_for_register_unequal.part.1+0x58/0x90 [amdgpu]
[  +0.001790]  phm_wait_for_register_unequal+0x1a/0x30 [amdgpu]
[  +0.001800]  vega20_wait_for_response+0x28/0x80 [amdgpu]
[  +0.001757]  vega20_send_msg_to_smc_with_parameter+0x21/0x110 [amdgpu]
[  +0.001838]  smum_send_msg_to_smc_with_parameter+0xcd/0x100 [amdgpu]
[  +0.001829]  ? kvfree+0x1e/0x30
[  +0.001462]  vega20_set_power_profile_mode+0x58/0x330 [amdgpu]
[  +0.001868]  ? kvfree+0x1e/0x30
[  +0.001462]  ? ttm_bo_release+0x261/0x370 [ttm]
[  +0.001467]  pp_dpm_switch_power_profile+0xc2/0x170 [amdgpu]
[  +0.001863]  amdgpu_dpm_switch_power_profile+0x6b/0x90 [amdgpu]
[  +0.001866]  amdgpu_amdkfd_set_compute_idle+0x1a/0x20 [amdgpu]
[  +0.001784]  kfd_dec_compute_active+0x2c/0x50 [amdgpu]
[  +0.001744]  process_termination_cpsch+0x2f9/0x3a0 [amdgpu]
[  +0.001728]  kfd_process_dequeue_from_all_devices+0x49/0x70 [amdgpu]
[  +0.001730]  kfd_process_notifier_release+0x91/0xe0 [amdgpu]
[  +0.001718]  __mmu_notifier_release+0x77/0x1f0
[  +0.001411]  exit_mmap+0x1b5/0x200
[  +0.001396]  ? __switch_to+0x12d/0x3e0
[  +0.001388]  ? __switch_to_asm+0x36/0x70
[  +0.001372]  ? preempt_count_add+0x74/0xc0
[  +0.001364]  mmput+0x57/0x110
[  +0.001349]  do_exit+0x33d/0xc20
[  +0.001337]  ? _raw_spin_unlock+0x1a/0x30
[  +0.001346]  do_group_exit+0x43/0xa0
[  +0.001341]  get_signal+0x131/0x920
[  +0.001295]  arch_do_signal_or_restart+0xb1/0x870
[  +0.001303]  ? do_futex+0x125/0x190
[  +0.001285]  exit_to_user_mode_prepare+0xb1/0x1c0
[  +0.001282]  syscall_exit_to_user_mode+0x2a/0x40
[  +0.001264]  do_syscall_64+0x46/0xb0
[  +0.001236]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[  +0.001219] RIP: 0033:0x7f6aff1d2ad3
[  +0.001177] Code: Unable to access opcode bytes at RIP 0x7f6aff1d2aa9.
[  +0.001166] RSP: 002b:00007f6ab2029d20 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca
[  +0.001170] RAX: fffffffffffffe00 RBX: 0000000004f542b0 RCX: 00007f6aff1d2ad3
[  +0.001168] RDX: 0000000000000000 RSI: 0000000000000080 RDI: 0000000004f542d8
[  +0.001162] RBP: 0000000004f542d4 R08: 0000000000000000 R09: 0000000000000000
[  +0.001152] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000004f542d8
[  +0.001176] R13: 0000000000000000 R14: 0000000004f54288 R15: 0000000000000000
[  +0.001152]  </TASK>
[  +0.001113] Modules linked in: veth amdgpu(E) nf_conntrack_netlink nfnetlink xfrm_user xt_addrtype br_netfilter xt_CHECKSUM iptable_mangle xt_MASQUERADE iptable_nat nf_nat xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4 xt_tcpudp bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter overlay esp6_offload esp6 esp4_offload esp4 xfrm_algo intel_rapl_msr intel_rapl_common sb_edac x86_pkg_temp_thermal intel_powerclamp snd_hda_codec_hdmi snd_hda_intel ipmi_ssif snd_intel_dspcfg coretemp snd_hda_codec kvm_intel snd_hda_core snd_hwdep snd_pcm snd_timer snd kvm soundcore irqbypass ftdi_sio usbserial input_leds iTCO_wdt iTCO_vendor_support joydev mei_me rapl lpc_ich intel_cstate mei ipmi_si ipmi_devintf ipmi_msghandler mac_hid acpi_power_meter sch_fq_codel ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables x_tables autofs4 btrfs blake2b_generic zstd_compress raid10 raid456
[  +0.000102]  async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear iommu_v2 gpu_sched drm_ttm_helper mgag200 ttm drm_shmem_helper drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops crct10dif_pclmul hid_generic crc32_pclmul ghash_clmulni_intel usbhid uas aesni_intel crypto_simd igb ahci hid drm usb_storage cryptd libahci dca megaraid_sas i2c_algo_bit wmi [last unloaded: amdgpu]
[  +0.016626] CR2: 0000000000058a68
[  +0.001550] ---[ end trace ff90849fe0a8b3b4 ]---
[  +0.024953] RIP: 0010:amdgpu_device_rreg.part.24+0xa9/0xe0 [amdgpu]
[  +0.001814] Code: e8 8c 7d 02 00 65 ff 0d 65 e0 7f 3f 75 ae 0f 1f 44 00 00 eb a7 83 e2 02 75 09 f6 87 10 69 01 00 10 75 0d 4c 03 a3 a0 09 00 00 <45> 8b 24 24 eb 8a 4c 8d b7 b0 6b 01 00 4c 89 f7 e8 a2 4c 2e ca 85
[  +0.003255] RSP: 0018:ffffb58fac313928 EFLAGS: 00010202
[  +0.001641] RAX: ffffffffc09a4270 RBX: ffff8b0c9c840000 RCX: 00000000ffffffff
[  +0.001656] RDX: 0000000000000000 RSI: 000000000001629a RDI: ffff8b0c9c840000
[  +0.001681] RBP: ffffb58fac313948 R08: 0000000000000021 R09: 0000000000000001
[  +0.001662] R10: ffffb58fac313b30 R11: ffffffff8c065b00 R12: 0000000000058a68
[  +0.001650] R13: 000000000001629a R14: 0000000000000000 R15: 000000000001629a
[  +0.001648] FS:  0000000000000000(0000) GS:ffff8b4b7fa80000(0000) knlGS:0000000000000000
[  +0.001668] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  +0.001673] CR2: 0000000000058a68 CR3: 000000010a2c8001 CR4: 00000000001706e0
[  +0.001740] Fixing recursive fault but reboot is needed!


On Apr 21, 2022, at 2:41 AM, Andrey Grodzovsky <andrey.grodzovsky@amd.com<mailto:andrey.grodzovsky@amd.com>> wrote:


I retested hot plug tests at the commit I mentioned bellow - looks ok, my ASIC is Navi 10, I also tested using Vega 10 and older Polaris ASICs (whatever i had at home at the time). It's possible there are extra issues in ASICs like ur which I didn't cover during tests.

andrey@andrey-test:~/drm$ sudo ./build/tests/amdgpu/amdgpu_test -s 13
/usr/local/share/libdrm/amdgpu.ids: No such file or directory
/usr/local/share/libdrm/amdgpu.ids: No such file or directory
/usr/local/share/libdrm/amdgpu.ids: No such file or directory


The ASIC NOT support UVD, suite disabled
/usr/local/share/libdrm/amdgpu.ids: No such file or directory


The ASIC NOT support VCE, suite disabled
/usr/local/share/libdrm/amdgpu.ids: No such file or directory
/usr/local/share/libdrm/amdgpu.ids: No such file or directory
/usr/local/share/libdrm/amdgpu.ids: No such file or directory


The ASIC NOT support UVD ENC, suite disabled.
/usr/local/share/libdrm/amdgpu.ids: No such file or directory
/usr/local/share/libdrm/amdgpu.ids: No such file or directory
/usr/local/share/libdrm/amdgpu.ids: No such file or directory
/usr/local/share/libdrm/amdgpu.ids: No such file or directory


Don't support TMZ (trust memory zone), security suite disabled
/usr/local/share/libdrm/amdgpu.ids: No such file or directory
/usr/local/share/libdrm/amdgpu.ids: No such file or directory
Peer device is not opened or has ASIC not supported by the suite, skip all Peer to Peer tests.


     CUnit - A unit testing framework for C - Version 2.1-3
     http://cunit.sourceforge.net/<https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fcunit.sourceforge.net%2F&data=05%7C01%7Cshuotaoxu%40microsoft.com%7C2707c61047544f54735d08da28678fd6%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637866722561602834%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=2wqomPfZIusH1EpPw09VBt4UlyehO4r2BmfVEWlUayk%3D&reserved=0>


Suite: Hotunplug Tests
  Test: Unplug card and rescan the bus to plug it back .../usr/local/share/libdrm/amdgpu.ids: No such file or directory
passed
  Test: Same as first test but with command submission .../usr/local/share/libdrm/amdgpu.ids: No such file or directory
passed
  Test: Unplug with exported bo .../usr/local/share/libdrm/amdgpu.ids: No such file or directory
passed

Run Summary:    Type  Total    Ran Passed Failed Inactive
              suites     14      1    n/a      0        0
               tests     71      3      3      0        1
             asserts     21     21     21      0      n/a

Elapsed time =    9.195 seconds


Andrey

On 2022-04-20 11:44, Andrey Grodzovsky wrote:

The only one in Radeon 7 I see is the same sysfs crash we already fixed so you can use the same fix. The MI 200 issue i haven't seen yet but I also haven't tested MI200 so never saw it before. Need to test when i get the time.

So try that fix with Radeon 7 again to see if you pass the tests (the warnings should all be minor issues).

Andrey


On 2022-04-20 05:24, Shuotao Xu wrote:

That a problem, latest working baseline I tested and confirmed passing hotplug tests is this branch and commithttps://gitlab.freedesktop.org/agd5f/linux/-/commit/86e12a53b73135806e101142e72f3f1c0e6fa8e6<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fagd5f%2Flinux%2F-%2Fcommit%2F86e12a53b73135806e101142e72f3f1c0e6fa8e6&data=05%7C01%7Cshuotaoxu%40microsoft.com%7C2707c61047544f54735d08da28678fd6%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637866722561602834%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=3xl%2B3pV6aRokaHPIX75mGujO3nhAabT80YUIzf6l%2BSQ%3D&reserved=0> which is amd-staging-drm-next. 5.14 was the branch we ups-reamed the hotplug code but it had a lot of regressions over time due to new changes (that why I added the hotplug test to try and catch them early). It would be best to run this branch on mi-100 so we have a clean baseline and only after confirming  this particular branch from this commits passes libdrm tests only then start adding the KFD specific addons. Another option if you can't work with MI-100 and this branch is to try a different ASIC that does work with this branch (if possible).

Andrey

OK I tried both this commit and the HEAD of and-staging-drm-next on two GPUs( MI100 and Radeon VII) both did not pass hotplugout libdrm test. I might be able to gain access to MI200, but I suspect it would work.

I copied the complete dmesgs as follows. I highlighted the OOPSES for you.

Radeon VII:


[-- Attachment #2: Type: text/html, Size: 42455 bytes --]

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [EXTERNAL] [PATCH 2/2] drm/amdkfd: Add PCIe Hotplug Support for AMDKFD
  2022-05-10 11:03                                       ` Shuotao Xu
@ 2022-05-10 16:34                                         ` Andrey Grodzovsky
  2022-05-10 20:31                                         ` Felix Kuehling
  1 sibling, 0 replies; 31+ messages in thread
From: Andrey Grodzovsky @ 2022-05-10 16:34 UTC (permalink / raw)
  To: Shuotao Xu
  Cc: Mukul.Joshi, Kuehling, Felix, Peng Cheng, amd-gfx, Lei Qu,
	Ran Shu, Ziyue Yang

[-- Attachment #1: Type: text/plain, Size: 22941 bytes --]


On 2022-05-10 07:03, Shuotao Xu wrote:
>
>
>> On Apr 28, 2022, at 12:04 AM, Andrey Grodzovsky 
>> <andrey.grodzovsky@amd.com> wrote:
>>
>> On 2022-04-27 05:20, Shuotao Xu wrote:
>>
>>> Hi Andrey,
>>>
>>> Sorry that I did not have time to work on this for a few days.
>>>
>>> I just tried the sysfs crash fix on Radeon VII and it seems that it 
>>> worked. It did not pass last the hotplug test, but my version has 4 
>>> tests instead of 3 in your case.
>>
>>
>> That because the 4th one is only enabled when here are 2 cards in the 
>> system - to test DRI_PRIME export. I tested this time with only one card.
>>
> Yes, I only had one Radeon VII in my system, so this 4th test should 
> have been skipped. I am ignoring this issue.
>
>>>
>>>
>>> Suite: Hotunplug Tests
>>>   Test: Unplug card and rescan the bus to plug it back 
>>> .../usr/local/share/libdrm/amdgpu.ids: No such file or directory
>>> passed
>>>   Test: Same as first test but with command submission 
>>> .../usr/local/share/libdrm/amdgpu.ids: No such file or directory
>>> passed
>>>   Test: Unplug with exported bo 
>>> .../usr/local/share/libdrm/amdgpu.ids: No such file or directory
>>> passed
>>>   Test: Unplug with exported fence 
>>> .../usr/local/share/libdrm/amdgpu.ids: No such file or directory
>>> amdgpu_device_initialize: amdgpu_get_auth (1) failed (-1)
>>
>>
>> on the kernel side - the IOCTlL returning this is drm_getclient - 
>> maybe take a look while it can't find client it ? I didn't have such 
>> issue as far as I remember when testing.
>>
>>
>>> FAILED
>>>     1. ../tests/amdgpu/hotunplug_tests.c:368  - CU_ASSERT_EQUAL(r,0)
>>>     2. ../tests/amdgpu/hotunplug_tests.c:411  - 
>>> CU_ASSERT_EQUAL(amdgpu_cs_import_syncobj(device2, shared_fd, 
>>> &sync_obj_handle2),0)
>>>     3. ../tests/amdgpu/hotunplug_tests.c:423  - 
>>> CU_ASSERT_EQUAL(amdgpu_cs_syncobj_wait(device2, &sync_obj_handle2, 
>>> 1, 100000000, 0, NULL),0)
>>>     4. ../tests/amdgpu/hotunplug_tests.c:425  - 
>>> CU_ASSERT_EQUAL(amdgpu_cs_destroy_syncobj(device2, sync_obj_handle2),0)
>>>
>>> Run Summary:    Type  Total    Ran Passed Failed Inactive
>>>               suites     14      1    n/a      0        0
>>>                tests     71      4      3      1        0
>>>              asserts     39     39     35      4      n/a
>>>
>>> Elapsed time =   17.321 seconds
>>>
>>> For kfd compute, there is some problem which I did not see in MI100 
>>> after I killed the hung application after hot plugout. I was using 
>>> rocm5.0.2 driver for MI100 card, and not sure if it is a regression 
>>> from the newer driver.
>>> After pkill, one of child of user process would be stuck in Zombie 
>>> mode (Z) understandably because of the bug, and future rocm 
>>> application after plug-back would in uninterrupted sleep mode (D) 
>>> because it would not return from syscall to kfd.
>>>
>>> Although drm test for amdgpu would run just fine without issues 
>>> after plug-back with dangling kfd state.
>>
>>
>> I am not clear when the crash bellow happens ? Is it related to what 
>> you describe above ?
>>
>>
>>>
>>> I don’t know if there is a quick fix to it. I was thinking add 
>>> drm_enter/drm_exit to amdgpu_device_rreg.
>>
>>
>> Try adding drm_dev_enter/exit pair at the highest level of attmetong 
>> to access HW - in this case it's amdgpu_amdkfd_set_compute_idle. We 
>> always try to avoid accessing any HW functions after backing device 
>> is gone.
>>
>>
>>> Also this has been a long time in my attempt to fix hotplug issue 
>>> for kfd application.
>>> I don’t know 1) if I would be able to get to MI100 (fixing Radeon 
>>> VII would mean something but MI100 is more important for us); 2) 
>>> what the direct of the patch to this issue will move forward.
>>
>>
>> I will go to office tomorrow to pick up MI-100, With time and 
>> priorities permitting I will then then try to test it and fix any 
>> bugs such that it will be passing all hot plug libdrm tests at the 
>> tip of public amd-staging-drm-next 
>> -https://gitlab.freedesktop.org/agd5f/linux, after that you can try 
>> to continue working with ROCm enabling on top of that.
>>
>> For now i suggest you move on with Radeon 7 which as your development 
>> ASIC and use the fix i mentioned above.
>>
> I finally got some time to continue on kfd hotplug patch attempt.
> The following patch seems to work for kfd hotplug on Radeon VII. After 
> hot plugout, the tf process exists because of vm fault.
> A new tf process run without issues after plugback.
>
> It has the following fixes.
>
>  1. ras sysfs regression;
>  2. skip setting compute idle after dev is plugged, otherwise it will
>     try to write the pci bar thus driver fault
>

1+ 2 look good to me.


>  1. stops the actual work of invalidate memory map triggered by
>     useptrs; (return false will trigger warning, so I returned true.
>     Not sure if it is correct)
>

For this you need an ack from Compute team (e.g. Felix) - For me it 
looks ok as you have to stop generating new restore user PTR works at 
some point - but maybe we can let them all to run their course and just 
restrict HW access
more at the lower level.


>  1. It sends exceptions to all the events/signal that a “zombie”
>     process that are waiting for. (Not sure if the hw_exception is
>     worthwhile, it did not do anything in my case since there is such
>     event type associated with that process)
>

About this one i am in doubt as i told you our policy for Graphics was 
to not force kill any user side clients who monitor those processes and 
just let them live until their natural death. Is it different for KFD ? 
Does it create a specific issue if you don't generate the signals ?

Andrey


> Please take a look and let me know if it acceptable.
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
> index 1f8161cd507f..2f7858692067 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
> @@ -33,6 +33,7 @@
>  #include <uapi/linux/kfd_ioctl.h>
>  #include "amdgpu_ras.h"
>  #include "amdgpu_umc.h"
> +#include <drm/drm_drv.h>
>
>  /* Total memory size in system memory and all GPU VRAM. Used to
>   * estimate worst case amount of memory to reserve for page tables
> @@ -681,9 +682,10 @@ int amdgpu_amdkfd_submit_ib(struct amdgpu_device 
> *adev,
>
>  void amdgpu_amdkfd_set_compute_idle(struct amdgpu_device *adev, bool 
> idle)
>  {
> -       amdgpu_dpm_switch_power_profile(adev,
> - PP_SMC_POWER_PROFILE_COMPUTE,
> -                                       !idle);
> +       if (!drm_dev_is_unplugged(adev_to_drm(adev)))
> +               amdgpu_dpm_switch_power_profile(adev,
> + PP_SMC_POWER_PROFILE_COMPUTE,
> +                                               !idle);
>  }
>
>  bool amdgpu_amdkfd_is_kfd_vmid(struct amdgpu_device *adev, u32 vmid)
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
> index 4b153daf283d..fb4c9e55eace 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
> @@ -46,6 +46,7 @@
>  #include <linux/firmware.h>
>  #include <linux/module.h>
>  #include <drm/drm.h>
> +#include <drm/drm_drv.h>
>
>  #include "amdgpu.h"
>  #include "amdgpu_amdkfd.h"
> @@ -104,6 +105,9 @@ static bool amdgpu_mn_invalidate_hsa(struct 
> mmu_interval_notifier *mni,
>         struct amdgpu_bo *bo = container_of(mni, struct amdgpu_bo, 
> notifier);
>         struct amdgpu_device *adev = amdgpu_ttm_adev(bo->tbo.bdev);
>
> +       if (drm_dev_is_unplugged(adev_to_drm(adev)))
> +               return true;
> +
>         if (!mmu_notifier_range_blockable(range))
>                 return false;
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> index cac56f830aed..fbbaaabf3a67 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> @@ -1509,7 +1509,6 @@ static int amdgpu_ras_fs_fini(struct 
> amdgpu_device *adev)
>                 }
>         }
>
> -       amdgpu_ras_sysfs_remove_all(adev);
>         return 0;
>  }
>  /* ras fs end */
> @@ -2557,8 +2556,6 @@ void amdgpu_ras_block_late_fini(struct 
> amdgpu_device *adev,
>         if (!ras_block)
>                 return;
>
> -       amdgpu_ras_sysfs_remove(adev, ras_block);
> -
>         ras_obj = container_of(ras_block, struct 
> amdgpu_ras_block_object, ras_comm);
>         if (ras_obj->ras_cb)
>                 amdgpu_ras_interrupt_remove_handler(adev, ras_block);
> @@ -2659,6 +2656,7 @@ int amdgpu_ras_pre_fini(struct amdgpu_device *adev)
>         /* Need disable ras on all IPs here before ip [hw/sw]fini */
>         amdgpu_ras_disable_all_features(adev, 0);
>         amdgpu_ras_recovery_fini(adev);
> +       amdgpu_ras_sysfs_remove_all(adev);
>         return 0;
>  }
>
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c 
> b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> index f1a225a20719..4b789bec9670 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> @@ -714,16 +714,37 @@ bool kfd_is_locked(void)
>
>  void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm)
>  {
> +       struct kfd_process *p;
> +       struct amdkfd_process_info *p_info;
> +       unsigned int temp;
> +
>         if (!kfd->init_complete)
>                 return;
>
>         /* for runtime suspend, skip locking kfd */
> -       if (!run_pm) {
> +       if (!run_pm && !drm_dev_is_unplugged(kfd->ddev)) {
>                 /* For first KFD device suspend all the KFD processes */
>                 if (atomic_inc_return(&kfd_locked) == 1)
>                         kfd_suspend_all_processes();
>         }
>
> +       if (drm_dev_is_unplugged(kfd->ddev)){
> +               int idx = srcu_read_lock(&kfd_processes_srcu);
> +               pr_debug("cancel restore_userptr_work\n");
> +               hash_for_each_rcu(kfd_processes_table, temp, p, 
> kfd_processes) {
> +                       if (kfd_process_gpuidx_from_gpuid(p, kfd->id) 
> >= 0) {
> +                               p_info = p->kgd_process_info;
> +                               pr_debug("cancel processes, pid = %d 
> for gpu_id = %d", pid_nr(p_info->pid), kfd->id);
> + cancel_delayed_work_sync(&p_info->restore_userptr_work);
> +
> +                               /* send exception signals to the kfd 
> events waiting in user space */
> + kfd_signal_hw_exception_event(p->pasid);
> + kfd_signal_vm_fault_event(kfd, p->pasid, NULL);
> +                       }
> +               }
> +               srcu_read_unlock(&kfd_processes_srcu, idx);
> +       }
> +
>         kfd->dqm->ops.stop(kfd->dqm);
>         kfd_iommu_suspend(kfd);
>  }
>
> Regards,
> Shuotao
>>
>> Andrey
>>
>>
>>>
>>> Regards,
>>> Shuotao
>>>
>>> [  +0.001645] BUG: unable to handle page fault for address: 
>>> 0000000000058a68
>>> [  +0.001298] #PF: supervisor read access in kernel mode
>>> [  +0.001252] #PF: error_code(0x0000) - not-present page
>>> [  +0.001248] PGD 8000000115806067 P4D 8000000115806067 PUD 
>>> 109b2d067 PMD 0
>>> [  +0.001270] Oops: 0000 [#1] PREEMPT SMP PTI
>>> [  +0.001256] CPU: 5 PID: 13818 Comm: tf_cnn_benchmar Tainted: G     
>>>    W   E     5.16.0+ #3
>>> [  +0.001290] Hardware name: Dell Inc. PowerEdge R730/0H21J3, BIOS 
>>> 1.5.4 [FPGA Test BIOS] 10/002/2015
>>> [  +0.001309] RIP: 0010:amdgpu_device_rreg.part.24+0xa9/0xe0 [amdgpu]
>>> [  +0.001562] Code: e8 8c 7d 02 00 65 ff 0d 65 e0 7f 3f 75 ae 0f 1f 
>>> 44 00 00 eb a7 83 e2 02 75 09 f6 87 10 69 01 00 10 75 0d 4c 03 a3 a0 
>>> 09 00 00 <45> 8b 24 24 eb 8a 4c 8d b7 b0 6b 01 00 4c 89 f7 e8 a2 4c 
>>> 2e ca 85
>>> [  +0.002751] RSP: 0018:ffffb58fac313928 EFLAGS: 00010202
>>> [  +0.001388] RAX: ffffffffc09a4270 RBX: ffff8b0c9c840000 RCX: 
>>> 00000000ffffffff
>>> [  +0.001402] RDX: 0000000000000000 RSI: 000000000001629a RDI: 
>>> ffff8b0c9c840000
>>> [  +0.001418] RBP: ffffb58fac313948 R08: 0000000000000021 R09: 
>>> 0000000000000001
>>> [  +0.001421] R10: ffffb58fac313b30 R11: ffffffff8c065b00 R12: 
>>> 0000000000058a68
>>> [  +0.001400] R13: 000000000001629a R14: 0000000000000000 R15: 
>>> 000000000001629a
>>> [  +0.001397] FS:  0000000000000000(0000) GS:ffff8b4b7fa80000(0000) 
>>> knlGS:0000000000000000
>>> [  +0.001411] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> [  +0.001405] CR2: 0000000000058a68 CR3: 000000010a2c8001 CR4: 
>>> 00000000001706e0
>>> [  +0.001422] Call Trace:
>>> [  +0.001407]  <TASK>
>>> [  +0.001391]  amdgpu_device_rreg+0x17/0x20 [amdgpu]
>>> [  +0.001614]  amdgpu_cgs_read_register+0x14/0x20 [amdgpu]
>>> [  +0.001735]  phm_wait_for_register_unequal.part.1+0x58/0x90 [amdgpu]
>>> [  +0.001790]  phm_wait_for_register_unequal+0x1a/0x30 [amdgpu]
>>> [  +0.001800]  vega20_wait_for_response+0x28/0x80 [amdgpu]
>>> [  +0.001757]  vega20_send_msg_to_smc_with_parameter+0x21/0x110 [amdgpu]
>>> [  +0.001838]  smum_send_msg_to_smc_with_parameter+0xcd/0x100 [amdgpu]
>>> [  +0.001829]  ? kvfree+0x1e/0x30
>>> [  +0.001462]  vega20_set_power_profile_mode+0x58/0x330 [amdgpu]
>>> [  +0.001868]  ? kvfree+0x1e/0x30
>>> [  +0.001462]  ? ttm_bo_release+0x261/0x370 [ttm]
>>> [  +0.001467]  pp_dpm_switch_power_profile+0xc2/0x170 [amdgpu]
>>> [  +0.001863]  amdgpu_dpm_switch_power_profile+0x6b/0x90 [amdgpu]
>>> [  +0.001866]  amdgpu_amdkfd_set_compute_idle+0x1a/0x20 [amdgpu]
>>> [  +0.001784]  kfd_dec_compute_active+0x2c/0x50 [amdgpu]
>>> [  +0.001744]  process_termination_cpsch+0x2f9/0x3a0 [amdgpu]
>>> [  +0.001728]  kfd_process_dequeue_from_all_devices+0x49/0x70 [amdgpu]
>>> [  +0.001730]  kfd_process_notifier_release+0x91/0xe0 [amdgpu]
>>> [  +0.001718]  __mmu_notifier_release+0x77/0x1f0
>>> [  +0.001411]  exit_mmap+0x1b5/0x200
>>> [  +0.001396]  ? __switch_to+0x12d/0x3e0
>>> [  +0.001388]  ? __switch_to_asm+0x36/0x70
>>> [  +0.001372]  ? preempt_count_add+0x74/0xc0
>>> [  +0.001364]  mmput+0x57/0x110
>>> [  +0.001349]  do_exit+0x33d/0xc20
>>> [  +0.001337]  ? _raw_spin_unlock+0x1a/0x30
>>> [  +0.001346]  do_group_exit+0x43/0xa0
>>> [  +0.001341]  get_signal+0x131/0x920
>>> [  +0.001295]  arch_do_signal_or_restart+0xb1/0x870
>>> [  +0.001303]  ? do_futex+0x125/0x190
>>> [  +0.001285]  exit_to_user_mode_prepare+0xb1/0x1c0
>>> [  +0.001282]  syscall_exit_to_user_mode+0x2a/0x40
>>> [  +0.001264]  do_syscall_64+0x46/0xb0
>>> [  +0.001236]  entry_SYSCALL_64_after_hwframe+0x44/0xae
>>> [  +0.001219] RIP: 0033:0x7f6aff1d2ad3
>>> [  +0.001177] Code: Unable to access opcode bytes at RIP 0x7f6aff1d2aa9.
>>> [  +0.001166] RSP: 002b:00007f6ab2029d20 EFLAGS: 00000246 ORIG_RAX: 
>>> 00000000000000ca
>>> [  +0.001170] RAX: fffffffffffffe00 RBX: 0000000004f542b0 RCX: 
>>> 00007f6aff1d2ad3
>>> [  +0.001168] RDX: 0000000000000000 RSI: 0000000000000080 RDI: 
>>> 0000000004f542d8
>>> [  +0.001162] RBP: 0000000004f542d4 R08: 0000000000000000 R09: 
>>> 0000000000000000
>>> [  +0.001152] R10: 0000000000000000 R11: 0000000000000246 R12: 
>>> 0000000004f542d8
>>> [  +0.001176] R13: 0000000000000000 R14: 0000000004f54288 R15: 
>>> 0000000000000000
>>> [  +0.001152]  </TASK>
>>> [  +0.001113] Modules linked in: veth amdgpu(E) nf_conntrack_netlink 
>>> nfnetlink xfrm_user xt_addrtype br_netfilter xt_CHECKSUM 
>>> iptable_mangle xt_MASQUERADE iptable_nat nf_nat xt_conntrack 
>>> nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4 
>>> xt_tcpudp bridge stp llc ebtable_filter ebtables ip6table_filter 
>>> ip6_tables iptable_filter overlay esp6_offload esp6 esp4_offload 
>>> esp4 xfrm_algo intel_rapl_msr intel_rapl_common sb_edac 
>>> x86_pkg_temp_thermal intel_powerclamp snd_hda_codec_hdmi 
>>> snd_hda_intel ipmi_ssif snd_intel_dspcfg coretemp snd_hda_codec 
>>> kvm_intel snd_hda_core snd_hwdep snd_pcm snd_timer snd kvm soundcore 
>>> irqbypass ftdi_sio usbserial input_leds iTCO_wdt iTCO_vendor_support 
>>> joydev mei_me rapl lpc_ich intel_cstate mei ipmi_si ipmi_devintf 
>>> ipmi_msghandler mac_hid acpi_power_meter sch_fq_codel ib_iser 
>>> rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi 
>>> scsi_transport_iscsi ip_tables x_tables autofs4 btrfs 
>>> blake2b_generic zstd_compress raid10 raid456
>>> [  +0.000102]  async_raid6_recov async_memcpy async_pq async_xor 
>>> async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear 
>>> iommu_v2 gpu_sched drm_ttm_helper mgag200 ttm drm_shmem_helper 
>>> drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops 
>>> crct10dif_pclmul hid_generic crc32_pclmul ghash_clmulni_intel usbhid 
>>> uas aesni_intel crypto_simd igb ahci hid drm usb_storage cryptd 
>>> libahci dca megaraid_sas i2c_algo_bit wmi [last unloaded: amdgpu]
>>> [  +0.016626] CR2: 0000000000058a68
>>> [  +0.001550] ---[ end trace ff90849fe0a8b3b4 ]---
>>> [  +0.024953] RIP: 0010:amdgpu_device_rreg.part.24+0xa9/0xe0 [amdgpu]
>>> [  +0.001814] Code: e8 8c 7d 02 00 65 ff 0d 65 e0 7f 3f 75 ae 0f 1f 
>>> 44 00 00 eb a7 83 e2 02 75 09 f6 87 10 69 01 00 10 75 0d 4c 03 a3 a0 
>>> 09 00 00 <45> 8b 24 24 eb 8a 4c 8d b7 b0 6b 01 00 4c 89 f7 e8 a2 4c 
>>> 2e ca 85
>>> [  +0.003255] RSP: 0018:ffffb58fac313928 EFLAGS: 00010202
>>> [  +0.001641] RAX: ffffffffc09a4270 RBX: ffff8b0c9c840000 RCX: 
>>> 00000000ffffffff
>>> [  +0.001656] RDX: 0000000000000000 RSI: 000000000001629a RDI: 
>>> ffff8b0c9c840000
>>> [  +0.001681] RBP: ffffb58fac313948 R08: 0000000000000021 R09: 
>>> 0000000000000001
>>> [  +0.001662] R10: ffffb58fac313b30 R11: ffffffff8c065b00 R12: 
>>> 0000000000058a68
>>> [  +0.001650] R13: 000000000001629a R14: 0000000000000000 R15: 
>>> 000000000001629a
>>> [  +0.001648] FS:  0000000000000000(0000) GS:ffff8b4b7fa80000(0000) 
>>> knlGS:0000000000000000
>>> [  +0.001668] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> [  +0.001673] CR2: 0000000000058a68 CR3: 000000010a2c8001 CR4: 
>>> 00000000001706e0
>>> [  +0.001740] Fixing recursive fault but reboot is needed!
>>>
>>>
>>>> On Apr 21, 2022, at 2:41 AM, Andrey Grodzovsky 
>>>> <andrey.grodzovsky@amd.com> wrote:
>>>>
>>>> I retested hot plug tests at the commit I mentioned bellow - looks 
>>>> ok, my ASIC is Navi 10, I also tested using Vega 10 and older 
>>>> Polaris ASICs (whatever i had at home at the time). It's possible 
>>>> there are extra issues in ASICs like ur which I didn't cover during 
>>>> tests.
>>>>
>>>> andrey@andrey-test:~/drm$ sudo ./build/tests/amdgpu/amdgpu_test -s 13
>>>> /usr/local/share/libdrm/amdgpu.ids: No such file or directory
>>>> /usr/local/share/libdrm/amdgpu.ids: No such file or directory
>>>> /usr/local/share/libdrm/amdgpu.ids: No such file or directory
>>>>
>>>>
>>>> The ASIC NOT support UVD, suite disabled
>>>> /usr/local/share/libdrm/amdgpu.ids: No such file or directory
>>>>
>>>>
>>>> The ASIC NOT support VCE, suite disabled
>>>> /usr/local/share/libdrm/amdgpu.ids: No such file or directory
>>>> /usr/local/share/libdrm/amdgpu.ids: No such file or directory
>>>> /usr/local/share/libdrm/amdgpu.ids: No such file or directory
>>>>
>>>>
>>>> The ASIC NOT support UVD ENC, suite disabled.
>>>> /usr/local/share/libdrm/amdgpu.ids: No such file or directory
>>>> /usr/local/share/libdrm/amdgpu.ids: No such file or directory
>>>> /usr/local/share/libdrm/amdgpu.ids: No such file or directory
>>>> /usr/local/share/libdrm/amdgpu.ids: No such file or directory
>>>>
>>>>
>>>> Don't support TMZ (trust memory zone), security suite disabled
>>>> /usr/local/share/libdrm/amdgpu.ids: No such file or directory
>>>> /usr/local/share/libdrm/amdgpu.ids: No such file or directory
>>>> Peer device is not opened or has ASIC not supported by the suite, 
>>>> skip all Peer to Peer tests.
>>>>
>>>>
>>>>      CUnit - A unit testing framework for C - Version 2.1-3
>>>> http://cunit.sourceforge.net/
>>>>
>>>>
>>>> *Suite: Hotunplug Tests**
>>>> **  Test: Unplug card and rescan the bus to plug it back 
>>>> .../usr/local/share/libdrm/amdgpu.ids: No such file or directory**
>>>> **passed**
>>>> **  Test: Same as first test but with command submission 
>>>> .../usr/local/share/libdrm/amdgpu.ids: No such file or directory**
>>>> **passed**
>>>> **  Test: Unplug with exported bo 
>>>> .../usr/local/share/libdrm/amdgpu.ids: No such file or directory**
>>>> **passed*
>>>>
>>>> Run Summary:    Type  Total    Ran Passed Failed Inactive
>>>>               suites     14      1    n/a 0        0
>>>>                tests     71      3      3 0        1
>>>>              asserts     21     21     21 0      n/a
>>>>
>>>> Elapsed time =    9.195 seconds
>>>>
>>>>
>>>> Andrey
>>>>
>>>> On 2022-04-20 11:44, Andrey Grodzovsky wrote:
>>>>>
>>>>> The only one in Radeon 7 I see is the same sysfs crash we already 
>>>>> fixed so you can use the same fix. The MI 200 issue i haven't seen 
>>>>> yet but I also haven't tested MI200 so never saw it before. Need 
>>>>> to test when i get the time.
>>>>>
>>>>> So try that fix with Radeon 7 again to see if you pass the tests 
>>>>> (the warnings should all be minor issues).
>>>>>
>>>>> Andrey
>>>>>
>>>>>
>>>>> On 2022-04-20 05:24, Shuotao Xu wrote:
>>>>>>>
>>>>>>> That a problem, latest working baseline I tested and confirmed 
>>>>>>> passing hotplug tests is this branch and 
>>>>>>> commithttps://gitlab.freedesktop.org/agd5f/linux/-/commit/86e12a53b73135806e101142e72f3f1c0e6fa8e6which 
>>>>>>> is amd-staging-drm-next. 5.14 was the branch we ups-reamed the 
>>>>>>> hotplug code but it had a lot of regressions over time due to 
>>>>>>> new changes (that why I added the hotplug test to try and catch 
>>>>>>> them early). It would be best to run this branch on mi-100 so we 
>>>>>>> have a clean baseline and only after confirming  this particular 
>>>>>>> branch from this commits passes libdrm tests only then start 
>>>>>>> adding the KFD specific addons. Another option if you can't work 
>>>>>>> with MI-100 and this branch is to try a different ASIC that does 
>>>>>>> work with this branch (if possible).
>>>>>>>
>>>>>>> Andrey
>>>>>>>
>>>>>> OK I tried both this commit and the HEAD of and-staging-drm-next 
>>>>>> on two GPUs( MI100 and Radeon VII) both did not pass hotplugout 
>>>>>> libdrm test. I might be able to gain access to MI200, but I 
>>>>>> suspect it would work.
>>>>>>
>>>>>> I copied the complete dmesgs as follows. I highlighted the OOPSES 
>>>>>> for you.
>>>>>>
>>>>>> Radeon VII:
>

[-- Attachment #2: Type: text/html, Size: 57383 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [EXTERNAL] [PATCH 2/2] drm/amdkfd: Add PCIe Hotplug Support for AMDKFD
  2022-05-10 11:03                                       ` Shuotao Xu
  2022-05-10 16:34                                         ` Andrey Grodzovsky
@ 2022-05-10 20:31                                         ` Felix Kuehling
  2022-05-11  3:35                                           ` Shuotao Xu
  1 sibling, 1 reply; 31+ messages in thread
From: Felix Kuehling @ 2022-05-10 20:31 UTC (permalink / raw)
  To: Shuotao Xu, Andrey Grodzovsky
  Cc: Mukul.Joshi, Peng Cheng, amd-gfx, Lei Qu, Ran Shu, Ziyue Yang


Am 2022-05-10 um 07:03 schrieb Shuotao Xu:
>
>
>> On Apr 28, 2022, at 12:04 AM, Andrey Grodzovsky 
>> <andrey.grodzovsky@amd.com> wrote:
>>
>> On 2022-04-27 05:20, Shuotao Xu wrote:
>>
>>> Hi Andrey,
>>>
>>> Sorry that I did not have time to work on this for a few days.
>>>
>>> I just tried the sysfs crash fix on Radeon VII and it seems that it 
>>> worked. It did not pass last the hotplug test, but my version has 4 
>>> tests instead of 3 in your case.
>>
>>
>> That because the 4th one is only enabled when here are 2 cards in the 
>> system - to test DRI_PRIME export. I tested this time with only one card.
>>
> Yes, I only had one Radeon VII in my system, so this 4th test should 
> have been skipped. I am ignoring this issue.
>
>>>
>>>
>>> Suite: Hotunplug Tests
>>>   Test: Unplug card and rescan the bus to plug it back 
>>> .../usr/local/share/libdrm/amdgpu.ids: No such file or directory
>>> passed
>>>   Test: Same as first test but with command submission 
>>> .../usr/local/share/libdrm/amdgpu.ids: No such file or directory
>>> passed
>>>   Test: Unplug with exported bo 
>>> .../usr/local/share/libdrm/amdgpu.ids: No such file or directory
>>> passed
>>>   Test: Unplug with exported fence 
>>> .../usr/local/share/libdrm/amdgpu.ids: No such file or directory
>>> amdgpu_device_initialize: amdgpu_get_auth (1) failed (-1)
>>
>>
>> on the kernel side - the IOCTlL returning this is drm_getclient - 
>> maybe take a look while it can't find client it ? I didn't have such 
>> issue as far as I remember when testing.
>>
>>
>>> FAILED
>>>     1. ../tests/amdgpu/hotunplug_tests.c:368  - CU_ASSERT_EQUAL(r,0)
>>>     2. ../tests/amdgpu/hotunplug_tests.c:411  - 
>>> CU_ASSERT_EQUAL(amdgpu_cs_import_syncobj(device2, shared_fd, 
>>> &sync_obj_handle2),0)
>>>     3. ../tests/amdgpu/hotunplug_tests.c:423  - 
>>> CU_ASSERT_EQUAL(amdgpu_cs_syncobj_wait(device2, &sync_obj_handle2, 
>>> 1, 100000000, 0, NULL),0)
>>>     4. ../tests/amdgpu/hotunplug_tests.c:425  - 
>>> CU_ASSERT_EQUAL(amdgpu_cs_destroy_syncobj(device2, sync_obj_handle2),0)
>>>
>>> Run Summary:    Type  Total    Ran Passed Failed Inactive
>>>               suites     14      1    n/a      0        0
>>>                tests     71      4      3      1        0
>>>              asserts     39     39     35      4      n/a
>>>
>>> Elapsed time =   17.321 seconds
>>>
>>> For kfd compute, there is some problem which I did not see in MI100 
>>> after I killed the hung application after hot plugout. I was using 
>>> rocm5.0.2 driver for MI100 card, and not sure if it is a regression 
>>> from the newer driver.
>>> After pkill, one of child of user process would be stuck in Zombie 
>>> mode (Z) understandably because of the bug, and future rocm 
>>> application after plug-back would in uninterrupted sleep mode (D) 
>>> because it would not return from syscall to kfd.
>>>
>>> Although drm test for amdgpu would run just fine without issues 
>>> after plug-back with dangling kfd state.
>>
>>
>> I am not clear when the crash bellow happens ? Is it related to what 
>> you describe above ?
>>
>>
>>>
>>> I don’t know if there is a quick fix to it. I was thinking add 
>>> drm_enter/drm_exit to amdgpu_device_rreg.
>>
>>
>> Try adding drm_dev_enter/exit pair at the highest level of attmetong 
>> to access HW - in this case it's amdgpu_amdkfd_set_compute_idle. We 
>> always try to avoid accessing any HW functions after backing device 
>> is gone.
>>
>>
>>> Also this has been a long time in my attempt to fix hotplug issue 
>>> for kfd application.
>>> I don’t know 1) if I would be able to get to MI100 (fixing Radeon 
>>> VII would mean something but MI100 is more important for us); 2) 
>>> what the direct of the patch to this issue will move forward.
>>
>>
>> I will go to office tomorrow to pick up MI-100, With time and 
>> priorities permitting I will then then try to test it and fix any 
>> bugs such that it will be passing all hot plug libdrm tests at the 
>> tip of public amd-staging-drm-next 
>> -https://gitlab.freedesktop.org/agd5f/linux, after that you can try 
>> to continue working with ROCm enabling on top of that.
>>
>> For now i suggest you move on with Radeon 7 which as your development 
>> ASIC and use the fix i mentioned above.
>>
> I finally got some time to continue on kfd hotplug patch attempt.
> The following patch seems to work for kfd hotplug on Radeon VII. After 
> hot plugout, the tf process exists because of vm fault.
> A new tf process run without issues after plugback.
>
> It has the following fixes.
>
>  1. ras sysfs regression;
>  2. skip setting compute idle after dev is plugged, otherwise it will
>     try to write the pci bar thus driver fault
>  3. stops the actual work of invalidate memory map triggered by
>     useptrs; (return false will trigger warning, so I returned true.
>     Not sure if it is correct)
>  4. It sends exceptions to all the events/signal that a “zombie”
>     process that are waiting for. (Not sure if the hw_exception is
>     worthwhile, it did not do anything in my case since there is such
>     event type associated with that process)
>
> Please take a look and let me know if it acceptable.
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
> index 1f8161cd507f..2f7858692067 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
> @@ -33,6 +33,7 @@
>  #include <uapi/linux/kfd_ioctl.h>
>  #include "amdgpu_ras.h"
>  #include "amdgpu_umc.h"
> +#include <drm/drm_drv.h>
>
>  /* Total memory size in system memory and all GPU VRAM. Used to
>   * estimate worst case amount of memory to reserve for page tables
> @@ -681,9 +682,10 @@ int amdgpu_amdkfd_submit_ib(struct amdgpu_device 
> *adev,
>
>  void amdgpu_amdkfd_set_compute_idle(struct amdgpu_device *adev, bool 
> idle)
>  {
> -       amdgpu_dpm_switch_power_profile(adev,
> - PP_SMC_POWER_PROFILE_COMPUTE,
> -                                       !idle);
> +       if (!drm_dev_is_unplugged(adev_to_drm(adev)))
> +               amdgpu_dpm_switch_power_profile(adev,
> + PP_SMC_POWER_PROFILE_COMPUTE,
> +                                               !idle);
>  }
>
>  bool amdgpu_amdkfd_is_kfd_vmid(struct amdgpu_device *adev, u32 vmid)
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
> index 4b153daf283d..fb4c9e55eace 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
> @@ -46,6 +46,7 @@
>  #include <linux/firmware.h>
>  #include <linux/module.h>
>  #include <drm/drm.h>
> +#include <drm/drm_drv.h>
>
>  #include "amdgpu.h"
>  #include "amdgpu_amdkfd.h"
> @@ -104,6 +105,9 @@ static bool amdgpu_mn_invalidate_hsa(struct 
> mmu_interval_notifier *mni,
>         struct amdgpu_bo *bo = container_of(mni, struct amdgpu_bo, 
> notifier);
>         struct amdgpu_device *adev = amdgpu_ttm_adev(bo->tbo.bdev);
>
> +       if (drm_dev_is_unplugged(adev_to_drm(adev)))
> +               return true;
> +
>         if (!mmu_notifier_range_blockable(range))
>                 return false;
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> index cac56f830aed..fbbaaabf3a67 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> @@ -1509,7 +1509,6 @@ static int amdgpu_ras_fs_fini(struct 
> amdgpu_device *adev)
>                 }
>         }
>
> -       amdgpu_ras_sysfs_remove_all(adev);
>         return 0;
>  }
>  /* ras fs end */
> @@ -2557,8 +2556,6 @@ void amdgpu_ras_block_late_fini(struct 
> amdgpu_device *adev,
>         if (!ras_block)
>                 return;
>
> -       amdgpu_ras_sysfs_remove(adev, ras_block);
> -
>         ras_obj = container_of(ras_block, struct 
> amdgpu_ras_block_object, ras_comm);
>         if (ras_obj->ras_cb)
>                 amdgpu_ras_interrupt_remove_handler(adev, ras_block);
> @@ -2659,6 +2656,7 @@ int amdgpu_ras_pre_fini(struct amdgpu_device *adev)
>         /* Need disable ras on all IPs here before ip [hw/sw]fini */
>         amdgpu_ras_disable_all_features(adev, 0);
>         amdgpu_ras_recovery_fini(adev);
> +       amdgpu_ras_sysfs_remove_all(adev);
>         return 0;
>  }
>
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c 
> b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> index f1a225a20719..4b789bec9670 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> @@ -714,16 +714,37 @@ bool kfd_is_locked(void)
>
>  void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm)
>  {
> +       struct kfd_process *p;
> +       struct amdkfd_process_info *p_info;
> +       unsigned int temp;
> +
>         if (!kfd->init_complete)
>                 return;
>
>         /* for runtime suspend, skip locking kfd */
> -       if (!run_pm) {
> +       if (!run_pm && !drm_dev_is_unplugged(kfd->ddev)) {
>                 /* For first KFD device suspend all the KFD processes */
>                 if (atomic_inc_return(&kfd_locked) == 1)
>                         kfd_suspend_all_processes();
>         }
>
> +       if (drm_dev_is_unplugged(kfd->ddev)){
> +               int idx = srcu_read_lock(&kfd_processes_srcu);
> +               pr_debug("cancel restore_userptr_work\n");
> +               hash_for_each_rcu(kfd_processes_table, temp, p, 
> kfd_processes) {
> +                       if (kfd_process_gpuidx_from_gpuid(p, kfd->id) 
> >= 0) {
> +                               p_info = p->kgd_process_info;
> +                               pr_debug("cancel processes, pid = %d 
> for gpu_id = %d", pid_nr(p_info->pid), kfd->id);
> + cancel_delayed_work_sync(&p_info->restore_userptr_work);

Is this really necessary? If it is, there are probably other workers, 
e.g. related to our SVM code, that would need to be canceled as well.


> +
> +                               /* send exception signals to the kfd 
> events waiting in user space */
> + kfd_signal_hw_exception_event(p->pasid);

This makes sense. It basically tells user mode that the application's 
GPU state is lost due to a RAS error or a GPU reset, or now a GPU 
hot-unplug.


> + kfd_signal_vm_fault_event(kfd, p->pasid, NULL);

This does not make sense. A VM fault indicates an access to a bad 
virtual address by the GPU. If a debugger is attached to the process, it 
notifies the debugger to investigate what went wrong. If the GPU is 
gone, that doesn't make any sense. There is no GPU that could have 
issued a bad memory request. And the debugger won't be happy either to 
find a VM fault from a GPU that doesn't exist any more.

If the HW-exception event doesn't terminate your process, we may need to 
look into how ROCr handles the HW-exception events.


> +                       }
> +               }
> +               srcu_read_unlock(&kfd_processes_srcu, idx);
> +       }
> +
>         kfd->dqm->ops.stop(kfd->dqm);
>         kfd_iommu_suspend(kfd);

Should DQM stop and IOMMU suspend still be executed? Or should the 
hot-unplug case short-circuit them?

Regards,
   Felix


>  }
>
> Regards,
> Shuotao
>>
>> Andrey
>>
>>
>>>
>>> Regards,
>>> Shuotao
>>>
>>> [  +0.001645] BUG: unable to handle page fault for address: 
>>> 0000000000058a68
>>> [  +0.001298] #PF: supervisor read access in kernel mode
>>> [  +0.001252] #PF: error_code(0x0000) - not-present page
>>> [  +0.001248] PGD 8000000115806067 P4D 8000000115806067 PUD 
>>> 109b2d067 PMD 0
>>> [  +0.001270] Oops: 0000 [#1] PREEMPT SMP PTI
>>> [  +0.001256] CPU: 5 PID: 13818 Comm: tf_cnn_benchmar Tainted: G     
>>>    W   E     5.16.0+ #3
>>> [  +0.001290] Hardware name: Dell Inc. PowerEdge R730/0H21J3, BIOS 
>>> 1.5.4 [FPGA Test BIOS] 10/002/2015
>>> [  +0.001309] RIP: 0010:amdgpu_device_rreg.part.24+0xa9/0xe0 [amdgpu]
>>> [  +0.001562] Code: e8 8c 7d 02 00 65 ff 0d 65 e0 7f 3f 75 ae 0f 1f 
>>> 44 00 00 eb a7 83 e2 02 75 09 f6 87 10 69 01 00 10 75 0d 4c 03 a3 a0 
>>> 09 00 00 <45> 8b 24 24 eb 8a 4c 8d b7 b0 6b 01 00 4c 89 f7 e8 a2 4c 
>>> 2e ca 85
>>> [  +0.002751] RSP: 0018:ffffb58fac313928 EFLAGS: 00010202
>>> [  +0.001388] RAX: ffffffffc09a4270 RBX: ffff8b0c9c840000 RCX: 
>>> 00000000ffffffff
>>> [  +0.001402] RDX: 0000000000000000 RSI: 000000000001629a RDI: 
>>> ffff8b0c9c840000
>>> [  +0.001418] RBP: ffffb58fac313948 R08: 0000000000000021 R09: 
>>> 0000000000000001
>>> [  +0.001421] R10: ffffb58fac313b30 R11: ffffffff8c065b00 R12: 
>>> 0000000000058a68
>>> [  +0.001400] R13: 000000000001629a R14: 0000000000000000 R15: 
>>> 000000000001629a
>>> [  +0.001397] FS:  0000000000000000(0000) GS:ffff8b4b7fa80000(0000) 
>>> knlGS:0000000000000000
>>> [  +0.001411] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> [  +0.001405] CR2: 0000000000058a68 CR3: 000000010a2c8001 CR4: 
>>> 00000000001706e0
>>> [  +0.001422] Call Trace:
>>> [  +0.001407]  <TASK>
>>> [  +0.001391]  amdgpu_device_rreg+0x17/0x20 [amdgpu]
>>> [  +0.001614]  amdgpu_cgs_read_register+0x14/0x20 [amdgpu]
>>> [  +0.001735]  phm_wait_for_register_unequal.part.1+0x58/0x90 [amdgpu]
>>> [  +0.001790]  phm_wait_for_register_unequal+0x1a/0x30 [amdgpu]
>>> [  +0.001800]  vega20_wait_for_response+0x28/0x80 [amdgpu]
>>> [  +0.001757]  vega20_send_msg_to_smc_with_parameter+0x21/0x110 [amdgpu]
>>> [  +0.001838]  smum_send_msg_to_smc_with_parameter+0xcd/0x100 [amdgpu]
>>> [  +0.001829]  ? kvfree+0x1e/0x30
>>> [  +0.001462]  vega20_set_power_profile_mode+0x58/0x330 [amdgpu]
>>> [  +0.001868]  ? kvfree+0x1e/0x30
>>> [  +0.001462]  ? ttm_bo_release+0x261/0x370 [ttm]
>>> [  +0.001467]  pp_dpm_switch_power_profile+0xc2/0x170 [amdgpu]
>>> [  +0.001863]  amdgpu_dpm_switch_power_profile+0x6b/0x90 [amdgpu]
>>> [  +0.001866]  amdgpu_amdkfd_set_compute_idle+0x1a/0x20 [amdgpu]
>>> [  +0.001784]  kfd_dec_compute_active+0x2c/0x50 [amdgpu]
>>> [  +0.001744]  process_termination_cpsch+0x2f9/0x3a0 [amdgpu]
>>> [  +0.001728]  kfd_process_dequeue_from_all_devices+0x49/0x70 [amdgpu]
>>> [  +0.001730]  kfd_process_notifier_release+0x91/0xe0 [amdgpu]
>>> [  +0.001718]  __mmu_notifier_release+0x77/0x1f0
>>> [  +0.001411]  exit_mmap+0x1b5/0x200
>>> [  +0.001396]  ? __switch_to+0x12d/0x3e0
>>> [  +0.001388]  ? __switch_to_asm+0x36/0x70
>>> [  +0.001372]  ? preempt_count_add+0x74/0xc0
>>> [  +0.001364]  mmput+0x57/0x110
>>> [  +0.001349]  do_exit+0x33d/0xc20
>>> [  +0.001337]  ? _raw_spin_unlock+0x1a/0x30
>>> [  +0.001346]  do_group_exit+0x43/0xa0
>>> [  +0.001341]  get_signal+0x131/0x920
>>> [  +0.001295]  arch_do_signal_or_restart+0xb1/0x870
>>> [  +0.001303]  ? do_futex+0x125/0x190
>>> [  +0.001285]  exit_to_user_mode_prepare+0xb1/0x1c0
>>> [  +0.001282]  syscall_exit_to_user_mode+0x2a/0x40
>>> [  +0.001264]  do_syscall_64+0x46/0xb0
>>> [  +0.001236]  entry_SYSCALL_64_after_hwframe+0x44/0xae
>>> [  +0.001219] RIP: 0033:0x7f6aff1d2ad3
>>> [  +0.001177] Code: Unable to access opcode bytes at RIP 0x7f6aff1d2aa9.
>>> [  +0.001166] RSP: 002b:00007f6ab2029d20 EFLAGS: 00000246 ORIG_RAX: 
>>> 00000000000000ca
>>> [  +0.001170] RAX: fffffffffffffe00 RBX: 0000000004f542b0 RCX: 
>>> 00007f6aff1d2ad3
>>> [  +0.001168] RDX: 0000000000000000 RSI: 0000000000000080 RDI: 
>>> 0000000004f542d8
>>> [  +0.001162] RBP: 0000000004f542d4 R08: 0000000000000000 R09: 
>>> 0000000000000000
>>> [  +0.001152] R10: 0000000000000000 R11: 0000000000000246 R12: 
>>> 0000000004f542d8
>>> [  +0.001176] R13: 0000000000000000 R14: 0000000004f54288 R15: 
>>> 0000000000000000
>>> [  +0.001152]  </TASK>
>>> [  +0.001113] Modules linked in: veth amdgpu(E) nf_conntrack_netlink 
>>> nfnetlink xfrm_user xt_addrtype br_netfilter xt_CHECKSUM 
>>> iptable_mangle xt_MASQUERADE iptable_nat nf_nat xt_conntrack 
>>> nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4 
>>> xt_tcpudp bridge stp llc ebtable_filter ebtables ip6table_filter 
>>> ip6_tables iptable_filter overlay esp6_offload esp6 esp4_offload 
>>> esp4 xfrm_algo intel_rapl_msr intel_rapl_common sb_edac 
>>> x86_pkg_temp_thermal intel_powerclamp snd_hda_codec_hdmi 
>>> snd_hda_intel ipmi_ssif snd_intel_dspcfg coretemp snd_hda_codec 
>>> kvm_intel snd_hda_core snd_hwdep snd_pcm snd_timer snd kvm soundcore 
>>> irqbypass ftdi_sio usbserial input_leds iTCO_wdt iTCO_vendor_support 
>>> joydev mei_me rapl lpc_ich intel_cstate mei ipmi_si ipmi_devintf 
>>> ipmi_msghandler mac_hid acpi_power_meter sch_fq_codel ib_iser 
>>> rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi 
>>> scsi_transport_iscsi ip_tables x_tables autofs4 btrfs 
>>> blake2b_generic zstd_compress raid10 raid456
>>> [  +0.000102]  async_raid6_recov async_memcpy async_pq async_xor 
>>> async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear 
>>> iommu_v2 gpu_sched drm_ttm_helper mgag200 ttm drm_shmem_helper 
>>> drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops 
>>> crct10dif_pclmul hid_generic crc32_pclmul ghash_clmulni_intel usbhid 
>>> uas aesni_intel crypto_simd igb ahci hid drm usb_storage cryptd 
>>> libahci dca megaraid_sas i2c_algo_bit wmi [last unloaded: amdgpu]
>>> [  +0.016626] CR2: 0000000000058a68
>>> [  +0.001550] ---[ end trace ff90849fe0a8b3b4 ]---
>>> [  +0.024953] RIP: 0010:amdgpu_device_rreg.part.24+0xa9/0xe0 [amdgpu]
>>> [  +0.001814] Code: e8 8c 7d 02 00 65 ff 0d 65 e0 7f 3f 75 ae 0f 1f 
>>> 44 00 00 eb a7 83 e2 02 75 09 f6 87 10 69 01 00 10 75 0d 4c 03 a3 a0 
>>> 09 00 00 <45> 8b 24 24 eb 8a 4c 8d b7 b0 6b 01 00 4c 89 f7 e8 a2 4c 
>>> 2e ca 85
>>> [  +0.003255] RSP: 0018:ffffb58fac313928 EFLAGS: 00010202
>>> [  +0.001641] RAX: ffffffffc09a4270 RBX: ffff8b0c9c840000 RCX: 
>>> 00000000ffffffff
>>> [  +0.001656] RDX: 0000000000000000 RSI: 000000000001629a RDI: 
>>> ffff8b0c9c840000
>>> [  +0.001681] RBP: ffffb58fac313948 R08: 0000000000000021 R09: 
>>> 0000000000000001
>>> [  +0.001662] R10: ffffb58fac313b30 R11: ffffffff8c065b00 R12: 
>>> 0000000000058a68
>>> [  +0.001650] R13: 000000000001629a R14: 0000000000000000 R15: 
>>> 000000000001629a
>>> [  +0.001648] FS:  0000000000000000(0000) GS:ffff8b4b7fa80000(0000) 
>>> knlGS:0000000000000000
>>> [  +0.001668] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> [  +0.001673] CR2: 0000000000058a68 CR3: 000000010a2c8001 CR4: 
>>> 00000000001706e0
>>> [  +0.001740] Fixing recursive fault but reboot is needed!
>>>
>>>
>>>> On Apr 21, 2022, at 2:41 AM, Andrey Grodzovsky 
>>>> <andrey.grodzovsky@amd.com> wrote:
>>>>
>>>> I retested hot plug tests at the commit I mentioned bellow - looks 
>>>> ok, my ASIC is Navi 10, I also tested using Vega 10 and older 
>>>> Polaris ASICs (whatever i had at home at the time). It's possible 
>>>> there are extra issues in ASICs like ur which I didn't cover during 
>>>> tests.
>>>>
>>>> andrey@andrey-test:~/drm$ sudo ./build/tests/amdgpu/amdgpu_test -s 13
>>>> /usr/local/share/libdrm/amdgpu.ids: No such file or directory
>>>> /usr/local/share/libdrm/amdgpu.ids: No such file or directory
>>>> /usr/local/share/libdrm/amdgpu.ids: No such file or directory
>>>>
>>>>
>>>> The ASIC NOT support UVD, suite disabled
>>>> /usr/local/share/libdrm/amdgpu.ids: No such file or directory
>>>>
>>>>
>>>> The ASIC NOT support VCE, suite disabled
>>>> /usr/local/share/libdrm/amdgpu.ids: No such file or directory
>>>> /usr/local/share/libdrm/amdgpu.ids: No such file or directory
>>>> /usr/local/share/libdrm/amdgpu.ids: No such file or directory
>>>>
>>>>
>>>> The ASIC NOT support UVD ENC, suite disabled.
>>>> /usr/local/share/libdrm/amdgpu.ids: No such file or directory
>>>> /usr/local/share/libdrm/amdgpu.ids: No such file or directory
>>>> /usr/local/share/libdrm/amdgpu.ids: No such file or directory
>>>> /usr/local/share/libdrm/amdgpu.ids: No such file or directory
>>>>
>>>>
>>>> Don't support TMZ (trust memory zone), security suite disabled
>>>> /usr/local/share/libdrm/amdgpu.ids: No such file or directory
>>>> /usr/local/share/libdrm/amdgpu.ids: No such file or directory
>>>> Peer device is not opened or has ASIC not supported by the suite, 
>>>> skip all Peer to Peer tests.
>>>>
>>>>
>>>>      CUnit - A unit testing framework for C - Version 2.1-3
>>>> http://cunit.sourceforge.net/
>>>>
>>>>
>>>> *Suite: Hotunplug Tests**
>>>> **  Test: Unplug card and rescan the bus to plug it back 
>>>> .../usr/local/share/libdrm/amdgpu.ids: No such file or directory**
>>>> **passed**
>>>> **  Test: Same as first test but with command submission 
>>>> .../usr/local/share/libdrm/amdgpu.ids: No such file or directory**
>>>> **passed**
>>>> **  Test: Unplug with exported bo 
>>>> .../usr/local/share/libdrm/amdgpu.ids: No such file or directory**
>>>> **passed*
>>>>
>>>> Run Summary:    Type  Total    Ran Passed Failed Inactive
>>>>               suites     14      1    n/a 0        0
>>>>                tests     71      3      3 0        1
>>>>              asserts     21     21     21 0      n/a
>>>>
>>>> Elapsed time =    9.195 seconds
>>>>
>>>>
>>>> Andrey
>>>>
>>>> On 2022-04-20 11:44, Andrey Grodzovsky wrote:
>>>>>
>>>>> The only one in Radeon 7 I see is the same sysfs crash we already 
>>>>> fixed so you can use the same fix. The MI 200 issue i haven't seen 
>>>>> yet but I also haven't tested MI200 so never saw it before. Need 
>>>>> to test when i get the time.
>>>>>
>>>>> So try that fix with Radeon 7 again to see if you pass the tests 
>>>>> (the warnings should all be minor issues).
>>>>>
>>>>> Andrey
>>>>>
>>>>>
>>>>> On 2022-04-20 05:24, Shuotao Xu wrote:
>>>>>>>
>>>>>>> That a problem, latest working baseline I tested and confirmed 
>>>>>>> passing hotplug tests is this branch and 
>>>>>>> commithttps://gitlab.freedesktop.org/agd5f/linux/-/commit/86e12a53b73135806e101142e72f3f1c0e6fa8e6which 
>>>>>>> is amd-staging-drm-next. 5.14 was the branch we ups-reamed the 
>>>>>>> hotplug code but it had a lot of regressions over time due to 
>>>>>>> new changes (that why I added the hotplug test to try and catch 
>>>>>>> them early). It would be best to run this branch on mi-100 so we 
>>>>>>> have a clean baseline and only after confirming  this particular 
>>>>>>> branch from this commits passes libdrm tests only then start 
>>>>>>> adding the KFD specific addons. Another option if you can't work 
>>>>>>> with MI-100 and this branch is to try a different ASIC that does 
>>>>>>> work with this branch (if possible).
>>>>>>>
>>>>>>> Andrey
>>>>>>>
>>>>>> OK I tried both this commit and the HEAD of and-staging-drm-next 
>>>>>> on two GPUs( MI100 and Radeon VII) both did not pass hotplugout 
>>>>>> libdrm test. I might be able to gain access to MI200, but I 
>>>>>> suspect it would work.
>>>>>>
>>>>>> I copied the complete dmesgs as follows. I highlighted the OOPSES 
>>>>>> for you.
>>>>>>
>>>>>> Radeon VII:
>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [EXTERNAL] [PATCH 2/2] drm/amdkfd: Add PCIe Hotplug Support for AMDKFD
  2022-05-10 20:31                                         ` Felix Kuehling
@ 2022-05-11  3:35                                           ` Shuotao Xu
  2022-05-11 13:49                                             ` Andrey Grodzovsky
  0 siblings, 1 reply; 31+ messages in thread
From: Shuotao Xu @ 2022-05-11  3:35 UTC (permalink / raw)
  To: Felix Kuehling
  Cc: Andrey Grodzovsky, Mukul.Joshi, Peng Cheng, amd-gfx, Lei Qu,
	Ran Shu, Ziyue Yang

[-- Attachment #1: Type: text/plain, Size: 31704 bytes --]



On May 11, 2022, at 4:31 AM, Felix Kuehling <felix.kuehling@amd.com<mailto:felix.kuehling@amd.com>> wrote:

[Some people who received this message don't often get email from felix.kuehling@amd.com<mailto:felix.kuehling@amd.com>. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification.]

Am 2022-05-10 um 07:03 schrieb Shuotao Xu:


On Apr 28, 2022, at 12:04 AM, Andrey Grodzovsky
<andrey.grodzovsky@amd.com<mailto:andrey.grodzovsky@amd.com>> wrote:

On 2022-04-27 05:20, Shuotao Xu wrote:

Hi Andrey,

Sorry that I did not have time to work on this for a few days.

I just tried the sysfs crash fix on Radeon VII and it seems that it
worked. It did not pass last the hotplug test, but my version has 4
tests instead of 3 in your case.


That because the 4th one is only enabled when here are 2 cards in the
system - to test DRI_PRIME export. I tested this time with only one card.

Yes, I only had one Radeon VII in my system, so this 4th test should
have been skipped. I am ignoring this issue.



Suite: Hotunplug Tests
Test: Unplug card and rescan the bus to plug it back
.../usr/local/share/libdrm/amdgpu.ids: No such file or directory
passed
Test: Same as first test but with command submission
.../usr/local/share/libdrm/amdgpu.ids: No such file or directory
passed
Test: Unplug with exported bo
.../usr/local/share/libdrm/amdgpu.ids: No such file or directory
passed
Test: Unplug with exported fence
.../usr/local/share/libdrm/amdgpu.ids: No such file or directory
amdgpu_device_initialize: amdgpu_get_auth (1) failed (-1)


on the kernel side - the IOCTlL returning this is drm_getclient -
maybe take a look while it can't find client it ? I didn't have such
issue as far as I remember when testing.


FAILED
1. ../tests/amdgpu/hotunplug_tests.c:368 - CU_ASSERT_EQUAL(r,0)
2. ../tests/amdgpu/hotunplug_tests.c:411 -
CU_ASSERT_EQUAL(amdgpu_cs_import_syncobj(device2, shared_fd,
&sync_obj_handle2),0)
3. ../tests/amdgpu/hotunplug_tests.c:423 -
CU_ASSERT_EQUAL(amdgpu_cs_syncobj_wait(device2, &sync_obj_handle2,
1, 100000000, 0, NULL),0)
4. ../tests/amdgpu/hotunplug_tests.c:425 -
CU_ASSERT_EQUAL(amdgpu_cs_destroy_syncobj(device2, sync_obj_handle2),0)

Run Summary: Type Total Ran Passed Failed Inactive
suites 14 1 n/a 0 0
tests 71 4 3 1 0
asserts 39 39 35 4 n/a

Elapsed time = 17.321 seconds

For kfd compute, there is some problem which I did not see in MI100
after I killed the hung application after hot plugout. I was using
rocm5.0.2 driver for MI100 card, and not sure if it is a regression
from the newer driver.
After pkill, one of child of user process would be stuck in Zombie
mode (Z) understandably because of the bug, and future rocm
application after plug-back would in uninterrupted sleep mode (D)
because it would not return from syscall to kfd.

Although drm test for amdgpu would run just fine without issues
after plug-back with dangling kfd state.


I am not clear when the crash bellow happens ? Is it related to what
you describe above ?



I don’t know if there is a quick fix to it. I was thinking add
drm_enter/drm_exit to amdgpu_device_rreg.


Try adding drm_dev_enter/exit pair at the highest level of attmetong
to access HW - in this case it's amdgpu_amdkfd_set_compute_idle. We
always try to avoid accessing any HW functions after backing device
is gone.


Also this has been a long time in my attempt to fix hotplug issue
for kfd application.
I don’t know 1) if I would be able to get to MI100 (fixing Radeon
VII would mean something but MI100 is more important for us); 2)
what the direct of the patch to this issue will move forward.


I will go to office tomorrow to pick up MI-100, With time and
priorities permitting I will then then try to test it and fix any
bugs such that it will be passing all hot plug libdrm tests at the
tip of public amd-staging-drm-next
-https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fagd5f%2Flinux&amp;data=05%7C01%7Cshuotaoxu%40microsoft.com%7C97faa63fd9a743a2982308da32c41ec4%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637878115188634502%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=uzuHL2YOs2e5IDmJTfyC7y44mLVLhvod9jC9s0QMXww%3D&amp;reserved=0, after that you can try
to continue working with ROCm enabling on top of that.

For now i suggest you move on with Radeon 7 which as your development
ASIC and use the fix i mentioned above.

I finally got some time to continue on kfd hotplug patch attempt.
The following patch seems to work for kfd hotplug on Radeon VII. After
hot plugout, the tf process exists because of vm fault.
A new tf process run without issues after plugback.

It has the following fixes.

1. ras sysfs regression;
2. skip setting compute idle after dev is plugged, otherwise it will
   try to write the pci bar thus driver fault
3. stops the actual work of invalidate memory map triggered by
   useptrs; (return false will trigger warning, so I returned true.
   Not sure if it is correct)
4. It sends exceptions to all the events/signal that a “zombie”
   process that are waiting for. (Not sure if the hw_exception is
   worthwhile, it did not do anything in my case since there is such
   event type associated with that process)

Please take a look and let me know if it acceptable.

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
index 1f8161cd507f..2f7858692067 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
@@ -33,6 +33,7 @@
#include <uapi/linux/kfd_ioctl.h>
#include "amdgpu_ras.h"
#include "amdgpu_umc.h"
+#include <drm/drm_drv.h>

/* Total memory size in system memory and all GPU VRAM. Used to
 * estimate worst case amount of memory to reserve for page tables
@@ -681,9 +682,10 @@ int amdgpu_amdkfd_submit_ib(struct amdgpu_device
*adev,

void amdgpu_amdkfd_set_compute_idle(struct amdgpu_device *adev, bool
idle)
{
-       amdgpu_dpm_switch_power_profile(adev,
- PP_SMC_POWER_PROFILE_COMPUTE,
-                                       !idle);
+       if (!drm_dev_is_unplugged(adev_to_drm(adev)))
+               amdgpu_dpm_switch_power_profile(adev,
+ PP_SMC_POWER_PROFILE_COMPUTE,
+                                               !idle);
}

bool amdgpu_amdkfd_is_kfd_vmid(struct amdgpu_device *adev, u32 vmid)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
index 4b153daf283d..fb4c9e55eace 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
@@ -46,6 +46,7 @@
#include <linux/firmware.h>
#include <linux/module.h>
#include <drm/drm.h>
+#include <drm/drm_drv.h>

#include "amdgpu.h"
#include "amdgpu_amdkfd.h"
@@ -104,6 +105,9 @@ static bool amdgpu_mn_invalidate_hsa(struct
mmu_interval_notifier *mni,
       struct amdgpu_bo *bo = container_of(mni, struct amdgpu_bo,
notifier);
       struct amdgpu_device *adev = amdgpu_ttm_adev(bo->tbo.bdev);

+       if (drm_dev_is_unplugged(adev_to_drm(adev)))
+               return true;
+
Label: Fix 3
       if (!mmu_notifier_range_blockable(range))
               return false;

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
index cac56f830aed..fbbaaabf3a67 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
@@ -1509,7 +1509,6 @@ static int amdgpu_ras_fs_fini(struct
amdgpu_device *adev)
               }
       }

-       amdgpu_ras_sysfs_remove_all(adev);
       return 0;
}
/* ras fs end */
@@ -2557,8 +2556,6 @@ void amdgpu_ras_block_late_fini(struct
amdgpu_device *adev,
       if (!ras_block)
               return;

-       amdgpu_ras_sysfs_remove(adev, ras_block);
-
       ras_obj = container_of(ras_block, struct
amdgpu_ras_block_object, ras_comm);
       if (ras_obj->ras_cb)
               amdgpu_ras_interrupt_remove_handler(adev, ras_block);
@@ -2659,6 +2656,7 @@ int amdgpu_ras_pre_fini(struct amdgpu_device *adev)
       /* Need disable ras on all IPs here before ip [hw/sw]fini */
       amdgpu_ras_disable_all_features(adev, 0);
       amdgpu_ras_recovery_fini(adev);
+       amdgpu_ras_sysfs_remove_all(adev);
       return 0;
}

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
index f1a225a20719..4b789bec9670 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
@@ -714,16 +714,37 @@ bool kfd_is_locked(void)

void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm)
{
+       struct kfd_process *p;
+       struct amdkfd_process_info *p_info;
+       unsigned int temp;
+
       if (!kfd->init_complete)
               return;

       /* for runtime suspend, skip locking kfd */
-       if (!run_pm) {
+       if (!run_pm && !drm_dev_is_unplugged(kfd->ddev)) {
               /* For first KFD device suspend all the KFD processes */
               if (atomic_inc_return(&kfd_locked) == 1)
                       kfd_suspend_all_processes();
       }

+       if (drm_dev_is_unplugged(kfd->ddev)){
+               int idx = srcu_read_lock(&kfd_processes_srcu);
+               pr_debug("cancel restore_userptr_work\n");
+               hash_for_each_rcu(kfd_processes_table, temp, p,
kfd_processes) {
+                       if (kfd_process_gpuidx_from_gpuid(p, kfd->id)
>= 0) {
+                               p_info = p->kgd_process_info;
+                               pr_debug("cancel processes, pid = %d
for gpu_id = %d", pid_nr(p_info->pid), kfd->id);
+ cancel_delayed_work_sync(&p_info->restore_userptr_work);

Is this really necessary? If it is, there are probably other workers,
e.g. related to our SVM code, that would need to be canceled as well.


I delete this and it seems to be OK. It was previously added to suppress restore_useptr_work which keeps updating PTE.
Now this is gone by Fix 3. Please let us know if it is OK:) @Felix


+
+ /* send exception signals to the kfd
events waiting in user space */
+ kfd_signal_hw_exception_event(p->pasid);

This makes sense. It basically tells user mode that the application's
GPU state is lost due to a RAS error or a GPU reset, or now a GPU
hot-unplug.

The problem is that it cannot find an event with a type that matches HW_EXCEPTION_TYPE so it does **nothing** from the driver with the default parameter value of send_sigterm = false;
After all, if a “zombie” process (zombie in the sense it does not have a GPU dev) does not exit, kfd resources seems not been released properly and new kfd process cannot run after plug back.
(I still need to look hard into rocr/hsakmt/kfd driver code to understand the reason. At least I am seeing that the kfd topology won’t be cleaned up without process exiting, so that there would be a “zombie" kfd node in the topology, which may or may not cause issues in hsakmt).
@Felix Do you have suggestion/insight on this “zombie" process issue? @Andrey suggests it should be OK to have a “zombie” kfd process and a “zombie” kfd dev, and the new kfd process should be ok to run on the new kfd dev after plugback.

May 11 09:52:07 NETSYS26 kernel: [52604.845400] amdgpu: cancel restore_userptr_work
May 11 09:52:07 NETSYS26 kernel: [52604.845405] amdgpu: sending hw exception to pasid = 0x800
May 11 09:52:07 NETSYS26 kernel: [52604.845414] kfd kfd: amdgpu: Process 25894 (pasid 0x8001) got unhandled exception



+ kfd_signal_vm_fault_event(kfd, p->pasid, NULL);

This does not make sense. A VM fault indicates an access to a bad
virtual address by the GPU. If a debugger is attached to the process, it
notifies the debugger to investigate what went wrong. If the GPU is
gone, that doesn't make any sense. There is no GPU that could have
issued a bad memory request. And the debugger won't be happy either to
find a VM fault from a GPU that doesn't exist any more.

OK understood.


If the HW-exception event doesn't terminate your process, we may need to
look into how ROCr handles the HW-exception events.


+ }
+ }
+ srcu_read_unlock(&kfd_processes_srcu, idx);
+ }
+
kfd->dqm->ops.stop(kfd->dqm);
kfd_iommu_suspend(kfd);

Should DQM stop and IOMMU suspend still be executed? Or should the
hot-unplug case short-circuit them?

I tried short circuiting them, but would later caused BUG related to GPU reset. I added the following that solve the issue on plugout.

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index b583026dc893..d78a06d74759 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -5317,7 +5317,8 @@ static void amdgpu_device_queue_gpu_recover_work(struct work_struct *work)
 {
        struct amdgpu_recover_work_struct *recover_work = container_of(work, struct amdgpu_recover_work_struct, base);

-       recover_work->ret = amdgpu_device_gpu_recover_imp(recover_work->adev, recover_work->job);
+       if (!drm_dev_is_unplugged(adev_to_drm(recover_work->adev)))
+               recover_work->ret = amdgpu_device_gpu_recover_imp(recover_work->adev, recover_work->job);
 }
 /*
  * Serialize gpu recover into reset domain single threaded wq

However after kill the zombie process, it failed to evict queues of the process.

[  +0.000002] amdgpu: writing 263 to doorbell address 00000000c86e63f2
[  +9.002503] amdgpu: qcm fence wait loop timeout expired
[  +0.001364] amdgpu: The cp might be in an unrecoverable state due to an unsuccessful queues preemption
[  +0.001343] amdgpu: Failed to evict process queues
[  +0.001355] amdgpu: Failed to evict queues of pasid 0x8001


This would cause driver BUG triggered by new kfd process after plugback. I am pasting the errors from dmesg after plugback as below.



May 11 10:25:16 NETSYS26 kernel: [  688.445332] amdgpu: Evicting PASID 0x8001 queues
May 11 10:25:16 NETSYS26 kernel: [  688.445359] BUG: unable to handle page fault for address: 000000020000006e
May 11 10:25:16 NETSYS26 kernel: [  688.447516] #PF: supervisor read access in kernel mode
May 11 10:25:16 NETSYS26 kernel: [  688.449627] #PF: error_code(0x0000) - not-present page
May 11 10:25:16 NETSYS26 kernel: [  688.451661] PGD 80000020892a8067 P4D 80000020892a8067 PUD 0
May 11 10:25:16 NETSYS26 kernel: [  688.453741] Oops: 0000 [#1] PREEMPT SMP PTI
May 11 10:25:16 NETSYS26 kernel: [  688.455904] CPU: 25 PID: 9236 Comm: tf_cnn_benchmar Tainted: G        W  OE     5.16.0+ #3
May 11 10:25:16 NETSYS26 kernel: [  688.457406] amdgpu 0000:05:00.0: amdgpu: GPU reset begin!
May 11 10:25:16 NETSYS26 kernel: [  688.457798] Hardware name: Dell Inc. PowerEdge R730/0H21J3, BIOS 1.5.4 [FPGA Test BIOS] 10/002/2015
May 11 10:25:16 NETSYS26 kernel: [  688.461458] RIP: 0010:evict_process_queues_cpsch+0x99/0x1b0 [amdgpu]
May 11 10:25:16 NETSYS26 kernel: [  688.465238] Code: bd 13 8a dd 85 c0 0f 85 13 01 00 00 49 8b 5f 10 4d 8d 77 10 49 39 de 75 11 e9 8d 00 00 00 48 8b 1b 4c 39 f3 0f 84 81 00 00 00 <80> 7b 6e 00 c6 43 6d 01 74 ea c6 43 6e 00 41 83 ac 24 70 01 00 00
May 11 10:25:16 NETSYS26 kernel: [  688.470516] RSP: 0018:ffffb2674c8afbf0 EFLAGS: 00010203
May 11 10:25:16 NETSYS26 kernel: [  688.473255] RAX: ffff91c65cca3800 RBX: 0000000200000000 RCX: 0000000000000001
May 11 10:25:16 NETSYS26 kernel: [  688.475691] RDX: 0000000000000000 RSI: ffffffff9fb712d9 RDI: 00000000ffffffff
May 11 10:25:16 NETSYS26 kernel: [  688.478564] RBP: ffffb2674c8afc20 R08: 0000000000000000 R09: 000000000006ba18
May 11 10:25:16 NETSYS26 kernel: [  688.481409] R10: 00007fe5a0000000 R11: ffffb2674c8af918 R12: ffff91c66d6f5800
May 11 10:25:16 NETSYS26 kernel: [  688.484254] R13: ffff91c66d6f5938 R14: ffff91e5c71ac820 R15: ffff91e5c71ac810
May 11 10:25:16 NETSYS26 kernel: [  688.487184] FS:  00007fe62124a700(0000) GS:ffff92053fd00000(0000) knlGS:0000000000000000
May 11 10:25:16 NETSYS26 kernel: [  688.490308] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
May 11 10:25:16 NETSYS26 kernel: [  688.493122] CR2: 000000020000006e CR3: 0000002095284004 CR4: 00000000001706e0
May 11 10:25:16 NETSYS26 kernel: [  688.496142] Call Trace:
May 11 10:25:16 NETSYS26 kernel: [  688.499199]  <TASK>
May 11 10:25:16 NETSYS26 kernel: [  688.502261]  kfd_process_evict_queues+0x43/0xf0 [amdgpu]
May 11 10:25:16 NETSYS26 kernel: [  688.506378]  kgd2kfd_quiesce_mm+0x2a/0x60 [amdgpu]
May 11 10:25:16 NETSYS26 kernel: [  688.510539]  amdgpu_amdkfd_evict_userptr+0x46/0x80 [amdgpu]
May 11 10:25:16 NETSYS26 kernel: [  688.514110]  amdgpu_mn_invalidate_hsa+0x9c/0xb0 [amdgpu]
May 11 10:25:16 NETSYS26 kernel: [  688.518247]  __mmu_notifier_invalidate_range_start+0x136/0x1e0
May 11 10:25:16 NETSYS26 kernel: [  688.521252]  change_protection+0x41d/0xcd0
May 11 10:25:16 NETSYS26 kernel: [  688.524310]  change_prot_numa+0x19/0x30
May 11 10:25:16 NETSYS26 kernel: [  688.527366]  task_numa_work+0x1ca/0x330
May 11 10:25:16 NETSYS26 kernel: [  688.530157]  task_work_run+0x6c/0xa0
May 11 10:25:16 NETSYS26 kernel: [  688.533124]  exit_to_user_mode_prepare+0x1af/0x1c0
May 11 10:25:16 NETSYS26 kernel: [  688.536058]  syscall_exit_to_user_mode+0x2a/0x40
May 11 10:25:16 NETSYS26 kernel: [  688.538989]  do_syscall_64+0x46/0xb0
May 11 10:25:16 NETSYS26 kernel: [  688.541830]  entry_SYSCALL_64_after_hwframe+0x44/0xae
May 11 10:25:16 NETSYS26 kernel: [  688.544701] RIP: 0033:0x7fe6585ec317
May 11 10:25:16 NETSYS26 kernel: [  688.547297] Code: b3 66 90 48 8b 05 71 4b 2d 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 41 4b 2d 00 f7 d8 64 89 01 48
May 11 10:25:16 NETSYS26 kernel: [  688.553183] RSP: 002b:00007fe6212494c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
May 11 10:25:16 NETSYS26 kernel: [  688.556105] RAX: ffffffffffffffc2 RBX: 0000000000000000 RCX: 00007fe6585ec317
May 11 10:25:16 NETSYS26 kernel: [  688.558970] RDX: 00007fe621249540 RSI: 00000000c0584b02 RDI: 0000000000000003
May 11 10:25:16 NETSYS26 kernel: [  688.561950] RBP: 00007fe621249540 R08: 0000000000000000 R09: 0000000000040000
May 11 10:25:16 NETSYS26 kernel: [  688.564563] R10: 00007fe617480000 R11: 0000000000000246 R12: 00000000c0584b02
May 11 10:25:16 NETSYS26 kernel: [  688.567494] R13: 0000000000000003 R14: 0000000000000064 R15: 00007fe621249920
May 11 10:25:16 NETSYS26 kernel: [  688.570470]  </TASK>
May 11 10:25:16 NETSYS26 kernel: [  688.573380] Modules linked in: amdgpu(OE) veth nf_conntrack_netlink nfnetlink xfrm_user xt_addrtype br_netfilter xt_CHECKSUM iptable_mangle xt_MASQUERADE iptable_nat nf_nat xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4 xt_tcpudp bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter overlay esp6_offload esp6 esp4_offload esp4 xfrm_algo intel_rapl_msr intel_rapl_common sb_edac snd_hda_codec_hdmi x86_pkg_temp_thermal snd_hda_intel intel_powerclamp snd_intel_dspcfg ipmi_ssif coretemp snd_hda_codec kvm_intel snd_hda_core snd_hwdep kvm snd_pcm snd_timer snd soundcore ftdi_sio irqbypass rapl intel_cstate usbserial joydev mei_me input_leds mei iTCO_wdt iTCO_vendor_support lpc_ich ipmi_si ipmi_devintf mac_hid acpi_power_meter ipmi_msghandler sch_fq_codel ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables x_tables autofs4 btrfs blake2b_generic zstd_compress raid10 raid456
May 11 10:25:16 NETSYS26 kernel: [  688.573543]  async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear iommu_v2 gpu_sched drm_ttm_helper mgag200 ttm drm_shmem_helper drm_kms_helper syscopyarea hid_generic crct10dif_pclmul crc32_pclmul sysfillrect ghash_clmulni_intel sysimgblt fb_sys_fops uas usbhid aesni_intel crypto_simd igb ahci hid drm usb_storage cryptd libahci dca megaraid_sas i2c_algo_bit wmi [last unloaded: amdgpu]
May 11 10:25:16 NETSYS26 kernel: [  688.611083] CR2: 000000020000006e
May 11 10:25:16 NETSYS26 kernel: [  688.614454] ---[ end trace 349cf28efb6268bc ]—

Looking forward to the comments.

Regards,
Shuotao


Regards,
Felix


}

Regards,
Shuotao

Andrey



Regards,
Shuotao

[  +0.001645] BUG: unable to handle page fault for address:
0000000000058a68
[  +0.001298] #PF: supervisor read access in kernel mode
[  +0.001252] #PF: error_code(0x0000) - not-present page
[  +0.001248] PGD 8000000115806067 P4D 8000000115806067 PUD
109b2d067 PMD 0
[  +0.001270] Oops: 0000 [#1] PREEMPT SMP PTI
[  +0.001256] CPU: 5 PID: 13818 Comm: tf_cnn_benchmar Tainted: G
  W   E     5.16.0+ #3
[  +0.001290] Hardware name: Dell Inc. PowerEdge R730/0H21J3, BIOS
1.5.4 [FPGA Test BIOS] 10/002/2015
[  +0.001309] RIP: 0010:amdgpu_device_rreg.part.24+0xa9/0xe0 [amdgpu]
[  +0.001562] Code: e8 8c 7d 02 00 65 ff 0d 65 e0 7f 3f 75 ae 0f 1f
44 00 00 eb a7 83 e2 02 75 09 f6 87 10 69 01 00 10 75 0d 4c 03 a3 a0
09 00 00 <45> 8b 24 24 eb 8a 4c 8d b7 b0 6b 01 00 4c 89 f7 e8 a2 4c
2e ca 85
[  +0.002751] RSP: 0018:ffffb58fac313928 EFLAGS: 00010202
[  +0.001388] RAX: ffffffffc09a4270 RBX: ffff8b0c9c840000 RCX:
00000000ffffffff
[  +0.001402] RDX: 0000000000000000 RSI: 000000000001629a RDI:
ffff8b0c9c840000
[  +0.001418] RBP: ffffb58fac313948 R08: 0000000000000021 R09:
0000000000000001
[  +0.001421] R10: ffffb58fac313b30 R11: ffffffff8c065b00 R12:
0000000000058a68
[  +0.001400] R13: 000000000001629a R14: 0000000000000000 R15:
000000000001629a
[  +0.001397] FS:  0000000000000000(0000) GS:ffff8b4b7fa80000(0000)
knlGS:0000000000000000
[  +0.001411] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  +0.001405] CR2: 0000000000058a68 CR3: 000000010a2c8001 CR4:
00000000001706e0
[  +0.001422] Call Trace:
[  +0.001407]  <TASK>
[  +0.001391]  amdgpu_device_rreg+0x17/0x20 [amdgpu]
[  +0.001614]  amdgpu_cgs_read_register+0x14/0x20 [amdgpu]
[  +0.001735]  phm_wait_for_register_unequal.part.1+0x58/0x90 [amdgpu]
[  +0.001790]  phm_wait_for_register_unequal+0x1a/0x30 [amdgpu]
[  +0.001800]  vega20_wait_for_response+0x28/0x80 [amdgpu]
[  +0.001757]  vega20_send_msg_to_smc_with_parameter+0x21/0x110 [amdgpu]
[  +0.001838]  smum_send_msg_to_smc_with_parameter+0xcd/0x100 [amdgpu]
[  +0.001829]  ? kvfree+0x1e/0x30
[  +0.001462]  vega20_set_power_profile_mode+0x58/0x330 [amdgpu]
[  +0.001868]  ? kvfree+0x1e/0x30
[  +0.001462]  ? ttm_bo_release+0x261/0x370 [ttm]
[  +0.001467]  pp_dpm_switch_power_profile+0xc2/0x170 [amdgpu]
[  +0.001863]  amdgpu_dpm_switch_power_profile+0x6b/0x90 [amdgpu]
[  +0.001866]  amdgpu_amdkfd_set_compute_idle+0x1a/0x20 [amdgpu]
[  +0.001784]  kfd_dec_compute_active+0x2c/0x50 [amdgpu]
[  +0.001744]  process_termination_cpsch+0x2f9/0x3a0 [amdgpu]
[  +0.001728]  kfd_process_dequeue_from_all_devices+0x49/0x70 [amdgpu]
[  +0.001730]  kfd_process_notifier_release+0x91/0xe0 [amdgpu]
[  +0.001718]  __mmu_notifier_release+0x77/0x1f0
[  +0.001411]  exit_mmap+0x1b5/0x200
[  +0.001396]  ? __switch_to+0x12d/0x3e0
[  +0.001388]  ? __switch_to_asm+0x36/0x70
[  +0.001372]  ? preempt_count_add+0x74/0xc0
[  +0.001364]  mmput+0x57/0x110
[  +0.001349]  do_exit+0x33d/0xc20
[  +0.001337]  ? _raw_spin_unlock+0x1a/0x30
[  +0.001346]  do_group_exit+0x43/0xa0
[  +0.001341]  get_signal+0x131/0x920
[  +0.001295]  arch_do_signal_or_restart+0xb1/0x870
[  +0.001303]  ? do_futex+0x125/0x190
[  +0.001285]  exit_to_user_mode_prepare+0xb1/0x1c0
[  +0.001282]  syscall_exit_to_user_mode+0x2a/0x40
[  +0.001264]  do_syscall_64+0x46/0xb0
[  +0.001236]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[  +0.001219] RIP: 0033:0x7f6aff1d2ad3
[  +0.001177] Code: Unable to access opcode bytes at RIP 0x7f6aff1d2aa9.
[  +0.001166] RSP: 002b:00007f6ab2029d20 EFLAGS: 00000246 ORIG_RAX:
00000000000000ca
[  +0.001170] RAX: fffffffffffffe00 RBX: 0000000004f542b0 RCX:
00007f6aff1d2ad3
[  +0.001168] RDX: 0000000000000000 RSI: 0000000000000080 RDI:
0000000004f542d8
[  +0.001162] RBP: 0000000004f542d4 R08: 0000000000000000 R09:
0000000000000000
[  +0.001152] R10: 0000000000000000 R11: 0000000000000246 R12:
0000000004f542d8
[  +0.001176] R13: 0000000000000000 R14: 0000000004f54288 R15:
0000000000000000
[  +0.001152]  </TASK>
[  +0.001113] Modules linked in: veth amdgpu(E) nf_conntrack_netlink
nfnetlink xfrm_user xt_addrtype br_netfilter xt_CHECKSUM
iptable_mangle xt_MASQUERADE iptable_nat nf_nat xt_conntrack
nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4
xt_tcpudp bridge stp llc ebtable_filter ebtables ip6table_filter
ip6_tables iptable_filter overlay esp6_offload esp6 esp4_offload
esp4 xfrm_algo intel_rapl_msr intel_rapl_common sb_edac
x86_pkg_temp_thermal intel_powerclamp snd_hda_codec_hdmi
snd_hda_intel ipmi_ssif snd_intel_dspcfg coretemp snd_hda_codec
kvm_intel snd_hda_core snd_hwdep snd_pcm snd_timer snd kvm soundcore
irqbypass ftdi_sio usbserial input_leds iTCO_wdt iTCO_vendor_support
joydev mei_me rapl lpc_ich intel_cstate mei ipmi_si ipmi_devintf
ipmi_msghandler mac_hid acpi_power_meter sch_fq_codel ib_iser
rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi
scsi_transport_iscsi ip_tables x_tables autofs4 btrfs
blake2b_generic zstd_compress raid10 raid456
[  +0.000102]  async_raid6_recov async_memcpy async_pq async_xor
async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear
iommu_v2 gpu_sched drm_ttm_helper mgag200 ttm drm_shmem_helper
drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops
crct10dif_pclmul hid_generic crc32_pclmul ghash_clmulni_intel usbhid
uas aesni_intel crypto_simd igb ahci hid drm usb_storage cryptd
libahci dca megaraid_sas i2c_algo_bit wmi [last unloaded: amdgpu]
[  +0.016626] CR2: 0000000000058a68
[  +0.001550] ---[ end trace ff90849fe0a8b3b4 ]---
[  +0.024953] RIP: 0010:amdgpu_device_rreg.part.24+0xa9/0xe0 [amdgpu]
[  +0.001814] Code: e8 8c 7d 02 00 65 ff 0d 65 e0 7f 3f 75 ae 0f 1f
44 00 00 eb a7 83 e2 02 75 09 f6 87 10 69 01 00 10 75 0d 4c 03 a3 a0
09 00 00 <45> 8b 24 24 eb 8a 4c 8d b7 b0 6b 01 00 4c 89 f7 e8 a2 4c
2e ca 85
[  +0.003255] RSP: 0018:ffffb58fac313928 EFLAGS: 00010202
[  +0.001641] RAX: ffffffffc09a4270 RBX: ffff8b0c9c840000 RCX:
00000000ffffffff
[  +0.001656] RDX: 0000000000000000 RSI: 000000000001629a RDI:
ffff8b0c9c840000
[  +0.001681] RBP: ffffb58fac313948 R08: 0000000000000021 R09:
0000000000000001
[  +0.001662] R10: ffffb58fac313b30 R11: ffffffff8c065b00 R12:
0000000000058a68
[  +0.001650] R13: 000000000001629a R14: 0000000000000000 R15:
000000000001629a
[  +0.001648] FS:  0000000000000000(0000) GS:ffff8b4b7fa80000(0000)
knlGS:0000000000000000
[  +0.001668] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  +0.001673] CR2: 0000000000058a68 CR3: 000000010a2c8001 CR4:
00000000001706e0
[  +0.001740] Fixing recursive fault but reboot is needed!


On Apr 21, 2022, at 2:41 AM, Andrey Grodzovsky
<andrey.grodzovsky@amd.com<mailto:andrey.grodzovsky@amd.com>> wrote:

I retested hot plug tests at the commit I mentioned bellow - looks
ok, my ASIC is Navi 10, I also tested using Vega 10 and older
Polaris ASICs (whatever i had at home at the time). It's possible
there are extra issues in ASICs like ur which I didn't cover during
tests.

andrey@andrey-test:~/drm$ sudo ./build/tests/amdgpu/amdgpu_test -s 13
/usr/local/share/libdrm/amdgpu.ids: No such file or directory
/usr/local/share/libdrm/amdgpu.ids: No such file or directory
/usr/local/share/libdrm/amdgpu.ids: No such file or directory


The ASIC NOT support UVD, suite disabled
/usr/local/share/libdrm/amdgpu.ids: No such file or directory


The ASIC NOT support VCE, suite disabled
/usr/local/share/libdrm/amdgpu.ids: No such file or directory
/usr/local/share/libdrm/amdgpu.ids: No such file or directory
/usr/local/share/libdrm/amdgpu.ids: No such file or directory


The ASIC NOT support UVD ENC, suite disabled.
/usr/local/share/libdrm/amdgpu.ids: No such file or directory
/usr/local/share/libdrm/amdgpu.ids: No such file or directory
/usr/local/share/libdrm/amdgpu.ids: No such file or directory
/usr/local/share/libdrm/amdgpu.ids: No such file or directory


Don't support TMZ (trust memory zone), security suite disabled
/usr/local/share/libdrm/amdgpu.ids: No such file or directory
/usr/local/share/libdrm/amdgpu.ids: No such file or directory
Peer device is not opened or has ASIC not supported by the suite,
skip all Peer to Peer tests.


CUnit - A unit testing framework for C - Version 2.1-3
https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fcunit.sourceforge.net%2F&amp;data=05%7C01%7Cshuotaoxu%40microsoft.com%7C97faa63fd9a743a2982308da32c41ec4%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637878115188634502%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=Ae2GEM2LDQVGndNPKmUFvus5Z1frSIezgo%2BzQGF0Mbs%3D&amp;reserved=0


*Suite: Hotunplug Tests**
** Test: Unplug card and rescan the bus to plug it back
.../usr/local/share/libdrm/amdgpu.ids: No such file or directory**
**passed**
** Test: Same as first test but with command submission
.../usr/local/share/libdrm/amdgpu.ids: No such file or directory**
**passed**
** Test: Unplug with exported bo
.../usr/local/share/libdrm/amdgpu.ids: No such file or directory**
**passed*

Run Summary: Type Total Ran Passed Failed Inactive
suites 14 1 n/a 0 0
tests 71 3 3 0 1
asserts 21 21 21 0 n/a

Elapsed time = 9.195 seconds


Andrey

On 2022-04-20 11:44, Andrey Grodzovsky wrote:

The only one in Radeon 7 I see is the same sysfs crash we already
fixed so you can use the same fix. The MI 200 issue i haven't seen
yet but I also haven't tested MI200 so never saw it before. Need
to test when i get the time.

So try that fix with Radeon 7 again to see if you pass the tests
(the warnings should all be minor issues).

Andrey


On 2022-04-20 05:24, Shuotao Xu wrote:

That a problem, latest working baseline I tested and confirmed
passing hotplug tests is this branch and
commithttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fagd5f%2Flinux%2F-%2Fcommit%2F86e12a53b73135806e101142e72f3f1c0e6fa8e6which&amp;data=05%7C01%7Cshuotaoxu%40microsoft.com%7C97faa63fd9a743a2982308da32c41ec4%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637878115188634502%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=WJos5tofZ6Bc0PSnwKmh%2FX3a5FGCZJ%2BA3AJjGHggbHc%3D&amp;reserved=0
is amd-staging-drm-next. 5.14 was the branch we ups-reamed the
hotplug code but it had a lot of regressions over time due to
new changes (that why I added the hotplug test to try and catch
them early). It would be best to run this branch on mi-100 so we
have a clean baseline and only after confirming this particular
branch from this commits passes libdrm tests only then start
adding the KFD specific addons. Another option if you can't work
with MI-100 and this branch is to try a different ASIC that does
work with this branch (if possible).

Andrey

OK I tried both this commit and the HEAD of and-staging-drm-next
on two GPUs( MI100 and Radeon VII) both did not pass hotplugout
libdrm test. I might be able to gain access to MI200, but I
suspect it would work.

I copied the complete dmesgs as follows. I highlighted the OOPSES
for you.

Radeon VII:


[-- Attachment #2: Type: text/html, Size: 69321 bytes --]

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [EXTERNAL] [PATCH 2/2] drm/amdkfd: Add PCIe Hotplug Support for AMDKFD
  2022-05-11  3:35                                           ` Shuotao Xu
@ 2022-05-11 13:49                                             ` Andrey Grodzovsky
  2022-05-11 16:49                                               ` Felix Kuehling
  0 siblings, 1 reply; 31+ messages in thread
From: Andrey Grodzovsky @ 2022-05-11 13:49 UTC (permalink / raw)
  To: Shuotao Xu, Felix Kuehling
  Cc: Mukul.Joshi, Peng Cheng, amd-gfx, Lei Qu, Ran Shu, Ziyue Yang

[-- Attachment #1: Type: text/plain, Size: 36850 bytes --]


On 2022-05-10 23:35, Shuotao Xu wrote:
>
>
>> On May 11, 2022, at 4:31 AM, Felix Kuehling <felix.kuehling@amd.com> 
>> wrote:
>>
>> [Some people who received this message don't often get email 
>> fromfelix.kuehling@amd.com. Learn why this is important 
>> athttps://aka.ms/LearnAboutSenderIdentification.]
>>
>> Am 2022-05-10 um 07:03 schrieb Shuotao Xu:
>>>
>>>
>>>> On Apr 28, 2022, at 12:04 AM, Andrey Grodzovsky
>>>> <andrey.grodzovsky@amd.com> wrote:
>>>>
>>>> On 2022-04-27 05:20, Shuotao Xu wrote:
>>>>
>>>>> Hi Andrey,
>>>>>
>>>>> Sorry that I did not have time to work on this for a few days.
>>>>>
>>>>> I just tried the sysfs crash fix on Radeon VII and it seems that it
>>>>> worked. It did not pass last the hotplug test, but my version has 4
>>>>> tests instead of 3 in your case.
>>>>
>>>>
>>>> That because the 4th one is only enabled when here are 2 cards in the
>>>> system - to test DRI_PRIME export. I tested this time with only one 
>>>> card.
>>>>
>>> Yes, I only had one Radeon VII in my system, so this 4th test should
>>> have been skipped. I am ignoring this issue.
>>>
>>>>>
>>>>>
>>>>> Suite: Hotunplug Tests
>>>>> Test: Unplug card and rescan the bus to plug it back
>>>>> .../usr/local/share/libdrm/amdgpu.ids: No such file or directory
>>>>> passed
>>>>> Test: Same as first test but with command submission
>>>>> .../usr/local/share/libdrm/amdgpu.ids: No such file or directory
>>>>> passed
>>>>> Test: Unplug with exported bo
>>>>> .../usr/local/share/libdrm/amdgpu.ids: No such file or directory
>>>>> passed
>>>>> Test: Unplug with exported fence
>>>>> .../usr/local/share/libdrm/amdgpu.ids: No such file or directory
>>>>> amdgpu_device_initialize: amdgpu_get_auth (1) failed (-1)
>>>>
>>>>
>>>> on the kernel side - the IOCTlL returning this is drm_getclient -
>>>> maybe take a look while it can't find client it ? I didn't have such
>>>> issue as far as I remember when testing.
>>>>
>>>>
>>>>> FAILED
>>>>> 1. ../tests/amdgpu/hotunplug_tests.c:368 - CU_ASSERT_EQUAL(r,0)
>>>>> 2. ../tests/amdgpu/hotunplug_tests.c:411 -
>>>>> CU_ASSERT_EQUAL(amdgpu_cs_import_syncobj(device2, shared_fd,
>>>>> &sync_obj_handle2),0)
>>>>> 3. ../tests/amdgpu/hotunplug_tests.c:423 -
>>>>> CU_ASSERT_EQUAL(amdgpu_cs_syncobj_wait(device2, &sync_obj_handle2,
>>>>> 1, 100000000, 0, NULL),0)
>>>>> 4. ../tests/amdgpu/hotunplug_tests.c:425 -
>>>>> CU_ASSERT_EQUAL(amdgpu_cs_destroy_syncobj(device2, 
>>>>> sync_obj_handle2),0)
>>>>>
>>>>> Run Summary: Type Total Ran Passed Failed Inactive
>>>>> suites 14 1 n/a 0 0
>>>>> tests 71 4 3 1 0
>>>>> asserts 39 39 35 4 n/a
>>>>>
>>>>> Elapsed time = 17.321 seconds
>>>>>
>>>>> For kfd compute, there is some problem which I did not see in MI100
>>>>> after I killed the hung application after hot plugout. I was using
>>>>> rocm5.0.2 driver for MI100 card, and not sure if it is a regression
>>>>> from the newer driver.
>>>>> After pkill, one of child of user process would be stuck in Zombie
>>>>> mode (Z) understandably because of the bug, and future rocm
>>>>> application after plug-back would in uninterrupted sleep mode (D)
>>>>> because it would not return from syscall to kfd.
>>>>>
>>>>> Although drm test for amdgpu would run just fine without issues
>>>>> after plug-back with dangling kfd state.
>>>>
>>>>
>>>> I am not clear when the crash bellow happens ? Is it related to what
>>>> you describe above ?
>>>>
>>>>
>>>>>
>>>>> I don’t know if there is a quick fix to it. I was thinking add
>>>>> drm_enter/drm_exit to amdgpu_device_rreg.
>>>>
>>>>
>>>> Try adding drm_dev_enter/exit pair at the highest level of attmetong
>>>> to access HW - in this case it's amdgpu_amdkfd_set_compute_idle. We
>>>> always try to avoid accessing any HW functions after backing device
>>>> is gone.
>>>>
>>>>
>>>>> Also this has been a long time in my attempt to fix hotplug issue
>>>>> for kfd application.
>>>>> I don’t know 1) if I would be able to get to MI100 (fixing Radeon
>>>>> VII would mean something but MI100 is more important for us); 2)
>>>>> what the direct of the patch to this issue will move forward.
>>>>
>>>>
>>>> I will go to office tomorrow to pick up MI-100, With time and
>>>> priorities permitting I will then then try to test it and fix any
>>>> bugs such that it will be passing all hot plug libdrm tests at the
>>>> tip of public amd-staging-drm-next
>>>> -https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fagd5f%2Flinux&amp;data=05%7C01%7Cshuotaoxu%40microsoft.com%7C97faa63fd9a743a2982308da32c41ec4 
>>>> <https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fagd5f%2Flinux&data=05%7C01%7Candrey.grodzovsky%40amd.com%7C23750571b50a4c2e434508da32ff5720%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637878369526441445%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=Ub4jMSDBchMgrgzlDu1vMiNypFnsfN%2FcPuZgqa7ZJk8%3D&reserved=0>%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637878115188634502%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=uzuHL2YOs2e5IDmJTfyC7y44mLVLhvod9jC9s0QMXww%3D&amp;reserved=0, 
>>>> after that you can try
>>>> to continue working with ROCm enabling on top of that.
>>>>
>>>> For now i suggest you move on with Radeon 7 which as your development
>>>> ASIC and use the fix i mentioned above.
>>>>
>>> I finally got some time to continue on kfd hotplug patch attempt.
>>> The following patch seems to work for kfd hotplug on Radeon VII. After
>>> hot plugout, the tf process exists because of vm fault.
>>> A new tf process run without issues after plugback.
>>>
>>> It has the following fixes.
>>>
>>> 1. ras sysfs regression;
>>> 2. skip setting compute idle after dev is plugged, otherwise it will
>>>    try to write the pci bar thus driver fault
>>> 3. stops the actual work of invalidate memory map triggered by
>>>    useptrs; (return false will trigger warning, so I returned true.
>>>    Not sure if it is correct)
>>> 4. It sends exceptions to all the events/signal that a “zombie”
>>>    process that are waiting for. (Not sure if the hw_exception is
>>>    worthwhile, it did not do anything in my case since there is such
>>>    event type associated with that process)
>>>
>>> Please take a look and let me know if it acceptable.
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
>>> index 1f8161cd507f..2f7858692067 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
>>> @@ -33,6 +33,7 @@
>>> #include <uapi/linux/kfd_ioctl.h>
>>> #include "amdgpu_ras.h"
>>> #include "amdgpu_umc.h"
>>> +#include <drm/drm_drv.h>
>>>
>>> /* Total memory size in system memory and all GPU VRAM. Used to
>>>  * estimate worst case amount of memory to reserve for page tables
>>> @@ -681,9 +682,10 @@ int amdgpu_amdkfd_submit_ib(struct amdgpu_device
>>> *adev,
>>>
>>> void amdgpu_amdkfd_set_compute_idle(struct amdgpu_device *adev, bool
>>> idle)
>>> {
>>> -       amdgpu_dpm_switch_power_profile(adev,
>>> - PP_SMC_POWER_PROFILE_COMPUTE,
>>> -                                       !idle);
>>> +       if (!drm_dev_is_unplugged(adev_to_drm(adev)))
>>> +               amdgpu_dpm_switch_power_profile(adev,
>>> + PP_SMC_POWER_PROFILE_COMPUTE,
>>> +                                               !idle);
>>> }
>>>
>>> bool amdgpu_amdkfd_is_kfd_vmid(struct amdgpu_device *adev, u32 vmid)
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
>>> index 4b153daf283d..fb4c9e55eace 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
>>> @@ -46,6 +46,7 @@
>>> #include <linux/firmware.h>
>>> #include <linux/module.h>
>>> #include <drm/drm.h>
>>> +#include <drm/drm_drv.h>
>>>
>>> #include "amdgpu.h"
>>> #include "amdgpu_amdkfd.h"
>>> @@ -104,6 +105,9 @@ static bool amdgpu_mn_invalidate_hsa(struct
>>> mmu_interval_notifier *mni,
>>>        struct amdgpu_bo *bo = container_of(mni, struct amdgpu_bo,
>>> notifier);
>>>        struct amdgpu_device *adev = amdgpu_ttm_adev(bo->tbo.bdev);
>>>
>>> +       if (drm_dev_is_unplugged(adev_to_drm(adev)))
>>> +               return true;
>>> +
> Label: Fix 3
>>>        if (!mmu_notifier_range_blockable(range))
>>>                return false;
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
>>> index cac56f830aed..fbbaaabf3a67 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
>>> @@ -1509,7 +1509,6 @@ static int amdgpu_ras_fs_fini(struct
>>> amdgpu_device *adev)
>>>                }
>>>        }
>>>
>>> -       amdgpu_ras_sysfs_remove_all(adev);
>>>        return 0;
>>> }
>>> /* ras fs end */
>>> @@ -2557,8 +2556,6 @@ void amdgpu_ras_block_late_fini(struct
>>> amdgpu_device *adev,
>>>        if (!ras_block)
>>>                return;
>>>
>>> -       amdgpu_ras_sysfs_remove(adev, ras_block);
>>> -
>>>        ras_obj = container_of(ras_block, struct
>>> amdgpu_ras_block_object, ras_comm);
>>>        if (ras_obj->ras_cb)
>>>                amdgpu_ras_interrupt_remove_handler(adev, ras_block);
>>> @@ -2659,6 +2656,7 @@ int amdgpu_ras_pre_fini(struct amdgpu_device 
>>> *adev)
>>>        /* Need disable ras on all IPs here before ip [hw/sw]fini */
>>>        amdgpu_ras_disable_all_features(adev, 0);
>>>        amdgpu_ras_recovery_fini(adev);
>>> +       amdgpu_ras_sysfs_remove_all(adev);
>>>        return 0;
>>> }
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>> b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>> index f1a225a20719..4b789bec9670 100644
>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>> @@ -714,16 +714,37 @@ bool kfd_is_locked(void)
>>>
>>> void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm)
>>> {
>>> +       struct kfd_process *p;
>>> +       struct amdkfd_process_info *p_info;
>>> +       unsigned int temp;
>>> +
>>>        if (!kfd->init_complete)
>>>                return;
>>>
>>>        /* for runtime suspend, skip locking kfd */
>>> -       if (!run_pm) {
>>> +       if (!run_pm && !drm_dev_is_unplugged(kfd->ddev)) {
>>>                /* For first KFD device suspend all the KFD processes */
>>>                if (atomic_inc_return(&kfd_locked) == 1)
>>>                        kfd_suspend_all_processes();
>>>        }
>>>
>>> +       if (drm_dev_is_unplugged(kfd->ddev)){
>>> +               int idx = srcu_read_lock(&kfd_processes_srcu);
>>> +               pr_debug("cancel restore_userptr_work\n");
>>> +               hash_for_each_rcu(kfd_processes_table, temp, p,
>>> kfd_processes) {
>>> +                       if (kfd_process_gpuidx_from_gpuid(p, kfd->id)
>>> >= 0) {
>>> +                               p_info = p->kgd_process_info;
>>> +                               pr_debug("cancel processes, pid = %d
>>> for gpu_id = %d", pid_nr(p_info->pid), kfd->id);
>>> + cancel_delayed_work_sync(&p_info->restore_userptr_work);
>>
>> Is this really necessary? If it is, there are probably other workers,
>> e.g. related to our SVM code, that would need to be canceled as well.
>>
>
> I delete this and it seems to be OK. It was previously added to 
> suppress restore_useptr_work which keeps updating PTE.
> Now this is gone by Fix 3. Please let us know if it is OK:) @Felix
>
>>
>>> +
>>> + /* send exception signals to the kfd
>>> events waiting in user space */
>>> + kfd_signal_hw_exception_event(p->pasid);
>>
>> This makes sense. It basically tells user mode that the application's
>> GPU state is lost due to a RAS error or a GPU reset, or now a GPU
>> hot-unplug.
>
> The problem is that it cannot find an event with a type that matches 
> HW_EXCEPTION_TYPE so it does **nothing** from the driver with the 
> default parameter value of send_sigterm = false;
> After all, if a “zombie” process (zombie in the sense it does not have 
> a GPU dev) does not exit, kfd resources seems not been released 
> properly and new kfd process cannot run after plug back.
> (I still need to look hard into rocr/hsakmt/kfd driver code to 
> understand the reason. At least I am seeing that the kfd topology 
> won’t be cleaned up without process exiting, so that there would be a 
> “zombie" kfd node in the topology, which may or may not cause issues 
> in hsakmt).
> @Felix Do you have suggestion/insight on this “zombie" process issue? 
> @Andrey suggests it should be OK to have a “zombie” kfd process and a 
> “zombie” kfd dev, and the new kfd process should be ok to run on the 
> new kfd dev after plugback.


My experience with the graphic stack at least showed that. At least in a 
setup with 2 GPUs, if i remove a secondary GPU which had a rendering 
process on it, I could plug back the secondary GPU and start a new 
rendering process while the old zombie process was still present. It 
could be that in KFD case there are some obstacles to this that need to 
be resolved.

Andrey


>
> May 11 09:52:07 NETSYS26 kernel: [52604.845400] amdgpu: cancel 
> restore_userptr_work
> May 11 09:52:07 NETSYS26 kernel: [52604.845405] amdgpu: sending hw 
> exception to pasid = 0x800
> May 11 09:52:07 NETSYS26 kernel: [52604.845414] kfd kfd: amdgpu: 
> Process 25894 (pasid 0x8001) got unhandled exception
>
>>
>>
>>> + kfd_signal_vm_fault_event(kfd, p->pasid, NULL);
>>
>> This does not make sense. A VM fault indicates an access to a bad
>> virtual address by the GPU. If a debugger is attached to the process, it
>> notifies the debugger to investigate what went wrong. If the GPU is
>> gone, that doesn't make any sense. There is no GPU that could have
>> issued a bad memory request. And the debugger won't be happy either to
>> find a VM fault from a GPU that doesn't exist any more.
>
> OK understood.
>
>>
>> If the HW-exception event doesn't terminate your process, we may need to
>> look into how ROCr handles the HW-exception events.
>>
>>
>>> + }
>>> + }
>>> + srcu_read_unlock(&kfd_processes_srcu, idx);
>>> + }
>>> +
>>> kfd->dqm->ops.stop(kfd->dqm);
>>> kfd_iommu_suspend(kfd);
>>
>> Should DQM stop and IOMMU suspend still be executed? Or should the
>> hot-unplug case short-circuit them?
>
> I tried short circuiting them, but would later caused BUG related to 
> GPU reset. I added the following that solve the issue on plugout.
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index b583026dc893..d78a06d74759 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -5317,7 +5317,8 @@ static void 
> amdgpu_device_queue_gpu_recover_work(struct work_struct *work)
>  {
>         struct amdgpu_recover_work_struct *recover_work = 
> container_of(work, struct amdgpu_recover_work_struct, base);
>
> -       recover_work->ret = 
> amdgpu_device_gpu_recover_imp(recover_work->adev, recover_work->job);
> +       if (!drm_dev_is_unplugged(adev_to_drm(recover_work->adev)))
> +               recover_work->ret = 
> amdgpu_device_gpu_recover_imp(recover_work->adev, recover_work->job);
>  }
>  /*
>   * Serialize gpu recover into reset domain single threaded wq
>
> However after kill the zombie process, it failed to evict queues of 
> the process.
>
> [  +0.000002] amdgpu: writing 263 to doorbell address 00000000c86e63f2
> [  +9.002503] amdgpu: qcm fence wait loop timeout expired
> [  +0.001364] amdgpu: The cp might be in an unrecoverable state due to 
> an unsuccessful queues preemption
> [  +0.001343] amdgpu: Failed to evict process queues
> [  +0.001355] amdgpu: Failed to evict queues of pasid 0x8001
>
>
> This would cause driver BUG triggered by new kfd process after 
> plugback. I am pasting the errors from dmesg after plugback as below.
>
>
>
> May 11 10:25:16 NETSYS26 kernel: [  688.445332] amdgpu: Evicting PASID 
> 0x8001 queues
> May 11 10:25:16 NETSYS26 kernel: [  688.445359] BUG: unable to handle 
> page fault for address: 000000020000006e
> May 11 10:25:16 NETSYS26 kernel: [  688.447516] #PF: supervisor read 
> access in kernel mode
> May 11 10:25:16 NETSYS26 kernel: [  688.449627] #PF: 
> error_code(0x0000) - not-present page
> May 11 10:25:16 NETSYS26 kernel: [  688.451661] PGD 80000020892a8067 
> P4D 80000020892a8067 PUD 0
> May 11 10:25:16 NETSYS26 kernel: [  688.453741] Oops: 0000 [#1] 
> PREEMPT SMP PTI
> May 11 10:25:16 NETSYS26 kernel: [  688.455904] CPU: 25 PID: 9236 
> Comm: tf_cnn_benchmar Tainted: G        W  OE 5.16.0+ #3
> May 11 10:25:16 NETSYS26 kernel: [  688.457406] amdgpu 0000:05:00.0: 
> amdgpu: GPU reset begin!
> May 11 10:25:16 NETSYS26 kernel: [  688.457798] Hardware name: Dell 
> Inc. PowerEdge R730/0H21J3, BIOS 1.5.4 [FPGA Test BIOS] 10/002/2015
> May 11 10:25:16 NETSYS26 kernel: [  688.461458] RIP: 
> 0010:evict_process_queues_cpsch+0x99/0x1b0 [amdgpu]
> May 11 10:25:16 NETSYS26 kernel: [  688.465238] Code: bd 13 8a dd 85 
> c0 0f 85 13 01 00 00 49 8b 5f 10 4d 8d 77 10 49 39 de 75 11 e9 8d 00 
> 00 00 48 8b 1b 4c 39 f3 0f 84 81 00 00 00 <80> 7b 6e 00 c6 43 6d 01 74 
> ea c6 43 6e 00 41 83 ac 24 70 01 00 00
> May 11 10:25:16 NETSYS26 kernel: [  688.470516] RSP: 
> 0018:ffffb2674c8afbf0 EFLAGS: 00010203
> May 11 10:25:16 NETSYS26 kernel: [  688.473255] RAX: ffff91c65cca3800 
> RBX: 0000000200000000 RCX: 0000000000000001
> May 11 10:25:16 NETSYS26 kernel: [  688.475691] RDX: 0000000000000000 
> RSI: ffffffff9fb712d9 RDI: 00000000ffffffff
> May 11 10:25:16 NETSYS26 kernel: [  688.478564] RBP: ffffb2674c8afc20 
> R08: 0000000000000000 R09: 000000000006ba18
> May 11 10:25:16 NETSYS26 kernel: [  688.481409] R10: 00007fe5a0000000 
> R11: ffffb2674c8af918 R12: ffff91c66d6f5800
> May 11 10:25:16 NETSYS26 kernel: [  688.484254] R13: ffff91c66d6f5938 
> R14: ffff91e5c71ac820 R15: ffff91e5c71ac810
> May 11 10:25:16 NETSYS26 kernel: [  688.487184] FS: 
>  00007fe62124a700(0000) GS:ffff92053fd00000(0000) knlGS:0000000000000000
> May 11 10:25:16 NETSYS26 kernel: [  688.490308] CS:  0010 DS: 0000 ES: 
> 0000 CR0: 0000000080050033
> May 11 10:25:16 NETSYS26 kernel: [  688.493122] CR2: 000000020000006e 
> CR3: 0000002095284004 CR4: 00000000001706e0
> May 11 10:25:16 NETSYS26 kernel: [  688.496142] Call Trace:
> May 11 10:25:16 NETSYS26 kernel: [  688.499199]  <TASK>
> May 11 10:25:16 NETSYS26 kernel: [  688.502261] 
>  kfd_process_evict_queues+0x43/0xf0 [amdgpu]
> May 11 10:25:16 NETSYS26 kernel: [  688.506378] 
>  kgd2kfd_quiesce_mm+0x2a/0x60 [amdgpu]
> May 11 10:25:16 NETSYS26 kernel: [  688.510539] 
>  amdgpu_amdkfd_evict_userptr+0x46/0x80 [amdgpu]
> May 11 10:25:16 NETSYS26 kernel: [  688.514110] 
>  amdgpu_mn_invalidate_hsa+0x9c/0xb0 [amdgpu]
> May 11 10:25:16 NETSYS26 kernel: [  688.518247] 
>  __mmu_notifier_invalidate_range_start+0x136/0x1e0
> May 11 10:25:16 NETSYS26 kernel: [  688.521252] 
>  change_protection+0x41d/0xcd0
> May 11 10:25:16 NETSYS26 kernel: [  688.524310] 
>  change_prot_numa+0x19/0x30
> May 11 10:25:16 NETSYS26 kernel: [  688.527366] 
>  task_numa_work+0x1ca/0x330
> May 11 10:25:16 NETSYS26 kernel: [  688.530157]  task_work_run+0x6c/0xa0
> May 11 10:25:16 NETSYS26 kernel: [  688.533124] 
>  exit_to_user_mode_prepare+0x1af/0x1c0
> May 11 10:25:16 NETSYS26 kernel: [  688.536058] 
>  syscall_exit_to_user_mode+0x2a/0x40
> May 11 10:25:16 NETSYS26 kernel: [  688.538989]  do_syscall_64+0x46/0xb0
> May 11 10:25:16 NETSYS26 kernel: [  688.541830] 
>  entry_SYSCALL_64_after_hwframe+0x44/0xae
> May 11 10:25:16 NETSYS26 kernel: [  688.544701] RIP: 0033:0x7fe6585ec317
> May 11 10:25:16 NETSYS26 kernel: [  688.547297] Code: b3 66 90 48 8b 
> 05 71 4b 2d 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 
> 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 
> 48 8b 0d 41 4b 2d 00 f7 d8 64 89 01 48
> May 11 10:25:16 NETSYS26 kernel: [  688.553183] RSP: 
> 002b:00007fe6212494c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
> May 11 10:25:16 NETSYS26 kernel: [  688.556105] RAX: ffffffffffffffc2 
> RBX: 0000000000000000 RCX: 00007fe6585ec317
> May 11 10:25:16 NETSYS26 kernel: [  688.558970] RDX: 00007fe621249540 
> RSI: 00000000c0584b02 RDI: 0000000000000003
> May 11 10:25:16 NETSYS26 kernel: [  688.561950] RBP: 00007fe621249540 
> R08: 0000000000000000 R09: 0000000000040000
> May 11 10:25:16 NETSYS26 kernel: [  688.564563] R10: 00007fe617480000 
> R11: 0000000000000246 R12: 00000000c0584b02
> May 11 10:25:16 NETSYS26 kernel: [  688.567494] R13: 0000000000000003 
> R14: 0000000000000064 R15: 00007fe621249920
> May 11 10:25:16 NETSYS26 kernel: [  688.570470]  </TASK>
> May 11 10:25:16 NETSYS26 kernel: [  688.573380] Modules linked in: 
> amdgpu(OE) veth nf_conntrack_netlink nfnetlink xfrm_user xt_addrtype 
> br_netfilter xt_CHECKSUM iptable_mangle xt_MASQUERADE iptable_nat 
> nf_nat xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 
> ipt_REJECT nf_reject_ipv4 xt_tcpudp bridge stp llc ebtable_filter 
> ebtables ip6table_filter ip6_tables iptable_filter overlay 
> esp6_offload esp6 esp4_offload esp4 xfrm_algo intel_rapl_msr 
> intel_rapl_common sb_edac snd_hda_codec_hdmi x86_pkg_temp_thermal 
> snd_hda_intel intel_powerclamp snd_intel_dspcfg ipmi_ssif coretemp 
> snd_hda_codec kvm_intel snd_hda_core snd_hwdep kvm snd_pcm snd_timer 
> snd soundcore ftdi_sio irqbypass rapl intel_cstate usbserial joydev 
> mei_me input_leds mei iTCO_wdt iTCO_vendor_support lpc_ich ipmi_si 
> ipmi_devintf mac_hid acpi_power_meter ipmi_msghandler sch_fq_codel 
> ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi 
> scsi_transport_iscsi ip_tables x_tables autofs4 btrfs blake2b_generic 
> zstd_compress raid10 raid456
> May 11 10:25:16 NETSYS26 kernel: [  688.573543]  async_raid6_recov 
> async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 
> raid0 multipath linear iommu_v2 gpu_sched drm_ttm_helper mgag200 ttm 
> drm_shmem_helper drm_kms_helper syscopyarea hid_generic 
> crct10dif_pclmul crc32_pclmul sysfillrect ghash_clmulni_intel 
> sysimgblt fb_sys_fops uas usbhid aesni_intel crypto_simd igb ahci hid 
> drm usb_storage cryptd libahci dca megaraid_sas i2c_algo_bit wmi [last 
> unloaded: amdgpu]
> May 11 10:25:16 NETSYS26 kernel: [  688.611083] CR2: 000000020000006e
> May 11 10:25:16 NETSYS26 kernel: [  688.614454] ---[ end trace 
> 349cf28efb6268bc ]—
>
> Looking forward to the comments.
>
> Regards,
> Shuotao
>
>>
>> Regards,
>> Felix
>>
>>
>>> }
>>>
>>> Regards,
>>> Shuotao
>>>>
>>>> Andrey
>>>>
>>>>
>>>>>
>>>>> Regards,
>>>>> Shuotao
>>>>>
>>>>> [  +0.001645] BUG: unable to handle page fault for address:
>>>>> 0000000000058a68
>>>>> [  +0.001298] #PF: supervisor read access in kernel mode
>>>>> [  +0.001252] #PF: error_code(0x0000) - not-present page
>>>>> [  +0.001248] PGD 8000000115806067 P4D 8000000115806067 PUD
>>>>> 109b2d067 PMD 0
>>>>> [  +0.001270] Oops: 0000 [#1] PREEMPT SMP PTI
>>>>> [  +0.001256] CPU: 5 PID: 13818 Comm: tf_cnn_benchmar Tainted: G
>>>>>   W   E     5.16.0+ #3
>>>>> [  +0.001290] Hardware name: Dell Inc. PowerEdge R730/0H21J3, BIOS
>>>>> 1.5.4 [FPGA Test BIOS] 10/002/2015
>>>>> [  +0.001309] RIP: 0010:amdgpu_device_rreg.part.24+0xa9/0xe0 [amdgpu]
>>>>> [  +0.001562] Code: e8 8c 7d 02 00 65 ff 0d 65 e0 7f 3f 75 ae 0f 1f
>>>>> 44 00 00 eb a7 83 e2 02 75 09 f6 87 10 69 01 00 10 75 0d 4c 03 a3 a0
>>>>> 09 00 00 <45> 8b 24 24 eb 8a 4c 8d b7 b0 6b 01 00 4c 89 f7 e8 a2 4c
>>>>> 2e ca 85
>>>>> [  +0.002751] RSP: 0018:ffffb58fac313928 EFLAGS: 00010202
>>>>> [  +0.001388] RAX: ffffffffc09a4270 RBX: ffff8b0c9c840000 RCX:
>>>>> 00000000ffffffff
>>>>> [  +0.001402] RDX: 0000000000000000 RSI: 000000000001629a RDI:
>>>>> ffff8b0c9c840000
>>>>> [  +0.001418] RBP: ffffb58fac313948 R08: 0000000000000021 R09:
>>>>> 0000000000000001
>>>>> [  +0.001421] R10: ffffb58fac313b30 R11: ffffffff8c065b00 R12:
>>>>> 0000000000058a68
>>>>> [  +0.001400] R13: 000000000001629a R14: 0000000000000000 R15:
>>>>> 000000000001629a
>>>>> [  +0.001397] FS:  0000000000000000(0000) GS:ffff8b4b7fa80000(0000)
>>>>> knlGS:0000000000000000
>>>>> [  +0.001411] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>>> [  +0.001405] CR2: 0000000000058a68 CR3: 000000010a2c8001 CR4:
>>>>> 00000000001706e0
>>>>> [  +0.001422] Call Trace:
>>>>> [  +0.001407]  <TASK>
>>>>> [  +0.001391]  amdgpu_device_rreg+0x17/0x20 [amdgpu]
>>>>> [  +0.001614]  amdgpu_cgs_read_register+0x14/0x20 [amdgpu]
>>>>> [  +0.001735]  phm_wait_for_register_unequal.part.1+0x58/0x90 [amdgpu]
>>>>> [  +0.001790]  phm_wait_for_register_unequal+0x1a/0x30 [amdgpu]
>>>>> [  +0.001800]  vega20_wait_for_response+0x28/0x80 [amdgpu]
>>>>> [  +0.001757]  vega20_send_msg_to_smc_with_parameter+0x21/0x110 
>>>>> [amdgpu]
>>>>> [  +0.001838]  smum_send_msg_to_smc_with_parameter+0xcd/0x100 [amdgpu]
>>>>> [  +0.001829]  ? kvfree+0x1e/0x30
>>>>> [  +0.001462]  vega20_set_power_profile_mode+0x58/0x330 [amdgpu]
>>>>> [  +0.001868]  ? kvfree+0x1e/0x30
>>>>> [  +0.001462]  ? ttm_bo_release+0x261/0x370 [ttm]
>>>>> [  +0.001467]  pp_dpm_switch_power_profile+0xc2/0x170 [amdgpu]
>>>>> [  +0.001863]  amdgpu_dpm_switch_power_profile+0x6b/0x90 [amdgpu]
>>>>> [  +0.001866]  amdgpu_amdkfd_set_compute_idle+0x1a/0x20 [amdgpu]
>>>>> [  +0.001784]  kfd_dec_compute_active+0x2c/0x50 [amdgpu]
>>>>> [  +0.001744]  process_termination_cpsch+0x2f9/0x3a0 [amdgpu]
>>>>> [  +0.001728]  kfd_process_dequeue_from_all_devices+0x49/0x70 [amdgpu]
>>>>> [  +0.001730]  kfd_process_notifier_release+0x91/0xe0 [amdgpu]
>>>>> [  +0.001718]  __mmu_notifier_release+0x77/0x1f0
>>>>> [  +0.001411]  exit_mmap+0x1b5/0x200
>>>>> [  +0.001396]  ? __switch_to+0x12d/0x3e0
>>>>> [  +0.001388]  ? __switch_to_asm+0x36/0x70
>>>>> [  +0.001372]  ? preempt_count_add+0x74/0xc0
>>>>> [  +0.001364]  mmput+0x57/0x110
>>>>> [  +0.001349]  do_exit+0x33d/0xc20
>>>>> [  +0.001337]  ? _raw_spin_unlock+0x1a/0x30
>>>>> [  +0.001346]  do_group_exit+0x43/0xa0
>>>>> [  +0.001341]  get_signal+0x131/0x920
>>>>> [  +0.001295]  arch_do_signal_or_restart+0xb1/0x870
>>>>> [  +0.001303]  ? do_futex+0x125/0x190
>>>>> [  +0.001285]  exit_to_user_mode_prepare+0xb1/0x1c0
>>>>> [  +0.001282]  syscall_exit_to_user_mode+0x2a/0x40
>>>>> [  +0.001264]  do_syscall_64+0x46/0xb0
>>>>> [  +0.001236]  entry_SYSCALL_64_after_hwframe+0x44/0xae
>>>>> [  +0.001219] RIP: 0033:0x7f6aff1d2ad3
>>>>> [  +0.001177] Code: Unable to access opcode bytes at RIP 
>>>>> 0x7f6aff1d2aa9.
>>>>> [  +0.001166] RSP: 002b:00007f6ab2029d20 EFLAGS: 00000246 ORIG_RAX:
>>>>> 00000000000000ca
>>>>> [  +0.001170] RAX: fffffffffffffe00 RBX: 0000000004f542b0 RCX:
>>>>> 00007f6aff1d2ad3
>>>>> [  +0.001168] RDX: 0000000000000000 RSI: 0000000000000080 RDI:
>>>>> 0000000004f542d8
>>>>> [  +0.001162] RBP: 0000000004f542d4 R08: 0000000000000000 R09:
>>>>> 0000000000000000
>>>>> [  +0.001152] R10: 0000000000000000 R11: 0000000000000246 R12:
>>>>> 0000000004f542d8
>>>>> [  +0.001176] R13: 0000000000000000 R14: 0000000004f54288 R15:
>>>>> 0000000000000000
>>>>> [  +0.001152]  </TASK>
>>>>> [  +0.001113] Modules linked in: veth amdgpu(E) nf_conntrack_netlink
>>>>> nfnetlink xfrm_user xt_addrtype br_netfilter xt_CHECKSUM
>>>>> iptable_mangle xt_MASQUERADE iptable_nat nf_nat xt_conntrack
>>>>> nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4
>>>>> xt_tcpudp bridge stp llc ebtable_filter ebtables ip6table_filter
>>>>> ip6_tables iptable_filter overlay esp6_offload esp6 esp4_offload
>>>>> esp4 xfrm_algo intel_rapl_msr intel_rapl_common sb_edac
>>>>> x86_pkg_temp_thermal intel_powerclamp snd_hda_codec_hdmi
>>>>> snd_hda_intel ipmi_ssif snd_intel_dspcfg coretemp snd_hda_codec
>>>>> kvm_intel snd_hda_core snd_hwdep snd_pcm snd_timer snd kvm soundcore
>>>>> irqbypass ftdi_sio usbserial input_leds iTCO_wdt iTCO_vendor_support
>>>>> joydev mei_me rapl lpc_ich intel_cstate mei ipmi_si ipmi_devintf
>>>>> ipmi_msghandler mac_hid acpi_power_meter sch_fq_codel ib_iser
>>>>> rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi
>>>>> scsi_transport_iscsi ip_tables x_tables autofs4 btrfs
>>>>> blake2b_generic zstd_compress raid10 raid456
>>>>> [  +0.000102]  async_raid6_recov async_memcpy async_pq async_xor
>>>>> async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear
>>>>> iommu_v2 gpu_sched drm_ttm_helper mgag200 ttm drm_shmem_helper
>>>>> drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops
>>>>> crct10dif_pclmul hid_generic crc32_pclmul ghash_clmulni_intel usbhid
>>>>> uas aesni_intel crypto_simd igb ahci hid drm usb_storage cryptd
>>>>> libahci dca megaraid_sas i2c_algo_bit wmi [last unloaded: amdgpu]
>>>>> [  +0.016626] CR2: 0000000000058a68
>>>>> [  +0.001550] ---[ end trace ff90849fe0a8b3b4 ]---
>>>>> [  +0.024953] RIP: 0010:amdgpu_device_rreg.part.24+0xa9/0xe0 [amdgpu]
>>>>> [  +0.001814] Code: e8 8c 7d 02 00 65 ff 0d 65 e0 7f 3f 75 ae 0f 1f
>>>>> 44 00 00 eb a7 83 e2 02 75 09 f6 87 10 69 01 00 10 75 0d 4c 03 a3 a0
>>>>> 09 00 00 <45> 8b 24 24 eb 8a 4c 8d b7 b0 6b 01 00 4c 89 f7 e8 a2 4c
>>>>> 2e ca 85
>>>>> [  +0.003255] RSP: 0018:ffffb58fac313928 EFLAGS: 00010202
>>>>> [  +0.001641] RAX: ffffffffc09a4270 RBX: ffff8b0c9c840000 RCX:
>>>>> 00000000ffffffff
>>>>> [  +0.001656] RDX: 0000000000000000 RSI: 000000000001629a RDI:
>>>>> ffff8b0c9c840000
>>>>> [  +0.001681] RBP: ffffb58fac313948 R08: 0000000000000021 R09:
>>>>> 0000000000000001
>>>>> [  +0.001662] R10: ffffb58fac313b30 R11: ffffffff8c065b00 R12:
>>>>> 0000000000058a68
>>>>> [  +0.001650] R13: 000000000001629a R14: 0000000000000000 R15:
>>>>> 000000000001629a
>>>>> [  +0.001648] FS:  0000000000000000(0000) GS:ffff8b4b7fa80000(0000)
>>>>> knlGS:0000000000000000
>>>>> [  +0.001668] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>>> [  +0.001673] CR2: 0000000000058a68 CR3: 000000010a2c8001 CR4:
>>>>> 00000000001706e0
>>>>> [  +0.001740] Fixing recursive fault but reboot is needed!
>>>>>
>>>>>
>>>>>> On Apr 21, 2022, at 2:41 AM, Andrey Grodzovsky
>>>>>> <andrey.grodzovsky@amd.com> wrote:
>>>>>>
>>>>>> I retested hot plug tests at the commit I mentioned bellow - looks
>>>>>> ok, my ASIC is Navi 10, I also tested using Vega 10 and older
>>>>>> Polaris ASICs (whatever i had at home at the time). It's possible
>>>>>> there are extra issues in ASICs like ur which I didn't cover during
>>>>>> tests.
>>>>>>
>>>>>> andrey@andrey-test:~/drm$ sudo ./build/tests/amdgpu/amdgpu_test -s 13
>>>>>> /usr/local/share/libdrm/amdgpu.ids: No such file or directory
>>>>>> /usr/local/share/libdrm/amdgpu.ids: No such file or directory
>>>>>> /usr/local/share/libdrm/amdgpu.ids: No such file or directory
>>>>>>
>>>>>>
>>>>>> The ASIC NOT support UVD, suite disabled
>>>>>> /usr/local/share/libdrm/amdgpu.ids: No such file or directory
>>>>>>
>>>>>>
>>>>>> The ASIC NOT support VCE, suite disabled
>>>>>> /usr/local/share/libdrm/amdgpu.ids: No such file or directory
>>>>>> /usr/local/share/libdrm/amdgpu.ids: No such file or directory
>>>>>> /usr/local/share/libdrm/amdgpu.ids: No such file or directory
>>>>>>
>>>>>>
>>>>>> The ASIC NOT support UVD ENC, suite disabled.
>>>>>> /usr/local/share/libdrm/amdgpu.ids: No such file or directory
>>>>>> /usr/local/share/libdrm/amdgpu.ids: No such file or directory
>>>>>> /usr/local/share/libdrm/amdgpu.ids: No such file or directory
>>>>>> /usr/local/share/libdrm/amdgpu.ids: No such file or directory
>>>>>>
>>>>>>
>>>>>> Don't support TMZ (trust memory zone), security suite disabled
>>>>>> /usr/local/share/libdrm/amdgpu.ids: No such file or directory
>>>>>> /usr/local/share/libdrm/amdgpu.ids: No such file or directory
>>>>>> Peer device is not opened or has ASIC not supported by the suite,
>>>>>> skip all Peer to Peer tests.
>>>>>>
>>>>>>
>>>>>> CUnit - A unit testing framework for C - Version 2.1-3
>>>>>> https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fcunit.sourceforge.net%2F&amp;data=05%7C01%7Cshuotaoxu%40microsoft.com%7C97faa63fd9a743a2982308da32c41ec4%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637878115188634502%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=Ae2GEM2LDQVGndNPKmUFvus5Z1frSIezgo%2BzQGF0Mbs%3D&amp;reserved=0 
>>>>>> <https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fcunit.sourceforge.net%2F&data=05%7C01%7Candrey.grodzovsky%40amd.com%7C23750571b50a4c2e434508da32ff5720%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637878369526441445%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=kzNRa9d46sBwZCVhu9%2BEkK%2F3f7fyjAo%2BAADtgeoz2l8%3D&reserved=0>
>>>>>>
>>>>>>
>>>>>> *Suite: Hotunplug Tests**
>>>>>> ** Test: Unplug card and rescan the bus to plug it back
>>>>>> .../usr/local/share/libdrm/amdgpu.ids: No such file or directory**
>>>>>> **passed**
>>>>>> ** Test: Same as first test but with command submission
>>>>>> .../usr/local/share/libdrm/amdgpu.ids: No such file or directory**
>>>>>> **passed**
>>>>>> ** Test: Unplug with exported bo
>>>>>> .../usr/local/share/libdrm/amdgpu.ids: No such file or directory**
>>>>>> **passed*
>>>>>>
>>>>>> Run Summary: Type Total Ran Passed Failed Inactive
>>>>>> suites 14 1 n/a 0 0
>>>>>> tests 71 3 3 0 1
>>>>>> asserts 21 21 21 0 n/a
>>>>>>
>>>>>> Elapsed time = 9.195 seconds
>>>>>>
>>>>>>
>>>>>> Andrey
>>>>>>
>>>>>> On 2022-04-20 11:44, Andrey Grodzovsky wrote:
>>>>>>>
>>>>>>> The only one in Radeon 7 I see is the same sysfs crash we already
>>>>>>> fixed so you can use the same fix. The MI 200 issue i haven't seen
>>>>>>> yet but I also haven't tested MI200 so never saw it before. Need
>>>>>>> to test when i get the time.
>>>>>>>
>>>>>>> So try that fix with Radeon 7 again to see if you pass the tests
>>>>>>> (the warnings should all be minor issues).
>>>>>>>
>>>>>>> Andrey
>>>>>>>
>>>>>>>
>>>>>>> On 2022-04-20 05:24, Shuotao Xu wrote:
>>>>>>>>>
>>>>>>>>> That a problem, latest working baseline I tested and confirmed
>>>>>>>>> passing hotplug tests is this branch and
>>>>>>>>> commithttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fagd5f%2Flinux%2F-%2Fcommit%2F86e12a53b73135806e101142e72f3f1c0e6fa8e6which&amp;data=05%7C01%7Cshuotaoxu%40microsoft.com%7C97faa63fd9a743a2982308da32c41ec4%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637878115188634502%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=WJos5tofZ6Bc0PSnwKmh%2FX3a5FGCZJ%2BA3AJjGHggbHc%3D&amp;reserved=0 
>>>>>>>>> <commithttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fagd5f%2Flinux%2F-%2Fcommit%2F86e12a53b73135806e101142e72f3f1c0e6fa8e6which&amp;data=05%7C01%7Cshuotaoxu%40microsoft.com%7C97faa63fd9a743a2982308da32c41ec4%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637878115188634502%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=WJos5tofZ6Bc0PSnwKmh%2FX3a5FGCZJ%2BA3AJjGHggbHc%3D&amp;reserved=0>
>>>>>>>>> is amd-staging-drm-next. 5.14 was the branch we ups-reamed the
>>>>>>>>> hotplug code but it had a lot of regressions over time due to
>>>>>>>>> new changes (that why I added the hotplug test to try and catch
>>>>>>>>> them early). It would be best to run this branch on mi-100 so we
>>>>>>>>> have a clean baseline and only after confirming this particular
>>>>>>>>> branch from this commits passes libdrm tests only then start
>>>>>>>>> adding the KFD specific addons. Another option if you can't work
>>>>>>>>> with MI-100 and this branch is to try a different ASIC that does
>>>>>>>>> work with this branch (if possible).
>>>>>>>>>
>>>>>>>>> Andrey
>>>>>>>>>
>>>>>>>> OK I tried both this commit and the HEAD of and-staging-drm-next
>>>>>>>> on two GPUs( MI100 and Radeon VII) both did not pass hotplugout
>>>>>>>> libdrm test. I might be able to gain access to MI200, but I
>>>>>>>> suspect it would work.
>>>>>>>>
>>>>>>>> I copied the complete dmesgs as follows. I highlighted the OOPSES
>>>>>>>> for you.
>>>>>>>>
>>>>>>>> Radeon VII:
>

[-- Attachment #2: Type: text/html, Size: 91745 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [EXTERNAL] [PATCH 2/2] drm/amdkfd: Add PCIe Hotplug Support for AMDKFD
  2022-05-11 13:49                                             ` Andrey Grodzovsky
@ 2022-05-11 16:49                                               ` Felix Kuehling
  2022-05-11 17:02                                                 ` Andrey Grodzovsky
  0 siblings, 1 reply; 31+ messages in thread
From: Felix Kuehling @ 2022-05-11 16:49 UTC (permalink / raw)
  To: Andrey Grodzovsky, Shuotao Xu
  Cc: Mukul.Joshi, Peng Cheng, amd-gfx, Lei Qu, Ran Shu, Ziyue Yang

Am 2022-05-11 um 09:49 schrieb Andrey Grodzovsky:
>
>
>
[snip]
>>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>>> b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>>> index f1a225a20719..4b789bec9670 100644
>>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>>> @@ -714,16 +714,37 @@ bool kfd_is_locked(void)
>>>>
>>>> void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm)
>>>> {
>>>> +       struct kfd_process *p;
>>>> +       struct amdkfd_process_info *p_info;
>>>> +       unsigned int temp;
>>>> +
>>>>        if (!kfd->init_complete)
>>>>                return;
>>>>
>>>>        /* for runtime suspend, skip locking kfd */
>>>> -       if (!run_pm) {
>>>> +       if (!run_pm && !drm_dev_is_unplugged(kfd->ddev)) {
>>>>                /* For first KFD device suspend all the KFD processes */
>>>>                if (atomic_inc_return(&kfd_locked) == 1)
>>>>                        kfd_suspend_all_processes();
>>>>        }
>>>>
>>>> +       if (drm_dev_is_unplugged(kfd->ddev)){
>>>> +               int idx = srcu_read_lock(&kfd_processes_srcu);
>>>> +               pr_debug("cancel restore_userptr_work\n");
>>>> +               hash_for_each_rcu(kfd_processes_table, temp, p,
>>>> kfd_processes) {
>>>> +                       if (kfd_process_gpuidx_from_gpuid(p, kfd->id)
>>>> >= 0) {
>>>> +                               p_info = p->kgd_process_info;
>>>> +                               pr_debug("cancel processes, pid = %d
>>>> for gpu_id = %d", pid_nr(p_info->pid), kfd->id);
>>>> + cancel_delayed_work_sync(&p_info->restore_userptr_work);
>>>
>>> Is this really necessary? If it is, there are probably other workers,
>>> e.g. related to our SVM code, that would need to be canceled as well.
>>>
>>
>> I delete this and it seems to be OK. It was previously added to 
>> suppress restore_useptr_work which keeps updating PTE.
>> Now this is gone by Fix 3. Please let us know if it is OK:) @Felix

Sounds good to me.


>>
>>>
>>>> +
>>>> + /* send exception signals to the kfd
>>>> events waiting in user space */
>>>> + kfd_signal_hw_exception_event(p->pasid);
>>>
>>> This makes sense. It basically tells user mode that the application's
>>> GPU state is lost due to a RAS error or a GPU reset, or now a GPU
>>> hot-unplug.
>>
>> The problem is that it cannot find an event with a type that matches 
>> HW_EXCEPTION_TYPE so it does **nothing** from the driver with the 
>> default parameter value of send_sigterm = false;
>> After all, if a “zombie” process (zombie in the sense it does not 
>> have a GPU dev) does not exit, kfd resources seems not been released 
>> properly and new kfd process cannot run after plug back.
>> (I still need to look hard into rocr/hsakmt/kfd driver code to 
>> understand the reason. At least I am seeing that the kfd topology 
>> won’t be cleaned up without process exiting, so that there would be a 
>> “zombie" kfd node in the topology, which may or may not cause issues 
>> in hsakmt).
>> @Felix Do you have suggestion/insight on this “zombie" process issue? 
>> @Andrey suggests it should be OK to have a “zombie” kfd process and a 
>> “zombie” kfd dev, and the new kfd process should be ok to run on the 
>> new kfd dev after plugback.
>
>
> My experience with the graphic stack at least showed that. At least in 
> a setup with 2 GPUs, if i remove a secondary GPU which had a rendering 
> process on it, I could plug back the secondary GPU and start a new 
> rendering process while the old zombie process was still present. It 
> could be that in KFD case there are some obstacles to this that need 
> to be resolved.
>
I think this may be related to how KFD is tracking GPU resources. Do we 
actually destroy the KFD device structure when the GPU is unplugged? If 
not, it's still tracking process resource usage of the hanging process. 
This may be a bigger issue here and the solution is probably quite 
involved because of how all the process and device structures are 
related to each other.

Normally the KFD process cleanup is triggered by an MMU notifier when 
the process address space is destroyed. The kfd_process structure is 
also reference counted. I'll need to check if there is a way to 
force-delete the KFD process structure when a GPU is unplugged. That's 
going to be tricky, because of how the KFD process struct ties together 
several GPUs.

Regards,
   Felix


> Andrey
>
>
>>
>> May 11 09:52:07 NETSYS26 kernel: [52604.845400] amdgpu: cancel 
>> restore_userptr_work
>> May 11 09:52:07 NETSYS26 kernel: [52604.845405] amdgpu: sending hw 
>> exception to pasid = 0x800
>> May 11 09:52:07 NETSYS26 kernel: [52604.845414] kfd kfd: amdgpu: 
>> Process 25894 (pasid 0x8001) got unhandled exception
>>
>>>
>>>
>>>> + kfd_signal_vm_fault_event(kfd, p->pasid, NULL);
>>>
>>> This does not make sense. A VM fault indicates an access to a bad
>>> virtual address by the GPU. If a debugger is attached to the process, it
>>> notifies the debugger to investigate what went wrong. If the GPU is
>>> gone, that doesn't make any sense. There is no GPU that could have
>>> issued a bad memory request. And the debugger won't be happy either to
>>> find a VM fault from a GPU that doesn't exist any more.
>>
>> OK understood.
>>
>>>
>>> If the HW-exception event doesn't terminate your process, we may need to
>>> look into how ROCr handles the HW-exception events.
>>>
>>>
>>>> + }
>>>> + }
>>>> + srcu_read_unlock(&kfd_processes_srcu, idx);
>>>> + }
>>>> +
>>>> kfd->dqm->ops.stop(kfd->dqm);
>>>> kfd_iommu_suspend(kfd);
>>>
>>> Should DQM stop and IOMMU suspend still be executed? Or should the
>>> hot-unplug case short-circuit them?
>>
>> I tried short circuiting them, but would later caused BUG related to 
>> GPU reset. I added the following that solve the issue on plugout.
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> index b583026dc893..d78a06d74759 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> @@ -5317,7 +5317,8 @@ static void 
>> amdgpu_device_queue_gpu_recover_work(struct work_struct *work)
>>  {
>>         struct amdgpu_recover_work_struct *recover_work = 
>> container_of(work, struct amdgpu_recover_work_struct, base);
>>
>> -       recover_work->ret = 
>> amdgpu_device_gpu_recover_imp(recover_work->adev, recover_work->job);
>> +       if (!drm_dev_is_unplugged(adev_to_drm(recover_work->adev)))
>> +               recover_work->ret = 
>> amdgpu_device_gpu_recover_imp(recover_work->adev, recover_work->job);
>>  }
>>  /*
>>   * Serialize gpu recover into reset domain single threaded wq
>>
>> However after kill the zombie process, it failed to evict queues of 
>> the process.
>>
>> [  +0.000002] amdgpu: writing 263 to doorbell address 00000000c86e63f2
>> [  +9.002503] amdgpu: qcm fence wait loop timeout expired
>> [  +0.001364] amdgpu: The cp might be in an unrecoverable state due 
>> to an unsuccessful queues preemption
>> [  +0.001343] amdgpu: Failed to evict process queues
>> [  +0.001355] amdgpu: Failed to evict queues of pasid 0x8001
>>
>>
>> This would cause driver BUG triggered by new kfd process after 
>> plugback. I am pasting the errors from dmesg after plugback as below.
>>
>>
>>
>> May 11 10:25:16 NETSYS26 kernel: [  688.445332] amdgpu: Evicting 
>> PASID 0x8001 queues
>> May 11 10:25:16 NETSYS26 kernel: [  688.445359] BUG: unable to handle 
>> page fault for address: 000000020000006e
>> May 11 10:25:16 NETSYS26 kernel: [  688.447516] #PF: supervisor read 
>> access in kernel mode
>> May 11 10:25:16 NETSYS26 kernel: [  688.449627] #PF: 
>> error_code(0x0000) - not-present page
>> May 11 10:25:16 NETSYS26 kernel: [  688.451661] PGD 80000020892a8067 
>> P4D 80000020892a8067 PUD 0
>> May 11 10:25:16 NETSYS26 kernel: [  688.453741] Oops: 0000 [#1] 
>> PREEMPT SMP PTI
>> May 11 10:25:16 NETSYS26 kernel: [  688.455904] CPU: 25 PID: 9236 
>> Comm: tf_cnn_benchmar Tainted: G        W  OE   5.16.0+ #3
>> May 11 10:25:16 NETSYS26 kernel: [  688.457406] amdgpu 0000:05:00.0: 
>> amdgpu: GPU reset begin!
>> May 11 10:25:16 NETSYS26 kernel: [  688.457798] Hardware name: Dell 
>> Inc. PowerEdge R730/0H21J3, BIOS 1.5.4 [FPGA Test BIOS] 10/002/2015
>> May 11 10:25:16 NETSYS26 kernel: [  688.461458] RIP: 
>> 0010:evict_process_queues_cpsch+0x99/0x1b0 [amdgpu]
>> May 11 10:25:16 NETSYS26 kernel: [  688.465238] Code: bd 13 8a dd 85 
>> c0 0f 85 13 01 00 00 49 8b 5f 10 4d 8d 77 10 49 39 de 75 11 e9 8d 00 
>> 00 00 48 8b 1b 4c 39 f3 0f 84 81 00 00 00 <80> 7b 6e 00 c6 43 6d 01 
>> 74 ea c6 43 6e 00 41 83 ac 24 70 01 00 00
>> May 11 10:25:16 NETSYS26 kernel: [  688.470516] RSP: 
>> 0018:ffffb2674c8afbf0 EFLAGS: 00010203
>> May 11 10:25:16 NETSYS26 kernel: [  688.473255] RAX: ffff91c65cca3800 
>> RBX: 0000000200000000 RCX: 0000000000000001
>> May 11 10:25:16 NETSYS26 kernel: [  688.475691] RDX: 0000000000000000 
>> RSI: ffffffff9fb712d9 RDI: 00000000ffffffff
>> May 11 10:25:16 NETSYS26 kernel: [  688.478564] RBP: ffffb2674c8afc20 
>> R08: 0000000000000000 R09: 000000000006ba18
>> May 11 10:25:16 NETSYS26 kernel: [  688.481409] R10: 00007fe5a0000000 
>> R11: ffffb2674c8af918 R12: ffff91c66d6f5800
>> May 11 10:25:16 NETSYS26 kernel: [  688.484254] R13: ffff91c66d6f5938 
>> R14: ffff91e5c71ac820 R15: ffff91e5c71ac810
>> May 11 10:25:16 NETSYS26 kernel: [  688.487184] FS: 
>>  00007fe62124a700(0000) GS:ffff92053fd00000(0000) knlGS:0000000000000000
>> May 11 10:25:16 NETSYS26 kernel: [  688.490308] CS:  0010 DS: 0000 
>> ES: 0000 CR0: 0000000080050033
>> May 11 10:25:16 NETSYS26 kernel: [  688.493122] CR2: 000000020000006e 
>> CR3: 0000002095284004 CR4: 00000000001706e0
>> May 11 10:25:16 NETSYS26 kernel: [  688.496142] Call Trace:
>> May 11 10:25:16 NETSYS26 kernel: [  688.499199]  <TASK>
>> May 11 10:25:16 NETSYS26 kernel: [  688.502261] 
>>  kfd_process_evict_queues+0x43/0xf0 [amdgpu]
>> May 11 10:25:16 NETSYS26 kernel: [  688.506378] 
>>  kgd2kfd_quiesce_mm+0x2a/0x60 [amdgpu]
>> May 11 10:25:16 NETSYS26 kernel: [  688.510539] 
>>  amdgpu_amdkfd_evict_userptr+0x46/0x80 [amdgpu]
>> May 11 10:25:16 NETSYS26 kernel: [  688.514110] 
>>  amdgpu_mn_invalidate_hsa+0x9c/0xb0 [amdgpu]
>> May 11 10:25:16 NETSYS26 kernel: [  688.518247] 
>>  __mmu_notifier_invalidate_range_start+0x136/0x1e0
>> May 11 10:25:16 NETSYS26 kernel: [  688.521252] 
>>  change_protection+0x41d/0xcd0
>> May 11 10:25:16 NETSYS26 kernel: [  688.524310] 
>>  change_prot_numa+0x19/0x30
>> May 11 10:25:16 NETSYS26 kernel: [  688.527366] 
>>  task_numa_work+0x1ca/0x330
>> May 11 10:25:16 NETSYS26 kernel: [  688.530157]  task_work_run+0x6c/0xa0
>> May 11 10:25:16 NETSYS26 kernel: [  688.533124] 
>>  exit_to_user_mode_prepare+0x1af/0x1c0
>> May 11 10:25:16 NETSYS26 kernel: [  688.536058] 
>>  syscall_exit_to_user_mode+0x2a/0x40
>> May 11 10:25:16 NETSYS26 kernel: [  688.538989]  do_syscall_64+0x46/0xb0
>> May 11 10:25:16 NETSYS26 kernel: [  688.541830] 
>>  entry_SYSCALL_64_after_hwframe+0x44/0xae
>> May 11 10:25:16 NETSYS26 kernel: [  688.544701] RIP: 0033:0x7fe6585ec317
>> May 11 10:25:16 NETSYS26 kernel: [  688.547297] Code: b3 66 90 48 8b 
>> 05 71 4b 2d 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 
>> 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 
>> c3 48 8b 0d 41 4b 2d 00 f7 d8 64 89 01 48
>> May 11 10:25:16 NETSYS26 kernel: [  688.553183] RSP: 
>> 002b:00007fe6212494c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
>> May 11 10:25:16 NETSYS26 kernel: [  688.556105] RAX: ffffffffffffffc2 
>> RBX: 0000000000000000 RCX: 00007fe6585ec317
>> May 11 10:25:16 NETSYS26 kernel: [  688.558970] RDX: 00007fe621249540 
>> RSI: 00000000c0584b02 RDI: 0000000000000003
>> May 11 10:25:16 NETSYS26 kernel: [  688.561950] RBP: 00007fe621249540 
>> R08: 0000000000000000 R09: 0000000000040000
>> May 11 10:25:16 NETSYS26 kernel: [  688.564563] R10: 00007fe617480000 
>> R11: 0000000000000246 R12: 00000000c0584b02
>> May 11 10:25:16 NETSYS26 kernel: [  688.567494] R13: 0000000000000003 
>> R14: 0000000000000064 R15: 00007fe621249920
>> May 11 10:25:16 NETSYS26 kernel: [  688.570470]  </TASK>
>> May 11 10:25:16 NETSYS26 kernel: [  688.573380] Modules linked in: 
>> amdgpu(OE) veth nf_conntrack_netlink nfnetlink xfrm_user xt_addrtype 
>> br_netfilter xt_CHECKSUM iptable_mangle xt_MASQUERADE iptable_nat 
>> nf_nat xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 
>> ipt_REJECT nf_reject_ipv4 xt_tcpudp bridge stp llc ebtable_filter 
>> ebtables ip6table_filter ip6_tables iptable_filter overlay 
>> esp6_offload esp6 esp4_offload esp4 xfrm_algo intel_rapl_msr 
>> intel_rapl_common sb_edac snd_hda_codec_hdmi x86_pkg_temp_thermal 
>> snd_hda_intel intel_powerclamp snd_intel_dspcfg ipmi_ssif coretemp 
>> snd_hda_codec kvm_intel snd_hda_core snd_hwdep kvm snd_pcm snd_timer 
>> snd soundcore ftdi_sio irqbypass rapl intel_cstate usbserial joydev 
>> mei_me input_leds mei iTCO_wdt iTCO_vendor_support lpc_ich ipmi_si 
>> ipmi_devintf mac_hid acpi_power_meter ipmi_msghandler sch_fq_codel 
>> ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi 
>> scsi_transport_iscsi ip_tables x_tables autofs4 btrfs blake2b_generic 
>> zstd_compress raid10 raid456
>> May 11 10:25:16 NETSYS26 kernel: [  688.573543]  async_raid6_recov 
>> async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 
>> raid0 multipath linear iommu_v2 gpu_sched drm_ttm_helper mgag200 ttm 
>> drm_shmem_helper drm_kms_helper syscopyarea hid_generic 
>> crct10dif_pclmul crc32_pclmul sysfillrect ghash_clmulni_intel 
>> sysimgblt fb_sys_fops uas usbhid aesni_intel crypto_simd igb ahci hid 
>> drm usb_storage cryptd libahci dca megaraid_sas i2c_algo_bit wmi 
>> [last unloaded: amdgpu]
>> May 11 10:25:16 NETSYS26 kernel: [  688.611083] CR2: 000000020000006e
>> May 11 10:25:16 NETSYS26 kernel: [  688.614454] ---[ end trace 
>> 349cf28efb6268bc ]—
>>
>> Looking forward to the comments.
>>
>> Regards,
>> Shuotao
>>
>>>
>>> Regards,
>>> Felix
>>>
>>>
>>>> }
>>>>
>>>> Regards,
>>>> Shuotao
>>>>>
>>>>> Andrey
>>>>>
>>>>>
>>>>>>
>>>>>> Regards,
>>>>>> Shuotao
>>>>>>
>>>>>> [  +0.001645] BUG: unable to handle page fault for address:
>>>>>> 0000000000058a68
>>>>>> [  +0.001298] #PF: supervisor read access in kernel mode
>>>>>> [  +0.001252] #PF: error_code(0x0000) - not-present page
>>>>>> [  +0.001248] PGD 8000000115806067 P4D 8000000115806067 PUD
>>>>>> 109b2d067 PMD 0
>>>>>> [  +0.001270] Oops: 0000 [#1] PREEMPT SMP PTI
>>>>>> [  +0.001256] CPU: 5 PID: 13818 Comm: tf_cnn_benchmar Tainted: G
>>>>>>   W   E     5.16.0+ #3
>>>>>> [  +0.001290] Hardware name: Dell Inc. PowerEdge R730/0H21J3, BIOS
>>>>>> 1.5.4 [FPGA Test BIOS] 10/002/2015
>>>>>> [  +0.001309] RIP: 0010:amdgpu_device_rreg.part.24+0xa9/0xe0 [amdgpu]
>>>>>> [  +0.001562] Code: e8 8c 7d 02 00 65 ff 0d 65 e0 7f 3f 75 ae 0f 1f
>>>>>> 44 00 00 eb a7 83 e2 02 75 09 f6 87 10 69 01 00 10 75 0d 4c 03 a3 a0
>>>>>> 09 00 00 <45> 8b 24 24 eb 8a 4c 8d b7 b0 6b 01 00 4c 89 f7 e8 a2 4c
>>>>>> 2e ca 85
>>>>>> [  +0.002751] RSP: 0018:ffffb58fac313928 EFLAGS: 00010202
>>>>>> [  +0.001388] RAX: ffffffffc09a4270 RBX: ffff8b0c9c840000 RCX:
>>>>>> 00000000ffffffff
>>>>>> [  +0.001402] RDX: 0000000000000000 RSI: 000000000001629a RDI:
>>>>>> ffff8b0c9c840000
>>>>>> [  +0.001418] RBP: ffffb58fac313948 R08: 0000000000000021 R09:
>>>>>> 0000000000000001
>>>>>> [  +0.001421] R10: ffffb58fac313b30 R11: ffffffff8c065b00 R12:
>>>>>> 0000000000058a68
>>>>>> [  +0.001400] R13: 000000000001629a R14: 0000000000000000 R15:
>>>>>> 000000000001629a
>>>>>> [  +0.001397] FS:  0000000000000000(0000) GS:ffff8b4b7fa80000(0000)
>>>>>> knlGS:0000000000000000
>>>>>> [  +0.001411] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>>>> [  +0.001405] CR2: 0000000000058a68 CR3: 000000010a2c8001 CR4:
>>>>>> 00000000001706e0
>>>>>> [  +0.001422] Call Trace:
>>>>>> [  +0.001407]  <TASK>
>>>>>> [  +0.001391]  amdgpu_device_rreg+0x17/0x20 [amdgpu]
>>>>>> [  +0.001614]  amdgpu_cgs_read_register+0x14/0x20 [amdgpu]
>>>>>> [  +0.001735]  phm_wait_for_register_unequal.part.1+0x58/0x90 
>>>>>> [amdgpu]
>>>>>> [  +0.001790]  phm_wait_for_register_unequal+0x1a/0x30 [amdgpu]
>>>>>> [  +0.001800]  vega20_wait_for_response+0x28/0x80 [amdgpu]
>>>>>> [  +0.001757]  vega20_send_msg_to_smc_with_parameter+0x21/0x110 
>>>>>> [amdgpu]
>>>>>> [  +0.001838]  smum_send_msg_to_smc_with_parameter+0xcd/0x100 
>>>>>> [amdgpu]
>>>>>> [  +0.001829]  ? kvfree+0x1e/0x30
>>>>>> [  +0.001462]  vega20_set_power_profile_mode+0x58/0x330 [amdgpu]
>>>>>> [  +0.001868]  ? kvfree+0x1e/0x30
>>>>>> [  +0.001462]  ? ttm_bo_release+0x261/0x370 [ttm]
>>>>>> [  +0.001467]  pp_dpm_switch_power_profile+0xc2/0x170 [amdgpu]
>>>>>> [  +0.001863]  amdgpu_dpm_switch_power_profile+0x6b/0x90 [amdgpu]
>>>>>> [  +0.001866]  amdgpu_amdkfd_set_compute_idle+0x1a/0x20 [amdgpu]
>>>>>> [  +0.001784]  kfd_dec_compute_active+0x2c/0x50 [amdgpu]
>>>>>> [  +0.001744]  process_termination_cpsch+0x2f9/0x3a0 [amdgpu]
>>>>>> [  +0.001728]  kfd_process_dequeue_from_all_devices+0x49/0x70 
>>>>>> [amdgpu]
>>>>>> [  +0.001730]  kfd_process_notifier_release+0x91/0xe0 [amdgpu]
>>>>>> [  +0.001718]  __mmu_notifier_release+0x77/0x1f0
>>>>>> [  +0.001411]  exit_mmap+0x1b5/0x200
>>>>>> [  +0.001396]  ? __switch_to+0x12d/0x3e0
>>>>>> [  +0.001388]  ? __switch_to_asm+0x36/0x70
>>>>>> [  +0.001372]  ? preempt_count_add+0x74/0xc0
>>>>>> [  +0.001364]  mmput+0x57/0x110
>>>>>> [  +0.001349]  do_exit+0x33d/0xc20
>>>>>> [  +0.001337]  ? _raw_spin_unlock+0x1a/0x30
>>>>>> [  +0.001346]  do_group_exit+0x43/0xa0
>>>>>> [  +0.001341]  get_signal+0x131/0x920
>>>>>> [  +0.001295]  arch_do_signal_or_restart+0xb1/0x870
>>>>>> [  +0.001303]  ? do_futex+0x125/0x190
>>>>>> [  +0.001285]  exit_to_user_mode_prepare+0xb1/0x1c0
>>>>>> [  +0.001282]  syscall_exit_to_user_mode+0x2a/0x40
>>>>>> [  +0.001264]  do_syscall_64+0x46/0xb0
>>>>>> [  +0.001236]  entry_SYSCALL_64_after_hwframe+0x44/0xae
>>>>>> [  +0.001219] RIP: 0033:0x7f6aff1d2ad3
>>>>>> [  +0.001177] Code: Unable to access opcode bytes at RIP 
>>>>>> 0x7f6aff1d2aa9.
>>>>>> [  +0.001166] RSP: 002b:00007f6ab2029d20 EFLAGS: 00000246 ORIG_RAX:
>>>>>> 00000000000000ca
>>>>>> [  +0.001170] RAX: fffffffffffffe00 RBX: 0000000004f542b0 RCX:
>>>>>> 00007f6aff1d2ad3
>>>>>> [  +0.001168] RDX: 0000000000000000 RSI: 0000000000000080 RDI:
>>>>>> 0000000004f542d8
>>>>>> [  +0.001162] RBP: 0000000004f542d4 R08: 0000000000000000 R09:
>>>>>> 0000000000000000
>>>>>> [  +0.001152] R10: 0000000000000000 R11: 0000000000000246 R12:
>>>>>> 0000000004f542d8
>>>>>> [  +0.001176] R13: 0000000000000000 R14: 0000000004f54288 R15:
>>>>>> 0000000000000000
>>>>>> [  +0.001152]  </TASK>
>>>>>> [  +0.001113] Modules linked in: veth amdgpu(E) nf_conntrack_netlink
>>>>>> nfnetlink xfrm_user xt_addrtype br_netfilter xt_CHECKSUM
>>>>>> iptable_mangle xt_MASQUERADE iptable_nat nf_nat xt_conntrack
>>>>>> nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4
>>>>>> xt_tcpudp bridge stp llc ebtable_filter ebtables ip6table_filter
>>>>>> ip6_tables iptable_filter overlay esp6_offload esp6 esp4_offload
>>>>>> esp4 xfrm_algo intel_rapl_msr intel_rapl_common sb_edac
>>>>>> x86_pkg_temp_thermal intel_powerclamp snd_hda_codec_hdmi
>>>>>> snd_hda_intel ipmi_ssif snd_intel_dspcfg coretemp snd_hda_codec
>>>>>> kvm_intel snd_hda_core snd_hwdep snd_pcm snd_timer snd kvm soundcore
>>>>>> irqbypass ftdi_sio usbserial input_leds iTCO_wdt iTCO_vendor_support
>>>>>> joydev mei_me rapl lpc_ich intel_cstate mei ipmi_si ipmi_devintf
>>>>>> ipmi_msghandler mac_hid acpi_power_meter sch_fq_codel ib_iser
>>>>>> rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi
>>>>>> scsi_transport_iscsi ip_tables x_tables autofs4 btrfs
>>>>>> blake2b_generic zstd_compress raid10 raid456
>>>>>> [  +0.000102]  async_raid6_recov async_memcpy async_pq async_xor
>>>>>> async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear
>>>>>> iommu_v2 gpu_sched drm_ttm_helper mgag200 ttm drm_shmem_helper
>>>>>> drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops
>>>>>> crct10dif_pclmul hid_generic crc32_pclmul ghash_clmulni_intel usbhid
>>>>>> uas aesni_intel crypto_simd igb ahci hid drm usb_storage cryptd
>>>>>> libahci dca megaraid_sas i2c_algo_bit wmi [last unloaded: amdgpu]
>>>>>> [  +0.016626] CR2: 0000000000058a68
>>>>>> [  +0.001550] ---[ end trace ff90849fe0a8b3b4 ]---
>>>>>> [  +0.024953] RIP: 0010:amdgpu_device_rreg.part.24+0xa9/0xe0 [amdgpu]
>>>>>> [  +0.001814] Code: e8 8c 7d 02 00 65 ff 0d 65 e0 7f 3f 75 ae 0f 1f
>>>>>> 44 00 00 eb a7 83 e2 02 75 09 f6 87 10 69 01 00 10 75 0d 4c 03 a3 a0
>>>>>> 09 00 00 <45> 8b 24 24 eb 8a 4c 8d b7 b0 6b 01 00 4c 89 f7 e8 a2 4c
>>>>>> 2e ca 85
>>>>>> [  +0.003255] RSP: 0018:ffffb58fac313928 EFLAGS: 00010202
>>>>>> [  +0.001641] RAX: ffffffffc09a4270 RBX: ffff8b0c9c840000 RCX:
>>>>>> 00000000ffffffff
>>>>>> [  +0.001656] RDX: 0000000000000000 RSI: 000000000001629a RDI:
>>>>>> ffff8b0c9c840000
>>>>>> [  +0.001681] RBP: ffffb58fac313948 R08: 0000000000000021 R09:
>>>>>> 0000000000000001
>>>>>> [  +0.001662] R10: ffffb58fac313b30 R11: ffffffff8c065b00 R12:
>>>>>> 0000000000058a68
>>>>>> [  +0.001650] R13: 000000000001629a R14: 0000000000000000 R15:
>>>>>> 000000000001629a
>>>>>> [  +0.001648] FS:  0000000000000000(0000) GS:ffff8b4b7fa80000(0000)
>>>>>> knlGS:0000000000000000
>>>>>> [  +0.001668] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>>>> [  +0.001673] CR2: 0000000000058a68 CR3: 000000010a2c8001 CR4:
>>>>>> 00000000001706e0
>>>>>> [  +0.001740] Fixing recursive fault but reboot is needed!
>>>>>>
>>>>>>
>>>>>>> On Apr 21, 2022, at 2:41 AM, Andrey Grodzovsky
>>>>>>> <andrey.grodzovsky@amd.com> wrote:
>>>>>>>
>>>>>>> I retested hot plug tests at the commit I mentioned bellow - looks
>>>>>>> ok, my ASIC is Navi 10, I also tested using Vega 10 and older
>>>>>>> Polaris ASICs (whatever i had at home at the time). It's possible
>>>>>>> there are extra issues in ASICs like ur which I didn't cover during
>>>>>>> tests.
>>>>>>>
>>>>>>> andrey@andrey-test:~/drm$ sudo ./build/tests/amdgpu/amdgpu_test 
>>>>>>> -s 13
>>>>>>> /usr/local/share/libdrm/amdgpu.ids: No such file or directory
>>>>>>> /usr/local/share/libdrm/amdgpu.ids: No such file or directory
>>>>>>> /usr/local/share/libdrm/amdgpu.ids: No such file or directory
>>>>>>>
>>>>>>>
>>>>>>> The ASIC NOT support UVD, suite disabled
>>>>>>> /usr/local/share/libdrm/amdgpu.ids: No such file or directory
>>>>>>>
>>>>>>>
>>>>>>> The ASIC NOT support VCE, suite disabled
>>>>>>> /usr/local/share/libdrm/amdgpu.ids: No such file or directory
>>>>>>> /usr/local/share/libdrm/amdgpu.ids: No such file or directory
>>>>>>> /usr/local/share/libdrm/amdgpu.ids: No such file or directory
>>>>>>>
>>>>>>>
>>>>>>> The ASIC NOT support UVD ENC, suite disabled.
>>>>>>> /usr/local/share/libdrm/amdgpu.ids: No such file or directory
>>>>>>> /usr/local/share/libdrm/amdgpu.ids: No such file or directory
>>>>>>> /usr/local/share/libdrm/amdgpu.ids: No such file or directory
>>>>>>> /usr/local/share/libdrm/amdgpu.ids: No such file or directory
>>>>>>>
>>>>>>>
>>>>>>> Don't support TMZ (trust memory zone), security suite disabled
>>>>>>> /usr/local/share/libdrm/amdgpu.ids: No such file or directory
>>>>>>> /usr/local/share/libdrm/amdgpu.ids: No such file or directory
>>>>>>> Peer device is not opened or has ASIC not supported by the suite,
>>>>>>> skip all Peer to Peer tests.
>>>>>>>
>>>>>>>
>>>>>>> CUnit - A unit testing framework for C - Version 2.1-3
>>>>>>> http://cunit.sourceforge.net/
>>>>>>>
>>>>>>>
>>>>>>> *Suite: Hotunplug Tests**
>>>>>>> ** Test: Unplug card and rescan the bus to plug it back
>>>>>>> .../usr/local/share/libdrm/amdgpu.ids: No such file or directory**
>>>>>>> **passed**
>>>>>>> ** Test: Same as first test but with command submission
>>>>>>> .../usr/local/share/libdrm/amdgpu.ids: No such file or directory**
>>>>>>> **passed**
>>>>>>> ** Test: Unplug with exported bo
>>>>>>> .../usr/local/share/libdrm/amdgpu.ids: No such file or directory**
>>>>>>> **passed*
>>>>>>>
>>>>>>> Run Summary: Type Total Ran Passed Failed Inactive
>>>>>>> suites 14 1 n/a 0 0
>>>>>>> tests 71 3 3 0 1
>>>>>>> asserts 21 21 21 0 n/a
>>>>>>>
>>>>>>> Elapsed time = 9.195 seconds
>>>>>>>
>>>>>>>
>>>>>>> Andrey
>>>>>>>
>>>>>>> On 2022-04-20 11:44, Andrey Grodzovsky wrote:
>>>>>>>>
>>>>>>>> The only one in Radeon 7 I see is the same sysfs crash we already
>>>>>>>> fixed so you can use the same fix. The MI 200 issue i haven't seen
>>>>>>>> yet but I also haven't tested MI200 so never saw it before. Need
>>>>>>>> to test when i get the time.
>>>>>>>>
>>>>>>>> So try that fix with Radeon 7 again to see if you pass the tests
>>>>>>>> (the warnings should all be minor issues).
>>>>>>>>
>>>>>>>> Andrey
>>>>>>>>
>>>>>>>>
>>>>>>>> On 2022-04-20 05:24, Shuotao Xu wrote:
>>>>>>>>>>
>>>>>>>>>> That a problem, latest working baseline I tested and confirmed
>>>>>>>>>> passing hotplug tests is this branch and
>>>>>>>>>> commithttps://gitlab.freedesktop.org/agd5f/linux/-/commit/86e12a53b73135806e101142e72f3f1c0e6fa8e6which>
>>>>>>>>>> is amd-staging-drm-next. 5.14 was the branch we ups-reamed the
>>>>>>>>>> hotplug code but it had a lot of regressions over time due to
>>>>>>>>>> new changes (that why I added the hotplug test to try and catch
>>>>>>>>>> them early). It would be best to run this branch on mi-100 so we
>>>>>>>>>> have a clean baseline and only after confirming this particular
>>>>>>>>>> branch from this commits passes libdrm tests only then start
>>>>>>>>>> adding the KFD specific addons. Another option if you can't work
>>>>>>>>>> with MI-100 and this branch is to try a different ASIC that does
>>>>>>>>>> work with this branch (if possible).
>>>>>>>>>>
>>>>>>>>>> Andrey
>>>>>>>>>>
>>>>>>>>>> <commithttps://gitlab.freedesktop.org/agd5f/linux/-/commit/86e12a53b73135806e101142e72f3f1c0e6fa8e6which 
>>>>>>>>>> class=>
>>>>>>>>> OK I tried both this commit and the HEAD of and-staging-drm-next
>>>>>>>>> on two GPUs( MI100 and Radeon VII) both did not pass hotplugout
>>>>>>>>> libdrm test. I might be able to gain access to MI200, but I
>>>>>>>>> suspect it would work.
>>>>>>>>>
>>>>>>>>> I copied the complete dmesgs as follows. I highlighted the OOPSES
>>>>>>>>> for you.
>>>>>>>>>
>>>>>>>>> Radeon VII: 
>>>>>>>>> <commithttps://gitlab.freedesktop.org/agd5f/linux/-/commit/86e12a53b73135806e101142e72f3f1c0e6fa8e6which 
>>>>>>>>> class=>
>>>>>>>> <commithttps://gitlab.freedesktop.org/agd5f/linux/-/commit/86e12a53b73135806e101142e72f3f1c0e6fa8e6which 
>>>>>>>> class=>
>>>>>>> <commithttps://gitlab.freedesktop.org/agd5f/linux/-/commit/86e12a53b73135806e101142e72f3f1c0e6fa8e6which 
>>>>>>> class=>
>>>>>> <commithttps://gitlab.freedesktop.org/agd5f/linux/-/commit/86e12a53b73135806e101142e72f3f1c0e6fa8e6which 
>>>>>> class=>
>>>>> <commithttps://gitlab.freedesktop.org/agd5f/linux/-/commit/86e12a53b73135806e101142e72f3f1c0e6fa8e6which 
>>>>> class=>
>>>> <commithttps://gitlab.freedesktop.org/agd5f/linux/-/commit/86e12a53b73135806e101142e72f3f1c0e6fa8e6which 
>>>> class=>
>>> <commithttps://gitlab.freedesktop.org/agd5f/linux/-/commit/86e12a53b73135806e101142e72f3f1c0e6fa8e6which 
>>> class=>
>>> <commithttps://gitlab.freedesktop.org/agd5f/linux/-/commit/86e12a53b73135806e101142e72f3f1c0e6fa8e6which 
>>> class=>
>> <commithttps://gitlab.freedesktop.org/agd5f/linux/-/commit/86e12a53b73135806e101142e72f3f1c0e6fa8e6which 
>> class=>
>>
>> <commithttps://gitlab.freedesktop.org/agd5f/linux/-/commit/86e12a53b73135806e101142e72f3f1c0e6fa8e6which 
>> class=>
> <commithttps://gitlab.freedesktop.org/agd5f/linux/-/commit/86e12a53b73135806e101142e72f3f1c0e6fa8e6which 
> class=>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [EXTERNAL] [PATCH 2/2] drm/amdkfd: Add PCIe Hotplug Support for AMDKFD
  2022-05-11 16:49                                               ` Felix Kuehling
@ 2022-05-11 17:02                                                 ` Andrey Grodzovsky
  0 siblings, 0 replies; 31+ messages in thread
From: Andrey Grodzovsky @ 2022-05-11 17:02 UTC (permalink / raw)
  To: Felix Kuehling, Shuotao Xu
  Cc: Mukul.Joshi, Peng Cheng, amd-gfx, Lei Qu, Ran Shu, Ziyue Yang


On 2022-05-11 12:49, Felix Kuehling wrote:
> Am 2022-05-11 um 09:49 schrieb Andrey Grodzovsky:
>>
>>
>>
> [snip]
>>>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>>>> b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>>>> index f1a225a20719..4b789bec9670 100644
>>>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>>>> @@ -714,16 +714,37 @@ bool kfd_is_locked(void)
>>>>>
>>>>> void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm)
>>>>> {
>>>>> +       struct kfd_process *p;
>>>>> +       struct amdkfd_process_info *p_info;
>>>>> +       unsigned int temp;
>>>>> +
>>>>>        if (!kfd->init_complete)
>>>>>                return;
>>>>>
>>>>>        /* for runtime suspend, skip locking kfd */
>>>>> -       if (!run_pm) {
>>>>> +       if (!run_pm && !drm_dev_is_unplugged(kfd->ddev)) {
>>>>>                /* For first KFD device suspend all the KFD 
>>>>> processes */
>>>>>                if (atomic_inc_return(&kfd_locked) == 1)
>>>>>                        kfd_suspend_all_processes();
>>>>>        }
>>>>>
>>>>> +       if (drm_dev_is_unplugged(kfd->ddev)){
>>>>> +               int idx = srcu_read_lock(&kfd_processes_srcu);
>>>>> +               pr_debug("cancel restore_userptr_work\n");
>>>>> +               hash_for_each_rcu(kfd_processes_table, temp, p,
>>>>> kfd_processes) {
>>>>> +                       if (kfd_process_gpuidx_from_gpuid(p, kfd->id)
>>>>> >= 0) {
>>>>> +                               p_info = p->kgd_process_info;
>>>>> +                               pr_debug("cancel processes, pid = %d
>>>>> for gpu_id = %d", pid_nr(p_info->pid), kfd->id);
>>>>> + cancel_delayed_work_sync(&p_info->restore_userptr_work);
>>>>
>>>> Is this really necessary? If it is, there are probably other workers,
>>>> e.g. related to our SVM code, that would need to be canceled as well.
>>>>
>>>
>>> I delete this and it seems to be OK. It was previously added to 
>>> suppress restore_useptr_work which keeps updating PTE.
>>> Now this is gone by Fix 3. Please let us know if it is OK:) @Felix
>
> Sounds good to me.
>
>
>>>
>>>>
>>>>> +
>>>>> + /* send exception signals to the kfd
>>>>> events waiting in user space */
>>>>> + kfd_signal_hw_exception_event(p->pasid);
>>>>
>>>> This makes sense. It basically tells user mode that the application's
>>>> GPU state is lost due to a RAS error or a GPU reset, or now a GPU
>>>> hot-unplug.
>>>
>>> The problem is that it cannot find an event with a type that matches 
>>> HW_EXCEPTION_TYPE so it does **nothing** from the driver with the 
>>> default parameter value of send_sigterm = false;
>>> After all, if a “zombie” process (zombie in the sense it does not 
>>> have a GPU dev) does not exit, kfd resources seems not been released 
>>> properly and new kfd process cannot run after plug back.
>>> (I still need to look hard into rocr/hsakmt/kfd driver code to 
>>> understand the reason. At least I am seeing that the kfd topology 
>>> won’t be cleaned up without process exiting, so that there would be 
>>> a “zombie" kfd node in the topology, which may or may not cause 
>>> issues in hsakmt).
>>> @Felix Do you have suggestion/insight on this “zombie" process 
>>> issue? @Andrey suggests it should be OK to have a “zombie” kfd 
>>> process and a “zombie” kfd dev, and the new kfd process should be ok 
>>> to run on the new kfd dev after plugback.
>>
>>
>> My experience with the graphic stack at least showed that. At least 
>> in a setup with 2 GPUs, if i remove a secondary GPU which had a 
>> rendering process on it, I could plug back the secondary GPU and 
>> start a new rendering process while the old zombie process was still 
>> present. It could be that in KFD case there are some obstacles to 
>> this that need to be resolved.
>>
> I think this may be related to how KFD is tracking GPU resources. Do 
> we actually destroy the KFD device structure when the GPU is unplugged?


No, all the device hierarchy (drm_device, amdgpu_device and hence I 
assume kfd_device) is kept around until the last drm_put drops the 
refcount to 0 - which happens when the process dies and drops it's drm 
file descriptor.


> If not, it's still tracking process resource usage of the hanging 
> process. This may be a bigger issue here and the solution is probably 
> quite involved because of how all the process and device structures 
> are related to each other.
>
> Normally the KFD process cleanup is triggered by an MMU notifier when 
> the process address space is destroyed. 


Note that the only thing we do is to invalidate all MMIO mappings within 
all the processes that have the GPU mapped into their address space 
(amdgpu_pci_remove->...->amdgpu_device_unmap_mmio) - this will prevent 
the zombie
process from subsequently writing into physical addresses that are not 
assigned to the removed GPU anymore.

Andrey


> The kfd_process structure is also reference counted. I'll need to 
> check if there is a way to force-delete the KFD process structure when 
> a GPU is unplugged. That's going to be tricky, because of how the KFD 
> process struct ties together several GPUs.
>
> Regards,
>   Felix
>
>
>> Andrey
>>
>>
>>>
>>> May 11 09:52:07 NETSYS26 kernel: [52604.845400] amdgpu: cancel 
>>> restore_userptr_work
>>> May 11 09:52:07 NETSYS26 kernel: [52604.845405] amdgpu: sending hw 
>>> exception to pasid = 0x800
>>> May 11 09:52:07 NETSYS26 kernel: [52604.845414] kfd kfd: amdgpu: 
>>> Process 25894 (pasid 0x8001) got unhandled exception
>>>
>>>>
>>>>
>>>>> + kfd_signal_vm_fault_event(kfd, p->pasid, NULL);
>>>>
>>>> This does not make sense. A VM fault indicates an access to a bad
>>>> virtual address by the GPU. If a debugger is attached to the 
>>>> process, it
>>>> notifies the debugger to investigate what went wrong. If the GPU is
>>>> gone, that doesn't make any sense. There is no GPU that could have
>>>> issued a bad memory request. And the debugger won't be happy either to
>>>> find a VM fault from a GPU that doesn't exist any more.
>>>
>>> OK understood.
>>>
>>>>
>>>> If the HW-exception event doesn't terminate your process, we may 
>>>> need to
>>>> look into how ROCr handles the HW-exception events.
>>>>
>>>>
>>>>> + }
>>>>> + }
>>>>> + srcu_read_unlock(&kfd_processes_srcu, idx);
>>>>> + }
>>>>> +
>>>>> kfd->dqm->ops.stop(kfd->dqm);
>>>>> kfd_iommu_suspend(kfd);
>>>>
>>>> Should DQM stop and IOMMU suspend still be executed? Or should the
>>>> hot-unplug case short-circuit them?
>>>
>>> I tried short circuiting them, but would later caused BUG related to 
>>> GPU reset. I added the following that solve the issue on plugout.
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> index b583026dc893..d78a06d74759 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> @@ -5317,7 +5317,8 @@ static void 
>>> amdgpu_device_queue_gpu_recover_work(struct work_struct *work)
>>>  {
>>>         struct amdgpu_recover_work_struct *recover_work = 
>>> container_of(work, struct amdgpu_recover_work_struct, base);
>>>
>>> -       recover_work->ret = 
>>> amdgpu_device_gpu_recover_imp(recover_work->adev, recover_work->job);
>>> +       if (!drm_dev_is_unplugged(adev_to_drm(recover_work->adev)))
>>> +               recover_work->ret = 
>>> amdgpu_device_gpu_recover_imp(recover_work->adev, recover_work->job);
>>>  }
>>>  /*
>>>   * Serialize gpu recover into reset domain single threaded wq
>>>
>>> However after kill the zombie process, it failed to evict queues of 
>>> the process.
>>>
>>> [  +0.000002] amdgpu: writing 263 to doorbell address 00000000c86e63f2
>>> [  +9.002503] amdgpu: qcm fence wait loop timeout expired
>>> [  +0.001364] amdgpu: The cp might be in an unrecoverable state due 
>>> to an unsuccessful queues preemption
>>> [  +0.001343] amdgpu: Failed to evict process queues
>>> [  +0.001355] amdgpu: Failed to evict queues of pasid 0x8001
>>>
>>>
>>> This would cause driver BUG triggered by new kfd process after 
>>> plugback. I am pasting the errors from dmesg after plugback as below.
>>>
>>>
>>>
>>> May 11 10:25:16 NETSYS26 kernel: [  688.445332] amdgpu: Evicting 
>>> PASID 0x8001 queues
>>> May 11 10:25:16 NETSYS26 kernel: [  688.445359] BUG: unable to 
>>> handle page fault for address: 000000020000006e
>>> May 11 10:25:16 NETSYS26 kernel: [  688.447516] #PF: supervisor read 
>>> access in kernel mode
>>> May 11 10:25:16 NETSYS26 kernel: [  688.449627] #PF: 
>>> error_code(0x0000) - not-present page
>>> May 11 10:25:16 NETSYS26 kernel: [  688.451661] PGD 80000020892a8067 
>>> P4D 80000020892a8067 PUD 0
>>> May 11 10:25:16 NETSYS26 kernel: [  688.453741] Oops: 0000 [#1] 
>>> PREEMPT SMP PTI
>>> May 11 10:25:16 NETSYS26 kernel: [  688.455904] CPU: 25 PID: 9236 
>>> Comm: tf_cnn_benchmar Tainted: G        W  OE   5.16.0+ #3
>>> May 11 10:25:16 NETSYS26 kernel: [  688.457406] amdgpu 0000:05:00.0: 
>>> amdgpu: GPU reset begin!
>>> May 11 10:25:16 NETSYS26 kernel: [  688.457798] Hardware name: Dell 
>>> Inc. PowerEdge R730/0H21J3, BIOS 1.5.4 [FPGA Test BIOS] 10/002/2015
>>> May 11 10:25:16 NETSYS26 kernel: [  688.461458] RIP: 
>>> 0010:evict_process_queues_cpsch+0x99/0x1b0 [amdgpu]
>>> May 11 10:25:16 NETSYS26 kernel: [  688.465238] Code: bd 13 8a dd 85 
>>> c0 0f 85 13 01 00 00 49 8b 5f 10 4d 8d 77 10 49 39 de 75 11 e9 8d 00 
>>> 00 00 48 8b 1b 4c 39 f3 0f 84 81 00 00 00 <80> 7b 6e 00 c6 43 6d 01 
>>> 74 ea c6 43 6e 00 41 83 ac 24 70 01 00 00
>>> May 11 10:25:16 NETSYS26 kernel: [  688.470516] RSP: 
>>> 0018:ffffb2674c8afbf0 EFLAGS: 00010203
>>> May 11 10:25:16 NETSYS26 kernel: [  688.473255] RAX: 
>>> ffff91c65cca3800 RBX: 0000000200000000 RCX: 0000000000000001
>>> May 11 10:25:16 NETSYS26 kernel: [  688.475691] RDX: 
>>> 0000000000000000 RSI: ffffffff9fb712d9 RDI: 00000000ffffffff
>>> May 11 10:25:16 NETSYS26 kernel: [  688.478564] RBP: 
>>> ffffb2674c8afc20 R08: 0000000000000000 R09: 000000000006ba18
>>> May 11 10:25:16 NETSYS26 kernel: [  688.481409] R10: 
>>> 00007fe5a0000000 R11: ffffb2674c8af918 R12: ffff91c66d6f5800
>>> May 11 10:25:16 NETSYS26 kernel: [  688.484254] R13: 
>>> ffff91c66d6f5938 R14: ffff91e5c71ac820 R15: ffff91e5c71ac810
>>> May 11 10:25:16 NETSYS26 kernel: [  688.487184] FS: 
>>>  00007fe62124a700(0000) GS:ffff92053fd00000(0000) 
>>> knlGS:0000000000000000
>>> May 11 10:25:16 NETSYS26 kernel: [  688.490308] CS:  0010 DS: 0000 
>>> ES: 0000 CR0: 0000000080050033
>>> May 11 10:25:16 NETSYS26 kernel: [  688.493122] CR2: 
>>> 000000020000006e CR3: 0000002095284004 CR4: 00000000001706e0
>>> May 11 10:25:16 NETSYS26 kernel: [  688.496142] Call Trace:
>>> May 11 10:25:16 NETSYS26 kernel: [  688.499199]  <TASK>
>>> May 11 10:25:16 NETSYS26 kernel: [  688.502261] 
>>>  kfd_process_evict_queues+0x43/0xf0 [amdgpu]
>>> May 11 10:25:16 NETSYS26 kernel: [  688.506378] 
>>>  kgd2kfd_quiesce_mm+0x2a/0x60 [amdgpu]
>>> May 11 10:25:16 NETSYS26 kernel: [  688.510539] 
>>>  amdgpu_amdkfd_evict_userptr+0x46/0x80 [amdgpu]
>>> May 11 10:25:16 NETSYS26 kernel: [  688.514110] 
>>>  amdgpu_mn_invalidate_hsa+0x9c/0xb0 [amdgpu]
>>> May 11 10:25:16 NETSYS26 kernel: [  688.518247] 
>>>  __mmu_notifier_invalidate_range_start+0x136/0x1e0
>>> May 11 10:25:16 NETSYS26 kernel: [  688.521252] 
>>>  change_protection+0x41d/0xcd0
>>> May 11 10:25:16 NETSYS26 kernel: [  688.524310] 
>>>  change_prot_numa+0x19/0x30
>>> May 11 10:25:16 NETSYS26 kernel: [  688.527366] 
>>>  task_numa_work+0x1ca/0x330
>>> May 11 10:25:16 NETSYS26 kernel: [  688.530157] 
>>>  task_work_run+0x6c/0xa0
>>> May 11 10:25:16 NETSYS26 kernel: [  688.533124] 
>>>  exit_to_user_mode_prepare+0x1af/0x1c0
>>> May 11 10:25:16 NETSYS26 kernel: [  688.536058] 
>>>  syscall_exit_to_user_mode+0x2a/0x40
>>> May 11 10:25:16 NETSYS26 kernel: [  688.538989] 
>>>  do_syscall_64+0x46/0xb0
>>> May 11 10:25:16 NETSYS26 kernel: [  688.541830] 
>>>  entry_SYSCALL_64_after_hwframe+0x44/0xae
>>> May 11 10:25:16 NETSYS26 kernel: [  688.544701] RIP: 
>>> 0033:0x7fe6585ec317
>>> May 11 10:25:16 NETSYS26 kernel: [  688.547297] Code: b3 66 90 48 8b 
>>> 05 71 4b 2d 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 
>>> 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 
>>> c3 48 8b 0d 41 4b 2d 00 f7 d8 64 89 01 48
>>> May 11 10:25:16 NETSYS26 kernel: [  688.553183] RSP: 
>>> 002b:00007fe6212494c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
>>> May 11 10:25:16 NETSYS26 kernel: [  688.556105] RAX: 
>>> ffffffffffffffc2 RBX: 0000000000000000 RCX: 00007fe6585ec317
>>> May 11 10:25:16 NETSYS26 kernel: [  688.558970] RDX: 
>>> 00007fe621249540 RSI: 00000000c0584b02 RDI: 0000000000000003
>>> May 11 10:25:16 NETSYS26 kernel: [  688.561950] RBP: 
>>> 00007fe621249540 R08: 0000000000000000 R09: 0000000000040000
>>> May 11 10:25:16 NETSYS26 kernel: [  688.564563] R10: 
>>> 00007fe617480000 R11: 0000000000000246 R12: 00000000c0584b02
>>> May 11 10:25:16 NETSYS26 kernel: [  688.567494] R13: 
>>> 0000000000000003 R14: 0000000000000064 R15: 00007fe621249920
>>> May 11 10:25:16 NETSYS26 kernel: [  688.570470]  </TASK>
>>> May 11 10:25:16 NETSYS26 kernel: [  688.573380] Modules linked in: 
>>> amdgpu(OE) veth nf_conntrack_netlink nfnetlink xfrm_user xt_addrtype 
>>> br_netfilter xt_CHECKSUM iptable_mangle xt_MASQUERADE iptable_nat 
>>> nf_nat xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 
>>> ipt_REJECT nf_reject_ipv4 xt_tcpudp bridge stp llc ebtable_filter 
>>> ebtables ip6table_filter ip6_tables iptable_filter overlay 
>>> esp6_offload esp6 esp4_offload esp4 xfrm_algo intel_rapl_msr 
>>> intel_rapl_common sb_edac snd_hda_codec_hdmi x86_pkg_temp_thermal 
>>> snd_hda_intel intel_powerclamp snd_intel_dspcfg ipmi_ssif coretemp 
>>> snd_hda_codec kvm_intel snd_hda_core snd_hwdep kvm snd_pcm snd_timer 
>>> snd soundcore ftdi_sio irqbypass rapl intel_cstate usbserial joydev 
>>> mei_me input_leds mei iTCO_wdt iTCO_vendor_support lpc_ich ipmi_si 
>>> ipmi_devintf mac_hid acpi_power_meter ipmi_msghandler sch_fq_codel 
>>> ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi 
>>> scsi_transport_iscsi ip_tables x_tables autofs4 btrfs 
>>> blake2b_generic zstd_compress raid10 raid456
>>> May 11 10:25:16 NETSYS26 kernel: [  688.573543]  async_raid6_recov 
>>> async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c 
>>> raid1 raid0 multipath linear iommu_v2 gpu_sched drm_ttm_helper 
>>> mgag200 ttm drm_shmem_helper drm_kms_helper syscopyarea hid_generic 
>>> crct10dif_pclmul crc32_pclmul sysfillrect ghash_clmulni_intel 
>>> sysimgblt fb_sys_fops uas usbhid aesni_intel crypto_simd igb ahci 
>>> hid drm usb_storage cryptd libahci dca megaraid_sas i2c_algo_bit wmi 
>>> [last unloaded: amdgpu]
>>> May 11 10:25:16 NETSYS26 kernel: [  688.611083] CR2: 000000020000006e
>>> May 11 10:25:16 NETSYS26 kernel: [  688.614454] ---[ end trace 
>>> 349cf28efb6268bc ]—
>>>
>>> Looking forward to the comments.
>>>
>>> Regards,
>>> Shuotao
>>>
>>>>
>>>> Regards,
>>>> Felix
>>>>
>>>>
>>>>> }
>>>>>
>>>>> Regards,
>>>>> Shuotao
>>>>>>
>>>>>> Andrey
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> Regards,
>>>>>>> Shuotao
>>>>>>>
>>>>>>> [  +0.001645] BUG: unable to handle page fault for address:
>>>>>>> 0000000000058a68
>>>>>>> [  +0.001298] #PF: supervisor read access in kernel mode
>>>>>>> [  +0.001252] #PF: error_code(0x0000) - not-present page
>>>>>>> [  +0.001248] PGD 8000000115806067 P4D 8000000115806067 PUD
>>>>>>> 109b2d067 PMD 0
>>>>>>> [  +0.001270] Oops: 0000 [#1] PREEMPT SMP PTI
>>>>>>> [  +0.001256] CPU: 5 PID: 13818 Comm: tf_cnn_benchmar Tainted: G
>>>>>>>   W   E     5.16.0+ #3
>>>>>>> [  +0.001290] Hardware name: Dell Inc. PowerEdge R730/0H21J3, BIOS
>>>>>>> 1.5.4 [FPGA Test BIOS] 10/002/2015
>>>>>>> [  +0.001309] RIP: 0010:amdgpu_device_rreg.part.24+0xa9/0xe0 
>>>>>>> [amdgpu]
>>>>>>> [  +0.001562] Code: e8 8c 7d 02 00 65 ff 0d 65 e0 7f 3f 75 ae 0f 1f
>>>>>>> 44 00 00 eb a7 83 e2 02 75 09 f6 87 10 69 01 00 10 75 0d 4c 03 
>>>>>>> a3 a0
>>>>>>> 09 00 00 <45> 8b 24 24 eb 8a 4c 8d b7 b0 6b 01 00 4c 89 f7 e8 a2 4c
>>>>>>> 2e ca 85
>>>>>>> [  +0.002751] RSP: 0018:ffffb58fac313928 EFLAGS: 00010202
>>>>>>> [  +0.001388] RAX: ffffffffc09a4270 RBX: ffff8b0c9c840000 RCX:
>>>>>>> 00000000ffffffff
>>>>>>> [  +0.001402] RDX: 0000000000000000 RSI: 000000000001629a RDI:
>>>>>>> ffff8b0c9c840000
>>>>>>> [  +0.001418] RBP: ffffb58fac313948 R08: 0000000000000021 R09:
>>>>>>> 0000000000000001
>>>>>>> [  +0.001421] R10: ffffb58fac313b30 R11: ffffffff8c065b00 R12:
>>>>>>> 0000000000058a68
>>>>>>> [  +0.001400] R13: 000000000001629a R14: 0000000000000000 R15:
>>>>>>> 000000000001629a
>>>>>>> [  +0.001397] FS:  0000000000000000(0000) GS:ffff8b4b7fa80000(0000)
>>>>>>> knlGS:0000000000000000
>>>>>>> [  +0.001411] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>>>>> [  +0.001405] CR2: 0000000000058a68 CR3: 000000010a2c8001 CR4:
>>>>>>> 00000000001706e0
>>>>>>> [  +0.001422] Call Trace:
>>>>>>> [  +0.001407]  <TASK>
>>>>>>> [  +0.001391]  amdgpu_device_rreg+0x17/0x20 [amdgpu]
>>>>>>> [  +0.001614]  amdgpu_cgs_read_register+0x14/0x20 [amdgpu]
>>>>>>> [  +0.001735]  phm_wait_for_register_unequal.part.1+0x58/0x90 
>>>>>>> [amdgpu]
>>>>>>> [  +0.001790]  phm_wait_for_register_unequal+0x1a/0x30 [amdgpu]
>>>>>>> [  +0.001800]  vega20_wait_for_response+0x28/0x80 [amdgpu]
>>>>>>> [  +0.001757]  vega20_send_msg_to_smc_with_parameter+0x21/0x110 
>>>>>>> [amdgpu]
>>>>>>> [  +0.001838]  smum_send_msg_to_smc_with_parameter+0xcd/0x100 
>>>>>>> [amdgpu]
>>>>>>> [  +0.001829]  ? kvfree+0x1e/0x30
>>>>>>> [  +0.001462]  vega20_set_power_profile_mode+0x58/0x330 [amdgpu]
>>>>>>> [  +0.001868]  ? kvfree+0x1e/0x30
>>>>>>> [  +0.001462]  ? ttm_bo_release+0x261/0x370 [ttm]
>>>>>>> [  +0.001467]  pp_dpm_switch_power_profile+0xc2/0x170 [amdgpu]
>>>>>>> [  +0.001863]  amdgpu_dpm_switch_power_profile+0x6b/0x90 [amdgpu]
>>>>>>> [  +0.001866]  amdgpu_amdkfd_set_compute_idle+0x1a/0x20 [amdgpu]
>>>>>>> [  +0.001784]  kfd_dec_compute_active+0x2c/0x50 [amdgpu]
>>>>>>> [  +0.001744]  process_termination_cpsch+0x2f9/0x3a0 [amdgpu]
>>>>>>> [  +0.001728]  kfd_process_dequeue_from_all_devices+0x49/0x70 
>>>>>>> [amdgpu]
>>>>>>> [  +0.001730]  kfd_process_notifier_release+0x91/0xe0 [amdgpu]
>>>>>>> [  +0.001718]  __mmu_notifier_release+0x77/0x1f0
>>>>>>> [  +0.001411]  exit_mmap+0x1b5/0x200
>>>>>>> [  +0.001396]  ? __switch_to+0x12d/0x3e0
>>>>>>> [  +0.001388]  ? __switch_to_asm+0x36/0x70
>>>>>>> [  +0.001372]  ? preempt_count_add+0x74/0xc0
>>>>>>> [  +0.001364]  mmput+0x57/0x110
>>>>>>> [  +0.001349]  do_exit+0x33d/0xc20
>>>>>>> [  +0.001337]  ? _raw_spin_unlock+0x1a/0x30
>>>>>>> [  +0.001346]  do_group_exit+0x43/0xa0
>>>>>>> [  +0.001341]  get_signal+0x131/0x920
>>>>>>> [  +0.001295]  arch_do_signal_or_restart+0xb1/0x870
>>>>>>> [  +0.001303]  ? do_futex+0x125/0x190
>>>>>>> [  +0.001285]  exit_to_user_mode_prepare+0xb1/0x1c0
>>>>>>> [  +0.001282]  syscall_exit_to_user_mode+0x2a/0x40
>>>>>>> [  +0.001264]  do_syscall_64+0x46/0xb0
>>>>>>> [  +0.001236]  entry_SYSCALL_64_after_hwframe+0x44/0xae
>>>>>>> [  +0.001219] RIP: 0033:0x7f6aff1d2ad3
>>>>>>> [  +0.001177] Code: Unable to access opcode bytes at RIP 
>>>>>>> 0x7f6aff1d2aa9.
>>>>>>> [  +0.001166] RSP: 002b:00007f6ab2029d20 EFLAGS: 00000246 ORIG_RAX:
>>>>>>> 00000000000000ca
>>>>>>> [  +0.001170] RAX: fffffffffffffe00 RBX: 0000000004f542b0 RCX:
>>>>>>> 00007f6aff1d2ad3
>>>>>>> [  +0.001168] RDX: 0000000000000000 RSI: 0000000000000080 RDI:
>>>>>>> 0000000004f542d8
>>>>>>> [  +0.001162] RBP: 0000000004f542d4 R08: 0000000000000000 R09:
>>>>>>> 0000000000000000
>>>>>>> [  +0.001152] R10: 0000000000000000 R11: 0000000000000246 R12:
>>>>>>> 0000000004f542d8
>>>>>>> [  +0.001176] R13: 0000000000000000 R14: 0000000004f54288 R15:
>>>>>>> 0000000000000000
>>>>>>> [  +0.001152]  </TASK>
>>>>>>> [  +0.001113] Modules linked in: veth amdgpu(E) 
>>>>>>> nf_conntrack_netlink
>>>>>>> nfnetlink xfrm_user xt_addrtype br_netfilter xt_CHECKSUM
>>>>>>> iptable_mangle xt_MASQUERADE iptable_nat nf_nat xt_conntrack
>>>>>>> nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ipt_REJECT 
>>>>>>> nf_reject_ipv4
>>>>>>> xt_tcpudp bridge stp llc ebtable_filter ebtables ip6table_filter
>>>>>>> ip6_tables iptable_filter overlay esp6_offload esp6 esp4_offload
>>>>>>> esp4 xfrm_algo intel_rapl_msr intel_rapl_common sb_edac
>>>>>>> x86_pkg_temp_thermal intel_powerclamp snd_hda_codec_hdmi
>>>>>>> snd_hda_intel ipmi_ssif snd_intel_dspcfg coretemp snd_hda_codec
>>>>>>> kvm_intel snd_hda_core snd_hwdep snd_pcm snd_timer snd kvm 
>>>>>>> soundcore
>>>>>>> irqbypass ftdi_sio usbserial input_leds iTCO_wdt 
>>>>>>> iTCO_vendor_support
>>>>>>> joydev mei_me rapl lpc_ich intel_cstate mei ipmi_si ipmi_devintf
>>>>>>> ipmi_msghandler mac_hid acpi_power_meter sch_fq_codel ib_iser
>>>>>>> rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi
>>>>>>> scsi_transport_iscsi ip_tables x_tables autofs4 btrfs
>>>>>>> blake2b_generic zstd_compress raid10 raid456
>>>>>>> [  +0.000102]  async_raid6_recov async_memcpy async_pq async_xor
>>>>>>> async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear
>>>>>>> iommu_v2 gpu_sched drm_ttm_helper mgag200 ttm drm_shmem_helper
>>>>>>> drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops
>>>>>>> crct10dif_pclmul hid_generic crc32_pclmul ghash_clmulni_intel 
>>>>>>> usbhid
>>>>>>> uas aesni_intel crypto_simd igb ahci hid drm usb_storage cryptd
>>>>>>> libahci dca megaraid_sas i2c_algo_bit wmi [last unloaded: amdgpu]
>>>>>>> [  +0.016626] CR2: 0000000000058a68
>>>>>>> [  +0.001550] ---[ end trace ff90849fe0a8b3b4 ]---
>>>>>>> [  +0.024953] RIP: 0010:amdgpu_device_rreg.part.24+0xa9/0xe0 
>>>>>>> [amdgpu]
>>>>>>> [  +0.001814] Code: e8 8c 7d 02 00 65 ff 0d 65 e0 7f 3f 75 ae 0f 1f
>>>>>>> 44 00 00 eb a7 83 e2 02 75 09 f6 87 10 69 01 00 10 75 0d 4c 03 
>>>>>>> a3 a0
>>>>>>> 09 00 00 <45> 8b 24 24 eb 8a 4c 8d b7 b0 6b 01 00 4c 89 f7 e8 a2 4c
>>>>>>> 2e ca 85
>>>>>>> [  +0.003255] RSP: 0018:ffffb58fac313928 EFLAGS: 00010202
>>>>>>> [  +0.001641] RAX: ffffffffc09a4270 RBX: ffff8b0c9c840000 RCX:
>>>>>>> 00000000ffffffff
>>>>>>> [  +0.001656] RDX: 0000000000000000 RSI: 000000000001629a RDI:
>>>>>>> ffff8b0c9c840000
>>>>>>> [  +0.001681] RBP: ffffb58fac313948 R08: 0000000000000021 R09:
>>>>>>> 0000000000000001
>>>>>>> [  +0.001662] R10: ffffb58fac313b30 R11: ffffffff8c065b00 R12:
>>>>>>> 0000000000058a68
>>>>>>> [  +0.001650] R13: 000000000001629a R14: 0000000000000000 R15:
>>>>>>> 000000000001629a
>>>>>>> [  +0.001648] FS:  0000000000000000(0000) GS:ffff8b4b7fa80000(0000)
>>>>>>> knlGS:0000000000000000
>>>>>>> [  +0.001668] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>>>>> [  +0.001673] CR2: 0000000000058a68 CR3: 000000010a2c8001 CR4:
>>>>>>> 00000000001706e0
>>>>>>> [  +0.001740] Fixing recursive fault but reboot is needed!
>>>>>>>
>>>>>>>
>>>>>>>> On Apr 21, 2022, at 2:41 AM, Andrey Grodzovsky
>>>>>>>> <andrey.grodzovsky@amd.com> wrote:
>>>>>>>>
>>>>>>>> I retested hot plug tests at the commit I mentioned bellow - looks
>>>>>>>> ok, my ASIC is Navi 10, I also tested using Vega 10 and older
>>>>>>>> Polaris ASICs (whatever i had at home at the time). It's possible
>>>>>>>> there are extra issues in ASICs like ur which I didn't cover 
>>>>>>>> during
>>>>>>>> tests.
>>>>>>>>
>>>>>>>> andrey@andrey-test:~/drm$ sudo ./build/tests/amdgpu/amdgpu_test 
>>>>>>>> -s 13
>>>>>>>> /usr/local/share/libdrm/amdgpu.ids: No such file or directory
>>>>>>>> /usr/local/share/libdrm/amdgpu.ids: No such file or directory
>>>>>>>> /usr/local/share/libdrm/amdgpu.ids: No such file or directory
>>>>>>>>
>>>>>>>>
>>>>>>>> The ASIC NOT support UVD, suite disabled
>>>>>>>> /usr/local/share/libdrm/amdgpu.ids: No such file or directory
>>>>>>>>
>>>>>>>>
>>>>>>>> The ASIC NOT support VCE, suite disabled
>>>>>>>> /usr/local/share/libdrm/amdgpu.ids: No such file or directory
>>>>>>>> /usr/local/share/libdrm/amdgpu.ids: No such file or directory
>>>>>>>> /usr/local/share/libdrm/amdgpu.ids: No such file or directory
>>>>>>>>
>>>>>>>>
>>>>>>>> The ASIC NOT support UVD ENC, suite disabled.
>>>>>>>> /usr/local/share/libdrm/amdgpu.ids: No such file or directory
>>>>>>>> /usr/local/share/libdrm/amdgpu.ids: No such file or directory
>>>>>>>> /usr/local/share/libdrm/amdgpu.ids: No such file or directory
>>>>>>>> /usr/local/share/libdrm/amdgpu.ids: No such file or directory
>>>>>>>>
>>>>>>>>
>>>>>>>> Don't support TMZ (trust memory zone), security suite disabled
>>>>>>>> /usr/local/share/libdrm/amdgpu.ids: No such file or directory
>>>>>>>> /usr/local/share/libdrm/amdgpu.ids: No such file or directory
>>>>>>>> Peer device is not opened or has ASIC not supported by the suite,
>>>>>>>> skip all Peer to Peer tests.
>>>>>>>>
>>>>>>>>
>>>>>>>> CUnit - A unit testing framework for C - Version 2.1-3
>>>>>>>> http://cunit.sourceforge.net/
>>>>>>>>
>>>>>>>>
>>>>>>>> *Suite: Hotunplug Tests**
>>>>>>>> ** Test: Unplug card and rescan the bus to plug it back
>>>>>>>> .../usr/local/share/libdrm/amdgpu.ids: No such file or directory**
>>>>>>>> **passed**
>>>>>>>> ** Test: Same as first test but with command submission
>>>>>>>> .../usr/local/share/libdrm/amdgpu.ids: No such file or directory**
>>>>>>>> **passed**
>>>>>>>> ** Test: Unplug with exported bo
>>>>>>>> .../usr/local/share/libdrm/amdgpu.ids: No such file or directory**
>>>>>>>> **passed*
>>>>>>>>
>>>>>>>> Run Summary: Type Total Ran Passed Failed Inactive
>>>>>>>> suites 14 1 n/a 0 0
>>>>>>>> tests 71 3 3 0 1
>>>>>>>> asserts 21 21 21 0 n/a
>>>>>>>>
>>>>>>>> Elapsed time = 9.195 seconds
>>>>>>>>
>>>>>>>>
>>>>>>>> Andrey
>>>>>>>>
>>>>>>>> On 2022-04-20 11:44, Andrey Grodzovsky wrote:
>>>>>>>>>
>>>>>>>>> The only one in Radeon 7 I see is the same sysfs crash we already
>>>>>>>>> fixed so you can use the same fix. The MI 200 issue i haven't 
>>>>>>>>> seen
>>>>>>>>> yet but I also haven't tested MI200 so never saw it before. Need
>>>>>>>>> to test when i get the time.
>>>>>>>>>
>>>>>>>>> So try that fix with Radeon 7 again to see if you pass the tests
>>>>>>>>> (the warnings should all be minor issues).
>>>>>>>>>
>>>>>>>>> Andrey
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 2022-04-20 05:24, Shuotao Xu wrote:
>>>>>>>>>>>
>>>>>>>>>>> That a problem, latest working baseline I tested and confirmed
>>>>>>>>>>> passing hotplug tests is this branch and
>>>>>>>>>>> commithttps://gitlab.freedesktop.org/agd5f/linux/-/commit/86e12a53b73135806e101142e72f3f1c0e6fa8e6which> 
>>>>>>>>>>>
>>>>>>>>>>> is amd-staging-drm-next. 5.14 was the branch we ups-reamed the
>>>>>>>>>>> hotplug code but it had a lot of regressions over time due to
>>>>>>>>>>> new changes (that why I added the hotplug test to try and catch
>>>>>>>>>>> them early). It would be best to run this branch on mi-100 
>>>>>>>>>>> so we
>>>>>>>>>>> have a clean baseline and only after confirming this particular
>>>>>>>>>>> branch from this commits passes libdrm tests only then start
>>>>>>>>>>> adding the KFD specific addons. Another option if you can't 
>>>>>>>>>>> work
>>>>>>>>>>> with MI-100 and this branch is to try a different ASIC that 
>>>>>>>>>>> does
>>>>>>>>>>> work with this branch (if possible).
>>>>>>>>>>>
>>>>>>>>>>> Andrey
>>>>>>>>>>>
>>>>>>>>>>> <commithttps://gitlab.freedesktop.org/agd5f/linux/-/commit/86e12a53b73135806e101142e72f3f1c0e6fa8e6which 
>>>>>>>>>>> class=>
>>>>>>>>>> OK I tried both this commit and the HEAD of and-staging-drm-next
>>>>>>>>>> on two GPUs( MI100 and Radeon VII) both did not pass hotplugout
>>>>>>>>>> libdrm test. I might be able to gain access to MI200, but I
>>>>>>>>>> suspect it would work.
>>>>>>>>>>
>>>>>>>>>> I copied the complete dmesgs as follows. I highlighted the 
>>>>>>>>>> OOPSES
>>>>>>>>>> for you.
>>>>>>>>>>
>>>>>>>>>> Radeon VII: 
>>>>>>>>>> <commithttps://gitlab.freedesktop.org/agd5f/linux/-/commit/86e12a53b73135806e101142e72f3f1c0e6fa8e6which 
>>>>>>>>>> class=>
>>>>>>>>> <commithttps://gitlab.freedesktop.org/agd5f/linux/-/commit/86e12a53b73135806e101142e72f3f1c0e6fa8e6which 
>>>>>>>>> class=>
>>>>>>>> <commithttps://gitlab.freedesktop.org/agd5f/linux/-/commit/86e12a53b73135806e101142e72f3f1c0e6fa8e6which 
>>>>>>>> class=>
>>>>>>> <commithttps://gitlab.freedesktop.org/agd5f/linux/-/commit/86e12a53b73135806e101142e72f3f1c0e6fa8e6which 
>>>>>>> class=>
>>>>>> <commithttps://gitlab.freedesktop.org/agd5f/linux/-/commit/86e12a53b73135806e101142e72f3f1c0e6fa8e6which 
>>>>>> class=>
>>>>> <commithttps://gitlab.freedesktop.org/agd5f/linux/-/commit/86e12a53b73135806e101142e72f3f1c0e6fa8e6which 
>>>>> class=>
>>>> <commithttps://gitlab.freedesktop.org/agd5f/linux/-/commit/86e12a53b73135806e101142e72f3f1c0e6fa8e6which 
>>>> class=>
>>>> <commithttps://gitlab.freedesktop.org/agd5f/linux/-/commit/86e12a53b73135806e101142e72f3f1c0e6fa8e6which 
>>>> class=>
>>> <commithttps://gitlab.freedesktop.org/agd5f/linux/-/commit/86e12a53b73135806e101142e72f3f1c0e6fa8e6which 
>>> class=>
>>>
>>> <commithttps://gitlab.freedesktop.org/agd5f/linux/-/commit/86e12a53b73135806e101142e72f3f1c0e6fa8e6which 
>>> class=>
>> <commithttps://gitlab.freedesktop.org/agd5f/linux/-/commit/86e12a53b73135806e101142e72f3f1c0e6fa8e6which 
>> class=>

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2022-05-11 17:02 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-04-08  8:45 [PATCH 1/2] drm/amdkfd: Cleanup IO links during KFD device removal Shuotao Xu
2022-04-08  8:45 ` [PATCH 2/2] drm/amdkfd: Add PCIe Hotplug Support for AMDKFD Shuotao Xu
2022-04-08 15:28   ` Andrey Grodzovsky
2022-04-09  1:28     ` [EXTERNAL] " Shuotao Xu
2022-04-11 15:52       ` Andrey Grodzovsky
2022-04-13 16:03         ` Shuotao Xu
2022-04-13 17:31           ` Andrey Grodzovsky
2022-04-14 14:00             ` Shuotao Xu
2022-04-14 14:24               ` Shuotao Xu
2022-04-14 15:13               ` Andrey Grodzovsky
2022-04-15 10:12                 ` Shuotao Xu
2022-04-15 16:43                   ` Andrey Grodzovsky
2022-04-18 13:22                     ` Shuotao Xu
2022-04-18 15:23                       ` Andrey Grodzovsky
2022-04-19  7:41                         ` Shuotao Xu
2022-04-19 16:01                           ` Andrey Grodzovsky
2022-04-19 16:18                             ` Felix Kuehling
2022-04-20  9:24                             ` Shuotao Xu
2022-04-20 15:44                               ` Andrey Grodzovsky
2022-04-20 18:41                                 ` Andrey Grodzovsky
2022-04-27  9:20                                   ` Shuotao Xu
2022-04-27 16:04                                     ` Andrey Grodzovsky
2022-05-10 11:03                                       ` Shuotao Xu
2022-05-10 16:34                                         ` Andrey Grodzovsky
2022-05-10 20:31                                         ` Felix Kuehling
2022-05-11  3:35                                           ` Shuotao Xu
2022-05-11 13:49                                             ` Andrey Grodzovsky
2022-05-11 16:49                                               ` Felix Kuehling
2022-05-11 17:02                                                 ` Andrey Grodzovsky
2022-04-12  0:07 ` [PATCH 1/2] drm/amdkfd: Cleanup IO links during KFD device removal Felix Kuehling
2022-04-12  1:38   ` [EXTERNAL] " Shuotao Xu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).