All of lore.kernel.org
 help / color / mirror / Atom feed
* Code Review Request for AMDGPU Hotplug Support
@ 2022-04-06  2:45 Shuotao Xu
  2022-04-06 14:13 ` Andrey Grodzovsky
  0 siblings, 1 reply; 13+ messages in thread
From: Shuotao Xu @ 2022-04-06  2:45 UTC (permalink / raw)
  To: amd-gfx; +Cc: Ziyue Yang, Lei Qu, Peng Cheng, Ran Shu

[-- Attachment #1: Type: text/plain, Size: 1128 bytes --]

Dear AMD Colleagues,

We are from Microsoft Research, and are working on GPU disaggregation technology.

We have created a new pull request Add PCIe hotplug support for amdgpu by xushuotao · Pull Request #131 · RadeonOpenCompute/ROCK-Kernel-Driver (github.com)<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FRadeonOpenCompute%2FROCK-Kernel-Driver%2Fpull%2F131&data=04%7C01%7Cshuotaoxu%40microsoft.com%7Cc86224bc365f44bec6b408da172ecac1%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637847787066456985%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=PA8l7Cj82dphBHbo82zqTEQUM4kGM7yg5UeQuduhDg0%3D&reserved=0> in ROCK-Kernel-Driver, which will enable PCIe hot-plug support for amdgpu.

We believe the support of hot-plug of GPU devices can open doors for many advanced applications in data center in the next few years, and we would like to have some reviewers on this PR so we can continue further technical discussions around this feature.

Would you please help review this PR?

Thank you very much!

Best regards,
Shuotao Xu

[-- Attachment #2: Type: text/html, Size: 3813 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Code Review Request for AMDGPU Hotplug Support
  2022-04-06  2:45 Code Review Request for AMDGPU Hotplug Support Shuotao Xu
@ 2022-04-06 14:13 ` Andrey Grodzovsky
  2022-04-06 14:25   ` [EXTERNAL] " Shuotao Xu
  0 siblings, 1 reply; 13+ messages in thread
From: Andrey Grodzovsky @ 2022-04-06 14:13 UTC (permalink / raw)
  To: Shuotao Xu, amd-gfx; +Cc: Ziyue Yang, Lei Qu, Peng Cheng, Ran Shu

Looks like you are using 5.13 kernel for this work, FYI we added
hot plug support for the graphic stack in 5.14 kernel (see 
https://www.phoronix.com/scan.php?page=news_item&px=Linux-5.14-AMDGPU-Hot-Unplug) 


I am not sure about the code part since it all touches KFD driver (KFD 
team can comment on that) - but I was just wondering if you try 5.14 
kernel would things just work for you out of the box ?

Andrey

On 2022-04-05 22:45, Shuotao Xu wrote:
> Dear AMD Colleagues,
> 
> We are from Microsoft Research, and are working on GPU disaggregation 
> technology.
> 
> We have created a new pull requestAdd PCIe hotplug support for amdgpu by 
> xushuotao · Pull Request #131 · RadeonOpenCompute/ROCK-Kernel-Driver 
> (github.com) 
> <https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FRadeonOpenCompute%2FROCK-Kernel-Driver%2Fpull%2F131&data=04%7C01%7Candrey.grodzovsky%40amd.com%7C4e8dc7d4feb84b19edf208da17a54fac%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637848296133682200%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=GE4XHNeLaWfbuoJbM4a1ecH8KKJbKbd2mRCnFinn7eI%3D&reserved=0>in 
> ROCK-Kernel-Driver, which will enable PCIe hot-plug support for amdgpu.
> 
> We believe the support of hot-plug of GPU devices can open doors for 
> many advanced applications in data center in the next few years, and we 
> would like to have some reviewers on this PR so we can continue further 
> technical discussions around this feature.
> 
> Would you please help review this PR?
> 
> Thank you very much!
> 
> Best regards,
> 
> Shuotao Xu
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [EXTERNAL] Re: Code Review Request for AMDGPU Hotplug Support
  2022-04-06 14:13 ` Andrey Grodzovsky
@ 2022-04-06 14:25   ` Shuotao Xu
  2022-04-06 14:36     ` Andrey Grodzovsky
  0 siblings, 1 reply; 13+ messages in thread
From: Shuotao Xu @ 2022-04-06 14:25 UTC (permalink / raw)
  To: Andrey Grodzovsky, amd-gfx; +Cc: Ziyue Yang, Lei Qu, Peng Cheng, Ran Shu


[-- Attachment #1.1: Type: text/plain, Size: 3241 bytes --]

Hi Andrey,

We just tried kernel 5.16 based on https://gitlab.freedesktop.org/agd5f/linux.git amd-staging-drm-next branch, and found out that hotplug did not work out of box for Rocm compute stack.
We did not try the rendering stack since we currently are more focused on AI workloads.

We have also created a patch against the amd-staging-drm-next branch to enable hotplug for ROCM stack, which were sent in another later email with same subject. I am attaching the patch in this email, in case that you would want to delete that later email.

Best regards,
Shuotao

From: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Date: Wednesday, April 6, 2022 at 10:13 PM
To: Shuotao Xu <shuotaoxu@microsoft.com>, amd-gfx@lists.freedesktop.org <amd-gfx@lists.freedesktop.org>
Cc: Ziyue Yang <Ziyue.Yang@microsoft.com>, Lei Qu <Lei.Qu@microsoft.com>, Peng Cheng <pengc@microsoft.com>, Ran Shu <Ran.Shu@microsoft.com>
Subject: [EXTERNAL] Re: Code Review Request for AMDGPU Hotplug Support
[You don't often get email from andrey.grodzovsky@amd.com. Learn why this is important at http://aka.ms/LearnAboutSenderIdentification.]<http://aka.ms/LearnAboutSenderIdentification.%5d>

Looks like you are using 5.13 kernel for this work, FYI we added
hot plug support for the graphic stack in 5.14 kernel (see
https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.phoronix.com%2Fscan.php%3Fpage%3Dnews_item%26px%3DLinux-5.14-AMDGPU-Hot-Unplug&amp;data=05%7C01%7Cshuotaoxu%40microsoft.com%7Cf1f7980b198541d7196d08da17d79838%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637848512015144682%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=26qOd5vKzOigo0SaSc5%2FF8BOI9yzRlqC08xUMC01Jzk%3D&amp;reserved=0)


I am not sure about the code part since it all touches KFD driver (KFD
team can comment on that) - but I was just wondering if you try 5.14
kernel would things just work for you out of the box ?

Andrey

On 2022-04-05 22:45, Shuotao Xu wrote:
> Dear AMD Colleagues,
>
> We are from Microsoft Research, and are working on GPU disaggregation
> technology.
>
> We have created a new pull requestAdd PCIe hotplug support for amdgpu by
> xushuotao · Pull Request #131 · RadeonOpenCompute/ROCK-Kernel-Driver
> (github.com)
> <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FRadeonOpenCompute%2FROCK-Kernel-Driver%2Fpull%2F131&amp;data=05%7C01%7Cshuotaoxu%40microsoft.com%7Cf1f7980b198541d7196d08da17d79838%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637848512015144682%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=u2NtNDfuiCfKNKqeZ337KLq2uRDB1oGyO3%2BxIMQweRA%3D&amp;reserved=0>in
> ROCK-Kernel-Driver, which will enable PCIe hot-plug support for amdgpu.
>
> We believe the support of hot-plug of GPU devices can open doors for
> many advanced applications in data center in the next few years, and we
> would like to have some reviewers on this PR so we can continue further
> technical discussions around this feature.
>
> Would you please help review this PR?
>
> Thank you very much!
>
> Best regards,
>
> Shuotao Xu
>

[-- Attachment #1.2: Type: text/html, Size: 7128 bytes --]

[-- Attachment #2: 0001-drm-amdkfd-Add-PCIe-Hotplug-Support-for-AMDGPU.patch --]
[-- Type: application/octet-stream, Size: 4008 bytes --]

From a4e53bda6f65b72b1f6a344c19677574d7842cd3 Mon Sep 17 00:00:00 2001
From: Shuotao Xu <shuotaoxu@microsoft.com>
Date: Wed, 6 Apr 2022 12:42:10 +0900
Subject: [PATCH] drm/amdkfd: Add PCIe Hotplug Support for AMDGPU 1. During
 PCIe probing, decrement KFD lock which was incremented when    the PCIe
 device was removed; otherwise kfd_open is going to fail. 2. Remove p2p links
 in sysfs when device is hotplugged out.

Signed-off-by: Shuotao Xu <shuotaoxu@microsoft.com>
---
 drivers/gpu/drm/amd/amdkfd/kfd_device.c   |  4 ++
 drivers/gpu/drm/amd/amdkfd/kfd_topology.c | 50 +++++++++++++++++++++--
 2 files changed, 51 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
index 62aa6c9d5123..c9638bc299dd 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
@@ -575,6 +575,10 @@ bool kgd2kfd_device_init(struct kfd_dev *kfd,
 	if (kfd_resume(kfd))
 		goto kfd_resume_error;
 
+	/* release kfd lock b/o pcie hotplug out  */
+	if (kfd_is_locked())
+		atomic_dec(&kfd_locked);
+
 	if (kfd_topology_add_device(kfd)) {
 		dev_err(kfd_device, "Error adding device to topology\n");
 		goto kfd_topology_add_device_error;
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_topology.c b/drivers/gpu/drm/amd/amdkfd/kfd_topology.c
index 3bdcae239bc0..cfa3b16f6939 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_topology.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_topology.c
@@ -132,6 +132,21 @@ struct kfd_dev *kfd_device_by_adev(const struct amdgpu_device *adev)
 	return device;
 }
 
+/* Called with write topology_lock acquired */
+static void kfd_release_link_prop(struct kfd_topology_device *dev, uint32_t node_id)
+{
+	struct kfd_iolink_properties *iolink, *tmp;
+
+	list_for_each_entry_safe(iolink, tmp, &dev->io_link_props, list) {
+		if (iolink->node_to == node_id) {
+			pr_debug("%s, io_link from_node = %d, to_node = %d", __func__, iolink->node_from, iolink->node_to);
+			list_del(&iolink->list);
+			kfree(iolink);
+			dev->node_props.io_links_count--;
+		}
+	}
+}
+
 /* Called with write topology_lock acquired */
 static void kfd_release_topology_device(struct kfd_topology_device *dev)
 {
@@ -556,6 +571,21 @@ static void kfd_remove_sysfs_file(struct kobject *kobj, struct attribute *attr)
 	kobject_put(kobj);
 }
 
+static void kfd_remove_sysfs_link_to(struct kfd_topology_device *dev, uint32_t node_id)
+{
+	struct kfd_iolink_properties *iolink;
+
+	if (dev->kobj_iolink) {
+		list_for_each_entry(iolink, &dev->io_link_props, list)
+			if (iolink->kobj && iolink->node_to == node_id) {
+				pr_debug("%s, io_link from_node = %d, to_node = %d", __func__, iolink->node_from, iolink->node_to);
+				kfd_remove_sysfs_file(iolink->kobj,
+									  &iolink->attr);
+				iolink->kobj = NULL;
+			}
+	}
+}
+
 static void kfd_remove_sysfs_node_entry(struct kfd_topology_device *dev)
 {
 	struct kfd_iolink_properties *iolink;
@@ -1490,20 +1520,34 @@ int kfd_topology_remove_device(struct kfd_dev *gpu)
 	struct kfd_topology_device *dev, *tmp;
 	uint32_t gpu_id;
 	int res = -ENODEV;
+	uint32_t node_id = 0;
+	bool found = false;
 
 	down_write(&topology_lock);
 
-	list_for_each_entry_safe(dev, tmp, &topology_device_list, list)
+	list_for_each_entry_safe(dev, tmp, &topology_device_list, list) {
 		if (dev->gpu == gpu) {
 			gpu_id = dev->gpu_id;
 			kfd_remove_sysfs_node_entry(dev);
 			kfd_release_topology_device(dev);
 			sys_props.num_devices--;
 			res = 0;
-			if (kfd_topology_update_sysfs() < 0)
-				kfd_topology_release_sysfs();
+			pr_debug("kfd_topology: removing gpu node, node id = %d", node_id);
+			found = true;
 			break;
 		}
+		node_id++;
+	}
+
+	if (found) {
+		list_for_each_entry(dev, &topology_device_list, list) {
+			kfd_remove_sysfs_link_to(dev, node_id);
+			kfd_release_link_prop(dev, node_id);
+		}
+		atomic_dec(&topology_crat_proximity_domain);
+		if (kfd_topology_update_sysfs() < 0)
+			kfd_topology_release_sysfs();
+	}
 
 	up_write(&topology_lock);
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [EXTERNAL] Re: Code Review Request for AMDGPU Hotplug Support
  2022-04-06 14:25   ` [EXTERNAL] " Shuotao Xu
@ 2022-04-06 14:36     ` Andrey Grodzovsky
  2022-04-06 15:11       ` Shuotao Xu
  0 siblings, 1 reply; 13+ messages in thread
From: Andrey Grodzovsky @ 2022-04-06 14:36 UTC (permalink / raw)
  To: Shuotao Xu, amd-gfx; +Cc: Ziyue Yang, Lei Qu, Peng Cheng, Ran Shu

Can you attach dmesg for the failure without your patch against 
amd-staging-drm-next ?

Also, in general, patches for amdgpu upstream branches should be 
submitted to amd-gfx mailing list inline using git-send which makes it 
easy to comment and review them inline.

Andrey

On 2022-04-06 10:25, Shuotao Xu wrote:
> Hi Andrey,
> 
> We just tried kernel 5.16 based on 
> https://gitlab.freedesktop.org/agd5f/linux.git 
> <https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fagd5f%2Flinux.git&data=04%7C01%7Candrey.grodzovsky%40amd.com%7C86a376e9139548aab4ca08da17d9621f%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637848519676249428%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=wdPzJJBPVGWulUhyAyaI1Jtq4uD%2B80aBo7PDBpIjmQM%3D&reserved=0> 
> amd-staging-drm-next branch, and found out that hotplug did not work out 
> of box for Rocm compute stack.
> 
> We did not try the rendering stack since we currently are more focused 
> on AI workloads.
> 
> We have also created a patch against the amd-staging-drm-next branch to 
> enable hotplug for ROCM stack, which were sent in another later email 
> with same subject. I am attaching the patch in this email, in case that 
> you would want to delete that later email.
> 
> Best regards,
> 
> Shuotao
> 
> *From: *Andrey Grodzovsky <andrey.grodzovsky@amd.com>
> *Date: *Wednesday, April 6, 2022 at 10:13 PM
> *To: *Shuotao Xu <shuotaoxu@microsoft.com>, 
> amd-gfx@lists.freedesktop.org <amd-gfx@lists.freedesktop.org>
> *Cc: *Ziyue Yang <Ziyue.Yang@microsoft.com>, Lei Qu 
> <Lei.Qu@microsoft.com>, Peng Cheng <pengc@microsoft.com>, Ran Shu 
> <Ran.Shu@microsoft.com>
> *Subject: *[EXTERNAL] Re: Code Review Request for AMDGPU Hotplug Support
> 
> [You don't often get email from andrey.grodzovsky@amd.com. Learn why 
> this is important at http://aka.ms/LearnAboutSenderIdentification.] 
> <https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Faka.ms%2FLearnAboutSenderIdentification.%255d&data=04%7C01%7Candrey.grodzovsky%40amd.com%7C86a376e9139548aab4ca08da17d9621f%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637848519676249428%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=5VSq8jQN%2FXrj0SG%2B7Tv%2Bz29O0pE3eb9CUevGBiX1Bxo%3D&reserved=0>
> 
> Looks like you are using 5.13 kernel for this work, FYI we added
> hot plug support for the graphic stack in 5.14 kernel (see
> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.phoronix.com%2Fscan.php%3Fpage%3Dnews_item%26px%3DLinux-5.14-AMDGPU-Hot-Unplug&amp;data=05%7C01%7Cshuotaoxu%40microsoft.com%7Cf1f7980b198541d7196d08da17d79838%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637848512015144682%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=26qOd5vKzOigo0SaSc5%2FF8BOI9yzRlqC08xUMC01Jzk%3D&amp;reserved=0) 
> <https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.phoronix.com%2Fscan.php%3Fpage%3Dnews_item%26px%3DLinux-5.14-AMDGPU-Hot-Unplug&data=04%7C01%7Candrey.grodzovsky%40amd.com%7C86a376e9139548aab4ca08da17d9621f%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637848519676249428%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=8GPGb%2B9bgMH3ZgbFUeChtP0hxOMRKwt7Q4it%2BEC%2Flfc%3D&reserved=0>
> 
> 
> I am not sure about the code part since it all touches KFD driver (KFD
> team can comment on that) - but I was just wondering if you try 5.14
> kernel would things just work for you out of the box ?
> 
> Andrey
> 
> On 2022-04-05 22:45, Shuotao Xu wrote:
>> Dear AMD Colleagues,
>>
>> We are from Microsoft Research, and are working on GPU disaggregation
>> technology.
>>
>> We have created a new pull requestAdd PCIe hotplug support for amdgpu by
>> xushuotao · Pull Request #131 · RadeonOpenCompute/ROCK-Kernel-Driver
>> (github.com)
>> <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FRadeonOpenCompute%2FROCK-Kernel-Driver%2Fpull%2F131&amp;data=05%7C01%7Cshuotaoxu%40microsoft.com%7Cf1f7980b198541d7196d08da17d79838%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637848512015144682%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=u2NtNDfuiCfKNKqeZ337KLq2uRDB1oGyO3%2BxIMQweRA%3D&amp;reserved=0 
> <https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FRadeonOpenCompute%2FROCK-Kernel-Driver%2Fpull%2F131&data=04%7C01%7Candrey.grodzovsky%40amd.com%7C86a376e9139548aab4ca08da17d9621f%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637848519676249428%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=qjShnpesp%2F0P1qFSeAPjF2Oc5Dh1tfnUPy4EcLUxylU%3D&reserved=0>>in
>> ROCK-Kernel-Driver, which will enable PCIe hot-plug support for amdgpu.
>>
>> We believe the support of hot-plug of GPU devices can open doors for
>> many advanced applications in data center in the next few years, and we
>> would like to have some reviewers on this PR so we can continue further
>> technical discussions around this feature.
>>
>> Would you please help review this PR?
>>
>> Thank you very much!
>>
>> Best regards,
>>
>> Shuotao Xu
>>
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [EXTERNAL] Re: Code Review Request for AMDGPU Hotplug Support
  2022-04-06 14:36     ` Andrey Grodzovsky
@ 2022-04-06 15:11       ` Shuotao Xu
  2022-04-06 15:39         ` Andrey Grodzovsky
  2022-04-07  2:22         ` [EXTERNAL] " Joshi, Mukul
  0 siblings, 2 replies; 13+ messages in thread
From: Shuotao Xu @ 2022-04-06 15:11 UTC (permalink / raw)
  To: Andrey Grodzovsky, amd-gfx; +Cc: Ziyue Yang, Lei Qu, Peng Cheng, Ran Shu

[-- Attachment #1: Type: text/plain, Size: 22701 bytes --]

Hi Andrey,

Thanks for your kind comment on linux patch submission protocol, please let me know if there is anyway to rectify it.

dmesg is fine except with some warning during pci rescan after pci removal of an AMD MI100.

The issue is that after this rocm application will return segfault with the amdgpu driver unless the entire amdgpu kernel module is unloaded and loaded, which we did not meet our hotplug requirement. The issues upon investigation are
1) kfd_lock is locked after hotplug, and kfd_open will return fault right away to libhsakmt .
2) iolink/p2plink has anormalies after hotplug, and libhsakmt will found such anomalies and return error.

Our patch has been tested with a single-instance AMD MI100 GPU and showed it worked.

I am attaching the dmesg after rescan anyway, which will show the warning and the segfault.

[  132.054822] pci 0000:43:00.0: [1002:738c] type 00 class 0x038000
[  132.054856] pci 0000:43:00.0: reg 0x10: [mem 0x38b000000000-0x38b7ffffffff 64bit pref]
[  132.054877] pci 0000:43:00.0: reg 0x18: [mem 0x38b800000000-0x38b8001fffff 64bit pref]
[  132.054890] pci 0000:43:00.0: reg 0x20: [io  0xa000-0xa0ff]
[  132.054904] pci 0000:43:00.0: reg 0x24: [mem 0xb8400000-0xb847ffff]
[  132.054918] pci 0000:43:00.0: reg 0x30: [mem 0xb8480000-0xb849ffff pref]
[  132.055134] pci 0000:43:00.0: PME# supported from D1 D2 D3hot D3cold
[  132.055217] pci 0000:43:00.0: 63.008 Gb/s available PCIe bandwidth, limited by 8.0 GT/s PCIe x8 link at 0000:3c:14.0 (capable of 252.048 Gb/s with 16.0 GT/s PCIe x16 link)
[  132.056001] pci 0000:43:00.0: Adding to iommu group 73
[  132.057943] pci 0000:43:00.0: BAR 0: assigned [mem 0x38b000000000-0x38b7ffffffff 64bit pref]
[  132.057960] pci 0000:43:00.0: BAR 2: assigned [mem 0x38b800000000-0x38b8001fffff 64bit pref]
[  132.057974] pci 0000:43:00.0: BAR 5: assigned [mem 0xb8400000-0xb847ffff]
[  132.057981] pci 0000:43:00.0: BAR 6: assigned [mem 0xb8480000-0xb849ffff pref]
[  132.057984] pci 0000:43:00.0: BAR 4: assigned [io  0xa000-0xa0ff]

[  132.058429] ======================================================
[  132.058453] WARNING: possible circular locking dependency detected
[  132.058477] 5.16.0-kfd+ #1 Not tainted
[  132.058492] ------------------------------------------------------
[  132.058515] bash/3632 is trying to acquire lock:
[  132.058534] ffffadee20adfb50 ((work_completion)(&wfc.work)){+.+.}-{0:0}, at: __flush_work+0x2f5/0x470
[  132.058554] [drm] initializing kernel modesetting (ARCTURUS 0x1002:0x738C 0x1002:0x0C34 0x01).
[  132.058577]
               but task is already holding lock:
[  132.058580] ffffffffa3c62308
[  132.058630] amdgpu 0000:43:00.0: amdgpu: Trusted Memory Zone (TMZ) feature not supported
[  132.058638]  (
[  132.058678] [drm] register mmio base: 0xB8400000
[  132.058683] pci_rescan_remove_lock
[  132.058694] [drm] register mmio size: 524288
[  132.058713] ){+.+.}-{3:3}, at: rescan_store+0x55/0x90
[  132.058773]
               which lock already depends on the new lock.

[  132.058804]
               the existing dependency chain (in reverse order) is:
[  132.058819] [drm] add ip block number 0 <soc15_common>
[  132.058831]
               -> #1 (
[  132.058854] [drm] add ip block number 1 <gmc_v9_0>
[  132.058858] [drm] add ip block number 2 <vega20_ih>
[  132.058874] pci_rescan_remove_lock
[  132.058894] [drm] add ip block number 3 <psp>
[  132.058915] ){+.+.}-{3:3}
[  132.058931] [drm] add ip block number 4 <smu>
[  132.058951] :
[  132.058965] [drm] add ip block number 5 <gfx_v9_0>
[  132.058986]        __mutex_lock+0xa4/0x990
[  132.058996] [drm] add ip block number 6 <sdma_v4_0>
[  132.059016]        i801_add_tco_spt.isra.20+0x2a/0x1a0
[  132.059033] [drm] add ip block number 7 <vcn_v2_5>
[  132.059054]        i801_add_tco+0xf6/0x110
[  132.059075] [drm] add ip block number 8 <jpeg_v2_5>
[  132.059096]        i801_probe+0x402/0x860
[  132.059151]        local_pci_probe+0x40/0x90
[  132.059170]        work_for_cpu_fn+0x10/0x20
[  132.059189]        process_one_work+0x2a4/0x640
[  132.059208]        worker_thread+0x228/0x3f0
[  132.059227]        kthread+0x16d/0x1a0
[  132.059795]        ret_from_fork+0x1f/0x30
[  132.060337]
               -> #0 ((work_completion)(&wfc.work)){+.+.}-{0:0}:
[  132.061405]        __lock_acquire+0x1552/0x1ac0
[  132.061950]        lock_acquire+0x26c/0x300
[  132.062484]        __flush_work+0x315/0x470
[  132.063009]        work_on_cpu+0x98/0xc0
[  132.063526]        pci_device_probe+0x1bc/0x1d0
[  132.064036]        really_probe+0x102/0x450
[  132.064532]        __driver_probe_device+0x100/0x170
[  132.065020]        driver_probe_device+0x1f/0xa0
[  132.065497]        __device_attach_driver+0x6b/0xe0
[  132.065975]        bus_for_each_drv+0x6a/0xb0
[  132.066449]        __device_attach+0xe2/0x160
[  132.066912]        pci_bus_add_device+0x4a/0x80
[  132.067365]        pci_bus_add_devices+0x2c/0x70
[  132.067812]        pci_bus_add_devices+0x65/0x70
[  132.068253]        pci_bus_add_devices+0x65/0x70
[  132.068688]        pci_bus_add_devices+0x65/0x70
[  132.068936] amdgpu 0000:43:00.0: amdgpu: Fetched VBIOS from ROM BAR
[  132.069109]        pci_bus_add_devices+0x65/0x70
[  132.069602] amdgpu: ATOM BIOS: 113-D3431401-X00
[  132.070058]        pci_bus_add_devices+0x65/0x70
[  132.070572] [drm] VCN(0) decode is enabled in VM mode
[  132.070997]        pci_rescan_bus+0x23/0x30
[  132.071000]        rescan_store+0x61/0x90
[  132.071003]        kernfs_fop_write_iter+0x132/0x1b0
[  132.071501] [drm] VCN(1) decode is enabled in VM mode
[  132.071964]        new_sync_write+0x11f/0x1b0
[  132.072432] [drm] VCN(0) encode is enabled in VM mode
[  132.072900]        vfs_write+0x35b/0x3b0
[  132.073376] [drm] VCN(1) encode is enabled in VM mode
[  132.073847]        ksys_write+0xa7/0xe0
[  132.074335] [drm] JPEG(0) JPEG decode is enabled in VM mode
[  132.074803]        do_syscall_64+0x34/0x80
[  132.074808]        entry_SYSCALL_64_after_hwframe+0x44/0xae
[  132.074811]
               other info that might help us debug this:

[  132.074813]  Possible unsafe locking scenario:

[  132.075302] [drm] JPEG(1) JPEG decode is enabled in VM mode
[  132.075779]        CPU0                    CPU1
[  132.076361] amdgpu 0000:43:00.0: amdgpu: MEM ECC is active.
[  132.076765]        ----                    ----
[  132.077265] amdgpu 0000:43:00.0: amdgpu: SRAM ECC is active.
[  132.078649]   lock(pci_rescan_remove_lock);
[  132.078652]                                lock((work_completion)(&wfc.work));
[  132.078653]                                lock(pci_rescan_remove_lock);
[  132.078655]   lock((work_completion)(&wfc.work));
[  132.078656]
                *** DEADLOCK ***

[  132.078656] 5 locks held by bash/3632:
[  132.078658]  #0: ffff9c39c7b89438
[  132.079612] amdgpu 0000:43:00.0: amdgpu: RAS INFO: ras initialized successfully, hardware ability[7fff] ras_mask[7fff]
[  132.080089]  (
[  132.080602] [drm] vm size is 262144 GB, 4 levels, block size is 9-bit, fragment size is 9-bit
[  132.081082] sb_writers
[  132.081601] amdgpu 0000:43:00.0: amdgpu: VRAM: 32752M 0x0000008000000000 - 0x00000087FEFFFFFF (32752M used)
[  132.082102] #6
[  132.082630] amdgpu 0000:43:00.0: amdgpu: GART: 512M 0x0000000000000000 - 0x000000001FFFFFFF
[  132.083152] ){.+.+}-{0:0}
[  132.083687] amdgpu 0000:43:00.0: amdgpu: AGP: 267878400M 0x0000008800000000 - 0x0000FFFFFFFFFFFF
[  132.084210] , at: ksys_write+0xa7/0xe0
[  132.085708] [drm] Detected VRAM RAM=32752M, BAR=32768M
[  132.086205]  #1:
[  132.086733] [drm] RAM width 4096bits HBM
[  132.087269] ffff9c5959011088
[  132.087890] [drm] amdgpu: 32752M of VRAM memory ready
[  132.088389]  (
[  132.088972] [drm] amdgpu: 32752M of GTT memory ready.
[  132.089572] &of->mutex
[  132.090206] [drm] GART: num cpu pages 131072, num gpu pages 131072
[  132.090804] ){+.+.}-{3:3}, at: kernfs_fop_write_iter+0x103/0x1b0
[  132.090808]  #2: ffff9c39c882a9e0 (kn->active#423){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x10c/0x1b0
[  132.091639] [drm] PCIE GART of 512M enabled.
[  132.092117]  #3:
[  132.092801] [drm] PTB located at 0x0000008000000000
[  132.093480] ffffffffa3c62308
[  132.094566] amdgpu 0000:43:00.0: amdgpu: PSP runtime database doesn't exist
[  132.094822]  (pci_rescan_remove_lock){+.+.}-{3:3}, at: rescan_store+0x55/0x90
[  132.094827]  #4: ffff9c597392b248 (&dev->mutex){....}-{3:3}, at: __device_attach+0x39/0x160
[  132.094835]
               stack backtrace:
[  132.097098] [drm] Found VCN firmware Version ENC: 1.1 DEC: 1 VEP: 0 Revision: 21
[  132.097467] CPU: 47 PID: 3632 Comm: bash Not tainted 5.16.0-kfd+ #1
[  132.098169] amdgpu 0000:43:00.0: amdgpu: Will use PSP to load VCN firmware
[  132.098839] Hardware name: Supermicro SYS-4029GP-TRT2/X11DPG-OT-CPU, BIOS 2.1 08/14/2018
[  132.098841] Call Trace:
[  132.098842]  <TASK>
[  132.098843]  dump_stack_lvl+0x44/0x57
[  132.098848]  check_noncircular+0x105/0x120
[  132.098853]  ? unwind_get_return_address+0x1b/0x30
[  132.112924]  ? register_lock_class+0x46/0x780
[  132.113630]  ? __lock_acquire+0x1552/0x1ac0
[  132.114342]  __lock_acquire+0x1552/0x1ac0
[  132.115050]  lock_acquire+0x26c/0x300
[  132.115755]  ? __flush_work+0x2f5/0x470
[  132.116460]  ? lock_is_held_type+0xdf/0x130
[  132.117177]  __flush_work+0x315/0x470
[  132.117890]  ? __flush_work+0x2f5/0x470
[  132.118604]  ? lock_is_held_type+0xdf/0x130
[  132.119305]  ? mark_held_locks+0x49/0x70
[  132.119981]  ? queue_work_on+0x2f/0x70
[  132.120645]  ? lockdep_hardirqs_on+0x79/0x100
[  132.121300]  work_on_cpu+0x98/0xc0
[  132.121702] [drm] reserve 0x400000 from 0x87fec00000 for PSP TMR
[  132.121947]  ? __traceiter_workqueue_execute_end+0x40/0x40
[  132.123270]  ? pci_device_shutdown+0x60/0x60
[  132.123880]  pci_device_probe+0x1bc/0x1d0
[  132.124475]  really_probe+0x102/0x450
[  132.125060]  __driver_probe_device+0x100/0x170
[  132.125641]  driver_probe_device+0x1f/0xa0
[  132.126215]  __device_attach_driver+0x6b/0xe0
[  132.126797]  ? driver_allows_async_probing+0x50/0x50
[  132.127383]  ? driver_allows_async_probing+0x50/0x50
[  132.127960]  bus_for_each_drv+0x6a/0xb0
[  132.128528]  __device_attach+0xe2/0x160
[  132.129095]  pci_bus_add_device+0x4a/0x80
[  132.129659]  pci_bus_add_devices+0x2c/0x70
[  132.130213]  pci_bus_add_devices+0x65/0x70
[  132.130753]  pci_bus_add_devices+0x65/0x70
[  132.131283]  pci_bus_add_devices+0x65/0x70
[  132.131780]  pci_bus_add_devices+0x65/0x70
[  132.132270]  pci_bus_add_devices+0x65/0x70
[  132.132757]  pci_rescan_bus+0x23/0x30
[  132.133233]  rescan_store+0x61/0x90
[  132.133701]  kernfs_fop_write_iter+0x132/0x1b0
[  132.134167]  new_sync_write+0x11f/0x1b0
[  132.134627]  vfs_write+0x35b/0x3b0
[  132.135062]  ksys_write+0xa7/0xe0
[  132.135503]  do_syscall_64+0x34/0x80
[  132.135933]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[  132.136358] RIP: 0033:0x7f0058a73224
[  132.136775] Code: 89 02 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 8d 05 c1 07 2e 00 8b 00 85 c0 75 13 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 54 f3 c3 66 90 41 54 55 49 89 d4 53 48 89 f5
[  132.137663] RSP: 002b:00007ffc4f4c71a8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[  132.138121] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007f0058a73224
[  132.138590] RDX: 0000000000000002 RSI: 000055d466c24450 RDI: 0000000000000001
[  132.139064] RBP: 000055d466c24450 R08: 000000000000000a R09: 0000000000000001
[  132.139532] R10: 000000000000000a R11: 0000000000000246 R12: 00007f0058d4f760
[  132.140003] R13: 0000000000000002 R14: 00007f0058d4b2a0 R15: 00007f0058d4a760
[  132.140485]  </TASK>
[  132.183669] amdgpu 0000:43:00.0: amdgpu: HDCP: optional hdcp ta ucode is not available
[  132.184214] amdgpu 0000:43:00.0: amdgpu: DTM: optional dtm ta ucode is not available
[  132.184735] amdgpu 0000:43:00.0: amdgpu: RAP: optional rap ta ucode is not available
[  132.185234] amdgpu 0000:43:00.0: amdgpu: SECUREDISPLAY: securedisplay ta ucode is not available
[  132.185823] amdgpu 0000:43:00.0: amdgpu: use vbios provided pptable
[  132.186327] amdgpu 0000:43:00.0: amdgpu: smc_dpm_info table revision(format.content): 4.6
[  132.188783] amdgpu 0000:43:00.0: amdgpu: SMU is initialized successfully!
[  132.190039] [drm] kiq ring mec 2 pipe 1 q 0
[  132.203608] [drm] VCN decode and encode initialized successfully(under DPG Mode).
[  132.204178] [drm] JPEG decode initialized successfully.
[  132.246079] kfd kfd: amdgpu: Allocated 3969056 bytes on gart
[  132.327589] memmap_init_zone_device initialised 8388608 pages in 64ms
[  132.328139] amdgpu: HMM registered 32752MB device memory
[  132.328784] amdgpu: Virtual CRAT table created for GPU
[  132.329844] amdgpu: Topology: Add dGPU node [0x738c:0x1002]
[  132.330387] kfd kfd: amdgpu: added device 1002:738c
[  132.330965] amdgpu 0000:43:00.0: amdgpu: SE 8, SH per SE 1, CU per SH 16, active_cu_number 72
[  132.331725] amdgpu 0000:43:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 0 on hub 0
[  132.332296] amdgpu 0000:43:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 1 on hub 0
[  132.332856] amdgpu 0000:43:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 4 on hub 0
[  132.333414] amdgpu 0000:43:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 5 on hub 0
[  132.333965] amdgpu 0000:43:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 6 on hub 0
[  132.334507] amdgpu 0000:43:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 7 on hub 0
[  132.335057] amdgpu 0000:43:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 8 on hub 0
[  132.335594] amdgpu 0000:43:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 9 on hub 0
[  132.336137] amdgpu 0000:43:00.0: amdgpu: ring kiq_2.1.0 uses VM inv eng 10 on hub 0
[  132.336679] amdgpu 0000:43:00.0: amdgpu: ring sdma0 uses VM inv eng 0 on hub 1
[  132.337234] amdgpu 0000:43:00.0: amdgpu: ring sdma1 uses VM inv eng 1 on hub 1
[  132.337790] amdgpu 0000:43:00.0: amdgpu: ring sdma2 uses VM inv eng 4 on hub 1
[  132.338343] amdgpu 0000:43:00.0: amdgpu: ring sdma3 uses VM inv eng 5 on hub 1
[  132.338906] amdgpu 0000:43:00.0: amdgpu: ring sdma4 uses VM inv eng 6 on hub 1
[  132.339448] amdgpu 0000:43:00.0: amdgpu: ring sdma5 uses VM inv eng 0 on hub 2
[  132.339987] amdgpu 0000:43:00.0: amdgpu: ring sdma6 uses VM inv eng 1 on hub 2
[  132.340519] amdgpu 0000:43:00.0: amdgpu: ring sdma7 uses VM inv eng 4 on hub 2
[  132.341041] amdgpu 0000:43:00.0: amdgpu: ring vcn_dec_0 uses VM inv eng 5 on hub 2
[  132.341570] amdgpu 0000:43:00.0: amdgpu: ring vcn_enc_0.0 uses VM inv eng 6 on hub 2
[  132.342101] amdgpu 0000:43:00.0: amdgpu: ring vcn_enc_0.1 uses VM inv eng 7 on hub 2
[  132.342630] amdgpu 0000:43:00.0: amdgpu: ring vcn_dec_1 uses VM inv eng 8 on hub 2
[  132.343152] amdgpu 0000:43:00.0: amdgpu: ring vcn_enc_1.0 uses VM inv eng 9 on hub 2
[  132.343657] amdgpu 0000:43:00.0: amdgpu: ring vcn_enc_1.1 uses VM inv eng 10 on hub 2
[  132.344136] amdgpu 0000:43:00.0: amdgpu: ring jpeg_dec_0 uses VM inv eng 11 on hub 2
[  132.344610] amdgpu 0000:43:00.0: amdgpu: ring jpeg_dec_1 uses VM inv eng 12 on hub 2
[  132.378213] amdgpu: Detected AMDGPU 6 Perf Events.
[  132.387349] [drm] Initialized amdgpu 3.46.0 20150101 for 0000:43:00.0 on minor 1
[  132.388530] pcieport 0000:d7:00.0: bridge window [io  0x1000-0x0fff] to [bus d8] add_size 1000
[  132.389078] pcieport 0000:d7:00.0: BAR 13: no space for [io  size 0x1000]
[  132.389600] pcieport 0000:d7:00.0: BAR 13: failed to assign [io  size 0x1000]
[  132.390091] pcieport 0000:d7:00.0: BAR 13: no space for [io  size 0x1000]
[  132.390568] pcieport 0000:d7:00.0: BAR 13: failed to assign [io  size 0x1000]
[  155.359200] HelloWorld[3824]: segfault at 68 ip 00007f4c979f764e sp 00007ffc9b3bb610 error 4 in libamdhip64.so.4.4.21432-f9dccde4[7f4c979b3000+2eb000]
[  155.360268] Code: 48 8b 45 e8 64 48 33 04 25 28 00 00 00 74 05 e8 b8 c7 fb ff 48 8b 5d f8 c9 c3 f3 0f 1e fa 55 48 89 e5 48 89 7d f8 48 8b 45 f8 <48> 8b 40 68 5d c3 f3 0f 1e fa 55 48 89 e5 48 89 7d f8 48 8b 45 f8

Best regards,
Shuotao

From: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Date: Wednesday, April 6, 2022 at 10:36 PM
To: Shuotao Xu <shuotaoxu@microsoft.com>, amd-gfx@lists.freedesktop.org <amd-gfx@lists.freedesktop.org>
Cc: Ziyue Yang <Ziyue.Yang@microsoft.com>, Lei Qu <Lei.Qu@microsoft.com>, Peng Cheng <pengc@microsoft.com>, Ran Shu <Ran.Shu@microsoft.com>
Subject: Re: [EXTERNAL] Re: Code Review Request for AMDGPU Hotplug Support
[You don't often get email from andrey.grodzovsky@amd.com. Learn why this is important at http://aka.ms/LearnAboutSenderIdentification.]<http://aka.ms/LearnAboutSenderIdentification.%5d>

Can you attach dmesg for the failure without your patch against
amd-staging-drm-next ?

Also, in general, patches for amdgpu upstream branches should be
submitted to amd-gfx mailing list inline using git-send which makes it
easy to comment and review them inline.

Andrey

On 2022-04-06 10:25, Shuotao Xu wrote:
> Hi Andrey,
>
> We just tried kernel 5.16 based on
> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fagd5f%2Flinux.git&amp;data=05%7C01%7Cshuotaoxu%40microsoft.com%7C93f1fcb8f60541f7b87308da17dae167%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637848526184858564%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=jz51aMtsG7PIEfLy1jLvLGd%2BsBREvOFf9Gc6BZlmsmU%3D&amp;reserved=0
> <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fagd5f%2Flinux.git&amp;data=05%7C01%7Cshuotaoxu%40microsoft.com%7C93f1fcb8f60541f7b87308da17dae167%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637848526184858564%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=jz51aMtsG7PIEfLy1jLvLGd%2BsBREvOFf9Gc6BZlmsmU%3D&amp;reserved=0>
> amd-staging-drm-next branch, and found out that hotplug did not work out
> of box for Rocm compute stack.
>
> We did not try the rendering stack since we currently are more focused
> on AI workloads.
>
> We have also created a patch against the amd-staging-drm-next branch to
> enable hotplug for ROCM stack, which were sent in another later email
> with same subject. I am attaching the patch in this email, in case that
> you would want to delete that later email.
>
> Best regards,
>
> Shuotao
>
> *From: *Andrey Grodzovsky <andrey.grodzovsky@amd.com>
> *Date: *Wednesday, April 6, 2022 at 10:13 PM
> *To: *Shuotao Xu <shuotaoxu@microsoft.com>,
> amd-gfx@lists.freedesktop.org <amd-gfx@lists.freedesktop.org>
> *Cc: *Ziyue Yang <Ziyue.Yang@microsoft.com>, Lei Qu
> <Lei.Qu@microsoft.com>, Peng Cheng <pengc@microsoft.com>, Ran Shu
> <Ran.Shu@microsoft.com>
> *Subject: *[EXTERNAL] Re: Code Review Request for AMDGPU Hotplug Support
>
> [You don't often get email from andrey.grodzovsky@amd.com. Learn why
> this is important at http://aka.ms/LearnAboutSenderIdentification.]<http://aka.ms/LearnAboutSenderIdentification.%5d>
> <https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Faka.ms%2FLearnAboutSenderIdentification.%255d&amp;data=05%7C01%7Cshuotaoxu%40microsoft.com%7C93f1fcb8f60541f7b87308da17dae167%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637848526184858564%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=HfSwu6SWfoCYyscJqGFdKHBPtaj%2BKB4lyo13zkm6hi4%3D&amp;reserved=0>
>
> Looks like you are using 5.13 kernel for this work, FYI we added
> hot plug support for the graphic stack in 5.14 kernel (see
> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.phoronix.com%2Fscan.php%3Fpage%3Dnews_item%26px%3DLinux-5.14-AMDGPU-Hot-Unplug&amp;data=05%7C01%7Cshuotaoxu%40microsoft.com%7C93f1fcb8f60541f7b87308da17dae167%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637848526184858564%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=4l9mT8zNR%2FvqsEFr7noIDqKf16IGN8xmO2T6jnpipzo%3D&amp;reserved=0)
> <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.phoronix.com%2Fscan.php%3Fpage%3Dnews_item%26px%3DLinux-5.14-AMDGPU-Hot-Unplug&amp;data=05%7C01%7Cshuotaoxu%40microsoft.com%7C93f1fcb8f60541f7b87308da17dae167%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637848526184858564%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=4l9mT8zNR%2FvqsEFr7noIDqKf16IGN8xmO2T6jnpipzo%3D&amp;reserved=0>
>
>
> I am not sure about the code part since it all touches KFD driver (KFD
> team can comment on that) - but I was just wondering if you try 5.14
> kernel would things just work for you out of the box ?
>
> Andrey
>
> On 2022-04-05 22:45, Shuotao Xu wrote:
>> Dear AMD Colleagues,
>>
>> We are from Microsoft Research, and are working on GPU disaggregation
>> technology.
>>
>> We have created a new pull requestAdd PCIe hotplug support for amdgpu by
>> xushuotao * Pull Request #131 * RadeonOpenCompute/ROCK-Kernel-Driver
>> (github.com)
>> <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FRadeonOpenCompute%2FROCK-Kernel-Driver%2Fpull%2F131&amp;data=05%7C01%7Cshuotaoxu%40microsoft.com%7C93f1fcb8f60541f7b87308da17dae167%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637848526184858564%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=qEeZR6R95jrjAaltd1MnpyFedOiVZaNQuCxcoNog90g%3D&amp;reserved=0
> <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FRadeonOpenCompute%2FROCK-Kernel-Driver%2Fpull%2F131&amp;data=05%7C01%7Cshuotaoxu%40microsoft.com%7C93f1fcb8f60541f7b87308da17dae167%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637848526184858564%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=qEeZR6R95jrjAaltd1MnpyFedOiVZaNQuCxcoNog90g%3D&amp;reserved=0>>in
>> ROCK-Kernel-Driver, which will enable PCIe hot-plug support for amdgpu.
>>
>> We believe the support of hot-plug of GPU devices can open doors for
>> many advanced applications in data center in the next few years, and we
>> would like to have some reviewers on this PR so we can continue further
>> technical discussions around this feature.
>>
>> Would you please help review this PR?
>>
>> Thank you very much!
>>
>> Best regards,
>>
>> Shuotao Xu
>>
>

[-- Attachment #2: Type: text/html, Size: 52376 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [EXTERNAL] Re: Code Review Request for AMDGPU Hotplug Support
  2022-04-06 15:11       ` Shuotao Xu
@ 2022-04-06 15:39         ` Andrey Grodzovsky
  2022-04-07  2:55           ` [EXTERNAL] " Shuotao Xu
  2022-04-07  2:22         ` [EXTERNAL] " Joshi, Mukul
  1 sibling, 1 reply; 13+ messages in thread
From: Andrey Grodzovsky @ 2022-04-06 15:39 UTC (permalink / raw)
  To: Shuotao Xu, amd-gfx, Kuehling, Felix
  Cc: Ziyue Yang, Lei Qu, Peng Cheng, Ran Shu

+ Felix

On 2022-04-06 11:11, Shuotao Xu wrote:
> Hi Andrey,
> 
> Thanks for your kind comment on linux patch submission protocol, please 
> let me know if there is anyway to rectify it.

Just resend your patch to amd-gfx mailoing list using
git-send (see here how to use it - 
https://burzalodowa.wordpress.com/2013/10/05/how-to-send-patches-with-git-send-email/)

I suggest adding --cover-letter so you will be able to explain the
story behind the patch.

> 
> dmesg is fine except with some warning during pci rescan after pci 
> removal of an AMD MI100.
> 
> The issue is that after this rocm application will return segfault with 
> the amdgpu driver unless the entire amdgpu kernel module is unloaded and 
> loaded, which we did not meet our hotplug requirement. The issues upon 
> investigation are
> 
> 1) kfd_lock is locked after hotplug, and kfd_open will return fault 
> right away to libhsakmt .

I see now, kfd_lock is static and so single isntance across all devices
and so not going away after device removal but only after driver unload.
In this case I am not sure it's the best idea to just decrement kfd_lock
on device inint since in multi GPU system this might be locked on 
purpuse because another device is going through reset for example right
at this moment.

Felix, kgd2kfd_suspend is called also during device pci remove meaning
unblalanced decrement of the lock. Maybe we should not decremnt I
adding drm_dev_enter guard in kgd2kfd_suspend to avoid decrment of
kfd_locked if we are during PCI remove.


> 
> 2) iolink/p2plink has anormalies after hotplug, and libhsakmt will found 
> such anomalies and return error.

Can you point to what abnormalities ? Part of PCI hot unplug we clean
all sysfs files and this looks like part of it, do you see sysfs file
already exsist error on next pci_rescan ?


> 
> Our patch has been tested with a single-instance AMD MI100 GPU and 
> showed it worked.

Exactly, for multi GPU system arbitrary decrementing kfd_lock on device load
can be problematic.

Andrey

> 
> I am attaching the dmesg after rescan anyway, which will show the 
> warning and the segfault.
> 
> [  132.054822] pci 0000:43:00.0: [1002:738c] type 00 class 0x038000
> 
> [  132.054856] pci 0000:43:00.0: reg 0x10: [mem 
> 0x38b000000000-0x38b7ffffffff 64bit pref]
> 
> [  132.054877] pci 0000:43:00.0: reg 0x18: [mem 
> 0x38b800000000-0x38b8001fffff 64bit pref]
> 
> [  132.054890] pci 0000:43:00.0: reg 0x20: [io  0xa000-0xa0ff]
> 
> [  132.054904] pci 0000:43:00.0: reg 0x24: [mem 0xb8400000-0xb847ffff]
> 
> [  132.054918] pci 0000:43:00.0: reg 0x30: [mem 0xb8480000-0xb849ffff pref]
> 
> [  132.055134] pci 0000:43:00.0: PME# supported from D1 D2 D3hot D3cold
> 
> [  132.055217] pci 0000:43:00.0: 63.008 Gb/s available PCIe bandwidth, 
> limited by 8.0 GT/s PCIe x8 link at 0000:3c:14.0 (capable of 252.048 
> Gb/s with 16.0 GT/s PCIe x16 link)
> 
> [  132.056001] pci 0000:43:00.0: Adding to iommu group 73
> 
> [  132.057943] pci 0000:43:00.0: BAR 0: assigned [mem 
> 0x38b000000000-0x38b7ffffffff 64bit pref]
> 
> [  132.057960] pci 0000:43:00.0: BAR 2: assigned [mem 
> 0x38b800000000-0x38b8001fffff 64bit pref]
> 
> [  132.057974] pci 0000:43:00.0: BAR 5: assigned [mem 0xb8400000-0xb847ffff]
> 
> [  132.057981] pci 0000:43:00.0: BAR 6: assigned [mem 
> 0xb8480000-0xb849ffff pref]
> 
> [  132.057984] pci 0000:43:00.0: BAR 4: assigned [io  0xa000-0xa0ff]
> 
> [  132.058429] ======================================================
> 
> [  132.058453] WARNING: possible circular locking dependency detected
> 
> [  132.058477] 5.16.0-kfd+ #1 Not tainted
> 
> [  132.058492] ------------------------------------------------------
> 
> [  132.058515] bash/3632 is trying to acquire lock:
> 
> [  132.058534] ffffadee20adfb50 
> ((work_completion)(&wfc.work)){+.+.}-{0:0}, at: __flush_work+0x2f5/0x470
> 
> [  132.058554] [drm] initializing kernel modesetting (ARCTURUS 
> 0x1002:0x738C 0x1002:0x0C34 0x01).
> 
> [  132.058577]
> 
>                 but task is already holding lock:
> 
> [  132.058580] ffffffffa3c62308
> 
> [  132.058630] amdgpu 0000:43:00.0: amdgpu: Trusted Memory Zone (TMZ) 
> feature not supported
> 
> [  132.058638]  (
> 
> [  132.058678] [drm] register mmio base: 0xB8400000
> 
> [  132.058683] pci_rescan_remove_lock
> 
> [  132.058694] [drm] register mmio size: 524288
> 
> [  132.058713] ){+.+.}-{3:3}, at: rescan_store+0x55/0x90
> 
> [  132.058773]
> 
>                 which lock already depends on the new lock.
> 
> [  132.058804]
> 
>                 the existing dependency chain (in reverse order) is:
> 
> [  132.058819] [drm] add ip block number 0 <soc15_common>
> 
> [  132.058831]
> 
>                 -> #1 (
> 
> [  132.058854] [drm] add ip block number 1 <gmc_v9_0>
> 
> [  132.058858] [drm] add ip block number 2 <vega20_ih>
> 
> [  132.058874] pci_rescan_remove_lock
> 
> [  132.058894] [drm] add ip block number 3 <psp>
> 
> [  132.058915] ){+.+.}-{3:3}
> 
> [  132.058931] [drm] add ip block number 4 <smu>
> 
> [  132.058951] :
> 
> [  132.058965] [drm] add ip block number 5 <gfx_v9_0>
> 
> [  132.058986]        __mutex_lock+0xa4/0x990
> 
> [  132.058996] [drm] add ip block number 6 <sdma_v4_0>
> 
> [  132.059016]        i801_add_tco_spt.isra.20+0x2a/0x1a0
> 
> [  132.059033] [drm] add ip block number 7 <vcn_v2_5>
> 
> [  132.059054]        i801_add_tco+0xf6/0x110
> 
> [  132.059075] [drm] add ip block number 8 <jpeg_v2_5>
> 
> [  132.059096]        i801_probe+0x402/0x860
> 
> [  132.059151]        local_pci_probe+0x40/0x90
> 
> [  132.059170]        work_for_cpu_fn+0x10/0x20
> 
> [  132.059189]        process_one_work+0x2a4/0x640
> 
> [  132.059208]        worker_thread+0x228/0x3f0
> 
> [  132.059227]        kthread+0x16d/0x1a0
> 
> [  132.059795]        ret_from_fork+0x1f/0x30
> 
> [  132.060337]
> 
>                 -> #0 ((work_completion)(&wfc.work)){+.+.}-{0:0}:
> 
> [  132.061405]        __lock_acquire+0x1552/0x1ac0
> 
> [  132.061950]        lock_acquire+0x26c/0x300
> 
> [  132.062484]        __flush_work+0x315/0x470
> 
> [  132.063009]        work_on_cpu+0x98/0xc0
> 
> [  132.063526]        pci_device_probe+0x1bc/0x1d0
> 
> [  132.064036]        really_probe+0x102/0x450
> 
> [  132.064532]        __driver_probe_device+0x100/0x170
> 
> [  132.065020]        driver_probe_device+0x1f/0xa0
> 
> [  132.065497]        __device_attach_driver+0x6b/0xe0
> 
> [  132.065975]        bus_for_each_drv+0x6a/0xb0
> 
> [  132.066449]        __device_attach+0xe2/0x160
> 
> [  132.066912]        pci_bus_add_device+0x4a/0x80
> 
> [  132.067365]        pci_bus_add_devices+0x2c/0x70
> 
> [  132.067812]        pci_bus_add_devices+0x65/0x70
> 
> [  132.068253]        pci_bus_add_devices+0x65/0x70
> 
> [  132.068688]        pci_bus_add_devices+0x65/0x70
> 
> [  132.068936] amdgpu 0000:43:00.0: amdgpu: Fetched VBIOS from ROM BAR
> 
> [  132.069109]        pci_bus_add_devices+0x65/0x70
> 
> [  132.069602] amdgpu: ATOM BIOS: 113-D3431401-X00
> 
> [  132.070058]        pci_bus_add_devices+0x65/0x70
> 
> [  132.070572] [drm] VCN(0) decode is enabled in VM mode
> 
> [  132.070997]        pci_rescan_bus+0x23/0x30
> 
> [  132.071000]        rescan_store+0x61/0x90
> 
> [  132.071003]        kernfs_fop_write_iter+0x132/0x1b0
> 
> [  132.071501] [drm] VCN(1) decode is enabled in VM mode
> 
> [  132.071964]        new_sync_write+0x11f/0x1b0
> 
> [  132.072432] [drm] VCN(0) encode is enabled in VM mode
> 
> [  132.072900]        vfs_write+0x35b/0x3b0
> 
> [  132.073376] [drm] VCN(1) encode is enabled in VM mode
> 
> [  132.073847]        ksys_write+0xa7/0xe0
> 
> [  132.074335] [drm] JPEG(0) JPEG decode is enabled in VM mode
> 
> [  132.074803]        do_syscall_64+0x34/0x80
> 
> [  132.074808]        entry_SYSCALL_64_after_hwframe+0x44/0xae
> 
> [  132.074811]
> 
>                 other info that might help us debug this:
> 
> [  132.074813]  Possible unsafe locking scenario:
> 
> [  132.075302] [drm] JPEG(1) JPEG decode is enabled in VM mode
> 
> [  132.075779]        CPU0                    CPU1
> 
> [  132.076361] amdgpu 0000:43:00.0: amdgpu: MEM ECC is active.
> 
> [  132.076765]        ----                    ----
> 
> [  132.077265] amdgpu 0000:43:00.0: amdgpu: SRAM ECC is active.
> 
> [  132.078649]   lock(pci_rescan_remove_lock);
> 
> [  132.078652]                                
> lock((work_completion)(&wfc.work));
> 
> [  132.078653]                                lock(pci_rescan_remove_lock);
> 
> [  132.078655]   lock((work_completion)(&wfc.work));
> 
> [  132.078656]
> 
>                  *** DEADLOCK ***
> 
> [  132.078656] 5 locks held by bash/3632:
> 
> [  132.078658]  #0: ffff9c39c7b89438
> 
> [  132.079612] amdgpu 0000:43:00.0: amdgpu: RAS INFO: ras initialized 
> successfully, hardware ability[7fff] ras_mask[7fff]
> 
> [  132.080089]  (
> 
> [  132.080602] [drm] vm size is 262144 GB, 4 levels, block size is 
> 9-bit, fragment size is 9-bit
> 
> [  132.081082] sb_writers
> 
> [  132.081601] amdgpu 0000:43:00.0: amdgpu: VRAM: 32752M 
> 0x0000008000000000 - 0x00000087FEFFFFFF (32752M used)
> 
> [  132.082102] #6
> 
> [  132.082630] amdgpu 0000:43:00.0: amdgpu: GART: 512M 
> 0x0000000000000000 - 0x000000001FFFFFFF
> 
> [  132.083152] ){.+.+}-{0:0}
> 
> [  132.083687] amdgpu 0000:43:00.0: amdgpu: AGP: 267878400M 
> 0x0000008800000000 - 0x0000FFFFFFFFFFFF
> 
> [  132.084210] , at: ksys_write+0xa7/0xe0
> 
> [  132.085708] [drm] Detected VRAM RAM=32752M, BAR=32768M
> 
> [  132.086205]  #1:
> 
> [  132.086733] [drm] RAM width 4096bits HBM
> 
> [  132.087269] ffff9c5959011088
> 
> [  132.087890] [drm] amdgpu: 32752M of VRAM memory ready
> 
> [  132.088389]  (
> 
> [  132.088972] [drm] amdgpu: 32752M of GTT memory ready.
> 
> [  132.089572] &of->mutex
> 
> [  132.090206] [drm] GART: num cpu pages 131072, num gpu pages 131072
> 
> [  132.090804] ){+.+.}-{3:3}, at: kernfs_fop_write_iter+0x103/0x1b0
> 
> [  132.090808]  #2: ffff9c39c882a9e0 (kn->active#423){.+.+}-{0:0}, at: 
> kernfs_fop_write_iter+0x10c/0x1b0
> 
> [  132.091639] [drm] PCIE GART of 512M enabled.
> 
> [  132.092117]  #3:
> 
> [  132.092801] [drm] PTB located at 0x0000008000000000
> 
> [  132.093480] ffffffffa3c62308
> 
> [  132.094566] amdgpu 0000:43:00.0: amdgpu: PSP runtime database doesn't 
> exist
> 
> [  132.094822]  (pci_rescan_remove_lock){+.+.}-{3:3}, at: 
> rescan_store+0x55/0x90
> 
> [  132.094827]  #4: ffff9c597392b248 (&dev->mutex){....}-{3:3}, at: 
> __device_attach+0x39/0x160
> 
> [  132.094835]
> 
>                 stack backtrace:
> 
> [  132.097098] [drm] Found VCN firmware Version ENC: 1.1 DEC: 1 VEP: 0 
> Revision: 21
> 
> [  132.097467] CPU: 47 PID: 3632 Comm: bash Not tainted 5.16.0-kfd+ #1
> 
> [  132.098169] amdgpu 0000:43:00.0: amdgpu: Will use PSP to load VCN 
> firmware
> 
> [  132.098839] Hardware name: Supermicro SYS-4029GP-TRT2/X11DPG-OT-CPU, 
> BIOS 2.1 08/14/2018
> 
> [  132.098841] Call Trace:
> 
> [  132.098842]  <TASK>
> 
> [  132.098843]  dump_stack_lvl+0x44/0x57
> 
> [  132.098848]  check_noncircular+0x105/0x120
> 
> [  132.098853]  ? unwind_get_return_address+0x1b/0x30
> 
> [  132.112924]  ? register_lock_class+0x46/0x780
> 
> [  132.113630]  ? __lock_acquire+0x1552/0x1ac0
> 
> [  132.114342]  __lock_acquire+0x1552/0x1ac0
> 
> [  132.115050]  lock_acquire+0x26c/0x300
> 
> [  132.115755]  ? __flush_work+0x2f5/0x470
> 
> [  132.116460]  ? lock_is_held_type+0xdf/0x130
> 
> [  132.117177]  __flush_work+0x315/0x470
> 
> [  132.117890]  ? __flush_work+0x2f5/0x470
> 
> [  132.118604]  ? lock_is_held_type+0xdf/0x130
> 
> [  132.119305]  ? mark_held_locks+0x49/0x70
> 
> [  132.119981]  ? queue_work_on+0x2f/0x70
> 
> [  132.120645]  ? lockdep_hardirqs_on+0x79/0x100
> 
> [  132.121300]  work_on_cpu+0x98/0xc0
> 
> [  132.121702] [drm] reserve 0x400000 from 0x87fec00000 for PSP TMR
> 
> [  132.121947]  ? __traceiter_workqueue_execute_end+0x40/0x40
> 
> [  132.123270]  ? pci_device_shutdown+0x60/0x60
> 
> [  132.123880]  pci_device_probe+0x1bc/0x1d0
> 
> [  132.124475]  really_probe+0x102/0x450
> 
> [  132.125060]  __driver_probe_device+0x100/0x170
> 
> [  132.125641]  driver_probe_device+0x1f/0xa0
> 
> [  132.126215]  __device_attach_driver+0x6b/0xe0
> 
> [  132.126797]  ? driver_allows_async_probing+0x50/0x50
> 
> [  132.127383]  ? driver_allows_async_probing+0x50/0x50
> 
> [  132.127960]  bus_for_each_drv+0x6a/0xb0
> 
> [  132.128528]  __device_attach+0xe2/0x160
> 
> [  132.129095]  pci_bus_add_device+0x4a/0x80
> 
> [  132.129659]  pci_bus_add_devices+0x2c/0x70
> 
> [  132.130213]  pci_bus_add_devices+0x65/0x70
> 
> [  132.130753]  pci_bus_add_devices+0x65/0x70
> 
> [  132.131283]  pci_bus_add_devices+0x65/0x70
> 
> [  132.131780]  pci_bus_add_devices+0x65/0x70
> 
> [  132.132270]  pci_bus_add_devices+0x65/0x70
> 
> [  132.132757]  pci_rescan_bus+0x23/0x30
> 
> [  132.133233]  rescan_store+0x61/0x90
> 
> [  132.133701]  kernfs_fop_write_iter+0x132/0x1b0
> 
> [  132.134167]  new_sync_write+0x11f/0x1b0
> 
> [  132.134627]  vfs_write+0x35b/0x3b0
> 
> [  132.135062]  ksys_write+0xa7/0xe0
> 
> [  132.135503]  do_syscall_64+0x34/0x80
> 
> [  132.135933]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> 
> [  132.136358] RIP: 0033:0x7f0058a73224
> 
> [  132.136775] Code: 89 02 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 
> 00 00 00 66 90 48 8d 05 c1 07 2e 00 8b 00 85 c0 75 13 b8 01 00 00 00 0f 
> 05 <48> 3d 00 f0 ff ff 77 54 f3 c3 66 90 41 54 55 49 89 d4 53 48 89 f5
> 
> [  132.137663] RSP: 002b:00007ffc4f4c71a8 EFLAGS: 00000246 ORIG_RAX: 
> 0000000000000001
> 
> [  132.138121] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 
> 00007f0058a73224
> 
> [  132.138590] RDX: 0000000000000002 RSI: 000055d466c24450 RDI: 
> 0000000000000001
> 
> [  132.139064] RBP: 000055d466c24450 R08: 000000000000000a R09: 
> 0000000000000001
> 
> [  132.139532] R10: 000000000000000a R11: 0000000000000246 R12: 
> 00007f0058d4f760
> 
> [  132.140003] R13: 0000000000000002 R14: 00007f0058d4b2a0 R15: 
> 00007f0058d4a760
> 
> [  132.140485]  </TASK>
> 
> [  132.183669] amdgpu 0000:43:00.0: amdgpu: HDCP: optional hdcp ta ucode 
> is not available
> 
> [  132.184214] amdgpu 0000:43:00.0: amdgpu: DTM: optional dtm ta ucode 
> is not available
> 
> [  132.184735] amdgpu 0000:43:00.0: amdgpu: RAP: optional rap ta ucode 
> is not available
> 
> [  132.185234] amdgpu 0000:43:00.0: amdgpu: SECUREDISPLAY: securedisplay 
> ta ucode is not available
> 
> [  132.185823] amdgpu 0000:43:00.0: amdgpu: use vbios provided pptable
> 
> [  132.186327] amdgpu 0000:43:00.0: amdgpu: smc_dpm_info table 
> revision(format.content): 4.6
> 
> [  132.188783] amdgpu 0000:43:00.0: amdgpu: SMU is initialized successfully!
> 
> [  132.190039] [drm] kiq ring mec 2 pipe 1 q 0
> 
> [  132.203608] [drm] VCN decode and encode initialized 
> successfully(under DPG Mode).
> 
> [  132.204178] [drm] JPEG decode initialized successfully.
> 
> [  132.246079] kfd kfd: amdgpu: Allocated 3969056 bytes on gart
> 
> [  132.327589] memmap_init_zone_device initialised 8388608 pages in 64ms
> 
> [  132.328139] amdgpu: HMM registered 32752MB device memory
> 
> [  132.328784] amdgpu: Virtual CRAT table created for GPU
> 
> [  132.329844] amdgpu: Topology: Add dGPU node [0x738c:0x1002]
> 
> [  132.330387] kfd kfd: amdgpu: added device 1002:738c
> 
> [  132.330965] amdgpu 0000:43:00.0: amdgpu: SE 8, SH per SE 1, CU per SH 
> 16, active_cu_number 72
> 
> [  132.331725] amdgpu 0000:43:00.0: amdgpu: ring comp_1.0.0 uses VM inv 
> eng 0 on hub 0
> 
> [  132.332296] amdgpu 0000:43:00.0: amdgpu: ring comp_1.1.0 uses VM inv 
> eng 1 on hub 0
> 
> [  132.332856] amdgpu 0000:43:00.0: amdgpu: ring comp_1.2.0 uses VM inv 
> eng 4 on hub 0
> 
> [  132.333414] amdgpu 0000:43:00.0: amdgpu: ring comp_1.3.0 uses VM inv 
> eng 5 on hub 0
> 
> [  132.333965] amdgpu 0000:43:00.0: amdgpu: ring comp_1.0.1 uses VM inv 
> eng 6 on hub 0
> 
> [  132.334507] amdgpu 0000:43:00.0: amdgpu: ring comp_1.1.1 uses VM inv 
> eng 7 on hub 0
> 
> [  132.335057] amdgpu 0000:43:00.0: amdgpu: ring comp_1.2.1 uses VM inv 
> eng 8 on hub 0
> 
> [  132.335594] amdgpu 0000:43:00.0: amdgpu: ring comp_1.3.1 uses VM inv 
> eng 9 on hub 0
> 
> [  132.336137] amdgpu 0000:43:00.0: amdgpu: ring kiq_2.1.0 uses VM inv 
> eng 10 on hub 0
> 
> [  132.336679] amdgpu 0000:43:00.0: amdgpu: ring sdma0 uses VM inv eng 0 
> on hub 1
> 
> [  132.337234] amdgpu 0000:43:00.0: amdgpu: ring sdma1 uses VM inv eng 1 
> on hub 1
> 
> [  132.337790] amdgpu 0000:43:00.0: amdgpu: ring sdma2 uses VM inv eng 4 
> on hub 1
> 
> [  132.338343] amdgpu 0000:43:00.0: amdgpu: ring sdma3 uses VM inv eng 5 
> on hub 1
> 
> [  132.338906] amdgpu 0000:43:00.0: amdgpu: ring sdma4 uses VM inv eng 6 
> on hub 1
> 
> [  132.339448] amdgpu 0000:43:00.0: amdgpu: ring sdma5 uses VM inv eng 0 
> on hub 2
> 
> [  132.339987] amdgpu 0000:43:00.0: amdgpu: ring sdma6 uses VM inv eng 1 
> on hub 2
> 
> [  132.340519] amdgpu 0000:43:00.0: amdgpu: ring sdma7 uses VM inv eng 4 
> on hub 2
> 
> [  132.341041] amdgpu 0000:43:00.0: amdgpu: ring vcn_dec_0 uses VM inv 
> eng 5 on hub 2
> 
> [  132.341570] amdgpu 0000:43:00.0: amdgpu: ring vcn_enc_0.0 uses VM inv 
> eng 6 on hub 2
> 
> [  132.342101] amdgpu 0000:43:00.0: amdgpu: ring vcn_enc_0.1 uses VM inv 
> eng 7 on hub 2
> 
> [  132.342630] amdgpu 0000:43:00.0: amdgpu: ring vcn_dec_1 uses VM inv 
> eng 8 on hub 2
> 
> [  132.343152] amdgpu 0000:43:00.0: amdgpu: ring vcn_enc_1.0 uses VM inv 
> eng 9 on hub 2
> 
> [  132.343657] amdgpu 0000:43:00.0: amdgpu: ring vcn_enc_1.1 uses VM inv 
> eng 10 on hub 2
> 
> [  132.344136] amdgpu 0000:43:00.0: amdgpu: ring jpeg_dec_0 uses VM inv 
> eng 11 on hub 2
> 
> [  132.344610] amdgpu 0000:43:00.0: amdgpu: ring jpeg_dec_1 uses VM inv 
> eng 12 on hub 2
> 
> [  132.378213] amdgpu: Detected AMDGPU 6 Perf Events.
> 
> [  132.387349] [drm] Initialized amdgpu 3.46.0 20150101 for 0000:43:00.0 
> on minor 1
> 
> [  132.388530] pcieport 0000:d7:00.0: bridge window [io  0x1000-0x0fff] 
> to [bus d8] add_size 1000
> 
> [  132.389078] pcieport 0000:d7:00.0: BAR 13: no space for [io  size 0x1000]
> 
> [  132.389600] pcieport 0000:d7:00.0: BAR 13: failed to assign [io  size 
> 0x1000]
> 
> [  132.390091] pcieport 0000:d7:00.0: BAR 13: no space for [io  size 0x1000]
> 
> [  132.390568] pcieport 0000:d7:00.0: BAR 13: failed to assign [io  size 
> 0x1000]
> 
> [  155.359200] HelloWorld[3824]: segfault at 68 ip 00007f4c979f764e sp 
> 00007ffc9b3bb610 error 4 in 
> libamdhip64.so.4.4.21432-f9dccde4[7f4c979b3000+2eb000]
> 
> [  155.360268] Code: 48 8b 45 e8 64 48 33 04 25 28 00 00 00 74 05 e8 b8 
> c7 fb ff 48 8b 5d f8 c9 c3 f3 0f 1e fa 55 48 89 e5 48 89 7d f8 48 8b 45 
> f8 <48> 8b 40 68 5d c3 f3 0f 1e fa 55 48 89 e5 48 89 7d f8 48 8b 45 f8
> 
> Best regards,
> 
> Shuotao
> 
> *From: *Andrey Grodzovsky <andrey.grodzovsky@amd.com>
> *Date: *Wednesday, April 6, 2022 at 10:36 PM
> *To: *Shuotao Xu <shuotaoxu@microsoft.com>, 
> amd-gfx@lists.freedesktop.org <amd-gfx@lists.freedesktop.org>
> *Cc: *Ziyue Yang <Ziyue.Yang@microsoft.com>, Lei Qu 
> <Lei.Qu@microsoft.com>, Peng Cheng <pengc@microsoft.com>, Ran Shu 
> <Ran.Shu@microsoft.com>
> *Subject: *Re: [EXTERNAL] Re: Code Review Request for AMDGPU Hotplug Support
> 
> [You don't often get email from andrey.grodzovsky@amd.com. Learn why 
> this is important at http://aka.ms/LearnAboutSenderIdentification.] 
> <https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Faka.ms%2FLearnAboutSenderIdentification.%255d&data=04%7C01%7Candrey.grodzovsky%40amd.com%7Ccf05d2033d264aef772508da17dfc58c%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637848547134165533%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=gS%2BhMp165sXmPjg22lVa42oUSwZXfuhAoj2OcOmwRuk%3D&reserved=0>
> 
> Can you attach dmesg for the failure without your patch against
> amd-staging-drm-next ?
> 
> Also, in general, patches for amdgpu upstream branches should be
> submitted to amd-gfx mailing list inline using git-send which makes it
> easy to comment and review them inline.
> 
> Andrey
> 
> On 2022-04-06 10:25, Shuotao Xu wrote:
>> Hi Andrey,
>>
>> We just tried kernel 5.16 based on
>> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fagd5f%2Flinux.git&amp;data=05%7C01%7Cshuotaoxu%40microsoft.com%7C93f1fcb8f60541f7b87308da17dae167%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637848526184858564%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=jz51aMtsG7PIEfLy1jLvLGd%2BsBREvOFf9Gc6BZlmsmU%3D&amp;reserved=0 
> <https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fagd5f%2Flinux.git&data=04%7C01%7Candrey.grodzovsky%40amd.com%7Ccf05d2033d264aef772508da17dfc58c%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637848547134165533%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=%2Bl6rD8x7VDD1sq54XEi3rmhgGbgun0PabfIRFaG8S88%3D&reserved=0>
>> <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fagd5f%2Flinux.git&amp;data=05%7C01%7Cshuotaoxu%40microsoft.com%7C93f1fcb8f60541f7b87308da17dae167%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637848526184858564%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=jz51aMtsG7PIEfLy1jLvLGd%2BsBREvOFf9Gc6BZlmsmU%3D&amp;reserved=0 
> <https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fagd5f%2Flinux.git&data=04%7C01%7Candrey.grodzovsky%40amd.com%7Ccf05d2033d264aef772508da17dfc58c%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637848547134165533%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=%2Bl6rD8x7VDD1sq54XEi3rmhgGbgun0PabfIRFaG8S88%3D&reserved=0>>
>> amd-staging-drm-next branch, and found out that hotplug did not work out
>> of box for Rocm compute stack.
>>
>> We did not try the rendering stack since we currently are more focused
>> on AI workloads.
>>
>> We have also created a patch against the amd-staging-drm-next branch to
>> enable hotplug for ROCM stack, which were sent in another later email
>> with same subject. I am attaching the patch in this email, in case that
>> you would want to delete that later email.
>>
>> Best regards,
>>
>> Shuotao
>>
>> *From: *Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>> *Date: *Wednesday, April 6, 2022 at 10:13 PM
>> *To: *Shuotao Xu <shuotaoxu@microsoft.com>,
>> amd-gfx@lists.freedesktop.org <amd-gfx@lists.freedesktop.org>
>> *Cc: *Ziyue Yang <Ziyue.Yang@microsoft.com>, Lei Qu
>> <Lei.Qu@microsoft.com>, Peng Cheng <pengc@microsoft.com>, Ran Shu
>> <Ran.Shu@microsoft.com>
>> *Subject: *[EXTERNAL] Re: Code Review Request for AMDGPU Hotplug Support
>>
>> [You don't often get email from andrey.grodzovsky@amd.com. Learn why
>> this is important at http://aka.ms/LearnAboutSenderIdentification.] 
> <https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Faka.ms%2FLearnAboutSenderIdentification.%255d&data=04%7C01%7Candrey.grodzovsky%40amd.com%7Ccf05d2033d264aef772508da17dfc58c%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637848547134165533%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=gS%2BhMp165sXmPjg22lVa42oUSwZXfuhAoj2OcOmwRuk%3D&reserved=0>
>> <https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Faka.ms%2FLearnAboutSenderIdentification.%255d&amp;data=05%7C01%7Cshuotaoxu%40microsoft.com%7C93f1fcb8f60541f7b87308da17dae167%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637848526184858564%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=HfSwu6SWfoCYyscJqGFdKHBPtaj%2BKB4lyo13zkm6hi4%3D&amp;reserved=0 
> <https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Faka.ms%2FLearnAboutSenderIdentification.%255d&data=04%7C01%7Candrey.grodzovsky%40amd.com%7Ccf05d2033d264aef772508da17dfc58c%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637848547134165533%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=gS%2BhMp165sXmPjg22lVa42oUSwZXfuhAoj2OcOmwRuk%3D&reserved=0>>
>>
>> Looks like you are using 5.13 kernel for this work, FYI we added
>> hot plug support for the graphic stack in 5.14 kernel (see
>> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.phoronix.com%2Fscan.php%3Fpage%3Dnews_item%26px%3DLinux-5.14-AMDGPU-Hot-Unplug&amp;data=05%7C01%7Cshuotaoxu%40microsoft.com%7C93f1fcb8f60541f7b87308da17dae167%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637848526184858564%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=4l9mT8zNR%2FvqsEFr7noIDqKf16IGN8xmO2T6jnpipzo%3D&amp;reserved=0) 
> <https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.phoronix.com%2Fscan.php%3Fpage%3Dnews_item%26px%3DLinux-5.14-AMDGPU-Hot-Unplug&data=04%7C01%7Candrey.grodzovsky%40amd.com%7Ccf05d2033d264aef772508da17dfc58c%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637848547134165533%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=LSwOIrmGXU8Ne7E6wlIo%2FXbJcacyWbd%2FltwJSMP2Ofw%3D&reserved=0>
>> <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.phoronix.com%2Fscan.php%3Fpage%3Dnews_item%26px%3DLinux-5.14-AMDGPU-Hot-Unplug&amp;data=05%7C01%7Cshuotaoxu%40microsoft.com%7C93f1fcb8f60541f7b87308da17dae167%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637848526184858564%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=4l9mT8zNR%2FvqsEFr7noIDqKf16IGN8xmO2T6jnpipzo%3D&amp;reserved=0 
> <https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.phoronix.com%2Fscan.php%3Fpage%3Dnews_item%26px%3DLinux-5.14-AMDGPU-Hot-Unplug&data=04%7C01%7Candrey.grodzovsky%40amd.com%7Ccf05d2033d264aef772508da17dfc58c%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637848547134165533%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=LSwOIrmGXU8Ne7E6wlIo%2FXbJcacyWbd%2FltwJSMP2Ofw%3D&reserved=0>>
>>
>>
>> I am not sure about the code part since it all touches KFD driver (KFD
>> team can comment on that) - but I was just wondering if you try 5.14
>> kernel would things just work for you out of the box ?
>>
>> Andrey
>>
>> On 2022-04-05 22:45, Shuotao Xu wrote:
>>> Dear AMD Colleagues,
>>>
>>> We are from Microsoft Research, and are working on GPU disaggregation
>>> technology.
>>>
>>> We have created a new pull requestAdd PCIe hotplug support for amdgpu by
>>> xushuotao · Pull Request #131 · RadeonOpenCompute/ROCK-Kernel-Driver
>>> (github.com)
>>> <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FRadeonOpenCompute%2FROCK-Kernel-Driver%2Fpull%2F131&amp;data=05%7C01%7Cshuotaoxu%40microsoft.com%7C93f1fcb8f60541f7b87308da17dae167%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637848526184858564%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=qEeZR6R95jrjAaltd1MnpyFedOiVZaNQuCxcoNog90g%3D&amp;reserved=0
>> <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FRadeonOpenCompute%2FROCK-Kernel-Driver%2Fpull%2F131&amp;data=05%7C01%7Cshuotaoxu%40microsoft.com%7C93f1fcb8f60541f7b87308da17dae167%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637848526184858564%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=qEeZR6R95jrjAaltd1MnpyFedOiVZaNQuCxcoNog90g%3D&amp;reserved=0 
> <https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FRadeonOpenCompute%2FROCK-Kernel-Driver%2Fpull%2F131&data=04%7C01%7Candrey.grodzovsky%40amd.com%7Ccf05d2033d264aef772508da17dfc58c%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637848547134165533%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=0WQ9S94HsQYwhgoM5MhqtkZOP1mfsaiLrDqoEZh1YkU%3D&reserved=0>>>in
>>> ROCK-Kernel-Driver, which will enable PCIe hot-plug support for amdgpu.
>>>
>>> We believe the support of hot-plug of GPU devices can open doors for
>>> many advanced applications in data center in the next few years, and we
>>> would like to have some reviewers on this PR so we can continue further
>>> technical discussions around this feature.
>>>
>>> Would you please help review this PR?
>>>
>>> Thank you very much!
>>>
>>> Best regards,
>>>
>>> Shuotao Xu
>>>
>>
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: [EXTERNAL] Re: Code Review Request for AMDGPU Hotplug Support
  2022-04-06 15:11       ` Shuotao Xu
  2022-04-06 15:39         ` Andrey Grodzovsky
@ 2022-04-07  2:22         ` Joshi, Mukul
  2022-04-07 15:08           ` Shuotao Xu
  1 sibling, 1 reply; 13+ messages in thread
From: Joshi, Mukul @ 2022-04-07  2:22 UTC (permalink / raw)
  To: Shuotao Xu, Grodzovsky, Andrey, amd-gfx
  Cc: Ziyue Yang, Lei Qu, Peng Cheng, Ran Shu

[-- Attachment #1: Type: text/plain, Size: 28097 bytes --]

[AMD Official Use Only]

Hi Shuotao,

Thanks for your patch.
I have been working on something similar as I also found that we don't cleanup IO links upon device removal.

The IO-links cleanup change in your patch would work only either on a single GPU system or a multi-GPU system where the last node (in the sysfs topology) is hot-plugged out. That's because of the way the atomic counter, topology_crat_proximity_domain, is used in the code.

I have a patch which takes care of these issues on a multi-GPU system.
I should be able to send that out for review in sometime.

Thanks,
Mukul

From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> On Behalf Of Shuotao Xu
Sent: Wednesday, April 6, 2022 11:12 AM
To: Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>; amd-gfx@lists.freedesktop.org
Cc: Ziyue Yang <Ziyue.Yang@microsoft.com>; Lei Qu <Lei.Qu@microsoft.com>; Peng Cheng <pengc@microsoft.com>; Ran Shu <Ran.Shu@microsoft.com>
Subject: Re: [EXTERNAL] Re: Code Review Request for AMDGPU Hotplug Support

[CAUTION: External Email]
Hi Andrey,

Thanks for your kind comment on linux patch submission protocol, please let me know if there is anyway to rectify it.

dmesg is fine except with some warning during pci rescan after pci removal of an AMD MI100.

The issue is that after this rocm application will return segfault with the amdgpu driver unless the entire amdgpu kernel module is unloaded and loaded, which we did not meet our hotplug requirement. The issues upon investigation are
1) kfd_lock is locked after hotplug, and kfd_open will return fault right away to libhsakmt .
2) iolink/p2plink has anormalies after hotplug, and libhsakmt will found such anomalies and return error.

Our patch has been tested with a single-instance AMD MI100 GPU and showed it worked.

I am attaching the dmesg after rescan anyway, which will show the warning and the segfault.

[  132.054822] pci 0000:43:00.0: [1002:738c] type 00 class 0x038000
[  132.054856] pci 0000:43:00.0: reg 0x10: [mem 0x38b000000000-0x38b7ffffffff 64bit pref]
[  132.054877] pci 0000:43:00.0: reg 0x18: [mem 0x38b800000000-0x38b8001fffff 64bit pref]
[  132.054890] pci 0000:43:00.0: reg 0x20: [io  0xa000-0xa0ff]
[  132.054904] pci 0000:43:00.0: reg 0x24: [mem 0xb8400000-0xb847ffff]
[  132.054918] pci 0000:43:00.0: reg 0x30: [mem 0xb8480000-0xb849ffff pref]
[  132.055134] pci 0000:43:00.0: PME# supported from D1 D2 D3hot D3cold
[  132.055217] pci 0000:43:00.0: 63.008 Gb/s available PCIe bandwidth, limited by 8.0 GT/s PCIe x8 link at 0000:3c:14.0 (capable of 252.048 Gb/s with 16.0 GT/s PCIe x16 link)
[  132.056001] pci 0000:43:00.0: Adding to iommu group 73
[  132.057943] pci 0000:43:00.0: BAR 0: assigned [mem 0x38b000000000-0x38b7ffffffff 64bit pref]
[  132.057960] pci 0000:43:00.0: BAR 2: assigned [mem 0x38b800000000-0x38b8001fffff 64bit pref]
[  132.057974] pci 0000:43:00.0: BAR 5: assigned [mem 0xb8400000-0xb847ffff]
[  132.057981] pci 0000:43:00.0: BAR 6: assigned [mem 0xb8480000-0xb849ffff pref]
[  132.057984] pci 0000:43:00.0: BAR 4: assigned [io  0xa000-0xa0ff]

[  132.058429] ======================================================
[  132.058453] WARNING: possible circular locking dependency detected
[  132.058477] 5.16.0-kfd+ #1 Not tainted
[  132.058492] ------------------------------------------------------
[  132.058515] bash/3632 is trying to acquire lock:
[  132.058534] ffffadee20adfb50 ((work_completion)(&wfc.work)){+.+.}-{0:0}, at: __flush_work+0x2f5/0x470
[  132.058554] [drm] initializing kernel modesetting (ARCTURUS 0x1002:0x738C 0x1002:0x0C34 0x01).
[  132.058577]
               but task is already holding lock:
[  132.058580] ffffffffa3c62308
[  132.058630] amdgpu 0000:43:00.0: amdgpu: Trusted Memory Zone (TMZ) feature not supported
[  132.058638]  (
[  132.058678] [drm] register mmio base: 0xB8400000
[  132.058683] pci_rescan_remove_lock
[  132.058694] [drm] register mmio size: 524288
[  132.058713] ){+.+.}-{3:3}, at: rescan_store+0x55/0x90
[  132.058773]
               which lock already depends on the new lock.

[  132.058804]
               the existing dependency chain (in reverse order) is:
[  132.058819] [drm] add ip block number 0 <soc15_common>
[  132.058831]
               -> #1 (
[  132.058854] [drm] add ip block number 1 <gmc_v9_0>
[  132.058858] [drm] add ip block number 2 <vega20_ih>
[  132.058874] pci_rescan_remove_lock
[  132.058894] [drm] add ip block number 3 <psp>
[  132.058915] ){+.+.}-{3:3}
[  132.058931] [drm] add ip block number 4 <smu>
[  132.058951] :
[  132.058965] [drm] add ip block number 5 <gfx_v9_0>
[  132.058986]        __mutex_lock+0xa4/0x990
[  132.058996] [drm] add ip block number 6 <sdma_v4_0>
[  132.059016]        i801_add_tco_spt.isra.20+0x2a/0x1a0
[  132.059033] [drm] add ip block number 7 <vcn_v2_5>
[  132.059054]        i801_add_tco+0xf6/0x110
[  132.059075] [drm] add ip block number 8 <jpeg_v2_5>
[  132.059096]        i801_probe+0x402/0x860
[  132.059151]        local_pci_probe+0x40/0x90
[  132.059170]        work_for_cpu_fn+0x10/0x20
[  132.059189]        process_one_work+0x2a4/0x640
[  132.059208]        worker_thread+0x228/0x3f0
[  132.059227]        kthread+0x16d/0x1a0
[  132.059795]        ret_from_fork+0x1f/0x30
[  132.060337]
               -> #0 ((work_completion)(&wfc.work)){+.+.}-{0:0}:
[  132.061405]        __lock_acquire+0x1552/0x1ac0
[  132.061950]        lock_acquire+0x26c/0x300
[  132.062484]        __flush_work+0x315/0x470
[  132.063009]        work_on_cpu+0x98/0xc0
[  132.063526]        pci_device_probe+0x1bc/0x1d0
[  132.064036]        really_probe+0x102/0x450
[  132.064532]        __driver_probe_device+0x100/0x170
[  132.065020]        driver_probe_device+0x1f/0xa0
[  132.065497]        __device_attach_driver+0x6b/0xe0
[  132.065975]        bus_for_each_drv+0x6a/0xb0
[  132.066449]        __device_attach+0xe2/0x160
[  132.066912]        pci_bus_add_device+0x4a/0x80
[  132.067365]        pci_bus_add_devices+0x2c/0x70
[  132.067812]        pci_bus_add_devices+0x65/0x70
[  132.068253]        pci_bus_add_devices+0x65/0x70
[  132.068688]        pci_bus_add_devices+0x65/0x70
[  132.068936] amdgpu 0000:43:00.0: amdgpu: Fetched VBIOS from ROM BAR
[  132.069109]        pci_bus_add_devices+0x65/0x70
[  132.069602] amdgpu: ATOM BIOS: 113-D3431401-X00
[  132.070058]        pci_bus_add_devices+0x65/0x70
[  132.070572] [drm] VCN(0) decode is enabled in VM mode
[  132.070997]        pci_rescan_bus+0x23/0x30
[  132.071000]        rescan_store+0x61/0x90
[  132.071003]        kernfs_fop_write_iter+0x132/0x1b0
[  132.071501] [drm] VCN(1) decode is enabled in VM mode
[  132.071964]        new_sync_write+0x11f/0x1b0
[  132.072432] [drm] VCN(0) encode is enabled in VM mode
[  132.072900]        vfs_write+0x35b/0x3b0
[  132.073376] [drm] VCN(1) encode is enabled in VM mode
[  132.073847]        ksys_write+0xa7/0xe0
[  132.074335] [drm] JPEG(0) JPEG decode is enabled in VM mode
[  132.074803]        do_syscall_64+0x34/0x80
[  132.074808]        entry_SYSCALL_64_after_hwframe+0x44/0xae
[  132.074811]
               other info that might help us debug this:

[  132.074813]  Possible unsafe locking scenario:

[  132.075302] [drm] JPEG(1) JPEG decode is enabled in VM mode
[  132.075779]        CPU0                    CPU1
[  132.076361] amdgpu 0000:43:00.0: amdgpu: MEM ECC is active.
[  132.076765]        ----                    ----
[  132.077265] amdgpu 0000:43:00.0: amdgpu: SRAM ECC is active.
[  132.078649]   lock(pci_rescan_remove_lock);
[  132.078652]                                lock((work_completion)(&wfc.work));
[  132.078653]                                lock(pci_rescan_remove_lock);
[  132.078655]   lock((work_completion)(&wfc.work));
[  132.078656]
                *** DEADLOCK ***

[  132.078656] 5 locks held by bash/3632:
[  132.078658]  #0: ffff9c39c7b89438
[  132.079612] amdgpu 0000:43:00.0: amdgpu: RAS INFO: ras initialized successfully, hardware ability[7fff] ras_mask[7fff]
[  132.080089]  (
[  132.080602] [drm] vm size is 262144 GB, 4 levels, block size is 9-bit, fragment size is 9-bit
[  132.081082] sb_writers
[  132.081601] amdgpu 0000:43:00.0: amdgpu: VRAM: 32752M 0x0000008000000000 - 0x00000087FEFFFFFF (32752M used)
[  132.082102] #6
[  132.082630] amdgpu 0000:43:00.0: amdgpu: GART: 512M 0x0000000000000000 - 0x000000001FFFFFFF
[  132.083152] ){.+.+}-{0:0}
[  132.083687] amdgpu 0000:43:00.0: amdgpu: AGP: 267878400M 0x0000008800000000 - 0x0000FFFFFFFFFFFF
[  132.084210] , at: ksys_write+0xa7/0xe0
[  132.085708] [drm] Detected VRAM RAM=32752M, BAR=32768M
[  132.086205]  #1:
[  132.086733] [drm] RAM width 4096bits HBM
[  132.087269] ffff9c5959011088
[  132.087890] [drm] amdgpu: 32752M of VRAM memory ready
[  132.088389]  (
[  132.088972] [drm] amdgpu: 32752M of GTT memory ready.
[  132.089572] &of->mutex
[  132.090206] [drm] GART: num cpu pages 131072, num gpu pages 131072
[  132.090804] ){+.+.}-{3:3}, at: kernfs_fop_write_iter+0x103/0x1b0
[  132.090808]  #2: ffff9c39c882a9e0 (kn->active#423){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x10c/0x1b0
[  132.091639] [drm] PCIE GART of 512M enabled.
[  132.092117]  #3:
[  132.092801] [drm] PTB located at 0x0000008000000000
[  132.093480] ffffffffa3c62308
[  132.094566] amdgpu 0000:43:00.0: amdgpu: PSP runtime database doesn't exist
[  132.094822]  (pci_rescan_remove_lock){+.+.}-{3:3}, at: rescan_store+0x55/0x90
[  132.094827]  #4: ffff9c597392b248 (&dev->mutex){....}-{3:3}, at: __device_attach+0x39/0x160
[  132.094835]
               stack backtrace:
[  132.097098] [drm] Found VCN firmware Version ENC: 1.1 DEC: 1 VEP: 0 Revision: 21
[  132.097467] CPU: 47 PID: 3632 Comm: bash Not tainted 5.16.0-kfd+ #1
[  132.098169] amdgpu 0000:43:00.0: amdgpu: Will use PSP to load VCN firmware
[  132.098839] Hardware name: Supermicro SYS-4029GP-TRT2/X11DPG-OT-CPU, BIOS 2.1 08/14/2018
[  132.098841] Call Trace:
[  132.098842]  <TASK>
[  132.098843]  dump_stack_lvl+0x44/0x57
[  132.098848]  check_noncircular+0x105/0x120
[  132.098853]  ? unwind_get_return_address+0x1b/0x30
[  132.112924]  ? register_lock_class+0x46/0x780
[  132.113630]  ? __lock_acquire+0x1552/0x1ac0
[  132.114342]  __lock_acquire+0x1552/0x1ac0
[  132.115050]  lock_acquire+0x26c/0x300
[  132.115755]  ? __flush_work+0x2f5/0x470
[  132.116460]  ? lock_is_held_type+0xdf/0x130
[  132.117177]  __flush_work+0x315/0x470
[  132.117890]  ? __flush_work+0x2f5/0x470
[  132.118604]  ? lock_is_held_type+0xdf/0x130
[  132.119305]  ? mark_held_locks+0x49/0x70
[  132.119981]  ? queue_work_on+0x2f/0x70
[  132.120645]  ? lockdep_hardirqs_on+0x79/0x100
[  132.121300]  work_on_cpu+0x98/0xc0
[  132.121702] [drm] reserve 0x400000 from 0x87fec00000 for PSP TMR
[  132.121947]  ? __traceiter_workqueue_execute_end+0x40/0x40
[  132.123270]  ? pci_device_shutdown+0x60/0x60
[  132.123880]  pci_device_probe+0x1bc/0x1d0
[  132.124475]  really_probe+0x102/0x450
[  132.125060]  __driver_probe_device+0x100/0x170
[  132.125641]  driver_probe_device+0x1f/0xa0
[  132.126215]  __device_attach_driver+0x6b/0xe0
[  132.126797]  ? driver_allows_async_probing+0x50/0x50
[  132.127383]  ? driver_allows_async_probing+0x50/0x50
[  132.127960]  bus_for_each_drv+0x6a/0xb0
[  132.128528]  __device_attach+0xe2/0x160
[  132.129095]  pci_bus_add_device+0x4a/0x80
[  132.129659]  pci_bus_add_devices+0x2c/0x70
[  132.130213]  pci_bus_add_devices+0x65/0x70
[  132.130753]  pci_bus_add_devices+0x65/0x70
[  132.131283]  pci_bus_add_devices+0x65/0x70
[  132.131780]  pci_bus_add_devices+0x65/0x70
[  132.132270]  pci_bus_add_devices+0x65/0x70
[  132.132757]  pci_rescan_bus+0x23/0x30
[  132.133233]  rescan_store+0x61/0x90
[  132.133701]  kernfs_fop_write_iter+0x132/0x1b0
[  132.134167]  new_sync_write+0x11f/0x1b0
[  132.134627]  vfs_write+0x35b/0x3b0
[  132.135062]  ksys_write+0xa7/0xe0
[  132.135503]  do_syscall_64+0x34/0x80
[  132.135933]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[  132.136358] RIP: 0033:0x7f0058a73224
[  132.136775] Code: 89 02 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 8d 05 c1 07 2e 00 8b 00 85 c0 75 13 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 54 f3 c3 66 90 41 54 55 49 89 d4 53 48 89 f5
[  132.137663] RSP: 002b:00007ffc4f4c71a8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[  132.138121] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007f0058a73224
[  132.138590] RDX: 0000000000000002 RSI: 000055d466c24450 RDI: 0000000000000001
[  132.139064] RBP: 000055d466c24450 R08: 000000000000000a R09: 0000000000000001
[  132.139532] R10: 000000000000000a R11: 0000000000000246 R12: 00007f0058d4f760
[  132.140003] R13: 0000000000000002 R14: 00007f0058d4b2a0 R15: 00007f0058d4a760
[  132.140485]  </TASK>
[  132.183669] amdgpu 0000:43:00.0: amdgpu: HDCP: optional hdcp ta ucode is not available
[  132.184214] amdgpu 0000:43:00.0: amdgpu: DTM: optional dtm ta ucode is not available
[  132.184735] amdgpu 0000:43:00.0: amdgpu: RAP: optional rap ta ucode is not available
[  132.185234] amdgpu 0000:43:00.0: amdgpu: SECUREDISPLAY: securedisplay ta ucode is not available
[  132.185823] amdgpu 0000:43:00.0: amdgpu: use vbios provided pptable
[  132.186327] amdgpu 0000:43:00.0: amdgpu: smc_dpm_info table revision(format.content): 4.6
[  132.188783] amdgpu 0000:43:00.0: amdgpu: SMU is initialized successfully!
[  132.190039] [drm] kiq ring mec 2 pipe 1 q 0
[  132.203608] [drm] VCN decode and encode initialized successfully(under DPG Mode).
[  132.204178] [drm] JPEG decode initialized successfully.
[  132.246079] kfd kfd: amdgpu: Allocated 3969056 bytes on gart
[  132.327589] memmap_init_zone_device initialised 8388608 pages in 64ms
[  132.328139] amdgpu: HMM registered 32752MB device memory
[  132.328784] amdgpu: Virtual CRAT table created for GPU
[  132.329844] amdgpu: Topology: Add dGPU node [0x738c:0x1002]
[  132.330387] kfd kfd: amdgpu: added device 1002:738c
[  132.330965] amdgpu 0000:43:00.0: amdgpu: SE 8, SH per SE 1, CU per SH 16, active_cu_number 72
[  132.331725] amdgpu 0000:43:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 0 on hub 0
[  132.332296] amdgpu 0000:43:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 1 on hub 0
[  132.332856] amdgpu 0000:43:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 4 on hub 0
[  132.333414] amdgpu 0000:43:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 5 on hub 0
[  132.333965] amdgpu 0000:43:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 6 on hub 0
[  132.334507] amdgpu 0000:43:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 7 on hub 0
[  132.335057] amdgpu 0000:43:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 8 on hub 0
[  132.335594] amdgpu 0000:43:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 9 on hub 0
[  132.336137] amdgpu 0000:43:00.0: amdgpu: ring kiq_2.1.0 uses VM inv eng 10 on hub 0
[  132.336679] amdgpu 0000:43:00.0: amdgpu: ring sdma0 uses VM inv eng 0 on hub 1
[  132.337234] amdgpu 0000:43:00.0: amdgpu: ring sdma1 uses VM inv eng 1 on hub 1
[  132.337790] amdgpu 0000:43:00.0: amdgpu: ring sdma2 uses VM inv eng 4 on hub 1
[  132.338343] amdgpu 0000:43:00.0: amdgpu: ring sdma3 uses VM inv eng 5 on hub 1
[  132.338906] amdgpu 0000:43:00.0: amdgpu: ring sdma4 uses VM inv eng 6 on hub 1
[  132.339448] amdgpu 0000:43:00.0: amdgpu: ring sdma5 uses VM inv eng 0 on hub 2
[  132.339987] amdgpu 0000:43:00.0: amdgpu: ring sdma6 uses VM inv eng 1 on hub 2
[  132.340519] amdgpu 0000:43:00.0: amdgpu: ring sdma7 uses VM inv eng 4 on hub 2
[  132.341041] amdgpu 0000:43:00.0: amdgpu: ring vcn_dec_0 uses VM inv eng 5 on hub 2
[  132.341570] amdgpu 0000:43:00.0: amdgpu: ring vcn_enc_0.0 uses VM inv eng 6 on hub 2
[  132.342101] amdgpu 0000:43:00.0: amdgpu: ring vcn_enc_0.1 uses VM inv eng 7 on hub 2
[  132.342630] amdgpu 0000:43:00.0: amdgpu: ring vcn_dec_1 uses VM inv eng 8 on hub 2
[  132.343152] amdgpu 0000:43:00.0: amdgpu: ring vcn_enc_1.0 uses VM inv eng 9 on hub 2
[  132.343657] amdgpu 0000:43:00.0: amdgpu: ring vcn_enc_1.1 uses VM inv eng 10 on hub 2
[  132.344136] amdgpu 0000:43:00.0: amdgpu: ring jpeg_dec_0 uses VM inv eng 11 on hub 2
[  132.344610] amdgpu 0000:43:00.0: amdgpu: ring jpeg_dec_1 uses VM inv eng 12 on hub 2
[  132.378213] amdgpu: Detected AMDGPU 6 Perf Events.
[  132.387349] [drm] Initialized amdgpu 3.46.0 20150101 for 0000:43:00.0 on minor 1
[  132.388530] pcieport 0000:d7:00.0: bridge window [io  0x1000-0x0fff] to [bus d8] add_size 1000
[  132.389078] pcieport 0000:d7:00.0: BAR 13: no space for [io  size 0x1000]
[  132.389600] pcieport 0000:d7:00.0: BAR 13: failed to assign [io  size 0x1000]
[  132.390091] pcieport 0000:d7:00.0: BAR 13: no space for [io  size 0x1000]
[  132.390568] pcieport 0000:d7:00.0: BAR 13: failed to assign [io  size 0x1000]
[  155.359200] HelloWorld[3824]: segfault at 68 ip 00007f4c979f764e sp 00007ffc9b3bb610 error 4 in libamdhip64.so.4.4.21432-f9dccde4[7f4c979b3000+2eb000]
[  155.360268] Code: 48 8b 45 e8 64 48 33 04 25 28 00 00 00 74 05 e8 b8 c7 fb ff 48 8b 5d f8 c9 c3 f3 0f 1e fa 55 48 89 e5 48 89 7d f8 48 8b 45 f8 <48> 8b 40 68 5d c3 f3 0f 1e fa 55 48 89 e5 48 89 7d f8 48 8b 45 f8

Best regards,
Shuotao

From: Andrey Grodzovsky <andrey.grodzovsky@amd.com<mailto:andrey.grodzovsky@amd.com>>
Date: Wednesday, April 6, 2022 at 10:36 PM
To: Shuotao Xu <shuotaoxu@microsoft.com<mailto:shuotaoxu@microsoft.com>>, amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org> <amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>>
Cc: Ziyue Yang <Ziyue.Yang@microsoft.com<mailto:Ziyue.Yang@microsoft.com>>, Lei Qu <Lei.Qu@microsoft.com<mailto:Lei.Qu@microsoft.com>>, Peng Cheng <pengc@microsoft.com<mailto:pengc@microsoft.com>>, Ran Shu <Ran.Shu@microsoft.com<mailto:Ran.Shu@microsoft.com>>
Subject: Re: [EXTERNAL] Re: Code Review Request for AMDGPU Hotplug Support
[You don't often get email from andrey.grodzovsky@amd.com<mailto:andrey.grodzovsky@amd.com>. Learn why this is important at http://aka.ms/LearnAboutSenderIdentification.]<https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Faka.ms%2FLearnAboutSenderIdentification.%255d&data=04%7C01%7Cmukul.joshi%40amd.com%7C9e1aae93e76940be1e9308da17e50651%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637848570104726729%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=DudMKUKPgXMYSUyLMI0yopEHKDQzWx3uNSsk69oGcQU%3D&reserved=0>

Can you attach dmesg for the failure without your patch against
amd-staging-drm-next ?

Also, in general, patches for amdgpu upstream branches should be
submitted to amd-gfx mailing list inline using git-send which makes it
easy to comment and review them inline.

Andrey

On 2022-04-06 10:25, Shuotao Xu wrote:
> Hi Andrey,
>
> We just tried kernel 5.16 based on
> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fagd5f%2Flinux.git&amp;data=05%7C01%7Cshuotaoxu%40microsoft.com%7C93f1fcb8f60541f7b87308da17dae167%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637848526184858564%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=jz51aMtsG7PIEfLy1jLvLGd%2BsBREvOFf9Gc6BZlmsmU%3D&amp;reserved=0<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fagd5f%2Flinux.git&data=04%7C01%7Cmukul.joshi%40amd.com%7C9e1aae93e76940be1e9308da17e50651%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637848570104726729%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=OvTe%2Bc1dElbj7bwFJ1%2BM%2B6vIyIkUfbLkPuG9o3Avlqs%3D&reserved=0>
> <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fagd5f%2Flinux.git&amp;data=05%7C01%7Cshuotaoxu%40microsoft.com%7C93f1fcb8f60541f7b87308da17dae167%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637848526184858564%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=jz51aMtsG7PIEfLy1jLvLGd%2BsBREvOFf9Gc6BZlmsmU%3D&amp;reserved=0<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fagd5f%2Flinux.git&data=04%7C01%7Cmukul.joshi%40amd.com%7C9e1aae93e76940be1e9308da17e50651%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637848570104726729%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=OvTe%2Bc1dElbj7bwFJ1%2BM%2B6vIyIkUfbLkPuG9o3Avlqs%3D&reserved=0>>
> amd-staging-drm-next branch, and found out that hotplug did not work out
> of box for Rocm compute stack.
>
> We did not try the rendering stack since we currently are more focused
> on AI workloads.
>
> We have also created a patch against the amd-staging-drm-next branch to
> enable hotplug for ROCM stack, which were sent in another later email
> with same subject. I am attaching the patch in this email, in case that
> you would want to delete that later email.
>
> Best regards,
>
> Shuotao
>
> *From: *Andrey Grodzovsky <andrey.grodzovsky@amd.com<mailto:andrey.grodzovsky@amd.com>>
> *Date: *Wednesday, April 6, 2022 at 10:13 PM
> *To: *Shuotao Xu <shuotaoxu@microsoft.com<mailto:shuotaoxu@microsoft.com>>,
> amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org> <amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>>
> *Cc: *Ziyue Yang <Ziyue.Yang@microsoft.com<mailto:Ziyue.Yang@microsoft.com>>, Lei Qu
> <Lei.Qu@microsoft.com<mailto:Lei.Qu@microsoft.com>>, Peng Cheng <pengc@microsoft.com<mailto:pengc@microsoft.com>>, Ran Shu
> <Ran.Shu@microsoft.com<mailto:Ran.Shu@microsoft.com>>
> *Subject: *[EXTERNAL] Re: Code Review Request for AMDGPU Hotplug Support
>
> [You don't often get email from andrey.grodzovsky@amd.com<mailto:andrey.grodzovsky@amd.com>. Learn why
> this is important at http://aka.ms/LearnAboutSenderIdentification.]<https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Faka.ms%2FLearnAboutSenderIdentification.%255d&data=04%7C01%7Cmukul.joshi%40amd.com%7C9e1aae93e76940be1e9308da17e50651%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637848570104726729%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=DudMKUKPgXMYSUyLMI0yopEHKDQzWx3uNSsk69oGcQU%3D&reserved=0>
> <https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Faka.ms%2FLearnAboutSenderIdentification.%255d&amp;data=05%7C01%7Cshuotaoxu%40microsoft.com%7C93f1fcb8f60541f7b87308da17dae167%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637848526184858564%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=HfSwu6SWfoCYyscJqGFdKHBPtaj%2BKB4lyo13zkm6hi4%3D&amp;reserved=0<https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Faka.ms%2FLearnAboutSenderIdentification.%255d&data=04%7C01%7Cmukul.joshi%40amd.com%7C9e1aae93e76940be1e9308da17e50651%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637848570104726729%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=DudMKUKPgXMYSUyLMI0yopEHKDQzWx3uNSsk69oGcQU%3D&reserved=0>>
>
> Looks like you are using 5.13 kernel for this work, FYI we added
> hot plug support for the graphic stack in 5.14 kernel (see
> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.phoronix.com%2Fscan.php%3Fpage%3Dnews_item%26px%3DLinux-5.14-AMDGPU-Hot-Unplug&amp;data=05%7C01%7Cshuotaoxu%40microsoft.com%7C93f1fcb8f60541f7b87308da17dae167%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637848526184858564%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=4l9mT8zNR%2FvqsEFr7noIDqKf16IGN8xmO2T6jnpipzo%3D&amp;reserved=0)<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.phoronix.com%2Fscan.php%3Fpage%3Dnews_item%26px%3DLinux-5.14-AMDGPU-Hot-Unplug&data=04%7C01%7Cmukul.joshi%40amd.com%7C9e1aae93e76940be1e9308da17e50651%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637848570104726729%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=srRqUvoFNJ8RpnzHS%2FI96GU75lO7eYbmwRdZfXFJAkM%3D&reserved=0>
> <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.phoronix.com%2Fscan.php%3Fpage%3Dnews_item%26px%3DLinux-5.14-AMDGPU-Hot-Unplug&amp;data=05%7C01%7Cshuotaoxu%40microsoft.com%7C93f1fcb8f60541f7b87308da17dae167%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637848526184858564%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=4l9mT8zNR%2FvqsEFr7noIDqKf16IGN8xmO2T6jnpipzo%3D&amp;reserved=0<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.phoronix.com%2Fscan.php%3Fpage%3Dnews_item%26px%3DLinux-5.14-AMDGPU-Hot-Unplug&data=04%7C01%7Cmukul.joshi%40amd.com%7C9e1aae93e76940be1e9308da17e50651%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637848570104726729%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=srRqUvoFNJ8RpnzHS%2FI96GU75lO7eYbmwRdZfXFJAkM%3D&reserved=0>>
>
>
> I am not sure about the code part since it all touches KFD driver (KFD
> team can comment on that) - but I was just wondering if you try 5.14
> kernel would things just work for you out of the box ?
>
> Andrey
>
> On 2022-04-05 22:45, Shuotao Xu wrote:
>> Dear AMD Colleagues,
>>
>> We are from Microsoft Research, and are working on GPU disaggregation
>> technology.
>>
>> We have created a new pull requestAdd PCIe hotplug support for amdgpu by
>> xushuotao * Pull Request #131 * RadeonOpenCompute/ROCK-Kernel-Driver
>> (github.com)
>> <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FRadeonOpenCompute%2FROCK-Kernel-Driver%2Fpull%2F131&amp;data=05%7C01%7Cshuotaoxu%40microsoft.com%7C93f1fcb8f60541f7b87308da17dae167%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637848526184858564%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=qEeZR6R95jrjAaltd1MnpyFedOiVZaNQuCxcoNog90g%3D&amp;reserved=0
<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FRadeonOpenCompute%2FROCK-Kernel-Driver%2Fpull%2F131&amp;data=05%7C01%7Cshuotaoxu%40microsoft.com%7C93f1fcb8f60541f7b87308da17dae167%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637848526184858564%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=qEeZR6R95jrjAaltd1MnpyFedOiVZaNQuCxcoNog90g%3D&amp;reserved=0%0b>> <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FRadeonOpenCompute%2FROCK-Kernel-Driver%2Fpull%2F131&amp;data=05%7C01%7Cshuotaoxu%40microsoft.com%7C93f1fcb8f60541f7b87308da17dae167%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637848526184858564%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=qEeZR6R95jrjAaltd1MnpyFedOiVZaNQuCxcoNog90g%3D&amp;reserved=0<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FRadeonOpenCompute%2FROCK-Kernel-Driver%2Fpull%2F131&data=04%7C01%7Cmukul.joshi%40amd.com%7C9e1aae93e76940be1e9308da17e50651%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637848570104726729%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=nld%2BG9FZ9fqH%2FvSiRVF0EL6MTE8It44sJgbCICiBGnc%3D&reserved=0>>>in
>> ROCK-Kernel-Driver, which will enable PCIe hot-plug support for amdgpu.
>>
>> We believe the support of hot-plug of GPU devices can open doors for
>> many advanced applications in data center in the next few years, and we
>> would like to have some reviewers on this PR so we can continue further
>> technical discussions around this feature.
>>
>> Would you please help review this PR?
>>
>> Thank you very much!
>>
>> Best regards,
>>
>> Shuotao Xu
>>
>

[-- Attachment #2: Type: text/html, Size: 57498 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [EXTERNAL] Code Review Request for AMDGPU Hotplug Support
  2022-04-06 15:39         ` Andrey Grodzovsky
@ 2022-04-07  2:55           ` Shuotao Xu
  0 siblings, 0 replies; 13+ messages in thread
From: Shuotao Xu @ 2022-04-07  2:55 UTC (permalink / raw)
  To: Andrey Grodzovsky, Mukul.Joshi
  Cc: Kuehling, Felix, Peng Cheng, amd-gfx, Lei Qu, Ran Shu, Ziyue Yang

[-- Attachment #1: Type: text/plain, Size: 31157 bytes --]

Please see my inline comments:)


On Apr 6, 2022, at 11:39 PM, Andrey Grodzovsky <andrey.grodzovsky@amd.com<mailto:andrey.grodzovsky@amd.com>> wrote:

+ Felix

On 2022-04-06 11:11, Shuotao Xu wrote:
Hi Andrey,
Thanks for your kind comment on linux patch submission protocol, please let me know if there is anyway to rectify it.

Just resend your patch to amd-gfx mailoing list using
git-send (see here how to use it - https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fburzalodowa.wordpress.com%2F2013%2F10%2F05%2Fhow-to-send-patches-with-git-send-email%2F&amp;data=05%7C01%7Cshuotaoxu%40microsoft.com%7Cc8bd85217a194c7d177808da17e3a825%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637848563851011449%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=sXcPSfZ1VFiizQ%2BWrdvNGDDZhTALFv9im7cNwYsZqPc%3D&amp;reserved=0)

I suggest adding --cover-letter so you will be able to explain the
story behind the patch.

Yes, thanks for the introduction. We will follow the protocol to resubmit the patch.

dmesg is fine except with some warning during pci rescan after pci removal of an AMD MI100.
The issue is that after this rocm application will return segfault with the amdgpu driver unless the entire amdgpu kernel module is unloaded and loaded, which we did not meet our hotplug requirement. The issues upon investigation are
1) kfd_lock is locked after hotplug, and kfd_open will return fault right away to libhsakmt .

I see now, kfd_lock is static and so single isntance across all devices
and so not going away after device removal but only after driver unload.
In this case I am not sure it's the best idea to just decrement kfd_lock
on device inint since in multi GPU system this might be locked on purpuse because another device is going through reset for example right
at this moment.

Felix, kgd2kfd_suspend is called also during device pci remove meaning
unblalanced decrement of the lock. Maybe we should not decremnt I
adding drm_dev_enter guard in kgd2kfd_suspend to avoid decrment of
kfd_locked if we are during PCI remove.

Yes, I don’t fully understand why kfd_lock’s purpose, and saw the driver never decrements the lock the after Hotplug out.
This just a temporary fix by me, and I think AMD should have a better understanding of a correct fix.
Ideally, we would like to add and remove gpu nodes flexibly into a bare-metal system, which enables autoscaling a gpu node.
This is one of the key features we are working on for GPU disaggregation.

2) iolink/p2plink has anormalies after hotplug, and libhsakmt will found such anomalies and return error.

Can you point to what abnormalities ? Part of PCI hot unplug we clean
all sysfs files and this looks like part of it, do you see sysfs file
already exsist error on next pci_rescan ?
The following demonstrates the anomalies:
We have three kfd nodes in the systems
0: cpu
1: cpu
2: amdgpu

--Before hotplug out: there is a io-link between 0 and 2
$cat /sys/devices/virtual/kfd/kfd/topology/nodes/2/io_links/0/properties
type 2
version_major 0
version_minor 0
node_from 2
node_to 0
weight 20
min_latency 0
max_latency 0
min_bandwidth 312
max_bandwidth 8000
recommended_transfer_size 0
flags 1

--After hotplug out:
Anomaly 0: this io-link between 0 and 2 should have been removed, which is not
$ cat /sys/devices/virtual/kfd/kfd/topology/nodes/0/io_links/1/properties
type 2
version_major 0
version_minor 0
node_from 0
node_to 2
weight 20
min_latency 0
max_latency 0
min_bandwidth 312
max_bandwidth 8000
recommended_transfer_size 0
flags 3

—After hotplug in:
Anomaly 1: another io-link between 0 and 2 is created.
Anomaly 2: the node id of 2 is 3, which should have been 3, although libhsakmt won’t pick this error up.
                   This is because of atomic counter, topology_crat_proximity_domain as mentioned by Joshi later.
$ cat /sys/devices/virtual/kfd/kfd/topology/nodes/0/io_links/2/properties
type 2
version_major 0
version_minor 0
node_from 0
node_to 3
weight 20
min_latency 0
max_latency 0
min_bandwidth 312
max_bandwidth 8000
recommended_transfer_size 0
flags 3

Anomaly 3: node_from should have been 2 instead of 3.
                   libhsakmt will pick this anomaly and return an error when taking the topology snapshot.
$ cat /sys/devices/virtual/kfd/kfd/topology/nodes/2/io_links/0/properties
type 2
version_major 0
version_minor 0
node_from 3
node_to 0
weight 20
min_latency 0
max_latency 0
min_bandwidth 312
max_bandwidth 8000
recommended_transfer_size 0
flags 1

Our patch at least have fixed those three anomalies in the kfd topology in a single-GPU system.


Our patch has been tested with a single-instance AMD MI100 GPU and showed it worked.

Exactly, for multi GPU system arbitrary decrementing kfd_lock on device load
can be problematic.
I was hoping AMD would do the correct fix; the driver code is quite convoluted from an outsider.
Reverse-engineering without much documentation has been quite painful for me.


Andrey
Thank you very much for prompt discussion! We really appreciate it!

Best regards,
Shuotao


I am attaching the dmesg after rescan anyway, which will show the warning and the segfault.
[  132.054822] pci 0000:43:00.0: [1002:738c] type 00 class 0x038000
[  132.054856] pci 0000:43:00.0: reg 0x10: [mem 0x38b000000000-0x38b7ffffffff 64bit pref]
[  132.054877] pci 0000:43:00.0: reg 0x18: [mem 0x38b800000000-0x38b8001fffff 64bit pref]
[  132.054890] pci 0000:43:00.0: reg 0x20: [io  0xa000-0xa0ff]
[  132.054904] pci 0000:43:00.0: reg 0x24: [mem 0xb8400000-0xb847ffff]
[  132.054918] pci 0000:43:00.0: reg 0x30: [mem 0xb8480000-0xb849ffff pref]
[  132.055134] pci 0000:43:00.0: PME# supported from D1 D2 D3hot D3cold
[  132.055217] pci 0000:43:00.0: 63.008 Gb/s available PCIe bandwidth, limited by 8.0 GT/s PCIe x8 link at 0000:3c:14.0 (capable of 252.048 Gb/s with 16.0 GT/s PCIe x16 link)
[  132.056001] pci 0000:43:00.0: Adding to iommu group 73
[  132.057943] pci 0000:43:00.0: BAR 0: assigned [mem 0x38b000000000-0x38b7ffffffff 64bit pref]
[  132.057960] pci 0000:43:00.0: BAR 2: assigned [mem 0x38b800000000-0x38b8001fffff 64bit pref]
[  132.057974] pci 0000:43:00.0: BAR 5: assigned [mem 0xb8400000-0xb847ffff]
[  132.057981] pci 0000:43:00.0: BAR 6: assigned [mem 0xb8480000-0xb849ffff pref]
[  132.057984] pci 0000:43:00.0: BAR 4: assigned [io  0xa000-0xa0ff]
[  132.058429] ======================================================
[  132.058453] WARNING: possible circular locking dependency detected
[  132.058477] 5.16.0-kfd+ #1 Not tainted
[  132.058492] ------------------------------------------------------
[  132.058515] bash/3632 is trying to acquire lock:
[  132.058534] ffffadee20adfb50 ((work_completion)(&wfc.work)){+.+.}-{0:0}, at: __flush_work+0x2f5/0x470
[  132.058554] [drm] initializing kernel modesetting (ARCTURUS 0x1002:0x738C 0x1002:0x0C34 0x01).
[  132.058577]
               but task is already holding lock:
[  132.058580] ffffffffa3c62308
[  132.058630] amdgpu 0000:43:00.0: amdgpu: Trusted Memory Zone (TMZ) feature not supported
[  132.058638]  (
[  132.058678] [drm] register mmio base: 0xB8400000
[  132.058683] pci_rescan_remove_lock
[  132.058694] [drm] register mmio size: 524288
[  132.058713] ){+.+.}-{3:3}, at: rescan_store+0x55/0x90
[  132.058773]
               which lock already depends on the new lock.
[  132.058804]
               the existing dependency chain (in reverse order) is:
[  132.058819] [drm] add ip block number 0 <soc15_common>
[  132.058831]
               -> #1 (
[  132.058854] [drm] add ip block number 1 <gmc_v9_0>
[  132.058858] [drm] add ip block number 2 <vega20_ih>
[  132.058874] pci_rescan_remove_lock
[  132.058894] [drm] add ip block number 3 <psp>
[  132.058915] ){+.+.}-{3:3}
[  132.058931] [drm] add ip block number 4 <smu>
[  132.058951] :
[  132.058965] [drm] add ip block number 5 <gfx_v9_0>
[  132.058986]        __mutex_lock+0xa4/0x990
[  132.058996] [drm] add ip block number 6 <sdma_v4_0>
[  132.059016]        i801_add_tco_spt.isra.20+0x2a/0x1a0
[  132.059033] [drm] add ip block number 7 <vcn_v2_5>
[  132.059054]        i801_add_tco+0xf6/0x110
[  132.059075] [drm] add ip block number 8 <jpeg_v2_5>
[  132.059096]        i801_probe+0x402/0x860
[  132.059151]        local_pci_probe+0x40/0x90
[  132.059170]        work_for_cpu_fn+0x10/0x20
[  132.059189]        process_one_work+0x2a4/0x640
[  132.059208]        worker_thread+0x228/0x3f0
[  132.059227]        kthread+0x16d/0x1a0
[  132.059795]        ret_from_fork+0x1f/0x30
[  132.060337]
               -> #0 ((work_completion)(&wfc.work)){+.+.}-{0:0}:
[  132.061405]        __lock_acquire+0x1552/0x1ac0
[  132.061950]        lock_acquire+0x26c/0x300
[  132.062484]        __flush_work+0x315/0x470
[  132.063009]        work_on_cpu+0x98/0xc0
[  132.063526]        pci_device_probe+0x1bc/0x1d0
[  132.064036]        really_probe+0x102/0x450
[  132.064532]        __driver_probe_device+0x100/0x170
[  132.065020]        driver_probe_device+0x1f/0xa0
[  132.065497]        __device_attach_driver+0x6b/0xe0
[  132.065975]        bus_for_each_drv+0x6a/0xb0
[  132.066449]        __device_attach+0xe2/0x160
[  132.066912]        pci_bus_add_device+0x4a/0x80
[  132.067365]        pci_bus_add_devices+0x2c/0x70
[  132.067812]        pci_bus_add_devices+0x65/0x70
[  132.068253]        pci_bus_add_devices+0x65/0x70
[  132.068688]        pci_bus_add_devices+0x65/0x70
[  132.068936] amdgpu 0000:43:00.0: amdgpu: Fetched VBIOS from ROM BAR
[  132.069109]        pci_bus_add_devices+0x65/0x70
[  132.069602] amdgpu: ATOM BIOS: 113-D3431401-X00
[  132.070058]        pci_bus_add_devices+0x65/0x70
[  132.070572] [drm] VCN(0) decode is enabled in VM mode
[  132.070997]        pci_rescan_bus+0x23/0x30
[  132.071000]        rescan_store+0x61/0x90
[  132.071003]        kernfs_fop_write_iter+0x132/0x1b0
[  132.071501] [drm] VCN(1) decode is enabled in VM mode
[  132.071964]        new_sync_write+0x11f/0x1b0
[  132.072432] [drm] VCN(0) encode is enabled in VM mode
[  132.072900]        vfs_write+0x35b/0x3b0
[  132.073376] [drm] VCN(1) encode is enabled in VM mode
[  132.073847]        ksys_write+0xa7/0xe0
[  132.074335] [drm] JPEG(0) JPEG decode is enabled in VM mode
[  132.074803]        do_syscall_64+0x34/0x80
[  132.074808]        entry_SYSCALL_64_after_hwframe+0x44/0xae
[  132.074811]
               other info that might help us debug this:
[  132.074813]  Possible unsafe locking scenario:
[  132.075302] [drm] JPEG(1) JPEG decode is enabled in VM mode
[  132.075779]        CPU0                    CPU1
[  132.076361] amdgpu 0000:43:00.0: amdgpu: MEM ECC is active.
[  132.076765]        ----                    ----
[  132.077265] amdgpu 0000:43:00.0: amdgpu: SRAM ECC is active.
[  132.078649]   lock(pci_rescan_remove_lock);
[  132.078652]                                lock((work_completion)(&wfc.work));
[  132.078653]                                lock(pci_rescan_remove_lock);
[  132.078655]   lock((work_completion)(&wfc.work));
[  132.078656]
                *** DEADLOCK ***
[  132.078656] 5 locks held by bash/3632:
[  132.078658]  #0: ffff9c39c7b89438
[  132.079612] amdgpu 0000:43:00.0: amdgpu: RAS INFO: ras initialized successfully, hardware ability[7fff] ras_mask[7fff]
[  132.080089]  (
[  132.080602] [drm] vm size is 262144 GB, 4 levels, block size is 9-bit, fragment size is 9-bit
[  132.081082] sb_writers
[  132.081601] amdgpu 0000:43:00.0: amdgpu: VRAM: 32752M 0x0000008000000000 - 0x00000087FEFFFFFF (32752M used)
[  132.082102] #6
[  132.082630] amdgpu 0000:43:00.0: amdgpu: GART: 512M 0x0000000000000000 - 0x000000001FFFFFFF
[  132.083152] ){.+.+}-{0:0}
[  132.083687] amdgpu 0000:43:00.0: amdgpu: AGP: 267878400M 0x0000008800000000 - 0x0000FFFFFFFFFFFF
[  132.084210] , at: ksys_write+0xa7/0xe0
[  132.085708] [drm] Detected VRAM RAM=32752M, BAR=32768M
[  132.086205]  #1:
[  132.086733] [drm] RAM width 4096bits HBM
[  132.087269] ffff9c5959011088
[  132.087890] [drm] amdgpu: 32752M of VRAM memory ready
[  132.088389]  (
[  132.088972] [drm] amdgpu: 32752M of GTT memory ready.
[  132.089572] &of->mutex
[  132.090206] [drm] GART: num cpu pages 131072, num gpu pages 131072
[  132.090804] ){+.+.}-{3:3}, at: kernfs_fop_write_iter+0x103/0x1b0
[  132.090808]  #2: ffff9c39c882a9e0 (kn->active#423){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x10c/0x1b0
[  132.091639] [drm] PCIE GART of 512M enabled.
[  132.092117]  #3:
[  132.092801] [drm] PTB located at 0x0000008000000000
[  132.093480] ffffffffa3c62308
[  132.094566] amdgpu 0000:43:00.0: amdgpu: PSP runtime database doesn't exist
[  132.094822]  (pci_rescan_remove_lock){+.+.}-{3:3}, at: rescan_store+0x55/0x90
[  132.094827]  #4: ffff9c597392b248 (&dev->mutex){....}-{3:3}, at: __device_attach+0x39/0x160
[  132.094835]
               stack backtrace:
[  132.097098] [drm] Found VCN firmware Version ENC: 1.1 DEC: 1 VEP: 0 Revision: 21
[  132.097467] CPU: 47 PID: 3632 Comm: bash Not tainted 5.16.0-kfd+ #1
[  132.098169] amdgpu 0000:43:00.0: amdgpu: Will use PSP to load VCN firmware
[  132.098839] Hardware name: Supermicro SYS-4029GP-TRT2/X11DPG-OT-CPU, BIOS 2.1 08/14/2018
[  132.098841] Call Trace:
[  132.098842]  <TASK>
[  132.098843]  dump_stack_lvl+0x44/0x57
[  132.098848]  check_noncircular+0x105/0x120
[  132.098853]  ? unwind_get_return_address+0x1b/0x30
[  132.112924]  ? register_lock_class+0x46/0x780
[  132.113630]  ? __lock_acquire+0x1552/0x1ac0
[  132.114342]  __lock_acquire+0x1552/0x1ac0
[  132.115050]  lock_acquire+0x26c/0x300
[  132.115755]  ? __flush_work+0x2f5/0x470
[  132.116460]  ? lock_is_held_type+0xdf/0x130
[  132.117177]  __flush_work+0x315/0x470
[  132.117890]  ? __flush_work+0x2f5/0x470
[  132.118604]  ? lock_is_held_type+0xdf/0x130
[  132.119305]  ? mark_held_locks+0x49/0x70
[  132.119981]  ? queue_work_on+0x2f/0x70
[  132.120645]  ? lockdep_hardirqs_on+0x79/0x100
[  132.121300]  work_on_cpu+0x98/0xc0
[  132.121702] [drm] reserve 0x400000 from 0x87fec00000 for PSP TMR
[  132.121947]  ? __traceiter_workqueue_execute_end+0x40/0x40
[  132.123270]  ? pci_device_shutdown+0x60/0x60
[  132.123880]  pci_device_probe+0x1bc/0x1d0
[  132.124475]  really_probe+0x102/0x450
[  132.125060]  __driver_probe_device+0x100/0x170
[  132.125641]  driver_probe_device+0x1f/0xa0
[  132.126215]  __device_attach_driver+0x6b/0xe0
[  132.126797]  ? driver_allows_async_probing+0x50/0x50
[  132.127383]  ? driver_allows_async_probing+0x50/0x50
[  132.127960]  bus_for_each_drv+0x6a/0xb0
[  132.128528]  __device_attach+0xe2/0x160
[  132.129095]  pci_bus_add_device+0x4a/0x80
[  132.129659]  pci_bus_add_devices+0x2c/0x70
[  132.130213]  pci_bus_add_devices+0x65/0x70
[  132.130753]  pci_bus_add_devices+0x65/0x70
[  132.131283]  pci_bus_add_devices+0x65/0x70
[  132.131780]  pci_bus_add_devices+0x65/0x70
[  132.132270]  pci_bus_add_devices+0x65/0x70
[  132.132757]  pci_rescan_bus+0x23/0x30
[  132.133233]  rescan_store+0x61/0x90
[  132.133701]  kernfs_fop_write_iter+0x132/0x1b0
[  132.134167]  new_sync_write+0x11f/0x1b0
[  132.134627]  vfs_write+0x35b/0x3b0
[  132.135062]  ksys_write+0xa7/0xe0
[  132.135503]  do_syscall_64+0x34/0x80
[  132.135933]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[  132.136358] RIP: 0033:0x7f0058a73224
[  132.136775] Code: 89 02 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 8d 05 c1 07 2e 00 8b 00 85 c0 75 13 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 54 f3 c3 66 90 41 54 55 49 89 d4 53 48 89 f5
[  132.137663] RSP: 002b:00007ffc4f4c71a8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[  132.138121] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007f0058a73224
[  132.138590] RDX: 0000000000000002 RSI: 000055d466c24450 RDI: 0000000000000001
[  132.139064] RBP: 000055d466c24450 R08: 000000000000000a R09: 0000000000000001
[  132.139532] R10: 000000000000000a R11: 0000000000000246 R12: 00007f0058d4f760
[  132.140003] R13: 0000000000000002 R14: 00007f0058d4b2a0 R15: 00007f0058d4a760
[  132.140485]  </TASK>
[  132.183669] amdgpu 0000:43:00.0: amdgpu: HDCP: optional hdcp ta ucode is not available
[  132.184214] amdgpu 0000:43:00.0: amdgpu: DTM: optional dtm ta ucode is not available
[  132.184735] amdgpu 0000:43:00.0: amdgpu: RAP: optional rap ta ucode is not available
[  132.185234] amdgpu 0000:43:00.0: amdgpu: SECUREDISPLAY: securedisplay ta ucode is not available
[  132.185823] amdgpu 0000:43:00.0: amdgpu: use vbios provided pptable
[  132.186327] amdgpu 0000:43:00.0: amdgpu: smc_dpm_info table revision(format.content): 4.6
[  132.188783] amdgpu 0000:43:00.0: amdgpu: SMU is initialized successfully!
[  132.190039] [drm] kiq ring mec 2 pipe 1 q 0
[  132.203608] [drm] VCN decode and encode initialized successfully(under DPG Mode).
[  132.204178] [drm] JPEG decode initialized successfully.
[  132.246079] kfd kfd: amdgpu: Allocated 3969056 bytes on gart
[  132.327589] memmap_init_zone_device initialised 8388608 pages in 64ms
[  132.328139] amdgpu: HMM registered 32752MB device memory
[  132.328784] amdgpu: Virtual CRAT table created for GPU
[  132.329844] amdgpu: Topology: Add dGPU node [0x738c:0x1002]
[  132.330387] kfd kfd: amdgpu: added device 1002:738c
[  132.330965] amdgpu 0000:43:00.0: amdgpu: SE 8, SH per SE 1, CU per SH 16, active_cu_number 72
[  132.331725] amdgpu 0000:43:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 0 on hub 0
[  132.332296] amdgpu 0000:43:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 1 on hub 0
[  132.332856] amdgpu 0000:43:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 4 on hub 0
[  132.333414] amdgpu 0000:43:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 5 on hub 0
[  132.333965] amdgpu 0000:43:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 6 on hub 0
[  132.334507] amdgpu 0000:43:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 7 on hub 0
[  132.335057] amdgpu 0000:43:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 8 on hub 0
[  132.335594] amdgpu 0000:43:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 9 on hub 0
[  132.336137] amdgpu 0000:43:00.0: amdgpu: ring kiq_2.1.0 uses VM inv eng 10 on hub 0
[  132.336679] amdgpu 0000:43:00.0: amdgpu: ring sdma0 uses VM inv eng 0 on hub 1
[  132.337234] amdgpu 0000:43:00.0: amdgpu: ring sdma1 uses VM inv eng 1 on hub 1
[  132.337790] amdgpu 0000:43:00.0: amdgpu: ring sdma2 uses VM inv eng 4 on hub 1
[  132.338343] amdgpu 0000:43:00.0: amdgpu: ring sdma3 uses VM inv eng 5 on hub 1
[  132.338906] amdgpu 0000:43:00.0: amdgpu: ring sdma4 uses VM inv eng 6 on hub 1
[  132.339448] amdgpu 0000:43:00.0: amdgpu: ring sdma5 uses VM inv eng 0 on hub 2
[  132.339987] amdgpu 0000:43:00.0: amdgpu: ring sdma6 uses VM inv eng 1 on hub 2
[  132.340519] amdgpu 0000:43:00.0: amdgpu: ring sdma7 uses VM inv eng 4 on hub 2
[  132.341041] amdgpu 0000:43:00.0: amdgpu: ring vcn_dec_0 uses VM inv eng 5 on hub 2
[  132.341570] amdgpu 0000:43:00.0: amdgpu: ring vcn_enc_0.0 uses VM inv eng 6 on hub 2
[  132.342101] amdgpu 0000:43:00.0: amdgpu: ring vcn_enc_0.1 uses VM inv eng 7 on hub 2
[  132.342630] amdgpu 0000:43:00.0: amdgpu: ring vcn_dec_1 uses VM inv eng 8 on hub 2
[  132.343152] amdgpu 0000:43:00.0: amdgpu: ring vcn_enc_1.0 uses VM inv eng 9 on hub 2
[  132.343657] amdgpu 0000:43:00.0: amdgpu: ring vcn_enc_1.1 uses VM inv eng 10 on hub 2
[  132.344136] amdgpu 0000:43:00.0: amdgpu: ring jpeg_dec_0 uses VM inv eng 11 on hub 2
[  132.344610] amdgpu 0000:43:00.0: amdgpu: ring jpeg_dec_1 uses VM inv eng 12 on hub 2
[  132.378213] amdgpu: Detected AMDGPU 6 Perf Events.
[  132.387349] [drm] Initialized amdgpu 3.46.0 20150101 for 0000:43:00.0 on minor 1
[  132.388530] pcieport 0000:d7:00.0: bridge window [io  0x1000-0x0fff] to [bus d8] add_size 1000
[  132.389078] pcieport 0000:d7:00.0: BAR 13: no space for [io  size 0x1000]
[  132.389600] pcieport 0000:d7:00.0: BAR 13: failed to assign [io  size 0x1000]
[  132.390091] pcieport 0000:d7:00.0: BAR 13: no space for [io  size 0x1000]
[  132.390568] pcieport 0000:d7:00.0: BAR 13: failed to assign [io  size 0x1000]
[  155.359200] HelloWorld[3824]: segfault at 68 ip 00007f4c979f764e sp 00007ffc9b3bb610 error 4 in libamdhip64.so.4.4.21432-f9dccde4[7f4c979b3000+2eb000]
[  155.360268] Code: 48 8b 45 e8 64 48 33 04 25 28 00 00 00 74 05 e8 b8 c7 fb ff 48 8b 5d f8 c9 c3 f3 0f 1e fa 55 48 89 e5 48 89 7d f8 48 8b 45 f8 <48> 8b 40 68 5d c3 f3 0f 1e fa 55 48 89 e5 48 89 7d f8 48 8b 45 f8
Best regards,
Shuotao
*From: *Andrey Grodzovsky <andrey.grodzovsky@amd.com<mailto:andrey.grodzovsky@amd.com>>
*Date: *Wednesday, April 6, 2022 at 10:36 PM
*To: *Shuotao Xu <shuotaoxu@microsoft.com<mailto:shuotaoxu@microsoft.com>>, amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org> <amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>>
*Cc: *Ziyue Yang <Ziyue.Yang@microsoft.com<mailto:Ziyue.Yang@microsoft.com>>, Lei Qu <Lei.Qu@microsoft.com<mailto:Lei.Qu@microsoft.com>>, Peng Cheng <pengc@microsoft.com<mailto:pengc@microsoft.com>>, Ran Shu <Ran.Shu@microsoft.com<mailto:Ran.Shu@microsoft.com>>
*Subject: *Re: [EXTERNAL] Re: Code Review Request for AMDGPU Hotplug Support
[You don't often get email from andrey.grodzovsky@amd.com<mailto:andrey.grodzovsky@amd.com>. Learn why this is important at http://aka.ms/LearnAboutSenderIdentification.] <https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Faka.ms%2FLearnAboutSenderIdentification.%255d&amp;data=05%7C01%7Cshuotaoxu%40microsoft.com%7Cc8bd85217a194c7d177808da17e3a825%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637848563851011449%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=0IDdt8tcUR%2BZ7fU0OPlCsICmLNtMgBYF2I6Tnh4rNLE%3D&amp;reserved=0>
Can you attach dmesg for the failure without your patch against
amd-staging-drm-next ?
Also, in general, patches for amdgpu upstream branches should be
submitted to amd-gfx mailing list inline using git-send which makes it
easy to comment and review them inline.
Andrey
On 2022-04-06 10:25, Shuotao Xu wrote:
Hi Andrey,

We just tried kernel 5.16 based on
https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fagd5f%2Flinux.git&amp;data=05%7C01%7Cshuotaoxu%40microsoft.com%7Cc8bd85217a194c7d177808da17e3a825%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637848563851011449%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=bQPzShlHhTX67qtY1HJ6gWKVn0Nr2YawHZ5K1EQMi%2F0%3D&amp;reserved=0
<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fagd5f%2Flinux.git&amp;data=05%7C01%7Cshuotaoxu%40microsoft.com%7Cc8bd85217a194c7d177808da17e3a825%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637848563851011449%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=bQPzShlHhTX67qtY1HJ6gWKVn0Nr2YawHZ5K1EQMi%2F0%3D&amp;reserved=0>
<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fagd5f%2Flinux.git&amp;data=05%7C01%7Cshuotaoxu%40microsoft.com%7Cc8bd85217a194c7d177808da17e3a825%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637848563851011449%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=bQPzShlHhTX67qtY1HJ6gWKVn0Nr2YawHZ5K1EQMi%2F0%3D&amp;reserved=0
<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fagd5f%2Flinux.git&amp;data=05%7C01%7Cshuotaoxu%40microsoft.com%7Cc8bd85217a194c7d177808da17e3a825%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637848563851011449%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=bQPzShlHhTX67qtY1HJ6gWKVn0Nr2YawHZ5K1EQMi%2F0%3D&amp;reserved=0>>
amd-staging-drm-next branch, and found out that hotplug did not work out
of box for Rocm compute stack.

We did not try the rendering stack since we currently are more focused
on AI workloads.

We have also created a patch against the amd-staging-drm-next branch to
enable hotplug for ROCM stack, which were sent in another later email
with same subject. I am attaching the patch in this email, in case that
you would want to delete that later email.

Best regards,

Shuotao

*From: *Andrey Grodzovsky <andrey.grodzovsky@amd.com<mailto:andrey.grodzovsky@amd.com>>
*Date: *Wednesday, April 6, 2022 at 10:13 PM
*To: *Shuotao Xu <shuotaoxu@microsoft.com<mailto:shuotaoxu@microsoft.com>>,
amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org> <amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>>
*Cc: *Ziyue Yang <Ziyue.Yang@microsoft.com<mailto:Ziyue.Yang@microsoft.com>>, Lei Qu
<Lei.Qu@microsoft.com<mailto:Lei.Qu@microsoft.com>>, Peng Cheng <pengc@microsoft.com<mailto:pengc@microsoft.com>>, Ran Shu
<Ran.Shu@microsoft.com<mailto:Ran.Shu@microsoft.com>>
*Subject: *[EXTERNAL] Re: Code Review Request for AMDGPU Hotplug Support

[You don't often get email from andrey.grodzovsky@amd.com<mailto:andrey.grodzovsky@amd.com>. Learn why
this is important at http://aka.ms/LearnAboutSenderIdentification.]
<https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Faka.ms%2FLearnAboutSenderIdentification.%255d&amp;data=05%7C01%7Cshuotaoxu%40microsoft.com%7Cc8bd85217a194c7d177808da17e3a825%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637848563851011449%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=0IDdt8tcUR%2BZ7fU0OPlCsICmLNtMgBYF2I6Tnh4rNLE%3D&amp;reserved=0>
<https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Faka.ms%2FLearnAboutSenderIdentification.%255d&amp;data=05%7C01%7Cshuotaoxu%40microsoft.com%7Cc8bd85217a194c7d177808da17e3a825%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637848563851011449%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=0IDdt8tcUR%2BZ7fU0OPlCsICmLNtMgBYF2I6Tnh4rNLE%3D&amp;reserved=0
<https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Faka.ms%2FLearnAboutSenderIdentification.%255d&amp;data=05%7C01%7Cshuotaoxu%40microsoft.com%7Cc8bd85217a194c7d177808da17e3a825%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637848563851011449%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=0IDdt8tcUR%2BZ7fU0OPlCsICmLNtMgBYF2I6Tnh4rNLE%3D&amp;reserved=0>>

Looks like you are using 5.13 kernel for this work, FYI we added
hot plug support for the graphic stack in 5.14 kernel (see
https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.phoronix.com%2Fscan.php%3Fpage%3Dnews_item%26px%3DLinux-5.14-AMDGPU-Hot-Unplug&amp;data=05%7C01%7Cshuotaoxu%40microsoft.com%7Cc8bd85217a194c7d177808da17e3a825%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637848563851011449%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=u8%2FSr%2BF8k9qZDO8%2F%2Fg3Eti0OaeJgM5fVPAaw2N7Bf6I%3D&amp;reserved=0)
<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.phoronix.com%2Fscan.php%3Fpage%3Dnews_item%26px%3DLinux-5.14-AMDGPU-Hot-Unplug&amp;data=05%7C01%7Cshuotaoxu%40microsoft.com%7Cc8bd85217a194c7d177808da17e3a825%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637848563851011449%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=u8%2FSr%2BF8k9qZDO8%2F%2Fg3Eti0OaeJgM5fVPAaw2N7Bf6I%3D&amp;reserved=0>
<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.phoronix.com%2Fscan.php%3Fpage%3Dnews_item%26px%3DLinux-5.14-AMDGPU-Hot-Unplug&amp;data=05%7C01%7Cshuotaoxu%40microsoft.com%7Cc8bd85217a194c7d177808da17e3a825%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637848563851011449%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=u8%2FSr%2BF8k9qZDO8%2F%2Fg3Eti0OaeJgM5fVPAaw2N7Bf6I%3D&amp;reserved=0
<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.phoronix.com%2Fscan.php%3Fpage%3Dnews_item%26px%3DLinux-5.14-AMDGPU-Hot-Unplug&amp;data=05%7C01%7Cshuotaoxu%40microsoft.com%7Cc8bd85217a194c7d177808da17e3a825%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637848563851011449%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=u8%2FSr%2BF8k9qZDO8%2F%2Fg3Eti0OaeJgM5fVPAaw2N7Bf6I%3D&amp;reserved=0>>


I am not sure about the code part since it all touches KFD driver (KFD
team can comment on that) - but I was just wondering if you try 5.14
kernel would things just work for you out of the box ?

Andrey

On 2022-04-05 22:45, Shuotao Xu wrote:
Dear AMD Colleagues,

We are from Microsoft Research, and are working on GPU disaggregation
technology.

We have created a new pull requestAdd PCIe hotplug support for amdgpu by
xushuotao · Pull Request #131 · RadeonOpenCompute/ROCK-Kernel-Driver
(github.com<http://github.com/>)
<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FRadeonOpenCompute%2FROCK-Kernel-Driver%2Fpull%2F131&amp;data=05%7C01%7Cshuotaoxu%40microsoft.com%7Cc8bd85217a194c7d177808da17e3a825%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637848563851011449%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=tG3fBCB1EYc3dqW9iLsY5d1SYx4bvB7%2BHCuA7HVHzXE%3D&amp;reserved=0
<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FRadeonOpenCompute%2FROCK-Kernel-Driver%2Fpull%2F131&amp;data=05%7C01%7Cshuotaoxu%40microsoft.com%7Cc8bd85217a194c7d177808da17e3a825%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637848563851011449%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=tG3fBCB1EYc3dqW9iLsY5d1SYx4bvB7%2BHCuA7HVHzXE%3D&amp;reserved=0
<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FRadeonOpenCompute%2FROCK-Kernel-Driver%2Fpull%2F131&amp;data=05%7C01%7Cshuotaoxu%40microsoft.com%7Cc8bd85217a194c7d177808da17e3a825%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637848563851011449%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=tG3fBCB1EYc3dqW9iLsY5d1SYx4bvB7%2BHCuA7HVHzXE%3D&amp;reserved=0>>>in
ROCK-Kernel-Driver, which will enable PCIe hot-plug support for amdgpu.

We believe the support of hot-plug of GPU devices can open doors for
many advanced applications in data center in the next few years, and we
would like to have some reviewers on this PR so we can continue further
technical discussions around this feature.

Would you please help review this PR?

Thank you very much!

Best regards,

Shuotao Xu


[-- Attachment #2: Type: text/html, Size: 73216 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [EXTERNAL] Re: Code Review Request for AMDGPU Hotplug Support
  2022-04-07  2:22         ` [EXTERNAL] " Joshi, Mukul
@ 2022-04-07 15:08           ` Shuotao Xu
  2022-04-07 16:22             ` Joshi, Mukul
  0 siblings, 1 reply; 13+ messages in thread
From: Shuotao Xu @ 2022-04-07 15:08 UTC (permalink / raw)
  To: Joshi, Mukul, Grodzovsky, Andrey, amd-gfx
  Cc: Ziyue Yang, Lei Qu, Peng Cheng, Ran Shu

[-- Attachment #1: Type: text/plain, Size: 29250 bytes --]

Hi Joshi,

Per your comment, I produced a fix to work with multi-GPU system for hotplug support for our group’s internal usage.

I have tested on a 4-node MI100 system, which seems to be working. It is pushed in the github PR.
The details are in: https://github.com/RadeonOpenCompute/ROCK-Kernel-Driver/pull/131#issuecomment-1091843803

I will send the patch to the mail-list also.

May I know when your patch is ready for public review?

All the best,
Shuotao


From: Joshi, Mukul <Mukul.Joshi@amd.com>
Date: Thursday, April 7, 2022 at 10:24 AM
To: Shuotao Xu <shuotaoxu@microsoft.com>, Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>, amd-gfx@lists.freedesktop.org <amd-gfx@lists.freedesktop.org>
Cc: Ziyue Yang <Ziyue.Yang@microsoft.com>, Lei Qu <Lei.Qu@microsoft.com>, Peng Cheng <pengc@microsoft.com>, Ran Shu <Ran.Shu@microsoft.com>
Subject: RE: [EXTERNAL] Re: Code Review Request for AMDGPU Hotplug Support
You don't often get email from mukul.joshi@amd.com. Learn why this is important<http://aka.ms/LearnAboutSenderIdentification>

[AMD Official Use Only]

Hi Shuotao,

Thanks for your patch.
I have been working on something similar as I also found that we don’t cleanup IO links upon device removal.

The IO-links cleanup change in your patch would work only either on a single GPU system or a multi-GPU system where the last node (in the sysfs topology) is hot-plugged out. That’s because of the way the atomic counter, topology_crat_proximity_domain, is used in the code.

I have a patch which takes care of these issues on a multi-GPU system.
I should be able to send that out for review in sometime.

Thanks,
Mukul

From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> On Behalf Of Shuotao Xu
Sent: Wednesday, April 6, 2022 11:12 AM
To: Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>; amd-gfx@lists.freedesktop.org
Cc: Ziyue Yang <Ziyue.Yang@microsoft.com>; Lei Qu <Lei.Qu@microsoft.com>; Peng Cheng <pengc@microsoft.com>; Ran Shu <Ran.Shu@microsoft.com>
Subject: Re: [EXTERNAL] Re: Code Review Request for AMDGPU Hotplug Support

[CAUTION: External Email]
Hi Andrey,

Thanks for your kind comment on linux patch submission protocol, please let me know if there is anyway to rectify it.

dmesg is fine except with some warning during pci rescan after pci removal of an AMD MI100.

The issue is that after this rocm application will return segfault with the amdgpu driver unless the entire amdgpu kernel module is unloaded and loaded, which we did not meet our hotplug requirement. The issues upon investigation are
1) kfd_lock is locked after hotplug, and kfd_open will return fault right away to libhsakmt .
2) iolink/p2plink has anormalies after hotplug, and libhsakmt will found such anomalies and return error.

Our patch has been tested with a single-instance AMD MI100 GPU and showed it worked.

I am attaching the dmesg after rescan anyway, which will show the warning and the segfault.

[  132.054822] pci 0000:43:00.0: [1002:738c] type 00 class 0x038000
[  132.054856] pci 0000:43:00.0: reg 0x10: [mem 0x38b000000000-0x38b7ffffffff 64bit pref]
[  132.054877] pci 0000:43:00.0: reg 0x18: [mem 0x38b800000000-0x38b8001fffff 64bit pref]
[  132.054890] pci 0000:43:00.0: reg 0x20: [io  0xa000-0xa0ff]
[  132.054904] pci 0000:43:00.0: reg 0x24: [mem 0xb8400000-0xb847ffff]
[  132.054918] pci 0000:43:00.0: reg 0x30: [mem 0xb8480000-0xb849ffff pref]
[  132.055134] pci 0000:43:00.0: PME# supported from D1 D2 D3hot D3cold
[  132.055217] pci 0000:43:00.0: 63.008 Gb/s available PCIe bandwidth, limited by 8.0 GT/s PCIe x8 link at 0000:3c:14.0 (capable of 252.048 Gb/s with 16.0 GT/s PCIe x16 link)
[  132.056001] pci 0000:43:00.0: Adding to iommu group 73
[  132.057943] pci 0000:43:00.0: BAR 0: assigned [mem 0x38b000000000-0x38b7ffffffff 64bit pref]
[  132.057960] pci 0000:43:00.0: BAR 2: assigned [mem 0x38b800000000-0x38b8001fffff 64bit pref]
[  132.057974] pci 0000:43:00.0: BAR 5: assigned [mem 0xb8400000-0xb847ffff]
[  132.057981] pci 0000:43:00.0: BAR 6: assigned [mem 0xb8480000-0xb849ffff pref]
[  132.057984] pci 0000:43:00.0: BAR 4: assigned [io  0xa000-0xa0ff]

[  132.058429] ======================================================
[  132.058453] WARNING: possible circular locking dependency detected
[  132.058477] 5.16.0-kfd+ #1 Not tainted
[  132.058492] ------------------------------------------------------
[  132.058515] bash/3632 is trying to acquire lock:
[  132.058534] ffffadee20adfb50 ((work_completion)(&wfc.work)){+.+.}-{0:0}, at: __flush_work+0x2f5/0x470
[  132.058554] [drm] initializing kernel modesetting (ARCTURUS 0x1002:0x738C 0x1002:0x0C34 0x01).
[  132.058577]
               but task is already holding lock:
[  132.058580] ffffffffa3c62308
[  132.058630] amdgpu 0000:43:00.0: amdgpu: Trusted Memory Zone (TMZ) feature not supported
[  132.058638]  (
[  132.058678] [drm] register mmio base: 0xB8400000
[  132.058683] pci_rescan_remove_lock
[  132.058694] [drm] register mmio size: 524288
[  132.058713] ){+.+.}-{3:3}, at: rescan_store+0x55/0x90
[  132.058773]
               which lock already depends on the new lock.

[  132.058804]
               the existing dependency chain (in reverse order) is:
[  132.058819] [drm] add ip block number 0 <soc15_common>
[  132.058831]
               -> #1 (
[  132.058854] [drm] add ip block number 1 <gmc_v9_0>
[  132.058858] [drm] add ip block number 2 <vega20_ih>
[  132.058874] pci_rescan_remove_lock
[  132.058894] [drm] add ip block number 3 <psp>
[  132.058915] ){+.+.}-{3:3}
[  132.058931] [drm] add ip block number 4 <smu>
[  132.058951] :
[  132.058965] [drm] add ip block number 5 <gfx_v9_0>
[  132.058986]        __mutex_lock+0xa4/0x990
[  132.058996] [drm] add ip block number 6 <sdma_v4_0>
[  132.059016]        i801_add_tco_spt.isra.20+0x2a/0x1a0
[  132.059033] [drm] add ip block number 7 <vcn_v2_5>
[  132.059054]        i801_add_tco+0xf6/0x110
[  132.059075] [drm] add ip block number 8 <jpeg_v2_5>
[  132.059096]        i801_probe+0x402/0x860
[  132.059151]        local_pci_probe+0x40/0x90
[  132.059170]        work_for_cpu_fn+0x10/0x20
[  132.059189]        process_one_work+0x2a4/0x640
[  132.059208]        worker_thread+0x228/0x3f0
[  132.059227]        kthread+0x16d/0x1a0
[  132.059795]        ret_from_fork+0x1f/0x30
[  132.060337]
               -> #0 ((work_completion)(&wfc.work)){+.+.}-{0:0}:
[  132.061405]        __lock_acquire+0x1552/0x1ac0
[  132.061950]        lock_acquire+0x26c/0x300
[  132.062484]        __flush_work+0x315/0x470
[  132.063009]        work_on_cpu+0x98/0xc0
[  132.063526]        pci_device_probe+0x1bc/0x1d0
[  132.064036]        really_probe+0x102/0x450
[  132.064532]        __driver_probe_device+0x100/0x170
[  132.065020]        driver_probe_device+0x1f/0xa0
[  132.065497]        __device_attach_driver+0x6b/0xe0
[  132.065975]        bus_for_each_drv+0x6a/0xb0
[  132.066449]        __device_attach+0xe2/0x160
[  132.066912]        pci_bus_add_device+0x4a/0x80
[  132.067365]        pci_bus_add_devices+0x2c/0x70
[  132.067812]        pci_bus_add_devices+0x65/0x70
[  132.068253]        pci_bus_add_devices+0x65/0x70
[  132.068688]        pci_bus_add_devices+0x65/0x70
[  132.068936] amdgpu 0000:43:00.0: amdgpu: Fetched VBIOS from ROM BAR
[  132.069109]        pci_bus_add_devices+0x65/0x70
[  132.069602] amdgpu: ATOM BIOS: 113-D3431401-X00
[  132.070058]        pci_bus_add_devices+0x65/0x70
[  132.070572] [drm] VCN(0) decode is enabled in VM mode
[  132.070997]        pci_rescan_bus+0x23/0x30
[  132.071000]        rescan_store+0x61/0x90
[  132.071003]        kernfs_fop_write_iter+0x132/0x1b0
[  132.071501] [drm] VCN(1) decode is enabled in VM mode
[  132.071964]        new_sync_write+0x11f/0x1b0
[  132.072432] [drm] VCN(0) encode is enabled in VM mode
[  132.072900]        vfs_write+0x35b/0x3b0
[  132.073376] [drm] VCN(1) encode is enabled in VM mode
[  132.073847]        ksys_write+0xa7/0xe0
[  132.074335] [drm] JPEG(0) JPEG decode is enabled in VM mode
[  132.074803]        do_syscall_64+0x34/0x80
[  132.074808]        entry_SYSCALL_64_after_hwframe+0x44/0xae
[  132.074811]
               other info that might help us debug this:

[  132.074813]  Possible unsafe locking scenario:

[  132.075302] [drm] JPEG(1) JPEG decode is enabled in VM mode
[  132.075779]        CPU0                    CPU1
[  132.076361] amdgpu 0000:43:00.0: amdgpu: MEM ECC is active.
[  132.076765]        ----                    ----
[  132.077265] amdgpu 0000:43:00.0: amdgpu: SRAM ECC is active.
[  132.078649]   lock(pci_rescan_remove_lock);
[  132.078652]                                lock((work_completion)(&wfc.work));
[  132.078653]                                lock(pci_rescan_remove_lock);
[  132.078655]   lock((work_completion)(&wfc.work));
[  132.078656]
                *** DEADLOCK ***

[  132.078656] 5 locks held by bash/3632:
[  132.078658]  #0: ffff9c39c7b89438
[  132.079612] amdgpu 0000:43:00.0: amdgpu: RAS INFO: ras initialized successfully, hardware ability[7fff] ras_mask[7fff]
[  132.080089]  (
[  132.080602] [drm] vm size is 262144 GB, 4 levels, block size is 9-bit, fragment size is 9-bit
[  132.081082] sb_writers
[  132.081601] amdgpu 0000:43:00.0: amdgpu: VRAM: 32752M 0x0000008000000000 - 0x00000087FEFFFFFF (32752M used)
[  132.082102] #6
[  132.082630] amdgpu 0000:43:00.0: amdgpu: GART: 512M 0x0000000000000000 - 0x000000001FFFFFFF
[  132.083152] ){.+.+}-{0:0}
[  132.083687] amdgpu 0000:43:00.0: amdgpu: AGP: 267878400M 0x0000008800000000 - 0x0000FFFFFFFFFFFF
[  132.084210] , at: ksys_write+0xa7/0xe0
[  132.085708] [drm] Detected VRAM RAM=32752M, BAR=32768M
[  132.086205]  #1:
[  132.086733] [drm] RAM width 4096bits HBM
[  132.087269] ffff9c5959011088
[  132.087890] [drm] amdgpu: 32752M of VRAM memory ready
[  132.088389]  (
[  132.088972] [drm] amdgpu: 32752M of GTT memory ready.
[  132.089572] &of->mutex
[  132.090206] [drm] GART: num cpu pages 131072, num gpu pages 131072
[  132.090804] ){+.+.}-{3:3}, at: kernfs_fop_write_iter+0x103/0x1b0
[  132.090808]  #2: ffff9c39c882a9e0 (kn->active#423){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x10c/0x1b0
[  132.091639] [drm] PCIE GART of 512M enabled.
[  132.092117]  #3:
[  132.092801] [drm] PTB located at 0x0000008000000000
[  132.093480] ffffffffa3c62308
[  132.094566] amdgpu 0000:43:00.0: amdgpu: PSP runtime database doesn't exist
[  132.094822]  (pci_rescan_remove_lock){+.+.}-{3:3}, at: rescan_store+0x55/0x90
[  132.094827]  #4: ffff9c597392b248 (&dev->mutex){....}-{3:3}, at: __device_attach+0x39/0x160
[  132.094835]
               stack backtrace:
[  132.097098] [drm] Found VCN firmware Version ENC: 1.1 DEC: 1 VEP: 0 Revision: 21
[  132.097467] CPU: 47 PID: 3632 Comm: bash Not tainted 5.16.0-kfd+ #1
[  132.098169] amdgpu 0000:43:00.0: amdgpu: Will use PSP to load VCN firmware
[  132.098839] Hardware name: Supermicro SYS-4029GP-TRT2/X11DPG-OT-CPU, BIOS 2.1 08/14/2018
[  132.098841] Call Trace:
[  132.098842]  <TASK>
[  132.098843]  dump_stack_lvl+0x44/0x57
[  132.098848]  check_noncircular+0x105/0x120
[  132.098853]  ? unwind_get_return_address+0x1b/0x30
[  132.112924]  ? register_lock_class+0x46/0x780
[  132.113630]  ? __lock_acquire+0x1552/0x1ac0
[  132.114342]  __lock_acquire+0x1552/0x1ac0
[  132.115050]  lock_acquire+0x26c/0x300
[  132.115755]  ? __flush_work+0x2f5/0x470
[  132.116460]  ? lock_is_held_type+0xdf/0x130
[  132.117177]  __flush_work+0x315/0x470
[  132.117890]  ? __flush_work+0x2f5/0x470
[  132.118604]  ? lock_is_held_type+0xdf/0x130
[  132.119305]  ? mark_held_locks+0x49/0x70
[  132.119981]  ? queue_work_on+0x2f/0x70
[  132.120645]  ? lockdep_hardirqs_on+0x79/0x100
[  132.121300]  work_on_cpu+0x98/0xc0
[  132.121702] [drm] reserve 0x400000 from 0x87fec00000 for PSP TMR
[  132.121947]  ? __traceiter_workqueue_execute_end+0x40/0x40
[  132.123270]  ? pci_device_shutdown+0x60/0x60
[  132.123880]  pci_device_probe+0x1bc/0x1d0
[  132.124475]  really_probe+0x102/0x450
[  132.125060]  __driver_probe_device+0x100/0x170
[  132.125641]  driver_probe_device+0x1f/0xa0
[  132.126215]  __device_attach_driver+0x6b/0xe0
[  132.126797]  ? driver_allows_async_probing+0x50/0x50
[  132.127383]  ? driver_allows_async_probing+0x50/0x50
[  132.127960]  bus_for_each_drv+0x6a/0xb0
[  132.128528]  __device_attach+0xe2/0x160
[  132.129095]  pci_bus_add_device+0x4a/0x80
[  132.129659]  pci_bus_add_devices+0x2c/0x70
[  132.130213]  pci_bus_add_devices+0x65/0x70
[  132.130753]  pci_bus_add_devices+0x65/0x70
[  132.131283]  pci_bus_add_devices+0x65/0x70
[  132.131780]  pci_bus_add_devices+0x65/0x70
[  132.132270]  pci_bus_add_devices+0x65/0x70
[  132.132757]  pci_rescan_bus+0x23/0x30
[  132.133233]  rescan_store+0x61/0x90
[  132.133701]  kernfs_fop_write_iter+0x132/0x1b0
[  132.134167]  new_sync_write+0x11f/0x1b0
[  132.134627]  vfs_write+0x35b/0x3b0
[  132.135062]  ksys_write+0xa7/0xe0
[  132.135503]  do_syscall_64+0x34/0x80
[  132.135933]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[  132.136358] RIP: 0033:0x7f0058a73224
[  132.136775] Code: 89 02 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 8d 05 c1 07 2e 00 8b 00 85 c0 75 13 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 54 f3 c3 66 90 41 54 55 49 89 d4 53 48 89 f5
[  132.137663] RSP: 002b:00007ffc4f4c71a8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[  132.138121] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007f0058a73224
[  132.138590] RDX: 0000000000000002 RSI: 000055d466c24450 RDI: 0000000000000001
[  132.139064] RBP: 000055d466c24450 R08: 000000000000000a R09: 0000000000000001
[  132.139532] R10: 000000000000000a R11: 0000000000000246 R12: 00007f0058d4f760
[  132.140003] R13: 0000000000000002 R14: 00007f0058d4b2a0 R15: 00007f0058d4a760
[  132.140485]  </TASK>
[  132.183669] amdgpu 0000:43:00.0: amdgpu: HDCP: optional hdcp ta ucode is not available
[  132.184214] amdgpu 0000:43:00.0: amdgpu: DTM: optional dtm ta ucode is not available
[  132.184735] amdgpu 0000:43:00.0: amdgpu: RAP: optional rap ta ucode is not available
[  132.185234] amdgpu 0000:43:00.0: amdgpu: SECUREDISPLAY: securedisplay ta ucode is not available
[  132.185823] amdgpu 0000:43:00.0: amdgpu: use vbios provided pptable
[  132.186327] amdgpu 0000:43:00.0: amdgpu: smc_dpm_info table revision(format.content): 4.6
[  132.188783] amdgpu 0000:43:00.0: amdgpu: SMU is initialized successfully!
[  132.190039] [drm] kiq ring mec 2 pipe 1 q 0
[  132.203608] [drm] VCN decode and encode initialized successfully(under DPG Mode).
[  132.204178] [drm] JPEG decode initialized successfully.
[  132.246079] kfd kfd: amdgpu: Allocated 3969056 bytes on gart
[  132.327589] memmap_init_zone_device initialised 8388608 pages in 64ms
[  132.328139] amdgpu: HMM registered 32752MB device memory
[  132.328784] amdgpu: Virtual CRAT table created for GPU
[  132.329844] amdgpu: Topology: Add dGPU node [0x738c:0x1002]
[  132.330387] kfd kfd: amdgpu: added device 1002:738c
[  132.330965] amdgpu 0000:43:00.0: amdgpu: SE 8, SH per SE 1, CU per SH 16, active_cu_number 72
[  132.331725] amdgpu 0000:43:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 0 on hub 0
[  132.332296] amdgpu 0000:43:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 1 on hub 0
[  132.332856] amdgpu 0000:43:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 4 on hub 0
[  132.333414] amdgpu 0000:43:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 5 on hub 0
[  132.333965] amdgpu 0000:43:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 6 on hub 0
[  132.334507] amdgpu 0000:43:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 7 on hub 0
[  132.335057] amdgpu 0000:43:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 8 on hub 0
[  132.335594] amdgpu 0000:43:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 9 on hub 0
[  132.336137] amdgpu 0000:43:00.0: amdgpu: ring kiq_2.1.0 uses VM inv eng 10 on hub 0
[  132.336679] amdgpu 0000:43:00.0: amdgpu: ring sdma0 uses VM inv eng 0 on hub 1
[  132.337234] amdgpu 0000:43:00.0: amdgpu: ring sdma1 uses VM inv eng 1 on hub 1
[  132.337790] amdgpu 0000:43:00.0: amdgpu: ring sdma2 uses VM inv eng 4 on hub 1
[  132.338343] amdgpu 0000:43:00.0: amdgpu: ring sdma3 uses VM inv eng 5 on hub 1
[  132.338906] amdgpu 0000:43:00.0: amdgpu: ring sdma4 uses VM inv eng 6 on hub 1
[  132.339448] amdgpu 0000:43:00.0: amdgpu: ring sdma5 uses VM inv eng 0 on hub 2
[  132.339987] amdgpu 0000:43:00.0: amdgpu: ring sdma6 uses VM inv eng 1 on hub 2
[  132.340519] amdgpu 0000:43:00.0: amdgpu: ring sdma7 uses VM inv eng 4 on hub 2
[  132.341041] amdgpu 0000:43:00.0: amdgpu: ring vcn_dec_0 uses VM inv eng 5 on hub 2
[  132.341570] amdgpu 0000:43:00.0: amdgpu: ring vcn_enc_0.0 uses VM inv eng 6 on hub 2
[  132.342101] amdgpu 0000:43:00.0: amdgpu: ring vcn_enc_0.1 uses VM inv eng 7 on hub 2
[  132.342630] amdgpu 0000:43:00.0: amdgpu: ring vcn_dec_1 uses VM inv eng 8 on hub 2
[  132.343152] amdgpu 0000:43:00.0: amdgpu: ring vcn_enc_1.0 uses VM inv eng 9 on hub 2
[  132.343657] amdgpu 0000:43:00.0: amdgpu: ring vcn_enc_1.1 uses VM inv eng 10 on hub 2
[  132.344136] amdgpu 0000:43:00.0: amdgpu: ring jpeg_dec_0 uses VM inv eng 11 on hub 2
[  132.344610] amdgpu 0000:43:00.0: amdgpu: ring jpeg_dec_1 uses VM inv eng 12 on hub 2
[  132.378213] amdgpu: Detected AMDGPU 6 Perf Events.
[  132.387349] [drm] Initialized amdgpu 3.46.0 20150101 for 0000:43:00.0 on minor 1
[  132.388530] pcieport 0000:d7:00.0: bridge window [io  0x1000-0x0fff] to [bus d8] add_size 1000
[  132.389078] pcieport 0000:d7:00.0: BAR 13: no space for [io  size 0x1000]
[  132.389600] pcieport 0000:d7:00.0: BAR 13: failed to assign [io  size 0x1000]
[  132.390091] pcieport 0000:d7:00.0: BAR 13: no space for [io  size 0x1000]
[  132.390568] pcieport 0000:d7:00.0: BAR 13: failed to assign [io  size 0x1000]
[  155.359200] HelloWorld[3824]: segfault at 68 ip 00007f4c979f764e sp 00007ffc9b3bb610 error 4 in libamdhip64.so.4.4.21432-f9dccde4[7f4c979b3000+2eb000]
[  155.360268] Code: 48 8b 45 e8 64 48 33 04 25 28 00 00 00 74 05 e8 b8 c7 fb ff 48 8b 5d f8 c9 c3 f3 0f 1e fa 55 48 89 e5 48 89 7d f8 48 8b 45 f8 <48> 8b 40 68 5d c3 f3 0f 1e fa 55 48 89 e5 48 89 7d f8 48 8b 45 f8

Best regards,
Shuotao

From: Andrey Grodzovsky <andrey.grodzovsky@amd.com<mailto:andrey.grodzovsky@amd.com>>
Date: Wednesday, April 6, 2022 at 10:36 PM
To: Shuotao Xu <shuotaoxu@microsoft.com<mailto:shuotaoxu@microsoft.com>>, amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org> <amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>>
Cc: Ziyue Yang <Ziyue.Yang@microsoft.com<mailto:Ziyue.Yang@microsoft.com>>, Lei Qu <Lei.Qu@microsoft.com<mailto:Lei.Qu@microsoft.com>>, Peng Cheng <pengc@microsoft.com<mailto:pengc@microsoft.com>>, Ran Shu <Ran.Shu@microsoft.com<mailto:Ran.Shu@microsoft.com>>
Subject: Re: [EXTERNAL] Re: Code Review Request for AMDGPU Hotplug Support
[You don't often get email from andrey.grodzovsky@amd.com<mailto:andrey.grodzovsky@amd.com>. Learn why this is important at http://aka.ms/LearnAboutSenderIdentification.]<https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Faka.ms%2FLearnAboutSenderIdentification.%255d&data=05%7C01%7Cshuotaoxu%40microsoft.com%7Ca782bea4842e45f15b7408da183d85b4%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637848950411722793%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=oY6DZbPB%2BJZSSDdMNg4s44cRc4TnT20Vtew%2Fw4Uziq8%3D&reserved=0>

Can you attach dmesg for the failure without your patch against
amd-staging-drm-next ?

Also, in general, patches for amdgpu upstream branches should be
submitted to amd-gfx mailing list inline using git-send which makes it
easy to comment and review them inline.

Andrey

On 2022-04-06 10:25, Shuotao Xu wrote:
> Hi Andrey,
>
> We just tried kernel 5.16 based on
> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fagd5f%2Flinux.git&amp;data=05%7C01%7Cshuotaoxu%40microsoft.com%7C93f1fcb8f60541f7b87308da17dae167%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637848526184858564%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=jz51aMtsG7PIEfLy1jLvLGd%2BsBREvOFf9Gc6BZlmsmU%3D&amp;reserved=0<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fagd5f%2Flinux.git&data=05%7C01%7Cshuotaoxu%40microsoft.com%7Ca782bea4842e45f15b7408da183d85b4%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637848950411722793%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=ssS4iNDEer0H%2FkK1N34xiC76v3TKv4GNYYWvlELU670%3D&reserved=0>
> <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fagd5f%2Flinux.git&amp;data=05%7C01%7Cshuotaoxu%40microsoft.com%7C93f1fcb8f60541f7b87308da17dae167%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637848526184858564%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=jz51aMtsG7PIEfLy1jLvLGd%2BsBREvOFf9Gc6BZlmsmU%3D&amp;reserved=0<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fagd5f%2Flinux.git&data=05%7C01%7Cshuotaoxu%40microsoft.com%7Ca782bea4842e45f15b7408da183d85b4%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637848950411722793%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=ssS4iNDEer0H%2FkK1N34xiC76v3TKv4GNYYWvlELU670%3D&reserved=0>>
> amd-staging-drm-next branch, and found out that hotplug did not work out
> of box for Rocm compute stack.
>
> We did not try the rendering stack since we currently are more focused
> on AI workloads.
>
> We have also created a patch against the amd-staging-drm-next branch to
> enable hotplug for ROCM stack, which were sent in another later email
> with same subject. I am attaching the patch in this email, in case that
> you would want to delete that later email.
>
> Best regards,
>
> Shuotao
>
> *From: *Andrey Grodzovsky <andrey.grodzovsky@amd.com<mailto:andrey.grodzovsky@amd.com>>
> *Date: *Wednesday, April 6, 2022 at 10:13 PM
> *To: *Shuotao Xu <shuotaoxu@microsoft.com<mailto:shuotaoxu@microsoft.com>>,
> amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org> <amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>>
> *Cc: *Ziyue Yang <Ziyue.Yang@microsoft.com<mailto:Ziyue.Yang@microsoft.com>>, Lei Qu
> <Lei.Qu@microsoft.com<mailto:Lei.Qu@microsoft.com>>, Peng Cheng <pengc@microsoft.com<mailto:pengc@microsoft.com>>, Ran Shu
> <Ran.Shu@microsoft.com<mailto:Ran.Shu@microsoft.com>>
> *Subject: *[EXTERNAL] Re: Code Review Request for AMDGPU Hotplug Support
>
> [You don't often get email from andrey.grodzovsky@amd.com<mailto:andrey.grodzovsky@amd.com>. Learn why
> this is important at http://aka.ms/LearnAboutSenderIdentification.]<https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Faka.ms%2FLearnAboutSenderIdentification.%255d&data=05%7C01%7Cshuotaoxu%40microsoft.com%7Ca782bea4842e45f15b7408da183d85b4%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637848950411722793%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=oY6DZbPB%2BJZSSDdMNg4s44cRc4TnT20Vtew%2Fw4Uziq8%3D&reserved=0>
> <https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Faka.ms%2FLearnAboutSenderIdentification.%255d&amp;data=05%7C01%7Cshuotaoxu%40microsoft.com%7C93f1fcb8f60541f7b87308da17dae167%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637848526184858564%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=HfSwu6SWfoCYyscJqGFdKHBPtaj%2BKB4lyo13zkm6hi4%3D&amp;reserved=0<https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Faka.ms%2FLearnAboutSenderIdentification.%255d&data=05%7C01%7Cshuotaoxu%40microsoft.com%7Ca782bea4842e45f15b7408da183d85b4%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637848950411722793%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=oY6DZbPB%2BJZSSDdMNg4s44cRc4TnT20Vtew%2Fw4Uziq8%3D&reserved=0>>
>
> Looks like you are using 5.13 kernel for this work, FYI we added
> hot plug support for the graphic stack in 5.14 kernel (see
> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.phoronix.com%2Fscan.php%3Fpage%3Dnews_item%26px%3DLinux-5.14-AMDGPU-Hot-Unplug&amp;data=05%7C01%7Cshuotaoxu%40microsoft.com%7C93f1fcb8f60541f7b87308da17dae167%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637848526184858564%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=4l9mT8zNR%2FvqsEFr7noIDqKf16IGN8xmO2T6jnpipzo%3D&amp;reserved=0)<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.phoronix.com%2Fscan.php%3Fpage%3Dnews_item%26px%3DLinux-5.14-AMDGPU-Hot-Unplug&data=05%7C01%7Cshuotaoxu%40microsoft.com%7Ca782bea4842e45f15b7408da183d85b4%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637848950411722793%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=2nCzwhL%2Bg8GIJ3WlDacVB6Rbb1hAtHRHhfihZRStzuI%3D&reserved=0>
> <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.phoronix.com%2Fscan.php%3Fpage%3Dnews_item%26px%3DLinux-5.14-AMDGPU-Hot-Unplug&amp;data=05%7C01%7Cshuotaoxu%40microsoft.com%7C93f1fcb8f60541f7b87308da17dae167%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637848526184858564%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=4l9mT8zNR%2FvqsEFr7noIDqKf16IGN8xmO2T6jnpipzo%3D&amp;reserved=0<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.phoronix.com%2Fscan.php%3Fpage%3Dnews_item%26px%3DLinux-5.14-AMDGPU-Hot-Unplug&data=05%7C01%7Cshuotaoxu%40microsoft.com%7Ca782bea4842e45f15b7408da183d85b4%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637848950411722793%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=2nCzwhL%2Bg8GIJ3WlDacVB6Rbb1hAtHRHhfihZRStzuI%3D&reserved=0>>
>
>
> I am not sure about the code part since it all touches KFD driver (KFD
> team can comment on that) - but I was just wondering if you try 5.14
> kernel would things just work for you out of the box ?
>
> Andrey
>
> On 2022-04-05 22:45, Shuotao Xu wrote:
>> Dear AMD Colleagues,
>>
>> We are from Microsoft Research, and are working on GPU disaggregation
>> technology.
>>
>> We have created a new pull requestAdd PCIe hotplug support for amdgpu by
>> xushuotao · Pull Request #131 · RadeonOpenCompute/ROCK-Kernel-Driver
>> (github.com)
>> <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FRadeonOpenCompute%2FROCK-Kernel-Driver%2Fpull%2F131&amp;data=05%7C01%7Cshuotaoxu%40microsoft.com%7C93f1fcb8f60541f7b87308da17dae167%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637848526184858564%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=qEeZR6R95jrjAaltd1MnpyFedOiVZaNQuCxcoNog90g%3D&amp;reserved=0
<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FRadeonOpenCompute%2FROCK-Kernel-Driver%2Fpull%2F131&data=05%7C01%7Cshuotaoxu%40microsoft.com%7Ca782bea4842e45f15b7408da183d85b4%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637848950411722793%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=xvW04nplXWAZ7jxsTnABR1JKY7HcHw80nPYJPPpD%2BrI%3D&reserved=0>> <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FRadeonOpenCompute%2FROCK-Kernel-Driver%2Fpull%2F131&amp;data=05%7C01%7Cshuotaoxu%40microsoft.com%7C93f1fcb8f60541f7b87308da17dae167%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637848526184858564%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=qEeZR6R95jrjAaltd1MnpyFedOiVZaNQuCxcoNog90g%3D&amp;reserved=0<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FRadeonOpenCompute%2FROCK-Kernel-Driver%2Fpull%2F131&data=05%7C01%7Cshuotaoxu%40microsoft.com%7Ca782bea4842e45f15b7408da183d85b4%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637848950411772785%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=EK5BHdupZ2UJsJXTAcYa5rAYoy3uCVvSrTH1uCOS1wQ%3D&reserved=0>>>in
>> ROCK-Kernel-Driver, which will enable PCIe hot-plug support for amdgpu.
>>
>> We believe the support of hot-plug of GPU devices can open doors for
>> many advanced applications in data center in the next few years, and we
>> would like to have some reviewers on this PR so we can continue further
>> technical discussions around this feature.
>>
>> Would you please help review this PR?
>>
>> Thank you very much!
>>
>> Best regards,
>>
>> Shuotao Xu
>>
>

[-- Attachment #2: Type: text/html, Size: 61103 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: [EXTERNAL] Re: Code Review Request for AMDGPU Hotplug Support
  2022-04-07 15:08           ` Shuotao Xu
@ 2022-04-07 16:22             ` Joshi, Mukul
  2022-04-07 16:28               ` Shuotao Xu
  0 siblings, 1 reply; 13+ messages in thread
From: Joshi, Mukul @ 2022-04-07 16:22 UTC (permalink / raw)
  To: Shuotao Xu, Grodzovsky, Andrey, amd-gfx
  Cc: Ziyue Yang, Lei Qu, Peng Cheng, Ran Shu

[-- Attachment #1: Type: text/plain, Size: 30829 bytes --]

[AMD Official Use Only]

Hi Shuotao,

Just sent out the patch to cleanup IO links upon KFD device removal to the public mailing list.
Please try it, review it and let us know how it goes for you.

Thank you.

Regards,
Mukul

From: Shuotao Xu <shuotaoxu@microsoft.com>
Sent: Thursday, April 7, 2022 11:09 AM
To: Joshi, Mukul <Mukul.Joshi@amd.com>; Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>; amd-gfx@lists.freedesktop.org
Cc: Ziyue Yang <Ziyue.Yang@microsoft.com>; Lei Qu <Lei.Qu@microsoft.com>; Peng Cheng <pengc@microsoft.com>; Ran Shu <Ran.Shu@microsoft.com>
Subject: Re: [EXTERNAL] Re: Code Review Request for AMDGPU Hotplug Support

[CAUTION: External Email]
Hi Joshi,

Per your comment, I produced a fix to work with multi-GPU system for hotplug support for our group's internal usage.

I have tested on a 4-node MI100 system, which seems to be working. It is pushed in the github PR.
The details are in: https://github.com/RadeonOpenCompute/ROCK-Kernel-Driver/pull/131#issuecomment-1091843803<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FRadeonOpenCompute%2FROCK-Kernel-Driver%2Fpull%2F131%23issuecomment-1091843803&data=04%7C01%7CMukul.Joshi%40amd.com%7Cad2e46b40ac844c558ec08da18a8861d%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637849410435874157%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=w4eC75GFEezLrc4jV%2FqRo6W6vMKpuVGnXOqgbz9Ll0Y%3D&reserved=0>

I will send the patch to the mail-list also.

May I know when your patch is ready for public review?

All the best,
Shuotao


From: Joshi, Mukul <Mukul.Joshi@amd.com<mailto:Mukul.Joshi@amd.com>>
Date: Thursday, April 7, 2022 at 10:24 AM
To: Shuotao Xu <shuotaoxu@microsoft.com<mailto:shuotaoxu@microsoft.com>>, Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com<mailto:Andrey.Grodzovsky@amd.com>>, amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org> <amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>>
Cc: Ziyue Yang <Ziyue.Yang@microsoft.com<mailto:Ziyue.Yang@microsoft.com>>, Lei Qu <Lei.Qu@microsoft.com<mailto:Lei.Qu@microsoft.com>>, Peng Cheng <pengc@microsoft.com<mailto:pengc@microsoft.com>>, Ran Shu <Ran.Shu@microsoft.com<mailto:Ran.Shu@microsoft.com>>
Subject: RE: [EXTERNAL] Re: Code Review Request for AMDGPU Hotplug Support
You don't often get email from mukul.joshi@amd.com<mailto:mukul.joshi@amd.com>. Learn why this is important<http://aka.ms/LearnAboutSenderIdentification>

[AMD Official Use Only]

Hi Shuotao,

Thanks for your patch.
I have been working on something similar as I also found that we don't cleanup IO links upon device removal.

The IO-links cleanup change in your patch would work only either on a single GPU system or a multi-GPU system where the last node (in the sysfs topology) is hot-plugged out. That's because of the way the atomic counter, topology_crat_proximity_domain, is used in the code.

I have a patch which takes care of these issues on a multi-GPU system.
I should be able to send that out for review in sometime.

Thanks,
Mukul

From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org<mailto:amd-gfx-bounces@lists.freedesktop.org>> On Behalf Of Shuotao Xu
Sent: Wednesday, April 6, 2022 11:12 AM
To: Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com<mailto:Andrey.Grodzovsky@amd.com>>; amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
Cc: Ziyue Yang <Ziyue.Yang@microsoft.com<mailto:Ziyue.Yang@microsoft.com>>; Lei Qu <Lei.Qu@microsoft.com<mailto:Lei.Qu@microsoft.com>>; Peng Cheng <pengc@microsoft.com<mailto:pengc@microsoft.com>>; Ran Shu <Ran.Shu@microsoft.com<mailto:Ran.Shu@microsoft.com>>
Subject: Re: [EXTERNAL] Re: Code Review Request for AMDGPU Hotplug Support

[CAUTION: External Email]
Hi Andrey,

Thanks for your kind comment on linux patch submission protocol, please let me know if there is anyway to rectify it.

dmesg is fine except with some warning during pci rescan after pci removal of an AMD MI100.

The issue is that after this rocm application will return segfault with the amdgpu driver unless the entire amdgpu kernel module is unloaded and loaded, which we did not meet our hotplug requirement. The issues upon investigation are
1) kfd_lock is locked after hotplug, and kfd_open will return fault right away to libhsakmt .
2) iolink/p2plink has anormalies after hotplug, and libhsakmt will found such anomalies and return error.

Our patch has been tested with a single-instance AMD MI100 GPU and showed it worked.

I am attaching the dmesg after rescan anyway, which will show the warning and the segfault.

[  132.054822] pci 0000:43:00.0: [1002:738c] type 00 class 0x038000
[  132.054856] pci 0000:43:00.0: reg 0x10: [mem 0x38b000000000-0x38b7ffffffff 64bit pref]
[  132.054877] pci 0000:43:00.0: reg 0x18: [mem 0x38b800000000-0x38b8001fffff 64bit pref]
[  132.054890] pci 0000:43:00.0: reg 0x20: [io  0xa000-0xa0ff]
[  132.054904] pci 0000:43:00.0: reg 0x24: [mem 0xb8400000-0xb847ffff]
[  132.054918] pci 0000:43:00.0: reg 0x30: [mem 0xb8480000-0xb849ffff pref]
[  132.055134] pci 0000:43:00.0: PME# supported from D1 D2 D3hot D3cold
[  132.055217] pci 0000:43:00.0: 63.008 Gb/s available PCIe bandwidth, limited by 8.0 GT/s PCIe x8 link at 0000:3c:14.0 (capable of 252.048 Gb/s with 16.0 GT/s PCIe x16 link)
[  132.056001] pci 0000:43:00.0: Adding to iommu group 73
[  132.057943] pci 0000:43:00.0: BAR 0: assigned [mem 0x38b000000000-0x38b7ffffffff 64bit pref]
[  132.057960] pci 0000:43:00.0: BAR 2: assigned [mem 0x38b800000000-0x38b8001fffff 64bit pref]
[  132.057974] pci 0000:43:00.0: BAR 5: assigned [mem 0xb8400000-0xb847ffff]
[  132.057981] pci 0000:43:00.0: BAR 6: assigned [mem 0xb8480000-0xb849ffff pref]
[  132.057984] pci 0000:43:00.0: BAR 4: assigned [io  0xa000-0xa0ff]

[  132.058429] ======================================================
[  132.058453] WARNING: possible circular locking dependency detected
[  132.058477] 5.16.0-kfd+ #1 Not tainted
[  132.058492] ------------------------------------------------------
[  132.058515] bash/3632 is trying to acquire lock:
[  132.058534] ffffadee20adfb50 ((work_completion)(&wfc.work)){+.+.}-{0:0}, at: __flush_work+0x2f5/0x470
[  132.058554] [drm] initializing kernel modesetting (ARCTURUS 0x1002:0x738C 0x1002:0x0C34 0x01).
[  132.058577]
               but task is already holding lock:
[  132.058580] ffffffffa3c62308
[  132.058630] amdgpu 0000:43:00.0: amdgpu: Trusted Memory Zone (TMZ) feature not supported
[  132.058638]  (
[  132.058678] [drm] register mmio base: 0xB8400000
[  132.058683] pci_rescan_remove_lock
[  132.058694] [drm] register mmio size: 524288
[  132.058713] ){+.+.}-{3:3}, at: rescan_store+0x55/0x90
[  132.058773]
               which lock already depends on the new lock.

[  132.058804]
               the existing dependency chain (in reverse order) is:
[  132.058819] [drm] add ip block number 0 <soc15_common>
[  132.058831]
               -> #1 (
[  132.058854] [drm] add ip block number 1 <gmc_v9_0>
[  132.058858] [drm] add ip block number 2 <vega20_ih>
[  132.058874] pci_rescan_remove_lock
[  132.058894] [drm] add ip block number 3 <psp>
[  132.058915] ){+.+.}-{3:3}
[  132.058931] [drm] add ip block number 4 <smu>
[  132.058951] :
[  132.058965] [drm] add ip block number 5 <gfx_v9_0>
[  132.058986]        __mutex_lock+0xa4/0x990
[  132.058996] [drm] add ip block number 6 <sdma_v4_0>
[  132.059016]        i801_add_tco_spt.isra.20+0x2a/0x1a0
[  132.059033] [drm] add ip block number 7 <vcn_v2_5>
[  132.059054]        i801_add_tco+0xf6/0x110
[  132.059075] [drm] add ip block number 8 <jpeg_v2_5>
[  132.059096]        i801_probe+0x402/0x860
[  132.059151]        local_pci_probe+0x40/0x90
[  132.059170]        work_for_cpu_fn+0x10/0x20
[  132.059189]        process_one_work+0x2a4/0x640
[  132.059208]        worker_thread+0x228/0x3f0
[  132.059227]        kthread+0x16d/0x1a0
[  132.059795]        ret_from_fork+0x1f/0x30
[  132.060337]
               -> #0 ((work_completion)(&wfc.work)){+.+.}-{0:0}:
[  132.061405]        __lock_acquire+0x1552/0x1ac0
[  132.061950]        lock_acquire+0x26c/0x300
[  132.062484]        __flush_work+0x315/0x470
[  132.063009]        work_on_cpu+0x98/0xc0
[  132.063526]        pci_device_probe+0x1bc/0x1d0
[  132.064036]        really_probe+0x102/0x450
[  132.064532]        __driver_probe_device+0x100/0x170
[  132.065020]        driver_probe_device+0x1f/0xa0
[  132.065497]        __device_attach_driver+0x6b/0xe0
[  132.065975]        bus_for_each_drv+0x6a/0xb0
[  132.066449]        __device_attach+0xe2/0x160
[  132.066912]        pci_bus_add_device+0x4a/0x80
[  132.067365]        pci_bus_add_devices+0x2c/0x70
[  132.067812]        pci_bus_add_devices+0x65/0x70
[  132.068253]        pci_bus_add_devices+0x65/0x70
[  132.068688]        pci_bus_add_devices+0x65/0x70
[  132.068936] amdgpu 0000:43:00.0: amdgpu: Fetched VBIOS from ROM BAR
[  132.069109]        pci_bus_add_devices+0x65/0x70
[  132.069602] amdgpu: ATOM BIOS: 113-D3431401-X00
[  132.070058]        pci_bus_add_devices+0x65/0x70
[  132.070572] [drm] VCN(0) decode is enabled in VM mode
[  132.070997]        pci_rescan_bus+0x23/0x30
[  132.071000]        rescan_store+0x61/0x90
[  132.071003]        kernfs_fop_write_iter+0x132/0x1b0
[  132.071501] [drm] VCN(1) decode is enabled in VM mode
[  132.071964]        new_sync_write+0x11f/0x1b0
[  132.072432] [drm] VCN(0) encode is enabled in VM mode
[  132.072900]        vfs_write+0x35b/0x3b0
[  132.073376] [drm] VCN(1) encode is enabled in VM mode
[  132.073847]        ksys_write+0xa7/0xe0
[  132.074335] [drm] JPEG(0) JPEG decode is enabled in VM mode
[  132.074803]        do_syscall_64+0x34/0x80
[  132.074808]        entry_SYSCALL_64_after_hwframe+0x44/0xae
[  132.074811]
               other info that might help us debug this:

[  132.074813]  Possible unsafe locking scenario:

[  132.075302] [drm] JPEG(1) JPEG decode is enabled in VM mode
[  132.075779]        CPU0                    CPU1
[  132.076361] amdgpu 0000:43:00.0: amdgpu: MEM ECC is active.
[  132.076765]        ----                    ----
[  132.077265] amdgpu 0000:43:00.0: amdgpu: SRAM ECC is active.
[  132.078649]   lock(pci_rescan_remove_lock);
[  132.078652]                                lock((work_completion)(&wfc.work));
[  132.078653]                                lock(pci_rescan_remove_lock);
[  132.078655]   lock((work_completion)(&wfc.work));
[  132.078656]
                *** DEADLOCK ***

[  132.078656] 5 locks held by bash/3632:
[  132.078658]  #0: ffff9c39c7b89438
[  132.079612] amdgpu 0000:43:00.0: amdgpu: RAS INFO: ras initialized successfully, hardware ability[7fff] ras_mask[7fff]
[  132.080089]  (
[  132.080602] [drm] vm size is 262144 GB, 4 levels, block size is 9-bit, fragment size is 9-bit
[  132.081082] sb_writers
[  132.081601] amdgpu 0000:43:00.0: amdgpu: VRAM: 32752M 0x0000008000000000 - 0x00000087FEFFFFFF (32752M used)
[  132.082102] #6
[  132.082630] amdgpu 0000:43:00.0: amdgpu: GART: 512M 0x0000000000000000 - 0x000000001FFFFFFF
[  132.083152] ){.+.+}-{0:0}
[  132.083687] amdgpu 0000:43:00.0: amdgpu: AGP: 267878400M 0x0000008800000000 - 0x0000FFFFFFFFFFFF
[  132.084210] , at: ksys_write+0xa7/0xe0
[  132.085708] [drm] Detected VRAM RAM=32752M, BAR=32768M
[  132.086205]  #1:
[  132.086733] [drm] RAM width 4096bits HBM
[  132.087269] ffff9c5959011088
[  132.087890] [drm] amdgpu: 32752M of VRAM memory ready
[  132.088389]  (
[  132.088972] [drm] amdgpu: 32752M of GTT memory ready.
[  132.089572] &of->mutex
[  132.090206] [drm] GART: num cpu pages 131072, num gpu pages 131072
[  132.090804] ){+.+.}-{3:3}, at: kernfs_fop_write_iter+0x103/0x1b0
[  132.090808]  #2: ffff9c39c882a9e0 (kn->active#423){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x10c/0x1b0
[  132.091639] [drm] PCIE GART of 512M enabled.
[  132.092117]  #3:
[  132.092801] [drm] PTB located at 0x0000008000000000
[  132.093480] ffffffffa3c62308
[  132.094566] amdgpu 0000:43:00.0: amdgpu: PSP runtime database doesn't exist
[  132.094822]  (pci_rescan_remove_lock){+.+.}-{3:3}, at: rescan_store+0x55/0x90
[  132.094827]  #4: ffff9c597392b248 (&dev->mutex){....}-{3:3}, at: __device_attach+0x39/0x160
[  132.094835]
               stack backtrace:
[  132.097098] [drm] Found VCN firmware Version ENC: 1.1 DEC: 1 VEP: 0 Revision: 21
[  132.097467] CPU: 47 PID: 3632 Comm: bash Not tainted 5.16.0-kfd+ #1
[  132.098169] amdgpu 0000:43:00.0: amdgpu: Will use PSP to load VCN firmware
[  132.098839] Hardware name: Supermicro SYS-4029GP-TRT2/X11DPG-OT-CPU, BIOS 2.1 08/14/2018
[  132.098841] Call Trace:
[  132.098842]  <TASK>
[  132.098843]  dump_stack_lvl+0x44/0x57
[  132.098848]  check_noncircular+0x105/0x120
[  132.098853]  ? unwind_get_return_address+0x1b/0x30
[  132.112924]  ? register_lock_class+0x46/0x780
[  132.113630]  ? __lock_acquire+0x1552/0x1ac0
[  132.114342]  __lock_acquire+0x1552/0x1ac0
[  132.115050]  lock_acquire+0x26c/0x300
[  132.115755]  ? __flush_work+0x2f5/0x470
[  132.116460]  ? lock_is_held_type+0xdf/0x130
[  132.117177]  __flush_work+0x315/0x470
[  132.117890]  ? __flush_work+0x2f5/0x470
[  132.118604]  ? lock_is_held_type+0xdf/0x130
[  132.119305]  ? mark_held_locks+0x49/0x70
[  132.119981]  ? queue_work_on+0x2f/0x70
[  132.120645]  ? lockdep_hardirqs_on+0x79/0x100
[  132.121300]  work_on_cpu+0x98/0xc0
[  132.121702] [drm] reserve 0x400000 from 0x87fec00000 for PSP TMR
[  132.121947]  ? __traceiter_workqueue_execute_end+0x40/0x40
[  132.123270]  ? pci_device_shutdown+0x60/0x60
[  132.123880]  pci_device_probe+0x1bc/0x1d0
[  132.124475]  really_probe+0x102/0x450
[  132.125060]  __driver_probe_device+0x100/0x170
[  132.125641]  driver_probe_device+0x1f/0xa0
[  132.126215]  __device_attach_driver+0x6b/0xe0
[  132.126797]  ? driver_allows_async_probing+0x50/0x50
[  132.127383]  ? driver_allows_async_probing+0x50/0x50
[  132.127960]  bus_for_each_drv+0x6a/0xb0
[  132.128528]  __device_attach+0xe2/0x160
[  132.129095]  pci_bus_add_device+0x4a/0x80
[  132.129659]  pci_bus_add_devices+0x2c/0x70
[  132.130213]  pci_bus_add_devices+0x65/0x70
[  132.130753]  pci_bus_add_devices+0x65/0x70
[  132.131283]  pci_bus_add_devices+0x65/0x70
[  132.131780]  pci_bus_add_devices+0x65/0x70
[  132.132270]  pci_bus_add_devices+0x65/0x70
[  132.132757]  pci_rescan_bus+0x23/0x30
[  132.133233]  rescan_store+0x61/0x90
[  132.133701]  kernfs_fop_write_iter+0x132/0x1b0
[  132.134167]  new_sync_write+0x11f/0x1b0
[  132.134627]  vfs_write+0x35b/0x3b0
[  132.135062]  ksys_write+0xa7/0xe0
[  132.135503]  do_syscall_64+0x34/0x80
[  132.135933]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[  132.136358] RIP: 0033:0x7f0058a73224
[  132.136775] Code: 89 02 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 8d 05 c1 07 2e 00 8b 00 85 c0 75 13 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 54 f3 c3 66 90 41 54 55 49 89 d4 53 48 89 f5
[  132.137663] RSP: 002b:00007ffc4f4c71a8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[  132.138121] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007f0058a73224
[  132.138590] RDX: 0000000000000002 RSI: 000055d466c24450 RDI: 0000000000000001
[  132.139064] RBP: 000055d466c24450 R08: 000000000000000a R09: 0000000000000001
[  132.139532] R10: 000000000000000a R11: 0000000000000246 R12: 00007f0058d4f760
[  132.140003] R13: 0000000000000002 R14: 00007f0058d4b2a0 R15: 00007f0058d4a760
[  132.140485]  </TASK>
[  132.183669] amdgpu 0000:43:00.0: amdgpu: HDCP: optional hdcp ta ucode is not available
[  132.184214] amdgpu 0000:43:00.0: amdgpu: DTM: optional dtm ta ucode is not available
[  132.184735] amdgpu 0000:43:00.0: amdgpu: RAP: optional rap ta ucode is not available
[  132.185234] amdgpu 0000:43:00.0: amdgpu: SECUREDISPLAY: securedisplay ta ucode is not available
[  132.185823] amdgpu 0000:43:00.0: amdgpu: use vbios provided pptable
[  132.186327] amdgpu 0000:43:00.0: amdgpu: smc_dpm_info table revision(format.content): 4.6
[  132.188783] amdgpu 0000:43:00.0: amdgpu: SMU is initialized successfully!
[  132.190039] [drm] kiq ring mec 2 pipe 1 q 0
[  132.203608] [drm] VCN decode and encode initialized successfully(under DPG Mode).
[  132.204178] [drm] JPEG decode initialized successfully.
[  132.246079] kfd kfd: amdgpu: Allocated 3969056 bytes on gart
[  132.327589] memmap_init_zone_device initialised 8388608 pages in 64ms
[  132.328139] amdgpu: HMM registered 32752MB device memory
[  132.328784] amdgpu: Virtual CRAT table created for GPU
[  132.329844] amdgpu: Topology: Add dGPU node [0x738c:0x1002]
[  132.330387] kfd kfd: amdgpu: added device 1002:738c
[  132.330965] amdgpu 0000:43:00.0: amdgpu: SE 8, SH per SE 1, CU per SH 16, active_cu_number 72
[  132.331725] amdgpu 0000:43:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 0 on hub 0
[  132.332296] amdgpu 0000:43:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 1 on hub 0
[  132.332856] amdgpu 0000:43:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 4 on hub 0
[  132.333414] amdgpu 0000:43:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 5 on hub 0
[  132.333965] amdgpu 0000:43:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 6 on hub 0
[  132.334507] amdgpu 0000:43:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 7 on hub 0
[  132.335057] amdgpu 0000:43:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 8 on hub 0
[  132.335594] amdgpu 0000:43:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 9 on hub 0
[  132.336137] amdgpu 0000:43:00.0: amdgpu: ring kiq_2.1.0 uses VM inv eng 10 on hub 0
[  132.336679] amdgpu 0000:43:00.0: amdgpu: ring sdma0 uses VM inv eng 0 on hub 1
[  132.337234] amdgpu 0000:43:00.0: amdgpu: ring sdma1 uses VM inv eng 1 on hub 1
[  132.337790] amdgpu 0000:43:00.0: amdgpu: ring sdma2 uses VM inv eng 4 on hub 1
[  132.338343] amdgpu 0000:43:00.0: amdgpu: ring sdma3 uses VM inv eng 5 on hub 1
[  132.338906] amdgpu 0000:43:00.0: amdgpu: ring sdma4 uses VM inv eng 6 on hub 1
[  132.339448] amdgpu 0000:43:00.0: amdgpu: ring sdma5 uses VM inv eng 0 on hub 2
[  132.339987] amdgpu 0000:43:00.0: amdgpu: ring sdma6 uses VM inv eng 1 on hub 2
[  132.340519] amdgpu 0000:43:00.0: amdgpu: ring sdma7 uses VM inv eng 4 on hub 2
[  132.341041] amdgpu 0000:43:00.0: amdgpu: ring vcn_dec_0 uses VM inv eng 5 on hub 2
[  132.341570] amdgpu 0000:43:00.0: amdgpu: ring vcn_enc_0.0 uses VM inv eng 6 on hub 2
[  132.342101] amdgpu 0000:43:00.0: amdgpu: ring vcn_enc_0.1 uses VM inv eng 7 on hub 2
[  132.342630] amdgpu 0000:43:00.0: amdgpu: ring vcn_dec_1 uses VM inv eng 8 on hub 2
[  132.343152] amdgpu 0000:43:00.0: amdgpu: ring vcn_enc_1.0 uses VM inv eng 9 on hub 2
[  132.343657] amdgpu 0000:43:00.0: amdgpu: ring vcn_enc_1.1 uses VM inv eng 10 on hub 2
[  132.344136] amdgpu 0000:43:00.0: amdgpu: ring jpeg_dec_0 uses VM inv eng 11 on hub 2
[  132.344610] amdgpu 0000:43:00.0: amdgpu: ring jpeg_dec_1 uses VM inv eng 12 on hub 2
[  132.378213] amdgpu: Detected AMDGPU 6 Perf Events.
[  132.387349] [drm] Initialized amdgpu 3.46.0 20150101 for 0000:43:00.0 on minor 1
[  132.388530] pcieport 0000:d7:00.0: bridge window [io  0x1000-0x0fff] to [bus d8] add_size 1000
[  132.389078] pcieport 0000:d7:00.0: BAR 13: no space for [io  size 0x1000]
[  132.389600] pcieport 0000:d7:00.0: BAR 13: failed to assign [io  size 0x1000]
[  132.390091] pcieport 0000:d7:00.0: BAR 13: no space for [io  size 0x1000]
[  132.390568] pcieport 0000:d7:00.0: BAR 13: failed to assign [io  size 0x1000]
[  155.359200] HelloWorld[3824]: segfault at 68 ip 00007f4c979f764e sp 00007ffc9b3bb610 error 4 in libamdhip64.so.4.4.21432-f9dccde4[7f4c979b3000+2eb000]
[  155.360268] Code: 48 8b 45 e8 64 48 33 04 25 28 00 00 00 74 05 e8 b8 c7 fb ff 48 8b 5d f8 c9 c3 f3 0f 1e fa 55 48 89 e5 48 89 7d f8 48 8b 45 f8 <48> 8b 40 68 5d c3 f3 0f 1e fa 55 48 89 e5 48 89 7d f8 48 8b 45 f8

Best regards,
Shuotao

From: Andrey Grodzovsky <andrey.grodzovsky@amd.com<mailto:andrey.grodzovsky@amd.com>>
Date: Wednesday, April 6, 2022 at 10:36 PM
To: Shuotao Xu <shuotaoxu@microsoft.com<mailto:shuotaoxu@microsoft.com>>, amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org> <amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>>
Cc: Ziyue Yang <Ziyue.Yang@microsoft.com<mailto:Ziyue.Yang@microsoft.com>>, Lei Qu <Lei.Qu@microsoft.com<mailto:Lei.Qu@microsoft.com>>, Peng Cheng <pengc@microsoft.com<mailto:pengc@microsoft.com>>, Ran Shu <Ran.Shu@microsoft.com<mailto:Ran.Shu@microsoft.com>>
Subject: Re: [EXTERNAL] Re: Code Review Request for AMDGPU Hotplug Support
[You don't often get email from andrey.grodzovsky@amd.com<mailto:andrey.grodzovsky@amd.com>. Learn why this is important at http://aka.ms/LearnAboutSenderIdentification.]<https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Faka.ms%2FLearnAboutSenderIdentification.%255d&data=04%7C01%7CMukul.Joshi%40amd.com%7Cad2e46b40ac844c558ec08da18a8861d%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637849410435874157%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=JnJcMiiXjXPrYmbvwORip96RGPr3C%2Fwo7acMrbpSb4k%3D&reserved=0>

Can you attach dmesg for the failure without your patch against
amd-staging-drm-next ?

Also, in general, patches for amdgpu upstream branches should be
submitted to amd-gfx mailing list inline using git-send which makes it
easy to comment and review them inline.

Andrey

On 2022-04-06 10:25, Shuotao Xu wrote:
> Hi Andrey,
>
> We just tried kernel 5.16 based on
> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fagd5f%2Flinux.git&amp;data=05%7C01%7Cshuotaoxu%40microsoft.com%7C93f1fcb8f60541f7b87308da17dae167%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637848526184858564%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=jz51aMtsG7PIEfLy1jLvLGd%2BsBREvOFf9Gc6BZlmsmU%3D&amp;reserved=0<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fagd5f%2Flinux.git&data=04%7C01%7CMukul.Joshi%40amd.com%7Cad2e46b40ac844c558ec08da18a8861d%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637849410435874157%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=%2FGZP00nErLq6tTL%2B1WwGFzigxXX7zCfWL6XotxEYaHY%3D&reserved=0>
> <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fagd5f%2Flinux.git&amp;data=05%7C01%7Cshuotaoxu%40microsoft.com%7C93f1fcb8f60541f7b87308da17dae167%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637848526184858564%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=jz51aMtsG7PIEfLy1jLvLGd%2BsBREvOFf9Gc6BZlmsmU%3D&amp;reserved=0<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fagd5f%2Flinux.git&data=04%7C01%7CMukul.Joshi%40amd.com%7Cad2e46b40ac844c558ec08da18a8861d%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637849410435874157%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=%2FGZP00nErLq6tTL%2B1WwGFzigxXX7zCfWL6XotxEYaHY%3D&reserved=0>>
> amd-staging-drm-next branch, and found out that hotplug did not work out
> of box for Rocm compute stack.
>
> We did not try the rendering stack since we currently are more focused
> on AI workloads.
>
> We have also created a patch against the amd-staging-drm-next branch to
> enable hotplug for ROCM stack, which were sent in another later email
> with same subject. I am attaching the patch in this email, in case that
> you would want to delete that later email.
>
> Best regards,
>
> Shuotao
>
> *From: *Andrey Grodzovsky <andrey.grodzovsky@amd.com<mailto:andrey.grodzovsky@amd.com>>
> *Date: *Wednesday, April 6, 2022 at 10:13 PM
> *To: *Shuotao Xu <shuotaoxu@microsoft.com<mailto:shuotaoxu@microsoft.com>>,
> amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org> <amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>>
> *Cc: *Ziyue Yang <Ziyue.Yang@microsoft.com<mailto:Ziyue.Yang@microsoft.com>>, Lei Qu
> <Lei.Qu@microsoft.com<mailto:Lei.Qu@microsoft.com>>, Peng Cheng <pengc@microsoft.com<mailto:pengc@microsoft.com>>, Ran Shu
> <Ran.Shu@microsoft.com<mailto:Ran.Shu@microsoft.com>>
> *Subject: *[EXTERNAL] Re: Code Review Request for AMDGPU Hotplug Support
>
> [You don't often get email from andrey.grodzovsky@amd.com<mailto:andrey.grodzovsky@amd.com>. Learn why
> this is important at http://aka.ms/LearnAboutSenderIdentification.]<https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Faka.ms%2FLearnAboutSenderIdentification.%255d&data=04%7C01%7CMukul.Joshi%40amd.com%7Cad2e46b40ac844c558ec08da18a8861d%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637849410435874157%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=JnJcMiiXjXPrYmbvwORip96RGPr3C%2Fwo7acMrbpSb4k%3D&reserved=0>
> <https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Faka.ms%2FLearnAboutSenderIdentification.%255d&amp;data=05%7C01%7Cshuotaoxu%40microsoft.com%7C93f1fcb8f60541f7b87308da17dae167%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637848526184858564%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=HfSwu6SWfoCYyscJqGFdKHBPtaj%2BKB4lyo13zkm6hi4%3D&amp;reserved=0<https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Faka.ms%2FLearnAboutSenderIdentification.%255d&data=04%7C01%7CMukul.Joshi%40amd.com%7Cad2e46b40ac844c558ec08da18a8861d%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637849410435874157%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=JnJcMiiXjXPrYmbvwORip96RGPr3C%2Fwo7acMrbpSb4k%3D&reserved=0>>
>
> Looks like you are using 5.13 kernel for this work, FYI we added
> hot plug support for the graphic stack in 5.14 kernel (see
> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.phoronix.com%2Fscan.php%3Fpage%3Dnews_item%26px%3DLinux-5.14-AMDGPU-Hot-Unplug&amp;data=05%7C01%7Cshuotaoxu%40microsoft.com%7C93f1fcb8f60541f7b87308da17dae167%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637848526184858564%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=4l9mT8zNR%2FvqsEFr7noIDqKf16IGN8xmO2T6jnpipzo%3D&amp;reserved=0)<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.phoronix.com%2Fscan.php%3Fpage%3Dnews_item%26px%3DLinux-5.14-AMDGPU-Hot-Unplug&data=04%7C01%7CMukul.Joshi%40amd.com%7Cad2e46b40ac844c558ec08da18a8861d%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637849410435874157%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=TCTuPCb2ghYpxD0IrILltzlbpRTpOjqstuPzp3LAcbs%3D&reserved=0>
> <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.phoronix.com%2Fscan.php%3Fpage%3Dnews_item%26px%3DLinux-5.14-AMDGPU-Hot-Unplug&amp;data=05%7C01%7Cshuotaoxu%40microsoft.com%7C93f1fcb8f60541f7b87308da17dae167%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637848526184858564%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=4l9mT8zNR%2FvqsEFr7noIDqKf16IGN8xmO2T6jnpipzo%3D&amp;reserved=0<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.phoronix.com%2Fscan.php%3Fpage%3Dnews_item%26px%3DLinux-5.14-AMDGPU-Hot-Unplug&data=04%7C01%7CMukul.Joshi%40amd.com%7Cad2e46b40ac844c558ec08da18a8861d%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637849410435874157%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=TCTuPCb2ghYpxD0IrILltzlbpRTpOjqstuPzp3LAcbs%3D&reserved=0>>
>
>
> I am not sure about the code part since it all touches KFD driver (KFD
> team can comment on that) - but I was just wondering if you try 5.14
> kernel would things just work for you out of the box ?
>
> Andrey
>
> On 2022-04-05 22:45, Shuotao Xu wrote:
>> Dear AMD Colleagues,
>>
>> We are from Microsoft Research, and are working on GPU disaggregation
>> technology.
>>
>> We have created a new pull requestAdd PCIe hotplug support for amdgpu by
>> xushuotao * Pull Request #131 * RadeonOpenCompute/ROCK-Kernel-Driver
>> (github.com)
>> <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FRadeonOpenCompute%2FROCK-Kernel-Driver%2Fpull%2F131&amp;data=05%7C01%7Cshuotaoxu%40microsoft.com%7C93f1fcb8f60541f7b87308da17dae167%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637848526184858564%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=qEeZR6R95jrjAaltd1MnpyFedOiVZaNQuCxcoNog90g%3D&amp;reserved=0
<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FRadeonOpenCompute%2FROCK-Kernel-Driver%2Fpull%2F131&data=04%7C01%7CMukul.Joshi%40amd.com%7Cad2e46b40ac844c558ec08da18a8861d%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637849410435874157%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=TFTd4UQnPg5ZYRvG%2Bl4VVktdc7EwQQUCbfccMgASYsk%3D&reserved=0>> <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FRadeonOpenCompute%2FROCK-Kernel-Driver%2Fpull%2F131&amp;data=05%7C01%7Cshuotaoxu%40microsoft.com%7C93f1fcb8f60541f7b87308da17dae167%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637848526184858564%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=qEeZR6R95jrjAaltd1MnpyFedOiVZaNQuCxcoNog90g%3D&amp;reserved=0<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FRadeonOpenCompute%2FROCK-Kernel-Driver%2Fpull%2F131&data=04%7C01%7CMukul.Joshi%40amd.com%7Cad2e46b40ac844c558ec08da18a8861d%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637849410435874157%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=TFTd4UQnPg5ZYRvG%2Bl4VVktdc7EwQQUCbfccMgASYsk%3D&reserved=0>>>in
>> ROCK-Kernel-Driver, which will enable PCIe hot-plug support for amdgpu.
>>
>> We believe the support of hot-plug of GPU devices can open doors for
>> many advanced applications in data center in the next few years, and we
>> would like to have some reviewers on this PR so we can continue further
>> technical discussions around this feature.
>>
>> Would you please help review this PR?
>>
>> Thank you very much!
>>
>> Best regards,
>>
>> Shuotao Xu
>>
>

[-- Attachment #2: Type: text/html, Size: 64507 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [EXTERNAL] Re: Code Review Request for AMDGPU Hotplug Support
  2022-04-07 16:22             ` Joshi, Mukul
@ 2022-04-07 16:28               ` Shuotao Xu
  2022-04-07 16:33                 ` Joshi, Mukul
  0 siblings, 1 reply; 13+ messages in thread
From: Shuotao Xu @ 2022-04-07 16:28 UTC (permalink / raw)
  To: Joshi, Mukul, Grodzovsky, Andrey, amd-gfx
  Cc: Ziyue Yang, Lei Qu, Peng Cheng, Ran Shu

[-- Attachment #1: Type: text/plain, Size: 31557 bytes --]

Thanks Mukul very much!

The code looks neat, although kfd_locked looks still would cause trouble. I will try it.

Best,
Shuotao

From: Joshi, Mukul <Mukul.Joshi@amd.com>
Date: Friday, April 8, 2022 at 12:23 AM
To: Shuotao Xu <shuotaoxu@microsoft.com>, Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>, amd-gfx@lists.freedesktop.org <amd-gfx@lists.freedesktop.org>
Cc: Ziyue Yang <Ziyue.Yang@microsoft.com>, Lei Qu <Lei.Qu@microsoft.com>, Peng Cheng <pengc@microsoft.com>, Ran Shu <Ran.Shu@microsoft.com>
Subject: RE: [EXTERNAL] Re: Code Review Request for AMDGPU Hotplug Support

[AMD Official Use Only]

Hi Shuotao,

Just sent out the patch to cleanup IO links upon KFD device removal to the public mailing list.
Please try it, review it and let us know how it goes for you.

Thank you.

Regards,
Mukul

From: Shuotao Xu <shuotaoxu@microsoft.com>
Sent: Thursday, April 7, 2022 11:09 AM
To: Joshi, Mukul <Mukul.Joshi@amd.com>; Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>; amd-gfx@lists.freedesktop.org
Cc: Ziyue Yang <Ziyue.Yang@microsoft.com>; Lei Qu <Lei.Qu@microsoft.com>; Peng Cheng <pengc@microsoft.com>; Ran Shu <Ran.Shu@microsoft.com>
Subject: Re: [EXTERNAL] Re: Code Review Request for AMDGPU Hotplug Support

[CAUTION: External Email]
Hi Joshi,

Per your comment, I produced a fix to work with multi-GPU system for hotplug support for our group’s internal usage.

I have tested on a 4-node MI100 system, which seems to be working. It is pushed in the github PR.
The details are in: https://github.com/RadeonOpenCompute/ROCK-Kernel-Driver/pull/131#issuecomment-1091843803<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FRadeonOpenCompute%2FROCK-Kernel-Driver%2Fpull%2F131%23issuecomment-1091843803&data=05%7C01%7Cshuotaoxu%40microsoft.com%7Cf6070eadde684e919c5208da18b2c843%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637849454324342228%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000%7C%7C%7C&sdata=NUb%2Bef1scMYtNVy2%2BRcSFOxU5kFgTURVffq%2F0nqklso%3D&reserved=0>

I will send the patch to the mail-list also.

May I know when your patch is ready for public review?

All the best,
Shuotao


From: Joshi, Mukul <Mukul.Joshi@amd.com<mailto:Mukul.Joshi@amd.com>>
Date: Thursday, April 7, 2022 at 10:24 AM
To: Shuotao Xu <shuotaoxu@microsoft.com<mailto:shuotaoxu@microsoft.com>>, Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com<mailto:Andrey.Grodzovsky@amd.com>>, amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org> <amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>>
Cc: Ziyue Yang <Ziyue.Yang@microsoft.com<mailto:Ziyue.Yang@microsoft.com>>, Lei Qu <Lei.Qu@microsoft.com<mailto:Lei.Qu@microsoft.com>>, Peng Cheng <pengc@microsoft.com<mailto:pengc@microsoft.com>>, Ran Shu <Ran.Shu@microsoft.com<mailto:Ran.Shu@microsoft.com>>
Subject: RE: [EXTERNAL] Re: Code Review Request for AMDGPU Hotplug Support
You don't often get email from mukul.joshi@amd.com<mailto:mukul.joshi@amd.com>. Learn why this is important<http://aka.ms/LearnAboutSenderIdentification>

[AMD Official Use Only]

Hi Shuotao,

Thanks for your patch.
I have been working on something similar as I also found that we don’t cleanup IO links upon device removal.

The IO-links cleanup change in your patch would work only either on a single GPU system or a multi-GPU system where the last node (in the sysfs topology) is hot-plugged out. That’s because of the way the atomic counter, topology_crat_proximity_domain, is used in the code.

I have a patch which takes care of these issues on a multi-GPU system.
I should be able to send that out for review in sometime.

Thanks,
Mukul

From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org<mailto:amd-gfx-bounces@lists.freedesktop.org>> On Behalf Of Shuotao Xu
Sent: Wednesday, April 6, 2022 11:12 AM
To: Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com<mailto:Andrey.Grodzovsky@amd.com>>; amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
Cc: Ziyue Yang <Ziyue.Yang@microsoft.com<mailto:Ziyue.Yang@microsoft.com>>; Lei Qu <Lei.Qu@microsoft.com<mailto:Lei.Qu@microsoft.com>>; Peng Cheng <pengc@microsoft.com<mailto:pengc@microsoft.com>>; Ran Shu <Ran.Shu@microsoft.com<mailto:Ran.Shu@microsoft.com>>
Subject: Re: [EXTERNAL] Re: Code Review Request for AMDGPU Hotplug Support

[CAUTION: External Email]
Hi Andrey,

Thanks for your kind comment on linux patch submission protocol, please let me know if there is anyway to rectify it.

dmesg is fine except with some warning during pci rescan after pci removal of an AMD MI100.

The issue is that after this rocm application will return segfault with the amdgpu driver unless the entire amdgpu kernel module is unloaded and loaded, which we did not meet our hotplug requirement. The issues upon investigation are
1) kfd_lock is locked after hotplug, and kfd_open will return fault right away to libhsakmt .
2) iolink/p2plink has anormalies after hotplug, and libhsakmt will found such anomalies and return error.

Our patch has been tested with a single-instance AMD MI100 GPU and showed it worked.

I am attaching the dmesg after rescan anyway, which will show the warning and the segfault.

[  132.054822] pci 0000:43:00.0: [1002:738c] type 00 class 0x038000
[  132.054856] pci 0000:43:00.0: reg 0x10: [mem 0x38b000000000-0x38b7ffffffff 64bit pref]
[  132.054877] pci 0000:43:00.0: reg 0x18: [mem 0x38b800000000-0x38b8001fffff 64bit pref]
[  132.054890] pci 0000:43:00.0: reg 0x20: [io  0xa000-0xa0ff]
[  132.054904] pci 0000:43:00.0: reg 0x24: [mem 0xb8400000-0xb847ffff]
[  132.054918] pci 0000:43:00.0: reg 0x30: [mem 0xb8480000-0xb849ffff pref]
[  132.055134] pci 0000:43:00.0: PME# supported from D1 D2 D3hot D3cold
[  132.055217] pci 0000:43:00.0: 63.008 Gb/s available PCIe bandwidth, limited by 8.0 GT/s PCIe x8 link at 0000:3c:14.0 (capable of 252.048 Gb/s with 16.0 GT/s PCIe x16 link)
[  132.056001] pci 0000:43:00.0: Adding to iommu group 73
[  132.057943] pci 0000:43:00.0: BAR 0: assigned [mem 0x38b000000000-0x38b7ffffffff 64bit pref]
[  132.057960] pci 0000:43:00.0: BAR 2: assigned [mem 0x38b800000000-0x38b8001fffff 64bit pref]
[  132.057974] pci 0000:43:00.0: BAR 5: assigned [mem 0xb8400000-0xb847ffff]
[  132.057981] pci 0000:43:00.0: BAR 6: assigned [mem 0xb8480000-0xb849ffff pref]
[  132.057984] pci 0000:43:00.0: BAR 4: assigned [io  0xa000-0xa0ff]

[  132.058429] ======================================================
[  132.058453] WARNING: possible circular locking dependency detected
[  132.058477] 5.16.0-kfd+ #1 Not tainted
[  132.058492] ------------------------------------------------------
[  132.058515] bash/3632 is trying to acquire lock:
[  132.058534] ffffadee20adfb50 ((work_completion)(&wfc.work)){+.+.}-{0:0}, at: __flush_work+0x2f5/0x470
[  132.058554] [drm] initializing kernel modesetting (ARCTURUS 0x1002:0x738C 0x1002:0x0C34 0x01).
[  132.058577]
               but task is already holding lock:
[  132.058580] ffffffffa3c62308
[  132.058630] amdgpu 0000:43:00.0: amdgpu: Trusted Memory Zone (TMZ) feature not supported
[  132.058638]  (
[  132.058678] [drm] register mmio base: 0xB8400000
[  132.058683] pci_rescan_remove_lock
[  132.058694] [drm] register mmio size: 524288
[  132.058713] ){+.+.}-{3:3}, at: rescan_store+0x55/0x90
[  132.058773]
               which lock already depends on the new lock.

[  132.058804]
               the existing dependency chain (in reverse order) is:
[  132.058819] [drm] add ip block number 0 <soc15_common>
[  132.058831]
               -> #1 (
[  132.058854] [drm] add ip block number 1 <gmc_v9_0>
[  132.058858] [drm] add ip block number 2 <vega20_ih>
[  132.058874] pci_rescan_remove_lock
[  132.058894] [drm] add ip block number 3 <psp>
[  132.058915] ){+.+.}-{3:3}
[  132.058931] [drm] add ip block number 4 <smu>
[  132.058951] :
[  132.058965] [drm] add ip block number 5 <gfx_v9_0>
[  132.058986]        __mutex_lock+0xa4/0x990
[  132.058996] [drm] add ip block number 6 <sdma_v4_0>
[  132.059016]        i801_add_tco_spt.isra.20+0x2a/0x1a0
[  132.059033] [drm] add ip block number 7 <vcn_v2_5>
[  132.059054]        i801_add_tco+0xf6/0x110
[  132.059075] [drm] add ip block number 8 <jpeg_v2_5>
[  132.059096]        i801_probe+0x402/0x860
[  132.059151]        local_pci_probe+0x40/0x90
[  132.059170]        work_for_cpu_fn+0x10/0x20
[  132.059189]        process_one_work+0x2a4/0x640
[  132.059208]        worker_thread+0x228/0x3f0
[  132.059227]        kthread+0x16d/0x1a0
[  132.059795]        ret_from_fork+0x1f/0x30
[  132.060337]
               -> #0 ((work_completion)(&wfc.work)){+.+.}-{0:0}:
[  132.061405]        __lock_acquire+0x1552/0x1ac0
[  132.061950]        lock_acquire+0x26c/0x300
[  132.062484]        __flush_work+0x315/0x470
[  132.063009]        work_on_cpu+0x98/0xc0
[  132.063526]        pci_device_probe+0x1bc/0x1d0
[  132.064036]        really_probe+0x102/0x450
[  132.064532]        __driver_probe_device+0x100/0x170
[  132.065020]        driver_probe_device+0x1f/0xa0
[  132.065497]        __device_attach_driver+0x6b/0xe0
[  132.065975]        bus_for_each_drv+0x6a/0xb0
[  132.066449]        __device_attach+0xe2/0x160
[  132.066912]        pci_bus_add_device+0x4a/0x80
[  132.067365]        pci_bus_add_devices+0x2c/0x70
[  132.067812]        pci_bus_add_devices+0x65/0x70
[  132.068253]        pci_bus_add_devices+0x65/0x70
[  132.068688]        pci_bus_add_devices+0x65/0x70
[  132.068936] amdgpu 0000:43:00.0: amdgpu: Fetched VBIOS from ROM BAR
[  132.069109]        pci_bus_add_devices+0x65/0x70
[  132.069602] amdgpu: ATOM BIOS: 113-D3431401-X00
[  132.070058]        pci_bus_add_devices+0x65/0x70
[  132.070572] [drm] VCN(0) decode is enabled in VM mode
[  132.070997]        pci_rescan_bus+0x23/0x30
[  132.071000]        rescan_store+0x61/0x90
[  132.071003]        kernfs_fop_write_iter+0x132/0x1b0
[  132.071501] [drm] VCN(1) decode is enabled in VM mode
[  132.071964]        new_sync_write+0x11f/0x1b0
[  132.072432] [drm] VCN(0) encode is enabled in VM mode
[  132.072900]        vfs_write+0x35b/0x3b0
[  132.073376] [drm] VCN(1) encode is enabled in VM mode
[  132.073847]        ksys_write+0xa7/0xe0
[  132.074335] [drm] JPEG(0) JPEG decode is enabled in VM mode
[  132.074803]        do_syscall_64+0x34/0x80
[  132.074808]        entry_SYSCALL_64_after_hwframe+0x44/0xae
[  132.074811]
               other info that might help us debug this:

[  132.074813]  Possible unsafe locking scenario:

[  132.075302] [drm] JPEG(1) JPEG decode is enabled in VM mode
[  132.075779]        CPU0                    CPU1
[  132.076361] amdgpu 0000:43:00.0: amdgpu: MEM ECC is active.
[  132.076765]        ----                    ----
[  132.077265] amdgpu 0000:43:00.0: amdgpu: SRAM ECC is active.
[  132.078649]   lock(pci_rescan_remove_lock);
[  132.078652]                                lock((work_completion)(&wfc.work));
[  132.078653]                                lock(pci_rescan_remove_lock);
[  132.078655]   lock((work_completion)(&wfc.work));
[  132.078656]
                *** DEADLOCK ***

[  132.078656] 5 locks held by bash/3632:
[  132.078658]  #0: ffff9c39c7b89438
[  132.079612] amdgpu 0000:43:00.0: amdgpu: RAS INFO: ras initialized successfully, hardware ability[7fff] ras_mask[7fff]
[  132.080089]  (
[  132.080602] [drm] vm size is 262144 GB, 4 levels, block size is 9-bit, fragment size is 9-bit
[  132.081082] sb_writers
[  132.081601] amdgpu 0000:43:00.0: amdgpu: VRAM: 32752M 0x0000008000000000 - 0x00000087FEFFFFFF (32752M used)
[  132.082102] #6
[  132.082630] amdgpu 0000:43:00.0: amdgpu: GART: 512M 0x0000000000000000 - 0x000000001FFFFFFF
[  132.083152] ){.+.+}-{0:0}
[  132.083687] amdgpu 0000:43:00.0: amdgpu: AGP: 267878400M 0x0000008800000000 - 0x0000FFFFFFFFFFFF
[  132.084210] , at: ksys_write+0xa7/0xe0
[  132.085708] [drm] Detected VRAM RAM=32752M, BAR=32768M
[  132.086205]  #1:
[  132.086733] [drm] RAM width 4096bits HBM
[  132.087269] ffff9c5959011088
[  132.087890] [drm] amdgpu: 32752M of VRAM memory ready
[  132.088389]  (
[  132.088972] [drm] amdgpu: 32752M of GTT memory ready.
[  132.089572] &of->mutex
[  132.090206] [drm] GART: num cpu pages 131072, num gpu pages 131072
[  132.090804] ){+.+.}-{3:3}, at: kernfs_fop_write_iter+0x103/0x1b0
[  132.090808]  #2: ffff9c39c882a9e0 (kn->active#423){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x10c/0x1b0
[  132.091639] [drm] PCIE GART of 512M enabled.
[  132.092117]  #3:
[  132.092801] [drm] PTB located at 0x0000008000000000
[  132.093480] ffffffffa3c62308
[  132.094566] amdgpu 0000:43:00.0: amdgpu: PSP runtime database doesn't exist
[  132.094822]  (pci_rescan_remove_lock){+.+.}-{3:3}, at: rescan_store+0x55/0x90
[  132.094827]  #4: ffff9c597392b248 (&dev->mutex){....}-{3:3}, at: __device_attach+0x39/0x160
[  132.094835]
               stack backtrace:
[  132.097098] [drm] Found VCN firmware Version ENC: 1.1 DEC: 1 VEP: 0 Revision: 21
[  132.097467] CPU: 47 PID: 3632 Comm: bash Not tainted 5.16.0-kfd+ #1
[  132.098169] amdgpu 0000:43:00.0: amdgpu: Will use PSP to load VCN firmware
[  132.098839] Hardware name: Supermicro SYS-4029GP-TRT2/X11DPG-OT-CPU, BIOS 2.1 08/14/2018
[  132.098841] Call Trace:
[  132.098842]  <TASK>
[  132.098843]  dump_stack_lvl+0x44/0x57
[  132.098848]  check_noncircular+0x105/0x120
[  132.098853]  ? unwind_get_return_address+0x1b/0x30
[  132.112924]  ? register_lock_class+0x46/0x780
[  132.113630]  ? __lock_acquire+0x1552/0x1ac0
[  132.114342]  __lock_acquire+0x1552/0x1ac0
[  132.115050]  lock_acquire+0x26c/0x300
[  132.115755]  ? __flush_work+0x2f5/0x470
[  132.116460]  ? lock_is_held_type+0xdf/0x130
[  132.117177]  __flush_work+0x315/0x470
[  132.117890]  ? __flush_work+0x2f5/0x470
[  132.118604]  ? lock_is_held_type+0xdf/0x130
[  132.119305]  ? mark_held_locks+0x49/0x70
[  132.119981]  ? queue_work_on+0x2f/0x70
[  132.120645]  ? lockdep_hardirqs_on+0x79/0x100
[  132.121300]  work_on_cpu+0x98/0xc0
[  132.121702] [drm] reserve 0x400000 from 0x87fec00000 for PSP TMR
[  132.121947]  ? __traceiter_workqueue_execute_end+0x40/0x40
[  132.123270]  ? pci_device_shutdown+0x60/0x60
[  132.123880]  pci_device_probe+0x1bc/0x1d0
[  132.124475]  really_probe+0x102/0x450
[  132.125060]  __driver_probe_device+0x100/0x170
[  132.125641]  driver_probe_device+0x1f/0xa0
[  132.126215]  __device_attach_driver+0x6b/0xe0
[  132.126797]  ? driver_allows_async_probing+0x50/0x50
[  132.127383]  ? driver_allows_async_probing+0x50/0x50
[  132.127960]  bus_for_each_drv+0x6a/0xb0
[  132.128528]  __device_attach+0xe2/0x160
[  132.129095]  pci_bus_add_device+0x4a/0x80
[  132.129659]  pci_bus_add_devices+0x2c/0x70
[  132.130213]  pci_bus_add_devices+0x65/0x70
[  132.130753]  pci_bus_add_devices+0x65/0x70
[  132.131283]  pci_bus_add_devices+0x65/0x70
[  132.131780]  pci_bus_add_devices+0x65/0x70
[  132.132270]  pci_bus_add_devices+0x65/0x70
[  132.132757]  pci_rescan_bus+0x23/0x30
[  132.133233]  rescan_store+0x61/0x90
[  132.133701]  kernfs_fop_write_iter+0x132/0x1b0
[  132.134167]  new_sync_write+0x11f/0x1b0
[  132.134627]  vfs_write+0x35b/0x3b0
[  132.135062]  ksys_write+0xa7/0xe0
[  132.135503]  do_syscall_64+0x34/0x80
[  132.135933]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[  132.136358] RIP: 0033:0x7f0058a73224
[  132.136775] Code: 89 02 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 8d 05 c1 07 2e 00 8b 00 85 c0 75 13 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 54 f3 c3 66 90 41 54 55 49 89 d4 53 48 89 f5
[  132.137663] RSP: 002b:00007ffc4f4c71a8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[  132.138121] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007f0058a73224
[  132.138590] RDX: 0000000000000002 RSI: 000055d466c24450 RDI: 0000000000000001
[  132.139064] RBP: 000055d466c24450 R08: 000000000000000a R09: 0000000000000001
[  132.139532] R10: 000000000000000a R11: 0000000000000246 R12: 00007f0058d4f760
[  132.140003] R13: 0000000000000002 R14: 00007f0058d4b2a0 R15: 00007f0058d4a760
[  132.140485]  </TASK>
[  132.183669] amdgpu 0000:43:00.0: amdgpu: HDCP: optional hdcp ta ucode is not available
[  132.184214] amdgpu 0000:43:00.0: amdgpu: DTM: optional dtm ta ucode is not available
[  132.184735] amdgpu 0000:43:00.0: amdgpu: RAP: optional rap ta ucode is not available
[  132.185234] amdgpu 0000:43:00.0: amdgpu: SECUREDISPLAY: securedisplay ta ucode is not available
[  132.185823] amdgpu 0000:43:00.0: amdgpu: use vbios provided pptable
[  132.186327] amdgpu 0000:43:00.0: amdgpu: smc_dpm_info table revision(format.content): 4.6
[  132.188783] amdgpu 0000:43:00.0: amdgpu: SMU is initialized successfully!
[  132.190039] [drm] kiq ring mec 2 pipe 1 q 0
[  132.203608] [drm] VCN decode and encode initialized successfully(under DPG Mode).
[  132.204178] [drm] JPEG decode initialized successfully.
[  132.246079] kfd kfd: amdgpu: Allocated 3969056 bytes on gart
[  132.327589] memmap_init_zone_device initialised 8388608 pages in 64ms
[  132.328139] amdgpu: HMM registered 32752MB device memory
[  132.328784] amdgpu: Virtual CRAT table created for GPU
[  132.329844] amdgpu: Topology: Add dGPU node [0x738c:0x1002]
[  132.330387] kfd kfd: amdgpu: added device 1002:738c
[  132.330965] amdgpu 0000:43:00.0: amdgpu: SE 8, SH per SE 1, CU per SH 16, active_cu_number 72
[  132.331725] amdgpu 0000:43:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 0 on hub 0
[  132.332296] amdgpu 0000:43:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 1 on hub 0
[  132.332856] amdgpu 0000:43:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 4 on hub 0
[  132.333414] amdgpu 0000:43:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 5 on hub 0
[  132.333965] amdgpu 0000:43:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 6 on hub 0
[  132.334507] amdgpu 0000:43:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 7 on hub 0
[  132.335057] amdgpu 0000:43:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 8 on hub 0
[  132.335594] amdgpu 0000:43:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 9 on hub 0
[  132.336137] amdgpu 0000:43:00.0: amdgpu: ring kiq_2.1.0 uses VM inv eng 10 on hub 0
[  132.336679] amdgpu 0000:43:00.0: amdgpu: ring sdma0 uses VM inv eng 0 on hub 1
[  132.337234] amdgpu 0000:43:00.0: amdgpu: ring sdma1 uses VM inv eng 1 on hub 1
[  132.337790] amdgpu 0000:43:00.0: amdgpu: ring sdma2 uses VM inv eng 4 on hub 1
[  132.338343] amdgpu 0000:43:00.0: amdgpu: ring sdma3 uses VM inv eng 5 on hub 1
[  132.338906] amdgpu 0000:43:00.0: amdgpu: ring sdma4 uses VM inv eng 6 on hub 1
[  132.339448] amdgpu 0000:43:00.0: amdgpu: ring sdma5 uses VM inv eng 0 on hub 2
[  132.339987] amdgpu 0000:43:00.0: amdgpu: ring sdma6 uses VM inv eng 1 on hub 2
[  132.340519] amdgpu 0000:43:00.0: amdgpu: ring sdma7 uses VM inv eng 4 on hub 2
[  132.341041] amdgpu 0000:43:00.0: amdgpu: ring vcn_dec_0 uses VM inv eng 5 on hub 2
[  132.341570] amdgpu 0000:43:00.0: amdgpu: ring vcn_enc_0.0 uses VM inv eng 6 on hub 2
[  132.342101] amdgpu 0000:43:00.0: amdgpu: ring vcn_enc_0.1 uses VM inv eng 7 on hub 2
[  132.342630] amdgpu 0000:43:00.0: amdgpu: ring vcn_dec_1 uses VM inv eng 8 on hub 2
[  132.343152] amdgpu 0000:43:00.0: amdgpu: ring vcn_enc_1.0 uses VM inv eng 9 on hub 2
[  132.343657] amdgpu 0000:43:00.0: amdgpu: ring vcn_enc_1.1 uses VM inv eng 10 on hub 2
[  132.344136] amdgpu 0000:43:00.0: amdgpu: ring jpeg_dec_0 uses VM inv eng 11 on hub 2
[  132.344610] amdgpu 0000:43:00.0: amdgpu: ring jpeg_dec_1 uses VM inv eng 12 on hub 2
[  132.378213] amdgpu: Detected AMDGPU 6 Perf Events.
[  132.387349] [drm] Initialized amdgpu 3.46.0 20150101 for 0000:43:00.0 on minor 1
[  132.388530] pcieport 0000:d7:00.0: bridge window [io  0x1000-0x0fff] to [bus d8] add_size 1000
[  132.389078] pcieport 0000:d7:00.0: BAR 13: no space for [io  size 0x1000]
[  132.389600] pcieport 0000:d7:00.0: BAR 13: failed to assign [io  size 0x1000]
[  132.390091] pcieport 0000:d7:00.0: BAR 13: no space for [io  size 0x1000]
[  132.390568] pcieport 0000:d7:00.0: BAR 13: failed to assign [io  size 0x1000]
[  155.359200] HelloWorld[3824]: segfault at 68 ip 00007f4c979f764e sp 00007ffc9b3bb610 error 4 in libamdhip64.so.4.4.21432-f9dccde4[7f4c979b3000+2eb000]
[  155.360268] Code: 48 8b 45 e8 64 48 33 04 25 28 00 00 00 74 05 e8 b8 c7 fb ff 48 8b 5d f8 c9 c3 f3 0f 1e fa 55 48 89 e5 48 89 7d f8 48 8b 45 f8 <48> 8b 40 68 5d c3 f3 0f 1e fa 55 48 89 e5 48 89 7d f8 48 8b 45 f8

Best regards,
Shuotao

From: Andrey Grodzovsky <andrey.grodzovsky@amd.com<mailto:andrey.grodzovsky@amd.com>>
Date: Wednesday, April 6, 2022 at 10:36 PM
To: Shuotao Xu <shuotaoxu@microsoft.com<mailto:shuotaoxu@microsoft.com>>, amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org> <amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>>
Cc: Ziyue Yang <Ziyue.Yang@microsoft.com<mailto:Ziyue.Yang@microsoft.com>>, Lei Qu <Lei.Qu@microsoft.com<mailto:Lei.Qu@microsoft.com>>, Peng Cheng <pengc@microsoft.com<mailto:pengc@microsoft.com>>, Ran Shu <Ran.Shu@microsoft.com<mailto:Ran.Shu@microsoft.com>>
Subject: Re: [EXTERNAL] Re: Code Review Request for AMDGPU Hotplug Support
[You don't often get email from andrey.grodzovsky@amd.com<mailto:andrey.grodzovsky@amd.com>. Learn why this is important at http://aka.ms/LearnAboutSenderIdentification.]<https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Faka.ms%2FLearnAboutSenderIdentification.%255d&data=05%7C01%7Cshuotaoxu%40microsoft.com%7Cf6070eadde684e919c5208da18b2c843%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637849454324342228%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000%7C%7C%7C&sdata=%2FWA%2FlRxQEVroODaKMlMZuWS6XVEjkVFvlB4Dx0Cppdo%3D&reserved=0>

Can you attach dmesg for the failure without your patch against
amd-staging-drm-next ?

Also, in general, patches for amdgpu upstream branches should be
submitted to amd-gfx mailing list inline using git-send which makes it
easy to comment and review them inline.

Andrey

On 2022-04-06 10:25, Shuotao Xu wrote:
> Hi Andrey,
>
> We just tried kernel 5.16 based on
> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fagd5f%2Flinux.git&amp;data=05%7C01%7Cshuotaoxu%40microsoft.com%7C93f1fcb8f60541f7b87308da17dae167%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637848526184858564%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=jz51aMtsG7PIEfLy1jLvLGd%2BsBREvOFf9Gc6BZlmsmU%3D&amp;reserved=0<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fagd5f%2Flinux.git&data=05%7C01%7Cshuotaoxu%40microsoft.com%7Cf6070eadde684e919c5208da18b2c843%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637849454324342228%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000%7C%7C%7C&sdata=ewR5eOXmcaB2FhpFIpNd5eNugPvfBmWQvplkO%2BYjDx4%3D&reserved=0>
> <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fagd5f%2Flinux.git&amp;data=05%7C01%7Cshuotaoxu%40microsoft.com%7C93f1fcb8f60541f7b87308da17dae167%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637848526184858564%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=jz51aMtsG7PIEfLy1jLvLGd%2BsBREvOFf9Gc6BZlmsmU%3D&amp;reserved=0<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fagd5f%2Flinux.git&data=05%7C01%7Cshuotaoxu%40microsoft.com%7Cf6070eadde684e919c5208da18b2c843%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637849454324342228%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000%7C%7C%7C&sdata=ewR5eOXmcaB2FhpFIpNd5eNugPvfBmWQvplkO%2BYjDx4%3D&reserved=0>>
> amd-staging-drm-next branch, and found out that hotplug did not work out
> of box for Rocm compute stack.
>
> We did not try the rendering stack since we currently are more focused
> on AI workloads.
>
> We have also created a patch against the amd-staging-drm-next branch to
> enable hotplug for ROCM stack, which were sent in another later email
> with same subject. I am attaching the patch in this email, in case that
> you would want to delete that later email.
>
> Best regards,
>
> Shuotao
>
> *From: *Andrey Grodzovsky <andrey.grodzovsky@amd.com<mailto:andrey.grodzovsky@amd.com>>
> *Date: *Wednesday, April 6, 2022 at 10:13 PM
> *To: *Shuotao Xu <shuotaoxu@microsoft.com<mailto:shuotaoxu@microsoft.com>>,
> amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org> <amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>>
> *Cc: *Ziyue Yang <Ziyue.Yang@microsoft.com<mailto:Ziyue.Yang@microsoft.com>>, Lei Qu
> <Lei.Qu@microsoft.com<mailto:Lei.Qu@microsoft.com>>, Peng Cheng <pengc@microsoft.com<mailto:pengc@microsoft.com>>, Ran Shu
> <Ran.Shu@microsoft.com<mailto:Ran.Shu@microsoft.com>>
> *Subject: *[EXTERNAL] Re: Code Review Request for AMDGPU Hotplug Support
>
> [You don't often get email from andrey.grodzovsky@amd.com<mailto:andrey.grodzovsky@amd.com>. Learn why
> this is important at http://aka.ms/LearnAboutSenderIdentification.]<https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Faka.ms%2FLearnAboutSenderIdentification.%255d&data=05%7C01%7Cshuotaoxu%40microsoft.com%7Cf6070eadde684e919c5208da18b2c843%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637849454324342228%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000%7C%7C%7C&sdata=%2FWA%2FlRxQEVroODaKMlMZuWS6XVEjkVFvlB4Dx0Cppdo%3D&reserved=0>
> <https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Faka.ms%2FLearnAboutSenderIdentification.%255d&amp;data=05%7C01%7Cshuotaoxu%40microsoft.com%7C93f1fcb8f60541f7b87308da17dae167%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637848526184858564%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=HfSwu6SWfoCYyscJqGFdKHBPtaj%2BKB4lyo13zkm6hi4%3D&amp;reserved=0<https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Faka.ms%2FLearnAboutSenderIdentification.%255d&data=05%7C01%7Cshuotaoxu%40microsoft.com%7Cf6070eadde684e919c5208da18b2c843%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637849454324392208%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000%7C%7C%7C&sdata=2CtvPRIHpPSRWkzMw8YXPlybziRvfusQLzOxujtp7mE%3D&reserved=0>>
>
> Looks like you are using 5.13 kernel for this work, FYI we added
> hot plug support for the graphic stack in 5.14 kernel (see
> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.phoronix.com%2Fscan.php%3Fpage%3Dnews_item%26px%3DLinux-5.14-AMDGPU-Hot-Unplug&amp;data=05%7C01%7Cshuotaoxu%40microsoft.com%7C93f1fcb8f60541f7b87308da17dae167%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637848526184858564%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=4l9mT8zNR%2FvqsEFr7noIDqKf16IGN8xmO2T6jnpipzo%3D&amp;reserved=0)<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.phoronix.com%2Fscan.php%3Fpage%3Dnews_item%26px%3DLinux-5.14-AMDGPU-Hot-Unplug&data=05%7C01%7Cshuotaoxu%40microsoft.com%7Cf6070eadde684e919c5208da18b2c843%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637849454324392208%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000%7C%7C%7C&sdata=c2fHHBMsl9EBhiGTHEout54Y7zPa8PMCpD6oc2Cwppg%3D&reserved=0>
> <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.phoronix.com%2Fscan.php%3Fpage%3Dnews_item%26px%3DLinux-5.14-AMDGPU-Hot-Unplug&amp;data=05%7C01%7Cshuotaoxu%40microsoft.com%7C93f1fcb8f60541f7b87308da17dae167%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637848526184858564%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=4l9mT8zNR%2FvqsEFr7noIDqKf16IGN8xmO2T6jnpipzo%3D&amp;reserved=0<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.phoronix.com%2Fscan.php%3Fpage%3Dnews_item%26px%3DLinux-5.14-AMDGPU-Hot-Unplug&data=05%7C01%7Cshuotaoxu%40microsoft.com%7Cf6070eadde684e919c5208da18b2c843%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637849454324392208%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000%7C%7C%7C&sdata=c2fHHBMsl9EBhiGTHEout54Y7zPa8PMCpD6oc2Cwppg%3D&reserved=0>>
>
>
> I am not sure about the code part since it all touches KFD driver (KFD
> team can comment on that) - but I was just wondering if you try 5.14
> kernel would things just work for you out of the box ?
>
> Andrey
>
> On 2022-04-05 22:45, Shuotao Xu wrote:
>> Dear AMD Colleagues,
>>
>> We are from Microsoft Research, and are working on GPU disaggregation
>> technology.
>>
>> We have created a new pull requestAdd PCIe hotplug support for amdgpu by
>> xushuotao · Pull Request #131 · RadeonOpenCompute/ROCK-Kernel-Driver
>> (github.com)
>> <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FRadeonOpenCompute%2FROCK-Kernel-Driver%2Fpull%2F131&amp;data=05%7C01%7Cshuotaoxu%40microsoft.com%7C93f1fcb8f60541f7b87308da17dae167%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637848526184858564%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=qEeZR6R95jrjAaltd1MnpyFedOiVZaNQuCxcoNog90g%3D&amp;reserved=0
<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FRadeonOpenCompute%2FROCK-Kernel-Driver%2Fpull%2F131&data=05%7C01%7Cshuotaoxu%40microsoft.com%7Cf6070eadde684e919c5208da18b2c843%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637849454324392208%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000%7C%7C%7C&sdata=EI7q7B%2BkbA3ND22suAXFGP4PTkr2b%2FOQGm2GXVFInr0%3D&reserved=0>> <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FRadeonOpenCompute%2FROCK-Kernel-Driver%2Fpull%2F131&amp;data=05%7C01%7Cshuotaoxu%40microsoft.com%7C93f1fcb8f60541f7b87308da17dae167%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637848526184858564%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=qEeZR6R95jrjAaltd1MnpyFedOiVZaNQuCxcoNog90g%3D&amp;reserved=0<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FRadeonOpenCompute%2FROCK-Kernel-Driver%2Fpull%2F131&data=05%7C01%7Cshuotaoxu%40microsoft.com%7Cf6070eadde684e919c5208da18b2c843%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637849454324392208%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000%7C%7C%7C&sdata=EI7q7B%2BkbA3ND22suAXFGP4PTkr2b%2FOQGm2GXVFInr0%3D&reserved=0>>>in
>> ROCK-Kernel-Driver, which will enable PCIe hot-plug support for amdgpu.
>>
>> We believe the support of hot-plug of GPU devices can open doors for
>> many advanced applications in data center in the next few years, and we
>> would like to have some reviewers on this PR so we can continue further
>> technical discussions around this feature.
>>
>> Would you please help review this PR?
>>
>> Thank you very much!
>>
>> Best regards,
>>
>> Shuotao Xu
>>
>

[-- Attachment #2: Type: text/html, Size: 65984 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: [EXTERNAL] Re: Code Review Request for AMDGPU Hotplug Support
  2022-04-07 16:28               ` Shuotao Xu
@ 2022-04-07 16:33                 ` Joshi, Mukul
  2022-04-07 16:37                   ` Shuotao Xu
  0 siblings, 1 reply; 13+ messages in thread
From: Joshi, Mukul @ 2022-04-07 16:33 UTC (permalink / raw)
  To: Shuotao Xu, Grodzovsky, Andrey, amd-gfx
  Cc: Ziyue Yang, Lei Qu, Peng Cheng, Ran Shu

[-- Attachment #1: Type: text/plain, Size: 32645 bytes --]

[AMD Official Use Only]

Thanks Shuotao.
If the IO link cleanup works ok for you, you can use this patch as the base for adding your changes to add Hot Plug support. You can send a separate patch for that.

Regards,
Mukul

From: Shuotao Xu <shuotaoxu@microsoft.com>
Sent: Thursday, April 7, 2022 12:28 PM
To: Joshi, Mukul <Mukul.Joshi@amd.com>; Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>; amd-gfx@lists.freedesktop.org
Cc: Ziyue Yang <Ziyue.Yang@microsoft.com>; Lei Qu <Lei.Qu@microsoft.com>; Peng Cheng <pengc@microsoft.com>; Ran Shu <Ran.Shu@microsoft.com>
Subject: Re: [EXTERNAL] Re: Code Review Request for AMDGPU Hotplug Support

[CAUTION: External Email]
Thanks Mukul very much!

The code looks neat, although kfd_locked looks still would cause trouble. I will try it.

Best,
Shuotao

From: Joshi, Mukul <Mukul.Joshi@amd.com<mailto:Mukul.Joshi@amd.com>>
Date: Friday, April 8, 2022 at 12:23 AM
To: Shuotao Xu <shuotaoxu@microsoft.com<mailto:shuotaoxu@microsoft.com>>, Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com<mailto:Andrey.Grodzovsky@amd.com>>, amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org> <amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>>
Cc: Ziyue Yang <Ziyue.Yang@microsoft.com<mailto:Ziyue.Yang@microsoft.com>>, Lei Qu <Lei.Qu@microsoft.com<mailto:Lei.Qu@microsoft.com>>, Peng Cheng <pengc@microsoft.com<mailto:pengc@microsoft.com>>, Ran Shu <Ran.Shu@microsoft.com<mailto:Ran.Shu@microsoft.com>>
Subject: RE: [EXTERNAL] Re: Code Review Request for AMDGPU Hotplug Support

[AMD Official Use Only]

Hi Shuotao,

Just sent out the patch to cleanup IO links upon KFD device removal to the public mailing list.
Please try it, review it and let us know how it goes for you.

Thank you.

Regards,
Mukul

From: Shuotao Xu <shuotaoxu@microsoft.com<mailto:shuotaoxu@microsoft.com>>
Sent: Thursday, April 7, 2022 11:09 AM
To: Joshi, Mukul <Mukul.Joshi@amd.com<mailto:Mukul.Joshi@amd.com>>; Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com<mailto:Andrey.Grodzovsky@amd.com>>; amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
Cc: Ziyue Yang <Ziyue.Yang@microsoft.com<mailto:Ziyue.Yang@microsoft.com>>; Lei Qu <Lei.Qu@microsoft.com<mailto:Lei.Qu@microsoft.com>>; Peng Cheng <pengc@microsoft.com<mailto:pengc@microsoft.com>>; Ran Shu <Ran.Shu@microsoft.com<mailto:Ran.Shu@microsoft.com>>
Subject: Re: [EXTERNAL] Re: Code Review Request for AMDGPU Hotplug Support

[CAUTION: External Email]
Hi Joshi,

Per your comment, I produced a fix to work with multi-GPU system for hotplug support for our group's internal usage.

I have tested on a 4-node MI100 system, which seems to be working. It is pushed in the github PR.
The details are in: https://github.com/RadeonOpenCompute/ROCK-Kernel-Driver/pull/131#issuecomment-1091843803<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FRadeonOpenCompute%2FROCK-Kernel-Driver%2Fpull%2F131%23issuecomment-1091843803&data=04%7C01%7CMukul.Joshi%40amd.com%7C6fa8b71a7017468299e608da18b3a649%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637849457140650516%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000&sdata=yGHto9N07npETzwdiPitfQF4J0NZFuKBdD%2FZQ8NpwJM%3D&reserved=0>

I will send the patch to the mail-list also.

May I know when your patch is ready for public review?

All the best,
Shuotao


From: Joshi, Mukul <Mukul.Joshi@amd.com<mailto:Mukul.Joshi@amd.com>>
Date: Thursday, April 7, 2022 at 10:24 AM
To: Shuotao Xu <shuotaoxu@microsoft.com<mailto:shuotaoxu@microsoft.com>>, Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com<mailto:Andrey.Grodzovsky@amd.com>>, amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org> <amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>>
Cc: Ziyue Yang <Ziyue.Yang@microsoft.com<mailto:Ziyue.Yang@microsoft.com>>, Lei Qu <Lei.Qu@microsoft.com<mailto:Lei.Qu@microsoft.com>>, Peng Cheng <pengc@microsoft.com<mailto:pengc@microsoft.com>>, Ran Shu <Ran.Shu@microsoft.com<mailto:Ran.Shu@microsoft.com>>
Subject: RE: [EXTERNAL] Re: Code Review Request for AMDGPU Hotplug Support
You don't often get email from mukul.joshi@amd.com<mailto:mukul.joshi@amd.com>. Learn why this is important<http://aka.ms/LearnAboutSenderIdentification>

[AMD Official Use Only]

Hi Shuotao,

Thanks for your patch.
I have been working on something similar as I also found that we don't cleanup IO links upon device removal.

The IO-links cleanup change in your patch would work only either on a single GPU system or a multi-GPU system where the last node (in the sysfs topology) is hot-plugged out. That's because of the way the atomic counter, topology_crat_proximity_domain, is used in the code.

I have a patch which takes care of these issues on a multi-GPU system.
I should be able to send that out for review in sometime.

Thanks,
Mukul

From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org<mailto:amd-gfx-bounces@lists.freedesktop.org>> On Behalf Of Shuotao Xu
Sent: Wednesday, April 6, 2022 11:12 AM
To: Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com<mailto:Andrey.Grodzovsky@amd.com>>; amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
Cc: Ziyue Yang <Ziyue.Yang@microsoft.com<mailto:Ziyue.Yang@microsoft.com>>; Lei Qu <Lei.Qu@microsoft.com<mailto:Lei.Qu@microsoft.com>>; Peng Cheng <pengc@microsoft.com<mailto:pengc@microsoft.com>>; Ran Shu <Ran.Shu@microsoft.com<mailto:Ran.Shu@microsoft.com>>
Subject: Re: [EXTERNAL] Re: Code Review Request for AMDGPU Hotplug Support

[CAUTION: External Email]
Hi Andrey,

Thanks for your kind comment on linux patch submission protocol, please let me know if there is anyway to rectify it.

dmesg is fine except with some warning during pci rescan after pci removal of an AMD MI100.

The issue is that after this rocm application will return segfault with the amdgpu driver unless the entire amdgpu kernel module is unloaded and loaded, which we did not meet our hotplug requirement. The issues upon investigation are
1) kfd_lock is locked after hotplug, and kfd_open will return fault right away to libhsakmt .
2) iolink/p2plink has anormalies after hotplug, and libhsakmt will found such anomalies and return error.

Our patch has been tested with a single-instance AMD MI100 GPU and showed it worked.

I am attaching the dmesg after rescan anyway, which will show the warning and the segfault.

[  132.054822] pci 0000:43:00.0: [1002:738c] type 00 class 0x038000
[  132.054856] pci 0000:43:00.0: reg 0x10: [mem 0x38b000000000-0x38b7ffffffff 64bit pref]
[  132.054877] pci 0000:43:00.0: reg 0x18: [mem 0x38b800000000-0x38b8001fffff 64bit pref]
[  132.054890] pci 0000:43:00.0: reg 0x20: [io  0xa000-0xa0ff]
[  132.054904] pci 0000:43:00.0: reg 0x24: [mem 0xb8400000-0xb847ffff]
[  132.054918] pci 0000:43:00.0: reg 0x30: [mem 0xb8480000-0xb849ffff pref]
[  132.055134] pci 0000:43:00.0: PME# supported from D1 D2 D3hot D3cold
[  132.055217] pci 0000:43:00.0: 63.008 Gb/s available PCIe bandwidth, limited by 8.0 GT/s PCIe x8 link at 0000:3c:14.0 (capable of 252.048 Gb/s with 16.0 GT/s PCIe x16 link)
[  132.056001] pci 0000:43:00.0: Adding to iommu group 73
[  132.057943] pci 0000:43:00.0: BAR 0: assigned [mem 0x38b000000000-0x38b7ffffffff 64bit pref]
[  132.057960] pci 0000:43:00.0: BAR 2: assigned [mem 0x38b800000000-0x38b8001fffff 64bit pref]
[  132.057974] pci 0000:43:00.0: BAR 5: assigned [mem 0xb8400000-0xb847ffff]
[  132.057981] pci 0000:43:00.0: BAR 6: assigned [mem 0xb8480000-0xb849ffff pref]
[  132.057984] pci 0000:43:00.0: BAR 4: assigned [io  0xa000-0xa0ff]

[  132.058429] ======================================================
[  132.058453] WARNING: possible circular locking dependency detected
[  132.058477] 5.16.0-kfd+ #1 Not tainted
[  132.058492] ------------------------------------------------------
[  132.058515] bash/3632 is trying to acquire lock:
[  132.058534] ffffadee20adfb50 ((work_completion)(&wfc.work)){+.+.}-{0:0}, at: __flush_work+0x2f5/0x470
[  132.058554] [drm] initializing kernel modesetting (ARCTURUS 0x1002:0x738C 0x1002:0x0C34 0x01).
[  132.058577]
               but task is already holding lock:
[  132.058580] ffffffffa3c62308
[  132.058630] amdgpu 0000:43:00.0: amdgpu: Trusted Memory Zone (TMZ) feature not supported
[  132.058638]  (
[  132.058678] [drm] register mmio base: 0xB8400000
[  132.058683] pci_rescan_remove_lock
[  132.058694] [drm] register mmio size: 524288
[  132.058713] ){+.+.}-{3:3}, at: rescan_store+0x55/0x90
[  132.058773]
               which lock already depends on the new lock.

[  132.058804]
               the existing dependency chain (in reverse order) is:
[  132.058819] [drm] add ip block number 0 <soc15_common>
[  132.058831]
               -> #1 (
[  132.058854] [drm] add ip block number 1 <gmc_v9_0>
[  132.058858] [drm] add ip block number 2 <vega20_ih>
[  132.058874] pci_rescan_remove_lock
[  132.058894] [drm] add ip block number 3 <psp>
[  132.058915] ){+.+.}-{3:3}
[  132.058931] [drm] add ip block number 4 <smu>
[  132.058951] :
[  132.058965] [drm] add ip block number 5 <gfx_v9_0>
[  132.058986]        __mutex_lock+0xa4/0x990
[  132.058996] [drm] add ip block number 6 <sdma_v4_0>
[  132.059016]        i801_add_tco_spt.isra.20+0x2a/0x1a0
[  132.059033] [drm] add ip block number 7 <vcn_v2_5>
[  132.059054]        i801_add_tco+0xf6/0x110
[  132.059075] [drm] add ip block number 8 <jpeg_v2_5>
[  132.059096]        i801_probe+0x402/0x860
[  132.059151]        local_pci_probe+0x40/0x90
[  132.059170]        work_for_cpu_fn+0x10/0x20
[  132.059189]        process_one_work+0x2a4/0x640
[  132.059208]        worker_thread+0x228/0x3f0
[  132.059227]        kthread+0x16d/0x1a0
[  132.059795]        ret_from_fork+0x1f/0x30
[  132.060337]
               -> #0 ((work_completion)(&wfc.work)){+.+.}-{0:0}:
[  132.061405]        __lock_acquire+0x1552/0x1ac0
[  132.061950]        lock_acquire+0x26c/0x300
[  132.062484]        __flush_work+0x315/0x470
[  132.063009]        work_on_cpu+0x98/0xc0
[  132.063526]        pci_device_probe+0x1bc/0x1d0
[  132.064036]        really_probe+0x102/0x450
[  132.064532]        __driver_probe_device+0x100/0x170
[  132.065020]        driver_probe_device+0x1f/0xa0
[  132.065497]        __device_attach_driver+0x6b/0xe0
[  132.065975]        bus_for_each_drv+0x6a/0xb0
[  132.066449]        __device_attach+0xe2/0x160
[  132.066912]        pci_bus_add_device+0x4a/0x80
[  132.067365]        pci_bus_add_devices+0x2c/0x70
[  132.067812]        pci_bus_add_devices+0x65/0x70
[  132.068253]        pci_bus_add_devices+0x65/0x70
[  132.068688]        pci_bus_add_devices+0x65/0x70
[  132.068936] amdgpu 0000:43:00.0: amdgpu: Fetched VBIOS from ROM BAR
[  132.069109]        pci_bus_add_devices+0x65/0x70
[  132.069602] amdgpu: ATOM BIOS: 113-D3431401-X00
[  132.070058]        pci_bus_add_devices+0x65/0x70
[  132.070572] [drm] VCN(0) decode is enabled in VM mode
[  132.070997]        pci_rescan_bus+0x23/0x30
[  132.071000]        rescan_store+0x61/0x90
[  132.071003]        kernfs_fop_write_iter+0x132/0x1b0
[  132.071501] [drm] VCN(1) decode is enabled in VM mode
[  132.071964]        new_sync_write+0x11f/0x1b0
[  132.072432] [drm] VCN(0) encode is enabled in VM mode
[  132.072900]        vfs_write+0x35b/0x3b0
[  132.073376] [drm] VCN(1) encode is enabled in VM mode
[  132.073847]        ksys_write+0xa7/0xe0
[  132.074335] [drm] JPEG(0) JPEG decode is enabled in VM mode
[  132.074803]        do_syscall_64+0x34/0x80
[  132.074808]        entry_SYSCALL_64_after_hwframe+0x44/0xae
[  132.074811]
               other info that might help us debug this:

[  132.074813]  Possible unsafe locking scenario:

[  132.075302] [drm] JPEG(1) JPEG decode is enabled in VM mode
[  132.075779]        CPU0                    CPU1
[  132.076361] amdgpu 0000:43:00.0: amdgpu: MEM ECC is active.
[  132.076765]        ----                    ----
[  132.077265] amdgpu 0000:43:00.0: amdgpu: SRAM ECC is active.
[  132.078649]   lock(pci_rescan_remove_lock);
[  132.078652]                                lock((work_completion)(&wfc.work));
[  132.078653]                                lock(pci_rescan_remove_lock);
[  132.078655]   lock((work_completion)(&wfc.work));
[  132.078656]
                *** DEADLOCK ***

[  132.078656] 5 locks held by bash/3632:
[  132.078658]  #0: ffff9c39c7b89438
[  132.079612] amdgpu 0000:43:00.0: amdgpu: RAS INFO: ras initialized successfully, hardware ability[7fff] ras_mask[7fff]
[  132.080089]  (
[  132.080602] [drm] vm size is 262144 GB, 4 levels, block size is 9-bit, fragment size is 9-bit
[  132.081082] sb_writers
[  132.081601] amdgpu 0000:43:00.0: amdgpu: VRAM: 32752M 0x0000008000000000 - 0x00000087FEFFFFFF (32752M used)
[  132.082102] #6
[  132.082630] amdgpu 0000:43:00.0: amdgpu: GART: 512M 0x0000000000000000 - 0x000000001FFFFFFF
[  132.083152] ){.+.+}-{0:0}
[  132.083687] amdgpu 0000:43:00.0: amdgpu: AGP: 267878400M 0x0000008800000000 - 0x0000FFFFFFFFFFFF
[  132.084210] , at: ksys_write+0xa7/0xe0
[  132.085708] [drm] Detected VRAM RAM=32752M, BAR=32768M
[  132.086205]  #1:
[  132.086733] [drm] RAM width 4096bits HBM
[  132.087269] ffff9c5959011088
[  132.087890] [drm] amdgpu: 32752M of VRAM memory ready
[  132.088389]  (
[  132.088972] [drm] amdgpu: 32752M of GTT memory ready.
[  132.089572] &of->mutex
[  132.090206] [drm] GART: num cpu pages 131072, num gpu pages 131072
[  132.090804] ){+.+.}-{3:3}, at: kernfs_fop_write_iter+0x103/0x1b0
[  132.090808]  #2: ffff9c39c882a9e0 (kn->active#423){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x10c/0x1b0
[  132.091639] [drm] PCIE GART of 512M enabled.
[  132.092117]  #3:
[  132.092801] [drm] PTB located at 0x0000008000000000
[  132.093480] ffffffffa3c62308
[  132.094566] amdgpu 0000:43:00.0: amdgpu: PSP runtime database doesn't exist
[  132.094822]  (pci_rescan_remove_lock){+.+.}-{3:3}, at: rescan_store+0x55/0x90
[  132.094827]  #4: ffff9c597392b248 (&dev->mutex){....}-{3:3}, at: __device_attach+0x39/0x160
[  132.094835]
               stack backtrace:
[  132.097098] [drm] Found VCN firmware Version ENC: 1.1 DEC: 1 VEP: 0 Revision: 21
[  132.097467] CPU: 47 PID: 3632 Comm: bash Not tainted 5.16.0-kfd+ #1
[  132.098169] amdgpu 0000:43:00.0: amdgpu: Will use PSP to load VCN firmware
[  132.098839] Hardware name: Supermicro SYS-4029GP-TRT2/X11DPG-OT-CPU, BIOS 2.1 08/14/2018
[  132.098841] Call Trace:
[  132.098842]  <TASK>
[  132.098843]  dump_stack_lvl+0x44/0x57
[  132.098848]  check_noncircular+0x105/0x120
[  132.098853]  ? unwind_get_return_address+0x1b/0x30
[  132.112924]  ? register_lock_class+0x46/0x780
[  132.113630]  ? __lock_acquire+0x1552/0x1ac0
[  132.114342]  __lock_acquire+0x1552/0x1ac0
[  132.115050]  lock_acquire+0x26c/0x300
[  132.115755]  ? __flush_work+0x2f5/0x470
[  132.116460]  ? lock_is_held_type+0xdf/0x130
[  132.117177]  __flush_work+0x315/0x470
[  132.117890]  ? __flush_work+0x2f5/0x470
[  132.118604]  ? lock_is_held_type+0xdf/0x130
[  132.119305]  ? mark_held_locks+0x49/0x70
[  132.119981]  ? queue_work_on+0x2f/0x70
[  132.120645]  ? lockdep_hardirqs_on+0x79/0x100
[  132.121300]  work_on_cpu+0x98/0xc0
[  132.121702] [drm] reserve 0x400000 from 0x87fec00000 for PSP TMR
[  132.121947]  ? __traceiter_workqueue_execute_end+0x40/0x40
[  132.123270]  ? pci_device_shutdown+0x60/0x60
[  132.123880]  pci_device_probe+0x1bc/0x1d0
[  132.124475]  really_probe+0x102/0x450
[  132.125060]  __driver_probe_device+0x100/0x170
[  132.125641]  driver_probe_device+0x1f/0xa0
[  132.126215]  __device_attach_driver+0x6b/0xe0
[  132.126797]  ? driver_allows_async_probing+0x50/0x50
[  132.127383]  ? driver_allows_async_probing+0x50/0x50
[  132.127960]  bus_for_each_drv+0x6a/0xb0
[  132.128528]  __device_attach+0xe2/0x160
[  132.129095]  pci_bus_add_device+0x4a/0x80
[  132.129659]  pci_bus_add_devices+0x2c/0x70
[  132.130213]  pci_bus_add_devices+0x65/0x70
[  132.130753]  pci_bus_add_devices+0x65/0x70
[  132.131283]  pci_bus_add_devices+0x65/0x70
[  132.131780]  pci_bus_add_devices+0x65/0x70
[  132.132270]  pci_bus_add_devices+0x65/0x70
[  132.132757]  pci_rescan_bus+0x23/0x30
[  132.133233]  rescan_store+0x61/0x90
[  132.133701]  kernfs_fop_write_iter+0x132/0x1b0
[  132.134167]  new_sync_write+0x11f/0x1b0
[  132.134627]  vfs_write+0x35b/0x3b0
[  132.135062]  ksys_write+0xa7/0xe0
[  132.135503]  do_syscall_64+0x34/0x80
[  132.135933]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[  132.136358] RIP: 0033:0x7f0058a73224
[  132.136775] Code: 89 02 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 8d 05 c1 07 2e 00 8b 00 85 c0 75 13 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 54 f3 c3 66 90 41 54 55 49 89 d4 53 48 89 f5
[  132.137663] RSP: 002b:00007ffc4f4c71a8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[  132.138121] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007f0058a73224
[  132.138590] RDX: 0000000000000002 RSI: 000055d466c24450 RDI: 0000000000000001
[  132.139064] RBP: 000055d466c24450 R08: 000000000000000a R09: 0000000000000001
[  132.139532] R10: 000000000000000a R11: 0000000000000246 R12: 00007f0058d4f760
[  132.140003] R13: 0000000000000002 R14: 00007f0058d4b2a0 R15: 00007f0058d4a760
[  132.140485]  </TASK>
[  132.183669] amdgpu 0000:43:00.0: amdgpu: HDCP: optional hdcp ta ucode is not available
[  132.184214] amdgpu 0000:43:00.0: amdgpu: DTM: optional dtm ta ucode is not available
[  132.184735] amdgpu 0000:43:00.0: amdgpu: RAP: optional rap ta ucode is not available
[  132.185234] amdgpu 0000:43:00.0: amdgpu: SECUREDISPLAY: securedisplay ta ucode is not available
[  132.185823] amdgpu 0000:43:00.0: amdgpu: use vbios provided pptable
[  132.186327] amdgpu 0000:43:00.0: amdgpu: smc_dpm_info table revision(format.content): 4.6
[  132.188783] amdgpu 0000:43:00.0: amdgpu: SMU is initialized successfully!
[  132.190039] [drm] kiq ring mec 2 pipe 1 q 0
[  132.203608] [drm] VCN decode and encode initialized successfully(under DPG Mode).
[  132.204178] [drm] JPEG decode initialized successfully.
[  132.246079] kfd kfd: amdgpu: Allocated 3969056 bytes on gart
[  132.327589] memmap_init_zone_device initialised 8388608 pages in 64ms
[  132.328139] amdgpu: HMM registered 32752MB device memory
[  132.328784] amdgpu: Virtual CRAT table created for GPU
[  132.329844] amdgpu: Topology: Add dGPU node [0x738c:0x1002]
[  132.330387] kfd kfd: amdgpu: added device 1002:738c
[  132.330965] amdgpu 0000:43:00.0: amdgpu: SE 8, SH per SE 1, CU per SH 16, active_cu_number 72
[  132.331725] amdgpu 0000:43:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 0 on hub 0
[  132.332296] amdgpu 0000:43:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 1 on hub 0
[  132.332856] amdgpu 0000:43:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 4 on hub 0
[  132.333414] amdgpu 0000:43:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 5 on hub 0
[  132.333965] amdgpu 0000:43:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 6 on hub 0
[  132.334507] amdgpu 0000:43:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 7 on hub 0
[  132.335057] amdgpu 0000:43:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 8 on hub 0
[  132.335594] amdgpu 0000:43:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 9 on hub 0
[  132.336137] amdgpu 0000:43:00.0: amdgpu: ring kiq_2.1.0 uses VM inv eng 10 on hub 0
[  132.336679] amdgpu 0000:43:00.0: amdgpu: ring sdma0 uses VM inv eng 0 on hub 1
[  132.337234] amdgpu 0000:43:00.0: amdgpu: ring sdma1 uses VM inv eng 1 on hub 1
[  132.337790] amdgpu 0000:43:00.0: amdgpu: ring sdma2 uses VM inv eng 4 on hub 1
[  132.338343] amdgpu 0000:43:00.0: amdgpu: ring sdma3 uses VM inv eng 5 on hub 1
[  132.338906] amdgpu 0000:43:00.0: amdgpu: ring sdma4 uses VM inv eng 6 on hub 1
[  132.339448] amdgpu 0000:43:00.0: amdgpu: ring sdma5 uses VM inv eng 0 on hub 2
[  132.339987] amdgpu 0000:43:00.0: amdgpu: ring sdma6 uses VM inv eng 1 on hub 2
[  132.340519] amdgpu 0000:43:00.0: amdgpu: ring sdma7 uses VM inv eng 4 on hub 2
[  132.341041] amdgpu 0000:43:00.0: amdgpu: ring vcn_dec_0 uses VM inv eng 5 on hub 2
[  132.341570] amdgpu 0000:43:00.0: amdgpu: ring vcn_enc_0.0 uses VM inv eng 6 on hub 2
[  132.342101] amdgpu 0000:43:00.0: amdgpu: ring vcn_enc_0.1 uses VM inv eng 7 on hub 2
[  132.342630] amdgpu 0000:43:00.0: amdgpu: ring vcn_dec_1 uses VM inv eng 8 on hub 2
[  132.343152] amdgpu 0000:43:00.0: amdgpu: ring vcn_enc_1.0 uses VM inv eng 9 on hub 2
[  132.343657] amdgpu 0000:43:00.0: amdgpu: ring vcn_enc_1.1 uses VM inv eng 10 on hub 2
[  132.344136] amdgpu 0000:43:00.0: amdgpu: ring jpeg_dec_0 uses VM inv eng 11 on hub 2
[  132.344610] amdgpu 0000:43:00.0: amdgpu: ring jpeg_dec_1 uses VM inv eng 12 on hub 2
[  132.378213] amdgpu: Detected AMDGPU 6 Perf Events.
[  132.387349] [drm] Initialized amdgpu 3.46.0 20150101 for 0000:43:00.0 on minor 1
[  132.388530] pcieport 0000:d7:00.0: bridge window [io  0x1000-0x0fff] to [bus d8] add_size 1000
[  132.389078] pcieport 0000:d7:00.0: BAR 13: no space for [io  size 0x1000]
[  132.389600] pcieport 0000:d7:00.0: BAR 13: failed to assign [io  size 0x1000]
[  132.390091] pcieport 0000:d7:00.0: BAR 13: no space for [io  size 0x1000]
[  132.390568] pcieport 0000:d7:00.0: BAR 13: failed to assign [io  size 0x1000]
[  155.359200] HelloWorld[3824]: segfault at 68 ip 00007f4c979f764e sp 00007ffc9b3bb610 error 4 in libamdhip64.so.4.4.21432-f9dccde4[7f4c979b3000+2eb000]
[  155.360268] Code: 48 8b 45 e8 64 48 33 04 25 28 00 00 00 74 05 e8 b8 c7 fb ff 48 8b 5d f8 c9 c3 f3 0f 1e fa 55 48 89 e5 48 89 7d f8 48 8b 45 f8 <48> 8b 40 68 5d c3 f3 0f 1e fa 55 48 89 e5 48 89 7d f8 48 8b 45 f8

Best regards,
Shuotao

From: Andrey Grodzovsky <andrey.grodzovsky@amd.com<mailto:andrey.grodzovsky@amd.com>>
Date: Wednesday, April 6, 2022 at 10:36 PM
To: Shuotao Xu <shuotaoxu@microsoft.com<mailto:shuotaoxu@microsoft.com>>, amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org> <amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>>
Cc: Ziyue Yang <Ziyue.Yang@microsoft.com<mailto:Ziyue.Yang@microsoft.com>>, Lei Qu <Lei.Qu@microsoft.com<mailto:Lei.Qu@microsoft.com>>, Peng Cheng <pengc@microsoft.com<mailto:pengc@microsoft.com>>, Ran Shu <Ran.Shu@microsoft.com<mailto:Ran.Shu@microsoft.com>>
Subject: Re: [EXTERNAL] Re: Code Review Request for AMDGPU Hotplug Support
[You don't often get email from andrey.grodzovsky@amd.com<mailto:andrey.grodzovsky@amd.com>. Learn why this is important at http://aka.ms/LearnAboutSenderIdentification.]<https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Faka.ms%2FLearnAboutSenderIdentification.%255d&data=04%7C01%7CMukul.Joshi%40amd.com%7C6fa8b71a7017468299e608da18b3a649%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637849457140650516%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000&sdata=LbbAV1OQ4XQeexmn9Hbycdj4sTHqDpKccOJluVwIWpM%3D&reserved=0>

Can you attach dmesg for the failure without your patch against
amd-staging-drm-next ?

Also, in general, patches for amdgpu upstream branches should be
submitted to amd-gfx mailing list inline using git-send which makes it
easy to comment and review them inline.

Andrey

On 2022-04-06 10:25, Shuotao Xu wrote:
> Hi Andrey,
>
> We just tried kernel 5.16 based on
> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fagd5f%2Flinux.git&amp;data=05%7C01%7Cshuotaoxu%40microsoft.com%7C93f1fcb8f60541f7b87308da17dae167%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637848526184858564%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=jz51aMtsG7PIEfLy1jLvLGd%2BsBREvOFf9Gc6BZlmsmU%3D&amp;reserved=0<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fagd5f%2Flinux.git&data=04%7C01%7CMukul.Joshi%40amd.com%7C6fa8b71a7017468299e608da18b3a649%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637849457140650516%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000&sdata=5%2FFxe0DQhYILPSRl%2FWOzDWg2N3jzRXGsUhYHsYuJ0yE%3D&reserved=0>
> <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fagd5f%2Flinux.git&amp;data=05%7C01%7Cshuotaoxu%40microsoft.com%7C93f1fcb8f60541f7b87308da17dae167%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637848526184858564%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=jz51aMtsG7PIEfLy1jLvLGd%2BsBREvOFf9Gc6BZlmsmU%3D&amp;reserved=0<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fagd5f%2Flinux.git&data=04%7C01%7CMukul.Joshi%40amd.com%7C6fa8b71a7017468299e608da18b3a649%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637849457140650516%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000&sdata=5%2FFxe0DQhYILPSRl%2FWOzDWg2N3jzRXGsUhYHsYuJ0yE%3D&reserved=0>>
> amd-staging-drm-next branch, and found out that hotplug did not work out
> of box for Rocm compute stack.
>
> We did not try the rendering stack since we currently are more focused
> on AI workloads.
>
> We have also created a patch against the amd-staging-drm-next branch to
> enable hotplug for ROCM stack, which were sent in another later email
> with same subject. I am attaching the patch in this email, in case that
> you would want to delete that later email.
>
> Best regards,
>
> Shuotao
>
> *From: *Andrey Grodzovsky <andrey.grodzovsky@amd.com<mailto:andrey.grodzovsky@amd.com>>
> *Date: *Wednesday, April 6, 2022 at 10:13 PM
> *To: *Shuotao Xu <shuotaoxu@microsoft.com<mailto:shuotaoxu@microsoft.com>>,
> amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org> <amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>>
> *Cc: *Ziyue Yang <Ziyue.Yang@microsoft.com<mailto:Ziyue.Yang@microsoft.com>>, Lei Qu
> <Lei.Qu@microsoft.com<mailto:Lei.Qu@microsoft.com>>, Peng Cheng <pengc@microsoft.com<mailto:pengc@microsoft.com>>, Ran Shu
> <Ran.Shu@microsoft.com<mailto:Ran.Shu@microsoft.com>>
> *Subject: *[EXTERNAL] Re: Code Review Request for AMDGPU Hotplug Support
>
> [You don't often get email from andrey.grodzovsky@amd.com<mailto:andrey.grodzovsky@amd.com>. Learn why
> this is important at http://aka.ms/LearnAboutSenderIdentification.]<https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Faka.ms%2FLearnAboutSenderIdentification.%255d&data=04%7C01%7CMukul.Joshi%40amd.com%7C6fa8b71a7017468299e608da18b3a649%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637849457140650516%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000&sdata=LbbAV1OQ4XQeexmn9Hbycdj4sTHqDpKccOJluVwIWpM%3D&reserved=0>
> <https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Faka.ms%2FLearnAboutSenderIdentification.%255d&amp;data=05%7C01%7Cshuotaoxu%40microsoft.com%7C93f1fcb8f60541f7b87308da17dae167%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637848526184858564%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=HfSwu6SWfoCYyscJqGFdKHBPtaj%2BKB4lyo13zkm6hi4%3D&amp;reserved=0<https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Faka.ms%2FLearnAboutSenderIdentification.%255d&data=04%7C01%7CMukul.Joshi%40amd.com%7C6fa8b71a7017468299e608da18b3a649%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637849457140806742%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000&sdata=RLbrnUoTh8SIBZsv9JiejpNMcBzS4BSZ2bswvPNHHd0%3D&reserved=0>>
>
> Looks like you are using 5.13 kernel for this work, FYI we added
> hot plug support for the graphic stack in 5.14 kernel (see
> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.phoronix.com%2Fscan.php%3Fpage%3Dnews_item%26px%3DLinux-5.14-AMDGPU-Hot-Unplug&amp;data=05%7C01%7Cshuotaoxu%40microsoft.com%7C93f1fcb8f60541f7b87308da17dae167%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637848526184858564%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=4l9mT8zNR%2FvqsEFr7noIDqKf16IGN8xmO2T6jnpipzo%3D&amp;reserved=0)<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.phoronix.com%2Fscan.php%3Fpage%3Dnews_item%26px%3DLinux-5.14-AMDGPU-Hot-Unplug&data=04%7C01%7CMukul.Joshi%40amd.com%7C6fa8b71a7017468299e608da18b3a649%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637849457140806742%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000&sdata=MHTv9TxiEISAwOKUfW7shx7NTm82iRg4Tljz%2F6K4cYw%3D&reserved=0>
> <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.phoronix.com%2Fscan.php%3Fpage%3Dnews_item%26px%3DLinux-5.14-AMDGPU-Hot-Unplug&amp;data=05%7C01%7Cshuotaoxu%40microsoft.com%7C93f1fcb8f60541f7b87308da17dae167%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637848526184858564%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=4l9mT8zNR%2FvqsEFr7noIDqKf16IGN8xmO2T6jnpipzo%3D&amp;reserved=0<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.phoronix.com%2Fscan.php%3Fpage%3Dnews_item%26px%3DLinux-5.14-AMDGPU-Hot-Unplug&data=04%7C01%7CMukul.Joshi%40amd.com%7C6fa8b71a7017468299e608da18b3a649%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637849457140806742%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000&sdata=MHTv9TxiEISAwOKUfW7shx7NTm82iRg4Tljz%2F6K4cYw%3D&reserved=0>>
>
>
> I am not sure about the code part since it all touches KFD driver (KFD
> team can comment on that) - but I was just wondering if you try 5.14
> kernel would things just work for you out of the box ?
>
> Andrey
>
> On 2022-04-05 22:45, Shuotao Xu wrote:
>> Dear AMD Colleagues,
>>
>> We are from Microsoft Research, and are working on GPU disaggregation
>> technology.
>>
>> We have created a new pull requestAdd PCIe hotplug support for amdgpu by
>> xushuotao * Pull Request #131 * RadeonOpenCompute/ROCK-Kernel-Driver
>> (github.com)
>> <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FRadeonOpenCompute%2FROCK-Kernel-Driver%2Fpull%2F131&amp;data=05%7C01%7Cshuotaoxu%40microsoft.com%7C93f1fcb8f60541f7b87308da17dae167%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637848526184858564%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=qEeZR6R95jrjAaltd1MnpyFedOiVZaNQuCxcoNog90g%3D&amp;reserved=0
<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FRadeonOpenCompute%2FROCK-Kernel-Driver%2Fpull%2F131&data=04%7C01%7CMukul.Joshi%40amd.com%7C6fa8b71a7017468299e608da18b3a649%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637849457140806742%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000&sdata=BV2K%2FEsSiowqFwVc1XQ8uAKSI6aQItxdV%2BStAjliGN8%3D&reserved=0>> <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FRadeonOpenCompute%2FROCK-Kernel-Driver%2Fpull%2F131&amp;data=05%7C01%7Cshuotaoxu%40microsoft.com%7C93f1fcb8f60541f7b87308da17dae167%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637848526184858564%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=qEeZR6R95jrjAaltd1MnpyFedOiVZaNQuCxcoNog90g%3D&amp;reserved=0<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FRadeonOpenCompute%2FROCK-Kernel-Driver%2Fpull%2F131&data=04%7C01%7CMukul.Joshi%40amd.com%7C6fa8b71a7017468299e608da18b3a649%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637849457140806742%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000&sdata=BV2K%2FEsSiowqFwVc1XQ8uAKSI6aQItxdV%2BStAjliGN8%3D&reserved=0>>>in
>> ROCK-Kernel-Driver, which will enable PCIe hot-plug support for amdgpu.
>>
>> We believe the support of hot-plug of GPU devices can open doors for
>> many advanced applications in data center in the next few years, and we
>> would like to have some reviewers on this PR so we can continue further
>> technical discussions around this feature.
>>
>> Would you please help review this PR?
>>
>> Thank you very much!
>>
>> Best regards,
>>
>> Shuotao Xu
>>
>

[-- Attachment #2: Type: text/html, Size: 68665 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [EXTERNAL] Re: Code Review Request for AMDGPU Hotplug Support
  2022-04-07 16:33                 ` Joshi, Mukul
@ 2022-04-07 16:37                   ` Shuotao Xu
  0 siblings, 0 replies; 13+ messages in thread
From: Shuotao Xu @ 2022-04-07 16:37 UTC (permalink / raw)
  To: Joshi, Mukul, Grodzovsky, Andrey, amd-gfx
  Cc: Ziyue Yang, Lei Qu, Peng Cheng, Ran Shu

[-- Attachment #1: Type: text/plain, Size: 33381 bytes --]

Thanks Joshi. I will just do that. Will hold onto my current patch based on one in the github PR.

Best regards,
Shuotao

From: Joshi, Mukul <Mukul.Joshi@amd.com>
Date: Friday, April 8, 2022 at 12:33 AM
To: Shuotao Xu <shuotaoxu@microsoft.com>, Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>, amd-gfx@lists.freedesktop.org <amd-gfx@lists.freedesktop.org>
Cc: Ziyue Yang <Ziyue.Yang@microsoft.com>, Lei Qu <Lei.Qu@microsoft.com>, Peng Cheng <pengc@microsoft.com>, Ran Shu <Ran.Shu@microsoft.com>
Subject: RE: [EXTERNAL] Re: Code Review Request for AMDGPU Hotplug Support

[AMD Official Use Only]

Thanks Shuotao.
If the IO link cleanup works ok for you, you can use this patch as the base for adding your changes to add Hot Plug support. You can send a separate patch for that.

Regards,
Mukul

From: Shuotao Xu <shuotaoxu@microsoft.com>
Sent: Thursday, April 7, 2022 12:28 PM
To: Joshi, Mukul <Mukul.Joshi@amd.com>; Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>; amd-gfx@lists.freedesktop.org
Cc: Ziyue Yang <Ziyue.Yang@microsoft.com>; Lei Qu <Lei.Qu@microsoft.com>; Peng Cheng <pengc@microsoft.com>; Ran Shu <Ran.Shu@microsoft.com>
Subject: Re: [EXTERNAL] Re: Code Review Request for AMDGPU Hotplug Support

[CAUTION: External Email]
Thanks Mukul very much!

The code looks neat, although kfd_locked looks still would cause trouble. I will try it.

Best,
Shuotao

From: Joshi, Mukul <Mukul.Joshi@amd.com<mailto:Mukul.Joshi@amd.com>>
Date: Friday, April 8, 2022 at 12:23 AM
To: Shuotao Xu <shuotaoxu@microsoft.com<mailto:shuotaoxu@microsoft.com>>, Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com<mailto:Andrey.Grodzovsky@amd.com>>, amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org> <amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>>
Cc: Ziyue Yang <Ziyue.Yang@microsoft.com<mailto:Ziyue.Yang@microsoft.com>>, Lei Qu <Lei.Qu@microsoft.com<mailto:Lei.Qu@microsoft.com>>, Peng Cheng <pengc@microsoft.com<mailto:pengc@microsoft.com>>, Ran Shu <Ran.Shu@microsoft.com<mailto:Ran.Shu@microsoft.com>>
Subject: RE: [EXTERNAL] Re: Code Review Request for AMDGPU Hotplug Support

[AMD Official Use Only]

Hi Shuotao,

Just sent out the patch to cleanup IO links upon KFD device removal to the public mailing list.
Please try it, review it and let us know how it goes for you.

Thank you.

Regards,
Mukul

From: Shuotao Xu <shuotaoxu@microsoft.com<mailto:shuotaoxu@microsoft.com>>
Sent: Thursday, April 7, 2022 11:09 AM
To: Joshi, Mukul <Mukul.Joshi@amd.com<mailto:Mukul.Joshi@amd.com>>; Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com<mailto:Andrey.Grodzovsky@amd.com>>; amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
Cc: Ziyue Yang <Ziyue.Yang@microsoft.com<mailto:Ziyue.Yang@microsoft.com>>; Lei Qu <Lei.Qu@microsoft.com<mailto:Lei.Qu@microsoft.com>>; Peng Cheng <pengc@microsoft.com<mailto:pengc@microsoft.com>>; Ran Shu <Ran.Shu@microsoft.com<mailto:Ran.Shu@microsoft.com>>
Subject: Re: [EXTERNAL] Re: Code Review Request for AMDGPU Hotplug Support

[CAUTION: External Email]
Hi Joshi,

Per your comment, I produced a fix to work with multi-GPU system for hotplug support for our group’s internal usage.

I have tested on a 4-node MI100 system, which seems to be working. It is pushed in the github PR.
The details are in: https://github.com/RadeonOpenCompute/ROCK-Kernel-Driver/pull/131#issuecomment-1091843803<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FRadeonOpenCompute%2FROCK-Kernel-Driver%2Fpull%2F131%23issuecomment-1091843803&data=05%7C01%7Cshuotaoxu%40microsoft.com%7Cd08285235abb42c5648508da18b44dbf%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637849460024976886%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000%7C%7C%7C&sdata=DrnW5wawPcCu6%2BG0dVFc%2F%2FUphtNLpMq%2BYvV9PSbNzjk%3D&reserved=0>

I will send the patch to the mail-list also.

May I know when your patch is ready for public review?

All the best,
Shuotao


From: Joshi, Mukul <Mukul.Joshi@amd.com<mailto:Mukul.Joshi@amd.com>>
Date: Thursday, April 7, 2022 at 10:24 AM
To: Shuotao Xu <shuotaoxu@microsoft.com<mailto:shuotaoxu@microsoft.com>>, Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com<mailto:Andrey.Grodzovsky@amd.com>>, amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org> <amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>>
Cc: Ziyue Yang <Ziyue.Yang@microsoft.com<mailto:Ziyue.Yang@microsoft.com>>, Lei Qu <Lei.Qu@microsoft.com<mailto:Lei.Qu@microsoft.com>>, Peng Cheng <pengc@microsoft.com<mailto:pengc@microsoft.com>>, Ran Shu <Ran.Shu@microsoft.com<mailto:Ran.Shu@microsoft.com>>
Subject: RE: [EXTERNAL] Re: Code Review Request for AMDGPU Hotplug Support
You don't often get email from mukul.joshi@amd.com<mailto:mukul.joshi@amd.com>. Learn why this is important<http://aka.ms/LearnAboutSenderIdentification>

[AMD Official Use Only]

Hi Shuotao,

Thanks for your patch.
I have been working on something similar as I also found that we don’t cleanup IO links upon device removal.

The IO-links cleanup change in your patch would work only either on a single GPU system or a multi-GPU system where the last node (in the sysfs topology) is hot-plugged out. That’s because of the way the atomic counter, topology_crat_proximity_domain, is used in the code.

I have a patch which takes care of these issues on a multi-GPU system.
I should be able to send that out for review in sometime.

Thanks,
Mukul

From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org<mailto:amd-gfx-bounces@lists.freedesktop.org>> On Behalf Of Shuotao Xu
Sent: Wednesday, April 6, 2022 11:12 AM
To: Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com<mailto:Andrey.Grodzovsky@amd.com>>; amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
Cc: Ziyue Yang <Ziyue.Yang@microsoft.com<mailto:Ziyue.Yang@microsoft.com>>; Lei Qu <Lei.Qu@microsoft.com<mailto:Lei.Qu@microsoft.com>>; Peng Cheng <pengc@microsoft.com<mailto:pengc@microsoft.com>>; Ran Shu <Ran.Shu@microsoft.com<mailto:Ran.Shu@microsoft.com>>
Subject: Re: [EXTERNAL] Re: Code Review Request for AMDGPU Hotplug Support

[CAUTION: External Email]
Hi Andrey,

Thanks for your kind comment on linux patch submission protocol, please let me know if there is anyway to rectify it.

dmesg is fine except with some warning during pci rescan after pci removal of an AMD MI100.

The issue is that after this rocm application will return segfault with the amdgpu driver unless the entire amdgpu kernel module is unloaded and loaded, which we did not meet our hotplug requirement. The issues upon investigation are
1) kfd_lock is locked after hotplug, and kfd_open will return fault right away to libhsakmt .
2) iolink/p2plink has anormalies after hotplug, and libhsakmt will found such anomalies and return error.

Our patch has been tested with a single-instance AMD MI100 GPU and showed it worked.

I am attaching the dmesg after rescan anyway, which will show the warning and the segfault.

[  132.054822] pci 0000:43:00.0: [1002:738c] type 00 class 0x038000
[  132.054856] pci 0000:43:00.0: reg 0x10: [mem 0x38b000000000-0x38b7ffffffff 64bit pref]
[  132.054877] pci 0000:43:00.0: reg 0x18: [mem 0x38b800000000-0x38b8001fffff 64bit pref]
[  132.054890] pci 0000:43:00.0: reg 0x20: [io  0xa000-0xa0ff]
[  132.054904] pci 0000:43:00.0: reg 0x24: [mem 0xb8400000-0xb847ffff]
[  132.054918] pci 0000:43:00.0: reg 0x30: [mem 0xb8480000-0xb849ffff pref]
[  132.055134] pci 0000:43:00.0: PME# supported from D1 D2 D3hot D3cold
[  132.055217] pci 0000:43:00.0: 63.008 Gb/s available PCIe bandwidth, limited by 8.0 GT/s PCIe x8 link at 0000:3c:14.0 (capable of 252.048 Gb/s with 16.0 GT/s PCIe x16 link)
[  132.056001] pci 0000:43:00.0: Adding to iommu group 73
[  132.057943] pci 0000:43:00.0: BAR 0: assigned [mem 0x38b000000000-0x38b7ffffffff 64bit pref]
[  132.057960] pci 0000:43:00.0: BAR 2: assigned [mem 0x38b800000000-0x38b8001fffff 64bit pref]
[  132.057974] pci 0000:43:00.0: BAR 5: assigned [mem 0xb8400000-0xb847ffff]
[  132.057981] pci 0000:43:00.0: BAR 6: assigned [mem 0xb8480000-0xb849ffff pref]
[  132.057984] pci 0000:43:00.0: BAR 4: assigned [io  0xa000-0xa0ff]

[  132.058429] ======================================================
[  132.058453] WARNING: possible circular locking dependency detected
[  132.058477] 5.16.0-kfd+ #1 Not tainted
[  132.058492] ------------------------------------------------------
[  132.058515] bash/3632 is trying to acquire lock:
[  132.058534] ffffadee20adfb50 ((work_completion)(&wfc.work)){+.+.}-{0:0}, at: __flush_work+0x2f5/0x470
[  132.058554] [drm] initializing kernel modesetting (ARCTURUS 0x1002:0x738C 0x1002:0x0C34 0x01).
[  132.058577]
               but task is already holding lock:
[  132.058580] ffffffffa3c62308
[  132.058630] amdgpu 0000:43:00.0: amdgpu: Trusted Memory Zone (TMZ) feature not supported
[  132.058638]  (
[  132.058678] [drm] register mmio base: 0xB8400000
[  132.058683] pci_rescan_remove_lock
[  132.058694] [drm] register mmio size: 524288
[  132.058713] ){+.+.}-{3:3}, at: rescan_store+0x55/0x90
[  132.058773]
               which lock already depends on the new lock.

[  132.058804]
               the existing dependency chain (in reverse order) is:
[  132.058819] [drm] add ip block number 0 <soc15_common>
[  132.058831]
               -> #1 (
[  132.058854] [drm] add ip block number 1 <gmc_v9_0>
[  132.058858] [drm] add ip block number 2 <vega20_ih>
[  132.058874] pci_rescan_remove_lock
[  132.058894] [drm] add ip block number 3 <psp>
[  132.058915] ){+.+.}-{3:3}
[  132.058931] [drm] add ip block number 4 <smu>
[  132.058951] :
[  132.058965] [drm] add ip block number 5 <gfx_v9_0>
[  132.058986]        __mutex_lock+0xa4/0x990
[  132.058996] [drm] add ip block number 6 <sdma_v4_0>
[  132.059016]        i801_add_tco_spt.isra.20+0x2a/0x1a0
[  132.059033] [drm] add ip block number 7 <vcn_v2_5>
[  132.059054]        i801_add_tco+0xf6/0x110
[  132.059075] [drm] add ip block number 8 <jpeg_v2_5>
[  132.059096]        i801_probe+0x402/0x860
[  132.059151]        local_pci_probe+0x40/0x90
[  132.059170]        work_for_cpu_fn+0x10/0x20
[  132.059189]        process_one_work+0x2a4/0x640
[  132.059208]        worker_thread+0x228/0x3f0
[  132.059227]        kthread+0x16d/0x1a0
[  132.059795]        ret_from_fork+0x1f/0x30
[  132.060337]
               -> #0 ((work_completion)(&wfc.work)){+.+.}-{0:0}:
[  132.061405]        __lock_acquire+0x1552/0x1ac0
[  132.061950]        lock_acquire+0x26c/0x300
[  132.062484]        __flush_work+0x315/0x470
[  132.063009]        work_on_cpu+0x98/0xc0
[  132.063526]        pci_device_probe+0x1bc/0x1d0
[  132.064036]        really_probe+0x102/0x450
[  132.064532]        __driver_probe_device+0x100/0x170
[  132.065020]        driver_probe_device+0x1f/0xa0
[  132.065497]        __device_attach_driver+0x6b/0xe0
[  132.065975]        bus_for_each_drv+0x6a/0xb0
[  132.066449]        __device_attach+0xe2/0x160
[  132.066912]        pci_bus_add_device+0x4a/0x80
[  132.067365]        pci_bus_add_devices+0x2c/0x70
[  132.067812]        pci_bus_add_devices+0x65/0x70
[  132.068253]        pci_bus_add_devices+0x65/0x70
[  132.068688]        pci_bus_add_devices+0x65/0x70
[  132.068936] amdgpu 0000:43:00.0: amdgpu: Fetched VBIOS from ROM BAR
[  132.069109]        pci_bus_add_devices+0x65/0x70
[  132.069602] amdgpu: ATOM BIOS: 113-D3431401-X00
[  132.070058]        pci_bus_add_devices+0x65/0x70
[  132.070572] [drm] VCN(0) decode is enabled in VM mode
[  132.070997]        pci_rescan_bus+0x23/0x30
[  132.071000]        rescan_store+0x61/0x90
[  132.071003]        kernfs_fop_write_iter+0x132/0x1b0
[  132.071501] [drm] VCN(1) decode is enabled in VM mode
[  132.071964]        new_sync_write+0x11f/0x1b0
[  132.072432] [drm] VCN(0) encode is enabled in VM mode
[  132.072900]        vfs_write+0x35b/0x3b0
[  132.073376] [drm] VCN(1) encode is enabled in VM mode
[  132.073847]        ksys_write+0xa7/0xe0
[  132.074335] [drm] JPEG(0) JPEG decode is enabled in VM mode
[  132.074803]        do_syscall_64+0x34/0x80
[  132.074808]        entry_SYSCALL_64_after_hwframe+0x44/0xae
[  132.074811]
               other info that might help us debug this:

[  132.074813]  Possible unsafe locking scenario:

[  132.075302] [drm] JPEG(1) JPEG decode is enabled in VM mode
[  132.075779]        CPU0                    CPU1
[  132.076361] amdgpu 0000:43:00.0: amdgpu: MEM ECC is active.
[  132.076765]        ----                    ----
[  132.077265] amdgpu 0000:43:00.0: amdgpu: SRAM ECC is active.
[  132.078649]   lock(pci_rescan_remove_lock);
[  132.078652]                                lock((work_completion)(&wfc.work));
[  132.078653]                                lock(pci_rescan_remove_lock);
[  132.078655]   lock((work_completion)(&wfc.work));
[  132.078656]
                *** DEADLOCK ***

[  132.078656] 5 locks held by bash/3632:
[  132.078658]  #0: ffff9c39c7b89438
[  132.079612] amdgpu 0000:43:00.0: amdgpu: RAS INFO: ras initialized successfully, hardware ability[7fff] ras_mask[7fff]
[  132.080089]  (
[  132.080602] [drm] vm size is 262144 GB, 4 levels, block size is 9-bit, fragment size is 9-bit
[  132.081082] sb_writers
[  132.081601] amdgpu 0000:43:00.0: amdgpu: VRAM: 32752M 0x0000008000000000 - 0x00000087FEFFFFFF (32752M used)
[  132.082102] #6
[  132.082630] amdgpu 0000:43:00.0: amdgpu: GART: 512M 0x0000000000000000 - 0x000000001FFFFFFF
[  132.083152] ){.+.+}-{0:0}
[  132.083687] amdgpu 0000:43:00.0: amdgpu: AGP: 267878400M 0x0000008800000000 - 0x0000FFFFFFFFFFFF
[  132.084210] , at: ksys_write+0xa7/0xe0
[  132.085708] [drm] Detected VRAM RAM=32752M, BAR=32768M
[  132.086205]  #1:
[  132.086733] [drm] RAM width 4096bits HBM
[  132.087269] ffff9c5959011088
[  132.087890] [drm] amdgpu: 32752M of VRAM memory ready
[  132.088389]  (
[  132.088972] [drm] amdgpu: 32752M of GTT memory ready.
[  132.089572] &of->mutex
[  132.090206] [drm] GART: num cpu pages 131072, num gpu pages 131072
[  132.090804] ){+.+.}-{3:3}, at: kernfs_fop_write_iter+0x103/0x1b0
[  132.090808]  #2: ffff9c39c882a9e0 (kn->active#423){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x10c/0x1b0
[  132.091639] [drm] PCIE GART of 512M enabled.
[  132.092117]  #3:
[  132.092801] [drm] PTB located at 0x0000008000000000
[  132.093480] ffffffffa3c62308
[  132.094566] amdgpu 0000:43:00.0: amdgpu: PSP runtime database doesn't exist
[  132.094822]  (pci_rescan_remove_lock){+.+.}-{3:3}, at: rescan_store+0x55/0x90
[  132.094827]  #4: ffff9c597392b248 (&dev->mutex){....}-{3:3}, at: __device_attach+0x39/0x160
[  132.094835]
               stack backtrace:
[  132.097098] [drm] Found VCN firmware Version ENC: 1.1 DEC: 1 VEP: 0 Revision: 21
[  132.097467] CPU: 47 PID: 3632 Comm: bash Not tainted 5.16.0-kfd+ #1
[  132.098169] amdgpu 0000:43:00.0: amdgpu: Will use PSP to load VCN firmware
[  132.098839] Hardware name: Supermicro SYS-4029GP-TRT2/X11DPG-OT-CPU, BIOS 2.1 08/14/2018
[  132.098841] Call Trace:
[  132.098842]  <TASK>
[  132.098843]  dump_stack_lvl+0x44/0x57
[  132.098848]  check_noncircular+0x105/0x120
[  132.098853]  ? unwind_get_return_address+0x1b/0x30
[  132.112924]  ? register_lock_class+0x46/0x780
[  132.113630]  ? __lock_acquire+0x1552/0x1ac0
[  132.114342]  __lock_acquire+0x1552/0x1ac0
[  132.115050]  lock_acquire+0x26c/0x300
[  132.115755]  ? __flush_work+0x2f5/0x470
[  132.116460]  ? lock_is_held_type+0xdf/0x130
[  132.117177]  __flush_work+0x315/0x470
[  132.117890]  ? __flush_work+0x2f5/0x470
[  132.118604]  ? lock_is_held_type+0xdf/0x130
[  132.119305]  ? mark_held_locks+0x49/0x70
[  132.119981]  ? queue_work_on+0x2f/0x70
[  132.120645]  ? lockdep_hardirqs_on+0x79/0x100
[  132.121300]  work_on_cpu+0x98/0xc0
[  132.121702] [drm] reserve 0x400000 from 0x87fec00000 for PSP TMR
[  132.121947]  ? __traceiter_workqueue_execute_end+0x40/0x40
[  132.123270]  ? pci_device_shutdown+0x60/0x60
[  132.123880]  pci_device_probe+0x1bc/0x1d0
[  132.124475]  really_probe+0x102/0x450
[  132.125060]  __driver_probe_device+0x100/0x170
[  132.125641]  driver_probe_device+0x1f/0xa0
[  132.126215]  __device_attach_driver+0x6b/0xe0
[  132.126797]  ? driver_allows_async_probing+0x50/0x50
[  132.127383]  ? driver_allows_async_probing+0x50/0x50
[  132.127960]  bus_for_each_drv+0x6a/0xb0
[  132.128528]  __device_attach+0xe2/0x160
[  132.129095]  pci_bus_add_device+0x4a/0x80
[  132.129659]  pci_bus_add_devices+0x2c/0x70
[  132.130213]  pci_bus_add_devices+0x65/0x70
[  132.130753]  pci_bus_add_devices+0x65/0x70
[  132.131283]  pci_bus_add_devices+0x65/0x70
[  132.131780]  pci_bus_add_devices+0x65/0x70
[  132.132270]  pci_bus_add_devices+0x65/0x70
[  132.132757]  pci_rescan_bus+0x23/0x30
[  132.133233]  rescan_store+0x61/0x90
[  132.133701]  kernfs_fop_write_iter+0x132/0x1b0
[  132.134167]  new_sync_write+0x11f/0x1b0
[  132.134627]  vfs_write+0x35b/0x3b0
[  132.135062]  ksys_write+0xa7/0xe0
[  132.135503]  do_syscall_64+0x34/0x80
[  132.135933]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[  132.136358] RIP: 0033:0x7f0058a73224
[  132.136775] Code: 89 02 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 8d 05 c1 07 2e 00 8b 00 85 c0 75 13 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 54 f3 c3 66 90 41 54 55 49 89 d4 53 48 89 f5
[  132.137663] RSP: 002b:00007ffc4f4c71a8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[  132.138121] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007f0058a73224
[  132.138590] RDX: 0000000000000002 RSI: 000055d466c24450 RDI: 0000000000000001
[  132.139064] RBP: 000055d466c24450 R08: 000000000000000a R09: 0000000000000001
[  132.139532] R10: 000000000000000a R11: 0000000000000246 R12: 00007f0058d4f760
[  132.140003] R13: 0000000000000002 R14: 00007f0058d4b2a0 R15: 00007f0058d4a760
[  132.140485]  </TASK>
[  132.183669] amdgpu 0000:43:00.0: amdgpu: HDCP: optional hdcp ta ucode is not available
[  132.184214] amdgpu 0000:43:00.0: amdgpu: DTM: optional dtm ta ucode is not available
[  132.184735] amdgpu 0000:43:00.0: amdgpu: RAP: optional rap ta ucode is not available
[  132.185234] amdgpu 0000:43:00.0: amdgpu: SECUREDISPLAY: securedisplay ta ucode is not available
[  132.185823] amdgpu 0000:43:00.0: amdgpu: use vbios provided pptable
[  132.186327] amdgpu 0000:43:00.0: amdgpu: smc_dpm_info table revision(format.content): 4.6
[  132.188783] amdgpu 0000:43:00.0: amdgpu: SMU is initialized successfully!
[  132.190039] [drm] kiq ring mec 2 pipe 1 q 0
[  132.203608] [drm] VCN decode and encode initialized successfully(under DPG Mode).
[  132.204178] [drm] JPEG decode initialized successfully.
[  132.246079] kfd kfd: amdgpu: Allocated 3969056 bytes on gart
[  132.327589] memmap_init_zone_device initialised 8388608 pages in 64ms
[  132.328139] amdgpu: HMM registered 32752MB device memory
[  132.328784] amdgpu: Virtual CRAT table created for GPU
[  132.329844] amdgpu: Topology: Add dGPU node [0x738c:0x1002]
[  132.330387] kfd kfd: amdgpu: added device 1002:738c
[  132.330965] amdgpu 0000:43:00.0: amdgpu: SE 8, SH per SE 1, CU per SH 16, active_cu_number 72
[  132.331725] amdgpu 0000:43:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 0 on hub 0
[  132.332296] amdgpu 0000:43:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 1 on hub 0
[  132.332856] amdgpu 0000:43:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 4 on hub 0
[  132.333414] amdgpu 0000:43:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 5 on hub 0
[  132.333965] amdgpu 0000:43:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 6 on hub 0
[  132.334507] amdgpu 0000:43:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 7 on hub 0
[  132.335057] amdgpu 0000:43:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 8 on hub 0
[  132.335594] amdgpu 0000:43:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 9 on hub 0
[  132.336137] amdgpu 0000:43:00.0: amdgpu: ring kiq_2.1.0 uses VM inv eng 10 on hub 0
[  132.336679] amdgpu 0000:43:00.0: amdgpu: ring sdma0 uses VM inv eng 0 on hub 1
[  132.337234] amdgpu 0000:43:00.0: amdgpu: ring sdma1 uses VM inv eng 1 on hub 1
[  132.337790] amdgpu 0000:43:00.0: amdgpu: ring sdma2 uses VM inv eng 4 on hub 1
[  132.338343] amdgpu 0000:43:00.0: amdgpu: ring sdma3 uses VM inv eng 5 on hub 1
[  132.338906] amdgpu 0000:43:00.0: amdgpu: ring sdma4 uses VM inv eng 6 on hub 1
[  132.339448] amdgpu 0000:43:00.0: amdgpu: ring sdma5 uses VM inv eng 0 on hub 2
[  132.339987] amdgpu 0000:43:00.0: amdgpu: ring sdma6 uses VM inv eng 1 on hub 2
[  132.340519] amdgpu 0000:43:00.0: amdgpu: ring sdma7 uses VM inv eng 4 on hub 2
[  132.341041] amdgpu 0000:43:00.0: amdgpu: ring vcn_dec_0 uses VM inv eng 5 on hub 2
[  132.341570] amdgpu 0000:43:00.0: amdgpu: ring vcn_enc_0.0 uses VM inv eng 6 on hub 2
[  132.342101] amdgpu 0000:43:00.0: amdgpu: ring vcn_enc_0.1 uses VM inv eng 7 on hub 2
[  132.342630] amdgpu 0000:43:00.0: amdgpu: ring vcn_dec_1 uses VM inv eng 8 on hub 2
[  132.343152] amdgpu 0000:43:00.0: amdgpu: ring vcn_enc_1.0 uses VM inv eng 9 on hub 2
[  132.343657] amdgpu 0000:43:00.0: amdgpu: ring vcn_enc_1.1 uses VM inv eng 10 on hub 2
[  132.344136] amdgpu 0000:43:00.0: amdgpu: ring jpeg_dec_0 uses VM inv eng 11 on hub 2
[  132.344610] amdgpu 0000:43:00.0: amdgpu: ring jpeg_dec_1 uses VM inv eng 12 on hub 2
[  132.378213] amdgpu: Detected AMDGPU 6 Perf Events.
[  132.387349] [drm] Initialized amdgpu 3.46.0 20150101 for 0000:43:00.0 on minor 1
[  132.388530] pcieport 0000:d7:00.0: bridge window [io  0x1000-0x0fff] to [bus d8] add_size 1000
[  132.389078] pcieport 0000:d7:00.0: BAR 13: no space for [io  size 0x1000]
[  132.389600] pcieport 0000:d7:00.0: BAR 13: failed to assign [io  size 0x1000]
[  132.390091] pcieport 0000:d7:00.0: BAR 13: no space for [io  size 0x1000]
[  132.390568] pcieport 0000:d7:00.0: BAR 13: failed to assign [io  size 0x1000]
[  155.359200] HelloWorld[3824]: segfault at 68 ip 00007f4c979f764e sp 00007ffc9b3bb610 error 4 in libamdhip64.so.4.4.21432-f9dccde4[7f4c979b3000+2eb000]
[  155.360268] Code: 48 8b 45 e8 64 48 33 04 25 28 00 00 00 74 05 e8 b8 c7 fb ff 48 8b 5d f8 c9 c3 f3 0f 1e fa 55 48 89 e5 48 89 7d f8 48 8b 45 f8 <48> 8b 40 68 5d c3 f3 0f 1e fa 55 48 89 e5 48 89 7d f8 48 8b 45 f8

Best regards,
Shuotao

From: Andrey Grodzovsky <andrey.grodzovsky@amd.com<mailto:andrey.grodzovsky@amd.com>>
Date: Wednesday, April 6, 2022 at 10:36 PM
To: Shuotao Xu <shuotaoxu@microsoft.com<mailto:shuotaoxu@microsoft.com>>, amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org> <amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>>
Cc: Ziyue Yang <Ziyue.Yang@microsoft.com<mailto:Ziyue.Yang@microsoft.com>>, Lei Qu <Lei.Qu@microsoft.com<mailto:Lei.Qu@microsoft.com>>, Peng Cheng <pengc@microsoft.com<mailto:pengc@microsoft.com>>, Ran Shu <Ran.Shu@microsoft.com<mailto:Ran.Shu@microsoft.com>>
Subject: Re: [EXTERNAL] Re: Code Review Request for AMDGPU Hotplug Support
[You don't often get email from andrey.grodzovsky@amd.com<mailto:andrey.grodzovsky@amd.com>. Learn why this is important at http://aka.ms/LearnAboutSenderIdentification.]<https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Faka.ms%2FLearnAboutSenderIdentification.%255d&data=05%7C01%7Cshuotaoxu%40microsoft.com%7Cd08285235abb42c5648508da18b44dbf%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637849460024976886%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000%7C%7C%7C&sdata=9QlEasOkkI2h1xjbZYBlyhIE3BLq7aiAc%2BXuWaK8aa0%3D&reserved=0>

Can you attach dmesg for the failure without your patch against
amd-staging-drm-next ?

Also, in general, patches for amdgpu upstream branches should be
submitted to amd-gfx mailing list inline using git-send which makes it
easy to comment and review them inline.

Andrey

On 2022-04-06 10:25, Shuotao Xu wrote:
> Hi Andrey,
>
> We just tried kernel 5.16 based on
> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fagd5f%2Flinux.git&amp;data=05%7C01%7Cshuotaoxu%40microsoft.com%7C93f1fcb8f60541f7b87308da17dae167%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637848526184858564%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=jz51aMtsG7PIEfLy1jLvLGd%2BsBREvOFf9Gc6BZlmsmU%3D&amp;reserved=0<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fagd5f%2Flinux.git&data=05%7C01%7Cshuotaoxu%40microsoft.com%7Cd08285235abb42c5648508da18b44dbf%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637849460024976886%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000%7C%7C%7C&sdata=k5wQHS5wX%2Bi%2BEm%2F8Kv24qr8qtkU%2BwPYggEtutld%2FwHA%3D&reserved=0>
> <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fagd5f%2Flinux.git&amp;data=05%7C01%7Cshuotaoxu%40microsoft.com%7C93f1fcb8f60541f7b87308da17dae167%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637848526184858564%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=jz51aMtsG7PIEfLy1jLvLGd%2BsBREvOFf9Gc6BZlmsmU%3D&amp;reserved=0<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fagd5f%2Flinux.git&data=05%7C01%7Cshuotaoxu%40microsoft.com%7Cd08285235abb42c5648508da18b44dbf%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637849460024976886%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000%7C%7C%7C&sdata=k5wQHS5wX%2Bi%2BEm%2F8Kv24qr8qtkU%2BwPYggEtutld%2FwHA%3D&reserved=0>>
> amd-staging-drm-next branch, and found out that hotplug did not work out
> of box for Rocm compute stack.
>
> We did not try the rendering stack since we currently are more focused
> on AI workloads.
>
> We have also created a patch against the amd-staging-drm-next branch to
> enable hotplug for ROCM stack, which were sent in another later email
> with same subject. I am attaching the patch in this email, in case that
> you would want to delete that later email.
>
> Best regards,
>
> Shuotao
>
> *From: *Andrey Grodzovsky <andrey.grodzovsky@amd.com<mailto:andrey.grodzovsky@amd.com>>
> *Date: *Wednesday, April 6, 2022 at 10:13 PM
> *To: *Shuotao Xu <shuotaoxu@microsoft.com<mailto:shuotaoxu@microsoft.com>>,
> amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org> <amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>>
> *Cc: *Ziyue Yang <Ziyue.Yang@microsoft.com<mailto:Ziyue.Yang@microsoft.com>>, Lei Qu
> <Lei.Qu@microsoft.com<mailto:Lei.Qu@microsoft.com>>, Peng Cheng <pengc@microsoft.com<mailto:pengc@microsoft.com>>, Ran Shu
> <Ran.Shu@microsoft.com<mailto:Ran.Shu@microsoft.com>>
> *Subject: *[EXTERNAL] Re: Code Review Request for AMDGPU Hotplug Support
>
> [You don't often get email from andrey.grodzovsky@amd.com<mailto:andrey.grodzovsky@amd.com>. Learn why
> this is important at http://aka.ms/LearnAboutSenderIdentification.]<https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Faka.ms%2FLearnAboutSenderIdentification.%255d&data=05%7C01%7Cshuotaoxu%40microsoft.com%7Cd08285235abb42c5648508da18b44dbf%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637849460024976886%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000%7C%7C%7C&sdata=9QlEasOkkI2h1xjbZYBlyhIE3BLq7aiAc%2BXuWaK8aa0%3D&reserved=0>
> <https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Faka.ms%2FLearnAboutSenderIdentification.%255d&amp;data=05%7C01%7Cshuotaoxu%40microsoft.com%7C93f1fcb8f60541f7b87308da17dae167%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637848526184858564%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=HfSwu6SWfoCYyscJqGFdKHBPtaj%2BKB4lyo13zkm6hi4%3D&amp;reserved=0<https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Faka.ms%2FLearnAboutSenderIdentification.%255d&data=05%7C01%7Cshuotaoxu%40microsoft.com%7Cd08285235abb42c5648508da18b44dbf%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637849460024976886%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000%7C%7C%7C&sdata=9QlEasOkkI2h1xjbZYBlyhIE3BLq7aiAc%2BXuWaK8aa0%3D&reserved=0>>
>
> Looks like you are using 5.13 kernel for this work, FYI we added
> hot plug support for the graphic stack in 5.14 kernel (see
> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.phoronix.com%2Fscan.php%3Fpage%3Dnews_item%26px%3DLinux-5.14-AMDGPU-Hot-Unplug&amp;data=05%7C01%7Cshuotaoxu%40microsoft.com%7C93f1fcb8f60541f7b87308da17dae167%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637848526184858564%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=4l9mT8zNR%2FvqsEFr7noIDqKf16IGN8xmO2T6jnpipzo%3D&amp;reserved=0)<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.phoronix.com%2Fscan.php%3Fpage%3Dnews_item%26px%3DLinux-5.14-AMDGPU-Hot-Unplug&data=05%7C01%7Cshuotaoxu%40microsoft.com%7Cd08285235abb42c5648508da18b44dbf%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637849460024976886%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000%7C%7C%7C&sdata=plmgugU8bJQbaqP%2FFiJAuwvsXtoBz%2FPY3Q5pRUSzlBc%3D&reserved=0>
> <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.phoronix.com%2Fscan.php%3Fpage%3Dnews_item%26px%3DLinux-5.14-AMDGPU-Hot-Unplug&amp;data=05%7C01%7Cshuotaoxu%40microsoft.com%7C93f1fcb8f60541f7b87308da17dae167%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637848526184858564%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=4l9mT8zNR%2FvqsEFr7noIDqKf16IGN8xmO2T6jnpipzo%3D&amp;reserved=0<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.phoronix.com%2Fscan.php%3Fpage%3Dnews_item%26px%3DLinux-5.14-AMDGPU-Hot-Unplug&data=05%7C01%7Cshuotaoxu%40microsoft.com%7Cd08285235abb42c5648508da18b44dbf%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637849460024976886%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000%7C%7C%7C&sdata=plmgugU8bJQbaqP%2FFiJAuwvsXtoBz%2FPY3Q5pRUSzlBc%3D&reserved=0>>
>
>
> I am not sure about the code part since it all touches KFD driver (KFD
> team can comment on that) - but I was just wondering if you try 5.14
> kernel would things just work for you out of the box ?
>
> Andrey
>
> On 2022-04-05 22:45, Shuotao Xu wrote:
>> Dear AMD Colleagues,
>>
>> We are from Microsoft Research, and are working on GPU disaggregation
>> technology.
>>
>> We have created a new pull requestAdd PCIe hotplug support for amdgpu by
>> xushuotao · Pull Request #131 · RadeonOpenCompute/ROCK-Kernel-Driver
>> (github.com)
>> <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FRadeonOpenCompute%2FROCK-Kernel-Driver%2Fpull%2F131&amp;data=05%7C01%7Cshuotaoxu%40microsoft.com%7C93f1fcb8f60541f7b87308da17dae167%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637848526184858564%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=qEeZR6R95jrjAaltd1MnpyFedOiVZaNQuCxcoNog90g%3D&amp;reserved=0
<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FRadeonOpenCompute%2FROCK-Kernel-Driver%2Fpull%2F131&data=05%7C01%7Cshuotaoxu%40microsoft.com%7Cd08285235abb42c5648508da18b44dbf%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637849460025026871%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000%7C%7C%7C&sdata=aRCxFgrgiRSIVV7MZkTVcSQWV8lxtXG%2FF8sA45kFYsE%3D&reserved=0>> <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FRadeonOpenCompute%2FROCK-Kernel-Driver%2Fpull%2F131&amp;data=05%7C01%7Cshuotaoxu%40microsoft.com%7C93f1fcb8f60541f7b87308da17dae167%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637848526184858564%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=qEeZR6R95jrjAaltd1MnpyFedOiVZaNQuCxcoNog90g%3D&amp;reserved=0<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FRadeonOpenCompute%2FROCK-Kernel-Driver%2Fpull%2F131&data=05%7C01%7Cshuotaoxu%40microsoft.com%7Cd08285235abb42c5648508da18b44dbf%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637849460025026871%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000%7C%7C%7C&sdata=aRCxFgrgiRSIVV7MZkTVcSQWV8lxtXG%2FF8sA45kFYsE%3D&reserved=0>>>in
>> ROCK-Kernel-Driver, which will enable PCIe hot-plug support for amdgpu.
>>
>> We believe the support of hot-plug of GPU devices can open doors for
>> many advanced applications in data center in the next few years, and we
>> would like to have some reviewers on this PR so we can continue further
>> technical discussions around this feature.
>>
>> Would you please help review this PR?
>>
>> Thank you very much!
>>
>> Best regards,
>>
>> Shuotao Xu
>>
>

[-- Attachment #2: Type: text/html, Size: 70437 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2022-04-07 17:19 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-04-06  2:45 Code Review Request for AMDGPU Hotplug Support Shuotao Xu
2022-04-06 14:13 ` Andrey Grodzovsky
2022-04-06 14:25   ` [EXTERNAL] " Shuotao Xu
2022-04-06 14:36     ` Andrey Grodzovsky
2022-04-06 15:11       ` Shuotao Xu
2022-04-06 15:39         ` Andrey Grodzovsky
2022-04-07  2:55           ` [EXTERNAL] " Shuotao Xu
2022-04-07  2:22         ` [EXTERNAL] " Joshi, Mukul
2022-04-07 15:08           ` Shuotao Xu
2022-04-07 16:22             ` Joshi, Mukul
2022-04-07 16:28               ` Shuotao Xu
2022-04-07 16:33                 ` Joshi, Mukul
2022-04-07 16:37                   ` Shuotao Xu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.