All of lore.kernel.org
 help / color / mirror / Atom feed
* Code Review Request for AMDGPU Hotplug Support
@ 2022-04-06  8:11 Shuotao Xu
  0 siblings, 0 replies; 3+ messages in thread
From: Shuotao Xu @ 2022-04-06  8:11 UTC (permalink / raw)
  To: amd-gfx, Felix.Kuehling; +Cc: Ziyue Yang, Lei Qu, Peng Cheng, Ran Shu


[-- Attachment #1.1: Type: text/plain, Size: 1264 bytes --]

Dear AMD Colleagues,

We are from Microsoft Research and are working on GPU disaggregation technology.

We have created a patch against https://gitlab.freedesktop.org/agd5f/linux.git against drm-staging-drm-next, which will enable PCIe hot-plug support for amdgpu

We have also created a pull request Add PCIe hotplug support for amdgpu by xushuotao · Pull Request #131 · RadeonOpenCompute/ROCK-Kernel-Driver (github.com)<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FRadeonOpenCompute%2FROCK-Kernel-Driver%2Fpull%2F131&data=04%7C01%7Cshuotaoxu%40microsoft.com%7Cc86224bc365f44bec6b408da172ecac1%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637847787066456985%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=PA8l7Cj82dphBHbo82zqTEQUM4kGM7yg5UeQuduhDg0%3D&reserved=0> in ROCK-Kernel-Driver, against rocm-5.0.x.

We believe the support of hot-plug of GPU devices can open doors for many advanced applications in data center in the next few years, and we would like to have some reviewers on this PR so we can continue further technical discussions around this feature.

Would you please help review this patch?

Thank you very much!

Best regards,
Shuotao Xu


[-- Attachment #1.2: Type: text/html, Size: 6588 bytes --]

[-- Attachment #2: 0001-drm-amdkfd-Add-PCIe-Hotplug-Support-for-AMDGPU.patch --]
[-- Type: application/octet-stream, Size: 4008 bytes --]

From a4e53bda6f65b72b1f6a344c19677574d7842cd3 Mon Sep 17 00:00:00 2001
From: Shuotao Xu <shuotaoxu@microsoft.com>
Date: Wed, 6 Apr 2022 12:42:10 +0900
Subject: [PATCH] drm/amdkfd: Add PCIe Hotplug Support for AMDGPU 1. During
 PCIe probing, decrement KFD lock which was incremented when    the PCIe
 device was removed; otherwise kfd_open is going to fail. 2. Remove p2p links
 in sysfs when device is hotplugged out.

Signed-off-by: Shuotao Xu <shuotaoxu@microsoft.com>
---
 drivers/gpu/drm/amd/amdkfd/kfd_device.c   |  4 ++
 drivers/gpu/drm/amd/amdkfd/kfd_topology.c | 50 +++++++++++++++++++++--
 2 files changed, 51 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
index 62aa6c9d5123..c9638bc299dd 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
@@ -575,6 +575,10 @@ bool kgd2kfd_device_init(struct kfd_dev *kfd,
 	if (kfd_resume(kfd))
 		goto kfd_resume_error;
 
+	/* release kfd lock b/o pcie hotplug out  */
+	if (kfd_is_locked())
+		atomic_dec(&kfd_locked);
+
 	if (kfd_topology_add_device(kfd)) {
 		dev_err(kfd_device, "Error adding device to topology\n");
 		goto kfd_topology_add_device_error;
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_topology.c b/drivers/gpu/drm/amd/amdkfd/kfd_topology.c
index 3bdcae239bc0..cfa3b16f6939 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_topology.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_topology.c
@@ -132,6 +132,21 @@ struct kfd_dev *kfd_device_by_adev(const struct amdgpu_device *adev)
 	return device;
 }
 
+/* Called with write topology_lock acquired */
+static void kfd_release_link_prop(struct kfd_topology_device *dev, uint32_t node_id)
+{
+	struct kfd_iolink_properties *iolink, *tmp;
+
+	list_for_each_entry_safe(iolink, tmp, &dev->io_link_props, list) {
+		if (iolink->node_to == node_id) {
+			pr_debug("%s, io_link from_node = %d, to_node = %d", __func__, iolink->node_from, iolink->node_to);
+			list_del(&iolink->list);
+			kfree(iolink);
+			dev->node_props.io_links_count--;
+		}
+	}
+}
+
 /* Called with write topology_lock acquired */
 static void kfd_release_topology_device(struct kfd_topology_device *dev)
 {
@@ -556,6 +571,21 @@ static void kfd_remove_sysfs_file(struct kobject *kobj, struct attribute *attr)
 	kobject_put(kobj);
 }
 
+static void kfd_remove_sysfs_link_to(struct kfd_topology_device *dev, uint32_t node_id)
+{
+	struct kfd_iolink_properties *iolink;
+
+	if (dev->kobj_iolink) {
+		list_for_each_entry(iolink, &dev->io_link_props, list)
+			if (iolink->kobj && iolink->node_to == node_id) {
+				pr_debug("%s, io_link from_node = %d, to_node = %d", __func__, iolink->node_from, iolink->node_to);
+				kfd_remove_sysfs_file(iolink->kobj,
+									  &iolink->attr);
+				iolink->kobj = NULL;
+			}
+	}
+}
+
 static void kfd_remove_sysfs_node_entry(struct kfd_topology_device *dev)
 {
 	struct kfd_iolink_properties *iolink;
@@ -1490,20 +1520,34 @@ int kfd_topology_remove_device(struct kfd_dev *gpu)
 	struct kfd_topology_device *dev, *tmp;
 	uint32_t gpu_id;
 	int res = -ENODEV;
+	uint32_t node_id = 0;
+	bool found = false;
 
 	down_write(&topology_lock);
 
-	list_for_each_entry_safe(dev, tmp, &topology_device_list, list)
+	list_for_each_entry_safe(dev, tmp, &topology_device_list, list) {
 		if (dev->gpu == gpu) {
 			gpu_id = dev->gpu_id;
 			kfd_remove_sysfs_node_entry(dev);
 			kfd_release_topology_device(dev);
 			sys_props.num_devices--;
 			res = 0;
-			if (kfd_topology_update_sysfs() < 0)
-				kfd_topology_release_sysfs();
+			pr_debug("kfd_topology: removing gpu node, node id = %d", node_id);
+			found = true;
 			break;
 		}
+		node_id++;
+	}
+
+	if (found) {
+		list_for_each_entry(dev, &topology_device_list, list) {
+			kfd_remove_sysfs_link_to(dev, node_id);
+			kfd_release_link_prop(dev, node_id);
+		}
+		atomic_dec(&topology_crat_proximity_domain);
+		if (kfd_topology_update_sysfs() < 0)
+			kfd_topology_release_sysfs();
+	}
 
 	up_write(&topology_lock);
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: Code Review Request for AMDGPU Hotplug Support
  2022-04-06  2:45 Shuotao Xu
@ 2022-04-06 14:13 ` Andrey Grodzovsky
  0 siblings, 0 replies; 3+ messages in thread
From: Andrey Grodzovsky @ 2022-04-06 14:13 UTC (permalink / raw)
  To: Shuotao Xu, amd-gfx; +Cc: Ziyue Yang, Lei Qu, Peng Cheng, Ran Shu

Looks like you are using 5.13 kernel for this work, FYI we added
hot plug support for the graphic stack in 5.14 kernel (see 
https://www.phoronix.com/scan.php?page=news_item&px=Linux-5.14-AMDGPU-Hot-Unplug) 


I am not sure about the code part since it all touches KFD driver (KFD 
team can comment on that) - but I was just wondering if you try 5.14 
kernel would things just work for you out of the box ?

Andrey

On 2022-04-05 22:45, Shuotao Xu wrote:
> Dear AMD Colleagues,
> 
> We are from Microsoft Research, and are working on GPU disaggregation 
> technology.
> 
> We have created a new pull requestAdd PCIe hotplug support for amdgpu by 
> xushuotao · Pull Request #131 · RadeonOpenCompute/ROCK-Kernel-Driver 
> (github.com) 
> <https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FRadeonOpenCompute%2FROCK-Kernel-Driver%2Fpull%2F131&data=04%7C01%7Candrey.grodzovsky%40amd.com%7C4e8dc7d4feb84b19edf208da17a54fac%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637848296133682200%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=GE4XHNeLaWfbuoJbM4a1ecH8KKJbKbd2mRCnFinn7eI%3D&reserved=0>in 
> ROCK-Kernel-Driver, which will enable PCIe hot-plug support for amdgpu.
> 
> We believe the support of hot-plug of GPU devices can open doors for 
> many advanced applications in data center in the next few years, and we 
> would like to have some reviewers on this PR so we can continue further 
> technical discussions around this feature.
> 
> Would you please help review this PR?
> 
> Thank you very much!
> 
> Best regards,
> 
> Shuotao Xu
> 

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Code Review Request for AMDGPU Hotplug Support
@ 2022-04-06  2:45 Shuotao Xu
  2022-04-06 14:13 ` Andrey Grodzovsky
  0 siblings, 1 reply; 3+ messages in thread
From: Shuotao Xu @ 2022-04-06  2:45 UTC (permalink / raw)
  To: amd-gfx; +Cc: Ziyue Yang, Lei Qu, Peng Cheng, Ran Shu

[-- Attachment #1: Type: text/plain, Size: 1128 bytes --]

Dear AMD Colleagues,

We are from Microsoft Research, and are working on GPU disaggregation technology.

We have created a new pull request Add PCIe hotplug support for amdgpu by xushuotao · Pull Request #131 · RadeonOpenCompute/ROCK-Kernel-Driver (github.com)<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FRadeonOpenCompute%2FROCK-Kernel-Driver%2Fpull%2F131&data=04%7C01%7Cshuotaoxu%40microsoft.com%7Cc86224bc365f44bec6b408da172ecac1%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637847787066456985%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=PA8l7Cj82dphBHbo82zqTEQUM4kGM7yg5UeQuduhDg0%3D&reserved=0> in ROCK-Kernel-Driver, which will enable PCIe hot-plug support for amdgpu.

We believe the support of hot-plug of GPU devices can open doors for many advanced applications in data center in the next few years, and we would like to have some reviewers on this PR so we can continue further technical discussions around this feature.

Would you please help review this PR?

Thank you very much!

Best regards,
Shuotao Xu

[-- Attachment #2: Type: text/html, Size: 3813 bytes --]

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2022-04-06 14:13 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-04-06  8:11 Code Review Request for AMDGPU Hotplug Support Shuotao Xu
  -- strict thread matches above, loose matches on Subject: below --
2022-04-06  2:45 Shuotao Xu
2022-04-06 14:13 ` Andrey Grodzovsky

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.