All of lore.kernel.org
 help / color / mirror / Atom feed
From: Shuotao Xu <xushuotao@gmail.com>
To: amd-gfx@lists.freedesktop.org
Cc: Mukul.Joshi@amd.com, Andrey.Grodzovsky@amd.com,
	Felix.Kuehling@amd.com, pengc@microsoft.com,
	Lei.Qu@microsoft.com, Shuotao Xu <shuotaoxu@microsoft.com>,
	Ran.Shu@microsoft.com, Ziyue.Yang@microsoft.com
Subject: [PATCH 2/2] drm/amdkfd: Add PCIe Hotplug Support for AMDKFD
Date: Fri,  8 Apr 2022 08:45:44 +0000	[thread overview]
Message-ID: <20220408084544.9313-2-shuotaoxu@microsoft.com> (raw)
In-Reply-To: <20220408084544.9313-1-shuotaoxu@microsoft.com>

Adding PCIe Hotplug Support for AMDKFD: the support of hot-plug of GPU
devices can open doors for many advanced applications in data center
in the next few years, such as for GPU resource
disaggregation. Current AMDKFD does not support hotplug out b/o the
following reasons:

1. During PCIe removal, decrement KFD lock which was incremented at
   the beginning of hw fini; otherwise kfd_open later is going to
   fail.

2. Remove redudant p2p/io links in sysfs when device is hotplugged
   out.

3. New kfd node_id is not properly assigned after a new device is
   added after a gpu is hotplugged out in a system. libhsakmt will
   find this anomaly, (i.e. node_from != <dev node id> in iolinks),
   when taking a topology_snapshot, thus returns fault to the rocm
   stack.

-- This patch fixes issue 1; another patch by Mukul fixes issues 2&3.
-- Tested on a 4-GPU MI100 gpu nodes with kernel 5.13.0-kfd; kernel
   5.16.0-kfd is unstable out of box for MI100.
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c |  5 +++++
 drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h |  7 +++++++
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  1 +
 drivers/gpu/drm/amd/amdkfd/kfd_device.c    | 13 +++++++++++++
 4 files changed, 26 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
index c18c4be1e4ac..d50011bdb5c4 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
@@ -213,6 +213,11 @@ int amdgpu_amdkfd_resume(struct amdgpu_device *adev, bool run_pm)
 	return r;
 }
 
+int amdgpu_amdkfd_resume_processes(void)
+{
+	return kgd2kfd_resume_processes();
+}
+
 int amdgpu_amdkfd_pre_reset(struct amdgpu_device *adev)
 {
 	int r = 0;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
index f8b9f27adcf5..803306e011c3 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
@@ -140,6 +140,7 @@ void amdgpu_amdkfd_fini(void);
 void amdgpu_amdkfd_suspend(struct amdgpu_device *adev, bool run_pm);
 int amdgpu_amdkfd_resume_iommu(struct amdgpu_device *adev);
 int amdgpu_amdkfd_resume(struct amdgpu_device *adev, bool run_pm);
+int amdgpu_amdkfd_resume_processes(void);
 void amdgpu_amdkfd_interrupt(struct amdgpu_device *adev,
 			const void *ih_ring_entry);
 void amdgpu_amdkfd_device_probe(struct amdgpu_device *adev);
@@ -347,6 +348,7 @@ void kgd2kfd_device_exit(struct kfd_dev *kfd);
 void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm);
 int kgd2kfd_resume_iommu(struct kfd_dev *kfd);
 int kgd2kfd_resume(struct kfd_dev *kfd, bool run_pm);
+int kgd2kfd_resume_processes(void);
 int kgd2kfd_pre_reset(struct kfd_dev *kfd);
 int kgd2kfd_post_reset(struct kfd_dev *kfd);
 void kgd2kfd_interrupt(struct kfd_dev *kfd, const void *ih_ring_entry);
@@ -393,6 +395,11 @@ static inline int kgd2kfd_resume(struct kfd_dev *kfd, bool run_pm)
 	return 0;
 }
 
+static inline int kgd2kfd_resume_processes(void)
+{
+	return 0;
+}
+
 static inline int kgd2kfd_pre_reset(struct kfd_dev *kfd)
 {
 	return 0;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index fa4a9f13c922..5827b65b7489 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4004,6 +4004,7 @@ void amdgpu_device_fini_hw(struct amdgpu_device *adev)
 	if (drm_dev_is_unplugged(adev_to_drm(adev)))
 		amdgpu_device_unmap_mmio(adev);
 
+	amdgpu_amdkfd_resume_processes();
 }
 
 void amdgpu_device_fini_sw(struct amdgpu_device *adev)
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
index 62aa6c9d5123..ef05aae9255e 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
@@ -714,6 +714,19 @@ int kgd2kfd_resume(struct kfd_dev *kfd, bool run_pm)
 	return ret;
 }
 
+/* for non-runtime resume only */
+int kgd2kfd_resume_processes(void)
+{
+	int count;
+
+	count = atomic_dec_return(&kfd_locked);
+	WARN_ONCE(count < 0, "KFD suspend / resume ref. error");
+	if (count == 0)
+		return kfd_resume_all_processes();
+
+	return 0;
+}
+
 int kgd2kfd_resume_iommu(struct kfd_dev *kfd)
 {
 	int err = 0;
-- 
2.25.1


  reply	other threads:[~2022-04-08 13:50 UTC|newest]

Thread overview: 31+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-04-08  8:45 [PATCH 1/2] drm/amdkfd: Cleanup IO links during KFD device removal Shuotao Xu
2022-04-08  8:45 ` Shuotao Xu [this message]
2022-04-08 15:28   ` [PATCH 2/2] drm/amdkfd: Add PCIe Hotplug Support for AMDKFD Andrey Grodzovsky
2022-04-09  1:28     ` [EXTERNAL] " Shuotao Xu
2022-04-11 15:52       ` Andrey Grodzovsky
2022-04-13 16:03         ` Shuotao Xu
2022-04-13 17:31           ` Andrey Grodzovsky
2022-04-14 14:00             ` Shuotao Xu
2022-04-14 14:24               ` Shuotao Xu
2022-04-14 15:13               ` Andrey Grodzovsky
2022-04-15 10:12                 ` Shuotao Xu
2022-04-15 16:43                   ` Andrey Grodzovsky
2022-04-18 13:22                     ` Shuotao Xu
2022-04-18 15:23                       ` Andrey Grodzovsky
2022-04-19  7:41                         ` Shuotao Xu
2022-04-19 16:01                           ` Andrey Grodzovsky
2022-04-19 16:18                             ` Felix Kuehling
2022-04-20  9:24                             ` Shuotao Xu
2022-04-20 15:44                               ` Andrey Grodzovsky
2022-04-20 18:41                                 ` Andrey Grodzovsky
2022-04-27  9:20                                   ` Shuotao Xu
2022-04-27 16:04                                     ` Andrey Grodzovsky
2022-05-10 11:03                                       ` Shuotao Xu
2022-05-10 16:34                                         ` Andrey Grodzovsky
2022-05-10 20:31                                         ` Felix Kuehling
2022-05-11  3:35                                           ` Shuotao Xu
2022-05-11 13:49                                             ` Andrey Grodzovsky
2022-05-11 16:49                                               ` Felix Kuehling
2022-05-11 17:02                                                 ` Andrey Grodzovsky
2022-04-12  0:07 ` [PATCH 1/2] drm/amdkfd: Cleanup IO links during KFD device removal Felix Kuehling
2022-04-12  1:38   ` [EXTERNAL] " Shuotao Xu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20220408084544.9313-2-shuotaoxu@microsoft.com \
    --to=xushuotao@gmail.com \
    --cc=Andrey.Grodzovsky@amd.com \
    --cc=Felix.Kuehling@amd.com \
    --cc=Lei.Qu@microsoft.com \
    --cc=Mukul.Joshi@amd.com \
    --cc=Ran.Shu@microsoft.com \
    --cc=Ziyue.Yang@microsoft.com \
    --cc=amd-gfx@lists.freedesktop.org \
    --cc=pengc@microsoft.com \
    --cc=shuotaoxu@microsoft.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.