All of lore.kernel.org
 help / color / mirror / Atom feed
* [Patch v2 0/4] Enable BACO with KFD
@ 2020-02-01  3:37 Rajneesh Bhardwaj
  2020-02-01  3:37 ` [Patch v2 1/4] drm/amdgpu: Fix missing error check in suspend Rajneesh Bhardwaj
                   ` (3 more replies)
  0 siblings, 4 replies; 13+ messages in thread
From: Rajneesh Bhardwaj @ 2020-02-01  3:37 UTC (permalink / raw)
  To: amd-gfx; +Cc: Alexander.Deucher, Felix.Kuehling, Rajneesh Bhardwaj

Changes in v2:
 * Rebased on latest amd-staging-drm-next
 * Addressed review comments from Felix, Oak and Alex for v1
 * Removed 60 second hack for auto-suspend delay and simplified the
   logic
 * Dropped kfd debugfs patch
 * Folded in Alex's patch from this series to enable and test with kfd.
   https://patchwork.freedesktop.org/series/67885/ also fixed a
   checkpatch warning.

Link to v1: https://www.spinics.net/lists/amd-gfx/msg44551.html

This series aims to enable BACO by default on supported AMD platforms
and ensures that the AMD Kernel Fusion Driver can co-exist with this
feature when the GPU devices are runtime suspended and firmware pushes
the envelop to save more power with BACO entry sequence. Current
approach makes sure that if KFD is using GPU services for compute, it
won't let AMDGPU driver suspend and if the AMDGPU driver is already
runtime suspended with GPUs in deep power saving mode with BACO, the KFD
driver wakes up the AMDGPU and then starts the compute workload
execution.

This series has been tested with a single GPU (fiji), Dual GPUs (fiji
and fiji/tonga) and seems to work fine with kfdtest. Also tested system
wide suspend to mem and suspend to idle and with this series and it
worked fine.

Alex Deucher (1):
  drm/amdgpu/runpm: enable runpm on baco capable VI+ asics

Rajneesh Bhardwaj (3):
  drm/amdgpu: Fix missing error check in suspend
  drm/amdkfd: show warning when kfd is locked
  drm/amdkfd: refactor runtime pm for baco

 drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 12 ++++----
 drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h |  8 +++---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  4 +--
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c    |  3 ++
 drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c    |  8 ++++--
 drivers/gpu/drm/amd/amdkfd/kfd_chardev.c   |  2 ++
 drivers/gpu/drm/amd/amdkfd/kfd_device.c    | 29 +++++++++++--------
 drivers/gpu/drm/amd/amdkfd/kfd_priv.h      |  1 +
 drivers/gpu/drm/amd/amdkfd/kfd_process.c   | 33 ++++++++++++++++++++--
 9 files changed, 71 insertions(+), 29 deletions(-)

-- 
2.17.1

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Patch v2 1/4] drm/amdgpu: Fix missing error check in suspend
  2020-02-01  3:37 [Patch v2 0/4] Enable BACO with KFD Rajneesh Bhardwaj
@ 2020-02-01  3:37 ` Rajneesh Bhardwaj
  2020-02-01  3:37 ` [Patch v2 2/4] drm/amdkfd: show warning when kfd is locked Rajneesh Bhardwaj
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 13+ messages in thread
From: Rajneesh Bhardwaj @ 2020-02-01  3:37 UTC (permalink / raw)
  To: amd-gfx; +Cc: Alexander.Deucher, Felix.Kuehling, Rajneesh Bhardwaj

amdgpu_device_suspend might return an error code since it can be called
from both runtime and system suspend flows. Add the missing return code
in case of a failure.

Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index f28d040de3ce..0026ff56542c 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -1235,6 +1235,9 @@ static int amdgpu_pmops_runtime_suspend(struct device *dev)
 	drm_kms_helper_poll_disable(drm_dev);
 
 	ret = amdgpu_device_suspend(drm_dev, false);
+	if (ret)
+		return ret;
+
 	if (amdgpu_device_supports_boco(drm_dev)) {
 		/* Only need to handle PCI state in the driver for ATPX
 		 * PCI core handles it for _PR3.
-- 
2.17.1

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [Patch v2 2/4] drm/amdkfd: show warning when kfd is locked
  2020-02-01  3:37 [Patch v2 0/4] Enable BACO with KFD Rajneesh Bhardwaj
  2020-02-01  3:37 ` [Patch v2 1/4] drm/amdgpu: Fix missing error check in suspend Rajneesh Bhardwaj
@ 2020-02-01  3:37 ` Rajneesh Bhardwaj
  2020-02-01  3:37 ` [Patch v2 3/4] drm/amdkfd: refactor runtime pm for baco Rajneesh Bhardwaj
  2020-02-01  3:37 ` [Patch v2 4/4] drm/amdgpu/runpm: enable runpm on baco capable VI+ asics Rajneesh Bhardwaj
  3 siblings, 0 replies; 13+ messages in thread
From: Rajneesh Bhardwaj @ 2020-02-01  3:37 UTC (permalink / raw)
  To: amd-gfx; +Cc: Alexander.Deucher, Felix.Kuehling, Rajneesh Bhardwaj

During system suspend the kfd driver aquires a lock that prohibits
further kfd actions unless the gpu is resumed. This adds some info which
can be useful while debugging.

Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
---
 drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
index 275f79ab0900..86b919d82129 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
@@ -127,6 +127,8 @@ static int kfd_open(struct inode *inode, struct file *filep)
 		return PTR_ERR(process);
 
 	if (kfd_is_locked()) {
+		dev_dbg(kfd_device, "kfd is locked!\n"
+				"process %d unreferenced", process->pasid);
 		kfd_unref_process(process);
 		return -EAGAIN;
 	}
-- 
2.17.1

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [Patch v2 3/4] drm/amdkfd: refactor runtime pm for baco
  2020-02-01  3:37 [Patch v2 0/4] Enable BACO with KFD Rajneesh Bhardwaj
  2020-02-01  3:37 ` [Patch v2 1/4] drm/amdgpu: Fix missing error check in suspend Rajneesh Bhardwaj
  2020-02-01  3:37 ` [Patch v2 2/4] drm/amdkfd: show warning when kfd is locked Rajneesh Bhardwaj
@ 2020-02-01  3:37 ` Rajneesh Bhardwaj
  2020-02-01  4:21   ` Zeng, Oak
  2020-02-04 21:28   ` Felix Kuehling
  2020-02-01  3:37 ` [Patch v2 4/4] drm/amdgpu/runpm: enable runpm on baco capable VI+ asics Rajneesh Bhardwaj
  3 siblings, 2 replies; 13+ messages in thread
From: Rajneesh Bhardwaj @ 2020-02-01  3:37 UTC (permalink / raw)
  To: amd-gfx; +Cc: Alexander.Deucher, Felix.Kuehling, Rajneesh Bhardwaj

So far the kfd driver implemented same routines for runtime and system
wide suspend and resume (s2idle or mem). During system wide suspend the
kfd aquires an atomic lock that prevents any more user processes to
create queues and interact with kfd driver and amd gpu. This mechanism
created problem when amdgpu device is runtime suspended with BACO
enabled. Any application that relies on kfd driver fails to load because
the driver reports a locked kfd device since gpu is runtime suspended.

However, in an ideal case, when gpu is runtime  suspended the kfd driver
should be able to:

 - auto resume amdgpu driver whenever a client requests compute service
 - prevent runtime suspend for amdgpu  while kfd is in use

This change refactors the amdgpu and amdkfd drivers to support BACO and
runtime power management.

Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 12 +++---
 drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h |  8 ++--
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  4 +-
 drivers/gpu/drm/amd/amdkfd/kfd_device.c    | 29 +++++++++------
 drivers/gpu/drm/amd/amdkfd/kfd_priv.h      |  1 +
 drivers/gpu/drm/amd/amdkfd/kfd_process.c   | 43 ++++++++++++++++++++--
 6 files changed, 70 insertions(+), 27 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
index 8609287620ea..314c4a2a0354 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
@@ -178,18 +178,18 @@ void amdgpu_amdkfd_interrupt(struct amdgpu_device *adev,
 		kgd2kfd_interrupt(adev->kfd.dev, ih_ring_entry);
 }
 
-void amdgpu_amdkfd_suspend(struct amdgpu_device *adev)
+void amdgpu_amdkfd_suspend(struct amdgpu_device *adev, bool run_pm)
 {
 	if (adev->kfd.dev)
-		kgd2kfd_suspend(adev->kfd.dev);
+		kgd2kfd_suspend(adev->kfd.dev, run_pm);
 }
 
-int amdgpu_amdkfd_resume(struct amdgpu_device *adev)
+int amdgpu_amdkfd_resume(struct amdgpu_device *adev, bool run_pm)
 {
 	int r = 0;
 
 	if (adev->kfd.dev)
-		r = kgd2kfd_resume(adev->kfd.dev);
+		r = kgd2kfd_resume(adev->kfd.dev, run_pm);
 
 	return r;
 }
@@ -713,11 +713,11 @@ void kgd2kfd_exit(void)
 {
 }
 
-void kgd2kfd_suspend(struct kfd_dev *kfd)
+void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm)
 {
 }
 
-int kgd2kfd_resume(struct kfd_dev *kfd)
+int kgd2kfd_resume(struct kfd_dev *kfd, bool run_pm)
 {
 	return 0;
 }
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
index 47b0f2957d1f..9e8db702d878 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
@@ -122,8 +122,8 @@ struct amdkfd_process_info {
 int amdgpu_amdkfd_init(void);
 void amdgpu_amdkfd_fini(void);
 
-void amdgpu_amdkfd_suspend(struct amdgpu_device *adev);
-int amdgpu_amdkfd_resume(struct amdgpu_device *adev);
+void amdgpu_amdkfd_suspend(struct amdgpu_device *adev, bool run_pm);
+int amdgpu_amdkfd_resume(struct amdgpu_device *adev, bool run_pm);
 void amdgpu_amdkfd_interrupt(struct amdgpu_device *adev,
 			const void *ih_ring_entry);
 void amdgpu_amdkfd_device_probe(struct amdgpu_device *adev);
@@ -249,8 +249,8 @@ bool kgd2kfd_device_init(struct kfd_dev *kfd,
 			 struct drm_device *ddev,
 			 const struct kgd2kfd_shared_resources *gpu_resources);
 void kgd2kfd_device_exit(struct kfd_dev *kfd);
-void kgd2kfd_suspend(struct kfd_dev *kfd);
-int kgd2kfd_resume(struct kfd_dev *kfd);
+void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm);
+int kgd2kfd_resume(struct kfd_dev *kfd, bool run_pm);
 int kgd2kfd_pre_reset(struct kfd_dev *kfd);
 int kgd2kfd_post_reset(struct kfd_dev *kfd);
 void kgd2kfd_interrupt(struct kfd_dev *kfd, const void *ih_ring_entry);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 5030a09babb8..43843e6c4bcd 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -3311,7 +3311,7 @@ int amdgpu_device_suspend(struct drm_device *dev, bool fbcon)
 		}
 	}
 
-	amdgpu_amdkfd_suspend(adev);
+	amdgpu_amdkfd_suspend(adev, !fbcon);
 
 	amdgpu_ras_suspend(adev);
 
@@ -3395,7 +3395,7 @@ int amdgpu_device_resume(struct drm_device *dev, bool fbcon)
 			}
 		}
 	}
-	r = amdgpu_amdkfd_resume(adev);
+	r = amdgpu_amdkfd_resume(adev, !fbcon);
 	if (r)
 		return r;
 
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
index 798ad1c8f799..42ee9ea5c45a 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
@@ -732,7 +732,7 @@ bool kgd2kfd_device_init(struct kfd_dev *kfd,
 void kgd2kfd_device_exit(struct kfd_dev *kfd)
 {
 	if (kfd->init_complete) {
-		kgd2kfd_suspend(kfd);
+		kgd2kfd_suspend(kfd, false);
 		device_queue_manager_uninit(kfd->dqm);
 		kfd_interrupt_exit(kfd);
 		kfd_topology_remove_device(kfd);
@@ -753,7 +753,7 @@ int kgd2kfd_pre_reset(struct kfd_dev *kfd)
 
 	kfd->dqm->ops.pre_reset(kfd->dqm);
 
-	kgd2kfd_suspend(kfd);
+	kgd2kfd_suspend(kfd, false);
 
 	kfd_signal_reset_event(kfd);
 	return 0;
@@ -787,21 +787,23 @@ bool kfd_is_locked(void)
 	return  (atomic_read(&kfd_locked) > 0);
 }
 
-void kgd2kfd_suspend(struct kfd_dev *kfd)
+void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm)
 {
 	if (!kfd->init_complete)
 		return;
 
-	/* For first KFD device suspend all the KFD processes */
-	if (atomic_inc_return(&kfd_locked) == 1)
-		kfd_suspend_all_processes();
+	/* for runtime suspend, skip locking kfd */
+	if (!run_pm) {
+		/* For first KFD device suspend all the KFD processes */
+		if (atomic_inc_return(&kfd_locked) == 1)
+			kfd_suspend_all_processes();
+	}
 
 	kfd->dqm->ops.stop(kfd->dqm);
-
 	kfd_iommu_suspend(kfd);
 }
 
-int kgd2kfd_resume(struct kfd_dev *kfd)
+int kgd2kfd_resume(struct kfd_dev *kfd, bool run_pm)
 {
 	int ret, count;
 
@@ -812,10 +814,13 @@ int kgd2kfd_resume(struct kfd_dev *kfd)
 	if (ret)
 		return ret;
 
-	count = atomic_dec_return(&kfd_locked);
-	WARN_ONCE(count < 0, "KFD suspend / resume ref. error");
-	if (count == 0)
-		ret = kfd_resume_all_processes();
+	/* for runtime resume, skip unlocking kfd */
+	if (!run_pm) {
+		count = atomic_dec_return(&kfd_locked);
+		WARN_ONCE(count < 0, "KFD suspend / resume ref. error");
+		if (count == 0)
+			ret = kfd_resume_all_processes();
+	}
 
 	return ret;
 }
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
index c0b0defc8f7a..20dd4747250d 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
@@ -647,6 +647,7 @@ struct kfd_process_device {
 	 * function.
 	 */
 	bool already_dequeued;
+	bool runtime_inuse;
 
 	/* Is this process/pasid bound to this device? (amd_iommu_bind_pasid) */
 	enum kfd_pdd_bound bound;
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process.c b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
index 25b90f70aecd..6907a5a2cbc8 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
@@ -31,6 +31,7 @@
 #include <linux/compat.h>
 #include <linux/mman.h>
 #include <linux/file.h>
+#include <linux/pm_runtime.h>
 #include "amdgpu_amdkfd.h"
 #include "amdgpu.h"
 
@@ -440,6 +441,16 @@ static void kfd_process_destroy_pdds(struct kfd_process *p)
 		kfree(pdd->qpd.doorbell_bitmap);
 		idr_destroy(&pdd->alloc_idr);
 
+		/*
+		 * before destroying pdd, make sure to report availability
+		 * for auto suspend
+		 */
+		if (pdd->runtime_inuse) {
+			pm_runtime_mark_last_busy(pdd->dev->ddev->dev);
+			pm_runtime_put_autosuspend(pdd->dev->ddev->dev);
+			pdd->runtime_inuse = false;
+		}
+
 		kfree(pdd);
 	}
 }
@@ -754,6 +765,7 @@ struct kfd_process_device *kfd_create_process_device_data(struct kfd_dev *dev,
 	pdd->process = p;
 	pdd->bound = PDD_UNBOUND;
 	pdd->already_dequeued = false;
+	pdd->runtime_inuse = false;
 	list_add(&pdd->per_device_list, &p->per_device_data);
 
 	/* Init idr used for memory handle translation */
@@ -843,15 +855,40 @@ struct kfd_process_device *kfd_bind_process_to_device(struct kfd_dev *dev,
 		return ERR_PTR(-ENOMEM);
 	}
 
+	/*
+	 * signal runtime-pm system to auto resume and prevent
+	 * further runtime suspend once device pdd is created until
+	 * pdd is destroyed.
+	 */
+	if (!pdd->runtime_inuse) {
+		err = pm_runtime_get_sync(dev->ddev->dev);
+		if (err < 0)
+			return ERR_PTR(err);
+	}
+
 	err = kfd_iommu_bind_process_to_device(pdd);
 	if (err)
-		return ERR_PTR(err);
+		goto out;
 
 	err = kfd_process_device_init_vm(pdd, NULL);
 	if (err)
-		return ERR_PTR(err);
+		goto out;
 
-	return pdd;
+	if (!err) {
+		/*
+		 * make sure that runtime_usage counter is incremented
+		 * just once per pdd
+		 */
+		if (!pdd->runtime_inuse)
+			pdd->runtime_inuse = true;
+
+		return pdd;
+	}
+out:
+	/* balance runpm reference count and exit with error */
+	pm_runtime_mark_last_busy(dev->ddev->dev);
+	pm_runtime_put_autosuspend(dev->ddev->dev);
+	return ERR_PTR(err);
 }
 
 struct kfd_process_device *kfd_get_first_process_device_data(
-- 
2.17.1

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [Patch v2 4/4] drm/amdgpu/runpm: enable runpm on baco capable VI+ asics
  2020-02-01  3:37 [Patch v2 0/4] Enable BACO with KFD Rajneesh Bhardwaj
                   ` (2 preceding siblings ...)
  2020-02-01  3:37 ` [Patch v2 3/4] drm/amdkfd: refactor runtime pm for baco Rajneesh Bhardwaj
@ 2020-02-01  3:37 ` Rajneesh Bhardwaj
  3 siblings, 0 replies; 13+ messages in thread
From: Rajneesh Bhardwaj @ 2020-02-01  3:37 UTC (permalink / raw)
  To: amd-gfx; +Cc: Alex Deucher, Alex Deucher, Felix.Kuehling

From: Alex Deucher <alexdeucher@gmail.com>

Seems to work reliably on VI+.

[rajneesh] Picked https://patchwork.freedesktop.org/patch/335402/ to
enable runtime pm with baco for kfd. Also fixed a checkpatch warning and
dropped below patch from previous series in favor of Alex's patch.
https://www.spinics.net/lists/amd-gfx/msg44552.html

Reviewed-and-tested-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
index 3a0ea9096498..f4c039f8815c 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
@@ -170,10 +170,14 @@ int amdgpu_driver_load_kms(struct drm_device *dev, unsigned long flags)
 	}
 
 	if (amdgpu_device_supports_boco(dev) &&
-	    (amdgpu_runtime_pm != 0)) /* enable runpm by default */
+	    (amdgpu_runtime_pm != 0)) /* enable runpm by default for boco */
 		adev->runpm = true;
 	else if (amdgpu_device_supports_baco(dev) &&
-		 (amdgpu_runtime_pm > 0)) /* enable runpm if runpm=1 */
+		 (amdgpu_runtime_pm != 0) &&
+		 (adev->asic_type >= CHIP_TOPAZ)) /* enable runpm on VI+ */
+		adev->runpm = true;
+	else if (amdgpu_device_supports_baco(dev) &&
+		 (amdgpu_runtime_pm > 0))  /* enable runpm if runpm=1 on CI */
 		adev->runpm = true;
 
 	/* Call ACPI methods: require modeset init
-- 
2.17.1

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* RE: [Patch v2 3/4] drm/amdkfd: refactor runtime pm for baco
  2020-02-01  3:37 ` [Patch v2 3/4] drm/amdkfd: refactor runtime pm for baco Rajneesh Bhardwaj
@ 2020-02-01  4:21   ` Zeng, Oak
  2020-02-01 15:21     ` Bhardwaj, Rajneesh
  2020-02-04 21:28   ` Felix Kuehling
  1 sibling, 1 reply; 13+ messages in thread
From: Zeng, Oak @ 2020-02-01  4:21 UTC (permalink / raw)
  To: Bhardwaj, Rajneesh, amd-gfx
  Cc: Deucher, Alexander, Kuehling, Felix, Bhardwaj, Rajneesh

[AMD Official Use Only - Internal Distribution Only]

Patch 1,2,3 work for me. See one comment inline, otherwise Reviewed-by: Oak Zeng <Oak.Zeng@amd.com>

Regards,
Oak

-----Original Message-----
From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> On Behalf Of Rajneesh Bhardwaj
Sent: Friday, January 31, 2020 10:37 PM
To: amd-gfx@lists.freedesktop.org
Cc: Deucher, Alexander <Alexander.Deucher@amd.com>; Kuehling, Felix <Felix.Kuehling@amd.com>; Bhardwaj, Rajneesh <Rajneesh.Bhardwaj@amd.com>
Subject: [Patch v2 3/4] drm/amdkfd: refactor runtime pm for baco

So far the kfd driver implemented same routines for runtime and system wide suspend and resume (s2idle or mem). During system wide suspend the kfd aquires an atomic lock that prevents any more user processes to create queues and interact with kfd driver and amd gpu. This mechanism created problem when amdgpu device is runtime suspended with BACO enabled. Any application that relies on kfd driver fails to load because the driver reports a locked kfd device since gpu is runtime suspended.

However, in an ideal case, when gpu is runtime  suspended the kfd driver should be able to:

 - auto resume amdgpu driver whenever a client requests compute service
 - prevent runtime suspend for amdgpu  while kfd is in use

This change refactors the amdgpu and amdkfd drivers to support BACO and runtime power management.

Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 12 +++---  drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h |  8 ++--  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  4 +-
 drivers/gpu/drm/amd/amdkfd/kfd_device.c    | 29 +++++++++------
 drivers/gpu/drm/amd/amdkfd/kfd_priv.h      |  1 +
 drivers/gpu/drm/amd/amdkfd/kfd_process.c   | 43 ++++++++++++++++++++--
 6 files changed, 70 insertions(+), 27 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
index 8609287620ea..314c4a2a0354 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
@@ -178,18 +178,18 @@ void amdgpu_amdkfd_interrupt(struct amdgpu_device *adev,
 		kgd2kfd_interrupt(adev->kfd.dev, ih_ring_entry);  }
 
-void amdgpu_amdkfd_suspend(struct amdgpu_device *adev)
+void amdgpu_amdkfd_suspend(struct amdgpu_device *adev, bool run_pm)
 {
 	if (adev->kfd.dev)
-		kgd2kfd_suspend(adev->kfd.dev);
+		kgd2kfd_suspend(adev->kfd.dev, run_pm);
 }
 
-int amdgpu_amdkfd_resume(struct amdgpu_device *adev)
+int amdgpu_amdkfd_resume(struct amdgpu_device *adev, bool run_pm)
 {
 	int r = 0;
 
 	if (adev->kfd.dev)
-		r = kgd2kfd_resume(adev->kfd.dev);
+		r = kgd2kfd_resume(adev->kfd.dev, run_pm);
 
 	return r;
 }
@@ -713,11 +713,11 @@ void kgd2kfd_exit(void)  {  }
 
-void kgd2kfd_suspend(struct kfd_dev *kfd)
+void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm)
 {
 }
 
-int kgd2kfd_resume(struct kfd_dev *kfd)
+int kgd2kfd_resume(struct kfd_dev *kfd, bool run_pm)
 {
 	return 0;
 }
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
index 47b0f2957d1f..9e8db702d878 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
@@ -122,8 +122,8 @@ struct amdkfd_process_info {  int amdgpu_amdkfd_init(void);  void amdgpu_amdkfd_fini(void);
 
-void amdgpu_amdkfd_suspend(struct amdgpu_device *adev); -int amdgpu_amdkfd_resume(struct amdgpu_device *adev);
+void amdgpu_amdkfd_suspend(struct amdgpu_device *adev, bool run_pm); 
+int amdgpu_amdkfd_resume(struct amdgpu_device *adev, bool run_pm);
 void amdgpu_amdkfd_interrupt(struct amdgpu_device *adev,
 			const void *ih_ring_entry);
 void amdgpu_amdkfd_device_probe(struct amdgpu_device *adev); @@ -249,8 +249,8 @@ bool kgd2kfd_device_init(struct kfd_dev *kfd,
 			 struct drm_device *ddev,
 			 const struct kgd2kfd_shared_resources *gpu_resources);  void kgd2kfd_device_exit(struct kfd_dev *kfd); -void kgd2kfd_suspend(struct kfd_dev *kfd); -int kgd2kfd_resume(struct kfd_dev *kfd);
+void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm); int 
+kgd2kfd_resume(struct kfd_dev *kfd, bool run_pm);
 int kgd2kfd_pre_reset(struct kfd_dev *kfd);  int kgd2kfd_post_reset(struct kfd_dev *kfd);  void kgd2kfd_interrupt(struct kfd_dev *kfd, const void *ih_ring_entry); diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 5030a09babb8..43843e6c4bcd 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -3311,7 +3311,7 @@ int amdgpu_device_suspend(struct drm_device *dev, bool fbcon)
 		}
 	}
 
-	amdgpu_amdkfd_suspend(adev);
+	amdgpu_amdkfd_suspend(adev, !fbcon);
 
 	amdgpu_ras_suspend(adev);
 
@@ -3395,7 +3395,7 @@ int amdgpu_device_resume(struct drm_device *dev, bool fbcon)
 			}
 		}
 	}
-	r = amdgpu_amdkfd_resume(adev);
+	r = amdgpu_amdkfd_resume(adev, !fbcon);
 	if (r)
 		return r;
 
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
index 798ad1c8f799..42ee9ea5c45a 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
@@ -732,7 +732,7 @@ bool kgd2kfd_device_init(struct kfd_dev *kfd,  void kgd2kfd_device_exit(struct kfd_dev *kfd)  {
 	if (kfd->init_complete) {
-		kgd2kfd_suspend(kfd);
+		kgd2kfd_suspend(kfd, false);
 		device_queue_manager_uninit(kfd->dqm);
 		kfd_interrupt_exit(kfd);
 		kfd_topology_remove_device(kfd);
@@ -753,7 +753,7 @@ int kgd2kfd_pre_reset(struct kfd_dev *kfd)
 
 	kfd->dqm->ops.pre_reset(kfd->dqm);
 
-	kgd2kfd_suspend(kfd);
+	kgd2kfd_suspend(kfd, false);
 
 	kfd_signal_reset_event(kfd);
 	return 0;
@@ -787,21 +787,23 @@ bool kfd_is_locked(void)
 	return  (atomic_read(&kfd_locked) > 0);  }
 
-void kgd2kfd_suspend(struct kfd_dev *kfd)
+void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm)
[Oak] I think parameter run_pm is better to called: call_from_pm_runtime. Or you can add some comments to the parameter of this function to say when run_pm==true it is called from pm_runtime.

 {
 	if (!kfd->init_complete)
 		return;
 
-	/* For first KFD device suspend all the KFD processes */
-	if (atomic_inc_return(&kfd_locked) == 1)
-		kfd_suspend_all_processes();
+	/* for runtime suspend, skip locking kfd */
+	if (!run_pm) {
+		/* For first KFD device suspend all the KFD processes */
+		if (atomic_inc_return(&kfd_locked) == 1)
+			kfd_suspend_all_processes();
+	}
 
 	kfd->dqm->ops.stop(kfd->dqm);
-
 	kfd_iommu_suspend(kfd);
 }
 
-int kgd2kfd_resume(struct kfd_dev *kfd)
+int kgd2kfd_resume(struct kfd_dev *kfd, bool run_pm)
 {
 	int ret, count;
 
@@ -812,10 +814,13 @@ int kgd2kfd_resume(struct kfd_dev *kfd)
 	if (ret)
 		return ret;
 
-	count = atomic_dec_return(&kfd_locked);
-	WARN_ONCE(count < 0, "KFD suspend / resume ref. error");
-	if (count == 0)
-		ret = kfd_resume_all_processes();
+	/* for runtime resume, skip unlocking kfd */
+	if (!run_pm) {
+		count = atomic_dec_return(&kfd_locked);
+		WARN_ONCE(count < 0, "KFD suspend / resume ref. error");
+		if (count == 0)
+			ret = kfd_resume_all_processes();
+	}
 
 	return ret;
 }
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
index c0b0defc8f7a..20dd4747250d 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
@@ -647,6 +647,7 @@ struct kfd_process_device {
 	 * function.
 	 */
 	bool already_dequeued;
+	bool runtime_inuse;
 
 	/* Is this process/pasid bound to this device? (amd_iommu_bind_pasid) */
 	enum kfd_pdd_bound bound;
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process.c b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
index 25b90f70aecd..6907a5a2cbc8 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
@@ -31,6 +31,7 @@
 #include <linux/compat.h>
 #include <linux/mman.h>
 #include <linux/file.h>
+#include <linux/pm_runtime.h>
 #include "amdgpu_amdkfd.h"
 #include "amdgpu.h"
 
@@ -440,6 +441,16 @@ static void kfd_process_destroy_pdds(struct kfd_process *p)
 		kfree(pdd->qpd.doorbell_bitmap);
 		idr_destroy(&pdd->alloc_idr);
 
+		/*
+		 * before destroying pdd, make sure to report availability
+		 * for auto suspend
+		 */
+		if (pdd->runtime_inuse) {
+			pm_runtime_mark_last_busy(pdd->dev->ddev->dev);
+			pm_runtime_put_autosuspend(pdd->dev->ddev->dev);
+			pdd->runtime_inuse = false;
+		}
+
 		kfree(pdd);
 	}
 }
@@ -754,6 +765,7 @@ struct kfd_process_device *kfd_create_process_device_data(struct kfd_dev *dev,
 	pdd->process = p;
 	pdd->bound = PDD_UNBOUND;
 	pdd->already_dequeued = false;
+	pdd->runtime_inuse = false;
 	list_add(&pdd->per_device_list, &p->per_device_data);
 
 	/* Init idr used for memory handle translation */ @@ -843,15 +855,40 @@ struct kfd_process_device *kfd_bind_process_to_device(struct kfd_dev *dev,
 		return ERR_PTR(-ENOMEM);
 	}
 
+	/*
+	 * signal runtime-pm system to auto resume and prevent
+	 * further runtime suspend once device pdd is created until
+	 * pdd is destroyed.
+	 */
+	if (!pdd->runtime_inuse) {
+		err = pm_runtime_get_sync(dev->ddev->dev);
+		if (err < 0)
+			return ERR_PTR(err);
+	}
+
 	err = kfd_iommu_bind_process_to_device(pdd);
 	if (err)
-		return ERR_PTR(err);
+		goto out;
 
 	err = kfd_process_device_init_vm(pdd, NULL);
 	if (err)
-		return ERR_PTR(err);
+		goto out;
 
-	return pdd;
+	if (!err) {
+		/*
+		 * make sure that runtime_usage counter is incremented
+		 * just once per pdd
+		 */
+		if (!pdd->runtime_inuse)
+			pdd->runtime_inuse = true;
+
+		return pdd;
+	}
+out:
+	/* balance runpm reference count and exit with error */
+	pm_runtime_mark_last_busy(dev->ddev->dev);
+	pm_runtime_put_autosuspend(dev->ddev->dev);
+	return ERR_PTR(err);
 }
 
 struct kfd_process_device *kfd_get_first_process_device_data(
--
2.17.1

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7Coak.zeng%40amd.com%7C922e0392f4b949e4b6ae08d7a6c813a4%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637161250615739404&amp;sdata=nb9Lmc5sGR3C5zXDSAjib28tnxP%2F%2FTAljdO3%2BIgHB78%3D&amp;reserved=0
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [Patch v2 3/4] drm/amdkfd: refactor runtime pm for baco
  2020-02-01  4:21   ` Zeng, Oak
@ 2020-02-01 15:21     ` Bhardwaj, Rajneesh
  0 siblings, 0 replies; 13+ messages in thread
From: Bhardwaj, Rajneesh @ 2020-02-01 15:21 UTC (permalink / raw)
  To: Zeng, Oak, amd-gfx; +Cc: Deucher, Alexander, Kuehling, Felix


On 1/31/2020 11:21 PM, Zeng, Oak wrote:
> [AMD Official Use Only - Internal Distribution Only]
>
> Patch 1,2,3 work for me. See one comment inline, otherwise Reviewed-by: Oak Zeng <Oak.Zeng@amd.com>
>
> Regards,
> Oak


Thanks for the review and testing! My response below.

----8< -------------snip ------------------------>8-----------

>   
> -void kgd2kfd_suspend(struct kfd_dev *kfd)
> +void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm)
> [Oak] I think parameter run_pm is better to called: call_from_pm_runtime. Or you can add some comments to the parameter of this function to say when run_pm==true it is called from pm_runtime.


The main purpose of passing this arg is to skip locking kfd for runtime 
suspend and its already described in the comment above the block that 
skips locking.

Thanks,

Rajneesh

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Patch v2 3/4] drm/amdkfd: refactor runtime pm for baco
  2020-02-01  3:37 ` [Patch v2 3/4] drm/amdkfd: refactor runtime pm for baco Rajneesh Bhardwaj
  2020-02-01  4:21   ` Zeng, Oak
@ 2020-02-04 21:28   ` Felix Kuehling
  2020-02-05  0:57     ` Bhardwaj, Rajneesh
  2020-02-05  4:46     ` Alex Deucher
  1 sibling, 2 replies; 13+ messages in thread
From: Felix Kuehling @ 2020-02-04 21:28 UTC (permalink / raw)
  To: Rajneesh Bhardwaj, amd-gfx; +Cc: Alexander.Deucher

On 2020-01-31 10:37 p.m., Rajneesh Bhardwaj wrote:
> So far the kfd driver implemented same routines for runtime and system
> wide suspend and resume (s2idle or mem). During system wide suspend the
> kfd aquires an atomic lock that prevents any more user processes to
> create queues and interact with kfd driver and amd gpu. This mechanism
> created problem when amdgpu device is runtime suspended with BACO
> enabled. Any application that relies on kfd driver fails to load because
> the driver reports a locked kfd device since gpu is runtime suspended.
>
> However, in an ideal case, when gpu is runtime  suspended the kfd driver
> should be able to:
>
>   - auto resume amdgpu driver whenever a client requests compute service
>   - prevent runtime suspend for amdgpu  while kfd is in use
>
> This change refactors the amdgpu and amdkfd drivers to support BACO and
> runtime power management.
>
> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>

One small comment inline. Other than that patches 1-3 are

Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>

Also, I believe patch 1 is unchanged from v1 and already got a 
Reviewed-by from Alex. Please remember to add that tag before you submit.

The last patch that enabled runtime PM by default, I'd leave the 
decision to submit that up to Alex. There may be other considerations 
than just KFD.

See inline ...


> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 12 +++---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h |  8 ++--
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  4 +-
>   drivers/gpu/drm/amd/amdkfd/kfd_device.c    | 29 +++++++++------
>   drivers/gpu/drm/amd/amdkfd/kfd_priv.h      |  1 +
>   drivers/gpu/drm/amd/amdkfd/kfd_process.c   | 43 ++++++++++++++++++++--
>   6 files changed, 70 insertions(+), 27 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
> index 8609287620ea..314c4a2a0354 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
> @@ -178,18 +178,18 @@ void amdgpu_amdkfd_interrupt(struct amdgpu_device *adev,
>   		kgd2kfd_interrupt(adev->kfd.dev, ih_ring_entry);
>   }
>   
> -void amdgpu_amdkfd_suspend(struct amdgpu_device *adev)
> +void amdgpu_amdkfd_suspend(struct amdgpu_device *adev, bool run_pm)
>   {
>   	if (adev->kfd.dev)
> -		kgd2kfd_suspend(adev->kfd.dev);
> +		kgd2kfd_suspend(adev->kfd.dev, run_pm);
>   }
>   
> -int amdgpu_amdkfd_resume(struct amdgpu_device *adev)
> +int amdgpu_amdkfd_resume(struct amdgpu_device *adev, bool run_pm)
>   {
>   	int r = 0;
>   
>   	if (adev->kfd.dev)
> -		r = kgd2kfd_resume(adev->kfd.dev);
> +		r = kgd2kfd_resume(adev->kfd.dev, run_pm);
>   
>   	return r;
>   }
> @@ -713,11 +713,11 @@ void kgd2kfd_exit(void)
>   {
>   }
>   
> -void kgd2kfd_suspend(struct kfd_dev *kfd)
> +void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm)
>   {
>   }
>   
> -int kgd2kfd_resume(struct kfd_dev *kfd)
> +int kgd2kfd_resume(struct kfd_dev *kfd, bool run_pm)
>   {
>   	return 0;
>   }
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
> index 47b0f2957d1f..9e8db702d878 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
> @@ -122,8 +122,8 @@ struct amdkfd_process_info {
>   int amdgpu_amdkfd_init(void);
>   void amdgpu_amdkfd_fini(void);
>   
> -void amdgpu_amdkfd_suspend(struct amdgpu_device *adev);
> -int amdgpu_amdkfd_resume(struct amdgpu_device *adev);
> +void amdgpu_amdkfd_suspend(struct amdgpu_device *adev, bool run_pm);
> +int amdgpu_amdkfd_resume(struct amdgpu_device *adev, bool run_pm);
>   void amdgpu_amdkfd_interrupt(struct amdgpu_device *adev,
>   			const void *ih_ring_entry);
>   void amdgpu_amdkfd_device_probe(struct amdgpu_device *adev);
> @@ -249,8 +249,8 @@ bool kgd2kfd_device_init(struct kfd_dev *kfd,
>   			 struct drm_device *ddev,
>   			 const struct kgd2kfd_shared_resources *gpu_resources);
>   void kgd2kfd_device_exit(struct kfd_dev *kfd);
> -void kgd2kfd_suspend(struct kfd_dev *kfd);
> -int kgd2kfd_resume(struct kfd_dev *kfd);
> +void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm);
> +int kgd2kfd_resume(struct kfd_dev *kfd, bool run_pm);
>   int kgd2kfd_pre_reset(struct kfd_dev *kfd);
>   int kgd2kfd_post_reset(struct kfd_dev *kfd);
>   void kgd2kfd_interrupt(struct kfd_dev *kfd, const void *ih_ring_entry);
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 5030a09babb8..43843e6c4bcd 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -3311,7 +3311,7 @@ int amdgpu_device_suspend(struct drm_device *dev, bool fbcon)
>   		}
>   	}
>   
> -	amdgpu_amdkfd_suspend(adev);
> +	amdgpu_amdkfd_suspend(adev, !fbcon);
>   
>   	amdgpu_ras_suspend(adev);
>   
> @@ -3395,7 +3395,7 @@ int amdgpu_device_resume(struct drm_device *dev, bool fbcon)
>   			}
>   		}
>   	}
> -	r = amdgpu_amdkfd_resume(adev);
> +	r = amdgpu_amdkfd_resume(adev, !fbcon);
>   	if (r)
>   		return r;
>   
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> index 798ad1c8f799..42ee9ea5c45a 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> @@ -732,7 +732,7 @@ bool kgd2kfd_device_init(struct kfd_dev *kfd,
>   void kgd2kfd_device_exit(struct kfd_dev *kfd)
>   {
>   	if (kfd->init_complete) {
> -		kgd2kfd_suspend(kfd);
> +		kgd2kfd_suspend(kfd, false);
>   		device_queue_manager_uninit(kfd->dqm);
>   		kfd_interrupt_exit(kfd);
>   		kfd_topology_remove_device(kfd);
> @@ -753,7 +753,7 @@ int kgd2kfd_pre_reset(struct kfd_dev *kfd)
>   
>   	kfd->dqm->ops.pre_reset(kfd->dqm);
>   
> -	kgd2kfd_suspend(kfd);
> +	kgd2kfd_suspend(kfd, false);
>   
>   	kfd_signal_reset_event(kfd);
>   	return 0;
> @@ -787,21 +787,23 @@ bool kfd_is_locked(void)
>   	return  (atomic_read(&kfd_locked) > 0);
>   }
>   
> -void kgd2kfd_suspend(struct kfd_dev *kfd)
> +void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm)
>   {
>   	if (!kfd->init_complete)
>   		return;
>   
> -	/* For first KFD device suspend all the KFD processes */
> -	if (atomic_inc_return(&kfd_locked) == 1)
> -		kfd_suspend_all_processes();
> +	/* for runtime suspend, skip locking kfd */
> +	if (!run_pm) {
> +		/* For first KFD device suspend all the KFD processes */
> +		if (atomic_inc_return(&kfd_locked) == 1)
> +			kfd_suspend_all_processes();
> +	}
>   
>   	kfd->dqm->ops.stop(kfd->dqm);
> -
>   	kfd_iommu_suspend(kfd);
>   }
>   
> -int kgd2kfd_resume(struct kfd_dev *kfd)
> +int kgd2kfd_resume(struct kfd_dev *kfd, bool run_pm)
>   {
>   	int ret, count;
>   
> @@ -812,10 +814,13 @@ int kgd2kfd_resume(struct kfd_dev *kfd)
>   	if (ret)
>   		return ret;
>   
> -	count = atomic_dec_return(&kfd_locked);
> -	WARN_ONCE(count < 0, "KFD suspend / resume ref. error");
> -	if (count == 0)
> -		ret = kfd_resume_all_processes();
> +	/* for runtime resume, skip unlocking kfd */
> +	if (!run_pm) {
> +		count = atomic_dec_return(&kfd_locked);
> +		WARN_ONCE(count < 0, "KFD suspend / resume ref. error");
> +		if (count == 0)
> +			ret = kfd_resume_all_processes();
> +	}
>   
>   	return ret;
>   }
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
> index c0b0defc8f7a..20dd4747250d 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
> @@ -647,6 +647,7 @@ struct kfd_process_device {
>   	 * function.
>   	 */
>   	bool already_dequeued;
> +	bool runtime_inuse;
>   
>   	/* Is this process/pasid bound to this device? (amd_iommu_bind_pasid) */
>   	enum kfd_pdd_bound bound;
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process.c b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
> index 25b90f70aecd..6907a5a2cbc8 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
> @@ -31,6 +31,7 @@
>   #include <linux/compat.h>
>   #include <linux/mman.h>
>   #include <linux/file.h>
> +#include <linux/pm_runtime.h>
>   #include "amdgpu_amdkfd.h"
>   #include "amdgpu.h"
>   
> @@ -440,6 +441,16 @@ static void kfd_process_destroy_pdds(struct kfd_process *p)
>   		kfree(pdd->qpd.doorbell_bitmap);
>   		idr_destroy(&pdd->alloc_idr);
>   
> +		/*
> +		 * before destroying pdd, make sure to report availability
> +		 * for auto suspend
> +		 */
> +		if (pdd->runtime_inuse) {
> +			pm_runtime_mark_last_busy(pdd->dev->ddev->dev);
> +			pm_runtime_put_autosuspend(pdd->dev->ddev->dev);
> +			pdd->runtime_inuse = false;
> +		}
> +
>   		kfree(pdd);
>   	}
>   }
> @@ -754,6 +765,7 @@ struct kfd_process_device *kfd_create_process_device_data(struct kfd_dev *dev,
>   	pdd->process = p;
>   	pdd->bound = PDD_UNBOUND;
>   	pdd->already_dequeued = false;
> +	pdd->runtime_inuse = false;
>   	list_add(&pdd->per_device_list, &p->per_device_data);
>   
>   	/* Init idr used for memory handle translation */
> @@ -843,15 +855,40 @@ struct kfd_process_device *kfd_bind_process_to_device(struct kfd_dev *dev,
>   		return ERR_PTR(-ENOMEM);
>   	}
>   
> +	/*
> +	 * signal runtime-pm system to auto resume and prevent
> +	 * further runtime suspend once device pdd is created until
> +	 * pdd is destroyed.
> +	 */
> +	if (!pdd->runtime_inuse) {
> +		err = pm_runtime_get_sync(dev->ddev->dev);
> +		if (err < 0)
> +			return ERR_PTR(err);
> +	}
> +
>   	err = kfd_iommu_bind_process_to_device(pdd);
>   	if (err)
> -		return ERR_PTR(err);
> +		goto out;
>   
>   	err = kfd_process_device_init_vm(pdd, NULL);
>   	if (err)
> -		return ERR_PTR(err);
> +		goto out;
>   
> -	return pdd;
> +	if (!err) {
> +		/*
> +		 * make sure that runtime_usage counter is incremented
> +		 * just once per pdd
> +		 */
> +		if (!pdd->runtime_inuse)
> +			pdd->runtime_inuse = true;

The "if" is redundant here. You can just set pdd->runtime_inuse = true 
unconditionally.

Regards,
   Felix

> +
> +		return pdd;
> +	}
> +out:
> +	/* balance runpm reference count and exit with error */
> +	pm_runtime_mark_last_busy(dev->ddev->dev);
> +	pm_runtime_put_autosuspend(dev->ddev->dev);
> +	return ERR_PTR(err);
>   }
>   
>   struct kfd_process_device *kfd_get_first_process_device_data(
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Patch v2 3/4] drm/amdkfd: refactor runtime pm for baco
  2020-02-04 21:28   ` Felix Kuehling
@ 2020-02-05  0:57     ` Bhardwaj, Rajneesh
  2020-02-05  1:59       ` Felix Kuehling
  2020-02-05  4:46     ` Alex Deucher
  1 sibling, 1 reply; 13+ messages in thread
From: Bhardwaj, Rajneesh @ 2020-02-05  0:57 UTC (permalink / raw)
  To: Felix Kuehling, amd-gfx; +Cc: Alexander.Deucher

Hi Felix,

Thanks for the review feedback!

On 2/4/2020 4:28 PM, Felix Kuehling wrote:
> On 2020-01-31 10:37 p.m., Rajneesh Bhardwaj wrote:
>> So far the kfd driver implemented same routines for runtime and system
>> wide suspend and resume (s2idle or mem). During system wide suspend the
>> kfd aquires an atomic lock that prevents any more user processes to
>> create queues and interact with kfd driver and amd gpu. This mechanism
>> created problem when amdgpu device is runtime suspended with BACO
>> enabled. Any application that relies on kfd driver fails to load because
>> the driver reports a locked kfd device since gpu is runtime suspended.
>>
>> However, in an ideal case, when gpu is runtime  suspended the kfd driver
>> should be able to:
>>
>>   - auto resume amdgpu driver whenever a client requests compute service
>>   - prevent runtime suspend for amdgpu  while kfd is in use
>>
>> This change refactors the amdgpu and amdkfd drivers to support BACO and
>> runtime power management.
>>
>> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
>
> One small comment inline. Other than that patches 1-3 are
>
> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
>
> Also, I believe patch 1 is unchanged from v1 and already got a 
> Reviewed-by from Alex. Please remember to add that tag before you submit.


You mean https://www.spinics.net/lists/amd-gfx/msg44689.html? I have 
already added Alex's RB tag.


>
>
> The last patch that enabled runtime PM by default, I'd leave the 
> decision to submit that up to Alex. There may be other considerations 
> than just KFD.


Sure, but its needed along with the series otherwise we can't test/use it.


>
>
> See inline ...
>
>
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 12 +++---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h |  8 ++--
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  4 +-
>>   drivers/gpu/drm/amd/amdkfd/kfd_device.c    | 29 +++++++++------
>>   drivers/gpu/drm/amd/amdkfd/kfd_priv.h      |  1 +
>>   drivers/gpu/drm/amd/amdkfd/kfd_process.c   | 43 ++++++++++++++++++++--
>>   6 files changed, 70 insertions(+), 27 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
>> index 8609287620ea..314c4a2a0354 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
>> @@ -178,18 +178,18 @@ void amdgpu_amdkfd_interrupt(struct 
>> amdgpu_device *adev,
>>           kgd2kfd_interrupt(adev->kfd.dev, ih_ring_entry);
>>   }
>>   -void amdgpu_amdkfd_suspend(struct amdgpu_device *adev)
>> +void amdgpu_amdkfd_suspend(struct amdgpu_device *adev, bool run_pm)
>>   {
>>       if (adev->kfd.dev)
>> -        kgd2kfd_suspend(adev->kfd.dev);
>> +        kgd2kfd_suspend(adev->kfd.dev, run_pm);
>>   }
>>   -int amdgpu_amdkfd_resume(struct amdgpu_device *adev)
>> +int amdgpu_amdkfd_resume(struct amdgpu_device *adev, bool run_pm)
>>   {
>>       int r = 0;
>>         if (adev->kfd.dev)
>> -        r = kgd2kfd_resume(adev->kfd.dev);
>> +        r = kgd2kfd_resume(adev->kfd.dev, run_pm);
>>         return r;
>>   }
>> @@ -713,11 +713,11 @@ void kgd2kfd_exit(void)
>>   {
>>   }
>>   -void kgd2kfd_suspend(struct kfd_dev *kfd)
>> +void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm)
>>   {
>>   }
>>   -int kgd2kfd_resume(struct kfd_dev *kfd)
>> +int kgd2kfd_resume(struct kfd_dev *kfd, bool run_pm)
>>   {
>>       return 0;
>>   }
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
>> index 47b0f2957d1f..9e8db702d878 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
>> @@ -122,8 +122,8 @@ struct amdkfd_process_info {
>>   int amdgpu_amdkfd_init(void);
>>   void amdgpu_amdkfd_fini(void);
>>   -void amdgpu_amdkfd_suspend(struct amdgpu_device *adev);
>> -int amdgpu_amdkfd_resume(struct amdgpu_device *adev);
>> +void amdgpu_amdkfd_suspend(struct amdgpu_device *adev, bool run_pm);
>> +int amdgpu_amdkfd_resume(struct amdgpu_device *adev, bool run_pm);
>>   void amdgpu_amdkfd_interrupt(struct amdgpu_device *adev,
>>               const void *ih_ring_entry);
>>   void amdgpu_amdkfd_device_probe(struct amdgpu_device *adev);
>> @@ -249,8 +249,8 @@ bool kgd2kfd_device_init(struct kfd_dev *kfd,
>>                struct drm_device *ddev,
>>                const struct kgd2kfd_shared_resources *gpu_resources);
>>   void kgd2kfd_device_exit(struct kfd_dev *kfd);
>> -void kgd2kfd_suspend(struct kfd_dev *kfd);
>> -int kgd2kfd_resume(struct kfd_dev *kfd);
>> +void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm);
>> +int kgd2kfd_resume(struct kfd_dev *kfd, bool run_pm);
>>   int kgd2kfd_pre_reset(struct kfd_dev *kfd);
>>   int kgd2kfd_post_reset(struct kfd_dev *kfd);
>>   void kgd2kfd_interrupt(struct kfd_dev *kfd, const void 
>> *ih_ring_entry);
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> index 5030a09babb8..43843e6c4bcd 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> @@ -3311,7 +3311,7 @@ int amdgpu_device_suspend(struct drm_device 
>> *dev, bool fbcon)
>>           }
>>       }
>>   -    amdgpu_amdkfd_suspend(adev);
>> +    amdgpu_amdkfd_suspend(adev, !fbcon);
>>         amdgpu_ras_suspend(adev);
>>   @@ -3395,7 +3395,7 @@ int amdgpu_device_resume(struct drm_device 
>> *dev, bool fbcon)
>>               }
>>           }
>>       }
>> -    r = amdgpu_amdkfd_resume(adev);
>> +    r = amdgpu_amdkfd_resume(adev, !fbcon);
>>       if (r)
>>           return r;
>>   diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c 
>> b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>> index 798ad1c8f799..42ee9ea5c45a 100644
>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>> @@ -732,7 +732,7 @@ bool kgd2kfd_device_init(struct kfd_dev *kfd,
>>   void kgd2kfd_device_exit(struct kfd_dev *kfd)
>>   {
>>       if (kfd->init_complete) {
>> -        kgd2kfd_suspend(kfd);
>> +        kgd2kfd_suspend(kfd, false);
>>           device_queue_manager_uninit(kfd->dqm);
>>           kfd_interrupt_exit(kfd);
>>           kfd_topology_remove_device(kfd);
>> @@ -753,7 +753,7 @@ int kgd2kfd_pre_reset(struct kfd_dev *kfd)
>>         kfd->dqm->ops.pre_reset(kfd->dqm);
>>   -    kgd2kfd_suspend(kfd);
>> +    kgd2kfd_suspend(kfd, false);
>>         kfd_signal_reset_event(kfd);
>>       return 0;
>> @@ -787,21 +787,23 @@ bool kfd_is_locked(void)
>>       return  (atomic_read(&kfd_locked) > 0);
>>   }
>>   -void kgd2kfd_suspend(struct kfd_dev *kfd)
>> +void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm)
>>   {
>>       if (!kfd->init_complete)
>>           return;
>>   -    /* For first KFD device suspend all the KFD processes */
>> -    if (atomic_inc_return(&kfd_locked) == 1)
>> -        kfd_suspend_all_processes();
>> +    /* for runtime suspend, skip locking kfd */
>> +    if (!run_pm) {
>> +        /* For first KFD device suspend all the KFD processes */
>> +        if (atomic_inc_return(&kfd_locked) == 1)
>> +            kfd_suspend_all_processes();
>> +    }
>>         kfd->dqm->ops.stop(kfd->dqm);
>> -
>>       kfd_iommu_suspend(kfd);
>>   }
>>   -int kgd2kfd_resume(struct kfd_dev *kfd)
>> +int kgd2kfd_resume(struct kfd_dev *kfd, bool run_pm)
>>   {
>>       int ret, count;
>>   @@ -812,10 +814,13 @@ int kgd2kfd_resume(struct kfd_dev *kfd)
>>       if (ret)
>>           return ret;
>>   -    count = atomic_dec_return(&kfd_locked);
>> -    WARN_ONCE(count < 0, "KFD suspend / resume ref. error");
>> -    if (count == 0)
>> -        ret = kfd_resume_all_processes();
>> +    /* for runtime resume, skip unlocking kfd */
>> +    if (!run_pm) {
>> +        count = atomic_dec_return(&kfd_locked);
>> +        WARN_ONCE(count < 0, "KFD suspend / resume ref. error");
>> +        if (count == 0)
>> +            ret = kfd_resume_all_processes();
>> +    }
>>         return ret;
>>   }
>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h 
>> b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
>> index c0b0defc8f7a..20dd4747250d 100644
>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
>> @@ -647,6 +647,7 @@ struct kfd_process_device {
>>        * function.
>>        */
>>       bool already_dequeued;
>> +    bool runtime_inuse;
>>         /* Is this process/pasid bound to this device? 
>> (amd_iommu_bind_pasid) */
>>       enum kfd_pdd_bound bound;
>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process.c 
>> b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
>> index 25b90f70aecd..6907a5a2cbc8 100644
>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
>> @@ -31,6 +31,7 @@
>>   #include <linux/compat.h>
>>   #include <linux/mman.h>
>>   #include <linux/file.h>
>> +#include <linux/pm_runtime.h>
>>   #include "amdgpu_amdkfd.h"
>>   #include "amdgpu.h"
>>   @@ -440,6 +441,16 @@ static void kfd_process_destroy_pdds(struct 
>> kfd_process *p)
>>           kfree(pdd->qpd.doorbell_bitmap);
>>           idr_destroy(&pdd->alloc_idr);
>>   +        /*
>> +         * before destroying pdd, make sure to report availability
>> +         * for auto suspend
>> +         */
>> +        if (pdd->runtime_inuse) {
>> + pm_runtime_mark_last_busy(pdd->dev->ddev->dev);
>> + pm_runtime_put_autosuspend(pdd->dev->ddev->dev);
>> +            pdd->runtime_inuse = false;
>> +        }
>> +
>>           kfree(pdd);
>>       }
>>   }
>> @@ -754,6 +765,7 @@ struct kfd_process_device 
>> *kfd_create_process_device_data(struct kfd_dev *dev,
>>       pdd->process = p;
>>       pdd->bound = PDD_UNBOUND;
>>       pdd->already_dequeued = false;
>> +    pdd->runtime_inuse = false;
>>       list_add(&pdd->per_device_list, &p->per_device_data);
>>         /* Init idr used for memory handle translation */
>> @@ -843,15 +855,40 @@ struct kfd_process_device 
>> *kfd_bind_process_to_device(struct kfd_dev *dev,
>>           return ERR_PTR(-ENOMEM);
>>       }
>>   +    /*
>> +     * signal runtime-pm system to auto resume and prevent
>> +     * further runtime suspend once device pdd is created until
>> +     * pdd is destroyed.
>> +     */
>> +    if (!pdd->runtime_inuse) {
>> +        err = pm_runtime_get_sync(dev->ddev->dev);
>> +        if (err < 0)
>> +            return ERR_PTR(err);
>> +    }
>> +
>>       err = kfd_iommu_bind_process_to_device(pdd);
>>       if (err)
>> -        return ERR_PTR(err);
>> +        goto out;
>>         err = kfd_process_device_init_vm(pdd, NULL);
>>       if (err)
>> -        return ERR_PTR(err);
>> +        goto out;
>>   -    return pdd;
>> +    if (!err) {
>> +        /*
>> +         * make sure that runtime_usage counter is incremented
>> +         * just once per pdd
>> +         */
>> +        if (!pdd->runtime_inuse)
>> +            pdd->runtime_inuse = true;
>
> The "if" is redundant here. You can just set pdd->runtime_inuse = true 
> unconditionally.
> Regards,
>   Felix
>
Thanks

Rajneesh

>> +
>> +        return pdd;
>> +    }
>> +out:
>> +    /* balance runpm reference count and exit with error */
>> +    pm_runtime_mark_last_busy(dev->ddev->dev);
>> +    pm_runtime_put_autosuspend(dev->ddev->dev);
>> +    return ERR_PTR(err);
>>   }
>>     struct kfd_process_device *kfd_get_first_process_device_data(
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Patch v2 3/4] drm/amdkfd: refactor runtime pm for baco
  2020-02-05  0:57     ` Bhardwaj, Rajneesh
@ 2020-02-05  1:59       ` Felix Kuehling
  0 siblings, 0 replies; 13+ messages in thread
From: Felix Kuehling @ 2020-02-05  1:59 UTC (permalink / raw)
  To: Bhardwaj, Rajneesh, amd-gfx; +Cc: Alexander.Deucher

On 2020-02-04 7:57 p.m., Bhardwaj, Rajneesh wrote:
> Hi Felix,
>
> Thanks for the review feedback!
>
> On 2/4/2020 4:28 PM, Felix Kuehling wrote:
>> On 2020-01-31 10:37 p.m., Rajneesh Bhardwaj wrote:
>>> So far the kfd driver implemented same routines for runtime and system
>>> wide suspend and resume (s2idle or mem). During system wide suspend the
>>> kfd aquires an atomic lock that prevents any more user processes to
>>> create queues and interact with kfd driver and amd gpu. This mechanism
>>> created problem when amdgpu device is runtime suspended with BACO
>>> enabled. Any application that relies on kfd driver fails to load 
>>> because
>>> the driver reports a locked kfd device since gpu is runtime suspended.
>>>
>>> However, in an ideal case, when gpu is runtime  suspended the kfd 
>>> driver
>>> should be able to:
>>>
>>>   - auto resume amdgpu driver whenever a client requests compute 
>>> service
>>>   - prevent runtime suspend for amdgpu  while kfd is in use
>>>
>>> This change refactors the amdgpu and amdkfd drivers to support BACO and
>>> runtime power management.
>>>
>>> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
>>
>> One small comment inline. Other than that patches 1-3 are
>>
>> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
>>
>> Also, I believe patch 1 is unchanged from v1 and already got a 
>> Reviewed-by from Alex. Please remember to add that tag before you 
>> submit.
>
>
> You mean https://www.spinics.net/lists/amd-gfx/msg44689.html? I have 
> already added Alex's RB tag.

Right, I missed it. I'm not used to seeing a Reviewed-by tag before the 
Signed-off-by tag.

Thanks,
   Felix


>
>
>>
>>
>> The last patch that enabled runtime PM by default, I'd leave the 
>> decision to submit that up to Alex. There may be other considerations 
>> than just KFD.
>
>
> Sure, but its needed along with the series otherwise we can't test/use 
> it.
>
>
>>
>>
>> See inline ...
>>
>>
>>> ---
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 12 +++---
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h |  8 ++--
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  4 +-
>>>   drivers/gpu/drm/amd/amdkfd/kfd_device.c    | 29 +++++++++------
>>>   drivers/gpu/drm/amd/amdkfd/kfd_priv.h      |  1 +
>>>   drivers/gpu/drm/amd/amdkfd/kfd_process.c   | 43 
>>> ++++++++++++++++++++--
>>>   6 files changed, 70 insertions(+), 27 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c 
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
>>> index 8609287620ea..314c4a2a0354 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
>>> @@ -178,18 +178,18 @@ void amdgpu_amdkfd_interrupt(struct 
>>> amdgpu_device *adev,
>>>           kgd2kfd_interrupt(adev->kfd.dev, ih_ring_entry);
>>>   }
>>>   -void amdgpu_amdkfd_suspend(struct amdgpu_device *adev)
>>> +void amdgpu_amdkfd_suspend(struct amdgpu_device *adev, bool run_pm)
>>>   {
>>>       if (adev->kfd.dev)
>>> -        kgd2kfd_suspend(adev->kfd.dev);
>>> +        kgd2kfd_suspend(adev->kfd.dev, run_pm);
>>>   }
>>>   -int amdgpu_amdkfd_resume(struct amdgpu_device *adev)
>>> +int amdgpu_amdkfd_resume(struct amdgpu_device *adev, bool run_pm)
>>>   {
>>>       int r = 0;
>>>         if (adev->kfd.dev)
>>> -        r = kgd2kfd_resume(adev->kfd.dev);
>>> +        r = kgd2kfd_resume(adev->kfd.dev, run_pm);
>>>         return r;
>>>   }
>>> @@ -713,11 +713,11 @@ void kgd2kfd_exit(void)
>>>   {
>>>   }
>>>   -void kgd2kfd_suspend(struct kfd_dev *kfd)
>>> +void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm)
>>>   {
>>>   }
>>>   -int kgd2kfd_resume(struct kfd_dev *kfd)
>>> +int kgd2kfd_resume(struct kfd_dev *kfd, bool run_pm)
>>>   {
>>>       return 0;
>>>   }
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h 
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
>>> index 47b0f2957d1f..9e8db702d878 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
>>> @@ -122,8 +122,8 @@ struct amdkfd_process_info {
>>>   int amdgpu_amdkfd_init(void);
>>>   void amdgpu_amdkfd_fini(void);
>>>   -void amdgpu_amdkfd_suspend(struct amdgpu_device *adev);
>>> -int amdgpu_amdkfd_resume(struct amdgpu_device *adev);
>>> +void amdgpu_amdkfd_suspend(struct amdgpu_device *adev, bool run_pm);
>>> +int amdgpu_amdkfd_resume(struct amdgpu_device *adev, bool run_pm);
>>>   void amdgpu_amdkfd_interrupt(struct amdgpu_device *adev,
>>>               const void *ih_ring_entry);
>>>   void amdgpu_amdkfd_device_probe(struct amdgpu_device *adev);
>>> @@ -249,8 +249,8 @@ bool kgd2kfd_device_init(struct kfd_dev *kfd,
>>>                struct drm_device *ddev,
>>>                const struct kgd2kfd_shared_resources *gpu_resources);
>>>   void kgd2kfd_device_exit(struct kfd_dev *kfd);
>>> -void kgd2kfd_suspend(struct kfd_dev *kfd);
>>> -int kgd2kfd_resume(struct kfd_dev *kfd);
>>> +void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm);
>>> +int kgd2kfd_resume(struct kfd_dev *kfd, bool run_pm);
>>>   int kgd2kfd_pre_reset(struct kfd_dev *kfd);
>>>   int kgd2kfd_post_reset(struct kfd_dev *kfd);
>>>   void kgd2kfd_interrupt(struct kfd_dev *kfd, const void 
>>> *ih_ring_entry);
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> index 5030a09babb8..43843e6c4bcd 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> @@ -3311,7 +3311,7 @@ int amdgpu_device_suspend(struct drm_device 
>>> *dev, bool fbcon)
>>>           }
>>>       }
>>>   -    amdgpu_amdkfd_suspend(adev);
>>> +    amdgpu_amdkfd_suspend(adev, !fbcon);
>>>         amdgpu_ras_suspend(adev);
>>>   @@ -3395,7 +3395,7 @@ int amdgpu_device_resume(struct drm_device 
>>> *dev, bool fbcon)
>>>               }
>>>           }
>>>       }
>>> -    r = amdgpu_amdkfd_resume(adev);
>>> +    r = amdgpu_amdkfd_resume(adev, !fbcon);
>>>       if (r)
>>>           return r;
>>>   diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c 
>>> b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>> index 798ad1c8f799..42ee9ea5c45a 100644
>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>> @@ -732,7 +732,7 @@ bool kgd2kfd_device_init(struct kfd_dev *kfd,
>>>   void kgd2kfd_device_exit(struct kfd_dev *kfd)
>>>   {
>>>       if (kfd->init_complete) {
>>> -        kgd2kfd_suspend(kfd);
>>> +        kgd2kfd_suspend(kfd, false);
>>>           device_queue_manager_uninit(kfd->dqm);
>>>           kfd_interrupt_exit(kfd);
>>>           kfd_topology_remove_device(kfd);
>>> @@ -753,7 +753,7 @@ int kgd2kfd_pre_reset(struct kfd_dev *kfd)
>>>         kfd->dqm->ops.pre_reset(kfd->dqm);
>>>   -    kgd2kfd_suspend(kfd);
>>> +    kgd2kfd_suspend(kfd, false);
>>>         kfd_signal_reset_event(kfd);
>>>       return 0;
>>> @@ -787,21 +787,23 @@ bool kfd_is_locked(void)
>>>       return  (atomic_read(&kfd_locked) > 0);
>>>   }
>>>   -void kgd2kfd_suspend(struct kfd_dev *kfd)
>>> +void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm)
>>>   {
>>>       if (!kfd->init_complete)
>>>           return;
>>>   -    /* For first KFD device suspend all the KFD processes */
>>> -    if (atomic_inc_return(&kfd_locked) == 1)
>>> -        kfd_suspend_all_processes();
>>> +    /* for runtime suspend, skip locking kfd */
>>> +    if (!run_pm) {
>>> +        /* For first KFD device suspend all the KFD processes */
>>> +        if (atomic_inc_return(&kfd_locked) == 1)
>>> +            kfd_suspend_all_processes();
>>> +    }
>>>         kfd->dqm->ops.stop(kfd->dqm);
>>> -
>>>       kfd_iommu_suspend(kfd);
>>>   }
>>>   -int kgd2kfd_resume(struct kfd_dev *kfd)
>>> +int kgd2kfd_resume(struct kfd_dev *kfd, bool run_pm)
>>>   {
>>>       int ret, count;
>>>   @@ -812,10 +814,13 @@ int kgd2kfd_resume(struct kfd_dev *kfd)
>>>       if (ret)
>>>           return ret;
>>>   -    count = atomic_dec_return(&kfd_locked);
>>> -    WARN_ONCE(count < 0, "KFD suspend / resume ref. error");
>>> -    if (count == 0)
>>> -        ret = kfd_resume_all_processes();
>>> +    /* for runtime resume, skip unlocking kfd */
>>> +    if (!run_pm) {
>>> +        count = atomic_dec_return(&kfd_locked);
>>> +        WARN_ONCE(count < 0, "KFD suspend / resume ref. error");
>>> +        if (count == 0)
>>> +            ret = kfd_resume_all_processes();
>>> +    }
>>>         return ret;
>>>   }
>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h 
>>> b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
>>> index c0b0defc8f7a..20dd4747250d 100644
>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
>>> @@ -647,6 +647,7 @@ struct kfd_process_device {
>>>        * function.
>>>        */
>>>       bool already_dequeued;
>>> +    bool runtime_inuse;
>>>         /* Is this process/pasid bound to this device? 
>>> (amd_iommu_bind_pasid) */
>>>       enum kfd_pdd_bound bound;
>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process.c 
>>> b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
>>> index 25b90f70aecd..6907a5a2cbc8 100644
>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
>>> @@ -31,6 +31,7 @@
>>>   #include <linux/compat.h>
>>>   #include <linux/mman.h>
>>>   #include <linux/file.h>
>>> +#include <linux/pm_runtime.h>
>>>   #include "amdgpu_amdkfd.h"
>>>   #include "amdgpu.h"
>>>   @@ -440,6 +441,16 @@ static void kfd_process_destroy_pdds(struct 
>>> kfd_process *p)
>>>           kfree(pdd->qpd.doorbell_bitmap);
>>>           idr_destroy(&pdd->alloc_idr);
>>>   +        /*
>>> +         * before destroying pdd, make sure to report availability
>>> +         * for auto suspend
>>> +         */
>>> +        if (pdd->runtime_inuse) {
>>> + pm_runtime_mark_last_busy(pdd->dev->ddev->dev);
>>> + pm_runtime_put_autosuspend(pdd->dev->ddev->dev);
>>> +            pdd->runtime_inuse = false;
>>> +        }
>>> +
>>>           kfree(pdd);
>>>       }
>>>   }
>>> @@ -754,6 +765,7 @@ struct kfd_process_device 
>>> *kfd_create_process_device_data(struct kfd_dev *dev,
>>>       pdd->process = p;
>>>       pdd->bound = PDD_UNBOUND;
>>>       pdd->already_dequeued = false;
>>> +    pdd->runtime_inuse = false;
>>>       list_add(&pdd->per_device_list, &p->per_device_data);
>>>         /* Init idr used for memory handle translation */
>>> @@ -843,15 +855,40 @@ struct kfd_process_device 
>>> *kfd_bind_process_to_device(struct kfd_dev *dev,
>>>           return ERR_PTR(-ENOMEM);
>>>       }
>>>   +    /*
>>> +     * signal runtime-pm system to auto resume and prevent
>>> +     * further runtime suspend once device pdd is created until
>>> +     * pdd is destroyed.
>>> +     */
>>> +    if (!pdd->runtime_inuse) {
>>> +        err = pm_runtime_get_sync(dev->ddev->dev);
>>> +        if (err < 0)
>>> +            return ERR_PTR(err);
>>> +    }
>>> +
>>>       err = kfd_iommu_bind_process_to_device(pdd);
>>>       if (err)
>>> -        return ERR_PTR(err);
>>> +        goto out;
>>>         err = kfd_process_device_init_vm(pdd, NULL);
>>>       if (err)
>>> -        return ERR_PTR(err);
>>> +        goto out;
>>>   -    return pdd;
>>> +    if (!err) {
>>> +        /*
>>> +         * make sure that runtime_usage counter is incremented
>>> +         * just once per pdd
>>> +         */
>>> +        if (!pdd->runtime_inuse)
>>> +            pdd->runtime_inuse = true;
>>
>> The "if" is redundant here. You can just set pdd->runtime_inuse = 
>> true unconditionally.
>> Regards,
>>   Felix
>>
> Thanks
>
> Rajneesh
>
>>> +
>>> +        return pdd;
>>> +    }
>>> +out:
>>> +    /* balance runpm reference count and exit with error */
>>> +    pm_runtime_mark_last_busy(dev->ddev->dev);
>>> +    pm_runtime_put_autosuspend(dev->ddev->dev);
>>> +    return ERR_PTR(err);
>>>   }
>>>     struct kfd_process_device *kfd_get_first_process_device_data(
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Patch v2 3/4] drm/amdkfd: refactor runtime pm for baco
  2020-02-04 21:28   ` Felix Kuehling
  2020-02-05  0:57     ` Bhardwaj, Rajneesh
@ 2020-02-05  4:46     ` Alex Deucher
  2020-02-06 14:27       ` Alex Deucher
  1 sibling, 1 reply; 13+ messages in thread
From: Alex Deucher @ 2020-02-05  4:46 UTC (permalink / raw)
  To: Felix Kuehling; +Cc: Deucher, Alexander, Rajneesh Bhardwaj, amd-gfx list

On Tue, Feb 4, 2020 at 4:28 PM Felix Kuehling <felix.kuehling@amd.com> wrote:
>
> On 2020-01-31 10:37 p.m., Rajneesh Bhardwaj wrote:
> > So far the kfd driver implemented same routines for runtime and system
> > wide suspend and resume (s2idle or mem). During system wide suspend the
> > kfd aquires an atomic lock that prevents any more user processes to
> > create queues and interact with kfd driver and amd gpu. This mechanism
> > created problem when amdgpu device is runtime suspended with BACO
> > enabled. Any application that relies on kfd driver fails to load because
> > the driver reports a locked kfd device since gpu is runtime suspended.
> >
> > However, in an ideal case, when gpu is runtime  suspended the kfd driver
> > should be able to:
> >
> >   - auto resume amdgpu driver whenever a client requests compute service
> >   - prevent runtime suspend for amdgpu  while kfd is in use
> >
> > This change refactors the amdgpu and amdkfd drivers to support BACO and
> > runtime power management.
> >
> > Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
>
> One small comment inline. Other than that patches 1-3 are
>
> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
>
> Also, I believe patch 1 is unchanged from v1 and already got a
> Reviewed-by from Alex. Please remember to add that tag before you submit.
>
> The last patch that enabled runtime PM by default, I'd leave the
> decision to submit that up to Alex. There may be other considerations
> than just KFD.

KFD was the only thing left.  I've been running with runpm forced on
for a while now with no problems across a wide variety of hardware.

Alex

>
> See inline ...
>
>
> > ---
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 12 +++---
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h |  8 ++--
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  4 +-
> >   drivers/gpu/drm/amd/amdkfd/kfd_device.c    | 29 +++++++++------
> >   drivers/gpu/drm/amd/amdkfd/kfd_priv.h      |  1 +
> >   drivers/gpu/drm/amd/amdkfd/kfd_process.c   | 43 ++++++++++++++++++++--
> >   6 files changed, 70 insertions(+), 27 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
> > index 8609287620ea..314c4a2a0354 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
> > @@ -178,18 +178,18 @@ void amdgpu_amdkfd_interrupt(struct amdgpu_device *adev,
> >               kgd2kfd_interrupt(adev->kfd.dev, ih_ring_entry);
> >   }
> >
> > -void amdgpu_amdkfd_suspend(struct amdgpu_device *adev)
> > +void amdgpu_amdkfd_suspend(struct amdgpu_device *adev, bool run_pm)
> >   {
> >       if (adev->kfd.dev)
> > -             kgd2kfd_suspend(adev->kfd.dev);
> > +             kgd2kfd_suspend(adev->kfd.dev, run_pm);
> >   }
> >
> > -int amdgpu_amdkfd_resume(struct amdgpu_device *adev)
> > +int amdgpu_amdkfd_resume(struct amdgpu_device *adev, bool run_pm)
> >   {
> >       int r = 0;
> >
> >       if (adev->kfd.dev)
> > -             r = kgd2kfd_resume(adev->kfd.dev);
> > +             r = kgd2kfd_resume(adev->kfd.dev, run_pm);
> >
> >       return r;
> >   }
> > @@ -713,11 +713,11 @@ void kgd2kfd_exit(void)
> >   {
> >   }
> >
> > -void kgd2kfd_suspend(struct kfd_dev *kfd)
> > +void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm)
> >   {
> >   }
> >
> > -int kgd2kfd_resume(struct kfd_dev *kfd)
> > +int kgd2kfd_resume(struct kfd_dev *kfd, bool run_pm)
> >   {
> >       return 0;
> >   }
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
> > index 47b0f2957d1f..9e8db702d878 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
> > @@ -122,8 +122,8 @@ struct amdkfd_process_info {
> >   int amdgpu_amdkfd_init(void);
> >   void amdgpu_amdkfd_fini(void);
> >
> > -void amdgpu_amdkfd_suspend(struct amdgpu_device *adev);
> > -int amdgpu_amdkfd_resume(struct amdgpu_device *adev);
> > +void amdgpu_amdkfd_suspend(struct amdgpu_device *adev, bool run_pm);
> > +int amdgpu_amdkfd_resume(struct amdgpu_device *adev, bool run_pm);
> >   void amdgpu_amdkfd_interrupt(struct amdgpu_device *adev,
> >                       const void *ih_ring_entry);
> >   void amdgpu_amdkfd_device_probe(struct amdgpu_device *adev);
> > @@ -249,8 +249,8 @@ bool kgd2kfd_device_init(struct kfd_dev *kfd,
> >                        struct drm_device *ddev,
> >                        const struct kgd2kfd_shared_resources *gpu_resources);
> >   void kgd2kfd_device_exit(struct kfd_dev *kfd);
> > -void kgd2kfd_suspend(struct kfd_dev *kfd);
> > -int kgd2kfd_resume(struct kfd_dev *kfd);
> > +void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm);
> > +int kgd2kfd_resume(struct kfd_dev *kfd, bool run_pm);
> >   int kgd2kfd_pre_reset(struct kfd_dev *kfd);
> >   int kgd2kfd_post_reset(struct kfd_dev *kfd);
> >   void kgd2kfd_interrupt(struct kfd_dev *kfd, const void *ih_ring_entry);
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > index 5030a09babb8..43843e6c4bcd 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > @@ -3311,7 +3311,7 @@ int amdgpu_device_suspend(struct drm_device *dev, bool fbcon)
> >               }
> >       }
> >
> > -     amdgpu_amdkfd_suspend(adev);
> > +     amdgpu_amdkfd_suspend(adev, !fbcon);
> >
> >       amdgpu_ras_suspend(adev);
> >
> > @@ -3395,7 +3395,7 @@ int amdgpu_device_resume(struct drm_device *dev, bool fbcon)
> >                       }
> >               }
> >       }
> > -     r = amdgpu_amdkfd_resume(adev);
> > +     r = amdgpu_amdkfd_resume(adev, !fbcon);
> >       if (r)
> >               return r;
> >
> > diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> > index 798ad1c8f799..42ee9ea5c45a 100644
> > --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> > +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> > @@ -732,7 +732,7 @@ bool kgd2kfd_device_init(struct kfd_dev *kfd,
> >   void kgd2kfd_device_exit(struct kfd_dev *kfd)
> >   {
> >       if (kfd->init_complete) {
> > -             kgd2kfd_suspend(kfd);
> > +             kgd2kfd_suspend(kfd, false);
> >               device_queue_manager_uninit(kfd->dqm);
> >               kfd_interrupt_exit(kfd);
> >               kfd_topology_remove_device(kfd);
> > @@ -753,7 +753,7 @@ int kgd2kfd_pre_reset(struct kfd_dev *kfd)
> >
> >       kfd->dqm->ops.pre_reset(kfd->dqm);
> >
> > -     kgd2kfd_suspend(kfd);
> > +     kgd2kfd_suspend(kfd, false);
> >
> >       kfd_signal_reset_event(kfd);
> >       return 0;
> > @@ -787,21 +787,23 @@ bool kfd_is_locked(void)
> >       return  (atomic_read(&kfd_locked) > 0);
> >   }
> >
> > -void kgd2kfd_suspend(struct kfd_dev *kfd)
> > +void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm)
> >   {
> >       if (!kfd->init_complete)
> >               return;
> >
> > -     /* For first KFD device suspend all the KFD processes */
> > -     if (atomic_inc_return(&kfd_locked) == 1)
> > -             kfd_suspend_all_processes();
> > +     /* for runtime suspend, skip locking kfd */
> > +     if (!run_pm) {
> > +             /* For first KFD device suspend all the KFD processes */
> > +             if (atomic_inc_return(&kfd_locked) == 1)
> > +                     kfd_suspend_all_processes();
> > +     }
> >
> >       kfd->dqm->ops.stop(kfd->dqm);
> > -
> >       kfd_iommu_suspend(kfd);
> >   }
> >
> > -int kgd2kfd_resume(struct kfd_dev *kfd)
> > +int kgd2kfd_resume(struct kfd_dev *kfd, bool run_pm)
> >   {
> >       int ret, count;
> >
> > @@ -812,10 +814,13 @@ int kgd2kfd_resume(struct kfd_dev *kfd)
> >       if (ret)
> >               return ret;
> >
> > -     count = atomic_dec_return(&kfd_locked);
> > -     WARN_ONCE(count < 0, "KFD suspend / resume ref. error");
> > -     if (count == 0)
> > -             ret = kfd_resume_all_processes();
> > +     /* for runtime resume, skip unlocking kfd */
> > +     if (!run_pm) {
> > +             count = atomic_dec_return(&kfd_locked);
> > +             WARN_ONCE(count < 0, "KFD suspend / resume ref. error");
> > +             if (count == 0)
> > +                     ret = kfd_resume_all_processes();
> > +     }
> >
> >       return ret;
> >   }
> > diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
> > index c0b0defc8f7a..20dd4747250d 100644
> > --- a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
> > +++ b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
> > @@ -647,6 +647,7 @@ struct kfd_process_device {
> >        * function.
> >        */
> >       bool already_dequeued;
> > +     bool runtime_inuse;
> >
> >       /* Is this process/pasid bound to this device? (amd_iommu_bind_pasid) */
> >       enum kfd_pdd_bound bound;
> > diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process.c b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
> > index 25b90f70aecd..6907a5a2cbc8 100644
> > --- a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
> > +++ b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
> > @@ -31,6 +31,7 @@
> >   #include <linux/compat.h>
> >   #include <linux/mman.h>
> >   #include <linux/file.h>
> > +#include <linux/pm_runtime.h>
> >   #include "amdgpu_amdkfd.h"
> >   #include "amdgpu.h"
> >
> > @@ -440,6 +441,16 @@ static void kfd_process_destroy_pdds(struct kfd_process *p)
> >               kfree(pdd->qpd.doorbell_bitmap);
> >               idr_destroy(&pdd->alloc_idr);
> >
> > +             /*
> > +              * before destroying pdd, make sure to report availability
> > +              * for auto suspend
> > +              */
> > +             if (pdd->runtime_inuse) {
> > +                     pm_runtime_mark_last_busy(pdd->dev->ddev->dev);
> > +                     pm_runtime_put_autosuspend(pdd->dev->ddev->dev);
> > +                     pdd->runtime_inuse = false;
> > +             }
> > +
> >               kfree(pdd);
> >       }
> >   }
> > @@ -754,6 +765,7 @@ struct kfd_process_device *kfd_create_process_device_data(struct kfd_dev *dev,
> >       pdd->process = p;
> >       pdd->bound = PDD_UNBOUND;
> >       pdd->already_dequeued = false;
> > +     pdd->runtime_inuse = false;
> >       list_add(&pdd->per_device_list, &p->per_device_data);
> >
> >       /* Init idr used for memory handle translation */
> > @@ -843,15 +855,40 @@ struct kfd_process_device *kfd_bind_process_to_device(struct kfd_dev *dev,
> >               return ERR_PTR(-ENOMEM);
> >       }
> >
> > +     /*
> > +      * signal runtime-pm system to auto resume and prevent
> > +      * further runtime suspend once device pdd is created until
> > +      * pdd is destroyed.
> > +      */
> > +     if (!pdd->runtime_inuse) {
> > +             err = pm_runtime_get_sync(dev->ddev->dev);
> > +             if (err < 0)
> > +                     return ERR_PTR(err);
> > +     }
> > +
> >       err = kfd_iommu_bind_process_to_device(pdd);
> >       if (err)
> > -             return ERR_PTR(err);
> > +             goto out;
> >
> >       err = kfd_process_device_init_vm(pdd, NULL);
> >       if (err)
> > -             return ERR_PTR(err);
> > +             goto out;
> >
> > -     return pdd;
> > +     if (!err) {
> > +             /*
> > +              * make sure that runtime_usage counter is incremented
> > +              * just once per pdd
> > +              */
> > +             if (!pdd->runtime_inuse)
> > +                     pdd->runtime_inuse = true;
>
> The "if" is redundant here. You can just set pdd->runtime_inuse = true
> unconditionally.
>
> Regards,
>    Felix
>
> > +
> > +             return pdd;
> > +     }
> > +out:
> > +     /* balance runpm reference count and exit with error */
> > +     pm_runtime_mark_last_busy(dev->ddev->dev);
> > +     pm_runtime_put_autosuspend(dev->ddev->dev);
> > +     return ERR_PTR(err);
> >   }
> >
> >   struct kfd_process_device *kfd_get_first_process_device_data(
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Patch v2 3/4] drm/amdkfd: refactor runtime pm for baco
  2020-02-05  4:46     ` Alex Deucher
@ 2020-02-06 14:27       ` Alex Deucher
  2020-02-06 15:43         ` Zeng, Oak
  0 siblings, 1 reply; 13+ messages in thread
From: Alex Deucher @ 2020-02-06 14:27 UTC (permalink / raw)
  To: Felix Kuehling; +Cc: Deucher, Alexander, Rajneesh Bhardwaj, amd-gfx list

On Tue, Feb 4, 2020 at 11:46 PM Alex Deucher <alexdeucher@gmail.com> wrote:
>
> On Tue, Feb 4, 2020 at 4:28 PM Felix Kuehling <felix.kuehling@amd.com> wrote:
> >
> > On 2020-01-31 10:37 p.m., Rajneesh Bhardwaj wrote:
> > > So far the kfd driver implemented same routines for runtime and system
> > > wide suspend and resume (s2idle or mem). During system wide suspend the
> > > kfd aquires an atomic lock that prevents any more user processes to
> > > create queues and interact with kfd driver and amd gpu. This mechanism
> > > created problem when amdgpu device is runtime suspended with BACO
> > > enabled. Any application that relies on kfd driver fails to load because
> > > the driver reports a locked kfd device since gpu is runtime suspended.
> > >
> > > However, in an ideal case, when gpu is runtime  suspended the kfd driver
> > > should be able to:
> > >
> > >   - auto resume amdgpu driver whenever a client requests compute service
> > >   - prevent runtime suspend for amdgpu  while kfd is in use
> > >
> > > This change refactors the amdgpu and amdkfd drivers to support BACO and
> > > runtime power management.
> > >
> > > Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
> >
> > One small comment inline. Other than that patches 1-3 are
> >
> > Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
> >
> > Also, I believe patch 1 is unchanged from v1 and already got a
> > Reviewed-by from Alex. Please remember to add that tag before you submit.
> >
> > The last patch that enabled runtime PM by default, I'd leave the
> > decision to submit that up to Alex. There may be other considerations
> > than just KFD.
>
> KFD was the only thing left.  I've been running with runpm forced on
> for a while now with no problems across a wide variety of hardware.

Actually, we need to prevent runtime pm if xgmi is active.  One more
thing to check.

Alex


>
> Alex
>
> >
> > See inline ...
> >
> >
> > > ---
> > >   drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 12 +++---
> > >   drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h |  8 ++--
> > >   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  4 +-
> > >   drivers/gpu/drm/amd/amdkfd/kfd_device.c    | 29 +++++++++------
> > >   drivers/gpu/drm/amd/amdkfd/kfd_priv.h      |  1 +
> > >   drivers/gpu/drm/amd/amdkfd/kfd_process.c   | 43 ++++++++++++++++++++--
> > >   6 files changed, 70 insertions(+), 27 deletions(-)
> > >
> > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
> > > index 8609287620ea..314c4a2a0354 100644
> > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
> > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
> > > @@ -178,18 +178,18 @@ void amdgpu_amdkfd_interrupt(struct amdgpu_device *adev,
> > >               kgd2kfd_interrupt(adev->kfd.dev, ih_ring_entry);
> > >   }
> > >
> > > -void amdgpu_amdkfd_suspend(struct amdgpu_device *adev)
> > > +void amdgpu_amdkfd_suspend(struct amdgpu_device *adev, bool run_pm)
> > >   {
> > >       if (adev->kfd.dev)
> > > -             kgd2kfd_suspend(adev->kfd.dev);
> > > +             kgd2kfd_suspend(adev->kfd.dev, run_pm);
> > >   }
> > >
> > > -int amdgpu_amdkfd_resume(struct amdgpu_device *adev)
> > > +int amdgpu_amdkfd_resume(struct amdgpu_device *adev, bool run_pm)
> > >   {
> > >       int r = 0;
> > >
> > >       if (adev->kfd.dev)
> > > -             r = kgd2kfd_resume(adev->kfd.dev);
> > > +             r = kgd2kfd_resume(adev->kfd.dev, run_pm);
> > >
> > >       return r;
> > >   }
> > > @@ -713,11 +713,11 @@ void kgd2kfd_exit(void)
> > >   {
> > >   }
> > >
> > > -void kgd2kfd_suspend(struct kfd_dev *kfd)
> > > +void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm)
> > >   {
> > >   }
> > >
> > > -int kgd2kfd_resume(struct kfd_dev *kfd)
> > > +int kgd2kfd_resume(struct kfd_dev *kfd, bool run_pm)
> > >   {
> > >       return 0;
> > >   }
> > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
> > > index 47b0f2957d1f..9e8db702d878 100644
> > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
> > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
> > > @@ -122,8 +122,8 @@ struct amdkfd_process_info {
> > >   int amdgpu_amdkfd_init(void);
> > >   void amdgpu_amdkfd_fini(void);
> > >
> > > -void amdgpu_amdkfd_suspend(struct amdgpu_device *adev);
> > > -int amdgpu_amdkfd_resume(struct amdgpu_device *adev);
> > > +void amdgpu_amdkfd_suspend(struct amdgpu_device *adev, bool run_pm);
> > > +int amdgpu_amdkfd_resume(struct amdgpu_device *adev, bool run_pm);
> > >   void amdgpu_amdkfd_interrupt(struct amdgpu_device *adev,
> > >                       const void *ih_ring_entry);
> > >   void amdgpu_amdkfd_device_probe(struct amdgpu_device *adev);
> > > @@ -249,8 +249,8 @@ bool kgd2kfd_device_init(struct kfd_dev *kfd,
> > >                        struct drm_device *ddev,
> > >                        const struct kgd2kfd_shared_resources *gpu_resources);
> > >   void kgd2kfd_device_exit(struct kfd_dev *kfd);
> > > -void kgd2kfd_suspend(struct kfd_dev *kfd);
> > > -int kgd2kfd_resume(struct kfd_dev *kfd);
> > > +void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm);
> > > +int kgd2kfd_resume(struct kfd_dev *kfd, bool run_pm);
> > >   int kgd2kfd_pre_reset(struct kfd_dev *kfd);
> > >   int kgd2kfd_post_reset(struct kfd_dev *kfd);
> > >   void kgd2kfd_interrupt(struct kfd_dev *kfd, const void *ih_ring_entry);
> > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > index 5030a09babb8..43843e6c4bcd 100644
> > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > @@ -3311,7 +3311,7 @@ int amdgpu_device_suspend(struct drm_device *dev, bool fbcon)
> > >               }
> > >       }
> > >
> > > -     amdgpu_amdkfd_suspend(adev);
> > > +     amdgpu_amdkfd_suspend(adev, !fbcon);
> > >
> > >       amdgpu_ras_suspend(adev);
> > >
> > > @@ -3395,7 +3395,7 @@ int amdgpu_device_resume(struct drm_device *dev, bool fbcon)
> > >                       }
> > >               }
> > >       }
> > > -     r = amdgpu_amdkfd_resume(adev);
> > > +     r = amdgpu_amdkfd_resume(adev, !fbcon);
> > >       if (r)
> > >               return r;
> > >
> > > diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> > > index 798ad1c8f799..42ee9ea5c45a 100644
> > > --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> > > +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> > > @@ -732,7 +732,7 @@ bool kgd2kfd_device_init(struct kfd_dev *kfd,
> > >   void kgd2kfd_device_exit(struct kfd_dev *kfd)
> > >   {
> > >       if (kfd->init_complete) {
> > > -             kgd2kfd_suspend(kfd);
> > > +             kgd2kfd_suspend(kfd, false);
> > >               device_queue_manager_uninit(kfd->dqm);
> > >               kfd_interrupt_exit(kfd);
> > >               kfd_topology_remove_device(kfd);
> > > @@ -753,7 +753,7 @@ int kgd2kfd_pre_reset(struct kfd_dev *kfd)
> > >
> > >       kfd->dqm->ops.pre_reset(kfd->dqm);
> > >
> > > -     kgd2kfd_suspend(kfd);
> > > +     kgd2kfd_suspend(kfd, false);
> > >
> > >       kfd_signal_reset_event(kfd);
> > >       return 0;
> > > @@ -787,21 +787,23 @@ bool kfd_is_locked(void)
> > >       return  (atomic_read(&kfd_locked) > 0);
> > >   }
> > >
> > > -void kgd2kfd_suspend(struct kfd_dev *kfd)
> > > +void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm)
> > >   {
> > >       if (!kfd->init_complete)
> > >               return;
> > >
> > > -     /* For first KFD device suspend all the KFD processes */
> > > -     if (atomic_inc_return(&kfd_locked) == 1)
> > > -             kfd_suspend_all_processes();
> > > +     /* for runtime suspend, skip locking kfd */
> > > +     if (!run_pm) {
> > > +             /* For first KFD device suspend all the KFD processes */
> > > +             if (atomic_inc_return(&kfd_locked) == 1)
> > > +                     kfd_suspend_all_processes();
> > > +     }
> > >
> > >       kfd->dqm->ops.stop(kfd->dqm);
> > > -
> > >       kfd_iommu_suspend(kfd);
> > >   }
> > >
> > > -int kgd2kfd_resume(struct kfd_dev *kfd)
> > > +int kgd2kfd_resume(struct kfd_dev *kfd, bool run_pm)
> > >   {
> > >       int ret, count;
> > >
> > > @@ -812,10 +814,13 @@ int kgd2kfd_resume(struct kfd_dev *kfd)
> > >       if (ret)
> > >               return ret;
> > >
> > > -     count = atomic_dec_return(&kfd_locked);
> > > -     WARN_ONCE(count < 0, "KFD suspend / resume ref. error");
> > > -     if (count == 0)
> > > -             ret = kfd_resume_all_processes();
> > > +     /* for runtime resume, skip unlocking kfd */
> > > +     if (!run_pm) {
> > > +             count = atomic_dec_return(&kfd_locked);
> > > +             WARN_ONCE(count < 0, "KFD suspend / resume ref. error");
> > > +             if (count == 0)
> > > +                     ret = kfd_resume_all_processes();
> > > +     }
> > >
> > >       return ret;
> > >   }
> > > diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
> > > index c0b0defc8f7a..20dd4747250d 100644
> > > --- a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
> > > +++ b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
> > > @@ -647,6 +647,7 @@ struct kfd_process_device {
> > >        * function.
> > >        */
> > >       bool already_dequeued;
> > > +     bool runtime_inuse;
> > >
> > >       /* Is this process/pasid bound to this device? (amd_iommu_bind_pasid) */
> > >       enum kfd_pdd_bound bound;
> > > diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process.c b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
> > > index 25b90f70aecd..6907a5a2cbc8 100644
> > > --- a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
> > > +++ b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
> > > @@ -31,6 +31,7 @@
> > >   #include <linux/compat.h>
> > >   #include <linux/mman.h>
> > >   #include <linux/file.h>
> > > +#include <linux/pm_runtime.h>
> > >   #include "amdgpu_amdkfd.h"
> > >   #include "amdgpu.h"
> > >
> > > @@ -440,6 +441,16 @@ static void kfd_process_destroy_pdds(struct kfd_process *p)
> > >               kfree(pdd->qpd.doorbell_bitmap);
> > >               idr_destroy(&pdd->alloc_idr);
> > >
> > > +             /*
> > > +              * before destroying pdd, make sure to report availability
> > > +              * for auto suspend
> > > +              */
> > > +             if (pdd->runtime_inuse) {
> > > +                     pm_runtime_mark_last_busy(pdd->dev->ddev->dev);
> > > +                     pm_runtime_put_autosuspend(pdd->dev->ddev->dev);
> > > +                     pdd->runtime_inuse = false;
> > > +             }
> > > +
> > >               kfree(pdd);
> > >       }
> > >   }
> > > @@ -754,6 +765,7 @@ struct kfd_process_device *kfd_create_process_device_data(struct kfd_dev *dev,
> > >       pdd->process = p;
> > >       pdd->bound = PDD_UNBOUND;
> > >       pdd->already_dequeued = false;
> > > +     pdd->runtime_inuse = false;
> > >       list_add(&pdd->per_device_list, &p->per_device_data);
> > >
> > >       /* Init idr used for memory handle translation */
> > > @@ -843,15 +855,40 @@ struct kfd_process_device *kfd_bind_process_to_device(struct kfd_dev *dev,
> > >               return ERR_PTR(-ENOMEM);
> > >       }
> > >
> > > +     /*
> > > +      * signal runtime-pm system to auto resume and prevent
> > > +      * further runtime suspend once device pdd is created until
> > > +      * pdd is destroyed.
> > > +      */
> > > +     if (!pdd->runtime_inuse) {
> > > +             err = pm_runtime_get_sync(dev->ddev->dev);
> > > +             if (err < 0)
> > > +                     return ERR_PTR(err);
> > > +     }
> > > +
> > >       err = kfd_iommu_bind_process_to_device(pdd);
> > >       if (err)
> > > -             return ERR_PTR(err);
> > > +             goto out;
> > >
> > >       err = kfd_process_device_init_vm(pdd, NULL);
> > >       if (err)
> > > -             return ERR_PTR(err);
> > > +             goto out;
> > >
> > > -     return pdd;
> > > +     if (!err) {
> > > +             /*
> > > +              * make sure that runtime_usage counter is incremented
> > > +              * just once per pdd
> > > +              */
> > > +             if (!pdd->runtime_inuse)
> > > +                     pdd->runtime_inuse = true;
> >
> > The "if" is redundant here. You can just set pdd->runtime_inuse = true
> > unconditionally.
> >
> > Regards,
> >    Felix
> >
> > > +
> > > +             return pdd;
> > > +     }
> > > +out:
> > > +     /* balance runpm reference count and exit with error */
> > > +     pm_runtime_mark_last_busy(dev->ddev->dev);
> > > +     pm_runtime_put_autosuspend(dev->ddev->dev);
> > > +     return ERR_PTR(err);
> > >   }
> > >
> > >   struct kfd_process_device *kfd_get_first_process_device_data(
> > _______________________________________________
> > amd-gfx mailing list
> > amd-gfx@lists.freedesktop.org
> > https://lists.freedesktop.org/mailman/listinfo/amd-gfx
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: [Patch v2 3/4] drm/amdkfd: refactor runtime pm for baco
  2020-02-06 14:27       ` Alex Deucher
@ 2020-02-06 15:43         ` Zeng, Oak
  0 siblings, 0 replies; 13+ messages in thread
From: Zeng, Oak @ 2020-02-06 15:43 UTC (permalink / raw)
  To: Alex Deucher, Kuehling, Felix
  Cc: Deucher, Alexander, Bhardwaj, Rajneesh, amd-gfx list

[AMD Official Use Only - Internal Distribution Only]

Hi Alex,

I am trying to understand why prevent runtime pm when xgmi is active. Is it because other device's accessing suspended device's HBM? Here is my understanding: after device is suspend, the DF and HBM will still be alive. So as long as the gL2 probing to device is disabled (this should be guaranteed by gL2 flush during suspend), other device should still be able to access HBM on the suspended device.  

Regards,
Oak

-----Original Message-----
From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> On Behalf Of Alex Deucher
Sent: Thursday, February 6, 2020 9:27 AM
To: Kuehling, Felix <Felix.Kuehling@amd.com>
Cc: Deucher, Alexander <Alexander.Deucher@amd.com>; Bhardwaj, Rajneesh <Rajneesh.Bhardwaj@amd.com>; amd-gfx list <amd-gfx@lists.freedesktop.org>
Subject: Re: [Patch v2 3/4] drm/amdkfd: refactor runtime pm for baco

On Tue, Feb 4, 2020 at 11:46 PM Alex Deucher <alexdeucher@gmail.com> wrote:
>
> On Tue, Feb 4, 2020 at 4:28 PM Felix Kuehling <felix.kuehling@amd.com> wrote:
> >
> > On 2020-01-31 10:37 p.m., Rajneesh Bhardwaj wrote:
> > > So far the kfd driver implemented same routines for runtime and 
> > > system wide suspend and resume (s2idle or mem). During system wide 
> > > suspend the kfd aquires an atomic lock that prevents any more user 
> > > processes to create queues and interact with kfd driver and amd 
> > > gpu. This mechanism created problem when amdgpu device is runtime 
> > > suspended with BACO enabled. Any application that relies on kfd 
> > > driver fails to load because the driver reports a locked kfd device since gpu is runtime suspended.
> > >
> > > However, in an ideal case, when gpu is runtime  suspended the kfd 
> > > driver should be able to:
> > >
> > >   - auto resume amdgpu driver whenever a client requests compute service
> > >   - prevent runtime suspend for amdgpu  while kfd is in use
> > >
> > > This change refactors the amdgpu and amdkfd drivers to support 
> > > BACO and runtime power management.
> > >
> > > Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
> >
> > One small comment inline. Other than that patches 1-3 are
> >
> > Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
> >
> > Also, I believe patch 1 is unchanged from v1 and already got a 
> > Reviewed-by from Alex. Please remember to add that tag before you submit.
> >
> > The last patch that enabled runtime PM by default, I'd leave the 
> > decision to submit that up to Alex. There may be other 
> > considerations than just KFD.
>
> KFD was the only thing left.  I've been running with runpm forced on 
> for a while now with no problems across a wide variety of hardware.

Actually, we need to prevent runtime pm if xgmi is active.  One more thing to check.

Alex


>
> Alex
>
> >
> > See inline ...
> >
> >
> > > ---
> > >   drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 12 +++---
> > >   drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h |  8 ++--
> > >   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  4 +-
> > >   drivers/gpu/drm/amd/amdkfd/kfd_device.c    | 29 +++++++++------
> > >   drivers/gpu/drm/amd/amdkfd/kfd_priv.h      |  1 +
> > >   drivers/gpu/drm/amd/amdkfd/kfd_process.c   | 43 ++++++++++++++++++++--
> > >   6 files changed, 70 insertions(+), 27 deletions(-)
> > >
> > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c 
> > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
> > > index 8609287620ea..314c4a2a0354 100644
> > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
> > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
> > > @@ -178,18 +178,18 @@ void amdgpu_amdkfd_interrupt(struct amdgpu_device *adev,
> > >               kgd2kfd_interrupt(adev->kfd.dev, ih_ring_entry);
> > >   }
> > >
> > > -void amdgpu_amdkfd_suspend(struct amdgpu_device *adev)
> > > +void amdgpu_amdkfd_suspend(struct amdgpu_device *adev, bool 
> > > +run_pm)
> > >   {
> > >       if (adev->kfd.dev)
> > > -             kgd2kfd_suspend(adev->kfd.dev);
> > > +             kgd2kfd_suspend(adev->kfd.dev, run_pm);
> > >   }
> > >
> > > -int amdgpu_amdkfd_resume(struct amdgpu_device *adev)
> > > +int amdgpu_amdkfd_resume(struct amdgpu_device *adev, bool run_pm)
> > >   {
> > >       int r = 0;
> > >
> > >       if (adev->kfd.dev)
> > > -             r = kgd2kfd_resume(adev->kfd.dev);
> > > +             r = kgd2kfd_resume(adev->kfd.dev, run_pm);
> > >
> > >       return r;
> > >   }
> > > @@ -713,11 +713,11 @@ void kgd2kfd_exit(void)
> > >   {
> > >   }
> > >
> > > -void kgd2kfd_suspend(struct kfd_dev *kfd)
> > > +void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm)
> > >   {
> > >   }
> > >
> > > -int kgd2kfd_resume(struct kfd_dev *kfd)
> > > +int kgd2kfd_resume(struct kfd_dev *kfd, bool run_pm)
> > >   {
> > >       return 0;
> > >   }
> > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h 
> > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
> > > index 47b0f2957d1f..9e8db702d878 100644
> > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
> > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
> > > @@ -122,8 +122,8 @@ struct amdkfd_process_info {
> > >   int amdgpu_amdkfd_init(void);
> > >   void amdgpu_amdkfd_fini(void);
> > >
> > > -void amdgpu_amdkfd_suspend(struct amdgpu_device *adev); -int 
> > > amdgpu_amdkfd_resume(struct amdgpu_device *adev);
> > > +void amdgpu_amdkfd_suspend(struct amdgpu_device *adev, bool 
> > > +run_pm); int amdgpu_amdkfd_resume(struct amdgpu_device *adev, 
> > > +bool run_pm);
> > >   void amdgpu_amdkfd_interrupt(struct amdgpu_device *adev,
> > >                       const void *ih_ring_entry);
> > >   void amdgpu_amdkfd_device_probe(struct amdgpu_device *adev); @@ 
> > > -249,8 +249,8 @@ bool kgd2kfd_device_init(struct kfd_dev *kfd,
> > >                        struct drm_device *ddev,
> > >                        const struct kgd2kfd_shared_resources *gpu_resources);
> > >   void kgd2kfd_device_exit(struct kfd_dev *kfd); -void 
> > > kgd2kfd_suspend(struct kfd_dev *kfd); -int kgd2kfd_resume(struct 
> > > kfd_dev *kfd);
> > > +void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm); int 
> > > +kgd2kfd_resume(struct kfd_dev *kfd, bool run_pm);
> > >   int kgd2kfd_pre_reset(struct kfd_dev *kfd);
> > >   int kgd2kfd_post_reset(struct kfd_dev *kfd);
> > >   void kgd2kfd_interrupt(struct kfd_dev *kfd, const void 
> > > *ih_ring_entry); diff --git 
> > > a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
> > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > index 5030a09babb8..43843e6c4bcd 100644
> > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > @@ -3311,7 +3311,7 @@ int amdgpu_device_suspend(struct drm_device *dev, bool fbcon)
> > >               }
> > >       }
> > >
> > > -     amdgpu_amdkfd_suspend(adev);
> > > +     amdgpu_amdkfd_suspend(adev, !fbcon);
> > >
> > >       amdgpu_ras_suspend(adev);
> > >
> > > @@ -3395,7 +3395,7 @@ int amdgpu_device_resume(struct drm_device *dev, bool fbcon)
> > >                       }
> > >               }
> > >       }
> > > -     r = amdgpu_amdkfd_resume(adev);
> > > +     r = amdgpu_amdkfd_resume(adev, !fbcon);
> > >       if (r)
> > >               return r;
> > >
> > > diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c 
> > > b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> > > index 798ad1c8f799..42ee9ea5c45a 100644
> > > --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> > > +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> > > @@ -732,7 +732,7 @@ bool kgd2kfd_device_init(struct kfd_dev *kfd,
> > >   void kgd2kfd_device_exit(struct kfd_dev *kfd)
> > >   {
> > >       if (kfd->init_complete) {
> > > -             kgd2kfd_suspend(kfd);
> > > +             kgd2kfd_suspend(kfd, false);
> > >               device_queue_manager_uninit(kfd->dqm);
> > >               kfd_interrupt_exit(kfd);
> > >               kfd_topology_remove_device(kfd); @@ -753,7 +753,7 @@ 
> > > int kgd2kfd_pre_reset(struct kfd_dev *kfd)
> > >
> > >       kfd->dqm->ops.pre_reset(kfd->dqm);
> > >
> > > -     kgd2kfd_suspend(kfd);
> > > +     kgd2kfd_suspend(kfd, false);
> > >
> > >       kfd_signal_reset_event(kfd);
> > >       return 0;
> > > @@ -787,21 +787,23 @@ bool kfd_is_locked(void)
> > >       return  (atomic_read(&kfd_locked) > 0);
> > >   }
> > >
> > > -void kgd2kfd_suspend(struct kfd_dev *kfd)
> > > +void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm)
> > >   {
> > >       if (!kfd->init_complete)
> > >               return;
> > >
> > > -     /* For first KFD device suspend all the KFD processes */
> > > -     if (atomic_inc_return(&kfd_locked) == 1)
> > > -             kfd_suspend_all_processes();
> > > +     /* for runtime suspend, skip locking kfd */
> > > +     if (!run_pm) {
> > > +             /* For first KFD device suspend all the KFD processes */
> > > +             if (atomic_inc_return(&kfd_locked) == 1)
> > > +                     kfd_suspend_all_processes();
> > > +     }
> > >
> > >       kfd->dqm->ops.stop(kfd->dqm);
> > > -
> > >       kfd_iommu_suspend(kfd);
> > >   }
> > >
> > > -int kgd2kfd_resume(struct kfd_dev *kfd)
> > > +int kgd2kfd_resume(struct kfd_dev *kfd, bool run_pm)
> > >   {
> > >       int ret, count;
> > >
> > > @@ -812,10 +814,13 @@ int kgd2kfd_resume(struct kfd_dev *kfd)
> > >       if (ret)
> > >               return ret;
> > >
> > > -     count = atomic_dec_return(&kfd_locked);
> > > -     WARN_ONCE(count < 0, "KFD suspend / resume ref. error");
> > > -     if (count == 0)
> > > -             ret = kfd_resume_all_processes();
> > > +     /* for runtime resume, skip unlocking kfd */
> > > +     if (!run_pm) {
> > > +             count = atomic_dec_return(&kfd_locked);
> > > +             WARN_ONCE(count < 0, "KFD suspend / resume ref. error");
> > > +             if (count == 0)
> > > +                     ret = kfd_resume_all_processes();
> > > +     }
> > >
> > >       return ret;
> > >   }
> > > diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h 
> > > b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
> > > index c0b0defc8f7a..20dd4747250d 100644
> > > --- a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
> > > +++ b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
> > > @@ -647,6 +647,7 @@ struct kfd_process_device {
> > >        * function.
> > >        */
> > >       bool already_dequeued;
> > > +     bool runtime_inuse;
> > >
> > >       /* Is this process/pasid bound to this device? (amd_iommu_bind_pasid) */
> > >       enum kfd_pdd_bound bound;
> > > diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process.c 
> > > b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
> > > index 25b90f70aecd..6907a5a2cbc8 100644
> > > --- a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
> > > +++ b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
> > > @@ -31,6 +31,7 @@
> > >   #include <linux/compat.h>
> > >   #include <linux/mman.h>
> > >   #include <linux/file.h>
> > > +#include <linux/pm_runtime.h>
> > >   #include "amdgpu_amdkfd.h"
> > >   #include "amdgpu.h"
> > >
> > > @@ -440,6 +441,16 @@ static void kfd_process_destroy_pdds(struct kfd_process *p)
> > >               kfree(pdd->qpd.doorbell_bitmap);
> > >               idr_destroy(&pdd->alloc_idr);
> > >
> > > +             /*
> > > +              * before destroying pdd, make sure to report availability
> > > +              * for auto suspend
> > > +              */
> > > +             if (pdd->runtime_inuse) {
> > > +                     pm_runtime_mark_last_busy(pdd->dev->ddev->dev);
> > > +                     pm_runtime_put_autosuspend(pdd->dev->ddev->dev);
> > > +                     pdd->runtime_inuse = false;
> > > +             }
> > > +
> > >               kfree(pdd);
> > >       }
> > >   }
> > > @@ -754,6 +765,7 @@ struct kfd_process_device *kfd_create_process_device_data(struct kfd_dev *dev,
> > >       pdd->process = p;
> > >       pdd->bound = PDD_UNBOUND;
> > >       pdd->already_dequeued = false;
> > > +     pdd->runtime_inuse = false;
> > >       list_add(&pdd->per_device_list, &p->per_device_data);
> > >
> > >       /* Init idr used for memory handle translation */ @@ -843,15 
> > > +855,40 @@ struct kfd_process_device *kfd_bind_process_to_device(struct kfd_dev *dev,
> > >               return ERR_PTR(-ENOMEM);
> > >       }
> > >
> > > +     /*
> > > +      * signal runtime-pm system to auto resume and prevent
> > > +      * further runtime suspend once device pdd is created until
> > > +      * pdd is destroyed.
> > > +      */
> > > +     if (!pdd->runtime_inuse) {
> > > +             err = pm_runtime_get_sync(dev->ddev->dev);
> > > +             if (err < 0)
> > > +                     return ERR_PTR(err);
> > > +     }
> > > +
> > >       err = kfd_iommu_bind_process_to_device(pdd);
> > >       if (err)
> > > -             return ERR_PTR(err);
> > > +             goto out;
> > >
> > >       err = kfd_process_device_init_vm(pdd, NULL);
> > >       if (err)
> > > -             return ERR_PTR(err);
> > > +             goto out;
> > >
> > > -     return pdd;
> > > +     if (!err) {
> > > +             /*
> > > +              * make sure that runtime_usage counter is incremented
> > > +              * just once per pdd
> > > +              */
> > > +             if (!pdd->runtime_inuse)
> > > +                     pdd->runtime_inuse = true;
> >
> > The "if" is redundant here. You can just set pdd->runtime_inuse = 
> > true unconditionally.
> >
> > Regards,
> >    Felix
> >
> > > +
> > > +             return pdd;
> > > +     }
> > > +out:
> > > +     /* balance runpm reference count and exit with error */
> > > +     pm_runtime_mark_last_busy(dev->ddev->dev);
> > > +     pm_runtime_put_autosuspend(dev->ddev->dev);
> > > +     return ERR_PTR(err);
> > >   }
> > >
> > >   struct kfd_process_device *kfd_get_first_process_device_data(
> > _______________________________________________
> > amd-gfx mailing list
> > amd-gfx@lists.freedesktop.org
> > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fli
> > sts.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%
> > 7Coak.zeng%40amd.com%7Cd5179ff2709543f0871008d7ab10b0b7%7C3dd8961fe4
> > 884e608e11a82d994e183d%7C0%7C0%7C637165960494222799&amp;sdata=YVNSWu
> > 0FM92F0wiMQE70TcNvOmjUiw3kkO%2ByQik2Mmw%3D&amp;reserved=0
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7Coak.zeng%40amd.com%7Cd5179ff2709543f0871008d7ab10b0b7%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637165960494222799&amp;sdata=YVNSWu0FM92F0wiMQE70TcNvOmjUiw3kkO%2ByQik2Mmw%3D&amp;reserved=0
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2020-02-06 15:43 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-02-01  3:37 [Patch v2 0/4] Enable BACO with KFD Rajneesh Bhardwaj
2020-02-01  3:37 ` [Patch v2 1/4] drm/amdgpu: Fix missing error check in suspend Rajneesh Bhardwaj
2020-02-01  3:37 ` [Patch v2 2/4] drm/amdkfd: show warning when kfd is locked Rajneesh Bhardwaj
2020-02-01  3:37 ` [Patch v2 3/4] drm/amdkfd: refactor runtime pm for baco Rajneesh Bhardwaj
2020-02-01  4:21   ` Zeng, Oak
2020-02-01 15:21     ` Bhardwaj, Rajneesh
2020-02-04 21:28   ` Felix Kuehling
2020-02-05  0:57     ` Bhardwaj, Rajneesh
2020-02-05  1:59       ` Felix Kuehling
2020-02-05  4:46     ` Alex Deucher
2020-02-06 14:27       ` Alex Deucher
2020-02-06 15:43         ` Zeng, Oak
2020-02-01  3:37 ` [Patch v2 4/4] drm/amdgpu/runpm: enable runpm on baco capable VI+ asics Rajneesh Bhardwaj

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.