All of lore.kernel.org
 help / color / mirror / Atom feed
* [Patch v1 0/5] Enable BACO with KFD
@ 2020-01-28  1:29 Rajneesh Bhardwaj
  2020-01-28  1:29 ` [Patch v1 1/5] drm/amdgpu: always enable runtime power management Rajneesh Bhardwaj
                   ` (5 more replies)
  0 siblings, 6 replies; 23+ messages in thread
From: Rajneesh Bhardwaj @ 2020-01-28  1:29 UTC (permalink / raw)
  To: amd-gfx; +Cc: Alexander.Deucher, Felix.Kuehling, Rajneesh Bhardwaj

This series aims to enable BACO by default on supported AMD platforms
and ensures that the AMD Kernel Fusion Driver can co-exist with this
feature when the GPU devices are runtime suspended and firmware pushes
the envelop to save more power with BACO entry sequence. Current
approach makes sure that if KFD is using GPU services for compute, it
won't let AMDGPU driver suspend and if the AMDGPU driver is already
runtime suspended with GPUs in deep power saving mode with BACO, the KFD
driver wakes up the AMDGPU and then starts the compute workload
execution.

This series has been tested with a single GPU (fiji), Dual GPUs (fiji
and fiji/tonga) and seems to work fine. I have not seen but expect some
corner cases where for some reason a KFD client binds a process but is
unable to complete the task within 60 seconds. Ideally we should put the
runtime_put and auto_suspend code in some function which is logical
opposite of bind_process_to_device but during my experiments, it ended
up with random gpu hang, machine reboot etc so any comments for
improvement are welcome.

Todo: 

rebase on top of https://patchwork.freedesktop.org/patch/338037/

Rajneesh Bhardwaj (5):
  drm/amdgpu: always enable runtime power management
  drm/amdgpu: Fix missing error check in suspend
  drm/amdkfd: Introduce debugfs option to disable baco
  drm/amdkfd: show warning when kfd is locked
  drm/amdkfd: refactor runtime pm for baco

 drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 12 ++++-----
 drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h |  9 ++++---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  7 +++--
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c    |  3 +++
 drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c    |  9 +++----
 drivers/gpu/drm/amd/amdkfd/kfd_chardev.c   |  2 ++
 drivers/gpu/drm/amd/amdkfd/kfd_debugfs.c   | 27 +++++++++++++++++++
 drivers/gpu/drm/amd/amdkfd/kfd_device.c    | 31 +++++++++++++---------
 drivers/gpu/drm/amd/amdkfd/kfd_iommu.c     |  5 +++-
 drivers/gpu/drm/amd/amdkfd/kfd_process.c   | 19 ++++++++++---
 10 files changed, 91 insertions(+), 33 deletions(-)

-- 
2.17.1

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Patch v1 1/5] drm/amdgpu: always enable runtime power management
  2020-01-28  1:29 [Patch v1 0/5] Enable BACO with KFD Rajneesh Bhardwaj
@ 2020-01-28  1:29 ` Rajneesh Bhardwaj
  2020-01-28 20:14   ` Alex Deucher
  2020-01-28  1:29 ` [Patch v1 2/5] drm/amdgpu: Fix missing error check in suspend Rajneesh Bhardwaj
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 23+ messages in thread
From: Rajneesh Bhardwaj @ 2020-01-28  1:29 UTC (permalink / raw)
  To: amd-gfx; +Cc: Alexander.Deucher, Felix.Kuehling, Rajneesh Bhardwaj

This allows runtime power management to kick in on amdgpu driver when
the underlying hardware supports either BOCO or BACO. This can still be
avoided if boot arg amdgpu.runpm = 0 is supplied.

	BOCO: Bus Off, Chip Off
	BACO: Bus Alive, Chip Off

Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 9 ++++-----
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
index 3a0ea9096498..7958d508486e 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
@@ -169,11 +169,10 @@ int amdgpu_driver_load_kms(struct drm_device *dev, unsigned long flags)
 		goto out;
 	}
 
-	if (amdgpu_device_supports_boco(dev) &&
-	    (amdgpu_runtime_pm != 0)) /* enable runpm by default */
-		adev->runpm = true;
-	else if (amdgpu_device_supports_baco(dev) &&
-		 (amdgpu_runtime_pm > 0)) /* enable runpm if runpm=1 */
+	/* always enable runtime power management except when amdgpu.runpm=0 */
+	if ((amdgpu_device_supports_boco(dev) ||
+			amdgpu_device_supports_baco(dev))
+			&& (amdgpu_runtime_pm != 0))
 		adev->runpm = true;
 
 	/* Call ACPI methods: require modeset init
-- 
2.17.1

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [Patch v1 2/5] drm/amdgpu: Fix missing error check in suspend
  2020-01-28  1:29 [Patch v1 0/5] Enable BACO with KFD Rajneesh Bhardwaj
  2020-01-28  1:29 ` [Patch v1 1/5] drm/amdgpu: always enable runtime power management Rajneesh Bhardwaj
@ 2020-01-28  1:29 ` Rajneesh Bhardwaj
  2020-01-28 20:15   ` Alex Deucher
  2020-01-28  1:29 ` [Patch v1 3/5] drm/amdkfd: Introduce debugfs option to disable baco Rajneesh Bhardwaj
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 23+ messages in thread
From: Rajneesh Bhardwaj @ 2020-01-28  1:29 UTC (permalink / raw)
  To: amd-gfx; +Cc: Alexander.Deucher, Felix.Kuehling, Rajneesh Bhardwaj

amdgpu_device_suspend might return an error code since it can be called
from both runtime and system suspend flows. Add the missing return code
in case of a failure.

Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index 137e76f0e3db..71466df6dff6 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -1236,6 +1236,9 @@ static int amdgpu_pmops_runtime_suspend(struct device *dev)
 	drm_kms_helper_poll_disable(drm_dev);
 
 	ret = amdgpu_device_suspend(drm_dev, false);
+	if (ret)
+		return ret;
+
 	if (amdgpu_device_supports_boco(drm_dev)) {
 		/* Only need to handle PCI state in the driver for ATPX
 		 * PCI core handles it for _PR3.
-- 
2.17.1

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [Patch v1 3/5] drm/amdkfd: Introduce debugfs option to disable baco
  2020-01-28  1:29 [Patch v1 0/5] Enable BACO with KFD Rajneesh Bhardwaj
  2020-01-28  1:29 ` [Patch v1 1/5] drm/amdgpu: always enable runtime power management Rajneesh Bhardwaj
  2020-01-28  1:29 ` [Patch v1 2/5] drm/amdgpu: Fix missing error check in suspend Rajneesh Bhardwaj
@ 2020-01-28  1:29 ` Rajneesh Bhardwaj
  2020-01-28 20:22   ` Alex Deucher
  2020-01-28  1:29 ` [Patch v1 4/5] drm/amdkfd: show warning when kfd is locked Rajneesh Bhardwaj
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 23+ messages in thread
From: Rajneesh Bhardwaj @ 2020-01-28  1:29 UTC (permalink / raw)
  To: amd-gfx; +Cc: Alexander.Deucher, Felix.Kuehling, Rajneesh Bhardwaj

When BACO is enabled by default, sometimes it can cause additional
trouble to debug KFD issues. This debugfs override allows to temporarily
disable BACO for debug purpose without having to reboot the machine.

However, in some cases one suspend-resume cycle might be needed if
the device is already runtime suspended.

e.g

sudo rtcwake -m < mem or freeze > -s 15

or

by triggering autosuspend event from user space, by doing something
like:

echo 6000 > /sys/bus/pci/devices/0000\:03\:00.0/power/autosuspend_delay_ms

    Usage:

echo 0 > /sys/kernel/debug/kfd/enable_baco and run
cat /sys/kernel/debug/kfd/baco_status to verify whether BACO is
enabled or disabled by kfd driver.

It should be noted that while enabling baco again via kfd override, we
must do the following steps:

1. echo 0 > /sys/kernel/debug/kfd/enable_baco
2. sudo rtcwake -m < mem > -s 15

In this case, we need GPU to be fully reset which is done by BIOS. This
is not possible in case of S2idle i.e. freeze so we must use mem for
sleep.

Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h |  1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  3 +++
 drivers/gpu/drm/amd/amdkfd/kfd_debugfs.c   | 27 ++++++++++++++++++++++
 3 files changed, 31 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
index 47b0f2957d1f..0fa898d30207 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
@@ -34,6 +34,7 @@
 #include "amdgpu_vm.h"
 
 extern uint64_t amdgpu_amdkfd_total_mem_size;
+extern bool kfd_enable_baco;
 
 struct amdgpu_device;
 
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 2c64d2a83d61..d9e5eac182d3 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -3259,6 +3259,9 @@ int amdgpu_device_suspend(struct drm_device *dev, bool fbcon)
 		return -ENODEV;
 	}
 
+	if (!kfd_enable_baco)
+		return -EBUSY;
+
 	adev = dev->dev_private;
 
 	if (dev->switch_power_state == DRM_SWITCH_POWER_OFF)
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_debugfs.c b/drivers/gpu/drm/amd/amdkfd/kfd_debugfs.c
index 511712c2e382..c6f6ff2ff603 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_debugfs.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_debugfs.c
@@ -26,6 +26,7 @@
 #include "kfd_priv.h"
 
 static struct dentry *debugfs_root;
+bool kfd_enable_baco = true;
 
 static int kfd_debugfs_open(struct inode *inode, struct file *file)
 {
@@ -83,6 +84,28 @@ static const struct file_operations kfd_debugfs_hang_hws_fops = {
 	.release = single_release,
 };
 
+static int baco_sts_set(void *data, u64 val)
+{
+	if (!val)
+		kfd_enable_baco = false;
+	else
+		kfd_enable_baco = true;
+
+	return 0;
+}
+DEFINE_SIMPLE_ATTRIBUTE(fops_baco_enable, NULL, baco_sts_set, "%lld\n");
+
+static int baco_sts_show(struct seq_file *s, void *unused)
+{
+	if (kfd_enable_baco)
+		seq_puts(s, "Baco is Enabled\n");
+	else
+		seq_puts(s, "Baco is Disabled\n");
+
+	return 0;
+}
+DEFINE_SHOW_ATTRIBUTE(baco_sts);
+
 void kfd_debugfs_init(void)
 {
 	debugfs_root = debugfs_create_dir("kfd", NULL);
@@ -95,6 +118,10 @@ void kfd_debugfs_init(void)
 			    kfd_debugfs_rls_by_device, &kfd_debugfs_fops);
 	debugfs_create_file("hang_hws", S_IFREG | 0200, debugfs_root,
 			    NULL, &kfd_debugfs_hang_hws_fops);
+	debugfs_create_file("enable_baco", 0644, debugfs_root, NULL,
+			    &fops_baco_enable);
+	debugfs_create_file("baco_status", 0444, debugfs_root, NULL,
+			    &baco_sts_fops);
 }
 
 void kfd_debugfs_fini(void)
-- 
2.17.1

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [Patch v1 4/5] drm/amdkfd: show warning when kfd is locked
  2020-01-28  1:29 [Patch v1 0/5] Enable BACO with KFD Rajneesh Bhardwaj
                   ` (2 preceding siblings ...)
  2020-01-28  1:29 ` [Patch v1 3/5] drm/amdkfd: Introduce debugfs option to disable baco Rajneesh Bhardwaj
@ 2020-01-28  1:29 ` Rajneesh Bhardwaj
  2020-01-28 22:42   ` Felix Kuehling
  2020-01-28  1:29 ` [Patch v1 5/5] drm/amdkfd: refactor runtime pm for baco Rajneesh Bhardwaj
  2020-01-28 17:39 ` [Patch v1 0/5] Enable BACO with KFD Mike Lothian
  5 siblings, 1 reply; 23+ messages in thread
From: Rajneesh Bhardwaj @ 2020-01-28  1:29 UTC (permalink / raw)
  To: amd-gfx; +Cc: Alexander.Deucher, Felix.Kuehling, Rajneesh Bhardwaj

During system suspend the kfd driver aquires a lock that prohibits
further kfd actions unless the gpu is resumed. This adds some info which
can be useful while debugging.

Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
---
 drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
index 1aebda4bbbe7..081cc5f40d18 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
@@ -127,6 +127,8 @@ static int kfd_open(struct inode *inode, struct file *filep)
 		return PTR_ERR(process);
 
 	if (kfd_is_locked()) {
+		dev_warn(kfd_device, "kfd is locked!\n"
+				"process %d unreferenced", process->pasid);
 		kfd_unref_process(process);
 		return -EAGAIN;
 	}
-- 
2.17.1

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [Patch v1 5/5] drm/amdkfd: refactor runtime pm for baco
  2020-01-28  1:29 [Patch v1 0/5] Enable BACO with KFD Rajneesh Bhardwaj
                   ` (3 preceding siblings ...)
  2020-01-28  1:29 ` [Patch v1 4/5] drm/amdkfd: show warning when kfd is locked Rajneesh Bhardwaj
@ 2020-01-28  1:29 ` Rajneesh Bhardwaj
  2020-01-28 20:09   ` Zeng, Oak
  2020-01-29 22:52   ` Felix Kuehling
  2020-01-28 17:39 ` [Patch v1 0/5] Enable BACO with KFD Mike Lothian
  5 siblings, 2 replies; 23+ messages in thread
From: Rajneesh Bhardwaj @ 2020-01-28  1:29 UTC (permalink / raw)
  To: amd-gfx; +Cc: Alexander.Deucher, Felix.Kuehling, Rajneesh Bhardwaj

So far the kfd driver implemented same routines for runtime and system
wide suspend and resume (s2idle or mem). During system wide suspend the
kfd aquires an atomic lock that prevents any more user processes to
create queues and interact with kfd driver and amd gpu. This mechanism
created problem when amdgpu device is runtime suspended with BACO
enabled. Any application that relies on kfd driver fails to load because
the driver reports a locked kfd device since gpu is runtime suspended.

However, in an ideal case, when gpu is runtime  suspended the kfd driver
should be able to:

 - auto resume amdgpu driver whenever a client requests compute service
 - prevent runtime suspend for amdgpu  while kfd is in use

This change refactors the amdgpu and amdkfd drivers to support BACO and
runtime runtime power management.

Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 12 ++++-----
 drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h |  8 +++---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  4 +--
 drivers/gpu/drm/amd/amdkfd/kfd_device.c    | 31 +++++++++++++---------
 drivers/gpu/drm/amd/amdkfd/kfd_iommu.c     |  5 +++-
 drivers/gpu/drm/amd/amdkfd/kfd_process.c   | 19 ++++++++++---
 6 files changed, 51 insertions(+), 28 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
index 8609287620ea..314c4a2a0354 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
@@ -178,18 +178,18 @@ void amdgpu_amdkfd_interrupt(struct amdgpu_device *adev,
 		kgd2kfd_interrupt(adev->kfd.dev, ih_ring_entry);
 }
 
-void amdgpu_amdkfd_suspend(struct amdgpu_device *adev)
+void amdgpu_amdkfd_suspend(struct amdgpu_device *adev, bool run_pm)
 {
 	if (adev->kfd.dev)
-		kgd2kfd_suspend(adev->kfd.dev);
+		kgd2kfd_suspend(adev->kfd.dev, run_pm);
 }
 
-int amdgpu_amdkfd_resume(struct amdgpu_device *adev)
+int amdgpu_amdkfd_resume(struct amdgpu_device *adev, bool run_pm)
 {
 	int r = 0;
 
 	if (adev->kfd.dev)
-		r = kgd2kfd_resume(adev->kfd.dev);
+		r = kgd2kfd_resume(adev->kfd.dev, run_pm);
 
 	return r;
 }
@@ -713,11 +713,11 @@ void kgd2kfd_exit(void)
 {
 }
 
-void kgd2kfd_suspend(struct kfd_dev *kfd)
+void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm)
 {
 }
 
-int kgd2kfd_resume(struct kfd_dev *kfd)
+int kgd2kfd_resume(struct kfd_dev *kfd, bool run_pm)
 {
 	return 0;
 }
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
index 0fa898d30207..3dce4a51e522 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
@@ -123,8 +123,8 @@ struct amdkfd_process_info {
 int amdgpu_amdkfd_init(void);
 void amdgpu_amdkfd_fini(void);
 
-void amdgpu_amdkfd_suspend(struct amdgpu_device *adev);
-int amdgpu_amdkfd_resume(struct amdgpu_device *adev);
+void amdgpu_amdkfd_suspend(struct amdgpu_device *adev, bool run_pm);
+int amdgpu_amdkfd_resume(struct amdgpu_device *adev, bool run_pm);
 void amdgpu_amdkfd_interrupt(struct amdgpu_device *adev,
 			const void *ih_ring_entry);
 void amdgpu_amdkfd_device_probe(struct amdgpu_device *adev);
@@ -250,8 +250,8 @@ bool kgd2kfd_device_init(struct kfd_dev *kfd,
 			 struct drm_device *ddev,
 			 const struct kgd2kfd_shared_resources *gpu_resources);
 void kgd2kfd_device_exit(struct kfd_dev *kfd);
-void kgd2kfd_suspend(struct kfd_dev *kfd);
-int kgd2kfd_resume(struct kfd_dev *kfd);
+void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm);
+int kgd2kfd_resume(struct kfd_dev *kfd, bool run_pm);
 int kgd2kfd_pre_reset(struct kfd_dev *kfd);
 int kgd2kfd_post_reset(struct kfd_dev *kfd);
 void kgd2kfd_interrupt(struct kfd_dev *kfd, const void *ih_ring_entry);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index d9e5eac182d3..787d49e9f4de 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -3314,7 +3314,7 @@ int amdgpu_device_suspend(struct drm_device *dev, bool fbcon)
 		}
 	}
 
-	amdgpu_amdkfd_suspend(adev);
+	amdgpu_amdkfd_suspend(adev, fbcon);
 
 	amdgpu_ras_suspend(adev);
 
@@ -3398,7 +3398,7 @@ int amdgpu_device_resume(struct drm_device *dev, bool fbcon)
 			}
 		}
 	}
-	r = amdgpu_amdkfd_resume(adev);
+	r = amdgpu_amdkfd_resume(adev, fbcon);
 	if (r)
 		return r;
 
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
index 2a9e40131735..e9f00c3a067a 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
@@ -23,6 +23,7 @@
 #include <linux/bsearch.h>
 #include <linux/pci.h>
 #include <linux/slab.h>
+#include <linux/pm_runtime.h>
 #include "kfd_priv.h"
 #include "kfd_device_queue_manager.h"
 #include "kfd_pm4_headers_vi.h"
@@ -710,7 +711,7 @@ bool kgd2kfd_device_init(struct kfd_dev *kfd,
 void kgd2kfd_device_exit(struct kfd_dev *kfd)
 {
 	if (kfd->init_complete) {
-		kgd2kfd_suspend(kfd);
+		kgd2kfd_suspend(kfd, true);
 		device_queue_manager_uninit(kfd->dqm);
 		kfd_interrupt_exit(kfd);
 		kfd_topology_remove_device(kfd);
@@ -731,7 +732,7 @@ int kgd2kfd_pre_reset(struct kfd_dev *kfd)
 
 	kfd->dqm->ops.pre_reset(kfd->dqm);
 
-	kgd2kfd_suspend(kfd);
+	kgd2kfd_suspend(kfd, true);
 
 	kfd_signal_reset_event(kfd);
 	return 0;
@@ -765,21 +766,24 @@ bool kfd_is_locked(void)
 	return  (atomic_read(&kfd_locked) > 0);
 }
 
-void kgd2kfd_suspend(struct kfd_dev *kfd)
+void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm)
 {
 	if (!kfd->init_complete)
 		return;
 
-	/* For first KFD device suspend all the KFD processes */
-	if (atomic_inc_return(&kfd_locked) == 1)
-		kfd_suspend_all_processes();
+	/* for runtime suspend, skip locking kfd */
+	if (run_pm) {
+		/* For first KFD device suspend all the KFD processes */
+		if (atomic_inc_return(&kfd_locked) == 1)
+			kfd_suspend_all_processes();
+	}
 
+	pm_runtime_set_autosuspend_delay(kfd->ddev->dev, 5000);
 	kfd->dqm->ops.stop(kfd->dqm);
-
 	kfd_iommu_suspend(kfd);
 }
 
-int kgd2kfd_resume(struct kfd_dev *kfd)
+int kgd2kfd_resume(struct kfd_dev *kfd, bool run_pm)
 {
 	int ret, count;
 
@@ -790,10 +794,13 @@ int kgd2kfd_resume(struct kfd_dev *kfd)
 	if (ret)
 		return ret;
 
-	count = atomic_dec_return(&kfd_locked);
-	WARN_ONCE(count < 0, "KFD suspend / resume ref. error");
-	if (count == 0)
-		ret = kfd_resume_all_processes();
+	/* for runtime resume, skip unlocking kfd */
+	if (run_pm) {
+		count = atomic_dec_return(&kfd_locked);
+		WARN_ONCE(count < 0, "KFD suspend / resume ref. error");
+		if (count == 0)
+			ret = kfd_resume_all_processes();
+	}
 
 	return ret;
 }
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_iommu.c b/drivers/gpu/drm/amd/amdkfd/kfd_iommu.c
index 8d871514671e..6301d77ed3d6 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_iommu.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_iommu.c
@@ -25,6 +25,7 @@
 #include <linux/slab.h>
 #include <linux/pci.h>
 #include <linux/amd-iommu.h>
+#include <linux/pm_runtime.h>
 #include "kfd_priv.h"
 #include "kfd_dbgmgr.h"
 #include "kfd_topology.h"
@@ -134,8 +135,10 @@ void kfd_iommu_unbind_process(struct kfd_process *p)
 	struct kfd_process_device *pdd;
 
 	list_for_each_entry(pdd, &p->per_device_data, per_device_list)
-		if (pdd->bound == PDD_BOUND)
+		if (pdd->bound == PDD_BOUND) {
 			amd_iommu_unbind_pasid(pdd->dev->pdev, p->pasid);
+			pm_runtime_set_autosuspend_delay(pdd->dev->ddev->dev, 5000);
+		}
 }
 
 /* Callback for process shutdown invoked by the IOMMU driver */
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process.c b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
index 25b90f70aecd..d19d5e97405c 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
@@ -31,6 +31,7 @@
 #include <linux/compat.h>
 #include <linux/mman.h>
 #include <linux/file.h>
+#include <linux/pm_runtime.h>
 #include "amdgpu_amdkfd.h"
 #include "amdgpu.h"
 
@@ -843,15 +844,27 @@ struct kfd_process_device *kfd_bind_process_to_device(struct kfd_dev *dev,
 		return ERR_PTR(-ENOMEM);
 	}
 
+	err = pm_runtime_get_sync(dev->ddev->dev);
+	pm_runtime_set_autosuspend_delay(dev->ddev->dev, 60000);
+	if (err < 0)
+		return ERR_PTR(err);
+
 	err = kfd_iommu_bind_process_to_device(pdd);
 	if (err)
-		return ERR_PTR(err);
+		goto out;
 
 	err = kfd_process_device_init_vm(pdd, NULL);
 	if (err)
-		return ERR_PTR(err);
+		goto out;
 
-	return pdd;
+out:
+	pm_runtime_mark_last_busy(dev->ddev->dev);
+	pm_runtime_put_autosuspend(dev->ddev->dev);
+
+	if (!err)
+		return pdd;
+	else
+		return ERR_PTR(err);
 }
 
 struct kfd_process_device *kfd_get_first_process_device_data(
-- 
2.17.1

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [Patch v1 0/5] Enable BACO with KFD
  2020-01-28  1:29 [Patch v1 0/5] Enable BACO with KFD Rajneesh Bhardwaj
                   ` (4 preceding siblings ...)
  2020-01-28  1:29 ` [Patch v1 5/5] drm/amdkfd: refactor runtime pm for baco Rajneesh Bhardwaj
@ 2020-01-28 17:39 ` Mike Lothian
  2020-01-28 20:11   ` Alex Deucher
  5 siblings, 1 reply; 23+ messages in thread
From: Mike Lothian @ 2020-01-28 17:39 UTC (permalink / raw)
  To: Rajneesh Bhardwaj; +Cc: Alex Deucher, Felix Kuehling, amd-gfx list

Is this designed to work with PRIME laptops too?

On Tue, 28 Jan 2020 at 01:29, Rajneesh Bhardwaj
<rajneesh.bhardwaj@amd.com> wrote:
>
> This series aims to enable BACO by default on supported AMD platforms
> and ensures that the AMD Kernel Fusion Driver can co-exist with this
> feature when the GPU devices are runtime suspended and firmware pushes
> the envelop to save more power with BACO entry sequence. Current
> approach makes sure that if KFD is using GPU services for compute, it
> won't let AMDGPU driver suspend and if the AMDGPU driver is already
> runtime suspended with GPUs in deep power saving mode with BACO, the KFD
> driver wakes up the AMDGPU and then starts the compute workload
> execution.
>
> This series has been tested with a single GPU (fiji), Dual GPUs (fiji
> and fiji/tonga) and seems to work fine. I have not seen but expect some
> corner cases where for some reason a KFD client binds a process but is
> unable to complete the task within 60 seconds. Ideally we should put the
> runtime_put and auto_suspend code in some function which is logical
> opposite of bind_process_to_device but during my experiments, it ended
> up with random gpu hang, machine reboot etc so any comments for
> improvement are welcome.
>
> Todo:
>
> rebase on top of https://patchwork.freedesktop.org/patch/338037/
>
> Rajneesh Bhardwaj (5):
>   drm/amdgpu: always enable runtime power management
>   drm/amdgpu: Fix missing error check in suspend
>   drm/amdkfd: Introduce debugfs option to disable baco
>   drm/amdkfd: show warning when kfd is locked
>   drm/amdkfd: refactor runtime pm for baco
>
>  drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 12 ++++-----
>  drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h |  9 ++++---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  7 +++--
>  drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c    |  3 +++
>  drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c    |  9 +++----
>  drivers/gpu/drm/amd/amdkfd/kfd_chardev.c   |  2 ++
>  drivers/gpu/drm/amd/amdkfd/kfd_debugfs.c   | 27 +++++++++++++++++++
>  drivers/gpu/drm/amd/amdkfd/kfd_device.c    | 31 +++++++++++++---------
>  drivers/gpu/drm/amd/amdkfd/kfd_iommu.c     |  5 +++-
>  drivers/gpu/drm/amd/amdkfd/kfd_process.c   | 19 ++++++++++---
>  10 files changed, 91 insertions(+), 33 deletions(-)
>
> --
> 2.17.1
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: [Patch v1 5/5] drm/amdkfd: refactor runtime pm for baco
  2020-01-28  1:29 ` [Patch v1 5/5] drm/amdkfd: refactor runtime pm for baco Rajneesh Bhardwaj
@ 2020-01-28 20:09   ` Zeng, Oak
  2020-01-30 19:08     ` Bhardwaj, Rajneesh
  2020-01-29 22:52   ` Felix Kuehling
  1 sibling, 1 reply; 23+ messages in thread
From: Zeng, Oak @ 2020-01-28 20:09 UTC (permalink / raw)
  To: Bhardwaj, Rajneesh, amd-gfx
  Cc: Deucher, Alexander, Kuehling, Felix, Bhardwaj, Rajneesh

[AMD Official Use Only - Internal Distribution Only]



Regards,
Oak

-----Original Message-----
From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> On Behalf Of Rajneesh Bhardwaj
Sent: Monday, January 27, 2020 8:29 PM
To: amd-gfx@lists.freedesktop.org
Cc: Deucher, Alexander <Alexander.Deucher@amd.com>; Kuehling, Felix <Felix.Kuehling@amd.com>; Bhardwaj, Rajneesh <Rajneesh.Bhardwaj@amd.com>
Subject: [Patch v1 5/5] drm/amdkfd: refactor runtime pm for baco

So far the kfd driver implemented same routines for runtime and system wide suspend and resume (s2idle or mem). During system wide suspend the kfd aquires an atomic lock that prevents any more user processes to create queues and interact with kfd driver and amd gpu. This mechanism created problem when amdgpu device is runtime suspended with BACO enabled. Any application that relies on kfd driver fails to load because the driver reports a locked kfd device since gpu is runtime suspended.

However, in an ideal case, when gpu is runtime  suspended the kfd driver should be able to:

 - auto resume amdgpu driver whenever a client requests compute service
 - prevent runtime suspend for amdgpu  while kfd is in use

This change refactors the amdgpu and amdkfd drivers to support BACO and runtime runtime power management.
[Oak] two runtime above

Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 12 ++++-----  drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h |  8 +++---  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  4 +--
 drivers/gpu/drm/amd/amdkfd/kfd_device.c    | 31 +++++++++++++---------
 drivers/gpu/drm/amd/amdkfd/kfd_iommu.c     |  5 +++-
 drivers/gpu/drm/amd/amdkfd/kfd_process.c   | 19 ++++++++++---
 6 files changed, 51 insertions(+), 28 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
index 8609287620ea..314c4a2a0354 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
@@ -178,18 +178,18 @@ void amdgpu_amdkfd_interrupt(struct amdgpu_device *adev,
 		kgd2kfd_interrupt(adev->kfd.dev, ih_ring_entry);  }
 
-void amdgpu_amdkfd_suspend(struct amdgpu_device *adev)
+void amdgpu_amdkfd_suspend(struct amdgpu_device *adev, bool run_pm)
[Oak] then name run_pm is a little bit confusing. Maybe system_wide_pm or none_runtime_pm?
 {
 	if (adev->kfd.dev)
-		kgd2kfd_suspend(adev->kfd.dev);
+		kgd2kfd_suspend(adev->kfd.dev, run_pm);
 }
 
-int amdgpu_amdkfd_resume(struct amdgpu_device *adev)
+int amdgpu_amdkfd_resume(struct amdgpu_device *adev, bool run_pm)
 {
 	int r = 0;
 
 	if (adev->kfd.dev)
-		r = kgd2kfd_resume(adev->kfd.dev);
+		r = kgd2kfd_resume(adev->kfd.dev, run_pm);
 
 	return r;
 }
@@ -713,11 +713,11 @@ void kgd2kfd_exit(void)  {  }
 
-void kgd2kfd_suspend(struct kfd_dev *kfd)
+void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm)
 {
 }
 
-int kgd2kfd_resume(struct kfd_dev *kfd)
+int kgd2kfd_resume(struct kfd_dev *kfd, bool run_pm)
 {
 	return 0;
 }
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
index 0fa898d30207..3dce4a51e522 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
@@ -123,8 +123,8 @@ struct amdkfd_process_info {  int amdgpu_amdkfd_init(void);  void amdgpu_amdkfd_fini(void);
 
-void amdgpu_amdkfd_suspend(struct amdgpu_device *adev); -int amdgpu_amdkfd_resume(struct amdgpu_device *adev);
+void amdgpu_amdkfd_suspend(struct amdgpu_device *adev, bool run_pm); 
+int amdgpu_amdkfd_resume(struct amdgpu_device *adev, bool run_pm);
 void amdgpu_amdkfd_interrupt(struct amdgpu_device *adev,
 			const void *ih_ring_entry);
 void amdgpu_amdkfd_device_probe(struct amdgpu_device *adev); @@ -250,8 +250,8 @@ bool kgd2kfd_device_init(struct kfd_dev *kfd,
 			 struct drm_device *ddev,
 			 const struct kgd2kfd_shared_resources *gpu_resources);  void kgd2kfd_device_exit(struct kfd_dev *kfd); -void kgd2kfd_suspend(struct kfd_dev *kfd); -int kgd2kfd_resume(struct kfd_dev *kfd);
+void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm); int 
+kgd2kfd_resume(struct kfd_dev *kfd, bool run_pm);
 int kgd2kfd_pre_reset(struct kfd_dev *kfd);  int kgd2kfd_post_reset(struct kfd_dev *kfd);  void kgd2kfd_interrupt(struct kfd_dev *kfd, const void *ih_ring_entry); diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index d9e5eac182d3..787d49e9f4de 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -3314,7 +3314,7 @@ int amdgpu_device_suspend(struct drm_device *dev, bool fbcon)
 		}
 	}
 
-	amdgpu_amdkfd_suspend(adev);
+	amdgpu_amdkfd_suspend(adev, fbcon);
 
 	amdgpu_ras_suspend(adev);
 
@@ -3398,7 +3398,7 @@ int amdgpu_device_resume(struct drm_device *dev, bool fbcon)
 			}
 		}
 	}
-	r = amdgpu_amdkfd_resume(adev);
+	r = amdgpu_amdkfd_resume(adev, fbcon);
 	if (r)
 		return r;
 
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
index 2a9e40131735..e9f00c3a067a 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
@@ -23,6 +23,7 @@
 #include <linux/bsearch.h>
 #include <linux/pci.h>
 #include <linux/slab.h>
+#include <linux/pm_runtime.h>
 #include "kfd_priv.h"
 #include "kfd_device_queue_manager.h"
 #include "kfd_pm4_headers_vi.h"
@@ -710,7 +711,7 @@ bool kgd2kfd_device_init(struct kfd_dev *kfd,  void kgd2kfd_device_exit(struct kfd_dev *kfd)  {
 	if (kfd->init_complete) {
-		kgd2kfd_suspend(kfd);
+		kgd2kfd_suspend(kfd, true);
 		device_queue_manager_uninit(kfd->dqm);
 		kfd_interrupt_exit(kfd);
 		kfd_topology_remove_device(kfd);
@@ -731,7 +732,7 @@ int kgd2kfd_pre_reset(struct kfd_dev *kfd)
 
 	kfd->dqm->ops.pre_reset(kfd->dqm);
 
-	kgd2kfd_suspend(kfd);
+	kgd2kfd_suspend(kfd, true);
 
 	kfd_signal_reset_event(kfd);
 	return 0;
@@ -765,21 +766,24 @@ bool kfd_is_locked(void)
 	return  (atomic_read(&kfd_locked) > 0);  }
 
-void kgd2kfd_suspend(struct kfd_dev *kfd)
+void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm)
 {
 	if (!kfd->init_complete)
 		return;
 
-	/* For first KFD device suspend all the KFD processes */
-	if (atomic_inc_return(&kfd_locked) == 1)
-		kfd_suspend_all_processes();
+	/* for runtime suspend, skip locking kfd */
+	if (run_pm) {
+		/* For first KFD device suspend all the KFD processes */
+		if (atomic_inc_return(&kfd_locked) == 1)
+			kfd_suspend_all_processes();
+	}
 
+	pm_runtime_set_autosuspend_delay(kfd->ddev->dev, 5000);
[Oak]: why this is necessary? This timeout value has already been set in driver load. See amdgpu_driver_load_kms
 	kfd->dqm->ops.stop(kfd->dqm);
-
 	kfd_iommu_suspend(kfd);
 }
 
-int kgd2kfd_resume(struct kfd_dev *kfd)
+int kgd2kfd_resume(struct kfd_dev *kfd, bool run_pm)
 {
 	int ret, count;
 
@@ -790,10 +794,13 @@ int kgd2kfd_resume(struct kfd_dev *kfd)
 	if (ret)
 		return ret;
 
-	count = atomic_dec_return(&kfd_locked);
-	WARN_ONCE(count < 0, "KFD suspend / resume ref. error");
-	if (count == 0)
-		ret = kfd_resume_all_processes();
+	/* for runtime resume, skip unlocking kfd */
+	if (run_pm) {
+		count = atomic_dec_return(&kfd_locked);
+		WARN_ONCE(count < 0, "KFD suspend / resume ref. error");
+		if (count == 0)
+			ret = kfd_resume_all_processes();
+	}
 
 	return ret;
 }
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_iommu.c b/drivers/gpu/drm/amd/amdkfd/kfd_iommu.c
index 8d871514671e..6301d77ed3d6 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_iommu.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_iommu.c
@@ -25,6 +25,7 @@
 #include <linux/slab.h>
 #include <linux/pci.h>
 #include <linux/amd-iommu.h>
+#include <linux/pm_runtime.h>
 #include "kfd_priv.h"
 #include "kfd_dbgmgr.h"
 #include "kfd_topology.h"
@@ -134,8 +135,10 @@ void kfd_iommu_unbind_process(struct kfd_process *p)
 	struct kfd_process_device *pdd;
 
 	list_for_each_entry(pdd, &p->per_device_data, per_device_list)
-		if (pdd->bound == PDD_BOUND)
+		if (pdd->bound == PDD_BOUND) {
 			amd_iommu_unbind_pasid(pdd->dev->pdev, p->pasid);
+			pm_runtime_set_autosuspend_delay(pdd->dev->ddev->dev, 5000);
[Oak] This only set the timeout. The correction function to call is pm_runtime_put_autosuspend?
I think it is better to call it at the end of kfd_process_wq_release directly and call pm_runtime_mark_last_busy first.
+		}
 }
 
 /* Callback for process shutdown invoked by the IOMMU driver */ diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process.c b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
index 25b90f70aecd..d19d5e97405c 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
@@ -31,6 +31,7 @@
 #include <linux/compat.h>
 #include <linux/mman.h>
 #include <linux/file.h>
+#include <linux/pm_runtime.h>
 #include "amdgpu_amdkfd.h"
 #include "amdgpu.h"
 
@@ -843,15 +844,27 @@ struct kfd_process_device *kfd_bind_process_to_device(struct kfd_dev *dev,
 		return ERR_PTR(-ENOMEM);
 	}
 
+	err = pm_runtime_get_sync(dev->ddev->dev);
+	pm_runtime_set_autosuspend_delay(dev->ddev->dev, 60000);
[Oak]: The second call is not necessary
+	if (err < 0)
+		return ERR_PTR(err);
+
 	err = kfd_iommu_bind_process_to_device(pdd);
 	if (err)
-		return ERR_PTR(err);
+		goto out;
 
 	err = kfd_process_device_init_vm(pdd, NULL);
 	if (err)
-		return ERR_PTR(err);
+		goto out;
 
-	return pdd;
+out:
+	pm_runtime_mark_last_busy(dev->ddev->dev);
+	pm_runtime_put_autosuspend(dev->ddev->dev);
+
+	if (!err)
+		return pdd;
+	else
+		return ERR_PTR(err);
 }
 
 struct kfd_process_device *kfd_get_first_process_device_data(
--
2.17.1

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7Coak.zeng%40amd.com%7Ce57e1eeb5fd74b5d8dea08d7a391a089%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637157718236258423&amp;sdata=zqM5YTz7qofPZjG3PmWKbHQ4sMMZjol1IlzaNTwQtcw%3D&amp;reserved=0
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [Patch v1 0/5] Enable BACO with KFD
  2020-01-28 17:39 ` [Patch v1 0/5] Enable BACO with KFD Mike Lothian
@ 2020-01-28 20:11   ` Alex Deucher
  0 siblings, 0 replies; 23+ messages in thread
From: Alex Deucher @ 2020-01-28 20:11 UTC (permalink / raw)
  To: Mike Lothian
  Cc: Alex Deucher, Felix Kuehling, Rajneesh Bhardwaj, amd-gfx list

On Tue, Jan 28, 2020 at 12:39 PM Mike Lothian <mike@fireburn.co.uk> wrote:
>
> Is this designed to work with PRIME laptops too?
>

Yes.  The title should really be runtime pm rather than BACO
specifically.  The underlying power savings mechanism (BOCO, BACO,
etc.) varies from platform to platform.

Alex

> On Tue, 28 Jan 2020 at 01:29, Rajneesh Bhardwaj
> <rajneesh.bhardwaj@amd.com> wrote:
> >
> > This series aims to enable BACO by default on supported AMD platforms
> > and ensures that the AMD Kernel Fusion Driver can co-exist with this
> > feature when the GPU devices are runtime suspended and firmware pushes
> > the envelop to save more power with BACO entry sequence. Current
> > approach makes sure that if KFD is using GPU services for compute, it
> > won't let AMDGPU driver suspend and if the AMDGPU driver is already
> > runtime suspended with GPUs in deep power saving mode with BACO, the KFD
> > driver wakes up the AMDGPU and then starts the compute workload
> > execution.
> >
> > This series has been tested with a single GPU (fiji), Dual GPUs (fiji
> > and fiji/tonga) and seems to work fine. I have not seen but expect some
> > corner cases where for some reason a KFD client binds a process but is
> > unable to complete the task within 60 seconds. Ideally we should put the
> > runtime_put and auto_suspend code in some function which is logical
> > opposite of bind_process_to_device but during my experiments, it ended
> > up with random gpu hang, machine reboot etc so any comments for
> > improvement are welcome.
> >
> > Todo:
> >
> > rebase on top of https://patchwork.freedesktop.org/patch/338037/
> >
> > Rajneesh Bhardwaj (5):
> >   drm/amdgpu: always enable runtime power management
> >   drm/amdgpu: Fix missing error check in suspend
> >   drm/amdkfd: Introduce debugfs option to disable baco
> >   drm/amdkfd: show warning when kfd is locked
> >   drm/amdkfd: refactor runtime pm for baco
> >
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 12 ++++-----
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h |  9 ++++---
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  7 +++--
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c    |  3 +++
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c    |  9 +++----
> >  drivers/gpu/drm/amd/amdkfd/kfd_chardev.c   |  2 ++
> >  drivers/gpu/drm/amd/amdkfd/kfd_debugfs.c   | 27 +++++++++++++++++++
> >  drivers/gpu/drm/amd/amdkfd/kfd_device.c    | 31 +++++++++++++---------
> >  drivers/gpu/drm/amd/amdkfd/kfd_iommu.c     |  5 +++-
> >  drivers/gpu/drm/amd/amdkfd/kfd_process.c   | 19 ++++++++++---
> >  10 files changed, 91 insertions(+), 33 deletions(-)
> >
> > --
> > 2.17.1
> >
> > _______________________________________________
> > amd-gfx mailing list
> > amd-gfx@lists.freedesktop.org
> > https://lists.freedesktop.org/mailman/listinfo/amd-gfx
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Patch v1 1/5] drm/amdgpu: always enable runtime power management
  2020-01-28  1:29 ` [Patch v1 1/5] drm/amdgpu: always enable runtime power management Rajneesh Bhardwaj
@ 2020-01-28 20:14   ` Alex Deucher
  2020-01-30 19:06     ` Bhardwaj, Rajneesh
  0 siblings, 1 reply; 23+ messages in thread
From: Alex Deucher @ 2020-01-28 20:14 UTC (permalink / raw)
  To: Rajneesh Bhardwaj; +Cc: Deucher, Alexander, Kuehling, Felix, amd-gfx list

On Mon, Jan 27, 2020 at 8:30 PM Rajneesh Bhardwaj
<rajneesh.bhardwaj@amd.com> wrote:
>
> This allows runtime power management to kick in on amdgpu driver when
> the underlying hardware supports either BOCO or BACO. This can still be
> avoided if boot arg amdgpu.runpm = 0 is supplied.
>
>         BOCO: Bus Off, Chip Off
>         BACO: Bus Alive, Chip Off
>
> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>

This patch should be the last one in the series, otherwise we'll
enable runpm on BACO capable devices before the KFD code is in place.
Also, it's only supported on VI and newer asics, so we should use this
patch instead:
https://patchwork.freedesktop.org/patch/335402/

Alex

> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 9 ++++-----
>  1 file changed, 4 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
> index 3a0ea9096498..7958d508486e 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
> @@ -169,11 +169,10 @@ int amdgpu_driver_load_kms(struct drm_device *dev, unsigned long flags)
>                 goto out;
>         }
>
> -       if (amdgpu_device_supports_boco(dev) &&
> -           (amdgpu_runtime_pm != 0)) /* enable runpm by default */
> -               adev->runpm = true;
> -       else if (amdgpu_device_supports_baco(dev) &&
> -                (amdgpu_runtime_pm > 0)) /* enable runpm if runpm=1 */
> +       /* always enable runtime power management except when amdgpu.runpm=0 */
> +       if ((amdgpu_device_supports_boco(dev) ||
> +                       amdgpu_device_supports_baco(dev))
> +                       && (amdgpu_runtime_pm != 0))
>                 adev->runpm = true;
>
>         /* Call ACPI methods: require modeset init
> --
> 2.17.1
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Patch v1 2/5] drm/amdgpu: Fix missing error check in suspend
  2020-01-28  1:29 ` [Patch v1 2/5] drm/amdgpu: Fix missing error check in suspend Rajneesh Bhardwaj
@ 2020-01-28 20:15   ` Alex Deucher
  0 siblings, 0 replies; 23+ messages in thread
From: Alex Deucher @ 2020-01-28 20:15 UTC (permalink / raw)
  To: Rajneesh Bhardwaj; +Cc: Deucher, Alexander, Kuehling, Felix, amd-gfx list

On Mon, Jan 27, 2020 at 8:30 PM Rajneesh Bhardwaj
<rajneesh.bhardwaj@amd.com> wrote:
>
> amdgpu_device_suspend might return an error code since it can be called
> from both runtime and system suspend flows. Add the missing return code
> in case of a failure.
>
> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>

Reviewed-by: Alex Deucher <alexander.deucher@amd.com>

> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 3 +++
>  1 file changed, 3 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> index 137e76f0e3db..71466df6dff6 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> @@ -1236,6 +1236,9 @@ static int amdgpu_pmops_runtime_suspend(struct device *dev)
>         drm_kms_helper_poll_disable(drm_dev);
>
>         ret = amdgpu_device_suspend(drm_dev, false);
> +       if (ret)
> +               return ret;
> +
>         if (amdgpu_device_supports_boco(drm_dev)) {
>                 /* Only need to handle PCI state in the driver for ATPX
>                  * PCI core handles it for _PR3.
> --
> 2.17.1
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Patch v1 3/5] drm/amdkfd: Introduce debugfs option to disable baco
  2020-01-28  1:29 ` [Patch v1 3/5] drm/amdkfd: Introduce debugfs option to disable baco Rajneesh Bhardwaj
@ 2020-01-28 20:22   ` Alex Deucher
  2020-01-30 19:05     ` Bhardwaj, Rajneesh
  0 siblings, 1 reply; 23+ messages in thread
From: Alex Deucher @ 2020-01-28 20:22 UTC (permalink / raw)
  To: Rajneesh Bhardwaj; +Cc: Deucher, Alexander, Kuehling, Felix, amd-gfx list

On Mon, Jan 27, 2020 at 8:30 PM Rajneesh Bhardwaj
<rajneesh.bhardwaj@amd.com> wrote:
>
> When BACO is enabled by default, sometimes it can cause additional
> trouble to debug KFD issues. This debugfs override allows to temporarily
> disable BACO for debug purpose without having to reboot the machine.
>
> However, in some cases one suspend-resume cycle might be needed if
> the device is already runtime suspended.
>
> e.g
>
> sudo rtcwake -m < mem or freeze > -s 15
>
> or
>
> by triggering autosuspend event from user space, by doing something
> like:
>
> echo 6000 > /sys/bus/pci/devices/0000\:03\:00.0/power/autosuspend_delay_ms
>
>     Usage:
>
> echo 0 > /sys/kernel/debug/kfd/enable_baco and run
> cat /sys/kernel/debug/kfd/baco_status to verify whether BACO is
> enabled or disabled by kfd driver.
>
> It should be noted that while enabling baco again via kfd override, we
> must do the following steps:
>
> 1. echo 0 > /sys/kernel/debug/kfd/enable_baco
> 2. sudo rtcwake -m < mem > -s 15
>
> In this case, we need GPU to be fully reset which is done by BIOS. This
> is not possible in case of S2idle i.e. freeze so we must use mem for
> sleep.
>

I think we can drop this patch in favor of just using the standard
runtime pm control.  E.g.,
/sys/class/drm/card0/device/power/control

Also, the underlying mechanism may not always be BACO.  E.g., hybrid
laptops use BOCO (d3cold).  So it would be better to make the naming
more generic (e.g., runpm_enable, runpm_status).  This is also kfd
specific.  If we actually do want something like this, I think it
should be at the device level in amdgpu.

Alex

> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h |  1 +
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  3 +++
>  drivers/gpu/drm/amd/amdkfd/kfd_debugfs.c   | 27 ++++++++++++++++++++++
>  3 files changed, 31 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
> index 47b0f2957d1f..0fa898d30207 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
> @@ -34,6 +34,7 @@
>  #include "amdgpu_vm.h"
>
>  extern uint64_t amdgpu_amdkfd_total_mem_size;
> +extern bool kfd_enable_baco;
>
>  struct amdgpu_device;
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 2c64d2a83d61..d9e5eac182d3 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -3259,6 +3259,9 @@ int amdgpu_device_suspend(struct drm_device *dev, bool fbcon)
>                 return -ENODEV;
>         }
>
> +       if (!kfd_enable_baco)
> +               return -EBUSY;
> +
>         adev = dev->dev_private;
>
>         if (dev->switch_power_state == DRM_SWITCH_POWER_OFF)
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_debugfs.c b/drivers/gpu/drm/amd/amdkfd/kfd_debugfs.c
> index 511712c2e382..c6f6ff2ff603 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_debugfs.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_debugfs.c
> @@ -26,6 +26,7 @@
>  #include "kfd_priv.h"
>
>  static struct dentry *debugfs_root;
> +bool kfd_enable_baco = true;
>
>  static int kfd_debugfs_open(struct inode *inode, struct file *file)
>  {
> @@ -83,6 +84,28 @@ static const struct file_operations kfd_debugfs_hang_hws_fops = {
>         .release = single_release,
>  };
>
> +static int baco_sts_set(void *data, u64 val)
> +{
> +       if (!val)
> +               kfd_enable_baco = false;
> +       else
> +               kfd_enable_baco = true;
> +
> +       return 0;
> +}
> +DEFINE_SIMPLE_ATTRIBUTE(fops_baco_enable, NULL, baco_sts_set, "%lld\n");
> +
> +static int baco_sts_show(struct seq_file *s, void *unused)
> +{
> +       if (kfd_enable_baco)
> +               seq_puts(s, "Baco is Enabled\n");
> +       else
> +               seq_puts(s, "Baco is Disabled\n");
> +
> +       return 0;
> +}
> +DEFINE_SHOW_ATTRIBUTE(baco_sts);
> +
>  void kfd_debugfs_init(void)
>  {
>         debugfs_root = debugfs_create_dir("kfd", NULL);
> @@ -95,6 +118,10 @@ void kfd_debugfs_init(void)
>                             kfd_debugfs_rls_by_device, &kfd_debugfs_fops);
>         debugfs_create_file("hang_hws", S_IFREG | 0200, debugfs_root,
>                             NULL, &kfd_debugfs_hang_hws_fops);
> +       debugfs_create_file("enable_baco", 0644, debugfs_root, NULL,
> +                           &fops_baco_enable);
> +       debugfs_create_file("baco_status", 0444, debugfs_root, NULL,
> +                           &baco_sts_fops);
>  }
>
>  void kfd_debugfs_fini(void)
> --
> 2.17.1
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Patch v1 4/5] drm/amdkfd: show warning when kfd is locked
  2020-01-28  1:29 ` [Patch v1 4/5] drm/amdkfd: show warning when kfd is locked Rajneesh Bhardwaj
@ 2020-01-28 22:42   ` Felix Kuehling
  2020-01-30 19:02     ` Bhardwaj, Rajneesh
  0 siblings, 1 reply; 23+ messages in thread
From: Felix Kuehling @ 2020-01-28 22:42 UTC (permalink / raw)
  To: Rajneesh Bhardwaj, amd-gfx; +Cc: Alexander.Deucher

On 2020-01-27 20:29, Rajneesh Bhardwaj wrote:
> During system suspend the kfd driver aquires a lock that prohibits
> further kfd actions unless the gpu is resumed. This adds some info which
> can be useful while debugging.
>
> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
> ---
>   drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 2 ++
>   1 file changed, 2 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
> index 1aebda4bbbe7..081cc5f40d18 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
> @@ -127,6 +127,8 @@ static int kfd_open(struct inode *inode, struct file *filep)
>   		return PTR_ERR(process);
>   
>   	if (kfd_is_locked()) {
> +		dev_warn(kfd_device, "kfd is locked!\n"
> +				"process %d unreferenced", process->pasid);

If this is for debugging, make it dev_dbg. Printing warnings like this 
usually confuses people reporting completely unrelated problems that 
aren't even kernel problems at all. "Look, I found a warning in the 
kernel log. It must be a kernel problem."

Regards,
   Felix


>   		kfd_unref_process(process);
>   		return -EAGAIN;
>   	}
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Patch v1 5/5] drm/amdkfd: refactor runtime pm for baco
  2020-01-28  1:29 ` [Patch v1 5/5] drm/amdkfd: refactor runtime pm for baco Rajneesh Bhardwaj
  2020-01-28 20:09   ` Zeng, Oak
@ 2020-01-29 22:52   ` Felix Kuehling
  2020-01-30 19:01     ` Bhardwaj, Rajneesh
  1 sibling, 1 reply; 23+ messages in thread
From: Felix Kuehling @ 2020-01-29 22:52 UTC (permalink / raw)
  To: Rajneesh Bhardwaj, amd-gfx; +Cc: Alexander.Deucher

HI Rajneesh,

See comments inline ...

And a general question: Why do you need to set the autosuspend_delay in 
so many places? Amdgpu only has a single call to this function during 
initialization.

On 2020-01-27 20:29, Rajneesh Bhardwaj wrote:
> So far the kfd driver implemented same routines for runtime and system
> wide suspend and resume (s2idle or mem). During system wide suspend the
> kfd aquires an atomic lock that prevents any more user processes to
> create queues and interact with kfd driver and amd gpu. This mechanism
> created problem when amdgpu device is runtime suspended with BACO
> enabled. Any application that relies on kfd driver fails to load because
> the driver reports a locked kfd device since gpu is runtime suspended.
>
> However, in an ideal case, when gpu is runtime  suspended the kfd driver
> should be able to:
>
>   - auto resume amdgpu driver whenever a client requests compute service
>   - prevent runtime suspend for amdgpu  while kfd is in use
>
> This change refactors the amdgpu and amdkfd drivers to support BACO and
> runtime runtime power management.
>
> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 12 ++++-----
>   drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h |  8 +++---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  4 +--
>   drivers/gpu/drm/amd/amdkfd/kfd_device.c    | 31 +++++++++++++---------
>   drivers/gpu/drm/amd/amdkfd/kfd_iommu.c     |  5 +++-
>   drivers/gpu/drm/amd/amdkfd/kfd_process.c   | 19 ++++++++++---
>   6 files changed, 51 insertions(+), 28 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
> index 8609287620ea..314c4a2a0354 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
> @@ -178,18 +178,18 @@ void amdgpu_amdkfd_interrupt(struct amdgpu_device *adev,
>   		kgd2kfd_interrupt(adev->kfd.dev, ih_ring_entry);
>   }
>   
> -void amdgpu_amdkfd_suspend(struct amdgpu_device *adev)
> +void amdgpu_amdkfd_suspend(struct amdgpu_device *adev, bool run_pm)
>   {
>   	if (adev->kfd.dev)
> -		kgd2kfd_suspend(adev->kfd.dev);
> +		kgd2kfd_suspend(adev->kfd.dev, run_pm);
>   }
>   
> -int amdgpu_amdkfd_resume(struct amdgpu_device *adev)
> +int amdgpu_amdkfd_resume(struct amdgpu_device *adev, bool run_pm)
>   {
>   	int r = 0;
>   
>   	if (adev->kfd.dev)
> -		r = kgd2kfd_resume(adev->kfd.dev);
> +		r = kgd2kfd_resume(adev->kfd.dev, run_pm);
>   
>   	return r;
>   }
> @@ -713,11 +713,11 @@ void kgd2kfd_exit(void)
>   {
>   }
>   
> -void kgd2kfd_suspend(struct kfd_dev *kfd)
> +void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm)
>   {
>   }
>   
> -int kgd2kfd_resume(struct kfd_dev *kfd)
> +int kgd2kfd_resume(struct kfd_dev *kfd, bool run_pm)
>   {
>   	return 0;
>   }
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
> index 0fa898d30207..3dce4a51e522 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
> @@ -123,8 +123,8 @@ struct amdkfd_process_info {
>   int amdgpu_amdkfd_init(void);
>   void amdgpu_amdkfd_fini(void);
>   
> -void amdgpu_amdkfd_suspend(struct amdgpu_device *adev);
> -int amdgpu_amdkfd_resume(struct amdgpu_device *adev);
> +void amdgpu_amdkfd_suspend(struct amdgpu_device *adev, bool run_pm);
> +int amdgpu_amdkfd_resume(struct amdgpu_device *adev, bool run_pm);
>   void amdgpu_amdkfd_interrupt(struct amdgpu_device *adev,
>   			const void *ih_ring_entry);
>   void amdgpu_amdkfd_device_probe(struct amdgpu_device *adev);
> @@ -250,8 +250,8 @@ bool kgd2kfd_device_init(struct kfd_dev *kfd,
>   			 struct drm_device *ddev,
>   			 const struct kgd2kfd_shared_resources *gpu_resources);
>   void kgd2kfd_device_exit(struct kfd_dev *kfd);
> -void kgd2kfd_suspend(struct kfd_dev *kfd);
> -int kgd2kfd_resume(struct kfd_dev *kfd);
> +void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm);
> +int kgd2kfd_resume(struct kfd_dev *kfd, bool run_pm);
>   int kgd2kfd_pre_reset(struct kfd_dev *kfd);
>   int kgd2kfd_post_reset(struct kfd_dev *kfd);
>   void kgd2kfd_interrupt(struct kfd_dev *kfd, const void *ih_ring_entry);
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index d9e5eac182d3..787d49e9f4de 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -3314,7 +3314,7 @@ int amdgpu_device_suspend(struct drm_device *dev, bool fbcon)
>   		}
>   	}
>   
> -	amdgpu_amdkfd_suspend(adev);
> +	amdgpu_amdkfd_suspend(adev, fbcon);

The logic seems inverted here. There are four different pmops callbacks 
that use different values for this parameter:

* amdgpu_pmops_suspend: true
* amdgpu_pmops_freeze: true
* amdgpu_pmops_poweroff: true
* amdgpu_pmops_runtime_suspend: false

It looks like runtime_suspend uses false, so you should pass the run_pm 
parameter as !fbcon.


>   
>   	amdgpu_ras_suspend(adev);
>   
> @@ -3398,7 +3398,7 @@ int amdgpu_device_resume(struct drm_device *dev, bool fbcon)
>   			}
>   		}
>   	}
> -	r = amdgpu_amdkfd_resume(adev);
> +	r = amdgpu_amdkfd_resume(adev, fbcon);
>   	if (r)
>   		return r;
>   
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> index 2a9e40131735..e9f00c3a067a 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> @@ -23,6 +23,7 @@
>   #include <linux/bsearch.h>
>   #include <linux/pci.h>
>   #include <linux/slab.h>
> +#include <linux/pm_runtime.h>
>   #include "kfd_priv.h"
>   #include "kfd_device_queue_manager.h"
>   #include "kfd_pm4_headers_vi.h"
> @@ -710,7 +711,7 @@ bool kgd2kfd_device_init(struct kfd_dev *kfd,
>   void kgd2kfd_device_exit(struct kfd_dev *kfd)
>   {
>   	if (kfd->init_complete) {
> -		kgd2kfd_suspend(kfd);
> +		kgd2kfd_suspend(kfd, true);
>   		device_queue_manager_uninit(kfd->dqm);
>   		kfd_interrupt_exit(kfd);
>   		kfd_topology_remove_device(kfd);
> @@ -731,7 +732,7 @@ int kgd2kfd_pre_reset(struct kfd_dev *kfd)
>   
>   	kfd->dqm->ops.pre_reset(kfd->dqm);
>   
> -	kgd2kfd_suspend(kfd);
> +	kgd2kfd_suspend(kfd, true);

This should use false. This is not runtime PM.


>   
>   	kfd_signal_reset_event(kfd);
>   	return 0;
> @@ -765,21 +766,24 @@ bool kfd_is_locked(void)
>   	return  (atomic_read(&kfd_locked) > 0);
>   }
>   
> -void kgd2kfd_suspend(struct kfd_dev *kfd)
> +void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm)
>   {
>   	if (!kfd->init_complete)
>   		return;
>   
> -	/* For first KFD device suspend all the KFD processes */
> -	if (atomic_inc_return(&kfd_locked) == 1)
> -		kfd_suspend_all_processes();
> +	/* for runtime suspend, skip locking kfd */
> +	if (run_pm) {

This should be done for the non-runtime PM case: if (!run_pm).


> +		/* For first KFD device suspend all the KFD processes */
> +		if (atomic_inc_return(&kfd_locked) == 1)
> +			kfd_suspend_all_processes();
> +	}
>   
> +	pm_runtime_set_autosuspend_delay(kfd->ddev->dev, 5000);
>   	kfd->dqm->ops.stop(kfd->dqm);
> -
>   	kfd_iommu_suspend(kfd);
>   }
>   
> -int kgd2kfd_resume(struct kfd_dev *kfd)
> +int kgd2kfd_resume(struct kfd_dev *kfd, bool run_pm)
>   {
>   	int ret, count;
>   
> @@ -790,10 +794,13 @@ int kgd2kfd_resume(struct kfd_dev *kfd)
>   	if (ret)
>   		return ret;
>   
> -	count = atomic_dec_return(&kfd_locked);
> -	WARN_ONCE(count < 0, "KFD suspend / resume ref. error");
> -	if (count == 0)
> -		ret = kfd_resume_all_processes();
> +	/* for runtime resume, skip unlocking kfd */
> +	if (run_pm) {

Same as above.


> +		count = atomic_dec_return(&kfd_locked);
> +		WARN_ONCE(count < 0, "KFD suspend / resume ref. error");
> +		if (count == 0)
> +			ret = kfd_resume_all_processes();
> +	}
>   
>   	return ret;
>   }
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_iommu.c b/drivers/gpu/drm/amd/amdkfd/kfd_iommu.c
> index 8d871514671e..6301d77ed3d6 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_iommu.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_iommu.c
> @@ -25,6 +25,7 @@
>   #include <linux/slab.h>
>   #include <linux/pci.h>
>   #include <linux/amd-iommu.h>
> +#include <linux/pm_runtime.h>
>   #include "kfd_priv.h"
>   #include "kfd_dbgmgr.h"
>   #include "kfd_topology.h"
> @@ -134,8 +135,10 @@ void kfd_iommu_unbind_process(struct kfd_process *p)
>   	struct kfd_process_device *pdd;
>   
>   	list_for_each_entry(pdd, &p->per_device_data, per_device_list)
> -		if (pdd->bound == PDD_BOUND)
> +		if (pdd->bound == PDD_BOUND) {
>   			amd_iommu_unbind_pasid(pdd->dev->pdev, p->pasid);
> +			pm_runtime_set_autosuspend_delay(pdd->dev->ddev->dev, 5000);

I don't think this is the right place. This code should only run on APUs 
with IOMMUv2 enabled. Probably missing a check at the start of the 
function. On kernels with IOMMU disabled, this source file is not 
compiled at all.

I think this code should go into kfd_process_destroy_pdds.


> +		}
>   }
>   
>   /* Callback for process shutdown invoked by the IOMMU driver */
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process.c b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
> index 25b90f70aecd..d19d5e97405c 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
> @@ -31,6 +31,7 @@
>   #include <linux/compat.h>
>   #include <linux/mman.h>
>   #include <linux/file.h>
> +#include <linux/pm_runtime.h>
>   #include "amdgpu_amdkfd.h"
>   #include "amdgpu.h"
>   
> @@ -843,15 +844,27 @@ struct kfd_process_device *kfd_bind_process_to_device(struct kfd_dev *dev,
>   		return ERR_PTR(-ENOMEM);
>   	}
>   
> +	err = pm_runtime_get_sync(dev->ddev->dev);
> +	pm_runtime_set_autosuspend_delay(dev->ddev->dev, 60000);

Why are you using a different hard-coded delay (60s?) here?

Regards,
   Felix

> +	if (err < 0)
> +		return ERR_PTR(err);
> +
>   	err = kfd_iommu_bind_process_to_device(pdd);
>   	if (err)
> -		return ERR_PTR(err);
> +		goto out;
>   
>   	err = kfd_process_device_init_vm(pdd, NULL);
>   	if (err)
> -		return ERR_PTR(err);
> +		goto out;
>   
> -	return pdd;
> +out:
> +	pm_runtime_mark_last_busy(dev->ddev->dev);
> +	pm_runtime_put_autosuspend(dev->ddev->dev);
> +
> +	if (!err)
> +		return pdd;
> +	else
> +		return ERR_PTR(err);
>   }
>   
>   struct kfd_process_device *kfd_get_first_process_device_data(
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Patch v1 5/5] drm/amdkfd: refactor runtime pm for baco
  2020-01-29 22:52   ` Felix Kuehling
@ 2020-01-30 19:01     ` Bhardwaj, Rajneesh
  2020-01-30 21:55       ` Felix Kuehling
  0 siblings, 1 reply; 23+ messages in thread
From: Bhardwaj, Rajneesh @ 2020-01-30 19:01 UTC (permalink / raw)
  To: Felix Kuehling, amd-gfx; +Cc: Alexander.Deucher

Hello Felix,

Thanks for your time to review and for your feedback.

On 1/29/2020 5:52 PM, Felix Kuehling wrote:
> HI Rajneesh,
>
> See comments inline ...
>
> And a general question: Why do you need to set the autosuspend_delay 
> in so many places? Amdgpu only has a single call to this function 
> during initialization.


We don't. I have called this out in cover letter since i too feel its 
not the best way to do what we want to do. I have seen weird issues on 
dual GPUs with KFDTest that sometimes results in system hang and gpu 
hang etc.

Even if i try with running kfdtest on just one node in a multi gpu 
system, the baco exit seems to wake up all gpus and then one goes to 
auto suspend again while other performs the desired kfdtest operation. I 
have never seen any issue with a single gpu card. So in current approch, 
i am making sure that the GPUs are not quickly auto-suspended while the 
kfd operations are ongoing but once the kfdtest is finished, the runtime 
suspend call with ensure to reset the timeout back to 5 seconds.


I would like to avoid this and appreciate some pointers where i can put 
get_sync calls while unbinding. I have tried a number of places but saw 
some issues. So any help will be highly appreciated, to identify the 
proper logical inverse of kfd_bind_process_to_device.


>
> On 2020-01-27 20:29, Rajneesh Bhardwaj wrote:
>> So far the kfd driver implemented same routines for runtime and system
>> wide suspend and resume (s2idle or mem). During system wide suspend the
>> kfd aquires an atomic lock that prevents any more user processes to
>> create queues and interact with kfd driver and amd gpu. This mechanism
>> created problem when amdgpu device is runtime suspended with BACO
>> enabled. Any application that relies on kfd driver fails to load because
>> the driver reports a locked kfd device since gpu is runtime suspended.
>>
>> However, in an ideal case, when gpu is runtime  suspended the kfd driver
>> should be able to:
>>
>>   - auto resume amdgpu driver whenever a client requests compute service
>>   - prevent runtime suspend for amdgpu  while kfd is in use
>>
>> This change refactors the amdgpu and amdkfd drivers to support BACO and
>> runtime runtime power management.
>>
>> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 12 ++++-----
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h |  8 +++---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  4 +--
>>   drivers/gpu/drm/amd/amdkfd/kfd_device.c    | 31 +++++++++++++---------
>>   drivers/gpu/drm/amd/amdkfd/kfd_iommu.c     |  5 +++-
>>   drivers/gpu/drm/amd/amdkfd/kfd_process.c   | 19 ++++++++++---
>>   6 files changed, 51 insertions(+), 28 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
>> index 8609287620ea..314c4a2a0354 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
>> @@ -178,18 +178,18 @@ void amdgpu_amdkfd_interrupt(struct 
>> amdgpu_device *adev,
>>           kgd2kfd_interrupt(adev->kfd.dev, ih_ring_entry);
>>   }
>>   -void amdgpu_amdkfd_suspend(struct amdgpu_device *adev)
>> +void amdgpu_amdkfd_suspend(struct amdgpu_device *adev, bool run_pm)
>>   {
>>       if (adev->kfd.dev)
>> -        kgd2kfd_suspend(adev->kfd.dev);
>> +        kgd2kfd_suspend(adev->kfd.dev, run_pm);
>>   }
>>   -int amdgpu_amdkfd_resume(struct amdgpu_device *adev)
>> +int amdgpu_amdkfd_resume(struct amdgpu_device *adev, bool run_pm)
>>   {
>>       int r = 0;
>>         if (adev->kfd.dev)
>> -        r = kgd2kfd_resume(adev->kfd.dev);
>> +        r = kgd2kfd_resume(adev->kfd.dev, run_pm);
>>         return r;
>>   }
>> @@ -713,11 +713,11 @@ void kgd2kfd_exit(void)
>>   {
>>   }
>>   -void kgd2kfd_suspend(struct kfd_dev *kfd)
>> +void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm)
>>   {
>>   }
>>   -int kgd2kfd_resume(struct kfd_dev *kfd)
>> +int kgd2kfd_resume(struct kfd_dev *kfd, bool run_pm)
>>   {
>>       return 0;
>>   }
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
>> index 0fa898d30207..3dce4a51e522 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
>> @@ -123,8 +123,8 @@ struct amdkfd_process_info {
>>   int amdgpu_amdkfd_init(void);
>>   void amdgpu_amdkfd_fini(void);
>>   -void amdgpu_amdkfd_suspend(struct amdgpu_device *adev);
>> -int amdgpu_amdkfd_resume(struct amdgpu_device *adev);
>> +void amdgpu_amdkfd_suspend(struct amdgpu_device *adev, bool run_pm);
>> +int amdgpu_amdkfd_resume(struct amdgpu_device *adev, bool run_pm);
>>   void amdgpu_amdkfd_interrupt(struct amdgpu_device *adev,
>>               const void *ih_ring_entry);
>>   void amdgpu_amdkfd_device_probe(struct amdgpu_device *adev);
>> @@ -250,8 +250,8 @@ bool kgd2kfd_device_init(struct kfd_dev *kfd,
>>                struct drm_device *ddev,
>>                const struct kgd2kfd_shared_resources *gpu_resources);
>>   void kgd2kfd_device_exit(struct kfd_dev *kfd);
>> -void kgd2kfd_suspend(struct kfd_dev *kfd);
>> -int kgd2kfd_resume(struct kfd_dev *kfd);
>> +void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm);
>> +int kgd2kfd_resume(struct kfd_dev *kfd, bool run_pm);
>>   int kgd2kfd_pre_reset(struct kfd_dev *kfd);
>>   int kgd2kfd_post_reset(struct kfd_dev *kfd);
>>   void kgd2kfd_interrupt(struct kfd_dev *kfd, const void 
>> *ih_ring_entry);
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> index d9e5eac182d3..787d49e9f4de 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> @@ -3314,7 +3314,7 @@ int amdgpu_device_suspend(struct drm_device 
>> *dev, bool fbcon)
>>           }
>>       }
>>   -    amdgpu_amdkfd_suspend(adev);
>> +    amdgpu_amdkfd_suspend(adev, fbcon);
>
> The logic seems inverted here. There are four different pmops 
> callbacks that use different values for this parameter:
>
> * amdgpu_pmops_suspend: true
> * amdgpu_pmops_freeze: true
> * amdgpu_pmops_poweroff: true
> * amdgpu_pmops_runtime_suspend: false
>
> It looks like runtime_suspend uses false, so you should pass the 
> run_pm parameter as !fbcon.

Ok. will fix in v2.


>
>
>>         amdgpu_ras_suspend(adev);
>>   @@ -3398,7 +3398,7 @@ int amdgpu_device_resume(struct drm_device 
>> *dev, bool fbcon)
>>               }
>>           }
>>       }
>> -    r = amdgpu_amdkfd_resume(adev);
>> +    r = amdgpu_amdkfd_resume(adev, fbcon);
>>       if (r)
>>           return r;
>>   diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c 
>> b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>> index 2a9e40131735..e9f00c3a067a 100644
>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>> @@ -23,6 +23,7 @@
>>   #include <linux/bsearch.h>
>>   #include <linux/pci.h>
>>   #include <linux/slab.h>
>> +#include <linux/pm_runtime.h>
>>   #include "kfd_priv.h"
>>   #include "kfd_device_queue_manager.h"
>>   #include "kfd_pm4_headers_vi.h"
>> @@ -710,7 +711,7 @@ bool kgd2kfd_device_init(struct kfd_dev *kfd,
>>   void kgd2kfd_device_exit(struct kfd_dev *kfd)
>>   {
>>       if (kfd->init_complete) {
>> -        kgd2kfd_suspend(kfd);
>> +        kgd2kfd_suspend(kfd, true);
>>           device_queue_manager_uninit(kfd->dqm);
>>           kfd_interrupt_exit(kfd);
>>           kfd_topology_remove_device(kfd);
>> @@ -731,7 +732,7 @@ int kgd2kfd_pre_reset(struct kfd_dev *kfd)
>>         kfd->dqm->ops.pre_reset(kfd->dqm);
>>   -    kgd2kfd_suspend(kfd);
>> +    kgd2kfd_suspend(kfd, true);
>
> This should use false. This is not runtime PM.
>
>
>>         kfd_signal_reset_event(kfd);
>>       return 0;
>> @@ -765,21 +766,24 @@ bool kfd_is_locked(void)
>>       return  (atomic_read(&kfd_locked) > 0);
>>   }
>>   -void kgd2kfd_suspend(struct kfd_dev *kfd)
>> +void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm)
>>   {
>>       if (!kfd->init_complete)
>>           return;
>>   -    /* For first KFD device suspend all the KFD processes */
>> -    if (atomic_inc_return(&kfd_locked) == 1)
>> -        kfd_suspend_all_processes();
>> +    /* for runtime suspend, skip locking kfd */
>> +    if (run_pm) {
>
> This should be done for the non-runtime PM case: if (!run_pm).
>
>
>> +        /* For first KFD device suspend all the KFD processes */
>> +        if (atomic_inc_return(&kfd_locked) == 1)
>> +            kfd_suspend_all_processes();
>> +    }
>>   +    pm_runtime_set_autosuspend_delay(kfd->ddev->dev, 5000);
>>       kfd->dqm->ops.stop(kfd->dqm);
>> -
>>       kfd_iommu_suspend(kfd);
>>   }
>>   -int kgd2kfd_resume(struct kfd_dev *kfd)
>> +int kgd2kfd_resume(struct kfd_dev *kfd, bool run_pm)
>>   {
>>       int ret, count;
>>   @@ -790,10 +794,13 @@ int kgd2kfd_resume(struct kfd_dev *kfd)
>>       if (ret)
>>           return ret;
>>   -    count = atomic_dec_return(&kfd_locked);
>> -    WARN_ONCE(count < 0, "KFD suspend / resume ref. error");
>> -    if (count == 0)
>> -        ret = kfd_resume_all_processes();
>> +    /* for runtime resume, skip unlocking kfd */
>> +    if (run_pm) {
>
> Same as above.
>
>
>> +        count = atomic_dec_return(&kfd_locked);
>> +        WARN_ONCE(count < 0, "KFD suspend / resume ref. error");
>> +        if (count == 0)
>> +            ret = kfd_resume_all_processes();
>> +    }
>>         return ret;
>>   }
>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_iommu.c 
>> b/drivers/gpu/drm/amd/amdkfd/kfd_iommu.c
>> index 8d871514671e..6301d77ed3d6 100644
>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_iommu.c
>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_iommu.c
>> @@ -25,6 +25,7 @@
>>   #include <linux/slab.h>
>>   #include <linux/pci.h>
>>   #include <linux/amd-iommu.h>
>> +#include <linux/pm_runtime.h>
>>   #include "kfd_priv.h"
>>   #include "kfd_dbgmgr.h"
>>   #include "kfd_topology.h"
>> @@ -134,8 +135,10 @@ void kfd_iommu_unbind_process(struct kfd_process 
>> *p)
>>       struct kfd_process_device *pdd;
>>         list_for_each_entry(pdd, &p->per_device_data, per_device_list)
>> -        if (pdd->bound == PDD_BOUND)
>> +        if (pdd->bound == PDD_BOUND) {
>>               amd_iommu_unbind_pasid(pdd->dev->pdev, p->pasid);
>> + pm_runtime_set_autosuspend_delay(pdd->dev->ddev->dev, 5000);
>
> I don't think this is the right place. This code should only run on 
> APUs with IOMMUv2 enabled. Probably missing a check at the start of 
> the function. On kernels with IOMMU disabled, this source file is not 
> compiled at all.
>
> I think this code should go into kfd_process_destroy_pdds.


Ok. will fix in v2. So can we consider this as logical inverse of 
bind_process_to_device?


>
>
>> +        }
>>   }
>>     /* Callback for process shutdown invoked by the IOMMU driver */
>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process.c 
>> b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
>> index 25b90f70aecd..d19d5e97405c 100644
>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
>> @@ -31,6 +31,7 @@
>>   #include <linux/compat.h>
>>   #include <linux/mman.h>
>>   #include <linux/file.h>
>> +#include <linux/pm_runtime.h>
>>   #include "amdgpu_amdkfd.h"
>>   #include "amdgpu.h"
>>   @@ -843,15 +844,27 @@ struct kfd_process_device 
>> *kfd_bind_process_to_device(struct kfd_dev *dev,
>>           return ERR_PTR(-ENOMEM);
>>       }
>>   +    err = pm_runtime_get_sync(dev->ddev->dev);
>> +    pm_runtime_set_autosuspend_delay(dev->ddev->dev, 60000);
>
> Why are you using a different hard-coded delay (60s?) here?


It needs to be fixed. Explained above the reason for such deviation.


>
> Regards,
>   Felix
>

Thanks
Rajneesh


>> +    if (err < 0)
>> +        return ERR_PTR(err);
>> +
>>       err = kfd_iommu_bind_process_to_device(pdd);
>>       if (err)
>> -        return ERR_PTR(err);
>> +        goto out;
>>         err = kfd_process_device_init_vm(pdd, NULL);
>>       if (err)
>> -        return ERR_PTR(err);
>> +        goto out;
>>   -    return pdd;
>> +out:
>> +    pm_runtime_mark_last_busy(dev->ddev->dev);
>> +    pm_runtime_put_autosuspend(dev->ddev->dev);
>> +
>> +    if (!err)
>> +        return pdd;
>> +    else
>> +        return ERR_PTR(err);
>>   }
>>     struct kfd_process_device *kfd_get_first_process_device_data(
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Patch v1 4/5] drm/amdkfd: show warning when kfd is locked
  2020-01-28 22:42   ` Felix Kuehling
@ 2020-01-30 19:02     ` Bhardwaj, Rajneesh
  0 siblings, 0 replies; 23+ messages in thread
From: Bhardwaj, Rajneesh @ 2020-01-30 19:02 UTC (permalink / raw)
  To: Felix Kuehling, amd-gfx; +Cc: Alexander.Deucher


On 1/28/2020 5:42 PM, Felix Kuehling wrote:
> On 2020-01-27 20:29, Rajneesh Bhardwaj wrote:
>> During system suspend the kfd driver aquires a lock that prohibits
>> further kfd actions unless the gpu is resumed. This adds some info which
>> can be useful while debugging.
>>
>> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
>> ---
>>   drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 2 ++
>>   1 file changed, 2 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c 
>> b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
>> index 1aebda4bbbe7..081cc5f40d18 100644
>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
>> @@ -127,6 +127,8 @@ static int kfd_open(struct inode *inode, struct 
>> file *filep)
>>           return PTR_ERR(process);
>>         if (kfd_is_locked()) {
>> +        dev_warn(kfd_device, "kfd is locked!\n"
>> +                "process %d unreferenced", process->pasid);
>
> If this is for debugging, make it dev_dbg. Printing warnings like this 
> usually confuses people reporting completely unrelated problems that 
> aren't even kernel problems at all. "Look, I found a warning in the 
> kernel log. It must be a kernel problem."

Agree. Will fix in v2.


>
> Regards,
>   Felix
>
>
>>           kfd_unref_process(process);
>>           return -EAGAIN;
>>       }
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Patch v1 3/5] drm/amdkfd: Introduce debugfs option to disable baco
  2020-01-28 20:22   ` Alex Deucher
@ 2020-01-30 19:05     ` Bhardwaj, Rajneesh
  0 siblings, 0 replies; 23+ messages in thread
From: Bhardwaj, Rajneesh @ 2020-01-30 19:05 UTC (permalink / raw)
  To: Alex Deucher; +Cc: Deucher, Alexander, Kuehling, Felix, amd-gfx list

Hi Alex


Thanks for your time and feedback!


On 1/28/2020 3:22 PM, Alex Deucher wrote:
> [CAUTION: External Email]
>
> On Mon, Jan 27, 2020 at 8:30 PM Rajneesh Bhardwaj
> <rajneesh.bhardwaj@amd.com> wrote:
>> When BACO is enabled by default, sometimes it can cause additional
>> trouble to debug KFD issues. This debugfs override allows to temporarily
>> disable BACO for debug purpose without having to reboot the machine.
>>
>> However, in some cases one suspend-resume cycle might be needed if
>> the device is already runtime suspended.
>>
>> e.g
>>
>> sudo rtcwake -m < mem or freeze > -s 15
>>
>> or
>>
>> by triggering autosuspend event from user space, by doing something
>> like:
>>
>> echo 6000 > /sys/bus/pci/devices/0000\:03\:00.0/power/autosuspend_delay_ms
>>
>>      Usage:
>>
>> echo 0 > /sys/kernel/debug/kfd/enable_baco and run
>> cat /sys/kernel/debug/kfd/baco_status to verify whether BACO is
>> enabled or disabled by kfd driver.
>>
>> It should be noted that while enabling baco again via kfd override, we
>> must do the following steps:
>>
>> 1. echo 0 > /sys/kernel/debug/kfd/enable_baco
>> 2. sudo rtcwake -m < mem > -s 15
>>
>> In this case, we need GPU to be fully reset which is done by BIOS. This
>> is not possible in case of S2idle i.e. freeze so we must use mem for
>> sleep.
>>
> I think we can drop this patch in favor of just using the standard
> runtime pm control.  E.g.,
> /sys/class/drm/card0/device/power/control


Sure, i was using the /sys/bus/pci way to do it and found it was not 
easy. Since this sysfs exists, will drop the patch.


Regards

Rajneesh


>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx@lists.freedesktop.org
>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7Crajneesh.bhardwaj%40amd.com%7Cfdaaf630ee6548c6bd9108d7a42fe314%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637158397891190611&amp;sdata=3jE9jZbbw9IiCu7geMeCCsTC4u4tTdippeWYeSnX3oE%3D&amp;reserved=0
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Patch v1 1/5] drm/amdgpu: always enable runtime power management
  2020-01-28 20:14   ` Alex Deucher
@ 2020-01-30 19:06     ` Bhardwaj, Rajneesh
  0 siblings, 0 replies; 23+ messages in thread
From: Bhardwaj, Rajneesh @ 2020-01-30 19:06 UTC (permalink / raw)
  To: Alex Deucher; +Cc: Deucher, Alexander, Kuehling, Felix, amd-gfx list


On 1/28/2020 3:14 PM, Alex Deucher wrote:
> [CAUTION: External Email]
>
> On Mon, Jan 27, 2020 at 8:30 PM Rajneesh Bhardwaj
> <rajneesh.bhardwaj@amd.com> wrote:
>> This allows runtime power management to kick in on amdgpu driver when
>> the underlying hardware supports either BOCO or BACO. This can still be
>> avoided if boot arg amdgpu.runpm = 0 is supplied.
>>
>>          BOCO: Bus Off, Chip Off
>>          BACO: Bus Alive, Chip Off
>>
>> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
> This patch should be the last one in the series, otherwise we'll
> enable runpm on BACO capable devices before the KFD code is in place.
> Also, it's only supported on VI and newer asics, so we should use this
> patch instead:
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpatchwork.freedesktop.org%2Fpatch%2F335402%2F&amp;data=02%7C01%7Crajneesh.bhardwaj%40amd.com%7C01f67fc720d3423ee6b908d7a42eb68d%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637158392852429503&amp;sdata=aU07GE56Vfb0JTSDsVDdyCdhxUkjHEVMAHiBaBC4V7g%3D&amp;reserved=0
>
> Alex

Thanks, Will fix in v2.


>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 9 ++++-----
>>   1 file changed, 4 insertions(+), 5 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>> index 3a0ea9096498..7958d508486e 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>> @@ -169,11 +169,10 @@ int amdgpu_driver_load_kms(struct drm_device *dev, unsigned long flags)
>>                  goto out;
>>          }
>>
>> -       if (amdgpu_device_supports_boco(dev) &&
>> -           (amdgpu_runtime_pm != 0)) /* enable runpm by default */
>> -               adev->runpm = true;
>> -       else if (amdgpu_device_supports_baco(dev) &&
>> -                (amdgpu_runtime_pm > 0)) /* enable runpm if runpm=1 */
>> +       /* always enable runtime power management except when amdgpu.runpm=0 */
>> +       if ((amdgpu_device_supports_boco(dev) ||
>> +                       amdgpu_device_supports_baco(dev))
>> +                       && (amdgpu_runtime_pm != 0))
>>                  adev->runpm = true;
>>
>>          /* Call ACPI methods: require modeset init
>> --
>> 2.17.1
>>
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx@lists.freedesktop.org
>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7Crajneesh.bhardwaj%40amd.com%7C01f67fc720d3423ee6b908d7a42eb68d%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637158392852429503&amp;sdata=%2BXwHkoDeyA9Q%2FwnSyaND6QOc1SxpGuAHkZ4JdaTM3wU%3D&amp;reserved=0
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Patch v1 5/5] drm/amdkfd: refactor runtime pm for baco
  2020-01-28 20:09   ` Zeng, Oak
@ 2020-01-30 19:08     ` Bhardwaj, Rajneesh
  0 siblings, 0 replies; 23+ messages in thread
From: Bhardwaj, Rajneesh @ 2020-01-30 19:08 UTC (permalink / raw)
  To: Zeng, Oak, amd-gfx; +Cc: Deucher, Alexander, Kuehling, Felix


On 1/28/2020 3:09 PM, Zeng, Oak wrote:
> [AMD Official Use Only - Internal Distribution Only]
>
>
>
> Regards,
> Oak
>
> -----Original Message-----
> From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> On Behalf Of Rajneesh Bhardwaj
> Sent: Monday, January 27, 2020 8:29 PM
> To: amd-gfx@lists.freedesktop.org
> Cc: Deucher, Alexander <Alexander.Deucher@amd.com>; Kuehling, Felix <Felix.Kuehling@amd.com>; Bhardwaj, Rajneesh <Rajneesh.Bhardwaj@amd.com>
> Subject: [Patch v1 5/5] drm/amdkfd: refactor runtime pm for baco
>
> So far the kfd driver implemented same routines for runtime and system wide suspend and resume (s2idle or mem). During system wide suspend the kfd aquires an atomic lock that prevents any more user processes to create queues and interact with kfd driver and amd gpu. This mechanism created problem when amdgpu device is runtime suspended with BACO enabled. Any application that relies on kfd driver fails to load because the driver reports a locked kfd device since gpu is runtime suspended.
>
> However, in an ideal case, when gpu is runtime  suspended the kfd driver should be able to:
>
>   - auto resume amdgpu driver whenever a client requests compute service
>   - prevent runtime suspend for amdgpu  while kfd is in use
>
> This change refactors the amdgpu and amdkfd drivers to support BACO and runtime runtime power management.
> [Oak] two runtime above


Thanks, will fix it.


>
> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 12 ++++-----  drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h |  8 +++---  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  4 +--
>   drivers/gpu/drm/amd/amdkfd/kfd_device.c    | 31 +++++++++++++---------
>   drivers/gpu/drm/amd/amdkfd/kfd_iommu.c     |  5 +++-
>   drivers/gpu/drm/amd/amdkfd/kfd_process.c   | 19 ++++++++++---
>   6 files changed, 51 insertions(+), 28 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
> index 8609287620ea..314c4a2a0354 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
> @@ -178,18 +178,18 @@ void amdgpu_amdkfd_interrupt(struct amdgpu_device *adev,
>   		kgd2kfd_interrupt(adev->kfd.dev, ih_ring_entry);  }
>   
> -void amdgpu_amdkfd_suspend(struct amdgpu_device *adev)
> +void amdgpu_amdkfd_suspend(struct amdgpu_device *adev, bool run_pm)
> [Oak] then name run_pm is a little bit confusing. Maybe system_wide_pm or none_runtime_pm?
>   {
>   	if (adev->kfd.dev)
> -		kgd2kfd_suspend(adev->kfd.dev);
> +		kgd2kfd_suspend(adev->kfd.dev, run_pm);
>   }
>   
> -int amdgpu_amdkfd_resume(struct amdgpu_device *adev)
> +int amdgpu_amdkfd_resume(struct amdgpu_device *adev, bool run_pm)
>   {
>   	int r = 0;
>   
>   	if (adev->kfd.dev)
> -		r = kgd2kfd_resume(adev->kfd.dev);
> +		r = kgd2kfd_resume(adev->kfd.dev, run_pm);
>   
>   	return r;
>   }
> @@ -713,11 +713,11 @@ void kgd2kfd_exit(void)  {  }
>   
> -void kgd2kfd_suspend(struct kfd_dev *kfd)
> +void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm)
>   {
>   }
>   
> -int kgd2kfd_resume(struct kfd_dev *kfd)
> +int kgd2kfd_resume(struct kfd_dev *kfd, bool run_pm)
>   {
>   	return 0;
>   }
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
> index 0fa898d30207..3dce4a51e522 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
> @@ -123,8 +123,8 @@ struct amdkfd_process_info {  int amdgpu_amdkfd_init(void);  void amdgpu_amdkfd_fini(void);
>   
> -void amdgpu_amdkfd_suspend(struct amdgpu_device *adev); -int amdgpu_amdkfd_resume(struct amdgpu_device *adev);
> +void amdgpu_amdkfd_suspend(struct amdgpu_device *adev, bool run_pm);
> +int amdgpu_amdkfd_resume(struct amdgpu_device *adev, bool run_pm);
>   void amdgpu_amdkfd_interrupt(struct amdgpu_device *adev,
>   			const void *ih_ring_entry);
>   void amdgpu_amdkfd_device_probe(struct amdgpu_device *adev); @@ -250,8 +250,8 @@ bool kgd2kfd_device_init(struct kfd_dev *kfd,
>   			 struct drm_device *ddev,
>   			 const struct kgd2kfd_shared_resources *gpu_resources);  void kgd2kfd_device_exit(struct kfd_dev *kfd); -void kgd2kfd_suspend(struct kfd_dev *kfd); -int kgd2kfd_resume(struct kfd_dev *kfd);
> +void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm); int
> +kgd2kfd_resume(struct kfd_dev *kfd, bool run_pm);
>   int kgd2kfd_pre_reset(struct kfd_dev *kfd);  int kgd2kfd_post_reset(struct kfd_dev *kfd);  void kgd2kfd_interrupt(struct kfd_dev *kfd, const void *ih_ring_entry); diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index d9e5eac182d3..787d49e9f4de 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -3314,7 +3314,7 @@ int amdgpu_device_suspend(struct drm_device *dev, bool fbcon)
>   		}
>   	}
>   
> -	amdgpu_amdkfd_suspend(adev);
> +	amdgpu_amdkfd_suspend(adev, fbcon);
>   
>   	amdgpu_ras_suspend(adev);
>   
> @@ -3398,7 +3398,7 @@ int amdgpu_device_resume(struct drm_device *dev, bool fbcon)
>   			}
>   		}
>   	}
> -	r = amdgpu_amdkfd_resume(adev);
> +	r = amdgpu_amdkfd_resume(adev, fbcon);
>   	if (r)
>   		return r;
>   
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> index 2a9e40131735..e9f00c3a067a 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> @@ -23,6 +23,7 @@
>   #include <linux/bsearch.h>
>   #include <linux/pci.h>
>   #include <linux/slab.h>
> +#include <linux/pm_runtime.h>
>   #include "kfd_priv.h"
>   #include "kfd_device_queue_manager.h"
>   #include "kfd_pm4_headers_vi.h"
> @@ -710,7 +711,7 @@ bool kgd2kfd_device_init(struct kfd_dev *kfd,  void kgd2kfd_device_exit(struct kfd_dev *kfd)  {
>   	if (kfd->init_complete) {
> -		kgd2kfd_suspend(kfd);
> +		kgd2kfd_suspend(kfd, true);
>   		device_queue_manager_uninit(kfd->dqm);
>   		kfd_interrupt_exit(kfd);
>   		kfd_topology_remove_device(kfd);
> @@ -731,7 +732,7 @@ int kgd2kfd_pre_reset(struct kfd_dev *kfd)
>   
>   	kfd->dqm->ops.pre_reset(kfd->dqm);
>   
> -	kgd2kfd_suspend(kfd);
> +	kgd2kfd_suspend(kfd, true);
>   
>   	kfd_signal_reset_event(kfd);
>   	return 0;
> @@ -765,21 +766,24 @@ bool kfd_is_locked(void)
>   	return  (atomic_read(&kfd_locked) > 0);  }
>   
> -void kgd2kfd_suspend(struct kfd_dev *kfd)
> +void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm)
>   {
>   	if (!kfd->init_complete)
>   		return;
>   
> -	/* For first KFD device suspend all the KFD processes */
> -	if (atomic_inc_return(&kfd_locked) == 1)
> -		kfd_suspend_all_processes();
> +	/* for runtime suspend, skip locking kfd */
> +	if (run_pm) {
> +		/* For first KFD device suspend all the KFD processes */
> +		if (atomic_inc_return(&kfd_locked) == 1)
> +			kfd_suspend_all_processes();
> +	}
>   
> +	pm_runtime_set_autosuspend_delay(kfd->ddev->dev, 5000);
> [Oak]: why this is necessary? This timeout value has already been set in driver load. See amdgpu_driver_load_kms
>   	kfd->dqm->ops.stop(kfd->dqm);
> -
>   	kfd_iommu_suspend(kfd);
>   }
>   
> -int kgd2kfd_resume(struct kfd_dev *kfd)
> +int kgd2kfd_resume(struct kfd_dev *kfd, bool run_pm)
>   {
>   	int ret, count;
>   
> @@ -790,10 +794,13 @@ int kgd2kfd_resume(struct kfd_dev *kfd)
>   	if (ret)
>   		return ret;
>   
> -	count = atomic_dec_return(&kfd_locked);
> -	WARN_ONCE(count < 0, "KFD suspend / resume ref. error");
> -	if (count == 0)
> -		ret = kfd_resume_all_processes();
> +	/* for runtime resume, skip unlocking kfd */
> +	if (run_pm) {
> +		count = atomic_dec_return(&kfd_locked);
> +		WARN_ONCE(count < 0, "KFD suspend / resume ref. error");
> +		if (count == 0)
> +			ret = kfd_resume_all_processes();
> +	}
>   
>   	return ret;
>   }
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_iommu.c b/drivers/gpu/drm/amd/amdkfd/kfd_iommu.c
> index 8d871514671e..6301d77ed3d6 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_iommu.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_iommu.c
> @@ -25,6 +25,7 @@
>   #include <linux/slab.h>
>   #include <linux/pci.h>
>   #include <linux/amd-iommu.h>
> +#include <linux/pm_runtime.h>
>   #include "kfd_priv.h"
>   #include "kfd_dbgmgr.h"
>   #include "kfd_topology.h"
> @@ -134,8 +135,10 @@ void kfd_iommu_unbind_process(struct kfd_process *p)
>   	struct kfd_process_device *pdd;
>   
>   	list_for_each_entry(pdd, &p->per_device_data, per_device_list)
> -		if (pdd->bound == PDD_BOUND)
> +		if (pdd->bound == PDD_BOUND) {
>   			amd_iommu_unbind_pasid(pdd->dev->pdev, p->pasid);
> +			pm_runtime_set_autosuspend_delay(pdd->dev->ddev->dev, 5000);
> [Oak] This only set the timeout. The correction function to call is pm_runtime_put_autosuspend?
> I think it is better to call it at the end of kfd_process_wq_release directly and call pm_runtime_mark_last_busy first.
> +		}
>   }
>   
>   /* Callback for process shutdown invoked by the IOMMU driver */ diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process.c b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
> index 25b90f70aecd..d19d5e97405c 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
> @@ -31,6 +31,7 @@
>   #include <linux/compat.h>
>   #include <linux/mman.h>
>   #include <linux/file.h>
> +#include <linux/pm_runtime.h>
>   #include "amdgpu_amdkfd.h"
>   #include "amdgpu.h"
>   
> @@ -843,15 +844,27 @@ struct kfd_process_device *kfd_bind_process_to_device(struct kfd_dev *dev,
>   		return ERR_PTR(-ENOMEM);
>   	}
>   
> +	err = pm_runtime_get_sync(dev->ddev->dev);
> +	pm_runtime_set_autosuspend_delay(dev->ddev->dev, 60000);
> [Oak]: The second call is not necessary
> +	if (err < 0)
> +		return ERR_PTR(err);
> +
>   	err = kfd_iommu_bind_process_to_device(pdd);
>   	if (err)
> -		return ERR_PTR(err);
> +		goto out;
>   
>   	err = kfd_process_device_init_vm(pdd, NULL);
>   	if (err)
> -		return ERR_PTR(err);
> +		goto out;
>   
> -	return pdd;
> +out:
> +	pm_runtime_mark_last_busy(dev->ddev->dev);
> +	pm_runtime_put_autosuspend(dev->ddev->dev);
> +
> +	if (!err)
> +		return pdd;
> +	else
> +		return ERR_PTR(err);
>   }
>   
>   struct kfd_process_device *kfd_get_first_process_device_data(
> --
> 2.17.1
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7Coak.zeng%40amd.com%7Ce57e1eeb5fd74b5d8dea08d7a391a089%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637157718236258423&amp;sdata=zqM5YTz7qofPZjG3PmWKbHQ4sMMZjol1IlzaNTwQtcw%3D&amp;reserved=0
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Patch v1 5/5] drm/amdkfd: refactor runtime pm for baco
  2020-01-30 19:01     ` Bhardwaj, Rajneesh
@ 2020-01-30 21:55       ` Felix Kuehling
  2020-01-30 22:11         ` Alex Deucher
  0 siblings, 1 reply; 23+ messages in thread
From: Felix Kuehling @ 2020-01-30 21:55 UTC (permalink / raw)
  To: Bhardwaj, Rajneesh, amd-gfx; +Cc: Alexander.Deucher

On 2020-01-30 14:01, Bhardwaj, Rajneesh wrote:
> Hello Felix,
>
> Thanks for your time to review and for your feedback.
>
> On 1/29/2020 5:52 PM, Felix Kuehling wrote:
>> HI Rajneesh,
>>
>> See comments inline ...
>>
>> And a general question: Why do you need to set the autosuspend_delay 
>> in so many places? Amdgpu only has a single call to this function 
>> during initialization.
>
>
> We don't. I have called this out in cover letter since i too feel its 
> not the best way to do what we want to do. I have seen weird issues on 
> dual GPUs with KFDTest that sometimes results in system hang and gpu 
> hang etc.
>
> Even if i try with running kfdtest on just one node in a multi gpu 
> system, the baco exit seems to wake up all gpus and then one goes to 
> auto suspend again while other performs the desired kfdtest operation. 
> I have never seen any issue with a single gpu card. So in current 
> approch, i am making sure that the GPUs are not quickly auto-suspended 
> while the kfd operations are ongoing but once the kfdtest is finished, 
> the runtime suspend call with ensure to reset the timeout back to 5 
> seconds.

So by setting the timeout to 60s, you can fix applications that run for 
less than 60s without calling into KFD. What about applications that run 
for longer without calling into KFD? This doesn't sound like a solution, 
it's just a hack.


>
>
> I would like to avoid this and appreciate some pointers where i can 
> put get_sync calls while unbinding. I have tried a number of places 
> but saw some issues. So any help will be highly appreciated, to 
> identify the proper logical inverse of kfd_bind_process_to_device.

kfd_bind_process_to_device is called when KFD starts using a device. It 
only stops using that device when the process is destroyed. Therefore I 
recommended moving the cleanup code into kfd_process_destroy_pdds, which 
iterates over all devices (PDDs) during process termination.

Note that kfd_bind_process_to_device can be called many times for the 
same device. If you need to keep usage counters balanced, you need extra 
checks to make sure you only do runtime-PM stuff the first time it's 
called for a particular device in a particular process.

Also, in kfd_process_destroy_pdds you may need to check whether 
kfd_bind_process_to_device has been called for a particular PDD before 
releasing runtime PM.

Regards,
   Felix


>
>
>>
>> On 2020-01-27 20:29, Rajneesh Bhardwaj wrote:
>>> So far the kfd driver implemented same routines for runtime and system
>>> wide suspend and resume (s2idle or mem). During system wide suspend the
>>> kfd aquires an atomic lock that prevents any more user processes to
>>> create queues and interact with kfd driver and amd gpu. This mechanism
>>> created problem when amdgpu device is runtime suspended with BACO
>>> enabled. Any application that relies on kfd driver fails to load 
>>> because
>>> the driver reports a locked kfd device since gpu is runtime suspended.
>>>
>>> However, in an ideal case, when gpu is runtime  suspended the kfd 
>>> driver
>>> should be able to:
>>>
>>>   - auto resume amdgpu driver whenever a client requests compute 
>>> service
>>>   - prevent runtime suspend for amdgpu  while kfd is in use
>>>
>>> This change refactors the amdgpu and amdkfd drivers to support BACO and
>>> runtime runtime power management.
>>>
>>> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
>>> ---
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 12 ++++-----
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h |  8 +++---
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  4 +--
>>>   drivers/gpu/drm/amd/amdkfd/kfd_device.c    | 31 
>>> +++++++++++++---------
>>>   drivers/gpu/drm/amd/amdkfd/kfd_iommu.c     |  5 +++-
>>>   drivers/gpu/drm/amd/amdkfd/kfd_process.c   | 19 ++++++++++---
>>>   6 files changed, 51 insertions(+), 28 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c 
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
>>> index 8609287620ea..314c4a2a0354 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
>>> @@ -178,18 +178,18 @@ void amdgpu_amdkfd_interrupt(struct 
>>> amdgpu_device *adev,
>>>           kgd2kfd_interrupt(adev->kfd.dev, ih_ring_entry);
>>>   }
>>>   -void amdgpu_amdkfd_suspend(struct amdgpu_device *adev)
>>> +void amdgpu_amdkfd_suspend(struct amdgpu_device *adev, bool run_pm)
>>>   {
>>>       if (adev->kfd.dev)
>>> -        kgd2kfd_suspend(adev->kfd.dev);
>>> +        kgd2kfd_suspend(adev->kfd.dev, run_pm);
>>>   }
>>>   -int amdgpu_amdkfd_resume(struct amdgpu_device *adev)
>>> +int amdgpu_amdkfd_resume(struct amdgpu_device *adev, bool run_pm)
>>>   {
>>>       int r = 0;
>>>         if (adev->kfd.dev)
>>> -        r = kgd2kfd_resume(adev->kfd.dev);
>>> +        r = kgd2kfd_resume(adev->kfd.dev, run_pm);
>>>         return r;
>>>   }
>>> @@ -713,11 +713,11 @@ void kgd2kfd_exit(void)
>>>   {
>>>   }
>>>   -void kgd2kfd_suspend(struct kfd_dev *kfd)
>>> +void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm)
>>>   {
>>>   }
>>>   -int kgd2kfd_resume(struct kfd_dev *kfd)
>>> +int kgd2kfd_resume(struct kfd_dev *kfd, bool run_pm)
>>>   {
>>>       return 0;
>>>   }
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h 
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
>>> index 0fa898d30207..3dce4a51e522 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
>>> @@ -123,8 +123,8 @@ struct amdkfd_process_info {
>>>   int amdgpu_amdkfd_init(void);
>>>   void amdgpu_amdkfd_fini(void);
>>>   -void amdgpu_amdkfd_suspend(struct amdgpu_device *adev);
>>> -int amdgpu_amdkfd_resume(struct amdgpu_device *adev);
>>> +void amdgpu_amdkfd_suspend(struct amdgpu_device *adev, bool run_pm);
>>> +int amdgpu_amdkfd_resume(struct amdgpu_device *adev, bool run_pm);
>>>   void amdgpu_amdkfd_interrupt(struct amdgpu_device *adev,
>>>               const void *ih_ring_entry);
>>>   void amdgpu_amdkfd_device_probe(struct amdgpu_device *adev);
>>> @@ -250,8 +250,8 @@ bool kgd2kfd_device_init(struct kfd_dev *kfd,
>>>                struct drm_device *ddev,
>>>                const struct kgd2kfd_shared_resources *gpu_resources);
>>>   void kgd2kfd_device_exit(struct kfd_dev *kfd);
>>> -void kgd2kfd_suspend(struct kfd_dev *kfd);
>>> -int kgd2kfd_resume(struct kfd_dev *kfd);
>>> +void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm);
>>> +int kgd2kfd_resume(struct kfd_dev *kfd, bool run_pm);
>>>   int kgd2kfd_pre_reset(struct kfd_dev *kfd);
>>>   int kgd2kfd_post_reset(struct kfd_dev *kfd);
>>>   void kgd2kfd_interrupt(struct kfd_dev *kfd, const void 
>>> *ih_ring_entry);
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> index d9e5eac182d3..787d49e9f4de 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> @@ -3314,7 +3314,7 @@ int amdgpu_device_suspend(struct drm_device 
>>> *dev, bool fbcon)
>>>           }
>>>       }
>>>   -    amdgpu_amdkfd_suspend(adev);
>>> +    amdgpu_amdkfd_suspend(adev, fbcon);
>>
>> The logic seems inverted here. There are four different pmops 
>> callbacks that use different values for this parameter:
>>
>> * amdgpu_pmops_suspend: true
>> * amdgpu_pmops_freeze: true
>> * amdgpu_pmops_poweroff: true
>> * amdgpu_pmops_runtime_suspend: false
>>
>> It looks like runtime_suspend uses false, so you should pass the 
>> run_pm parameter as !fbcon.
>
> Ok. will fix in v2.
>
>
>>
>>
>>>         amdgpu_ras_suspend(adev);
>>>   @@ -3398,7 +3398,7 @@ int amdgpu_device_resume(struct drm_device 
>>> *dev, bool fbcon)
>>>               }
>>>           }
>>>       }
>>> -    r = amdgpu_amdkfd_resume(adev);
>>> +    r = amdgpu_amdkfd_resume(adev, fbcon);
>>>       if (r)
>>>           return r;
>>>   diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c 
>>> b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>> index 2a9e40131735..e9f00c3a067a 100644
>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>> @@ -23,6 +23,7 @@
>>>   #include <linux/bsearch.h>
>>>   #include <linux/pci.h>
>>>   #include <linux/slab.h>
>>> +#include <linux/pm_runtime.h>
>>>   #include "kfd_priv.h"
>>>   #include "kfd_device_queue_manager.h"
>>>   #include "kfd_pm4_headers_vi.h"
>>> @@ -710,7 +711,7 @@ bool kgd2kfd_device_init(struct kfd_dev *kfd,
>>>   void kgd2kfd_device_exit(struct kfd_dev *kfd)
>>>   {
>>>       if (kfd->init_complete) {
>>> -        kgd2kfd_suspend(kfd);
>>> +        kgd2kfd_suspend(kfd, true);
>>>           device_queue_manager_uninit(kfd->dqm);
>>>           kfd_interrupt_exit(kfd);
>>>           kfd_topology_remove_device(kfd);
>>> @@ -731,7 +732,7 @@ int kgd2kfd_pre_reset(struct kfd_dev *kfd)
>>>         kfd->dqm->ops.pre_reset(kfd->dqm);
>>>   -    kgd2kfd_suspend(kfd);
>>> +    kgd2kfd_suspend(kfd, true);
>>
>> This should use false. This is not runtime PM.
>>
>>
>>>         kfd_signal_reset_event(kfd);
>>>       return 0;
>>> @@ -765,21 +766,24 @@ bool kfd_is_locked(void)
>>>       return  (atomic_read(&kfd_locked) > 0);
>>>   }
>>>   -void kgd2kfd_suspend(struct kfd_dev *kfd)
>>> +void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm)
>>>   {
>>>       if (!kfd->init_complete)
>>>           return;
>>>   -    /* For first KFD device suspend all the KFD processes */
>>> -    if (atomic_inc_return(&kfd_locked) == 1)
>>> -        kfd_suspend_all_processes();
>>> +    /* for runtime suspend, skip locking kfd */
>>> +    if (run_pm) {
>>
>> This should be done for the non-runtime PM case: if (!run_pm).
>>
>>
>>> +        /* For first KFD device suspend all the KFD processes */
>>> +        if (atomic_inc_return(&kfd_locked) == 1)
>>> +            kfd_suspend_all_processes();
>>> +    }
>>>   +    pm_runtime_set_autosuspend_delay(kfd->ddev->dev, 5000);
>>>       kfd->dqm->ops.stop(kfd->dqm);
>>> -
>>>       kfd_iommu_suspend(kfd);
>>>   }
>>>   -int kgd2kfd_resume(struct kfd_dev *kfd)
>>> +int kgd2kfd_resume(struct kfd_dev *kfd, bool run_pm)
>>>   {
>>>       int ret, count;
>>>   @@ -790,10 +794,13 @@ int kgd2kfd_resume(struct kfd_dev *kfd)
>>>       if (ret)
>>>           return ret;
>>>   -    count = atomic_dec_return(&kfd_locked);
>>> -    WARN_ONCE(count < 0, "KFD suspend / resume ref. error");
>>> -    if (count == 0)
>>> -        ret = kfd_resume_all_processes();
>>> +    /* for runtime resume, skip unlocking kfd */
>>> +    if (run_pm) {
>>
>> Same as above.
>>
>>
>>> +        count = atomic_dec_return(&kfd_locked);
>>> +        WARN_ONCE(count < 0, "KFD suspend / resume ref. error");
>>> +        if (count == 0)
>>> +            ret = kfd_resume_all_processes();
>>> +    }
>>>         return ret;
>>>   }
>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_iommu.c 
>>> b/drivers/gpu/drm/amd/amdkfd/kfd_iommu.c
>>> index 8d871514671e..6301d77ed3d6 100644
>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_iommu.c
>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_iommu.c
>>> @@ -25,6 +25,7 @@
>>>   #include <linux/slab.h>
>>>   #include <linux/pci.h>
>>>   #include <linux/amd-iommu.h>
>>> +#include <linux/pm_runtime.h>
>>>   #include "kfd_priv.h"
>>>   #include "kfd_dbgmgr.h"
>>>   #include "kfd_topology.h"
>>> @@ -134,8 +135,10 @@ void kfd_iommu_unbind_process(struct 
>>> kfd_process *p)
>>>       struct kfd_process_device *pdd;
>>>         list_for_each_entry(pdd, &p->per_device_data, per_device_list)
>>> -        if (pdd->bound == PDD_BOUND)
>>> +        if (pdd->bound == PDD_BOUND) {
>>>               amd_iommu_unbind_pasid(pdd->dev->pdev, p->pasid);
>>> + pm_runtime_set_autosuspend_delay(pdd->dev->ddev->dev, 5000);
>>
>> I don't think this is the right place. This code should only run on 
>> APUs with IOMMUv2 enabled. Probably missing a check at the start of 
>> the function. On kernels with IOMMU disabled, this source file is not 
>> compiled at all.
>>
>> I think this code should go into kfd_process_destroy_pdds.
>
>
> Ok. will fix in v2. So can we consider this as logical inverse of 
> bind_process_to_device?
>
>
>>
>>
>>> +        }
>>>   }
>>>     /* Callback for process shutdown invoked by the IOMMU driver */
>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process.c 
>>> b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
>>> index 25b90f70aecd..d19d5e97405c 100644
>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
>>> @@ -31,6 +31,7 @@
>>>   #include <linux/compat.h>
>>>   #include <linux/mman.h>
>>>   #include <linux/file.h>
>>> +#include <linux/pm_runtime.h>
>>>   #include "amdgpu_amdkfd.h"
>>>   #include "amdgpu.h"
>>>   @@ -843,15 +844,27 @@ struct kfd_process_device 
>>> *kfd_bind_process_to_device(struct kfd_dev *dev,
>>>           return ERR_PTR(-ENOMEM);
>>>       }
>>>   +    err = pm_runtime_get_sync(dev->ddev->dev);
>>> +    pm_runtime_set_autosuspend_delay(dev->ddev->dev, 60000);
>>
>> Why are you using a different hard-coded delay (60s?) here?
>
>
> It needs to be fixed. Explained above the reason for such deviation.
>
>
>>
>> Regards,
>>   Felix
>>
>
> Thanks
> Rajneesh
>
>
>>> +    if (err < 0)
>>> +        return ERR_PTR(err);
>>> +
>>>       err = kfd_iommu_bind_process_to_device(pdd);
>>>       if (err)
>>> -        return ERR_PTR(err);
>>> +        goto out;
>>>         err = kfd_process_device_init_vm(pdd, NULL);
>>>       if (err)
>>> -        return ERR_PTR(err);
>>> +        goto out;
>>>   -    return pdd;
>>> +out:
>>> +    pm_runtime_mark_last_busy(dev->ddev->dev);
>>> +    pm_runtime_put_autosuspend(dev->ddev->dev);
>>> +
>>> +    if (!err)
>>> +        return pdd;
>>> +    else
>>> +        return ERR_PTR(err);
>>>   }
>>>     struct kfd_process_device *kfd_get_first_process_device_data(
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Patch v1 5/5] drm/amdkfd: refactor runtime pm for baco
  2020-01-30 21:55       ` Felix Kuehling
@ 2020-01-30 22:11         ` Alex Deucher
  2020-01-30 23:24           ` Felix Kuehling
  0 siblings, 1 reply; 23+ messages in thread
From: Alex Deucher @ 2020-01-30 22:11 UTC (permalink / raw)
  To: Felix Kuehling; +Cc: Deucher, Alexander, Bhardwaj, Rajneesh, amd-gfx list

On Thu, Jan 30, 2020 at 4:55 PM Felix Kuehling <felix.kuehling@amd.com> wrote:
>
> On 2020-01-30 14:01, Bhardwaj, Rajneesh wrote:
> > Hello Felix,
> >
> > Thanks for your time to review and for your feedback.
> >
> > On 1/29/2020 5:52 PM, Felix Kuehling wrote:
> >> HI Rajneesh,
> >>
> >> See comments inline ...
> >>
> >> And a general question: Why do you need to set the autosuspend_delay
> >> in so many places? Amdgpu only has a single call to this function
> >> during initialization.
> >
> >
> > We don't. I have called this out in cover letter since i too feel its
> > not the best way to do what we want to do. I have seen weird issues on
> > dual GPUs with KFDTest that sometimes results in system hang and gpu
> > hang etc.
> >
> > Even if i try with running kfdtest on just one node in a multi gpu
> > system, the baco exit seems to wake up all gpus and then one goes to
> > auto suspend again while other performs the desired kfdtest operation.
> > I have never seen any issue with a single gpu card. So in current
> > approch, i am making sure that the GPUs are not quickly auto-suspended
> > while the kfd operations are ongoing but once the kfdtest is finished,
> > the runtime suspend call with ensure to reset the timeout back to 5
> > seconds.
>
> So by setting the timeout to 60s, you can fix applications that run for
> less than 60s without calling into KFD. What about applications that run
> for longer without calling into KFD? This doesn't sound like a solution,
> it's just a hack.
>
>
> >
> >
> > I would like to avoid this and appreciate some pointers where i can
> > put get_sync calls while unbinding. I have tried a number of places
> > but saw some issues. So any help will be highly appreciated, to
> > identify the proper logical inverse of kfd_bind_process_to_device.
>
> kfd_bind_process_to_device is called when KFD starts using a device. It
> only stops using that device when the process is destroyed. Therefore I
> recommended moving the cleanup code into kfd_process_destroy_pdds, which
> iterates over all devices (PDDs) during process termination.
>
> Note that kfd_bind_process_to_device can be called many times for the
> same device. If you need to keep usage counters balanced, you need extra
> checks to make sure you only do runtime-PM stuff the first time it's
> called for a particular device in a particular process.
>
> Also, in kfd_process_destroy_pdds you may need to check whether
> kfd_bind_process_to_device has been called for a particular PDD before
> releasing runtime PM.

runtimepm already has reference counting.  You can use something like
pm_runtime_get_sync() or pm_runtime_get_noresume() depending on
whether you want to make sure the GPU is powered up or not.  That will
increase the usage count on the device for runtime pm.  Then when you
want to decrement and possibly suspend the device, call
pm_runtime_mark_last_busy() and pm_runtime_put_autosuspend().  That is
basically what we do in amdgpu to handle device usage.  We call
pm_runtime_get_noresume() in our fence_emit function which is what
gets called when we submit work to the GPU.  That increments the usage
count for runpm.  Then we call pm_runtime_mark_last_busy() and
pm_runtime_put_autosuspend() in our fence_process function.  That gets
called as each job on the GPU is retired.  For kfd, I think you'll
want to call pm_runtime_get_sync() in at the start of your ioctl to
wake the device, then pm_runtime_mark_last_busy() and
pm_runtime_put_autosuspend() at the end of your ioctl to mark to match
the start.  Then in your actual ioctl work, you'll want to call
pm_runtime_get_noresume() everytime a queue gets created and
pm_runtime_mark_last_busy() and pm_runtime_put_autosuspend() each time
a queue is destroyed.  That should keep the device awake as long as
there are any queues around.  Due to stuff like xgmi, it's probably
best to just do this for all GPUs on the system just in case you have
any p2p stuff going on.

Alex

>
> Regards,
>    Felix
>
>
> >
> >
> >>
> >> On 2020-01-27 20:29, Rajneesh Bhardwaj wrote:
> >>> So far the kfd driver implemented same routines for runtime and system
> >>> wide suspend and resume (s2idle or mem). During system wide suspend the
> >>> kfd aquires an atomic lock that prevents any more user processes to
> >>> create queues and interact with kfd driver and amd gpu. This mechanism
> >>> created problem when amdgpu device is runtime suspended with BACO
> >>> enabled. Any application that relies on kfd driver fails to load
> >>> because
> >>> the driver reports a locked kfd device since gpu is runtime suspended.
> >>>
> >>> However, in an ideal case, when gpu is runtime  suspended the kfd
> >>> driver
> >>> should be able to:
> >>>
> >>>   - auto resume amdgpu driver whenever a client requests compute
> >>> service
> >>>   - prevent runtime suspend for amdgpu  while kfd is in use
> >>>
> >>> This change refactors the amdgpu and amdkfd drivers to support BACO and
> >>> runtime runtime power management.
> >>>
> >>> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
> >>> ---
> >>>   drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 12 ++++-----
> >>>   drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h |  8 +++---
> >>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  4 +--
> >>>   drivers/gpu/drm/amd/amdkfd/kfd_device.c    | 31
> >>> +++++++++++++---------
> >>>   drivers/gpu/drm/amd/amdkfd/kfd_iommu.c     |  5 +++-
> >>>   drivers/gpu/drm/amd/amdkfd/kfd_process.c   | 19 ++++++++++---
> >>>   6 files changed, 51 insertions(+), 28 deletions(-)
> >>>
> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
> >>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
> >>> index 8609287620ea..314c4a2a0354 100644
> >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
> >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
> >>> @@ -178,18 +178,18 @@ void amdgpu_amdkfd_interrupt(struct
> >>> amdgpu_device *adev,
> >>>           kgd2kfd_interrupt(adev->kfd.dev, ih_ring_entry);
> >>>   }
> >>>   -void amdgpu_amdkfd_suspend(struct amdgpu_device *adev)
> >>> +void amdgpu_amdkfd_suspend(struct amdgpu_device *adev, bool run_pm)
> >>>   {
> >>>       if (adev->kfd.dev)
> >>> -        kgd2kfd_suspend(adev->kfd.dev);
> >>> +        kgd2kfd_suspend(adev->kfd.dev, run_pm);
> >>>   }
> >>>   -int amdgpu_amdkfd_resume(struct amdgpu_device *adev)
> >>> +int amdgpu_amdkfd_resume(struct amdgpu_device *adev, bool run_pm)
> >>>   {
> >>>       int r = 0;
> >>>         if (adev->kfd.dev)
> >>> -        r = kgd2kfd_resume(adev->kfd.dev);
> >>> +        r = kgd2kfd_resume(adev->kfd.dev, run_pm);
> >>>         return r;
> >>>   }
> >>> @@ -713,11 +713,11 @@ void kgd2kfd_exit(void)
> >>>   {
> >>>   }
> >>>   -void kgd2kfd_suspend(struct kfd_dev *kfd)
> >>> +void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm)
> >>>   {
> >>>   }
> >>>   -int kgd2kfd_resume(struct kfd_dev *kfd)
> >>> +int kgd2kfd_resume(struct kfd_dev *kfd, bool run_pm)
> >>>   {
> >>>       return 0;
> >>>   }
> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
> >>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
> >>> index 0fa898d30207..3dce4a51e522 100644
> >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
> >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
> >>> @@ -123,8 +123,8 @@ struct amdkfd_process_info {
> >>>   int amdgpu_amdkfd_init(void);
> >>>   void amdgpu_amdkfd_fini(void);
> >>>   -void amdgpu_amdkfd_suspend(struct amdgpu_device *adev);
> >>> -int amdgpu_amdkfd_resume(struct amdgpu_device *adev);
> >>> +void amdgpu_amdkfd_suspend(struct amdgpu_device *adev, bool run_pm);
> >>> +int amdgpu_amdkfd_resume(struct amdgpu_device *adev, bool run_pm);
> >>>   void amdgpu_amdkfd_interrupt(struct amdgpu_device *adev,
> >>>               const void *ih_ring_entry);
> >>>   void amdgpu_amdkfd_device_probe(struct amdgpu_device *adev);
> >>> @@ -250,8 +250,8 @@ bool kgd2kfd_device_init(struct kfd_dev *kfd,
> >>>                struct drm_device *ddev,
> >>>                const struct kgd2kfd_shared_resources *gpu_resources);
> >>>   void kgd2kfd_device_exit(struct kfd_dev *kfd);
> >>> -void kgd2kfd_suspend(struct kfd_dev *kfd);
> >>> -int kgd2kfd_resume(struct kfd_dev *kfd);
> >>> +void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm);
> >>> +int kgd2kfd_resume(struct kfd_dev *kfd, bool run_pm);
> >>>   int kgd2kfd_pre_reset(struct kfd_dev *kfd);
> >>>   int kgd2kfd_post_reset(struct kfd_dev *kfd);
> >>>   void kgd2kfd_interrupt(struct kfd_dev *kfd, const void
> >>> *ih_ring_entry);
> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> >>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> >>> index d9e5eac182d3..787d49e9f4de 100644
> >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> >>> @@ -3314,7 +3314,7 @@ int amdgpu_device_suspend(struct drm_device
> >>> *dev, bool fbcon)
> >>>           }
> >>>       }
> >>>   -    amdgpu_amdkfd_suspend(adev);
> >>> +    amdgpu_amdkfd_suspend(adev, fbcon);
> >>
> >> The logic seems inverted here. There are four different pmops
> >> callbacks that use different values for this parameter:
> >>
> >> * amdgpu_pmops_suspend: true
> >> * amdgpu_pmops_freeze: true
> >> * amdgpu_pmops_poweroff: true
> >> * amdgpu_pmops_runtime_suspend: false
> >>
> >> It looks like runtime_suspend uses false, so you should pass the
> >> run_pm parameter as !fbcon.
> >
> > Ok. will fix in v2.
> >
> >
> >>
> >>
> >>>         amdgpu_ras_suspend(adev);
> >>>   @@ -3398,7 +3398,7 @@ int amdgpu_device_resume(struct drm_device
> >>> *dev, bool fbcon)
> >>>               }
> >>>           }
> >>>       }
> >>> -    r = amdgpu_amdkfd_resume(adev);
> >>> +    r = amdgpu_amdkfd_resume(adev, fbcon);
> >>>       if (r)
> >>>           return r;
> >>>   diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> >>> b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> >>> index 2a9e40131735..e9f00c3a067a 100644
> >>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> >>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> >>> @@ -23,6 +23,7 @@
> >>>   #include <linux/bsearch.h>
> >>>   #include <linux/pci.h>
> >>>   #include <linux/slab.h>
> >>> +#include <linux/pm_runtime.h>
> >>>   #include "kfd_priv.h"
> >>>   #include "kfd_device_queue_manager.h"
> >>>   #include "kfd_pm4_headers_vi.h"
> >>> @@ -710,7 +711,7 @@ bool kgd2kfd_device_init(struct kfd_dev *kfd,
> >>>   void kgd2kfd_device_exit(struct kfd_dev *kfd)
> >>>   {
> >>>       if (kfd->init_complete) {
> >>> -        kgd2kfd_suspend(kfd);
> >>> +        kgd2kfd_suspend(kfd, true);
> >>>           device_queue_manager_uninit(kfd->dqm);
> >>>           kfd_interrupt_exit(kfd);
> >>>           kfd_topology_remove_device(kfd);
> >>> @@ -731,7 +732,7 @@ int kgd2kfd_pre_reset(struct kfd_dev *kfd)
> >>>         kfd->dqm->ops.pre_reset(kfd->dqm);
> >>>   -    kgd2kfd_suspend(kfd);
> >>> +    kgd2kfd_suspend(kfd, true);
> >>
> >> This should use false. This is not runtime PM.
> >>
> >>
> >>>         kfd_signal_reset_event(kfd);
> >>>       return 0;
> >>> @@ -765,21 +766,24 @@ bool kfd_is_locked(void)
> >>>       return  (atomic_read(&kfd_locked) > 0);
> >>>   }
> >>>   -void kgd2kfd_suspend(struct kfd_dev *kfd)
> >>> +void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm)
> >>>   {
> >>>       if (!kfd->init_complete)
> >>>           return;
> >>>   -    /* For first KFD device suspend all the KFD processes */
> >>> -    if (atomic_inc_return(&kfd_locked) == 1)
> >>> -        kfd_suspend_all_processes();
> >>> +    /* for runtime suspend, skip locking kfd */
> >>> +    if (run_pm) {
> >>
> >> This should be done for the non-runtime PM case: if (!run_pm).
> >>
> >>
> >>> +        /* For first KFD device suspend all the KFD processes */
> >>> +        if (atomic_inc_return(&kfd_locked) == 1)
> >>> +            kfd_suspend_all_processes();
> >>> +    }
> >>>   +    pm_runtime_set_autosuspend_delay(kfd->ddev->dev, 5000);
> >>>       kfd->dqm->ops.stop(kfd->dqm);
> >>> -
> >>>       kfd_iommu_suspend(kfd);
> >>>   }
> >>>   -int kgd2kfd_resume(struct kfd_dev *kfd)
> >>> +int kgd2kfd_resume(struct kfd_dev *kfd, bool run_pm)
> >>>   {
> >>>       int ret, count;
> >>>   @@ -790,10 +794,13 @@ int kgd2kfd_resume(struct kfd_dev *kfd)
> >>>       if (ret)
> >>>           return ret;
> >>>   -    count = atomic_dec_return(&kfd_locked);
> >>> -    WARN_ONCE(count < 0, "KFD suspend / resume ref. error");
> >>> -    if (count == 0)
> >>> -        ret = kfd_resume_all_processes();
> >>> +    /* for runtime resume, skip unlocking kfd */
> >>> +    if (run_pm) {
> >>
> >> Same as above.
> >>
> >>
> >>> +        count = atomic_dec_return(&kfd_locked);
> >>> +        WARN_ONCE(count < 0, "KFD suspend / resume ref. error");
> >>> +        if (count == 0)
> >>> +            ret = kfd_resume_all_processes();
> >>> +    }
> >>>         return ret;
> >>>   }
> >>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_iommu.c
> >>> b/drivers/gpu/drm/amd/amdkfd/kfd_iommu.c
> >>> index 8d871514671e..6301d77ed3d6 100644
> >>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_iommu.c
> >>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_iommu.c
> >>> @@ -25,6 +25,7 @@
> >>>   #include <linux/slab.h>
> >>>   #include <linux/pci.h>
> >>>   #include <linux/amd-iommu.h>
> >>> +#include <linux/pm_runtime.h>
> >>>   #include "kfd_priv.h"
> >>>   #include "kfd_dbgmgr.h"
> >>>   #include "kfd_topology.h"
> >>> @@ -134,8 +135,10 @@ void kfd_iommu_unbind_process(struct
> >>> kfd_process *p)
> >>>       struct kfd_process_device *pdd;
> >>>         list_for_each_entry(pdd, &p->per_device_data, per_device_list)
> >>> -        if (pdd->bound == PDD_BOUND)
> >>> +        if (pdd->bound == PDD_BOUND) {
> >>>               amd_iommu_unbind_pasid(pdd->dev->pdev, p->pasid);
> >>> + pm_runtime_set_autosuspend_delay(pdd->dev->ddev->dev, 5000);
> >>
> >> I don't think this is the right place. This code should only run on
> >> APUs with IOMMUv2 enabled. Probably missing a check at the start of
> >> the function. On kernels with IOMMU disabled, this source file is not
> >> compiled at all.
> >>
> >> I think this code should go into kfd_process_destroy_pdds.
> >
> >
> > Ok. will fix in v2. So can we consider this as logical inverse of
> > bind_process_to_device?
> >
> >
> >>
> >>
> >>> +        }
> >>>   }
> >>>     /* Callback for process shutdown invoked by the IOMMU driver */
> >>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
> >>> b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
> >>> index 25b90f70aecd..d19d5e97405c 100644
> >>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
> >>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
> >>> @@ -31,6 +31,7 @@
> >>>   #include <linux/compat.h>
> >>>   #include <linux/mman.h>
> >>>   #include <linux/file.h>
> >>> +#include <linux/pm_runtime.h>
> >>>   #include "amdgpu_amdkfd.h"
> >>>   #include "amdgpu.h"
> >>>   @@ -843,15 +844,27 @@ struct kfd_process_device
> >>> *kfd_bind_process_to_device(struct kfd_dev *dev,
> >>>           return ERR_PTR(-ENOMEM);
> >>>       }
> >>>   +    err = pm_runtime_get_sync(dev->ddev->dev);
> >>> +    pm_runtime_set_autosuspend_delay(dev->ddev->dev, 60000);
> >>
> >> Why are you using a different hard-coded delay (60s?) here?
> >
> >
> > It needs to be fixed. Explained above the reason for such deviation.
> >
> >
> >>
> >> Regards,
> >>   Felix
> >>
> >
> > Thanks
> > Rajneesh
> >
> >
> >>> +    if (err < 0)
> >>> +        return ERR_PTR(err);
> >>> +
> >>>       err = kfd_iommu_bind_process_to_device(pdd);
> >>>       if (err)
> >>> -        return ERR_PTR(err);
> >>> +        goto out;
> >>>         err = kfd_process_device_init_vm(pdd, NULL);
> >>>       if (err)
> >>> -        return ERR_PTR(err);
> >>> +        goto out;
> >>>   -    return pdd;
> >>> +out:
> >>> +    pm_runtime_mark_last_busy(dev->ddev->dev);
> >>> +    pm_runtime_put_autosuspend(dev->ddev->dev);
> >>> +
> >>> +    if (!err)
> >>> +        return pdd;
> >>> +    else
> >>> +        return ERR_PTR(err);
> >>>   }
> >>>     struct kfd_process_device *kfd_get_first_process_device_data(
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Patch v1 5/5] drm/amdkfd: refactor runtime pm for baco
  2020-01-30 22:11         ` Alex Deucher
@ 2020-01-30 23:24           ` Felix Kuehling
  2020-01-31  0:53             ` Zeng, Oak
  0 siblings, 1 reply; 23+ messages in thread
From: Felix Kuehling @ 2020-01-30 23:24 UTC (permalink / raw)
  To: Alex Deucher; +Cc: Deucher, Alexander, Bhardwaj, Rajneesh, amd-gfx list


On 2020-01-30 17:11, Alex Deucher wrote:
> On Thu, Jan 30, 2020 at 4:55 PM Felix Kuehling <felix.kuehling@amd.com> wrote:
>> On 2020-01-30 14:01, Bhardwaj, Rajneesh wrote:
>>> Hello Felix,
>>>
>>> Thanks for your time to review and for your feedback.
>>>
>>> On 1/29/2020 5:52 PM, Felix Kuehling wrote:
>>>> HI Rajneesh,
>>>>
>>>> See comments inline ...
>>>>
>>>> And a general question: Why do you need to set the autosuspend_delay
>>>> in so many places? Amdgpu only has a single call to this function
>>>> during initialization.
>>>
>>> We don't. I have called this out in cover letter since i too feel its
>>> not the best way to do what we want to do. I have seen weird issues on
>>> dual GPUs with KFDTest that sometimes results in system hang and gpu
>>> hang etc.
>>>
>>> Even if i try with running kfdtest on just one node in a multi gpu
>>> system, the baco exit seems to wake up all gpus and then one goes to
>>> auto suspend again while other performs the desired kfdtest operation.
>>> I have never seen any issue with a single gpu card. So in current
>>> approch, i am making sure that the GPUs are not quickly auto-suspended
>>> while the kfd operations are ongoing but once the kfdtest is finished,
>>> the runtime suspend call with ensure to reset the timeout back to 5
>>> seconds.
>> So by setting the timeout to 60s, you can fix applications that run for
>> less than 60s without calling into KFD. What about applications that run
>> for longer without calling into KFD? This doesn't sound like a solution,
>> it's just a hack.
>>
>>
>>>
>>> I would like to avoid this and appreciate some pointers where i can
>>> put get_sync calls while unbinding. I have tried a number of places
>>> but saw some issues. So any help will be highly appreciated, to
>>> identify the proper logical inverse of kfd_bind_process_to_device.
>> kfd_bind_process_to_device is called when KFD starts using a device. It
>> only stops using that device when the process is destroyed. Therefore I
>> recommended moving the cleanup code into kfd_process_destroy_pdds, which
>> iterates over all devices (PDDs) during process termination.
>>
>> Note that kfd_bind_process_to_device can be called many times for the
>> same device. If you need to keep usage counters balanced, you need extra
>> checks to make sure you only do runtime-PM stuff the first time it's
>> called for a particular device in a particular process.
>>
>> Also, in kfd_process_destroy_pdds you may need to check whether
>> kfd_bind_process_to_device has been called for a particular PDD before
>> releasing runtime PM.
> runtimepm already has reference counting.  You can use something like
> pm_runtime_get_sync() or pm_runtime_get_noresume() depending on
> whether you want to make sure the GPU is powered up or not.  That will
> increase the usage count on the device for runtime pm.  Then when you
> want to decrement and possibly suspend the device, call
> pm_runtime_mark_last_busy() and pm_runtime_put_autosuspend().  That is
> basically what we do in amdgpu to handle device usage.

My point is, kfd_bind_process_to_device can be called many times. If we 
call pm_runtime_sync_get, we'll need to release an unpredictable number 
of usage counts later. Therefore I suggested ensuring we only do this 
the first time kfd_bind_process_to_device is called for a particular 
process-device.

Similarly, in kfd_process_destroy_pdds we should only release the usage 
count if we actually took it in kfd_bind_process_to_device.


>    We call
> pm_runtime_get_noresume() in our fence_emit function which is what
> gets called when we submit work to the GPU.  That increments the usage
> count for runpm.  Then we call pm_runtime_mark_last_busy() and
> pm_runtime_put_autosuspend() in our fence_process function.  That gets
> called as each job on the GPU is retired.  For kfd, I think you'll
> want to call pm_runtime_get_sync() in at the start of your ioctl to
> wake the device, then pm_runtime_mark_last_busy() and
> pm_runtime_put_autosuspend() at the end of your ioctl to mark to match
> the start.  Then in your actual ioctl work, you'll want to call
> pm_runtime_get_noresume() everytime a queue gets created and
> pm_runtime_mark_last_busy() and pm_runtime_put_autosuspend() each time
> a queue is destroyed.  That should keep the device awake as long as
> there are any queues around.  Due to stuff like xgmi, it's probably
> best to just do this for all GPUs on the system just in case you have
> any p2p stuff going on.

OK, that's a completely different approach from what I had in mind. 
Basically tying device usage to both ioctls and the existence of user 
mode queues. P2P adds extra complication. Not just XGMI but also PCIe 
P2P where GPU A wants to access GPU B's memory but GPU B doesn't have 
any activity to keep it awake. To simplify this, we could also associate 
usage counters with allocating memory. That way any GPU that has memory 
that may be remotely accessed would be kept awake.

Always keeping all GPUs awake seems wasteful. I think being able to 
power down unused GPUs would be a very useful feature.

I still believe my idea of taking a usage counter in 
kfd_bind_process_to_device and dropping it in kfd_process_destroy_pdds 
is simpler and just as effective.

Regards,
   Felix

>
> Alex
>
>> Regards,
>>     Felix
>>
>>
>>>
>>>> On 2020-01-27 20:29, Rajneesh Bhardwaj wrote:
>>>>> So far the kfd driver implemented same routines for runtime and system
>>>>> wide suspend and resume (s2idle or mem). During system wide suspend the
>>>>> kfd aquires an atomic lock that prevents any more user processes to
>>>>> create queues and interact with kfd driver and amd gpu. This mechanism
>>>>> created problem when amdgpu device is runtime suspended with BACO
>>>>> enabled. Any application that relies on kfd driver fails to load
>>>>> because
>>>>> the driver reports a locked kfd device since gpu is runtime suspended.
>>>>>
>>>>> However, in an ideal case, when gpu is runtime  suspended the kfd
>>>>> driver
>>>>> should be able to:
>>>>>
>>>>>    - auto resume amdgpu driver whenever a client requests compute
>>>>> service
>>>>>    - prevent runtime suspend for amdgpu  while kfd is in use
>>>>>
>>>>> This change refactors the amdgpu and amdkfd drivers to support BACO and
>>>>> runtime runtime power management.
>>>>>
>>>>> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
>>>>> ---
>>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 12 ++++-----
>>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h |  8 +++---
>>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  4 +--
>>>>>    drivers/gpu/drm/amd/amdkfd/kfd_device.c    | 31
>>>>> +++++++++++++---------
>>>>>    drivers/gpu/drm/amd/amdkfd/kfd_iommu.c     |  5 +++-
>>>>>    drivers/gpu/drm/amd/amdkfd/kfd_process.c   | 19 ++++++++++---
>>>>>    6 files changed, 51 insertions(+), 28 deletions(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
>>>>> index 8609287620ea..314c4a2a0354 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
>>>>> @@ -178,18 +178,18 @@ void amdgpu_amdkfd_interrupt(struct
>>>>> amdgpu_device *adev,
>>>>>            kgd2kfd_interrupt(adev->kfd.dev, ih_ring_entry);
>>>>>    }
>>>>>    -void amdgpu_amdkfd_suspend(struct amdgpu_device *adev)
>>>>> +void amdgpu_amdkfd_suspend(struct amdgpu_device *adev, bool run_pm)
>>>>>    {
>>>>>        if (adev->kfd.dev)
>>>>> -        kgd2kfd_suspend(adev->kfd.dev);
>>>>> +        kgd2kfd_suspend(adev->kfd.dev, run_pm);
>>>>>    }
>>>>>    -int amdgpu_amdkfd_resume(struct amdgpu_device *adev)
>>>>> +int amdgpu_amdkfd_resume(struct amdgpu_device *adev, bool run_pm)
>>>>>    {
>>>>>        int r = 0;
>>>>>          if (adev->kfd.dev)
>>>>> -        r = kgd2kfd_resume(adev->kfd.dev);
>>>>> +        r = kgd2kfd_resume(adev->kfd.dev, run_pm);
>>>>>          return r;
>>>>>    }
>>>>> @@ -713,11 +713,11 @@ void kgd2kfd_exit(void)
>>>>>    {
>>>>>    }
>>>>>    -void kgd2kfd_suspend(struct kfd_dev *kfd)
>>>>> +void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm)
>>>>>    {
>>>>>    }
>>>>>    -int kgd2kfd_resume(struct kfd_dev *kfd)
>>>>> +int kgd2kfd_resume(struct kfd_dev *kfd, bool run_pm)
>>>>>    {
>>>>>        return 0;
>>>>>    }
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
>>>>> index 0fa898d30207..3dce4a51e522 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
>>>>> @@ -123,8 +123,8 @@ struct amdkfd_process_info {
>>>>>    int amdgpu_amdkfd_init(void);
>>>>>    void amdgpu_amdkfd_fini(void);
>>>>>    -void amdgpu_amdkfd_suspend(struct amdgpu_device *adev);
>>>>> -int amdgpu_amdkfd_resume(struct amdgpu_device *adev);
>>>>> +void amdgpu_amdkfd_suspend(struct amdgpu_device *adev, bool run_pm);
>>>>> +int amdgpu_amdkfd_resume(struct amdgpu_device *adev, bool run_pm);
>>>>>    void amdgpu_amdkfd_interrupt(struct amdgpu_device *adev,
>>>>>                const void *ih_ring_entry);
>>>>>    void amdgpu_amdkfd_device_probe(struct amdgpu_device *adev);
>>>>> @@ -250,8 +250,8 @@ bool kgd2kfd_device_init(struct kfd_dev *kfd,
>>>>>                 struct drm_device *ddev,
>>>>>                 const struct kgd2kfd_shared_resources *gpu_resources);
>>>>>    void kgd2kfd_device_exit(struct kfd_dev *kfd);
>>>>> -void kgd2kfd_suspend(struct kfd_dev *kfd);
>>>>> -int kgd2kfd_resume(struct kfd_dev *kfd);
>>>>> +void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm);
>>>>> +int kgd2kfd_resume(struct kfd_dev *kfd, bool run_pm);
>>>>>    int kgd2kfd_pre_reset(struct kfd_dev *kfd);
>>>>>    int kgd2kfd_post_reset(struct kfd_dev *kfd);
>>>>>    void kgd2kfd_interrupt(struct kfd_dev *kfd, const void
>>>>> *ih_ring_entry);
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>> index d9e5eac182d3..787d49e9f4de 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>> @@ -3314,7 +3314,7 @@ int amdgpu_device_suspend(struct drm_device
>>>>> *dev, bool fbcon)
>>>>>            }
>>>>>        }
>>>>>    -    amdgpu_amdkfd_suspend(adev);
>>>>> +    amdgpu_amdkfd_suspend(adev, fbcon);
>>>> The logic seems inverted here. There are four different pmops
>>>> callbacks that use different values for this parameter:
>>>>
>>>> * amdgpu_pmops_suspend: true
>>>> * amdgpu_pmops_freeze: true
>>>> * amdgpu_pmops_poweroff: true
>>>> * amdgpu_pmops_runtime_suspend: false
>>>>
>>>> It looks like runtime_suspend uses false, so you should pass the
>>>> run_pm parameter as !fbcon.
>>> Ok. will fix in v2.
>>>
>>>
>>>>
>>>>>          amdgpu_ras_suspend(adev);
>>>>>    @@ -3398,7 +3398,7 @@ int amdgpu_device_resume(struct drm_device
>>>>> *dev, bool fbcon)
>>>>>                }
>>>>>            }
>>>>>        }
>>>>> -    r = amdgpu_amdkfd_resume(adev);
>>>>> +    r = amdgpu_amdkfd_resume(adev, fbcon);
>>>>>        if (r)
>>>>>            return r;
>>>>>    diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>>>> b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>>>> index 2a9e40131735..e9f00c3a067a 100644
>>>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>>>> @@ -23,6 +23,7 @@
>>>>>    #include <linux/bsearch.h>
>>>>>    #include <linux/pci.h>
>>>>>    #include <linux/slab.h>
>>>>> +#include <linux/pm_runtime.h>
>>>>>    #include "kfd_priv.h"
>>>>>    #include "kfd_device_queue_manager.h"
>>>>>    #include "kfd_pm4_headers_vi.h"
>>>>> @@ -710,7 +711,7 @@ bool kgd2kfd_device_init(struct kfd_dev *kfd,
>>>>>    void kgd2kfd_device_exit(struct kfd_dev *kfd)
>>>>>    {
>>>>>        if (kfd->init_complete) {
>>>>> -        kgd2kfd_suspend(kfd);
>>>>> +        kgd2kfd_suspend(kfd, true);
>>>>>            device_queue_manager_uninit(kfd->dqm);
>>>>>            kfd_interrupt_exit(kfd);
>>>>>            kfd_topology_remove_device(kfd);
>>>>> @@ -731,7 +732,7 @@ int kgd2kfd_pre_reset(struct kfd_dev *kfd)
>>>>>          kfd->dqm->ops.pre_reset(kfd->dqm);
>>>>>    -    kgd2kfd_suspend(kfd);
>>>>> +    kgd2kfd_suspend(kfd, true);
>>>> This should use false. This is not runtime PM.
>>>>
>>>>
>>>>>          kfd_signal_reset_event(kfd);
>>>>>        return 0;
>>>>> @@ -765,21 +766,24 @@ bool kfd_is_locked(void)
>>>>>        return  (atomic_read(&kfd_locked) > 0);
>>>>>    }
>>>>>    -void kgd2kfd_suspend(struct kfd_dev *kfd)
>>>>> +void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm)
>>>>>    {
>>>>>        if (!kfd->init_complete)
>>>>>            return;
>>>>>    -    /* For first KFD device suspend all the KFD processes */
>>>>> -    if (atomic_inc_return(&kfd_locked) == 1)
>>>>> -        kfd_suspend_all_processes();
>>>>> +    /* for runtime suspend, skip locking kfd */
>>>>> +    if (run_pm) {
>>>> This should be done for the non-runtime PM case: if (!run_pm).
>>>>
>>>>
>>>>> +        /* For first KFD device suspend all the KFD processes */
>>>>> +        if (atomic_inc_return(&kfd_locked) == 1)
>>>>> +            kfd_suspend_all_processes();
>>>>> +    }
>>>>>    +    pm_runtime_set_autosuspend_delay(kfd->ddev->dev, 5000);
>>>>>        kfd->dqm->ops.stop(kfd->dqm);
>>>>> -
>>>>>        kfd_iommu_suspend(kfd);
>>>>>    }
>>>>>    -int kgd2kfd_resume(struct kfd_dev *kfd)
>>>>> +int kgd2kfd_resume(struct kfd_dev *kfd, bool run_pm)
>>>>>    {
>>>>>        int ret, count;
>>>>>    @@ -790,10 +794,13 @@ int kgd2kfd_resume(struct kfd_dev *kfd)
>>>>>        if (ret)
>>>>>            return ret;
>>>>>    -    count = atomic_dec_return(&kfd_locked);
>>>>> -    WARN_ONCE(count < 0, "KFD suspend / resume ref. error");
>>>>> -    if (count == 0)
>>>>> -        ret = kfd_resume_all_processes();
>>>>> +    /* for runtime resume, skip unlocking kfd */
>>>>> +    if (run_pm) {
>>>> Same as above.
>>>>
>>>>
>>>>> +        count = atomic_dec_return(&kfd_locked);
>>>>> +        WARN_ONCE(count < 0, "KFD suspend / resume ref. error");
>>>>> +        if (count == 0)
>>>>> +            ret = kfd_resume_all_processes();
>>>>> +    }
>>>>>          return ret;
>>>>>    }
>>>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_iommu.c
>>>>> b/drivers/gpu/drm/amd/amdkfd/kfd_iommu.c
>>>>> index 8d871514671e..6301d77ed3d6 100644
>>>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_iommu.c
>>>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_iommu.c
>>>>> @@ -25,6 +25,7 @@
>>>>>    #include <linux/slab.h>
>>>>>    #include <linux/pci.h>
>>>>>    #include <linux/amd-iommu.h>
>>>>> +#include <linux/pm_runtime.h>
>>>>>    #include "kfd_priv.h"
>>>>>    #include "kfd_dbgmgr.h"
>>>>>    #include "kfd_topology.h"
>>>>> @@ -134,8 +135,10 @@ void kfd_iommu_unbind_process(struct
>>>>> kfd_process *p)
>>>>>        struct kfd_process_device *pdd;
>>>>>          list_for_each_entry(pdd, &p->per_device_data, per_device_list)
>>>>> -        if (pdd->bound == PDD_BOUND)
>>>>> +        if (pdd->bound == PDD_BOUND) {
>>>>>                amd_iommu_unbind_pasid(pdd->dev->pdev, p->pasid);
>>>>> + pm_runtime_set_autosuspend_delay(pdd->dev->ddev->dev, 5000);
>>>> I don't think this is the right place. This code should only run on
>>>> APUs with IOMMUv2 enabled. Probably missing a check at the start of
>>>> the function. On kernels with IOMMU disabled, this source file is not
>>>> compiled at all.
>>>>
>>>> I think this code should go into kfd_process_destroy_pdds.
>>>
>>> Ok. will fix in v2. So can we consider this as logical inverse of
>>> bind_process_to_device?
>>>
>>>
>>>>
>>>>> +        }
>>>>>    }
>>>>>      /* Callback for process shutdown invoked by the IOMMU driver */
>>>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
>>>>> b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
>>>>> index 25b90f70aecd..d19d5e97405c 100644
>>>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
>>>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
>>>>> @@ -31,6 +31,7 @@
>>>>>    #include <linux/compat.h>
>>>>>    #include <linux/mman.h>
>>>>>    #include <linux/file.h>
>>>>> +#include <linux/pm_runtime.h>
>>>>>    #include "amdgpu_amdkfd.h"
>>>>>    #include "amdgpu.h"
>>>>>    @@ -843,15 +844,27 @@ struct kfd_process_device
>>>>> *kfd_bind_process_to_device(struct kfd_dev *dev,
>>>>>            return ERR_PTR(-ENOMEM);
>>>>>        }
>>>>>    +    err = pm_runtime_get_sync(dev->ddev->dev);
>>>>> +    pm_runtime_set_autosuspend_delay(dev->ddev->dev, 60000);
>>>> Why are you using a different hard-coded delay (60s?) here?
>>>
>>> It needs to be fixed. Explained above the reason for such deviation.
>>>
>>>
>>>> Regards,
>>>>    Felix
>>>>
>>> Thanks
>>> Rajneesh
>>>
>>>
>>>>> +    if (err < 0)
>>>>> +        return ERR_PTR(err);
>>>>> +
>>>>>        err = kfd_iommu_bind_process_to_device(pdd);
>>>>>        if (err)
>>>>> -        return ERR_PTR(err);
>>>>> +        goto out;
>>>>>          err = kfd_process_device_init_vm(pdd, NULL);
>>>>>        if (err)
>>>>> -        return ERR_PTR(err);
>>>>> +        goto out;
>>>>>    -    return pdd;
>>>>> +out:
>>>>> +    pm_runtime_mark_last_busy(dev->ddev->dev);
>>>>> +    pm_runtime_put_autosuspend(dev->ddev->dev);
>>>>> +
>>>>> +    if (!err)
>>>>> +        return pdd;
>>>>> +    else
>>>>> +        return ERR_PTR(err);
>>>>>    }
>>>>>      struct kfd_process_device *kfd_get_first_process_device_data(
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx@lists.freedesktop.org
>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7Cfelix.kuehling%40amd.com%7Ce1794fa7944b47dd055308d7a5d15f33%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637160190978215746&amp;sdata=fqELIO20vQ4%2BmGNvPavaZYos2a0iBznNRnElj8F0dQ8%3D&amp;reserved=0
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: [Patch v1 5/5] drm/amdkfd: refactor runtime pm for baco
  2020-01-30 23:24           ` Felix Kuehling
@ 2020-01-31  0:53             ` Zeng, Oak
  0 siblings, 0 replies; 23+ messages in thread
From: Zeng, Oak @ 2020-01-31  0:53 UTC (permalink / raw)
  To: Kuehling, Felix, Alex Deucher
  Cc: Deucher, Alexander, Bhardwaj, Rajneesh, amd-gfx list

[AMD Official Use Only - Internal Distribution Only]

Hi Felix,

See one inline comment

Regards,
Oak

-----Original Message-----
From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> On Behalf Of Felix Kuehling
Sent: Thursday, January 30, 2020 6:24 PM
To: Alex Deucher <alexdeucher@gmail.com>
Cc: Deucher, Alexander <Alexander.Deucher@amd.com>; Bhardwaj, Rajneesh <Rajneesh.Bhardwaj@amd.com>; amd-gfx list <amd-gfx@lists.freedesktop.org>
Subject: Re: [Patch v1 5/5] drm/amdkfd: refactor runtime pm for baco


On 2020-01-30 17:11, Alex Deucher wrote:
> On Thu, Jan 30, 2020 at 4:55 PM Felix Kuehling <felix.kuehling@amd.com> wrote:
>> On 2020-01-30 14:01, Bhardwaj, Rajneesh wrote:
>>> Hello Felix,
>>>
>>> Thanks for your time to review and for your feedback.
>>>
>>> On 1/29/2020 5:52 PM, Felix Kuehling wrote:
>>>> HI Rajneesh,
>>>>
>>>> See comments inline ...
>>>>
>>>> And a general question: Why do you need to set the 
>>>> autosuspend_delay in so many places? Amdgpu only has a single call 
>>>> to this function during initialization.
>>>
>>> We don't. I have called this out in cover letter since i too feel 
>>> its not the best way to do what we want to do. I have seen weird 
>>> issues on dual GPUs with KFDTest that sometimes results in system 
>>> hang and gpu hang etc.
>>>
>>> Even if i try with running kfdtest on just one node in a multi gpu 
>>> system, the baco exit seems to wake up all gpus and then one goes to 
>>> auto suspend again while other performs the desired kfdtest operation.
>>> I have never seen any issue with a single gpu card. So in current 
>>> approch, i am making sure that the GPUs are not quickly 
>>> auto-suspended while the kfd operations are ongoing but once the 
>>> kfdtest is finished, the runtime suspend call with ensure to reset 
>>> the timeout back to 5 seconds.
>> So by setting the timeout to 60s, you can fix applications that run 
>> for less than 60s without calling into KFD. What about applications 
>> that run for longer without calling into KFD? This doesn't sound like 
>> a solution, it's just a hack.
>>
>>
>>>
>>> I would like to avoid this and appreciate some pointers where i can 
>>> put get_sync calls while unbinding. I have tried a number of places 
>>> but saw some issues. So any help will be highly appreciated, to 
>>> identify the proper logical inverse of kfd_bind_process_to_device.
>> kfd_bind_process_to_device is called when KFD starts using a device. 
>> It only stops using that device when the process is destroyed. 
>> Therefore I recommended moving the cleanup code into 
>> kfd_process_destroy_pdds, which iterates over all devices (PDDs) during process termination.
>>
>> Note that kfd_bind_process_to_device can be called many times for the 
>> same device. If you need to keep usage counters balanced, you need 
>> extra checks to make sure you only do runtime-PM stuff the first time 
>> it's called for a particular device in a particular process.
>>
>> Also, in kfd_process_destroy_pdds you may need to check whether 
>> kfd_bind_process_to_device has been called for a particular PDD 
>> before releasing runtime PM.
> runtimepm already has reference counting.  You can use something like
> pm_runtime_get_sync() or pm_runtime_get_noresume() depending on 
> whether you want to make sure the GPU is powered up or not.  That will 
> increase the usage count on the device for runtime pm.  Then when you 
> want to decrement and possibly suspend the device, call
> pm_runtime_mark_last_busy() and pm_runtime_put_autosuspend().  That is 
> basically what we do in amdgpu to handle device usage.

My point is, kfd_bind_process_to_device can be called many times. If we call pm_runtime_sync_get, we'll need to release an unpredictable number of usage counts later. Therefore I suggested ensuring we only do this the first time kfd_bind_process_to_device is called for a particular process-device.

Similarly, in kfd_process_destroy_pdds we should only release the usage count if we actually took it in kfd_bind_process_to_device.


>    We call
> pm_runtime_get_noresume() in our fence_emit function which is what
> gets called when we submit work to the GPU.  That increments the usage
> count for runpm.  Then we call pm_runtime_mark_last_busy() and
> pm_runtime_put_autosuspend() in our fence_process function.  That gets
> called as each job on the GPU is retired.  For kfd, I think you'll
> want to call pm_runtime_get_sync() in at the start of your ioctl to
> wake the device, then pm_runtime_mark_last_busy() and
> pm_runtime_put_autosuspend() at the end of your ioctl to mark to match
> the start.  Then in your actual ioctl work, you'll want to call
> pm_runtime_get_noresume() everytime a queue gets created and
> pm_runtime_mark_last_busy() and pm_runtime_put_autosuspend() each time
> a queue is destroyed.  That should keep the device awake as long as
> there are any queues around.  Due to stuff like xgmi, it's probably
> best to just do this for all GPUs on the system just in case you have
> any p2p stuff going on.

OK, that's a completely different approach from what I had in mind. 
Basically tying device usage to both ioctls and the existence of user 
mode queues. P2P adds extra complication. Not just XGMI but also PCIe 
P2P where GPU A wants to access GPU B's memory but GPU B doesn't have 
any activity to keep it awake. To simplify this, we could also associate 
usage counters with allocating memory. That way any GPU that has memory 
that may be remotely accessed would be kept awake.

Always keeping all GPUs awake seems wasteful. I think being able to 
power down unused GPUs would be a very useful feature.

I still believe my idea of taking a usage counter in 
kfd_bind_process_to_device and dropping it in kfd_process_destroy_pdds 
is simpler and just as effective.

[Oak]: Kfd_bind_process_to_device is called from many ioctl (q creation, memory map/alloc etc). But destroy_pdd is called per process-device pair. So the pm usage counter is not balanced. To balance it, you can make the get/put usage counter either per process or per-pdd. For example, if you want to do it per pdd, you have 2 options:
1. to make sure only get_sync is called only at the first *actual* bind - I think in kfd_process_device_init_vm, before "return 0", is a place logically work.
2. call get sync at the pdd creation: kfd_create_prcess_device_data
In either option, call put at *each* destroy of pdd.

Regards,
   Felix

>
> Alex
>
>> Regards,
>>     Felix
>>
>>
>>>
>>>> On 2020-01-27 20:29, Rajneesh Bhardwaj wrote:
>>>>> So far the kfd driver implemented same routines for runtime and system
>>>>> wide suspend and resume (s2idle or mem). During system wide suspend the
>>>>> kfd aquires an atomic lock that prevents any more user processes to
>>>>> create queues and interact with kfd driver and amd gpu. This mechanism
>>>>> created problem when amdgpu device is runtime suspended with BACO
>>>>> enabled. Any application that relies on kfd driver fails to load
>>>>> because
>>>>> the driver reports a locked kfd device since gpu is runtime suspended.
>>>>>
>>>>> However, in an ideal case, when gpu is runtime  suspended the kfd
>>>>> driver
>>>>> should be able to:
>>>>>
>>>>>    - auto resume amdgpu driver whenever a client requests compute
>>>>> service
>>>>>    - prevent runtime suspend for amdgpu  while kfd is in use
>>>>>
>>>>> This change refactors the amdgpu and amdkfd drivers to support BACO and
>>>>> runtime runtime power management.
>>>>>
>>>>> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
>>>>> ---
>>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 12 ++++-----
>>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h |  8 +++---
>>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  4 +--
>>>>>    drivers/gpu/drm/amd/amdkfd/kfd_device.c    | 31
>>>>> +++++++++++++---------
>>>>>    drivers/gpu/drm/amd/amdkfd/kfd_iommu.c     |  5 +++-
>>>>>    drivers/gpu/drm/amd/amdkfd/kfd_process.c   | 19 ++++++++++---
>>>>>    6 files changed, 51 insertions(+), 28 deletions(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
>>>>> index 8609287620ea..314c4a2a0354 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
>>>>> @@ -178,18 +178,18 @@ void amdgpu_amdkfd_interrupt(struct
>>>>> amdgpu_device *adev,
>>>>>            kgd2kfd_interrupt(adev->kfd.dev, ih_ring_entry);
>>>>>    }
>>>>>    -void amdgpu_amdkfd_suspend(struct amdgpu_device *adev)
>>>>> +void amdgpu_amdkfd_suspend(struct amdgpu_device *adev, bool run_pm)
>>>>>    {
>>>>>        if (adev->kfd.dev)
>>>>> -        kgd2kfd_suspend(adev->kfd.dev);
>>>>> +        kgd2kfd_suspend(adev->kfd.dev, run_pm);
>>>>>    }
>>>>>    -int amdgpu_amdkfd_resume(struct amdgpu_device *adev)
>>>>> +int amdgpu_amdkfd_resume(struct amdgpu_device *adev, bool run_pm)
>>>>>    {
>>>>>        int r = 0;
>>>>>          if (adev->kfd.dev)
>>>>> -        r = kgd2kfd_resume(adev->kfd.dev);
>>>>> +        r = kgd2kfd_resume(adev->kfd.dev, run_pm);
>>>>>          return r;
>>>>>    }
>>>>> @@ -713,11 +713,11 @@ void kgd2kfd_exit(void)
>>>>>    {
>>>>>    }
>>>>>    -void kgd2kfd_suspend(struct kfd_dev *kfd)
>>>>> +void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm)
>>>>>    {
>>>>>    }
>>>>>    -int kgd2kfd_resume(struct kfd_dev *kfd)
>>>>> +int kgd2kfd_resume(struct kfd_dev *kfd, bool run_pm)
>>>>>    {
>>>>>        return 0;
>>>>>    }
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
>>>>> index 0fa898d30207..3dce4a51e522 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
>>>>> @@ -123,8 +123,8 @@ struct amdkfd_process_info {
>>>>>    int amdgpu_amdkfd_init(void);
>>>>>    void amdgpu_amdkfd_fini(void);
>>>>>    -void amdgpu_amdkfd_suspend(struct amdgpu_device *adev);
>>>>> -int amdgpu_amdkfd_resume(struct amdgpu_device *adev);
>>>>> +void amdgpu_amdkfd_suspend(struct amdgpu_device *adev, bool run_pm);
>>>>> +int amdgpu_amdkfd_resume(struct amdgpu_device *adev, bool run_pm);
>>>>>    void amdgpu_amdkfd_interrupt(struct amdgpu_device *adev,
>>>>>                const void *ih_ring_entry);
>>>>>    void amdgpu_amdkfd_device_probe(struct amdgpu_device *adev);
>>>>> @@ -250,8 +250,8 @@ bool kgd2kfd_device_init(struct kfd_dev *kfd,
>>>>>                 struct drm_device *ddev,
>>>>>                 const struct kgd2kfd_shared_resources *gpu_resources);
>>>>>    void kgd2kfd_device_exit(struct kfd_dev *kfd);
>>>>> -void kgd2kfd_suspend(struct kfd_dev *kfd);
>>>>> -int kgd2kfd_resume(struct kfd_dev *kfd);
>>>>> +void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm);
>>>>> +int kgd2kfd_resume(struct kfd_dev *kfd, bool run_pm);
>>>>>    int kgd2kfd_pre_reset(struct kfd_dev *kfd);
>>>>>    int kgd2kfd_post_reset(struct kfd_dev *kfd);
>>>>>    void kgd2kfd_interrupt(struct kfd_dev *kfd, const void
>>>>> *ih_ring_entry);
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>> index d9e5eac182d3..787d49e9f4de 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>> @@ -3314,7 +3314,7 @@ int amdgpu_device_suspend(struct drm_device
>>>>> *dev, bool fbcon)
>>>>>            }
>>>>>        }
>>>>>    -    amdgpu_amdkfd_suspend(adev);
>>>>> +    amdgpu_amdkfd_suspend(adev, fbcon);
>>>> The logic seems inverted here. There are four different pmops
>>>> callbacks that use different values for this parameter:
>>>>
>>>> * amdgpu_pmops_suspend: true
>>>> * amdgpu_pmops_freeze: true
>>>> * amdgpu_pmops_poweroff: true
>>>> * amdgpu_pmops_runtime_suspend: false
>>>>
>>>> It looks like runtime_suspend uses false, so you should pass the
>>>> run_pm parameter as !fbcon.
>>> Ok. will fix in v2.
>>>
>>>
>>>>
>>>>>          amdgpu_ras_suspend(adev);
>>>>>    @@ -3398,7 +3398,7 @@ int amdgpu_device_resume(struct drm_device
>>>>> *dev, bool fbcon)
>>>>>                }
>>>>>            }
>>>>>        }
>>>>> -    r = amdgpu_amdkfd_resume(adev);
>>>>> +    r = amdgpu_amdkfd_resume(adev, fbcon);
>>>>>        if (r)
>>>>>            return r;
>>>>>    diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>>>> b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>>>> index 2a9e40131735..e9f00c3a067a 100644
>>>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>>>> @@ -23,6 +23,7 @@
>>>>>    #include <linux/bsearch.h>
>>>>>    #include <linux/pci.h>
>>>>>    #include <linux/slab.h>
>>>>> +#include <linux/pm_runtime.h>
>>>>>    #include "kfd_priv.h"
>>>>>    #include "kfd_device_queue_manager.h"
>>>>>    #include "kfd_pm4_headers_vi.h"
>>>>> @@ -710,7 +711,7 @@ bool kgd2kfd_device_init(struct kfd_dev *kfd,
>>>>>    void kgd2kfd_device_exit(struct kfd_dev *kfd)
>>>>>    {
>>>>>        if (kfd->init_complete) {
>>>>> -        kgd2kfd_suspend(kfd);
>>>>> +        kgd2kfd_suspend(kfd, true);
>>>>>            device_queue_manager_uninit(kfd->dqm);
>>>>>            kfd_interrupt_exit(kfd);
>>>>>            kfd_topology_remove_device(kfd);
>>>>> @@ -731,7 +732,7 @@ int kgd2kfd_pre_reset(struct kfd_dev *kfd)
>>>>>          kfd->dqm->ops.pre_reset(kfd->dqm);
>>>>>    -    kgd2kfd_suspend(kfd);
>>>>> +    kgd2kfd_suspend(kfd, true);
>>>> This should use false. This is not runtime PM.
>>>>
>>>>
>>>>>          kfd_signal_reset_event(kfd);
>>>>>        return 0;
>>>>> @@ -765,21 +766,24 @@ bool kfd_is_locked(void)
>>>>>        return  (atomic_read(&kfd_locked) > 0);
>>>>>    }
>>>>>    -void kgd2kfd_suspend(struct kfd_dev *kfd)
>>>>> +void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm)
>>>>>    {
>>>>>        if (!kfd->init_complete)
>>>>>            return;
>>>>>    -    /* For first KFD device suspend all the KFD processes */
>>>>> -    if (atomic_inc_return(&kfd_locked) == 1)
>>>>> -        kfd_suspend_all_processes();
>>>>> +    /* for runtime suspend, skip locking kfd */
>>>>> +    if (run_pm) {
>>>> This should be done for the non-runtime PM case: if (!run_pm).
>>>>
>>>>
>>>>> +        /* For first KFD device suspend all the KFD processes */
>>>>> +        if (atomic_inc_return(&kfd_locked) == 1)
>>>>> +            kfd_suspend_all_processes();
>>>>> +    }
>>>>>    +    pm_runtime_set_autosuspend_delay(kfd->ddev->dev, 5000);
>>>>>        kfd->dqm->ops.stop(kfd->dqm);
>>>>> -
>>>>>        kfd_iommu_suspend(kfd);
>>>>>    }
>>>>>    -int kgd2kfd_resume(struct kfd_dev *kfd)
>>>>> +int kgd2kfd_resume(struct kfd_dev *kfd, bool run_pm)
>>>>>    {
>>>>>        int ret, count;
>>>>>    @@ -790,10 +794,13 @@ int kgd2kfd_resume(struct kfd_dev *kfd)
>>>>>        if (ret)
>>>>>            return ret;
>>>>>    -    count = atomic_dec_return(&kfd_locked);
>>>>> -    WARN_ONCE(count < 0, "KFD suspend / resume ref. error");
>>>>> -    if (count == 0)
>>>>> -        ret = kfd_resume_all_processes();
>>>>> +    /* for runtime resume, skip unlocking kfd */
>>>>> +    if (run_pm) {
>>>> Same as above.
>>>>
>>>>
>>>>> +        count = atomic_dec_return(&kfd_locked);
>>>>> +        WARN_ONCE(count < 0, "KFD suspend / resume ref. error");
>>>>> +        if (count == 0)
>>>>> +            ret = kfd_resume_all_processes();
>>>>> +    }
>>>>>          return ret;
>>>>>    }
>>>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_iommu.c
>>>>> b/drivers/gpu/drm/amd/amdkfd/kfd_iommu.c
>>>>> index 8d871514671e..6301d77ed3d6 100644
>>>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_iommu.c
>>>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_iommu.c
>>>>> @@ -25,6 +25,7 @@
>>>>>    #include <linux/slab.h>
>>>>>    #include <linux/pci.h>
>>>>>    #include <linux/amd-iommu.h>
>>>>> +#include <linux/pm_runtime.h>
>>>>>    #include "kfd_priv.h"
>>>>>    #include "kfd_dbgmgr.h"
>>>>>    #include "kfd_topology.h"
>>>>> @@ -134,8 +135,10 @@ void kfd_iommu_unbind_process(struct
>>>>> kfd_process *p)
>>>>>        struct kfd_process_device *pdd;
>>>>>          list_for_each_entry(pdd, &p->per_device_data, per_device_list)
>>>>> -        if (pdd->bound == PDD_BOUND)
>>>>> +        if (pdd->bound == PDD_BOUND) {
>>>>>                amd_iommu_unbind_pasid(pdd->dev->pdev, p->pasid);
>>>>> + pm_runtime_set_autosuspend_delay(pdd->dev->ddev->dev, 5000);
>>>> I don't think this is the right place. This code should only run on
>>>> APUs with IOMMUv2 enabled. Probably missing a check at the start of
>>>> the function. On kernels with IOMMU disabled, this source file is not
>>>> compiled at all.
>>>>
>>>> I think this code should go into kfd_process_destroy_pdds.
>>>
>>> Ok. will fix in v2. So can we consider this as logical inverse of
>>> bind_process_to_device?
>>>
>>>
>>>>
>>>>> +        }
>>>>>    }
>>>>>      /* Callback for process shutdown invoked by the IOMMU driver */
>>>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
>>>>> b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
>>>>> index 25b90f70aecd..d19d5e97405c 100644
>>>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
>>>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
>>>>> @@ -31,6 +31,7 @@
>>>>>    #include <linux/compat.h>
>>>>>    #include <linux/mman.h>
>>>>>    #include <linux/file.h>
>>>>> +#include <linux/pm_runtime.h>
>>>>>    #include "amdgpu_amdkfd.h"
>>>>>    #include "amdgpu.h"
>>>>>    @@ -843,15 +844,27 @@ struct kfd_process_device
>>>>> *kfd_bind_process_to_device(struct kfd_dev *dev,
>>>>>            return ERR_PTR(-ENOMEM);
>>>>>        }
>>>>>    +    err = pm_runtime_get_sync(dev->ddev->dev);
>>>>> +    pm_runtime_set_autosuspend_delay(dev->ddev->dev, 60000);
>>>> Why are you using a different hard-coded delay (60s?) here?
>>>
>>> It needs to be fixed. Explained above the reason for such deviation.
>>>
>>>
>>>> Regards,
>>>>    Felix
>>>>
>>> Thanks
>>> Rajneesh
>>>
>>>
>>>>> +    if (err < 0)
>>>>> +        return ERR_PTR(err);
>>>>> +
>>>>>        err = kfd_iommu_bind_process_to_device(pdd);
>>>>>        if (err)
>>>>> -        return ERR_PTR(err);
>>>>> +        goto out;
>>>>>          err = kfd_process_device_init_vm(pdd, NULL);
>>>>>        if (err)
>>>>> -        return ERR_PTR(err);
>>>>> +        goto out;
>>>>>    -    return pdd;
>>>>> +out:
>>>>> +    pm_runtime_mark_last_busy(dev->ddev->dev);
>>>>> +    pm_runtime_put_autosuspend(dev->ddev->dev);
>>>>> +
>>>>> +    if (!err)
>>>>> +        return pdd;
>>>>> +    else
>>>>> +        return ERR_PTR(err);
>>>>>    }
>>>>>      struct kfd_process_device *kfd_get_first_process_device_data(
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx@lists.freedesktop.org
>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7Coak.zeng%40amd.com%7C64abe683b4ad4902cf1208d7a5db8dba%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637160234717876492&amp;sdata=0TO3FbOJc4xJwSSE9%2B1Wk6VJxg3GsB4gfYh0tTq0n%2FY%3D&amp;reserved=0
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7Coak.zeng%40amd.com%7C64abe683b4ad4902cf1208d7a5db8dba%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637160234717876492&amp;sdata=0TO3FbOJc4xJwSSE9%2B1Wk6VJxg3GsB4gfYh0tTq0n%2FY%3D&amp;reserved=0
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2020-01-31  0:53 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-01-28  1:29 [Patch v1 0/5] Enable BACO with KFD Rajneesh Bhardwaj
2020-01-28  1:29 ` [Patch v1 1/5] drm/amdgpu: always enable runtime power management Rajneesh Bhardwaj
2020-01-28 20:14   ` Alex Deucher
2020-01-30 19:06     ` Bhardwaj, Rajneesh
2020-01-28  1:29 ` [Patch v1 2/5] drm/amdgpu: Fix missing error check in suspend Rajneesh Bhardwaj
2020-01-28 20:15   ` Alex Deucher
2020-01-28  1:29 ` [Patch v1 3/5] drm/amdkfd: Introduce debugfs option to disable baco Rajneesh Bhardwaj
2020-01-28 20:22   ` Alex Deucher
2020-01-30 19:05     ` Bhardwaj, Rajneesh
2020-01-28  1:29 ` [Patch v1 4/5] drm/amdkfd: show warning when kfd is locked Rajneesh Bhardwaj
2020-01-28 22:42   ` Felix Kuehling
2020-01-30 19:02     ` Bhardwaj, Rajneesh
2020-01-28  1:29 ` [Patch v1 5/5] drm/amdkfd: refactor runtime pm for baco Rajneesh Bhardwaj
2020-01-28 20:09   ` Zeng, Oak
2020-01-30 19:08     ` Bhardwaj, Rajneesh
2020-01-29 22:52   ` Felix Kuehling
2020-01-30 19:01     ` Bhardwaj, Rajneesh
2020-01-30 21:55       ` Felix Kuehling
2020-01-30 22:11         ` Alex Deucher
2020-01-30 23:24           ` Felix Kuehling
2020-01-31  0:53             ` Zeng, Oak
2020-01-28 17:39 ` [Patch v1 0/5] Enable BACO with KFD Mike Lothian
2020-01-28 20:11   ` Alex Deucher

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.