All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/3] drm/msm/gpu: More system suspend fixes
@ 2022-03-10 23:46 ` Rob Clark
  0 siblings, 0 replies; 52+ messages in thread
From: Rob Clark @ 2022-03-10 23:46 UTC (permalink / raw)
  To: dri-devel
  Cc: freedreno, linux-arm-msm, Rob Clark, Abhinav Kumar,
	Akhil P Oommen, AngeloGioacchino Del Regno, Bjorn Andersson,
	Jonathan Marek, open list, Vladimir Lypak

From: Rob Clark <robdclark@chromium.org>

In particular, we want to park the scheduler threads so that suspend
is not racing with the kthread pushing more jobs to the driver.

Rob Clark (3):
  drm/msm/gpu: Rename runtime suspend/resume functions
  drm/msm/gpu: Park scheduler threads for system suspend
  drm/msm/gpu: Remove mutex from wait_event condition

 drivers/gpu/drm/msm/adreno/adreno_device.c | 79 ++++++++++++++++++----
 1 file changed, 65 insertions(+), 14 deletions(-)

-- 
2.35.1


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH 0/3] drm/msm/gpu: More system suspend fixes
@ 2022-03-10 23:46 ` Rob Clark
  0 siblings, 0 replies; 52+ messages in thread
From: Rob Clark @ 2022-03-10 23:46 UTC (permalink / raw)
  To: dri-devel
  Cc: Rob Clark, Jonathan Marek, Akhil P Oommen, linux-arm-msm,
	Vladimir Lypak, Abhinav Kumar, Bjorn Andersson, freedreno,
	open list, AngeloGioacchino Del Regno

From: Rob Clark <robdclark@chromium.org>

In particular, we want to park the scheduler threads so that suspend
is not racing with the kthread pushing more jobs to the driver.

Rob Clark (3):
  drm/msm/gpu: Rename runtime suspend/resume functions
  drm/msm/gpu: Park scheduler threads for system suspend
  drm/msm/gpu: Remove mutex from wait_event condition

 drivers/gpu/drm/msm/adreno/adreno_device.c | 79 ++++++++++++++++++----
 1 file changed, 65 insertions(+), 14 deletions(-)

-- 
2.35.1


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH 1/3] drm/msm/gpu: Rename runtime suspend/resume functions
  2022-03-10 23:46 ` Rob Clark
@ 2022-03-10 23:46   ` Rob Clark
  -1 siblings, 0 replies; 52+ messages in thread
From: Rob Clark @ 2022-03-10 23:46 UTC (permalink / raw)
  To: dri-devel
  Cc: freedreno, linux-arm-msm, Rob Clark, Rob Clark, Sean Paul,
	Abhinav Kumar, David Airlie, Daniel Vetter, Akhil P Oommen,
	AngeloGioacchino Del Regno, Bjorn Andersson, Jonathan Marek,
	Vladimir Lypak, open list

From: Rob Clark <robdclark@chromium.org>

Signed-off-by: Rob Clark <robdclark@chromium.org>
---
 drivers/gpu/drm/msm/adreno/adreno_device.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/msm/adreno/adreno_device.c b/drivers/gpu/drm/msm/adreno/adreno_device.c
index 89cfd84760d7..8859834b51b8 100644
--- a/drivers/gpu/drm/msm/adreno/adreno_device.c
+++ b/drivers/gpu/drm/msm/adreno/adreno_device.c
@@ -600,7 +600,7 @@ static const struct of_device_id dt_match[] = {
 };
 
 #ifdef CONFIG_PM
-static int adreno_resume(struct device *dev)
+static int adreno_runtime_resume(struct device *dev)
 {
 	struct msm_gpu *gpu = dev_to_gpu(dev);
 
@@ -616,7 +616,7 @@ static int active_submits(struct msm_gpu *gpu)
 	return active_submits;
 }
 
-static int adreno_suspend(struct device *dev)
+static int adreno_runtime_suspend(struct device *dev)
 {
 	struct msm_gpu *gpu = dev_to_gpu(dev);
 	int remaining;
@@ -635,7 +635,7 @@ static int adreno_suspend(struct device *dev)
 
 static const struct dev_pm_ops adreno_pm_ops = {
 	SET_SYSTEM_SLEEP_PM_OPS(pm_runtime_force_suspend, pm_runtime_force_resume)
-	SET_RUNTIME_PM_OPS(adreno_suspend, adreno_resume, NULL)
+	SET_RUNTIME_PM_OPS(adreno_runtime_suspend, adreno_runtime_resume, NULL)
 };
 
 static struct platform_driver adreno_driver = {
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH 1/3] drm/msm/gpu: Rename runtime suspend/resume functions
@ 2022-03-10 23:46   ` Rob Clark
  0 siblings, 0 replies; 52+ messages in thread
From: Rob Clark @ 2022-03-10 23:46 UTC (permalink / raw)
  To: dri-devel
  Cc: Rob Clark, Jonathan Marek, Akhil P Oommen, David Airlie,
	linux-arm-msm, Vladimir Lypak, Abhinav Kumar, Bjorn Andersson,
	Sean Paul, freedreno, open list, AngeloGioacchino Del Regno

From: Rob Clark <robdclark@chromium.org>

Signed-off-by: Rob Clark <robdclark@chromium.org>
---
 drivers/gpu/drm/msm/adreno/adreno_device.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/msm/adreno/adreno_device.c b/drivers/gpu/drm/msm/adreno/adreno_device.c
index 89cfd84760d7..8859834b51b8 100644
--- a/drivers/gpu/drm/msm/adreno/adreno_device.c
+++ b/drivers/gpu/drm/msm/adreno/adreno_device.c
@@ -600,7 +600,7 @@ static const struct of_device_id dt_match[] = {
 };
 
 #ifdef CONFIG_PM
-static int adreno_resume(struct device *dev)
+static int adreno_runtime_resume(struct device *dev)
 {
 	struct msm_gpu *gpu = dev_to_gpu(dev);
 
@@ -616,7 +616,7 @@ static int active_submits(struct msm_gpu *gpu)
 	return active_submits;
 }
 
-static int adreno_suspend(struct device *dev)
+static int adreno_runtime_suspend(struct device *dev)
 {
 	struct msm_gpu *gpu = dev_to_gpu(dev);
 	int remaining;
@@ -635,7 +635,7 @@ static int adreno_suspend(struct device *dev)
 
 static const struct dev_pm_ops adreno_pm_ops = {
 	SET_SYSTEM_SLEEP_PM_OPS(pm_runtime_force_suspend, pm_runtime_force_resume)
-	SET_RUNTIME_PM_OPS(adreno_suspend, adreno_resume, NULL)
+	SET_RUNTIME_PM_OPS(adreno_runtime_suspend, adreno_runtime_resume, NULL)
 };
 
 static struct platform_driver adreno_driver = {
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH 2/3] drm/msm/gpu: Park scheduler threads for system suspend
  2022-03-10 23:46 ` Rob Clark
@ 2022-03-10 23:46   ` Rob Clark
  -1 siblings, 0 replies; 52+ messages in thread
From: Rob Clark @ 2022-03-10 23:46 UTC (permalink / raw)
  To: dri-devel
  Cc: freedreno, linux-arm-msm, Rob Clark, Rob Clark, Sean Paul,
	Abhinav Kumar, David Airlie, Daniel Vetter, Akhil P Oommen,
	Jonathan Marek, AngeloGioacchino Del Regno, Bjorn Andersson,
	Vladimir Lypak, open list

From: Rob Clark <robdclark@chromium.org>

In the system suspend path, we don't want to be racing with the
scheduler kthreads pushing additional queued up jobs to the hw
queue (ringbuffer).  So park them first.  While we are at it,
move the wait for active jobs to complete into the new system-
suspend path.

Signed-off-by: Rob Clark <robdclark@chromium.org>
---
 drivers/gpu/drm/msm/adreno/adreno_device.c | 68 ++++++++++++++++++++--
 1 file changed, 64 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/msm/adreno/adreno_device.c b/drivers/gpu/drm/msm/adreno/adreno_device.c
index 8859834b51b8..0440a98988fc 100644
--- a/drivers/gpu/drm/msm/adreno/adreno_device.c
+++ b/drivers/gpu/drm/msm/adreno/adreno_device.c
@@ -619,22 +619,82 @@ static int active_submits(struct msm_gpu *gpu)
 static int adreno_runtime_suspend(struct device *dev)
 {
 	struct msm_gpu *gpu = dev_to_gpu(dev);
-	int remaining;
+
+	/*
+	 * We should be holding a runpm ref, which will prevent
+	 * runtime suspend.  In the system suspend path, we've
+	 * already waited for active jobs to complete.
+	 */
+	WARN_ON_ONCE(gpu->active_submits);
+
+	return gpu->funcs->pm_suspend(gpu);
+}
+
+static void suspend_scheduler(struct msm_gpu *gpu)
+{
+	int i;
+
+	/*
+	 * Shut down the scheduler before we force suspend, so that
+	 * suspend isn't racing with scheduler kthread feeding us
+	 * more work.
+	 *
+	 * Note, we just want to park the thread, and let any jobs
+	 * that are already on the hw queue complete normally, as
+	 * opposed to the drm_sched_stop() path used for handling
+	 * faulting/timed-out jobs.  We can't really cancel any jobs
+	 * already on the hw queue without racing with the GPU.
+	 */
+	for (i = 0; i < gpu->nr_rings; i++) {
+		struct drm_gpu_scheduler *sched = &gpu->rb[i]->sched;
+		kthread_park(sched->thread);
+	}
+}
+
+static void resume_scheduler(struct msm_gpu *gpu)
+{
+	int i;
+
+	for (i = 0; i < gpu->nr_rings; i++) {
+		struct drm_gpu_scheduler *sched = &gpu->rb[i]->sched;
+		kthread_unpark(sched->thread);
+	}
+}
+
+static int adreno_system_suspend(struct device *dev)
+{
+	struct msm_gpu *gpu = dev_to_gpu(dev);
+	int remaining, ret;
+
+	suspend_scheduler(gpu);
 
 	remaining = wait_event_timeout(gpu->retire_event,
 				       active_submits(gpu) == 0,
 				       msecs_to_jiffies(1000));
 	if (remaining == 0) {
 		dev_err(dev, "Timeout waiting for GPU to suspend\n");
-		return -EBUSY;
+		ret = -EBUSY;
+		goto out;
 	}
 
-	return gpu->funcs->pm_suspend(gpu);
+	ret = pm_runtime_force_suspend(dev);
+out:
+	if (ret)
+		resume_scheduler(gpu);
+
+	return ret;
 }
+
+static int adreno_system_resume(struct device *dev)
+{
+	resume_scheduler(dev_to_gpu(dev));
+	return pm_runtime_force_resume(dev);
+}
+
 #endif
 
 static const struct dev_pm_ops adreno_pm_ops = {
-	SET_SYSTEM_SLEEP_PM_OPS(pm_runtime_force_suspend, pm_runtime_force_resume)
+	SET_SYSTEM_SLEEP_PM_OPS(adreno_system_suspend, adreno_system_resume)
 	SET_RUNTIME_PM_OPS(adreno_runtime_suspend, adreno_runtime_resume, NULL)
 };
 
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH 2/3] drm/msm/gpu: Park scheduler threads for system suspend
@ 2022-03-10 23:46   ` Rob Clark
  0 siblings, 0 replies; 52+ messages in thread
From: Rob Clark @ 2022-03-10 23:46 UTC (permalink / raw)
  To: dri-devel
  Cc: Rob Clark, Jonathan Marek, Akhil P Oommen, David Airlie,
	linux-arm-msm, Vladimir Lypak, Abhinav Kumar, Bjorn Andersson,
	Sean Paul, freedreno, open list, AngeloGioacchino Del Regno

From: Rob Clark <robdclark@chromium.org>

In the system suspend path, we don't want to be racing with the
scheduler kthreads pushing additional queued up jobs to the hw
queue (ringbuffer).  So park them first.  While we are at it,
move the wait for active jobs to complete into the new system-
suspend path.

Signed-off-by: Rob Clark <robdclark@chromium.org>
---
 drivers/gpu/drm/msm/adreno/adreno_device.c | 68 ++++++++++++++++++++--
 1 file changed, 64 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/msm/adreno/adreno_device.c b/drivers/gpu/drm/msm/adreno/adreno_device.c
index 8859834b51b8..0440a98988fc 100644
--- a/drivers/gpu/drm/msm/adreno/adreno_device.c
+++ b/drivers/gpu/drm/msm/adreno/adreno_device.c
@@ -619,22 +619,82 @@ static int active_submits(struct msm_gpu *gpu)
 static int adreno_runtime_suspend(struct device *dev)
 {
 	struct msm_gpu *gpu = dev_to_gpu(dev);
-	int remaining;
+
+	/*
+	 * We should be holding a runpm ref, which will prevent
+	 * runtime suspend.  In the system suspend path, we've
+	 * already waited for active jobs to complete.
+	 */
+	WARN_ON_ONCE(gpu->active_submits);
+
+	return gpu->funcs->pm_suspend(gpu);
+}
+
+static void suspend_scheduler(struct msm_gpu *gpu)
+{
+	int i;
+
+	/*
+	 * Shut down the scheduler before we force suspend, so that
+	 * suspend isn't racing with scheduler kthread feeding us
+	 * more work.
+	 *
+	 * Note, we just want to park the thread, and let any jobs
+	 * that are already on the hw queue complete normally, as
+	 * opposed to the drm_sched_stop() path used for handling
+	 * faulting/timed-out jobs.  We can't really cancel any jobs
+	 * already on the hw queue without racing with the GPU.
+	 */
+	for (i = 0; i < gpu->nr_rings; i++) {
+		struct drm_gpu_scheduler *sched = &gpu->rb[i]->sched;
+		kthread_park(sched->thread);
+	}
+}
+
+static void resume_scheduler(struct msm_gpu *gpu)
+{
+	int i;
+
+	for (i = 0; i < gpu->nr_rings; i++) {
+		struct drm_gpu_scheduler *sched = &gpu->rb[i]->sched;
+		kthread_unpark(sched->thread);
+	}
+}
+
+static int adreno_system_suspend(struct device *dev)
+{
+	struct msm_gpu *gpu = dev_to_gpu(dev);
+	int remaining, ret;
+
+	suspend_scheduler(gpu);
 
 	remaining = wait_event_timeout(gpu->retire_event,
 				       active_submits(gpu) == 0,
 				       msecs_to_jiffies(1000));
 	if (remaining == 0) {
 		dev_err(dev, "Timeout waiting for GPU to suspend\n");
-		return -EBUSY;
+		ret = -EBUSY;
+		goto out;
 	}
 
-	return gpu->funcs->pm_suspend(gpu);
+	ret = pm_runtime_force_suspend(dev);
+out:
+	if (ret)
+		resume_scheduler(gpu);
+
+	return ret;
 }
+
+static int adreno_system_resume(struct device *dev)
+{
+	resume_scheduler(dev_to_gpu(dev));
+	return pm_runtime_force_resume(dev);
+}
+
 #endif
 
 static const struct dev_pm_ops adreno_pm_ops = {
-	SET_SYSTEM_SLEEP_PM_OPS(pm_runtime_force_suspend, pm_runtime_force_resume)
+	SET_SYSTEM_SLEEP_PM_OPS(adreno_system_suspend, adreno_system_resume)
 	SET_RUNTIME_PM_OPS(adreno_runtime_suspend, adreno_runtime_resume, NULL)
 };
 
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH 3/3] drm/msm/gpu: Remove mutex from wait_event condition
  2022-03-10 23:46 ` Rob Clark
@ 2022-03-10 23:46   ` Rob Clark
  -1 siblings, 0 replies; 52+ messages in thread
From: Rob Clark @ 2022-03-10 23:46 UTC (permalink / raw)
  To: dri-devel
  Cc: freedreno, linux-arm-msm, Rob Clark, Rob Clark, Sean Paul,
	Abhinav Kumar, David Airlie, Daniel Vetter, Akhil P Oommen,
	Jonathan Marek, AngeloGioacchino Del Regno, Bjorn Andersson,
	Vladimir Lypak, open list

From: Rob Clark <robdclark@chromium.org>

The mutex wasn't really protecting anything before.  Before the previous
patch we could still be racing with the scheduler's kthread, as that is
not necessarily frozen yet.  Now that we've parked the sched threads,
the only race is with jobs retiring, and that is harmless, ie.

Signed-off-by: Rob Clark <robdclark@chromium.org>
---
 drivers/gpu/drm/msm/adreno/adreno_device.c | 11 +----------
 1 file changed, 1 insertion(+), 10 deletions(-)

diff --git a/drivers/gpu/drm/msm/adreno/adreno_device.c b/drivers/gpu/drm/msm/adreno/adreno_device.c
index 0440a98988fc..661dfa7681fb 100644
--- a/drivers/gpu/drm/msm/adreno/adreno_device.c
+++ b/drivers/gpu/drm/msm/adreno/adreno_device.c
@@ -607,15 +607,6 @@ static int adreno_runtime_resume(struct device *dev)
 	return gpu->funcs->pm_resume(gpu);
 }
 
-static int active_submits(struct msm_gpu *gpu)
-{
-	int active_submits;
-	mutex_lock(&gpu->active_lock);
-	active_submits = gpu->active_submits;
-	mutex_unlock(&gpu->active_lock);
-	return active_submits;
-}
-
 static int adreno_runtime_suspend(struct device *dev)
 {
 	struct msm_gpu *gpu = dev_to_gpu(dev);
@@ -669,7 +660,7 @@ static int adreno_system_suspend(struct device *dev)
 	suspend_scheduler(gpu);
 
 	remaining = wait_event_timeout(gpu->retire_event,
-				       active_submits(gpu) == 0,
+				       gpu->active_submits == 0,
 				       msecs_to_jiffies(1000));
 	if (remaining == 0) {
 		dev_err(dev, "Timeout waiting for GPU to suspend\n");
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH 3/3] drm/msm/gpu: Remove mutex from wait_event condition
@ 2022-03-10 23:46   ` Rob Clark
  0 siblings, 0 replies; 52+ messages in thread
From: Rob Clark @ 2022-03-10 23:46 UTC (permalink / raw)
  To: dri-devel
  Cc: Rob Clark, Jonathan Marek, Akhil P Oommen, David Airlie,
	linux-arm-msm, Vladimir Lypak, Abhinav Kumar, Bjorn Andersson,
	Sean Paul, freedreno, open list, AngeloGioacchino Del Regno

From: Rob Clark <robdclark@chromium.org>

The mutex wasn't really protecting anything before.  Before the previous
patch we could still be racing with the scheduler's kthread, as that is
not necessarily frozen yet.  Now that we've parked the sched threads,
the only race is with jobs retiring, and that is harmless, ie.

Signed-off-by: Rob Clark <robdclark@chromium.org>
---
 drivers/gpu/drm/msm/adreno/adreno_device.c | 11 +----------
 1 file changed, 1 insertion(+), 10 deletions(-)

diff --git a/drivers/gpu/drm/msm/adreno/adreno_device.c b/drivers/gpu/drm/msm/adreno/adreno_device.c
index 0440a98988fc..661dfa7681fb 100644
--- a/drivers/gpu/drm/msm/adreno/adreno_device.c
+++ b/drivers/gpu/drm/msm/adreno/adreno_device.c
@@ -607,15 +607,6 @@ static int adreno_runtime_resume(struct device *dev)
 	return gpu->funcs->pm_resume(gpu);
 }
 
-static int active_submits(struct msm_gpu *gpu)
-{
-	int active_submits;
-	mutex_lock(&gpu->active_lock);
-	active_submits = gpu->active_submits;
-	mutex_unlock(&gpu->active_lock);
-	return active_submits;
-}
-
 static int adreno_runtime_suspend(struct device *dev)
 {
 	struct msm_gpu *gpu = dev_to_gpu(dev);
@@ -669,7 +660,7 @@ static int adreno_system_suspend(struct device *dev)
 	suspend_scheduler(gpu);
 
 	remaining = wait_event_timeout(gpu->retire_event,
-				       active_submits(gpu) == 0,
+				       gpu->active_submits == 0,
 				       msecs_to_jiffies(1000));
 	if (remaining == 0) {
 		dev_err(dev, "Timeout waiting for GPU to suspend\n");
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [PATCH 1/3] drm/msm/gpu: Rename runtime suspend/resume functions
  2022-03-10 23:46   ` Rob Clark
@ 2022-03-11  9:26     ` AngeloGioacchino Del Regno
  -1 siblings, 0 replies; 52+ messages in thread
From: AngeloGioacchino Del Regno @ 2022-03-11  9:26 UTC (permalink / raw)
  To: Rob Clark, dri-devel
  Cc: Rob Clark, freedreno, Jonathan Marek, David Airlie,
	linux-arm-msm, Vladimir Lypak, Abhinav Kumar, Bjorn Andersson,
	Akhil P Oommen, Sean Paul, open list

Il 11/03/22 00:46, Rob Clark ha scritto:
> From: Rob Clark <robdclark@chromium.org>
> 

Hey Rob,
looks like you've somehow lost the commit description on this one!

Cheers,
Angelo

> Signed-off-by: Rob Clark <robdclark@chromium.org>
> ---
>   drivers/gpu/drm/msm/adreno/adreno_device.c | 6 +++---
>   1 file changed, 3 insertions(+), 3 deletions(-)
> 

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 1/3] drm/msm/gpu: Rename runtime suspend/resume functions
@ 2022-03-11  9:26     ` AngeloGioacchino Del Regno
  0 siblings, 0 replies; 52+ messages in thread
From: AngeloGioacchino Del Regno @ 2022-03-11  9:26 UTC (permalink / raw)
  To: Rob Clark, dri-devel
  Cc: freedreno, linux-arm-msm, Rob Clark, Sean Paul, Abhinav Kumar,
	David Airlie, Daniel Vetter, Akhil P Oommen, Bjorn Andersson,
	Jonathan Marek, Vladimir Lypak, open list

Il 11/03/22 00:46, Rob Clark ha scritto:
> From: Rob Clark <robdclark@chromium.org>
> 

Hey Rob,
looks like you've somehow lost the commit description on this one!

Cheers,
Angelo

> Signed-off-by: Rob Clark <robdclark@chromium.org>
> ---
>   drivers/gpu/drm/msm/adreno/adreno_device.c | 6 +++---
>   1 file changed, 3 insertions(+), 3 deletions(-)
> 

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 2/3] drm/msm/gpu: Park scheduler threads for system suspend
  2022-03-10 23:46   ` Rob Clark
@ 2022-03-17  9:59     ` Daniel Vetter
  -1 siblings, 0 replies; 52+ messages in thread
From: Daniel Vetter @ 2022-03-17  9:59 UTC (permalink / raw)
  To: Rob Clark, Andrey Grodzovsky, Christian Koenig
  Cc: dri-devel, freedreno, linux-arm-msm, Rob Clark, Sean Paul,
	Abhinav Kumar, David Airlie, Daniel Vetter, Akhil P Oommen,
	Jonathan Marek, AngeloGioacchino Del Regno, Bjorn Andersson,
	Vladimir Lypak, open list

On Thu, Mar 10, 2022 at 03:46:05PM -0800, Rob Clark wrote:
> From: Rob Clark <robdclark@chromium.org>
> 
> In the system suspend path, we don't want to be racing with the
> scheduler kthreads pushing additional queued up jobs to the hw
> queue (ringbuffer).  So park them first.  While we are at it,
> move the wait for active jobs to complete into the new system-
> suspend path.
> 
> Signed-off-by: Rob Clark <robdclark@chromium.org>
> ---
>  drivers/gpu/drm/msm/adreno/adreno_device.c | 68 ++++++++++++++++++++--
>  1 file changed, 64 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/gpu/drm/msm/adreno/adreno_device.c b/drivers/gpu/drm/msm/adreno/adreno_device.c
> index 8859834b51b8..0440a98988fc 100644
> --- a/drivers/gpu/drm/msm/adreno/adreno_device.c
> +++ b/drivers/gpu/drm/msm/adreno/adreno_device.c
> @@ -619,22 +619,82 @@ static int active_submits(struct msm_gpu *gpu)
>  static int adreno_runtime_suspend(struct device *dev)
>  {
>  	struct msm_gpu *gpu = dev_to_gpu(dev);
> -	int remaining;
> +
> +	/*
> +	 * We should be holding a runpm ref, which will prevent
> +	 * runtime suspend.  In the system suspend path, we've
> +	 * already waited for active jobs to complete.
> +	 */
> +	WARN_ON_ONCE(gpu->active_submits);
> +
> +	return gpu->funcs->pm_suspend(gpu);
> +}
> +
> +static void suspend_scheduler(struct msm_gpu *gpu)
> +{
> +	int i;
> +
> +	/*
> +	 * Shut down the scheduler before we force suspend, so that
> +	 * suspend isn't racing with scheduler kthread feeding us
> +	 * more work.
> +	 *
> +	 * Note, we just want to park the thread, and let any jobs
> +	 * that are already on the hw queue complete normally, as
> +	 * opposed to the drm_sched_stop() path used for handling
> +	 * faulting/timed-out jobs.  We can't really cancel any jobs
> +	 * already on the hw queue without racing with the GPU.
> +	 */
> +	for (i = 0; i < gpu->nr_rings; i++) {
> +		struct drm_gpu_scheduler *sched = &gpu->rb[i]->sched;
> +		kthread_park(sched->thread);

Shouldn't we have some proper interfaces for this? Also I'm kinda
wondering how other drivers do this, feels like we should have a standard
way.

Finally not flushing out all in-flight requests sounds a bit like a bad
idea for system suspend/resume since that's also the hibernation path, and
that would mean your shrinker/page reclaim stops working. At least in full
generality. Which ain't good for hibernation.

Adding Christian and Andrey.
-Daniel

> +	}
> +}
> +
> +static void resume_scheduler(struct msm_gpu *gpu)
> +{
> +	int i;
> +
> +	for (i = 0; i < gpu->nr_rings; i++) {
> +		struct drm_gpu_scheduler *sched = &gpu->rb[i]->sched;
> +		kthread_unpark(sched->thread);
> +	}
> +}
> +
> +static int adreno_system_suspend(struct device *dev)
> +{
> +	struct msm_gpu *gpu = dev_to_gpu(dev);
> +	int remaining, ret;
> +
> +	suspend_scheduler(gpu);
>  
>  	remaining = wait_event_timeout(gpu->retire_event,
>  				       active_submits(gpu) == 0,
>  				       msecs_to_jiffies(1000));
>  	if (remaining == 0) {
>  		dev_err(dev, "Timeout waiting for GPU to suspend\n");
> -		return -EBUSY;
> +		ret = -EBUSY;
> +		goto out;
>  	}
>  
> -	return gpu->funcs->pm_suspend(gpu);
> +	ret = pm_runtime_force_suspend(dev);
> +out:
> +	if (ret)
> +		resume_scheduler(gpu);
> +
> +	return ret;
>  }
> +
> +static int adreno_system_resume(struct device *dev)
> +{
> +	resume_scheduler(dev_to_gpu(dev));
> +	return pm_runtime_force_resume(dev);
> +}
> +
>  #endif
>  
>  static const struct dev_pm_ops adreno_pm_ops = {
> -	SET_SYSTEM_SLEEP_PM_OPS(pm_runtime_force_suspend, pm_runtime_force_resume)
> +	SET_SYSTEM_SLEEP_PM_OPS(adreno_system_suspend, adreno_system_resume)
>  	SET_RUNTIME_PM_OPS(adreno_runtime_suspend, adreno_runtime_resume, NULL)
>  };
>  
> -- 
> 2.35.1
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 2/3] drm/msm/gpu: Park scheduler threads for system suspend
@ 2022-03-17  9:59     ` Daniel Vetter
  0 siblings, 0 replies; 52+ messages in thread
From: Daniel Vetter @ 2022-03-17  9:59 UTC (permalink / raw)
  To: Rob Clark, Andrey Grodzovsky, Christian Koenig
  Cc: Rob Clark, freedreno, Jonathan Marek, David Airlie,
	linux-arm-msm, Vladimir Lypak, Abhinav Kumar, dri-devel,
	Bjorn Andersson, Akhil P Oommen, Sean Paul, open list,
	AngeloGioacchino Del Regno

On Thu, Mar 10, 2022 at 03:46:05PM -0800, Rob Clark wrote:
> From: Rob Clark <robdclark@chromium.org>
> 
> In the system suspend path, we don't want to be racing with the
> scheduler kthreads pushing additional queued up jobs to the hw
> queue (ringbuffer).  So park them first.  While we are at it,
> move the wait for active jobs to complete into the new system-
> suspend path.
> 
> Signed-off-by: Rob Clark <robdclark@chromium.org>
> ---
>  drivers/gpu/drm/msm/adreno/adreno_device.c | 68 ++++++++++++++++++++--
>  1 file changed, 64 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/gpu/drm/msm/adreno/adreno_device.c b/drivers/gpu/drm/msm/adreno/adreno_device.c
> index 8859834b51b8..0440a98988fc 100644
> --- a/drivers/gpu/drm/msm/adreno/adreno_device.c
> +++ b/drivers/gpu/drm/msm/adreno/adreno_device.c
> @@ -619,22 +619,82 @@ static int active_submits(struct msm_gpu *gpu)
>  static int adreno_runtime_suspend(struct device *dev)
>  {
>  	struct msm_gpu *gpu = dev_to_gpu(dev);
> -	int remaining;
> +
> +	/*
> +	 * We should be holding a runpm ref, which will prevent
> +	 * runtime suspend.  In the system suspend path, we've
> +	 * already waited for active jobs to complete.
> +	 */
> +	WARN_ON_ONCE(gpu->active_submits);
> +
> +	return gpu->funcs->pm_suspend(gpu);
> +}
> +
> +static void suspend_scheduler(struct msm_gpu *gpu)
> +{
> +	int i;
> +
> +	/*
> +	 * Shut down the scheduler before we force suspend, so that
> +	 * suspend isn't racing with scheduler kthread feeding us
> +	 * more work.
> +	 *
> +	 * Note, we just want to park the thread, and let any jobs
> +	 * that are already on the hw queue complete normally, as
> +	 * opposed to the drm_sched_stop() path used for handling
> +	 * faulting/timed-out jobs.  We can't really cancel any jobs
> +	 * already on the hw queue without racing with the GPU.
> +	 */
> +	for (i = 0; i < gpu->nr_rings; i++) {
> +		struct drm_gpu_scheduler *sched = &gpu->rb[i]->sched;
> +		kthread_park(sched->thread);

Shouldn't we have some proper interfaces for this? Also I'm kinda
wondering how other drivers do this, feels like we should have a standard
way.

Finally not flushing out all in-flight requests sounds a bit like a bad
idea for system suspend/resume since that's also the hibernation path, and
that would mean your shrinker/page reclaim stops working. At least in full
generality. Which ain't good for hibernation.

Adding Christian and Andrey.
-Daniel

> +	}
> +}
> +
> +static void resume_scheduler(struct msm_gpu *gpu)
> +{
> +	int i;
> +
> +	for (i = 0; i < gpu->nr_rings; i++) {
> +		struct drm_gpu_scheduler *sched = &gpu->rb[i]->sched;
> +		kthread_unpark(sched->thread);
> +	}
> +}
> +
> +static int adreno_system_suspend(struct device *dev)
> +{
> +	struct msm_gpu *gpu = dev_to_gpu(dev);
> +	int remaining, ret;
> +
> +	suspend_scheduler(gpu);
>  
>  	remaining = wait_event_timeout(gpu->retire_event,
>  				       active_submits(gpu) == 0,
>  				       msecs_to_jiffies(1000));
>  	if (remaining == 0) {
>  		dev_err(dev, "Timeout waiting for GPU to suspend\n");
> -		return -EBUSY;
> +		ret = -EBUSY;
> +		goto out;
>  	}
>  
> -	return gpu->funcs->pm_suspend(gpu);
> +	ret = pm_runtime_force_suspend(dev);
> +out:
> +	if (ret)
> +		resume_scheduler(gpu);
> +
> +	return ret;
>  }
> +
> +static int adreno_system_resume(struct device *dev)
> +{
> +	resume_scheduler(dev_to_gpu(dev));
> +	return pm_runtime_force_resume(dev);
> +}
> +
>  #endif
>  
>  static const struct dev_pm_ops adreno_pm_ops = {
> -	SET_SYSTEM_SLEEP_PM_OPS(pm_runtime_force_suspend, pm_runtime_force_resume)
> +	SET_SYSTEM_SLEEP_PM_OPS(adreno_system_suspend, adreno_system_resume)
>  	SET_RUNTIME_PM_OPS(adreno_runtime_suspend, adreno_runtime_resume, NULL)
>  };
>  
> -- 
> 2.35.1
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 2/3] drm/msm/gpu: Park scheduler threads for system suspend
  2022-03-17  9:59     ` Daniel Vetter
  (?)
@ 2022-03-17 10:06     ` Christian König
  2022-03-17 14:58         ` Matthew Brost
  2022-03-17 15:10         ` Rob Clark
  -1 siblings, 2 replies; 52+ messages in thread
From: Christian König @ 2022-03-17 10:06 UTC (permalink / raw)
  To: Rob Clark, Andrey Grodzovsky, dri-devel, freedreno,
	linux-arm-msm, Rob Clark, Sean Paul, Abhinav Kumar, David Airlie,
	Akhil P Oommen, Jonathan Marek, AngeloGioacchino Del Regno,
	Bjorn Andersson, Vladimir Lypak, open list

Am 17.03.22 um 10:59 schrieb Daniel Vetter:
> On Thu, Mar 10, 2022 at 03:46:05PM -0800, Rob Clark wrote:
>> From: Rob Clark <robdclark@chromium.org>
>>
>> In the system suspend path, we don't want to be racing with the
>> scheduler kthreads pushing additional queued up jobs to the hw
>> queue (ringbuffer).  So park them first.  While we are at it,
>> move the wait for active jobs to complete into the new system-
>> suspend path.
>>
>> Signed-off-by: Rob Clark <robdclark@chromium.org>
>> ---
>>   drivers/gpu/drm/msm/adreno/adreno_device.c | 68 ++++++++++++++++++++--
>>   1 file changed, 64 insertions(+), 4 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/msm/adreno/adreno_device.c b/drivers/gpu/drm/msm/adreno/adreno_device.c
>> index 8859834b51b8..0440a98988fc 100644
>> --- a/drivers/gpu/drm/msm/adreno/adreno_device.c
>> +++ b/drivers/gpu/drm/msm/adreno/adreno_device.c
>> @@ -619,22 +619,82 @@ static int active_submits(struct msm_gpu *gpu)
>>   static int adreno_runtime_suspend(struct device *dev)
>>   {
>>   	struct msm_gpu *gpu = dev_to_gpu(dev);
>> -	int remaining;
>> +
>> +	/*
>> +	 * We should be holding a runpm ref, which will prevent
>> +	 * runtime suspend.  In the system suspend path, we've
>> +	 * already waited for active jobs to complete.
>> +	 */
>> +	WARN_ON_ONCE(gpu->active_submits);
>> +
>> +	return gpu->funcs->pm_suspend(gpu);
>> +}
>> +
>> +static void suspend_scheduler(struct msm_gpu *gpu)
>> +{
>> +	int i;
>> +
>> +	/*
>> +	 * Shut down the scheduler before we force suspend, so that
>> +	 * suspend isn't racing with scheduler kthread feeding us
>> +	 * more work.
>> +	 *
>> +	 * Note, we just want to park the thread, and let any jobs
>> +	 * that are already on the hw queue complete normally, as
>> +	 * opposed to the drm_sched_stop() path used for handling
>> +	 * faulting/timed-out jobs.  We can't really cancel any jobs
>> +	 * already on the hw queue without racing with the GPU.
>> +	 */
>> +	for (i = 0; i < gpu->nr_rings; i++) {
>> +		struct drm_gpu_scheduler *sched = &gpu->rb[i]->sched;
>> +		kthread_park(sched->thread);
> Shouldn't we have some proper interfaces for this?

If I'm not completely mistaken we already should have one, yes.

> Also I'm kinda wondering how other drivers do this, feels like we should have a standard
> way.
>
> Finally not flushing out all in-flight requests sounds a bit like a bad
> idea for system suspend/resume since that's also the hibernation path, and
> that would mean your shrinker/page reclaim stops working. At least in full
> generality. Which ain't good for hibernation.

Completely agree, that looks like an incorrect workaround to me.

During suspend all userspace applications should be frozen and all f 
their hardware activity flushed out and waited for completion.

I do remember that our internal guys came up with pretty much the same 
idea and it sounded broken to me back then as well.

Regards,
Christian.

>
> Adding Christian and Andrey.
> -Daniel
>
>> +	}
>> +}
>> +
>> +static void resume_scheduler(struct msm_gpu *gpu)
>> +{
>> +	int i;
>> +
>> +	for (i = 0; i < gpu->nr_rings; i++) {
>> +		struct drm_gpu_scheduler *sched = &gpu->rb[i]->sched;
>> +		kthread_unpark(sched->thread);
>> +	}
>> +}
>> +
>> +static int adreno_system_suspend(struct device *dev)
>> +{
>> +	struct msm_gpu *gpu = dev_to_gpu(dev);
>> +	int remaining, ret;
>> +
>> +	suspend_scheduler(gpu);
>>   
>>   	remaining = wait_event_timeout(gpu->retire_event,
>>   				       active_submits(gpu) == 0,
>>   				       msecs_to_jiffies(1000));
>>   	if (remaining == 0) {
>>   		dev_err(dev, "Timeout waiting for GPU to suspend\n");
>> -		return -EBUSY;
>> +		ret = -EBUSY;
>> +		goto out;
>>   	}
>>   
>> -	return gpu->funcs->pm_suspend(gpu);
>> +	ret = pm_runtime_force_suspend(dev);
>> +out:
>> +	if (ret)
>> +		resume_scheduler(gpu);
>> +
>> +	return ret;
>>   }
>> +
>> +static int adreno_system_resume(struct device *dev)
>> +{
>> +	resume_scheduler(dev_to_gpu(dev));
>> +	return pm_runtime_force_resume(dev);
>> +}
>> +
>>   #endif
>>   
>>   static const struct dev_pm_ops adreno_pm_ops = {
>> -	SET_SYSTEM_SLEEP_PM_OPS(pm_runtime_force_suspend, pm_runtime_force_resume)
>> +	SET_SYSTEM_SLEEP_PM_OPS(adreno_system_suspend, adreno_system_resume)
>>   	SET_RUNTIME_PM_OPS(adreno_runtime_suspend, adreno_runtime_resume, NULL)
>>   };
>>   
>> -- 
>> 2.35.1
>>


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 2/3] drm/msm/gpu: Park scheduler threads for system suspend
  2022-03-17 10:06     ` Christian König
@ 2022-03-17 14:58         ` Matthew Brost
  2022-03-17 15:10         ` Rob Clark
  1 sibling, 0 replies; 52+ messages in thread
From: Matthew Brost @ 2022-03-17 14:58 UTC (permalink / raw)
  To: Christian König
  Cc: Rob Clark, Andrey Grodzovsky, dri-devel, freedreno,
	linux-arm-msm, Rob Clark, Sean Paul, Abhinav Kumar, David Airlie,
	Akhil P Oommen, Jonathan Marek, AngeloGioacchino Del Regno,
	Bjorn Andersson, Vladimir Lypak, open list

On Thu, Mar 17, 2022 at 03:06:18AM -0700, Christian König wrote:
> Am 17.03.22 um 10:59 schrieb Daniel Vetter:
> > On Thu, Mar 10, 2022 at 03:46:05PM -0800, Rob Clark wrote:
> >> From: Rob Clark <robdclark@chromium.org>
> >>
> >> In the system suspend path, we don't want to be racing with the
> >> scheduler kthreads pushing additional queued up jobs to the hw
> >> queue (ringbuffer).  So park them first.  While we are at it,
> >> move the wait for active jobs to complete into the new system-
> >> suspend path.
> >>
> >> Signed-off-by: Rob Clark <robdclark@chromium.org>
> >> ---
> >>   drivers/gpu/drm/msm/adreno/adreno_device.c | 68 ++++++++++++++++++++--
> >>   1 file changed, 64 insertions(+), 4 deletions(-)
> >>
> >> diff --git a/drivers/gpu/drm/msm/adreno/adreno_device.c b/drivers/gpu/drm/msm/adreno/adreno_device.c
> >> index 8859834b51b8..0440a98988fc 100644
> >> --- a/drivers/gpu/drm/msm/adreno/adreno_device.c
> >> +++ b/drivers/gpu/drm/msm/adreno/adreno_device.c
> >> @@ -619,22 +619,82 @@ static int active_submits(struct msm_gpu *gpu)
> >>   static int adreno_runtime_suspend(struct device *dev)
> >>   {
> >>   	struct msm_gpu *gpu = dev_to_gpu(dev);
> >> -	int remaining;
> >> +
> >> +	/*
> >> +	 * We should be holding a runpm ref, which will prevent
> >> +	 * runtime suspend.  In the system suspend path, we've
> >> +	 * already waited for active jobs to complete.
> >> +	 */
> >> +	WARN_ON_ONCE(gpu->active_submits);
> >> +
> >> +	return gpu->funcs->pm_suspend(gpu);
> >> +}
> >> +
> >> +static void suspend_scheduler(struct msm_gpu *gpu)
> >> +{
> >> +	int i;
> >> +
> >> +	/*
> >> +	 * Shut down the scheduler before we force suspend, so that
> >> +	 * suspend isn't racing with scheduler kthread feeding us
> >> +	 * more work.
> >> +	 *
> >> +	 * Note, we just want to park the thread, and let any jobs
> >> +	 * that are already on the hw queue complete normally, as
> >> +	 * opposed to the drm_sched_stop() path used for handling
> >> +	 * faulting/timed-out jobs.  We can't really cancel any jobs
> >> +	 * already on the hw queue without racing with the GPU.
> >> +	 */
> >> +	for (i = 0; i < gpu->nr_rings; i++) {
> >> +		struct drm_gpu_scheduler *sched = &gpu->rb[i]->sched;
> >> +		kthread_park(sched->thread);
> > Shouldn't we have some proper interfaces for this?
> 
> If I'm not completely mistaken we already should have one, yes.
> 
> > Also I'm kinda wondering how other drivers do this, feels like we should have a standard
> > way.
> >
> > Finally not flushing out all in-flight requests sounds a bit like a bad
> > idea for system suspend/resume since that's also the hibernation path, and
> > that would mean your shrinker/page reclaim stops working. At least in full
> > generality. Which ain't good for hibernation.
> 
> Completely agree, that looks like an incorrect workaround to me.
> 
> During suspend all userspace applications should be frozen and all f 
> their hardware activity flushed out and waited for completion.
>

Isn't that what Rob is doing?

He kills the scheduler preventing any new job from being submitted then
waits for an outstanding jobs to complete naturally complete (see the
wait_event_timeout below). If the jobs don't naturally complete the
suspend seems to be aborted? That flow makes sense to me and seems like
a novel way to avoid races.

Matt 
 
> I do remember that our internal guys came up with pretty much the same 
> idea and it sounded broken to me back then as well.
> 
> Regards,
> Christian.
> 
> >
> > Adding Christian and Andrey.
> > -Daniel
> >
> >> +	}
> >> +}
> >> +
> >> +static void resume_scheduler(struct msm_gpu *gpu)
> >> +{
> >> +	int i;
> >> +
> >> +	for (i = 0; i < gpu->nr_rings; i++) {
> >> +		struct drm_gpu_scheduler *sched = &gpu->rb[i]->sched;
> >> +		kthread_unpark(sched->thread);
> >> +	}
> >> +}
> >> +
> >> +static int adreno_system_suspend(struct device *dev)
> >> +{
> >> +	struct msm_gpu *gpu = dev_to_gpu(dev);
> >> +	int remaining, ret;
> >> +
> >> +	suspend_scheduler(gpu);
> >>   
> >>   	remaining = wait_event_timeout(gpu->retire_event,
> >>   				       active_submits(gpu) == 0,
> >>   				       msecs_to_jiffies(1000));
> >>   	if (remaining == 0) {
> >>   		dev_err(dev, "Timeout waiting for GPU to suspend\n");
> >> -		return -EBUSY;
> >> +		ret = -EBUSY;
> >> +		goto out;
> >>   	}
> >>   
> >> -	return gpu->funcs->pm_suspend(gpu);
> >> +	ret = pm_runtime_force_suspend(dev);
> >> +out:
> >> +	if (ret)
> >> +		resume_scheduler(gpu);
> >> +
> >> +	return ret;
> >>   }
> >> +
> >> +static int adreno_system_resume(struct device *dev)
> >> +{
> >> +	resume_scheduler(dev_to_gpu(dev));
> >> +	return pm_runtime_force_resume(dev);
> >> +}
> >> +
> >>   #endif
> >>   
> >>   static const struct dev_pm_ops adreno_pm_ops = {
> >> -	SET_SYSTEM_SLEEP_PM_OPS(pm_runtime_force_suspend, pm_runtime_force_resume)
> >> +	SET_SYSTEM_SLEEP_PM_OPS(adreno_system_suspend, adreno_system_resume)
> >>   	SET_RUNTIME_PM_OPS(adreno_runtime_suspend, adreno_runtime_resume, NULL)
> >>   };
> >>   
> >> -- 
> >> 2.35.1
> >>
> 

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 2/3] drm/msm/gpu: Park scheduler threads for system suspend
@ 2022-03-17 14:58         ` Matthew Brost
  0 siblings, 0 replies; 52+ messages in thread
From: Matthew Brost @ 2022-03-17 14:58 UTC (permalink / raw)
  To: Christian König
  Cc: Rob Clark, Jonathan Marek, David Airlie, linux-arm-msm,
	Vladimir Lypak, Abhinav Kumar, dri-devel, Bjorn Andersson,
	Sean Paul, Akhil P Oommen, freedreno, open list,
	AngeloGioacchino Del Regno

On Thu, Mar 17, 2022 at 03:06:18AM -0700, Christian König wrote:
> Am 17.03.22 um 10:59 schrieb Daniel Vetter:
> > On Thu, Mar 10, 2022 at 03:46:05PM -0800, Rob Clark wrote:
> >> From: Rob Clark <robdclark@chromium.org>
> >>
> >> In the system suspend path, we don't want to be racing with the
> >> scheduler kthreads pushing additional queued up jobs to the hw
> >> queue (ringbuffer).  So park them first.  While we are at it,
> >> move the wait for active jobs to complete into the new system-
> >> suspend path.
> >>
> >> Signed-off-by: Rob Clark <robdclark@chromium.org>
> >> ---
> >>   drivers/gpu/drm/msm/adreno/adreno_device.c | 68 ++++++++++++++++++++--
> >>   1 file changed, 64 insertions(+), 4 deletions(-)
> >>
> >> diff --git a/drivers/gpu/drm/msm/adreno/adreno_device.c b/drivers/gpu/drm/msm/adreno/adreno_device.c
> >> index 8859834b51b8..0440a98988fc 100644
> >> --- a/drivers/gpu/drm/msm/adreno/adreno_device.c
> >> +++ b/drivers/gpu/drm/msm/adreno/adreno_device.c
> >> @@ -619,22 +619,82 @@ static int active_submits(struct msm_gpu *gpu)
> >>   static int adreno_runtime_suspend(struct device *dev)
> >>   {
> >>   	struct msm_gpu *gpu = dev_to_gpu(dev);
> >> -	int remaining;
> >> +
> >> +	/*
> >> +	 * We should be holding a runpm ref, which will prevent
> >> +	 * runtime suspend.  In the system suspend path, we've
> >> +	 * already waited for active jobs to complete.
> >> +	 */
> >> +	WARN_ON_ONCE(gpu->active_submits);
> >> +
> >> +	return gpu->funcs->pm_suspend(gpu);
> >> +}
> >> +
> >> +static void suspend_scheduler(struct msm_gpu *gpu)
> >> +{
> >> +	int i;
> >> +
> >> +	/*
> >> +	 * Shut down the scheduler before we force suspend, so that
> >> +	 * suspend isn't racing with scheduler kthread feeding us
> >> +	 * more work.
> >> +	 *
> >> +	 * Note, we just want to park the thread, and let any jobs
> >> +	 * that are already on the hw queue complete normally, as
> >> +	 * opposed to the drm_sched_stop() path used for handling
> >> +	 * faulting/timed-out jobs.  We can't really cancel any jobs
> >> +	 * already on the hw queue without racing with the GPU.
> >> +	 */
> >> +	for (i = 0; i < gpu->nr_rings; i++) {
> >> +		struct drm_gpu_scheduler *sched = &gpu->rb[i]->sched;
> >> +		kthread_park(sched->thread);
> > Shouldn't we have some proper interfaces for this?
> 
> If I'm not completely mistaken we already should have one, yes.
> 
> > Also I'm kinda wondering how other drivers do this, feels like we should have a standard
> > way.
> >
> > Finally not flushing out all in-flight requests sounds a bit like a bad
> > idea for system suspend/resume since that's also the hibernation path, and
> > that would mean your shrinker/page reclaim stops working. At least in full
> > generality. Which ain't good for hibernation.
> 
> Completely agree, that looks like an incorrect workaround to me.
> 
> During suspend all userspace applications should be frozen and all f 
> their hardware activity flushed out and waited for completion.
>

Isn't that what Rob is doing?

He kills the scheduler preventing any new job from being submitted then
waits for an outstanding jobs to complete naturally complete (see the
wait_event_timeout below). If the jobs don't naturally complete the
suspend seems to be aborted? That flow makes sense to me and seems like
a novel way to avoid races.

Matt 
 
> I do remember that our internal guys came up with pretty much the same 
> idea and it sounded broken to me back then as well.
> 
> Regards,
> Christian.
> 
> >
> > Adding Christian and Andrey.
> > -Daniel
> >
> >> +	}
> >> +}
> >> +
> >> +static void resume_scheduler(struct msm_gpu *gpu)
> >> +{
> >> +	int i;
> >> +
> >> +	for (i = 0; i < gpu->nr_rings; i++) {
> >> +		struct drm_gpu_scheduler *sched = &gpu->rb[i]->sched;
> >> +		kthread_unpark(sched->thread);
> >> +	}
> >> +}
> >> +
> >> +static int adreno_system_suspend(struct device *dev)
> >> +{
> >> +	struct msm_gpu *gpu = dev_to_gpu(dev);
> >> +	int remaining, ret;
> >> +
> >> +	suspend_scheduler(gpu);
> >>   
> >>   	remaining = wait_event_timeout(gpu->retire_event,
> >>   				       active_submits(gpu) == 0,
> >>   				       msecs_to_jiffies(1000));
> >>   	if (remaining == 0) {
> >>   		dev_err(dev, "Timeout waiting for GPU to suspend\n");
> >> -		return -EBUSY;
> >> +		ret = -EBUSY;
> >> +		goto out;
> >>   	}
> >>   
> >> -	return gpu->funcs->pm_suspend(gpu);
> >> +	ret = pm_runtime_force_suspend(dev);
> >> +out:
> >> +	if (ret)
> >> +		resume_scheduler(gpu);
> >> +
> >> +	return ret;
> >>   }
> >> +
> >> +static int adreno_system_resume(struct device *dev)
> >> +{
> >> +	resume_scheduler(dev_to_gpu(dev));
> >> +	return pm_runtime_force_resume(dev);
> >> +}
> >> +
> >>   #endif
> >>   
> >>   static const struct dev_pm_ops adreno_pm_ops = {
> >> -	SET_SYSTEM_SLEEP_PM_OPS(pm_runtime_force_suspend, pm_runtime_force_resume)
> >> +	SET_SYSTEM_SLEEP_PM_OPS(adreno_system_suspend, adreno_system_resume)
> >>   	SET_RUNTIME_PM_OPS(adreno_runtime_suspend, adreno_runtime_resume, NULL)
> >>   };
> >>   
> >> -- 
> >> 2.35.1
> >>
> 

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 2/3] drm/msm/gpu: Park scheduler threads for system suspend
  2022-03-17 10:06     ` Christian König
@ 2022-03-17 15:10         ` Rob Clark
  2022-03-17 15:10         ` Rob Clark
  1 sibling, 0 replies; 52+ messages in thread
From: Rob Clark @ 2022-03-17 15:10 UTC (permalink / raw)
  To: Christian König
  Cc: Andrey Grodzovsky, dri-devel, freedreno, linux-arm-msm,
	Rob Clark, Sean Paul, Abhinav Kumar, David Airlie,
	Akhil P Oommen, Jonathan Marek, AngeloGioacchino Del Regno,
	Bjorn Andersson, Vladimir Lypak, open list

On Thu, Mar 17, 2022 at 3:06 AM Christian König
<christian.koenig@amd.com> wrote:
>
> Am 17.03.22 um 10:59 schrieb Daniel Vetter:
> > On Thu, Mar 10, 2022 at 03:46:05PM -0800, Rob Clark wrote:
> >> From: Rob Clark <robdclark@chromium.org>
> >>
> >> In the system suspend path, we don't want to be racing with the
> >> scheduler kthreads pushing additional queued up jobs to the hw
> >> queue (ringbuffer).  So park them first.  While we are at it,
> >> move the wait for active jobs to complete into the new system-
> >> suspend path.
> >>
> >> Signed-off-by: Rob Clark <robdclark@chromium.org>
> >> ---
> >>   drivers/gpu/drm/msm/adreno/adreno_device.c | 68 ++++++++++++++++++++--
> >>   1 file changed, 64 insertions(+), 4 deletions(-)
> >>
> >> diff --git a/drivers/gpu/drm/msm/adreno/adreno_device.c b/drivers/gpu/drm/msm/adreno/adreno_device.c
> >> index 8859834b51b8..0440a98988fc 100644
> >> --- a/drivers/gpu/drm/msm/adreno/adreno_device.c
> >> +++ b/drivers/gpu/drm/msm/adreno/adreno_device.c
> >> @@ -619,22 +619,82 @@ static int active_submits(struct msm_gpu *gpu)
> >>   static int adreno_runtime_suspend(struct device *dev)
> >>   {
> >>      struct msm_gpu *gpu = dev_to_gpu(dev);
> >> -    int remaining;
> >> +
> >> +    /*
> >> +     * We should be holding a runpm ref, which will prevent
> >> +     * runtime suspend.  In the system suspend path, we've
> >> +     * already waited for active jobs to complete.
> >> +     */
> >> +    WARN_ON_ONCE(gpu->active_submits);
> >> +
> >> +    return gpu->funcs->pm_suspend(gpu);
> >> +}
> >> +
> >> +static void suspend_scheduler(struct msm_gpu *gpu)
> >> +{
> >> +    int i;
> >> +
> >> +    /*
> >> +     * Shut down the scheduler before we force suspend, so that
> >> +     * suspend isn't racing with scheduler kthread feeding us
> >> +     * more work.
> >> +     *
> >> +     * Note, we just want to park the thread, and let any jobs
> >> +     * that are already on the hw queue complete normally, as
> >> +     * opposed to the drm_sched_stop() path used for handling
> >> +     * faulting/timed-out jobs.  We can't really cancel any jobs
> >> +     * already on the hw queue without racing with the GPU.
> >> +     */
> >> +    for (i = 0; i < gpu->nr_rings; i++) {
> >> +            struct drm_gpu_scheduler *sched = &gpu->rb[i]->sched;
> >> +            kthread_park(sched->thread);
> > Shouldn't we have some proper interfaces for this?
>
> If I'm not completely mistaken we already should have one, yes.

drm_sched_stop() was my first thought, but it carries extra baggage.
Really I *just* want to park the kthread.

Note that amdgpu does (for afaict different reasons) park the kthread
directly as well.

> > Also I'm kinda wondering how other drivers do this, feels like we should have a standard
> > way.

As far as other drivers, it seems like they largely ignore it.  I
suspect other drivers also have problems in this area.

Fwiw, I have a piglit test to try to exercise this path if you want to
try it on other drivers.. might need some futzing around to make sure
enough work is queued up that there is some on hw ring and some queued
up in the scheduler when you try to suspend.

https://gitlab.freedesktop.org/mesa/piglit/-/merge_requests/643

> >
> > Finally not flushing out all in-flight requests sounds a bit like a bad
> > idea for system suspend/resume since that's also the hibernation path, and
> > that would mean your shrinker/page reclaim stops working. At least in full
> > generality. Which ain't good for hibernation.
>
> Completely agree, that looks like an incorrect workaround to me.
>
> During suspend all userspace applications should be frozen and all f
> their hardware activity flushed out and waited for completion.
>
> I do remember that our internal guys came up with pretty much the same
> idea and it sounded broken to me back then as well.

userspace frozen != kthread frozen .. that is what this patch is
trying to address, so we aren't racing between shutting down the hw
and the scheduler shoveling more jobs at us.

BR,
-R

> Regards,
> Christian.
>
> >
> > Adding Christian and Andrey.
> > -Daniel
> >
> >> +    }
> >> +}
> >> +
> >> +static void resume_scheduler(struct msm_gpu *gpu)
> >> +{
> >> +    int i;
> >> +
> >> +    for (i = 0; i < gpu->nr_rings; i++) {
> >> +            struct drm_gpu_scheduler *sched = &gpu->rb[i]->sched;
> >> +            kthread_unpark(sched->thread);
> >> +    }
> >> +}
> >> +
> >> +static int adreno_system_suspend(struct device *dev)
> >> +{
> >> +    struct msm_gpu *gpu = dev_to_gpu(dev);
> >> +    int remaining, ret;
> >> +
> >> +    suspend_scheduler(gpu);
> >>
> >>      remaining = wait_event_timeout(gpu->retire_event,
> >>                                     active_submits(gpu) == 0,
> >>                                     msecs_to_jiffies(1000));
> >>      if (remaining == 0) {
> >>              dev_err(dev, "Timeout waiting for GPU to suspend\n");
> >> -            return -EBUSY;
> >> +            ret = -EBUSY;
> >> +            goto out;
> >>      }
> >>
> >> -    return gpu->funcs->pm_suspend(gpu);
> >> +    ret = pm_runtime_force_suspend(dev);
> >> +out:
> >> +    if (ret)
> >> +            resume_scheduler(gpu);
> >> +
> >> +    return ret;
> >>   }
> >> +
> >> +static int adreno_system_resume(struct device *dev)
> >> +{
> >> +    resume_scheduler(dev_to_gpu(dev));
> >> +    return pm_runtime_force_resume(dev);
> >> +}
> >> +
> >>   #endif
> >>
> >>   static const struct dev_pm_ops adreno_pm_ops = {
> >> -    SET_SYSTEM_SLEEP_PM_OPS(pm_runtime_force_suspend, pm_runtime_force_resume)
> >> +    SET_SYSTEM_SLEEP_PM_OPS(adreno_system_suspend, adreno_system_resume)
> >>      SET_RUNTIME_PM_OPS(adreno_runtime_suspend, adreno_runtime_resume, NULL)
> >>   };
> >>
> >> --
> >> 2.35.1
> >>
>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 2/3] drm/msm/gpu: Park scheduler threads for system suspend
@ 2022-03-17 15:10         ` Rob Clark
  0 siblings, 0 replies; 52+ messages in thread
From: Rob Clark @ 2022-03-17 15:10 UTC (permalink / raw)
  To: Christian König
  Cc: Rob Clark, Jonathan Marek, David Airlie, freedreno,
	Vladimir Lypak, Abhinav Kumar, dri-devel, Bjorn Andersson,
	Akhil P Oommen, linux-arm-msm, Sean Paul, open list,
	AngeloGioacchino Del Regno

On Thu, Mar 17, 2022 at 3:06 AM Christian König
<christian.koenig@amd.com> wrote:
>
> Am 17.03.22 um 10:59 schrieb Daniel Vetter:
> > On Thu, Mar 10, 2022 at 03:46:05PM -0800, Rob Clark wrote:
> >> From: Rob Clark <robdclark@chromium.org>
> >>
> >> In the system suspend path, we don't want to be racing with the
> >> scheduler kthreads pushing additional queued up jobs to the hw
> >> queue (ringbuffer).  So park them first.  While we are at it,
> >> move the wait for active jobs to complete into the new system-
> >> suspend path.
> >>
> >> Signed-off-by: Rob Clark <robdclark@chromium.org>
> >> ---
> >>   drivers/gpu/drm/msm/adreno/adreno_device.c | 68 ++++++++++++++++++++--
> >>   1 file changed, 64 insertions(+), 4 deletions(-)
> >>
> >> diff --git a/drivers/gpu/drm/msm/adreno/adreno_device.c b/drivers/gpu/drm/msm/adreno/adreno_device.c
> >> index 8859834b51b8..0440a98988fc 100644
> >> --- a/drivers/gpu/drm/msm/adreno/adreno_device.c
> >> +++ b/drivers/gpu/drm/msm/adreno/adreno_device.c
> >> @@ -619,22 +619,82 @@ static int active_submits(struct msm_gpu *gpu)
> >>   static int adreno_runtime_suspend(struct device *dev)
> >>   {
> >>      struct msm_gpu *gpu = dev_to_gpu(dev);
> >> -    int remaining;
> >> +
> >> +    /*
> >> +     * We should be holding a runpm ref, which will prevent
> >> +     * runtime suspend.  In the system suspend path, we've
> >> +     * already waited for active jobs to complete.
> >> +     */
> >> +    WARN_ON_ONCE(gpu->active_submits);
> >> +
> >> +    return gpu->funcs->pm_suspend(gpu);
> >> +}
> >> +
> >> +static void suspend_scheduler(struct msm_gpu *gpu)
> >> +{
> >> +    int i;
> >> +
> >> +    /*
> >> +     * Shut down the scheduler before we force suspend, so that
> >> +     * suspend isn't racing with scheduler kthread feeding us
> >> +     * more work.
> >> +     *
> >> +     * Note, we just want to park the thread, and let any jobs
> >> +     * that are already on the hw queue complete normally, as
> >> +     * opposed to the drm_sched_stop() path used for handling
> >> +     * faulting/timed-out jobs.  We can't really cancel any jobs
> >> +     * already on the hw queue without racing with the GPU.
> >> +     */
> >> +    for (i = 0; i < gpu->nr_rings; i++) {
> >> +            struct drm_gpu_scheduler *sched = &gpu->rb[i]->sched;
> >> +            kthread_park(sched->thread);
> > Shouldn't we have some proper interfaces for this?
>
> If I'm not completely mistaken we already should have one, yes.

drm_sched_stop() was my first thought, but it carries extra baggage.
Really I *just* want to park the kthread.

Note that amdgpu does (for afaict different reasons) park the kthread
directly as well.

> > Also I'm kinda wondering how other drivers do this, feels like we should have a standard
> > way.

As far as other drivers, it seems like they largely ignore it.  I
suspect other drivers also have problems in this area.

Fwiw, I have a piglit test to try to exercise this path if you want to
try it on other drivers.. might need some futzing around to make sure
enough work is queued up that there is some on hw ring and some queued
up in the scheduler when you try to suspend.

https://gitlab.freedesktop.org/mesa/piglit/-/merge_requests/643

> >
> > Finally not flushing out all in-flight requests sounds a bit like a bad
> > idea for system suspend/resume since that's also the hibernation path, and
> > that would mean your shrinker/page reclaim stops working. At least in full
> > generality. Which ain't good for hibernation.
>
> Completely agree, that looks like an incorrect workaround to me.
>
> During suspend all userspace applications should be frozen and all f
> their hardware activity flushed out and waited for completion.
>
> I do remember that our internal guys came up with pretty much the same
> idea and it sounded broken to me back then as well.

userspace frozen != kthread frozen .. that is what this patch is
trying to address, so we aren't racing between shutting down the hw
and the scheduler shoveling more jobs at us.

BR,
-R

> Regards,
> Christian.
>
> >
> > Adding Christian and Andrey.
> > -Daniel
> >
> >> +    }
> >> +}
> >> +
> >> +static void resume_scheduler(struct msm_gpu *gpu)
> >> +{
> >> +    int i;
> >> +
> >> +    for (i = 0; i < gpu->nr_rings; i++) {
> >> +            struct drm_gpu_scheduler *sched = &gpu->rb[i]->sched;
> >> +            kthread_unpark(sched->thread);
> >> +    }
> >> +}
> >> +
> >> +static int adreno_system_suspend(struct device *dev)
> >> +{
> >> +    struct msm_gpu *gpu = dev_to_gpu(dev);
> >> +    int remaining, ret;
> >> +
> >> +    suspend_scheduler(gpu);
> >>
> >>      remaining = wait_event_timeout(gpu->retire_event,
> >>                                     active_submits(gpu) == 0,
> >>                                     msecs_to_jiffies(1000));
> >>      if (remaining == 0) {
> >>              dev_err(dev, "Timeout waiting for GPU to suspend\n");
> >> -            return -EBUSY;
> >> +            ret = -EBUSY;
> >> +            goto out;
> >>      }
> >>
> >> -    return gpu->funcs->pm_suspend(gpu);
> >> +    ret = pm_runtime_force_suspend(dev);
> >> +out:
> >> +    if (ret)
> >> +            resume_scheduler(gpu);
> >> +
> >> +    return ret;
> >>   }
> >> +
> >> +static int adreno_system_resume(struct device *dev)
> >> +{
> >> +    resume_scheduler(dev_to_gpu(dev));
> >> +    return pm_runtime_force_resume(dev);
> >> +}
> >> +
> >>   #endif
> >>
> >>   static const struct dev_pm_ops adreno_pm_ops = {
> >> -    SET_SYSTEM_SLEEP_PM_OPS(pm_runtime_force_suspend, pm_runtime_force_resume)
> >> +    SET_SYSTEM_SLEEP_PM_OPS(adreno_system_suspend, adreno_system_resume)
> >>      SET_RUNTIME_PM_OPS(adreno_runtime_suspend, adreno_runtime_resume, NULL)
> >>   };
> >>
> >> --
> >> 2.35.1
> >>
>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 2/3] drm/msm/gpu: Park scheduler threads for system suspend
  2022-03-17 15:10         ` Rob Clark
@ 2022-03-17 16:04           ` Christian König
  -1 siblings, 0 replies; 52+ messages in thread
From: Christian König @ 2022-03-17 16:04 UTC (permalink / raw)
  To: Rob Clark
  Cc: Andrey Grodzovsky, dri-devel, freedreno, linux-arm-msm,
	Rob Clark, Sean Paul, Abhinav Kumar, David Airlie,
	Akhil P Oommen, Jonathan Marek, AngeloGioacchino Del Regno,
	Bjorn Andersson, Vladimir Lypak, open list

Am 17.03.22 um 16:10 schrieb Rob Clark:
> [SNIP]
> userspace frozen != kthread frozen .. that is what this patch is
> trying to address, so we aren't racing between shutting down the hw
> and the scheduler shoveling more jobs at us.

Well exactly that's the problem. The scheduler is supposed to shoveling 
more jobs at us until it is empty.

Thinking more about it we will then keep some dma_fence instance 
unsignaled and that is and extremely bad idea since it can lead to 
deadlocks during suspend.

So this patch here is an absolute clear NAK from my side. If amdgpu is 
doing something similar that is a severe bug and needs to be addressed 
somehow.

Regards,
Christian.

>
> BR,
> -R
>


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 2/3] drm/msm/gpu: Park scheduler threads for system suspend
@ 2022-03-17 16:04           ` Christian König
  0 siblings, 0 replies; 52+ messages in thread
From: Christian König @ 2022-03-17 16:04 UTC (permalink / raw)
  To: Rob Clark
  Cc: Rob Clark, Jonathan Marek, David Airlie, freedreno,
	Vladimir Lypak, Abhinav Kumar, dri-devel, Bjorn Andersson,
	Akhil P Oommen, linux-arm-msm, Sean Paul, open list,
	AngeloGioacchino Del Regno

Am 17.03.22 um 16:10 schrieb Rob Clark:
> [SNIP]
> userspace frozen != kthread frozen .. that is what this patch is
> trying to address, so we aren't racing between shutting down the hw
> and the scheduler shoveling more jobs at us.

Well exactly that's the problem. The scheduler is supposed to shoveling 
more jobs at us until it is empty.

Thinking more about it we will then keep some dma_fence instance 
unsignaled and that is and extremely bad idea since it can lead to 
deadlocks during suspend.

So this patch here is an absolute clear NAK from my side. If amdgpu is 
doing something similar that is a severe bug and needs to be addressed 
somehow.

Regards,
Christian.

>
> BR,
> -R
>


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 2/3] drm/msm/gpu: Park scheduler threads for system suspend
  2022-03-17 16:04           ` Christian König
@ 2022-03-17 16:18             ` Rob Clark
  -1 siblings, 0 replies; 52+ messages in thread
From: Rob Clark @ 2022-03-17 16:18 UTC (permalink / raw)
  To: Christian König
  Cc: Andrey Grodzovsky, dri-devel, freedreno, linux-arm-msm,
	Rob Clark, Sean Paul, Abhinav Kumar, David Airlie,
	Akhil P Oommen, Jonathan Marek, AngeloGioacchino Del Regno,
	Bjorn Andersson, Vladimir Lypak, open list

On Thu, Mar 17, 2022 at 9:04 AM Christian König
<christian.koenig@amd.com> wrote:
>
> Am 17.03.22 um 16:10 schrieb Rob Clark:
> > [SNIP]
> > userspace frozen != kthread frozen .. that is what this patch is
> > trying to address, so we aren't racing between shutting down the hw
> > and the scheduler shoveling more jobs at us.
>
> Well exactly that's the problem. The scheduler is supposed to shoveling
> more jobs at us until it is empty.
>
> Thinking more about it we will then keep some dma_fence instance
> unsignaled and that is and extremely bad idea since it can lead to
> deadlocks during suspend.

Hmm, perhaps that is true if you need to migrate things out of vram?
It is at least not a problem when vram is not involved.

> So this patch here is an absolute clear NAK from my side. If amdgpu is
> doing something similar that is a severe bug and needs to be addressed
> somehow.

I think amdgpu's use of kthread_park is not related to suspend, but
didn't look too closely.

And perhaps the solution for this problem is more complex in the case
of amdgpu, I'm not super familiar with the constraints there.  But I
think it is a fine solution for integrated GPUs.

BR,
-R

> Regards,
> Christian.
>
> >
> > BR,
> > -R
> >
>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 2/3] drm/msm/gpu: Park scheduler threads for system suspend
@ 2022-03-17 16:18             ` Rob Clark
  0 siblings, 0 replies; 52+ messages in thread
From: Rob Clark @ 2022-03-17 16:18 UTC (permalink / raw)
  To: Christian König
  Cc: Rob Clark, Jonathan Marek, David Airlie, freedreno,
	Vladimir Lypak, Abhinav Kumar, dri-devel, Bjorn Andersson,
	Akhil P Oommen, linux-arm-msm, Sean Paul, open list,
	AngeloGioacchino Del Regno

On Thu, Mar 17, 2022 at 9:04 AM Christian König
<christian.koenig@amd.com> wrote:
>
> Am 17.03.22 um 16:10 schrieb Rob Clark:
> > [SNIP]
> > userspace frozen != kthread frozen .. that is what this patch is
> > trying to address, so we aren't racing between shutting down the hw
> > and the scheduler shoveling more jobs at us.
>
> Well exactly that's the problem. The scheduler is supposed to shoveling
> more jobs at us until it is empty.
>
> Thinking more about it we will then keep some dma_fence instance
> unsignaled and that is and extremely bad idea since it can lead to
> deadlocks during suspend.

Hmm, perhaps that is true if you need to migrate things out of vram?
It is at least not a problem when vram is not involved.

> So this patch here is an absolute clear NAK from my side. If amdgpu is
> doing something similar that is a severe bug and needs to be addressed
> somehow.

I think amdgpu's use of kthread_park is not related to suspend, but
didn't look too closely.

And perhaps the solution for this problem is more complex in the case
of amdgpu, I'm not super familiar with the constraints there.  But I
think it is a fine solution for integrated GPUs.

BR,
-R

> Regards,
> Christian.
>
> >
> > BR,
> > -R
> >
>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 2/3] drm/msm/gpu: Park scheduler threads for system suspend
  2022-03-17 16:18             ` Rob Clark
@ 2022-03-17 16:44               ` Christian König
  -1 siblings, 0 replies; 52+ messages in thread
From: Christian König @ 2022-03-17 16:44 UTC (permalink / raw)
  To: Rob Clark
  Cc: Andrey Grodzovsky, dri-devel, freedreno, linux-arm-msm,
	Rob Clark, Sean Paul, Abhinav Kumar, David Airlie,
	Akhil P Oommen, Jonathan Marek, AngeloGioacchino Del Regno,
	Bjorn Andersson, Vladimir Lypak, open list

Am 17.03.22 um 17:18 schrieb Rob Clark:
> On Thu, Mar 17, 2022 at 9:04 AM Christian König
> <christian.koenig@amd.com> wrote:
>> Am 17.03.22 um 16:10 schrieb Rob Clark:
>>> [SNIP]
>>> userspace frozen != kthread frozen .. that is what this patch is
>>> trying to address, so we aren't racing between shutting down the hw
>>> and the scheduler shoveling more jobs at us.
>> Well exactly that's the problem. The scheduler is supposed to shoveling
>> more jobs at us until it is empty.
>>
>> Thinking more about it we will then keep some dma_fence instance
>> unsignaled and that is and extremely bad idea since it can lead to
>> deadlocks during suspend.
> Hmm, perhaps that is true if you need to migrate things out of vram?
> It is at least not a problem when vram is not involved.

No, it's much wider than that.

See what can happen is that the memory management shrinkers want to wait 
for a dma_fence during suspend.

And if you stop the scheduler they will just wait forever.

What you need to do instead is to drain the scheduler, e.g. call 
drm_sched_entity_flush() with a proper timeout for each entity you have 
created.

Regards,
Christian.

>
>> So this patch here is an absolute clear NAK from my side. If amdgpu is
>> doing something similar that is a severe bug and needs to be addressed
>> somehow.
> I think amdgpu's use of kthread_park is not related to suspend, but
> didn't look too closely.
>
> And perhaps the solution for this problem is more complex in the case
> of amdgpu, I'm not super familiar with the constraints there.  But I
> think it is a fine solution for integrated GPUs.
>
> BR,
> -R
>
>> Regards,
>> Christian.
>>
>>> BR,
>>> -R
>>>


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 2/3] drm/msm/gpu: Park scheduler threads for system suspend
@ 2022-03-17 16:44               ` Christian König
  0 siblings, 0 replies; 52+ messages in thread
From: Christian König @ 2022-03-17 16:44 UTC (permalink / raw)
  To: Rob Clark
  Cc: Rob Clark, Jonathan Marek, David Airlie, freedreno,
	Vladimir Lypak, Abhinav Kumar, dri-devel, Bjorn Andersson,
	Akhil P Oommen, linux-arm-msm, Sean Paul, open list,
	AngeloGioacchino Del Regno

Am 17.03.22 um 17:18 schrieb Rob Clark:
> On Thu, Mar 17, 2022 at 9:04 AM Christian König
> <christian.koenig@amd.com> wrote:
>> Am 17.03.22 um 16:10 schrieb Rob Clark:
>>> [SNIP]
>>> userspace frozen != kthread frozen .. that is what this patch is
>>> trying to address, so we aren't racing between shutting down the hw
>>> and the scheduler shoveling more jobs at us.
>> Well exactly that's the problem. The scheduler is supposed to shoveling
>> more jobs at us until it is empty.
>>
>> Thinking more about it we will then keep some dma_fence instance
>> unsignaled and that is and extremely bad idea since it can lead to
>> deadlocks during suspend.
> Hmm, perhaps that is true if you need to migrate things out of vram?
> It is at least not a problem when vram is not involved.

No, it's much wider than that.

See what can happen is that the memory management shrinkers want to wait 
for a dma_fence during suspend.

And if you stop the scheduler they will just wait forever.

What you need to do instead is to drain the scheduler, e.g. call 
drm_sched_entity_flush() with a proper timeout for each entity you have 
created.

Regards,
Christian.

>
>> So this patch here is an absolute clear NAK from my side. If amdgpu is
>> doing something similar that is a severe bug and needs to be addressed
>> somehow.
> I think amdgpu's use of kthread_park is not related to suspend, but
> didn't look too closely.
>
> And perhaps the solution for this problem is more complex in the case
> of amdgpu, I'm not super familiar with the constraints there.  But I
> think it is a fine solution for integrated GPUs.
>
> BR,
> -R
>
>> Regards,
>> Christian.
>>
>>> BR,
>>> -R
>>>


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 2/3] drm/msm/gpu: Park scheduler threads for system suspend
  2022-03-17 16:44               ` Christian König
@ 2022-03-17 17:29                 ` Daniel Vetter
  -1 siblings, 0 replies; 52+ messages in thread
From: Daniel Vetter @ 2022-03-17 17:29 UTC (permalink / raw)
  To: Christian König
  Cc: Rob Clark, Rob Clark, Jonathan Marek, David Airlie, freedreno,
	Vladimir Lypak, Abhinav Kumar, dri-devel, Bjorn Andersson,
	Akhil P Oommen, linux-arm-msm, Sean Paul, open list,
	AngeloGioacchino Del Regno

On Thu, Mar 17, 2022 at 05:44:57PM +0100, Christian König wrote:
> Am 17.03.22 um 17:18 schrieb Rob Clark:
> > On Thu, Mar 17, 2022 at 9:04 AM Christian König
> > <christian.koenig@amd.com> wrote:
> > > Am 17.03.22 um 16:10 schrieb Rob Clark:
> > > > [SNIP]
> > > > userspace frozen != kthread frozen .. that is what this patch is
> > > > trying to address, so we aren't racing between shutting down the hw
> > > > and the scheduler shoveling more jobs at us.
> > > Well exactly that's the problem. The scheduler is supposed to shoveling
> > > more jobs at us until it is empty.
> > > 
> > > Thinking more about it we will then keep some dma_fence instance
> > > unsignaled and that is and extremely bad idea since it can lead to
> > > deadlocks during suspend.
> > Hmm, perhaps that is true if you need to migrate things out of vram?
> > It is at least not a problem when vram is not involved.
> 
> No, it's much wider than that.
> 
> See what can happen is that the memory management shrinkers want to wait for
> a dma_fence during suspend.
> 
> And if you stop the scheduler they will just wait forever.
> 
> What you need to do instead is to drain the scheduler, e.g. call
> drm_sched_entity_flush() with a proper timeout for each entity you have
> created.

Yeah I think properly flushing the scheduler and stopping it and cutting
all drivers over to that sounds like the right approach. Generally suspend
shouldn't be such a critical path that this will hurt us, all the other io
queues get flushed too afaik.

Resume is the thing that needs to go real fast.

So a patch set to move all drivers that open code the kthread_park to the
right scheduler function sounds like the right idea here to me.
-Daniel

> 
> Regards,
> Christian.
> 
> > 
> > > So this patch here is an absolute clear NAK from my side. If amdgpu is
> > > doing something similar that is a severe bug and needs to be addressed
> > > somehow.
> > I think amdgpu's use of kthread_park is not related to suspend, but
> > didn't look too closely.
> > 
> > And perhaps the solution for this problem is more complex in the case
> > of amdgpu, I'm not super familiar with the constraints there.  But I
> > think it is a fine solution for integrated GPUs.
> > 
> > BR,
> > -R
> > 
> > > Regards,
> > > Christian.
> > > 
> > > > BR,
> > > > -R
> > > > 
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 2/3] drm/msm/gpu: Park scheduler threads for system suspend
@ 2022-03-17 17:29                 ` Daniel Vetter
  0 siblings, 0 replies; 52+ messages in thread
From: Daniel Vetter @ 2022-03-17 17:29 UTC (permalink / raw)
  To: Christian König
  Cc: Rob Clark, Jonathan Marek, David Airlie, linux-arm-msm,
	Vladimir Lypak, Abhinav Kumar, dri-devel, Bjorn Andersson,
	Akhil P Oommen, Sean Paul, freedreno, open list,
	AngeloGioacchino Del Regno

On Thu, Mar 17, 2022 at 05:44:57PM +0100, Christian König wrote:
> Am 17.03.22 um 17:18 schrieb Rob Clark:
> > On Thu, Mar 17, 2022 at 9:04 AM Christian König
> > <christian.koenig@amd.com> wrote:
> > > Am 17.03.22 um 16:10 schrieb Rob Clark:
> > > > [SNIP]
> > > > userspace frozen != kthread frozen .. that is what this patch is
> > > > trying to address, so we aren't racing between shutting down the hw
> > > > and the scheduler shoveling more jobs at us.
> > > Well exactly that's the problem. The scheduler is supposed to shoveling
> > > more jobs at us until it is empty.
> > > 
> > > Thinking more about it we will then keep some dma_fence instance
> > > unsignaled and that is and extremely bad idea since it can lead to
> > > deadlocks during suspend.
> > Hmm, perhaps that is true if you need to migrate things out of vram?
> > It is at least not a problem when vram is not involved.
> 
> No, it's much wider than that.
> 
> See what can happen is that the memory management shrinkers want to wait for
> a dma_fence during suspend.
> 
> And if you stop the scheduler they will just wait forever.
> 
> What you need to do instead is to drain the scheduler, e.g. call
> drm_sched_entity_flush() with a proper timeout for each entity you have
> created.

Yeah I think properly flushing the scheduler and stopping it and cutting
all drivers over to that sounds like the right approach. Generally suspend
shouldn't be such a critical path that this will hurt us, all the other io
queues get flushed too afaik.

Resume is the thing that needs to go real fast.

So a patch set to move all drivers that open code the kthread_park to the
right scheduler function sounds like the right idea here to me.
-Daniel

> 
> Regards,
> Christian.
> 
> > 
> > > So this patch here is an absolute clear NAK from my side. If amdgpu is
> > > doing something similar that is a severe bug and needs to be addressed
> > > somehow.
> > I think amdgpu's use of kthread_park is not related to suspend, but
> > didn't look too closely.
> > 
> > And perhaps the solution for this problem is more complex in the case
> > of amdgpu, I'm not super familiar with the constraints there.  But I
> > think it is a fine solution for integrated GPUs.
> > 
> > BR,
> > -R
> > 
> > > Regards,
> > > Christian.
> > > 
> > > > BR,
> > > > -R
> > > > 
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 2/3] drm/msm/gpu: Park scheduler threads for system suspend
  2022-03-17 16:44               ` Christian König
@ 2022-03-17 17:35                 ` Rob Clark
  -1 siblings, 0 replies; 52+ messages in thread
From: Rob Clark @ 2022-03-17 17:35 UTC (permalink / raw)
  To: Christian König
  Cc: Andrey Grodzovsky, dri-devel, freedreno, linux-arm-msm,
	Rob Clark, Sean Paul, Abhinav Kumar, David Airlie,
	Akhil P Oommen, Jonathan Marek, AngeloGioacchino Del Regno,
	Bjorn Andersson, Vladimir Lypak, open list

On Thu, Mar 17, 2022 at 9:45 AM Christian König
<christian.koenig@amd.com> wrote:
>
> Am 17.03.22 um 17:18 schrieb Rob Clark:
> > On Thu, Mar 17, 2022 at 9:04 AM Christian König
> > <christian.koenig@amd.com> wrote:
> >> Am 17.03.22 um 16:10 schrieb Rob Clark:
> >>> [SNIP]
> >>> userspace frozen != kthread frozen .. that is what this patch is
> >>> trying to address, so we aren't racing between shutting down the hw
> >>> and the scheduler shoveling more jobs at us.
> >> Well exactly that's the problem. The scheduler is supposed to shoveling
> >> more jobs at us until it is empty.
> >>
> >> Thinking more about it we will then keep some dma_fence instance
> >> unsignaled and that is and extremely bad idea since it can lead to
> >> deadlocks during suspend.
> > Hmm, perhaps that is true if you need to migrate things out of vram?
> > It is at least not a problem when vram is not involved.
>
> No, it's much wider than that.
>
> See what can happen is that the memory management shrinkers want to wait
> for a dma_fence during suspend.

we don't wait on fences in shrinker, only purging or evicting things
that are already ready.  Actually, waiting on fences in shrinker path
sounds like a pretty bad idea.

> And if you stop the scheduler they will just wait forever.
>
> What you need to do instead is to drain the scheduler, e.g. call
> drm_sched_entity_flush() with a proper timeout for each entity you have
> created.

yeah, it would work to drain the scheduler.. I guess that might be the
more portable approach as far as generic solution for suspend.

BR,
-R

> Regards,
> Christian.
>
> >
> >> So this patch here is an absolute clear NAK from my side. If amdgpu is
> >> doing something similar that is a severe bug and needs to be addressed
> >> somehow.
> > I think amdgpu's use of kthread_park is not related to suspend, but
> > didn't look too closely.
> >
> > And perhaps the solution for this problem is more complex in the case
> > of amdgpu, I'm not super familiar with the constraints there.  But I
> > think it is a fine solution for integrated GPUs.
> >
> > BR,
> > -R
> >
> >> Regards,
> >> Christian.
> >>
> >>> BR,
> >>> -R
> >>>
>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 2/3] drm/msm/gpu: Park scheduler threads for system suspend
@ 2022-03-17 17:35                 ` Rob Clark
  0 siblings, 0 replies; 52+ messages in thread
From: Rob Clark @ 2022-03-17 17:35 UTC (permalink / raw)
  To: Christian König
  Cc: Rob Clark, Jonathan Marek, David Airlie, freedreno,
	Vladimir Lypak, Abhinav Kumar, dri-devel, Bjorn Andersson,
	Akhil P Oommen, linux-arm-msm, Sean Paul, open list,
	AngeloGioacchino Del Regno

On Thu, Mar 17, 2022 at 9:45 AM Christian König
<christian.koenig@amd.com> wrote:
>
> Am 17.03.22 um 17:18 schrieb Rob Clark:
> > On Thu, Mar 17, 2022 at 9:04 AM Christian König
> > <christian.koenig@amd.com> wrote:
> >> Am 17.03.22 um 16:10 schrieb Rob Clark:
> >>> [SNIP]
> >>> userspace frozen != kthread frozen .. that is what this patch is
> >>> trying to address, so we aren't racing between shutting down the hw
> >>> and the scheduler shoveling more jobs at us.
> >> Well exactly that's the problem. The scheduler is supposed to shoveling
> >> more jobs at us until it is empty.
> >>
> >> Thinking more about it we will then keep some dma_fence instance
> >> unsignaled and that is and extremely bad idea since it can lead to
> >> deadlocks during suspend.
> > Hmm, perhaps that is true if you need to migrate things out of vram?
> > It is at least not a problem when vram is not involved.
>
> No, it's much wider than that.
>
> See what can happen is that the memory management shrinkers want to wait
> for a dma_fence during suspend.

we don't wait on fences in shrinker, only purging or evicting things
that are already ready.  Actually, waiting on fences in shrinker path
sounds like a pretty bad idea.

> And if you stop the scheduler they will just wait forever.
>
> What you need to do instead is to drain the scheduler, e.g. call
> drm_sched_entity_flush() with a proper timeout for each entity you have
> created.

yeah, it would work to drain the scheduler.. I guess that might be the
more portable approach as far as generic solution for suspend.

BR,
-R

> Regards,
> Christian.
>
> >
> >> So this patch here is an absolute clear NAK from my side. If amdgpu is
> >> doing something similar that is a severe bug and needs to be addressed
> >> somehow.
> > I think amdgpu's use of kthread_park is not related to suspend, but
> > didn't look too closely.
> >
> > And perhaps the solution for this problem is more complex in the case
> > of amdgpu, I'm not super familiar with the constraints there.  But I
> > think it is a fine solution for integrated GPUs.
> >
> > BR,
> > -R
> >
> >> Regards,
> >> Christian.
> >>
> >>> BR,
> >>> -R
> >>>
>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 2/3] drm/msm/gpu: Park scheduler threads for system suspend
  2022-03-17 16:04           ` Christian König
@ 2022-03-17 17:46             ` Andrey Grodzovsky
  -1 siblings, 0 replies; 52+ messages in thread
From: Andrey Grodzovsky @ 2022-03-17 17:46 UTC (permalink / raw)
  To: Christian König, Rob Clark
  Cc: dri-devel, freedreno, linux-arm-msm, Rob Clark, Sean Paul,
	Abhinav Kumar, David Airlie, Akhil P Oommen, Jonathan Marek,
	AngeloGioacchino Del Regno, Bjorn Andersson, Vladimir Lypak,
	open list


On 2022-03-17 12:04, Christian König wrote:
> Am 17.03.22 um 16:10 schrieb Rob Clark:
>> [SNIP]
>> userspace frozen != kthread frozen .. that is what this patch is
>> trying to address, so we aren't racing between shutting down the hw
>> and the scheduler shoveling more jobs at us.
>
> Well exactly that's the problem. The scheduler is supposed to 
> shoveling more jobs at us until it is empty.
>
> Thinking more about it we will then keep some dma_fence instance 
> unsignaled and that is and extremely bad idea since it can lead to 
> deadlocks during suspend.
>
> So this patch here is an absolute clear NAK from my side. If amdgpu is 
> doing something similar that is a severe bug and needs to be addressed 
> somehow.


 From looking at latest amd-stagin-drm-next we only use directly 
kthread_park during in debugfs IB hooks.
For S3  suspend (amdgpu_pmops_suspend) we will only  flush all the HW 
fences (amdgpu_fence_wait_empty) so we don't freeze the scheduler thread 
and don't flush scheduler entities.

Andrey


>
> Regards,
> Christian.
>
>>
>> BR,
>> -R
>>
>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 2/3] drm/msm/gpu: Park scheduler threads for system suspend
@ 2022-03-17 17:46             ` Andrey Grodzovsky
  0 siblings, 0 replies; 52+ messages in thread
From: Andrey Grodzovsky @ 2022-03-17 17:46 UTC (permalink / raw)
  To: Christian König, Rob Clark
  Cc: Rob Clark, freedreno, Jonathan Marek, David Airlie,
	linux-arm-msm, Vladimir Lypak, Abhinav Kumar, dri-devel,
	Bjorn Andersson, Akhil P Oommen, Sean Paul, open list,
	AngeloGioacchino Del Regno


On 2022-03-17 12:04, Christian König wrote:
> Am 17.03.22 um 16:10 schrieb Rob Clark:
>> [SNIP]
>> userspace frozen != kthread frozen .. that is what this patch is
>> trying to address, so we aren't racing between shutting down the hw
>> and the scheduler shoveling more jobs at us.
>
> Well exactly that's the problem. The scheduler is supposed to 
> shoveling more jobs at us until it is empty.
>
> Thinking more about it we will then keep some dma_fence instance 
> unsignaled and that is and extremely bad idea since it can lead to 
> deadlocks during suspend.
>
> So this patch here is an absolute clear NAK from my side. If amdgpu is 
> doing something similar that is a severe bug and needs to be addressed 
> somehow.


 From looking at latest amd-stagin-drm-next we only use directly 
kthread_park during in debugfs IB hooks.
For S3  suspend (amdgpu_pmops_suspend) we will only  flush all the HW 
fences (amdgpu_fence_wait_empty) so we don't freeze the scheduler thread 
and don't flush scheduler entities.

Andrey


>
> Regards,
> Christian.
>
>>
>> BR,
>> -R
>>
>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 2/3] drm/msm/gpu: Park scheduler threads for system suspend
  2022-03-17 17:35                 ` Rob Clark
@ 2022-03-17 18:10                   ` Andrey Grodzovsky
  -1 siblings, 0 replies; 52+ messages in thread
From: Andrey Grodzovsky @ 2022-03-17 18:10 UTC (permalink / raw)
  To: Rob Clark, Christian König
  Cc: dri-devel, freedreno, linux-arm-msm, Rob Clark, Sean Paul,
	Abhinav Kumar, David Airlie, Akhil P Oommen, Jonathan Marek,
	AngeloGioacchino Del Regno, Bjorn Andersson, Vladimir Lypak,
	open list


On 2022-03-17 13:35, Rob Clark wrote:
> On Thu, Mar 17, 2022 at 9:45 AM Christian König
> <christian.koenig@amd.com> wrote:
>> Am 17.03.22 um 17:18 schrieb Rob Clark:
>>> On Thu, Mar 17, 2022 at 9:04 AM Christian König
>>> <christian.koenig@amd.com> wrote:
>>>> Am 17.03.22 um 16:10 schrieb Rob Clark:
>>>>> [SNIP]
>>>>> userspace frozen != kthread frozen .. that is what this patch is
>>>>> trying to address, so we aren't racing between shutting down the hw
>>>>> and the scheduler shoveling more jobs at us.
>>>> Well exactly that's the problem. The scheduler is supposed to shoveling
>>>> more jobs at us until it is empty.
>>>>
>>>> Thinking more about it we will then keep some dma_fence instance
>>>> unsignaled and that is and extremely bad idea since it can lead to
>>>> deadlocks during suspend.
>>> Hmm, perhaps that is true if you need to migrate things out of vram?
>>> It is at least not a problem when vram is not involved.
>> No, it's much wider than that.
>>
>> See what can happen is that the memory management shrinkers want to wait
>> for a dma_fence during suspend.
> we don't wait on fences in shrinker, only purging or evicting things
> that are already ready.  Actually, waiting on fences in shrinker path
> sounds like a pretty bad idea.
>
>> And if you stop the scheduler they will just wait forever.
>>
>> What you need to do instead is to drain the scheduler, e.g. call
>> drm_sched_entity_flush() with a proper timeout for each entity you have
>> created.
> yeah, it would work to drain the scheduler.. I guess that might be the
> more portable approach as far as generic solution for suspend.
>
> BR,
> -R


I am not sure how this drains the scheduler ? Suppose we done the 
waiting in drm_sched_entity_flush,
what prevents someone to push right away another job into the same 
entity's queue  right after that ?
Shouldn't we first disable further pushing of jobs into entity before we 
wait for  sched->job_scheduled ?

Andrey


>
>> Regards,
>> Christian.
>>
>>>> So this patch here is an absolute clear NAK from my side. If amdgpu is
>>>> doing something similar that is a severe bug and needs to be addressed
>>>> somehow.
>>> I think amdgpu's use of kthread_park is not related to suspend, but
>>> didn't look too closely.
>>>
>>> And perhaps the solution for this problem is more complex in the case
>>> of amdgpu, I'm not super familiar with the constraints there.  But I
>>> think it is a fine solution for integrated GPUs.
>>>
>>> BR,
>>> -R
>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>>> BR,
>>>>> -R
>>>>>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 2/3] drm/msm/gpu: Park scheduler threads for system suspend
@ 2022-03-17 18:10                   ` Andrey Grodzovsky
  0 siblings, 0 replies; 52+ messages in thread
From: Andrey Grodzovsky @ 2022-03-17 18:10 UTC (permalink / raw)
  To: Rob Clark, Christian König
  Cc: Rob Clark, freedreno, Jonathan Marek, David Airlie,
	linux-arm-msm, Vladimir Lypak, Abhinav Kumar, dri-devel,
	Bjorn Andersson, Akhil P Oommen, Sean Paul, open list,
	AngeloGioacchino Del Regno


On 2022-03-17 13:35, Rob Clark wrote:
> On Thu, Mar 17, 2022 at 9:45 AM Christian König
> <christian.koenig@amd.com> wrote:
>> Am 17.03.22 um 17:18 schrieb Rob Clark:
>>> On Thu, Mar 17, 2022 at 9:04 AM Christian König
>>> <christian.koenig@amd.com> wrote:
>>>> Am 17.03.22 um 16:10 schrieb Rob Clark:
>>>>> [SNIP]
>>>>> userspace frozen != kthread frozen .. that is what this patch is
>>>>> trying to address, so we aren't racing between shutting down the hw
>>>>> and the scheduler shoveling more jobs at us.
>>>> Well exactly that's the problem. The scheduler is supposed to shoveling
>>>> more jobs at us until it is empty.
>>>>
>>>> Thinking more about it we will then keep some dma_fence instance
>>>> unsignaled and that is and extremely bad idea since it can lead to
>>>> deadlocks during suspend.
>>> Hmm, perhaps that is true if you need to migrate things out of vram?
>>> It is at least not a problem when vram is not involved.
>> No, it's much wider than that.
>>
>> See what can happen is that the memory management shrinkers want to wait
>> for a dma_fence during suspend.
> we don't wait on fences in shrinker, only purging or evicting things
> that are already ready.  Actually, waiting on fences in shrinker path
> sounds like a pretty bad idea.
>
>> And if you stop the scheduler they will just wait forever.
>>
>> What you need to do instead is to drain the scheduler, e.g. call
>> drm_sched_entity_flush() with a proper timeout for each entity you have
>> created.
> yeah, it would work to drain the scheduler.. I guess that might be the
> more portable approach as far as generic solution for suspend.
>
> BR,
> -R


I am not sure how this drains the scheduler ? Suppose we done the 
waiting in drm_sched_entity_flush,
what prevents someone to push right away another job into the same 
entity's queue  right after that ?
Shouldn't we first disable further pushing of jobs into entity before we 
wait for  sched->job_scheduled ?

Andrey


>
>> Regards,
>> Christian.
>>
>>>> So this patch here is an absolute clear NAK from my side. If amdgpu is
>>>> doing something similar that is a severe bug and needs to be addressed
>>>> somehow.
>>> I think amdgpu's use of kthread_park is not related to suspend, but
>>> didn't look too closely.
>>>
>>> And perhaps the solution for this problem is more complex in the case
>>> of amdgpu, I'm not super familiar with the constraints there.  But I
>>> think it is a fine solution for integrated GPUs.
>>>
>>> BR,
>>> -R
>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>>> BR,
>>>>> -R
>>>>>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 2/3] drm/msm/gpu: Park scheduler threads for system suspend
  2022-03-17 18:10                   ` Andrey Grodzovsky
@ 2022-03-17 18:25                     ` Rob Clark
  -1 siblings, 0 replies; 52+ messages in thread
From: Rob Clark @ 2022-03-17 18:25 UTC (permalink / raw)
  To: Andrey Grodzovsky
  Cc: Christian König, dri-devel, freedreno, linux-arm-msm,
	Rob Clark, Sean Paul, Abhinav Kumar, David Airlie,
	Akhil P Oommen, Jonathan Marek, AngeloGioacchino Del Regno,
	Bjorn Andersson, Vladimir Lypak, open list

On Thu, Mar 17, 2022 at 11:10 AM Andrey Grodzovsky
<andrey.grodzovsky@amd.com> wrote:
>
>
> On 2022-03-17 13:35, Rob Clark wrote:
> > On Thu, Mar 17, 2022 at 9:45 AM Christian König
> > <christian.koenig@amd.com> wrote:
> >> Am 17.03.22 um 17:18 schrieb Rob Clark:
> >>> On Thu, Mar 17, 2022 at 9:04 AM Christian König
> >>> <christian.koenig@amd.com> wrote:
> >>>> Am 17.03.22 um 16:10 schrieb Rob Clark:
> >>>>> [SNIP]
> >>>>> userspace frozen != kthread frozen .. that is what this patch is
> >>>>> trying to address, so we aren't racing between shutting down the hw
> >>>>> and the scheduler shoveling more jobs at us.
> >>>> Well exactly that's the problem. The scheduler is supposed to shoveling
> >>>> more jobs at us until it is empty.
> >>>>
> >>>> Thinking more about it we will then keep some dma_fence instance
> >>>> unsignaled and that is and extremely bad idea since it can lead to
> >>>> deadlocks during suspend.
> >>> Hmm, perhaps that is true if you need to migrate things out of vram?
> >>> It is at least not a problem when vram is not involved.
> >> No, it's much wider than that.
> >>
> >> See what can happen is that the memory management shrinkers want to wait
> >> for a dma_fence during suspend.
> > we don't wait on fences in shrinker, only purging or evicting things
> > that are already ready.  Actually, waiting on fences in shrinker path
> > sounds like a pretty bad idea.
> >
> >> And if you stop the scheduler they will just wait forever.
> >>
> >> What you need to do instead is to drain the scheduler, e.g. call
> >> drm_sched_entity_flush() with a proper timeout for each entity you have
> >> created.
> > yeah, it would work to drain the scheduler.. I guess that might be the
> > more portable approach as far as generic solution for suspend.
> >
> > BR,
> > -R
>
>
> I am not sure how this drains the scheduler ? Suppose we done the
> waiting in drm_sched_entity_flush,
> what prevents someone to push right away another job into the same
> entity's queue  right after that ?
> Shouldn't we first disable further pushing of jobs into entity before we
> wait for  sched->job_scheduled ?
>

In the system suspend path, userspace processes will have already been
frozen, so there should be no way to push more jobs to the scheduler,
unless they are pushed from the kernel itself.  We don't do that in
drm/msm, but maybe you need to to move things btwn vram and system
memory?  But even in that case, if the # of jobs you push is bounded I
guess that is ok?

BR,
-R

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 2/3] drm/msm/gpu: Park scheduler threads for system suspend
@ 2022-03-17 18:25                     ` Rob Clark
  0 siblings, 0 replies; 52+ messages in thread
From: Rob Clark @ 2022-03-17 18:25 UTC (permalink / raw)
  To: Andrey Grodzovsky
  Cc: Rob Clark, freedreno, Jonathan Marek, David Airlie,
	linux-arm-msm, Vladimir Lypak, Abhinav Kumar, dri-devel,
	Bjorn Andersson, Akhil P Oommen, Sean Paul, Christian König,
	open list, AngeloGioacchino Del Regno

On Thu, Mar 17, 2022 at 11:10 AM Andrey Grodzovsky
<andrey.grodzovsky@amd.com> wrote:
>
>
> On 2022-03-17 13:35, Rob Clark wrote:
> > On Thu, Mar 17, 2022 at 9:45 AM Christian König
> > <christian.koenig@amd.com> wrote:
> >> Am 17.03.22 um 17:18 schrieb Rob Clark:
> >>> On Thu, Mar 17, 2022 at 9:04 AM Christian König
> >>> <christian.koenig@amd.com> wrote:
> >>>> Am 17.03.22 um 16:10 schrieb Rob Clark:
> >>>>> [SNIP]
> >>>>> userspace frozen != kthread frozen .. that is what this patch is
> >>>>> trying to address, so we aren't racing between shutting down the hw
> >>>>> and the scheduler shoveling more jobs at us.
> >>>> Well exactly that's the problem. The scheduler is supposed to shoveling
> >>>> more jobs at us until it is empty.
> >>>>
> >>>> Thinking more about it we will then keep some dma_fence instance
> >>>> unsignaled and that is and extremely bad idea since it can lead to
> >>>> deadlocks during suspend.
> >>> Hmm, perhaps that is true if you need to migrate things out of vram?
> >>> It is at least not a problem when vram is not involved.
> >> No, it's much wider than that.
> >>
> >> See what can happen is that the memory management shrinkers want to wait
> >> for a dma_fence during suspend.
> > we don't wait on fences in shrinker, only purging or evicting things
> > that are already ready.  Actually, waiting on fences in shrinker path
> > sounds like a pretty bad idea.
> >
> >> And if you stop the scheduler they will just wait forever.
> >>
> >> What you need to do instead is to drain the scheduler, e.g. call
> >> drm_sched_entity_flush() with a proper timeout for each entity you have
> >> created.
> > yeah, it would work to drain the scheduler.. I guess that might be the
> > more portable approach as far as generic solution for suspend.
> >
> > BR,
> > -R
>
>
> I am not sure how this drains the scheduler ? Suppose we done the
> waiting in drm_sched_entity_flush,
> what prevents someone to push right away another job into the same
> entity's queue  right after that ?
> Shouldn't we first disable further pushing of jobs into entity before we
> wait for  sched->job_scheduled ?
>

In the system suspend path, userspace processes will have already been
frozen, so there should be no way to push more jobs to the scheduler,
unless they are pushed from the kernel itself.  We don't do that in
drm/msm, but maybe you need to to move things btwn vram and system
memory?  But even in that case, if the # of jobs you push is bounded I
guess that is ok?

BR,
-R

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 2/3] drm/msm/gpu: Park scheduler threads for system suspend
  2022-03-17 18:25                     ` Rob Clark
@ 2022-03-17 19:49                       ` Andrey Grodzovsky
  -1 siblings, 0 replies; 52+ messages in thread
From: Andrey Grodzovsky @ 2022-03-17 19:49 UTC (permalink / raw)
  To: Rob Clark
  Cc: Christian König, dri-devel, freedreno, linux-arm-msm,
	Rob Clark, Sean Paul, Abhinav Kumar, David Airlie,
	Akhil P Oommen, Jonathan Marek, AngeloGioacchino Del Regno,
	Bjorn Andersson, Vladimir Lypak, open list


On 2022-03-17 14:25, Rob Clark wrote:
> On Thu, Mar 17, 2022 at 11:10 AM Andrey Grodzovsky
> <andrey.grodzovsky@amd.com> wrote:
>>
>> On 2022-03-17 13:35, Rob Clark wrote:
>>> On Thu, Mar 17, 2022 at 9:45 AM Christian König
>>> <christian.koenig@amd.com> wrote:
>>>> Am 17.03.22 um 17:18 schrieb Rob Clark:
>>>>> On Thu, Mar 17, 2022 at 9:04 AM Christian König
>>>>> <christian.koenig@amd.com> wrote:
>>>>>> Am 17.03.22 um 16:10 schrieb Rob Clark:
>>>>>>> [SNIP]
>>>>>>> userspace frozen != kthread frozen .. that is what this patch is
>>>>>>> trying to address, so we aren't racing between shutting down the hw
>>>>>>> and the scheduler shoveling more jobs at us.
>>>>>> Well exactly that's the problem. The scheduler is supposed to shoveling
>>>>>> more jobs at us until it is empty.
>>>>>>
>>>>>> Thinking more about it we will then keep some dma_fence instance
>>>>>> unsignaled and that is and extremely bad idea since it can lead to
>>>>>> deadlocks during suspend.
>>>>> Hmm, perhaps that is true if you need to migrate things out of vram?
>>>>> It is at least not a problem when vram is not involved.
>>>> No, it's much wider than that.
>>>>
>>>> See what can happen is that the memory management shrinkers want to wait
>>>> for a dma_fence during suspend.
>>> we don't wait on fences in shrinker, only purging or evicting things
>>> that are already ready.  Actually, waiting on fences in shrinker path
>>> sounds like a pretty bad idea.
>>>
>>>> And if you stop the scheduler they will just wait forever.
>>>>
>>>> What you need to do instead is to drain the scheduler, e.g. call
>>>> drm_sched_entity_flush() with a proper timeout for each entity you have
>>>> created.
>>> yeah, it would work to drain the scheduler.. I guess that might be the
>>> more portable approach as far as generic solution for suspend.
>>>
>>> BR,
>>> -R
>>
>> I am not sure how this drains the scheduler ? Suppose we done the
>> waiting in drm_sched_entity_flush,
>> what prevents someone to push right away another job into the same
>> entity's queue  right after that ?
>> Shouldn't we first disable further pushing of jobs into entity before we
>> wait for  sched->job_scheduled ?
>>
> In the system suspend path, userspace processes will have already been
> frozen, so there should be no way to push more jobs to the scheduler,
> unless they are pushed from the kernel itself.


It was my suspicion but I wasn't sure about it.


> We don't do that in
> drm/msm, but maybe you need to to move things btwn vram and system
> memory?


Exactly, that was my main concern - if we use this method we have to use 
it in a point in
suspend sequence when all the in kernel job submissions activity already 
suspended

> But even in that case, if the # of jobs you push is bounded I
> guess that is ok?

Submissions to scheduler entities are using unbounded queue, the bounded 
part is when
you extract next job from entity to submit to HW ring and it rejects if 
submission limit reached (drm_sched_ready)

In general - It looks to me at least that what we what we want her is 
more of a drain operation then flush (i.e.
we first want to disable any further job submission to entity's queue 
and then flush all in flight ones). As example
for this i was looking at  flush_workqueue vs. drain_workqueue

Andrey


>
> BR,
> -R

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 2/3] drm/msm/gpu: Park scheduler threads for system suspend
@ 2022-03-17 19:49                       ` Andrey Grodzovsky
  0 siblings, 0 replies; 52+ messages in thread
From: Andrey Grodzovsky @ 2022-03-17 19:49 UTC (permalink / raw)
  To: Rob Clark
  Cc: Rob Clark, freedreno, Jonathan Marek, David Airlie,
	linux-arm-msm, Vladimir Lypak, Abhinav Kumar, dri-devel,
	Bjorn Andersson, Akhil P Oommen, Sean Paul, Christian König,
	open list, AngeloGioacchino Del Regno


On 2022-03-17 14:25, Rob Clark wrote:
> On Thu, Mar 17, 2022 at 11:10 AM Andrey Grodzovsky
> <andrey.grodzovsky@amd.com> wrote:
>>
>> On 2022-03-17 13:35, Rob Clark wrote:
>>> On Thu, Mar 17, 2022 at 9:45 AM Christian König
>>> <christian.koenig@amd.com> wrote:
>>>> Am 17.03.22 um 17:18 schrieb Rob Clark:
>>>>> On Thu, Mar 17, 2022 at 9:04 AM Christian König
>>>>> <christian.koenig@amd.com> wrote:
>>>>>> Am 17.03.22 um 16:10 schrieb Rob Clark:
>>>>>>> [SNIP]
>>>>>>> userspace frozen != kthread frozen .. that is what this patch is
>>>>>>> trying to address, so we aren't racing between shutting down the hw
>>>>>>> and the scheduler shoveling more jobs at us.
>>>>>> Well exactly that's the problem. The scheduler is supposed to shoveling
>>>>>> more jobs at us until it is empty.
>>>>>>
>>>>>> Thinking more about it we will then keep some dma_fence instance
>>>>>> unsignaled and that is and extremely bad idea since it can lead to
>>>>>> deadlocks during suspend.
>>>>> Hmm, perhaps that is true if you need to migrate things out of vram?
>>>>> It is at least not a problem when vram is not involved.
>>>> No, it's much wider than that.
>>>>
>>>> See what can happen is that the memory management shrinkers want to wait
>>>> for a dma_fence during suspend.
>>> we don't wait on fences in shrinker, only purging or evicting things
>>> that are already ready.  Actually, waiting on fences in shrinker path
>>> sounds like a pretty bad idea.
>>>
>>>> And if you stop the scheduler they will just wait forever.
>>>>
>>>> What you need to do instead is to drain the scheduler, e.g. call
>>>> drm_sched_entity_flush() with a proper timeout for each entity you have
>>>> created.
>>> yeah, it would work to drain the scheduler.. I guess that might be the
>>> more portable approach as far as generic solution for suspend.
>>>
>>> BR,
>>> -R
>>
>> I am not sure how this drains the scheduler ? Suppose we done the
>> waiting in drm_sched_entity_flush,
>> what prevents someone to push right away another job into the same
>> entity's queue  right after that ?
>> Shouldn't we first disable further pushing of jobs into entity before we
>> wait for  sched->job_scheduled ?
>>
> In the system suspend path, userspace processes will have already been
> frozen, so there should be no way to push more jobs to the scheduler,
> unless they are pushed from the kernel itself.


It was my suspicion but I wasn't sure about it.


> We don't do that in
> drm/msm, but maybe you need to to move things btwn vram and system
> memory?


Exactly, that was my main concern - if we use this method we have to use 
it in a point in
suspend sequence when all the in kernel job submissions activity already 
suspended

> But even in that case, if the # of jobs you push is bounded I
> guess that is ok?

Submissions to scheduler entities are using unbounded queue, the bounded 
part is when
you extract next job from entity to submit to HW ring and it rejects if 
submission limit reached (drm_sched_ready)

In general - It looks to me at least that what we what we want her is 
more of a drain operation then flush (i.e.
we first want to disable any further job submission to entity's queue 
and then flush all in flight ones). As example
for this i was looking at  flush_workqueue vs. drain_workqueue

Andrey


>
> BR,
> -R

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 2/3] drm/msm/gpu: Park scheduler threads for system suspend
  2022-03-17 19:49                       ` Andrey Grodzovsky
@ 2022-03-17 20:35                         ` Rob Clark
  -1 siblings, 0 replies; 52+ messages in thread
From: Rob Clark @ 2022-03-17 20:35 UTC (permalink / raw)
  To: Andrey Grodzovsky
  Cc: Christian König, dri-devel, freedreno, linux-arm-msm,
	Rob Clark, Sean Paul, Abhinav Kumar, David Airlie,
	Akhil P Oommen, Jonathan Marek, AngeloGioacchino Del Regno,
	Bjorn Andersson, Vladimir Lypak, open list

On Thu, Mar 17, 2022 at 12:50 PM Andrey Grodzovsky
<andrey.grodzovsky@amd.com> wrote:
>
>
> On 2022-03-17 14:25, Rob Clark wrote:
> > On Thu, Mar 17, 2022 at 11:10 AM Andrey Grodzovsky
> > <andrey.grodzovsky@amd.com> wrote:
> >>
> >> On 2022-03-17 13:35, Rob Clark wrote:
> >>> On Thu, Mar 17, 2022 at 9:45 AM Christian König
> >>> <christian.koenig@amd.com> wrote:
> >>>> Am 17.03.22 um 17:18 schrieb Rob Clark:
> >>>>> On Thu, Mar 17, 2022 at 9:04 AM Christian König
> >>>>> <christian.koenig@amd.com> wrote:
> >>>>>> Am 17.03.22 um 16:10 schrieb Rob Clark:
> >>>>>>> [SNIP]
> >>>>>>> userspace frozen != kthread frozen .. that is what this patch is
> >>>>>>> trying to address, so we aren't racing between shutting down the hw
> >>>>>>> and the scheduler shoveling more jobs at us.
> >>>>>> Well exactly that's the problem. The scheduler is supposed to shoveling
> >>>>>> more jobs at us until it is empty.
> >>>>>>
> >>>>>> Thinking more about it we will then keep some dma_fence instance
> >>>>>> unsignaled and that is and extremely bad idea since it can lead to
> >>>>>> deadlocks during suspend.
> >>>>> Hmm, perhaps that is true if you need to migrate things out of vram?
> >>>>> It is at least not a problem when vram is not involved.
> >>>> No, it's much wider than that.
> >>>>
> >>>> See what can happen is that the memory management shrinkers want to wait
> >>>> for a dma_fence during suspend.
> >>> we don't wait on fences in shrinker, only purging or evicting things
> >>> that are already ready.  Actually, waiting on fences in shrinker path
> >>> sounds like a pretty bad idea.
> >>>
> >>>> And if you stop the scheduler they will just wait forever.
> >>>>
> >>>> What you need to do instead is to drain the scheduler, e.g. call
> >>>> drm_sched_entity_flush() with a proper timeout for each entity you have
> >>>> created.
> >>> yeah, it would work to drain the scheduler.. I guess that might be the
> >>> more portable approach as far as generic solution for suspend.
> >>>
> >>> BR,
> >>> -R
> >>
> >> I am not sure how this drains the scheduler ? Suppose we done the
> >> waiting in drm_sched_entity_flush,
> >> what prevents someone to push right away another job into the same
> >> entity's queue  right after that ?
> >> Shouldn't we first disable further pushing of jobs into entity before we
> >> wait for  sched->job_scheduled ?
> >>
> > In the system suspend path, userspace processes will have already been
> > frozen, so there should be no way to push more jobs to the scheduler,
> > unless they are pushed from the kernel itself.
>
>
> It was my suspicion but I wasn't sure about it.
>
>
> > We don't do that in
> > drm/msm, but maybe you need to to move things btwn vram and system
> > memory?
>
>
> Exactly, that was my main concern - if we use this method we have to use
> it in a point in
> suspend sequence when all the in kernel job submissions activity already
> suspended
>
> > But even in that case, if the # of jobs you push is bounded I
> > guess that is ok?
>
> Submissions to scheduler entities are using unbounded queue, the bounded
> part is when
> you extract next job from entity to submit to HW ring and it rejects if
> submission limit reached (drm_sched_ready)
>
> In general - It looks to me at least that what we what we want her is
> more of a drain operation then flush (i.e.
> we first want to disable any further job submission to entity's queue
> and then flush all in flight ones). As example
> for this i was looking at  flush_workqueue vs. drain_workqueue

Would it be possible for amdgpu to, in the system suspend task,

1) first queue up all the jobs needed to migrate bos out of vram, and
whatever other housekeeping jobs are needed
2) then drain gpu scheduler's queues
3) and then finally wait for jobs executing on GPU to complete

BR,
-R

> Andrey
>
>
> >
> > BR,
> > -R

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 2/3] drm/msm/gpu: Park scheduler threads for system suspend
@ 2022-03-17 20:35                         ` Rob Clark
  0 siblings, 0 replies; 52+ messages in thread
From: Rob Clark @ 2022-03-17 20:35 UTC (permalink / raw)
  To: Andrey Grodzovsky
  Cc: Rob Clark, freedreno, Jonathan Marek, David Airlie,
	linux-arm-msm, Vladimir Lypak, Abhinav Kumar, dri-devel,
	Bjorn Andersson, Akhil P Oommen, Sean Paul, Christian König,
	open list, AngeloGioacchino Del Regno

On Thu, Mar 17, 2022 at 12:50 PM Andrey Grodzovsky
<andrey.grodzovsky@amd.com> wrote:
>
>
> On 2022-03-17 14:25, Rob Clark wrote:
> > On Thu, Mar 17, 2022 at 11:10 AM Andrey Grodzovsky
> > <andrey.grodzovsky@amd.com> wrote:
> >>
> >> On 2022-03-17 13:35, Rob Clark wrote:
> >>> On Thu, Mar 17, 2022 at 9:45 AM Christian König
> >>> <christian.koenig@amd.com> wrote:
> >>>> Am 17.03.22 um 17:18 schrieb Rob Clark:
> >>>>> On Thu, Mar 17, 2022 at 9:04 AM Christian König
> >>>>> <christian.koenig@amd.com> wrote:
> >>>>>> Am 17.03.22 um 16:10 schrieb Rob Clark:
> >>>>>>> [SNIP]
> >>>>>>> userspace frozen != kthread frozen .. that is what this patch is
> >>>>>>> trying to address, so we aren't racing between shutting down the hw
> >>>>>>> and the scheduler shoveling more jobs at us.
> >>>>>> Well exactly that's the problem. The scheduler is supposed to shoveling
> >>>>>> more jobs at us until it is empty.
> >>>>>>
> >>>>>> Thinking more about it we will then keep some dma_fence instance
> >>>>>> unsignaled and that is and extremely bad idea since it can lead to
> >>>>>> deadlocks during suspend.
> >>>>> Hmm, perhaps that is true if you need to migrate things out of vram?
> >>>>> It is at least not a problem when vram is not involved.
> >>>> No, it's much wider than that.
> >>>>
> >>>> See what can happen is that the memory management shrinkers want to wait
> >>>> for a dma_fence during suspend.
> >>> we don't wait on fences in shrinker, only purging or evicting things
> >>> that are already ready.  Actually, waiting on fences in shrinker path
> >>> sounds like a pretty bad idea.
> >>>
> >>>> And if you stop the scheduler they will just wait forever.
> >>>>
> >>>> What you need to do instead is to drain the scheduler, e.g. call
> >>>> drm_sched_entity_flush() with a proper timeout for each entity you have
> >>>> created.
> >>> yeah, it would work to drain the scheduler.. I guess that might be the
> >>> more portable approach as far as generic solution for suspend.
> >>>
> >>> BR,
> >>> -R
> >>
> >> I am not sure how this drains the scheduler ? Suppose we done the
> >> waiting in drm_sched_entity_flush,
> >> what prevents someone to push right away another job into the same
> >> entity's queue  right after that ?
> >> Shouldn't we first disable further pushing of jobs into entity before we
> >> wait for  sched->job_scheduled ?
> >>
> > In the system suspend path, userspace processes will have already been
> > frozen, so there should be no way to push more jobs to the scheduler,
> > unless they are pushed from the kernel itself.
>
>
> It was my suspicion but I wasn't sure about it.
>
>
> > We don't do that in
> > drm/msm, but maybe you need to to move things btwn vram and system
> > memory?
>
>
> Exactly, that was my main concern - if we use this method we have to use
> it in a point in
> suspend sequence when all the in kernel job submissions activity already
> suspended
>
> > But even in that case, if the # of jobs you push is bounded I
> > guess that is ok?
>
> Submissions to scheduler entities are using unbounded queue, the bounded
> part is when
> you extract next job from entity to submit to HW ring and it rejects if
> submission limit reached (drm_sched_ready)
>
> In general - It looks to me at least that what we what we want her is
> more of a drain operation then flush (i.e.
> we first want to disable any further job submission to entity's queue
> and then flush all in flight ones). As example
> for this i was looking at  flush_workqueue vs. drain_workqueue

Would it be possible for amdgpu to, in the system suspend task,

1) first queue up all the jobs needed to migrate bos out of vram, and
whatever other housekeeping jobs are needed
2) then drain gpu scheduler's queues
3) and then finally wait for jobs executing on GPU to complete

BR,
-R

> Andrey
>
>
> >
> > BR,
> > -R

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Freedreno] [PATCH 3/3] drm/msm/gpu: Remove mutex from wait_event condition
  2022-03-10 23:46   ` Rob Clark
@ 2022-03-17 20:45     ` Akhil P Oommen
  -1 siblings, 0 replies; 52+ messages in thread
From: Akhil P Oommen @ 2022-03-17 20:45 UTC (permalink / raw)
  To: Rob Clark, dri-devel
  Cc: Rob Clark, Jonathan Marek, David Airlie, linux-arm-msm,
	Vladimir Lypak, Abhinav Kumar, Bjorn Andersson, Sean Paul,
	Daniel Vetter, freedreno, open list, AngeloGioacchino Del Regno

On 3/11/2022 5:16 AM, Rob Clark wrote:
> From: Rob Clark <robdclark@chromium.org>
>
> The mutex wasn't really protecting anything before.  Before the previous
> patch we could still be racing with the scheduler's kthread, as that is
> not necessarily frozen yet.  Now that we've parked the sched threads,
> the only race is with jobs retiring, and that is harmless, ie.
>
> Signed-off-by: Rob Clark <robdclark@chromium.org>
> ---
>   drivers/gpu/drm/msm/adreno/adreno_device.c | 11 +----------
>   1 file changed, 1 insertion(+), 10 deletions(-)
>
> diff --git a/drivers/gpu/drm/msm/adreno/adreno_device.c b/drivers/gpu/drm/msm/adreno/adreno_device.c
> index 0440a98988fc..661dfa7681fb 100644
> --- a/drivers/gpu/drm/msm/adreno/adreno_device.c
> +++ b/drivers/gpu/drm/msm/adreno/adreno_device.c
> @@ -607,15 +607,6 @@ static int adreno_runtime_resume(struct device *dev)
>   	return gpu->funcs->pm_resume(gpu);
>   }
>   
> -static int active_submits(struct msm_gpu *gpu)
> -{
> -	int active_submits;
> -	mutex_lock(&gpu->active_lock);
> -	active_submits = gpu->active_submits;
> -	mutex_unlock(&gpu->active_lock);
I assumed that this lock here was to ensure proper barriers while 
reading active_submits. Is that not required?

-Akhil.
> -	return active_submits;
> -}
> -
>   static int adreno_runtime_suspend(struct device *dev)
>   {
>   	struct msm_gpu *gpu = dev_to_gpu(dev);
> @@ -669,7 +660,7 @@ static int adreno_system_suspend(struct device *dev)
>   	suspend_scheduler(gpu);
>   
>   	remaining = wait_event_timeout(gpu->retire_event,
> -				       active_submits(gpu) == 0,
> +				       gpu->active_submits == 0,
>   				       msecs_to_jiffies(1000));
>   	if (remaining == 0) {
>   		dev_err(dev, "Timeout waiting for GPU to suspend\n");


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Freedreno] [PATCH 3/3] drm/msm/gpu: Remove mutex from wait_event condition
@ 2022-03-17 20:45     ` Akhil P Oommen
  0 siblings, 0 replies; 52+ messages in thread
From: Akhil P Oommen @ 2022-03-17 20:45 UTC (permalink / raw)
  To: Rob Clark, dri-devel
  Cc: Rob Clark, freedreno, Jonathan Marek, David Airlie,
	linux-arm-msm, Vladimir Lypak, Abhinav Kumar, Bjorn Andersson,
	Sean Paul, open list, AngeloGioacchino Del Regno

On 3/11/2022 5:16 AM, Rob Clark wrote:
> From: Rob Clark <robdclark@chromium.org>
>
> The mutex wasn't really protecting anything before.  Before the previous
> patch we could still be racing with the scheduler's kthread, as that is
> not necessarily frozen yet.  Now that we've parked the sched threads,
> the only race is with jobs retiring, and that is harmless, ie.
>
> Signed-off-by: Rob Clark <robdclark@chromium.org>
> ---
>   drivers/gpu/drm/msm/adreno/adreno_device.c | 11 +----------
>   1 file changed, 1 insertion(+), 10 deletions(-)
>
> diff --git a/drivers/gpu/drm/msm/adreno/adreno_device.c b/drivers/gpu/drm/msm/adreno/adreno_device.c
> index 0440a98988fc..661dfa7681fb 100644
> --- a/drivers/gpu/drm/msm/adreno/adreno_device.c
> +++ b/drivers/gpu/drm/msm/adreno/adreno_device.c
> @@ -607,15 +607,6 @@ static int adreno_runtime_resume(struct device *dev)
>   	return gpu->funcs->pm_resume(gpu);
>   }
>   
> -static int active_submits(struct msm_gpu *gpu)
> -{
> -	int active_submits;
> -	mutex_lock(&gpu->active_lock);
> -	active_submits = gpu->active_submits;
> -	mutex_unlock(&gpu->active_lock);
I assumed that this lock here was to ensure proper barriers while 
reading active_submits. Is that not required?

-Akhil.
> -	return active_submits;
> -}
> -
>   static int adreno_runtime_suspend(struct device *dev)
>   {
>   	struct msm_gpu *gpu = dev_to_gpu(dev);
> @@ -669,7 +660,7 @@ static int adreno_system_suspend(struct device *dev)
>   	suspend_scheduler(gpu);
>   
>   	remaining = wait_event_timeout(gpu->retire_event,
> -				       active_submits(gpu) == 0,
> +				       gpu->active_submits == 0,
>   				       msecs_to_jiffies(1000));
>   	if (remaining == 0) {
>   		dev_err(dev, "Timeout waiting for GPU to suspend\n");


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Freedreno] [PATCH 3/3] drm/msm/gpu: Remove mutex from wait_event condition
  2022-03-17 20:45     ` Akhil P Oommen
@ 2022-03-17 21:07       ` Rob Clark
  -1 siblings, 0 replies; 52+ messages in thread
From: Rob Clark @ 2022-03-17 21:07 UTC (permalink / raw)
  To: Akhil P Oommen
  Cc: dri-devel, Rob Clark, Jonathan Marek, David Airlie,
	linux-arm-msm, Vladimir Lypak, Abhinav Kumar, Bjorn Andersson,
	Sean Paul, Daniel Vetter, freedreno, open list,
	AngeloGioacchino Del Regno

On Thu, Mar 17, 2022 at 1:45 PM Akhil P Oommen <quic_akhilpo@quicinc.com> wrote:
>
> On 3/11/2022 5:16 AM, Rob Clark wrote:
> > From: Rob Clark <robdclark@chromium.org>
> >
> > The mutex wasn't really protecting anything before.  Before the previous
> > patch we could still be racing with the scheduler's kthread, as that is
> > not necessarily frozen yet.  Now that we've parked the sched threads,
> > the only race is with jobs retiring, and that is harmless, ie.
> >
> > Signed-off-by: Rob Clark <robdclark@chromium.org>
> > ---
> >   drivers/gpu/drm/msm/adreno/adreno_device.c | 11 +----------
> >   1 file changed, 1 insertion(+), 10 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/msm/adreno/adreno_device.c b/drivers/gpu/drm/msm/adreno/adreno_device.c
> > index 0440a98988fc..661dfa7681fb 100644
> > --- a/drivers/gpu/drm/msm/adreno/adreno_device.c
> > +++ b/drivers/gpu/drm/msm/adreno/adreno_device.c
> > @@ -607,15 +607,6 @@ static int adreno_runtime_resume(struct device *dev)
> >       return gpu->funcs->pm_resume(gpu);
> >   }
> >
> > -static int active_submits(struct msm_gpu *gpu)
> > -{
> > -     int active_submits;
> > -     mutex_lock(&gpu->active_lock);
> > -     active_submits = gpu->active_submits;
> > -     mutex_unlock(&gpu->active_lock);
> I assumed that this lock here was to ensure proper barriers while
> reading active_submits. Is that not required?

There is a spinlock in prepare_to_wait_event() ahead of checking the
condition, which AFAIU is a sufficient barrier

BR,
-R

>
> -Akhil.
> > -     return active_submits;
> > -}
> > -
> >   static int adreno_runtime_suspend(struct device *dev)
> >   {
> >       struct msm_gpu *gpu = dev_to_gpu(dev);
> > @@ -669,7 +660,7 @@ static int adreno_system_suspend(struct device *dev)
> >       suspend_scheduler(gpu);
> >
> >       remaining = wait_event_timeout(gpu->retire_event,
> > -                                    active_submits(gpu) == 0,
> > +                                    gpu->active_submits == 0,
> >                                      msecs_to_jiffies(1000));
> >       if (remaining == 0) {
> >               dev_err(dev, "Timeout waiting for GPU to suspend\n");
>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Freedreno] [PATCH 3/3] drm/msm/gpu: Remove mutex from wait_event condition
@ 2022-03-17 21:07       ` Rob Clark
  0 siblings, 0 replies; 52+ messages in thread
From: Rob Clark @ 2022-03-17 21:07 UTC (permalink / raw)
  To: Akhil P Oommen
  Cc: Rob Clark, freedreno, Jonathan Marek, David Airlie,
	linux-arm-msm, Vladimir Lypak, Abhinav Kumar, dri-devel,
	Bjorn Andersson, Sean Paul, open list,
	AngeloGioacchino Del Regno

On Thu, Mar 17, 2022 at 1:45 PM Akhil P Oommen <quic_akhilpo@quicinc.com> wrote:
>
> On 3/11/2022 5:16 AM, Rob Clark wrote:
> > From: Rob Clark <robdclark@chromium.org>
> >
> > The mutex wasn't really protecting anything before.  Before the previous
> > patch we could still be racing with the scheduler's kthread, as that is
> > not necessarily frozen yet.  Now that we've parked the sched threads,
> > the only race is with jobs retiring, and that is harmless, ie.
> >
> > Signed-off-by: Rob Clark <robdclark@chromium.org>
> > ---
> >   drivers/gpu/drm/msm/adreno/adreno_device.c | 11 +----------
> >   1 file changed, 1 insertion(+), 10 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/msm/adreno/adreno_device.c b/drivers/gpu/drm/msm/adreno/adreno_device.c
> > index 0440a98988fc..661dfa7681fb 100644
> > --- a/drivers/gpu/drm/msm/adreno/adreno_device.c
> > +++ b/drivers/gpu/drm/msm/adreno/adreno_device.c
> > @@ -607,15 +607,6 @@ static int adreno_runtime_resume(struct device *dev)
> >       return gpu->funcs->pm_resume(gpu);
> >   }
> >
> > -static int active_submits(struct msm_gpu *gpu)
> > -{
> > -     int active_submits;
> > -     mutex_lock(&gpu->active_lock);
> > -     active_submits = gpu->active_submits;
> > -     mutex_unlock(&gpu->active_lock);
> I assumed that this lock here was to ensure proper barriers while
> reading active_submits. Is that not required?

There is a spinlock in prepare_to_wait_event() ahead of checking the
condition, which AFAIU is a sufficient barrier

BR,
-R

>
> -Akhil.
> > -     return active_submits;
> > -}
> > -
> >   static int adreno_runtime_suspend(struct device *dev)
> >   {
> >       struct msm_gpu *gpu = dev_to_gpu(dev);
> > @@ -669,7 +660,7 @@ static int adreno_system_suspend(struct device *dev)
> >       suspend_scheduler(gpu);
> >
> >       remaining = wait_event_timeout(gpu->retire_event,
> > -                                    active_submits(gpu) == 0,
> > +                                    gpu->active_submits == 0,
> >                                      msecs_to_jiffies(1000));
> >       if (remaining == 0) {
> >               dev_err(dev, "Timeout waiting for GPU to suspend\n");
>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Freedreno] [PATCH 3/3] drm/msm/gpu: Remove mutex from wait_event condition
  2022-03-17 21:07       ` Rob Clark
  (?)
@ 2022-03-18  2:55       ` Hillf Danton
  -1 siblings, 0 replies; 52+ messages in thread
From: Hillf Danton @ 2022-03-18  2:55 UTC (permalink / raw)
  To: Rob Clark, Akhil P Oommen; +Cc: open list, dri-devel

On Thu, 17 Mar 2022 14:07:45 -0700 Rob Clark wrote:
> On Thu, Mar 17, 2022 at 1:45 PM Akhil P Oommen <quic_akhilpo@quicinc.com> wrote:
> >
> > On 3/11/2022 5:16 AM, Rob Clark wrote:
> > > From: Rob Clark <robdclark@chromium.org>
> > >
> > > The mutex wasn't really protecting anything before.  Before the previous
> > > patch we could still be racing with the scheduler's kthread, as that is
> > > not necessarily frozen yet.  Now that we've parked the sched threads,
> > > the only race is with jobs retiring, and that is harmless, ie.
> > >
> > > Signed-off-by: Rob Clark <robdclark@chromium.org>
> > > ---
> > >   drivers/gpu/drm/msm/adreno/adreno_device.c | 11 +----------
> > >   1 file changed, 1 insertion(+), 10 deletions(-)
> > >
> > > diff --git a/drivers/gpu/drm/msm/adreno/adreno_device.c b/drivers/gpu/drm/msm/adreno/adreno_device.c
> > > index 0440a98988fc..661dfa7681fb 100644
> > > --- a/drivers/gpu/drm/msm/adreno/adreno_device.c
> > > +++ b/drivers/gpu/drm/msm/adreno/adreno_device.c
> > > @@ -607,15 +607,6 @@ static int adreno_runtime_resume(struct device *dev)
> > >       return gpu->funcs->pm_resume(gpu);
> > >   }
> > >
> > > -static int active_submits(struct msm_gpu *gpu)
> > > -{
> > > -     int active_submits;
> > > -     mutex_lock(&gpu->active_lock);
> > > -     active_submits = gpu->active_submits;
> > > -     mutex_unlock(&gpu->active_lock);
> > I assumed that this lock here was to ensure proper barriers while
> > reading active_submits. Is that not required?
> 
> There is a spinlock in prepare_to_wait_event() ahead of checking the
> condition, which AFAIU is a sufficient barrier

set_current_state() is instead - feel free to grep it in <linux/wait.h>

Hillf
> 
> BR,
> -R
> 
> >
> > -Akhil.
> > > -     return active_submits;
> > > -}
> > > -
> > >   static int adreno_runtime_suspend(struct device *dev)
> > >   {
> > >       struct msm_gpu *gpu = dev_to_gpu(dev);
> > > @@ -669,7 +660,7 @@ static int adreno_system_suspend(struct device *dev)
> > >       suspend_scheduler(gpu);
> > >
> > >       remaining = wait_event_timeout(gpu->retire_event,
> > > -                                    active_submits(gpu) == 0,
> > > +                                    gpu->active_submits == 0,
> > >                                      msecs_to_jiffies(1000));
> > >       if (remaining == 0) {
> > >               dev_err(dev, "Timeout waiting for GPU to suspend\n");
> >

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 2/3] drm/msm/gpu: Park scheduler threads for system suspend
  2022-03-17 20:35                         ` Rob Clark
@ 2022-03-18 16:04                           ` Andrey Grodzovsky
  -1 siblings, 0 replies; 52+ messages in thread
From: Andrey Grodzovsky @ 2022-03-18 16:04 UTC (permalink / raw)
  To: Rob Clark
  Cc: Rob Clark, freedreno, Jonathan Marek, David Airlie,
	linux-arm-msm, Vladimir Lypak, Abhinav Kumar, dri-devel,
	Bjorn Andersson, Akhil P Oommen, Alexander.Deucher, Sean Paul,
	Christian König, open list, AngeloGioacchino Del Regno


On 2022-03-17 16:35, Rob Clark wrote:
> On Thu, Mar 17, 2022 at 12:50 PM Andrey Grodzovsky
> <andrey.grodzovsky@amd.com> wrote:
>>
>> On 2022-03-17 14:25, Rob Clark wrote:
>>> On Thu, Mar 17, 2022 at 11:10 AM Andrey Grodzovsky
>>> <andrey.grodzovsky@amd.com> wrote:
>>>> On 2022-03-17 13:35, Rob Clark wrote:
>>>>> On Thu, Mar 17, 2022 at 9:45 AM Christian König
>>>>> <christian.koenig@amd.com> wrote:
>>>>>> Am 17.03.22 um 17:18 schrieb Rob Clark:
>>>>>>> On Thu, Mar 17, 2022 at 9:04 AM Christian König
>>>>>>> <christian.koenig@amd.com> wrote:
>>>>>>>> Am 17.03.22 um 16:10 schrieb Rob Clark:
>>>>>>>>> [SNIP]
>>>>>>>>> userspace frozen != kthread frozen .. that is what this patch is
>>>>>>>>> trying to address, so we aren't racing between shutting down the hw
>>>>>>>>> and the scheduler shoveling more jobs at us.
>>>>>>>> Well exactly that's the problem. The scheduler is supposed to shoveling
>>>>>>>> more jobs at us until it is empty.
>>>>>>>>
>>>>>>>> Thinking more about it we will then keep some dma_fence instance
>>>>>>>> unsignaled and that is and extremely bad idea since it can lead to
>>>>>>>> deadlocks during suspend.
>>>>>>> Hmm, perhaps that is true if you need to migrate things out of vram?
>>>>>>> It is at least not a problem when vram is not involved.
>>>>>> No, it's much wider than that.
>>>>>>
>>>>>> See what can happen is that the memory management shrinkers want to wait
>>>>>> for a dma_fence during suspend.
>>>>> we don't wait on fences in shrinker, only purging or evicting things
>>>>> that are already ready.  Actually, waiting on fences in shrinker path
>>>>> sounds like a pretty bad idea.
>>>>>
>>>>>> And if you stop the scheduler they will just wait forever.
>>>>>>
>>>>>> What you need to do instead is to drain the scheduler, e.g. call
>>>>>> drm_sched_entity_flush() with a proper timeout for each entity you have
>>>>>> created.
>>>>> yeah, it would work to drain the scheduler.. I guess that might be the
>>>>> more portable approach as far as generic solution for suspend.
>>>>>
>>>>> BR,
>>>>> -R
>>>> I am not sure how this drains the scheduler ? Suppose we done the
>>>> waiting in drm_sched_entity_flush,
>>>> what prevents someone to push right away another job into the same
>>>> entity's queue  right after that ?
>>>> Shouldn't we first disable further pushing of jobs into entity before we
>>>> wait for  sched->job_scheduled ?
>>>>
>>> In the system suspend path, userspace processes will have already been
>>> frozen, so there should be no way to push more jobs to the scheduler,
>>> unless they are pushed from the kernel itself.
>>> amdgpu_device_suspend
>>
>> It was my suspicion but I wasn't sure about it.
>>
>>
>>> We don't do that in
>>> drm/msm, but maybe you need to to move things btwn vram and system
>>> memory?
>>
>> Exactly, that was my main concern - if we use this method we have to use
>> it in a point in
>> suspend sequence when all the in kernel job submissions activity already
>> suspended
>>
>>> But even in that case, if the # of jobs you push is bounded I
>>> guess that is ok?
>> Submissions to scheduler entities are using unbounded queue, the bounded
>> part is when
>> you extract next job from entity to submit to HW ring and it rejects if
>> submission limit reached (drm_sched_ready)
>>
>> In general - It looks to me at least that what we what we want her is
>> more of a drain operation then flush (i.e.
>> we first want to disable any further job submission to entity's queue
>> and then flush all in flight ones). As example
>> for this i was looking at  flush_workqueue vs. drain_workqueue
> Would it be possible for amdgpu to, in the system suspend task,
>
> 1) first queue up all the jobs needed to migrate bos out of vram, and
> whatever other housekeeping jobs are needed
> 2) then drain gpu scheduler's queues
> 3) and then finally wait for jobs executing on GPU to complete


We already do most of it in amdgpu_device_suspend, 
amdgpu_device_ip_suspend_phase1
followed by amdgpu_device_evict_resources followed by 
amdgpu_fence_driver_hw_fini is
exactly steps 1 + 3. What we are missing is step 2). For this step I 
suggest adding a function
called drm_sched_entity_drain which basically sets entity->stopped = 
true and then calls drm_sched_entity_flush.
This will both reject any new insertions into entity's job queue and 
will flush all pending job submissions to HW from that entity.
One point is we need to make make drm_sched_entity_push_job return value 
so the caller knows about job enqueue
rejection.

What about runtime suspend ? I guess same issue with scheduler racing 
against HW susppend is relevant there ?

Also, could you point to a particular buggy scenario where the race 
between SW shceduler and suspend is causing a problem ?

Andrey


>
> BR,
> -R
>
>> Andrey
>>
>>
>>> BR,
>>> -R

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 2/3] drm/msm/gpu: Park scheduler threads for system suspend
@ 2022-03-18 16:04                           ` Andrey Grodzovsky
  0 siblings, 0 replies; 52+ messages in thread
From: Andrey Grodzovsky @ 2022-03-18 16:04 UTC (permalink / raw)
  To: Rob Clark
  Cc: Christian König, dri-devel, freedreno, linux-arm-msm,
	Rob Clark, Sean Paul, Abhinav Kumar, David Airlie,
	Akhil P Oommen, Jonathan Marek, AngeloGioacchino Del Regno,
	Bjorn Andersson, Vladimir Lypak, open list, Alexander.Deucher


On 2022-03-17 16:35, Rob Clark wrote:
> On Thu, Mar 17, 2022 at 12:50 PM Andrey Grodzovsky
> <andrey.grodzovsky@amd.com> wrote:
>>
>> On 2022-03-17 14:25, Rob Clark wrote:
>>> On Thu, Mar 17, 2022 at 11:10 AM Andrey Grodzovsky
>>> <andrey.grodzovsky@amd.com> wrote:
>>>> On 2022-03-17 13:35, Rob Clark wrote:
>>>>> On Thu, Mar 17, 2022 at 9:45 AM Christian König
>>>>> <christian.koenig@amd.com> wrote:
>>>>>> Am 17.03.22 um 17:18 schrieb Rob Clark:
>>>>>>> On Thu, Mar 17, 2022 at 9:04 AM Christian König
>>>>>>> <christian.koenig@amd.com> wrote:
>>>>>>>> Am 17.03.22 um 16:10 schrieb Rob Clark:
>>>>>>>>> [SNIP]
>>>>>>>>> userspace frozen != kthread frozen .. that is what this patch is
>>>>>>>>> trying to address, so we aren't racing between shutting down the hw
>>>>>>>>> and the scheduler shoveling more jobs at us.
>>>>>>>> Well exactly that's the problem. The scheduler is supposed to shoveling
>>>>>>>> more jobs at us until it is empty.
>>>>>>>>
>>>>>>>> Thinking more about it we will then keep some dma_fence instance
>>>>>>>> unsignaled and that is and extremely bad idea since it can lead to
>>>>>>>> deadlocks during suspend.
>>>>>>> Hmm, perhaps that is true if you need to migrate things out of vram?
>>>>>>> It is at least not a problem when vram is not involved.
>>>>>> No, it's much wider than that.
>>>>>>
>>>>>> See what can happen is that the memory management shrinkers want to wait
>>>>>> for a dma_fence during suspend.
>>>>> we don't wait on fences in shrinker, only purging or evicting things
>>>>> that are already ready.  Actually, waiting on fences in shrinker path
>>>>> sounds like a pretty bad idea.
>>>>>
>>>>>> And if you stop the scheduler they will just wait forever.
>>>>>>
>>>>>> What you need to do instead is to drain the scheduler, e.g. call
>>>>>> drm_sched_entity_flush() with a proper timeout for each entity you have
>>>>>> created.
>>>>> yeah, it would work to drain the scheduler.. I guess that might be the
>>>>> more portable approach as far as generic solution for suspend.
>>>>>
>>>>> BR,
>>>>> -R
>>>> I am not sure how this drains the scheduler ? Suppose we done the
>>>> waiting in drm_sched_entity_flush,
>>>> what prevents someone to push right away another job into the same
>>>> entity's queue  right after that ?
>>>> Shouldn't we first disable further pushing of jobs into entity before we
>>>> wait for  sched->job_scheduled ?
>>>>
>>> In the system suspend path, userspace processes will have already been
>>> frozen, so there should be no way to push more jobs to the scheduler,
>>> unless they are pushed from the kernel itself.
>>> amdgpu_device_suspend
>>
>> It was my suspicion but I wasn't sure about it.
>>
>>
>>> We don't do that in
>>> drm/msm, but maybe you need to to move things btwn vram and system
>>> memory?
>>
>> Exactly, that was my main concern - if we use this method we have to use
>> it in a point in
>> suspend sequence when all the in kernel job submissions activity already
>> suspended
>>
>>> But even in that case, if the # of jobs you push is bounded I
>>> guess that is ok?
>> Submissions to scheduler entities are using unbounded queue, the bounded
>> part is when
>> you extract next job from entity to submit to HW ring and it rejects if
>> submission limit reached (drm_sched_ready)
>>
>> In general - It looks to me at least that what we what we want her is
>> more of a drain operation then flush (i.e.
>> we first want to disable any further job submission to entity's queue
>> and then flush all in flight ones). As example
>> for this i was looking at  flush_workqueue vs. drain_workqueue
> Would it be possible for amdgpu to, in the system suspend task,
>
> 1) first queue up all the jobs needed to migrate bos out of vram, and
> whatever other housekeeping jobs are needed
> 2) then drain gpu scheduler's queues
> 3) and then finally wait for jobs executing on GPU to complete


We already do most of it in amdgpu_device_suspend, 
amdgpu_device_ip_suspend_phase1
followed by amdgpu_device_evict_resources followed by 
amdgpu_fence_driver_hw_fini is
exactly steps 1 + 3. What we are missing is step 2). For this step I 
suggest adding a function
called drm_sched_entity_drain which basically sets entity->stopped = 
true and then calls drm_sched_entity_flush.
This will both reject any new insertions into entity's job queue and 
will flush all pending job submissions to HW from that entity.
One point is we need to make make drm_sched_entity_push_job return value 
so the caller knows about job enqueue
rejection.

What about runtime suspend ? I guess same issue with scheduler racing 
against HW susppend is relevant there ?

Also, could you point to a particular buggy scenario where the race 
between SW shceduler and suspend is causing a problem ?

Andrey


>
> BR,
> -R
>
>> Andrey
>>
>>
>>> BR,
>>> -R

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 2/3] drm/msm/gpu: Park scheduler threads for system suspend
  2022-03-18 16:04                           ` Andrey Grodzovsky
@ 2022-03-18 16:20                             ` Rob Clark
  -1 siblings, 0 replies; 52+ messages in thread
From: Rob Clark @ 2022-03-18 16:20 UTC (permalink / raw)
  To: Andrey Grodzovsky
  Cc: Rob Clark, freedreno, Jonathan Marek, David Airlie,
	linux-arm-msm, Vladimir Lypak, Abhinav Kumar, dri-devel,
	Bjorn Andersson, Akhil P Oommen, Alexander.Deucher, Sean Paul,
	Christian König, open list, AngeloGioacchino Del Regno

On Fri, Mar 18, 2022 at 9:04 AM Andrey Grodzovsky
<andrey.grodzovsky@amd.com> wrote:
>
>
> On 2022-03-17 16:35, Rob Clark wrote:
> > On Thu, Mar 17, 2022 at 12:50 PM Andrey Grodzovsky
> > <andrey.grodzovsky@amd.com> wrote:
> >>
> >> On 2022-03-17 14:25, Rob Clark wrote:
> >>> On Thu, Mar 17, 2022 at 11:10 AM Andrey Grodzovsky
> >>> <andrey.grodzovsky@amd.com> wrote:
> >>>> On 2022-03-17 13:35, Rob Clark wrote:
> >>>>> On Thu, Mar 17, 2022 at 9:45 AM Christian König
> >>>>> <christian.koenig@amd.com> wrote:
> >>>>>> Am 17.03.22 um 17:18 schrieb Rob Clark:
> >>>>>>> On Thu, Mar 17, 2022 at 9:04 AM Christian König
> >>>>>>> <christian.koenig@amd.com> wrote:
> >>>>>>>> Am 17.03.22 um 16:10 schrieb Rob Clark:
> >>>>>>>>> [SNIP]
> >>>>>>>>> userspace frozen != kthread frozen .. that is what this patch is
> >>>>>>>>> trying to address, so we aren't racing between shutting down the hw
> >>>>>>>>> and the scheduler shoveling more jobs at us.
> >>>>>>>> Well exactly that's the problem. The scheduler is supposed to shoveling
> >>>>>>>> more jobs at us until it is empty.
> >>>>>>>>
> >>>>>>>> Thinking more about it we will then keep some dma_fence instance
> >>>>>>>> unsignaled and that is and extremely bad idea since it can lead to
> >>>>>>>> deadlocks during suspend.
> >>>>>>> Hmm, perhaps that is true if you need to migrate things out of vram?
> >>>>>>> It is at least not a problem when vram is not involved.
> >>>>>> No, it's much wider than that.
> >>>>>>
> >>>>>> See what can happen is that the memory management shrinkers want to wait
> >>>>>> for a dma_fence during suspend.
> >>>>> we don't wait on fences in shrinker, only purging or evicting things
> >>>>> that are already ready.  Actually, waiting on fences in shrinker path
> >>>>> sounds like a pretty bad idea.
> >>>>>
> >>>>>> And if you stop the scheduler they will just wait forever.
> >>>>>>
> >>>>>> What you need to do instead is to drain the scheduler, e.g. call
> >>>>>> drm_sched_entity_flush() with a proper timeout for each entity you have
> >>>>>> created.
> >>>>> yeah, it would work to drain the scheduler.. I guess that might be the
> >>>>> more portable approach as far as generic solution for suspend.
> >>>>>
> >>>>> BR,
> >>>>> -R
> >>>> I am not sure how this drains the scheduler ? Suppose we done the
> >>>> waiting in drm_sched_entity_flush,
> >>>> what prevents someone to push right away another job into the same
> >>>> entity's queue  right after that ?
> >>>> Shouldn't we first disable further pushing of jobs into entity before we
> >>>> wait for  sched->job_scheduled ?
> >>>>
> >>> In the system suspend path, userspace processes will have already been
> >>> frozen, so there should be no way to push more jobs to the scheduler,
> >>> unless they are pushed from the kernel itself.
> >>> amdgpu_device_suspend
> >>
> >> It was my suspicion but I wasn't sure about it.
> >>
> >>
> >>> We don't do that in
> >>> drm/msm, but maybe you need to to move things btwn vram and system
> >>> memory?
> >>
> >> Exactly, that was my main concern - if we use this method we have to use
> >> it in a point in
> >> suspend sequence when all the in kernel job submissions activity already
> >> suspended
> >>
> >>> But even in that case, if the # of jobs you push is bounded I
> >>> guess that is ok?
> >> Submissions to scheduler entities are using unbounded queue, the bounded
> >> part is when
> >> you extract next job from entity to submit to HW ring and it rejects if
> >> submission limit reached (drm_sched_ready)
> >>
> >> In general - It looks to me at least that what we what we want her is
> >> more of a drain operation then flush (i.e.
> >> we first want to disable any further job submission to entity's queue
> >> and then flush all in flight ones). As example
> >> for this i was looking at  flush_workqueue vs. drain_workqueue
> > Would it be possible for amdgpu to, in the system suspend task,
> >
> > 1) first queue up all the jobs needed to migrate bos out of vram, and
> > whatever other housekeeping jobs are needed
> > 2) then drain gpu scheduler's queues
> > 3) and then finally wait for jobs executing on GPU to complete
>
>
> We already do most of it in amdgpu_device_suspend,
> amdgpu_device_ip_suspend_phase1
> followed by amdgpu_device_evict_resources followed by
> amdgpu_fence_driver_hw_fini is
> exactly steps 1 + 3. What we are missing is step 2). For this step I
> suggest adding a function
> called drm_sched_entity_drain which basically sets entity->stopped =
> true and then calls drm_sched_entity_flush.
> This will both reject any new insertions into entity's job queue and
> will flush all pending job submissions to HW from that entity.
> One point is we need to make make drm_sched_entity_push_job return value
> so the caller knows about job enqueue
> rejection.

Hmm, seems like job enqueue that is rejected because we are in the
process of suspending should be more of a WARN_ON() sort of thing?
Not sure if there is something sensible to do for the caller at that
point?

>
> What about runtime suspend ? I guess same issue with scheduler racing
> against HW susppend is relevant there ?

Runtime suspend should be ok, as long as the driver holds a runpm
reference whenever the hw needs to be awake.  The problem with system
suspend (at least if you are using pm_runtime_force_suspend() or doing
something equivalent) is that it bypasses the runpm reference.
(Which, IMO, seems like a bad design..)

> Also, could you point to a particular buggy scenario where the race
> between SW shceduler and suspend is causing a problem ?

I wrote a piglit test[1] to try to trigger this scenario.. it isn't
really that easy to hit

BR,
-R

[1] https://gitlab.freedesktop.org/mesa/piglit/-/merge_requests/643

> Andrey
>
>
> >
> > BR,
> > -R
> >
> >> Andrey
> >>
> >>
> >>> BR,
> >>> -R

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 2/3] drm/msm/gpu: Park scheduler threads for system suspend
@ 2022-03-18 16:20                             ` Rob Clark
  0 siblings, 0 replies; 52+ messages in thread
From: Rob Clark @ 2022-03-18 16:20 UTC (permalink / raw)
  To: Andrey Grodzovsky
  Cc: Christian König, dri-devel, freedreno, linux-arm-msm,
	Rob Clark, Sean Paul, Abhinav Kumar, David Airlie,
	Akhil P Oommen, Jonathan Marek, AngeloGioacchino Del Regno,
	Bjorn Andersson, Vladimir Lypak, open list, Alexander.Deucher

On Fri, Mar 18, 2022 at 9:04 AM Andrey Grodzovsky
<andrey.grodzovsky@amd.com> wrote:
>
>
> On 2022-03-17 16:35, Rob Clark wrote:
> > On Thu, Mar 17, 2022 at 12:50 PM Andrey Grodzovsky
> > <andrey.grodzovsky@amd.com> wrote:
> >>
> >> On 2022-03-17 14:25, Rob Clark wrote:
> >>> On Thu, Mar 17, 2022 at 11:10 AM Andrey Grodzovsky
> >>> <andrey.grodzovsky@amd.com> wrote:
> >>>> On 2022-03-17 13:35, Rob Clark wrote:
> >>>>> On Thu, Mar 17, 2022 at 9:45 AM Christian König
> >>>>> <christian.koenig@amd.com> wrote:
> >>>>>> Am 17.03.22 um 17:18 schrieb Rob Clark:
> >>>>>>> On Thu, Mar 17, 2022 at 9:04 AM Christian König
> >>>>>>> <christian.koenig@amd.com> wrote:
> >>>>>>>> Am 17.03.22 um 16:10 schrieb Rob Clark:
> >>>>>>>>> [SNIP]
> >>>>>>>>> userspace frozen != kthread frozen .. that is what this patch is
> >>>>>>>>> trying to address, so we aren't racing between shutting down the hw
> >>>>>>>>> and the scheduler shoveling more jobs at us.
> >>>>>>>> Well exactly that's the problem. The scheduler is supposed to shoveling
> >>>>>>>> more jobs at us until it is empty.
> >>>>>>>>
> >>>>>>>> Thinking more about it we will then keep some dma_fence instance
> >>>>>>>> unsignaled and that is and extremely bad idea since it can lead to
> >>>>>>>> deadlocks during suspend.
> >>>>>>> Hmm, perhaps that is true if you need to migrate things out of vram?
> >>>>>>> It is at least not a problem when vram is not involved.
> >>>>>> No, it's much wider than that.
> >>>>>>
> >>>>>> See what can happen is that the memory management shrinkers want to wait
> >>>>>> for a dma_fence during suspend.
> >>>>> we don't wait on fences in shrinker, only purging or evicting things
> >>>>> that are already ready.  Actually, waiting on fences in shrinker path
> >>>>> sounds like a pretty bad idea.
> >>>>>
> >>>>>> And if you stop the scheduler they will just wait forever.
> >>>>>>
> >>>>>> What you need to do instead is to drain the scheduler, e.g. call
> >>>>>> drm_sched_entity_flush() with a proper timeout for each entity you have
> >>>>>> created.
> >>>>> yeah, it would work to drain the scheduler.. I guess that might be the
> >>>>> more portable approach as far as generic solution for suspend.
> >>>>>
> >>>>> BR,
> >>>>> -R
> >>>> I am not sure how this drains the scheduler ? Suppose we done the
> >>>> waiting in drm_sched_entity_flush,
> >>>> what prevents someone to push right away another job into the same
> >>>> entity's queue  right after that ?
> >>>> Shouldn't we first disable further pushing of jobs into entity before we
> >>>> wait for  sched->job_scheduled ?
> >>>>
> >>> In the system suspend path, userspace processes will have already been
> >>> frozen, so there should be no way to push more jobs to the scheduler,
> >>> unless they are pushed from the kernel itself.
> >>> amdgpu_device_suspend
> >>
> >> It was my suspicion but I wasn't sure about it.
> >>
> >>
> >>> We don't do that in
> >>> drm/msm, but maybe you need to to move things btwn vram and system
> >>> memory?
> >>
> >> Exactly, that was my main concern - if we use this method we have to use
> >> it in a point in
> >> suspend sequence when all the in kernel job submissions activity already
> >> suspended
> >>
> >>> But even in that case, if the # of jobs you push is bounded I
> >>> guess that is ok?
> >> Submissions to scheduler entities are using unbounded queue, the bounded
> >> part is when
> >> you extract next job from entity to submit to HW ring and it rejects if
> >> submission limit reached (drm_sched_ready)
> >>
> >> In general - It looks to me at least that what we what we want her is
> >> more of a drain operation then flush (i.e.
> >> we first want to disable any further job submission to entity's queue
> >> and then flush all in flight ones). As example
> >> for this i was looking at  flush_workqueue vs. drain_workqueue
> > Would it be possible for amdgpu to, in the system suspend task,
> >
> > 1) first queue up all the jobs needed to migrate bos out of vram, and
> > whatever other housekeeping jobs are needed
> > 2) then drain gpu scheduler's queues
> > 3) and then finally wait for jobs executing on GPU to complete
>
>
> We already do most of it in amdgpu_device_suspend,
> amdgpu_device_ip_suspend_phase1
> followed by amdgpu_device_evict_resources followed by
> amdgpu_fence_driver_hw_fini is
> exactly steps 1 + 3. What we are missing is step 2). For this step I
> suggest adding a function
> called drm_sched_entity_drain which basically sets entity->stopped =
> true and then calls drm_sched_entity_flush.
> This will both reject any new insertions into entity's job queue and
> will flush all pending job submissions to HW from that entity.
> One point is we need to make make drm_sched_entity_push_job return value
> so the caller knows about job enqueue
> rejection.

Hmm, seems like job enqueue that is rejected because we are in the
process of suspending should be more of a WARN_ON() sort of thing?
Not sure if there is something sensible to do for the caller at that
point?

>
> What about runtime suspend ? I guess same issue with scheduler racing
> against HW susppend is relevant there ?

Runtime suspend should be ok, as long as the driver holds a runpm
reference whenever the hw needs to be awake.  The problem with system
suspend (at least if you are using pm_runtime_force_suspend() or doing
something equivalent) is that it bypasses the runpm reference.
(Which, IMO, seems like a bad design..)

> Also, could you point to a particular buggy scenario where the race
> between SW shceduler and suspend is causing a problem ?

I wrote a piglit test[1] to try to trigger this scenario.. it isn't
really that easy to hit

BR,
-R

[1] https://gitlab.freedesktop.org/mesa/piglit/-/merge_requests/643

> Andrey
>
>
> >
> > BR,
> > -R
> >
> >> Andrey
> >>
> >>
> >>> BR,
> >>> -R

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 2/3] drm/msm/gpu: Park scheduler threads for system suspend
  2022-03-18 16:20                             ` Rob Clark
@ 2022-03-18 16:27                               ` Andrey Grodzovsky
  -1 siblings, 0 replies; 52+ messages in thread
From: Andrey Grodzovsky @ 2022-03-18 16:27 UTC (permalink / raw)
  To: Rob Clark
  Cc: Rob Clark, freedreno, Jonathan Marek, David Airlie,
	linux-arm-msm, Vladimir Lypak, Abhinav Kumar, dri-devel,
	Bjorn Andersson, Akhil P Oommen, Alexander.Deucher, Sean Paul,
	Christian König, open list, AngeloGioacchino Del Regno


On 2022-03-18 12:20, Rob Clark wrote:
> On Fri, Mar 18, 2022 at 9:04 AM Andrey Grodzovsky
> <andrey.grodzovsky@amd.com> wrote:
>>
>> On 2022-03-17 16:35, Rob Clark wrote:
>>> On Thu, Mar 17, 2022 at 12:50 PM Andrey Grodzovsky
>>> <andrey.grodzovsky@amd.com> wrote:
>>>> On 2022-03-17 14:25, Rob Clark wrote:
>>>>> On Thu, Mar 17, 2022 at 11:10 AM Andrey Grodzovsky
>>>>> <andrey.grodzovsky@amd.com> wrote:
>>>>>> On 2022-03-17 13:35, Rob Clark wrote:
>>>>>>> On Thu, Mar 17, 2022 at 9:45 AM Christian König
>>>>>>> <christian.koenig@amd.com> wrote:
>>>>>>>> Am 17.03.22 um 17:18 schrieb Rob Clark:
>>>>>>>>> On Thu, Mar 17, 2022 at 9:04 AM Christian König
>>>>>>>>> <christian.koenig@amd.com> wrote:
>>>>>>>>>> Am 17.03.22 um 16:10 schrieb Rob Clark:
>>>>>>>>>>> [SNIP]
>>>>>>>>>>> userspace frozen != kthread frozen .. that is what this patch is
>>>>>>>>>>> trying to address, so we aren't racing between shutting down the hw
>>>>>>>>>>> and the scheduler shoveling more jobs at us.
>>>>>>>>>> Well exactly that's the problem. The scheduler is supposed to shoveling
>>>>>>>>>> more jobs at us until it is empty.
>>>>>>>>>>
>>>>>>>>>> Thinking more about it we will then keep some dma_fence instance
>>>>>>>>>> unsignaled and that is and extremely bad idea since it can lead to
>>>>>>>>>> deadlocks during suspend.
>>>>>>>>> Hmm, perhaps that is true if you need to migrate things out of vram?
>>>>>>>>> It is at least not a problem when vram is not involved.
>>>>>>>> No, it's much wider than that.
>>>>>>>>
>>>>>>>> See what can happen is that the memory management shrinkers want to wait
>>>>>>>> for a dma_fence during suspend.
>>>>>>> we don't wait on fences in shrinker, only purging or evicting things
>>>>>>> that are already ready.  Actually, waiting on fences in shrinker path
>>>>>>> sounds like a pretty bad idea.
>>>>>>>
>>>>>>>> And if you stop the scheduler they will just wait forever.
>>>>>>>>
>>>>>>>> What you need to do instead is to drain the scheduler, e.g. call
>>>>>>>> drm_sched_entity_flush() with a proper timeout for each entity you have
>>>>>>>> created.
>>>>>>> yeah, it would work to drain the scheduler.. I guess that might be the
>>>>>>> more portable approach as far as generic solution for suspend.
>>>>>>>
>>>>>>> BR,
>>>>>>> -R
>>>>>> I am not sure how this drains the scheduler ? Suppose we done the
>>>>>> waiting in drm_sched_entity_flush,
>>>>>> what prevents someone to push right away another job into the same
>>>>>> entity's queue  right after that ?
>>>>>> Shouldn't we first disable further pushing of jobs into entity before we
>>>>>> wait for  sched->job_scheduled ?
>>>>>>
>>>>> In the system suspend path, userspace processes will have already been
>>>>> frozen, so there should be no way to push more jobs to the scheduler,
>>>>> unless they are pushed from the kernel itself.
>>>>> amdgpu_device_suspend
>>>> It was my suspicion but I wasn't sure about it.
>>>>
>>>>
>>>>> We don't do that in
>>>>> drm/msm, but maybe you need to to move things btwn vram and system
>>>>> memory?
>>>> Exactly, that was my main concern - if we use this method we have to use
>>>> it in a point in
>>>> suspend sequence when all the in kernel job submissions activity already
>>>> suspended
>>>>
>>>>> But even in that case, if the # of jobs you push is bounded I
>>>>> guess that is ok?
>>>> Submissions to scheduler entities are using unbounded queue, the bounded
>>>> part is when
>>>> you extract next job from entity to submit to HW ring and it rejects if
>>>> submission limit reached (drm_sched_ready)
>>>>
>>>> In general - It looks to me at least that what we what we want her is
>>>> more of a drain operation then flush (i.e.
>>>> we first want to disable any further job submission to entity's queue
>>>> and then flush all in flight ones). As example
>>>> for this i was looking at  flush_workqueue vs. drain_workqueue
>>> Would it be possible for amdgpu to, in the system suspend task,
>>>
>>> 1) first queue up all the jobs needed to migrate bos out of vram, and
>>> whatever other housekeeping jobs are needed
>>> 2) then drain gpu scheduler's queues
>>> 3) and then finally wait for jobs executing on GPU to complete
>>
>> We already do most of it in amdgpu_device_suspend,
>> amdgpu_device_ip_suspend_phase1
>> followed by amdgpu_device_evict_resources followed by
>> amdgpu_fence_driver_hw_fini is
>> exactly steps 1 + 3. What we are missing is step 2). For this step I
>> suggest adding a function
>> called drm_sched_entity_drain which basically sets entity->stopped =
>> true and then calls drm_sched_entity_flush.
>> This will both reject any new insertions into entity's job queue and
>> will flush all pending job submissions to HW from that entity.
>> One point is we need to make make drm_sched_entity_push_job return value
>> so the caller knows about job enqueue
>> rejection.
> Hmm, seems like job enqueue that is rejected because we are in the
> process of suspending should be more of a WARN_ON() sort of thing?
> Not sure if there is something sensible to do for the caller at that
> point?


What about the job's fence the caller is waiting on ? If we rejected
job submission the caller must know about it to not get stuck waiting
on that fence.


>
>> What about runtime suspend ? I guess same issue with scheduler racing
>> against HW susppend is relevant there ?
> Runtime suspend should be ok, as long as the driver holds a runpm
> reference whenever the hw needs to be awake.  The problem with system
> suspend (at least if you are using pm_runtime_force_suspend() or doing
> something equivalent) is that it bypasses the runpm reference.
> (Which, IMO, seems like a bad design..)


I am not totally clear  yet - can you expand a bit why one case is ok 
but the other
problematic ?

Andrey


>
>> Also, could you point to a particular buggy scenario where the race
>> between SW shceduler and suspend is causing a problem ?
> I wrote a piglit test[1] to try to trigger this scenario.. it isn't
> really that easy to hit
>
> BR,
> -R
>
> [1] https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fmesa%2Fpiglit%2F-%2Fmerge_requests%2F643&amp;data=04%7C01%7Candrey.grodzovsky%40amd.com%7C502ac8db4fb94b3b0e9d08da08fb270e%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637832172051790527%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=u2Fqq%2BZpmjFHQFK77xwxEA5092O3Nc%2FdCMllfejgnvU%3D&amp;reserved=0
>
>> Andrey
>>
>>
>>> BR,
>>> -R
>>>
>>>> Andrey
>>>>
>>>>
>>>>> BR,
>>>>> -R

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 2/3] drm/msm/gpu: Park scheduler threads for system suspend
@ 2022-03-18 16:27                               ` Andrey Grodzovsky
  0 siblings, 0 replies; 52+ messages in thread
From: Andrey Grodzovsky @ 2022-03-18 16:27 UTC (permalink / raw)
  To: Rob Clark
  Cc: Christian König, dri-devel, freedreno, linux-arm-msm,
	Rob Clark, Sean Paul, Abhinav Kumar, David Airlie,
	Akhil P Oommen, Jonathan Marek, AngeloGioacchino Del Regno,
	Bjorn Andersson, Vladimir Lypak, open list, Alexander.Deucher


On 2022-03-18 12:20, Rob Clark wrote:
> On Fri, Mar 18, 2022 at 9:04 AM Andrey Grodzovsky
> <andrey.grodzovsky@amd.com> wrote:
>>
>> On 2022-03-17 16:35, Rob Clark wrote:
>>> On Thu, Mar 17, 2022 at 12:50 PM Andrey Grodzovsky
>>> <andrey.grodzovsky@amd.com> wrote:
>>>> On 2022-03-17 14:25, Rob Clark wrote:
>>>>> On Thu, Mar 17, 2022 at 11:10 AM Andrey Grodzovsky
>>>>> <andrey.grodzovsky@amd.com> wrote:
>>>>>> On 2022-03-17 13:35, Rob Clark wrote:
>>>>>>> On Thu, Mar 17, 2022 at 9:45 AM Christian König
>>>>>>> <christian.koenig@amd.com> wrote:
>>>>>>>> Am 17.03.22 um 17:18 schrieb Rob Clark:
>>>>>>>>> On Thu, Mar 17, 2022 at 9:04 AM Christian König
>>>>>>>>> <christian.koenig@amd.com> wrote:
>>>>>>>>>> Am 17.03.22 um 16:10 schrieb Rob Clark:
>>>>>>>>>>> [SNIP]
>>>>>>>>>>> userspace frozen != kthread frozen .. that is what this patch is
>>>>>>>>>>> trying to address, so we aren't racing between shutting down the hw
>>>>>>>>>>> and the scheduler shoveling more jobs at us.
>>>>>>>>>> Well exactly that's the problem. The scheduler is supposed to shoveling
>>>>>>>>>> more jobs at us until it is empty.
>>>>>>>>>>
>>>>>>>>>> Thinking more about it we will then keep some dma_fence instance
>>>>>>>>>> unsignaled and that is and extremely bad idea since it can lead to
>>>>>>>>>> deadlocks during suspend.
>>>>>>>>> Hmm, perhaps that is true if you need to migrate things out of vram?
>>>>>>>>> It is at least not a problem when vram is not involved.
>>>>>>>> No, it's much wider than that.
>>>>>>>>
>>>>>>>> See what can happen is that the memory management shrinkers want to wait
>>>>>>>> for a dma_fence during suspend.
>>>>>>> we don't wait on fences in shrinker, only purging or evicting things
>>>>>>> that are already ready.  Actually, waiting on fences in shrinker path
>>>>>>> sounds like a pretty bad idea.
>>>>>>>
>>>>>>>> And if you stop the scheduler they will just wait forever.
>>>>>>>>
>>>>>>>> What you need to do instead is to drain the scheduler, e.g. call
>>>>>>>> drm_sched_entity_flush() with a proper timeout for each entity you have
>>>>>>>> created.
>>>>>>> yeah, it would work to drain the scheduler.. I guess that might be the
>>>>>>> more portable approach as far as generic solution for suspend.
>>>>>>>
>>>>>>> BR,
>>>>>>> -R
>>>>>> I am not sure how this drains the scheduler ? Suppose we done the
>>>>>> waiting in drm_sched_entity_flush,
>>>>>> what prevents someone to push right away another job into the same
>>>>>> entity's queue  right after that ?
>>>>>> Shouldn't we first disable further pushing of jobs into entity before we
>>>>>> wait for  sched->job_scheduled ?
>>>>>>
>>>>> In the system suspend path, userspace processes will have already been
>>>>> frozen, so there should be no way to push more jobs to the scheduler,
>>>>> unless they are pushed from the kernel itself.
>>>>> amdgpu_device_suspend
>>>> It was my suspicion but I wasn't sure about it.
>>>>
>>>>
>>>>> We don't do that in
>>>>> drm/msm, but maybe you need to to move things btwn vram and system
>>>>> memory?
>>>> Exactly, that was my main concern - if we use this method we have to use
>>>> it in a point in
>>>> suspend sequence when all the in kernel job submissions activity already
>>>> suspended
>>>>
>>>>> But even in that case, if the # of jobs you push is bounded I
>>>>> guess that is ok?
>>>> Submissions to scheduler entities are using unbounded queue, the bounded
>>>> part is when
>>>> you extract next job from entity to submit to HW ring and it rejects if
>>>> submission limit reached (drm_sched_ready)
>>>>
>>>> In general - It looks to me at least that what we what we want her is
>>>> more of a drain operation then flush (i.e.
>>>> we first want to disable any further job submission to entity's queue
>>>> and then flush all in flight ones). As example
>>>> for this i was looking at  flush_workqueue vs. drain_workqueue
>>> Would it be possible for amdgpu to, in the system suspend task,
>>>
>>> 1) first queue up all the jobs needed to migrate bos out of vram, and
>>> whatever other housekeeping jobs are needed
>>> 2) then drain gpu scheduler's queues
>>> 3) and then finally wait for jobs executing on GPU to complete
>>
>> We already do most of it in amdgpu_device_suspend,
>> amdgpu_device_ip_suspend_phase1
>> followed by amdgpu_device_evict_resources followed by
>> amdgpu_fence_driver_hw_fini is
>> exactly steps 1 + 3. What we are missing is step 2). For this step I
>> suggest adding a function
>> called drm_sched_entity_drain which basically sets entity->stopped =
>> true and then calls drm_sched_entity_flush.
>> This will both reject any new insertions into entity's job queue and
>> will flush all pending job submissions to HW from that entity.
>> One point is we need to make make drm_sched_entity_push_job return value
>> so the caller knows about job enqueue
>> rejection.
> Hmm, seems like job enqueue that is rejected because we are in the
> process of suspending should be more of a WARN_ON() sort of thing?
> Not sure if there is something sensible to do for the caller at that
> point?


What about the job's fence the caller is waiting on ? If we rejected
job submission the caller must know about it to not get stuck waiting
on that fence.


>
>> What about runtime suspend ? I guess same issue with scheduler racing
>> against HW susppend is relevant there ?
> Runtime suspend should be ok, as long as the driver holds a runpm
> reference whenever the hw needs to be awake.  The problem with system
> suspend (at least if you are using pm_runtime_force_suspend() or doing
> something equivalent) is that it bypasses the runpm reference.
> (Which, IMO, seems like a bad design..)


I am not totally clear  yet - can you expand a bit why one case is ok 
but the other
problematic ?

Andrey


>
>> Also, could you point to a particular buggy scenario where the race
>> between SW shceduler and suspend is causing a problem ?
> I wrote a piglit test[1] to try to trigger this scenario.. it isn't
> really that easy to hit
>
> BR,
> -R
>
> [1] https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fmesa%2Fpiglit%2F-%2Fmerge_requests%2F643&amp;data=04%7C01%7Candrey.grodzovsky%40amd.com%7C502ac8db4fb94b3b0e9d08da08fb270e%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637832172051790527%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=u2Fqq%2BZpmjFHQFK77xwxEA5092O3Nc%2FdCMllfejgnvU%3D&amp;reserved=0
>
>> Andrey
>>
>>
>>> BR,
>>> -R
>>>
>>>> Andrey
>>>>
>>>>
>>>>> BR,
>>>>> -R

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 2/3] drm/msm/gpu: Park scheduler threads for system suspend
  2022-03-18 16:27                               ` Andrey Grodzovsky
@ 2022-03-18 17:22                                 ` Rob Clark
  -1 siblings, 0 replies; 52+ messages in thread
From: Rob Clark @ 2022-03-18 17:22 UTC (permalink / raw)
  To: Andrey Grodzovsky
  Cc: Christian König, dri-devel, freedreno, linux-arm-msm,
	Rob Clark, Sean Paul, Abhinav Kumar, David Airlie,
	Akhil P Oommen, Jonathan Marek, AngeloGioacchino Del Regno,
	Bjorn Andersson, Vladimir Lypak, open list, Alexander.Deucher

On Fri, Mar 18, 2022 at 9:27 AM Andrey Grodzovsky
<andrey.grodzovsky@amd.com> wrote:
>
>
> On 2022-03-18 12:20, Rob Clark wrote:
> > On Fri, Mar 18, 2022 at 9:04 AM Andrey Grodzovsky
> > <andrey.grodzovsky@amd.com> wrote:
> >>
> >> On 2022-03-17 16:35, Rob Clark wrote:
> >>> On Thu, Mar 17, 2022 at 12:50 PM Andrey Grodzovsky
> >>> <andrey.grodzovsky@amd.com> wrote:
> >>>> On 2022-03-17 14:25, Rob Clark wrote:
> >>>>> On Thu, Mar 17, 2022 at 11:10 AM Andrey Grodzovsky
> >>>>> <andrey.grodzovsky@amd.com> wrote:
> >>>>>> On 2022-03-17 13:35, Rob Clark wrote:
> >>>>>>> On Thu, Mar 17, 2022 at 9:45 AM Christian König
> >>>>>>> <christian.koenig@amd.com> wrote:
> >>>>>>>> Am 17.03.22 um 17:18 schrieb Rob Clark:
> >>>>>>>>> On Thu, Mar 17, 2022 at 9:04 AM Christian König
> >>>>>>>>> <christian.koenig@amd.com> wrote:
> >>>>>>>>>> Am 17.03.22 um 16:10 schrieb Rob Clark:
> >>>>>>>>>>> [SNIP]
> >>>>>>>>>>> userspace frozen != kthread frozen .. that is what this patch is
> >>>>>>>>>>> trying to address, so we aren't racing between shutting down the hw
> >>>>>>>>>>> and the scheduler shoveling more jobs at us.
> >>>>>>>>>> Well exactly that's the problem. The scheduler is supposed to shoveling
> >>>>>>>>>> more jobs at us until it is empty.
> >>>>>>>>>>
> >>>>>>>>>> Thinking more about it we will then keep some dma_fence instance
> >>>>>>>>>> unsignaled and that is and extremely bad idea since it can lead to
> >>>>>>>>>> deadlocks during suspend.
> >>>>>>>>> Hmm, perhaps that is true if you need to migrate things out of vram?
> >>>>>>>>> It is at least not a problem when vram is not involved.
> >>>>>>>> No, it's much wider than that.
> >>>>>>>>
> >>>>>>>> See what can happen is that the memory management shrinkers want to wait
> >>>>>>>> for a dma_fence during suspend.
> >>>>>>> we don't wait on fences in shrinker, only purging or evicting things
> >>>>>>> that are already ready.  Actually, waiting on fences in shrinker path
> >>>>>>> sounds like a pretty bad idea.
> >>>>>>>
> >>>>>>>> And if you stop the scheduler they will just wait forever.
> >>>>>>>>
> >>>>>>>> What you need to do instead is to drain the scheduler, e.g. call
> >>>>>>>> drm_sched_entity_flush() with a proper timeout for each entity you have
> >>>>>>>> created.
> >>>>>>> yeah, it would work to drain the scheduler.. I guess that might be the
> >>>>>>> more portable approach as far as generic solution for suspend.
> >>>>>>>
> >>>>>>> BR,
> >>>>>>> -R
> >>>>>> I am not sure how this drains the scheduler ? Suppose we done the
> >>>>>> waiting in drm_sched_entity_flush,
> >>>>>> what prevents someone to push right away another job into the same
> >>>>>> entity's queue  right after that ?
> >>>>>> Shouldn't we first disable further pushing of jobs into entity before we
> >>>>>> wait for  sched->job_scheduled ?
> >>>>>>
> >>>>> In the system suspend path, userspace processes will have already been
> >>>>> frozen, so there should be no way to push more jobs to the scheduler,
> >>>>> unless they are pushed from the kernel itself.
> >>>>> amdgpu_device_suspend
> >>>> It was my suspicion but I wasn't sure about it.
> >>>>
> >>>>
> >>>>> We don't do that in
> >>>>> drm/msm, but maybe you need to to move things btwn vram and system
> >>>>> memory?
> >>>> Exactly, that was my main concern - if we use this method we have to use
> >>>> it in a point in
> >>>> suspend sequence when all the in kernel job submissions activity already
> >>>> suspended
> >>>>
> >>>>> But even in that case, if the # of jobs you push is bounded I
> >>>>> guess that is ok?
> >>>> Submissions to scheduler entities are using unbounded queue, the bounded
> >>>> part is when
> >>>> you extract next job from entity to submit to HW ring and it rejects if
> >>>> submission limit reached (drm_sched_ready)
> >>>>
> >>>> In general - It looks to me at least that what we what we want her is
> >>>> more of a drain operation then flush (i.e.
> >>>> we first want to disable any further job submission to entity's queue
> >>>> and then flush all in flight ones). As example
> >>>> for this i was looking at  flush_workqueue vs. drain_workqueue
> >>> Would it be possible for amdgpu to, in the system suspend task,
> >>>
> >>> 1) first queue up all the jobs needed to migrate bos out of vram, and
> >>> whatever other housekeeping jobs are needed
> >>> 2) then drain gpu scheduler's queues
> >>> 3) and then finally wait for jobs executing on GPU to complete
> >>
> >> We already do most of it in amdgpu_device_suspend,
> >> amdgpu_device_ip_suspend_phase1
> >> followed by amdgpu_device_evict_resources followed by
> >> amdgpu_fence_driver_hw_fini is
> >> exactly steps 1 + 3. What we are missing is step 2). For this step I
> >> suggest adding a function
> >> called drm_sched_entity_drain which basically sets entity->stopped =
> >> true and then calls drm_sched_entity_flush.
> >> This will both reject any new insertions into entity's job queue and
> >> will flush all pending job submissions to HW from that entity.
> >> One point is we need to make make drm_sched_entity_push_job return value
> >> so the caller knows about job enqueue
> >> rejection.
> > Hmm, seems like job enqueue that is rejected because we are in the
> > process of suspending should be more of a WARN_ON() sort of thing?
> > Not sure if there is something sensible to do for the caller at that
> > point?
>
>
> What about the job's fence the caller is waiting on ? If we rejected
> job submission the caller must know about it to not get stuck waiting
> on that fence.
>

Hmm, perhaps I'm not being imaginative enough, but this sort of
scenario seems like it should only arise from a bug in the driver's
suspend path, Ie. not doing all the job submission before shutting
down the scheduler.  I don't think anything good is going to result
either way, which is why I was thinking you'd want a WARN_ON() to help
debug/fix that case.

>
> >
> >> What about runtime suspend ? I guess same issue with scheduler racing
> >> against HW susppend is relevant there ?
> > Runtime suspend should be ok, as long as the driver holds a runpm
> > reference whenever the hw needs to be awake.  The problem with system
> > suspend (at least if you are using pm_runtime_force_suspend() or doing
> > something equivalent) is that it bypasses the runpm reference.
> > (Which, IMO, seems like a bad design..)
>
>
> I am not totally clear  yet - can you expand a bit why one case is ok
> but the other
> problematic ?
>

Sure, normally pm_runtime_get/put increment a reference count, as long
as there have been more get's than puts, the device won't runtime
suspend.  So, for ex, msm's run_job fxn does a pm_runtime_get_sync().
And retire_submit() which runs after job completes on GPU does a
pm_runtime_put_autosuspend().

System suspend, OTOH, bypasses this reference counting.  Which is why
extra care is needed.

BR,
-R


> Andrey
>
>
> >
> >> Also, could you point to a particular buggy scenario where the race
> >> between SW shceduler and suspend is causing a problem ?
> > I wrote a piglit test[1] to try to trigger this scenario.. it isn't
> > really that easy to hit
> >
> > BR,
> > -R
> >
> > [1] https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fmesa%2Fpiglit%2F-%2Fmerge_requests%2F643&amp;data=04%7C01%7Candrey.grodzovsky%40amd.com%7C502ac8db4fb94b3b0e9d08da08fb270e%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637832172051790527%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=u2Fqq%2BZpmjFHQFK77xwxEA5092O3Nc%2FdCMllfejgnvU%3D&amp;reserved=0
> >
> >> Andrey
> >>
> >>
> >>> BR,
> >>> -R
> >>>
> >>>> Andrey
> >>>>
> >>>>
> >>>>> BR,
> >>>>> -R

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 2/3] drm/msm/gpu: Park scheduler threads for system suspend
@ 2022-03-18 17:22                                 ` Rob Clark
  0 siblings, 0 replies; 52+ messages in thread
From: Rob Clark @ 2022-03-18 17:22 UTC (permalink / raw)
  To: Andrey Grodzovsky
  Cc: Rob Clark, freedreno, Jonathan Marek, David Airlie,
	linux-arm-msm, Vladimir Lypak, Abhinav Kumar, dri-devel,
	Bjorn Andersson, Akhil P Oommen, Alexander.Deucher, Sean Paul,
	Christian König, open list, AngeloGioacchino Del Regno

On Fri, Mar 18, 2022 at 9:27 AM Andrey Grodzovsky
<andrey.grodzovsky@amd.com> wrote:
>
>
> On 2022-03-18 12:20, Rob Clark wrote:
> > On Fri, Mar 18, 2022 at 9:04 AM Andrey Grodzovsky
> > <andrey.grodzovsky@amd.com> wrote:
> >>
> >> On 2022-03-17 16:35, Rob Clark wrote:
> >>> On Thu, Mar 17, 2022 at 12:50 PM Andrey Grodzovsky
> >>> <andrey.grodzovsky@amd.com> wrote:
> >>>> On 2022-03-17 14:25, Rob Clark wrote:
> >>>>> On Thu, Mar 17, 2022 at 11:10 AM Andrey Grodzovsky
> >>>>> <andrey.grodzovsky@amd.com> wrote:
> >>>>>> On 2022-03-17 13:35, Rob Clark wrote:
> >>>>>>> On Thu, Mar 17, 2022 at 9:45 AM Christian König
> >>>>>>> <christian.koenig@amd.com> wrote:
> >>>>>>>> Am 17.03.22 um 17:18 schrieb Rob Clark:
> >>>>>>>>> On Thu, Mar 17, 2022 at 9:04 AM Christian König
> >>>>>>>>> <christian.koenig@amd.com> wrote:
> >>>>>>>>>> Am 17.03.22 um 16:10 schrieb Rob Clark:
> >>>>>>>>>>> [SNIP]
> >>>>>>>>>>> userspace frozen != kthread frozen .. that is what this patch is
> >>>>>>>>>>> trying to address, so we aren't racing between shutting down the hw
> >>>>>>>>>>> and the scheduler shoveling more jobs at us.
> >>>>>>>>>> Well exactly that's the problem. The scheduler is supposed to shoveling
> >>>>>>>>>> more jobs at us until it is empty.
> >>>>>>>>>>
> >>>>>>>>>> Thinking more about it we will then keep some dma_fence instance
> >>>>>>>>>> unsignaled and that is and extremely bad idea since it can lead to
> >>>>>>>>>> deadlocks during suspend.
> >>>>>>>>> Hmm, perhaps that is true if you need to migrate things out of vram?
> >>>>>>>>> It is at least not a problem when vram is not involved.
> >>>>>>>> No, it's much wider than that.
> >>>>>>>>
> >>>>>>>> See what can happen is that the memory management shrinkers want to wait
> >>>>>>>> for a dma_fence during suspend.
> >>>>>>> we don't wait on fences in shrinker, only purging or evicting things
> >>>>>>> that are already ready.  Actually, waiting on fences in shrinker path
> >>>>>>> sounds like a pretty bad idea.
> >>>>>>>
> >>>>>>>> And if you stop the scheduler they will just wait forever.
> >>>>>>>>
> >>>>>>>> What you need to do instead is to drain the scheduler, e.g. call
> >>>>>>>> drm_sched_entity_flush() with a proper timeout for each entity you have
> >>>>>>>> created.
> >>>>>>> yeah, it would work to drain the scheduler.. I guess that might be the
> >>>>>>> more portable approach as far as generic solution for suspend.
> >>>>>>>
> >>>>>>> BR,
> >>>>>>> -R
> >>>>>> I am not sure how this drains the scheduler ? Suppose we done the
> >>>>>> waiting in drm_sched_entity_flush,
> >>>>>> what prevents someone to push right away another job into the same
> >>>>>> entity's queue  right after that ?
> >>>>>> Shouldn't we first disable further pushing of jobs into entity before we
> >>>>>> wait for  sched->job_scheduled ?
> >>>>>>
> >>>>> In the system suspend path, userspace processes will have already been
> >>>>> frozen, so there should be no way to push more jobs to the scheduler,
> >>>>> unless they are pushed from the kernel itself.
> >>>>> amdgpu_device_suspend
> >>>> It was my suspicion but I wasn't sure about it.
> >>>>
> >>>>
> >>>>> We don't do that in
> >>>>> drm/msm, but maybe you need to to move things btwn vram and system
> >>>>> memory?
> >>>> Exactly, that was my main concern - if we use this method we have to use
> >>>> it in a point in
> >>>> suspend sequence when all the in kernel job submissions activity already
> >>>> suspended
> >>>>
> >>>>> But even in that case, if the # of jobs you push is bounded I
> >>>>> guess that is ok?
> >>>> Submissions to scheduler entities are using unbounded queue, the bounded
> >>>> part is when
> >>>> you extract next job from entity to submit to HW ring and it rejects if
> >>>> submission limit reached (drm_sched_ready)
> >>>>
> >>>> In general - It looks to me at least that what we what we want her is
> >>>> more of a drain operation then flush (i.e.
> >>>> we first want to disable any further job submission to entity's queue
> >>>> and then flush all in flight ones). As example
> >>>> for this i was looking at  flush_workqueue vs. drain_workqueue
> >>> Would it be possible for amdgpu to, in the system suspend task,
> >>>
> >>> 1) first queue up all the jobs needed to migrate bos out of vram, and
> >>> whatever other housekeeping jobs are needed
> >>> 2) then drain gpu scheduler's queues
> >>> 3) and then finally wait for jobs executing on GPU to complete
> >>
> >> We already do most of it in amdgpu_device_suspend,
> >> amdgpu_device_ip_suspend_phase1
> >> followed by amdgpu_device_evict_resources followed by
> >> amdgpu_fence_driver_hw_fini is
> >> exactly steps 1 + 3. What we are missing is step 2). For this step I
> >> suggest adding a function
> >> called drm_sched_entity_drain which basically sets entity->stopped =
> >> true and then calls drm_sched_entity_flush.
> >> This will both reject any new insertions into entity's job queue and
> >> will flush all pending job submissions to HW from that entity.
> >> One point is we need to make make drm_sched_entity_push_job return value
> >> so the caller knows about job enqueue
> >> rejection.
> > Hmm, seems like job enqueue that is rejected because we are in the
> > process of suspending should be more of a WARN_ON() sort of thing?
> > Not sure if there is something sensible to do for the caller at that
> > point?
>
>
> What about the job's fence the caller is waiting on ? If we rejected
> job submission the caller must know about it to not get stuck waiting
> on that fence.
>

Hmm, perhaps I'm not being imaginative enough, but this sort of
scenario seems like it should only arise from a bug in the driver's
suspend path, Ie. not doing all the job submission before shutting
down the scheduler.  I don't think anything good is going to result
either way, which is why I was thinking you'd want a WARN_ON() to help
debug/fix that case.

>
> >
> >> What about runtime suspend ? I guess same issue with scheduler racing
> >> against HW susppend is relevant there ?
> > Runtime suspend should be ok, as long as the driver holds a runpm
> > reference whenever the hw needs to be awake.  The problem with system
> > suspend (at least if you are using pm_runtime_force_suspend() or doing
> > something equivalent) is that it bypasses the runpm reference.
> > (Which, IMO, seems like a bad design..)
>
>
> I am not totally clear  yet - can you expand a bit why one case is ok
> but the other
> problematic ?
>

Sure, normally pm_runtime_get/put increment a reference count, as long
as there have been more get's than puts, the device won't runtime
suspend.  So, for ex, msm's run_job fxn does a pm_runtime_get_sync().
And retire_submit() which runs after job completes on GPU does a
pm_runtime_put_autosuspend().

System suspend, OTOH, bypasses this reference counting.  Which is why
extra care is needed.

BR,
-R


> Andrey
>
>
> >
> >> Also, could you point to a particular buggy scenario where the race
> >> between SW shceduler and suspend is causing a problem ?
> > I wrote a piglit test[1] to try to trigger this scenario.. it isn't
> > really that easy to hit
> >
> > BR,
> > -R
> >
> > [1] https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fmesa%2Fpiglit%2F-%2Fmerge_requests%2F643&amp;data=04%7C01%7Candrey.grodzovsky%40amd.com%7C502ac8db4fb94b3b0e9d08da08fb270e%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637832172051790527%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=u2Fqq%2BZpmjFHQFK77xwxEA5092O3Nc%2FdCMllfejgnvU%3D&amp;reserved=0
> >
> >> Andrey
> >>
> >>
> >>> BR,
> >>> -R
> >>>
> >>>> Andrey
> >>>>
> >>>>
> >>>>> BR,
> >>>>> -R

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 2/3] drm/msm/gpu: Park scheduler threads for system suspend
  2022-03-18 17:22                                 ` Rob Clark
@ 2022-03-18 20:14                                   ` Andrey Grodzovsky
  -1 siblings, 0 replies; 52+ messages in thread
From: Andrey Grodzovsky @ 2022-03-18 20:14 UTC (permalink / raw)
  To: Rob Clark
  Cc: Christian König, dri-devel, freedreno, linux-arm-msm,
	Rob Clark, Sean Paul, Abhinav Kumar, David Airlie,
	Akhil P Oommen, Jonathan Marek, AngeloGioacchino Del Regno,
	Bjorn Andersson, Vladimir Lypak, open list, Alexander.Deucher


On 2022-03-18 13:22, Rob Clark wrote:
> On Fri, Mar 18, 2022 at 9:27 AM Andrey Grodzovsky
> <andrey.grodzovsky@amd.com> wrote:
>>
>> On 2022-03-18 12:20, Rob Clark wrote:
>>> On Fri, Mar 18, 2022 at 9:04 AM Andrey Grodzovsky
>>> <andrey.grodzovsky@amd.com> wrote:
>>>> On 2022-03-17 16:35, Rob Clark wrote:
>>>>> On Thu, Mar 17, 2022 at 12:50 PM Andrey Grodzovsky
>>>>> <andrey.grodzovsky@amd.com> wrote:
>>>>>> On 2022-03-17 14:25, Rob Clark wrote:
>>>>>>> On Thu, Mar 17, 2022 at 11:10 AM Andrey Grodzovsky
>>>>>>> <andrey.grodzovsky@amd.com> wrote:
>>>>>>>> On 2022-03-17 13:35, Rob Clark wrote:
>>>>>>>>> On Thu, Mar 17, 2022 at 9:45 AM Christian König
>>>>>>>>> <christian.koenig@amd.com> wrote:
>>>>>>>>>> Am 17.03.22 um 17:18 schrieb Rob Clark:
>>>>>>>>>>> On Thu, Mar 17, 2022 at 9:04 AM Christian König
>>>>>>>>>>> <christian.koenig@amd.com> wrote:
>>>>>>>>>>>> Am 17.03.22 um 16:10 schrieb Rob Clark:
>>>>>>>>>>>>> [SNIP]
>>>>>>>>>>>>> userspace frozen != kthread frozen .. that is what this patch is
>>>>>>>>>>>>> trying to address, so we aren't racing between shutting down the hw
>>>>>>>>>>>>> and the scheduler shoveling more jobs at us.
>>>>>>>>>>>> Well exactly that's the problem. The scheduler is supposed to shoveling
>>>>>>>>>>>> more jobs at us until it is empty.
>>>>>>>>>>>>
>>>>>>>>>>>> Thinking more about it we will then keep some dma_fence instance
>>>>>>>>>>>> unsignaled and that is and extremely bad idea since it can lead to
>>>>>>>>>>>> deadlocks during suspend.
>>>>>>>>>>> Hmm, perhaps that is true if you need to migrate things out of vram?
>>>>>>>>>>> It is at least not a problem when vram is not involved.
>>>>>>>>>> No, it's much wider than that.
>>>>>>>>>>
>>>>>>>>>> See what can happen is that the memory management shrinkers want to wait
>>>>>>>>>> for a dma_fence during suspend.
>>>>>>>>> we don't wait on fences in shrinker, only purging or evicting things
>>>>>>>>> that are already ready.  Actually, waiting on fences in shrinker path
>>>>>>>>> sounds like a pretty bad idea.
>>>>>>>>>
>>>>>>>>>> And if you stop the scheduler they will just wait forever.
>>>>>>>>>>
>>>>>>>>>> What you need to do instead is to drain the scheduler, e.g. call
>>>>>>>>>> drm_sched_entity_flush() with a proper timeout for each entity you have
>>>>>>>>>> created.
>>>>>>>>> yeah, it would work to drain the scheduler.. I guess that might be the
>>>>>>>>> more portable approach as far as generic solution for suspend.
>>>>>>>>>
>>>>>>>>> BR,
>>>>>>>>> -R
>>>>>>>> I am not sure how this drains the scheduler ? Suppose we done the
>>>>>>>> waiting in drm_sched_entity_flush,
>>>>>>>> what prevents someone to push right away another job into the same
>>>>>>>> entity's queue  right after that ?
>>>>>>>> Shouldn't we first disable further pushing of jobs into entity before we
>>>>>>>> wait for  sched->job_scheduled ?
>>>>>>>>
>>>>>>> In the system suspend path, userspace processes will have already been
>>>>>>> frozen, so there should be no way to push more jobs to the scheduler,
>>>>>>> unless they are pushed from the kernel itself.
>>>>>>> amdgpu_device_suspend
>>>>>> It was my suspicion but I wasn't sure about it.
>>>>>>
>>>>>>
>>>>>>> We don't do that in
>>>>>>> drm/msm, but maybe you need to to move things btwn vram and system
>>>>>>> memory?
>>>>>> Exactly, that was my main concern - if we use this method we have to use
>>>>>> it in a point in
>>>>>> suspend sequence when all the in kernel job submissions activity already
>>>>>> suspended
>>>>>>
>>>>>>> But even in that case, if the # of jobs you push is bounded I
>>>>>>> guess that is ok?
>>>>>> Submissions to scheduler entities are using unbounded queue, the bounded
>>>>>> part is when
>>>>>> you extract next job from entity to submit to HW ring and it rejects if
>>>>>> submission limit reached (drm_sched_ready)
>>>>>>
>>>>>> In general - It looks to me at least that what we what we want her is
>>>>>> more of a drain operation then flush (i.e.
>>>>>> we first want to disable any further job submission to entity's queue
>>>>>> and then flush all in flight ones). As example
>>>>>> for this i was looking at  flush_workqueue vs. drain_workqueue
>>>>> Would it be possible for amdgpu to, in the system suspend task,
>>>>>
>>>>> 1) first queue up all the jobs needed to migrate bos out of vram, and
>>>>> whatever other housekeeping jobs are needed
>>>>> 2) then drain gpu scheduler's queues
>>>>> 3) and then finally wait for jobs executing on GPU to complete
>>>> We already do most of it in amdgpu_device_suspend,
>>>> amdgpu_device_ip_suspend_phase1
>>>> followed by amdgpu_device_evict_resources followed by
>>>> amdgpu_fence_driver_hw_fini is
>>>> exactly steps 1 + 3. What we are missing is step 2). For this step I
>>>> suggest adding a function
>>>> called drm_sched_entity_drain which basically sets entity->stopped =
>>>> true and then calls drm_sched_entity_flush.
>>>> This will both reject any new insertions into entity's job queue and
>>>> will flush all pending job submissions to HW from that entity.
>>>> One point is we need to make make drm_sched_entity_push_job return value
>>>> so the caller knows about job enqueue
>>>> rejection.
>>> Hmm, seems like job enqueue that is rejected because we are in the
>>> process of suspending should be more of a WARN_ON() sort of thing?
>>> Not sure if there is something sensible to do for the caller at that
>>> point?
>>
>> What about the job's fence the caller is waiting on ? If we rejected
>> job submission the caller must know about it to not get stuck waiting
>> on that fence.
>>
> Hmm, perhaps I'm not being imaginative enough, but this sort of
> scenario seems like it should only arise from a bug in the driver's
> suspend path, Ie. not doing all the job submission before shutting
> down the scheduler.  I don't think anything good is going to result
> either way, which is why I was thinking you'd want a WARN_ON() to help
> debug/fix that case.


Yes, I just wanted the code to not allow such bugs to go through 
unnoticed. I guess
WARN_ON should give laud enough warning anyway

Andrey


>>>> What about runtime suspend ? I guess same issue with scheduler racing
>>>> against HW susppend is relevant there ?
>>> Runtime suspend should be ok, as long as the driver holds a runpm
>>> reference whenever the hw needs to be awake.  The problem with system
>>> suspend (at least if you are using pm_runtime_force_suspend() or doing
>>> something equivalent) is that it bypasses the runpm reference.
>>> (Which, IMO, seems like a bad design..)
>>
>> I am not totally clear  yet - can you expand a bit why one case is ok
>> but the other
>> problematic ?
>>
> Sure, normally pm_runtime_get/put increment a reference count, as long
> as there have been more get's than puts, the device won't runtime
> suspend.  So, for ex, msm's run_job fxn does a pm_runtime_get_sync().
> And retire_submit() which runs after job completes on GPU does a
> pm_runtime_put_autosuspend().
>
> System suspend, OTOH, bypasses this reference counting.  Which is why
> extra care is needed.
>
> BR,
> -R
>
>
>> Andrey
>>
>>
>>>> Also, could you point to a particular buggy scenario where the race
>>>> between SW shceduler and suspend is causing a problem ?
>>> I wrote a piglit test[1] to try to trigger this scenario.. it isn't
>>> really that easy to hit
>>>
>>> BR,
>>> -R
>>>
>>> [1] https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fmesa%2Fpiglit%2F-%2Fmerge_requests%2F643&amp;data=04%7C01%7Candrey.grodzovsky%40amd.com%7C35f0d7d9282044651c9708da0903d4f4%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637832209324217553%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=dwjPEVAYgCI%2BtEyzBirfAQkJjZax2NdiLQfNeFfImtU%3D&amp;reserved=0
>>>
>>>> Andrey
>>>>
>>>>
>>>>> BR,
>>>>> -R
>>>>>
>>>>>> Andrey
>>>>>>
>>>>>>
>>>>>>> BR,
>>>>>>> -R

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 2/3] drm/msm/gpu: Park scheduler threads for system suspend
@ 2022-03-18 20:14                                   ` Andrey Grodzovsky
  0 siblings, 0 replies; 52+ messages in thread
From: Andrey Grodzovsky @ 2022-03-18 20:14 UTC (permalink / raw)
  To: Rob Clark
  Cc: Rob Clark, freedreno, Jonathan Marek, David Airlie,
	linux-arm-msm, Vladimir Lypak, Abhinav Kumar, dri-devel,
	Bjorn Andersson, Akhil P Oommen, Alexander.Deucher, Sean Paul,
	Christian König, open list, AngeloGioacchino Del Regno


On 2022-03-18 13:22, Rob Clark wrote:
> On Fri, Mar 18, 2022 at 9:27 AM Andrey Grodzovsky
> <andrey.grodzovsky@amd.com> wrote:
>>
>> On 2022-03-18 12:20, Rob Clark wrote:
>>> On Fri, Mar 18, 2022 at 9:04 AM Andrey Grodzovsky
>>> <andrey.grodzovsky@amd.com> wrote:
>>>> On 2022-03-17 16:35, Rob Clark wrote:
>>>>> On Thu, Mar 17, 2022 at 12:50 PM Andrey Grodzovsky
>>>>> <andrey.grodzovsky@amd.com> wrote:
>>>>>> On 2022-03-17 14:25, Rob Clark wrote:
>>>>>>> On Thu, Mar 17, 2022 at 11:10 AM Andrey Grodzovsky
>>>>>>> <andrey.grodzovsky@amd.com> wrote:
>>>>>>>> On 2022-03-17 13:35, Rob Clark wrote:
>>>>>>>>> On Thu, Mar 17, 2022 at 9:45 AM Christian König
>>>>>>>>> <christian.koenig@amd.com> wrote:
>>>>>>>>>> Am 17.03.22 um 17:18 schrieb Rob Clark:
>>>>>>>>>>> On Thu, Mar 17, 2022 at 9:04 AM Christian König
>>>>>>>>>>> <christian.koenig@amd.com> wrote:
>>>>>>>>>>>> Am 17.03.22 um 16:10 schrieb Rob Clark:
>>>>>>>>>>>>> [SNIP]
>>>>>>>>>>>>> userspace frozen != kthread frozen .. that is what this patch is
>>>>>>>>>>>>> trying to address, so we aren't racing between shutting down the hw
>>>>>>>>>>>>> and the scheduler shoveling more jobs at us.
>>>>>>>>>>>> Well exactly that's the problem. The scheduler is supposed to shoveling
>>>>>>>>>>>> more jobs at us until it is empty.
>>>>>>>>>>>>
>>>>>>>>>>>> Thinking more about it we will then keep some dma_fence instance
>>>>>>>>>>>> unsignaled and that is and extremely bad idea since it can lead to
>>>>>>>>>>>> deadlocks during suspend.
>>>>>>>>>>> Hmm, perhaps that is true if you need to migrate things out of vram?
>>>>>>>>>>> It is at least not a problem when vram is not involved.
>>>>>>>>>> No, it's much wider than that.
>>>>>>>>>>
>>>>>>>>>> See what can happen is that the memory management shrinkers want to wait
>>>>>>>>>> for a dma_fence during suspend.
>>>>>>>>> we don't wait on fences in shrinker, only purging or evicting things
>>>>>>>>> that are already ready.  Actually, waiting on fences in shrinker path
>>>>>>>>> sounds like a pretty bad idea.
>>>>>>>>>
>>>>>>>>>> And if you stop the scheduler they will just wait forever.
>>>>>>>>>>
>>>>>>>>>> What you need to do instead is to drain the scheduler, e.g. call
>>>>>>>>>> drm_sched_entity_flush() with a proper timeout for each entity you have
>>>>>>>>>> created.
>>>>>>>>> yeah, it would work to drain the scheduler.. I guess that might be the
>>>>>>>>> more portable approach as far as generic solution for suspend.
>>>>>>>>>
>>>>>>>>> BR,
>>>>>>>>> -R
>>>>>>>> I am not sure how this drains the scheduler ? Suppose we done the
>>>>>>>> waiting in drm_sched_entity_flush,
>>>>>>>> what prevents someone to push right away another job into the same
>>>>>>>> entity's queue  right after that ?
>>>>>>>> Shouldn't we first disable further pushing of jobs into entity before we
>>>>>>>> wait for  sched->job_scheduled ?
>>>>>>>>
>>>>>>> In the system suspend path, userspace processes will have already been
>>>>>>> frozen, so there should be no way to push more jobs to the scheduler,
>>>>>>> unless they are pushed from the kernel itself.
>>>>>>> amdgpu_device_suspend
>>>>>> It was my suspicion but I wasn't sure about it.
>>>>>>
>>>>>>
>>>>>>> We don't do that in
>>>>>>> drm/msm, but maybe you need to to move things btwn vram and system
>>>>>>> memory?
>>>>>> Exactly, that was my main concern - if we use this method we have to use
>>>>>> it in a point in
>>>>>> suspend sequence when all the in kernel job submissions activity already
>>>>>> suspended
>>>>>>
>>>>>>> But even in that case, if the # of jobs you push is bounded I
>>>>>>> guess that is ok?
>>>>>> Submissions to scheduler entities are using unbounded queue, the bounded
>>>>>> part is when
>>>>>> you extract next job from entity to submit to HW ring and it rejects if
>>>>>> submission limit reached (drm_sched_ready)
>>>>>>
>>>>>> In general - It looks to me at least that what we what we want her is
>>>>>> more of a drain operation then flush (i.e.
>>>>>> we first want to disable any further job submission to entity's queue
>>>>>> and then flush all in flight ones). As example
>>>>>> for this i was looking at  flush_workqueue vs. drain_workqueue
>>>>> Would it be possible for amdgpu to, in the system suspend task,
>>>>>
>>>>> 1) first queue up all the jobs needed to migrate bos out of vram, and
>>>>> whatever other housekeeping jobs are needed
>>>>> 2) then drain gpu scheduler's queues
>>>>> 3) and then finally wait for jobs executing on GPU to complete
>>>> We already do most of it in amdgpu_device_suspend,
>>>> amdgpu_device_ip_suspend_phase1
>>>> followed by amdgpu_device_evict_resources followed by
>>>> amdgpu_fence_driver_hw_fini is
>>>> exactly steps 1 + 3. What we are missing is step 2). For this step I
>>>> suggest adding a function
>>>> called drm_sched_entity_drain which basically sets entity->stopped =
>>>> true and then calls drm_sched_entity_flush.
>>>> This will both reject any new insertions into entity's job queue and
>>>> will flush all pending job submissions to HW from that entity.
>>>> One point is we need to make make drm_sched_entity_push_job return value
>>>> so the caller knows about job enqueue
>>>> rejection.
>>> Hmm, seems like job enqueue that is rejected because we are in the
>>> process of suspending should be more of a WARN_ON() sort of thing?
>>> Not sure if there is something sensible to do for the caller at that
>>> point?
>>
>> What about the job's fence the caller is waiting on ? If we rejected
>> job submission the caller must know about it to not get stuck waiting
>> on that fence.
>>
> Hmm, perhaps I'm not being imaginative enough, but this sort of
> scenario seems like it should only arise from a bug in the driver's
> suspend path, Ie. not doing all the job submission before shutting
> down the scheduler.  I don't think anything good is going to result
> either way, which is why I was thinking you'd want a WARN_ON() to help
> debug/fix that case.


Yes, I just wanted the code to not allow such bugs to go through 
unnoticed. I guess
WARN_ON should give laud enough warning anyway

Andrey


>>>> What about runtime suspend ? I guess same issue with scheduler racing
>>>> against HW susppend is relevant there ?
>>> Runtime suspend should be ok, as long as the driver holds a runpm
>>> reference whenever the hw needs to be awake.  The problem with system
>>> suspend (at least if you are using pm_runtime_force_suspend() or doing
>>> something equivalent) is that it bypasses the runpm reference.
>>> (Which, IMO, seems like a bad design..)
>>
>> I am not totally clear  yet - can you expand a bit why one case is ok
>> but the other
>> problematic ?
>>
> Sure, normally pm_runtime_get/put increment a reference count, as long
> as there have been more get's than puts, the device won't runtime
> suspend.  So, for ex, msm's run_job fxn does a pm_runtime_get_sync().
> And retire_submit() which runs after job completes on GPU does a
> pm_runtime_put_autosuspend().
>
> System suspend, OTOH, bypasses this reference counting.  Which is why
> extra care is needed.
>
> BR,
> -R
>
>
>> Andrey
>>
>>
>>>> Also, could you point to a particular buggy scenario where the race
>>>> between SW shceduler and suspend is causing a problem ?
>>> I wrote a piglit test[1] to try to trigger this scenario.. it isn't
>>> really that easy to hit
>>>
>>> BR,
>>> -R
>>>
>>> [1] https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fmesa%2Fpiglit%2F-%2Fmerge_requests%2F643&amp;data=04%7C01%7Candrey.grodzovsky%40amd.com%7C35f0d7d9282044651c9708da0903d4f4%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637832209324217553%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=dwjPEVAYgCI%2BtEyzBirfAQkJjZax2NdiLQfNeFfImtU%3D&amp;reserved=0
>>>
>>>> Andrey
>>>>
>>>>
>>>>> BR,
>>>>> -R
>>>>>
>>>>>> Andrey
>>>>>>
>>>>>>
>>>>>>> BR,
>>>>>>> -R

^ permalink raw reply	[flat|nested] 52+ messages in thread

end of thread, other threads:[~2022-03-18 20:15 UTC | newest]

Thread overview: 52+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-03-10 23:46 [PATCH 0/3] drm/msm/gpu: More system suspend fixes Rob Clark
2022-03-10 23:46 ` Rob Clark
2022-03-10 23:46 ` [PATCH 1/3] drm/msm/gpu: Rename runtime suspend/resume functions Rob Clark
2022-03-10 23:46   ` Rob Clark
2022-03-11  9:26   ` AngeloGioacchino Del Regno
2022-03-11  9:26     ` AngeloGioacchino Del Regno
2022-03-10 23:46 ` [PATCH 2/3] drm/msm/gpu: Park scheduler threads for system suspend Rob Clark
2022-03-10 23:46   ` Rob Clark
2022-03-17  9:59   ` Daniel Vetter
2022-03-17  9:59     ` Daniel Vetter
2022-03-17 10:06     ` Christian König
2022-03-17 14:58       ` Matthew Brost
2022-03-17 14:58         ` Matthew Brost
2022-03-17 15:10       ` Rob Clark
2022-03-17 15:10         ` Rob Clark
2022-03-17 16:04         ` Christian König
2022-03-17 16:04           ` Christian König
2022-03-17 16:18           ` Rob Clark
2022-03-17 16:18             ` Rob Clark
2022-03-17 16:44             ` Christian König
2022-03-17 16:44               ` Christian König
2022-03-17 17:29               ` Daniel Vetter
2022-03-17 17:29                 ` Daniel Vetter
2022-03-17 17:35               ` Rob Clark
2022-03-17 17:35                 ` Rob Clark
2022-03-17 18:10                 ` Andrey Grodzovsky
2022-03-17 18:10                   ` Andrey Grodzovsky
2022-03-17 18:25                   ` Rob Clark
2022-03-17 18:25                     ` Rob Clark
2022-03-17 19:49                     ` Andrey Grodzovsky
2022-03-17 19:49                       ` Andrey Grodzovsky
2022-03-17 20:35                       ` Rob Clark
2022-03-17 20:35                         ` Rob Clark
2022-03-18 16:04                         ` Andrey Grodzovsky
2022-03-18 16:04                           ` Andrey Grodzovsky
2022-03-18 16:20                           ` Rob Clark
2022-03-18 16:20                             ` Rob Clark
2022-03-18 16:27                             ` Andrey Grodzovsky
2022-03-18 16:27                               ` Andrey Grodzovsky
2022-03-18 17:22                               ` Rob Clark
2022-03-18 17:22                                 ` Rob Clark
2022-03-18 20:14                                 ` Andrey Grodzovsky
2022-03-18 20:14                                   ` Andrey Grodzovsky
2022-03-17 17:46           ` Andrey Grodzovsky
2022-03-17 17:46             ` Andrey Grodzovsky
2022-03-10 23:46 ` [PATCH 3/3] drm/msm/gpu: Remove mutex from wait_event condition Rob Clark
2022-03-10 23:46   ` Rob Clark
2022-03-17 20:45   ` [Freedreno] " Akhil P Oommen
2022-03-17 20:45     ` Akhil P Oommen
2022-03-17 21:07     ` Rob Clark
2022-03-17 21:07       ` Rob Clark
2022-03-18  2:55       ` Hillf Danton

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.