All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 1/2] adreno: Shutdown the GPU properly
@ 2022-11-11 19:49 ` Joel Fernandes (Google)
  0 siblings, 0 replies; 27+ messages in thread
From: Joel Fernandes (Google) @ 2022-11-11 19:49 UTC (permalink / raw)
  To: linux-kernel
  Cc: Joel Fernandes (Google),
	Rob Clark, Steven Rostedt, Ricardo Ribalda, Ross Zwisler,
	Abhinav Kumar, Akhil P Oommen, Daniel Vetter, David Airlie,
	Dmitry Baryshkov, dri-devel, Emma Anholt, freedreno,
	linux-arm-msm, Rob Clark, Sean Paul, Vladimir Lypak

During kexec on ARM device, we notice that device_shutdown() only calls
pm_runtime_force_suspend() while shutting down the GPU. This means the GPU
kthread is still running and further, there maybe active submits.

This causes all kinds of issues during a kexec reboot:

Warning from shutdown path:

[  292.509662] WARNING: CPU: 0 PID: 6304 at [...] adreno_runtime_suspend+0x3c/0x44
[  292.509863] Hardware name: Google Lazor (rev3 - 8) with LTE (DT)
[  292.509872] pstate: 80400009 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[  292.509881] pc : adreno_runtime_suspend+0x3c/0x44
[  292.509891] lr : pm_generic_runtime_suspend+0x30/0x44
[  292.509905] sp : ffffffc014473bf0
[...]
[  292.510043] Call trace:
[  292.510051]  adreno_runtime_suspend+0x3c/0x44
[  292.510061]  pm_generic_runtime_suspend+0x30/0x44
[  292.510071]  pm_runtime_force_suspend+0x54/0xc8
[  292.510081]  adreno_shutdown+0x1c/0x28
[  292.510090]  platform_shutdown+0x2c/0x38
[  292.510104]  device_shutdown+0x158/0x210
[  292.510119]  kernel_restart_prepare+0x40/0x4c

And here from GPU kthread, an SError OOPs:

[  192.648789]  el1h_64_error+0x7c/0x80
[  192.648812]  el1_interrupt+0x20/0x58
[  192.648833]  el1h_64_irq_handler+0x18/0x24
[  192.648854]  el1h_64_irq+0x7c/0x80
[  192.648873]  local_daif_inherit+0x10/0x18
[  192.648900]  el1h_64_sync_handler+0x48/0xb4
[  192.648921]  el1h_64_sync+0x7c/0x80
[  192.648941]  a6xx_gmu_set_oob+0xbc/0x1fc
[  192.648968]  a6xx_hw_init+0x44/0xe38
[  192.648991]  msm_gpu_hw_init+0x48/0x80
[  192.649013]  msm_gpu_submit+0x5c/0x1a8
[  192.649034]  msm_job_run+0xb0/0x11c
[  192.649058]  drm_sched_main+0x170/0x434
[  192.649086]  kthread+0x134/0x300
[  192.649114]  ret_from_fork+0x10/0x20

Fix by calling adreno_system_suspend() in the device_shutdown() path.

Cc: Rob Clark <robdclark@chromium.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ricardo Ribalda <ribalda@chromium.org>
Cc: Ross Zwisler <zwisler@kernel.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 drivers/gpu/drm/msm/adreno/adreno_device.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/msm/adreno/adreno_device.c b/drivers/gpu/drm/msm/adreno/adreno_device.c
index 24b489b6129a..f0cff62812c3 100644
--- a/drivers/gpu/drm/msm/adreno/adreno_device.c
+++ b/drivers/gpu/drm/msm/adreno/adreno_device.c
@@ -607,9 +607,12 @@ static int adreno_remove(struct platform_device *pdev)
 	return 0;
 }
 
+static int adreno_system_suspend(struct device *dev);
 static void adreno_shutdown(struct platform_device *pdev)
 {
-	pm_runtime_force_suspend(&pdev->dev);
+	struct msm_gpu *gpu = dev_to_gpu(&pdev->dev);
+
+	WARN_ON_ONCE(adreno_system_suspend(&pdev->dev));
 }
 
 static const struct of_device_id dt_match[] = {
-- 
2.38.1.493.g58b659f92b-goog


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 1/2] adreno: Shutdown the GPU properly
@ 2022-11-11 19:49 ` Joel Fernandes (Google)
  0 siblings, 0 replies; 27+ messages in thread
From: Joel Fernandes (Google) @ 2022-11-11 19:49 UTC (permalink / raw)
  To: linux-kernel
  Cc: Rob Clark, Emma Anholt, Akhil P Oommen, freedreno, linux-arm-msm,
	Ricardo Ribalda, Vladimir Lypak, Abhinav Kumar, Steven Rostedt,
	Dmitry Baryshkov, dri-devel, Ross Zwisler,
	Joel Fernandes (Google),
	Sean Paul

During kexec on ARM device, we notice that device_shutdown() only calls
pm_runtime_force_suspend() while shutting down the GPU. This means the GPU
kthread is still running and further, there maybe active submits.

This causes all kinds of issues during a kexec reboot:

Warning from shutdown path:

[  292.509662] WARNING: CPU: 0 PID: 6304 at [...] adreno_runtime_suspend+0x3c/0x44
[  292.509863] Hardware name: Google Lazor (rev3 - 8) with LTE (DT)
[  292.509872] pstate: 80400009 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[  292.509881] pc : adreno_runtime_suspend+0x3c/0x44
[  292.509891] lr : pm_generic_runtime_suspend+0x30/0x44
[  292.509905] sp : ffffffc014473bf0
[...]
[  292.510043] Call trace:
[  292.510051]  adreno_runtime_suspend+0x3c/0x44
[  292.510061]  pm_generic_runtime_suspend+0x30/0x44
[  292.510071]  pm_runtime_force_suspend+0x54/0xc8
[  292.510081]  adreno_shutdown+0x1c/0x28
[  292.510090]  platform_shutdown+0x2c/0x38
[  292.510104]  device_shutdown+0x158/0x210
[  292.510119]  kernel_restart_prepare+0x40/0x4c

And here from GPU kthread, an SError OOPs:

[  192.648789]  el1h_64_error+0x7c/0x80
[  192.648812]  el1_interrupt+0x20/0x58
[  192.648833]  el1h_64_irq_handler+0x18/0x24
[  192.648854]  el1h_64_irq+0x7c/0x80
[  192.648873]  local_daif_inherit+0x10/0x18
[  192.648900]  el1h_64_sync_handler+0x48/0xb4
[  192.648921]  el1h_64_sync+0x7c/0x80
[  192.648941]  a6xx_gmu_set_oob+0xbc/0x1fc
[  192.648968]  a6xx_hw_init+0x44/0xe38
[  192.648991]  msm_gpu_hw_init+0x48/0x80
[  192.649013]  msm_gpu_submit+0x5c/0x1a8
[  192.649034]  msm_job_run+0xb0/0x11c
[  192.649058]  drm_sched_main+0x170/0x434
[  192.649086]  kthread+0x134/0x300
[  192.649114]  ret_from_fork+0x10/0x20

Fix by calling adreno_system_suspend() in the device_shutdown() path.

Cc: Rob Clark <robdclark@chromium.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ricardo Ribalda <ribalda@chromium.org>
Cc: Ross Zwisler <zwisler@kernel.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 drivers/gpu/drm/msm/adreno/adreno_device.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/msm/adreno/adreno_device.c b/drivers/gpu/drm/msm/adreno/adreno_device.c
index 24b489b6129a..f0cff62812c3 100644
--- a/drivers/gpu/drm/msm/adreno/adreno_device.c
+++ b/drivers/gpu/drm/msm/adreno/adreno_device.c
@@ -607,9 +607,12 @@ static int adreno_remove(struct platform_device *pdev)
 	return 0;
 }
 
+static int adreno_system_suspend(struct device *dev);
 static void adreno_shutdown(struct platform_device *pdev)
 {
-	pm_runtime_force_suspend(&pdev->dev);
+	struct msm_gpu *gpu = dev_to_gpu(&pdev->dev);
+
+	WARN_ON_ONCE(adreno_system_suspend(&pdev->dev));
 }
 
 static const struct of_device_id dt_match[] = {
-- 
2.38.1.493.g58b659f92b-goog


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 2/2] adreno: Detect shutdown during get_param()
  2022-11-11 19:49 ` Joel Fernandes (Google)
@ 2022-11-11 19:49   ` Joel Fernandes (Google)
  -1 siblings, 0 replies; 27+ messages in thread
From: Joel Fernandes (Google) @ 2022-11-11 19:49 UTC (permalink / raw)
  To: linux-kernel
  Cc: Joel Fernandes (Google),
	Rob Clark, Steven Rostedt, Ricardo Ribalda, Ross Zwisler,
	Abhinav Kumar, Akhil P Oommen, Daniel Vetter, David Airlie,
	Dmitry Baryshkov, dri-devel, Emma Anholt, freedreno,
	linux-arm-msm, Rob Clark, Sean Paul, Vladimir Lypak

Even though the GPU is shut down, during kexec reboot we can have userspace
still running. This is especially true if KEXEC_JUMP is not enabled, because we
do not freeze userspace in this case.

To prevent crashes, track that the GPU is shutdown and prevent get_param() from
accessing GPU resources if we find it shutdown.

This fixes the following crash during kexec reboot on an ARM64 device with adreno GPU:

[  292.534314] Kernel panic - not syncing: Asynchronous SError Interrupt
[  292.534323] Hardware name: Google Lazor (rev3 - 8) with LTE (DT)
[  292.534326] Call trace:
[  292.534328]  dump_backtrace+0x0/0x1d4
[  292.534337]  show_stack+0x20/0x2c
[  292.534342]  dump_stack_lvl+0x60/0x78
[  292.534347]  dump_stack+0x18/0x38
[  292.534352]  panic+0x148/0x3b0
[  292.534357]  nmi_panic+0x80/0x94
[  292.534364]  arm64_serror_panic+0x70/0x7c
[  292.534369]  do_serror+0x0/0x7c
[  292.534372]  do_serror+0x54/0x7c
[  292.534377]  el1h_64_error_handler+0x34/0x4c
[  292.534381]  el1h_64_error+0x7c/0x80
[  292.534386]  el1_interrupt+0x20/0x58
[  292.534389]  el1h_64_irq_handler+0x18/0x24
[  292.534395]  el1h_64_irq+0x7c/0x80
[  292.534399]  local_daif_inherit+0x10/0x18
[  292.534405]  el1h_64_sync_handler+0x48/0xb4
[  292.534410]  el1h_64_sync+0x7c/0x80
[  292.534414]  a6xx_gmu_set_oob+0xbc/0x1fc
[  292.534422]  a6xx_get_timestamp+0x40/0xb4
[  292.534426]  adreno_get_param+0x12c/0x1e0
[  292.534433]  msm_ioctl_get_param+0x64/0x70
[  292.534440]  drm_ioctl_kernel+0xe8/0x158
[  292.534448]  drm_ioctl+0x208/0x320
[  292.534453]  __arm64_sys_ioctl+0x98/0xd0
[  292.534461]  invoke_syscall+0x4c/0x118
[  292.534467]  el0_svc_common+0x98/0x104
[  292.534473]  do_el0_svc+0x30/0x80
[  292.534478]  el0_svc+0x20/0x50
[  292.534481]  el0t_64_sync_handler+0x78/0x108
[  292.534485]  el0t_64_sync+0x1a4/0x1a8
[  292.534632] Kernel Offset: 0x1a5f800000 from 0xffffffc008000000
[  292.534635] PHYS_OFFSET: 0x80000000
[  292.534638] CPU features: 0x40018541,a3300e42
[  292.534644] Memory Limit: none

Cc: Rob Clark <robdclark@chromium.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ricardo Ribalda <ribalda@chromium.org>
Cc: Ross Zwisler <zwisler@kernel.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 drivers/gpu/drm/msm/adreno/adreno_device.c | 1 +
 drivers/gpu/drm/msm/adreno/adreno_gpu.c    | 2 +-
 drivers/gpu/drm/msm/msm_gpu.h              | 3 +++
 3 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/msm/adreno/adreno_device.c b/drivers/gpu/drm/msm/adreno/adreno_device.c
index f0cff62812c3..03d912dc0130 100644
--- a/drivers/gpu/drm/msm/adreno/adreno_device.c
+++ b/drivers/gpu/drm/msm/adreno/adreno_device.c
@@ -612,6 +612,7 @@ static void adreno_shutdown(struct platform_device *pdev)
 {
 	struct msm_gpu *gpu = dev_to_gpu(&pdev->dev);
 
+	gpu->is_shutdown = true;
 	WARN_ON_ONCE(adreno_system_suspend(&pdev->dev));
 }
 
diff --git a/drivers/gpu/drm/msm/adreno/adreno_gpu.c b/drivers/gpu/drm/msm/adreno/adreno_gpu.c
index 382fb7f9e497..6903c6892469 100644
--- a/drivers/gpu/drm/msm/adreno/adreno_gpu.c
+++ b/drivers/gpu/drm/msm/adreno/adreno_gpu.c
@@ -251,7 +251,7 @@ int adreno_get_param(struct msm_gpu *gpu, struct msm_file_private *ctx,
 	struct adreno_gpu *adreno_gpu = to_adreno_gpu(gpu);
 
 	/* No pointer params yet */
-	if (*len != 0)
+	if (*len != 0 || gpu->is_shutdown)
 		return -EINVAL;
 
 	switch (param) {
diff --git a/drivers/gpu/drm/msm/msm_gpu.h b/drivers/gpu/drm/msm/msm_gpu.h
index ff911e7305ce..f18b0a91442b 100644
--- a/drivers/gpu/drm/msm/msm_gpu.h
+++ b/drivers/gpu/drm/msm/msm_gpu.h
@@ -214,6 +214,9 @@ struct msm_gpu {
 	/* does gpu need hw_init? */
 	bool needs_hw_init;
 
+	/* is the GPU shutdown? */
+	bool is_shutdown;
+
 	/**
 	 * global_faults: number of GPU hangs not attributed to a particular
 	 * address space
-- 
2.38.1.493.g58b659f92b-goog


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 2/2] adreno: Detect shutdown during get_param()
@ 2022-11-11 19:49   ` Joel Fernandes (Google)
  0 siblings, 0 replies; 27+ messages in thread
From: Joel Fernandes (Google) @ 2022-11-11 19:49 UTC (permalink / raw)
  To: linux-kernel
  Cc: Rob Clark, Emma Anholt, Akhil P Oommen, freedreno, linux-arm-msm,
	Ricardo Ribalda, Vladimir Lypak, Abhinav Kumar, Steven Rostedt,
	Dmitry Baryshkov, dri-devel, Ross Zwisler,
	Joel Fernandes (Google),
	Sean Paul

Even though the GPU is shut down, during kexec reboot we can have userspace
still running. This is especially true if KEXEC_JUMP is not enabled, because we
do not freeze userspace in this case.

To prevent crashes, track that the GPU is shutdown and prevent get_param() from
accessing GPU resources if we find it shutdown.

This fixes the following crash during kexec reboot on an ARM64 device with adreno GPU:

[  292.534314] Kernel panic - not syncing: Asynchronous SError Interrupt
[  292.534323] Hardware name: Google Lazor (rev3 - 8) with LTE (DT)
[  292.534326] Call trace:
[  292.534328]  dump_backtrace+0x0/0x1d4
[  292.534337]  show_stack+0x20/0x2c
[  292.534342]  dump_stack_lvl+0x60/0x78
[  292.534347]  dump_stack+0x18/0x38
[  292.534352]  panic+0x148/0x3b0
[  292.534357]  nmi_panic+0x80/0x94
[  292.534364]  arm64_serror_panic+0x70/0x7c
[  292.534369]  do_serror+0x0/0x7c
[  292.534372]  do_serror+0x54/0x7c
[  292.534377]  el1h_64_error_handler+0x34/0x4c
[  292.534381]  el1h_64_error+0x7c/0x80
[  292.534386]  el1_interrupt+0x20/0x58
[  292.534389]  el1h_64_irq_handler+0x18/0x24
[  292.534395]  el1h_64_irq+0x7c/0x80
[  292.534399]  local_daif_inherit+0x10/0x18
[  292.534405]  el1h_64_sync_handler+0x48/0xb4
[  292.534410]  el1h_64_sync+0x7c/0x80
[  292.534414]  a6xx_gmu_set_oob+0xbc/0x1fc
[  292.534422]  a6xx_get_timestamp+0x40/0xb4
[  292.534426]  adreno_get_param+0x12c/0x1e0
[  292.534433]  msm_ioctl_get_param+0x64/0x70
[  292.534440]  drm_ioctl_kernel+0xe8/0x158
[  292.534448]  drm_ioctl+0x208/0x320
[  292.534453]  __arm64_sys_ioctl+0x98/0xd0
[  292.534461]  invoke_syscall+0x4c/0x118
[  292.534467]  el0_svc_common+0x98/0x104
[  292.534473]  do_el0_svc+0x30/0x80
[  292.534478]  el0_svc+0x20/0x50
[  292.534481]  el0t_64_sync_handler+0x78/0x108
[  292.534485]  el0t_64_sync+0x1a4/0x1a8
[  292.534632] Kernel Offset: 0x1a5f800000 from 0xffffffc008000000
[  292.534635] PHYS_OFFSET: 0x80000000
[  292.534638] CPU features: 0x40018541,a3300e42
[  292.534644] Memory Limit: none

Cc: Rob Clark <robdclark@chromium.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ricardo Ribalda <ribalda@chromium.org>
Cc: Ross Zwisler <zwisler@kernel.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 drivers/gpu/drm/msm/adreno/adreno_device.c | 1 +
 drivers/gpu/drm/msm/adreno/adreno_gpu.c    | 2 +-
 drivers/gpu/drm/msm/msm_gpu.h              | 3 +++
 3 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/msm/adreno/adreno_device.c b/drivers/gpu/drm/msm/adreno/adreno_device.c
index f0cff62812c3..03d912dc0130 100644
--- a/drivers/gpu/drm/msm/adreno/adreno_device.c
+++ b/drivers/gpu/drm/msm/adreno/adreno_device.c
@@ -612,6 +612,7 @@ static void adreno_shutdown(struct platform_device *pdev)
 {
 	struct msm_gpu *gpu = dev_to_gpu(&pdev->dev);
 
+	gpu->is_shutdown = true;
 	WARN_ON_ONCE(adreno_system_suspend(&pdev->dev));
 }
 
diff --git a/drivers/gpu/drm/msm/adreno/adreno_gpu.c b/drivers/gpu/drm/msm/adreno/adreno_gpu.c
index 382fb7f9e497..6903c6892469 100644
--- a/drivers/gpu/drm/msm/adreno/adreno_gpu.c
+++ b/drivers/gpu/drm/msm/adreno/adreno_gpu.c
@@ -251,7 +251,7 @@ int adreno_get_param(struct msm_gpu *gpu, struct msm_file_private *ctx,
 	struct adreno_gpu *adreno_gpu = to_adreno_gpu(gpu);
 
 	/* No pointer params yet */
-	if (*len != 0)
+	if (*len != 0 || gpu->is_shutdown)
 		return -EINVAL;
 
 	switch (param) {
diff --git a/drivers/gpu/drm/msm/msm_gpu.h b/drivers/gpu/drm/msm/msm_gpu.h
index ff911e7305ce..f18b0a91442b 100644
--- a/drivers/gpu/drm/msm/msm_gpu.h
+++ b/drivers/gpu/drm/msm/msm_gpu.h
@@ -214,6 +214,9 @@ struct msm_gpu {
 	/* does gpu need hw_init? */
 	bool needs_hw_init;
 
+	/* is the GPU shutdown? */
+	bool is_shutdown;
+
 	/**
 	 * global_faults: number of GPU hangs not attributed to a particular
 	 * address space
-- 
2.38.1.493.g58b659f92b-goog


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/2] adreno: Shutdown the GPU properly
  2022-11-11 19:49 ` Joel Fernandes (Google)
@ 2022-11-11 21:08   ` Joel Fernandes
  -1 siblings, 0 replies; 27+ messages in thread
From: Joel Fernandes @ 2022-11-11 21:08 UTC (permalink / raw)
  To: linux-kernel
  Cc: Rob Clark, Steven Rostedt, Ricardo Ribalda, Ross Zwisler,
	Abhinav Kumar, Akhil P Oommen, Daniel Vetter, David Airlie,
	Dmitry Baryshkov, dri-devel, Emma Anholt, freedreno,
	linux-arm-msm, Rob Clark, Sean Paul, Vladimir Lypak



> On Nov 11, 2022, at 2:50 PM, Joel Fernandes (Google) <joel@joelfernandes.org> wrote:
> 
> During kexec on ARM device, we notice that device_shutdown() only calls
> pm_runtime_force_suspend() while shutting down the GPU. This means the GPU
> kthread is still running and further, there maybe active submits.
> 
> This causes all kinds of issues during a kexec reboot:
> 
> Warning from shutdown path:
> 
> [  292.509662] WARNING: CPU: 0 PID: 6304 at [...] adreno_runtime_suspend+0x3c/0x44
> [  292.509863] Hardware name: Google Lazor (rev3 - 8) with LTE (DT)
> [  292.509872] pstate: 80400009 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> [  292.509881] pc : adreno_runtime_suspend+0x3c/0x44
> [  292.509891] lr : pm_generic_runtime_suspend+0x30/0x44
> [  292.509905] sp : ffffffc014473bf0
> [...]
> [  292.510043] Call trace:
> [  292.510051]  adreno_runtime_suspend+0x3c/0x44
> [  292.510061]  pm_generic_runtime_suspend+0x30/0x44
> [  292.510071]  pm_runtime_force_suspend+0x54/0xc8
> [  292.510081]  adreno_shutdown+0x1c/0x28
> [  292.510090]  platform_shutdown+0x2c/0x38
> [  292.510104]  device_shutdown+0x158/0x210
> [  292.510119]  kernel_restart_prepare+0x40/0x4c
> 
> And here from GPU kthread, an SError OOPs:
> 
> [  192.648789]  el1h_64_error+0x7c/0x80
> [  192.648812]  el1_interrupt+0x20/0x58
> [  192.648833]  el1h_64_irq_handler+0x18/0x24
> [  192.648854]  el1h_64_irq+0x7c/0x80
> [  192.648873]  local_daif_inherit+0x10/0x18
> [  192.648900]  el1h_64_sync_handler+0x48/0xb4
> [  192.648921]  el1h_64_sync+0x7c/0x80
> [  192.648941]  a6xx_gmu_set_oob+0xbc/0x1fc
> [  192.648968]  a6xx_hw_init+0x44/0xe38
> [  192.648991]  msm_gpu_hw_init+0x48/0x80
> [  192.649013]  msm_gpu_submit+0x5c/0x1a8
> [  192.649034]  msm_job_run+0xb0/0x11c
> [  192.649058]  drm_sched_main+0x170/0x434
> [  192.649086]  kthread+0x134/0x300
> [  192.649114]  ret_from_fork+0x10/0x20
> 
> Fix by calling adreno_system_suspend() in the device_shutdown() path.
> 
> Cc: Rob Clark <robdclark@chromium.org>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Ricardo Ribalda <ribalda@chromium.org>
> Cc: Ross Zwisler <zwisler@kernel.org>
> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> ---
> drivers/gpu/drm/msm/adreno/adreno_device.c | 5 ++++-
> 1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/msm/adreno/adreno_device.c b/drivers/gpu/drm/msm/adreno/adreno_device.c
> index 24b489b6129a..f0cff62812c3 100644
> --- a/drivers/gpu/drm/msm/adreno/adreno_device.c
> +++ b/drivers/gpu/drm/msm/adreno/adreno_device.c
> @@ -607,9 +607,12 @@ static int adreno_remove(struct platform_device *pdev)
>    return 0;
> }
> 
> +static int adreno_system_suspend(struct device *dev);
> static void adreno_shutdown(struct platform_device *pdev)
> {
> -    pm_runtime_force_suspend(&pdev->dev);
> +    struct msm_gpu *gpu = dev_to_gpu(&pdev->dev);
> +

This local variable definition should go to patch 2/2. Will fix in v2.

Thanks,

 - Joel


> +    WARN_ON_ONCE(adreno_system_suspend(&pdev->dev));
> }
> 
> static const struct of_device_id dt_match[] = {
> -- 
> 2.38.1.493.g58b659f92b-goog
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/2] adreno: Shutdown the GPU properly
@ 2022-11-11 21:08   ` Joel Fernandes
  0 siblings, 0 replies; 27+ messages in thread
From: Joel Fernandes @ 2022-11-11 21:08 UTC (permalink / raw)
  To: linux-kernel
  Cc: Rob Clark, Emma Anholt, Akhil P Oommen, freedreno, linux-arm-msm,
	Ricardo Ribalda, Vladimir Lypak, Abhinav Kumar, Steven Rostedt,
	Sean Paul, dri-devel, Ross Zwisler, Dmitry Baryshkov



> On Nov 11, 2022, at 2:50 PM, Joel Fernandes (Google) <joel@joelfernandes.org> wrote:
> 
> During kexec on ARM device, we notice that device_shutdown() only calls
> pm_runtime_force_suspend() while shutting down the GPU. This means the GPU
> kthread is still running and further, there maybe active submits.
> 
> This causes all kinds of issues during a kexec reboot:
> 
> Warning from shutdown path:
> 
> [  292.509662] WARNING: CPU: 0 PID: 6304 at [...] adreno_runtime_suspend+0x3c/0x44
> [  292.509863] Hardware name: Google Lazor (rev3 - 8) with LTE (DT)
> [  292.509872] pstate: 80400009 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> [  292.509881] pc : adreno_runtime_suspend+0x3c/0x44
> [  292.509891] lr : pm_generic_runtime_suspend+0x30/0x44
> [  292.509905] sp : ffffffc014473bf0
> [...]
> [  292.510043] Call trace:
> [  292.510051]  adreno_runtime_suspend+0x3c/0x44
> [  292.510061]  pm_generic_runtime_suspend+0x30/0x44
> [  292.510071]  pm_runtime_force_suspend+0x54/0xc8
> [  292.510081]  adreno_shutdown+0x1c/0x28
> [  292.510090]  platform_shutdown+0x2c/0x38
> [  292.510104]  device_shutdown+0x158/0x210
> [  292.510119]  kernel_restart_prepare+0x40/0x4c
> 
> And here from GPU kthread, an SError OOPs:
> 
> [  192.648789]  el1h_64_error+0x7c/0x80
> [  192.648812]  el1_interrupt+0x20/0x58
> [  192.648833]  el1h_64_irq_handler+0x18/0x24
> [  192.648854]  el1h_64_irq+0x7c/0x80
> [  192.648873]  local_daif_inherit+0x10/0x18
> [  192.648900]  el1h_64_sync_handler+0x48/0xb4
> [  192.648921]  el1h_64_sync+0x7c/0x80
> [  192.648941]  a6xx_gmu_set_oob+0xbc/0x1fc
> [  192.648968]  a6xx_hw_init+0x44/0xe38
> [  192.648991]  msm_gpu_hw_init+0x48/0x80
> [  192.649013]  msm_gpu_submit+0x5c/0x1a8
> [  192.649034]  msm_job_run+0xb0/0x11c
> [  192.649058]  drm_sched_main+0x170/0x434
> [  192.649086]  kthread+0x134/0x300
> [  192.649114]  ret_from_fork+0x10/0x20
> 
> Fix by calling adreno_system_suspend() in the device_shutdown() path.
> 
> Cc: Rob Clark <robdclark@chromium.org>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Ricardo Ribalda <ribalda@chromium.org>
> Cc: Ross Zwisler <zwisler@kernel.org>
> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> ---
> drivers/gpu/drm/msm/adreno/adreno_device.c | 5 ++++-
> 1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/msm/adreno/adreno_device.c b/drivers/gpu/drm/msm/adreno/adreno_device.c
> index 24b489b6129a..f0cff62812c3 100644
> --- a/drivers/gpu/drm/msm/adreno/adreno_device.c
> +++ b/drivers/gpu/drm/msm/adreno/adreno_device.c
> @@ -607,9 +607,12 @@ static int adreno_remove(struct platform_device *pdev)
>    return 0;
> }
> 
> +static int adreno_system_suspend(struct device *dev);
> static void adreno_shutdown(struct platform_device *pdev)
> {
> -    pm_runtime_force_suspend(&pdev->dev);
> +    struct msm_gpu *gpu = dev_to_gpu(&pdev->dev);
> +

This local variable definition should go to patch 2/2. Will fix in v2.

Thanks,

 - Joel


> +    WARN_ON_ONCE(adreno_system_suspend(&pdev->dev));
> }
> 
> static const struct of_device_id dt_match[] = {
> -- 
> 2.38.1.493.g58b659f92b-goog
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/2] adreno: Detect shutdown during get_param()
  2022-11-11 19:49   ` Joel Fernandes (Google)
@ 2022-11-11 21:28     ` Akhil P Oommen
  -1 siblings, 0 replies; 27+ messages in thread
From: Akhil P Oommen @ 2022-11-11 21:28 UTC (permalink / raw)
  To: Joel Fernandes (Google), linux-kernel
  Cc: Rob Clark, Steven Rostedt, Ricardo Ribalda, Ross Zwisler,
	Abhinav Kumar, Daniel Vetter, David Airlie, Dmitry Baryshkov,
	dri-devel, Emma Anholt, freedreno, linux-arm-msm, Rob Clark,
	Sean Paul, Vladimir Lypak

On 11/12/2022 1:19 AM, Joel Fernandes (Google) wrote:
> Even though the GPU is shut down, during kexec reboot we can have userspace
> still running. This is especially true if KEXEC_JUMP is not enabled, because we
> do not freeze userspace in this case.
>
> To prevent crashes, track that the GPU is shutdown and prevent get_param() from
> accessing GPU resources if we find it shutdown.
>
> This fixes the following crash during kexec reboot on an ARM64 device with adreno GPU:
>
> [  292.534314] Kernel panic - not syncing: Asynchronous SError Interrupt
> [  292.534323] Hardware name: Google Lazor (rev3 - 8) with LTE (DT)
> [  292.534326] Call trace:
> [  292.534328]  dump_backtrace+0x0/0x1d4
> [  292.534337]  show_stack+0x20/0x2c
> [  292.534342]  dump_stack_lvl+0x60/0x78
> [  292.534347]  dump_stack+0x18/0x38
> [  292.534352]  panic+0x148/0x3b0
> [  292.534357]  nmi_panic+0x80/0x94
> [  292.534364]  arm64_serror_panic+0x70/0x7c
> [  292.534369]  do_serror+0x0/0x7c
> [  292.534372]  do_serror+0x54/0x7c
> [  292.534377]  el1h_64_error_handler+0x34/0x4c
> [  292.534381]  el1h_64_error+0x7c/0x80
> [  292.534386]  el1_interrupt+0x20/0x58
> [  292.534389]  el1h_64_irq_handler+0x18/0x24
> [  292.534395]  el1h_64_irq+0x7c/0x80
> [  292.534399]  local_daif_inherit+0x10/0x18
> [  292.534405]  el1h_64_sync_handler+0x48/0xb4
> [  292.534410]  el1h_64_sync+0x7c/0x80
> [  292.534414]  a6xx_gmu_set_oob+0xbc/0x1fc
> [  292.534422]  a6xx_get_timestamp+0x40/0xb4
> [  292.534426]  adreno_get_param+0x12c/0x1e0
> [  292.534433]  msm_ioctl_get_param+0x64/0x70
> [  292.534440]  drm_ioctl_kernel+0xe8/0x158
> [  292.534448]  drm_ioctl+0x208/0x320
> [  292.534453]  __arm64_sys_ioctl+0x98/0xd0
> [  292.534461]  invoke_syscall+0x4c/0x118
> [  292.534467]  el0_svc_common+0x98/0x104
> [  292.534473]  do_el0_svc+0x30/0x80
> [  292.534478]  el0_svc+0x20/0x50
> [  292.534481]  el0t_64_sync_handler+0x78/0x108
> [  292.534485]  el0t_64_sync+0x1a4/0x1a8
> [  292.534632] Kernel Offset: 0x1a5f800000 from 0xffffffc008000000
> [  292.534635] PHYS_OFFSET: 0x80000000
> [  292.534638] CPU features: 0x40018541,a3300e42
> [  292.534644] Memory Limit: none
>
> Cc: Rob Clark <robdclark@chromium.org>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Ricardo Ribalda <ribalda@chromium.org>
> Cc: Ross Zwisler <zwisler@kernel.org>
> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> ---
>   drivers/gpu/drm/msm/adreno/adreno_device.c | 1 +
>   drivers/gpu/drm/msm/adreno/adreno_gpu.c    | 2 +-
>   drivers/gpu/drm/msm/msm_gpu.h              | 3 +++
>   3 files changed, 5 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/msm/adreno/adreno_device.c b/drivers/gpu/drm/msm/adreno/adreno_device.c
> index f0cff62812c3..03d912dc0130 100644
> --- a/drivers/gpu/drm/msm/adreno/adreno_device.c
> +++ b/drivers/gpu/drm/msm/adreno/adreno_device.c
> @@ -612,6 +612,7 @@ static void adreno_shutdown(struct platform_device *pdev)
>   {
>   	struct msm_gpu *gpu = dev_to_gpu(&pdev->dev);
>   
> +	gpu->is_shutdown = true;
>   	WARN_ON_ONCE(adreno_system_suspend(&pdev->dev));
>   }
>   
> diff --git a/drivers/gpu/drm/msm/adreno/adreno_gpu.c b/drivers/gpu/drm/msm/adreno/adreno_gpu.c
> index 382fb7f9e497..6903c6892469 100644
> --- a/drivers/gpu/drm/msm/adreno/adreno_gpu.c
> +++ b/drivers/gpu/drm/msm/adreno/adreno_gpu.c
> @@ -251,7 +251,7 @@ int adreno_get_param(struct msm_gpu *gpu, struct msm_file_private *ctx,
>   	struct adreno_gpu *adreno_gpu = to_adreno_gpu(gpu);
>   
>   	/* No pointer params yet */
> -	if (*len != 0)
> +	if (*len != 0 || gpu->is_shutdown)
>   		return -EINVAL;
This will race with shutdown. Probably, propagating back the return 
value of pm_runtime_get() in every possible ioctl call path is the right 
thing to do.

I have never thought about this scenario. Do you know why userspace is 
not freezed before kexec?

-Akhil.
>   
>   	switch (param) {
> diff --git a/drivers/gpu/drm/msm/msm_gpu.h b/drivers/gpu/drm/msm/msm_gpu.h
> index ff911e7305ce..f18b0a91442b 100644
> --- a/drivers/gpu/drm/msm/msm_gpu.h
> +++ b/drivers/gpu/drm/msm/msm_gpu.h
> @@ -214,6 +214,9 @@ struct msm_gpu {
>   	/* does gpu need hw_init? */
>   	bool needs_hw_init;
>   
> +	/* is the GPU shutdown? */
> +	bool is_shutdown;
> +
>   	/**
>   	 * global_faults: number of GPU hangs not attributed to a particular
>   	 * address space


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/2] adreno: Detect shutdown during get_param()
@ 2022-11-11 21:28     ` Akhil P Oommen
  0 siblings, 0 replies; 27+ messages in thread
From: Akhil P Oommen @ 2022-11-11 21:28 UTC (permalink / raw)
  To: Joel Fernandes (Google), linux-kernel
  Cc: Rob Clark, Emma Anholt, freedreno, linux-arm-msm,
	Ricardo Ribalda, Vladimir Lypak, Abhinav Kumar, Steven Rostedt,
	Sean Paul, dri-devel, Ross Zwisler, Dmitry Baryshkov

On 11/12/2022 1:19 AM, Joel Fernandes (Google) wrote:
> Even though the GPU is shut down, during kexec reboot we can have userspace
> still running. This is especially true if KEXEC_JUMP is not enabled, because we
> do not freeze userspace in this case.
>
> To prevent crashes, track that the GPU is shutdown and prevent get_param() from
> accessing GPU resources if we find it shutdown.
>
> This fixes the following crash during kexec reboot on an ARM64 device with adreno GPU:
>
> [  292.534314] Kernel panic - not syncing: Asynchronous SError Interrupt
> [  292.534323] Hardware name: Google Lazor (rev3 - 8) with LTE (DT)
> [  292.534326] Call trace:
> [  292.534328]  dump_backtrace+0x0/0x1d4
> [  292.534337]  show_stack+0x20/0x2c
> [  292.534342]  dump_stack_lvl+0x60/0x78
> [  292.534347]  dump_stack+0x18/0x38
> [  292.534352]  panic+0x148/0x3b0
> [  292.534357]  nmi_panic+0x80/0x94
> [  292.534364]  arm64_serror_panic+0x70/0x7c
> [  292.534369]  do_serror+0x0/0x7c
> [  292.534372]  do_serror+0x54/0x7c
> [  292.534377]  el1h_64_error_handler+0x34/0x4c
> [  292.534381]  el1h_64_error+0x7c/0x80
> [  292.534386]  el1_interrupt+0x20/0x58
> [  292.534389]  el1h_64_irq_handler+0x18/0x24
> [  292.534395]  el1h_64_irq+0x7c/0x80
> [  292.534399]  local_daif_inherit+0x10/0x18
> [  292.534405]  el1h_64_sync_handler+0x48/0xb4
> [  292.534410]  el1h_64_sync+0x7c/0x80
> [  292.534414]  a6xx_gmu_set_oob+0xbc/0x1fc
> [  292.534422]  a6xx_get_timestamp+0x40/0xb4
> [  292.534426]  adreno_get_param+0x12c/0x1e0
> [  292.534433]  msm_ioctl_get_param+0x64/0x70
> [  292.534440]  drm_ioctl_kernel+0xe8/0x158
> [  292.534448]  drm_ioctl+0x208/0x320
> [  292.534453]  __arm64_sys_ioctl+0x98/0xd0
> [  292.534461]  invoke_syscall+0x4c/0x118
> [  292.534467]  el0_svc_common+0x98/0x104
> [  292.534473]  do_el0_svc+0x30/0x80
> [  292.534478]  el0_svc+0x20/0x50
> [  292.534481]  el0t_64_sync_handler+0x78/0x108
> [  292.534485]  el0t_64_sync+0x1a4/0x1a8
> [  292.534632] Kernel Offset: 0x1a5f800000 from 0xffffffc008000000
> [  292.534635] PHYS_OFFSET: 0x80000000
> [  292.534638] CPU features: 0x40018541,a3300e42
> [  292.534644] Memory Limit: none
>
> Cc: Rob Clark <robdclark@chromium.org>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Ricardo Ribalda <ribalda@chromium.org>
> Cc: Ross Zwisler <zwisler@kernel.org>
> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> ---
>   drivers/gpu/drm/msm/adreno/adreno_device.c | 1 +
>   drivers/gpu/drm/msm/adreno/adreno_gpu.c    | 2 +-
>   drivers/gpu/drm/msm/msm_gpu.h              | 3 +++
>   3 files changed, 5 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/msm/adreno/adreno_device.c b/drivers/gpu/drm/msm/adreno/adreno_device.c
> index f0cff62812c3..03d912dc0130 100644
> --- a/drivers/gpu/drm/msm/adreno/adreno_device.c
> +++ b/drivers/gpu/drm/msm/adreno/adreno_device.c
> @@ -612,6 +612,7 @@ static void adreno_shutdown(struct platform_device *pdev)
>   {
>   	struct msm_gpu *gpu = dev_to_gpu(&pdev->dev);
>   
> +	gpu->is_shutdown = true;
>   	WARN_ON_ONCE(adreno_system_suspend(&pdev->dev));
>   }
>   
> diff --git a/drivers/gpu/drm/msm/adreno/adreno_gpu.c b/drivers/gpu/drm/msm/adreno/adreno_gpu.c
> index 382fb7f9e497..6903c6892469 100644
> --- a/drivers/gpu/drm/msm/adreno/adreno_gpu.c
> +++ b/drivers/gpu/drm/msm/adreno/adreno_gpu.c
> @@ -251,7 +251,7 @@ int adreno_get_param(struct msm_gpu *gpu, struct msm_file_private *ctx,
>   	struct adreno_gpu *adreno_gpu = to_adreno_gpu(gpu);
>   
>   	/* No pointer params yet */
> -	if (*len != 0)
> +	if (*len != 0 || gpu->is_shutdown)
>   		return -EINVAL;
This will race with shutdown. Probably, propagating back the return 
value of pm_runtime_get() in every possible ioctl call path is the right 
thing to do.

I have never thought about this scenario. Do you know why userspace is 
not freezed before kexec?

-Akhil.
>   
>   	switch (param) {
> diff --git a/drivers/gpu/drm/msm/msm_gpu.h b/drivers/gpu/drm/msm/msm_gpu.h
> index ff911e7305ce..f18b0a91442b 100644
> --- a/drivers/gpu/drm/msm/msm_gpu.h
> +++ b/drivers/gpu/drm/msm/msm_gpu.h
> @@ -214,6 +214,9 @@ struct msm_gpu {
>   	/* does gpu need hw_init? */
>   	bool needs_hw_init;
>   
> +	/* is the GPU shutdown? */
> +	bool is_shutdown;
> +
>   	/**
>   	 * global_faults: number of GPU hangs not attributed to a particular
>   	 * address space


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/2] adreno: Detect shutdown during get_param()
  2022-11-11 21:28     ` Akhil P Oommen
@ 2022-11-11 21:37       ` Joel Fernandes
  -1 siblings, 0 replies; 27+ messages in thread
From: Joel Fernandes @ 2022-11-11 21:37 UTC (permalink / raw)
  To: Akhil P Oommen
  Cc: linux-kernel, Rob Clark, Steven Rostedt, Ricardo Ribalda,
	Ross Zwisler, Abhinav Kumar, Daniel Vetter, David Airlie,
	Dmitry Baryshkov, dri-devel, Emma Anholt, freedreno,
	linux-arm-msm, Rob Clark, Sean Paul, Vladimir Lypak



> On Nov 11, 2022, at 4:28 PM, Akhil P Oommen <quic_akhilpo@quicinc.com> wrote:
> 
> On 11/12/2022 1:19 AM, Joel Fernandes (Google) wrote:
>> Even though the GPU is shut down, during kexec reboot we can have userspace
>> still running. This is especially true if KEXEC_JUMP is not enabled, because we
>> do not freeze userspace in this case.
>> 
>> To prevent crashes, track that the GPU is shutdown and prevent get_param() from
>> accessing GPU resources if we find it shutdown.
>> 
>> This fixes the following crash during kexec reboot on an ARM64 device with adreno GPU:
>> 
>> [  292.534314] Kernel panic - not syncing: Asynchronous SError Interrupt
>> [  292.534323] Hardware name: Google Lazor (rev3 - 8) with LTE (DT)
>> [  292.534326] Call trace:
>> [  292.534328]  dump_backtrace+0x0/0x1d4
>> [  292.534337]  show_stack+0x20/0x2c
>> [  292.534342]  dump_stack_lvl+0x60/0x78
>> [  292.534347]  dump_stack+0x18/0x38
>> [  292.534352]  panic+0x148/0x3b0
>> [  292.534357]  nmi_panic+0x80/0x94
>> [  292.534364]  arm64_serror_panic+0x70/0x7c
>> [  292.534369]  do_serror+0x0/0x7c
>> [  292.534372]  do_serror+0x54/0x7c
>> [  292.534377]  el1h_64_error_handler+0x34/0x4c
>> [  292.534381]  el1h_64_error+0x7c/0x80
>> [  292.534386]  el1_interrupt+0x20/0x58
>> [  292.534389]  el1h_64_irq_handler+0x18/0x24
>> [  292.534395]  el1h_64_irq+0x7c/0x80
>> [  292.534399]  local_daif_inherit+0x10/0x18
>> [  292.534405]  el1h_64_sync_handler+0x48/0xb4
>> [  292.534410]  el1h_64_sync+0x7c/0x80
>> [  292.534414]  a6xx_gmu_set_oob+0xbc/0x1fc
>> [  292.534422]  a6xx_get_timestamp+0x40/0xb4
>> [  292.534426]  adreno_get_param+0x12c/0x1e0
>> [  292.534433]  msm_ioctl_get_param+0x64/0x70
>> [  292.534440]  drm_ioctl_kernel+0xe8/0x158
>> [  292.534448]  drm_ioctl+0x208/0x320
>> [  292.534453]  __arm64_sys_ioctl+0x98/0xd0
>> [  292.534461]  invoke_syscall+0x4c/0x118
>> [  292.534467]  el0_svc_common+0x98/0x104
>> [  292.534473]  do_el0_svc+0x30/0x80
>> [  292.534478]  el0_svc+0x20/0x50
>> [  292.534481]  el0t_64_sync_handler+0x78/0x108
>> [  292.534485]  el0t_64_sync+0x1a4/0x1a8
>> [  292.534632] Kernel Offset: 0x1a5f800000 from 0xffffffc008000000
>> [  292.534635] PHYS_OFFSET: 0x80000000
>> [  292.534638] CPU features: 0x40018541,a3300e42
>> [  292.534644] Memory Limit: none
>> 
>> Cc: Rob Clark <robdclark@chromium.org>
>> Cc: Steven Rostedt <rostedt@goodmis.org>
>> Cc: Ricardo Ribalda <ribalda@chromium.org>
>> Cc: Ross Zwisler <zwisler@kernel.org>
>> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
>> ---
>>  drivers/gpu/drm/msm/adreno/adreno_device.c | 1 +
>>  drivers/gpu/drm/msm/adreno/adreno_gpu.c    | 2 +-
>>  drivers/gpu/drm/msm/msm_gpu.h              | 3 +++
>>  3 files changed, 5 insertions(+), 1 deletion(-)
>> 
>> diff --git a/drivers/gpu/drm/msm/adreno/adreno_device.c b/drivers/gpu/drm/msm/adreno/adreno_device.c
>> index f0cff62812c3..03d912dc0130 100644
>> --- a/drivers/gpu/drm/msm/adreno/adreno_device.c
>> +++ b/drivers/gpu/drm/msm/adreno/adreno_device.c
>> @@ -612,6 +612,7 @@ static void adreno_shutdown(struct platform_device *pdev)
>>  {
>>      struct msm_gpu *gpu = dev_to_gpu(&pdev->dev);
>>  +    gpu->is_shutdown = true;
>>      WARN_ON_ONCE(adreno_system_suspend(&pdev->dev));
>>  }
>>  diff --git a/drivers/gpu/drm/msm/adreno/adreno_gpu.c b/drivers/gpu/drm/msm/adreno/adreno_gpu.c
>> index 382fb7f9e497..6903c6892469 100644
>> --- a/drivers/gpu/drm/msm/adreno/adreno_gpu.c
>> +++ b/drivers/gpu/drm/msm/adreno/adreno_gpu.c
>> @@ -251,7 +251,7 @@ int adreno_get_param(struct msm_gpu *gpu, struct msm_file_private *ctx,
>>      struct adreno_gpu *adreno_gpu = to_adreno_gpu(gpu);
>>        /* No pointer params yet */
>> -    if (*len != 0)
>> +    if (*len != 0 || gpu->is_shutdown)
>>          return -EINVAL;
> This will race with shutdown.

Could you clarify what you mean? At this point in the code, the shutdown is completed and it crashes here.

> Probably, propagating back the return value of pm_runtime_get() in every possible ioctl call path is the right thing to do.

Ok I’ll look into that. But the patch I posted works reliably and fixes all crashes we could reproduce.

> I have never thought about this scenario. Do you know why userspace is not freezed before kexec?

I am not sure. It depends on how kexec is used. The userspace freeze happens only when kexec is called to switch back and forth between different kernels (persistence mode). In such scenario I believe the userspace has to be frozen and unfrozen. However for normal kexec, that does not happen.

Thanks.


> 
> -Akhil.
>>        switch (param) {
>> diff --git a/drivers/gpu/drm/msm/msm_gpu.h b/drivers/gpu/drm/msm/msm_gpu.h
>> index ff911e7305ce..f18b0a91442b 100644
>> --- a/drivers/gpu/drm/msm/msm_gpu.h
>> +++ b/drivers/gpu/drm/msm/msm_gpu.h
>> @@ -214,6 +214,9 @@ struct msm_gpu {
>>      /* does gpu need hw_init? */
>>      bool needs_hw_init;
>>  +    /* is the GPU shutdown? */
>> +    bool is_shutdown;
>> +
>>      /**
>>       * global_faults: number of GPU hangs not attributed to a particular
>>       * address space
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/2] adreno: Detect shutdown during get_param()
@ 2022-11-11 21:37       ` Joel Fernandes
  0 siblings, 0 replies; 27+ messages in thread
From: Joel Fernandes @ 2022-11-11 21:37 UTC (permalink / raw)
  To: Akhil P Oommen
  Cc: Rob Clark, Emma Anholt, freedreno, linux-arm-msm,
	Ricardo Ribalda, Vladimir Lypak, Abhinav Kumar, Steven Rostedt,
	linux-kernel, Sean Paul, dri-devel, Ross Zwisler,
	Dmitry Baryshkov



> On Nov 11, 2022, at 4:28 PM, Akhil P Oommen <quic_akhilpo@quicinc.com> wrote:
> 
> On 11/12/2022 1:19 AM, Joel Fernandes (Google) wrote:
>> Even though the GPU is shut down, during kexec reboot we can have userspace
>> still running. This is especially true if KEXEC_JUMP is not enabled, because we
>> do not freeze userspace in this case.
>> 
>> To prevent crashes, track that the GPU is shutdown and prevent get_param() from
>> accessing GPU resources if we find it shutdown.
>> 
>> This fixes the following crash during kexec reboot on an ARM64 device with adreno GPU:
>> 
>> [  292.534314] Kernel panic - not syncing: Asynchronous SError Interrupt
>> [  292.534323] Hardware name: Google Lazor (rev3 - 8) with LTE (DT)
>> [  292.534326] Call trace:
>> [  292.534328]  dump_backtrace+0x0/0x1d4
>> [  292.534337]  show_stack+0x20/0x2c
>> [  292.534342]  dump_stack_lvl+0x60/0x78
>> [  292.534347]  dump_stack+0x18/0x38
>> [  292.534352]  panic+0x148/0x3b0
>> [  292.534357]  nmi_panic+0x80/0x94
>> [  292.534364]  arm64_serror_panic+0x70/0x7c
>> [  292.534369]  do_serror+0x0/0x7c
>> [  292.534372]  do_serror+0x54/0x7c
>> [  292.534377]  el1h_64_error_handler+0x34/0x4c
>> [  292.534381]  el1h_64_error+0x7c/0x80
>> [  292.534386]  el1_interrupt+0x20/0x58
>> [  292.534389]  el1h_64_irq_handler+0x18/0x24
>> [  292.534395]  el1h_64_irq+0x7c/0x80
>> [  292.534399]  local_daif_inherit+0x10/0x18
>> [  292.534405]  el1h_64_sync_handler+0x48/0xb4
>> [  292.534410]  el1h_64_sync+0x7c/0x80
>> [  292.534414]  a6xx_gmu_set_oob+0xbc/0x1fc
>> [  292.534422]  a6xx_get_timestamp+0x40/0xb4
>> [  292.534426]  adreno_get_param+0x12c/0x1e0
>> [  292.534433]  msm_ioctl_get_param+0x64/0x70
>> [  292.534440]  drm_ioctl_kernel+0xe8/0x158
>> [  292.534448]  drm_ioctl+0x208/0x320
>> [  292.534453]  __arm64_sys_ioctl+0x98/0xd0
>> [  292.534461]  invoke_syscall+0x4c/0x118
>> [  292.534467]  el0_svc_common+0x98/0x104
>> [  292.534473]  do_el0_svc+0x30/0x80
>> [  292.534478]  el0_svc+0x20/0x50
>> [  292.534481]  el0t_64_sync_handler+0x78/0x108
>> [  292.534485]  el0t_64_sync+0x1a4/0x1a8
>> [  292.534632] Kernel Offset: 0x1a5f800000 from 0xffffffc008000000
>> [  292.534635] PHYS_OFFSET: 0x80000000
>> [  292.534638] CPU features: 0x40018541,a3300e42
>> [  292.534644] Memory Limit: none
>> 
>> Cc: Rob Clark <robdclark@chromium.org>
>> Cc: Steven Rostedt <rostedt@goodmis.org>
>> Cc: Ricardo Ribalda <ribalda@chromium.org>
>> Cc: Ross Zwisler <zwisler@kernel.org>
>> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
>> ---
>>  drivers/gpu/drm/msm/adreno/adreno_device.c | 1 +
>>  drivers/gpu/drm/msm/adreno/adreno_gpu.c    | 2 +-
>>  drivers/gpu/drm/msm/msm_gpu.h              | 3 +++
>>  3 files changed, 5 insertions(+), 1 deletion(-)
>> 
>> diff --git a/drivers/gpu/drm/msm/adreno/adreno_device.c b/drivers/gpu/drm/msm/adreno/adreno_device.c
>> index f0cff62812c3..03d912dc0130 100644
>> --- a/drivers/gpu/drm/msm/adreno/adreno_device.c
>> +++ b/drivers/gpu/drm/msm/adreno/adreno_device.c
>> @@ -612,6 +612,7 @@ static void adreno_shutdown(struct platform_device *pdev)
>>  {
>>      struct msm_gpu *gpu = dev_to_gpu(&pdev->dev);
>>  +    gpu->is_shutdown = true;
>>      WARN_ON_ONCE(adreno_system_suspend(&pdev->dev));
>>  }
>>  diff --git a/drivers/gpu/drm/msm/adreno/adreno_gpu.c b/drivers/gpu/drm/msm/adreno/adreno_gpu.c
>> index 382fb7f9e497..6903c6892469 100644
>> --- a/drivers/gpu/drm/msm/adreno/adreno_gpu.c
>> +++ b/drivers/gpu/drm/msm/adreno/adreno_gpu.c
>> @@ -251,7 +251,7 @@ int adreno_get_param(struct msm_gpu *gpu, struct msm_file_private *ctx,
>>      struct adreno_gpu *adreno_gpu = to_adreno_gpu(gpu);
>>        /* No pointer params yet */
>> -    if (*len != 0)
>> +    if (*len != 0 || gpu->is_shutdown)
>>          return -EINVAL;
> This will race with shutdown.

Could you clarify what you mean? At this point in the code, the shutdown is completed and it crashes here.

> Probably, propagating back the return value of pm_runtime_get() in every possible ioctl call path is the right thing to do.

Ok I’ll look into that. But the patch I posted works reliably and fixes all crashes we could reproduce.

> I have never thought about this scenario. Do you know why userspace is not freezed before kexec?

I am not sure. It depends on how kexec is used. The userspace freeze happens only when kexec is called to switch back and forth between different kernels (persistence mode). In such scenario I believe the userspace has to be frozen and unfrozen. However for normal kexec, that does not happen.

Thanks.


> 
> -Akhil.
>>        switch (param) {
>> diff --git a/drivers/gpu/drm/msm/msm_gpu.h b/drivers/gpu/drm/msm/msm_gpu.h
>> index ff911e7305ce..f18b0a91442b 100644
>> --- a/drivers/gpu/drm/msm/msm_gpu.h
>> +++ b/drivers/gpu/drm/msm/msm_gpu.h
>> @@ -214,6 +214,9 @@ struct msm_gpu {
>>      /* does gpu need hw_init? */
>>      bool needs_hw_init;
>>  +    /* is the GPU shutdown? */
>> +    bool is_shutdown;
>> +
>>      /**
>>       * global_faults: number of GPU hangs not attributed to a particular
>>       * address space
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/2] adreno: Detect shutdown during get_param()
  2022-11-11 21:37       ` Joel Fernandes
  (?)
@ 2022-11-11 22:19       ` Joel Fernandes
  -1 siblings, 0 replies; 27+ messages in thread
From: Joel Fernandes @ 2022-11-11 22:19 UTC (permalink / raw)
  To: Akhil P Oommen
  Cc: Rob Clark, freedreno, Emma Anholt, Sean Paul, linux-arm-msm,
	Ross Zwisler, Vladimir Lypak, Abhinav Kumar, Steven Rostedt,
	linux-kernel, dri-devel, Ricardo Ribalda, Dmitry Baryshkov

[-- Attachment #1: Type: text/plain, Size: 5876 bytes --]

On Fri, Nov 11, 2022 at 4:37 PM Joel Fernandes <joel@joelfernandes.org>
wrote:

>
>
> > On Nov 11, 2022, at 4:28 PM, Akhil P Oommen <quic_akhilpo@quicinc.com>
> wrote:
> >
> > On 11/12/2022 1:19 AM, Joel Fernandes (Google) wrote:
> >> Even though the GPU is shut down, during kexec reboot we can have
> userspace
> >> still running. This is especially true if KEXEC_JUMP is not enabled,
> because we
> >> do not freeze userspace in this case.
> >>
> >> To prevent crashes, track that the GPU is shutdown and prevent
> get_param() from
> >> accessing GPU resources if we find it shutdown.
> >>
> >> This fixes the following crash during kexec reboot on an ARM64 device
> with adreno GPU:
> >>
> >> [  292.534314] Kernel panic - not syncing: Asynchronous SError Interrupt
> >> [  292.534323] Hardware name: Google Lazor (rev3 - 8) with LTE (DT)
> >> [  292.534326] Call trace:
> >> [  292.534328]  dump_backtrace+0x0/0x1d4
> >> [  292.534337]  show_stack+0x20/0x2c
> >> [  292.534342]  dump_stack_lvl+0x60/0x78
> >> [  292.534347]  dump_stack+0x18/0x38
> >> [  292.534352]  panic+0x148/0x3b0
> >> [  292.534357]  nmi_panic+0x80/0x94
> >> [  292.534364]  arm64_serror_panic+0x70/0x7c
> >> [  292.534369]  do_serror+0x0/0x7c
> >> [  292.534372]  do_serror+0x54/0x7c
> >> [  292.534377]  el1h_64_error_handler+0x34/0x4c
> >> [  292.534381]  el1h_64_error+0x7c/0x80
> >> [  292.534386]  el1_interrupt+0x20/0x58
> >> [  292.534389]  el1h_64_irq_handler+0x18/0x24
> >> [  292.534395]  el1h_64_irq+0x7c/0x80
> >> [  292.534399]  local_daif_inherit+0x10/0x18
> >> [  292.534405]  el1h_64_sync_handler+0x48/0xb4
> >> [  292.534410]  el1h_64_sync+0x7c/0x80
> >> [  292.534414]  a6xx_gmu_set_oob+0xbc/0x1fc
> >> [  292.534422]  a6xx_get_timestamp+0x40/0xb4
> >> [  292.534426]  adreno_get_param+0x12c/0x1e0
> >> [  292.534433]  msm_ioctl_get_param+0x64/0x70
> >> [  292.534440]  drm_ioctl_kernel+0xe8/0x158
> >> [  292.534448]  drm_ioctl+0x208/0x320
> >> [  292.534453]  __arm64_sys_ioctl+0x98/0xd0
> >> [  292.534461]  invoke_syscall+0x4c/0x118
> >> [  292.534467]  el0_svc_common+0x98/0x104
> >> [  292.534473]  do_el0_svc+0x30/0x80
> >> [  292.534478]  el0_svc+0x20/0x50
> >> [  292.534481]  el0t_64_sync_handler+0x78/0x108
> >> [  292.534485]  el0t_64_sync+0x1a4/0x1a8
> >> [  292.534632] Kernel Offset: 0x1a5f800000 from 0xffffffc008000000
> >> [  292.534635] PHYS_OFFSET: 0x80000000
> >> [  292.534638] CPU features: 0x40018541,a3300e42
> >> [  292.534644] Memory Limit: none
> >>
> >> Cc: Rob Clark <robdclark@chromium.org>
> >> Cc: Steven Rostedt <rostedt@goodmis.org>
> >> Cc: Ricardo Ribalda <ribalda@chromium.org>
> >> Cc: Ross Zwisler <zwisler@kernel.org>
> >> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> >> ---
> >>  drivers/gpu/drm/msm/adreno/adreno_device.c | 1 +
> >>  drivers/gpu/drm/msm/adreno/adreno_gpu.c    | 2 +-
> >>  drivers/gpu/drm/msm/msm_gpu.h              | 3 +++
> >>  3 files changed, 5 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/drivers/gpu/drm/msm/adreno/adreno_device.c
> b/drivers/gpu/drm/msm/adreno/adreno_device.c
> >> index f0cff62812c3..03d912dc0130 100644
> >> --- a/drivers/gpu/drm/msm/adreno/adreno_device.c
> >> +++ b/drivers/gpu/drm/msm/adreno/adreno_device.c
> >> @@ -612,6 +612,7 @@ static void adreno_shutdown(struct platform_device
> *pdev)
> >>  {
> >>      struct msm_gpu *gpu = dev_to_gpu(&pdev->dev);
> >>  +    gpu->is_shutdown = true;
> >>      WARN_ON_ONCE(adreno_system_suspend(&pdev->dev));
> >>  }
> >>  diff --git a/drivers/gpu/drm/msm/adreno/adreno_gpu.c
> b/drivers/gpu/drm/msm/adreno/adreno_gpu.c
> >> index 382fb7f9e497..6903c6892469 100644
> >> --- a/drivers/gpu/drm/msm/adreno/adreno_gpu.c
> >> +++ b/drivers/gpu/drm/msm/adreno/adreno_gpu.c
> >> @@ -251,7 +251,7 @@ int adreno_get_param(struct msm_gpu *gpu, struct
> msm_file_private *ctx,
> >>      struct adreno_gpu *adreno_gpu = to_adreno_gpu(gpu);
> >>        /* No pointer params yet */
> >> -    if (*len != 0)
> >> +    if (*len != 0 || gpu->is_shutdown)
> >>          return -EINVAL;
> > This will race with shutdown.
>
> Could you clarify what you mean? At this point in the code, the shutdown
> is completed and it crashes here.
>


Ok so I think you meant that if the shut down happens after we sample the
is_shutdown, then we run into the same issue.

I can’t reproduce that but I’ll look into that. Another way might be to
synchronize using a mutex. Though maybe the shutdown path can wait for
active pm_runtime references?

Thanks.




> > Probably, propagating back the return value of pm_runtime_get() in every
> possible ioctl call path is the right thing to do.
>
> Ok I’ll look into that. But the patch I posted works reliably and fixes
> all crashes we could reproduce.
>
> > I have never thought about this scenario. Do you know why userspace is
> not freezed before kexec?
>
> I am not sure. It depends on how kexec is used. The userspace freeze
> happens only when kexec is called to switch back and forth between
> different kernels (persistence mode). In such scenario I believe the
> userspace has to be frozen and unfrozen. However for normal kexec, that
> does not happen.
>
> Thanks.
>
>
> >
> > -Akhil.
> >>        switch (param) {
> >> diff --git a/drivers/gpu/drm/msm/msm_gpu.h
> b/drivers/gpu/drm/msm/msm_gpu.h
> >> index ff911e7305ce..f18b0a91442b 100644
> >> --- a/drivers/gpu/drm/msm/msm_gpu.h
> >> +++ b/drivers/gpu/drm/msm/msm_gpu.h
> >> @@ -214,6 +214,9 @@ struct msm_gpu {
> >>      /* does gpu need hw_init? */
> >>      bool needs_hw_init;
> >>  +    /* is the GPU shutdown? */
> >> +    bool is_shutdown;
> >> +
> >>      /**
> >>       * global_faults: number of GPU hangs not attributed to a
> particular
> >>       * address space
> >
>

[-- Attachment #2: Type: text/html, Size: 7791 bytes --]

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/2] adreno: Detect shutdown during get_param()
  2022-11-11 21:28     ` Akhil P Oommen
@ 2022-11-12 18:35       ` Rob Clark
  -1 siblings, 0 replies; 27+ messages in thread
From: Rob Clark @ 2022-11-12 18:35 UTC (permalink / raw)
  To: Akhil P Oommen
  Cc: Joel Fernandes (Google),
	linux-kernel, Rob Clark, Steven Rostedt, Ricardo Ribalda,
	Ross Zwisler, Abhinav Kumar, Daniel Vetter, David Airlie,
	Dmitry Baryshkov, dri-devel, Emma Anholt, freedreno,
	linux-arm-msm, Sean Paul, Vladimir Lypak

On Fri, Nov 11, 2022 at 1:28 PM Akhil P Oommen <quic_akhilpo@quicinc.com> wrote:
>
> On 11/12/2022 1:19 AM, Joel Fernandes (Google) wrote:
> > Even though the GPU is shut down, during kexec reboot we can have userspace
> > still running. This is especially true if KEXEC_JUMP is not enabled, because we
> > do not freeze userspace in this case.
> >
> > To prevent crashes, track that the GPU is shutdown and prevent get_param() from
> > accessing GPU resources if we find it shutdown.
> >
> > This fixes the following crash during kexec reboot on an ARM64 device with adreno GPU:
> >
> > [  292.534314] Kernel panic - not syncing: Asynchronous SError Interrupt
> > [  292.534323] Hardware name: Google Lazor (rev3 - 8) with LTE (DT)
> > [  292.534326] Call trace:
> > [  292.534328]  dump_backtrace+0x0/0x1d4
> > [  292.534337]  show_stack+0x20/0x2c
> > [  292.534342]  dump_stack_lvl+0x60/0x78
> > [  292.534347]  dump_stack+0x18/0x38
> > [  292.534352]  panic+0x148/0x3b0
> > [  292.534357]  nmi_panic+0x80/0x94
> > [  292.534364]  arm64_serror_panic+0x70/0x7c
> > [  292.534369]  do_serror+0x0/0x7c
> > [  292.534372]  do_serror+0x54/0x7c
> > [  292.534377]  el1h_64_error_handler+0x34/0x4c
> > [  292.534381]  el1h_64_error+0x7c/0x80
> > [  292.534386]  el1_interrupt+0x20/0x58
> > [  292.534389]  el1h_64_irq_handler+0x18/0x24
> > [  292.534395]  el1h_64_irq+0x7c/0x80
> > [  292.534399]  local_daif_inherit+0x10/0x18
> > [  292.534405]  el1h_64_sync_handler+0x48/0xb4
> > [  292.534410]  el1h_64_sync+0x7c/0x80
> > [  292.534414]  a6xx_gmu_set_oob+0xbc/0x1fc
> > [  292.534422]  a6xx_get_timestamp+0x40/0xb4
> > [  292.534426]  adreno_get_param+0x12c/0x1e0
> > [  292.534433]  msm_ioctl_get_param+0x64/0x70
> > [  292.534440]  drm_ioctl_kernel+0xe8/0x158
> > [  292.534448]  drm_ioctl+0x208/0x320
> > [  292.534453]  __arm64_sys_ioctl+0x98/0xd0
> > [  292.534461]  invoke_syscall+0x4c/0x118
> > [  292.534467]  el0_svc_common+0x98/0x104
> > [  292.534473]  do_el0_svc+0x30/0x80
> > [  292.534478]  el0_svc+0x20/0x50
> > [  292.534481]  el0t_64_sync_handler+0x78/0x108
> > [  292.534485]  el0t_64_sync+0x1a4/0x1a8
> > [  292.534632] Kernel Offset: 0x1a5f800000 from 0xffffffc008000000
> > [  292.534635] PHYS_OFFSET: 0x80000000
> > [  292.534638] CPU features: 0x40018541,a3300e42
> > [  292.534644] Memory Limit: none
> >
> > Cc: Rob Clark <robdclark@chromium.org>
> > Cc: Steven Rostedt <rostedt@goodmis.org>
> > Cc: Ricardo Ribalda <ribalda@chromium.org>
> > Cc: Ross Zwisler <zwisler@kernel.org>
> > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> > ---
> >   drivers/gpu/drm/msm/adreno/adreno_device.c | 1 +
> >   drivers/gpu/drm/msm/adreno/adreno_gpu.c    | 2 +-
> >   drivers/gpu/drm/msm/msm_gpu.h              | 3 +++
> >   3 files changed, 5 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/gpu/drm/msm/adreno/adreno_device.c b/drivers/gpu/drm/msm/adreno/adreno_device.c
> > index f0cff62812c3..03d912dc0130 100644
> > --- a/drivers/gpu/drm/msm/adreno/adreno_device.c
> > +++ b/drivers/gpu/drm/msm/adreno/adreno_device.c
> > @@ -612,6 +612,7 @@ static void adreno_shutdown(struct platform_device *pdev)
> >   {
> >       struct msm_gpu *gpu = dev_to_gpu(&pdev->dev);
> >
> > +     gpu->is_shutdown = true;
> >       WARN_ON_ONCE(adreno_system_suspend(&pdev->dev));
> >   }
> >
> > diff --git a/drivers/gpu/drm/msm/adreno/adreno_gpu.c b/drivers/gpu/drm/msm/adreno/adreno_gpu.c
> > index 382fb7f9e497..6903c6892469 100644
> > --- a/drivers/gpu/drm/msm/adreno/adreno_gpu.c
> > +++ b/drivers/gpu/drm/msm/adreno/adreno_gpu.c
> > @@ -251,7 +251,7 @@ int adreno_get_param(struct msm_gpu *gpu, struct msm_file_private *ctx,
> >       struct adreno_gpu *adreno_gpu = to_adreno_gpu(gpu);
> >
> >       /* No pointer params yet */
> > -     if (*len != 0)
> > +     if (*len != 0 || gpu->is_shutdown)
> >               return -EINVAL;
> This will race with shutdown. Probably, propagating back the return
> value of pm_runtime_get() in every possible ioctl call path is the right
> thing to do.
>
> I have never thought about this scenario. Do you know why userspace is
> not freezed before kexec?

So userspace not being frozen seems like the root issue, and is likely
to cause all sorts of other whack-a-mole problems.  I guess I'd like
to know if this is the expected behavior?

If so, we should probably look at
drm_dev_is_unplugged()/drm_dev_unplug()/etc rather than trying to NIH
that mechanism.  We would need to sprinkle drm_dev_is_unplugged()
checks more widely, and also ensure that the scheduler kthread(s) gets
parked.  But it would be nice to know first if we are just trying to
paper over a kexec bug.

BR,
-R

>
> -Akhil.
> >
> >       switch (param) {
> > diff --git a/drivers/gpu/drm/msm/msm_gpu.h b/drivers/gpu/drm/msm/msm_gpu.h
> > index ff911e7305ce..f18b0a91442b 100644
> > --- a/drivers/gpu/drm/msm/msm_gpu.h
> > +++ b/drivers/gpu/drm/msm/msm_gpu.h
> > @@ -214,6 +214,9 @@ struct msm_gpu {
> >       /* does gpu need hw_init? */
> >       bool needs_hw_init;
> >
> > +     /* is the GPU shutdown? */
> > +     bool is_shutdown;
> > +
> >       /**
> >        * global_faults: number of GPU hangs not attributed to a particular
> >        * address space
>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/2] adreno: Detect shutdown during get_param()
@ 2022-11-12 18:35       ` Rob Clark
  0 siblings, 0 replies; 27+ messages in thread
From: Rob Clark @ 2022-11-12 18:35 UTC (permalink / raw)
  To: Akhil P Oommen
  Cc: Rob Clark, Emma Anholt, freedreno, linux-arm-msm,
	Ricardo Ribalda, Vladimir Lypak, linux-kernel, Steven Rostedt,
	Abhinav Kumar, Dmitry Baryshkov, dri-devel, Ross Zwisler,
	Joel Fernandes (Google),
	Sean Paul

On Fri, Nov 11, 2022 at 1:28 PM Akhil P Oommen <quic_akhilpo@quicinc.com> wrote:
>
> On 11/12/2022 1:19 AM, Joel Fernandes (Google) wrote:
> > Even though the GPU is shut down, during kexec reboot we can have userspace
> > still running. This is especially true if KEXEC_JUMP is not enabled, because we
> > do not freeze userspace in this case.
> >
> > To prevent crashes, track that the GPU is shutdown and prevent get_param() from
> > accessing GPU resources if we find it shutdown.
> >
> > This fixes the following crash during kexec reboot on an ARM64 device with adreno GPU:
> >
> > [  292.534314] Kernel panic - not syncing: Asynchronous SError Interrupt
> > [  292.534323] Hardware name: Google Lazor (rev3 - 8) with LTE (DT)
> > [  292.534326] Call trace:
> > [  292.534328]  dump_backtrace+0x0/0x1d4
> > [  292.534337]  show_stack+0x20/0x2c
> > [  292.534342]  dump_stack_lvl+0x60/0x78
> > [  292.534347]  dump_stack+0x18/0x38
> > [  292.534352]  panic+0x148/0x3b0
> > [  292.534357]  nmi_panic+0x80/0x94
> > [  292.534364]  arm64_serror_panic+0x70/0x7c
> > [  292.534369]  do_serror+0x0/0x7c
> > [  292.534372]  do_serror+0x54/0x7c
> > [  292.534377]  el1h_64_error_handler+0x34/0x4c
> > [  292.534381]  el1h_64_error+0x7c/0x80
> > [  292.534386]  el1_interrupt+0x20/0x58
> > [  292.534389]  el1h_64_irq_handler+0x18/0x24
> > [  292.534395]  el1h_64_irq+0x7c/0x80
> > [  292.534399]  local_daif_inherit+0x10/0x18
> > [  292.534405]  el1h_64_sync_handler+0x48/0xb4
> > [  292.534410]  el1h_64_sync+0x7c/0x80
> > [  292.534414]  a6xx_gmu_set_oob+0xbc/0x1fc
> > [  292.534422]  a6xx_get_timestamp+0x40/0xb4
> > [  292.534426]  adreno_get_param+0x12c/0x1e0
> > [  292.534433]  msm_ioctl_get_param+0x64/0x70
> > [  292.534440]  drm_ioctl_kernel+0xe8/0x158
> > [  292.534448]  drm_ioctl+0x208/0x320
> > [  292.534453]  __arm64_sys_ioctl+0x98/0xd0
> > [  292.534461]  invoke_syscall+0x4c/0x118
> > [  292.534467]  el0_svc_common+0x98/0x104
> > [  292.534473]  do_el0_svc+0x30/0x80
> > [  292.534478]  el0_svc+0x20/0x50
> > [  292.534481]  el0t_64_sync_handler+0x78/0x108
> > [  292.534485]  el0t_64_sync+0x1a4/0x1a8
> > [  292.534632] Kernel Offset: 0x1a5f800000 from 0xffffffc008000000
> > [  292.534635] PHYS_OFFSET: 0x80000000
> > [  292.534638] CPU features: 0x40018541,a3300e42
> > [  292.534644] Memory Limit: none
> >
> > Cc: Rob Clark <robdclark@chromium.org>
> > Cc: Steven Rostedt <rostedt@goodmis.org>
> > Cc: Ricardo Ribalda <ribalda@chromium.org>
> > Cc: Ross Zwisler <zwisler@kernel.org>
> > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> > ---
> >   drivers/gpu/drm/msm/adreno/adreno_device.c | 1 +
> >   drivers/gpu/drm/msm/adreno/adreno_gpu.c    | 2 +-
> >   drivers/gpu/drm/msm/msm_gpu.h              | 3 +++
> >   3 files changed, 5 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/gpu/drm/msm/adreno/adreno_device.c b/drivers/gpu/drm/msm/adreno/adreno_device.c
> > index f0cff62812c3..03d912dc0130 100644
> > --- a/drivers/gpu/drm/msm/adreno/adreno_device.c
> > +++ b/drivers/gpu/drm/msm/adreno/adreno_device.c
> > @@ -612,6 +612,7 @@ static void adreno_shutdown(struct platform_device *pdev)
> >   {
> >       struct msm_gpu *gpu = dev_to_gpu(&pdev->dev);
> >
> > +     gpu->is_shutdown = true;
> >       WARN_ON_ONCE(adreno_system_suspend(&pdev->dev));
> >   }
> >
> > diff --git a/drivers/gpu/drm/msm/adreno/adreno_gpu.c b/drivers/gpu/drm/msm/adreno/adreno_gpu.c
> > index 382fb7f9e497..6903c6892469 100644
> > --- a/drivers/gpu/drm/msm/adreno/adreno_gpu.c
> > +++ b/drivers/gpu/drm/msm/adreno/adreno_gpu.c
> > @@ -251,7 +251,7 @@ int adreno_get_param(struct msm_gpu *gpu, struct msm_file_private *ctx,
> >       struct adreno_gpu *adreno_gpu = to_adreno_gpu(gpu);
> >
> >       /* No pointer params yet */
> > -     if (*len != 0)
> > +     if (*len != 0 || gpu->is_shutdown)
> >               return -EINVAL;
> This will race with shutdown. Probably, propagating back the return
> value of pm_runtime_get() in every possible ioctl call path is the right
> thing to do.
>
> I have never thought about this scenario. Do you know why userspace is
> not freezed before kexec?

So userspace not being frozen seems like the root issue, and is likely
to cause all sorts of other whack-a-mole problems.  I guess I'd like
to know if this is the expected behavior?

If so, we should probably look at
drm_dev_is_unplugged()/drm_dev_unplug()/etc rather than trying to NIH
that mechanism.  We would need to sprinkle drm_dev_is_unplugged()
checks more widely, and also ensure that the scheduler kthread(s) gets
parked.  But it would be nice to know first if we are just trying to
paper over a kexec bug.

BR,
-R

>
> -Akhil.
> >
> >       switch (param) {
> > diff --git a/drivers/gpu/drm/msm/msm_gpu.h b/drivers/gpu/drm/msm/msm_gpu.h
> > index ff911e7305ce..f18b0a91442b 100644
> > --- a/drivers/gpu/drm/msm/msm_gpu.h
> > +++ b/drivers/gpu/drm/msm/msm_gpu.h
> > @@ -214,6 +214,9 @@ struct msm_gpu {
> >       /* does gpu need hw_init? */
> >       bool needs_hw_init;
> >
> > +     /* is the GPU shutdown? */
> > +     bool is_shutdown;
> > +
> >       /**
> >        * global_faults: number of GPU hangs not attributed to a particular
> >        * address space
>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/2] adreno: Shutdown the GPU properly
  2022-11-11 21:08   ` Joel Fernandes
@ 2022-11-12 18:44     ` Rob Clark
  -1 siblings, 0 replies; 27+ messages in thread
From: Rob Clark @ 2022-11-12 18:44 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: linux-kernel, Rob Clark, Steven Rostedt, Ricardo Ribalda,
	Ross Zwisler, Abhinav Kumar, Akhil P Oommen, Daniel Vetter,
	David Airlie, Dmitry Baryshkov, dri-devel, Emma Anholt,
	freedreno, linux-arm-msm, Sean Paul, Vladimir Lypak

On Fri, Nov 11, 2022 at 1:08 PM Joel Fernandes <joel@joelfernandes.org> wrote:
>
>
>
> > On Nov 11, 2022, at 2:50 PM, Joel Fernandes (Google) <joel@joelfernandes.org> wrote:
> >
> > During kexec on ARM device, we notice that device_shutdown() only calls
> > pm_runtime_force_suspend() while shutting down the GPU. This means the GPU
> > kthread is still running and further, there maybe active submits.
> >
> > This causes all kinds of issues during a kexec reboot:
> >
> > Warning from shutdown path:
> >
> > [  292.509662] WARNING: CPU: 0 PID: 6304 at [...] adreno_runtime_suspend+0x3c/0x44
> > [  292.509863] Hardware name: Google Lazor (rev3 - 8) with LTE (DT)
> > [  292.509872] pstate: 80400009 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> > [  292.509881] pc : adreno_runtime_suspend+0x3c/0x44
> > [  292.509891] lr : pm_generic_runtime_suspend+0x30/0x44
> > [  292.509905] sp : ffffffc014473bf0
> > [...]
> > [  292.510043] Call trace:
> > [  292.510051]  adreno_runtime_suspend+0x3c/0x44
> > [  292.510061]  pm_generic_runtime_suspend+0x30/0x44
> > [  292.510071]  pm_runtime_force_suspend+0x54/0xc8
> > [  292.510081]  adreno_shutdown+0x1c/0x28
> > [  292.510090]  platform_shutdown+0x2c/0x38
> > [  292.510104]  device_shutdown+0x158/0x210
> > [  292.510119]  kernel_restart_prepare+0x40/0x4c
> >
> > And here from GPU kthread, an SError OOPs:
> >
> > [  192.648789]  el1h_64_error+0x7c/0x80
> > [  192.648812]  el1_interrupt+0x20/0x58
> > [  192.648833]  el1h_64_irq_handler+0x18/0x24
> > [  192.648854]  el1h_64_irq+0x7c/0x80
> > [  192.648873]  local_daif_inherit+0x10/0x18
> > [  192.648900]  el1h_64_sync_handler+0x48/0xb4
> > [  192.648921]  el1h_64_sync+0x7c/0x80
> > [  192.648941]  a6xx_gmu_set_oob+0xbc/0x1fc
> > [  192.648968]  a6xx_hw_init+0x44/0xe38
> > [  192.648991]  msm_gpu_hw_init+0x48/0x80
> > [  192.649013]  msm_gpu_submit+0x5c/0x1a8
> > [  192.649034]  msm_job_run+0xb0/0x11c
> > [  192.649058]  drm_sched_main+0x170/0x434
> > [  192.649086]  kthread+0x134/0x300
> > [  192.649114]  ret_from_fork+0x10/0x20
> >
> > Fix by calling adreno_system_suspend() in the device_shutdown() path.
> >
> > Cc: Rob Clark <robdclark@chromium.org>
> > Cc: Steven Rostedt <rostedt@goodmis.org>
> > Cc: Ricardo Ribalda <ribalda@chromium.org>
> > Cc: Ross Zwisler <zwisler@kernel.org>
> > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> > ---
> > drivers/gpu/drm/msm/adreno/adreno_device.c | 5 ++++-
> > 1 file changed, 4 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/gpu/drm/msm/adreno/adreno_device.c b/drivers/gpu/drm/msm/adreno/adreno_device.c
> > index 24b489b6129a..f0cff62812c3 100644
> > --- a/drivers/gpu/drm/msm/adreno/adreno_device.c
> > +++ b/drivers/gpu/drm/msm/adreno/adreno_device.c
> > @@ -607,9 +607,12 @@ static int adreno_remove(struct platform_device *pdev)
> >    return 0;
> > }
> >
> > +static int adreno_system_suspend(struct device *dev);
> > static void adreno_shutdown(struct platform_device *pdev)
> > {
> > -    pm_runtime_force_suspend(&pdev->dev);
> > +    struct msm_gpu *gpu = dev_to_gpu(&pdev->dev);
> > +
>
> This local variable definition should go to patch 2/2. Will fix in v2.
>
> Thanks,
>
>  - Joel
>
>
> > +    WARN_ON_ONCE(adreno_system_suspend(&pdev->dev));

I think maybe adreno_unbind() needs the same treatment?  Any path
where we yank out the power cord without ensuring the scheduler is
parked means we'd be racing with jobs in the scheduler queue.  Ie.
userspace could queue a job before it is frozen, but the drm/scheduler
kthread hasn't yet called the msm_job_run() callback (which does
various touching of the now powered off hw).  So I think we need to
ensure that the scheduler is parked in all paths that call
pm_runtime_force_suspend() (as that bypasses the runpm reference that
would otherwise unsure the hw is powered before msm_job_run pokes at
registers)

BR,
-R

> > }
> >
> > static const struct of_device_id dt_match[] = {
> > --
> > 2.38.1.493.g58b659f92b-goog
> >

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/2] adreno: Shutdown the GPU properly
@ 2022-11-12 18:44     ` Rob Clark
  0 siblings, 0 replies; 27+ messages in thread
From: Rob Clark @ 2022-11-12 18:44 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Rob Clark, Emma Anholt, Akhil P Oommen, freedreno, linux-arm-msm,
	Ricardo Ribalda, Vladimir Lypak, Abhinav Kumar, Steven Rostedt,
	linux-kernel, Sean Paul, dri-devel, Ross Zwisler,
	Dmitry Baryshkov

On Fri, Nov 11, 2022 at 1:08 PM Joel Fernandes <joel@joelfernandes.org> wrote:
>
>
>
> > On Nov 11, 2022, at 2:50 PM, Joel Fernandes (Google) <joel@joelfernandes.org> wrote:
> >
> > During kexec on ARM device, we notice that device_shutdown() only calls
> > pm_runtime_force_suspend() while shutting down the GPU. This means the GPU
> > kthread is still running and further, there maybe active submits.
> >
> > This causes all kinds of issues during a kexec reboot:
> >
> > Warning from shutdown path:
> >
> > [  292.509662] WARNING: CPU: 0 PID: 6304 at [...] adreno_runtime_suspend+0x3c/0x44
> > [  292.509863] Hardware name: Google Lazor (rev3 - 8) with LTE (DT)
> > [  292.509872] pstate: 80400009 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> > [  292.509881] pc : adreno_runtime_suspend+0x3c/0x44
> > [  292.509891] lr : pm_generic_runtime_suspend+0x30/0x44
> > [  292.509905] sp : ffffffc014473bf0
> > [...]
> > [  292.510043] Call trace:
> > [  292.510051]  adreno_runtime_suspend+0x3c/0x44
> > [  292.510061]  pm_generic_runtime_suspend+0x30/0x44
> > [  292.510071]  pm_runtime_force_suspend+0x54/0xc8
> > [  292.510081]  adreno_shutdown+0x1c/0x28
> > [  292.510090]  platform_shutdown+0x2c/0x38
> > [  292.510104]  device_shutdown+0x158/0x210
> > [  292.510119]  kernel_restart_prepare+0x40/0x4c
> >
> > And here from GPU kthread, an SError OOPs:
> >
> > [  192.648789]  el1h_64_error+0x7c/0x80
> > [  192.648812]  el1_interrupt+0x20/0x58
> > [  192.648833]  el1h_64_irq_handler+0x18/0x24
> > [  192.648854]  el1h_64_irq+0x7c/0x80
> > [  192.648873]  local_daif_inherit+0x10/0x18
> > [  192.648900]  el1h_64_sync_handler+0x48/0xb4
> > [  192.648921]  el1h_64_sync+0x7c/0x80
> > [  192.648941]  a6xx_gmu_set_oob+0xbc/0x1fc
> > [  192.648968]  a6xx_hw_init+0x44/0xe38
> > [  192.648991]  msm_gpu_hw_init+0x48/0x80
> > [  192.649013]  msm_gpu_submit+0x5c/0x1a8
> > [  192.649034]  msm_job_run+0xb0/0x11c
> > [  192.649058]  drm_sched_main+0x170/0x434
> > [  192.649086]  kthread+0x134/0x300
> > [  192.649114]  ret_from_fork+0x10/0x20
> >
> > Fix by calling adreno_system_suspend() in the device_shutdown() path.
> >
> > Cc: Rob Clark <robdclark@chromium.org>
> > Cc: Steven Rostedt <rostedt@goodmis.org>
> > Cc: Ricardo Ribalda <ribalda@chromium.org>
> > Cc: Ross Zwisler <zwisler@kernel.org>
> > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> > ---
> > drivers/gpu/drm/msm/adreno/adreno_device.c | 5 ++++-
> > 1 file changed, 4 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/gpu/drm/msm/adreno/adreno_device.c b/drivers/gpu/drm/msm/adreno/adreno_device.c
> > index 24b489b6129a..f0cff62812c3 100644
> > --- a/drivers/gpu/drm/msm/adreno/adreno_device.c
> > +++ b/drivers/gpu/drm/msm/adreno/adreno_device.c
> > @@ -607,9 +607,12 @@ static int adreno_remove(struct platform_device *pdev)
> >    return 0;
> > }
> >
> > +static int adreno_system_suspend(struct device *dev);
> > static void adreno_shutdown(struct platform_device *pdev)
> > {
> > -    pm_runtime_force_suspend(&pdev->dev);
> > +    struct msm_gpu *gpu = dev_to_gpu(&pdev->dev);
> > +
>
> This local variable definition should go to patch 2/2. Will fix in v2.
>
> Thanks,
>
>  - Joel
>
>
> > +    WARN_ON_ONCE(adreno_system_suspend(&pdev->dev));

I think maybe adreno_unbind() needs the same treatment?  Any path
where we yank out the power cord without ensuring the scheduler is
parked means we'd be racing with jobs in the scheduler queue.  Ie.
userspace could queue a job before it is frozen, but the drm/scheduler
kthread hasn't yet called the msm_job_run() callback (which does
various touching of the now powered off hw).  So I think we need to
ensure that the scheduler is parked in all paths that call
pm_runtime_force_suspend() (as that bypasses the runpm reference that
would otherwise unsure the hw is powered before msm_job_run pokes at
registers)

BR,
-R

> > }
> >
> > static const struct of_device_id dt_match[] = {
> > --
> > 2.38.1.493.g58b659f92b-goog
> >

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/2] adreno: Detect shutdown during get_param()
  2022-11-12 18:35       ` Rob Clark
@ 2022-12-01 18:42         ` Joel Fernandes
  -1 siblings, 0 replies; 27+ messages in thread
From: Joel Fernandes @ 2022-12-01 18:42 UTC (permalink / raw)
  To: Rob Clark
  Cc: Akhil P Oommen, linux-kernel, Rob Clark, Steven Rostedt,
	Ricardo Ribalda, Ross Zwisler, Abhinav Kumar, Daniel Vetter,
	David Airlie, Dmitry Baryshkov, dri-devel, Emma Anholt,
	freedreno, linux-arm-msm, Sean Paul, Vladimir Lypak

On Sat, Nov 12, 2022 at 6:35 PM Rob Clark <robdclark@gmail.com> wrote:
>
> On Fri, Nov 11, 2022 at 1:28 PM Akhil P Oommen <quic_akhilpo@quicinc.com> wrote:
> >
> > On 11/12/2022 1:19 AM, Joel Fernandes (Google) wrote:
> > > Even though the GPU is shut down, during kexec reboot we can have userspace
> > > still running. This is especially true if KEXEC_JUMP is not enabled, because we
> > > do not freeze userspace in this case.
> > >
> > > To prevent crashes, track that the GPU is shutdown and prevent get_param() from
> > > accessing GPU resources if we find it shutdown.
> > >
> > > This fixes the following crash during kexec reboot on an ARM64 device with adreno GPU:
> > >
> > > [  292.534314] Kernel panic - not syncing: Asynchronous SError Interrupt
> > > [  292.534323] Hardware name: Google Lazor (rev3 - 8) with LTE (DT)
> > > [  292.534326] Call trace:
> > > [  292.534328]  dump_backtrace+0x0/0x1d4
> > > [  292.534337]  show_stack+0x20/0x2c
> > > [  292.534342]  dump_stack_lvl+0x60/0x78
> > > [  292.534347]  dump_stack+0x18/0x38
> > > [  292.534352]  panic+0x148/0x3b0
> > > [  292.534357]  nmi_panic+0x80/0x94
> > > [  292.534364]  arm64_serror_panic+0x70/0x7c
> > > [  292.534369]  do_serror+0x0/0x7c
> > > [  292.534372]  do_serror+0x54/0x7c
> > > [  292.534377]  el1h_64_error_handler+0x34/0x4c
> > > [  292.534381]  el1h_64_error+0x7c/0x80
> > > [  292.534386]  el1_interrupt+0x20/0x58
> > > [  292.534389]  el1h_64_irq_handler+0x18/0x24
> > > [  292.534395]  el1h_64_irq+0x7c/0x80
> > > [  292.534399]  local_daif_inherit+0x10/0x18
> > > [  292.534405]  el1h_64_sync_handler+0x48/0xb4
> > > [  292.534410]  el1h_64_sync+0x7c/0x80
> > > [  292.534414]  a6xx_gmu_set_oob+0xbc/0x1fc
> > > [  292.534422]  a6xx_get_timestamp+0x40/0xb4
> > > [  292.534426]  adreno_get_param+0x12c/0x1e0
> > > [  292.534433]  msm_ioctl_get_param+0x64/0x70
> > > [  292.534440]  drm_ioctl_kernel+0xe8/0x158
> > > [  292.534448]  drm_ioctl+0x208/0x320
> > > [  292.534453]  __arm64_sys_ioctl+0x98/0xd0
> > > [  292.534461]  invoke_syscall+0x4c/0x118
> > > [  292.534467]  el0_svc_common+0x98/0x104
> > > [  292.534473]  do_el0_svc+0x30/0x80
> > > [  292.534478]  el0_svc+0x20/0x50
> > > [  292.534481]  el0t_64_sync_handler+0x78/0x108
> > > [  292.534485]  el0t_64_sync+0x1a4/0x1a8
> > > [  292.534632] Kernel Offset: 0x1a5f800000 from 0xffffffc008000000
> > > [  292.534635] PHYS_OFFSET: 0x80000000
> > > [  292.534638] CPU features: 0x40018541,a3300e42
> > > [  292.534644] Memory Limit: none
> > >
> > > Cc: Rob Clark <robdclark@chromium.org>
> > > Cc: Steven Rostedt <rostedt@goodmis.org>
> > > Cc: Ricardo Ribalda <ribalda@chromium.org>
> > > Cc: Ross Zwisler <zwisler@kernel.org>
> > > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> > > ---
> > >   drivers/gpu/drm/msm/adreno/adreno_device.c | 1 +
> > >   drivers/gpu/drm/msm/adreno/adreno_gpu.c    | 2 +-
> > >   drivers/gpu/drm/msm/msm_gpu.h              | 3 +++
> > >   3 files changed, 5 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/drivers/gpu/drm/msm/adreno/adreno_device.c b/drivers/gpu/drm/msm/adreno/adreno_device.c
> > > index f0cff62812c3..03d912dc0130 100644
> > > --- a/drivers/gpu/drm/msm/adreno/adreno_device.c
> > > +++ b/drivers/gpu/drm/msm/adreno/adreno_device.c
> > > @@ -612,6 +612,7 @@ static void adreno_shutdown(struct platform_device *pdev)
> > >   {
> > >       struct msm_gpu *gpu = dev_to_gpu(&pdev->dev);
> > >
> > > +     gpu->is_shutdown = true;
> > >       WARN_ON_ONCE(adreno_system_suspend(&pdev->dev));
> > >   }
> > >
> > > diff --git a/drivers/gpu/drm/msm/adreno/adreno_gpu.c b/drivers/gpu/drm/msm/adreno/adreno_gpu.c
> > > index 382fb7f9e497..6903c6892469 100644
> > > --- a/drivers/gpu/drm/msm/adreno/adreno_gpu.c
> > > +++ b/drivers/gpu/drm/msm/adreno/adreno_gpu.c
> > > @@ -251,7 +251,7 @@ int adreno_get_param(struct msm_gpu *gpu, struct msm_file_private *ctx,
> > >       struct adreno_gpu *adreno_gpu = to_adreno_gpu(gpu);
> > >
> > >       /* No pointer params yet */
> > > -     if (*len != 0)
> > > +     if (*len != 0 || gpu->is_shutdown)
> > >               return -EINVAL;
> > This will race with shutdown. Probably, propagating back the return
> > value of pm_runtime_get() in every possible ioctl call path is the right
> > thing to do.
> >
> > I have never thought about this scenario. Do you know why userspace is
> > not freezed before kexec?
>
> So userspace not being frozen seems like the root issue, and is likely
> to cause all sorts of other whack-a-mole problems.  I guess I'd like
> to know if this is the expected behavior?

We tried that. Freezing before kexec seems to cause issues for ALSA as
Ricardo found:
https://lore.kernel.org/lkml/202211281209.mnBLzQ2I-lkp@intel.com/T/

That patch is still TBD due to disagreement on the right approach to
fix, so I don't think freezing before shutting down devices is viable
at the moment.

I am checking Ricardo if we can do something like util-linux's
shutdown code which sends SIGTERM to all processes:
https://kernel.googlesource.com/pub/scm/utils/util-linux/util-linux/+/v2.8/login-utils/shutdown.c#273
, before issuing the kexec-reboot.

Maybe, a more graceful shutdown from kexec-lite, will prevent the
kexec-reboot it does from crashing? Though in my view that would still
be a small copout instead of fixing the real issue, which is the
kernel crashing for any reason.

> If so, we should probably look at
> drm_dev_is_unplugged()/drm_dev_unplug()/etc rather than trying to NIH
> that mechanism.  We would need to sprinkle drm_dev_is_unplugged()
> checks more widely, and also ensure that the scheduler kthread(s) gets
> parked.  But it would be nice to know first if we are just trying to
> paper over a kexec bug.

Agreed. I think we still patch 1/2 whether the SIGTERM trick mentioned
above, works or not. I will await discussions with Ricardo before
reposting that one.

Cheers,

 -- Joel

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/2] adreno: Detect shutdown during get_param()
@ 2022-12-01 18:42         ` Joel Fernandes
  0 siblings, 0 replies; 27+ messages in thread
From: Joel Fernandes @ 2022-12-01 18:42 UTC (permalink / raw)
  To: Rob Clark
  Cc: Rob Clark, Emma Anholt, Akhil P Oommen, freedreno, linux-arm-msm,
	Ricardo Ribalda, Vladimir Lypak, linux-kernel, Steven Rostedt,
	Abhinav Kumar, Sean Paul, dri-devel, Ross Zwisler,
	Dmitry Baryshkov

On Sat, Nov 12, 2022 at 6:35 PM Rob Clark <robdclark@gmail.com> wrote:
>
> On Fri, Nov 11, 2022 at 1:28 PM Akhil P Oommen <quic_akhilpo@quicinc.com> wrote:
> >
> > On 11/12/2022 1:19 AM, Joel Fernandes (Google) wrote:
> > > Even though the GPU is shut down, during kexec reboot we can have userspace
> > > still running. This is especially true if KEXEC_JUMP is not enabled, because we
> > > do not freeze userspace in this case.
> > >
> > > To prevent crashes, track that the GPU is shutdown and prevent get_param() from
> > > accessing GPU resources if we find it shutdown.
> > >
> > > This fixes the following crash during kexec reboot on an ARM64 device with adreno GPU:
> > >
> > > [  292.534314] Kernel panic - not syncing: Asynchronous SError Interrupt
> > > [  292.534323] Hardware name: Google Lazor (rev3 - 8) with LTE (DT)
> > > [  292.534326] Call trace:
> > > [  292.534328]  dump_backtrace+0x0/0x1d4
> > > [  292.534337]  show_stack+0x20/0x2c
> > > [  292.534342]  dump_stack_lvl+0x60/0x78
> > > [  292.534347]  dump_stack+0x18/0x38
> > > [  292.534352]  panic+0x148/0x3b0
> > > [  292.534357]  nmi_panic+0x80/0x94
> > > [  292.534364]  arm64_serror_panic+0x70/0x7c
> > > [  292.534369]  do_serror+0x0/0x7c
> > > [  292.534372]  do_serror+0x54/0x7c
> > > [  292.534377]  el1h_64_error_handler+0x34/0x4c
> > > [  292.534381]  el1h_64_error+0x7c/0x80
> > > [  292.534386]  el1_interrupt+0x20/0x58
> > > [  292.534389]  el1h_64_irq_handler+0x18/0x24
> > > [  292.534395]  el1h_64_irq+0x7c/0x80
> > > [  292.534399]  local_daif_inherit+0x10/0x18
> > > [  292.534405]  el1h_64_sync_handler+0x48/0xb4
> > > [  292.534410]  el1h_64_sync+0x7c/0x80
> > > [  292.534414]  a6xx_gmu_set_oob+0xbc/0x1fc
> > > [  292.534422]  a6xx_get_timestamp+0x40/0xb4
> > > [  292.534426]  adreno_get_param+0x12c/0x1e0
> > > [  292.534433]  msm_ioctl_get_param+0x64/0x70
> > > [  292.534440]  drm_ioctl_kernel+0xe8/0x158
> > > [  292.534448]  drm_ioctl+0x208/0x320
> > > [  292.534453]  __arm64_sys_ioctl+0x98/0xd0
> > > [  292.534461]  invoke_syscall+0x4c/0x118
> > > [  292.534467]  el0_svc_common+0x98/0x104
> > > [  292.534473]  do_el0_svc+0x30/0x80
> > > [  292.534478]  el0_svc+0x20/0x50
> > > [  292.534481]  el0t_64_sync_handler+0x78/0x108
> > > [  292.534485]  el0t_64_sync+0x1a4/0x1a8
> > > [  292.534632] Kernel Offset: 0x1a5f800000 from 0xffffffc008000000
> > > [  292.534635] PHYS_OFFSET: 0x80000000
> > > [  292.534638] CPU features: 0x40018541,a3300e42
> > > [  292.534644] Memory Limit: none
> > >
> > > Cc: Rob Clark <robdclark@chromium.org>
> > > Cc: Steven Rostedt <rostedt@goodmis.org>
> > > Cc: Ricardo Ribalda <ribalda@chromium.org>
> > > Cc: Ross Zwisler <zwisler@kernel.org>
> > > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> > > ---
> > >   drivers/gpu/drm/msm/adreno/adreno_device.c | 1 +
> > >   drivers/gpu/drm/msm/adreno/adreno_gpu.c    | 2 +-
> > >   drivers/gpu/drm/msm/msm_gpu.h              | 3 +++
> > >   3 files changed, 5 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/drivers/gpu/drm/msm/adreno/adreno_device.c b/drivers/gpu/drm/msm/adreno/adreno_device.c
> > > index f0cff62812c3..03d912dc0130 100644
> > > --- a/drivers/gpu/drm/msm/adreno/adreno_device.c
> > > +++ b/drivers/gpu/drm/msm/adreno/adreno_device.c
> > > @@ -612,6 +612,7 @@ static void adreno_shutdown(struct platform_device *pdev)
> > >   {
> > >       struct msm_gpu *gpu = dev_to_gpu(&pdev->dev);
> > >
> > > +     gpu->is_shutdown = true;
> > >       WARN_ON_ONCE(adreno_system_suspend(&pdev->dev));
> > >   }
> > >
> > > diff --git a/drivers/gpu/drm/msm/adreno/adreno_gpu.c b/drivers/gpu/drm/msm/adreno/adreno_gpu.c
> > > index 382fb7f9e497..6903c6892469 100644
> > > --- a/drivers/gpu/drm/msm/adreno/adreno_gpu.c
> > > +++ b/drivers/gpu/drm/msm/adreno/adreno_gpu.c
> > > @@ -251,7 +251,7 @@ int adreno_get_param(struct msm_gpu *gpu, struct msm_file_private *ctx,
> > >       struct adreno_gpu *adreno_gpu = to_adreno_gpu(gpu);
> > >
> > >       /* No pointer params yet */
> > > -     if (*len != 0)
> > > +     if (*len != 0 || gpu->is_shutdown)
> > >               return -EINVAL;
> > This will race with shutdown. Probably, propagating back the return
> > value of pm_runtime_get() in every possible ioctl call path is the right
> > thing to do.
> >
> > I have never thought about this scenario. Do you know why userspace is
> > not freezed before kexec?
>
> So userspace not being frozen seems like the root issue, and is likely
> to cause all sorts of other whack-a-mole problems.  I guess I'd like
> to know if this is the expected behavior?

We tried that. Freezing before kexec seems to cause issues for ALSA as
Ricardo found:
https://lore.kernel.org/lkml/202211281209.mnBLzQ2I-lkp@intel.com/T/

That patch is still TBD due to disagreement on the right approach to
fix, so I don't think freezing before shutting down devices is viable
at the moment.

I am checking Ricardo if we can do something like util-linux's
shutdown code which sends SIGTERM to all processes:
https://kernel.googlesource.com/pub/scm/utils/util-linux/util-linux/+/v2.8/login-utils/shutdown.c#273
, before issuing the kexec-reboot.

Maybe, a more graceful shutdown from kexec-lite, will prevent the
kexec-reboot it does from crashing? Though in my view that would still
be a small copout instead of fixing the real issue, which is the
kernel crashing for any reason.

> If so, we should probably look at
> drm_dev_is_unplugged()/drm_dev_unplug()/etc rather than trying to NIH
> that mechanism.  We would need to sprinkle drm_dev_is_unplugged()
> checks more widely, and also ensure that the scheduler kthread(s) gets
> parked.  But it would be nice to know first if we are just trying to
> paper over a kexec bug.

Agreed. I think we still patch 1/2 whether the SIGTERM trick mentioned
above, works or not. I will await discussions with Ricardo before
reposting that one.

Cheers,

 -- Joel

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/2] adreno: Detect shutdown during get_param()
  2022-12-01 18:42         ` Joel Fernandes
@ 2022-12-01 19:33           ` Rob Clark
  -1 siblings, 0 replies; 27+ messages in thread
From: Rob Clark @ 2022-12-01 19:33 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Rob Clark, Emma Anholt, Akhil P Oommen, freedreno, linux-arm-msm,
	Ricardo Ribalda, Vladimir Lypak, linux-kernel, Steven Rostedt,
	Abhinav Kumar, Sean Paul, dri-devel, Ross Zwisler,
	Dmitry Baryshkov

On Thu, Dec 1, 2022 at 10:42 AM Joel Fernandes <joel@joelfernandes.org> wrote:
>
> On Sat, Nov 12, 2022 at 6:35 PM Rob Clark <robdclark@gmail.com> wrote:
> >
> > On Fri, Nov 11, 2022 at 1:28 PM Akhil P Oommen <quic_akhilpo@quicinc.com> wrote:
> > >
> > > On 11/12/2022 1:19 AM, Joel Fernandes (Google) wrote:
> > > > Even though the GPU is shut down, during kexec reboot we can have userspace
> > > > still running. This is especially true if KEXEC_JUMP is not enabled, because we
> > > > do not freeze userspace in this case.
> > > >
> > > > To prevent crashes, track that the GPU is shutdown and prevent get_param() from
> > > > accessing GPU resources if we find it shutdown.
> > > >
> > > > This fixes the following crash during kexec reboot on an ARM64 device with adreno GPU:
> > > >
> > > > [  292.534314] Kernel panic - not syncing: Asynchronous SError Interrupt
> > > > [  292.534323] Hardware name: Google Lazor (rev3 - 8) with LTE (DT)
> > > > [  292.534326] Call trace:
> > > > [  292.534328]  dump_backtrace+0x0/0x1d4
> > > > [  292.534337]  show_stack+0x20/0x2c
> > > > [  292.534342]  dump_stack_lvl+0x60/0x78
> > > > [  292.534347]  dump_stack+0x18/0x38
> > > > [  292.534352]  panic+0x148/0x3b0
> > > > [  292.534357]  nmi_panic+0x80/0x94
> > > > [  292.534364]  arm64_serror_panic+0x70/0x7c
> > > > [  292.534369]  do_serror+0x0/0x7c
> > > > [  292.534372]  do_serror+0x54/0x7c
> > > > [  292.534377]  el1h_64_error_handler+0x34/0x4c
> > > > [  292.534381]  el1h_64_error+0x7c/0x80
> > > > [  292.534386]  el1_interrupt+0x20/0x58
> > > > [  292.534389]  el1h_64_irq_handler+0x18/0x24
> > > > [  292.534395]  el1h_64_irq+0x7c/0x80
> > > > [  292.534399]  local_daif_inherit+0x10/0x18
> > > > [  292.534405]  el1h_64_sync_handler+0x48/0xb4
> > > > [  292.534410]  el1h_64_sync+0x7c/0x80
> > > > [  292.534414]  a6xx_gmu_set_oob+0xbc/0x1fc
> > > > [  292.534422]  a6xx_get_timestamp+0x40/0xb4
> > > > [  292.534426]  adreno_get_param+0x12c/0x1e0
> > > > [  292.534433]  msm_ioctl_get_param+0x64/0x70
> > > > [  292.534440]  drm_ioctl_kernel+0xe8/0x158
> > > > [  292.534448]  drm_ioctl+0x208/0x320
> > > > [  292.534453]  __arm64_sys_ioctl+0x98/0xd0
> > > > [  292.534461]  invoke_syscall+0x4c/0x118
> > > > [  292.534467]  el0_svc_common+0x98/0x104
> > > > [  292.534473]  do_el0_svc+0x30/0x80
> > > > [  292.534478]  el0_svc+0x20/0x50
> > > > [  292.534481]  el0t_64_sync_handler+0x78/0x108
> > > > [  292.534485]  el0t_64_sync+0x1a4/0x1a8
> > > > [  292.534632] Kernel Offset: 0x1a5f800000 from 0xffffffc008000000
> > > > [  292.534635] PHYS_OFFSET: 0x80000000
> > > > [  292.534638] CPU features: 0x40018541,a3300e42
> > > > [  292.534644] Memory Limit: none
> > > >
> > > > Cc: Rob Clark <robdclark@chromium.org>
> > > > Cc: Steven Rostedt <rostedt@goodmis.org>
> > > > Cc: Ricardo Ribalda <ribalda@chromium.org>
> > > > Cc: Ross Zwisler <zwisler@kernel.org>
> > > > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> > > > ---
> > > >   drivers/gpu/drm/msm/adreno/adreno_device.c | 1 +
> > > >   drivers/gpu/drm/msm/adreno/adreno_gpu.c    | 2 +-
> > > >   drivers/gpu/drm/msm/msm_gpu.h              | 3 +++
> > > >   3 files changed, 5 insertions(+), 1 deletion(-)
> > > >
> > > > diff --git a/drivers/gpu/drm/msm/adreno/adreno_device.c b/drivers/gpu/drm/msm/adreno/adreno_device.c
> > > > index f0cff62812c3..03d912dc0130 100644
> > > > --- a/drivers/gpu/drm/msm/adreno/adreno_device.c
> > > > +++ b/drivers/gpu/drm/msm/adreno/adreno_device.c
> > > > @@ -612,6 +612,7 @@ static void adreno_shutdown(struct platform_device *pdev)
> > > >   {
> > > >       struct msm_gpu *gpu = dev_to_gpu(&pdev->dev);
> > > >
> > > > +     gpu->is_shutdown = true;
> > > >       WARN_ON_ONCE(adreno_system_suspend(&pdev->dev));
> > > >   }
> > > >
> > > > diff --git a/drivers/gpu/drm/msm/adreno/adreno_gpu.c b/drivers/gpu/drm/msm/adreno/adreno_gpu.c
> > > > index 382fb7f9e497..6903c6892469 100644
> > > > --- a/drivers/gpu/drm/msm/adreno/adreno_gpu.c
> > > > +++ b/drivers/gpu/drm/msm/adreno/adreno_gpu.c
> > > > @@ -251,7 +251,7 @@ int adreno_get_param(struct msm_gpu *gpu, struct msm_file_private *ctx,
> > > >       struct adreno_gpu *adreno_gpu = to_adreno_gpu(gpu);
> > > >
> > > >       /* No pointer params yet */
> > > > -     if (*len != 0)
> > > > +     if (*len != 0 || gpu->is_shutdown)
> > > >               return -EINVAL;
> > > This will race with shutdown. Probably, propagating back the return
> > > value of pm_runtime_get() in every possible ioctl call path is the right
> > > thing to do.
> > >
> > > I have never thought about this scenario. Do you know why userspace is
> > > not freezed before kexec?
> >
> > So userspace not being frozen seems like the root issue, and is likely
> > to cause all sorts of other whack-a-mole problems.  I guess I'd like
> > to know if this is the expected behavior?
>
> We tried that. Freezing before kexec seems to cause issues for ALSA as
> Ricardo found:
> https://lore.kernel.org/lkml/202211281209.mnBLzQ2I-lkp@intel.com/T/
>
> That patch is still TBD due to disagreement on the right approach to
> fix, so I don't think freezing before shutting down devices is viable
> at the moment.
>
> I am checking Ricardo if we can do something like util-linux's
> shutdown code which sends SIGTERM to all processes:
> https://kernel.googlesource.com/pub/scm/utils/util-linux/util-linux/+/v2.8/login-utils/shutdown.c#273
> , before issuing the kexec-reboot.
>
> Maybe, a more graceful shutdown from kexec-lite, will prevent the
> kexec-reboot it does from crashing? Though in my view that would still
> be a small copout instead of fixing the real issue, which is the
> kernel crashing for any reason.

The problem is that pm_runtime_force_suspend() yanks the rug out from
under the driver from a runpm PoV.. generally this is ok (as long as
scheduler kthread is parked) because we don't have to worry about
userspace ;-)

> > If so, we should probably look at
> > drm_dev_is_unplugged()/drm_dev_unplug()/etc rather than trying to NIH
> > that mechanism.  We would need to sprinkle drm_dev_is_unplugged()
> > checks more widely, and also ensure that the scheduler kthread(s) gets
> > parked.  But it would be nice to know first if we are just trying to
> > paper over a kexec bug.
>
> Agreed. I think we still patch 1/2 whether the SIGTERM trick mentioned
> above, works or not. I will await discussions with Ricardo before
> reposting that one.

Yeah, I think I'm waiting on a v2 of that one ;-)

BR,
-R

> Cheers,
>
>  -- Joel

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/2] adreno: Detect shutdown during get_param()
@ 2022-12-01 19:33           ` Rob Clark
  0 siblings, 0 replies; 27+ messages in thread
From: Rob Clark @ 2022-12-01 19:33 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Akhil P Oommen, linux-kernel, Rob Clark, Steven Rostedt,
	Ricardo Ribalda, Ross Zwisler, Abhinav Kumar, Daniel Vetter,
	David Airlie, Dmitry Baryshkov, dri-devel, Emma Anholt,
	freedreno, linux-arm-msm, Sean Paul, Vladimir Lypak

On Thu, Dec 1, 2022 at 10:42 AM Joel Fernandes <joel@joelfernandes.org> wrote:
>
> On Sat, Nov 12, 2022 at 6:35 PM Rob Clark <robdclark@gmail.com> wrote:
> >
> > On Fri, Nov 11, 2022 at 1:28 PM Akhil P Oommen <quic_akhilpo@quicinc.com> wrote:
> > >
> > > On 11/12/2022 1:19 AM, Joel Fernandes (Google) wrote:
> > > > Even though the GPU is shut down, during kexec reboot we can have userspace
> > > > still running. This is especially true if KEXEC_JUMP is not enabled, because we
> > > > do not freeze userspace in this case.
> > > >
> > > > To prevent crashes, track that the GPU is shutdown and prevent get_param() from
> > > > accessing GPU resources if we find it shutdown.
> > > >
> > > > This fixes the following crash during kexec reboot on an ARM64 device with adreno GPU:
> > > >
> > > > [  292.534314] Kernel panic - not syncing: Asynchronous SError Interrupt
> > > > [  292.534323] Hardware name: Google Lazor (rev3 - 8) with LTE (DT)
> > > > [  292.534326] Call trace:
> > > > [  292.534328]  dump_backtrace+0x0/0x1d4
> > > > [  292.534337]  show_stack+0x20/0x2c
> > > > [  292.534342]  dump_stack_lvl+0x60/0x78
> > > > [  292.534347]  dump_stack+0x18/0x38
> > > > [  292.534352]  panic+0x148/0x3b0
> > > > [  292.534357]  nmi_panic+0x80/0x94
> > > > [  292.534364]  arm64_serror_panic+0x70/0x7c
> > > > [  292.534369]  do_serror+0x0/0x7c
> > > > [  292.534372]  do_serror+0x54/0x7c
> > > > [  292.534377]  el1h_64_error_handler+0x34/0x4c
> > > > [  292.534381]  el1h_64_error+0x7c/0x80
> > > > [  292.534386]  el1_interrupt+0x20/0x58
> > > > [  292.534389]  el1h_64_irq_handler+0x18/0x24
> > > > [  292.534395]  el1h_64_irq+0x7c/0x80
> > > > [  292.534399]  local_daif_inherit+0x10/0x18
> > > > [  292.534405]  el1h_64_sync_handler+0x48/0xb4
> > > > [  292.534410]  el1h_64_sync+0x7c/0x80
> > > > [  292.534414]  a6xx_gmu_set_oob+0xbc/0x1fc
> > > > [  292.534422]  a6xx_get_timestamp+0x40/0xb4
> > > > [  292.534426]  adreno_get_param+0x12c/0x1e0
> > > > [  292.534433]  msm_ioctl_get_param+0x64/0x70
> > > > [  292.534440]  drm_ioctl_kernel+0xe8/0x158
> > > > [  292.534448]  drm_ioctl+0x208/0x320
> > > > [  292.534453]  __arm64_sys_ioctl+0x98/0xd0
> > > > [  292.534461]  invoke_syscall+0x4c/0x118
> > > > [  292.534467]  el0_svc_common+0x98/0x104
> > > > [  292.534473]  do_el0_svc+0x30/0x80
> > > > [  292.534478]  el0_svc+0x20/0x50
> > > > [  292.534481]  el0t_64_sync_handler+0x78/0x108
> > > > [  292.534485]  el0t_64_sync+0x1a4/0x1a8
> > > > [  292.534632] Kernel Offset: 0x1a5f800000 from 0xffffffc008000000
> > > > [  292.534635] PHYS_OFFSET: 0x80000000
> > > > [  292.534638] CPU features: 0x40018541,a3300e42
> > > > [  292.534644] Memory Limit: none
> > > >
> > > > Cc: Rob Clark <robdclark@chromium.org>
> > > > Cc: Steven Rostedt <rostedt@goodmis.org>
> > > > Cc: Ricardo Ribalda <ribalda@chromium.org>
> > > > Cc: Ross Zwisler <zwisler@kernel.org>
> > > > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> > > > ---
> > > >   drivers/gpu/drm/msm/adreno/adreno_device.c | 1 +
> > > >   drivers/gpu/drm/msm/adreno/adreno_gpu.c    | 2 +-
> > > >   drivers/gpu/drm/msm/msm_gpu.h              | 3 +++
> > > >   3 files changed, 5 insertions(+), 1 deletion(-)
> > > >
> > > > diff --git a/drivers/gpu/drm/msm/adreno/adreno_device.c b/drivers/gpu/drm/msm/adreno/adreno_device.c
> > > > index f0cff62812c3..03d912dc0130 100644
> > > > --- a/drivers/gpu/drm/msm/adreno/adreno_device.c
> > > > +++ b/drivers/gpu/drm/msm/adreno/adreno_device.c
> > > > @@ -612,6 +612,7 @@ static void adreno_shutdown(struct platform_device *pdev)
> > > >   {
> > > >       struct msm_gpu *gpu = dev_to_gpu(&pdev->dev);
> > > >
> > > > +     gpu->is_shutdown = true;
> > > >       WARN_ON_ONCE(adreno_system_suspend(&pdev->dev));
> > > >   }
> > > >
> > > > diff --git a/drivers/gpu/drm/msm/adreno/adreno_gpu.c b/drivers/gpu/drm/msm/adreno/adreno_gpu.c
> > > > index 382fb7f9e497..6903c6892469 100644
> > > > --- a/drivers/gpu/drm/msm/adreno/adreno_gpu.c
> > > > +++ b/drivers/gpu/drm/msm/adreno/adreno_gpu.c
> > > > @@ -251,7 +251,7 @@ int adreno_get_param(struct msm_gpu *gpu, struct msm_file_private *ctx,
> > > >       struct adreno_gpu *adreno_gpu = to_adreno_gpu(gpu);
> > > >
> > > >       /* No pointer params yet */
> > > > -     if (*len != 0)
> > > > +     if (*len != 0 || gpu->is_shutdown)
> > > >               return -EINVAL;
> > > This will race with shutdown. Probably, propagating back the return
> > > value of pm_runtime_get() in every possible ioctl call path is the right
> > > thing to do.
> > >
> > > I have never thought about this scenario. Do you know why userspace is
> > > not freezed before kexec?
> >
> > So userspace not being frozen seems like the root issue, and is likely
> > to cause all sorts of other whack-a-mole problems.  I guess I'd like
> > to know if this is the expected behavior?
>
> We tried that. Freezing before kexec seems to cause issues for ALSA as
> Ricardo found:
> https://lore.kernel.org/lkml/202211281209.mnBLzQ2I-lkp@intel.com/T/
>
> That patch is still TBD due to disagreement on the right approach to
> fix, so I don't think freezing before shutting down devices is viable
> at the moment.
>
> I am checking Ricardo if we can do something like util-linux's
> shutdown code which sends SIGTERM to all processes:
> https://kernel.googlesource.com/pub/scm/utils/util-linux/util-linux/+/v2.8/login-utils/shutdown.c#273
> , before issuing the kexec-reboot.
>
> Maybe, a more graceful shutdown from kexec-lite, will prevent the
> kexec-reboot it does from crashing? Though in my view that would still
> be a small copout instead of fixing the real issue, which is the
> kernel crashing for any reason.

The problem is that pm_runtime_force_suspend() yanks the rug out from
under the driver from a runpm PoV.. generally this is ok (as long as
scheduler kthread is parked) because we don't have to worry about
userspace ;-)

> > If so, we should probably look at
> > drm_dev_is_unplugged()/drm_dev_unplug()/etc rather than trying to NIH
> > that mechanism.  We would need to sprinkle drm_dev_is_unplugged()
> > checks more widely, and also ensure that the scheduler kthread(s) gets
> > parked.  But it would be nice to know first if we are just trying to
> > paper over a kexec bug.
>
> Agreed. I think we still patch 1/2 whether the SIGTERM trick mentioned
> above, works or not. I will await discussions with Ricardo before
> reposting that one.

Yeah, I think I'm waiting on a v2 of that one ;-)

BR,
-R

> Cheers,
>
>  -- Joel

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/2] adreno: Detect shutdown during get_param()
  2022-12-01 19:33           ` Rob Clark
@ 2022-12-01 20:01             ` Joel Fernandes
  -1 siblings, 0 replies; 27+ messages in thread
From: Joel Fernandes @ 2022-12-01 20:01 UTC (permalink / raw)
  To: Rob Clark
  Cc: Akhil P Oommen, linux-kernel, Rob Clark, Steven Rostedt,
	Ricardo Ribalda, Ross Zwisler, Abhinav Kumar, Daniel Vetter,
	David Airlie, Dmitry Baryshkov, dri-devel, Emma Anholt,
	freedreno, linux-arm-msm, Sean Paul, Vladimir Lypak

On Thu, Dec 1, 2022 at 7:33 PM Rob Clark <robdclark@gmail.com> wrote:
>
> On Thu, Dec 1, 2022 at 10:42 AM Joel Fernandes <joel@joelfernandes.org> wrote:
> >
> > On Sat, Nov 12, 2022 at 6:35 PM Rob Clark <robdclark@gmail.com> wrote:
> > >
> > > On Fri, Nov 11, 2022 at 1:28 PM Akhil P Oommen <quic_akhilpo@quicinc.com> wrote:
> > > >
> > > > On 11/12/2022 1:19 AM, Joel Fernandes (Google) wrote:
> > > > > Even though the GPU is shut down, during kexec reboot we can have userspace
> > > > > still running. This is especially true if KEXEC_JUMP is not enabled, because we
> > > > > do not freeze userspace in this case.
> > > > >
> > > > > To prevent crashes, track that the GPU is shutdown and prevent get_param() from
> > > > > accessing GPU resources if we find it shutdown.
> > > > >
> > > > > This fixes the following crash during kexec reboot on an ARM64 device with adreno GPU:
> > > > >
> > > > > [  292.534314] Kernel panic - not syncing: Asynchronous SError Interrupt
> > > > > [  292.534323] Hardware name: Google Lazor (rev3 - 8) with LTE (DT)
> > > > > [  292.534326] Call trace:
> > > > > [  292.534328]  dump_backtrace+0x0/0x1d4
> > > > > [  292.534337]  show_stack+0x20/0x2c
> > > > > [  292.534342]  dump_stack_lvl+0x60/0x78
> > > > > [  292.534347]  dump_stack+0x18/0x38
> > > > > [  292.534352]  panic+0x148/0x3b0
> > > > > [  292.534357]  nmi_panic+0x80/0x94
> > > > > [  292.534364]  arm64_serror_panic+0x70/0x7c
> > > > > [  292.534369]  do_serror+0x0/0x7c
> > > > > [  292.534372]  do_serror+0x54/0x7c
> > > > > [  292.534377]  el1h_64_error_handler+0x34/0x4c
> > > > > [  292.534381]  el1h_64_error+0x7c/0x80
> > > > > [  292.534386]  el1_interrupt+0x20/0x58
> > > > > [  292.534389]  el1h_64_irq_handler+0x18/0x24
> > > > > [  292.534395]  el1h_64_irq+0x7c/0x80
> > > > > [  292.534399]  local_daif_inherit+0x10/0x18
> > > > > [  292.534405]  el1h_64_sync_handler+0x48/0xb4
> > > > > [  292.534410]  el1h_64_sync+0x7c/0x80
> > > > > [  292.534414]  a6xx_gmu_set_oob+0xbc/0x1fc
> > > > > [  292.534422]  a6xx_get_timestamp+0x40/0xb4
> > > > > [  292.534426]  adreno_get_param+0x12c/0x1e0
> > > > > [  292.534433]  msm_ioctl_get_param+0x64/0x70
> > > > > [  292.534440]  drm_ioctl_kernel+0xe8/0x158
> > > > > [  292.534448]  drm_ioctl+0x208/0x320
> > > > > [  292.534453]  __arm64_sys_ioctl+0x98/0xd0
> > > > > [  292.534461]  invoke_syscall+0x4c/0x118
> > > > > [  292.534467]  el0_svc_common+0x98/0x104
> > > > > [  292.534473]  do_el0_svc+0x30/0x80
> > > > > [  292.534478]  el0_svc+0x20/0x50
> > > > > [  292.534481]  el0t_64_sync_handler+0x78/0x108
> > > > > [  292.534485]  el0t_64_sync+0x1a4/0x1a8
> > > > > [  292.534632] Kernel Offset: 0x1a5f800000 from 0xffffffc008000000
> > > > > [  292.534635] PHYS_OFFSET: 0x80000000
> > > > > [  292.534638] CPU features: 0x40018541,a3300e42
> > > > > [  292.534644] Memory Limit: none
> > > > >
> > > > > Cc: Rob Clark <robdclark@chromium.org>
> > > > > Cc: Steven Rostedt <rostedt@goodmis.org>
> > > > > Cc: Ricardo Ribalda <ribalda@chromium.org>
> > > > > Cc: Ross Zwisler <zwisler@kernel.org>
> > > > > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> > > > > ---
> > > > >   drivers/gpu/drm/msm/adreno/adreno_device.c | 1 +
> > > > >   drivers/gpu/drm/msm/adreno/adreno_gpu.c    | 2 +-
> > > > >   drivers/gpu/drm/msm/msm_gpu.h              | 3 +++
> > > > >   3 files changed, 5 insertions(+), 1 deletion(-)
> > > > >
> > > > > diff --git a/drivers/gpu/drm/msm/adreno/adreno_device.c b/drivers/gpu/drm/msm/adreno/adreno_device.c
> > > > > index f0cff62812c3..03d912dc0130 100644
> > > > > --- a/drivers/gpu/drm/msm/adreno/adreno_device.c
> > > > > +++ b/drivers/gpu/drm/msm/adreno/adreno_device.c
> > > > > @@ -612,6 +612,7 @@ static void adreno_shutdown(struct platform_device *pdev)
> > > > >   {
> > > > >       struct msm_gpu *gpu = dev_to_gpu(&pdev->dev);
> > > > >
> > > > > +     gpu->is_shutdown = true;
> > > > >       WARN_ON_ONCE(adreno_system_suspend(&pdev->dev));
> > > > >   }
> > > > >
> > > > > diff --git a/drivers/gpu/drm/msm/adreno/adreno_gpu.c b/drivers/gpu/drm/msm/adreno/adreno_gpu.c
> > > > > index 382fb7f9e497..6903c6892469 100644
> > > > > --- a/drivers/gpu/drm/msm/adreno/adreno_gpu.c
> > > > > +++ b/drivers/gpu/drm/msm/adreno/adreno_gpu.c
> > > > > @@ -251,7 +251,7 @@ int adreno_get_param(struct msm_gpu *gpu, struct msm_file_private *ctx,
> > > > >       struct adreno_gpu *adreno_gpu = to_adreno_gpu(gpu);
> > > > >
> > > > >       /* No pointer params yet */
> > > > > -     if (*len != 0)
> > > > > +     if (*len != 0 || gpu->is_shutdown)
> > > > >               return -EINVAL;
> > > > This will race with shutdown. Probably, propagating back the return
> > > > value of pm_runtime_get() in every possible ioctl call path is the right
> > > > thing to do.
> > > >
> > > > I have never thought about this scenario. Do you know why userspace is
> > > > not freezed before kexec?
> > >
> > > So userspace not being frozen seems like the root issue, and is likely
> > > to cause all sorts of other whack-a-mole problems.  I guess I'd like
> > > to know if this is the expected behavior?
> >
> > We tried that. Freezing before kexec seems to cause issues for ALSA as
> > Ricardo found:
> > https://lore.kernel.org/lkml/202211281209.mnBLzQ2I-lkp@intel.com/T/
> >
> > That patch is still TBD due to disagreement on the right approach to
> > fix, so I don't think freezing before shutting down devices is viable
> > at the moment.
> >
> > I am checking Ricardo if we can do something like util-linux's
> > shutdown code which sends SIGTERM to all processes:
> > https://kernel.googlesource.com/pub/scm/utils/util-linux/util-linux/+/v2.8/login-utils/shutdown.c#273
> > , before issuing the kexec-reboot.
> >
> > Maybe, a more graceful shutdown from kexec-lite, will prevent the
> > kexec-reboot it does from crashing? Though in my view that would still
> > be a small copout instead of fixing the real issue, which is the
> > kernel crashing for any reason.
>
> The problem is that pm_runtime_force_suspend() yanks the rug out from
> under the driver from a runpm PoV.. generally this is ok (as long as
> scheduler kthread is parked) because we don't have to worry about
> userspace ;-)
>
> > > If so, we should probably look at
> > > drm_dev_is_unplugged()/drm_dev_unplug()/etc rather than trying to NIH
> > > that mechanism.  We would need to sprinkle drm_dev_is_unplugged()
> > > checks more widely, and also ensure that the scheduler kthread(s) gets
> > > parked.  But it would be nice to know first if we are just trying to
> > > paper over a kexec bug.
> >
> > Agreed. I think we still patch 1/2 whether the SIGTERM trick mentioned
> > above, works or not. I will await discussions with Ricardo before
> > reposting that one.
>
> Yeah, I think I'm waiting on a v2 of that one ;-)
>

Cool, I will send you a v2 on that shortly then, I think it is good to
do the patch 1/2 change anyway.

:-) Thanks,

 - Joel

> > Cheers,
> >
> >  -- Joel

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/2] adreno: Detect shutdown during get_param()
@ 2022-12-01 20:01             ` Joel Fernandes
  0 siblings, 0 replies; 27+ messages in thread
From: Joel Fernandes @ 2022-12-01 20:01 UTC (permalink / raw)
  To: Rob Clark
  Cc: Rob Clark, Emma Anholt, Akhil P Oommen, freedreno, linux-arm-msm,
	Ricardo Ribalda, Vladimir Lypak, linux-kernel, Steven Rostedt,
	Abhinav Kumar, Sean Paul, dri-devel, Ross Zwisler,
	Dmitry Baryshkov

On Thu, Dec 1, 2022 at 7:33 PM Rob Clark <robdclark@gmail.com> wrote:
>
> On Thu, Dec 1, 2022 at 10:42 AM Joel Fernandes <joel@joelfernandes.org> wrote:
> >
> > On Sat, Nov 12, 2022 at 6:35 PM Rob Clark <robdclark@gmail.com> wrote:
> > >
> > > On Fri, Nov 11, 2022 at 1:28 PM Akhil P Oommen <quic_akhilpo@quicinc.com> wrote:
> > > >
> > > > On 11/12/2022 1:19 AM, Joel Fernandes (Google) wrote:
> > > > > Even though the GPU is shut down, during kexec reboot we can have userspace
> > > > > still running. This is especially true if KEXEC_JUMP is not enabled, because we
> > > > > do not freeze userspace in this case.
> > > > >
> > > > > To prevent crashes, track that the GPU is shutdown and prevent get_param() from
> > > > > accessing GPU resources if we find it shutdown.
> > > > >
> > > > > This fixes the following crash during kexec reboot on an ARM64 device with adreno GPU:
> > > > >
> > > > > [  292.534314] Kernel panic - not syncing: Asynchronous SError Interrupt
> > > > > [  292.534323] Hardware name: Google Lazor (rev3 - 8) with LTE (DT)
> > > > > [  292.534326] Call trace:
> > > > > [  292.534328]  dump_backtrace+0x0/0x1d4
> > > > > [  292.534337]  show_stack+0x20/0x2c
> > > > > [  292.534342]  dump_stack_lvl+0x60/0x78
> > > > > [  292.534347]  dump_stack+0x18/0x38
> > > > > [  292.534352]  panic+0x148/0x3b0
> > > > > [  292.534357]  nmi_panic+0x80/0x94
> > > > > [  292.534364]  arm64_serror_panic+0x70/0x7c
> > > > > [  292.534369]  do_serror+0x0/0x7c
> > > > > [  292.534372]  do_serror+0x54/0x7c
> > > > > [  292.534377]  el1h_64_error_handler+0x34/0x4c
> > > > > [  292.534381]  el1h_64_error+0x7c/0x80
> > > > > [  292.534386]  el1_interrupt+0x20/0x58
> > > > > [  292.534389]  el1h_64_irq_handler+0x18/0x24
> > > > > [  292.534395]  el1h_64_irq+0x7c/0x80
> > > > > [  292.534399]  local_daif_inherit+0x10/0x18
> > > > > [  292.534405]  el1h_64_sync_handler+0x48/0xb4
> > > > > [  292.534410]  el1h_64_sync+0x7c/0x80
> > > > > [  292.534414]  a6xx_gmu_set_oob+0xbc/0x1fc
> > > > > [  292.534422]  a6xx_get_timestamp+0x40/0xb4
> > > > > [  292.534426]  adreno_get_param+0x12c/0x1e0
> > > > > [  292.534433]  msm_ioctl_get_param+0x64/0x70
> > > > > [  292.534440]  drm_ioctl_kernel+0xe8/0x158
> > > > > [  292.534448]  drm_ioctl+0x208/0x320
> > > > > [  292.534453]  __arm64_sys_ioctl+0x98/0xd0
> > > > > [  292.534461]  invoke_syscall+0x4c/0x118
> > > > > [  292.534467]  el0_svc_common+0x98/0x104
> > > > > [  292.534473]  do_el0_svc+0x30/0x80
> > > > > [  292.534478]  el0_svc+0x20/0x50
> > > > > [  292.534481]  el0t_64_sync_handler+0x78/0x108
> > > > > [  292.534485]  el0t_64_sync+0x1a4/0x1a8
> > > > > [  292.534632] Kernel Offset: 0x1a5f800000 from 0xffffffc008000000
> > > > > [  292.534635] PHYS_OFFSET: 0x80000000
> > > > > [  292.534638] CPU features: 0x40018541,a3300e42
> > > > > [  292.534644] Memory Limit: none
> > > > >
> > > > > Cc: Rob Clark <robdclark@chromium.org>
> > > > > Cc: Steven Rostedt <rostedt@goodmis.org>
> > > > > Cc: Ricardo Ribalda <ribalda@chromium.org>
> > > > > Cc: Ross Zwisler <zwisler@kernel.org>
> > > > > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> > > > > ---
> > > > >   drivers/gpu/drm/msm/adreno/adreno_device.c | 1 +
> > > > >   drivers/gpu/drm/msm/adreno/adreno_gpu.c    | 2 +-
> > > > >   drivers/gpu/drm/msm/msm_gpu.h              | 3 +++
> > > > >   3 files changed, 5 insertions(+), 1 deletion(-)
> > > > >
> > > > > diff --git a/drivers/gpu/drm/msm/adreno/adreno_device.c b/drivers/gpu/drm/msm/adreno/adreno_device.c
> > > > > index f0cff62812c3..03d912dc0130 100644
> > > > > --- a/drivers/gpu/drm/msm/adreno/adreno_device.c
> > > > > +++ b/drivers/gpu/drm/msm/adreno/adreno_device.c
> > > > > @@ -612,6 +612,7 @@ static void adreno_shutdown(struct platform_device *pdev)
> > > > >   {
> > > > >       struct msm_gpu *gpu = dev_to_gpu(&pdev->dev);
> > > > >
> > > > > +     gpu->is_shutdown = true;
> > > > >       WARN_ON_ONCE(adreno_system_suspend(&pdev->dev));
> > > > >   }
> > > > >
> > > > > diff --git a/drivers/gpu/drm/msm/adreno/adreno_gpu.c b/drivers/gpu/drm/msm/adreno/adreno_gpu.c
> > > > > index 382fb7f9e497..6903c6892469 100644
> > > > > --- a/drivers/gpu/drm/msm/adreno/adreno_gpu.c
> > > > > +++ b/drivers/gpu/drm/msm/adreno/adreno_gpu.c
> > > > > @@ -251,7 +251,7 @@ int adreno_get_param(struct msm_gpu *gpu, struct msm_file_private *ctx,
> > > > >       struct adreno_gpu *adreno_gpu = to_adreno_gpu(gpu);
> > > > >
> > > > >       /* No pointer params yet */
> > > > > -     if (*len != 0)
> > > > > +     if (*len != 0 || gpu->is_shutdown)
> > > > >               return -EINVAL;
> > > > This will race with shutdown. Probably, propagating back the return
> > > > value of pm_runtime_get() in every possible ioctl call path is the right
> > > > thing to do.
> > > >
> > > > I have never thought about this scenario. Do you know why userspace is
> > > > not freezed before kexec?
> > >
> > > So userspace not being frozen seems like the root issue, and is likely
> > > to cause all sorts of other whack-a-mole problems.  I guess I'd like
> > > to know if this is the expected behavior?
> >
> > We tried that. Freezing before kexec seems to cause issues for ALSA as
> > Ricardo found:
> > https://lore.kernel.org/lkml/202211281209.mnBLzQ2I-lkp@intel.com/T/
> >
> > That patch is still TBD due to disagreement on the right approach to
> > fix, so I don't think freezing before shutting down devices is viable
> > at the moment.
> >
> > I am checking Ricardo if we can do something like util-linux's
> > shutdown code which sends SIGTERM to all processes:
> > https://kernel.googlesource.com/pub/scm/utils/util-linux/util-linux/+/v2.8/login-utils/shutdown.c#273
> > , before issuing the kexec-reboot.
> >
> > Maybe, a more graceful shutdown from kexec-lite, will prevent the
> > kexec-reboot it does from crashing? Though in my view that would still
> > be a small copout instead of fixing the real issue, which is the
> > kernel crashing for any reason.
>
> The problem is that pm_runtime_force_suspend() yanks the rug out from
> under the driver from a runpm PoV.. generally this is ok (as long as
> scheduler kthread is parked) because we don't have to worry about
> userspace ;-)
>
> > > If so, we should probably look at
> > > drm_dev_is_unplugged()/drm_dev_unplug()/etc rather than trying to NIH
> > > that mechanism.  We would need to sprinkle drm_dev_is_unplugged()
> > > checks more widely, and also ensure that the scheduler kthread(s) gets
> > > parked.  But it would be nice to know first if we are just trying to
> > > paper over a kexec bug.
> >
> > Agreed. I think we still patch 1/2 whether the SIGTERM trick mentioned
> > above, works or not. I will await discussions with Ricardo before
> > reposting that one.
>
> Yeah, I think I'm waiting on a v2 of that one ;-)
>

Cool, I will send you a v2 on that shortly then, I think it is good to
do the patch 1/2 change anyway.

:-) Thanks,

 - Joel

> > Cheers,
> >
> >  -- Joel

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/2] adreno: Shutdown the GPU properly
  2022-11-12 18:44     ` Rob Clark
@ 2022-12-01 20:08       ` Joel Fernandes
  -1 siblings, 0 replies; 27+ messages in thread
From: Joel Fernandes @ 2022-12-01 20:08 UTC (permalink / raw)
  To: Rob Clark
  Cc: linux-kernel, Rob Clark, Steven Rostedt, Ricardo Ribalda,
	Ross Zwisler, Abhinav Kumar, Akhil P Oommen, Daniel Vetter,
	David Airlie, Dmitry Baryshkov, dri-devel, Emma Anholt,
	freedreno, linux-arm-msm, Sean Paul, Vladimir Lypak

On Sat, Nov 12, 2022 at 6:44 PM Rob Clark <robdclark@gmail.com> wrote:
>
> On Fri, Nov 11, 2022 at 1:08 PM Joel Fernandes <joel@joelfernandes.org> wrote:
> >
> >
> >
> > > On Nov 11, 2022, at 2:50 PM, Joel Fernandes (Google) <joel@joelfernandes.org> wrote:
> > >
> > > During kexec on ARM device, we notice that device_shutdown() only calls
> > > pm_runtime_force_suspend() while shutting down the GPU. This means the GPU
> > > kthread is still running and further, there maybe active submits.
> > >
> > > This causes all kinds of issues during a kexec reboot:
> > >
> > > Warning from shutdown path:
> > >
> > > [  292.509662] WARNING: CPU: 0 PID: 6304 at [...] adreno_runtime_suspend+0x3c/0x44
> > > [  292.509863] Hardware name: Google Lazor (rev3 - 8) with LTE (DT)
> > > [  292.509872] pstate: 80400009 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> > > [  292.509881] pc : adreno_runtime_suspend+0x3c/0x44
> > > [  292.509891] lr : pm_generic_runtime_suspend+0x30/0x44
> > > [  292.509905] sp : ffffffc014473bf0
> > > [...]
> > > [  292.510043] Call trace:
> > > [  292.510051]  adreno_runtime_suspend+0x3c/0x44
> > > [  292.510061]  pm_generic_runtime_suspend+0x30/0x44
> > > [  292.510071]  pm_runtime_force_suspend+0x54/0xc8
> > > [  292.510081]  adreno_shutdown+0x1c/0x28
> > > [  292.510090]  platform_shutdown+0x2c/0x38
> > > [  292.510104]  device_shutdown+0x158/0x210
> > > [  292.510119]  kernel_restart_prepare+0x40/0x4c
> > >
> > > And here from GPU kthread, an SError OOPs:
> > >
> > > [  192.648789]  el1h_64_error+0x7c/0x80
> > > [  192.648812]  el1_interrupt+0x20/0x58
> > > [  192.648833]  el1h_64_irq_handler+0x18/0x24
> > > [  192.648854]  el1h_64_irq+0x7c/0x80
> > > [  192.648873]  local_daif_inherit+0x10/0x18
> > > [  192.648900]  el1h_64_sync_handler+0x48/0xb4
> > > [  192.648921]  el1h_64_sync+0x7c/0x80
> > > [  192.648941]  a6xx_gmu_set_oob+0xbc/0x1fc
> > > [  192.648968]  a6xx_hw_init+0x44/0xe38
> > > [  192.648991]  msm_gpu_hw_init+0x48/0x80
> > > [  192.649013]  msm_gpu_submit+0x5c/0x1a8
> > > [  192.649034]  msm_job_run+0xb0/0x11c
> > > [  192.649058]  drm_sched_main+0x170/0x434
> > > [  192.649086]  kthread+0x134/0x300
> > > [  192.649114]  ret_from_fork+0x10/0x20
> > >
> > > Fix by calling adreno_system_suspend() in the device_shutdown() path.
> > >
> > > Cc: Rob Clark <robdclark@chromium.org>
> > > Cc: Steven Rostedt <rostedt@goodmis.org>
> > > Cc: Ricardo Ribalda <ribalda@chromium.org>
> > > Cc: Ross Zwisler <zwisler@kernel.org>
> > > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> > > ---
> > > drivers/gpu/drm/msm/adreno/adreno_device.c | 5 ++++-
> > > 1 file changed, 4 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/drivers/gpu/drm/msm/adreno/adreno_device.c b/drivers/gpu/drm/msm/adreno/adreno_device.c
> > > index 24b489b6129a..f0cff62812c3 100644
> > > --- a/drivers/gpu/drm/msm/adreno/adreno_device.c
> > > +++ b/drivers/gpu/drm/msm/adreno/adreno_device.c
> > > @@ -607,9 +607,12 @@ static int adreno_remove(struct platform_device *pdev)
> > >    return 0;
> > > }
> > >
> > > +static int adreno_system_suspend(struct device *dev);
> > > static void adreno_shutdown(struct platform_device *pdev)
> > > {
> > > -    pm_runtime_force_suspend(&pdev->dev);
> > > +    struct msm_gpu *gpu = dev_to_gpu(&pdev->dev);
> > > +
> >
> > This local variable definition should go to patch 2/2. Will fix in v2.
> >
> > Thanks,
> >
> >  - Joel
> >
> >
> > > +    WARN_ON_ONCE(adreno_system_suspend(&pdev->dev));
>
> I think maybe adreno_unbind() needs the same treatment?  Any path
> where we yank out the power cord without ensuring the scheduler is
> parked means we'd be racing with jobs in the scheduler queue.  Ie.
> userspace could queue a job before it is frozen, but the drm/scheduler
> kthread hasn't yet called the msm_job_run() callback (which does
> various touching of the now powered off hw).  So I think we need to
> ensure that the scheduler is parked in all paths that call
> pm_runtime_force_suspend() (as that bypasses the runpm reference that
> would otherwise unsure the hw is powered before msm_job_run pokes at
> registers)

a6xx_gmu_remove() calls pm_runtime_force_suspend() , would that need a
treatment too?

Though, adreno_system_suspend() is a static function in adreno_device.cc

Thanks.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/2] adreno: Shutdown the GPU properly
@ 2022-12-01 20:08       ` Joel Fernandes
  0 siblings, 0 replies; 27+ messages in thread
From: Joel Fernandes @ 2022-12-01 20:08 UTC (permalink / raw)
  To: Rob Clark
  Cc: Rob Clark, Emma Anholt, Akhil P Oommen, freedreno, linux-arm-msm,
	Ricardo Ribalda, Vladimir Lypak, Abhinav Kumar, Steven Rostedt,
	linux-kernel, Sean Paul, dri-devel, Ross Zwisler,
	Dmitry Baryshkov

On Sat, Nov 12, 2022 at 6:44 PM Rob Clark <robdclark@gmail.com> wrote:
>
> On Fri, Nov 11, 2022 at 1:08 PM Joel Fernandes <joel@joelfernandes.org> wrote:
> >
> >
> >
> > > On Nov 11, 2022, at 2:50 PM, Joel Fernandes (Google) <joel@joelfernandes.org> wrote:
> > >
> > > During kexec on ARM device, we notice that device_shutdown() only calls
> > > pm_runtime_force_suspend() while shutting down the GPU. This means the GPU
> > > kthread is still running and further, there maybe active submits.
> > >
> > > This causes all kinds of issues during a kexec reboot:
> > >
> > > Warning from shutdown path:
> > >
> > > [  292.509662] WARNING: CPU: 0 PID: 6304 at [...] adreno_runtime_suspend+0x3c/0x44
> > > [  292.509863] Hardware name: Google Lazor (rev3 - 8) with LTE (DT)
> > > [  292.509872] pstate: 80400009 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> > > [  292.509881] pc : adreno_runtime_suspend+0x3c/0x44
> > > [  292.509891] lr : pm_generic_runtime_suspend+0x30/0x44
> > > [  292.509905] sp : ffffffc014473bf0
> > > [...]
> > > [  292.510043] Call trace:
> > > [  292.510051]  adreno_runtime_suspend+0x3c/0x44
> > > [  292.510061]  pm_generic_runtime_suspend+0x30/0x44
> > > [  292.510071]  pm_runtime_force_suspend+0x54/0xc8
> > > [  292.510081]  adreno_shutdown+0x1c/0x28
> > > [  292.510090]  platform_shutdown+0x2c/0x38
> > > [  292.510104]  device_shutdown+0x158/0x210
> > > [  292.510119]  kernel_restart_prepare+0x40/0x4c
> > >
> > > And here from GPU kthread, an SError OOPs:
> > >
> > > [  192.648789]  el1h_64_error+0x7c/0x80
> > > [  192.648812]  el1_interrupt+0x20/0x58
> > > [  192.648833]  el1h_64_irq_handler+0x18/0x24
> > > [  192.648854]  el1h_64_irq+0x7c/0x80
> > > [  192.648873]  local_daif_inherit+0x10/0x18
> > > [  192.648900]  el1h_64_sync_handler+0x48/0xb4
> > > [  192.648921]  el1h_64_sync+0x7c/0x80
> > > [  192.648941]  a6xx_gmu_set_oob+0xbc/0x1fc
> > > [  192.648968]  a6xx_hw_init+0x44/0xe38
> > > [  192.648991]  msm_gpu_hw_init+0x48/0x80
> > > [  192.649013]  msm_gpu_submit+0x5c/0x1a8
> > > [  192.649034]  msm_job_run+0xb0/0x11c
> > > [  192.649058]  drm_sched_main+0x170/0x434
> > > [  192.649086]  kthread+0x134/0x300
> > > [  192.649114]  ret_from_fork+0x10/0x20
> > >
> > > Fix by calling adreno_system_suspend() in the device_shutdown() path.
> > >
> > > Cc: Rob Clark <robdclark@chromium.org>
> > > Cc: Steven Rostedt <rostedt@goodmis.org>
> > > Cc: Ricardo Ribalda <ribalda@chromium.org>
> > > Cc: Ross Zwisler <zwisler@kernel.org>
> > > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> > > ---
> > > drivers/gpu/drm/msm/adreno/adreno_device.c | 5 ++++-
> > > 1 file changed, 4 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/drivers/gpu/drm/msm/adreno/adreno_device.c b/drivers/gpu/drm/msm/adreno/adreno_device.c
> > > index 24b489b6129a..f0cff62812c3 100644
> > > --- a/drivers/gpu/drm/msm/adreno/adreno_device.c
> > > +++ b/drivers/gpu/drm/msm/adreno/adreno_device.c
> > > @@ -607,9 +607,12 @@ static int adreno_remove(struct platform_device *pdev)
> > >    return 0;
> > > }
> > >
> > > +static int adreno_system_suspend(struct device *dev);
> > > static void adreno_shutdown(struct platform_device *pdev)
> > > {
> > > -    pm_runtime_force_suspend(&pdev->dev);
> > > +    struct msm_gpu *gpu = dev_to_gpu(&pdev->dev);
> > > +
> >
> > This local variable definition should go to patch 2/2. Will fix in v2.
> >
> > Thanks,
> >
> >  - Joel
> >
> >
> > > +    WARN_ON_ONCE(adreno_system_suspend(&pdev->dev));
>
> I think maybe adreno_unbind() needs the same treatment?  Any path
> where we yank out the power cord without ensuring the scheduler is
> parked means we'd be racing with jobs in the scheduler queue.  Ie.
> userspace could queue a job before it is frozen, but the drm/scheduler
> kthread hasn't yet called the msm_job_run() callback (which does
> various touching of the now powered off hw).  So I think we need to
> ensure that the scheduler is parked in all paths that call
> pm_runtime_force_suspend() (as that bypasses the runpm reference that
> would otherwise unsure the hw is powered before msm_job_run pokes at
> registers)

a6xx_gmu_remove() calls pm_runtime_force_suspend() , would that need a
treatment too?

Though, adreno_system_suspend() is a static function in adreno_device.cc

Thanks.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/2] adreno: Shutdown the GPU properly
  2022-12-01 20:08       ` Joel Fernandes
@ 2022-12-01 22:06         ` Rob Clark
  -1 siblings, 0 replies; 27+ messages in thread
From: Rob Clark @ 2022-12-01 22:06 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: linux-kernel, Rob Clark, Steven Rostedt, Ricardo Ribalda,
	Ross Zwisler, Abhinav Kumar, Akhil P Oommen, Daniel Vetter,
	David Airlie, Dmitry Baryshkov, dri-devel, Emma Anholt,
	freedreno, linux-arm-msm, Sean Paul, Vladimir Lypak

On Thu, Dec 1, 2022 at 12:08 PM Joel Fernandes <joel@joelfernandes.org> wrote:
>
> On Sat, Nov 12, 2022 at 6:44 PM Rob Clark <robdclark@gmail.com> wrote:
> >
> > On Fri, Nov 11, 2022 at 1:08 PM Joel Fernandes <joel@joelfernandes.org> wrote:
> > >
> > >
> > >
> > > > On Nov 11, 2022, at 2:50 PM, Joel Fernandes (Google) <joel@joelfernandes.org> wrote:
> > > >
> > > > During kexec on ARM device, we notice that device_shutdown() only calls
> > > > pm_runtime_force_suspend() while shutting down the GPU. This means the GPU
> > > > kthread is still running and further, there maybe active submits.
> > > >
> > > > This causes all kinds of issues during a kexec reboot:
> > > >
> > > > Warning from shutdown path:
> > > >
> > > > [  292.509662] WARNING: CPU: 0 PID: 6304 at [...] adreno_runtime_suspend+0x3c/0x44
> > > > [  292.509863] Hardware name: Google Lazor (rev3 - 8) with LTE (DT)
> > > > [  292.509872] pstate: 80400009 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> > > > [  292.509881] pc : adreno_runtime_suspend+0x3c/0x44
> > > > [  292.509891] lr : pm_generic_runtime_suspend+0x30/0x44
> > > > [  292.509905] sp : ffffffc014473bf0
> > > > [...]
> > > > [  292.510043] Call trace:
> > > > [  292.510051]  adreno_runtime_suspend+0x3c/0x44
> > > > [  292.510061]  pm_generic_runtime_suspend+0x30/0x44
> > > > [  292.510071]  pm_runtime_force_suspend+0x54/0xc8
> > > > [  292.510081]  adreno_shutdown+0x1c/0x28
> > > > [  292.510090]  platform_shutdown+0x2c/0x38
> > > > [  292.510104]  device_shutdown+0x158/0x210
> > > > [  292.510119]  kernel_restart_prepare+0x40/0x4c
> > > >
> > > > And here from GPU kthread, an SError OOPs:
> > > >
> > > > [  192.648789]  el1h_64_error+0x7c/0x80
> > > > [  192.648812]  el1_interrupt+0x20/0x58
> > > > [  192.648833]  el1h_64_irq_handler+0x18/0x24
> > > > [  192.648854]  el1h_64_irq+0x7c/0x80
> > > > [  192.648873]  local_daif_inherit+0x10/0x18
> > > > [  192.648900]  el1h_64_sync_handler+0x48/0xb4
> > > > [  192.648921]  el1h_64_sync+0x7c/0x80
> > > > [  192.648941]  a6xx_gmu_set_oob+0xbc/0x1fc
> > > > [  192.648968]  a6xx_hw_init+0x44/0xe38
> > > > [  192.648991]  msm_gpu_hw_init+0x48/0x80
> > > > [  192.649013]  msm_gpu_submit+0x5c/0x1a8
> > > > [  192.649034]  msm_job_run+0xb0/0x11c
> > > > [  192.649058]  drm_sched_main+0x170/0x434
> > > > [  192.649086]  kthread+0x134/0x300
> > > > [  192.649114]  ret_from_fork+0x10/0x20
> > > >
> > > > Fix by calling adreno_system_suspend() in the device_shutdown() path.
> > > >
> > > > Cc: Rob Clark <robdclark@chromium.org>
> > > > Cc: Steven Rostedt <rostedt@goodmis.org>
> > > > Cc: Ricardo Ribalda <ribalda@chromium.org>
> > > > Cc: Ross Zwisler <zwisler@kernel.org>
> > > > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> > > > ---
> > > > drivers/gpu/drm/msm/adreno/adreno_device.c | 5 ++++-
> > > > 1 file changed, 4 insertions(+), 1 deletion(-)
> > > >
> > > > diff --git a/drivers/gpu/drm/msm/adreno/adreno_device.c b/drivers/gpu/drm/msm/adreno/adreno_device.c
> > > > index 24b489b6129a..f0cff62812c3 100644
> > > > --- a/drivers/gpu/drm/msm/adreno/adreno_device.c
> > > > +++ b/drivers/gpu/drm/msm/adreno/adreno_device.c
> > > > @@ -607,9 +607,12 @@ static int adreno_remove(struct platform_device *pdev)
> > > >    return 0;
> > > > }
> > > >
> > > > +static int adreno_system_suspend(struct device *dev);
> > > > static void adreno_shutdown(struct platform_device *pdev)
> > > > {
> > > > -    pm_runtime_force_suspend(&pdev->dev);
> > > > +    struct msm_gpu *gpu = dev_to_gpu(&pdev->dev);
> > > > +
> > >
> > > This local variable definition should go to patch 2/2. Will fix in v2.
> > >
> > > Thanks,
> > >
> > >  - Joel
> > >
> > >
> > > > +    WARN_ON_ONCE(adreno_system_suspend(&pdev->dev));
> >
> > I think maybe adreno_unbind() needs the same treatment?  Any path
> > where we yank out the power cord without ensuring the scheduler is
> > parked means we'd be racing with jobs in the scheduler queue.  Ie.
> > userspace could queue a job before it is frozen, but the drm/scheduler
> > kthread hasn't yet called the msm_job_run() callback (which does
> > various touching of the now powered off hw).  So I think we need to
> > ensure that the scheduler is parked in all paths that call
> > pm_runtime_force_suspend() (as that bypasses the runpm reference that
> > would otherwise unsure the hw is powered before msm_job_run pokes at
> > registers)
>
> a6xx_gmu_remove() calls pm_runtime_force_suspend() , would that need a
> treatment too?
>
> Though, adreno_system_suspend() is a static function in adreno_device.cc

Naw, you get there indirectly from adreno_unbind()

BR,
-R

> Thanks.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/2] adreno: Shutdown the GPU properly
@ 2022-12-01 22:06         ` Rob Clark
  0 siblings, 0 replies; 27+ messages in thread
From: Rob Clark @ 2022-12-01 22:06 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Rob Clark, Emma Anholt, Akhil P Oommen, freedreno, linux-arm-msm,
	Ricardo Ribalda, Vladimir Lypak, Abhinav Kumar, Steven Rostedt,
	linux-kernel, Sean Paul, dri-devel, Ross Zwisler,
	Dmitry Baryshkov

On Thu, Dec 1, 2022 at 12:08 PM Joel Fernandes <joel@joelfernandes.org> wrote:
>
> On Sat, Nov 12, 2022 at 6:44 PM Rob Clark <robdclark@gmail.com> wrote:
> >
> > On Fri, Nov 11, 2022 at 1:08 PM Joel Fernandes <joel@joelfernandes.org> wrote:
> > >
> > >
> > >
> > > > On Nov 11, 2022, at 2:50 PM, Joel Fernandes (Google) <joel@joelfernandes.org> wrote:
> > > >
> > > > During kexec on ARM device, we notice that device_shutdown() only calls
> > > > pm_runtime_force_suspend() while shutting down the GPU. This means the GPU
> > > > kthread is still running and further, there maybe active submits.
> > > >
> > > > This causes all kinds of issues during a kexec reboot:
> > > >
> > > > Warning from shutdown path:
> > > >
> > > > [  292.509662] WARNING: CPU: 0 PID: 6304 at [...] adreno_runtime_suspend+0x3c/0x44
> > > > [  292.509863] Hardware name: Google Lazor (rev3 - 8) with LTE (DT)
> > > > [  292.509872] pstate: 80400009 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> > > > [  292.509881] pc : adreno_runtime_suspend+0x3c/0x44
> > > > [  292.509891] lr : pm_generic_runtime_suspend+0x30/0x44
> > > > [  292.509905] sp : ffffffc014473bf0
> > > > [...]
> > > > [  292.510043] Call trace:
> > > > [  292.510051]  adreno_runtime_suspend+0x3c/0x44
> > > > [  292.510061]  pm_generic_runtime_suspend+0x30/0x44
> > > > [  292.510071]  pm_runtime_force_suspend+0x54/0xc8
> > > > [  292.510081]  adreno_shutdown+0x1c/0x28
> > > > [  292.510090]  platform_shutdown+0x2c/0x38
> > > > [  292.510104]  device_shutdown+0x158/0x210
> > > > [  292.510119]  kernel_restart_prepare+0x40/0x4c
> > > >
> > > > And here from GPU kthread, an SError OOPs:
> > > >
> > > > [  192.648789]  el1h_64_error+0x7c/0x80
> > > > [  192.648812]  el1_interrupt+0x20/0x58
> > > > [  192.648833]  el1h_64_irq_handler+0x18/0x24
> > > > [  192.648854]  el1h_64_irq+0x7c/0x80
> > > > [  192.648873]  local_daif_inherit+0x10/0x18
> > > > [  192.648900]  el1h_64_sync_handler+0x48/0xb4
> > > > [  192.648921]  el1h_64_sync+0x7c/0x80
> > > > [  192.648941]  a6xx_gmu_set_oob+0xbc/0x1fc
> > > > [  192.648968]  a6xx_hw_init+0x44/0xe38
> > > > [  192.648991]  msm_gpu_hw_init+0x48/0x80
> > > > [  192.649013]  msm_gpu_submit+0x5c/0x1a8
> > > > [  192.649034]  msm_job_run+0xb0/0x11c
> > > > [  192.649058]  drm_sched_main+0x170/0x434
> > > > [  192.649086]  kthread+0x134/0x300
> > > > [  192.649114]  ret_from_fork+0x10/0x20
> > > >
> > > > Fix by calling adreno_system_suspend() in the device_shutdown() path.
> > > >
> > > > Cc: Rob Clark <robdclark@chromium.org>
> > > > Cc: Steven Rostedt <rostedt@goodmis.org>
> > > > Cc: Ricardo Ribalda <ribalda@chromium.org>
> > > > Cc: Ross Zwisler <zwisler@kernel.org>
> > > > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> > > > ---
> > > > drivers/gpu/drm/msm/adreno/adreno_device.c | 5 ++++-
> > > > 1 file changed, 4 insertions(+), 1 deletion(-)
> > > >
> > > > diff --git a/drivers/gpu/drm/msm/adreno/adreno_device.c b/drivers/gpu/drm/msm/adreno/adreno_device.c
> > > > index 24b489b6129a..f0cff62812c3 100644
> > > > --- a/drivers/gpu/drm/msm/adreno/adreno_device.c
> > > > +++ b/drivers/gpu/drm/msm/adreno/adreno_device.c
> > > > @@ -607,9 +607,12 @@ static int adreno_remove(struct platform_device *pdev)
> > > >    return 0;
> > > > }
> > > >
> > > > +static int adreno_system_suspend(struct device *dev);
> > > > static void adreno_shutdown(struct platform_device *pdev)
> > > > {
> > > > -    pm_runtime_force_suspend(&pdev->dev);
> > > > +    struct msm_gpu *gpu = dev_to_gpu(&pdev->dev);
> > > > +
> > >
> > > This local variable definition should go to patch 2/2. Will fix in v2.
> > >
> > > Thanks,
> > >
> > >  - Joel
> > >
> > >
> > > > +    WARN_ON_ONCE(adreno_system_suspend(&pdev->dev));
> >
> > I think maybe adreno_unbind() needs the same treatment?  Any path
> > where we yank out the power cord without ensuring the scheduler is
> > parked means we'd be racing with jobs in the scheduler queue.  Ie.
> > userspace could queue a job before it is frozen, but the drm/scheduler
> > kthread hasn't yet called the msm_job_run() callback (which does
> > various touching of the now powered off hw).  So I think we need to
> > ensure that the scheduler is parked in all paths that call
> > pm_runtime_force_suspend() (as that bypasses the runpm reference that
> > would otherwise unsure the hw is powered before msm_job_run pokes at
> > registers)
>
> a6xx_gmu_remove() calls pm_runtime_force_suspend() , would that need a
> treatment too?
>
> Though, adreno_system_suspend() is a static function in adreno_device.cc

Naw, you get there indirectly from adreno_unbind()

BR,
-R

> Thanks.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/2] adreno: Shutdown the GPU properly
  2022-12-01 22:06         ` Rob Clark
@ 2022-12-01 22:13           ` Joel Fernandes
  -1 siblings, 0 replies; 27+ messages in thread
From: Joel Fernandes @ 2022-12-01 22:13 UTC (permalink / raw)
  To: Rob Clark
  Cc: linux-kernel, Rob Clark, Steven Rostedt, Ricardo Ribalda,
	Ross Zwisler, Abhinav Kumar, Akhil P Oommen, Daniel Vetter,
	David Airlie, Dmitry Baryshkov, dri-devel, Emma Anholt,
	freedreno, linux-arm-msm, Sean Paul, Vladimir Lypak

On Thu, Dec 1, 2022 at 10:06 PM Rob Clark <robdclark@gmail.com> wrote:
>
> On Thu, Dec 1, 2022 at 12:08 PM Joel Fernandes <joel@joelfernandes.org> wrote:
> >
> > On Sat, Nov 12, 2022 at 6:44 PM Rob Clark <robdclark@gmail.com> wrote:
> > >
> > > On Fri, Nov 11, 2022 at 1:08 PM Joel Fernandes <joel@joelfernandes.org> wrote:
> > > >
> > > >
> > > >
> > > > > On Nov 11, 2022, at 2:50 PM, Joel Fernandes (Google) <joel@joelfernandes.org> wrote:
> > > > >
> > > > > During kexec on ARM device, we notice that device_shutdown() only calls
> > > > > pm_runtime_force_suspend() while shutting down the GPU. This means the GPU
> > > > > kthread is still running and further, there maybe active submits.
> > > > >
> > > > > This causes all kinds of issues during a kexec reboot:
> > > > >
> > > > > Warning from shutdown path:
> > > > >
> > > > > [  292.509662] WARNING: CPU: 0 PID: 6304 at [...] adreno_runtime_suspend+0x3c/0x44
> > > > > [  292.509863] Hardware name: Google Lazor (rev3 - 8) with LTE (DT)
> > > > > [  292.509872] pstate: 80400009 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> > > > > [  292.509881] pc : adreno_runtime_suspend+0x3c/0x44
> > > > > [  292.509891] lr : pm_generic_runtime_suspend+0x30/0x44
> > > > > [  292.509905] sp : ffffffc014473bf0
> > > > > [...]
> > > > > [  292.510043] Call trace:
> > > > > [  292.510051]  adreno_runtime_suspend+0x3c/0x44
> > > > > [  292.510061]  pm_generic_runtime_suspend+0x30/0x44
> > > > > [  292.510071]  pm_runtime_force_suspend+0x54/0xc8
> > > > > [  292.510081]  adreno_shutdown+0x1c/0x28
> > > > > [  292.510090]  platform_shutdown+0x2c/0x38
> > > > > [  292.510104]  device_shutdown+0x158/0x210
> > > > > [  292.510119]  kernel_restart_prepare+0x40/0x4c
> > > > >
> > > > > And here from GPU kthread, an SError OOPs:
> > > > >
> > > > > [  192.648789]  el1h_64_error+0x7c/0x80
> > > > > [  192.648812]  el1_interrupt+0x20/0x58
> > > > > [  192.648833]  el1h_64_irq_handler+0x18/0x24
> > > > > [  192.648854]  el1h_64_irq+0x7c/0x80
> > > > > [  192.648873]  local_daif_inherit+0x10/0x18
> > > > > [  192.648900]  el1h_64_sync_handler+0x48/0xb4
> > > > > [  192.648921]  el1h_64_sync+0x7c/0x80
> > > > > [  192.648941]  a6xx_gmu_set_oob+0xbc/0x1fc
> > > > > [  192.648968]  a6xx_hw_init+0x44/0xe38
> > > > > [  192.648991]  msm_gpu_hw_init+0x48/0x80
> > > > > [  192.649013]  msm_gpu_submit+0x5c/0x1a8
> > > > > [  192.649034]  msm_job_run+0xb0/0x11c
> > > > > [  192.649058]  drm_sched_main+0x170/0x434
> > > > > [  192.649086]  kthread+0x134/0x300
> > > > > [  192.649114]  ret_from_fork+0x10/0x20
> > > > >
> > > > > Fix by calling adreno_system_suspend() in the device_shutdown() path.
> > > > >
> > > > > Cc: Rob Clark <robdclark@chromium.org>
> > > > > Cc: Steven Rostedt <rostedt@goodmis.org>
> > > > > Cc: Ricardo Ribalda <ribalda@chromium.org>
> > > > > Cc: Ross Zwisler <zwisler@kernel.org>
> > > > > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> > > > > ---
> > > > > drivers/gpu/drm/msm/adreno/adreno_device.c | 5 ++++-
> > > > > 1 file changed, 4 insertions(+), 1 deletion(-)
> > > > >
> > > > > diff --git a/drivers/gpu/drm/msm/adreno/adreno_device.c b/drivers/gpu/drm/msm/adreno/adreno_device.c
> > > > > index 24b489b6129a..f0cff62812c3 100644
> > > > > --- a/drivers/gpu/drm/msm/adreno/adreno_device.c
> > > > > +++ b/drivers/gpu/drm/msm/adreno/adreno_device.c
> > > > > @@ -607,9 +607,12 @@ static int adreno_remove(struct platform_device *pdev)
> > > > >    return 0;
> > > > > }
> > > > >
> > > > > +static int adreno_system_suspend(struct device *dev);
> > > > > static void adreno_shutdown(struct platform_device *pdev)
> > > > > {
> > > > > -    pm_runtime_force_suspend(&pdev->dev);
> > > > > +    struct msm_gpu *gpu = dev_to_gpu(&pdev->dev);
> > > > > +
> > > >
> > > > This local variable definition should go to patch 2/2. Will fix in v2.
> > > >
> > > > Thanks,
> > > >
> > > >  - Joel
> > > >
> > > >
> > > > > +    WARN_ON_ONCE(adreno_system_suspend(&pdev->dev));
> > >
> > > I think maybe adreno_unbind() needs the same treatment?  Any path
> > > where we yank out the power cord without ensuring the scheduler is
> > > parked means we'd be racing with jobs in the scheduler queue.  Ie.
> > > userspace could queue a job before it is frozen, but the drm/scheduler
> > > kthread hasn't yet called the msm_job_run() callback (which does
> > > various touching of the now powered off hw).  So I think we need to
> > > ensure that the scheduler is parked in all paths that call
> > > pm_runtime_force_suspend() (as that bypasses the runpm reference that
> > > would otherwise unsure the hw is powered before msm_job_run pokes at
> > > registers)
> >
> > a6xx_gmu_remove() calls pm_runtime_force_suspend() , would that need a
> > treatment too?
> >
> > Though, adreno_system_suspend() is a static function in adreno_device.cc
>
> Naw, you get there indirectly from adreno_unbind()

Ah gotcha, thanks.

 - Joel

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/2] adreno: Shutdown the GPU properly
@ 2022-12-01 22:13           ` Joel Fernandes
  0 siblings, 0 replies; 27+ messages in thread
From: Joel Fernandes @ 2022-12-01 22:13 UTC (permalink / raw)
  To: Rob Clark
  Cc: Rob Clark, Emma Anholt, Akhil P Oommen, freedreno, linux-arm-msm,
	Ricardo Ribalda, Vladimir Lypak, Abhinav Kumar, Steven Rostedt,
	linux-kernel, Sean Paul, dri-devel, Ross Zwisler,
	Dmitry Baryshkov

On Thu, Dec 1, 2022 at 10:06 PM Rob Clark <robdclark@gmail.com> wrote:
>
> On Thu, Dec 1, 2022 at 12:08 PM Joel Fernandes <joel@joelfernandes.org> wrote:
> >
> > On Sat, Nov 12, 2022 at 6:44 PM Rob Clark <robdclark@gmail.com> wrote:
> > >
> > > On Fri, Nov 11, 2022 at 1:08 PM Joel Fernandes <joel@joelfernandes.org> wrote:
> > > >
> > > >
> > > >
> > > > > On Nov 11, 2022, at 2:50 PM, Joel Fernandes (Google) <joel@joelfernandes.org> wrote:
> > > > >
> > > > > During kexec on ARM device, we notice that device_shutdown() only calls
> > > > > pm_runtime_force_suspend() while shutting down the GPU. This means the GPU
> > > > > kthread is still running and further, there maybe active submits.
> > > > >
> > > > > This causes all kinds of issues during a kexec reboot:
> > > > >
> > > > > Warning from shutdown path:
> > > > >
> > > > > [  292.509662] WARNING: CPU: 0 PID: 6304 at [...] adreno_runtime_suspend+0x3c/0x44
> > > > > [  292.509863] Hardware name: Google Lazor (rev3 - 8) with LTE (DT)
> > > > > [  292.509872] pstate: 80400009 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> > > > > [  292.509881] pc : adreno_runtime_suspend+0x3c/0x44
> > > > > [  292.509891] lr : pm_generic_runtime_suspend+0x30/0x44
> > > > > [  292.509905] sp : ffffffc014473bf0
> > > > > [...]
> > > > > [  292.510043] Call trace:
> > > > > [  292.510051]  adreno_runtime_suspend+0x3c/0x44
> > > > > [  292.510061]  pm_generic_runtime_suspend+0x30/0x44
> > > > > [  292.510071]  pm_runtime_force_suspend+0x54/0xc8
> > > > > [  292.510081]  adreno_shutdown+0x1c/0x28
> > > > > [  292.510090]  platform_shutdown+0x2c/0x38
> > > > > [  292.510104]  device_shutdown+0x158/0x210
> > > > > [  292.510119]  kernel_restart_prepare+0x40/0x4c
> > > > >
> > > > > And here from GPU kthread, an SError OOPs:
> > > > >
> > > > > [  192.648789]  el1h_64_error+0x7c/0x80
> > > > > [  192.648812]  el1_interrupt+0x20/0x58
> > > > > [  192.648833]  el1h_64_irq_handler+0x18/0x24
> > > > > [  192.648854]  el1h_64_irq+0x7c/0x80
> > > > > [  192.648873]  local_daif_inherit+0x10/0x18
> > > > > [  192.648900]  el1h_64_sync_handler+0x48/0xb4
> > > > > [  192.648921]  el1h_64_sync+0x7c/0x80
> > > > > [  192.648941]  a6xx_gmu_set_oob+0xbc/0x1fc
> > > > > [  192.648968]  a6xx_hw_init+0x44/0xe38
> > > > > [  192.648991]  msm_gpu_hw_init+0x48/0x80
> > > > > [  192.649013]  msm_gpu_submit+0x5c/0x1a8
> > > > > [  192.649034]  msm_job_run+0xb0/0x11c
> > > > > [  192.649058]  drm_sched_main+0x170/0x434
> > > > > [  192.649086]  kthread+0x134/0x300
> > > > > [  192.649114]  ret_from_fork+0x10/0x20
> > > > >
> > > > > Fix by calling adreno_system_suspend() in the device_shutdown() path.
> > > > >
> > > > > Cc: Rob Clark <robdclark@chromium.org>
> > > > > Cc: Steven Rostedt <rostedt@goodmis.org>
> > > > > Cc: Ricardo Ribalda <ribalda@chromium.org>
> > > > > Cc: Ross Zwisler <zwisler@kernel.org>
> > > > > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> > > > > ---
> > > > > drivers/gpu/drm/msm/adreno/adreno_device.c | 5 ++++-
> > > > > 1 file changed, 4 insertions(+), 1 deletion(-)
> > > > >
> > > > > diff --git a/drivers/gpu/drm/msm/adreno/adreno_device.c b/drivers/gpu/drm/msm/adreno/adreno_device.c
> > > > > index 24b489b6129a..f0cff62812c3 100644
> > > > > --- a/drivers/gpu/drm/msm/adreno/adreno_device.c
> > > > > +++ b/drivers/gpu/drm/msm/adreno/adreno_device.c
> > > > > @@ -607,9 +607,12 @@ static int adreno_remove(struct platform_device *pdev)
> > > > >    return 0;
> > > > > }
> > > > >
> > > > > +static int adreno_system_suspend(struct device *dev);
> > > > > static void adreno_shutdown(struct platform_device *pdev)
> > > > > {
> > > > > -    pm_runtime_force_suspend(&pdev->dev);
> > > > > +    struct msm_gpu *gpu = dev_to_gpu(&pdev->dev);
> > > > > +
> > > >
> > > > This local variable definition should go to patch 2/2. Will fix in v2.
> > > >
> > > > Thanks,
> > > >
> > > >  - Joel
> > > >
> > > >
> > > > > +    WARN_ON_ONCE(adreno_system_suspend(&pdev->dev));
> > >
> > > I think maybe adreno_unbind() needs the same treatment?  Any path
> > > where we yank out the power cord without ensuring the scheduler is
> > > parked means we'd be racing with jobs in the scheduler queue.  Ie.
> > > userspace could queue a job before it is frozen, but the drm/scheduler
> > > kthread hasn't yet called the msm_job_run() callback (which does
> > > various touching of the now powered off hw).  So I think we need to
> > > ensure that the scheduler is parked in all paths that call
> > > pm_runtime_force_suspend() (as that bypasses the runpm reference that
> > > would otherwise unsure the hw is powered before msm_job_run pokes at
> > > registers)
> >
> > a6xx_gmu_remove() calls pm_runtime_force_suspend() , would that need a
> > treatment too?
> >
> > Though, adreno_system_suspend() is a static function in adreno_device.cc
>
> Naw, you get there indirectly from adreno_unbind()

Ah gotcha, thanks.

 - Joel

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2022-12-01 22:14 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-11-11 19:49 [PATCH 1/2] adreno: Shutdown the GPU properly Joel Fernandes (Google)
2022-11-11 19:49 ` Joel Fernandes (Google)
2022-11-11 19:49 ` [PATCH 2/2] adreno: Detect shutdown during get_param() Joel Fernandes (Google)
2022-11-11 19:49   ` Joel Fernandes (Google)
2022-11-11 21:28   ` Akhil P Oommen
2022-11-11 21:28     ` Akhil P Oommen
2022-11-11 21:37     ` Joel Fernandes
2022-11-11 21:37       ` Joel Fernandes
2022-11-11 22:19       ` Joel Fernandes
2022-11-12 18:35     ` Rob Clark
2022-11-12 18:35       ` Rob Clark
2022-12-01 18:42       ` Joel Fernandes
2022-12-01 18:42         ` Joel Fernandes
2022-12-01 19:33         ` Rob Clark
2022-12-01 19:33           ` Rob Clark
2022-12-01 20:01           ` Joel Fernandes
2022-12-01 20:01             ` Joel Fernandes
2022-11-11 21:08 ` [PATCH 1/2] adreno: Shutdown the GPU properly Joel Fernandes
2022-11-11 21:08   ` Joel Fernandes
2022-11-12 18:44   ` Rob Clark
2022-11-12 18:44     ` Rob Clark
2022-12-01 20:08     ` Joel Fernandes
2022-12-01 20:08       ` Joel Fernandes
2022-12-01 22:06       ` Rob Clark
2022-12-01 22:06         ` Rob Clark
2022-12-01 22:13         ` Joel Fernandes
2022-12-01 22:13           ` Joel Fernandes

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.