On Fri, Nov 11, 2022 at 4:37 PM Joel Fernandes <joel@joelfernandes.org>
wrote:

>
>
> > On Nov 11, 2022, at 4:28 PM, Akhil P Oommen <quic_akhilpo@quicinc.com>
> wrote:
> >
> > ﻿On 11/12/2022 1:19 AM, Joel Fernandes (Google) wrote:
> >> Even though the GPU is shut down, during kexec reboot we can have
> userspace
> >> still running. This is especially true if KEXEC_JUMP is not enabled,
> because we
> >> do not freeze userspace in this case.
> >>
> >> To prevent crashes, track that the GPU is shutdown and prevent
> get_param() from
> >> accessing GPU resources if we find it shutdown.
> >>
> >> This fixes the following crash during kexec reboot on an ARM64 device
> with adreno GPU:
> >>
> >> [  292.534314] Kernel panic - not syncing: Asynchronous SError Interrupt
> >> [  292.534323] Hardware name: Google Lazor (rev3 - 8) with LTE (DT)
> >> [  292.534326] Call trace:
> >> [  292.534328]  dump_backtrace+0x0/0x1d4
> >> [  292.534337]  show_stack+0x20/0x2c
> >> [  292.534342]  dump_stack_lvl+0x60/0x78
> >> [  292.534347]  dump_stack+0x18/0x38
> >> [  292.534352]  panic+0x148/0x3b0
> >> [  292.534357]  nmi_panic+0x80/0x94
> >> [  292.534364]  arm64_serror_panic+0x70/0x7c
> >> [  292.534369]  do_serror+0x0/0x7c
> >> [  292.534372]  do_serror+0x54/0x7c
> >> [  292.534377]  el1h_64_error_handler+0x34/0x4c
> >> [  292.534381]  el1h_64_error+0x7c/0x80
> >> [  292.534386]  el1_interrupt+0x20/0x58
> >> [  292.534389]  el1h_64_irq_handler+0x18/0x24
> >> [  292.534395]  el1h_64_irq+0x7c/0x80
> >> [  292.534399]  local_daif_inherit+0x10/0x18
> >> [  292.534405]  el1h_64_sync_handler+0x48/0xb4
> >> [  292.534410]  el1h_64_sync+0x7c/0x80
> >> [  292.534414]  a6xx_gmu_set_oob+0xbc/0x1fc
> >> [  292.534422]  a6xx_get_timestamp+0x40/0xb4
> >> [  292.534426]  adreno_get_param+0x12c/0x1e0
> >> [  292.534433]  msm_ioctl_get_param+0x64/0x70
> >> [  292.534440]  drm_ioctl_kernel+0xe8/0x158
> >> [  292.534448]  drm_ioctl+0x208/0x320
> >> [  292.534453]  __arm64_sys_ioctl+0x98/0xd0
> >> [  292.534461]  invoke_syscall+0x4c/0x118
> >> [  292.534467]  el0_svc_common+0x98/0x104
> >> [  292.534473]  do_el0_svc+0x30/0x80
> >> [  292.534478]  el0_svc+0x20/0x50
> >> [  292.534481]  el0t_64_sync_handler+0x78/0x108
> >> [  292.534485]  el0t_64_sync+0x1a4/0x1a8
> >> [  292.534632] Kernel Offset: 0x1a5f800000 from 0xffffffc008000000
> >> [  292.534635] PHYS_OFFSET: 0x80000000
> >> [  292.534638] CPU features: 0x40018541,a3300e42
> >> [  292.534644] Memory Limit: none
> >>
> >> Cc: Rob Clark <robdclark@chromium.org>
> >> Cc: Steven Rostedt <rostedt@goodmis.org>
> >> Cc: Ricardo Ribalda <ribalda@chromium.org>
> >> Cc: Ross Zwisler <zwisler@kernel.org>
> >> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> >> ---
> >>  drivers/gpu/drm/msm/adreno/adreno_device.c | 1 +
> >>  drivers/gpu/drm/msm/adreno/adreno_gpu.c    | 2 +-
> >>  drivers/gpu/drm/msm/msm_gpu.h              | 3 +++
> >>  3 files changed, 5 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/drivers/gpu/drm/msm/adreno/adreno_device.c
> b/drivers/gpu/drm/msm/adreno/adreno_device.c
> >> index f0cff62812c3..03d912dc0130 100644
> >> --- a/drivers/gpu/drm/msm/adreno/adreno_device.c
> >> +++ b/drivers/gpu/drm/msm/adreno/adreno_device.c
> >> @@ -612,6 +612,7 @@ static void adreno_shutdown(struct platform_device
> *pdev)
> >>  {
> >>      struct msm_gpu *gpu = dev_to_gpu(&pdev->dev);
> >>  +    gpu->is_shutdown = true;
> >>      WARN_ON_ONCE(adreno_system_suspend(&pdev->dev));
> >>  }
> >>  diff --git a/drivers/gpu/drm/msm/adreno/adreno_gpu.c
> b/drivers/gpu/drm/msm/adreno/adreno_gpu.c
> >> index 382fb7f9e497..6903c6892469 100644
> >> --- a/drivers/gpu/drm/msm/adreno/adreno_gpu.c
> >> +++ b/drivers/gpu/drm/msm/adreno/adreno_gpu.c
> >> @@ -251,7 +251,7 @@ int adreno_get_param(struct msm_gpu *gpu, struct
> msm_file_private *ctx,
> >>      struct adreno_gpu *adreno_gpu = to_adreno_gpu(gpu);
> >>        /* No pointer params yet */
> >> -    if (*len != 0)
> >> +    if (*len != 0 || gpu->is_shutdown)
> >>          return -EINVAL;
> > This will race with shutdown.
>
> Could you clarify what you mean? At this point in the code, the shutdown
> is completed and it crashes here.
>


Ok so I think you meant that if the shut down happens after we sample the
is_shutdown, then we run into the same issue.

I can’t reproduce that but I’ll look into that. Another way might be to
synchronize using a mutex. Though maybe the shutdown path can wait for
active pm_runtime references?

Thanks.




> > Probably, propagating back the return value of pm_runtime_get() in every
> possible ioctl call path is the right thing to do.
>
> Ok I’ll look into that. But the patch I posted works reliably and fixes
> all crashes we could reproduce.
>
> > I have never thought about this scenario. Do you know why userspace is
> not freezed before kexec?
>
> I am not sure. It depends on how kexec is used. The userspace freeze
> happens only when kexec is called to switch back and forth between
> different kernels (persistence mode). In such scenario I believe the
> userspace has to be frozen and unfrozen. However for normal kexec, that
> does not happen.
>
> Thanks.
>
>
> >
> > -Akhil.
> >>        switch (param) {
> >> diff --git a/drivers/gpu/drm/msm/msm_gpu.h
> b/drivers/gpu/drm/msm/msm_gpu.h
> >> index ff911e7305ce..f18b0a91442b 100644
> >> --- a/drivers/gpu/drm/msm/msm_gpu.h
> >> +++ b/drivers/gpu/drm/msm/msm_gpu.h
> >> @@ -214,6 +214,9 @@ struct msm_gpu {
> >>      /* does gpu need hw_init? */
> >>      bool needs_hw_init;
> >>  +    /* is the GPU shutdown? */
> >> +    bool is_shutdown;
> >> +
> >>      /**
> >>       * global_faults: number of GPU hangs not attributed to a
> particular
> >>       * address space
> >
>