On 2022-04-26 22:31, Hangyu Hua wrote: > On 2022/4/26 22:55, Andrey Grodzovsky wrote: >> >> On 2022-04-25 22:54, Hangyu Hua wrote: >>> On 2022/4/25 23:42, Andrey Grodzovsky wrote: >>>> On 2022-04-25 04:36, Hangyu Hua wrote: >>>> >>>>> When drm_sched_job_add_dependency() fails, dma_fence_put() will be >>>>> called >>>>> internally. Calling it again after drm_sched_job_add_dependency() >>>>> finishes >>>>> may result in a dangling pointer. >>>>> >>>>> Fix this by removing redundant dma_fence_put(). >>>>> >>>>> Signed-off-by: Hangyu Hua >>>>> --- >>>>>   drivers/gpu/drm/lima/lima_gem.c        | 1 - >>>>>   drivers/gpu/drm/scheduler/sched_main.c | 1 - >>>>>   2 files changed, 2 deletions(-) >>>>> >>>>> diff --git a/drivers/gpu/drm/lima/lima_gem.c >>>>> b/drivers/gpu/drm/lima/lima_gem.c >>>>> index 55bb1ec3c4f7..99c8e7f6bb1c 100644 >>>>> --- a/drivers/gpu/drm/lima/lima_gem.c >>>>> +++ b/drivers/gpu/drm/lima/lima_gem.c >>>>> @@ -291,7 +291,6 @@ static int lima_gem_add_deps(struct drm_file >>>>> *file, struct lima_submit *submit) >>>>>           err = drm_sched_job_add_dependency(&submit->task->base, >>>>> fence); >>>>>           if (err) { >>>>> -            dma_fence_put(fence); >>>>>               return err; >>>> >>>> >>>> Makes sense here >>>> >>>> >>>>>           } >>>>>       } >>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c >>>>> b/drivers/gpu/drm/scheduler/sched_main.c >>>>> index b81fceb0b8a2..ebab9eca37a8 100644 >>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c >>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c >>>>> @@ -708,7 +708,6 @@ int >>>>> drm_sched_job_add_implicit_dependencies(struct drm_sched_job *job, >>>>>           dma_fence_get(fence); >>>>>           ret = drm_sched_job_add_dependency(job, fence); >>>>>           if (ret) { >>>>> -            dma_fence_put(fence); >>>> >>>> >>>> >>>> Not sure about this one since if you look at the relevant commits - >>>> 'drm/scheduler: fix drm_sched_job_add_implicit_dependencies' and >>>> 'drm/scheduler: fix drm_sched_job_add_implicit_dependencies harder' >>>> You will see that the dma_fence_put here balances the extra >>>> dma_fence_get >>>> above >>>> >>>> Andrey >>>> >>> >>> I don't think so. I checked the call chain and found no additional >>> dma_fence_get(). But dma_fence_get() needs to be called before >>> drm_sched_job_add_dependency() to keep the counter balanced. >> >> >> I don't say there is an additional get, I just say that >> drm_sched_job_add_dependency doesn't grab an extra reference to the >> fences it stores so this needs to be done outside and for that >> drm_sched_job_add_implicit_dependencies->dma_fence_get is called and, >> if this addition fails you just call dma_fence_put to keep the >> counter balanced. >> > > drm_sched_job_add_implicit_dependencies() will call > drm_sched_job_add_dependency(). And drm_sched_job_add_dependency() > already call dma_fence_put() when it fails. Calling dma_fence_put() > twice doesn't make sense. > > dma_fence_get() is in [2]. But dma_fence_put() will be called in [1] > and [3] when xa_alloc() fails. The way I see it, [2] and [3] are mat matching *get* and *put* respectively. [1] *put* is against the original dma_fence_init->kref_init of the fence which always set the refcount at 1. Also in support of this see commit 'drm/scheduler: fix drm_sched_job_add_implicit_dependencies harder' - it says there "drm_sched_job_add_dependency() could drop the last ref"  - this last ref is the original refcount set by dma_fence_init->kref Andrey > > > int drm_sched_job_add_dependency(struct drm_sched_job *job, >                  struct dma_fence *fence) > { >     ... >     ret = xa_alloc(&job->dependencies, &id, fence, xa_limit_32b, > GFP_KERNEL); >     if (ret != 0) >         dma_fence_put(fence);    <--- [1] > >     return ret; > } > EXPORT_SYMBOL(drm_sched_job_add_dependency); > > > int drm_sched_job_add_implicit_dependencies(struct drm_sched_job *job, >                         struct drm_gem_object *obj, >                         bool write) > { >     struct dma_resv_iter cursor; >     struct dma_fence *fence; >     int ret; > >     dma_resv_for_each_fence(&cursor, obj->resv, write, fence) { >         /* Make sure to grab an additional ref on the added fence */ >         dma_fence_get(fence);    <--- [2] >         ret = drm_sched_job_add_dependency(job, fence); >         if (ret) { >             dma_fence_put(fence);    <--- [3] >             return ret; >         } >     } >     return 0; > } > > >> >>> On the other hand, dma_fence_get() and dma_fence_put() are >>> meaningless here if threre is an extra dma_fence_get() beacause >>> counter will not decrease to 0 during drm_sched_job_add_dependency(). >>> >>> I check the call chain as follows: >>> >>> msm_ioctl_gem_submit() >>> -> submit_fence_sync() >>> -> drm_sched_job_add_implicit_dependencies() >> >> >> Can you maybe trace or print one such example of problematic refcount >> that you are trying to fix ? I still don't see where is the problem. >> >> Andrey >> > > I also wish I could. System logs can make this easy. But i don't have > a corresponding GPU physical device. > drm_sched_job_add_implicit_dependencies is only used in a few devices. > > Thanks. >> >>> >>> Thanks, >>> Hangyu >>> >>>> >>>>>               return ret; >>>>>           } >>>>>       }