Ah! So we have one RCU grace period caused by the scheduler fence and another one from the hardware fence.

Thanks for explaining that once more, that was really not obvious from reading the code.

But this means we could we also fix it by moving the "dma_fence_put(fence->parent);" from drm_sched_fence_free() into drm_sched_fence_release_scheduled() directly before the call_rcu(), can't we?

I think that would be cleaner, cause otherwise every driver using the GPU scheduler would need that workaround.

And drm_sched_fence_release_scheduled() is only called when the reference count becomes zero, so when anybody accesses the structure after that then that would be a bug as well.

Thanks,
Christian.

Am 18.05.2018 um 11:56 schrieb Deng, Emily:
Hi Christian,
        When we free an IB fence, we first call one call_rcu in drm_sched_fence_release_scheduled as the call trace one, then after the call trace one,
we call the call_rcu second in the  amdgpu_fence_release in call trace two, as below.
 
The kmem_cache_free(amdgpu_fence_slab, fence)'s call trace as below:
    1.drm_sched_entity_fini ->
    drm_sched_entity_cleanup ->
    dma_fence_put(entity->last_scheduled) ->
    drm_sched_fence_release_finished ->
    drm_sched_fence_release_scheduled ->
    call_rcu(&fence->finished.rcu, drm_sched_fence_free)
 
    2.drm_sched_fence_free ->
    dma_fence_put(fence->parent) ->
    amdgpu_fence_release ->
    call_rcu(&f->rcu, amdgpu_fence_free) ->
    kmem_cache_free(amdgpu_fence_slab, fence);
 
> -----Original Message-----
> From: Koenig, Christian
> Sent: Friday, May 18, 2018 5:46 PM
> To: Deng, Emily <Emily.Deng-5C7GfCeVMHo@public.gmane.org>; amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
> Subject: Re: [PATCH] drm/amdgpu: add rcu_barrier after entity fini
>
> Ok, I'm lost where do we use call_rcu() twice? Cause that sounds incorrect in
> the first place.
>
> Christian.
>
> Am 18.05.2018 um 11:41 schrieb Deng, Emily:
> > Ping......
> >
> > Best Wishes,
> > Emily Deng
> >
> >
> >
> >
> >> -----Original Message-----
> >> From: amd-gfx [mailto:amd-gfx-bounces-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org] On
> >> Behalf Of Deng, Emily
> >> Sent: Friday, May 18, 2018 11:20 AM
> >> To: Koenig, Christian <Christian.Koenig-5C7GfCeVMHo@public.gmane.org>; amd-
> >> gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
> >> Subject: RE: [PATCH] drm/amdgpu: add rcu_barrier after entity fini
> >>
> >> Hi Christian,
> >>       Yes, it has already one rcu_barrier, but it has called twice
> >> call_rcu, so the one rcu_barrier just could barrier one call_rcu some time.
> >>      After I added another rcu_barrier, the kernel issue will disappear.
> >>
> >> Best Wishes,
> >> Emily Deng
> >>
> >>> -----Original Message-----
> >>> From: Christian König [mailto:ckoenig.leichtzumerken-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org]
> >>> Sent: Thursday, May 17, 2018 7:08 PM
> >>> To: Deng, Emily <Emily.Deng-5C7GfCeVMHo@public.gmane.org>; amd-
> gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
> >>> Subject: Re: [PATCH] drm/amdgpu: add rcu_barrier after entity fini
> >>>
> >>> Am 17.05.2018 um 12:03 schrieb Emily Deng:
> >>>> To free the fence from the amdgpu_fence_slab, need twice call_rcu,
> >>>> to avoid the amdgpu_fence_slab_fini call
> >>>> kmem_cache_destroy(amdgpu_fence_slab) before
> >>> kmem_cache_free(amdgpu_fence_slab, fence), add rcu_barrier after
> >>> drm_sched_entity_fini.
> >>>> The kmem_cache_free(amdgpu_fence_slab, fence)'s call trace as
> below:
> >>>> 1.drm_sched_entity_fini ->
> >>>> drm_sched_entity_cleanup ->
> >>>> dma_fence_put(entity->last_scheduled) ->
> >>>> drm_sched_fence_release_finished ->
> >>> drm_sched_fence_release_scheduled
> >>>> -> call_rcu(&fence->finished.rcu, drm_sched_fence_free)
> >>>>
> >>>> 2.drm_sched_fence_free ->
> >>>> dma_fence_put(fence->parent) ->
> >>>> amdgpu_fence_release ->
> >>>> call_rcu(&f->rcu, amdgpu_fence_free) ->
> >>>> kmem_cache_free(amdgpu_fence_slab, fence);
> >>>>
> >>>> v2:put the barrier before the kmem_cache_destroy
> >>>>
> >>>> Change-Id: I8dcadd3372f97e72461bf46b41cc26d90f09b8df
> >>>> Signed-off-by: Emily Deng <Emily.Deng-5C7GfCeVMHo@public.gmane.org>
> >>>> ---
> >>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 1 +
> >>>>    1 file changed, 1 insertion(+)
> >>>>
> >>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> >>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> >>>> index 39ec6b8..42be65b 100644
> >>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> >>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> >>>> @@ -69,6 +69,7 @@ int amdgpu_fence_slab_init(void)
> >>>>    void amdgpu_fence_slab_fini(void)
> >>>>    {
> >>>>           rcu_barrier();
> >>>> +        rcu_barrier();
> >>> Well, you should have noted that there is already an rcu_barrier
> >>> here and adding another one shouldn't have any additional effect. So
> >>> your explanation and the proposed solution doesn't make to much
> sense.
> >>>
> >>> I think the problem you run into is rather that the fence is
> >>> reference counted and might live longer than the module who created it.
> >>>
> >>> Complicated issue, one possible solution would be to release
> >>> fence->parent earlier in the scheduler fence but that doesn't sound
> >>> fence->like
> >>> a general purpose solution.
> >>>
> >>> Christian.
> >>>
> >>>>           kmem_cache_destroy(amdgpu_fence_slab);
> >>>>    }
> >>>>    /*
> >> _______________________________________________
> >> amd-gfx mailing list
> >> amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
> >> https://lists.freedesktop.org/mailman/listinfo/amd-gfx