1. Drm_sched_entity_fini(): it exit right after entity->job_queue empty, [ but that time scheduler is not fast enough to deal with this entity now ]

That should never happen.

The last job from the entity->job_queue is only removed after the scheduler is done with the entity (at least that was the original idea, not sure if that still works as expected).

[ML] no that’s not true and we already catch the kernel NULL pointer issue with a entity->last_scheduled fence get double put , exactly like the way I described in the scenario …

Pixel already fixed it by moving the put/get pair on entity->last_scheduled prior to spsc_queue_pop() and the race issue is therefore avoided

/Monk

From: Christian König [mailto:ckoenig.leichtzumerken@gmail.com]
Sent: 2018年4月18日 16:36
To: Liu, Monk <Monk.Liu@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>; Deng, Emily <Emily.Deng@amd.com>
Cc: amd-gfx@lists.freedesktop.org
Subject: Re: force app kill patch

See that in “sched_entity_fini”, we only call dma_fence_put(entity->last_scheduled” under the condition of “If (entity->fini_status)”, so

This way there is memory leak for the case of “entity->fini_stats ==0”

Good catch, we indeed should fix that.

1. Drm_sched_entity_fini(): it exit right after entity->job_queue empty, [ but that time scheduler is not fast enough to deal with this entity now ]

*Correctio for the scenario *

After we move fence_put(entity->last_sched) out of the fini_status check:

A potential race issue for the scenario:

1.        Drm_sched_entity_fini(): it exit right after entity->job_queue empty, [ but that time scheduler is not fast enough to deal with this entity now ]

2.        Drm_sched_entity_cleanup() : it call dma_fence_put(entity->last_scheduled) [ but this time entity->last_scheduled actually points to the fence prior to the real last one ]

3.        Scheduler_main() now dealing with this entity: it call dma_fence_put(entity->last_scheduled)    [   Now this fence get double put !!! ]

4.        Scheduler_main() now call dma_fence_get() on the *real* last one !

So eventually the real last one fence triggers memory leak and more critical the double put fence cause NULL pointer access

/Monk

From: Liu, Monk
Sent: 2018年4月18日 15:11
To: Koenig, Christian <Christian.Koenig@amd.com>; Deng, Emily <Emily.Deng@amd.com>
Cc: amd-gfx@lists.freedesktop.org
Subject: force app kill patch

Hi Christian & Emily

I think the v4 fix for “fix force app kill hang” is still not good enough:

First:

See that in “sched_entity_fini”, we only call dma_fence_put(entity->last_scheduled” under the condition of “If (entity->fini_status)”, so

This way there is memory leak for the case of “entity->fini_stats ==0”

Second:

If we move dma_fence_put(entity->last_scheduled) out of the condition of “if (entity->fini_status)”, the memory leak issue can be fixed

But there will be kernel NULL pointer access, I think the time you call dma_fence_put(entity->last_scheduled”) may actually executed *not*

On the last scheduled fence of this entity, because it is run without “thread_park/unpark” pair which to make sure scheduler not dealing this entity

So with certain race issue, here is the scenario:

1.        scheduler is doing the dma_fence_put() on the 1^st fence,

2.        scheduler set entity->last_scheduled to 1^st fence

3.        now sched_entity_fini() run, and it call dma_fence_put() on entity->last_scheduled

4.        now this 1^st fence is actually put double time and the real last fence won’t get put by expected

any idea?

/Monk
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx