All of lore.kernel.org
 help / color / mirror / Atom feed
* force app kill patch
@ 2018-04-18  7:11 Liu, Monk
       [not found] ` <BLUPR12MB044915BDA633308DF65967BD84B60-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
  0 siblings, 1 reply; 5+ messages in thread
From: Liu, Monk @ 2018-04-18  7:11 UTC (permalink / raw)
  To: Koenig, Christian, Deng, Emily; +Cc: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW


[-- Attachment #1.1: Type: text/plain, Size: 1183 bytes --]

Hi Christian & Emily

I think the v4 fix for "fix force app kill hang" is still not good enough:


First:
See that in "sched_entity_fini", we only call dma_fence_put(entity->last_scheduled" under the condition of "If (entity->fini_status)", so
This way there is memory leak for the case of "entity->fini_stats ==0"


Second:
If we move dma_fence_put(entity->last_scheduled) out of the condition of "if (entity->fini_status)", the memory leak issue can be fixed
But there will be kernel NULL pointer access, I think the time you call dma_fence_put(entity->last_scheduled") may actually executed *not*
On the last scheduled fence of this entity, because it is run without "thread_park/unpark" pair which to make sure scheduler not dealing this entity

So with certain race issue, here is the scenario:


1.        scheduler is doing the dma_fence_put() on the 1st fence,

2.        scheduler set entity->last_scheduled to 1st fence

3.        now sched_entity_fini() run, and it call dma_fence_put() on entity->last_scheduled

4.        now this 1st fence is actually put double time and the real last fence won't get put by expected


any idea?


/Monk

[-- Attachment #1.2: Type: text/html, Size: 7943 bytes --]

[-- Attachment #2: Type: text/plain, Size: 154 bytes --]

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 5+ messages in thread

* RE: force app kill patch
       [not found] ` <BLUPR12MB044915BDA633308DF65967BD84B60-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
@ 2018-04-18  7:20   ` Liu, Monk
       [not found]     ` <BLUPR12MB0449CE1BE8367DCA8B7AA3BB84B60-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
  0 siblings, 1 reply; 5+ messages in thread
From: Liu, Monk @ 2018-04-18  7:20 UTC (permalink / raw)
  To: Koenig, Christian, Deng, Emily; +Cc: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW


[-- Attachment #1.1: Type: text/plain, Size: 2264 bytes --]

*Correctio for the scenario *

After we move fence_put(entity->last_sched) out of the fini_status check:


A potential race issue for the scenario:


1.        Drm_sched_entity_fini(): it exit right after entity->job_queue empty, [ but that time scheduler is not fast enough to deal with this entity now ]

2.        Drm_sched_entity_cleanup() : it call dma_fence_put(entity->last_scheduled)  [ but this time entity->last_scheduled  actually points to the fence prior to the real last one ]

3.        Scheduler_main() now dealing with this entity: it call dma_fence_put(entity->last_scheduled)    [   Now this fence get double put !!!  ]

4.        Scheduler_main() now call dma_fence_get() on the *real* last one !

So eventually the real last one fence triggers memory leak and more critical the double put fence cause NULL pointer access

/Monk

From: Liu, Monk
Sent: 2018年4月18日 15:11
To: Koenig, Christian <Christian.Koenig@amd.com>; Deng, Emily <Emily.Deng@amd.com>
Cc: amd-gfx@lists.freedesktop.org
Subject: force app kill patch

Hi Christian & Emily

I think the v4 fix for “fix force app kill hang” is still not good enough:


First:
See that in “sched_entity_fini”, we only call dma_fence_put(entity->last_scheduled” under the condition of “If (entity->fini_status)”, so
This way there is memory leak for the case of “entity->fini_stats ==0”


Second:
If we move dma_fence_put(entity->last_scheduled) out of the condition of “if (entity->fini_status)”, the memory leak issue can be fixed
But there will be kernel NULL pointer access, I think the time you call dma_fence_put(entity->last_scheduled”) may actually executed *not*
On the last scheduled fence of this entity, because it is run without “thread_park/unpark” pair which to make sure scheduler not dealing this entity

So with certain race issue, here is the scenario:


1.        scheduler is doing the dma_fence_put() on the 1st fence,

2.        scheduler set entity->last_scheduled to 1st fence

3.        now sched_entity_fini() run, and it call dma_fence_put() on entity->last_scheduled

4.        now this 1st fence is actually put double time and the real last fence won’t get put by expected


any idea?


/Monk

[-- Attachment #1.2: Type: text/html, Size: 14682 bytes --]

[-- Attachment #2: Type: text/plain, Size: 154 bytes --]

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: force app kill patch
       [not found]     ` <BLUPR12MB0449CE1BE8367DCA8B7AA3BB84B60-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
@ 2018-04-18  8:36       ` Christian König
       [not found]         ` <6f4369eb-6064-3115-b9ae-9d5e0b3440a2-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
  0 siblings, 1 reply; 5+ messages in thread
From: Christian König @ 2018-04-18  8:36 UTC (permalink / raw)
  To: Liu, Monk, Koenig, Christian, Deng, Emily
  Cc: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW


[-- Attachment #1.1: Type: text/plain, Size: 3289 bytes --]

> See that in “sched_entity_fini”, we only call 
> dma_fence_put(entity->last_scheduled”under the condition of “If 
> (entity->fini_status)”, so
>
> This way there is memory leak for the case of “entity->fini_stats ==0”
>
Good catch, we indeed should fix that.

> 1.Drm_sched_entity_fini(): it exit right after entity->job_queue 
> empty, [ but that time scheduler is not fast enough to deal with this 
> entity now ] 
That should never happen.

The last job from the entity->job_queue is only removed after the 
scheduler is done with the entity (at least that was the original idea, 
not sure if that still works as expected).

Regards,
Christian.

Am 18.04.2018 um 09:20 schrieb Liu, Monk:
>
> *Correctio for the scenario *
>
> After we move fence_put(entity->last_sched) out of the fini_status check:
>
> A potential race issue for the scenario:
>
> 1.Drm_sched_entity_fini(): it exit right after entity->job_queue 
> empty, [ but that time scheduler is not fast enough to deal with this 
> entity now ]
>
> 2.Drm_sched_entity_cleanup() : it call 
> dma_fence_put(entity->last_scheduled)  [ but this time 
> entity->last_scheduled  actually points to the fence prior to the real 
> last one ]
>
> 3.Scheduler_main() now dealing with this entity: it call 
> dma_fence_put(entity->last_scheduled)    [   Now this fence get double 
> put !!!  ]
>
> 4.Scheduler_main() now call dma_fence_get() on the *real* last one !
>
> So eventually the real last one fence triggers memory leak and more 
> critical the double put fence cause NULL pointer access
>
> /Monk
>
> *From:*Liu, Monk
> *Sent:* 2018年4月18日15:11
> *To:* Koenig, Christian <Christian.Koenig-5C7GfCeVMHo@public.gmane.org>; Deng, Emily 
> <Emily.Deng-5C7GfCeVMHo@public.gmane.org>
> *Cc:* amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
> *Subject:* force app kill patch
>
> Hi Christian & Emily
>
> I think the v4 fix for “fix force app kill hang”is still not good enough:
>
> First:
>
> See that in “sched_entity_fini”, we only call 
> dma_fence_put(entity->last_scheduled”under the condition of “If 
> (entity->fini_status)”, so
>
> This way there is memory leak for the case of “entity->fini_stats ==0”
>
> Second:
>
> If we move dma_fence_put(entity->last_scheduled) out of the condition 
> of “if (entity->fini_status)”, the memory leak issue can be fixed
>
> But there will be kernel NULL pointer access, I think the time you 
> call dma_fence_put(entity->last_scheduled”) may actually executed **not**
>
> On the last scheduled fence of this entity, because it is run without 
> “thread_park/unpark”pair which to make sure scheduler not dealing this 
> entity
>
> So with certain race issue, here is the scenario:
>
> 1.scheduler is doing the dma_fence_put() on the 1^st fence,
>
> 2.scheduler set entity->last_scheduled to 1^st fence
>
> 3.now sched_entity_fini() run, and it call dma_fence_put() on 
> entity->last_scheduled
>
> 4.now this 1^st fence is actually put double time and the real last 
> fence won’t get put by expected
>
> any idea?
>
> /Monk
>
>
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx


[-- Attachment #1.2: Type: text/html, Size: 18497 bytes --]

[-- Attachment #2: Type: text/plain, Size: 154 bytes --]

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 5+ messages in thread

* RE: force app kill patch
       [not found]         ` <6f4369eb-6064-3115-b9ae-9d5e0b3440a2-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
@ 2018-04-18  9:00           ` Liu, Monk
       [not found]             ` <BLUPR12MB0449F857ED9EFA931F1FC66084B60-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
  0 siblings, 1 reply; 5+ messages in thread
From: Liu, Monk @ 2018-04-18  9:00 UTC (permalink / raw)
  To: Koenig, Christian, Deng, Emily; +Cc: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW


[-- Attachment #1.1: Type: text/plain, Size: 4290 bytes --]

1.        Drm_sched_entity_fini(): it exit right after entity->job_queue empty, [ but that time scheduler is not fast enough to deal with this entity now ]
That should never happen.

The last job from the entity->job_queue is only removed after the scheduler is done with the entity (at least that was the original idea, not sure if that still works as expected).

[ML] no that’s not true and we already catch the kernel NULL pointer issue with a entity->last_scheduled fence get double put , exactly like the way I described in the scenario …

Pixel already fixed it by moving the put/get pair on entity->last_scheduled prior to spsc_queue_pop() and the race issue is therefore avoided

/Monk

From: Christian König [mailto:ckoenig.leichtzumerken@gmail.com]
Sent: 2018年4月18日 16:36
To: Liu, Monk <Monk.Liu@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>; Deng, Emily <Emily.Deng@amd.com>
Cc: amd-gfx@lists.freedesktop.org
Subject: Re: force app kill patch

See that in “sched_entity_fini”, we only call dma_fence_put(entity->last_scheduled” under the condition of “If (entity->fini_status)”, so
This way there is memory leak for the case of “entity->fini_stats ==0”
Good catch, we indeed should fix that.


1.        Drm_sched_entity_fini(): it exit right after entity->job_queue empty, [ but that time scheduler is not fast enough to deal with this entity now ]
That should never happen.

The last job from the entity->job_queue is only removed after the scheduler is done with the entity (at least that was the original idea, not sure if that still works as expected).

Regards,
Christian.

Am 18.04.2018 um 09:20 schrieb Liu, Monk:
*Correctio for the scenario *

After we move fence_put(entity->last_sched) out of the fini_status check:


A potential race issue for the scenario:


1.        Drm_sched_entity_fini(): it exit right after entity->job_queue empty, [ but that time scheduler is not fast enough to deal with this entity now ]

2.        Drm_sched_entity_cleanup() : it call dma_fence_put(entity->last_scheduled)  [ but this time entity->last_scheduled  actually points to the fence prior to the real last one ]

3.        Scheduler_main() now dealing with this entity: it call dma_fence_put(entity->last_scheduled)    [   Now this fence get double put !!!  ]

4.        Scheduler_main() now call dma_fence_get() on the *real* last one !

So eventually the real last one fence triggers memory leak and more critical the double put fence cause NULL pointer access

/Monk

From: Liu, Monk
Sent: 2018年4月18日 15:11
To: Koenig, Christian <Christian.Koenig@amd.com><mailto:Christian.Koenig@amd.com>; Deng, Emily <Emily.Deng@amd.com><mailto:Emily.Deng@amd.com>
Cc: amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
Subject: force app kill patch

Hi Christian & Emily

I think the v4 fix for “fix force app kill hang” is still not good enough:


First:
See that in “sched_entity_fini”, we only call dma_fence_put(entity->last_scheduled” under the condition of “If (entity->fini_status)”, so
This way there is memory leak for the case of “entity->fini_stats ==0”


Second:
If we move dma_fence_put(entity->last_scheduled) out of the condition of “if (entity->fini_status)”, the memory leak issue can be fixed
But there will be kernel NULL pointer access, I think the time you call dma_fence_put(entity->last_scheduled”) may actually executed *not*
On the last scheduled fence of this entity, because it is run without “thread_park/unpark” pair which to make sure scheduler not dealing this entity

So with certain race issue, here is the scenario:


1.        scheduler is doing the dma_fence_put() on the 1st fence,

2.        scheduler set entity->last_scheduled to 1st fence

3.        now sched_entity_fini() run, and it call dma_fence_put() on entity->last_scheduled

4.        now this 1st fence is actually put double time and the real last fence won’t get put by expected


any idea?


/Monk




_______________________________________________

amd-gfx mailing list

amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>

https://lists.freedesktop.org/mailman/listinfo/amd-gfx


[-- Attachment #1.2: Type: text/html, Size: 20757 bytes --]

[-- Attachment #2: Type: text/plain, Size: 154 bytes --]

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: force app kill patch
       [not found]             ` <BLUPR12MB0449F857ED9EFA931F1FC66084B60-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
@ 2018-04-18  9:03               ` Christian König
  0 siblings, 0 replies; 5+ messages in thread
From: Christian König @ 2018-04-18  9:03 UTC (permalink / raw)
  To: Liu, Monk, Koenig, Christian, Deng, Emily
  Cc: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW


[-- Attachment #1.1: Type: text/plain, Size: 5342 bytes --]

Am 18.04.2018 um 11:00 schrieb Liu, Monk:
>
> 1.Drm_sched_entity_fini(): it exit right after entity->job_queue 
> empty, [ but that time scheduler is not fast enough to deal with this 
> entity now ]
>
> That should never happen.
>
> The last job from the entity->job_queue is only removed after the 
> scheduler is done with the entity (at least that was the original 
> idea, not sure if that still works as expected).
>
> [ML] no that’s not true and we already catch the kernel NULL pointer 
> issue with a entity->last_scheduled fence get double put , exactly 
> like the way I described in the scenario …
>
> Pixel already fixed it by moving the put/get pair on 
> entity->last_scheduled prior to spsc_queue_pop() and the race issue is 
> therefore avoided
>

Yeah, already seen and reviewed that. That's a good catch, please make 
sure that this gets pushed to amd-staging-drm-next ASAP.

Christian.

> /Monk
>
> *From:*Christian König [mailto:ckoenig.leichtzumerken-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org]
> *Sent:* 2018年4月18日16:36
> *To:* Liu, Monk <Monk.Liu-5C7GfCeVMHo@public.gmane.org>; Koenig, Christian 
> <Christian.Koenig-5C7GfCeVMHo@public.gmane.org>; Deng, Emily <Emily.Deng-5C7GfCeVMHo@public.gmane.org>
> *Cc:* amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
> *Subject:* Re: force app kill patch
>
>     See that in “sched_entity_fini”, we only call
>     dma_fence_put(entity->last_scheduled” under the condition of “If
>     (entity->fini_status)”, so
>
>     This way there is memory leak for the case of “entity->fini_stats ==0”
>
> Good catch, we indeed should fix that.
>
>
>     1.Drm_sched_entity_fini(): it exit right after entity->job_queue
>     empty, [ but that time scheduler is not fast enough to deal with
>     this entity now ]
>
> That should never happen.
>
> The last job from the entity->job_queue is only removed after the 
> scheduler is done with the entity (at least that was the original 
> idea, not sure if that still works as expected).
>
> Regards,
> Christian.
>
> Am 18.04.2018 um 09:20 schrieb Liu, Monk:
>
>     *Correctio for the scenario *
>
>     After we move fence_put(entity->last_sched) out of the fini_status
>     check:
>
>     A potential race issue for the scenario:
>
>     1.Drm_sched_entity_fini(): it exit right after entity->job_queue
>     empty, [ but that time scheduler is not fast enough to deal with
>     this entity now ]
>
>     2.Drm_sched_entity_cleanup() : it call
>     dma_fence_put(entity->last_scheduled)  [ but this time
>     entity->last_scheduled  actually points to the fence prior to the
>     real last one ]
>
>     3.Scheduler_main() now dealing with this entity: it call
>     dma_fence_put(entity->last_scheduled)    [   Now this fence get
>     double put !!!  ]
>
>     4.Scheduler_main() now call dma_fence_get() on the *real* last one !
>
>     So eventually the real last one fence triggers memory leak and
>     more critical the double put fence cause NULL pointer access
>
>     /Monk
>
>     *From:*Liu, Monk
>     *Sent:* 2018年4月18日15:11
>     *To:* Koenig, Christian <Christian.Koenig-5C7GfCeVMHo@public.gmane.org>
>     <mailto:Christian.Koenig-5C7GfCeVMHo@public.gmane.org>; Deng, Emily
>     <Emily.Deng-5C7GfCeVMHo@public.gmane.org> <mailto:Emily.Deng-5C7GfCeVMHo@public.gmane.org>
>     *Cc:* amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
>     <mailto:amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org>
>     *Subject:* force app kill patch
>
>     Hi Christian & Emily
>
>     I think the v4 fix for “fix force app kill hang”is still not good
>     enough:
>
>     First:
>
>     See that in “sched_entity_fini”, we only call
>     dma_fence_put(entity->last_scheduled”under the condition of “If
>     (entity->fini_status)”, so
>
>     This way there is memory leak for the case of “entity->fini_stats ==0”
>
>     Second:
>
>     If we move dma_fence_put(entity->last_scheduled) out of the
>     condition of “if (entity->fini_status)”, the memory leak issue can
>     be fixed
>
>     But there will be kernel NULL pointer access, I think the time you
>     call dma_fence_put(entity->last_scheduled”) may actually executed
>     **not**
>
>     On the last scheduled fence of this entity, because it is run
>     without “thread_park/unpark”pair which to make sure scheduler not
>     dealing this entity
>
>     So with certain race issue, here is the scenario:
>
>     1.scheduler is doing the dma_fence_put() on the 1^st fence,
>
>     2.scheduler set entity->last_scheduled to 1^st fence
>
>     3.now sched_entity_fini() run, and it call dma_fence_put() on
>     entity->last_scheduled
>
>     4.now this 1^st fence is actually put double time and the real
>     last fence won’t get put by expected
>
>     any idea?
>
>     /Monk
>
>
>
>
>     _______________________________________________
>
>     amd-gfx mailing list
>
>     amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org <mailto:amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org>
>
>     https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>
>
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx


[-- Attachment #1.2: Type: text/html, Size: 25672 bytes --]

[-- Attachment #2: Type: text/plain, Size: 154 bytes --]

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2018-04-18  9:03 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-04-18  7:11 force app kill patch Liu, Monk
     [not found] ` <BLUPR12MB044915BDA633308DF65967BD84B60-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2018-04-18  7:20   ` Liu, Monk
     [not found]     ` <BLUPR12MB0449CE1BE8367DCA8B7AA3BB84B60-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2018-04-18  8:36       ` Christian König
     [not found]         ` <6f4369eb-6064-3115-b9ae-9d5e0b3440a2-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2018-04-18  9:00           ` Liu, Monk
     [not found]             ` <BLUPR12MB0449F857ED9EFA931F1FC66084B60-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2018-04-18  9:03               ` Christian König

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.