All of lore.kernel.org
 help / color / mirror / Atom feed
* regression with d6c650c0a8f6f671e49553725e1db541376d95f2
@ 2017-10-13  7:24 Liu, Monk
       [not found] ` <BLUPR12MB044941343D6D1A35996C7B4284480-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
  0 siblings, 1 reply; 15+ messages in thread
From: Liu, Monk @ 2017-10-13  7:24 UTC (permalink / raw)
  To: Nicolai Hähnle, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW
  Cc: Koenig, Christian


[-- Attachment #1.1: Type: text/plain, Size: 1369 bytes --]

commit d6c650c0a8f6f671e49553725e1db541376d95f2
Author: Nicolai Hähnle <nicolai.haehnle-5C7GfCeVMHo@public.gmane.org>
@@ -611,6 +611,10 @@ static int amd_sched_main(void *param)

                fence = sched->ops->run_job(sched_job);
                amd_sched_fence_scheduled(s_fence);
+
+               /* amd_sched_process_job drops the job's reference of the fence. */
+               sched_job->s_fence = NULL;
+
                if (fence) {
                        s_fence->parent = dma_fence_get(fence);
                        r = dma_fence_add_callback(fence, &s_fence->cb,


Hi Nicolai


with this patch, you will break "amdgpu_sched_hw_job_reset()"routine:

void amd_sched_hw_job_reset(struct amd_gpu_scheduler *sched)
{
    struct amd_sched_job *s_job;

    spin_lock(&sched->job_list_lock);
    list_for_each_entry_reverse(s_job, &sched->ring_mirror_list, node) {
        if (s_job->s_fence->parent &&
         fence_remove_callback(s_job->s_fence->parent,
                     &s_job->s_fence->cb)) {
            fence_put(s_job->s_fence->parent);
            s_job->s_fence->parent = NULL;
            atomic_dec(&sched->hw_rq_count);
        }
    }
    spin_unlock(&sched->job_list_lock);
}


see that without sched_job->s_fence, you cannot remove the callback from its hw fence,


any idea??


BR Monk



[-- Attachment #1.2: Type: text/html, Size: 6639 bytes --]

[-- Attachment #2: Type: text/plain, Size: 154 bytes --]

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: regression with d6c650c0a8f6f671e49553725e1db541376d95f2
       [not found] ` <BLUPR12MB044941343D6D1A35996C7B4284480-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
@ 2017-10-13  8:16   ` Christian König
       [not found]     ` <b0f57d8a-de01-addd-c17d-7af25e15f36d-5C7GfCeVMHo@public.gmane.org>
  0 siblings, 1 reply; 15+ messages in thread
From: Christian König @ 2017-10-13  8:16 UTC (permalink / raw)
  To: Liu, Monk, Nicolai Hähnle, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW


[-- Attachment #1.1: Type: text/plain, Size: 1614 bytes --]

Yeah, that change is actually incorrect and should be reverted.

What we really need to do is remove dropping sched_job->s_fence from 
amd_sched_process_job() into amd_sched_job_finish() directly before the 
call to free_job().

Regards,
Christian.

Am 13.10.2017 um 09:24 schrieb Liu, Monk:
> commit d6c650c0a8f6f671e49553725e1db541376d95f2
> Author: Nicolai Hähnle <nicolai.haehnle-5C7GfCeVMHo@public.gmane.org>
> @@ -611,6 +611,10 @@ static int amd_sched_main(void *param)
>
>                 fence = sched->ops->run_job(sched_job);
>                 amd_sched_fence_scheduled(s_fence);
> +
> +               /* amd_sched_process_job drops the job's reference of 
> the fence. */
> +               sched_job->s_fence = NULL;
> +
>                 if (fence) {
>                         s_fence->parent = dma_fence_get(fence);
>                         r = dma_fence_add_callback(fence, &s_fence->cb,
>
> Hi Nicolai
>
>
> with this patch, you will break "amdgpu_sched_hw_job_reset()"routine:
>
> voidamd_sched_hw_job_reset(structamd_gpu_scheduler *sched)
> {
> structamd_sched_job *s_job;
> spin_lock(&sched->job_list_lock);
> list_for_each_entry_reverse(s_job, &sched->ring_mirror_list, node) {
> if(s_job->s_fence->parent&&
> fence_remove_callback(s_job->s_fence->parent,
> &s_job->s_fence->cb)) {
> fence_put(s_job->s_fence->parent);
>             s_job->s_fence->parent=NULL;
> atomic_dec(&sched->hw_rq_count);
>         }
>     }
> spin_unlock(&sched->job_list_lock);
> }
>
>
> see that without sched_job->s_fence, you cannot remove the callback 
> from its hw fence,
>
>
> any idea??
>
>
> BR Monk
>
>
>


[-- Attachment #1.2: Type: text/html, Size: 6815 bytes --]

[-- Attachment #2: Type: text/plain, Size: 154 bytes --]

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: regression with d6c650c0a8f6f671e49553725e1db541376d95f2
       [not found]     ` <b0f57d8a-de01-addd-c17d-7af25e15f36d-5C7GfCeVMHo@public.gmane.org>
@ 2017-10-13  8:39       ` Liu, Monk
       [not found]         ` <BLUPR12MB04490BB257118A34DE346E0984480-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
  0 siblings, 1 reply; 15+ messages in thread
From: Liu, Monk @ 2017-10-13  8:39 UTC (permalink / raw)
  To: Koenig, Christian, Nicolai Hähnle,
	amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW


[-- Attachment #1.1: Type: text/plain, Size: 2372 bytes --]

I doubt it would always work fine…

First, we have FENCE_TRACE reference s_fence->finished after “fence_signal(&fence->finished)”
Second, we have trace_amd_sched_proess_job(s_fence) after “amd_sched_fence_finished()”,

If you put the finished before free_job() and by coincidence the job_finish() get very soon executed you’ll have odds to hit wild pointer on above two cases

BR Monk

From: Koenig, Christian
Sent: 2017年10月13日 16:17
To: Liu, Monk <Monk.Liu@amd.com>; Nicolai Hähnle <nhaehnle@gmail.com>; amd-gfx@lists.freedesktop.org
Subject: Re: regression with d6c650c0a8f6f671e49553725e1db541376d95f2

Yeah, that change is actually incorrect and should be reverted.

What we really need to do is remove dropping sched_job->s_fence from amd_sched_process_job() into amd_sched_job_finish() directly before the call to free_job().

Regards,
Christian.

Am 13.10.2017 um 09:24 schrieb Liu, Monk:
commit d6c650c0a8f6f671e49553725e1db541376d95f2
Author: Nicolai Hähnle <nicolai.haehnle@amd.com><mailto:nicolai.haehnle@amd.com>
@@ -611,6 +611,10 @@ static int amd_sched_main(void *param)

                fence = sched->ops->run_job(sched_job);
                amd_sched_fence_scheduled(s_fence);
+
+               /* amd_sched_process_job drops the job's reference of the fence. */
+               sched_job->s_fence = NULL;
+
                if (fence) {
                        s_fence->parent = dma_fence_get(fence);
                        r = dma_fence_add_callback(fence, &s_fence->cb,

Hi Nicolai



with this patch, you will break "amdgpu_sched_hw_job_reset()"routine:

void
amd_sched_hw_job_reset(struct amd_gpu_scheduler
*sched)

{

    struct amd_sched_job
*s_job;



    spin_lock(&sched->job_list_lock);

    list_for_each_entry_reverse(s_job,
&sched->ring_mirror_list, node) {

        if (s_job->s_fence->parent
&&

         fence_remove_callback(s_job->s_fence->parent,

                     &s_job->s_fence->cb))
{

            fence_put(s_job->s_fence->parent);

            s_job->s_fence->parent
=
NULL;

            atomic_dec(&sched->hw_rq_count);

        }

    }

    spin_unlock(&sched->job_list_lock);

}




see that without sched_job->s_fence, you cannot remove the callback from its hw fence,



any idea??



BR Monk







[-- Attachment #1.2: Type: text/html, Size: 22177 bytes --]

[-- Attachment #2: Type: text/plain, Size: 154 bytes --]

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: regression with d6c650c0a8f6f671e49553725e1db541376d95f2
       [not found]         ` <BLUPR12MB04490BB257118A34DE346E0984480-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
@ 2017-10-13  8:42           ` Christian König
       [not found]             ` <6cf3017b-d694-1bfd-1f5a-861fd824c014-5C7GfCeVMHo@public.gmane.org>
  2017-10-13  8:46           ` Liu, Monk
  1 sibling, 1 reply; 15+ messages in thread
From: Christian König @ 2017-10-13  8:42 UTC (permalink / raw)
  To: Liu, Monk, Nicolai Hähnle, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW


[-- Attachment #1.1: Type: text/plain, Size: 2851 bytes --]

The free_job() callback is called only way after the job has finished.

That is one change actually made by you to the code :)

Christian.

Am 13.10.2017 um 10:39 schrieb Liu, Monk:
>
> I doubt it would always work fine…
>
> First, we have FENCE_TRACE reference s_fence->finished after 
> “fence_signal(&fence->finished)”
>
> Second, we have trace_amd_sched_proess_job(s_fence) after 
> “amd_sched_fence_finished()”,
>
> If you put the finished before free_job() and by coincidence the 
> job_finish() get very soon executed you’ll have odds to hit wild 
> pointer on above two cases
>
> BR Monk
>
> *From:*Koenig, Christian
> *Sent:* 2017年10月13日16:17
> *To:* Liu, Monk <Monk.Liu-5C7GfCeVMHo@public.gmane.org>; Nicolai Hähnle 
> <nhaehnle-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>; amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
> *Subject:* Re: regression with d6c650c0a8f6f671e49553725e1db541376d95f2
>
> Yeah, that change is actually incorrect and should be reverted.
>
> What we really need to do is remove dropping sched_job->s_fence from 
> amd_sched_process_job() into amd_sched_job_finish() directly before 
> the call to free_job().
>
> Regards,
> Christian.
>
> Am 13.10.2017 um 09:24 schrieb Liu, Monk:
>
>     commit d6c650c0a8f6f671e49553725e1db541376d95f2
>     Author: Nicolai Hähnle <nicolai.haehnle-5C7GfCeVMHo@public.gmane.org>
>     <mailto:nicolai.haehnle-5C7GfCeVMHo@public.gmane.org>
>     @@ -611,6 +611,10 @@ static int amd_sched_main(void *param)
>
>                     fence = sched->ops->run_job(sched_job);
>                     amd_sched_fence_scheduled(s_fence);
>     +
>     +               /* amd_sched_process_job drops the job's reference
>     of the fence. */
>     +               sched_job->s_fence = NULL;
>     +
>                     if (fence) {
>                             s_fence->parent = dma_fence_get(fence);
>                             r = dma_fence_add_callback(fence,
>     &s_fence->cb,
>
>     Hi Nicolai
>
>     with this patch, you will break "amdgpu_sched_hw_job_reset()"routine:
>
>     void
>
>     amd_sched_hw_job_reset(structamd_gpu_scheduler
>
>     *sched)
>
>     {
>
>     structamd_sched_job
>
>     *s_job;
>
>     spin_lock(&sched->job_list_lock);
>
>     list_for_each_entry_reverse(s_job,
>
>     &sched->ring_mirror_list, node) {
>
>     if(s_job->s_fence->parent
>
>     &&
>
>     fence_remove_callback(s_job->s_fence->parent,
>
>     &s_job->s_fence->cb))
>
>     {
>
>     fence_put(s_job->s_fence->parent);
>
>                 s_job->s_fence->parent
>
>     =
>
>     NULL;
>
>     atomic_dec(&sched->hw_rq_count);
>
>             }
>
>         }
>
>     spin_unlock(&sched->job_list_lock);
>
>     }
>
>     see that without sched_job->s_fence, you cannot remove the
>     callback from its hw fence,
>
>     any idea??
>
>     BR Monk
>


[-- Attachment #1.2: Type: text/html, Size: 28784 bytes --]

[-- Attachment #2: Type: text/plain, Size: 154 bytes --]

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: regression with d6c650c0a8f6f671e49553725e1db541376d95f2
       [not found]         ` <BLUPR12MB04490BB257118A34DE346E0984480-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
  2017-10-13  8:42           ` Christian König
@ 2017-10-13  8:46           ` Liu, Monk
       [not found]             ` <BLUPR12MB0449FE32E68AC10504D8E7A684480-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
  1 sibling, 1 reply; 15+ messages in thread
From: Liu, Monk @ 2017-10-13  8:46 UTC (permalink / raw)
  To: Liu, Monk, Koenig, Christian, Nicolai Hähnle,
	amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW


[-- Attachment #1.1: Type: text/plain, Size: 3126 bytes --]

Just revert Nicolai’s patch,if other routine want to reference s_fence, it should get the finished fence in the first place/time,

For gpu_reset routine, it refers to s_fence only on those unfinished job in sched_hw_job_reset, so totally safe to refer to s_fence pointer

I wonder what issue Nicolai met with and submitted this patch

BR Monk


From: amd-gfx [mailto:amd-gfx-bounces@lists.freedesktop.org] On Behalf Of Liu, Monk
Sent: 2017年10月13日 16:40
To: Koenig, Christian <Christian.Koenig@amd.com>; Nicolai Hähnle <nhaehnle@gmail.com>; amd-gfx@lists.freedesktop.org
Subject: RE: regression with d6c650c0a8f6f671e49553725e1db541376d95f2

I doubt it would always work fine…

First, we have FENCE_TRACE reference s_fence->finished after “fence_signal(&fence->finished)”
Second, we have trace_amd_sched_proess_job(s_fence) after “amd_sched_fence_finished()”,

If you put the finished before free_job() and by coincidence the job_finish() get very soon executed you’ll have odds to hit wild pointer on above two cases

BR Monk

From: Koenig, Christian
Sent: 2017年10月13日 16:17
To: Liu, Monk <Monk.Liu@amd.com<mailto:Monk.Liu@amd.com>>; Nicolai Hähnle <nhaehnle@gmail.com<mailto:nhaehnle@gmail.com>>; amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
Subject: Re: regression with d6c650c0a8f6f671e49553725e1db541376d95f2

Yeah, that change is actually incorrect and should be reverted.

What we really need to do is remove dropping sched_job->s_fence from amd_sched_process_job() into amd_sched_job_finish() directly before the call to free_job().

Regards,
Christian.

Am 13.10.2017 um 09:24 schrieb Liu, Monk:
commit d6c650c0a8f6f671e49553725e1db541376d95f2
Author: Nicolai Hähnle <nicolai.haehnle@amd.com><mailto:nicolai.haehnle@amd.com>
@@ -611,6 +611,10 @@ static int amd_sched_main(void *param)

                fence = sched->ops->run_job(sched_job);
                amd_sched_fence_scheduled(s_fence);
+
+               /* amd_sched_process_job drops the job's reference of the fence. */
+               sched_job->s_fence = NULL;
+
                if (fence) {
                        s_fence->parent = dma_fence_get(fence);
                        r = dma_fence_add_callback(fence, &s_fence->cb,

Hi Nicolai



with this patch, you will break "amdgpu_sched_hw_job_reset()"routine:

void
amd_sched_hw_job_reset(struct amd_gpu_scheduler
*sched)

{

    struct amd_sched_job
*s_job;



    spin_lock(&sched->job_list_lock);

    list_for_each_entry_reverse(s_job,
&sched->ring_mirror_list, node) {

        if (s_job->s_fence->parent
&&

         fence_remove_callback(s_job->s_fence->parent,

                     &s_job->s_fence->cb))
{

            fence_put(s_job->s_fence->parent);

            s_job->s_fence->parent
=
NULL;

            atomic_dec(&sched->hw_rq_count);

        }

    }

    spin_unlock(&sched->job_list_lock);

}




see that without sched_job->s_fence, you cannot remove the callback from its hw fence,



any idea??



BR Monk







[-- Attachment #1.2: Type: text/html, Size: 25649 bytes --]

[-- Attachment #2: Type: text/plain, Size: 154 bytes --]

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: regression with d6c650c0a8f6f671e49553725e1db541376d95f2
       [not found]             ` <6cf3017b-d694-1bfd-1f5a-861fd824c014-5C7GfCeVMHo@public.gmane.org>
@ 2017-10-13  8:49               ` Liu, Monk
       [not found]                 ` <BLUPR12MB044988754711ABD7BF66B19184480-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
  0 siblings, 1 reply; 15+ messages in thread
From: Liu, Monk @ 2017-10-13  8:49 UTC (permalink / raw)
  To: Koenig, Christian, Nicolai Hähnle,
	amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW


[-- Attachment #1.1: Type: text/plain, Size: 3204 bytes --]

No that’s not true

The free_job() is called in sched_job_finish() which is queued on a WORK and scheduled from that “amd_sched_fence_finished()”
So the finishing timing of free_job() is asynchronized with sched_process_job()

How can you sure free_job() must before “trace_amd_sched_process_job” ?

From: Koenig, Christian
Sent: 2017年10月13日 16:43
To: Liu, Monk <Monk.Liu@amd.com>; Nicolai Hähnle <nhaehnle@gmail.com>; amd-gfx@lists.freedesktop.org
Subject: Re: regression with d6c650c0a8f6f671e49553725e1db541376d95f2

The free_job() callback is called only way after the job has finished.

That is one change actually made by you to the code :)

Christian.

Am 13.10.2017 um 10:39 schrieb Liu, Monk:
I doubt it would always work fine…

First, we have FENCE_TRACE reference s_fence->finished after “fence_signal(&fence->finished)”
Second, we have trace_amd_sched_proess_job(s_fence) after “amd_sched_fence_finished()”,

If you put the finished before free_job() and by coincidence the job_finish() get very soon executed you’ll have odds to hit wild pointer on above two cases

BR Monk

From: Koenig, Christian
Sent: 2017年10月13日 16:17
To: Liu, Monk <Monk.Liu@amd.com><mailto:Monk.Liu@amd.com>; Nicolai Hähnle <nhaehnle@gmail.com><mailto:nhaehnle@gmail.com>; amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
Subject: Re: regression with d6c650c0a8f6f671e49553725e1db541376d95f2

Yeah, that change is actually incorrect and should be reverted.

What we really need to do is remove dropping sched_job->s_fence from amd_sched_process_job() into amd_sched_job_finish() directly before the call to free_job().

Regards,
Christian.

Am 13.10.2017 um 09:24 schrieb Liu, Monk:
commit d6c650c0a8f6f671e49553725e1db541376d95f2
Author: Nicolai Hähnle <nicolai.haehnle@amd.com><mailto:nicolai.haehnle@amd.com>
@@ -611,6 +611,10 @@ static int amd_sched_main(void *param)

                fence = sched->ops->run_job(sched_job);
                amd_sched_fence_scheduled(s_fence);
+
+               /* amd_sched_process_job drops the job's reference of the fence. */
+               sched_job->s_fence = NULL;
+
                if (fence) {
                        s_fence->parent = dma_fence_get(fence);
                        r = dma_fence_add_callback(fence, &s_fence->cb,

Hi Nicolai



with this patch, you will break "amdgpu_sched_hw_job_reset()"routine:

void
amd_sched_hw_job_reset(struct amd_gpu_scheduler
*sched)

{

    struct amd_sched_job
*s_job;



    spin_lock(&sched->job_list_lock);

    list_for_each_entry_reverse(s_job,
&sched->ring_mirror_list, node) {

        if (s_job->s_fence->parent
&&

         fence_remove_callback(s_job->s_fence->parent,

                     &s_job->s_fence->cb))
{

            fence_put(s_job->s_fence->parent);

            s_job->s_fence->parent
=
NULL;

            atomic_dec(&sched->hw_rq_count);

        }

    }

    spin_unlock(&sched->job_list_lock);

}




see that without sched_job->s_fence, you cannot remove the callback from its hw fence,



any idea??



BR Monk









[-- Attachment #1.2: Type: text/html, Size: 28151 bytes --]

[-- Attachment #2: Type: text/plain, Size: 154 bytes --]

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: regression with d6c650c0a8f6f671e49553725e1db541376d95f2
       [not found]                 ` <BLUPR12MB044988754711ABD7BF66B19184480-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
@ 2017-10-13  8:51                   ` Liu, Monk
       [not found]                     ` <BLUPR12MB044908B9AECE89D7E942EC7F84480-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
  0 siblings, 1 reply; 15+ messages in thread
From: Liu, Monk @ 2017-10-13  8:51 UTC (permalink / raw)
  To: Koenig, Christian, Nicolai Hähnle,
	amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW


[-- Attachment #1.1: Type: text/plain, Size: 3921 bytes --]

The free_job() is called in sched_job_finish() which is queued on a WORK and scheduled from that “amd_sched_fence_finished()”
So the finishing timing of free_job() is asynchronized with sched_process_job()

There is chance that free_job() called before that  “trace_amd_sched_process_job”, correct?
And if so the s_fence referred by it maybe a wild pointer


BR Monk


From: Liu, Monk
Sent: 2017年10月13日 16:49
To: Koenig, Christian <Christian.Koenig@amd.com>; Nicolai Hähnle <nhaehnle@gmail.com>; amd-gfx@lists.freedesktop.org
Subject: RE: regression with d6c650c0a8f6f671e49553725e1db541376d95f2

No that’s not true

The free_job() is called in sched_job_finish() which is queued on a WORK and scheduled from that “amd_sched_fence_finished()”
So the finishing timing of free_job() is asynchronized with sched_process_job()

How can you sure free_job() must before “trace_amd_sched_process_job” ?

From: Koenig, Christian
Sent: 2017年10月13日 16:43
To: Liu, Monk <Monk.Liu@amd.com<mailto:Monk.Liu@amd.com>>; Nicolai Hähnle <nhaehnle@gmail.com<mailto:nhaehnle@gmail.com>>; amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
Subject: Re: regression with d6c650c0a8f6f671e49553725e1db541376d95f2

The free_job() callback is called only way after the job has finished.

That is one change actually made by you to the code :)

Christian.

Am 13.10.2017 um 10:39 schrieb Liu, Monk:
I doubt it would always work fine…

First, we have FENCE_TRACE reference s_fence->finished after “fence_signal(&fence->finished)”
Second, we have trace_amd_sched_proess_job(s_fence) after “amd_sched_fence_finished()”,

If you put the finished before free_job() and by coincidence the job_finish() get very soon executed you’ll have odds to hit wild pointer on above two cases

BR Monk

From: Koenig, Christian
Sent: 2017年10月13日 16:17
To: Liu, Monk <Monk.Liu@amd.com><mailto:Monk.Liu@amd.com>; Nicolai Hähnle <nhaehnle@gmail.com><mailto:nhaehnle@gmail.com>; amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
Subject: Re: regression with d6c650c0a8f6f671e49553725e1db541376d95f2

Yeah, that change is actually incorrect and should be reverted.

What we really need to do is remove dropping sched_job->s_fence from amd_sched_process_job() into amd_sched_job_finish() directly before the call to free_job().

Regards,
Christian.

Am 13.10.2017 um 09:24 schrieb Liu, Monk:
commit d6c650c0a8f6f671e49553725e1db541376d95f2
Author: Nicolai Hähnle <nicolai.haehnle@amd.com><mailto:nicolai.haehnle@amd.com>
@@ -611,6 +611,10 @@ static int amd_sched_main(void *param)

                fence = sched->ops->run_job(sched_job);
                amd_sched_fence_scheduled(s_fence);
+
+               /* amd_sched_process_job drops the job's reference of the fence. */
+               sched_job->s_fence = NULL;
+
                if (fence) {
                        s_fence->parent = dma_fence_get(fence);
                        r = dma_fence_add_callback(fence, &s_fence->cb,

Hi Nicolai



with this patch, you will break "amdgpu_sched_hw_job_reset()"routine:

void
amd_sched_hw_job_reset(struct amd_gpu_scheduler
*sched)

{

    struct amd_sched_job
*s_job;



    spin_lock(&sched->job_list_lock);

    list_for_each_entry_reverse(s_job,
&sched->ring_mirror_list, node) {

        if (s_job->s_fence->parent
&&

         fence_remove_callback(s_job->s_fence->parent,

                     &s_job->s_fence->cb))
{

            fence_put(s_job->s_fence->parent);

            s_job->s_fence->parent
=
NULL;

            atomic_dec(&sched->hw_rq_count);

        }

    }

    spin_unlock(&sched->job_list_lock);

}




see that without sched_job->s_fence, you cannot remove the callback from its hw fence,



any idea??



BR Monk









[-- Attachment #1.2: Type: text/html, Size: 31861 bytes --]

[-- Attachment #2: Type: text/plain, Size: 154 bytes --]

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: regression with d6c650c0a8f6f671e49553725e1db541376d95f2
       [not found]                     ` <BLUPR12MB044908B9AECE89D7E942EC7F84480-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
@ 2017-10-13  9:00                       ` Christian König
       [not found]                         ` <53f57508-6669-82ed-1b95-011f30610e05-5C7GfCeVMHo@public.gmane.org>
  0 siblings, 1 reply; 15+ messages in thread
From: Christian König @ 2017-10-13  9:00 UTC (permalink / raw)
  To: Liu, Monk, Nicolai Hähnle, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW


[-- Attachment #1.1: Type: text/plain, Size: 5330 bytes --]

> There is chance that free_job() called before that 
> “trace_amd_sched_process_job”, correct?
>
Correct, but that is harmless.

Take a look what trace_amd_sched_process_job actually does, it just 
prints the pointer of the fence structure (but the pointer might be 
stale at this point).

Nevertheless you are right that this is really ugly.

How about the attached patch to fix the issue?

Regards,
Christian.

Am 13.10.2017 um 10:51 schrieb Liu, Monk:
>
> The free_job() is called in sched_job_finish() which is queued on a 
> WORK and scheduled from that “amd_sched_fence_finished()”
>
> So the finishing timing of free_job() is asynchronized with 
> sched_process_job()
>
> There is chance that free_job() called before that 
> “trace_amd_sched_process_job”, correct?
>
> And if so the s_fence referred by it maybe a wild pointer
>
> BR Monk
>
> *From:*Liu, Monk
> *Sent:* 2017年10月13日16:49
> *To:* Koenig, Christian <Christian.Koenig-5C7GfCeVMHo@public.gmane.org>; Nicolai Hähnle 
> <nhaehnle-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>; amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
> *Subject:* RE: regression with d6c650c0a8f6f671e49553725e1db541376d95f2
>
> No that’s not true
>
> The free_job() is called in sched_job_finish() which is queued on a 
> WORK and scheduled from that “amd_sched_fence_finished()”
>
> So the finishing timing of free_job() is asynchronized with 
> sched_process_job()
>
> How can you sure free_job() must before “trace_amd_sched_process_job”?
>
> *From:*Koenig, Christian
> *Sent:* 2017年10月13日16:43
> *To:* Liu, Monk <Monk.Liu-5C7GfCeVMHo@public.gmane.org <mailto:Monk.Liu-5C7GfCeVMHo@public.gmane.org>>; Nicolai 
> Hähnle <nhaehnle-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org <mailto:nhaehnle-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>>; 
> amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org <mailto:amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org>
> *Subject:* Re: regression with d6c650c0a8f6f671e49553725e1db541376d95f2
>
> The free_job() callback is called only way after the job has finished.
>
> That is one change actually made by you to the code :)
>
> Christian.
>
> Am 13.10.2017 um 10:39 schrieb Liu, Monk:
>
>     I doubt it would always work fine…
>
>     First, we have FENCE_TRACE reference s_fence->finished after
>     “fence_signal(&fence->finished)”
>
>     Second, we have trace_amd_sched_proess_job(s_fence) after
>     “amd_sched_fence_finished()”,
>
>     If you put the finished before free_job() and by coincidence the
>     job_finish() get very soon executed you’ll have odds to hit wild
>     pointer on above two cases
>
>     BR Monk
>
>     *From:*Koenig, Christian
>     *Sent:* 2017年10月13日16:17
>     *To:* Liu, Monk <Monk.Liu-5C7GfCeVMHo@public.gmane.org> <mailto:Monk.Liu-5C7GfCeVMHo@public.gmane.org>;
>     Nicolai Hähnle <nhaehnle-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> <mailto:nhaehnle-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>;
>     amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org <mailto:amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org>
>     *Subject:* Re: regression with
>     d6c650c0a8f6f671e49553725e1db541376d95f2
>
>     Yeah, that change is actually incorrect and should be reverted.
>
>     What we really need to do is remove dropping sched_job->s_fence
>     from amd_sched_process_job() into amd_sched_job_finish() directly
>     before the call to free_job().
>
>     Regards,
>     Christian.
>
>     Am 13.10.2017 um 09:24 schrieb Liu, Monk:
>
>         commit d6c650c0a8f6f671e49553725e1db541376d95f2
>         Author: Nicolai Hähnle <nicolai.haehnle-5C7GfCeVMHo@public.gmane.org>
>         <mailto:nicolai.haehnle-5C7GfCeVMHo@public.gmane.org>
>         @@ -611,6 +611,10 @@ static int amd_sched_main(void *param)
>
>                         fence = sched->ops->run_job(sched_job);
>                         amd_sched_fence_scheduled(s_fence);
>         +
>         +               /* amd_sched_process_job drops the job's
>         reference of the fence. */
>         +               sched_job->s_fence = NULL;
>         +
>                         if (fence) {
>                                 s_fence->parent = dma_fence_get(fence);
>                                 r = dma_fence_add_callback(fence,
>         &s_fence->cb,
>
>         Hi Nicolai
>
>         with this patch, you will break
>         "amdgpu_sched_hw_job_reset()"routine:
>
>         void
>
>         amd_sched_hw_job_reset(structamd_gpu_scheduler
>
>         *sched)
>
>         {
>
>         structamd_sched_job
>
>         *s_job;
>
>         spin_lock(&sched->job_list_lock);
>
>         list_for_each_entry_reverse(s_job,
>
>         &sched->ring_mirror_list, node) {
>
>         if(s_job->s_fence->parent
>
>         &&
>
>         fence_remove_callback(s_job->s_fence->parent,
>
>         &s_job->s_fence->cb))
>
>         {
>
>         fence_put(s_job->s_fence->parent);
>
>                     s_job->s_fence->parent
>
>         =
>
>         NULL;
>
>         atomic_dec(&sched->hw_rq_count);
>
>                 }
>
>             }
>
>         spin_unlock(&sched->job_list_lock);
>
>         }
>
>         see that without sched_job->s_fence, you cannot remove the
>         callback from its hw fence,
>
>         any idea??
>
>         BR Monk
>


[-- Attachment #1.2: Type: text/html, Size: 43392 bytes --]

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0001-drm-amd-sched-fix-job-tear-down-order.patch --]
[-- Type: text/x-patch; name="0001-drm-amd-sched-fix-job-tear-down-order.patch", Size: 1946 bytes --]

>From 43723de3f71d49ead63619920a5ed63f3efdfee9 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Christian=20K=C3=B6nig?= <christian.koenig-5C7GfCeVMHo@public.gmane.org>
Date: Fri, 13 Oct 2017 10:58:15 +0200
Subject: [PATCH] drm/amd/sched: fix job tear down order
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Move the trace before we signal the scheduler fence and drop the
scheduler fence reference directly before we free the job.

Signed-off-by: Christian König <christian.koenig-5C7GfCeVMHo@public.gmane.org>
---
 drivers/gpu/drm/amd/scheduler/gpu_scheduler.c | 7 ++-----
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
index 59f1325d975c..04ac783fb954 100644
--- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
+++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
@@ -406,6 +406,7 @@ static void amd_sched_job_finish(struct work_struct *work)
 			schedule_delayed_work(&next->work_tdr, sched->timeout);
 	}
 	spin_unlock(&sched->job_list_lock);
+	dma_fence_put(&s_job->s_fence->finished);
 	sched->ops->free_job(s_job);
 }
 
@@ -587,10 +588,9 @@ static void amd_sched_process_job(struct dma_fence *f, struct dma_fence_cb *cb)
 	struct amd_gpu_scheduler *sched = s_fence->sched;
 
 	atomic_dec(&sched->hw_rq_count);
+	trace_amd_sched_process_job(s_fence);
 	amd_sched_fence_finished(s_fence);
 
-	trace_amd_sched_process_job(s_fence);
-	dma_fence_put(&s_fence->finished);
 	wake_up_interruptible(&sched->wake_up_worker);
 }
 
@@ -638,9 +638,6 @@ static int amd_sched_main(void *param)
 		fence = sched->ops->run_job(sched_job);
 		amd_sched_fence_scheduled(s_fence);
 
-		/* amd_sched_process_job drops the job's reference of the fence. */
-		sched_job->s_fence = NULL;
-
 		if (fence) {
 			s_fence->parent = dma_fence_get(fence);
 			r = dma_fence_add_callback(fence, &s_fence->cb,
-- 
2.11.0


[-- Attachment #3: Type: text/plain, Size: 154 bytes --]

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: regression with d6c650c0a8f6f671e49553725e1db541376d95f2
       [not found]                         ` <53f57508-6669-82ed-1b95-011f30610e05-5C7GfCeVMHo@public.gmane.org>
@ 2017-10-13  9:13                           ` Liu, Monk
       [not found]                             ` <BLUPR12MB0449CD46159EC28E81B823C184480-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
  0 siblings, 1 reply; 15+ messages in thread
From: Liu, Monk @ 2017-10-13  9:13 UTC (permalink / raw)
  To: Koenig, Christian, Nicolai Hähnle,
	amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW


[-- Attachment #1.1: Type: text/plain, Size: 5257 bytes --]

your patch looks good,  do you think we should also do this:


 void amd_sched_fence_scheduled(struct amd_sched_fence *fence)
 {
-       int ret = fence_signal(&fence->scheduled);
+       int ret;
+
+       fence_get(&fence->scheduled;)
+       ret = fence_signal(&fence->scheduled);

        if (!ret)
                FENCE_TRACE(&fence->scheduled, "signaled from irq context\n");
        else
                FENCE_TRACE(&fence->scheduled, "was already signaled\n");
+       fence_put(&fence->scheduled);
 }



________________________________
From: Koenig, Christian
Sent: Friday, October 13, 2017 5:00:27 PM
To: Liu, Monk; Nicolai Hähnle; amd-gfx@lists.freedesktop.org
Subject: Re: regression with d6c650c0a8f6f671e49553725e1db541376d95f2

There is chance that free_job() called before that  “trace_amd_sched_process_job”, correct?
Correct, but that is harmless.

Take a look what trace_amd_sched_process_job actually does, it just prints the pointer of the fence structure (but the pointer might be stale at this point).

Nevertheless you are right that this is really ugly.

How about the attached patch to fix the issue?

Regards,
Christian.

Am 13.10.2017 um 10:51 schrieb Liu, Monk:
The free_job() is called in sched_job_finish() which is queued on a WORK and scheduled from that “amd_sched_fence_finished()”
So the finishing timing of free_job() is asynchronized with sched_process_job()

There is chance that free_job() called before that  “trace_amd_sched_process_job”, correct?
And if so the s_fence referred by it maybe a wild pointer


BR Monk


From: Liu, Monk
Sent: 2017年10月13日 16:49
To: Koenig, Christian <Christian.Koenig@amd.com><mailto:Christian.Koenig@amd.com>; Nicolai Hähnle <nhaehnle@gmail.com><mailto:nhaehnle@gmail.com>; amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
Subject: RE: regression with d6c650c0a8f6f671e49553725e1db541376d95f2

No that’s not true

The free_job() is called in sched_job_finish() which is queued on a WORK and scheduled from that “amd_sched_fence_finished()”
So the finishing timing of free_job() is asynchronized with sched_process_job()

How can you sure free_job() must before “trace_amd_sched_process_job” ?

From: Koenig, Christian
Sent: 2017年10月13日 16:43
To: Liu, Monk <Monk.Liu@amd.com<mailto:Monk.Liu@amd.com>>; Nicolai Hähnle <nhaehnle@gmail.com<mailto:nhaehnle@gmail.com>>; amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
Subject: Re: regression with d6c650c0a8f6f671e49553725e1db541376d95f2

The free_job() callback is called only way after the job has finished.

That is one change actually made by you to the code :)

Christian.

Am 13.10.2017 um 10:39 schrieb Liu, Monk:
I doubt it would always work fine…

First, we have FENCE_TRACE reference s_fence->finished after “fence_signal(&fence->finished)”
Second, we have trace_amd_sched_proess_job(s_fence) after “amd_sched_fence_finished()”,

If you put the finished before free_job() and by coincidence the job_finish() get very soon executed you’ll have odds to hit wild pointer on above two cases

BR Monk

From: Koenig, Christian
Sent: 2017年10月13日 16:17
To: Liu, Monk <Monk.Liu@amd.com><mailto:Monk.Liu@amd.com>; Nicolai Hähnle <nhaehnle@gmail.com><mailto:nhaehnle@gmail.com>; amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
Subject: Re: regression with d6c650c0a8f6f671e49553725e1db541376d95f2

Yeah, that change is actually incorrect and should be reverted.

What we really need to do is remove dropping sched_job->s_fence from amd_sched_process_job() into amd_sched_job_finish() directly before the call to free_job().

Regards,
Christian.

Am 13.10.2017 um 09:24 schrieb Liu, Monk:
commit d6c650c0a8f6f671e49553725e1db541376d95f2
Author: Nicolai Hähnle <nicolai.haehnle@amd.com><mailto:nicolai.haehnle@amd.com>
@@ -611,6 +611,10 @@ static int amd_sched_main(void *param)

                fence = sched->ops->run_job(sched_job);
                amd_sched_fence_scheduled(s_fence);
+
+               /* amd_sched_process_job drops the job's reference of the fence. */
+               sched_job->s_fence = NULL;
+
                if (fence) {
                        s_fence->parent = dma_fence_get(fence);
                        r = dma_fence_add_callback(fence, &s_fence->cb,

Hi Nicolai



with this patch, you will break "amdgpu_sched_hw_job_reset()"routine:

void
amd_sched_hw_job_reset(struct amd_gpu_scheduler
*sched)

{

    struct amd_sched_job
*s_job;



    spin_lock(&sched->job_list_lock);

    list_for_each_entry_reverse(s_job,
&sched->ring_mirror_list, node) {

        if (s_job->s_fence->parent
&&

         fence_remove_callback(s_job->s_fence->parent,

                     &s_job->s_fence->cb))
{

            fence_put(s_job->s_fence->parent);

            s_job->s_fence->parent
=
NULL;

            atomic_dec(&sched->hw_rq_count);

        }

    }

    spin_unlock(&sched->job_list_lock);

}




see that without sched_job->s_fence, you cannot remove the callback from its hw fence,



any idea??



BR Monk










[-- Attachment #1.2: Type: text/html, Size: 36917 bytes --]

[-- Attachment #2: Type: text/plain, Size: 154 bytes --]

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: regression with d6c650c0a8f6f671e49553725e1db541376d95f2
       [not found]                             ` <BLUPR12MB0449CD46159EC28E81B823C184480-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
@ 2017-10-13 10:13                               ` Christian König
       [not found]                                 ` <e9d3674e-8ccf-c709-0dd9-b1fa493e086f-5C7GfCeVMHo@public.gmane.org>
  0 siblings, 1 reply; 15+ messages in thread
From: Christian König @ 2017-10-13 10:13 UTC (permalink / raw)
  To: Liu, Monk, Nicolai Hähnle, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW


[-- Attachment #1.1: Type: text/plain, Size: 6606 bytes --]

Good point as well. How about the attached version?

This time we keep an extra reference in amd_sched_process_job() until we 
are sure that we don't need the s_fence any more.

Regards,
Christian.

Am 13.10.2017 um 11:13 schrieb Liu, Monk:
>
> your patch looks good,  do you think we should also do this:
>
>
>  void amd_sched_fence_scheduled(struct amd_sched_fence *fence)
>  {
> -       int ret = fence_signal(&fence->scheduled);
> +       int ret;
> +
> +       fence_get(&fence->scheduled;)
> +       ret = fence_signal(&fence->scheduled);
>
>         if (!ret)
>                 FENCE_TRACE(&fence->scheduled, "signaled from irq 
> context\n");
>         else
>                 FENCE_TRACE(&fence->scheduled, "was already signaled\n");
> +       fence_put(&fence->scheduled);
>  }
>
>
> ------------------------------------------------------------------------
> *From:* Koenig, Christian
> *Sent:* Friday, October 13, 2017 5:00:27 PM
> *To:* Liu, Monk; Nicolai Hähnle; amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
> *Subject:* Re: regression with d6c650c0a8f6f671e49553725e1db541376d95f2
>>
>> There is chance that free_job() called before that 
>> “trace_amd_sched_process_job”, correct?
>>
> Correct, but that is harmless.
>
> Take a look what trace_amd_sched_process_job actually does, it just 
> prints the pointer of the fence structure (but the pointer might be 
> stale at this point).
>
> Nevertheless you are right that this is really ugly.
>
> How about the attached patch to fix the issue?
>
> Regards,
> Christian.
>
> Am 13.10.2017 um 10:51 schrieb Liu, Monk:
>>
>> The free_job() is called in sched_job_finish() which is queued on a 
>> WORK and scheduled from that “amd_sched_fence_finished()”
>>
>> So the finishing timing of free_job() is asynchronized with 
>> sched_process_job()
>>
>> There is chance that free_job() called before that 
>> “trace_amd_sched_process_job”, correct?
>>
>> And if so the s_fence referred by it maybe a wild pointer
>>
>> BR Monk
>>
>> *From:*Liu, Monk
>> *Sent:* 2017年10月13日16:49
>> *To:* Koenig, Christian <Christian.Koenig-5C7GfCeVMHo@public.gmane.org>; Nicolai Hähnle 
>> <nhaehnle-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>; amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
>> *Subject:* RE: regression with d6c650c0a8f6f671e49553725e1db541376d95f2
>>
>> No that’s not true
>>
>> The free_job() is called in sched_job_finish() which is queued on a 
>> WORK and scheduled from that “amd_sched_fence_finished()”
>>
>> So the finishing timing of free_job() is asynchronized with 
>> sched_process_job()
>>
>> How can you sure free_job() must before “trace_amd_sched_process_job”?
>>
>> *From:*Koenig, Christian
>> *Sent:* 2017年10月13日16:43
>> *To:* Liu, Monk <Monk.Liu-5C7GfCeVMHo@public.gmane.org <mailto:Monk.Liu-5C7GfCeVMHo@public.gmane.org>>; Nicolai 
>> Hähnle <nhaehnle-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org <mailto:nhaehnle-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>>; 
>> amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org <mailto:amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org>
>> *Subject:* Re: regression with d6c650c0a8f6f671e49553725e1db541376d95f2
>>
>> The free_job() callback is called only way after the job has finished.
>>
>> That is one change actually made by you to the code :)
>>
>> Christian.
>>
>> Am 13.10.2017 um 10:39 schrieb Liu, Monk:
>>
>>     I doubt it would always work fine…
>>
>>     First, we have FENCE_TRACE reference s_fence->finished after
>>     “fence_signal(&fence->finished)”
>>
>>     Second, we have trace_amd_sched_proess_job(s_fence) after
>>     “amd_sched_fence_finished()”,
>>
>>     If you put the finished before free_job() and by coincidence the
>>     job_finish() get very soon executed you’ll have odds to hit wild
>>     pointer on above two cases
>>
>>     BR Monk
>>
>>     *From:*Koenig, Christian
>>     *Sent:* 2017年10月13日16:17
>>     *To:* Liu, Monk <Monk.Liu-5C7GfCeVMHo@public.gmane.org> <mailto:Monk.Liu-5C7GfCeVMHo@public.gmane.org>;
>>     Nicolai Hähnle <nhaehnle-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> <mailto:nhaehnle-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>;
>>     amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org <mailto:amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org>
>>     *Subject:* Re: regression with
>>     d6c650c0a8f6f671e49553725e1db541376d95f2
>>
>>     Yeah, that change is actually incorrect and should be reverted.
>>
>>     What we really need to do is remove dropping sched_job->s_fence
>>     from amd_sched_process_job() into amd_sched_job_finish() directly
>>     before the call to free_job().
>>
>>     Regards,
>>     Christian.
>>
>>     Am 13.10.2017 um 09:24 schrieb Liu, Monk:
>>
>>         commit d6c650c0a8f6f671e49553725e1db541376d95f2
>>         Author: Nicolai Hähnle <nicolai.haehnle-5C7GfCeVMHo@public.gmane.org>
>>         <mailto:nicolai.haehnle-5C7GfCeVMHo@public.gmane.org>
>>         @@ -611,6 +611,10 @@ static int amd_sched_main(void *param)
>>
>>                         fence = sched->ops->run_job(sched_job);
>>         amd_sched_fence_scheduled(s_fence);
>>         +
>>         +               /* amd_sched_process_job drops the job's
>>         reference of the fence. */
>>         +               sched_job->s_fence = NULL;
>>         +
>>                         if (fence) {
>>                                 s_fence->parent = dma_fence_get(fence);
>>                                 r = dma_fence_add_callback(fence,
>>         &s_fence->cb,
>>
>>         Hi Nicolai
>>
>>         with this patch, you will break
>>         "amdgpu_sched_hw_job_reset()"routine:
>>
>>         void
>>
>>         amd_sched_hw_job_reset(structamd_gpu_scheduler
>>
>>         *sched)
>>
>>         {
>>
>>         structamd_sched_job
>>
>>         *s_job;
>>
>>         spin_lock(&sched->job_list_lock);
>>
>>         list_for_each_entry_reverse(s_job,
>>
>>         &sched->ring_mirror_list, node) {
>>
>>         if(s_job->s_fence->parent
>>
>>         &&
>>
>>         fence_remove_callback(s_job->s_fence->parent,
>>
>>         &s_job->s_fence->cb))
>>
>>         {
>>
>>         fence_put(s_job->s_fence->parent);
>>
>>                     s_job->s_fence->parent
>>
>>         =
>>
>>         NULL;
>>
>>         atomic_dec(&sched->hw_rq_count);
>>
>>                 }
>>
>>             }
>>
>>         spin_unlock(&sched->job_list_lock);
>>
>>         }
>>
>>         see that without sched_job->s_fence, you cannot remove the
>>         callback from its hw fence,
>>
>>         any idea??
>>
>>         BR Monk
>>
>


[-- Attachment #1.2: Type: text/html, Size: 46684 bytes --]

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0001-drm-amd-sched-fix-job-tear-down-order-v2.patch --]
[-- Type: text/x-patch; name="0001-drm-amd-sched-fix-job-tear-down-order-v2.patch", Size: 1895 bytes --]

>From 56079bea614e1239ade60e8d4dd618abb63e06d8 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Christian=20K=C3=B6nig?= <christian.koenig-5C7GfCeVMHo@public.gmane.org>
Date: Fri, 13 Oct 2017 10:58:15 +0200
Subject: [PATCH] drm/amd/sched: fix job tear down order v2
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Move the trace before we signal the scheduler fence and drop the
scheduler fence reference directly before we free the job.

v2: keep extra s_fence reference

Signed-off-by: Christian König <christian.koenig-5C7GfCeVMHo@public.gmane.org>
---
 drivers/gpu/drm/amd/scheduler/gpu_scheduler.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
index 59f1325d975c..e4d3b4ec4e92 100644
--- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
+++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
@@ -406,6 +406,7 @@ static void amd_sched_job_finish(struct work_struct *work)
 			schedule_delayed_work(&next->work_tdr, sched->timeout);
 	}
 	spin_unlock(&sched->job_list_lock);
+	dma_fence_put(&s_job->s_fence->finished);
 	sched->ops->free_job(s_job);
 }
 
@@ -586,6 +587,7 @@ static void amd_sched_process_job(struct dma_fence *f, struct dma_fence_cb *cb)
 		container_of(cb, struct amd_sched_fence, cb);
 	struct amd_gpu_scheduler *sched = s_fence->sched;
 
+	dma_fence_get(&s_fence->finished);
 	atomic_dec(&sched->hw_rq_count);
 	amd_sched_fence_finished(s_fence);
 
@@ -638,9 +640,6 @@ static int amd_sched_main(void *param)
 		fence = sched->ops->run_job(sched_job);
 		amd_sched_fence_scheduled(s_fence);
 
-		/* amd_sched_process_job drops the job's reference of the fence. */
-		sched_job->s_fence = NULL;
-
 		if (fence) {
 			s_fence->parent = dma_fence_get(fence);
 			r = dma_fence_add_callback(fence, &s_fence->cb,
-- 
2.11.0


[-- Attachment #3: Type: text/plain, Size: 154 bytes --]

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* RE: regression with d6c650c0a8f6f671e49553725e1db541376d95f2
       [not found]                                 ` <e9d3674e-8ccf-c709-0dd9-b1fa493e086f-5C7GfCeVMHo@public.gmane.org>
@ 2017-10-13 11:06                                   ` Liu, Monk
       [not found]                                     ` <BLUPR12MB04493DEEB32374DB3F64098D84480-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
  0 siblings, 1 reply; 15+ messages in thread
From: Liu, Monk @ 2017-10-13 11:06 UTC (permalink / raw)
  To: Koenig, Christian, Nicolai Hähnle,
	amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW


[-- Attachment #1.1: Type: text/plain, Size: 5850 bytes --]

Yeah, this one looks good

You can put my reviewed-by on it

From: Koenig, Christian
Sent: 2017年10月13日 18:14
To: Liu, Monk <Monk.Liu@amd.com>; Nicolai Hähnle <nhaehnle@gmail.com>; amd-gfx@lists.freedesktop.org
Subject: Re: regression with d6c650c0a8f6f671e49553725e1db541376d95f2

Good point as well. How about the attached version?

This time we keep an extra reference in amd_sched_process_job() until we are sure that we don't need the s_fence any more.

Regards,
Christian.

Am 13.10.2017 um 11:13 schrieb Liu, Monk:

your patch looks good,  do you think we should also do this:


 void amd_sched_fence_scheduled(struct amd_sched_fence *fence)
 {
-       int ret = fence_signal(&fence->scheduled);
+       int ret;
+
+       fence_get(&fence->scheduled;)
+       ret = fence_signal(&fence->scheduled);

        if (!ret)
                FENCE_TRACE(&fence->scheduled, "signaled from irq context\n");
        else
                FENCE_TRACE(&fence->scheduled, "was already signaled\n");
+       fence_put(&fence->scheduled);
 }




________________________________
From: Koenig, Christian
Sent: Friday, October 13, 2017 5:00:27 PM
To: Liu, Monk; Nicolai Hähnle; amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
Subject: Re: regression with d6c650c0a8f6f671e49553725e1db541376d95f2

There is chance that free_job() called before that  “trace_amd_sched_process_job”, correct?
Correct, but that is harmless.

Take a look what trace_amd_sched_process_job actually does, it just prints the pointer of the fence structure (but the pointer might be stale at this point).

Nevertheless you are right that this is really ugly.

How about the attached patch to fix the issue?

Regards,
Christian.

Am 13.10.2017 um 10:51 schrieb Liu, Monk:
The free_job() is called in sched_job_finish() which is queued on a WORK and scheduled from that “amd_sched_fence_finished()”
So the finishing timing of free_job() is asynchronized with sched_process_job()

There is chance that free_job() called before that  “trace_amd_sched_process_job”, correct?
And if so the s_fence referred by it maybe a wild pointer


BR Monk


From: Liu, Monk
Sent: 2017年10月13日 16:49
To: Koenig, Christian <Christian.Koenig@amd.com><mailto:Christian.Koenig@amd.com>; Nicolai Hähnle <nhaehnle@gmail.com><mailto:nhaehnle@gmail.com>; amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
Subject: RE: regression with d6c650c0a8f6f671e49553725e1db541376d95f2

No that’s not true

The free_job() is called in sched_job_finish() which is queued on a WORK and scheduled from that “amd_sched_fence_finished()”
So the finishing timing of free_job() is asynchronized with sched_process_job()

How can you sure free_job() must before “trace_amd_sched_process_job” ?

From: Koenig, Christian
Sent: 2017年10月13日 16:43
To: Liu, Monk <Monk.Liu@amd.com<mailto:Monk.Liu@amd.com>>; Nicolai Hähnle <nhaehnle@gmail.com<mailto:nhaehnle@gmail.com>>; amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
Subject: Re: regression with d6c650c0a8f6f671e49553725e1db541376d95f2

The free_job() callback is called only way after the job has finished.

That is one change actually made by you to the code :)

Christian.

Am 13.10.2017 um 10:39 schrieb Liu, Monk:
I doubt it would always work fine…

First, we have FENCE_TRACE reference s_fence->finished after “fence_signal(&fence->finished)”
Second, we have trace_amd_sched_proess_job(s_fence) after “amd_sched_fence_finished()”,

If you put the finished before free_job() and by coincidence the job_finish() get very soon executed you’ll have odds to hit wild pointer on above two cases

BR Monk

From: Koenig, Christian
Sent: 2017年10月13日 16:17
To: Liu, Monk <Monk.Liu@amd.com><mailto:Monk.Liu@amd.com>; Nicolai Hähnle <nhaehnle@gmail.com><mailto:nhaehnle@gmail.com>; amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
Subject: Re: regression with d6c650c0a8f6f671e49553725e1db541376d95f2

Yeah, that change is actually incorrect and should be reverted.

What we really need to do is remove dropping sched_job->s_fence from amd_sched_process_job() into amd_sched_job_finish() directly before the call to free_job().

Regards,
Christian.

Am 13.10.2017 um 09:24 schrieb Liu, Monk:
commit d6c650c0a8f6f671e49553725e1db541376d95f2
Author: Nicolai Hähnle <nicolai.haehnle@amd.com><mailto:nicolai.haehnle@amd.com>
@@ -611,6 +611,10 @@ static int amd_sched_main(void *param)

                fence = sched->ops->run_job(sched_job);
                amd_sched_fence_scheduled(s_fence);
+
+               /* amd_sched_process_job drops the job's reference of the fence. */
+               sched_job->s_fence = NULL;
+
                if (fence) {
                        s_fence->parent = dma_fence_get(fence);
                        r = dma_fence_add_callback(fence, &s_fence->cb,

Hi Nicolai



with this patch, you will break "amdgpu_sched_hw_job_reset()"routine:

void
amd_sched_hw_job_reset(struct amd_gpu_scheduler
*sched)

{

    struct amd_sched_job
*s_job;



    spin_lock(&sched->job_list_lock);

    list_for_each_entry_reverse(s_job,
&sched->ring_mirror_list, node) {

        if (s_job->s_fence->parent
&&

         fence_remove_callback(s_job->s_fence->parent,

                     &s_job->s_fence->cb))
{

            fence_put(s_job->s_fence->parent);

            s_job->s_fence->parent
=
NULL;

            atomic_dec(&sched->hw_rq_count);

        }

    }

    spin_unlock(&sched->job_list_lock);

}




see that without sched_job->s_fence, you cannot remove the callback from its hw fence,



any idea??



BR Monk













[-- Attachment #1.2: Type: text/html, Size: 32174 bytes --]

[-- Attachment #2: Type: text/plain, Size: 154 bytes --]

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: regression with d6c650c0a8f6f671e49553725e1db541376d95f2
       [not found]                                     ` <BLUPR12MB04493DEEB32374DB3F64098D84480-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
@ 2017-10-13 11:10                                       ` Tom St Denis
       [not found]                                         ` <18c873f5-987d-d21e-0fa3-008b7e91ea7c-5C7GfCeVMHo@public.gmane.org>
  0 siblings, 1 reply; 15+ messages in thread
From: Tom St Denis @ 2017-10-13 11:10 UTC (permalink / raw)
  To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

At the very least I don't get the hangs like I used to before the 
patchset so this patch isn't regressing any of that behaviour.

Tom


On 13/10/17 07:06 AM, Liu, Monk wrote:
> Yeah, this one looks good
> 
> You can put my reviewed-by on it
> 
> *From:*Koenig, Christian
> *Sent:* 2017年10月13日18:14
> *To:* Liu, Monk <Monk.Liu@amd.com>; Nicolai Hähnle <nhaehnle@gmail.com>; 
> amd-gfx@lists.freedesktop.org
> *Subject:* Re: regression with d6c650c0a8f6f671e49553725e1db541376d95f2
> 
> Good point as well. How about the attached version?
> 
> This time we keep an extra reference in amd_sched_process_job() until we 
> are sure that we don't need the s_fence any more.
> 
> Regards,
> Christian.
> 
> Am 13.10.2017 um 11:13 schrieb Liu, Monk:
> 
>     your patch looks good,  do you think we should also do this:
> 
>       void amd_sched_fence_scheduled(struct amd_sched_fence *fence)
>       {
>     -       int ret = fence_signal(&fence->scheduled);
>     +       int ret;
>     +
>     +       fence_get(&fence->scheduled;)
>     +       ret = fence_signal(&fence->scheduled);
> 
>              if (!ret)
>                      FENCE_TRACE(&fence->scheduled, "signaled from irq
>     context\n");
>              else
>                      FENCE_TRACE(&fence->scheduled, "was already
>     signaled\n");
>     +       fence_put(&fence->scheduled);
>       }
> 
>     ------------------------------------------------------------------------
> 
>     *From:*Koenig, Christian
>     *Sent:* Friday, October 13, 2017 5:00:27 PM
>     *To:* Liu, Monk; Nicolai Hähnle; amd-gfx@lists.freedesktop.org
>     <mailto:amd-gfx@lists.freedesktop.org>
>     *Subject:* Re: regression with d6c650c0a8f6f671e49553725e1db541376d95f2
> 
>         There is chance that free_job() called before that
>         “trace_amd_sched_process_job”, correct?
> 
>     Correct, but that is harmless.
> 
>     Take a look what trace_amd_sched_process_job actually does, it just
>     prints the pointer of the fence structure (but the pointer might be
>     stale at this point).
> 
>     Nevertheless you are right that this is really ugly.
> 
>     How about the attached patch to fix the issue?
> 
>     Regards,
>     Christian.
> 
>     Am 13.10.2017 um 10:51 schrieb Liu, Monk:
> 
>         The free_job() is called in sched_job_finish() which is queued
>         on a WORK and scheduled from that “amd_sched_fence_finished()”
> 
>         So the finishing timing of free_job() is asynchronized with
>         sched_process_job()
> 
>         There is chance that free_job() called before that
>         “trace_amd_sched_process_job”, correct?
> 
>         And if so the s_fence referred by it maybe a wild pointer
> 
>         BR Monk
> 
>         *From:*Liu, Monk
>         *Sent:* 2017年10月13日16:49
>         *To:* Koenig, Christian <Christian.Koenig@amd.com>
>         <mailto:Christian.Koenig@amd.com>; Nicolai Hähnle
>         <nhaehnle@gmail.com> <mailto:nhaehnle@gmail.com>;
>         amd-gfx@lists.freedesktop.org <mailto:amd-gfx@lists.freedesktop.org>
>         *Subject:* RE: regression with
>         d6c650c0a8f6f671e49553725e1db541376d95f2
> 
>         No that’s not true
> 
>         The free_job() is called in sched_job_finish() which is queued
>         on a WORK and scheduled from that “amd_sched_fence_finished()”
> 
>         So the finishing timing of free_job() is asynchronized with
>         sched_process_job()
> 
>         How can you sure free_job() must before
>         “trace_amd_sched_process_job” ?
> 
>         *From:*Koenig, Christian
>         *Sent:* 2017年10月13日16:43
>         *To:* Liu, Monk <Monk.Liu@amd.com <mailto:Monk.Liu@amd.com>>;
>         Nicolai Hähnle <nhaehnle@gmail.com <mailto:nhaehnle@gmail.com>>;
>         amd-gfx@lists.freedesktop.org <mailto:amd-gfx@lists.freedesktop.org>
>         *Subject:* Re: regression with
>         d6c650c0a8f6f671e49553725e1db541376d95f2
> 
>         The free_job() callback is called only way after the job has
>         finished.
> 
>         That is one change actually made by you to the code :)
> 
>         Christian.
> 
>         Am 13.10.2017 um 10:39 schrieb Liu, Monk:
> 
>             I doubt it would always work fine…
> 
>             First, we have FENCE_TRACE reference s_fence->finished after
>             “fence_signal(&fence->finished)”
> 
>             Second, we have trace_amd_sched_proess_job(s_fence) after
>             “amd_sched_fence_finished()”,
> 
>             If you put the finished before free_job() and by coincidence
>             the job_finish() get very soon executed you’ll have odds to
>             hit wild pointer on above two cases
> 
>             BR Monk
> 
>             *From:*Koenig, Christian
>             *Sent:* 2017年10月13日16:17
>             *To:* Liu, Monk <Monk.Liu@amd.com>
>             <mailto:Monk.Liu@amd.com>; Nicolai Hähnle
>             <nhaehnle@gmail.com> <mailto:nhaehnle@gmail.com>;
>             amd-gfx@lists.freedesktop.org
>             <mailto:amd-gfx@lists.freedesktop.org>
>             *Subject:* Re: regression with
>             d6c650c0a8f6f671e49553725e1db541376d95f2
> 
>             Yeah, that change is actually incorrect and should be reverted.
> 
>             What we really need to do is remove dropping
>             sched_job->s_fence from amd_sched_process_job() into
>             amd_sched_job_finish() directly before the call to free_job().
> 
>             Regards,
>             Christian.
> 
>             Am 13.10.2017 um 09:24 schrieb Liu, Monk:
> 
>                 commit d6c650c0a8f6f671e49553725e1db541376d95f2
>                 Author: Nicolai Hähnle <nicolai.haehnle@amd.com>
>                 <mailto:nicolai.haehnle@amd.com>
>                 @@ -611,6 +611,10 @@ static int amd_sched_main(void *param)
> 
>                                  fence = sched->ops->run_job(sched_job);
>                                  amd_sched_fence_scheduled(s_fence);
>                 +
>                 +               /* amd_sched_process_job drops the job's
>                 reference of the fence. */
>                 +               sched_job->s_fence = NULL;
>                 +
>                                  if (fence) {
>                                          s_fence->parent =
>                 dma_fence_get(fence);
>                                          r =
>                 dma_fence_add_callback(fence, &s_fence->cb,
> 
>                 Hi Nicolai
> 
>                 with this patch, you will break
>                 "amdgpu_sched_hw_job_reset()"routine:
> 
>                 void
> 
>                 amd_sched_hw_job_reset(struct amd_gpu_scheduler
> 
>                 *sched)
> 
>                 {
> 
>                      struct amd_sched_job
> 
>                 *s_job;
> 
>                      spin_lock(&sched->job_list_lock);
> 
>                      list_for_each_entry_reverse(s_job,
> 
>                 &sched->ring_mirror_list, node) {
> 
>                          if (s_job->s_fence->parent
> 
>                 &&
> 
>                           fence_remove_callback(s_job->s_fence->parent,
> 
>                                       &s_job->s_fence->cb))
> 
>                 {
> 
>                              fence_put(s_job->s_fence->parent);
> 
>                              s_job->s_fence->parent
> 
>                 =
> 
>                 NULL;
> 
>                              atomic_dec(&sched->hw_rq_count);
> 
>                          }
> 
>                      }
> 
>                      spin_unlock(&sched->job_list_lock);
> 
>                 }
> 
>                 see that without sched_job->s_fence, you cannot remove
>                 the callback from its hw fence,
> 
>                 any idea??
> 
>                 BR Monk
> 
> 
> 
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
> 

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: regression with d6c650c0a8f6f671e49553725e1db541376d95f2
       [not found]                                         ` <18c873f5-987d-d21e-0fa3-008b7e91ea7c-5C7GfCeVMHo@public.gmane.org>
@ 2017-10-13 11:15                                           ` Liu, Monk
  0 siblings, 0 replies; 15+ messages in thread
From: Liu, Monk @ 2017-10-13 11:15 UTC (permalink / raw)
  To: StDenis, Tom, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

Because you didn't try GPU reset feature

-----Original Message-----
From: amd-gfx [mailto:amd-gfx-bounces@lists.freedesktop.org] On Behalf Of Tom St Denis
Sent: 2017年10月13日 19:11
To: amd-gfx@lists.freedesktop.org
Subject: Re: regression with d6c650c0a8f6f671e49553725e1db541376d95f2

At the very least I don't get the hangs like I used to before the patchset so this patch isn't regressing any of that behaviour.

Tom


On 13/10/17 07:06 AM, Liu, Monk wrote:
> Yeah, this one looks good
> 
> You can put my reviewed-by on it
> 
> *From:*Koenig, Christian
> *Sent:* 2017年10月13日18:14
> *To:* Liu, Monk <Monk.Liu@amd.com>; Nicolai Hähnle 
> <nhaehnle@gmail.com>; amd-gfx@lists.freedesktop.org
> *Subject:* Re: regression with 
> d6c650c0a8f6f671e49553725e1db541376d95f2
> 
> Good point as well. How about the attached version?
> 
> This time we keep an extra reference in amd_sched_process_job() until 
> we are sure that we don't need the s_fence any more.
> 
> Regards,
> Christian.
> 
> Am 13.10.2017 um 11:13 schrieb Liu, Monk:
> 
>     your patch looks good,  do you think we should also do this:
> 
>       void amd_sched_fence_scheduled(struct amd_sched_fence *fence)
>       {
>     -       int ret = fence_signal(&fence->scheduled);
>     +       int ret;
>     +
>     +       fence_get(&fence->scheduled;)
>     +       ret = fence_signal(&fence->scheduled);
> 
>              if (!ret)
>                      FENCE_TRACE(&fence->scheduled, "signaled from irq
>     context\n");
>              else
>                      FENCE_TRACE(&fence->scheduled, "was already
>     signaled\n");
>     +       fence_put(&fence->scheduled);
>       }
> 
>     
> ----------------------------------------------------------------------
> --
> 
>     *From:*Koenig, Christian
>     *Sent:* Friday, October 13, 2017 5:00:27 PM
>     *To:* Liu, Monk; Nicolai Hähnle; amd-gfx@lists.freedesktop.org
>     <mailto:amd-gfx@lists.freedesktop.org>
>     *Subject:* Re: regression with 
> d6c650c0a8f6f671e49553725e1db541376d95f2
> 
>         There is chance that free_job() called before that
>         “trace_amd_sched_process_job”, correct?
> 
>     Correct, but that is harmless.
> 
>     Take a look what trace_amd_sched_process_job actually does, it just
>     prints the pointer of the fence structure (but the pointer might be
>     stale at this point).
> 
>     Nevertheless you are right that this is really ugly.
> 
>     How about the attached patch to fix the issue?
> 
>     Regards,
>     Christian.
> 
>     Am 13.10.2017 um 10:51 schrieb Liu, Monk:
> 
>         The free_job() is called in sched_job_finish() which is queued
>         on a WORK and scheduled from that “amd_sched_fence_finished()”
> 
>         So the finishing timing of free_job() is asynchronized with
>         sched_process_job()
> 
>         There is chance that free_job() called before that
>         “trace_amd_sched_process_job”, correct?
> 
>         And if so the s_fence referred by it maybe a wild pointer
> 
>         BR Monk
> 
>         *From:*Liu, Monk
>         *Sent:* 2017年10月13日16:49
>         *To:* Koenig, Christian <Christian.Koenig@amd.com>
>         <mailto:Christian.Koenig@amd.com>; Nicolai Hähnle
>         <nhaehnle@gmail.com> <mailto:nhaehnle@gmail.com>;
>         amd-gfx@lists.freedesktop.org <mailto:amd-gfx@lists.freedesktop.org>
>         *Subject:* RE: regression with
>         d6c650c0a8f6f671e49553725e1db541376d95f2
> 
>         No that’s not true
> 
>         The free_job() is called in sched_job_finish() which is queued
>         on a WORK and scheduled from that “amd_sched_fence_finished()”
> 
>         So the finishing timing of free_job() is asynchronized with
>         sched_process_job()
> 
>         How can you sure free_job() must before
>         “trace_amd_sched_process_job” ?
> 
>         *From:*Koenig, Christian
>         *Sent:* 2017年10月13日16:43
>         *To:* Liu, Monk <Monk.Liu@amd.com <mailto:Monk.Liu@amd.com>>;
>         Nicolai Hähnle <nhaehnle@gmail.com <mailto:nhaehnle@gmail.com>>;
>         amd-gfx@lists.freedesktop.org <mailto:amd-gfx@lists.freedesktop.org>
>         *Subject:* Re: regression with
>         d6c650c0a8f6f671e49553725e1db541376d95f2
> 
>         The free_job() callback is called only way after the job has
>         finished.
> 
>         That is one change actually made by you to the code :)
> 
>         Christian.
> 
>         Am 13.10.2017 um 10:39 schrieb Liu, Monk:
> 
>             I doubt it would always work fine…
> 
>             First, we have FENCE_TRACE reference s_fence->finished after
>             “fence_signal(&fence->finished)”
> 
>             Second, we have trace_amd_sched_proess_job(s_fence) after
>             “amd_sched_fence_finished()”,
> 
>             If you put the finished before free_job() and by coincidence
>             the job_finish() get very soon executed you’ll have odds to
>             hit wild pointer on above two cases
> 
>             BR Monk
> 
>             *From:*Koenig, Christian
>             *Sent:* 2017年10月13日16:17
>             *To:* Liu, Monk <Monk.Liu@amd.com>
>             <mailto:Monk.Liu@amd.com>; Nicolai Hähnle
>             <nhaehnle@gmail.com> <mailto:nhaehnle@gmail.com>;
>             amd-gfx@lists.freedesktop.org
>             <mailto:amd-gfx@lists.freedesktop.org>
>             *Subject:* Re: regression with
>             d6c650c0a8f6f671e49553725e1db541376d95f2
> 
>             Yeah, that change is actually incorrect and should be reverted.
> 
>             What we really need to do is remove dropping
>             sched_job->s_fence from amd_sched_process_job() into
>             amd_sched_job_finish() directly before the call to free_job().
> 
>             Regards,
>             Christian.
> 
>             Am 13.10.2017 um 09:24 schrieb Liu, Monk:
> 
>                 commit d6c650c0a8f6f671e49553725e1db541376d95f2
>                 Author: Nicolai Hähnle <nicolai.haehnle@amd.com>
>                 <mailto:nicolai.haehnle@amd.com>
>                 @@ -611,6 +611,10 @@ static int amd_sched_main(void 
> *param)
> 
>                                  fence = sched->ops->run_job(sched_job);
>                                  amd_sched_fence_scheduled(s_fence);
>                 +
>                 +               /* amd_sched_process_job drops the job's
>                 reference of the fence. */
>                 +               sched_job->s_fence = NULL;
>                 +
>                                  if (fence) {
>                                          s_fence->parent =
>                 dma_fence_get(fence);
>                                          r =
>                 dma_fence_add_callback(fence, &s_fence->cb,
> 
>                 Hi Nicolai
> 
>                 with this patch, you will break
>                 "amdgpu_sched_hw_job_reset()"routine:
> 
>                 void
> 
>                 amd_sched_hw_job_reset(struct amd_gpu_scheduler
> 
>                 *sched)
> 
>                 {
> 
>                      struct amd_sched_job
> 
>                 *s_job;
> 
>                      spin_lock(&sched->job_list_lock);
> 
>                      list_for_each_entry_reverse(s_job,
> 
>                 &sched->ring_mirror_list, node) {
> 
>                          if (s_job->s_fence->parent
> 
>                 &&
> 
>                           
> fence_remove_callback(s_job->s_fence->parent,
> 
>                                       &s_job->s_fence->cb))
> 
>                 {
> 
>                              fence_put(s_job->s_fence->parent);
> 
>                              s_job->s_fence->parent
> 
>                 =
> 
>                 NULL;
> 
>                              atomic_dec(&sched->hw_rq_count);
> 
>                          }
> 
>                      }
> 
>                      spin_unlock(&sched->job_list_lock);
> 
>                 }
> 
>                 see that without sched_job->s_fence, you cannot remove
>                 the callback from its hw fence,
> 
>                 any idea??
> 
>                 BR Monk
> 
> 
> 
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
> 

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: regression with d6c650c0a8f6f671e49553725e1db541376d95f2
       [not found]             ` <BLUPR12MB0449FE32E68AC10504D8E7A684480-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
@ 2017-10-17  8:57               ` Nicolai Hähnle
       [not found]                 ` <ae88aac7-0800-df06-5802-ad5376af0c0d-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
  0 siblings, 1 reply; 15+ messages in thread
From: Nicolai Hähnle @ 2017-10-17  8:57 UTC (permalink / raw)
  To: Liu, Monk, Koenig, Christian, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

On 13.10.2017 10:46, Liu, Monk wrote:
> Just revert Nicolai’s patch,if other routine want to reference s_fence, 
> it should get the finished fence in the first place/time,
> 
> For gpu_reset routine, it refers to s_fence only on those unfinished job 
> in sched_hw_job_reset, so totally safe to refer to s_fence pointer
> 
> I wonder what issue Nicolai met with and submitted this patch

The original motivation of my patch was to catch accidental use of 
job->s_fence after the fence was destroyed in amd_sched_process_job. 
Basically, prevent a dangling pointer. I still think that's a reasonable 
idea, though clearly my first attempt at it was just wrong.

In Christian's v2 patch, it might make sense to add

	spin_unlock(&sched->job_list_lock);
+	dma_fence_put(&s_job->s_fence->finished);
+	s_job->s_fence = NULL;
	sched->ops->free_job(s_job);

... though I'm not 100% certain about how the fence lifetimes work.

Cheers,
Nicolai


> 
> BR Monk
> 
> *From:*amd-gfx [mailto:amd-gfx-bounces@lists.freedesktop.org] *On Behalf 
> Of *Liu, Monk
> *Sent:* 2017年10月13日16:40
> *To:* Koenig, Christian <Christian.Koenig@amd.com>; Nicolai Hähnle 
> <nhaehnle@gmail.com>; amd-gfx@lists.freedesktop.org
> *Subject:* RE: regression with d6c650c0a8f6f671e49553725e1db541376d95f2
> 
> I doubt it would always work fine…
> 
> First, we have FENCE_TRACE reference s_fence->finished after 
> “fence_signal(&fence->finished)”
> 
> Second, we have trace_amd_sched_proess_job(s_fence) after 
> “amd_sched_fence_finished()”,
> 
> If you put the finished before free_job() and by coincidence the 
> job_finish() get very soon executed you’ll have odds to hit wild pointer 
> on above two cases
> 
> BR Monk
> 
> *From:*Koenig, Christian
> *Sent:* 2017年10月13日16:17
> *To:* Liu, Monk <Monk.Liu@amd.com <mailto:Monk.Liu@amd.com>>; Nicolai 
> Hähnle <nhaehnle@gmail.com <mailto:nhaehnle@gmail.com>>; 
> amd-gfx@lists.freedesktop.org <mailto:amd-gfx@lists.freedesktop.org>
> *Subject:* Re: regression with d6c650c0a8f6f671e49553725e1db541376d95f2
> 
> Yeah, that change is actually incorrect and should be reverted.
> 
> What we really need to do is remove dropping sched_job->s_fence from 
> amd_sched_process_job() into amd_sched_job_finish() directly before the 
> call to free_job().
> 
> Regards,
> Christian.
> 
> Am 13.10.2017 um 09:24 schrieb Liu, Monk:
> 
>     commit d6c650c0a8f6f671e49553725e1db541376d95f2
>     Author: Nicolai Hähnle <nicolai.haehnle@amd.com>
>     <mailto:nicolai.haehnle@amd.com>
>     @@ -611,6 +611,10 @@ static int amd_sched_main(void *param)
> 
>                      fence = sched->ops->run_job(sched_job);
>                      amd_sched_fence_scheduled(s_fence);
>     +
>     +               /* amd_sched_process_job drops the job's reference
>     of the fence. */
>     +               sched_job->s_fence = NULL;
>     +
>                      if (fence) {
>                              s_fence->parent = dma_fence_get(fence);
>                              r = dma_fence_add_callback(fence, &s_fence->cb,
> 
>     Hi Nicolai
> 
>     with this patch, you will break "amdgpu_sched_hw_job_reset()"routine:
> 
>     void
> 
>     amd_sched_hw_job_reset(structamd_gpu_scheduler
> 
>     *sched)
> 
>     {
> 
>     structamd_sched_job
> 
>     *s_job;
> 
>     spin_lock(&sched->job_list_lock);
> 
>     list_for_each_entry_reverse(s_job,
> 
>     &sched->ring_mirror_list, node) {
> 
>     if(s_job->s_fence->parent
> 
>     &&
> 
>     fence_remove_callback(s_job->s_fence->parent,
> 
>                           &s_job->s_fence->cb))
> 
>     {
> 
>     fence_put(s_job->s_fence->parent);
> 
>                  s_job->s_fence->parent
> 
>     =
> 
>     NULL;
> 
>     atomic_dec(&sched->hw_rq_count);
> 
>              }
> 
>          }
> 
>     spin_unlock(&sched->job_list_lock);
> 
>     }
> 
>     see that without sched_job->s_fence, you cannot remove the callback
>     from its hw fence,
> 
>     any idea??
> 
>     BR Monk
> 


-- 
Lerne, wie die Welt wirklich ist,
Aber vergiss niemals, wie sie sein sollte.
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: regression with d6c650c0a8f6f671e49553725e1db541376d95f2
       [not found]                 ` <ae88aac7-0800-df06-5802-ad5376af0c0d-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
@ 2017-10-17  9:13                   ` Christian König
  0 siblings, 0 replies; 15+ messages in thread
From: Christian König @ 2017-10-17  9:13 UTC (permalink / raw)
  To: Nicolai Hähnle, Liu, Monk, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

Am 17.10.2017 um 10:57 schrieb Nicolai Hähnle:
> On 13.10.2017 10:46, Liu, Monk wrote:
>> Just revert Nicolai’s patch,if other routine want to reference 
>> s_fence, it should get the finished fence in the first place/time,
>>
>> For gpu_reset routine, it refers to s_fence only on those unfinished 
>> job in sched_hw_job_reset, so totally safe to refer to s_fence pointer
>>
>> I wonder what issue Nicolai met with and submitted this patch
>
> The original motivation of my patch was to catch accidental use of 
> job->s_fence after the fence was destroyed in amd_sched_process_job. 
> Basically, prevent a dangling pointer. I still think that's a 
> reasonable idea, though clearly my first attempt at it was just wrong.

It was very reasonable that you pointed out the problem, it's just that 
the solution to set it to NULL was wrong.

Instead we should just have increased the lifetime of the reference.

>
> In Christian's v2 patch, it might make sense to add
>
>     spin_unlock(&sched->job_list_lock);
> +    dma_fence_put(&s_job->s_fence->finished);
> +    s_job->s_fence = NULL;
>     sched->ops->free_job(s_job);
>
> ... though I'm not 100% certain about how the fence lifetimes work.

That's a bit pointless because sched->ops->free_job() will free s_job as 
the next thing we do here.

Regards,
Christian.

>
> Cheers,
> Nicolai
>
>
>>
>> BR Monk
>>
>> *From:*amd-gfx [mailto:amd-gfx-bounces@lists.freedesktop.org] *On 
>> Behalf Of *Liu, Monk
>> *Sent:* 2017年10月13日16:40
>> *To:* Koenig, Christian <Christian.Koenig@amd.com>; Nicolai Hähnle 
>> <nhaehnle@gmail.com>; amd-gfx@lists.freedesktop.org
>> *Subject:* RE: regression with d6c650c0a8f6f671e49553725e1db541376d95f2
>>
>> I doubt it would always work fine…
>>
>> First, we have FENCE_TRACE reference s_fence->finished after 
>> “fence_signal(&fence->finished)”
>>
>> Second, we have trace_amd_sched_proess_job(s_fence) after 
>> “amd_sched_fence_finished()”,
>>
>> If you put the finished before free_job() and by coincidence the 
>> job_finish() get very soon executed you’ll have odds to hit wild 
>> pointer on above two cases
>>
>> BR Monk
>>
>> *From:*Koenig, Christian
>> *Sent:* 2017年10月13日16:17
>> *To:* Liu, Monk <Monk.Liu@amd.com <mailto:Monk.Liu@amd.com>>; Nicolai 
>> Hähnle <nhaehnle@gmail.com <mailto:nhaehnle@gmail.com>>; 
>> amd-gfx@lists.freedesktop.org <mailto:amd-gfx@lists.freedesktop.org>
>> *Subject:* Re: regression with d6c650c0a8f6f671e49553725e1db541376d95f2
>>
>> Yeah, that change is actually incorrect and should be reverted.
>>
>> What we really need to do is remove dropping sched_job->s_fence from 
>> amd_sched_process_job() into amd_sched_job_finish() directly before 
>> the call to free_job().
>>
>> Regards,
>> Christian.
>>
>> Am 13.10.2017 um 09:24 schrieb Liu, Monk:
>>
>>     commit d6c650c0a8f6f671e49553725e1db541376d95f2
>>     Author: Nicolai Hähnle <nicolai.haehnle@amd.com>
>>     <mailto:nicolai.haehnle@amd.com>
>>     @@ -611,6 +611,10 @@ static int amd_sched_main(void *param)
>>
>>                      fence = sched->ops->run_job(sched_job);
>>                      amd_sched_fence_scheduled(s_fence);
>>     +
>>     +               /* amd_sched_process_job drops the job's reference
>>     of the fence. */
>>     +               sched_job->s_fence = NULL;
>>     +
>>                      if (fence) {
>>                              s_fence->parent = dma_fence_get(fence);
>>                              r = dma_fence_add_callback(fence, 
>> &s_fence->cb,
>>
>>     Hi Nicolai
>>
>>     with this patch, you will break 
>> "amdgpu_sched_hw_job_reset()"routine:
>>
>>     void
>>
>>     amd_sched_hw_job_reset(structamd_gpu_scheduler
>>
>>     *sched)
>>
>>     {
>>
>>     structamd_sched_job
>>
>>     *s_job;
>>
>>     spin_lock(&sched->job_list_lock);
>>
>>     list_for_each_entry_reverse(s_job,
>>
>>     &sched->ring_mirror_list, node) {
>>
>>     if(s_job->s_fence->parent
>>
>>     &&
>>
>>     fence_remove_callback(s_job->s_fence->parent,
>>
>>                           &s_job->s_fence->cb))
>>
>>     {
>>
>>     fence_put(s_job->s_fence->parent);
>>
>>                  s_job->s_fence->parent
>>
>>     =
>>
>>     NULL;
>>
>>     atomic_dec(&sched->hw_rq_count);
>>
>>              }
>>
>>          }
>>
>>     spin_unlock(&sched->job_list_lock);
>>
>>     }
>>
>>     see that without sched_job->s_fence, you cannot remove the callback
>>     from its hw fence,
>>
>>     any idea??
>>
>>     BR Monk
>>
>
>

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2017-10-17  9:13 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-10-13  7:24 regression with d6c650c0a8f6f671e49553725e1db541376d95f2 Liu, Monk
     [not found] ` <BLUPR12MB044941343D6D1A35996C7B4284480-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2017-10-13  8:16   ` Christian König
     [not found]     ` <b0f57d8a-de01-addd-c17d-7af25e15f36d-5C7GfCeVMHo@public.gmane.org>
2017-10-13  8:39       ` Liu, Monk
     [not found]         ` <BLUPR12MB04490BB257118A34DE346E0984480-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2017-10-13  8:42           ` Christian König
     [not found]             ` <6cf3017b-d694-1bfd-1f5a-861fd824c014-5C7GfCeVMHo@public.gmane.org>
2017-10-13  8:49               ` Liu, Monk
     [not found]                 ` <BLUPR12MB044988754711ABD7BF66B19184480-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2017-10-13  8:51                   ` Liu, Monk
     [not found]                     ` <BLUPR12MB044908B9AECE89D7E942EC7F84480-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2017-10-13  9:00                       ` Christian König
     [not found]                         ` <53f57508-6669-82ed-1b95-011f30610e05-5C7GfCeVMHo@public.gmane.org>
2017-10-13  9:13                           ` Liu, Monk
     [not found]                             ` <BLUPR12MB0449CD46159EC28E81B823C184480-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2017-10-13 10:13                               ` Christian König
     [not found]                                 ` <e9d3674e-8ccf-c709-0dd9-b1fa493e086f-5C7GfCeVMHo@public.gmane.org>
2017-10-13 11:06                                   ` Liu, Monk
     [not found]                                     ` <BLUPR12MB04493DEEB32374DB3F64098D84480-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2017-10-13 11:10                                       ` Tom St Denis
     [not found]                                         ` <18c873f5-987d-d21e-0fa3-008b7e91ea7c-5C7GfCeVMHo@public.gmane.org>
2017-10-13 11:15                                           ` Liu, Monk
2017-10-13  8:46           ` Liu, Monk
     [not found]             ` <BLUPR12MB0449FE32E68AC10504D8E7A684480-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2017-10-17  8:57               ` Nicolai Hähnle
     [not found]                 ` <ae88aac7-0800-df06-5802-ad5376af0c0d-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2017-10-17  9:13                   ` Christian König

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.