[RFC] a new approach to detect which ring is the real black sheep upon TDR reported

All of lore.kernel.org
 help / color / mirror / Atom feed

* [RFC] a new approach to detect which ring is the real black sheep upon TDR reported
@ 2021-02-26  5:58 Liu, Monk
  2021-02-26  7:57 ` Christian König
  0 siblings, 1 reply; 13+ messages in thread
From: Liu, Monk @ 2021-02-26  5:58 UTC (permalink / raw)
  To: amd-gfx; +Cc: Zhang, Andy, Chen, Horace, Koenig, Christian, Zhang,  Jack (Jian)

[-- Attachment #1.1: Type: text/plain, Size: 2452 bytes --]

[AMD Public Use]

Hi all

NAVI2X  project hit a really hard to solve issue now, and it is turned out to be a general headache of our TDR mechanism , check below scenario:

  1.  There is a job1 running on compute1 ring at timestamp
  2.  There is a job2 running on gfx ring at timestamp
  3.  Job1 is the guilty one, and job1/job2 were scheduled to their rings at almost the same timestamp
  4.  After 2 seconds we receive two TDR reporting from both GFX ring and compute ring
  5.  Current scheme is that in drm scheduler all the head jobs of those two rings are considered "bad job" and taken away from the mirror list
  6.  The result is both the real guilty job (job1) and the innocent job (job2) were all deleted from mirror list, and their corresponding contexts were also treated as guilty (so the innocent process remains running is not secured)

But by our wish the ideal case is TDR mechanism can detect which ring is the guilty ring and the innocent ring can resubmits all its pending jobs:

  1.  Job1 to be deleted from compute1 ring's mirror list
  2.  Job2 is kept and resubmitted later and its belonging process/context are even not aware of this TDR at all

Here I have a proposal tend to achieve above goal and it rough procedure is :

  1.  Once any ring reports a TDR, the head job is *not* treated as "bad job", and it is *not* deleted from the mirror list in drm sched functions
  2.  In vendor's function (our amdgpu driver here):
     *   reset GPU
     *   repeat below actions on each RINGS * one by one *:

1. take the head job and submit it on this ring

2. see if it completes, if not then this job is the real "bad job"

3.  take it away from mirror list if this head job is "bad job"

     *   After above iteration on all RINGS, we already clears all the bad job(s)
  1.  Resubmit all jobs from each mirror list to their corresponding rings (this is the existed logic)

The idea of this is to use "serial" way to re-run and re-check each head job of each RING, in order to take out the real black sheep and its guilty context.

P.S.: we can use this approaches only on GFX/KCQ ring reports TDR , since those rings are intermutually affected to each other. For SDMA ring timeout it definitely proves the head job on SDMA ring is really guilty.

Thanks

------------------------------------------
Monk Liu | Cloud-GPU Core team
------------------------------------------

[-- Attachment #1.2: Type: text/html, Size: 11573 bytes --]

[-- Attachment #2: Type: text/plain, Size: 154 bytes --]

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC] a new approach to detect which ring is the real black sheep upon TDR reported
  2021-02-26  5:58 [RFC] a new approach to detect which ring is the real black sheep upon TDR reported Liu, Monk
@ 2021-02-26  7:57 ` Christian König
  2021-02-26 11:54   ` Liu, Monk
  0 siblings, 1 reply; 13+ messages in thread
From: Christian König @ 2021-02-26  7:57 UTC (permalink / raw)
  To: Liu, Monk, amd-gfx; +Cc: Zhang, Andy, Chen, Horace, Zhang, Jack (Jian)


[-- Attachment #1.1: Type: text/plain, Size: 3269 bytes --]

Hi Monk,

in general an interesting idea, but I see two major problems with that:

1. It would make the reset take much longer.

2. Things get often stuck because of timing issues, so a guilty job 
might pass perfectly when run a second time.

Apart from that the whole ring mirror list turned out to be a really bad 
idea. E.g. we still struggle with object life time because the concept 
doesn't fit into the object model of the GPU scheduler under Linux.

We should probably work on this separately and straighten up the job 
destruction once more and keep the recovery information in the fence 
instead.

Regards,
Christian.

Am 26.02.21 um 06:58 schrieb Liu, Monk:
>
> [AMD Public Use]
>
>
> Hi all
>
> NAVI2X  project hit a really hard to solve issue now, and it is turned 
> out to be a general headache of our TDR mechanism , check below scenario:
>
>  1. There is a job1 running on compute1 ring at timestamp
>  2. There is a job2 running on gfx ring at timestamp
>  3. Job1 is the guilty one, and job1/job2 were scheduled to their
>     rings at almost the same timestamp
>  4. After 2 seconds we receive two TDR reporting from both GFX ring
>     and compute ring
>  5. *Current scheme is that in drm scheduler all the head jobs of
>     those two rings are considered “bad job” and taken away from the
>     mirror list *
>  6. The result is both the real guilty job (job1) and the innocent job
>     (job2) were all deleted from mirror list, and their corresponding
>     contexts were also treated as guilty*(so the innocent process
>     remains running is not secured)*
>
> **
>
> But by our wish the ideal case is TDR mechanism can detect which ring 
> is the guilty ring and the innocent ring can resubmits all its pending 
> jobs:
>
>  1. Job1 to be deleted from compute1 ring’s mirror list
>  2. Job2 is kept and resubmitted later and its belonging
>     process/context are even not aware of this TDR at all
>
> Here I have a proposal tend to achieve above goal and it rough 
> procedure is :
>
>  1. Once any ring reports a TDR, the head job is **not** treated as
>     “bad job”, and it is **not** deleted from the mirror list in drm
>     sched functions
>  2. In vendor’s function (our amdgpu driver here):
>       * reset GPU
>       * repeat below actions on each RINGS * one by one *:
>
> 1.take the head job and submit it on this ring
>
> 2.see if it completes, if not then this job is the real “bad job”
>
> 3. take it away from mirror list if this head job is “bad job”
>
>       * After above iteration on all RINGS, we already clears all the
>         bad job(s)
>  2. Resubmit all jobs from each mirror list to their corresponding
>     rings (this is the existed logic)
>
> The idea of this is to use “serial” way to re-run and re-check each 
> head job of each RING, in order to take out the real black sheep and 
> its guilty context.
>
> P.S.: we can use this approaches only on GFX/KCQ ring reports TDR , 
> since those rings are intermutually affected to each other. For SDMA 
> ring timeout it definitely proves the head job on SDMA ring is really 
> guilty.
>
> Thanks
>
> ------------------------------------------
>
> Monk Liu | Cloud-GPU Core team
>
> ------------------------------------------
>


[-- Attachment #1.2: Type: text/html, Size: 13099 bytes --]

[-- Attachment #2: Type: text/plain, Size: 154 bytes --]

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: [RFC] a new approach to detect which ring is the real black sheep upon TDR reported
  2021-02-26  7:57 ` Christian König
@ 2021-02-26 11:54   ` Liu, Monk
  2021-02-26 11:57     ` Liu, Monk
  2021-02-26 16:49     ` Andrey Grodzovsky
  0 siblings, 2 replies; 13+ messages in thread
From: Liu, Monk @ 2021-02-26 11:54 UTC (permalink / raw)
  To: Koenig, Christian, amd-gfx; +Cc: Zhang, Andy, Chen, Horace, Zhang, Jack (Jian)

[-- Attachment #1.1: Type: text/plain, Size: 4561 bytes --]

[AMD Official Use Only - Internal Distribution Only]

See in line

Thanks

------------------------------------------
Monk Liu | Cloud-GPU Core team
------------------------------------------

From: Koenig, Christian <Christian.Koenig@amd.com>
Sent: Friday, February 26, 2021 3:58 PM
To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
Cc: Zhang, Andy <Andy.Zhang@amd.com>; Chen, Horace <Horace.Chen@amd.com>; Zhang, Jack (Jian) <Jack.Zhang1@amd.com>
Subject: Re: [RFC] a new approach to detect which ring is the real black sheep upon TDR reported

Hi Monk,

in general an interesting idea, but I see two major problems with that:

1. It would make the reset take much longer.

2. Things get often stuck because of timing issues, so a guilty job might pass perfectly when run a second time.
[ML] but the innocent ring already reported a TDR, and the drm sched logic already deleted this "sched_job" in its mirror list, thus you don't have chance to re-submit it again after reset, that's the major problem here.

Apart from that the whole ring mirror list turned out to be a really bad idea. E.g. we still struggle with object life time because the concept doesn't fit into the object model of the GPU scheduler under Linux.

We should probably work on this separately and straighten up the job destruction once more and keep the recovery information in the fence instead.
[ML] we claim to our customer that no innocent process will be dropped or cancelled, and our current logic works for the most time, but only when there are different process running on gfx/computes rings then we would run into the tricky situation I stated here, and the proposal is the only way I can figure out so far, do you have a better solution or idea we review it as another candidate RFC ? Be note that we raised this proposal is because we do hit our trouble and we do need to resolve it .... So even a not perfect solution is still better than just cancel the innocent job (and their context/process)
Thanks !

Regards,
Christian.
Am 26.02.21 um 06:58 schrieb Liu, Monk:

[AMD Public Use]

Hi all

NAVI2X  project hit a really hard to solve issue now, and it is turned out to be a general headache of our TDR mechanism , check below scenario:

  1.  There is a job1 running on compute1 ring at timestamp
  2.  There is a job2 running on gfx ring at timestamp
  3.  Job1 is the guilty one, and job1/job2 were scheduled to their rings at almost the same timestamp
  4.  After 2 seconds we receive two TDR reporting from both GFX ring and compute ring
  5.  Current scheme is that in drm scheduler all the head jobs of those two rings are considered "bad job" and taken away from the mirror list
  6.  The result is both the real guilty job (job1) and the innocent job (job2) were all deleted from mirror list, and their corresponding contexts were also treated as guilty (so the innocent process remains running is not secured)

But by our wish the ideal case is TDR mechanism can detect which ring is the guilty ring and the innocent ring can resubmits all its pending jobs:

  1.  Job1 to be deleted from compute1 ring's mirror list
  2.  Job2 is kept and resubmitted later and its belonging process/context are even not aware of this TDR at all

Here I have a proposal tend to achieve above goal and it rough procedure is :

  1.  Once any ring reports a TDR, the head job is *not* treated as "bad job", and it is *not* deleted from the mirror list in drm sched functions
  2.  In vendor's function (our amdgpu driver here):

     *   reset GPU
     *   repeat below actions on each RINGS * one by one *:

1. take the head job and submit it on this ring

2. see if it completes, if not then this job is the real "bad job"

3.  take it away from mirror list if this head job is "bad job"

     *   After above iteration on all RINGS, we already clears all the bad job(s)

  1.  Resubmit all jobs from each mirror list to their corresponding rings (this is the existed logic)

The idea of this is to use "serial" way to re-run and re-check each head job of each RING, in order to take out the real black sheep and its guilty context.

P.S.: we can use this approaches only on GFX/KCQ ring reports TDR , since those rings are intermutually affected to each other. For SDMA ring timeout it definitely proves the head job on SDMA ring is really guilty.

Thanks

------------------------------------------
Monk Liu | Cloud-GPU Core team
------------------------------------------

[-- Attachment #1.2: Type: text/html, Size: 15695 bytes --]

[-- Attachment #2: Type: text/plain, Size: 154 bytes --]

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: [RFC] a new approach to detect which ring is the real black sheep upon TDR reported
  2021-02-26 11:54   ` Liu, Monk
@ 2021-02-26 11:57     ` Liu, Monk
  2021-02-26 12:05       ` Christian König
  2021-02-26 16:49     ` Andrey Grodzovsky
  1 sibling, 1 reply; 13+ messages in thread
From: Liu, Monk @ 2021-02-26 11:57 UTC (permalink / raw)
  To: Koenig, Christian, amd-gfx; +Cc: Zhang, Andy, Chen, Horace, Zhang, Jack (Jian)

[-- Attachment #1.1: Type: text/plain, Size: 6180 bytes --]

[AMD Official Use Only - Internal Distribution Only]

static void drm_sched_job_timedout(struct work_struct *work)
279 {
280     struct drm_gpu_scheduler *sched;
281     struct drm_sched_job *job;
282
283     sched = container_of(work, struct drm_gpu_scheduler, work_tdr.work);
284
285     /* Protects against concurrent deletion in drm_sched_get_cleanup_job */
286     spin_lock(&sched->job_list_lock);
287     job = list_first_entry_or_null(&sched->ring_mirror_list,
288                        struct drm_sched_job, node);
289
290     if (job) {
291         /*
292          * Remove the bad job so it cannot be freed by concurrent
293          * drm_sched_cleanup_jobs. It will be reinserted back after sched->thread
294          * is parked at which point it's safe.
295          */
296         list_del_init(&job->node);
297         spin_unlock(&sched->job_list_lock);
298
299         job->sched->ops->timedout_job(job);

Thanks

------------------------------------------
Monk Liu | Cloud-GPU Core team
------------------------------------------

From: Liu, Monk
Sent: Friday, February 26, 2021 7:54 PM
To: Koenig, Christian <Christian.Koenig@amd.com>; amd-gfx@lists.freedesktop.org
Cc: Zhang, Andy <Andy.Zhang@amd.com>; Chen, Horace <Horace.Chen@amd.com>; Zhang, Jack (Jian) <Jack.Zhang1@amd.com>
Subject: RE: [RFC] a new approach to detect which ring is the real black sheep upon TDR reported

[AMD Official Use Only - Internal Distribution Only]

See in line

Thanks

------------------------------------------
Monk Liu | Cloud-GPU Core team
------------------------------------------

From: Koenig, Christian <Christian.Koenig@amd.com<mailto:Christian.Koenig@amd.com>>
Sent: Friday, February 26, 2021 3:58 PM
To: Liu, Monk <Monk.Liu@amd.com<mailto:Monk.Liu@amd.com>>; amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
Cc: Zhang, Andy <Andy.Zhang@amd.com<mailto:Andy.Zhang@amd.com>>; Chen, Horace <Horace.Chen@amd.com<mailto:Horace.Chen@amd.com>>; Zhang, Jack (Jian) <Jack.Zhang1@amd.com<mailto:Jack.Zhang1@amd.com>>
Subject: Re: [RFC] a new approach to detect which ring is the real black sheep upon TDR reported

Hi Monk,

in general an interesting idea, but I see two major problems with that:

1. It would make the reset take much longer.

2. Things get often stuck because of timing issues, so a guilty job might pass perfectly when run a second time.
[ML] but the innocent ring already reported a TDR, and the drm sched logic already deleted this "sched_job" in its mirror list, thus you don't have chance to re-submit it again after reset, that's the major problem here.

Apart from that the whole ring mirror list turned out to be a really bad idea. E.g. we still struggle with object life time because the concept doesn't fit into the object model of the GPU scheduler under Linux.

We should probably work on this separately and straighten up the job destruction once more and keep the recovery information in the fence instead.
[ML] we claim to our customer that no innocent process will be dropped or cancelled, and our current logic works for the most time, but only when there are different process running on gfx/computes rings then we would run into the tricky situation I stated here, and the proposal is the only way I can figure out so far, do you have a better solution or idea we review it as another candidate RFC ? Be note that we raised this proposal is because we do hit our trouble and we do need to resolve it .... So even a not perfect solution is still better than just cancel the innocent job (and their context/process)
Thanks !

Regards,
Christian.
Am 26.02.21 um 06:58 schrieb Liu, Monk:

[AMD Public Use]

Hi all

NAVI2X  project hit a really hard to solve issue now, and it is turned out to be a general headache of our TDR mechanism , check below scenario:

  1.  There is a job1 running on compute1 ring at timestamp
  2.  There is a job2 running on gfx ring at timestamp
  3.  Job1 is the guilty one, and job1/job2 were scheduled to their rings at almost the same timestamp
  4.  After 2 seconds we receive two TDR reporting from both GFX ring and compute ring
  5.  Current scheme is that in drm scheduler all the head jobs of those two rings are considered "bad job" and taken away from the mirror list
  6.  The result is both the real guilty job (job1) and the innocent job (job2) were all deleted from mirror list, and their corresponding contexts were also treated as guilty (so the innocent process remains running is not secured)

But by our wish the ideal case is TDR mechanism can detect which ring is the guilty ring and the innocent ring can resubmits all its pending jobs:

  1.  Job1 to be deleted from compute1 ring's mirror list
  2.  Job2 is kept and resubmitted later and its belonging process/context are even not aware of this TDR at all

Here I have a proposal tend to achieve above goal and it rough procedure is :

  1.  Once any ring reports a TDR, the head job is *not* treated as "bad job", and it is *not* deleted from the mirror list in drm sched functions
  2.  In vendor's function (our amdgpu driver here):

     *   reset GPU
     *   repeat below actions on each RINGS * one by one *:

1. take the head job and submit it on this ring

2. see if it completes, if not then this job is the real "bad job"

3.  take it away from mirror list if this head job is "bad job"

     *   After above iteration on all RINGS, we already clears all the bad job(s)

  1.  Resubmit all jobs from each mirror list to their corresponding rings (this is the existed logic)

The idea of this is to use "serial" way to re-run and re-check each head job of each RING, in order to take out the real black sheep and its guilty context.

P.S.: we can use this approaches only on GFX/KCQ ring reports TDR , since those rings are intermutually affected to each other. For SDMA ring timeout it definitely proves the head job on SDMA ring is really guilty.

Thanks

------------------------------------------
Monk Liu | Cloud-GPU Core team
------------------------------------------

[-- Attachment #1.2: Type: text/html, Size: 19747 bytes --]

[-- Attachment #2: Type: text/plain, Size: 154 bytes --]

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC] a new approach to detect which ring is the real black sheep upon TDR reported
  2021-02-26 11:57     ` Liu, Monk
@ 2021-02-26 12:05       ` Christian König
  2021-02-27  3:50         ` 回复: " Liu, Monk
  0 siblings, 1 reply; 13+ messages in thread
From: Christian König @ 2021-02-26 12:05 UTC (permalink / raw)
  To: Liu, Monk, amd-gfx; +Cc: Zhang, Andy, Chen, Horace, Zhang, Jack (Jian)


[-- Attachment #1.1: Type: text/plain, Size: 7839 bytes --]

Yeah that is exactly the stuff which doesn't works at all. We got 
feedback for multiple people that this whole approach of tying the job 
to the tdr was not a good idea at all.

What we should do instead is to have a pointer in the scheduler fence to 
which job it belongs. Freeing up the job when the scheduler fence is 
signaled is then job of the driver and not the scheduler any more.

The scheduler then gives the scheduler fence to the driver when a 
timeout is detected and the driver can do the rest of the handling all 
by itself.

But this problem is orthogonal to the suggested solution here.

> do you have a better solution or idea we review it as another 
> candidate RFC ?

I don't see much other option either. We could do something like only 
allowing one application at a time to use the gfx/compute block, but 
that would be even worse.

If we do this we should probably make it configurable as a module parameter.

Regards,
Christian.

Am 26.02.21 um 12:57 schrieb Liu, Monk:
>
> [AMD Official Use Only - Internal Distribution Only]
>
> static void drm_sched_job_timedout(struct work_struct *work)
>
> 279 {
>
> 280     struct drm_gpu_scheduler *sched;
>
> 281     struct drm_sched_job *job;
>
> 282
>
> 283     sched = container_of(work, struct drm_gpu_scheduler, 
> work_tdr.work);
>
> 284
>
> 285     /* Protects against concurrent deletion in 
> drm_sched_get_cleanup_job */
>
> *286 spin_lock(&sched->job_list_lock);*
>
> *287     job = list_first_entry_or_null(&sched->ring_mirror_list,*
>
> *288                        struct drm_sched_job, node);*
>
> *289*
>
> *290     if (job) {*
>
> *291         /**
>
> *292          * Remove the bad job so it cannot be freed by concurrent*
>
> *293          * drm_sched_cleanup_jobs. It will be reinserted back 
> after sched->thread*
>
> *294          * is parked at which point it's safe.*
>
> *295          */*
>
> *296 list_del_init(&job->node);*
>
> *297 spin_unlock(&sched->job_list_lock);*
>
> *298*
>
> *299 job->sched->ops->timedout_job(job);*
>
> Thanks
>
> ------------------------------------------
>
> Monk Liu | Cloud-GPU Core team
>
> ------------------------------------------
>
> *From:* Liu, Monk
> *Sent:* Friday, February 26, 2021 7:54 PM
> *To:* Koenig, Christian <Christian.Koenig@amd.com>; 
> amd-gfx@lists.freedesktop.org
> *Cc:* Zhang, Andy <Andy.Zhang@amd.com>; Chen, Horace 
> <Horace.Chen@amd.com>; Zhang, Jack (Jian) <Jack.Zhang1@amd.com>
> *Subject:* RE: [RFC] a new approach to detect which ring is the real 
> black sheep upon TDR reported
>
> [AMD Official Use Only - Internal Distribution Only]
>
> See in line
>
> Thanks
>
> ------------------------------------------
>
> Monk Liu | Cloud-GPU Core team
>
> ------------------------------------------
>
> *From:* Koenig, Christian <Christian.Koenig@amd.com 
> <mailto:Christian.Koenig@amd.com>>
> *Sent:* Friday, February 26, 2021 3:58 PM
> *To:* Liu, Monk <Monk.Liu@amd.com <mailto:Monk.Liu@amd.com>>; 
> amd-gfx@lists.freedesktop.org <mailto:amd-gfx@lists.freedesktop.org>
> *Cc:* Zhang, Andy <Andy.Zhang@amd.com <mailto:Andy.Zhang@amd.com>>; 
> Chen, Horace <Horace.Chen@amd.com <mailto:Horace.Chen@amd.com>>; 
> Zhang, Jack (Jian) <Jack.Zhang1@amd.com <mailto:Jack.Zhang1@amd.com>>
> *Subject:* Re: [RFC] a new approach to detect which ring is the real 
> black sheep upon TDR reported
>
> Hi Monk,
>
> in general an interesting idea, but I see two major problems with that:
>
> 1. It would make the reset take much longer.
>
> 2. Things get often stuck because of timing issues, so a guilty job 
> might pass perfectly when run a second time.
>
> [ML] but the innocent ring already reported a TDR, and the drm sched 
> logic already deleted this “sched_job” in its mirror list, thus you 
> don’t have chance to re-submit it again after reset, that’s the major 
> problem here.
>
> Apart from that the whole ring mirror list turned out to be a really 
> bad idea. E.g. we still struggle with object life time because the 
> concept doesn't fit into the object model of the GPU scheduler under 
> Linux.
>
> We should probably work on this separately and straighten up the job 
> destruction once more and keep the recovery information in the fence 
> instead.
>
> [ML] we claim to our customer that no innocent process will be dropped 
> or cancelled, and our current logic works for the most time, but only 
> when there are different process running on gfx/computes rings then we 
> would run into the tricky situation I stated here, and the proposal is 
> the only way I can figure out so far, do you have a better solution or 
> idea we review it as another candidate RFC ? Be note that we raised 
> this proposal is because we do hit our trouble and we do need to 
> resolve it …. So even a not perfect solution is still better than just 
> cancel the innocent job (and their context/process)
>
> Thanks !
>
>
> Regards,
> Christian.
>
> Am 26.02.21 um 06:58 schrieb Liu, Monk:
>
>     [AMD Public Use]
>
>     Hi all
>
>     NAVI2X  project hit a really hard to solve issue now, and it is
>     turned out to be a general headache of our TDR mechanism , check
>     below scenario:
>
>      1. There is a job1 running on compute1 ring at timestamp
>      2. There is a job2 running on gfx ring at timestamp
>      3. Job1 is the guilty one, and job1/job2 were scheduled to their
>         rings at almost the same timestamp
>      4. After 2 seconds we receive two TDR reporting from both GFX
>         ring and compute ring
>      5. *Current scheme is that in drm scheduler all the head jobs of
>         those two rings are considered “bad job” and taken away from
>         the mirror list *
>      6. The result is both the real guilty job (job1) and the innocent
>         job (job2) were all deleted from mirror list, and their
>         corresponding contexts were also treated as guilty*(so the
>         innocent process remains running is not secured)*
>
>     **
>
>     But by our wish the ideal case is TDR mechanism can detect which
>     ring is the guilty ring and the innocent ring can resubmits all
>     its pending jobs:
>
>      1. Job1 to be deleted from compute1 ring’s mirror list
>      2. Job2 is kept and resubmitted later and its belonging
>         process/context are even not aware of this TDR at all
>
>     Here I have a proposal tend to achieve above goal and it rough
>     procedure is :
>
>      1. Once any ring reports a TDR, the head job is **not** treated
>         as “bad job”, and it is **not** deleted from the mirror list
>         in drm sched functions
>      2. In vendor’s function (our amdgpu driver here):
>
>           * reset GPU
>           * repeat below actions on each RINGS * one by one *:
>
>     1.take the head job and submit it on this ring
>
>     2.see if it completes, if not then this job is the real “bad job”
>
>     3. take it away from mirror list if this head job is “bad job”
>
>           * After above iteration on all RINGS, we already clears all
>             the bad job(s)
>
>      3. Resubmit all jobs from each mirror list to their corresponding
>         rings (this is the existed logic)
>
>     The idea of this is to use “serial” way to re-run and re-check
>     each head job of each RING, in order to take out the real black
>     sheep and its guilty context.
>
>     P.S.: we can use this approaches only on GFX/KCQ ring reports TDR
>     , since those rings are intermutually affected to each other. For
>     SDMA ring timeout it definitely proves the head job on SDMA ring
>     is really guilty.
>
>     Thanks
>
>     ------------------------------------------
>
>     Monk Liu | Cloud-GPU Core team
>
>     ------------------------------------------
>


[-- Attachment #1.2: Type: text/html, Size: 23659 bytes --]

[-- Attachment #2: Type: text/plain, Size: 154 bytes --]

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC] a new approach to detect which ring is the real black sheep upon TDR reported
  2021-02-26 11:54   ` Liu, Monk
  2021-02-26 11:57     ` Liu, Monk
@ 2021-02-26 16:49     ` Andrey Grodzovsky
  2021-02-27  3:56       ` 回复: " Liu, Monk
  1 sibling, 1 reply; 13+ messages in thread
From: Andrey Grodzovsky @ 2021-02-26 16:49 UTC (permalink / raw)
  To: Liu, Monk, Koenig, Christian, amd-gfx
  Cc: Zhang, Andy, Chen, Horace, Zhang, Jack (Jian)


[-- Attachment #1.1: Type: text/plain, Size: 6232 bytes --]


On 2021-02-26 6:54 a.m., Liu, Monk wrote:
>
> [AMD Official Use Only - Internal Distribution Only]
>
> See in line
>
> Thanks
>
> ------------------------------------------
>
> Monk Liu | Cloud-GPU Core team
>
> ------------------------------------------
>
> *From:* Koenig, Christian <Christian.Koenig@amd.com>
> *Sent:* Friday, February 26, 2021 3:58 PM
> *To:* Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
> *Cc:* Zhang, Andy <Andy.Zhang@amd.com>; Chen, Horace 
> <Horace.Chen@amd.com>; Zhang, Jack (Jian) <Jack.Zhang1@amd.com>
> *Subject:* Re: [RFC] a new approach to detect which ring is the real 
> black sheep upon TDR reported
>
> Hi Monk,
>
> in general an interesting idea, but I see two major problems with that:
>
> 1. It would make the reset take much longer.
>
> 2. Things get often stuck because of timing issues, so a guilty job 
> might pass perfectly when run a second time.
>
> [ML] but the innocent ring already reported a TDR, and the drm sched 
> logic already deleted this “sched_job” in its mirror list, thus you 
> don’t have chance to re-submit it again after reset, that’s the major 
> problem here.
>

Just to confirm I understand correctly, Monk reports a scenario where 
the second TDR that was reported by the innocent job is bailing out 
BEFORE having a chance to run  drm_sched_stop for that scheduler which 
should have reinserted the job back into mirror list (because the first 
TDR run is still in progress and hence amdgpu_device_lock_adev fails for 
the second TDR) and so the innocent job which was extracted from mirror 
list in drm_sched_job_timedout is now lost.
If so and as a possible quick fix until we overhaul the entire design as 
suggested in this thread - maybe we can modify 
drm_sched_backend_ops.timedout_job callback to report back premature 
termination BEFORE drm_sched_stop had a chance to run and then reinsert 
back the job into mirror list from within drm_sched_job_timedout? There 
is no problem of racing against concurrent drm_sched_get_cleanup_job 
once we reinsert there as we don't reference the job pointer anymore 
after this point and so if it's already signaled and freed right away - 
it's ok.

Andrey


>
> Apart from that the whole ring mirror list turned out to be a really 
> bad idea. E.g. we still struggle with object life time because the 
> concept doesn't fit into the object model of the GPU scheduler under 
> Linux.
>
> We should probably work on this separately and straighten up the job 
> destruction once more and keep the recovery information in the fence 
> instead.
>
> [ML] we claim to our customer that no innocent process will be dropped 
> or cancelled, and our current logic works for the most time, but only 
> when there are different process running on gfx/computes rings then we 
> would run into the tricky situation I stated here, and the proposal is 
> the only way I can figure out so far, do you have a better solution or 
> idea we review it as another candidate RFC ? Be note that we raised 
> this proposal is because we do hit our trouble and we do need to 
> resolve it …. So even a not perfect solution is still better than just 
> cancel the innocent job (and their context/process)
>
> Thanks !
>
>
> Regards,
> Christian.
>
> Am 26.02.21 um 06:58 schrieb Liu, Monk:
>
>     [AMD Public Use]
>
>     Hi all
>
>     NAVI2X  project hit a really hard to solve issue now, and it is
>     turned out to be a general headache of our TDR mechanism , check
>     below scenario:
>
>      1. There is a job1 running on compute1 ring at timestamp
>      2. There is a job2 running on gfx ring at timestamp
>      3. Job1 is the guilty one, and job1/job2 were scheduled to their
>         rings at almost the same timestamp
>      4. After 2 seconds we receive two TDR reporting from both GFX
>         ring and compute ring
>      5. *Current scheme is that in drm scheduler all the head jobs of
>         those two rings are considered “bad job” and taken away from
>         the mirror list *
>      6. The result is both the real guilty job (job1) and the innocent
>         job (job2) were all deleted from mirror list, and their
>         corresponding contexts were also treated as guilty*(so the
>         innocent process remains running is not secured)*
>
>     **
>
>     But by our wish the ideal case is TDR mechanism can detect which
>     ring is the guilty ring and the innocent ring can resubmits all
>     its pending jobs:
>
>      1. Job1 to be deleted from compute1 ring’s mirror list
>      2. Job2 is kept and resubmitted later and its belonging
>         process/context are even not aware of this TDR at all
>
>     Here I have a proposal tend to achieve above goal and it rough
>     procedure is :
>
>      1. Once any ring reports a TDR, the head job is **not** treated
>         as “bad job”, and it is **not** deleted from the mirror list
>         in drm sched functions
>      2. In vendor’s function (our amdgpu driver here):
>
>           * reset GPU
>           * repeat below actions on each RINGS * one by one *:
>
>     1.take the head job and submit it on this ring
>
>     2.see if it completes, if not then this job is the real “bad job”
>
>     3. take it away from mirror list if this head job is “bad job”
>
>           * After above iteration on all RINGS, we already clears all
>             the bad job(s)
>
>      3. Resubmit all jobs from each mirror list to their corresponding
>         rings (this is the existed logic)
>
>     The idea of this is to use “serial” way to re-run and re-check
>     each head job of each RING, in order to take out the real black
>     sheep and its guilty context.
>
>     P.S.: we can use this approaches only on GFX/KCQ ring reports TDR
>     , since those rings are intermutually affected to each other. For
>     SDMA ring timeout it definitely proves the head job on SDMA ring
>     is really guilty.
>
>     Thanks
>
>     ------------------------------------------
>
>     Monk Liu | Cloud-GPU Core team
>
>     ------------------------------------------
>
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx

[-- Attachment #1.2: Type: text/html, Size: 19722 bytes --]

[-- Attachment #2: Type: text/plain, Size: 154 bytes --]

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 13+ messages in thread

* 回复: [RFC] a new approach to detect which ring is the real black sheep upon TDR reported
  2021-02-26 12:05       ` Christian König
@ 2021-02-27  3:50         ` Liu, Monk
  2021-02-27 16:07           ` Christian König
  0 siblings, 1 reply; 13+ messages in thread
From: Liu, Monk @ 2021-02-27  3:50 UTC (permalink / raw)
  To: Koenig, Christian, amd-gfx; +Cc: Zhang, Andy, Chen, Horace, Zhang, Jack (Jian)

[-- Attachment #1.1: Type: text/plain, Size: 8768 bytes --]

[AMD Official Use Only - Internal Distribution Only]

the code I pasted is to illustrate why the innocent job is already taken out in the mirror list thus my suggested proposal won’t work unless we don’t delete the job in sched_job_timeout() routine, and the problem you stated is with my understanding also kind of related with my suggested solution – the job removing from list should be handled by driver instead of scheduler .

let make scheduler’s duty clear and simple : the sched_job_timeout() only get notification when a sched_job timedout but it doesn’t judge  if the leading job in mirror list should be blamed , all those checking should be left to driver to take action.

>> If we do this we should probably make it configurable as a module parameter.
That’s ok,  maybe we can reuse the existed parm “gpu_recovery”, extend it with:

l  0 – no recovery initiated after job timeout

l  1 – legacy TDR behave

l  2 – enhanced TDR behave (the one suggested here)

发件人: Koenig, Christian <Christian.Koenig@amd.com>
发送时间: 2021年2月26日 20:05
收件人: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
抄送: Zhang, Andy <Andy.Zhang@amd.com>; Chen, Horace <Horace.Chen@amd.com>; Zhang, Jack (Jian) <Jack.Zhang1@amd.com>
主题: Re: [RFC] a new approach to detect which ring is the real black sheep upon TDR reported

Yeah that is exactly the stuff which doesn't works at all. We got feedback for multiple people that this whole approach of tying the job to the tdr was not a good idea at all.

What we should do instead is to have a pointer in the scheduler fence to which job it belongs. Freeing up the job when the scheduler fence is signaled is then job of the driver and not the scheduler any more.

The scheduler then gives the scheduler fence to the driver when a timeout is detected and the driver can do the rest of the handling all by itself.

But this problem is orthogonal to the suggested solution here.

do you have a better solution or idea we review it as another candidate RFC ?

I don't see much other option either. We could do something like only allowing one application at a time to use the gfx/compute block, but that would be even worse.

If we do this we should probably make it configurable as a module parameter.

Regards,
Christian.
Am 26.02.21 um 12:57 schrieb Liu, Monk:

[AMD Official Use Only - Internal Distribution Only]

static void drm_sched_job_timedout(struct work_struct *work)
279 {
280     struct drm_gpu_scheduler *sched;
281     struct drm_sched_job *job;
282
283     sched = container_of(work, struct drm_gpu_scheduler, work_tdr.work);
284
285     /* Protects against concurrent deletion in drm_sched_get_cleanup_job */
286     spin_lock(&sched->job_list_lock);
287     job = list_first_entry_or_null(&sched->ring_mirror_list,
288                        struct drm_sched_job, node);
289
290     if (job) {
291         /*
292          * Remove the bad job so it cannot be freed by concurrent
293          * drm_sched_cleanup_jobs. It will be reinserted back after sched->thread
294          * is parked at which point it's safe.
295          */
296         list_del_init(&job->node);
297         spin_unlock(&sched->job_list_lock);
298
299         job->sched->ops->timedout_job(job);

Thanks

------------------------------------------
Monk Liu | Cloud-GPU Core team
------------------------------------------

From: Liu, Monk
Sent: Friday, February 26, 2021 7:54 PM
To: Koenig, Christian <Christian.Koenig@amd.com><mailto:Christian.Koenig@amd.com>; amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
Cc: Zhang, Andy <Andy.Zhang@amd.com><mailto:Andy.Zhang@amd.com>; Chen, Horace <Horace.Chen@amd.com><mailto:Horace.Chen@amd.com>; Zhang, Jack (Jian) <Jack.Zhang1@amd.com><mailto:Jack.Zhang1@amd.com>
Subject: RE: [RFC] a new approach to detect which ring is the real black sheep upon TDR reported

[AMD Official Use Only - Internal Distribution Only]

See in line

Thanks

------------------------------------------
Monk Liu | Cloud-GPU Core team
------------------------------------------

From: Koenig, Christian <Christian.Koenig@amd.com<mailto:Christian.Koenig@amd.com>>
Sent: Friday, February 26, 2021 3:58 PM
To: Liu, Monk <Monk.Liu@amd.com<mailto:Monk.Liu@amd.com>>; amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
Cc: Zhang, Andy <Andy.Zhang@amd.com<mailto:Andy.Zhang@amd.com>>; Chen, Horace <Horace.Chen@amd.com<mailto:Horace.Chen@amd.com>>; Zhang, Jack (Jian) <Jack.Zhang1@amd.com<mailto:Jack.Zhang1@amd.com>>
Subject: Re: [RFC] a new approach to detect which ring is the real black sheep upon TDR reported

Hi Monk,

in general an interesting idea, but I see two major problems with that:

1. It would make the reset take much longer.

2. Things get often stuck because of timing issues, so a guilty job might pass perfectly when run a second time.
[ML] but the innocent ring already reported a TDR, and the drm sched logic already deleted this “sched_job” in its mirror list, thus you don’t have chance to re-submit it again after reset, that’s the major problem here.

Apart from that the whole ring mirror list turned out to be a really bad idea. E.g. we still struggle with object life time because the concept doesn't fit into the object model of the GPU scheduler under Linux.

We should probably work on this separately and straighten up the job destruction once more and keep the recovery information in the fence instead.
[ML] we claim to our customer that no innocent process will be dropped or cancelled, and our current logic works for the most time, but only when there are different process running on gfx/computes rings then we would run into the tricky situation I stated here, and the proposal is the only way I can figure out so far, do you have a better solution or idea we review it as another candidate RFC ? Be note that we raised this proposal is because we do hit our trouble and we do need to resolve it …. So even a not perfect solution is still better than just cancel the innocent job (and their context/process)
Thanks !

Regards,
Christian.
Am 26.02.21 um 06:58 schrieb Liu, Monk:

[AMD Public Use]

Hi all

NAVI2X  project hit a really hard to solve issue now, and it is turned out to be a general headache of our TDR mechanism , check below scenario:

  1.  There is a job1 running on compute1 ring at timestamp
  2.  There is a job2 running on gfx ring at timestamp
  3.  Job1 is the guilty one, and job1/job2 were scheduled to their rings at almost the same timestamp
  4.  After 2 seconds we receive two TDR reporting from both GFX ring and compute ring
  5.  Current scheme is that in drm scheduler all the head jobs of those two rings are considered “bad job” and taken away from the mirror list
  6.  The result is both the real guilty job (job1) and the innocent job (job2) were all deleted from mirror list, and their corresponding contexts were also treated as guilty (so the innocent process remains running is not secured)

But by our wish the ideal case is TDR mechanism can detect which ring is the guilty ring and the innocent ring can resubmits all its pending jobs:

  1.  Job1 to be deleted from compute1 ring’s mirror list
  2.  Job2 is kept and resubmitted later and its belonging process/context are even not aware of this TDR at all

Here I have a proposal tend to achieve above goal and it rough procedure is :

  1.  Once any ring reports a TDR, the head job is *not* treated as “bad job”, and it is *not* deleted from the mirror list in drm sched functions
  2.  In vendor’s function (our amdgpu driver here):

     *   reset GPU
     *   repeat below actions on each RINGS * one by one *:

1. take the head job and submit it on this ring

2. see if it completes, if not then this job is the real “bad job”

3.  take it away from mirror list if this head job is “bad job”

     *   After above iteration on all RINGS, we already clears all the bad job(s)

  1.  Resubmit all jobs from each mirror list to their corresponding rings (this is the existed logic)

The idea of this is to use “serial” way to re-run and re-check each head job of each RING, in order to take out the real black sheep and its guilty context.

P.S.: we can use this approaches only on GFX/KCQ ring reports TDR , since those rings are intermutually affected to each other. For SDMA ring timeout it definitely proves the head job on SDMA ring is really guilty.

Thanks

------------------------------------------
Monk Liu | Cloud-GPU Core team
------------------------------------------

[-- Attachment #1.2: Type: text/html, Size: 31079 bytes --]

[-- Attachment #2: Type: text/plain, Size: 154 bytes --]

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 13+ messages in thread

* 回复: [RFC] a new approach to detect which ring is the real black sheep upon TDR reported
  2021-02-26 16:49     ` Andrey Grodzovsky
@ 2021-02-27  3:56       ` Liu, Monk
  2021-02-27  3:57         ` Liu, Monk
  2021-02-28  0:55         ` Andrey Grodzovsky
  0 siblings, 2 replies; 13+ messages in thread
From: Liu, Monk @ 2021-02-27  3:56 UTC (permalink / raw)
  To: Grodzovsky, Andrey, Koenig, Christian, amd-gfx
  Cc: Zhang, Andy, Chen, Horace, Zhang, Jack (Jian)

[-- Attachment #1.1: Type: text/plain, Size: 7442 bytes --]

[AMD Official Use Only - Internal Distribution Only]

H Andrey

The scenario I hit here is not the one you mentioned, let me explain it with more details by another much easier understood example:

Consider ring you have a job1 on KCQ, but the timeout of KCQ is 60 seconds (just for example)
You also have a job2 on GFX ring, and the timeout of GFX is 2 seconds

We submit job1 first, and assume job1 have bug and it will cause shader hang very very soon
After 10 seconds we submit job2, since KCQ have 60 seconds to report TDR thus SW know nothing about the engine already hang
After 2 seconds we got TDR report from job2 on GFX ring, sched_job_timeout() think the leading job of GFX ring is the black sheep so it is deleted from the mirror list
But in fact this job1 is innocent, and we should insert it back after recovery , and due to it was already deleted this innocent job’s context/process is really harmed

Hope above example helps

Thanks

发件人: Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>
发送时间: 2021年2月27日 0:50
收件人: Liu, Monk <Monk.Liu@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>; amd-gfx@lists.freedesktop.org
抄送: Zhang, Andy <Andy.Zhang@amd.com>; Chen, Horace <Horace.Chen@amd.com>; Zhang, Jack (Jian) <Jack.Zhang1@amd.com>
主题: Re: [RFC] a new approach to detect which ring is the real black sheep upon TDR reported

On 2021-02-26 6:54 a.m., Liu, Monk wrote:

[AMD Official Use Only - Internal Distribution Only]

See in line

Thanks

------------------------------------------
Monk Liu | Cloud-GPU Core team
------------------------------------------

From: Koenig, Christian <Christian.Koenig@amd.com><mailto:Christian.Koenig@amd.com>
Sent: Friday, February 26, 2021 3:58 PM
To: Liu, Monk <Monk.Liu@amd.com><mailto:Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
Cc: Zhang, Andy <Andy.Zhang@amd.com><mailto:Andy.Zhang@amd.com>; Chen, Horace <Horace.Chen@amd.com><mailto:Horace.Chen@amd.com>; Zhang, Jack (Jian) <Jack.Zhang1@amd.com><mailto:Jack.Zhang1@amd.com>
Subject: Re: [RFC] a new approach to detect which ring is the real black sheep upon TDR reported

Hi Monk,

in general an interesting idea, but I see two major problems with that:

1. It would make the reset take much longer.

2. Things get often stuck because of timing issues, so a guilty job might pass perfectly when run a second time.
[ML] but the innocent ring already reported a TDR, and the drm sched logic already deleted this “sched_job” in its mirror list, thus you don’t have chance to re-submit it again after reset, that’s the major problem here.

Just to confirm I understand correctly, Monk reports a scenario where the second TDR that was reported by the innocent job is bailing out BEFORE having a chance to run  drm_sched_stop for that scheduler which should have reinserted the job back into mirror list (because the first TDR run is still in progress and hence amdgpu_device_lock_adev fails for the second TDR) and so the innocent job which was extracted from mirror list in drm_sched_job_timedout is now lost.
If so and as a possible quick fix until we overhaul the entire design as suggested in this thread - maybe we can modify drm_sched_backend_ops.timedout_job callback to report back premature termination BEFORE drm_sched_stop had a chance to run and then reinsert back the job into mirror list from within  drm_sched_job_timedout? There is no problem of racing against concurrent drm_sched_get_cleanup_job once we reinsert there as we don't reference the job pointer anymore after this point and so if it's already signaled and freed right away - it's ok.

Andrey

Apart from that the whole ring mirror list turned out to be a really bad idea. E.g. we still struggle with object life time because the concept doesn't fit into the object model of the GPU scheduler under Linux.

We should probably work on this separately and straighten up the job destruction once more and keep the recovery information in the fence instead.
[ML] we claim to our customer that no innocent process will be dropped or cancelled, and our current logic works for the most time, but only when there are different process running on gfx/computes rings then we would run into the tricky situation I stated here, and the proposal is the only way I can figure out so far, do you have a better solution or idea we review it as another candidate RFC ? Be note that we raised this proposal is because we do hit our trouble and we do need to resolve it …. So even a not perfect solution is still better than just cancel the innocent job (and their context/process)
Thanks !

Regards,
Christian.
Am 26.02.21 um 06:58 schrieb Liu, Monk:

[AMD Public Use]

Hi all

NAVI2X  project hit a really hard to solve issue now, and it is turned out to be a general headache of our TDR mechanism , check below scenario:

  1.  There is a job1 running on compute1 ring at timestamp
  2.  There is a job2 running on gfx ring at timestamp
  3.  Job1 is the guilty one, and job1/job2 were scheduled to their rings at almost the same timestamp
  4.  After 2 seconds we receive two TDR reporting from both GFX ring and compute ring
  5.  Current scheme is that in drm scheduler all the head jobs of those two rings are considered “bad job” and taken away from the mirror list
  6.  The result is both the real guilty job (job1) and the innocent job (job2) were all deleted from mirror list, and their corresponding contexts were also treated as guilty (so the innocent process remains running is not secured)

But by our wish the ideal case is TDR mechanism can detect which ring is the guilty ring and the innocent ring can resubmits all its pending jobs:

  1.  Job1 to be deleted from compute1 ring’s mirror list
  2.  Job2 is kept and resubmitted later and its belonging process/context are even not aware of this TDR at all

Here I have a proposal tend to achieve above goal and it rough procedure is :

  1.  Once any ring reports a TDR, the head job is *not* treated as “bad job”, and it is *not* deleted from the mirror list in drm sched functions
  2.  In vendor’s function (our amdgpu driver here):

     *   reset GPU
     *   repeat below actions on each RINGS * one by one *:

1. take the head job and submit it on this ring

2. see if it completes, if not then this job is the real “bad job”

3.  take it away from mirror list if this head job is “bad job”

     *   After above iteration on all RINGS, we already clears all the bad job(s)

  1.  Resubmit all jobs from each mirror list to their corresponding rings (this is the existed logic)

The idea of this is to use “serial” way to re-run and re-check each head job of each RING, in order to take out the real black sheep and its guilty context.

P.S.: we can use this approaches only on GFX/KCQ ring reports TDR , since those rings are intermutually affected to each other. For SDMA ring timeout it definitely proves the head job on SDMA ring is really guilty.

Thanks

------------------------------------------
Monk Liu | Cloud-GPU Core team
------------------------------------------

_______________________________________________

amd-gfx mailing list

amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>

https://lists.freedesktop.org/mailman/listinfo/amd-gfx

[-- Attachment #1.2: Type: text/html, Size: 24187 bytes --]

[-- Attachment #2: Type: text/plain, Size: 154 bytes --]

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 13+ messages in thread

* 回复: [RFC] a new approach to detect which ring is the real black sheep upon TDR reported
  2021-02-27  3:56       ` 回复: " Liu, Monk
@ 2021-02-27  3:57         ` Liu, Monk
  2021-02-28  0:55         ` Andrey Grodzovsky
  1 sibling, 0 replies; 13+ messages in thread
From: Liu, Monk @ 2021-02-27  3:57 UTC (permalink / raw)
  To: Grodzovsky, Andrey, Koenig, Christian, amd-gfx
  Cc: Zhang, Andy, Chen, Horace, Zhang, Jack (Jian)

[-- Attachment #1.1: Type: text/plain, Size: 8238 bytes --]

[AMD Official Use Only - Internal Distribution Only]

Fix typo:

But in fact this job2 is innocent, and we should insert it back after recovery , and due to it was already deleted this innocent job’s context/process is really harmed

发件人: Liu, Monk
发送时间: 2021年2月27日 11:56
收件人: Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>; amd-gfx@lists.freedesktop.org
抄送: Zhang, Andy <Andy.Zhang@amd.com>; Chen, Horace <Horace.Chen@amd.com>; Zhang, Jack (Jian) <Jack.Zhang1@amd.com>
主题: 回复: [RFC] a new approach to detect which ring is the real black sheep upon TDR reported

H Andrey

The scenario I hit here is not the one you mentioned, let me explain it with more details by another much easier understood example:

Consider ring you have a job1 on KCQ, but the timeout of KCQ is 60 seconds (just for example)
You also have a job2 on GFX ring, and the timeout of GFX is 2 seconds

We submit job1 first, and assume job1 have bug and it will cause shader hang very very soon
After 10 seconds we submit job2, since KCQ have 60 seconds to report TDR thus SW know nothing about the engine already hang
After 2 seconds we got TDR report from job2 on GFX ring, sched_job_timeout() think the leading job of GFX ring is the black sheep so it is deleted from the mirror list
But in fact this job1 is innocent, and we should insert it back after recovery , and due to it was already deleted this innocent job’s context/process is really harmed

Hope above example helps

Thanks

发件人: Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com<mailto:Andrey.Grodzovsky@amd.com>>
发送时间: 2021年2月27日 0:50
收件人: Liu, Monk <Monk.Liu@amd.com<mailto:Monk.Liu@amd.com>>; Koenig, Christian <Christian.Koenig@amd.com<mailto:Christian.Koenig@amd.com>>; amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
抄送: Zhang, Andy <Andy.Zhang@amd.com<mailto:Andy.Zhang@amd.com>>; Chen, Horace <Horace.Chen@amd.com<mailto:Horace.Chen@amd.com>>; Zhang, Jack (Jian) <Jack.Zhang1@amd.com<mailto:Jack.Zhang1@amd.com>>
主题: Re: [RFC] a new approach to detect which ring is the real black sheep upon TDR reported

On 2021-02-26 6:54 a.m., Liu, Monk wrote:

[AMD Official Use Only - Internal Distribution Only]

See in line

Thanks

------------------------------------------
Monk Liu | Cloud-GPU Core team
------------------------------------------

From: Koenig, Christian <Christian.Koenig@amd.com><mailto:Christian.Koenig@amd.com>
Sent: Friday, February 26, 2021 3:58 PM
To: Liu, Monk <Monk.Liu@amd.com><mailto:Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
Cc: Zhang, Andy <Andy.Zhang@amd.com><mailto:Andy.Zhang@amd.com>; Chen, Horace <Horace.Chen@amd.com><mailto:Horace.Chen@amd.com>; Zhang, Jack (Jian) <Jack.Zhang1@amd.com><mailto:Jack.Zhang1@amd.com>
Subject: Re: [RFC] a new approach to detect which ring is the real black sheep upon TDR reported

Hi Monk,

in general an interesting idea, but I see two major problems with that:

1. It would make the reset take much longer.

2. Things get often stuck because of timing issues, so a guilty job might pass perfectly when run a second time.
[ML] but the innocent ring already reported a TDR, and the drm sched logic already deleted this “sched_job” in its mirror list, thus you don’t have chance to re-submit it again after reset, that’s the major problem here.

Just to confirm I understand correctly, Monk reports a scenario where the second TDR that was reported by the innocent job is bailing out BEFORE having a chance to run  drm_sched_stop for that scheduler which should have reinserted the job back into mirror list (because the first TDR run is still in progress and hence amdgpu_device_lock_adev fails for the second TDR) and so the innocent job which was extracted from mirror list in drm_sched_job_timedout is now lost.
If so and as a possible quick fix until we overhaul the entire design as suggested in this thread - maybe we can modify drm_sched_backend_ops.timedout_job callback to report back premature termination BEFORE drm_sched_stop had a chance to run and then reinsert back the job into mirror list from within  drm_sched_job_timedout? There is no problem of racing against concurrent drm_sched_get_cleanup_job once we reinsert there as we don't reference the job pointer anymore after this point and so if it's already signaled and freed right away - it's ok.

Andrey

Apart from that the whole ring mirror list turned out to be a really bad idea. E.g. we still struggle with object life time because the concept doesn't fit into the object model of the GPU scheduler under Linux.

We should probably work on this separately and straighten up the job destruction once more and keep the recovery information in the fence instead.
[ML] we claim to our customer that no innocent process will be dropped or cancelled, and our current logic works for the most time, but only when there are different process running on gfx/computes rings then we would run into the tricky situation I stated here, and the proposal is the only way I can figure out so far, do you have a better solution or idea we review it as another candidate RFC ? Be note that we raised this proposal is because we do hit our trouble and we do need to resolve it …. So even a not perfect solution is still better than just cancel the innocent job (and their context/process)
Thanks !

Regards,
Christian.
Am 26.02.21 um 06:58 schrieb Liu, Monk:

[AMD Public Use]

Hi all

NAVI2X  project hit a really hard to solve issue now, and it is turned out to be a general headache of our TDR mechanism , check below scenario:

  1.  There is a job1 running on compute1 ring at timestamp
  2.  There is a job2 running on gfx ring at timestamp
  3.  Job1 is the guilty one, and job1/job2 were scheduled to their rings at almost the same timestamp
  4.  After 2 seconds we receive two TDR reporting from both GFX ring and compute ring
  5.  Current scheme is that in drm scheduler all the head jobs of those two rings are considered “bad job” and taken away from the mirror list
  6.  The result is both the real guilty job (job1) and the innocent job (job2) were all deleted from mirror list, and their corresponding contexts were also treated as guilty (so the innocent process remains running is not secured)

But by our wish the ideal case is TDR mechanism can detect which ring is the guilty ring and the innocent ring can resubmits all its pending jobs:

  1.  Job1 to be deleted from compute1 ring’s mirror list
  2.  Job2 is kept and resubmitted later and its belonging process/context are even not aware of this TDR at all

Here I have a proposal tend to achieve above goal and it rough procedure is :

  1.  Once any ring reports a TDR, the head job is *not* treated as “bad job”, and it is *not* deleted from the mirror list in drm sched functions
  2.  In vendor’s function (our amdgpu driver here):

     *   reset GPU
     *   repeat below actions on each RINGS * one by one *:

1. take the head job and submit it on this ring

2. see if it completes, if not then this job is the real “bad job”

3.  take it away from mirror list if this head job is “bad job”

     *   After above iteration on all RINGS, we already clears all the bad job(s)

  1.  Resubmit all jobs from each mirror list to their corresponding rings (this is the existed logic)

The idea of this is to use “serial” way to re-run and re-check each head job of each RING, in order to take out the real black sheep and its guilty context.

P.S.: we can use this approaches only on GFX/KCQ ring reports TDR , since those rings are intermutually affected to each other. For SDMA ring timeout it definitely proves the head job on SDMA ring is really guilty.

Thanks

------------------------------------------
Monk Liu | Cloud-GPU Core team
------------------------------------------

_______________________________________________

amd-gfx mailing list

amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>

https://lists.freedesktop.org/mailman/listinfo/amd-gfx

[-- Attachment #1.2: Type: text/html, Size: 26812 bytes --]

[-- Attachment #2: Type: text/plain, Size: 154 bytes --]

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: 回复: [RFC] a new approach to detect which ring is the real black sheep upon TDR reported
  2021-02-27  3:50         ` 回复: " Liu, Monk
@ 2021-02-27 16:07           ` Christian König
  0 siblings, 0 replies; 13+ messages in thread
From: Christian König @ 2021-02-27 16:07 UTC (permalink / raw)
  To: Liu, Monk, Koenig, Christian, amd-gfx
  Cc: Zhang, Andy, Chen, Horace, Zhang, Jack (Jian)


[-- Attachment #1.1: Type: text/plain, Size: 10781 bytes --]

Am 27.02.21 um 04:50 schrieb Liu, Monk:
>
> [AMD Official Use Only - Internal Distribution Only]
>
>
> the code I pasted is to illustrate why the innocent job is already 
> taken out in the mirror list thus my suggested proposal won’t work 
> unless we don’t delete the job in sched_job_timeout() routine, and the 
> problem you stated is with my understanding also kind of related with 
> my suggested solution – the job removing from list should be handled 
> by driver instead of scheduler .
>

Yes, exactly that's my thinking as well.

> let make scheduler’s duty clear and simple : the sched_job_timeout() 
> only get notification when a sched_job timedout but it doesn’t judge 
>  if the leading job in mirror list should be blamed , all those 
> checking should be left to driver to take action.
>

Need to get a detailed look, but it sounds correct as well.

> >>If we do this we should probably make it configurable as a module 
> parameter.
>
> That’s ok,  maybe we can reuse the existed parm “gpu_recovery”, extend 
> it with:
>
> l0 – no recovery initiated after job timeout
>
> l1 – legacy TDR behave
>
> l2 – enhanced TDR behave (the one suggested here)
>

Yes, something like that should work. Key point is we had a couple of 
people who already suggested to optimize the reset routine so that it 
doesn't take so long.

So far I pushed back on this because the reset routine isn't something I 
would optimize for speed. But when it starts to take something like 10 
seconds instead of halve a second because you had an extra long running 
compute job we will certainly see complains.

Regards,
Christian.

> *发件人:*Koenig, Christian <Christian.Koenig@amd.com>
> *发送时间:*2021年2月26日20:05
> *收件人:*Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
> *抄送:*Zhang, Andy <Andy.Zhang@amd.com>; Chen, Horace 
> <Horace.Chen@amd.com>; Zhang, Jack (Jian) <Jack.Zhang1@amd.com>
> *主题:*Re: [RFC] a new approach to detect which ring is the real black 
> sheep upon TDR reported
>
> Yeah that is exactly the stuff which doesn't works at all. We got 
> feedback for multiple people that this whole approach of tying the job 
> to the tdr was not a good idea at all.
>
> What we should do instead is to have a pointer in the scheduler fence 
> to which job it belongs. Freeing up the job when the scheduler fence 
> is signaled is then job of the driver and not the scheduler any more.
>
> The scheduler then gives the scheduler fence to the driver when a 
> timeout is detected and the driver can do the rest of the handling all 
> by itself.
>
> But this problem is orthogonal to the suggested solution here.
>
>
>     do you have a better solution or idea we review it as another
>     candidate RFC ?
>
>
> I don't see much other option either. We could do something like only 
> allowing one application at a time to use the gfx/compute block, but 
> that would be even worse.
>
> If we do this we should probably make it configurable as a module 
> parameter.
>
> Regards,
> Christian.
>
> Am 26.02.21 um 12:57 schrieb Liu, Monk:
>
>     [AMD Official Use Only - Internal Distribution Only]
>
>     static void drm_sched_job_timedout(struct work_struct *work)
>
>     279 {
>
>     280     struct drm_gpu_scheduler *sched;
>
>     281     struct drm_sched_job *job;
>
>     282
>
>     283     sched = container_of(work, struct drm_gpu_scheduler,
>     work_tdr.work);
>
>     284
>
>     285     /* Protects against concurrent deletion in
>     drm_sched_get_cleanup_job */
>
>     *286 spin_lock(&sched->job_list_lock);*
>
>     *287     job = list_first_entry_or_null(&sched->ring_mirror_list,*
>
>     *288 struct drm_sched_job, node);*
>
>     *289*
>
>     *290     if (job) {*
>
>     *291         /**
>
>     *292          * Remove the bad job so it cannot be freed by
>     concurrent*
>
>     *293          * drm_sched_cleanup_jobs. It will be reinserted back
>     after sched->thread*
>
>     *294          * is parked at which point it's safe.*
>
>     *295          */*
>
>     *296 list_del_init(&job->node);*
>
>     *297 spin_unlock(&sched->job_list_lock);*
>
>     *298*
>
>     *299 job->sched->ops->timedout_job(job);*
>
>     Thanks
>
>     ------------------------------------------
>
>     Monk Liu | Cloud-GPU Core team
>
>     ------------------------------------------
>
>     *From:*Liu, Monk
>     *Sent:* Friday, February 26, 2021 7:54 PM
>     *To:* Koenig, Christian <Christian.Koenig@amd.com>
>     <mailto:Christian.Koenig@amd.com>; amd-gfx@lists.freedesktop.org
>     <mailto:amd-gfx@lists.freedesktop.org>
>     *Cc:* Zhang, Andy <Andy.Zhang@amd.com>
>     <mailto:Andy.Zhang@amd.com>; Chen, Horace <Horace.Chen@amd.com>
>     <mailto:Horace.Chen@amd.com>; Zhang, Jack (Jian)
>     <Jack.Zhang1@amd.com> <mailto:Jack.Zhang1@amd.com>
>     *Subject:* RE: [RFC] a new approach to detect which ring is the
>     real black sheep upon TDR reported
>
>     [AMD Official Use Only - Internal Distribution Only]
>
>     See in line
>
>     Thanks
>
>     ------------------------------------------
>
>     Monk Liu | Cloud-GPU Core team
>
>     ------------------------------------------
>
>     *From:*Koenig, Christian <Christian.Koenig@amd.com
>     <mailto:Christian.Koenig@amd.com>>
>     *Sent:* Friday, February 26, 2021 3:58 PM
>     *To:* Liu, Monk <Monk.Liu@amd.com <mailto:Monk.Liu@amd.com>>;
>     amd-gfx@lists.freedesktop.org <mailto:amd-gfx@lists.freedesktop.org>
>     *Cc:* Zhang, Andy <Andy.Zhang@amd.com
>     <mailto:Andy.Zhang@amd.com>>; Chen, Horace <Horace.Chen@amd.com
>     <mailto:Horace.Chen@amd.com>>; Zhang, Jack (Jian)
>     <Jack.Zhang1@amd.com <mailto:Jack.Zhang1@amd.com>>
>     *Subject:* Re: [RFC] a new approach to detect which ring is the
>     real black sheep upon TDR reported
>
>     Hi Monk,
>
>     in general an interesting idea, but I see two major problems with
>     that:
>
>     1. It would make the reset take much longer.
>
>     2. Things get often stuck because of timing issues, so a guilty
>     job might pass perfectly when run a second time.
>
>     [ML] but the innocent ring already reported a TDR, and the drm
>     sched logic already deleted this “sched_job” in its mirror list,
>     thus you don’t have chance to re-submit it again after reset,
>     that’s the major problem here.
>
>     Apart from that the whole ring mirror list turned out to be a
>     really bad idea. E.g. we still struggle with object life time
>     because the concept doesn't fit into the object model of the GPU
>     scheduler under Linux.
>
>     We should probably work on this separately and straighten up the
>     job destruction once more and keep the recovery information in the
>     fence instead.
>
>     [ML] we claim to our customer that no innocent process will be
>     dropped or cancelled, and our current logic works for the most
>     time, but only when there are different process running on
>     gfx/computes rings then we would run into the tricky situation I
>     stated here, and the proposal is the only way I can figure out so
>     far, do you have a better solution or idea we review it as another
>     candidate RFC ? Be note that we raised this proposal is because we
>     do hit our trouble and we do need to resolve it …. So even a not
>     perfect solution is still better than just cancel the innocent job
>     (and their context/process)
>
>     Thanks !
>
>
>     Regards,
>     Christian.
>
>     Am 26.02.21 um 06:58 schrieb Liu, Monk:
>
>         [AMD Public Use]
>
>         Hi all
>
>         NAVI2X  project hit a really hard to solve issue now, and it
>         is turned out to be a general headache of our TDR mechanism ,
>         check below scenario:
>
>          1. There is a job1 running on compute1 ring at timestamp
>          2. There is a job2 running on gfx ring at timestamp
>          3. Job1 is the guilty one, and job1/job2 were scheduled to
>             their rings at almost the same timestamp
>          4. After 2 seconds we receive two TDR reporting from both GFX
>             ring and compute ring
>          5. *Current scheme is that in drm scheduler all the head jobs
>             of those two rings are considered “bad job” and taken away
>             from the mirror list *
>          6. The result is both the real guilty job (job1) and the
>             innocent job (job2) were all deleted from mirror list, and
>             their corresponding contexts were also treated as
>             guilty*(so the innocent process remains running is not
>             secured)*
>
>         **
>
>         But by our wish the ideal case is TDR mechanism can detect
>         which ring is the guilty ring and the innocent ring can
>         resubmits all its pending jobs:
>
>          1. Job1 to be deleted from compute1 ring’s mirror list
>          2. Job2 is kept and resubmitted later and its belonging
>             process/context are even not aware of this TDR at all
>
>         Here I have a proposal tend to achieve above goal and it rough
>         procedure is :
>
>          1. Once any ring reports a TDR, the head job is **not**
>             treated as “bad job”, and it is **not** deleted from the
>             mirror list in drm sched functions
>          2. In vendor’s function (our amdgpu driver here):
>
>               * reset GPU
>               * repeat below actions on each RINGS * one by one *:
>
>         1.take the head job and submit it on this ring
>
>         2.see if it completes, if not then this job is the real “bad job”
>
>         3. take it away from mirror list if this head job is “bad job”
>
>               * After above iteration on all RINGS, we already clears
>                 all the bad job(s)
>
>          3. Resubmit all jobs from each mirror list to their
>             corresponding rings (this is the existed logic)
>
>         The idea of this is to use “serial” way to re-run and re-check
>         each head job of each RING, in order to take out the real
>         black sheep and its guilty context.
>
>         P.S.: we can use this approaches only on GFX/KCQ ring reports
>         TDR , since those rings are intermutually affected to each
>         other. For SDMA ring timeout it definitely proves the head job
>         on SDMA ring is really guilty.
>
>         Thanks
>
>         ------------------------------------------
>
>         Monk Liu | Cloud-GPU Core team
>
>         ------------------------------------------
>
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx


[-- Attachment #1.2: Type: text/html, Size: 40500 bytes --]

[-- Attachment #2: Type: text/plain, Size: 154 bytes --]

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: 回复: [RFC] a new approach to detect which ring is the real black sheep upon TDR reported
  2021-02-27  3:56       ` 回复: " Liu, Monk
  2021-02-27  3:57         ` Liu, Monk
@ 2021-02-28  0:55         ` Andrey Grodzovsky
  2021-02-28  4:10           ` 回复: " Liu, Monk
  1 sibling, 1 reply; 13+ messages in thread
From: Andrey Grodzovsky @ 2021-02-28  0:55 UTC (permalink / raw)
  To: Liu, Monk, Koenig, Christian, amd-gfx
  Cc: Zhang, Andy, Chen, Horace, Zhang, Jack (Jian)


[-- Attachment #1.1: Type: text/plain, Size: 8916 bytes --]


On 2021-02-26 10:56 p.m., Liu, Monk wrote:
>
> [AMD Official Use Only - Internal Distribution Only]
>
>
> H Andrey
>
> The scenario I hit here is not the one you mentioned, let me explain 
> it with more details by another much easier understood example:
>
> Consider ring you have a job1 on KCQ, but the timeout of KCQ is 60 
> seconds (just for example)
>
> You also have a job2 on GFX ring, and the timeout of GFX is 2 seconds
>
> We submit job1 first, and assume job1 have bug and it will cause 
> shader hang very very soon
>
> After 10 seconds we submit job2, since KCQ have 60 seconds to report 
> TDR thus SW know nothing about the engine already hang
>

So gfx job hangs because it has a dependency on buggy compute job which 
already is hanging ?


> After 2 seconds we got TDR report from job2 on GFX ring, 
> sched_job_timeout() think the leading job of GFX ring is the black 
> sheep so it is deleted from the mirror list
>
> But in fact this job1 is innocent, and we should insert it back after 
> recovery , and due to it was already deleted this innocent job’s 
> context/process is really harmed
>

I am still missing something - we don't ever delete bad jobs or any jobs 
until they are signaled, we reinsert the bad  job back into mirror list 
in drm_sched_stop
(here - 
https://elixir.bootlin.com/linux/v5.11.1/source/drivers/gpu/drm/scheduler/sched_main.c#L385) 
after sched thread is stopped and continue with the reset procedure.

Andrey


> Hope above example helps
>
> Thanks
>
> *发件人:*Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>
> *发送时间:*2021年2月27日0:50
> *收件人:*Liu, Monk <Monk.Liu@amd.com>; Koenig, Christian 
> <Christian.Koenig@amd.com>; amd-gfx@lists.freedesktop.org
> *抄送:*Zhang, Andy <Andy.Zhang@amd.com>; Chen, Horace 
> <Horace.Chen@amd.com>; Zhang, Jack (Jian) <Jack.Zhang1@amd.com>
> *主题:*Re: [RFC] a new approach to detect which ring is the real black 
> sheep upon TDR reported
>
> On 2021-02-26 6:54 a.m., Liu, Monk wrote:
>
>     [AMD Official Use Only - Internal Distribution Only]
>
>     See in line
>
>     Thanks
>
>     ------------------------------------------
>
>     Monk Liu | Cloud-GPU Core team
>
>     ------------------------------------------
>
>     *From:*Koenig, Christian <Christian.Koenig@amd.com>
>     <mailto:Christian.Koenig@amd.com>
>     *Sent:* Friday, February 26, 2021 3:58 PM
>     *To:* Liu, Monk <Monk.Liu@amd.com> <mailto:Monk.Liu@amd.com>;
>     amd-gfx@lists.freedesktop.org <mailto:amd-gfx@lists.freedesktop.org>
>     *Cc:* Zhang, Andy <Andy.Zhang@amd.com>
>     <mailto:Andy.Zhang@amd.com>; Chen, Horace <Horace.Chen@amd.com>
>     <mailto:Horace.Chen@amd.com>; Zhang, Jack (Jian)
>     <Jack.Zhang1@amd.com> <mailto:Jack.Zhang1@amd.com>
>     *Subject:* Re: [RFC] a new approach to detect which ring is the
>     real black sheep upon TDR reported
>
>     Hi Monk,
>
>     in general an interesting idea, but I see two major problems with
>     that:
>
>     1. It would make the reset take much longer.
>
>     2. Things get often stuck because of timing issues, so a guilty
>     job might pass perfectly when run a second time.
>
>     [ML] but the innocent ring already reported a TDR, and the drm
>     sched logic already deleted this “sched_job” in its mirror list,
>     thus you don’t have chance to re-submit it again after reset,
>     that’s the major problem here.
>
> Just to confirm I understand correctly, Monk reports a scenario where 
> the second TDR that was reported by the innocent job is bailing out 
> BEFORE having a chance to run  drm_sched_stop for that scheduler which 
> should have reinserted the job back into mirror list (because the 
> first TDR run is still in progress and hence amdgpu_device_lock_adev 
> fails for the second TDR) and so the innocent job which was extracted 
> from mirror list in drm_sched_job_timedout is now lost.
> If so and as a possible quick fix until we overhaul the entire design 
> as suggested in this thread - maybe we can modify 
> drm_sched_backend_ops.timedout_job callback to report back premature 
> termination BEFORE drm_sched_stop had a chance to run and then 
> reinsert back the job into mirror list from within  
> drm_sched_job_timedout? There is no problem of racing against 
> concurrent drm_sched_get_cleanup_job once we reinsert there as we 
> don't reference the job pointer anymore after this point and so if 
> it's already signaled and freed right away - it's ok.
>
> Andrey
>
>
>     Apart from that the whole ring mirror list turned out to be a
>     really bad idea. E.g. we still struggle with object life time
>     because the concept doesn't fit into the object model of the GPU
>     scheduler under Linux.
>
>     We should probably work on this separately and straighten up the
>     job destruction once more and keep the recovery information in the
>     fence instead.
>
>     [ML] we claim to our customer that no innocent process will be
>     dropped or cancelled, and our current logic works for the most
>     time, but only when there are different process running on
>     gfx/computes rings then we would run into the tricky situation I
>     stated here, and the proposal is the only way I can figure out so
>     far, do you have a better solution or idea we review it as another
>     candidate RFC ? Be note that we raised this proposal is because we
>     do hit our trouble and we do need to resolve it …. So even a not
>     perfect solution is still better than just cancel the innocent job
>     (and their context/process)
>
>     Thanks !
>
>
>     Regards,
>     Christian.
>
>     Am 26.02.21 um 06:58 schrieb Liu, Monk:
>
>         [AMD Public Use]
>
>         Hi all
>
>         NAVI2X  project hit a really hard to solve issue now, and it
>         is turned out to be a general headache of our TDR mechanism ,
>         check below scenario:
>
>          1. There is a job1 running on compute1 ring at timestamp
>          2. There is a job2 running on gfx ring at timestamp
>          3. Job1 is the guilty one, and job1/job2 were scheduled to
>             their rings at almost the same timestamp
>          4. After 2 seconds we receive two TDR reporting from both GFX
>             ring and compute ring
>          5. *Current scheme is that in drm scheduler all the head jobs
>             of those two rings are considered “bad job” and taken away
>             from the mirror list *
>          6. The result is both the real guilty job (job1) and the
>             innocent job (job2) were all deleted from mirror list, and
>             their corresponding contexts were also treated as
>             guilty*(so the innocent process remains running is not
>             secured)*
>
>         **
>
>         But by our wish the ideal case is TDR mechanism can detect
>         which ring is the guilty ring and the innocent ring can
>         resubmits all its pending jobs:
>
>          1. Job1 to be deleted from compute1 ring’s mirror list
>          2. Job2 is kept and resubmitted later and its belonging
>             process/context are even not aware of this TDR at all
>
>         Here I have a proposal tend to achieve above goal and it rough
>         procedure is :
>
>          1. Once any ring reports a TDR, the head job is **not**
>             treated as “bad job”, and it is **not** deleted from the
>             mirror list in drm sched functions
>          2. In vendor’s function (our amdgpu driver here):
>
>               * reset GPU
>               * repeat below actions on each RINGS * one by one *:
>
>         1.take the head job and submit it on this ring
>
>         2.see if it completes, if not then this job is the real “bad job”
>
>         3. take it away from mirror list if this head job is “bad job”
>
>               * After above iteration on all RINGS, we already clears
>                 all the bad job(s)
>
>          3. Resubmit all jobs from each mirror list to their
>             corresponding rings (this is the existed logic)
>
>         The idea of this is to use “serial” way to re-run and re-check
>         each head job of each RING, in order to take out the real
>         black sheep and its guilty context.
>
>         P.S.: we can use this approaches only on GFX/KCQ ring reports
>         TDR , since those rings are intermutually affected to each
>         other. For SDMA ring timeout it definitely proves the head job
>         on SDMA ring is really guilty.
>
>         Thanks
>
>         ------------------------------------------
>
>         Monk Liu | Cloud-GPU Core team
>
>         ------------------------------------------
>
>
>
>     _______________________________________________
>
>     amd-gfx mailing list
>
>     amd-gfx@lists.freedesktop.org <mailto:amd-gfx@lists.freedesktop.org>
>
>     https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>

[-- Attachment #1.2: Type: text/html, Size: 30646 bytes --]

[-- Attachment #2: Type: text/plain, Size: 154 bytes --]

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 13+ messages in thread

* 回复: 回复: [RFC] a new approach to detect which ring is the real black sheep upon TDR reported
  2021-02-28  0:55         ` Andrey Grodzovsky
@ 2021-02-28  4:10           ` Liu, Monk
  2021-02-28 13:59             ` Andrey Grodzovsky
  0 siblings, 1 reply; 13+ messages in thread
From: Liu, Monk @ 2021-02-28  4:10 UTC (permalink / raw)
  To: Grodzovsky, Andrey, Koenig, Christian, amd-gfx
  Cc: Zhang, Andy, Chen, Horace, Zhang, Jack (Jian)

[-- Attachment #1.1: Type: text/plain, Size: 9927 bytes --]

[AMD Official Use Only - Internal Distribution Only]

>> So gfx job hangs because it has a dependency on buggy compute job which already is hanging ?
No, there is no dependency between this gfx job and that compute job from a software perspective , but the CU is shared thus gfx is affected by the bug from that compute job

>> I am still missing something - we don't ever delete bad jobs or any jobs until they are signaled, we reinsert the bad  job back into mirror list in drm_sched_stop
Oh yeah, it was still kept in the mirror list, I thought it was removed once for good in scheduler…
then my question is why we need to remove it in scheduler part if we always need to reinsert it back? And even for other vendors the better way is still
let vendor driver decide the heading job.

The real issue we hit is : sometimes if we run a quark test (it hangs kcq ring with a bad shader inside), X server will occasionally crash with a GFX ring TDR report
Root cause is still what I described before:  both this innocent gfx job and the guilty compute job are all marked as “guilty” by our driver, so even they are re-inserted back to mirror list
But they are all abandoned in drm_sched_resubmit_jobs() due to they are all processed by drm_sched_increase_karma()

发件人: Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>
发送时间: 2021年2月28日 8:55
收件人: Liu, Monk <Monk.Liu@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>; amd-gfx@lists.freedesktop.org
抄送: Zhang, Andy <Andy.Zhang@amd.com>; Chen, Horace <Horace.Chen@amd.com>; Zhang, Jack (Jian) <Jack.Zhang1@amd.com>
主题: Re: 回复: [RFC] a new approach to detect which ring is the real black sheep upon TDR reported

On 2021-02-26 10:56 p.m., Liu, Monk wrote:

[AMD Official Use Only - Internal Distribution Only]

H Andrey

The scenario I hit here is not the one you mentioned, let me explain it with more details by another much easier understood example:

Consider ring you have a job1 on KCQ, but the timeout of KCQ is 60 seconds (just for example)
You also have a job2 on GFX ring, and the timeout of GFX is 2 seconds

We submit job1 first, and assume job1 have bug and it will cause shader hang very very soon
After 10 seconds we submit job2, since KCQ have 60 seconds to report TDR thus SW know nothing about the engine already hang

So gfx job hangs because it has a dependency on buggy compute job which already is hanging ?

After 2 seconds we got TDR report from job2 on GFX ring, sched_job_timeout() think the leading job of GFX ring is the black sheep so it is deleted from the mirror list
But in fact this job1 is innocent, and we should insert it back after recovery , and due to it was already deleted this innocent job’s context/process is really harmed

I am still missing something - we don't ever delete bad jobs or any jobs until they are signaled, we reinsert the bad  job back into mirror list in drm_sched_stop
(here - https://elixir.bootlin.com/linux/v5.11.1/source/drivers/gpu/drm/scheduler/sched_main.c#L385) after sched thread is stopped and continue with the reset procedure.

Andrey

Hope above example helps

Thanks

发件人: Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com><mailto:Andrey.Grodzovsky@amd.com>
发送时间: 2021年2月27日 0:50
收件人: Liu, Monk <Monk.Liu@amd.com><mailto:Monk.Liu@amd.com>; Koenig, Christian <Christian.Koenig@amd.com><mailto:Christian.Koenig@amd.com>; amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
抄送: Zhang, Andy <Andy.Zhang@amd.com><mailto:Andy.Zhang@amd.com>; Chen, Horace <Horace.Chen@amd.com><mailto:Horace.Chen@amd.com>; Zhang, Jack (Jian) <Jack.Zhang1@amd.com><mailto:Jack.Zhang1@amd.com>
主题: Re: [RFC] a new approach to detect which ring is the real black sheep upon TDR reported

On 2021-02-26 6:54 a.m., Liu, Monk wrote:

[AMD Official Use Only - Internal Distribution Only]

See in line

Thanks

------------------------------------------
Monk Liu | Cloud-GPU Core team
------------------------------------------

From: Koenig, Christian <Christian.Koenig@amd.com><mailto:Christian.Koenig@amd.com>
Sent: Friday, February 26, 2021 3:58 PM
To: Liu, Monk <Monk.Liu@amd.com><mailto:Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
Cc: Zhang, Andy <Andy.Zhang@amd.com><mailto:Andy.Zhang@amd.com>; Chen, Horace <Horace.Chen@amd.com><mailto:Horace.Chen@amd.com>; Zhang, Jack (Jian) <Jack.Zhang1@amd.com><mailto:Jack.Zhang1@amd.com>
Subject: Re: [RFC] a new approach to detect which ring is the real black sheep upon TDR reported

Hi Monk,

in general an interesting idea, but I see two major problems with that:

1. It would make the reset take much longer.

2. Things get often stuck because of timing issues, so a guilty job might pass perfectly when run a second time.
[ML] but the innocent ring already reported a TDR, and the drm sched logic already deleted this “sched_job” in its mirror list, thus you don’t have chance to re-submit it again after reset, that’s the major problem here.

Just to confirm I understand correctly, Monk reports a scenario where the second TDR that was reported by the innocent job is bailing out BEFORE having a chance to run  drm_sched_stop for that scheduler which should have reinserted the job back into mirror list (because the first TDR run is still in progress and hence amdgpu_device_lock_adev fails for the second TDR) and so the innocent job which was extracted from mirror list in drm_sched_job_timedout is now lost.
If so and as a possible quick fix until we overhaul the entire design as suggested in this thread - maybe we can modify drm_sched_backend_ops.timedout_job callback to report back premature termination BEFORE drm_sched_stop had a chance to run and then reinsert back the job into mirror list from within  drm_sched_job_timedout? There is no problem of racing against concurrent drm_sched_get_cleanup_job once we reinsert there as we don't reference the job pointer anymore after this point and so if it's already signaled and freed right away - it's ok.

Andrey

Apart from that the whole ring mirror list turned out to be a really bad idea. E.g. we still struggle with object life time because the concept doesn't fit into the object model of the GPU scheduler under Linux.

We should probably work on this separately and straighten up the job destruction once more and keep the recovery information in the fence instead.
[ML] we claim to our customer that no innocent process will be dropped or cancelled, and our current logic works for the most time, but only when there are different process running on gfx/computes rings then we would run into the tricky situation I stated here, and the proposal is the only way I can figure out so far, do you have a better solution or idea we review it as another candidate RFC ? Be note that we raised this proposal is because we do hit our trouble and we do need to resolve it …. So even a not perfect solution is still better than just cancel the innocent job (and their context/process)
Thanks !

Regards,
Christian.
Am 26.02.21 um 06:58 schrieb Liu, Monk:

[AMD Public Use]

Hi all

NAVI2X  project hit a really hard to solve issue now, and it is turned out to be a general headache of our TDR mechanism , check below scenario:

  1.  There is a job1 running on compute1 ring at timestamp
  2.  There is a job2 running on gfx ring at timestamp
  3.  Job1 is the guilty one, and job1/job2 were scheduled to their rings at almost the same timestamp
  4.  After 2 seconds we receive two TDR reporting from both GFX ring and compute ring
  5.  Current scheme is that in drm scheduler all the head jobs of those two rings are considered “bad job” and taken away from the mirror list
  6.  The result is both the real guilty job (job1) and the innocent job (job2) were all deleted from mirror list, and their corresponding contexts were also treated as guilty (so the innocent process remains running is not secured)

But by our wish the ideal case is TDR mechanism can detect which ring is the guilty ring and the innocent ring can resubmits all its pending jobs:

  1.  Job1 to be deleted from compute1 ring’s mirror list
  2.  Job2 is kept and resubmitted later and its belonging process/context are even not aware of this TDR at all

Here I have a proposal tend to achieve above goal and it rough procedure is :

  1.  Once any ring reports a TDR, the head job is *not* treated as “bad job”, and it is *not* deleted from the mirror list in drm sched functions
  2.  In vendor’s function (our amdgpu driver here):

     *   reset GPU
     *   repeat below actions on each RINGS * one by one *:

1. take the head job and submit it on this ring

2. see if it completes, if not then this job is the real “bad job”

3.  take it away from mirror list if this head job is “bad job”

     *   After above iteration on all RINGS, we already clears all the bad job(s)

  1.  Resubmit all jobs from each mirror list to their corresponding rings (this is the existed logic)

The idea of this is to use “serial” way to re-run and re-check each head job of each RING, in order to take out the real black sheep and its guilty context.

P.S.: we can use this approaches only on GFX/KCQ ring reports TDR , since those rings are intermutually affected to each other. For SDMA ring timeout it definitely proves the head job on SDMA ring is really guilty.

Thanks

------------------------------------------
Monk Liu | Cloud-GPU Core team
------------------------------------------

_______________________________________________

amd-gfx mailing list

amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>

https://lists.freedesktop.org/mailman/listinfo/amd-gfx

[-- Attachment #1.2: Type: text/html, Size: 30487 bytes --]

[-- Attachment #2: Type: text/plain, Size: 154 bytes --]

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: 回复: 回复: [RFC] a new approach to detect which ring is the real black sheep upon TDR reported
  2021-02-28  4:10           ` 回复: " Liu, Monk
@ 2021-02-28 13:59             ` Andrey Grodzovsky
  0 siblings, 0 replies; 13+ messages in thread
From: Andrey Grodzovsky @ 2021-02-28 13:59 UTC (permalink / raw)
  To: Liu, Monk, Koenig, Christian, amd-gfx
  Cc: Zhang, Andy, Chen, Horace, Zhang, Jack (Jian)


[-- Attachment #1.1: Type: text/plain, Size: 12775 bytes --]


On 2021-02-27 11:10 p.m., Liu, Monk wrote:
>
> [AMD Official Use Only - Internal Distribution Only]
>
>
> >>So gfx job hangs because it has a dependency on buggy compute job 
> which already is hanging ?
>
> No, there is no dependency between this gfx job and that compute job 
> from a software perspective , but the CU is shared thus gfx is 
> affected by the bug from that compute job
>
> >>I am still missing something - we don't ever delete bad jobs or any 
> jobs until they are signaled, we reinsert the bad  job back into 
> mirror list in drm_sched_stop
>
> Oh yeah, it was still kept in the mirror list, I thought it was 
> removed once for good in scheduler…
>
> then my question is why we need to remove it in scheduler part if we 
> always need to reinsert it back?
>

See explanation in the original fix in this commit 'drm/scheduler: Avoid 
accessing freed bad job.' - the problem with that fix was that while it 
solved the original race issue it created another issue where if the 
driver was prematurely terminating the reset process due to guilty job 
already being signaled (optimization - like we have in non amdgpu 
drivers) OR reset lock contention from multiple TDR threads (like we 
have in amdgpu) then indeed we would remove the bad job but would not 
reinsert back as we would skip drm_sched_stop. For which issue Luben 
proposed a state machine approach to the entire job life cycle handling 
(can't find the patch-set now) but during which review it was decided 
that the optimal approach would be to stop relying on the job and start 
relying on the entity->finish_fence to keep all the info (What Christian 
mentions in the beginning of this thread).


> And even for other vendors the better way is still
>
> let vendor driver decide the heading job.
>
> The real issue we hit is : sometimes if we run a quark test (it hangs 
> kcq ring with a bad shader inside), X server will occasionally crash 
> with a GFX ring TDR report
>
> Root cause is still what I described before:  both this innocent gfx 
> job and the guilty compute job are all marked as “guilty” by our 
> driver, so even they are re-inserted back to mirror list
>
> But they are all abandoned in drm_sched_resubmit_jobs() due to they 
> are all processed by drm_sched_increase_karma()
>

I see now, in this case the main issue is indeed that we cannot rely on 
head job in mirror list to be the actual bad and guilty job and this 
then requires some redesign (e.g. along the lines of what you suggested).

Andrey


> *发件人:*Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>
> *发送时间:*2021年2月28日8:55
> *收件人:*Liu, Monk <Monk.Liu@amd.com>; Koenig, Christian 
> <Christian.Koenig@amd.com>; amd-gfx@lists.freedesktop.org
> *抄送:*Zhang, Andy <Andy.Zhang@amd.com>; Chen, Horace 
> <Horace.Chen@amd.com>; Zhang, Jack (Jian) <Jack.Zhang1@amd.com>
> *主题:*Re: 回复: [RFC] a new approach to detect which ring is the real 
> black sheep upon TDR reported
>
> On 2021-02-26 10:56 p.m., Liu, Monk wrote:
>
>     [AMD Official Use Only - Internal Distribution Only]
>
>     H Andrey
>
>     The scenario I hit here is not the one you mentioned, let me
>     explain it with more details by another much easier understood
>     example:
>
>     Consider ring you have a job1 on KCQ, but the timeout of KCQ is 60
>     seconds (just for example)
>
>     You also have a job2 on GFX ring, and the timeout of GFX is 2 seconds
>
>     We submit job1 first, and assume job1 have bug and it will cause
>     shader hang very very soon
>
>     After 10 seconds we submit job2, since KCQ have 60 seconds to
>     report TDR thus SW know nothing about the engine already hang
>
> So gfx job hangs because it has a dependency on buggy compute job 
> which already is hanging ?
>
>     After 2 seconds we got TDR report from job2 on GFX ring,
>     sched_job_timeout() think the leading job of GFX ring is the black
>     sheep so it is deleted from the mirror list
>
>     But in fact this job1 is innocent, and we should insert it back
>     after recovery , and due to it was already deleted this innocent
>     job’s context/process is really harmed
>
> I am still missing something - we don't ever delete bad jobs or any 
> jobs until they are signaled, we reinsert the bad  job back into 
> mirror list in drm_sched_stop
> (here - 
> https://elixir.bootlin.com/linux/v5.11.1/source/drivers/gpu/drm/scheduler/sched_main.c#L385) 
> after sched thread is stopped and continue with the reset procedure.
>
> Andrey
>
>     Hope above example helps
>
>     Thanks
>
>     *发件人:*Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>
>     <mailto:Andrey.Grodzovsky@amd.com>
>     *发送时间:*2021年2月27日0:50
>     *收件人:*Liu, Monk <Monk.Liu@amd.com> <mailto:Monk.Liu@amd.com>;
>     Koenig, Christian <Christian.Koenig@amd.com>
>     <mailto:Christian.Koenig@amd.com>; amd-gfx@lists.freedesktop.org
>     <mailto:amd-gfx@lists.freedesktop.org>
>     *抄送:*Zhang, Andy <Andy.Zhang@amd.com> <mailto:Andy.Zhang@amd.com>;
>     Chen, Horace <Horace.Chen@amd.com> <mailto:Horace.Chen@amd.com>;
>     Zhang, Jack (Jian) <Jack.Zhang1@amd.com> <mailto:Jack.Zhang1@amd.com>
>     *主题:*Re: [RFC] a new approach to detect which ring is the real
>     black sheep upon TDR reported
>
>     On 2021-02-26 6:54 a.m., Liu, Monk wrote:
>
>         [AMD Official Use Only - Internal Distribution Only]
>
>         See in line
>
>         Thanks
>
>         ------------------------------------------
>
>         Monk Liu | Cloud-GPU Core team
>
>         ------------------------------------------
>
>         *From:*Koenig, Christian <Christian.Koenig@amd.com>
>         <mailto:Christian.Koenig@amd.com>
>         *Sent:* Friday, February 26, 2021 3:58 PM
>         *To:* Liu, Monk <Monk.Liu@amd.com> <mailto:Monk.Liu@amd.com>;
>         amd-gfx@lists.freedesktop.org
>         <mailto:amd-gfx@lists.freedesktop.org>
>         *Cc:* Zhang, Andy <Andy.Zhang@amd.com>
>         <mailto:Andy.Zhang@amd.com>; Chen, Horace
>         <Horace.Chen@amd.com> <mailto:Horace.Chen@amd.com>; Zhang,
>         Jack (Jian) <Jack.Zhang1@amd.com> <mailto:Jack.Zhang1@amd.com>
>         *Subject:* Re: [RFC] a new approach to detect which ring is
>         the real black sheep upon TDR reported
>
>         Hi Monk,
>
>         in general an interesting idea, but I see two major problems
>         with that:
>
>         1. It would make the reset take much longer.
>
>         2. Things get often stuck because of timing issues, so a
>         guilty job might pass perfectly when run a second time.
>
>         [ML] but the innocent ring already reported a TDR, and the drm
>         sched logic already deleted this “sched_job” in its mirror
>         list, thus you don’t have chance to re-submit it again after
>         reset, that’s the major problem here.
>
>     Just to confirm I understand correctly, Monk reports a scenario
>     where the second TDR that was reported by the innocent job is
>     bailing out BEFORE having a chance to run  drm_sched_stop for that
>     scheduler which should have reinserted the job back into mirror
>     list (because the first TDR run is still in progress and hence
>     amdgpu_device_lock_adev fails for the second TDR) and so the
>     innocent job which was extracted from mirror list in
>     drm_sched_job_timedout is now lost.
>     If so and as a possible quick fix until we overhaul the entire
>     design as suggested in this thread - maybe we can modify
>     drm_sched_backend_ops.timedout_job callback to report back
>     premature termination BEFORE drm_sched_stop had a chance to run
>     and then reinsert back the job into mirror list from within
>     drm_sched_job_timedout? There is no problem of racing against
>     concurrent drm_sched_get_cleanup_job once we reinsert there as we
>     don't reference the job pointer anymore after this point and so if
>     it's already signaled and freed right away - it's ok.
>
>     Andrey
>
>
>         Apart from that the whole ring mirror list turned out to be a
>         really bad idea. E.g. we still struggle with object life time
>         because the concept doesn't fit into the object model of the
>         GPU scheduler under Linux.
>
>         We should probably work on this separately and straighten up
>         the job destruction once more and keep the recovery
>         information in the fence instead.
>
>         [ML] we claim to our customer that no innocent process will be
>         dropped or cancelled, and our current logic works for the most
>         time, but only when there are different process running on
>         gfx/computes rings then we would run into the tricky situation
>         I stated here, and the proposal is the only way I can figure
>         out so far, do you have a better solution or idea we review it
>         as another candidate RFC ? Be note that we raised this
>         proposal is because we do hit our trouble and we do need to
>         resolve it …. So even a not perfect solution is still better
>         than just cancel the innocent job (and their context/process)
>
>         Thanks !
>
>
>         Regards,
>         Christian.
>
>         Am 26.02.21 um 06:58 schrieb Liu, Monk:
>
>             [AMD Public Use]
>
>             Hi all
>
>             NAVI2X  project hit a really hard to solve issue now, and
>             it is turned out to be a general headache of our TDR
>             mechanism , check below scenario:
>
>              1. There is a job1 running on compute1 ring at timestamp
>              2. There is a job2 running on gfx ring at timestamp
>              3. Job1 is the guilty one, and job1/job2 were scheduled
>                 to their rings at almost the same timestamp
>              4. After 2 seconds we receive two TDR reporting from both
>                 GFX ring and compute ring
>              5. *Current scheme is that in drm scheduler all the head
>                 jobs of those two rings are considered “bad job” and
>                 taken away from the mirror list *
>              6. The result is both the real guilty job (job1) and the
>                 innocent job (job2) were all deleted from mirror list,
>                 and their corresponding contexts were also treated as
>                 guilty*(so the innocent process remains running is not
>                 secured)*
>
>             **
>
>             But by our wish the ideal case is TDR mechanism can detect
>             which ring is the guilty ring and the innocent ring can
>             resubmits all its pending jobs:
>
>              1. Job1 to be deleted from compute1 ring’s mirror list
>              2. Job2 is kept and resubmitted later and its belonging
>                 process/context are even not aware of this TDR at all
>
>             Here I have a proposal tend to achieve above goal and it
>             rough procedure is :
>
>              1. Once any ring reports a TDR, the head job is **not**
>                 treated as “bad job”, and it is **not** deleted from
>                 the mirror list in drm sched functions
>              2. In vendor’s function (our amdgpu driver here):
>
>                   * reset GPU
>                   * repeat below actions on each RINGS * one by one *:
>
>             1.take the head job and submit it on this ring
>
>             2.see if it completes, if not then this job is the real
>             “bad job”
>
>             3. take it away from mirror list if this head job is “bad job”
>
>                   * After above iteration on all RINGS, we already
>                     clears all the bad job(s)
>
>              3. Resubmit all jobs from each mirror list to their
>                 corresponding rings (this is the existed logic)
>
>             The idea of this is to use “serial” way to re-run and
>             re-check each head job of each RING, in order to take out
>             the real black sheep and its guilty context.
>
>             P.S.: we can use this approaches only on GFX/KCQ ring
>             reports TDR , since those rings are intermutually affected
>             to each other. For SDMA ring timeout it definitely proves
>             the head job on SDMA ring is really guilty.
>
>             Thanks
>
>             ------------------------------------------
>
>             Monk Liu | Cloud-GPU Core team
>
>             ------------------------------------------
>
>
>
>
>         _______________________________________________
>
>         amd-gfx mailing list
>
>         amd-gfx@lists.freedesktop.org
>         <mailto:amd-gfx@lists.freedesktop.org>
>
>         https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>

[-- Attachment #1.2: Type: text/html, Size: 39834 bytes --]

[-- Attachment #2: Type: text/plain, Size: 154 bytes --]

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2021-02-28 13:59 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-02-26  5:58 [RFC] a new approach to detect which ring is the real black sheep upon TDR reported Liu, Monk
2021-02-26  7:57 ` Christian König
2021-02-26 11:54   ` Liu, Monk
2021-02-26 11:57     ` Liu, Monk
2021-02-26 12:05       ` Christian König
2021-02-27  3:50         ` 回复: " Liu, Monk
2021-02-27 16:07           ` Christian König
2021-02-26 16:49     ` Andrey Grodzovsky
2021-02-27  3:56       ` 回复: " Liu, Monk
2021-02-27  3:57         ` Liu, Monk
2021-02-28  0:55         ` Andrey Grodzovsky
2021-02-28  4:10           ` 回复: " Liu, Monk
2021-02-28 13:59             ` Andrey Grodzovsky

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.