[diagnostic TDR mode patches] unify our solution opinions/suggestions in one thread

All of lore.kernel.org
 help / color / mirror / Atom feed

* [diagnostic TDR mode patches] unify our solution opinions/suggestions in one thread
@ 2021-09-01  1:23 Liu, Monk
  2021-09-01  1:58 ` Liu, Monk
  2021-09-01  8:18 ` Daniel Vetter
  0 siblings, 2 replies; 19+ messages in thread
From: Liu, Monk @ 2021-09-01  1:23 UTC (permalink / raw)
  To: Koenig, Christian, Grodzovsky, Andrey, Daniel Vetter, Chen, JingWen
  Cc: DRI Development, amd-gfx

[-- Attachment #1: Type: text/plain, Size: 1160 bytes --]

[AMD Official Use Only]

Hi Daniel/Christian/Andrey

It looks the voice from you three are spread over those email floods to me, the feature we are working on (diagnostic TDR scheme) is pending there for more than 6 month (we started it from feb 2021).

Honestly speaking the email ways that we are using now is not friendly and quite painful to me ....
Can we try to put all our opinions, suggestions, or even objects here together, let's go through them one by one, it's too hard for us to reply each email on different questions .

For [PATCH 1/2] drm/sched: fix the bug of time out calculation(v4)

This is a fixing patch on the timeout timer in scheduler, can we complete this one first ? it should already resolved all the questions and suggestions.

For [PATCH 2/2] drm/sched: serialize job_timeout and scheduler

I think I already explained the questions raised by Daniel in other thread , regarding why I use __kthread_should_park()
For other aspects, can we put all our opinion synthesized here ?

Thanks !

------------------------------------------
Monk Liu | Cloud-GPU Core team
------------------------------------------

[-- Attachment #2: Type: text/html, Size: 3762 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: [diagnostic TDR mode patches] unify our solution opinions/suggestions in one thread
  2021-09-01  1:23 [diagnostic TDR mode patches] unify our solution opinions/suggestions in one thread Liu, Monk
@ 2021-09-01  1:58 ` Liu, Monk
  2021-09-01  4:04   ` Andrey Grodzovsky
  2021-09-01  8:18 ` Daniel Vetter
  1 sibling, 1 reply; 19+ messages in thread
From: Liu, Monk @ 2021-09-01  1:58 UTC (permalink / raw)
  To: Koenig, Christian, Grodzovsky, Andrey, Daniel Vetter, Chen, JingWen
  Cc: DRI Development, amd-gfx

[-- Attachment #1: Type: text/plain, Size: 2970 bytes --]

[AMD Official Use Only]

In the previous discussion, you guys stated that we should drop the "kthread_should_park" in cleanup_job.

@@ -676,15 +676,6 @@ drm_sched_get_cleanup_job(struct drm_gpu_scheduler *sched)
{
        struct drm_sched_job *job, *next;

-       /*
-        * Don't destroy jobs while the timeout worker is running  OR thread
-        * is being parked and hence assumed to not touch pending_list
-        */
-       if ((sched->timeout != MAX_SCHEDULE_TIMEOUT &&
-           !cancel_delayed_work(&sched->work_tdr)) ||
-           kthread_should_park())
-               return NULL;

But I suddenly have a question here: if return the timedout job no matter kthread_should_park() or not, then we are backing to the original problem again: that the timedout_job is suddenly signaling and cleanup_job still returns it to sched_main and job is freed while it is still handling by vendor's timeout callback

If we return NULL when kthread_should_park() in cleanup_job, we can prevent above scenario from happening: once a job is processed by job_timedout we can stop its scheduler, and after that even this job suddenly signaled the cleanup_job won't return it so sched_main won't free it in parallel ...

What do you think ?
Thanks

------------------------------------------
Monk Liu | Cloud-GPU Core team
------------------------------------------

From: Liu, Monk
Sent: Wednesday, September 1, 2021 9:23 AM
To: Koenig, Christian <Christian.Koenig@amd.com>; Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>; Daniel Vetter <daniel@ffwll.ch>; Chen, JingWen <JingWen.Chen2@amd.com>
Cc: DRI Development <dri-devel@lists.freedesktop.org>; amd-gfx@lists.freedesktop.org
Subject: [diagnostic TDR mode patches] unify our solution opinions/suggestions in one thread

[AMD Official Use Only]

Hi Daniel/Christian/Andrey

It looks the voice from you three are spread over those email floods to me, the feature we are working on (diagnostic TDR scheme) is pending there for more than 6 month (we started it from feb 2021).

Honestly speaking the email ways that we are using now is not friendly and quite painful to me ....
Can we try to put all our opinions, suggestions, or even objects here together, let's go through them one by one, it's too hard for us to reply each email on different questions .

For [PATCH 1/2] drm/sched: fix the bug of time out calculation(v4)

This is a fixing patch on the timeout timer in scheduler, can we complete this one first ? it should already resolved all the questions and suggestions.

For [PATCH 2/2] drm/sched: serialize job_timeout and scheduler

I think I already explained the questions raised by Daniel in other thread , regarding why I use __kthread_should_park()
For other aspects, can we put all our opinion synthesized here ?

Thanks !

------------------------------------------
Monk Liu | Cloud-GPU Core team
------------------------------------------

[-- Attachment #2: Type: text/html, Size: 7739 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [diagnostic TDR mode patches] unify our solution opinions/suggestions in one thread
  2021-09-01  1:58 ` Liu, Monk
@ 2021-09-01  4:04   ` Andrey Grodzovsky
  2021-09-01  4:25     ` Jingwen Chen
  0 siblings, 1 reply; 19+ messages in thread
From: Andrey Grodzovsky @ 2021-09-01  4:04 UTC (permalink / raw)
  To: Liu, Monk, Koenig, Christian, Daniel Vetter, Chen, JingWen
  Cc: DRI Development, amd-gfx

[-- Attachment #1: Type: text/plain, Size: 5876 bytes --]

I will answer everything here -

On 2021-08-31 9:58 p.m., Liu, Monk wrote:
>
> [AMD Official Use Only]
>
> In the previous discussion, you guys stated that we should drop the 
> “kthread_should_park” in cleanup_job.
>
> @@ -676,15 +676,6 @@ drm_sched_get_cleanup_job(struct 
> drm_gpu_scheduler *sched)
>
> {
>
>         struct drm_sched_job *job, *next;
>
> -       /*
>
> -        * Don't destroy jobs while the timeout worker is running  OR 
> thread
>
> -        * is being parked and hence assumed to not touch pending_list
>
> -        */
>
> -       if ((sched->timeout != MAX_SCHEDULE_TIMEOUT &&
>
> - !cancel_delayed_work(&sched->work_tdr)) ||
>
> -           kthread_should_park())
>
> -               return NULL;
>
> But I suddenly have a question here: if return the timedout job no 
> matter kthread_should_park() or not, then we are backing to the 
> original problem again: that the timedout_job is suddenly signaling 
> and cleanup_job still returns it to sched_main and job is freed while 
> it is still handling by vendor’s timeout callback
>
> If we return NULL when kthread_should_park() in cleanup_job, we can 
> prevent above scenario from happening: once a job is processed by 
> job_timedout we can stop its scheduler, and after that even this job 
> suddenly signaled the cleanup_job won’t return it so sched_main won’t 
> free it in parallel …
>
> What do you think ?
>

Is your analysis above takes into account that you also submit
'[PATCH 2/2] drm/sched: serialize job_timeout and scheduler' then I 
don't see a problem -
I think that as long as you put kthread_park(sched->thread) BEFORE
fetching next bad job from pending list (under spinlock) there is no
such issue as in the case you describe because this potential bad job
that became signaled will be removed from pending list before you
even fetch the next job and by the time you fetch it the scheduler
thread is already stopped anyway

If you don't submit and we keep the removal hack for now then also no 
problem because
we temporary remove the job we fetch for TDR from pending list under 
spinlock
exactly to avoid this race


> Thanks
>
> ------------------------------------------
>
> Monk Liu | Cloud-GPU Core team
>
> ------------------------------------------
>
> *From:* Liu, Monk
> *Sent:* Wednesday, September 1, 2021 9:23 AM
> *To:* Koenig, Christian <Christian.Koenig@amd.com>; Grodzovsky, Andrey 
> <Andrey.Grodzovsky@amd.com>; Daniel Vetter <daniel@ffwll.ch>; Chen, 
> JingWen <JingWen.Chen2@amd.com>
> *Cc:* DRI Development <dri-devel@lists.freedesktop.org>; 
> amd-gfx@lists.freedesktop.org
> *Subject:* [diagnostic TDR mode patches] unify our solution 
> opinions/suggestions in one thread
>
> [AMD Official Use Only]
>
> Hi Daniel/Christian/Andrey
>
> It looks the voice from you three are spread over those email floods 
> to me, the feature we are working on (diagnostic TDR scheme) is 
> pending there for more than 6 month (we started it from feb 2021).
>
> Honestly speaking the email ways that we are using now is not friendly 
> and quite painful to me ….
>
> Can we try to put all our opinions, suggestions, or even objects here 
> together, let’s go through them one by one, it’s too hard for us to 
> reply each email on different questions .
>
> For [PATCH 1/2] drm/sched: fix the bug of time out calculation(v4)
>
> This is a fixing patch on the timeout timer in scheduler, can we 
> complete this one first ? it should already resolved all the questions 
> and suggestions.
>

I have no objections for this one besides getting rid of the 
kthread_should_park()) return null part,
if my answer above is not wrong then it seems superfluous to me


> For [PATCH 2/2] drm/sched: serialize job_timeout and scheduler
>
> I think I already explained the questions raised by Daniel in other 
> thread , regarding why I use __kthread_should_park()
>

Is this race free ? Can't the other thread execute kthread_park after 
the check ?


> For other aspects, can we put all our opinion synthesized here ?
>

So to summarize from previous threads I think that the best solution
to the problem being solved in this patch is if we do HW fence embedding
at the drm_sched_job level instead of doing it only for amdgpu, and 
modifying all
the drivers to support this we can both remove this hack and solve the race
against concurrent drm_sched_cleanup_jobs job freeing just by taking 
reference
to the hw fence of the job at the beginning of drm_sched_job_timedout

If doing this refactoring for all the drivers is not an option now and 
you need a quick
solution such as the serialization you do here then take into account 
other concurrent
TDR handlers that might run, as mentioned before, without serializing 
against other TDR handlers from other
schedulers you just race here against them, e.g. you parked it now but 
another
one in progress will unpark it as part of calling  drm_sched_start for 
other rings.
So you either need a global lock or dedicated single threaded queue as 
Daniel suggested.
At minimum we should change cancel_delayed_work in drm_sched_stop to 
cancel_delayed_work_sync
to cancel and flush all concurrent TDRs and probably move it to the 
begining of the function, after kthread_park
and before we start playing with the pending list.

P.S One comment I had regarding single threaded queue is that in case we 
have multiple TDR
arising from same event we will proceed to resetting multiple times - 
something that with trylock
we mostly avoid today within amdgpu (see amdgpu_device_lock_adev and 
amdgpu_device_lock_hive_adev)
Daniel mentioned it's not a problem but I still haven't understood why not.

Andrey


> Thanks !
>
> ------------------------------------------
>
> Monk Liu | Cloud-GPU Core team
>
> ------------------------------------------
>

[-- Attachment #2: Type: text/html, Size: 13192 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [diagnostic TDR mode patches] unify our solution opinions/suggestions in one thread
  2021-09-01  4:04   ` Andrey Grodzovsky
@ 2021-09-01  4:25     ` Jingwen Chen
  2021-09-01  4:28       ` Andrey Grodzovsky
  0 siblings, 1 reply; 19+ messages in thread
From: Jingwen Chen @ 2021-09-01  4:25 UTC (permalink / raw)
  To: Andrey Grodzovsky, Liu, Monk, Koenig, Christian, Daniel Vetter
  Cc: DRI Development, amd-gfx

On Wed Sep 01, 2021 at 12:04:47AM -0400, Andrey Grodzovsky wrote:
> I will answer everything here -
> 
> On 2021-08-31 9:58 p.m., Liu, Monk wrote:
> 
> 
>     [AMD Official Use Only]
> 
>      
> 
>     In the previous discussion, you guys stated that we should drop the
>     “kthread_should_park” in cleanup_job.
> 
>      
> 
>     @@ -676,15 +676,6 @@ drm_sched_get_cleanup_job(struct drm_gpu_scheduler
>     *sched)
> 
>     {
> 
>             struct drm_sched_job *job, *next;
> 
>      
> 
>     -       /*
> 
>     -        * Don't destroy jobs while the timeout worker is running  OR
>     thread
> 
>     -        * is being parked and hence assumed to not touch pending_list
> 
>     -        */
> 
>     -       if ((sched->timeout != MAX_SCHEDULE_TIMEOUT &&
> 
>     -           !cancel_delayed_work(&sched->work_tdr)) ||
> 
>     -           kthread_should_park())
> 
>     -               return NULL;
> 
>      
> 
>     But I suddenly have a question here: if return the timedout job no matter
>     kthread_should_park() or not, then we are backing to the original problem
>     again: that the timedout_job is suddenly signaling and cleanup_job still
>     returns it to sched_main and job is freed while it is still handling by
>     vendor’s timeout callback
> 
>      
> 
>     If we return NULL when kthread_should_park() in cleanup_job, we can prevent
>     above scenario from happening: once a job is processed by job_timedout we
>     can stop its scheduler, and after that even this job suddenly signaled the
>     cleanup_job won’t return it so sched_main won’t free it in parallel …
> 
>      
> 
>     What do you think ?
> 
> 
> Is your analysis above takes into account that you also submit
> '[PATCH 2/2] drm/sched: serialize job_timeout and scheduler' then I don't see a
> problem -
Hi Andrey,
Monk has talked to me and we agreed that as there're multiple opinions about the
'[PATCH 2/2] drm/sched: serialize job_timeout and scheduler' and patch
1 is an independent patch to fix some error. So we should not take the patch 2 into
analysis.

> I think that as long as you put kthread_park(sched->thread) BEFORE
> fetching next bad job from pending list (under spinlock) there is no
> such issue as in the case you describe because this potential bad job
> that became signaled will be removed from pending list before you
> even fetch the next job and by the time you fetch it the scheduler
> thread is already stopped anyway
> 
> If you don't submit and we keep the removal hack for now then also no problem
> because
> we temporary remove the job we fetch for TDR from pending list under spinlock
> exactly to avoid this race
> 
So can you help review [PATCH 1/2] drm/sched: fix the bug of time out calculation(v3)?
patch v3 keeps this kthread_should_park check.

Best Regards,
JingWen
> 
> 
>     Thanks
> 
>      
> 
>     ------------------------------------------
> 
>     Monk Liu | Cloud-GPU Core team
> 
>     ------------------------------------------
> 
>      
> 
>     From: Liu, Monk
>     Sent: Wednesday, September 1, 2021 9:23 AM
>     To: Koenig, Christian <Christian.Koenig@amd.com>; Grodzovsky, Andrey
>     <Andrey.Grodzovsky@amd.com>; Daniel Vetter <daniel@ffwll.ch>; Chen, JingWen
>     <JingWen.Chen2@amd.com>
>     Cc: DRI Development <dri-devel@lists.freedesktop.org>;
>     amd-gfx@lists.freedesktop.org
>     Subject: [diagnostic TDR mode patches] unify our solution opinions/
>     suggestions in one thread
> 
>      
> 
>     [AMD Official Use Only]
> 
>      
> 
>     Hi Daniel/Christian/Andrey
> 
>      
> 
>     It looks the voice from you three are spread over those email floods to me,
>     the feature we are working on (diagnostic TDR scheme) is pending there for
>     more than 6 month (we started it from feb 2021).
> 
>      
> 
>     Honestly speaking the email ways that we are using now is not friendly and
>     quite painful to me ….
> 
>     Can we try to put all our opinions, suggestions, or even objects here
>     together, let’s go through them one by one, it’s too hard for us to reply
>     each email on different questions .
> 
>      
> 
>     For [PATCH 1/2] drm/sched: fix the bug of time out calculation(v4)
> 
>      
> 
>     This is a fixing patch on the timeout timer in scheduler, can we complete
>     this one first ? it should already resolved all the questions and
>     suggestions.
> 
> 
> I have no objections for this one besides getting rid of the
> kthread_should_park()) return null part,
> if my answer above is not wrong then it seems superfluous to me
> 
> 
>      
> 
>     For [PATCH 2/2] drm/sched: serialize job_timeout and scheduler
> 
>      
> 
>     I think I already explained the questions raised by Daniel in other thread
>     , regarding why I use __kthread_should_park()
> 
> 
> Is this race free ? Can't the other thread execute kthread_park after the check
> ?
> 
> 
>     For other aspects, can we put all our opinion synthesized here ?
> 
> 
> So to summarize from previous threads I think that the best solution
> to the problem being solved in this patch is if we do HW fence embedding
> at the drm_sched_job level instead of doing it only for amdgpu, and modifying
> all
> the drivers to support this we can both remove this hack and solve the race
> against concurrent drm_sched_cleanup_jobs job freeing just by taking reference
> to the hw fence of the job at the beginning of drm_sched_job_timedout
> 
> If doing this refactoring for all the drivers is not an option now and you need
> a quick
> solution such as the serialization you do here then take into account other
> concurrent
> TDR handlers that might run, as mentioned before, without serializing against
> other TDR handlers from other
> schedulers you just race here against them, e.g. you parked it now but another
> one in progress will unpark it as part of calling  drm_sched_start for other
> rings.
> So you either need a global lock or dedicated single threaded queue as Daniel
> suggested.
> At minimum we should change cancel_delayed_work in drm_sched_stop to
> cancel_delayed_work_sync
> to cancel and flush all concurrent TDRs and probably move it to the begining of
> the function, after kthread_park
> and before we start playing with the pending list.
> 
> P.S One comment I had regarding single threaded queue is that in case we have
> multiple TDR
> arising from same event we will proceed to resetting multiple times - something
> that with trylock
> we mostly avoid today within amdgpu (see amdgpu_device_lock_adev and
> amdgpu_device_lock_hive_adev)
> Daniel mentioned it's not a problem but I still haven't understood why not.
> 
> Andrey
> 
> 
>      
> 
>     Thanks !
> 
>      
> 
>     ------------------------------------------
> 
>     Monk Liu | Cloud-GPU Core team
> 
>     ------------------------------------------
> 
>      
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [diagnostic TDR mode patches] unify our solution opinions/suggestions in one thread
  2021-09-01  4:25     ` Jingwen Chen
@ 2021-09-01  4:28       ` Andrey Grodzovsky
  2021-09-01  4:40         ` Jingwen Chen
  0 siblings, 1 reply; 19+ messages in thread
From: Andrey Grodzovsky @ 2021-09-01  4:28 UTC (permalink / raw)
  To: Jingwen Chen, Liu, Monk, Koenig, Christian, Daniel Vetter
  Cc: DRI Development, amd-gfx


On 2021-09-01 12:25 a.m., Jingwen Chen wrote:
> On Wed Sep 01, 2021 at 12:04:47AM -0400, Andrey Grodzovsky wrote:
>> I will answer everything here -
>>
>> On 2021-08-31 9:58 p.m., Liu, Monk wrote:
>>
>>
>>      [AMD Official Use Only]
>>
>>       
>>
>>      In the previous discussion, you guys stated that we should drop the
>>      “kthread_should_park” in cleanup_job.
>>
>>       
>>
>>      @@ -676,15 +676,6 @@ drm_sched_get_cleanup_job(struct drm_gpu_scheduler
>>      *sched)
>>
>>      {
>>
>>              struct drm_sched_job *job, *next;
>>
>>       
>>
>>      -       /*
>>
>>      -        * Don't destroy jobs while the timeout worker is running  OR
>>      thread
>>
>>      -        * is being parked and hence assumed to not touch pending_list
>>
>>      -        */
>>
>>      -       if ((sched->timeout != MAX_SCHEDULE_TIMEOUT &&
>>
>>      -           !cancel_delayed_work(&sched->work_tdr)) ||
>>
>>      -           kthread_should_park())
>>
>>      -               return NULL;
>>
>>       
>>
>>      But I suddenly have a question here: if return the timedout job no matter
>>      kthread_should_park() or not, then we are backing to the original problem
>>      again: that the timedout_job is suddenly signaling and cleanup_job still
>>      returns it to sched_main and job is freed while it is still handling by
>>      vendor’s timeout callback
>>
>>       
>>
>>      If we return NULL when kthread_should_park() in cleanup_job, we can prevent
>>      above scenario from happening: once a job is processed by job_timedout we
>>      can stop its scheduler, and after that even this job suddenly signaled the
>>      cleanup_job won’t return it so sched_main won’t free it in parallel …
>>
>>       
>>
>>      What do you think ?
>>
>>
>> Is your analysis above takes into account that you also submit
>> '[PATCH 2/2] drm/sched: serialize job_timeout and scheduler' then I don't see a
>> problem -
> Hi Andrey,
> Monk has talked to me and we agreed that as there're multiple opinions about the
> '[PATCH 2/2] drm/sched: serialize job_timeout and scheduler' and patch
> 1 is an independent patch to fix some error. So we should not take the patch 2 into
> analysis.
>
>> I think that as long as you put kthread_park(sched->thread) BEFORE
>> fetching next bad job from pending list (under spinlock) there is no
>> such issue as in the case you describe because this potential bad job
>> that became signaled will be removed from pending list before you
>> even fetch the next job and by the time you fetch it the scheduler
>> thread is already stopped anyway
>>
>> If you don't submit and we keep the removal hack for now then also no problem
>> because
>> we temporary remove the job we fetch for TDR from pending list under spinlock
>> exactly to avoid this race
>>
> So can you help review [PATCH 1/2] drm/sched: fix the bug of time out calculation(v3)?
> patch v3 keeps this kthread_should_park check.


But since in both cases looks like there is no danger of use after free
then I see no reason to keep kthread_should_park check.

Andrey


>
> Best Regards,
> JingWen
>>
>>      Thanks
>>
>>       
>>
>>      ------------------------------------------
>>
>>      Monk Liu | Cloud-GPU Core team
>>
>>      ------------------------------------------
>>
>>       
>>
>>      From: Liu, Monk
>>      Sent: Wednesday, September 1, 2021 9:23 AM
>>      To: Koenig, Christian <Christian.Koenig@amd.com>; Grodzovsky, Andrey
>>      <Andrey.Grodzovsky@amd.com>; Daniel Vetter <daniel@ffwll.ch>; Chen, JingWen
>>      <JingWen.Chen2@amd.com>
>>      Cc: DRI Development <dri-devel@lists.freedesktop.org>;
>>      amd-gfx@lists.freedesktop.org
>>      Subject: [diagnostic TDR mode patches] unify our solution opinions/
>>      suggestions in one thread
>>
>>       
>>
>>      [AMD Official Use Only]
>>
>>       
>>
>>      Hi Daniel/Christian/Andrey
>>
>>       
>>
>>      It looks the voice from you three are spread over those email floods to me,
>>      the feature we are working on (diagnostic TDR scheme) is pending there for
>>      more than 6 month (we started it from feb 2021).
>>
>>       
>>
>>      Honestly speaking the email ways that we are using now is not friendly and
>>      quite painful to me ….
>>
>>      Can we try to put all our opinions, suggestions, or even objects here
>>      together, let’s go through them one by one, it’s too hard for us to reply
>>      each email on different questions .
>>
>>       
>>
>>      For [PATCH 1/2] drm/sched: fix the bug of time out calculation(v4)
>>
>>       
>>
>>      This is a fixing patch on the timeout timer in scheduler, can we complete
>>      this one first ? it should already resolved all the questions and
>>      suggestions.
>>
>>
>> I have no objections for this one besides getting rid of the
>> kthread_should_park()) return null part,
>> if my answer above is not wrong then it seems superfluous to me
>>
>>
>>       
>>
>>      For [PATCH 2/2] drm/sched: serialize job_timeout and scheduler
>>
>>       
>>
>>      I think I already explained the questions raised by Daniel in other thread
>>      , regarding why I use __kthread_should_park()
>>
>>
>> Is this race free ? Can't the other thread execute kthread_park after the check
>> ?
>>
>>
>>      For other aspects, can we put all our opinion synthesized here ?
>>
>>
>> So to summarize from previous threads I think that the best solution
>> to the problem being solved in this patch is if we do HW fence embedding
>> at the drm_sched_job level instead of doing it only for amdgpu, and modifying
>> all
>> the drivers to support this we can both remove this hack and solve the race
>> against concurrent drm_sched_cleanup_jobs job freeing just by taking reference
>> to the hw fence of the job at the beginning of drm_sched_job_timedout
>>
>> If doing this refactoring for all the drivers is not an option now and you need
>> a quick
>> solution such as the serialization you do here then take into account other
>> concurrent
>> TDR handlers that might run, as mentioned before, without serializing against
>> other TDR handlers from other
>> schedulers you just race here against them, e.g. you parked it now but another
>> one in progress will unpark it as part of calling  drm_sched_start for other
>> rings.
>> So you either need a global lock or dedicated single threaded queue as Daniel
>> suggested.
>> At minimum we should change cancel_delayed_work in drm_sched_stop to
>> cancel_delayed_work_sync
>> to cancel and flush all concurrent TDRs and probably move it to the begining of
>> the function, after kthread_park
>> and before we start playing with the pending list.
>>
>> P.S One comment I had regarding single threaded queue is that in case we have
>> multiple TDR
>> arising from same event we will proceed to resetting multiple times - something
>> that with trylock
>> we mostly avoid today within amdgpu (see amdgpu_device_lock_adev and
>> amdgpu_device_lock_hive_adev)
>> Daniel mentioned it's not a problem but I still haven't understood why not.
>>
>> Andrey
>>
>>
>>       
>>
>>      Thanks !
>>
>>       
>>
>>      ------------------------------------------
>>
>>      Monk Liu | Cloud-GPU Core team
>>
>>      ------------------------------------------
>>
>>       
>>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [diagnostic TDR mode patches] unify our solution opinions/suggestions in one thread
  2021-09-01  4:28       ` Andrey Grodzovsky
@ 2021-09-01  4:40         ` Jingwen Chen
  2021-09-01  4:49           ` Andrey Grodzovsky
  0 siblings, 1 reply; 19+ messages in thread
From: Jingwen Chen @ 2021-09-01  4:40 UTC (permalink / raw)
  To: Andrey Grodzovsky, Liu, Monk, Koenig, Christian, Daniel Vetter
  Cc: DRI Development, amd-gfx

On Wed Sep 01, 2021 at 12:28:59AM -0400, Andrey Grodzovsky wrote:
> 
> On 2021-09-01 12:25 a.m., Jingwen Chen wrote:
> > On Wed Sep 01, 2021 at 12:04:47AM -0400, Andrey Grodzovsky wrote:
> > > I will answer everything here -
> > > 
> > > On 2021-08-31 9:58 p.m., Liu, Monk wrote:
> > > 
> > > 
> > >      [AMD Official Use Only]
> > > 
> > > 
> > >      In the previous discussion, you guys stated that we should drop the
> > >      “kthread_should_park” in cleanup_job.
> > > 
> > > 
> > >      @@ -676,15 +676,6 @@ drm_sched_get_cleanup_job(struct drm_gpu_scheduler
> > >      *sched)
> > > 
> > >      {
> > > 
> > >              struct drm_sched_job *job, *next;
> > > 
> > > 
> > >      -       /*
> > > 
> > >      -        * Don't destroy jobs while the timeout worker is running  OR
> > >      thread
> > > 
> > >      -        * is being parked and hence assumed to not touch pending_list
> > > 
> > >      -        */
> > > 
> > >      -       if ((sched->timeout != MAX_SCHEDULE_TIMEOUT &&
> > > 
> > >      -           !cancel_delayed_work(&sched->work_tdr)) ||
> > > 
> > >      -           kthread_should_park())
> > > 
> > >      -               return NULL;
> > > 
> > > 
> > >      But I suddenly have a question here: if return the timedout job no matter
> > >      kthread_should_park() or not, then we are backing to the original problem
> > >      again: that the timedout_job is suddenly signaling and cleanup_job still
> > >      returns it to sched_main and job is freed while it is still handling by
> > >      vendor’s timeout callback
> > > 
> > > 
> > >      If we return NULL when kthread_should_park() in cleanup_job, we can prevent
> > >      above scenario from happening: once a job is processed by job_timedout we
> > >      can stop its scheduler, and after that even this job suddenly signaled the
> > >      cleanup_job won’t return it so sched_main won’t free it in parallel …
> > > 
> > > 
> > >      What do you think ?
> > > 
> > > 
> > > Is your analysis above takes into account that you also submit
> > > '[PATCH 2/2] drm/sched: serialize job_timeout and scheduler' then I don't see a
> > > problem -
> > Hi Andrey,
> > Monk has talked to me and we agreed that as there're multiple opinions about the
> > '[PATCH 2/2] drm/sched: serialize job_timeout and scheduler' and patch
> > 1 is an independent patch to fix some error. So we should not take the patch 2 into
> > analysis.
> > 
> > > I think that as long as you put kthread_park(sched->thread) BEFORE
> > > fetching next bad job from pending list (under spinlock) there is no
> > > such issue as in the case you describe because this potential bad job
> > > that became signaled will be removed from pending list before you
> > > even fetch the next job and by the time you fetch it the scheduler
> > > thread is already stopped anyway
> > > 
> > > If you don't submit and we keep the removal hack for now then also no problem
> > > because
> > > we temporary remove the job we fetch for TDR from pending list under spinlock
> > > exactly to avoid this race
> > > 
> > So can you help review [PATCH 1/2] drm/sched: fix the bug of time out calculation(v3)?
> > patch v3 keeps this kthread_should_park check.
> 
> 
> But since in both cases looks like there is no danger of use after free
> then I see no reason to keep kthread_should_park check.
> 
> Andrey
OK, I get it. So patch v4 has removed this check, can you help review
[PATCH 1/2] drm/sched: fix the bug of time out calculation(v4)?
> 
> 
> > 
> > Best Regards,
> > JingWen
> > > 
> > >      Thanks
> > > 
> > > 
> > >      ------------------------------------------
> > > 
> > >      Monk Liu | Cloud-GPU Core team
> > > 
> > >      ------------------------------------------
> > > 
> > > 
> > >      From: Liu, Monk
> > >      Sent: Wednesday, September 1, 2021 9:23 AM
> > >      To: Koenig, Christian <Christian.Koenig@amd.com>; Grodzovsky, Andrey
> > >      <Andrey.Grodzovsky@amd.com>; Daniel Vetter <daniel@ffwll.ch>; Chen, JingWen
> > >      <JingWen.Chen2@amd.com>
> > >      Cc: DRI Development <dri-devel@lists.freedesktop.org>;
> > >      amd-gfx@lists.freedesktop.org
> > >      Subject: [diagnostic TDR mode patches] unify our solution opinions/
> > >      suggestions in one thread
> > > 
> > > 
> > >      [AMD Official Use Only]
> > > 
> > > 
> > >      Hi Daniel/Christian/Andrey
> > > 
> > > 
> > >      It looks the voice from you three are spread over those email floods to me,
> > >      the feature we are working on (diagnostic TDR scheme) is pending there for
> > >      more than 6 month (we started it from feb 2021).
> > > 
> > > 
> > >      Honestly speaking the email ways that we are using now is not friendly and
> > >      quite painful to me ….
> > > 
> > >      Can we try to put all our opinions, suggestions, or even objects here
> > >      together, let’s go through them one by one, it’s too hard for us to reply
> > >      each email on different questions .
> > > 
> > > 
> > >      For [PATCH 1/2] drm/sched: fix the bug of time out calculation(v4)
> > > 
> > > 
> > >      This is a fixing patch on the timeout timer in scheduler, can we complete
> > >      this one first ? it should already resolved all the questions and
> > >      suggestions.
> > > 
> > > 
> > > I have no objections for this one besides getting rid of the
> > > kthread_should_park()) return null part,
> > > if my answer above is not wrong then it seems superfluous to me
> > > 
> > > 
> > > 
> > >      For [PATCH 2/2] drm/sched: serialize job_timeout and scheduler
> > > 
> > > 
> > >      I think I already explained the questions raised by Daniel in other thread
> > >      , regarding why I use __kthread_should_park()
> > > 
> > > 
> > > Is this race free ? Can't the other thread execute kthread_park after the check
> > > ?
> > > 
> > > 
> > >      For other aspects, can we put all our opinion synthesized here ?
> > > 
> > > 
> > > So to summarize from previous threads I think that the best solution
> > > to the problem being solved in this patch is if we do HW fence embedding
> > > at the drm_sched_job level instead of doing it only for amdgpu, and modifying
> > > all
> > > the drivers to support this we can both remove this hack and solve the race
> > > against concurrent drm_sched_cleanup_jobs job freeing just by taking reference
> > > to the hw fence of the job at the beginning of drm_sched_job_timedout
> > > 
> > > If doing this refactoring for all the drivers is not an option now and you need
> > > a quick
> > > solution such as the serialization you do here then take into account other
> > > concurrent
> > > TDR handlers that might run, as mentioned before, without serializing against
> > > other TDR handlers from other
> > > schedulers you just race here against them, e.g. you parked it now but another
> > > one in progress will unpark it as part of calling  drm_sched_start for other
> > > rings.
> > > So you either need a global lock or dedicated single threaded queue as Daniel
> > > suggested.
> > > At minimum we should change cancel_delayed_work in drm_sched_stop to
> > > cancel_delayed_work_sync
> > > to cancel and flush all concurrent TDRs and probably move it to the begining of
> > > the function, after kthread_park
> > > and before we start playing with the pending list.
> > > 
> > > P.S One comment I had regarding single threaded queue is that in case we have
> > > multiple TDR
> > > arising from same event we will proceed to resetting multiple times - something
> > > that with trylock
> > > we mostly avoid today within amdgpu (see amdgpu_device_lock_adev and
> > > amdgpu_device_lock_hive_adev)
> > > Daniel mentioned it's not a problem but I still haven't understood why not.
> > > 
> > > Andrey
> > > 
> > > 
> > > 
> > >      Thanks !
> > > 
> > > 
> > >      ------------------------------------------
> > > 
> > >      Monk Liu | Cloud-GPU Core team
> > > 
> > >      ------------------------------------------
> > > 
> > > 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [diagnostic TDR mode patches] unify our solution opinions/suggestions in one thread
  2021-09-01  4:40         ` Jingwen Chen
@ 2021-09-01  4:49           ` Andrey Grodzovsky
  0 siblings, 0 replies; 19+ messages in thread
From: Andrey Grodzovsky @ 2021-09-01  4:49 UTC (permalink / raw)
  To: Jingwen Chen, Liu, Monk, Koenig, Christian, Daniel Vetter
  Cc: DRI Development, amd-gfx


On 2021-09-01 12:40 a.m., Jingwen Chen wrote:
> On Wed Sep 01, 2021 at 12:28:59AM -0400, Andrey Grodzovsky wrote:
>> On 2021-09-01 12:25 a.m., Jingwen Chen wrote:
>>> On Wed Sep 01, 2021 at 12:04:47AM -0400, Andrey Grodzovsky wrote:
>>>> I will answer everything here -
>>>>
>>>> On 2021-08-31 9:58 p.m., Liu, Monk wrote:
>>>>
>>>>
>>>>       [AMD Official Use Only]
>>>>
>>>>
>>>>       In the previous discussion, you guys stated that we should drop the
>>>>       “kthread_should_park” in cleanup_job.
>>>>
>>>>
>>>>       @@ -676,15 +676,6 @@ drm_sched_get_cleanup_job(struct drm_gpu_scheduler
>>>>       *sched)
>>>>
>>>>       {
>>>>
>>>>               struct drm_sched_job *job, *next;
>>>>
>>>>
>>>>       -       /*
>>>>
>>>>       -        * Don't destroy jobs while the timeout worker is running  OR
>>>>       thread
>>>>
>>>>       -        * is being parked and hence assumed to not touch pending_list
>>>>
>>>>       -        */
>>>>
>>>>       -       if ((sched->timeout != MAX_SCHEDULE_TIMEOUT &&
>>>>
>>>>       -           !cancel_delayed_work(&sched->work_tdr)) ||
>>>>
>>>>       -           kthread_should_park())
>>>>
>>>>       -               return NULL;
>>>>
>>>>
>>>>       But I suddenly have a question here: if return the timedout job no matter
>>>>       kthread_should_park() or not, then we are backing to the original problem
>>>>       again: that the timedout_job is suddenly signaling and cleanup_job still
>>>>       returns it to sched_main and job is freed while it is still handling by
>>>>       vendor’s timeout callback
>>>>
>>>>
>>>>       If we return NULL when kthread_should_park() in cleanup_job, we can prevent
>>>>       above scenario from happening: once a job is processed by job_timedout we
>>>>       can stop its scheduler, and after that even this job suddenly signaled the
>>>>       cleanup_job won’t return it so sched_main won’t free it in parallel …
>>>>
>>>>
>>>>       What do you think ?
>>>>
>>>>
>>>> Is your analysis above takes into account that you also submit
>>>> '[PATCH 2/2] drm/sched: serialize job_timeout and scheduler' then I don't see a
>>>> problem -
>>> Hi Andrey,
>>> Monk has talked to me and we agreed that as there're multiple opinions about the
>>> '[PATCH 2/2] drm/sched: serialize job_timeout and scheduler' and patch
>>> 1 is an independent patch to fix some error. So we should not take the patch 2 into
>>> analysis.
>>>
>>>> I think that as long as you put kthread_park(sched->thread) BEFORE
>>>> fetching next bad job from pending list (under spinlock) there is no
>>>> such issue as in the case you describe because this potential bad job
>>>> that became signaled will be removed from pending list before you
>>>> even fetch the next job and by the time you fetch it the scheduler
>>>> thread is already stopped anyway
>>>>
>>>> If you don't submit and we keep the removal hack for now then also no problem
>>>> because
>>>> we temporary remove the job we fetch for TDR from pending list under spinlock
>>>> exactly to avoid this race
>>>>
>>> So can you help review [PATCH 1/2] drm/sched: fix the bug of time out calculation(v3)?
>>> patch v3 keeps this kthread_should_park check.
>>
>> But since in both cases looks like there is no danger of use after free
>> then I see no reason to keep kthread_should_park check.
>>
>> Andrey
> OK, I get it. So patch v4 has removed this check, can you help review
> [PATCH 1/2] drm/sched: fix the bug of time out calculation(v4)?


Sure

Andrey


>>
>>> Best Regards,
>>> JingWen
>>>>       Thanks
>>>>
>>>>
>>>>       ------------------------------------------
>>>>
>>>>       Monk Liu | Cloud-GPU Core team
>>>>
>>>>       ------------------------------------------
>>>>
>>>>
>>>>       From: Liu, Monk
>>>>       Sent: Wednesday, September 1, 2021 9:23 AM
>>>>       To: Koenig, Christian <Christian.Koenig@amd.com>; Grodzovsky, Andrey
>>>>       <Andrey.Grodzovsky@amd.com>; Daniel Vetter <daniel@ffwll.ch>; Chen, JingWen
>>>>       <JingWen.Chen2@amd.com>
>>>>       Cc: DRI Development <dri-devel@lists.freedesktop.org>;
>>>>       amd-gfx@lists.freedesktop.org
>>>>       Subject: [diagnostic TDR mode patches] unify our solution opinions/
>>>>       suggestions in one thread
>>>>
>>>>
>>>>       [AMD Official Use Only]
>>>>
>>>>
>>>>       Hi Daniel/Christian/Andrey
>>>>
>>>>
>>>>       It looks the voice from you three are spread over those email floods to me,
>>>>       the feature we are working on (diagnostic TDR scheme) is pending there for
>>>>       more than 6 month (we started it from feb 2021).
>>>>
>>>>
>>>>       Honestly speaking the email ways that we are using now is not friendly and
>>>>       quite painful to me ….
>>>>
>>>>       Can we try to put all our opinions, suggestions, or even objects here
>>>>       together, let’s go through them one by one, it’s too hard for us to reply
>>>>       each email on different questions .
>>>>
>>>>
>>>>       For [PATCH 1/2] drm/sched: fix the bug of time out calculation(v4)
>>>>
>>>>
>>>>       This is a fixing patch on the timeout timer in scheduler, can we complete
>>>>       this one first ? it should already resolved all the questions and
>>>>       suggestions.
>>>>
>>>>
>>>> I have no objections for this one besides getting rid of the
>>>> kthread_should_park()) return null part,
>>>> if my answer above is not wrong then it seems superfluous to me
>>>>
>>>>
>>>>
>>>>       For [PATCH 2/2] drm/sched: serialize job_timeout and scheduler
>>>>
>>>>
>>>>       I think I already explained the questions raised by Daniel in other thread
>>>>       , regarding why I use __kthread_should_park()
>>>>
>>>>
>>>> Is this race free ? Can't the other thread execute kthread_park after the check
>>>> ?
>>>>
>>>>
>>>>       For other aspects, can we put all our opinion synthesized here ?
>>>>
>>>>
>>>> So to summarize from previous threads I think that the best solution
>>>> to the problem being solved in this patch is if we do HW fence embedding
>>>> at the drm_sched_job level instead of doing it only for amdgpu, and modifying
>>>> all
>>>> the drivers to support this we can both remove this hack and solve the race
>>>> against concurrent drm_sched_cleanup_jobs job freeing just by taking reference
>>>> to the hw fence of the job at the beginning of drm_sched_job_timedout
>>>>
>>>> If doing this refactoring for all the drivers is not an option now and you need
>>>> a quick
>>>> solution such as the serialization you do here then take into account other
>>>> concurrent
>>>> TDR handlers that might run, as mentioned before, without serializing against
>>>> other TDR handlers from other
>>>> schedulers you just race here against them, e.g. you parked it now but another
>>>> one in progress will unpark it as part of calling  drm_sched_start for other
>>>> rings.
>>>> So you either need a global lock or dedicated single threaded queue as Daniel
>>>> suggested.
>>>> At minimum we should change cancel_delayed_work in drm_sched_stop to
>>>> cancel_delayed_work_sync
>>>> to cancel and flush all concurrent TDRs and probably move it to the begining of
>>>> the function, after kthread_park
>>>> and before we start playing with the pending list.
>>>>
>>>> P.S One comment I had regarding single threaded queue is that in case we have
>>>> multiple TDR
>>>> arising from same event we will proceed to resetting multiple times - something
>>>> that with trylock
>>>> we mostly avoid today within amdgpu (see amdgpu_device_lock_adev and
>>>> amdgpu_device_lock_hive_adev)
>>>> Daniel mentioned it's not a problem but I still haven't understood why not.
>>>>
>>>> Andrey
>>>>
>>>>
>>>>
>>>>       Thanks !
>>>>
>>>>
>>>>       ------------------------------------------
>>>>
>>>>       Monk Liu | Cloud-GPU Core team
>>>>
>>>>       ------------------------------------------
>>>>
>>>>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [diagnostic TDR mode patches] unify our solution opinions/suggestions in one thread
  2021-09-01  1:23 [diagnostic TDR mode patches] unify our solution opinions/suggestions in one thread Liu, Monk
  2021-09-01  1:58 ` Liu, Monk
@ 2021-09-01  8:18 ` Daniel Vetter
  2021-09-01 10:19   ` Liu, Monk
  2021-09-01 10:27   ` Liu, Monk
  1 sibling, 2 replies; 19+ messages in thread
From: Daniel Vetter @ 2021-09-01  8:18 UTC (permalink / raw)
  To: Liu, Monk
  Cc: Koenig, Christian, Grodzovsky, Andrey, Chen, JingWen,
	DRI Development, amd-gfx

Hi Monk,

On Wed, Sep 1, 2021 at 3:23 AM Liu, Monk <Monk.Liu@amd.com> wrote:
>
> [AMD Official Use Only]
>
>
> Hi Daniel/Christian/Andrey
>
>
>
> It looks the voice from you three are spread over those email floods to me, the feature we are working on (diagnostic TDR scheme) is pending there for more than 6 month (we started it from feb 2021).

For me your project exists since a few weeks at most, because that is
when your team showed up on dri-devel. That you already spent 6 months
on this within amd, on a code area that very much affects shared code,
without kicking of any thread on dri-devel isn't great, but also not
something we can fix, since time machines don't exist.

So we have to make the best out of the situation and move ahead where
we are. From my understanding you've done a bunch of changes to the
scheduler code. As far as I can see there's been two related things
your team has done:

- remove some allocations from scheduler code, because that can lead
to deadlocks. I've kicked up this topic quite a while ago here

https://lore.kernel.org/dri-devel/20200604081224.863494-10-daniel.vetter@ffwll.ch/

This is just one patch of the entire series. This is an area where we
really need a consistent solution across all drm/sched drivers, not
something that individual drivers just fix in their own way.

- the other one is the timeout issue for the patches you cite here.
Again there's been discussion on this on dri-devel with Boris from
panfrost about how we can handle at least some of the races in tdr.
That resulted in lots of discussions and documentation improvements.
Those patches are merged now, link
https://lore.kernel.org/dri-devel/20210625133327.2598825-2-boris.brezillon@collabora.com/

There's been more than just this, also quite some doc patches from
Boris that explain how it's all supposed to work and be race-free.
Again your driver isn't the only one with interesting TDR races.

Your team hasn't been active in any of these discussions, but now
suddenly pops up out of nowhere and demands that your approach needs
to land asap. That's really not how upstream works.

The other thing where I'm struggling is that there's a lot of missing
context for outsiders. The patches sometimes come with zero commit
message, for tricky concurrency bugs. And there's no context with what
you've done already on the amdgpu side (since that never showed up on
dri-devel), which makes constructive discussions here really hard.

Now fixing these bugs is obviously good, but the way this is supposed
to work when touching shared infrastructure is:

- Before you start merging anything kick off an RFC thread on
dri-devel (or whatever the topic really is about) about the problem
you have and how your trying to solve it. This can be just text if
it's a big thing, but it can also already include some proof of
concept solution in the form of patches.

- Then we iterate on the solution, across drivers and shared code
_together_. Not "merge amdgpu code first, then get annoyed when the
core changes don't land immediately after you've practially finished
the project".

- This might mean changes to other drivers if we need to adjust interfaces.

On the plus side you can plan much better, because you know you have
upstream buy-in before you start to put in real work on the project.

> Honestly speaking the email ways that we are using now is not friendly and quite painful to me ….

Yes this is painful :-(

I think the best way forward is to go through the above process again
and essentially restart. So submit a complete patch series with
problem descriptions, solution you picked, why you picked that, all
the amdgpu patches to get there and the core patches too. Since it
sounds like a bunch of this has all landed already you probably need a
patch 1 that goes back to 6 months ago so that we can see the overall
direction, and review whether that's the right one or not.

The not-so-painful approach would have been to do this from the start,
6 months ago. It would definitely have helped if the tdr discussion
we've had just a few months ago would have involved your team too, I'm
sure there would have been some good insights from amd's side. I'd
really want you and your engineers involved here, so let's do this
properly!

Cheers, Daniel

> Can we try to put all our opinions, suggestions, or even objects here together, let’s go through them one by one, it’s too hard for us to reply each email on different questions .
>
>
>
> For [PATCH 1/2] drm/sched: fix the bug of time out calculation(v4)
>
>
>
> This is a fixing patch on the timeout timer in scheduler, can we complete this one first ? it should already resolved all the questions and suggestions.
>
>
>
> For [PATCH 2/2] drm/sched: serialize job_timeout and scheduler
>
>
>
> I think I already explained the questions raised by Daniel in other thread , regarding why I use __kthread_should_park()
>
> For other aspects, can we put all our opinion synthesized here ?
>
>
>
> Thanks !
>
>
>
> ------------------------------------------
>
> Monk Liu | Cloud-GPU Core team
>
> ------------------------------------------
>
>

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: [diagnostic TDR mode patches] unify our solution opinions/suggestions in one thread
  2021-09-01  8:18 ` Daniel Vetter
@ 2021-09-01 10:19   ` Liu, Monk
  2021-09-01 15:19     ` Alex Deucher
  2021-09-01 10:27   ` Liu, Monk
  1 sibling, 1 reply; 19+ messages in thread
From: Liu, Monk @ 2021-09-01 10:19 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Koenig, Christian, Grodzovsky, Andrey, Chen, JingWen,
	DRI Development, amd-gfx

[AMD Official Use Only]

Daniel 

From the link you share it looks you(or someone else) have quite a bunch patches that changes DRM_SCHED or even amdgpu, by that case before they are merged to kernel tree I'm wondering if any AMD develop reviewed them ?

They looks to me somehow conflicting with what we changed in our repo.... 

It is really a chaos for AMDer if someone else out side of AMD changes our kernel driver (or/and scheduler) without reviewed by AMDer, just like we are requiring your review if we tend to change scheduler's logic here .... 

This one changes AMD's code: https://lore.kernel.org/dri-devel/20210625133327.2598825-2-boris.brezillon@collabora.com/
And I didn't see any reviewed-by from AMDers ...

This one also touches AMD's code: https://lore.kernel.org/dri-devel/20200604081224.863494-12-daniel.vetter@ffwll.ch/
Which is conflicting with one patch we submitted (in our repo rightnow), and neither see AMDder gave a review-by on this one (let me know if I missed it)

Thanks 

------------------------------------------
Monk Liu | Cloud-GPU Core team
------------------------------------------

-----Original Message-----
From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> On Behalf Of Daniel Vetter
Sent: Wednesday, September 1, 2021 4:18 PM
To: Liu, Monk <Monk.Liu@amd.com>
Cc: Koenig, Christian <Christian.Koenig@amd.com>; Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>; Chen, JingWen <JingWen.Chen2@amd.com>; DRI Development <dri-devel@lists.freedesktop.org>; amd-gfx@lists.freedesktop.org
Subject: Re: [diagnostic TDR mode patches] unify our solution opinions/suggestions in one thread

Hi Monk,

On Wed, Sep 1, 2021 at 3:23 AM Liu, Monk <Monk.Liu@amd.com> wrote:
>
> [AMD Official Use Only]
>
>
> Hi Daniel/Christian/Andrey
>
>
>
> It looks the voice from you three are spread over those email floods to me, the feature we are working on (diagnostic TDR scheme) is pending there for more than 6 month (we started it from feb 2021).

For me your project exists since a few weeks at most, because that is when your team showed up on dri-devel. That you already spent 6 months on this within amd, on a code area that very much affects shared code, without kicking of any thread on dri-devel isn't great, but also not something we can fix, since time machines don't exist.

So we have to make the best out of the situation and move ahead where we are. From my understanding you've done a bunch of changes to the scheduler code. As far as I can see there's been two related things your team has done:

- remove some allocations from scheduler code, because that can lead to deadlocks. I've kicked up this topic quite a while ago here

https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fdri-devel%2F20200604081224.863494-10-daniel.vetter%40ffwll.ch%2F&amp;data=04%7C01%7Cmonk.liu%40amd.com%7Cd90ad990ac1a499c266208d96d21138d%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637660811106940372%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=pG5sG5pjVXEAMaahvfNS11VwbHkYWRuWrtHFXM9mEyo%3D&amp;reserved=0

This is just one patch of the entire series. This is an area where we really need a consistent solution across all drm/sched drivers, not something that individual drivers just fix in their own way.

- the other one is the timeout issue for the patches you cite here.
Again there's been discussion on this on dri-devel with Boris from panfrost about how we can handle at least some of the races in tdr.
That resulted in lots of discussions and documentation improvements.
Those patches are merged now, link
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fdri-devel%2F20210625133327.2598825-2-boris.brezillon%40collabora.com%2F&amp;data=04%7C01%7Cmonk.liu%40amd.com%7Cd90ad990ac1a499c266208d96d21138d%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637660811106940372%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=m6U6tJbX2x38xiwQXE1oV0sz2BxXZfPlcouyqIqPZNU%3D&amp;reserved=0

There's been more than just this, also quite some doc patches from Boris that explain how it's all supposed to work and be race-free.
Again your driver isn't the only one with interesting TDR races.

Your team hasn't been active in any of these discussions, but now suddenly pops up out of nowhere and demands that your approach needs to land asap. That's really not how upstream works.

The other thing where I'm struggling is that there's a lot of missing context for outsiders. The patches sometimes come with zero commit message, for tricky concurrency bugs. And there's no context with what you've done already on the amdgpu side (since that never showed up on dri-devel), which makes constructive discussions here really hard.

Now fixing these bugs is obviously good, but the way this is supposed to work when touching shared infrastructure is:

- Before you start merging anything kick off an RFC thread on dri-devel (or whatever the topic really is about) about the problem you have and how your trying to solve it. This can be just text if it's a big thing, but it can also already include some proof of concept solution in the form of patches.

- Then we iterate on the solution, across drivers and shared code _together_. Not "merge amdgpu code first, then get annoyed when the core changes don't land immediately after you've practially finished the project".

- This might mean changes to other drivers if we need to adjust interfaces.

On the plus side you can plan much better, because you know you have upstream buy-in before you start to put in real work on the project.

> Honestly speaking the email ways that we are using now is not friendly and quite painful to me ....

Yes this is painful :-(

I think the best way forward is to go through the above process again and essentially restart. So submit a complete patch series with problem descriptions, solution you picked, why you picked that, all the amdgpu patches to get there and the core patches too. Since it sounds like a bunch of this has all landed already you probably need a patch 1 that goes back to 6 months ago so that we can see the overall direction, and review whether that's the right one or not.

The not-so-painful approach would have been to do this from the start,
6 months ago. It would definitely have helped if the tdr discussion we've had just a few months ago would have involved your team too, I'm sure there would have been some good insights from amd's side. I'd really want you and your engineers involved here, so let's do this properly!

Cheers, Daniel

> Can we try to put all our opinions, suggestions, or even objects here together, let's go through them one by one, it's too hard for us to reply each email on different questions .
>
>
>
> For [PATCH 1/2] drm/sched: fix the bug of time out calculation(v4)
>
>
>
> This is a fixing patch on the timeout timer in scheduler, can we complete this one first ? it should already resolved all the questions and suggestions.
>
>
>
> For [PATCH 2/2] drm/sched: serialize job_timeout and scheduler
>
>
>
> I think I already explained the questions raised by Daniel in other 
> thread , regarding why I use __kthread_should_park()
>
> For other aspects, can we put all our opinion synthesized here ?
>
>
>
> Thanks !
>
>
>
> ------------------------------------------
>
> Monk Liu | Cloud-GPU Core team
>
> ------------------------------------------
>
>

--
Daniel Vetter
Software Engineer, Intel Corporation
https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fblog.ffwll.ch%2F&amp;data=04%7C01%7Cmonk.liu%40amd.com%7Cd90ad990ac1a499c266208d96d21138d%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637660811106940372%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=NA3iopIUYFOuTokczRA%2BNBcwVrvMMMHGPM96%2B%2Bm0nEg%3D&amp;reserved=0

^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: [diagnostic TDR mode patches] unify our solution opinions/suggestions in one thread
  2021-09-01  8:18 ` Daniel Vetter
  2021-09-01 10:19   ` Liu, Monk
@ 2021-09-01 10:27   ` Liu, Monk
  1 sibling, 0 replies; 19+ messages in thread
From: Liu, Monk @ 2021-09-01 10:27 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Koenig, Christian, Grodzovsky, Andrey, Chen, JingWen,
	DRI Development, amd-gfx

[AMD Official Use Only]

>> For me your project exists since a few weeks at most, because that is when your team showed up on dri-devel. That you already spent 6 months on this within amd, on a code area that very much affects shared code, without kicking of any thread on dri-devel isn't great, but also not something we can fix, since time machines don't exist.
This is partially true, because in the past months our change only resident in AMD driver, it is till now that we found we had to make changes in SCHED level 

>> Your team hasn't been active in any of these discussions, but now suddenly pops up out of nowhere and demands that your approach needs to land asap. That's really not how upstream works.
if our changes on DRI level part cannot get merged soon that's fine, we can discuss more, but that's not suddenly pops up from nowhere, we already worked on it for months inside of AMD drivers.

>> I think the best way forward is to go through the above process again and essentially restart. So submit a complete patch series with problem descriptions, solution you picked, why you picked that, all the amdgpu patches to get there and the core patches too. Since it sounds like a bunch of this has all landed already you probably need a patch 1 that goes back to 6 months ago so that we can see the overall direction, and review whether that's the right one or not.

We are not objecting this approach,  we didn't do that because previously all what we need to do is resident inside AMD driver ...   because we try to avoid change DRI/DRM interface part ... 

For the patches you shows to us with links I'm sorry that due to some IT infrastructure reason me and my team didn't see it before (we kind of work in AMD repo ... the patches you shows didn't get merged in our repo yet...)
One thing I also want to emphasis here: if any code need change inside AMD driver please always let us know and review.

Thanks 

------------------------------------------
Monk Liu | Cloud-GPU Core team
------------------------------------------

-----Original Message-----
From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> On Behalf Of Daniel Vetter
Sent: Wednesday, September 1, 2021 4:18 PM
To: Liu, Monk <Monk.Liu@amd.com>
Cc: Koenig, Christian <Christian.Koenig@amd.com>; Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>; Chen, JingWen <JingWen.Chen2@amd.com>; DRI Development <dri-devel@lists.freedesktop.org>; amd-gfx@lists.freedesktop.org
Subject: Re: [diagnostic TDR mode patches] unify our solution opinions/suggestions in one thread

Hi Monk,

On Wed, Sep 1, 2021 at 3:23 AM Liu, Monk <Monk.Liu@amd.com> wrote:
>
> [AMD Official Use Only]
>
>
> Hi Daniel/Christian/Andrey
>
>
>
> It looks the voice from you three are spread over those email floods to me, the feature we are working on (diagnostic TDR scheme) is pending there for more than 6 month (we started it from feb 2021).

For me your project exists since a few weeks at most, because that is when your team showed up on dri-devel. That you already spent 6 months on this within amd, on a code area that very much affects shared code, without kicking of any thread on dri-devel isn't great, but also not something we can fix, since time machines don't exist.

So we have to make the best out of the situation and move ahead where we are. From my understanding you've done a bunch of changes to the scheduler code. As far as I can see there's been two related things your team has done:

- remove some allocations from scheduler code, because that can lead to deadlocks. I've kicked up this topic quite a while ago here

https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fdri-devel%2F20200604081224.863494-10-daniel.vetter%40ffwll.ch%2F&amp;data=04%7C01%7Cmonk.liu%40amd.com%7Cd90ad990ac1a499c266208d96d21138d%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637660811106940372%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=pG5sG5pjVXEAMaahvfNS11VwbHkYWRuWrtHFXM9mEyo%3D&amp;reserved=0

This is just one patch of the entire series. This is an area where we really need a consistent solution across all drm/sched drivers, not something that individual drivers just fix in their own way.

- the other one is the timeout issue for the patches you cite here.
Again there's been discussion on this on dri-devel with Boris from panfrost about how we can handle at least some of the races in tdr.
That resulted in lots of discussions and documentation improvements.
Those patches are merged now, link
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fdri-devel%2F20210625133327.2598825-2-boris.brezillon%40collabora.com%2F&amp;data=04%7C01%7Cmonk.liu%40amd.com%7Cd90ad990ac1a499c266208d96d21138d%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637660811106940372%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=m6U6tJbX2x38xiwQXE1oV0sz2BxXZfPlcouyqIqPZNU%3D&amp;reserved=0

There's been more than just this, also quite some doc patches from Boris that explain how it's all supposed to work and be race-free.
Again your driver isn't the only one with interesting TDR races.

Your team hasn't been active in any of these discussions, but now suddenly pops up out of nowhere and demands that your approach needs to land asap. That's really not how upstream works.

The other thing where I'm struggling is that there's a lot of missing context for outsiders. The patches sometimes come with zero commit message, for tricky concurrency bugs. And there's no context with what you've done already on the amdgpu side (since that never showed up on dri-devel), which makes constructive discussions here really hard.

Now fixing these bugs is obviously good, but the way this is supposed to work when touching shared infrastructure is:

- Before you start merging anything kick off an RFC thread on dri-devel (or whatever the topic really is about) about the problem you have and how your trying to solve it. This can be just text if it's a big thing, but it can also already include some proof of concept solution in the form of patches.

- Then we iterate on the solution, across drivers and shared code _together_. Not "merge amdgpu code first, then get annoyed when the core changes don't land immediately after you've practially finished the project".

- This might mean changes to other drivers if we need to adjust interfaces.

On the plus side you can plan much better, because you know you have upstream buy-in before you start to put in real work on the project.

> Honestly speaking the email ways that we are using now is not friendly and quite painful to me ....

Yes this is painful :-(

I think the best way forward is to go through the above process again and essentially restart. So submit a complete patch series with problem descriptions, solution you picked, why you picked that, all the amdgpu patches to get there and the core patches too. Since it sounds like a bunch of this has all landed already you probably need a patch 1 that goes back to 6 months ago so that we can see the overall direction, and review whether that's the right one or not.

The not-so-painful approach would have been to do this from the start,
6 months ago. It would definitely have helped if the tdr discussion we've had just a few months ago would have involved your team too, I'm sure there would have been some good insights from amd's side. I'd really want you and your engineers involved here, so let's do this properly!

Cheers, Daniel

> Can we try to put all our opinions, suggestions, or even objects here together, let's go through them one by one, it's too hard for us to reply each email on different questions .
>
>
>
> For [PATCH 1/2] drm/sched: fix the bug of time out calculation(v4)
>
>
>
> This is a fixing patch on the timeout timer in scheduler, can we complete this one first ? it should already resolved all the questions and suggestions.
>
>
>
> For [PATCH 2/2] drm/sched: serialize job_timeout and scheduler
>
>
>
> I think I already explained the questions raised by Daniel in other 
> thread , regarding why I use __kthread_should_park()
>
> For other aspects, can we put all our opinion synthesized here ?
>
>
>
> Thanks !
>
>
>
> ------------------------------------------
>
> Monk Liu | Cloud-GPU Core team
>
> ------------------------------------------
>
>

--
Daniel Vetter
Software Engineer, Intel Corporation
https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fblog.ffwll.ch%2F&amp;data=04%7C01%7Cmonk.liu%40amd.com%7Cd90ad990ac1a499c266208d96d21138d%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637660811106940372%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=NA3iopIUYFOuTokczRA%2BNBcwVrvMMMHGPM96%2B%2Bm0nEg%3D&amp;reserved=0

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [diagnostic TDR mode patches] unify our solution opinions/suggestions in one thread
  2021-09-01 10:19   ` Liu, Monk
@ 2021-09-01 15:19     ` Alex Deucher
  2021-09-01 18:50       ` Dave Airlie
  0 siblings, 1 reply; 19+ messages in thread
From: Alex Deucher @ 2021-09-01 15:19 UTC (permalink / raw)
  To: Liu, Monk
  Cc: Daniel Vetter, Koenig, Christian, Grodzovsky, Andrey, Chen,
	JingWen, DRI Development, amd-gfx

On Wed, Sep 1, 2021 at 6:19 AM Liu, Monk <Monk.Liu@amd.com> wrote:
>
> [AMD Official Use Only]
>
> Daniel
>
> From the link you share it looks you(or someone else) have quite a bunch patches that changes DRM_SCHED or even amdgpu, by that case before they are merged to kernel tree I'm wondering if any AMD develop reviewed them ?
>
> They looks to me somehow conflicting with what we changed in our repo....
>
> It is really a chaos for AMDer if someone else out side of AMD changes our kernel driver (or/and scheduler) without reviewed by AMDer, just like we are requiring your review if we tend to change scheduler's logic here ....
>
> This one changes AMD's code: https://lore.kernel.org/dri-devel/20210625133327.2598825-2-boris.brezillon@collabora.com/
> And I didn't see any reviewed-by from AMDers ...
>
> This one also touches AMD's code: https://lore.kernel.org/dri-devel/20200604081224.863494-12-daniel.vetter@ffwll.ch/
> Which is conflicting with one patch we submitted (in our repo rightnow), and neither see AMDder gave a review-by on this one (let me know if I missed it)
>

Monk, this is not how upstream works.  You need to participate.
That's how communities work.  There's a reason all these discussions
happen on public mailing lists.  The patch author can't be expected to
know every person on every vendor team to CC with a patch.  If you
have concerns, you need to raise them when the patches are being
discussed.

Alex


> Thanks
>
> ------------------------------------------
> Monk Liu | Cloud-GPU Core team
> ------------------------------------------
>
> -----Original Message-----
> From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> On Behalf Of Daniel Vetter
> Sent: Wednesday, September 1, 2021 4:18 PM
> To: Liu, Monk <Monk.Liu@amd.com>
> Cc: Koenig, Christian <Christian.Koenig@amd.com>; Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>; Chen, JingWen <JingWen.Chen2@amd.com>; DRI Development <dri-devel@lists.freedesktop.org>; amd-gfx@lists.freedesktop.org
> Subject: Re: [diagnostic TDR mode patches] unify our solution opinions/suggestions in one thread
>
> Hi Monk,
>
> On Wed, Sep 1, 2021 at 3:23 AM Liu, Monk <Monk.Liu@amd.com> wrote:
> >
> > [AMD Official Use Only]
> >
> >
> > Hi Daniel/Christian/Andrey
> >
> >
> >
> > It looks the voice from you three are spread over those email floods to me, the feature we are working on (diagnostic TDR scheme) is pending there for more than 6 month (we started it from feb 2021).
>
> For me your project exists since a few weeks at most, because that is when your team showed up on dri-devel. That you already spent 6 months on this within amd, on a code area that very much affects shared code, without kicking of any thread on dri-devel isn't great, but also not something we can fix, since time machines don't exist.
>
> So we have to make the best out of the situation and move ahead where we are. From my understanding you've done a bunch of changes to the scheduler code. As far as I can see there's been two related things your team has done:
>
> - remove some allocations from scheduler code, because that can lead to deadlocks. I've kicked up this topic quite a while ago here
>
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fdri-devel%2F20200604081224.863494-10-daniel.vetter%40ffwll.ch%2F&amp;data=04%7C01%7Cmonk.liu%40amd.com%7Cd90ad990ac1a499c266208d96d21138d%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637660811106940372%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=pG5sG5pjVXEAMaahvfNS11VwbHkYWRuWrtHFXM9mEyo%3D&amp;reserved=0
>
> This is just one patch of the entire series. This is an area where we really need a consistent solution across all drm/sched drivers, not something that individual drivers just fix in their own way.
>
> - the other one is the timeout issue for the patches you cite here.
> Again there's been discussion on this on dri-devel with Boris from panfrost about how we can handle at least some of the races in tdr.
> That resulted in lots of discussions and documentation improvements.
> Those patches are merged now, link
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fdri-devel%2F20210625133327.2598825-2-boris.brezillon%40collabora.com%2F&amp;data=04%7C01%7Cmonk.liu%40amd.com%7Cd90ad990ac1a499c266208d96d21138d%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637660811106940372%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=m6U6tJbX2x38xiwQXE1oV0sz2BxXZfPlcouyqIqPZNU%3D&amp;reserved=0
>
> There's been more than just this, also quite some doc patches from Boris that explain how it's all supposed to work and be race-free.
> Again your driver isn't the only one with interesting TDR races.
>
> Your team hasn't been active in any of these discussions, but now suddenly pops up out of nowhere and demands that your approach needs to land asap. That's really not how upstream works.
>
> The other thing where I'm struggling is that there's a lot of missing context for outsiders. The patches sometimes come with zero commit message, for tricky concurrency bugs. And there's no context with what you've done already on the amdgpu side (since that never showed up on dri-devel), which makes constructive discussions here really hard.
>
> Now fixing these bugs is obviously good, but the way this is supposed to work when touching shared infrastructure is:
>
> - Before you start merging anything kick off an RFC thread on dri-devel (or whatever the topic really is about) about the problem you have and how your trying to solve it. This can be just text if it's a big thing, but it can also already include some proof of concept solution in the form of patches.
>
> - Then we iterate on the solution, across drivers and shared code _together_. Not "merge amdgpu code first, then get annoyed when the core changes don't land immediately after you've practially finished the project".
>
> - This might mean changes to other drivers if we need to adjust interfaces.
>
> On the plus side you can plan much better, because you know you have upstream buy-in before you start to put in real work on the project.
>
> > Honestly speaking the email ways that we are using now is not friendly and quite painful to me ....
>
> Yes this is painful :-(
>
> I think the best way forward is to go through the above process again and essentially restart. So submit a complete patch series with problem descriptions, solution you picked, why you picked that, all the amdgpu patches to get there and the core patches too. Since it sounds like a bunch of this has all landed already you probably need a patch 1 that goes back to 6 months ago so that we can see the overall direction, and review whether that's the right one or not.
>
> The not-so-painful approach would have been to do this from the start,
> 6 months ago. It would definitely have helped if the tdr discussion we've had just a few months ago would have involved your team too, I'm sure there would have been some good insights from amd's side. I'd really want you and your engineers involved here, so let's do this properly!
>
> Cheers, Daniel
>
> > Can we try to put all our opinions, suggestions, or even objects here together, let's go through them one by one, it's too hard for us to reply each email on different questions .
> >
> >
> >
> > For [PATCH 1/2] drm/sched: fix the bug of time out calculation(v4)
> >
> >
> >
> > This is a fixing patch on the timeout timer in scheduler, can we complete this one first ? it should already resolved all the questions and suggestions.
> >
> >
> >
> > For [PATCH 2/2] drm/sched: serialize job_timeout and scheduler
> >
> >
> >
> > I think I already explained the questions raised by Daniel in other
> > thread , regarding why I use __kthread_should_park()
> >
> > For other aspects, can we put all our opinion synthesized here ?
> >
> >
> >
> > Thanks !
> >
> >
> >
> > ------------------------------------------
> >
> > Monk Liu | Cloud-GPU Core team
> >
> > ------------------------------------------
> >
> >
>
>
>
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fblog.ffwll.ch%2F&amp;data=04%7C01%7Cmonk.liu%40amd.com%7Cd90ad990ac1a499c266208d96d21138d%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637660811106940372%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=NA3iopIUYFOuTokczRA%2BNBcwVrvMMMHGPM96%2B%2Bm0nEg%3D&amp;reserved=0

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [diagnostic TDR mode patches] unify our solution opinions/suggestions in one thread
  2021-09-01 15:19     ` Alex Deucher
@ 2021-09-01 18:50       ` Dave Airlie
  2021-09-02  5:52         ` Liu, Monk
  0 siblings, 1 reply; 19+ messages in thread
From: Dave Airlie @ 2021-09-01 18:50 UTC (permalink / raw)
  To: Alex Deucher
  Cc: Liu, Monk, Daniel Vetter, Koenig, Christian, Grodzovsky, Andrey,
	Chen, JingWen, DRI Development, amd-gfx

On Thu, 2 Sept 2021 at 01:20, Alex Deucher <alexdeucher@gmail.com> wrote:
>
> On Wed, Sep 1, 2021 at 6:19 AM Liu, Monk <Monk.Liu@amd.com> wrote:
> >
> > [AMD Official Use Only]
> >
> > Daniel
> >
> > From the link you share it looks you(or someone else) have quite a bunch patches that changes DRM_SCHED or even amdgpu, by that case before they are merged to kernel tree I'm wondering if any AMD develop reviewed them ?
> >
> > They looks to me somehow conflicting with what we changed in our repo....
> >
> > It is really a chaos for AMDer if someone else out side of AMD changes our kernel driver (or/and scheduler) without reviewed by AMDer, just like we are requiring your review if we tend to change scheduler's logic here ....
> >
> > This one changes AMD's code: https://lore.kernel.org/dri-devel/20210625133327.2598825-2-boris.brezillon@collabora.com/
> > And I didn't see any reviewed-by from AMDers ...
> >
> > This one also touches AMD's code: https://lore.kernel.org/dri-devel/20200604081224.863494-12-daniel.vetter@ffwll.ch/
> > Which is conflicting with one patch we submitted (in our repo rightnow), and neither see AMDder gave a review-by on this one (let me know if I missed it)
> >
>
> Monk, this is not how upstream works.  You need to participate.
> That's how communities work.  There's a reason all these discussions
> happen on public mailing lists.  The patch author can't be expected to
> know every person on every vendor team to CC with a patch.  If you
> have concerns, you need to raise them when the patches are being
> discussed.
>

I'm not sure I can add much to help this along, I'm sure Alex has some
internal training,

Once your driver is upstream, it belongs to upstream, you can maintain
it, but you no longer control it 100%, it's a tradeoff, it's not one
companies always understand.

Usually people are fine developing away internally, but once
interaction with other parts of the kernel/subsystem is required they
have the realisation that they needed to work upstream 6 months
earlier.

The best time to interact with upstream was 6 months ago, the second
best time is now.

Dave.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: [diagnostic TDR mode patches] unify our solution opinions/suggestions in one thread
  2021-09-01 18:50       ` Dave Airlie
@ 2021-09-02  5:52         ` Liu, Monk
  2021-09-02 11:00           ` Christian König
                             ` (2 more replies)
  0 siblings, 3 replies; 19+ messages in thread
From: Liu, Monk @ 2021-09-02  5:52 UTC (permalink / raw)
  To: Dave Airlie, Alex Deucher
  Cc: Daniel Vetter, Koenig, Christian, Grodzovsky, Andrey, Chen,
	JingWen, DRI Development, amd-gfx

[AMD Official Use Only]

>>>
I'm not sure I can add much to help this along, I'm sure Alex has some internal training,
Once your driver is upstream, it belongs to upstream, you can maintain it, but you no longer control it 100%, it's a tradeoff, it's not one companies always understand.
Usually people are fine developing away internally, but once interaction with other parts of the kernel/subsystem is required they have the realisation that they needed to work upstream 6 months earlier.
The best time to interact with upstream was 6 months ago, the second best time is now.
<<<

Daniel/AlexD

I didn't mean your changes on AMD driver need my personal approval or review ... and  I'm totally already get used that our driver is not 100% under control by AMDers, 
but supposedly any one from community (including you) who tend to change AMD's driver need at least to get approvement from someone in AMD, e.g.: AlexD or Christian, doesn't that reasonable?
just like we need your approve if we try to modify DRM-sched, or need panfrost's approval if we need to change panfrost code ...

by only CC AMD's engineers looks not quite properly, how do you know if your changes (on AMD code part) are conflicting with AMD's on-going internal features/refactoring or not ?

Thanks 

------------------------------------------
Monk Liu | Cloud-GPU Core team
------------------------------------------

-----Original Message-----
From: Dave Airlie <airlied@gmail.com> 
Sent: Thursday, September 2, 2021 2:51 AM
To: Alex Deucher <alexdeucher@gmail.com>
Cc: Liu, Monk <Monk.Liu@amd.com>; Daniel Vetter <daniel@ffwll.ch>; Koenig, Christian <Christian.Koenig@amd.com>; Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>; Chen, JingWen <JingWen.Chen2@amd.com>; DRI Development <dri-devel@lists.freedesktop.org>; amd-gfx@lists.freedesktop.org
Subject: Re: [diagnostic TDR mode patches] unify our solution opinions/suggestions in one thread

On Thu, 2 Sept 2021 at 01:20, Alex Deucher <alexdeucher@gmail.com> wrote:
>
> On Wed, Sep 1, 2021 at 6:19 AM Liu, Monk <Monk.Liu@amd.com> wrote:
> >
> > [AMD Official Use Only]
> >
> > Daniel
> >
> > From the link you share it looks you(or someone else) have quite a bunch patches that changes DRM_SCHED or even amdgpu, by that case before they are merged to kernel tree I'm wondering if any AMD develop reviewed them ?
> >
> > They looks to me somehow conflicting with what we changed in our repo....
> >
> > It is really a chaos for AMDer if someone else out side of AMD changes our kernel driver (or/and scheduler) without reviewed by AMDer, just like we are requiring your review if we tend to change scheduler's logic here ....
> >
> > This one changes AMD's code: 
> > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flo
> > re.kernel.org%2Fdri-devel%2F20210625133327.2598825-2-boris.brezillon
> > %40collabora.com%2F&amp;data=04%7C01%7CMonk.Liu%40amd.com%7C6c507d18
> > d65341ef53bb08d96d7976e6%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%
> > 7C637661190727875969%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJ
> > QIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=BWJSkKN
> > y2%2BwjxbQrfxGPzuJ5PBpBwB4aV0ZH6QoJGEg%3D&amp;reserved=0
> > And I didn't see any reviewed-by from AMDers ...
> >
> > This one also touches AMD's code: 
> > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flo
> > re.kernel.org%2Fdri-devel%2F20200604081224.863494-12-daniel.vetter%4
> > 0ffwll.ch%2F&amp;data=04%7C01%7CMonk.Liu%40amd.com%7C6c507d18d65341e
> > f53bb08d96d7976e6%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C63766
> > 1190727885929%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2
> > luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=%2F8vIVXCWjHkM
> > 56pcYI9EvuzhbsZhV9WczkKaBJE67KQ%3D&amp;reserved=0
> > Which is conflicting with one patch we submitted (in our repo 
> > rightnow), and neither see AMDder gave a review-by on this one (let 
> > me know if I missed it)
> >
>
> Monk, this is not how upstream works.  You need to participate.
> That's how communities work.  There's a reason all these discussions 
> happen on public mailing lists.  The patch author can't be expected to 
> know every person on every vendor team to CC with a patch.  If you 
> have concerns, you need to raise them when the patches are being 
> discussed.
>

I'm not sure I can add much to help this along, I'm sure Alex has some internal training,

Once your driver is upstream, it belongs to upstream, you can maintain it, but you no longer control it 100%, it's a tradeoff, it's not one companies always understand.

Usually people are fine developing away internally, but once interaction with other parts of the kernel/subsystem is required they have the realisation that they needed to work upstream 6 months earlier.

The best time to interact with upstream was 6 months ago, the second best time is now.

Dave.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [diagnostic TDR mode patches] unify our solution opinions/suggestions in one thread
  2021-09-02  5:52         ` Liu, Monk
@ 2021-09-02 11:00           ` Christian König
  2021-09-02 16:11             ` Daniel Vetter
  2021-09-02 13:31           ` Alex Deucher
  2021-09-02 16:57           ` Daniel Stone
  2 siblings, 1 reply; 19+ messages in thread
From: Christian König @ 2021-09-02 11:00 UTC (permalink / raw)
  To: Liu, Monk, Dave Airlie, Alex Deucher
  Cc: Daniel Vetter, Grodzovsky, Andrey, Chen, JingWen,
	DRI Development, amd-gfx

Hi Monk,

Am 02.09.21 um 07:52 schrieb Liu, Monk:
> [AMD Official Use Only]
>
> I'm not sure I can add much to help this along, I'm sure Alex has some internal training,
> Once your driver is upstream, it belongs to upstream, you can maintain it, but you no longer control it 100%, it's a tradeoff, it's not one companies always understand.
> Usually people are fine developing away internally, but once interaction with other parts of the kernel/subsystem is required they have the realisation that they needed to work upstream 6 months earlier.
> The best time to interact with upstream was 6 months ago, the second best time is now.
> <<<
>
> Daniel/AlexD
>
> I didn't mean your changes on AMD driver need my personal approval or review ... and  I'm totally already get used that our driver is not 100% under control by AMDers,
> but supposedly any one from community (including you) who tend to change AMD's driver need at least to get approvement from someone in AMD, e.g.: AlexD or Christian, doesn't that reasonable?

I'm fearing that just repeating what Alex said, but to make it clear 
once more: That is *not* necessary!

The shared repository is owned by upstream maintainers and they are 
usually free to do restructuring work without getting acknowledge from 
every single driver maintainer.

Anybody can of course technically object to upstream design decisions, 
but that means that you need to pay attention to the mailing lists in 
the first place.

> just like we need your approve if we try to modify DRM-sched, or need panfrost's approval if we need to change panfrost code ...
>
> by only CC AMD's engineers looks not quite properly, how do you know if your changes (on AMD code part) are conflicting with AMD's on-going internal features/refactoring or not ?

Well because AMD is supposed to work in public as much as possible and 
ask upstream before doing changes to the code base.

Additional to that design decisions are supposed to be discussed on the 
mailing list and *not* internally.

Regards,
Christian.

>
> Thanks
>
> ------------------------------------------
> Monk Liu | Cloud-GPU Core team
> ------------------------------------------
>
> -----Original Message-----
> From: Dave Airlie <airlied@gmail.com>
> Sent: Thursday, September 2, 2021 2:51 AM
> To: Alex Deucher <alexdeucher@gmail.com>
> Cc: Liu, Monk <Monk.Liu@amd.com>; Daniel Vetter <daniel@ffwll.ch>; Koenig, Christian <Christian.Koenig@amd.com>; Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>; Chen, JingWen <JingWen.Chen2@amd.com>; DRI Development <dri-devel@lists.freedesktop.org>; amd-gfx@lists.freedesktop.org
> Subject: Re: [diagnostic TDR mode patches] unify our solution opinions/suggestions in one thread
>
> On Thu, 2 Sept 2021 at 01:20, Alex Deucher <alexdeucher@gmail.com> wrote:
>> On Wed, Sep 1, 2021 at 6:19 AM Liu, Monk <Monk.Liu@amd.com> wrote:
>>> [AMD Official Use Only]
>>>
>>> Daniel
>>>
>>>  From the link you share it looks you(or someone else) have quite a bunch patches that changes DRM_SCHED or even amdgpu, by that case before they are merged to kernel tree I'm wondering if any AMD develop reviewed them ?
>>>
>>> They looks to me somehow conflicting with what we changed in our repo....
>>>
>>> It is really a chaos for AMDer if someone else out side of AMD changes our kernel driver (or/and scheduler) without reviewed by AMDer, just like we are requiring your review if we tend to change scheduler's logic here ....
>>>
>>> This one changes AMD's code:
>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flo
>>> re.kernel.org%2Fdri-devel%2F20210625133327.2598825-2-boris.brezillon
>>> %40collabora.com%2F&amp;data=04%7C01%7CMonk.Liu%40amd.com%7C6c507d18
>>> d65341ef53bb08d96d7976e6%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%
>>> 7C637661190727875969%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJ
>>> QIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=BWJSkKN
>>> y2%2BwjxbQrfxGPzuJ5PBpBwB4aV0ZH6QoJGEg%3D&amp;reserved=0
>>> And I didn't see any reviewed-by from AMDers ...
>>>
>>> This one also touches AMD's code:
>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flo
>>> re.kernel.org%2Fdri-devel%2F20200604081224.863494-12-daniel.vetter%4
>>> 0ffwll.ch%2F&amp;data=04%7C01%7CMonk.Liu%40amd.com%7C6c507d18d65341e
>>> f53bb08d96d7976e6%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C63766
>>> 1190727885929%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2
>>> luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=%2F8vIVXCWjHkM
>>> 56pcYI9EvuzhbsZhV9WczkKaBJE67KQ%3D&amp;reserved=0
>>> Which is conflicting with one patch we submitted (in our repo
>>> rightnow), and neither see AMDder gave a review-by on this one (let
>>> me know if I missed it)
>>>
>> Monk, this is not how upstream works.  You need to participate.
>> That's how communities work.  There's a reason all these discussions
>> happen on public mailing lists.  The patch author can't be expected to
>> know every person on every vendor team to CC with a patch.  If you
>> have concerns, you need to raise them when the patches are being
>> discussed.
>>
> I'm not sure I can add much to help this along, I'm sure Alex has some internal training,
>
> Once your driver is upstream, it belongs to upstream, you can maintain it, but you no longer control it 100%, it's a tradeoff, it's not one companies always understand.
>
> Usually people are fine developing away internally, but once interaction with other parts of the kernel/subsystem is required they have the realisation that they needed to work upstream 6 months earlier.
>
> The best time to interact with upstream was 6 months ago, the second best time is now.
>
> Dave.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [diagnostic TDR mode patches] unify our solution opinions/suggestions in one thread
  2021-09-02  5:52         ` Liu, Monk
  2021-09-02 11:00           ` Christian König
@ 2021-09-02 13:31           ` Alex Deucher
  2021-09-02 16:57           ` Daniel Stone
  2 siblings, 0 replies; 19+ messages in thread
From: Alex Deucher @ 2021-09-02 13:31 UTC (permalink / raw)
  To: Liu, Monk
  Cc: Dave Airlie, Daniel Vetter, Koenig, Christian, Grodzovsky,
	Andrey, Chen, JingWen, DRI Development, amd-gfx

On Thu, Sep 2, 2021 at 1:52 AM Liu, Monk <Monk.Liu@amd.com> wrote:
>
> [AMD Official Use Only]
>
> >>>
> I'm not sure I can add much to help this along, I'm sure Alex has some internal training,
> Once your driver is upstream, it belongs to upstream, you can maintain it, but you no longer control it 100%, it's a tradeoff, it's not one companies always understand.
> Usually people are fine developing away internally, but once interaction with other parts of the kernel/subsystem is required they have the realisation that they needed to work upstream 6 months earlier.
> The best time to interact with upstream was 6 months ago, the second best time is now.
> <<<
>
> Daniel/AlexD
>
> I didn't mean your changes on AMD driver need my personal approval or review ... and  I'm totally already get used that our driver is not 100% under control by AMDers,
> but supposedly any one from community (including you) who tend to change AMD's driver need at least to get approvement from someone in AMD, e.g.: AlexD or Christian, doesn't that reasonable?
> just like we need your approve if we try to modify DRM-sched, or need panfrost's approval if we need to change panfrost code ...
>
> by only CC AMD's engineers looks not quite properly, how do you know if your changes (on AMD code part) are conflicting with AMD's on-going internal features/refactoring or not ?
>

We keep as up to date as possible with upstream.  I don't have the
bandwidth to verify every patch, but in most cases I try and look at
them.  In your first example, the patch basically just adds a new
parameter to some common functions.  Drivers that don't need that
parameter don't use it.  It shouldn't really affect the functionality.
There are lots of changes that touch our driver that we are largely
not aware of.  E.g., APIs that we may use may change the function
signatures with no intended functional changes.  If problems are found
they are reported and resolved.  It is a collective effort.  If there
are changes that would conflict with stuff we are doing in our tree we
should bring them up when the relevant patches are being discussed.
We can also make changes to core functionality like scheduler, ttm,
etc. that would affect other drivers.  When we send out the patches we
cc the relevant maintainers, but ultimately the ones who participate
in the discussion set the direction.  That's why participation is
important.

Alex


> Thanks
>
> ------------------------------------------
> Monk Liu | Cloud-GPU Core team
> ------------------------------------------
>
> -----Original Message-----
> From: Dave Airlie <airlied@gmail.com>
> Sent: Thursday, September 2, 2021 2:51 AM
> To: Alex Deucher <alexdeucher@gmail.com>
> Cc: Liu, Monk <Monk.Liu@amd.com>; Daniel Vetter <daniel@ffwll.ch>; Koenig, Christian <Christian.Koenig@amd.com>; Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>; Chen, JingWen <JingWen.Chen2@amd.com>; DRI Development <dri-devel@lists.freedesktop.org>; amd-gfx@lists.freedesktop.org
> Subject: Re: [diagnostic TDR mode patches] unify our solution opinions/suggestions in one thread
>
> On Thu, 2 Sept 2021 at 01:20, Alex Deucher <alexdeucher@gmail.com> wrote:
> >
> > On Wed, Sep 1, 2021 at 6:19 AM Liu, Monk <Monk.Liu@amd.com> wrote:
> > >
> > > [AMD Official Use Only]
> > >
> > > Daniel
> > >
> > > From the link you share it looks you(or someone else) have quite a bunch patches that changes DRM_SCHED or even amdgpu, by that case before they are merged to kernel tree I'm wondering if any AMD develop reviewed them ?
> > >
> > > They looks to me somehow conflicting with what we changed in our repo....
> > >
> > > It is really a chaos for AMDer if someone else out side of AMD changes our kernel driver (or/and scheduler) without reviewed by AMDer, just like we are requiring your review if we tend to change scheduler's logic here ....
> > >
> > > This one changes AMD's code:
> > > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flo
> > > re.kernel.org%2Fdri-devel%2F20210625133327.2598825-2-boris.brezillon
> > > %40collabora.com%2F&amp;data=04%7C01%7CMonk.Liu%40amd.com%7C6c507d18
> > > d65341ef53bb08d96d7976e6%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%
> > > 7C637661190727875969%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJ
> > > QIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=BWJSkKN
> > > y2%2BwjxbQrfxGPzuJ5PBpBwB4aV0ZH6QoJGEg%3D&amp;reserved=0
> > > And I didn't see any reviewed-by from AMDers ...
> > >
> > > This one also touches AMD's code:
> > > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flo
> > > re.kernel.org%2Fdri-devel%2F20200604081224.863494-12-daniel.vetter%4
> > > 0ffwll.ch%2F&amp;data=04%7C01%7CMonk.Liu%40amd.com%7C6c507d18d65341e
> > > f53bb08d96d7976e6%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C63766
> > > 1190727885929%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2
> > > luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=%2F8vIVXCWjHkM
> > > 56pcYI9EvuzhbsZhV9WczkKaBJE67KQ%3D&amp;reserved=0
> > > Which is conflicting with one patch we submitted (in our repo
> > > rightnow), and neither see AMDder gave a review-by on this one (let
> > > me know if I missed it)
> > >
> >
> > Monk, this is not how upstream works.  You need to participate.
> > That's how communities work.  There's a reason all these discussions
> > happen on public mailing lists.  The patch author can't be expected to
> > know every person on every vendor team to CC with a patch.  If you
> > have concerns, you need to raise them when the patches are being
> > discussed.
> >
>
> I'm not sure I can add much to help this along, I'm sure Alex has some internal training,
>
> Once your driver is upstream, it belongs to upstream, you can maintain it, but you no longer control it 100%, it's a tradeoff, it's not one companies always understand.
>
> Usually people are fine developing away internally, but once interaction with other parts of the kernel/subsystem is required they have the realisation that they needed to work upstream 6 months earlier.
>
> The best time to interact with upstream was 6 months ago, the second best time is now.
>
> Dave.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [diagnostic TDR mode patches] unify our solution opinions/suggestions in one thread
  2021-09-02 11:00           ` Christian König
@ 2021-09-02 16:11             ` Daniel Vetter
  2021-09-06  6:36               ` Liu, Monk
  0 siblings, 1 reply; 19+ messages in thread
From: Daniel Vetter @ 2021-09-02 16:11 UTC (permalink / raw)
  To: Christian König
  Cc: Liu, Monk, Dave Airlie, Alex Deucher, Grodzovsky, Andrey, Chen,
	JingWen, DRI Development, amd-gfx

On Thu, Sep 2, 2021 at 1:00 PM Christian König <christian.koenig@amd.com> wrote:
>
> Hi Monk,
>
> Am 02.09.21 um 07:52 schrieb Liu, Monk:
> > [AMD Official Use Only]
> >
> > I'm not sure I can add much to help this along, I'm sure Alex has some internal training,
> > Once your driver is upstream, it belongs to upstream, you can maintain it, but you no longer control it 100%, it's a tradeoff, it's not one companies always understand.
> > Usually people are fine developing away internally, but once interaction with other parts of the kernel/subsystem is required they have the realisation that they needed to work upstream 6 months earlier.
> > The best time to interact with upstream was 6 months ago, the second best time is now.
> > <<<
> >
> > Daniel/AlexD
> >
> > I didn't mean your changes on AMD driver need my personal approval or review ... and  I'm totally already get used that our driver is not 100% under control by AMDers,
> > but supposedly any one from community (including you) who tend to change AMD's driver need at least to get approvement from someone in AMD, e.g.: AlexD or Christian, doesn't that reasonable?
>
> I'm fearing that just repeating what Alex said, but to make it clear
> once more: That is *not* necessary!
>
> The shared repository is owned by upstream maintainers and they are
> usually free to do restructuring work without getting acknowledge from
> every single driver maintainer.
>
> Anybody can of course technically object to upstream design decisions,
> but that means that you need to pay attention to the mailing lists in
> the first place.
>
> > just like we need your approve if we try to modify DRM-sched, or need panfrost's approval if we need to change panfrost code ...
> >
> > by only CC AMD's engineers looks not quite properly, how do you know if your changes (on AMD code part) are conflicting with AMD's on-going internal features/refactoring or not ?
>
> Well because AMD is supposed to work in public as much as possible and
> ask upstream before doing changes to the code base.
>
> Additional to that design decisions are supposed to be discussed on the
> mailing list and *not* internally.

Yeah I'm honestly really surprised about the course of this discussion
here. With Alex, Christian and others amd has a lot of folks with
years/decades of experience in how to collaborate in upstream, when to
pull in others proactively and when that's not needed, and in general
how to plan upstream work with the lest amount of risk and surprises.

I think step zero here needs to be some training at amd and then
re-planning this effort, before we get back to technical stuff.
Otherwise we'll just get bogged down in pain because expectations
about the process don't pan out.
-Daniel

>
> Regards,
> Christian.
>
> >
> > Thanks
> >
> > ------------------------------------------
> > Monk Liu | Cloud-GPU Core team
> > ------------------------------------------
> >
> > -----Original Message-----
> > From: Dave Airlie <airlied@gmail.com>
> > Sent: Thursday, September 2, 2021 2:51 AM
> > To: Alex Deucher <alexdeucher@gmail.com>
> > Cc: Liu, Monk <Monk.Liu@amd.com>; Daniel Vetter <daniel@ffwll.ch>; Koenig, Christian <Christian.Koenig@amd.com>; Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>; Chen, JingWen <JingWen.Chen2@amd.com>; DRI Development <dri-devel@lists.freedesktop.org>; amd-gfx@lists.freedesktop.org
> > Subject: Re: [diagnostic TDR mode patches] unify our solution opinions/suggestions in one thread
> >
> > On Thu, 2 Sept 2021 at 01:20, Alex Deucher <alexdeucher@gmail.com> wrote:
> >> On Wed, Sep 1, 2021 at 6:19 AM Liu, Monk <Monk.Liu@amd.com> wrote:
> >>> [AMD Official Use Only]
> >>>
> >>> Daniel
> >>>
> >>>  From the link you share it looks you(or someone else) have quite a bunch patches that changes DRM_SCHED or even amdgpu, by that case before they are merged to kernel tree I'm wondering if any AMD develop reviewed them ?
> >>>
> >>> They looks to me somehow conflicting with what we changed in our repo....
> >>>
> >>> It is really a chaos for AMDer if someone else out side of AMD changes our kernel driver (or/and scheduler) without reviewed by AMDer, just like we are requiring your review if we tend to change scheduler's logic here ....
> >>>
> >>> This one changes AMD's code:
> >>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flo
> >>> re.kernel.org%2Fdri-devel%2F20210625133327.2598825-2-boris.brezillon
> >>> %40collabora.com%2F&amp;data=04%7C01%7CMonk.Liu%40amd.com%7C6c507d18
> >>> d65341ef53bb08d96d7976e6%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%
> >>> 7C637661190727875969%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJ
> >>> QIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=BWJSkKN
> >>> y2%2BwjxbQrfxGPzuJ5PBpBwB4aV0ZH6QoJGEg%3D&amp;reserved=0
> >>> And I didn't see any reviewed-by from AMDers ...
> >>>
> >>> This one also touches AMD's code:
> >>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flo
> >>> re.kernel.org%2Fdri-devel%2F20200604081224.863494-12-daniel.vetter%4
> >>> 0ffwll.ch%2F&amp;data=04%7C01%7CMonk.Liu%40amd.com%7C6c507d18d65341e
> >>> f53bb08d96d7976e6%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C63766
> >>> 1190727885929%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2
> >>> luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=%2F8vIVXCWjHkM
> >>> 56pcYI9EvuzhbsZhV9WczkKaBJE67KQ%3D&amp;reserved=0
> >>> Which is conflicting with one patch we submitted (in our repo
> >>> rightnow), and neither see AMDder gave a review-by on this one (let
> >>> me know if I missed it)
> >>>
> >> Monk, this is not how upstream works.  You need to participate.
> >> That's how communities work.  There's a reason all these discussions
> >> happen on public mailing lists.  The patch author can't be expected to
> >> know every person on every vendor team to CC with a patch.  If you
> >> have concerns, you need to raise them when the patches are being
> >> discussed.
> >>
> > I'm not sure I can add much to help this along, I'm sure Alex has some internal training,
> >
> > Once your driver is upstream, it belongs to upstream, you can maintain it, but you no longer control it 100%, it's a tradeoff, it's not one companies always understand.
> >
> > Usually people are fine developing away internally, but once interaction with other parts of the kernel/subsystem is required they have the realisation that they needed to work upstream 6 months earlier.
> >
> > The best time to interact with upstream was 6 months ago, the second best time is now.
> >
> > Dave.
>


-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [diagnostic TDR mode patches] unify our solution opinions/suggestions in one thread
  2021-09-02  5:52         ` Liu, Monk
  2021-09-02 11:00           ` Christian König
  2021-09-02 13:31           ` Alex Deucher
@ 2021-09-02 16:57           ` Daniel Stone
  2 siblings, 0 replies; 19+ messages in thread
From: Daniel Stone @ 2021-09-02 16:57 UTC (permalink / raw)
  To: Liu, Monk
  Cc: Dave Airlie, Alex Deucher, Daniel Vetter, Koenig, Christian,
	Grodzovsky, Andrey, Chen, JingWen, DRI Development, amd-gfx

Hi Monk,

On Thu, 2 Sept 2021 at 06:52, Liu, Monk <Monk.Liu@amd.com> wrote:
> I didn't mean your changes on AMD driver need my personal approval or review ... and  I'm totally already get used that our driver is not 100% under control by AMDers,
> but supposedly any one from community (including you) who tend to change AMD's driver need at least to get approvement from someone in AMD, e.g.: AlexD or Christian, doesn't that reasonable?
> just like we need your approve if we try to modify DRM-sched, or need panfrost's approval if we need to change panfrost code ...
>
> by only CC AMD's engineers looks not quite properly, how do you know if your changes (on AMD code part) are conflicting with AMD's on-going internal features/refactoring or not ?

Looking at the patches in question, they were (at least mostly) CCed
both to the amd-gfx@ mailing list and also to ckoenig. Unfortunately
it is not possible for every single patch to get mandatory signoff
from every single stakeholder - e.g. if every AMD patch which touched
the scheduler required explicit approval from Etnaviv, Freedreno,
Lima, Panfrost, and V3D teams, it would become very difficult for AMD
to merge any code.

So the approach is that patches are sent for approval, they are CCed
to people who should be interested, and after some time with no
comments, they may be merged if it seems like a reasonable thing to
do.

The problem with internal work is that, well, it's internal. If the
community sends patches to amd-gfx@, there is no comment from AMD, and
then months later we are told that it should not have happened because
it conflicts with development that AMD has been doing - how should the
rest of the community have known about this? So unfortunately this is
the compromise: if you decide to do private development, not inform
anyone about your plans, and not join in any common discussion, then
it is your responsibility to deal with any changes or conflicts that
happen whilst you are developing privately.

The only way we can successfully have support in the same ecosystem
for AMD, Arm, Broadcom, Intel, NVIDIA, Qualcomm, and VeriSilicon, is
that we are all working together openly. If community development had
to stop because each of these vendors had been doing internal
development for several months without even informing the community of
their plans, any kind of shared development is clearly impossible.

Cheers,
Daniel

^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: [diagnostic TDR mode patches] unify our solution opinions/suggestions in one thread
  2021-09-02 16:11             ` Daniel Vetter
@ 2021-09-06  6:36               ` Liu, Monk
  2021-09-06 10:35                 ` Jingwen Chen
  0 siblings, 1 reply; 19+ messages in thread
From: Liu, Monk @ 2021-09-06  6:36 UTC (permalink / raw)
  To: Daniel Vetter, Koenig, Christian
  Cc: Dave Airlie, Alex Deucher, Grodzovsky, Andrey, Chen, JingWen,
	DRI Development, amd-gfx

[AMD Official Use Only]

> I'm fearing that just repeating what Alex said, but to make it clear 
> once more: That is *not* necessary!
>
> The shared repository is owned by upstream maintainers and they are 
> usually free to do restructuring work without getting acknowledge from 
> every single driver maintainer.

Hi Daniel

Anyway thanks for officially confirm to me of working model & policy in community, I don't want to put my opinion here due to that's not my call to change no matter how.
I only want to let this diagnostic TDR scheme going to a good end for AMD or even for all DRM vendor.

How about this way, we still have a final patch not landed in DRM scheduler and I would like jingwen to present it to you and AlexD/Christian/Andrey,  I believe you will have concerns or objections regarding this patch, but that's fine, let us figure it out together, how to make it acceptable by you and other vendors that working with DRM scheduler.

P.S.:  I had to repeat myself again, we are not popping up new idea suddenly, it is disconnection issue, we didn't have changes (or plan to have changes) in DRM scheduler before, but eventually we found we must make job_timeout and sched_main to work in a serialized otherwise it won't work based on current scheduler's code structure.

Thanks 

------------------------------------------
Monk Liu | Cloud-GPU Core team
------------------------------------------

-----Original Message-----
From: Daniel Vetter <daniel@ffwll.ch> 
Sent: Friday, September 3, 2021 12:11 AM
To: Koenig, Christian <Christian.Koenig@amd.com>
Cc: Liu, Monk <Monk.Liu@amd.com>; Dave Airlie <airlied@gmail.com>; Alex Deucher <alexdeucher@gmail.com>; Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>; Chen, JingWen <JingWen.Chen2@amd.com>; DRI Development <dri-devel@lists.freedesktop.org>; amd-gfx@lists.freedesktop.org
Subject: Re: [diagnostic TDR mode patches] unify our solution opinions/suggestions in one thread

On Thu, Sep 2, 2021 at 1:00 PM Christian König <christian.koenig@amd.com> wrote:
>
> Hi Monk,
>
> Am 02.09.21 um 07:52 schrieb Liu, Monk:
> > [AMD Official Use Only]
> >
> > I'm not sure I can add much to help this along, I'm sure Alex has 
> > some internal training, Once your driver is upstream, it belongs to upstream, you can maintain it, but you no longer control it 100%, it's a tradeoff, it's not one companies always understand.
> > Usually people are fine developing away internally, but once interaction with other parts of the kernel/subsystem is required they have the realisation that they needed to work upstream 6 months earlier.
> > The best time to interact with upstream was 6 months ago, the second best time is now.
> > <<<
> >
> > Daniel/AlexD
> >
> > I didn't mean your changes on AMD driver need my personal approval 
> > or review ... and  I'm totally already get used that our driver is not 100% under control by AMDers, but supposedly any one from community (including you) who tend to change AMD's driver need at least to get approvement from someone in AMD, e.g.: AlexD or Christian, doesn't that reasonable?
>
> I'm fearing that just repeating what Alex said, but to make it clear 
> once more: That is *not* necessary!
>
> The shared repository is owned by upstream maintainers and they are 
> usually free to do restructuring work without getting acknowledge from 
> every single driver maintainer.
>
> Anybody can of course technically object to upstream design decisions, 
> but that means that you need to pay attention to the mailing lists in 
> the first place.
>
> > just like we need your approve if we try to modify DRM-sched, or need panfrost's approval if we need to change panfrost code ...
> >
> > by only CC AMD's engineers looks not quite properly, how do you know if your changes (on AMD code part) are conflicting with AMD's on-going internal features/refactoring or not ?
>
> Well because AMD is supposed to work in public as much as possible and 
> ask upstream before doing changes to the code base.
>
> Additional to that design decisions are supposed to be discussed on 
> the mailing list and *not* internally.

Yeah I'm honestly really surprised about the course of this discussion here. With Alex, Christian and others amd has a lot of folks with years/decades of experience in how to collaborate in upstream, when to pull in others proactively and when that's not needed, and in general how to plan upstream work with the lest amount of risk and surprises.

I think step zero here needs to be some training at amd and then re-planning this effort, before we get back to technical stuff.
Otherwise we'll just get bogged down in pain because expectations about the process don't pan out.
-Daniel

>
> Regards,
> Christian.
>
> >
> > Thanks
> >
> > ------------------------------------------
> > Monk Liu | Cloud-GPU Core team
> > ------------------------------------------
> >
> > -----Original Message-----
> > From: Dave Airlie <airlied@gmail.com>
> > Sent: Thursday, September 2, 2021 2:51 AM
> > To: Alex Deucher <alexdeucher@gmail.com>
> > Cc: Liu, Monk <Monk.Liu@amd.com>; Daniel Vetter <daniel@ffwll.ch>; 
> > Koenig, Christian <Christian.Koenig@amd.com>; Grodzovsky, Andrey 
> > <Andrey.Grodzovsky@amd.com>; Chen, JingWen <JingWen.Chen2@amd.com>; 
> > DRI Development <dri-devel@lists.freedesktop.org>; 
> > amd-gfx@lists.freedesktop.org
> > Subject: Re: [diagnostic TDR mode patches] unify our solution 
> > opinions/suggestions in one thread
> >
> > On Thu, 2 Sept 2021 at 01:20, Alex Deucher <alexdeucher@gmail.com> wrote:
> >> On Wed, Sep 1, 2021 at 6:19 AM Liu, Monk <Monk.Liu@amd.com> wrote:
> >>> [AMD Official Use Only]
> >>>
> >>> Daniel
> >>>
> >>>  From the link you share it looks you(or someone else) have quite a bunch patches that changes DRM_SCHED or even amdgpu, by that case before they are merged to kernel tree I'm wondering if any AMD develop reviewed them ?
> >>>
> >>> They looks to me somehow conflicting with what we changed in our repo....
> >>>
> >>> It is really a chaos for AMDer if someone else out side of AMD changes our kernel driver (or/and scheduler) without reviewed by AMDer, just like we are requiring your review if we tend to change scheduler's logic here ....
> >>>
> >>> This one changes AMD's code:
> >>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2F
> >>> lo 
> >>> re.kernel.org%2Fdri-devel%2F20210625133327.2598825-2-boris.brezill
> >>> on
> >>> %40collabora.com%2F&amp;data=04%7C01%7CMonk.Liu%40amd.com%7C6c507d
> >>> 18 
> >>> d65341ef53bb08d96d7976e6%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C
> >>> 0% 
> >>> 7C637661190727875969%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiL
> >>> CJ 
> >>> QIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=BWJSk
> >>> KN
> >>> y2%2BwjxbQrfxGPzuJ5PBpBwB4aV0ZH6QoJGEg%3D&amp;reserved=0
> >>> And I didn't see any reviewed-by from AMDers ...
> >>>
> >>> This one also touches AMD's code:
> >>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2F
> >>> lo
> >>> re.kernel.org%2Fdri-devel%2F20200604081224.863494-12-daniel.vetter
> >>> %4 
> >>> 0ffwll.ch%2F&amp;data=04%7C01%7CMonk.Liu%40amd.com%7C6c507d18d6534
> >>> 1e
> >>> f53bb08d96d7976e6%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637
> >>> 66
> >>> 1190727885929%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoi
> >>> V2 
> >>> luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=%2F8vIVXCWjH
> >>> kM
> >>> 56pcYI9EvuzhbsZhV9WczkKaBJE67KQ%3D&amp;reserved=0
> >>> Which is conflicting with one patch we submitted (in our repo 
> >>> rightnow), and neither see AMDder gave a review-by on this one 
> >>> (let me know if I missed it)
> >>>
> >> Monk, this is not how upstream works.  You need to participate.
> >> That's how communities work.  There's a reason all these 
> >> discussions happen on public mailing lists.  The patch author can't 
> >> be expected to know every person on every vendor team to CC with a 
> >> patch.  If you have concerns, you need to raise them when the 
> >> patches are being discussed.
> >>
> > I'm not sure I can add much to help this along, I'm sure Alex has 
> > some internal training,
> >
> > Once your driver is upstream, it belongs to upstream, you can maintain it, but you no longer control it 100%, it's a tradeoff, it's not one companies always understand.
> >
> > Usually people are fine developing away internally, but once interaction with other parts of the kernel/subsystem is required they have the realisation that they needed to work upstream 6 months earlier.
> >
> > The best time to interact with upstream was 6 months ago, the second best time is now.
> >
> > Dave.
>


--
Daniel Vetter
Software Engineer, Intel Corporation
https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fblog.ffwll.ch%2F&amp;data=04%7C01%7CMonk.Liu%40amd.com%7C1de8110d43194346d9b908d96e2c5459%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637661958966011423%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=%2BvtBN1lBJnUoeSyj6aXTDRNHVQDQP8kPRdSUrhR1MVk%3D&amp;reserved=0

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [diagnostic TDR mode patches] unify our solution opinions/suggestions in one thread
  2021-09-06  6:36               ` Liu, Monk
@ 2021-09-06 10:35                 ` Jingwen Chen
  0 siblings, 0 replies; 19+ messages in thread
From: Jingwen Chen @ 2021-09-06 10:35 UTC (permalink / raw)
  To: Liu, Monk, Daniel Vetter, Koenig,  Christian
  Cc: Dave Airlie, Alex Deucher, Grodzovsky, Andrey, DRI Development, amd-gfx

Hi Christian/Andrey/Daniel,

I read Boris's patch about ordered workqueue and I think maybe we can
leverage this change.
https://lore.kernel.org/dri-devel/20210625133327.2598825-2-boris.brezillon@collabora.com/

As the TDR race condition we are talking about is caused by a bailing
job being deleted from pending list. While if we use the ordered
workqueue for timedout in the driver, there will be no bailing job.

Do you have any suggestions?

Best Regards,
JingWen Chen

On Mon Sep 06, 2021 at 02:36:52PM +0800, Liu, Monk wrote:
> [AMD Official Use Only]
> 
> > I'm fearing that just repeating what Alex said, but to make it clear 
> > once more: That is *not* necessary!
> >
> > The shared repository is owned by upstream maintainers and they are 
> > usually free to do restructuring work without getting acknowledge from 
> > every single driver maintainer.
> 
> Hi Daniel
> 
> Anyway thanks for officially confirm to me of working model & policy in community, I don't want to put my opinion here due to that's not my call to change no matter how.
> I only want to let this diagnostic TDR scheme going to a good end for AMD or even for all DRM vendor.
> 
> How about this way, we still have a final patch not landed in DRM scheduler and I would like jingwen to present it to you and AlexD/Christian/Andrey,  I believe you will have concerns or objections regarding this patch, but that's fine, let us figure it out together, how to make it acceptable by you and other vendors that working with DRM scheduler.
> 
> P.S.:  I had to repeat myself again, we are not popping up new idea suddenly, it is disconnection issue, we didn't have changes (or plan to have changes) in DRM scheduler before, but eventually we found we must make job_timeout and sched_main to work in a serialized otherwise it won't work based on current scheduler's code structure.
> 
> Thanks 
> 
> ------------------------------------------
> Monk Liu | Cloud-GPU Core team
> ------------------------------------------
> 
> -----Original Message-----
> From: Daniel Vetter <daniel@ffwll.ch> 
> Sent: Friday, September 3, 2021 12:11 AM
> To: Koenig, Christian <Christian.Koenig@amd.com>
> Cc: Liu, Monk <Monk.Liu@amd.com>; Dave Airlie <airlied@gmail.com>; Alex Deucher <alexdeucher@gmail.com>; Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>; Chen, JingWen <JingWen.Chen2@amd.com>; DRI Development <dri-devel@lists.freedesktop.org>; amd-gfx@lists.freedesktop.org
> Subject: Re: [diagnostic TDR mode patches] unify our solution opinions/suggestions in one thread
> 
> On Thu, Sep 2, 2021 at 1:00 PM Christian König <christian.koenig@amd.com> wrote:
> >
> > Hi Monk,
> >
> > Am 02.09.21 um 07:52 schrieb Liu, Monk:
> > > [AMD Official Use Only]
> > >
> > > I'm not sure I can add much to help this along, I'm sure Alex has 
> > > some internal training, Once your driver is upstream, it belongs to upstream, you can maintain it, but you no longer control it 100%, it's a tradeoff, it's not one companies always understand.
> > > Usually people are fine developing away internally, but once interaction with other parts of the kernel/subsystem is required they have the realisation that they needed to work upstream 6 months earlier.
> > > The best time to interact with upstream was 6 months ago, the second best time is now.
> > > <<<
> > >
> > > Daniel/AlexD
> > >
> > > I didn't mean your changes on AMD driver need my personal approval 
> > > or review ... and  I'm totally already get used that our driver is not 100% under control by AMDers, but supposedly any one from community (including you) who tend to change AMD's driver need at least to get approvement from someone in AMD, e.g.: AlexD or Christian, doesn't that reasonable?
> >
> > I'm fearing that just repeating what Alex said, but to make it clear 
> > once more: That is *not* necessary!
> >
> > The shared repository is owned by upstream maintainers and they are 
> > usually free to do restructuring work without getting acknowledge from 
> > every single driver maintainer.
> >
> > Anybody can of course technically object to upstream design decisions, 
> > but that means that you need to pay attention to the mailing lists in 
> > the first place.
> >
> > > just like we need your approve if we try to modify DRM-sched, or need panfrost's approval if we need to change panfrost code ...
> > >
> > > by only CC AMD's engineers looks not quite properly, how do you know if your changes (on AMD code part) are conflicting with AMD's on-going internal features/refactoring or not ?
> >
> > Well because AMD is supposed to work in public as much as possible and 
> > ask upstream before doing changes to the code base.
> >
> > Additional to that design decisions are supposed to be discussed on 
> > the mailing list and *not* internally.
> 
> Yeah I'm honestly really surprised about the course of this discussion here. With Alex, Christian and others amd has a lot of folks with years/decades of experience in how to collaborate in upstream, when to pull in others proactively and when that's not needed, and in general how to plan upstream work with the lest amount of risk and surprises.
> 
> I think step zero here needs to be some training at amd and then re-planning this effort, before we get back to technical stuff.
> Otherwise we'll just get bogged down in pain because expectations about the process don't pan out.
> -Daniel
> 
> >
> > Regards,
> > Christian.
> >
> > >
> > > Thanks
> > >
> > > ------------------------------------------
> > > Monk Liu | Cloud-GPU Core team
> > > ------------------------------------------
> > >
> > > -----Original Message-----
> > > From: Dave Airlie <airlied@gmail.com>
> > > Sent: Thursday, September 2, 2021 2:51 AM
> > > To: Alex Deucher <alexdeucher@gmail.com>
> > > Cc: Liu, Monk <Monk.Liu@amd.com>; Daniel Vetter <daniel@ffwll.ch>; 
> > > Koenig, Christian <Christian.Koenig@amd.com>; Grodzovsky, Andrey 
> > > <Andrey.Grodzovsky@amd.com>; Chen, JingWen <JingWen.Chen2@amd.com>; 
> > > DRI Development <dri-devel@lists.freedesktop.org>; 
> > > amd-gfx@lists.freedesktop.org
> > > Subject: Re: [diagnostic TDR mode patches] unify our solution 
> > > opinions/suggestions in one thread
> > >
> > > On Thu, 2 Sept 2021 at 01:20, Alex Deucher <alexdeucher@gmail.com> wrote:
> > >> On Wed, Sep 1, 2021 at 6:19 AM Liu, Monk <Monk.Liu@amd.com> wrote:
> > >>> [AMD Official Use Only]
> > >>>
> > >>> Daniel
> > >>>
> > >>>  From the link you share it looks you(or someone else) have quite a bunch patches that changes DRM_SCHED or even amdgpu, by that case before they are merged to kernel tree I'm wondering if any AMD develop reviewed them ?
> > >>>
> > >>> They looks to me somehow conflicting with what we changed in our repo....
> > >>>
> > >>> It is really a chaos for AMDer if someone else out side of AMD changes our kernel driver (or/and scheduler) without reviewed by AMDer, just like we are requiring your review if we tend to change scheduler's logic here ....
> > >>>
> > >>> This one changes AMD's code:
> > >>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2F
> > >>> lo 
> > >>> re.kernel.org%2Fdri-devel%2F20210625133327.2598825-2-boris.brezill
> > >>> on
> > >>> %40collabora.com%2F&amp;data=04%7C01%7CMonk.Liu%40amd.com%7C6c507d
> > >>> 18 
> > >>> d65341ef53bb08d96d7976e6%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C
> > >>> 0% 
> > >>> 7C637661190727875969%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiL
> > >>> CJ 
> > >>> QIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=BWJSk
> > >>> KN
> > >>> y2%2BwjxbQrfxGPzuJ5PBpBwB4aV0ZH6QoJGEg%3D&amp;reserved=0
> > >>> And I didn't see any reviewed-by from AMDers ...
> > >>>
> > >>> This one also touches AMD's code:
> > >>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2F
> > >>> lo
> > >>> re.kernel.org%2Fdri-devel%2F20200604081224.863494-12-daniel.vetter
> > >>> %4 
> > >>> 0ffwll.ch%2F&amp;data=04%7C01%7CMonk.Liu%40amd.com%7C6c507d18d6534
> > >>> 1e
> > >>> f53bb08d96d7976e6%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637
> > >>> 66
> > >>> 1190727885929%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoi
> > >>> V2 
> > >>> luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=%2F8vIVXCWjH
> > >>> kM
> > >>> 56pcYI9EvuzhbsZhV9WczkKaBJE67KQ%3D&amp;reserved=0
> > >>> Which is conflicting with one patch we submitted (in our repo 
> > >>> rightnow), and neither see AMDder gave a review-by on this one 
> > >>> (let me know if I missed it)
> > >>>
> > >> Monk, this is not how upstream works.  You need to participate.
> > >> That's how communities work.  There's a reason all these 
> > >> discussions happen on public mailing lists.  The patch author can't 
> > >> be expected to know every person on every vendor team to CC with a 
> > >> patch.  If you have concerns, you need to raise them when the 
> > >> patches are being discussed.
> > >>
> > > I'm not sure I can add much to help this along, I'm sure Alex has 
> > > some internal training,
> > >
> > > Once your driver is upstream, it belongs to upstream, you can maintain it, but you no longer control it 100%, it's a tradeoff, it's not one companies always understand.
> > >
> > > Usually people are fine developing away internally, but once interaction with other parts of the kernel/subsystem is required they have the realisation that they needed to work upstream 6 months earlier.
> > >
> > > The best time to interact with upstream was 6 months ago, the second best time is now.
> > >
> > > Dave.
> >
> 
> 
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fblog.ffwll.ch%2F&amp;data=04%7C01%7CMonk.Liu%40amd.com%7C1de8110d43194346d9b908d96e2c5459%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637661958966011423%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=%2BvtBN1lBJnUoeSyj6aXTDRNHVQDQP8kPRdSUrhR1MVk%3D&amp;reserved=0

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2021-09-06 10:35 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-09-01  1:23 [diagnostic TDR mode patches] unify our solution opinions/suggestions in one thread Liu, Monk
2021-09-01  1:58 ` Liu, Monk
2021-09-01  4:04   ` Andrey Grodzovsky
2021-09-01  4:25     ` Jingwen Chen
2021-09-01  4:28       ` Andrey Grodzovsky
2021-09-01  4:40         ` Jingwen Chen
2021-09-01  4:49           ` Andrey Grodzovsky
2021-09-01  8:18 ` Daniel Vetter
2021-09-01 10:19   ` Liu, Monk
2021-09-01 15:19     ` Alex Deucher
2021-09-01 18:50       ` Dave Airlie
2021-09-02  5:52         ` Liu, Monk
2021-09-02 11:00           ` Christian König
2021-09-02 16:11             ` Daniel Vetter
2021-09-06  6:36               ` Liu, Monk
2021-09-06 10:35                 ` Jingwen Chen
2021-09-02 13:31           ` Alex Deucher
2021-09-02 16:57           ` Daniel Stone
2021-09-01 10:27   ` Liu, Monk

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.