All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jingwen Chen <Jingwen.Chen2@amd.com>
To: Andrey Grodzovsky <andrey.grodzovsky@amd.com>,
	"Liu, Monk" <Monk.Liu@amd.com>,
	"Koenig, Christian" <Christian.Koenig@amd.com>,
	"Daniel Vetter" <daniel@ffwll.ch>
Cc: DRI Development <dri-devel@lists.freedesktop.org>,
	"amd-gfx@lists.freedesktop.org" <amd-gfx@lists.freedesktop.org>
Subject: Re: [diagnostic TDR mode patches] unify our solution opinions/suggestions in one thread
Date: Wed, 1 Sep 2021 12:25:13 +0800	[thread overview]
Message-ID: <20210901042513.nn3kdnh6xqkrbarn@wayne-dev> (raw)
In-Reply-To: <abf7b9de-67ef-25d4-a4be-11df93d2f799@amd.com>

On Wed Sep 01, 2021 at 12:04:47AM -0400, Andrey Grodzovsky wrote:
> I will answer everything here -
> 
> On 2021-08-31 9:58 p.m., Liu, Monk wrote:
> 
> 
>     [AMD Official Use Only]
> 
>      
> 
>     In the previous discussion, you guys stated that we should drop the
>     “kthread_should_park” in cleanup_job.
> 
>      
> 
>     @@ -676,15 +676,6 @@ drm_sched_get_cleanup_job(struct drm_gpu_scheduler
>     *sched)
> 
>     {
> 
>             struct drm_sched_job *job, *next;
> 
>      
> 
>     -       /*
> 
>     -        * Don't destroy jobs while the timeout worker is running  OR
>     thread
> 
>     -        * is being parked and hence assumed to not touch pending_list
> 
>     -        */
> 
>     -       if ((sched->timeout != MAX_SCHEDULE_TIMEOUT &&
> 
>     -           !cancel_delayed_work(&sched->work_tdr)) ||
> 
>     -           kthread_should_park())
> 
>     -               return NULL;
> 
>      
> 
>     But I suddenly have a question here: if return the timedout job no matter
>     kthread_should_park() or not, then we are backing to the original problem
>     again: that the timedout_job is suddenly signaling and cleanup_job still
>     returns it to sched_main and job is freed while it is still handling by
>     vendor’s timeout callback
> 
>      
> 
>     If we return NULL when kthread_should_park() in cleanup_job, we can prevent
>     above scenario from happening: once a job is processed by job_timedout we
>     can stop its scheduler, and after that even this job suddenly signaled the
>     cleanup_job won’t return it so sched_main won’t free it in parallel …
> 
>      
> 
>     What do you think ?
> 
> 
> Is your analysis above takes into account that you also submit
> '[PATCH 2/2] drm/sched: serialize job_timeout and scheduler' then I don't see a
> problem -
Hi Andrey,
Monk has talked to me and we agreed that as there're multiple opinions about the
'[PATCH 2/2] drm/sched: serialize job_timeout and scheduler' and patch
1 is an independent patch to fix some error. So we should not take the patch 2 into
analysis.

> I think that as long as you put kthread_park(sched->thread) BEFORE
> fetching next bad job from pending list (under spinlock) there is no
> such issue as in the case you describe because this potential bad job
> that became signaled will be removed from pending list before you
> even fetch the next job and by the time you fetch it the scheduler
> thread is already stopped anyway
> 
> If you don't submit and we keep the removal hack for now then also no problem
> because
> we temporary remove the job we fetch for TDR from pending list under spinlock
> exactly to avoid this race
> 
So can you help review [PATCH 1/2] drm/sched: fix the bug of time out calculation(v3)?
patch v3 keeps this kthread_should_park check.

Best Regards,
JingWen
> 
> 
>     Thanks
> 
>      
> 
>     ------------------------------------------
> 
>     Monk Liu | Cloud-GPU Core team
> 
>     ------------------------------------------
> 
>      
> 
>     From: Liu, Monk
>     Sent: Wednesday, September 1, 2021 9:23 AM
>     To: Koenig, Christian <Christian.Koenig@amd.com>; Grodzovsky, Andrey
>     <Andrey.Grodzovsky@amd.com>; Daniel Vetter <daniel@ffwll.ch>; Chen, JingWen
>     <JingWen.Chen2@amd.com>
>     Cc: DRI Development <dri-devel@lists.freedesktop.org>;
>     amd-gfx@lists.freedesktop.org
>     Subject: [diagnostic TDR mode patches] unify our solution opinions/
>     suggestions in one thread
> 
>      
> 
>     [AMD Official Use Only]
> 
>      
> 
>     Hi Daniel/Christian/Andrey
> 
>      
> 
>     It looks the voice from you three are spread over those email floods to me,
>     the feature we are working on (diagnostic TDR scheme) is pending there for
>     more than 6 month (we started it from feb 2021).
> 
>      
> 
>     Honestly speaking the email ways that we are using now is not friendly and
>     quite painful to me ….
> 
>     Can we try to put all our opinions, suggestions, or even objects here
>     together, let’s go through them one by one, it’s too hard for us to reply
>     each email on different questions .
> 
>      
> 
>     For [PATCH 1/2] drm/sched: fix the bug of time out calculation(v4)
> 
>      
> 
>     This is a fixing patch on the timeout timer in scheduler, can we complete
>     this one first ? it should already resolved all the questions and
>     suggestions.
> 
> 
> I have no objections for this one besides getting rid of the
> kthread_should_park()) return null part,
> if my answer above is not wrong then it seems superfluous to me
> 
> 
>      
> 
>     For [PATCH 2/2] drm/sched: serialize job_timeout and scheduler
> 
>      
> 
>     I think I already explained the questions raised by Daniel in other thread
>     , regarding why I use __kthread_should_park()
> 
> 
> Is this race free ? Can't the other thread execute kthread_park after the check
> ?
> 
> 
>     For other aspects, can we put all our opinion synthesized here ?
> 
> 
> So to summarize from previous threads I think that the best solution
> to the problem being solved in this patch is if we do HW fence embedding
> at the drm_sched_job level instead of doing it only for amdgpu, and modifying
> all
> the drivers to support this we can both remove this hack and solve the race
> against concurrent drm_sched_cleanup_jobs job freeing just by taking reference
> to the hw fence of the job at the beginning of drm_sched_job_timedout
> 
> If doing this refactoring for all the drivers is not an option now and you need
> a quick
> solution such as the serialization you do here then take into account other
> concurrent
> TDR handlers that might run, as mentioned before, without serializing against
> other TDR handlers from other
> schedulers you just race here against them, e.g. you parked it now but another
> one in progress will unpark it as part of calling  drm_sched_start for other
> rings.
> So you either need a global lock or dedicated single threaded queue as Daniel
> suggested.
> At minimum we should change cancel_delayed_work in drm_sched_stop to
> cancel_delayed_work_sync
> to cancel and flush all concurrent TDRs and probably move it to the begining of
> the function, after kthread_park
> and before we start playing with the pending list.
> 
> P.S One comment I had regarding single threaded queue is that in case we have
> multiple TDR
> arising from same event we will proceed to resetting multiple times - something
> that with trylock
> we mostly avoid today within amdgpu (see amdgpu_device_lock_adev and
> amdgpu_device_lock_hive_adev)
> Daniel mentioned it's not a problem but I still haven't understood why not.
> 
> Andrey
> 
> 
>      
> 
>     Thanks !
> 
>      
> 
>     ------------------------------------------
> 
>     Monk Liu | Cloud-GPU Core team
> 
>     ------------------------------------------
> 
>      
> 

  reply	other threads:[~2021-09-01  4:25 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-09-01  1:23 [diagnostic TDR mode patches] unify our solution opinions/suggestions in one thread Liu, Monk
2021-09-01  1:58 ` Liu, Monk
2021-09-01  4:04   ` Andrey Grodzovsky
2021-09-01  4:25     ` Jingwen Chen [this message]
2021-09-01  4:28       ` Andrey Grodzovsky
2021-09-01  4:40         ` Jingwen Chen
2021-09-01  4:49           ` Andrey Grodzovsky
2021-09-01  8:18 ` Daniel Vetter
2021-09-01 10:19   ` Liu, Monk
2021-09-01 15:19     ` Alex Deucher
2021-09-01 18:50       ` Dave Airlie
2021-09-02  5:52         ` Liu, Monk
2021-09-02 11:00           ` Christian König
2021-09-02 16:11             ` Daniel Vetter
2021-09-06  6:36               ` Liu, Monk
2021-09-06 10:35                 ` Jingwen Chen
2021-09-02 13:31           ` Alex Deucher
2021-09-02 16:57           ` Daniel Stone
2021-09-01 10:27   ` Liu, Monk

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210901042513.nn3kdnh6xqkrbarn@wayne-dev \
    --to=jingwen.chen2@amd.com \
    --cc=Christian.Koenig@amd.com \
    --cc=Monk.Liu@amd.com \
    --cc=amd-gfx@lists.freedesktop.org \
    --cc=andrey.grodzovsky@amd.com \
    --cc=daniel@ffwll.ch \
    --cc=dri-devel@lists.freedesktop.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.