From mboxrd@z Thu Jan 1 00:00:00 1970 From: "Zhou, David(ChunMing)" Subject: Re:[PATCH v5 6/6] drm/amdgpu: Avoid HW reset if guilty job already signaled. Date: Tue, 23 Apr 2019 15:19:01 +0000 Message-ID: <-hyv5g0n8ru25qelb0v-8u6jdi1vp2c7z1m3f5-uygwc1o5ji6s-9zli9v-srreuk-3pvse1en6kx0-6se95l-6jsafd-a6sboi-j814xf-ijgwfc-qewgmm-vnafjgrn2fq0-jgir949hx4yo-i772hz-tn7ial.1556032736536@email.android.com> References: <1555599624-12285-1-git-send-email-andrey.grodzovsky@amd.com> <1555599624-12285-6-git-send-email-andrey.grodzovsky@amd.com> <2f8cf80c-3c53-0ee2-968c-08338b9f154e@amd.com>, <1b41c4f1-b406-8710-2a7a-e5c54a116fe9@amd.com> Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="===============1690833229==" Return-path: In-Reply-To: <1b41c4f1-b406-8710-2a7a-e5c54a116fe9-5C7GfCeVMHo@public.gmane.org> Content-Language: en-US List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: amd-gfx-bounces-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org Sender: "amd-gfx" To: "Grodzovsky, Andrey" , "Zhou, David(ChunMing)" , "dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org" , "amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org" , "eric-WhKQ6XTQaPysTnJN9+BGXg@public.gmane.org" , "etnaviv-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org" , "ckoenig.leichtzumerken-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org" Cc: "Kazlauskas, Nicholas" , "Liu, Monk" List-Id: dri-devel@lists.freedesktop.org --===============1690833229== Content-Language: en-US Content-Type: multipart/alternative; boundary="_000_hyv5g0n8ru25qelb0v8u6jdi1vp2c7z1m3f5uygwc1o5ji6s9zli9vs_" --_000_hyv5g0n8ru25qelb0v8u6jdi1vp2c7z1m3f5uygwc1o5ji6s9zli9vs_ Content-Type: text/plain; charset="iso-2022-jp" Content-Transfer-Encoding: quoted-printable do you mean fence timer? why not stop it as well when stopping sched for th= e reason of hw reset? -------- Original Message -------- Subject: Re: [PATCH v5 6/6] drm/amdgpu: Avoid HW reset if guilty job alread= y signaled. From: "Grodzovsky, Andrey" To: "Zhou, David(ChunMing)" ,dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org,amd-gfx@lists.= freedesktop.org,eric-WhKQ6XTQaPysTnJN9+BGXg@public.gmane.org,etnaviv-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org,ckoenig.leich= tzumerken-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org CC: "Kazlauskas, Nicholas" ,"Liu, Monk" On 4/22/19 9:09 AM, Zhou, David(ChunMing) wrote: > +Monk. > > GPU reset is used widely in SRIOV, so need virtulizatino guy take a look. > > But out of curious, why guilty job can signal more if the job is already > set to guilty? set it wrongly? > > > -David It's possible that the job does completes at a later time then it's timeout handler started processing so in this patch we try to protect against this by rechecking the HW fence after stopping all SW schedulers. We do it BEFORE marking guilty on the job's sched_entity so at the point we check the guilty flag is not set yet. Andrey > > =1B$B:_=1B(B 2019/4/18 23:00, Andrey Grodzovsky =1B$B> Also reject TDRs if another one already running. >> >> v2: >> Stop all schedulers across device and entire XGMI hive before >> force signaling HW fences. >> Avoid passing job_signaled to helper fnctions to keep all the decision >> making about skipping HW reset in one place. >> >> v3: >> Fix SW sched. hang after non HW reset. sched.hw_rq_count has to be balan= ced >> against it's decrement in drm_sched_stop in non HW reset case. >> v4: rebase >> v5: Revert v3 as we do it now in sceduler code. >> >> Signed-off-by: Andrey Grodzovsky >> --- >> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 143 +++++++++++++++++++-= --------- >> 1 file changed, 95 insertions(+), 48 deletions(-) >> >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/dr= m/amd/amdgpu/amdgpu_device.c >> index a0e165c..85f8792 100644 >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >> @@ -3334,8 +3334,6 @@ static int amdgpu_device_pre_asic_reset(struct amd= gpu_device *adev, >> if (!ring || !ring->sched.thread) >> continue; >> >> - drm_sched_stop(&ring->sched, &job->base); >> - >> /* after all hw jobs are reset, hw fence is meaningless, s= o force_completion */ >> amdgpu_fence_driver_force_completion(ring); >> } >> @@ -3343,6 +3341,7 @@ static int amdgpu_device_pre_asic_reset(struct amd= gpu_device *adev, >> if(job) >> drm_sched_increase_karma(&job->base); >> >> + /* Don't suspend on bare metal if we are not going to HW reset the = ASIC */ >> if (!amdgpu_sriov_vf(adev)) { >> >> if (!need_full_reset) >> @@ -3480,37 +3479,21 @@ static int amdgpu_do_asic_reset(struct amdgpu_hi= ve_info *hive, >> return r; >> } >> >> -static void amdgpu_device_post_asic_reset(struct amdgpu_device *adev) >> +static bool amdgpu_device_lock_adev(struct amdgpu_device *adev, bool tr= ylock) >> { >> - int i; >> - >> - for (i =3D 0; i < AMDGPU_MAX_RINGS; ++i) { >> - struct amdgpu_ring *ring =3D adev->rings[i]; >> - >> - if (!ring || !ring->sched.thread) >> - continue; >> - >> - if (!adev->asic_reset_res) >> - drm_sched_resubmit_jobs(&ring->sched); >> + if (trylock) { >> + if (!mutex_trylock(&adev->lock_reset)) >> + return false; >> + } else >> + mutex_lock(&adev->lock_reset); >> >> - drm_sched_start(&ring->sched, !adev->asic_reset_res); >> - } >> - >> - if (!amdgpu_device_has_dc_support(adev)) { >> - drm_helper_resume_force_mode(adev->ddev); >> - } >> - >> - adev->asic_reset_res =3D 0; >> -} >> - >> -static void amdgpu_device_lock_adev(struct amdgpu_device *adev) >> -{ >> - mutex_lock(&adev->lock_reset); >> atomic_inc(&adev->gpu_reset_counter); >> adev->in_gpu_reset =3D 1; >> /* Block kfd: SRIOV would do it separately */ >> if (!amdgpu_sriov_vf(adev)) >> amdgpu_amdkfd_pre_reset(adev); >> + >> + return true; >> } >> >> static void amdgpu_device_unlock_adev(struct amdgpu_device *adev) >> @@ -3538,40 +3521,42 @@ static void amdgpu_device_unlock_adev(struct amd= gpu_device *adev) >> int amdgpu_device_gpu_recover(struct amdgpu_device *adev, >> struct amdgpu_job *job) >> { >> - int r; >> + struct list_head device_list, *device_list_handle =3D NULL; >> + bool need_full_reset, job_signaled; >> struct amdgpu_hive_info *hive =3D NULL; >> - bool need_full_reset =3D false; >> struct amdgpu_device *tmp_adev =3D NULL; >> - struct list_head device_list, *device_list_handle =3D NULL; >> + int i, r =3D 0; >> >> + need_full_reset =3D job_signaled =3D false; >> INIT_LIST_HEAD(&device_list); >> >> dev_info(adev->dev, "GPU reset begin!\n"); >> >> + hive =3D amdgpu_get_xgmi_hive(adev, false); >> + >> /* >> - * In case of XGMI hive disallow concurrent resets to be triggered >> - * by different nodes. No point also since the one node already exe= cuting >> - * reset will also reset all the other nodes in the hive. >> + * Here we trylock to avoid chain of resets executing from >> + * either trigger by jobs on different adevs in XGMI hive or jobs o= n >> + * different schedulers for same device while this TO handler is ru= nning. >> + * We always reset all schedulers for device and all devices for XG= MI >> + * hive so that should take care of them too. >> */ >> - hive =3D amdgpu_get_xgmi_hive(adev, 0); >> - if (hive && adev->gmc.xgmi.num_physical_nodes > 1 && >> - !mutex_trylock(&hive->reset_lock)) >> + >> + if (hive && !mutex_trylock(&hive->reset_lock)) { >> + DRM_INFO("Bailing on TDR for s_job:%llx, hive: %llx as anot= her already in progress", >> + job->base.id, hive->hive_id); >> return 0; >> + } >> >> /* Start with adev pre asic reset first for soft reset check.*/ >> - amdgpu_device_lock_adev(adev); >> - r =3D amdgpu_device_pre_asic_reset(adev, >> - job, >> - &need_full_reset); >> - if (r) { >> - /*TODO Should we stop ?*/ >> - DRM_ERROR("GPU pre asic reset failed with err, %d for drm d= ev, %s ", >> - r, adev->ddev->unique); >> - adev->asic_reset_res =3D r; >> + if (!amdgpu_device_lock_adev(adev, !hive)) { >> + DRM_INFO("Bailing on TDR for s_job:%llx, as another already= in progress", >> + job->base.id); >> + return 0; >> } >> >> /* Build list of devices to reset */ >> - if (need_full_reset && adev->gmc.xgmi.num_physical_nodes > 1) { >> + if (adev->gmc.xgmi.num_physical_nodes > 1) { >> if (!hive) { >> amdgpu_device_unlock_adev(adev); >> return -ENODEV; >> @@ -3588,13 +3573,56 @@ int amdgpu_device_gpu_recover(struct amdgpu_devi= ce *adev, >> device_list_handle =3D &device_list; >> } >> >> + /* block all schedulers and reset given job's ring */ >> + list_for_each_entry(tmp_adev, device_list_handle, gmc.xgmi.head) { >> + for (i =3D 0; i < AMDGPU_MAX_RINGS; ++i) { >> + struct amdgpu_ring *ring =3D tmp_adev->rings[i]; >> + >> + if (!ring || !ring->sched.thread) >> + continue; >> + >> + drm_sched_stop(&ring->sched, &job->base); >> + } >> + } >> + >> + >> + /* >> + * Must check guilty signal here since after this point all old >> + * HW fences are force signaled. >> + * >> + * job->base holds a reference to parent fence >> + */ >> + if (job && job->base.s_fence->parent && >> + dma_fence_is_signaled(job->base.s_fence->parent)) >> + job_signaled =3D true; >> + >> + if (!amdgpu_device_ip_need_full_reset(adev)) >> + device_list_handle =3D &device_list; >> + >> + if (job_signaled) { >> + dev_info(adev->dev, "Guilty job already signaled, skipping = HW reset"); >> + goto skip_hw_reset; >> + } >> + >> + >> + /* Guilty job will be freed after this*/ >> + r =3D amdgpu_device_pre_asic_reset(adev, >> + job, >> + &need_full_reset); >> + if (r) { >> + /*TODO Should we stop ?*/ >> + DRM_ERROR("GPU pre asic reset failed with err, %d for drm d= ev, %s ", >> + r, adev->ddev->unique); >> + adev->asic_reset_res =3D r; >> + } >> + >> retry: /* Rest of adevs pre asic reset from XGMI hive. */ >> list_for_each_entry(tmp_adev, device_list_handle, gmc.xgmi.head) { >> >> if (tmp_adev =3D=3D adev) >> continue; >> >> - amdgpu_device_lock_adev(tmp_adev); >> + amdgpu_device_lock_adev(tmp_adev, false); >> r =3D amdgpu_device_pre_asic_reset(tmp_adev, >> NULL, >> &need_full_reset); >> @@ -3618,9 +3646,28 @@ int amdgpu_device_gpu_recover(struct amdgpu_devic= e *adev, >> goto retry; >> } >> >> +skip_hw_reset: >> + >> /* Post ASIC reset for all devs .*/ >> list_for_each_entry(tmp_adev, device_list_handle, gmc.xgmi.head) { >> - amdgpu_device_post_asic_reset(tmp_adev); >> + for (i =3D 0; i < AMDGPU_MAX_RINGS; ++i) { >> + struct amdgpu_ring *ring =3D tmp_adev->rings[i]; >> + >> + if (!ring || !ring->sched.thread) >> + continue; >> + >> + /* No point to resubmit jobs if we didn't HW reset*= / >> + if (!tmp_adev->asic_reset_res && !job_signaled) >> + drm_sched_resubmit_jobs(&ring->sched); >> + >> + drm_sched_start(&ring->sched, !tmp_adev->asic_reset= _res); >> + } >> + >> + if (!amdgpu_device_has_dc_support(tmp_adev) && !job_signale= d) { >> + drm_helper_resume_force_mode(tmp_adev->ddev); >> + } >> + >> + tmp_adev->asic_reset_res =3D 0; >> >> if (r) { >> /* bad news, how to tell it to userspace ? */ >> @@ -3633,7 +3680,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device= *adev, >> amdgpu_device_unlock_adev(tmp_adev); >> } >> >> - if (hive && adev->gmc.xgmi.num_physical_nodes > 1) >> + if (hive) >> mutex_unlock(&hive->reset_lock); >> >> if (r) _______________________________________________ amd-gfx mailing list amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx --_000_hyv5g0n8ru25qelb0v8u6jdi1vp2c7z1m3f5uygwc1o5ji6s9zli9vs_ Content-Type: text/html; charset="iso-2022-jp" Content-Transfer-Encoding: quoted-printable
do you mean fence timer? why not stop it as well when stopping sched f= or the reason of hw reset?

-------- Original Message --------
Subject: Re: [PATCH v5 6/6] drm/amdgpu: Avoid HW reset if guilty job alread= y signaled.
From: "Grodzovsky, Andrey"
To: "Zhou, David(ChunMing)" ,dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org,amd-= gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org,eric-WhKQ6XTQaPysTnJN9+BGXg@public.gmane.org,etnaviv-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org,cko= enig.leichtzumerken-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org
CC: "Kazlauskas, Nicholas" ,"Liu, Monk"


On 4/22/19 9:09 AM, Zhou, David(ChunMing) wrote:
> +Monk.
>
> GPU reset is used widely in SRIOV, so need virtulizatino guy take a lo= ok.
>
> But out of curious, why guilty job can signal more if the job is alrea= dy
> set to guilty? set it wrongly?
>
>
> -David


It's possible that the job does completes at a later time then it's
timeout handler started processing so in this patch we try to protect
against this by rechecking the HW fence after stopping all SW
schedulers. We do it BEFORE marking guilty on the job's sched_entity so at the point we check the guilty flag is not set yet.

Andrey


>
> =1B$B:_=1B(B 2019/4/18 23:00, Andrey Grodzovsky =1B$B >> Also reject TDRs if another one already running.
>>
>> v2:
>> Stop all schedulers across device and entire XGMI hive before
>> force signaling HW fences.
>> Avoid passing job_signaled to helper fnctions to keep all the deci= sion
>> making about skipping HW reset in one place.
>>
>> v3:
>> Fix SW sched. hang after non HW reset. sched.hw_rq_count has to be= balanced
>> against it's decrement in drm_sched_stop in non HW reset case.
>> v4: rebase
>> v5: Revert v3 as we do it now in sceduler code.
>>
>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky-5C7GfCeVMHo@public.gmane.org>=
>> ---
>>    drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 143= +++++++++++++++= ;++++----------
>>    1 file changed, 95 insertions(+), 48 deletio= ns(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/= gpu/drm/amd/amdgpu/amdgpu_device.c
>> index a0e165c..85f8792 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> @@ -3334,8 +3334,6 @@ static int amdgpu_device_pre_asic_reset(= struct amdgpu_device *adev,
>>           &= nbsp;   if (!ring || !ring->sched.thread)
>>           &= nbsp;           continue;=
>>   
>> -           = ; drm_sched_stop(&ring->sched, &job->base);
>> -
>>           &= nbsp;   /* after all hw jobs are reset, hw fence is meaningless, = so force_completion */
>>           &= nbsp;   amdgpu_fence_driver_force_completion(ring);
>>       }
>> @@ -3343,6 +3341,7 @@ static int amdgpu_device_pre_asic_reset(= struct amdgpu_device *adev,
>>       if(job)
>>           &= nbsp;   drm_sched_increase_karma(&job->base);
>>   
>> +    /* Don't suspend on bare metal if we are n= ot going to HW reset the ASIC */
>>       if (!amdgpu_sriov_vf(adev)) {<= br> >>   
>>           &= nbsp;   if (!need_full_reset)
>> @@ -3480,37 +3479,21 @@ static int amdgpu_do_asic_reset(struct= amdgpu_hive_info *hive,
>>       return r;
>>    }
>>   
>> -static void amdgpu_device_post_asic_reset(struct amdgpu_device *a= dev)
>> +static bool amdgpu_device_lock_adev(struct amdgpu_device *ade= v, bool trylock)
>>    {
>> -    int i;
>> -
>> -    for (i =3D 0; i < AMDGPU_MAX_RINGS; +&#= 43;i) {
>> -           = ; struct amdgpu_ring *ring =3D adev->rings[i];
>> -
>> -           = ; if (!ring || !ring->sched.thread)
>> -           = ;         continue;
>> -
>> -           = ; if (!adev->asic_reset_res)
>> -           = ;         drm_sched_resubmit_jobs(&= amp;ring->sched);
>> +    if (trylock) {
>> +          &= nbsp; if (!mutex_trylock(&adev->lock_reset))
>> +          &= nbsp;         return false;
>> +    } else
>> +          &= nbsp; mutex_lock(&adev->lock_reset);
>>   
>> -           = ; drm_sched_start(&ring->sched, !adev->asic_reset_res);
>> -    }
>> -
>> -    if (!amdgpu_device_has_dc_support(adev)) {
>> -           = ; drm_helper_resume_force_mode(adev->ddev);
>> -    }
>> -
>> -    adev->asic_reset_res =3D 0;
>> -}
>> -
>> -static void amdgpu_device_lock_adev(struct amdgpu_device *adev) >> -{
>> -    mutex_lock(&adev->lock_reset);
>>       atomic_inc(&adev->gpu_r= eset_counter);
>>       adev->in_gpu_reset =3D 1; >>       /* Block kfd: SRIOV would do i= t separately */
>>       if (!amdgpu_sriov_vf(adev)) >>           &= nbsp;        amdgpu_amdkfd_pre_reset(ade= v);
>> +
>> +    return true;
>>    }
>>   
>>    static void amdgpu_device_unlock_adev(struct amd= gpu_device *adev)
>> @@ -3538,40 +3521,42 @@ static void amdgpu_device_unlock_adev(= struct amdgpu_device *adev)
>>    int amdgpu_device_gpu_recover(struct amdgpu_devi= ce *adev,
>>           &= nbsp;           &nbs= p;     struct amdgpu_job *job)
>>    {
>> -    int r;
>> +    struct list_head device_list, *device_list= _handle =3D  NULL;
>> +    bool need_full_reset, job_signaled;
>>       struct amdgpu_hive_info *hive = =3D NULL;
>> -    bool need_full_reset =3D false;
>>       struct amdgpu_device *tmp_adev= =3D NULL;
>> -    struct list_head device_list, *device_list_han= dle =3D  NULL;
>> +    int i, r =3D 0;
>>   
>> +    need_full_reset =3D job_signaled =3D false= ;
>>       INIT_LIST_HEAD(&device_lis= t);
>>   
>>       dev_info(adev->dev, "G= PU reset begin!\n");
>>   
>> +    hive =3D amdgpu_get_xgmi_hive(adev, false)= ;
>> +
>>       /*
>> -     * In case of XGMI hive disallow concurre= nt resets to be triggered
>> -     * by different nodes. No point also sinc= e the one node already executing
>> -     * reset will also reset all the other no= des in the hive.
>> +     * Here we trylock to avoid chain of = resets executing from
>> +     * either trigger by jobs on differen= t adevs in XGMI hive or jobs on
>> +     * different schedulers for same devi= ce while this TO handler is running.
>> +     * We always reset all schedulers for= device and all devices for XGMI
>> +     * hive so that should take care of t= hem too.
>>        */
>> -    hive =3D amdgpu_get_xgmi_hive(adev, 0);
>> -    if (hive && adev->gmc.xgmi.num_phys= ical_nodes > 1 &&
>> -        !mutex_trylock(&hi= ve->reset_lock))
>> +
>> +    if (hive && !mutex_trylock(&hi= ve->reset_lock)) {
>> +          &= nbsp; DRM_INFO("Bailing on TDR for s_job:%llx, hive: %llx as another a= lready in progress",
>> +          &= nbsp;          job->base.id= , hive->hive_id);
>>           &= nbsp;   return 0;
>> +    }
>>   
>>       /* Start with adev pre asic re= set first for soft reset check.*/
>> -    amdgpu_device_lock_adev(adev);
>> -    r =3D amdgpu_device_pre_asic_reset(adev,
>> -           = ;            &n= bsp;            = ; job,
>> -           = ;            &n= bsp;            = ; &need_full_reset);
>> -    if (r) {
>> -           = ; /*TODO Should we stop ?*/
>> -           = ; DRM_ERROR("GPU pre asic reset failed with err, %d for drm dev, %s &q= uot;,
>> -           = ;           r, adev->d= dev->unique);
>> -           = ; adev->asic_reset_res =3D r;
>> +    if (!amdgpu_device_lock_adev(adev, !hive))= {
>> +          &= nbsp; DRM_INFO("Bailing on TDR for s_job:%llx, as another already in p= rogress",
>> +          &= nbsp;           &nbs= p;            &= nbsp; job->base.id);
>> +          &= nbsp; return 0;
>>       }
>>   
>>       /* Build list of devices to re= set */
>> -    if  (need_full_reset && adev->= gmc.xgmi.num_physical_nodes > 1) {
>> +    if  (adev->gmc.xgmi.num_physical_n= odes > 1) {
>>           &= nbsp;   if (!hive) {
>>           &= nbsp;           amdgpu_de= vice_unlock_adev(adev);
>>           &= nbsp;           return -E= NODEV;
>> @@ -3588,13 +3573,56 @@ int amdgpu_device_gpu_recover(struct a= mdgpu_device *adev,
>>           &= nbsp;   device_list_handle =3D &device_list;
>>       }
>>   
>> +    /* block all schedulers and reset given jo= b's ring */
>> +    list_for_each_entry(tmp_adev, device_list_= handle, gmc.xgmi.head) {
>> +          &= nbsp; for (i =3D 0; i < AMDGPU_MAX_RINGS; ++i) {
>> +          &= nbsp;         struct amdgpu_ring *r= ing =3D tmp_adev->rings[i];
>> +
>> +          &= nbsp;         if (!ring || !ring-&g= t;sched.thread)
>> +          &= nbsp;           &nbs= p;     continue;
>> +
>> +          &= nbsp;         drm_sched_stop(&r= ing->sched, &job->base);
>> +          &= nbsp; }
>> +    }
>> +
>> +
>> +    /*
>> +     * Must check guilty signal here sinc= e after this point all old
>> +     * HW fences are force signaled.
>> +     *
>> +     * job->base holds a reference to = parent fence
>> +     */
>> +    if (job && job->base.s_fence-&g= t;parent &&
>> +        dma_fence_is_signa= led(job->base.s_fence->parent))
>> +          &= nbsp; job_signaled =3D true;
>> +
>> +    if (!amdgpu_device_ip_need_full_reset(adev= ))
>> +          &= nbsp; device_list_handle =3D &device_list;
>> +
>> +    if (job_signaled) {
>> +          &= nbsp; dev_info(adev->dev, "Guilty job already signaled, skipping HW= reset");
>> +          &= nbsp; goto skip_hw_reset;
>> +    }
>> +
>> +
>> +    /* Guilty job will be freed after this*/ >> +    r =3D amdgpu_device_pre_asic_reset(adev, >> +          &= nbsp;           &nbs= p;            &= nbsp; job,
>> +          &= nbsp;           &nbs= p;            &= nbsp; &need_full_reset);
>> +    if (r) {
>> +          &= nbsp; /*TODO Should we stop ?*/
>> +          &= nbsp; DRM_ERROR("GPU pre asic reset failed with err, %d for drm dev, %= s ",
>> +          &= nbsp;           r, adev-&= gt;ddev->unique);
>> +          &= nbsp; adev->asic_reset_res =3D r;
>> +    }
>> +
>>    retry:    /* Rest of adevs pre as= ic reset from XGMI hive. */
>>       list_for_each_entry(tmp_adev, = device_list_handle, gmc.xgmi.head) {
>>   
>>           &= nbsp;   if (tmp_adev =3D=3D adev)
>>           &= nbsp;           continue;=
>>   
>> -           = ; amdgpu_device_lock_adev(tmp_adev);
>> +          &= nbsp; amdgpu_device_lock_adev(tmp_adev, false);
>>           &= nbsp;   r =3D amdgpu_device_pre_asic_reset(tmp_adev,
>>           &= nbsp;           &nbs= p;            &= nbsp;           NULL,
>>           &= nbsp;           &nbs= p;            &= nbsp;           &need= _full_reset);
>> @@ -3618,9 +3646,28 @@ int amdgpu_device_gpu_recover(struct am= dgpu_device *adev,
>>           &= nbsp;           goto retr= y;
>>       }
>>   
>> +skip_hw_reset:
>> +
>>       /* Post ASIC reset for all dev= s .*/
>>       list_for_each_entry(tmp_adev, = device_list_handle, gmc.xgmi.head) {
>> -           = ; amdgpu_device_post_asic_reset(tmp_adev);
>> +          &= nbsp; for (i =3D 0; i < AMDGPU_MAX_RINGS; ++i) {
>> +          &= nbsp;         struct amdgpu_ring *r= ing =3D tmp_adev->rings[i];
>> +
>> +          &= nbsp;         if (!ring || !ring-&g= t;sched.thread)
>> +          &= nbsp;           &nbs= p;     continue;
>> +
>> +          &= nbsp;         /* No point to resubm= it jobs if we didn't HW reset*/
>> +          &= nbsp;         if (!tmp_adev->asi= c_reset_res && !job_signaled)
>> +          &= nbsp;           &nbs= p;     drm_sched_resubmit_jobs(&ring->sched); >> +
>> +          &= nbsp;         drm_sched_start(&= ring->sched, !tmp_adev->asic_reset_res);
>> +          &= nbsp; }
>> +
>> +          &= nbsp; if (!amdgpu_device_has_dc_support(tmp_adev) && !job_signaled)= {
>> +          &= nbsp;         drm_helper_resume_for= ce_mode(tmp_adev->ddev);
>> +          &= nbsp; }
>> +
>> +          &= nbsp; tmp_adev->asic_reset_res =3D 0;
>>   
>>           &= nbsp;   if (r) {
>>           &= nbsp;           /* bad ne= ws, how to tell it to userspace ? */
>> @@ -3633,7 +3680,7 @@ int amdgpu_device_gpu_recover(struct amd= gpu_device *adev,
>>           &= nbsp;   amdgpu_device_unlock_adev(tmp_adev);
>>       }
>>   
>> -    if (hive && adev->gmc.xgmi.num_phys= ical_nodes > 1)
>> +    if (hive)
>>           &= nbsp;   mutex_unlock(&hive->reset_lock);
>>   
>>       if (r)
_______________________________________________
amd-gfx mailing list
amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
https://= lists.freedesktop.org/mailman/listinfo/amd-gfx
--_000_hyv5g0n8ru25qelb0v8u6jdi1vp2c7z1m3f5uygwc1o5ji6s9zli9vs_-- --===============1690833229== Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: base64 Content-Disposition: inline X19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX18KYW1kLWdmeCBt YWlsaW5nIGxpc3QKYW1kLWdmeEBsaXN0cy5mcmVlZGVza3RvcC5vcmcKaHR0cHM6Ly9saXN0cy5m cmVlZGVza3RvcC5vcmcvbWFpbG1hbi9saXN0aW5mby9hbWQtZ2Z4 --===============1690833229==--