From mboxrd@z Thu Jan 1 00:00:00 1970 From: =?UTF-8?Q?Christian_K=c3=b6nig?= Subject: Re: [PATCH] drm/amdgpu: disable job timeout on GPU reset disabled Date: Tue, 20 Mar 2018 15:21:59 +0100 Message-ID: <50d6c9f3-4652-d733-586e-c0a80724a4e6@amd.com> References: <1521439692-14823-1-git-send-email-evan.quan@amd.com> Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="===============0822990624==" Return-path: In-Reply-To: Content-Language: en-US List-Id: Discussion list for AMD gfx List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: amd-gfx-bounces-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org Sender: "amd-gfx" To: "Deucher, Alexander" , "Quan, Evan" , "amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org" This is a multi-part message in MIME format. --===============0822990624== Content-Type: multipart/alternative; boundary="------------BC0523B78852FE3714AFC743" Content-Language: en-US This is a multi-part message in MIME format. --------------BC0523B78852FE3714AFC743 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit That's a good point as well, maybe we should have separate timeouts for gfx and compute? Something like 5 seconds for gfx and 1 minute (or even longer) for compute? Anyway I agree that we can worry about that later on, patch is Reviewed-by: Christian König for now. Regards, Christian. Am 20.03.2018 um 15:16 schrieb Deucher, Alexander: > > My concern was that compute will always have the timeout disabled with > no way to override it even if you enable GPU reset.  I guess we can > address that down the road. > > > Acked-by: Alex Deucher > > ------------------------------------------------------------------------ > *From:* Koenig, Christian > *Sent:* Tuesday, March 20, 2018 6:14:29 AM > *To:* Quan, Evan; amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org > *Cc:* Deucher, Alexander > *Subject:* Re: [PATCH] drm/amdgpu: disable job timeout on GPU reset > disabled > Hi Evan, > > that one is perfect if you ask me. Just reading up on the history of > that patch, Alex what was your concern with that? > > Regarding printing this as error, that's a really good point as well. We > should probably reduce it to a warning or even info severity. > > Regards, > Christian. > > Am 20.03.2018 um 03:11 schrieb Quan, Evan: > > Hi Christian, > > > > The messages prompted on timeout are Errors not just Warnings > although we did not see any real problem(for the dgemm special case). > That's why we say it confusing. > > And i suppose you want a fix like my previous patch(see attachment). > > > > Regards, > > Evan > >> -----Original Message----- > >> From: Christian König [mailto:ckoenig.leichtzumerken-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org] > >> Sent: Monday, March 19, 2018 5:42 PM > >> To: Quan, Evan ; amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org > >> Cc: Deucher, Alexander > >> Subject: Re: [PATCH] drm/amdgpu: disable job timeout on GPU reset > >> disabled > >> > >> Am 19.03.2018 um 07:08 schrieb Evan Quan: > >>> Since under some heavy computing environment(dgemm test), it takes the > >>> asic over 10+ seconds to finish the dispatched single job which will > >>> trigger the timeout. It's quite confusing although it does not seem to > >>> bring any real problems. > >>> As a quick workround, we choose to disable timeout when GPU reset is > >>> disabled. > >> NAK, I enabled those warning intentionally even when the GPU > recovery is > >> disabled to have a hint in the logs what goes wrong. > >> > >> Please only increase the timeout for the compute queue and/or add a > >> separate timeout for them. > >> > >> Regards, > >> Christian. > >> > >> > >>> Change-Id: I3a95d856ba4993094dc7b6269649e470c5b053d2 > >>> Signed-off-by: Evan Quan > >>> --- > >>>    drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 7 +++++++ > >>>    1 file changed, 7 insertions(+) > >>> > >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > >>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > >>> index 8bd9c3f..9d6a775 100644 > >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > >>> @@ -861,6 +861,13 @@ static void > >> amdgpu_device_check_arguments(struct amdgpu_device *adev) > >>>              amdgpu_lockup_timeout = 10000; > >>>      } > >>> > >>> +   /* > >>> +    * Disable timeout when GPU reset is disabled to avoid confusing > >>> +    * timeout messages in the kernel log. > >>> +    */ > >>> +   if (amdgpu_gpu_recovery == 0 || amdgpu_gpu_recovery == -1) > >>> +           amdgpu_lockup_timeout = INT_MAX; > >>> + > >>>      adev->firmware.load_type = amdgpu_ucode_get_load_type(adev, > >> amdgpu_fw_load_type); > >>>    } > >>> > --------------BC0523B78852FE3714AFC743 Content-Type: text/html; charset=utf-8 Content-Transfer-Encoding: 8bit
That's a good point as well, maybe we should have separate timeouts for gfx and compute?

Something like 5 seconds for gfx and 1 minute (or even longer) for compute?

Anyway I agree that we can worry about that later on, patch is Reviewed-by: Christian König <christian.koenig-5C7GfCeVMHo@public.gmane.org> for now.

Regards,
Christian.

Am 20.03.2018 um 15:16 schrieb Deucher, Alexander:

My concern was that compute will always have the timeout disabled with no way to override it even if you enable GPU reset.  I guess we can address that down the road.


Acked-by: Alex Deucher <alexander.deucher-5C7GfCeVMHo@public.gmane.org>


From: Koenig, Christian
Sent: Tuesday, March 20, 2018 6:14:29 AM
To: Quan, Evan; amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
Cc: Deucher, Alexander
Subject: Re: [PATCH] drm/amdgpu: disable job timeout on GPU reset disabled
 
Hi Evan,

that one is perfect if you ask me. Just reading up on the history of
that patch, Alex what was your concern with that?

Regarding printing this as error, that's a really good point as well. We
should probably reduce it to a warning or even info severity.

Regards,
Christian.

Am 20.03.2018 um 03:11 schrieb Quan, Evan:
> Hi Christian,
>
> The messages prompted on timeout are Errors not just Warnings although we did not see any real problem(for the dgemm special case). That's why we say it confusing.
> And i suppose you want a fix like my previous patch(see attachment).
>
> Regards,
> Evan
>> -----Original Message-----
>> From: Christian König [mailto:ckoenig.leichtzumerken-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org]
>> Sent: Monday, March 19, 2018 5:42 PM
>> To: Quan, Evan <Evan.Quan-5C7GfCeVMHo@public.gmane.org>; amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
>> Cc: Deucher, Alexander <Alexander.Deucher-5C7GfCeVMHo@public.gmane.org>
>> Subject: Re: [PATCH] drm/amdgpu: disable job timeout on GPU reset
>> disabled
>>
>> Am 19.03.2018 um 07:08 schrieb Evan Quan:
>>> Since under some heavy computing environment(dgemm test), it takes the
>>> asic over 10+ seconds to finish the dispatched single job which will
>>> trigger the timeout. It's quite confusing although it does not seem to
>>> bring any real problems.
>>> As a quick workround, we choose to disable timeout when GPU reset is
>>> disabled.
>> NAK, I enabled those warning intentionally even when the GPU recovery is
>> disabled to have a hint in the logs what goes wrong.
>>
>> Please only increase the timeout for the compute queue and/or add a
>> separate timeout for them.
>>
>> Regards,
>> Christian.
>>
>>
>>> Change-Id: I3a95d856ba4993094dc7b6269649e470c5b053d2
>>> Signed-off-by: Evan Quan <evan.quan-5C7GfCeVMHo@public.gmane.org>
>>> ---
>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 7 +++++++
>>>    1 file changed, 7 insertions(+)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> index 8bd9c3f..9d6a775 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> @@ -861,6 +861,13 @@ static void
>> amdgpu_device_check_arguments(struct amdgpu_device *adev)
>>>              amdgpu_lockup_timeout = 10000;
>>>      }
>>>
>>> +   /*
>>> +    * Disable timeout when GPU reset is disabled to avoid confusing
>>> +    * timeout messages in the kernel log.
>>> +    */
>>> +   if (amdgpu_gpu_recovery == 0 || amdgpu_gpu_recovery == -1)
>>> +           amdgpu_lockup_timeout = INT_MAX;
>>> +
>>>      adev->firmware.load_type = amdgpu_ucode_get_load_type(adev,
>> amdgpu_fw_load_type);
>>>    }
>>>


--------------BC0523B78852FE3714AFC743-- --===============0822990624== Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: base64 Content-Disposition: inline X19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX18KYW1kLWdmeCBt YWlsaW5nIGxpc3QKYW1kLWdmeEBsaXN0cy5mcmVlZGVza3RvcC5vcmcKaHR0cHM6Ly9saXN0cy5m cmVlZGVza3RvcC5vcmcvbWFpbG1hbi9saXN0aW5mby9hbWQtZ2Z4Cg== --===============0822990624==--