That's a good point as well, maybe we should have separate timeouts for gfx and compute? Something like 5 seconds for gfx and 1 minute (or even longer) for compute? Anyway I agree that we can worry about that later on, patch is Reviewed-by: Christian König for now. Regards, Christian. Am 20.03.2018 um 15:16 schrieb Deucher, Alexander: > > My concern was that compute will always have the timeout disabled with > no way to override it even if you enable GPU reset.  I guess we can > address that down the road. > > > Acked-by: Alex Deucher > > ------------------------------------------------------------------------ > *From:* Koenig, Christian > *Sent:* Tuesday, March 20, 2018 6:14:29 AM > *To:* Quan, Evan; amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org > *Cc:* Deucher, Alexander > *Subject:* Re: [PATCH] drm/amdgpu: disable job timeout on GPU reset > disabled > Hi Evan, > > that one is perfect if you ask me. Just reading up on the history of > that patch, Alex what was your concern with that? > > Regarding printing this as error, that's a really good point as well. We > should probably reduce it to a warning or even info severity. > > Regards, > Christian. > > Am 20.03.2018 um 03:11 schrieb Quan, Evan: > > Hi Christian, > > > > The messages prompted on timeout are Errors not just Warnings > although we did not see any real problem(for the dgemm special case). > That's why we say it confusing. > > And i suppose you want a fix like my previous patch(see attachment). > > > > Regards, > > Evan > >> -----Original Message----- > >> From: Christian König [mailto:ckoenig.leichtzumerken-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org] > >> Sent: Monday, March 19, 2018 5:42 PM > >> To: Quan, Evan ; amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org > >> Cc: Deucher, Alexander > >> Subject: Re: [PATCH] drm/amdgpu: disable job timeout on GPU reset > >> disabled > >> > >> Am 19.03.2018 um 07:08 schrieb Evan Quan: > >>> Since under some heavy computing environment(dgemm test), it takes the > >>> asic over 10+ seconds to finish the dispatched single job which will > >>> trigger the timeout. It's quite confusing although it does not seem to > >>> bring any real problems. > >>> As a quick workround, we choose to disable timeout when GPU reset is > >>> disabled. > >> NAK, I enabled those warning intentionally even when the GPU > recovery is > >> disabled to have a hint in the logs what goes wrong. > >> > >> Please only increase the timeout for the compute queue and/or add a > >> separate timeout for them. > >> > >> Regards, > >> Christian. > >> > >> > >>> Change-Id: I3a95d856ba4993094dc7b6269649e470c5b053d2 > >>> Signed-off-by: Evan Quan > >>> --- > >>>    drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 7 +++++++ > >>>    1 file changed, 7 insertions(+) > >>> > >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > >>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > >>> index 8bd9c3f..9d6a775 100644 > >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > >>> @@ -861,6 +861,13 @@ static void > >> amdgpu_device_check_arguments(struct amdgpu_device *adev) > >>>              amdgpu_lockup_timeout = 10000; > >>>      } > >>> > >>> +   /* > >>> +    * Disable timeout when GPU reset is disabled to avoid confusing > >>> +    * timeout messages in the kernel log. > >>> +    */ > >>> +   if (amdgpu_gpu_recovery == 0 || amdgpu_gpu_recovery == -1) > >>> +           amdgpu_lockup_timeout = INT_MAX; > >>> + > >>>      adev->firmware.load_type = amdgpu_ucode_get_load_type(adev, > >> amdgpu_fw_load_type); > >>>    } > >>> >