All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] drm/amdgpu: disable job timeout on GPU reset disabled
@ 2018-03-19  6:08 Evan Quan
       [not found] ` <1521439692-14823-1-git-send-email-evan.quan-5C7GfCeVMHo@public.gmane.org>
  0 siblings, 1 reply; 7+ messages in thread
From: Evan Quan @ 2018-03-19  6:08 UTC (permalink / raw)
  To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW
  Cc: Alexander.Deucher-5C7GfCeVMHo, Evan Quan

Since under some heavy computing environment(dgemm test), it takes
the asic over 10+ seconds to finish the dispatched single job
which will trigger the timeout. It's quite confusing although it
does not seem to bring any real problems.
As a quick workround, we choose to disable timeout when GPU reset
is disabled.

Change-Id: I3a95d856ba4993094dc7b6269649e470c5b053d2
Signed-off-by: Evan Quan <evan.quan@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 8bd9c3f..9d6a775 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -861,6 +861,13 @@ static void amdgpu_device_check_arguments(struct amdgpu_device *adev)
 		amdgpu_lockup_timeout = 10000;
 	}
 
+	/*
+	 * Disable timeout when GPU reset is disabled to avoid confusing
+	 * timeout messages in the kernel log.
+	 */
+	if (amdgpu_gpu_recovery == 0 || amdgpu_gpu_recovery == -1)
+		amdgpu_lockup_timeout = INT_MAX;
+
 	adev->firmware.load_type = amdgpu_ucode_get_load_type(adev, amdgpu_fw_load_type);
 }
 
-- 
2.7.4

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* RE: [PATCH] drm/amdgpu: disable job timeout on GPU reset disabled
       [not found] ` <1521439692-14823-1-git-send-email-evan.quan-5C7GfCeVMHo@public.gmane.org>
@ 2018-03-19  6:12   ` Quan, Evan
  2018-03-19  9:42   ` Christian König
  1 sibling, 0 replies; 7+ messages in thread
From: Quan, Evan @ 2018-03-19  6:12 UTC (permalink / raw)
  To: Quan, Evan, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW; +Cc: Deucher, Alexander

Hi Alex,

INT_MAX is used instead of MAX_SCHEDULE_TIMEOUT(which we discussed in another mail thread) since the amdgpu_lockup_timeout is with data type int.
Using MAX_SCHEDULE_TIMEOUT(data type:long) will get compile warnings.

Regards,
Evan
-----Original Message-----
From: Evan Quan [mailto:evan.quan@amd.com] 
Sent: Monday, March 19, 2018 2:08 PM
To: amd-gfx@lists.freedesktop.org
Cc: Deucher, Alexander <Alexander.Deucher@amd.com>; Quan, Evan <Evan.Quan@amd.com>
Subject: [PATCH] drm/amdgpu: disable job timeout on GPU reset disabled

Since under some heavy computing environment(dgemm test), it takes the asic over 10+ seconds to finish the dispatched single job which will trigger the timeout. It's quite confusing although it does not seem to bring any real problems.
As a quick workround, we choose to disable timeout when GPU reset is disabled.

Change-Id: I3a95d856ba4993094dc7b6269649e470c5b053d2
Signed-off-by: Evan Quan <evan.quan@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 8bd9c3f..9d6a775 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -861,6 +861,13 @@ static void amdgpu_device_check_arguments(struct amdgpu_device *adev)
 		amdgpu_lockup_timeout = 10000;
 	}
 
+	/*
+	 * Disable timeout when GPU reset is disabled to avoid confusing
+	 * timeout messages in the kernel log.
+	 */
+	if (amdgpu_gpu_recovery == 0 || amdgpu_gpu_recovery == -1)
+		amdgpu_lockup_timeout = INT_MAX;
+
 	adev->firmware.load_type = amdgpu_ucode_get_load_type(adev, amdgpu_fw_load_type);  }
 
--
2.7.4

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH] drm/amdgpu: disable job timeout on GPU reset disabled
       [not found] ` <1521439692-14823-1-git-send-email-evan.quan-5C7GfCeVMHo@public.gmane.org>
  2018-03-19  6:12   ` Quan, Evan
@ 2018-03-19  9:42   ` Christian König
       [not found]     ` <d7a88e66-6533-9c12-c36c-9b3ea569e354-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
  1 sibling, 1 reply; 7+ messages in thread
From: Christian König @ 2018-03-19  9:42 UTC (permalink / raw)
  To: Evan Quan, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW
  Cc: Alexander.Deucher-5C7GfCeVMHo

Am 19.03.2018 um 07:08 schrieb Evan Quan:
> Since under some heavy computing environment(dgemm test), it takes
> the asic over 10+ seconds to finish the dispatched single job
> which will trigger the timeout. It's quite confusing although it
> does not seem to bring any real problems.
> As a quick workround, we choose to disable timeout when GPU reset
> is disabled.

NAK, I enabled those warning intentionally even when the GPU recovery is 
disabled to have a hint in the logs what goes wrong.

Please only increase the timeout for the compute queue and/or add a 
separate timeout for them.

Regards,
Christian.


>
> Change-Id: I3a95d856ba4993094dc7b6269649e470c5b053d2
> Signed-off-by: Evan Quan <evan.quan@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 7 +++++++
>   1 file changed, 7 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 8bd9c3f..9d6a775 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -861,6 +861,13 @@ static void amdgpu_device_check_arguments(struct amdgpu_device *adev)
>   		amdgpu_lockup_timeout = 10000;
>   	}
>   
> +	/*
> +	 * Disable timeout when GPU reset is disabled to avoid confusing
> +	 * timeout messages in the kernel log.
> +	 */
> +	if (amdgpu_gpu_recovery == 0 || amdgpu_gpu_recovery == -1)
> +		amdgpu_lockup_timeout = INT_MAX;
> +
>   	adev->firmware.load_type = amdgpu_ucode_get_load_type(adev, amdgpu_fw_load_type);
>   }
>   

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: [PATCH] drm/amdgpu: disable job timeout on GPU reset disabled
       [not found]     ` <d7a88e66-6533-9c12-c36c-9b3ea569e354-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
@ 2018-03-20  2:11       ` Quan, Evan
       [not found]         ` <DM5PR1201MB248999A09FAAC36F90204EF9E4AB0-grEf7a3NxMAAZHT/xKzwlGrFom/aUZj6nBOFsp37pqbUKgpGm//BTAC/G2K4zDHf@public.gmane.org>
  0 siblings, 1 reply; 7+ messages in thread
From: Quan, Evan @ 2018-03-20  2:11 UTC (permalink / raw)
  To: Koenig, Christian, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW
  Cc: Deucher, Alexander

[-- Attachment #1: Type: text/plain, Size: 2283 bytes --]

Hi Christian,

The messages prompted on timeout are Errors not just Warnings although we did not see any real problem(for the dgemm special case). That's why we say it confusing.
And i suppose you want a fix like my previous patch(see attachment).

Regards,
Evan
> -----Original Message-----
> From: Christian König [mailto:ckoenig.leichtzumerken@gmail.com]
> Sent: Monday, March 19, 2018 5:42 PM
> To: Quan, Evan <Evan.Quan@amd.com>; amd-gfx@lists.freedesktop.org
> Cc: Deucher, Alexander <Alexander.Deucher@amd.com>
> Subject: Re: [PATCH] drm/amdgpu: disable job timeout on GPU reset
> disabled
> 
> Am 19.03.2018 um 07:08 schrieb Evan Quan:
> > Since under some heavy computing environment(dgemm test), it takes the
> > asic over 10+ seconds to finish the dispatched single job which will
> > trigger the timeout. It's quite confusing although it does not seem to
> > bring any real problems.
> > As a quick workround, we choose to disable timeout when GPU reset is
> > disabled.
> 
> NAK, I enabled those warning intentionally even when the GPU recovery is
> disabled to have a hint in the logs what goes wrong.
> 
> Please only increase the timeout for the compute queue and/or add a
> separate timeout for them.
> 
> Regards,
> Christian.
> 
> 
> >
> > Change-Id: I3a95d856ba4993094dc7b6269649e470c5b053d2
> > Signed-off-by: Evan Quan <evan.quan@amd.com>
> > ---
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 7 +++++++
> >   1 file changed, 7 insertions(+)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > index 8bd9c3f..9d6a775 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > @@ -861,6 +861,13 @@ static void
> amdgpu_device_check_arguments(struct amdgpu_device *adev)
> >   		amdgpu_lockup_timeout = 10000;
> >   	}
> >
> > +	/*
> > +	 * Disable timeout when GPU reset is disabled to avoid confusing
> > +	 * timeout messages in the kernel log.
> > +	 */
> > +	if (amdgpu_gpu_recovery == 0 || amdgpu_gpu_recovery == -1)
> > +		amdgpu_lockup_timeout = INT_MAX;
> > +
> >   	adev->firmware.load_type = amdgpu_ucode_get_load_type(adev,
> amdgpu_fw_load_type);
> >   }
> >


[-- Attachment #2: Type: message/rfc822, Size: 4878 bytes --]

From: "Quan, Evan" <Evan.Quan-5C7GfCeVMHo@public.gmane.org>
To: "amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org" <amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org>
Cc: "Deucher, Alexander" <Alexander.Deucher-5C7GfCeVMHo@public.gmane.org>, "Quan, Evan" <Evan.Quan-5C7GfCeVMHo@public.gmane.org>
Subject: [PATCH] drm/amdgpu: no job timeout setting on compute queues
Date: Fri, 16 Mar 2018 04:52:32 +0000
Message-ID: <1521175952-21758-1-git-send-email-evan.quan-5C7GfCeVMHo@public.gmane.org>

Under some heavy computing test(dgemm) environment, it may takes
the asic over 50+ seconds to finish the dispatched single job
which will trigger the timeout. It's quite annoying although it
does not seem to bring any real problems.
As a quick workround, we choose to not enfoce the timeout
setting on compute queues.

Change-Id: I210011a90898617367e897a90e9f8fb2639281a3
Signed-off-by: Evan Quan <evan.quan-5C7GfCeVMHo@public.gmane.org>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
index 008e198..455a81e 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
@@ -435,7 +435,9 @@ int amdgpu_fence_driver_init_ring(struct amdgpu_ring *ring,
        if (ring->funcs->type != AMDGPU_RING_TYPE_KIQ) {
                r = drm_sched_init(&ring->sched, &amdgpu_sched_ops,
                                   num_hw_submission, amdgpu_job_hang_limit,
-                                  msecs_to_jiffies(amdgpu_lockup_timeout), ring->name);
+                                  (ring->funcs->type == AMDGPU_RING_TYPE_COMPUTE) ?
+                                  MAX_SCHEDULE_TIMEOUT : msecs_to_jiffies(amdgpu_lockup_timeout),
+                                  ring->name);
                if (r) {
                        DRM_ERROR("Failed to create scheduler on ring %s.\n",
                                  ring->name);
--
2.7.4


[-- Attachment #3: Type: text/plain, Size: 154 bytes --]

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH] drm/amdgpu: disable job timeout on GPU reset disabled
       [not found]         ` <DM5PR1201MB248999A09FAAC36F90204EF9E4AB0-grEf7a3NxMAAZHT/xKzwlGrFom/aUZj6nBOFsp37pqbUKgpGm//BTAC/G2K4zDHf@public.gmane.org>
@ 2018-03-20 10:14           ` Christian König
       [not found]             ` <fffd20df-cbcb-51ae-7de2-915804fce17f-5C7GfCeVMHo@public.gmane.org>
  0 siblings, 1 reply; 7+ messages in thread
From: Christian König @ 2018-03-20 10:14 UTC (permalink / raw)
  To: Quan, Evan, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW; +Cc: Deucher, Alexander

Hi Evan,

that one is perfect if you ask me. Just reading up on the history of 
that patch, Alex what was your concern with that?

Regarding printing this as error, that's a really good point as well. We 
should probably reduce it to a warning or even info severity.

Regards,
Christian.

Am 20.03.2018 um 03:11 schrieb Quan, Evan:
> Hi Christian,
>
> The messages prompted on timeout are Errors not just Warnings although we did not see any real problem(for the dgemm special case). That's why we say it confusing.
> And i suppose you want a fix like my previous patch(see attachment).
>
> Regards,
> Evan
>> -----Original Message-----
>> From: Christian König [mailto:ckoenig.leichtzumerken@gmail.com]
>> Sent: Monday, March 19, 2018 5:42 PM
>> To: Quan, Evan <Evan.Quan@amd.com>; amd-gfx@lists.freedesktop.org
>> Cc: Deucher, Alexander <Alexander.Deucher@amd.com>
>> Subject: Re: [PATCH] drm/amdgpu: disable job timeout on GPU reset
>> disabled
>>
>> Am 19.03.2018 um 07:08 schrieb Evan Quan:
>>> Since under some heavy computing environment(dgemm test), it takes the
>>> asic over 10+ seconds to finish the dispatched single job which will
>>> trigger the timeout. It's quite confusing although it does not seem to
>>> bring any real problems.
>>> As a quick workround, we choose to disable timeout when GPU reset is
>>> disabled.
>> NAK, I enabled those warning intentionally even when the GPU recovery is
>> disabled to have a hint in the logs what goes wrong.
>>
>> Please only increase the timeout for the compute queue and/or add a
>> separate timeout for them.
>>
>> Regards,
>> Christian.
>>
>>
>>> Change-Id: I3a95d856ba4993094dc7b6269649e470c5b053d2
>>> Signed-off-by: Evan Quan <evan.quan@amd.com>
>>> ---
>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 7 +++++++
>>>    1 file changed, 7 insertions(+)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> index 8bd9c3f..9d6a775 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> @@ -861,6 +861,13 @@ static void
>> amdgpu_device_check_arguments(struct amdgpu_device *adev)
>>>    		amdgpu_lockup_timeout = 10000;
>>>    	}
>>>
>>> +	/*
>>> +	 * Disable timeout when GPU reset is disabled to avoid confusing
>>> +	 * timeout messages in the kernel log.
>>> +	 */
>>> +	if (amdgpu_gpu_recovery == 0 || amdgpu_gpu_recovery == -1)
>>> +		amdgpu_lockup_timeout = INT_MAX;
>>> +
>>>    	adev->firmware.load_type = amdgpu_ucode_get_load_type(adev,
>> amdgpu_fw_load_type);
>>>    }
>>>

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] drm/amdgpu: disable job timeout on GPU reset disabled
       [not found]             ` <fffd20df-cbcb-51ae-7de2-915804fce17f-5C7GfCeVMHo@public.gmane.org>
@ 2018-03-20 14:16               ` Deucher, Alexander
       [not found]                 ` <DM5PR12MB1820FEE50DE4EBD1E44B676BF7AB0-2J9CzHegvk8qWyLXlBb1HgdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
  0 siblings, 1 reply; 7+ messages in thread
From: Deucher, Alexander @ 2018-03-20 14:16 UTC (permalink / raw)
  To: Koenig, Christian, Quan, Evan, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW


[-- Attachment #1.1: Type: text/plain, Size: 3324 bytes --]

My concern was that compute will always have the timeout disabled with no way to override it even if you enable GPU reset.  I guess we can address that down the road.


Acked-by: Alex Deucher <alexander.deucher-5C7GfCeVMHo@public.gmane.org>

________________________________
From: Koenig, Christian
Sent: Tuesday, March 20, 2018 6:14:29 AM
To: Quan, Evan; amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
Cc: Deucher, Alexander
Subject: Re: [PATCH] drm/amdgpu: disable job timeout on GPU reset disabled

Hi Evan,

that one is perfect if you ask me. Just reading up on the history of
that patch, Alex what was your concern with that?

Regarding printing this as error, that's a really good point as well. We
should probably reduce it to a warning or even info severity.

Regards,
Christian.

Am 20.03.2018 um 03:11 schrieb Quan, Evan:
> Hi Christian,
>
> The messages prompted on timeout are Errors not just Warnings although we did not see any real problem(for the dgemm special case). That's why we say it confusing.
> And i suppose you want a fix like my previous patch(see attachment).
>
> Regards,
> Evan
>> -----Original Message-----
>> From: Christian König [mailto:ckoenig.leichtzumerken-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org]
>> Sent: Monday, March 19, 2018 5:42 PM
>> To: Quan, Evan <Evan.Quan-5C7GfCeVMHo@public.gmane.org>; amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
>> Cc: Deucher, Alexander <Alexander.Deucher-5C7GfCeVMHo@public.gmane.org>
>> Subject: Re: [PATCH] drm/amdgpu: disable job timeout on GPU reset
>> disabled
>>
>> Am 19.03.2018 um 07:08 schrieb Evan Quan:
>>> Since under some heavy computing environment(dgemm test), it takes the
>>> asic over 10+ seconds to finish the dispatched single job which will
>>> trigger the timeout. It's quite confusing although it does not seem to
>>> bring any real problems.
>>> As a quick workround, we choose to disable timeout when GPU reset is
>>> disabled.
>> NAK, I enabled those warning intentionally even when the GPU recovery is
>> disabled to have a hint in the logs what goes wrong.
>>
>> Please only increase the timeout for the compute queue and/or add a
>> separate timeout for them.
>>
>> Regards,
>> Christian.
>>
>>
>>> Change-Id: I3a95d856ba4993094dc7b6269649e470c5b053d2
>>> Signed-off-by: Evan Quan <evan.quan-5C7GfCeVMHo@public.gmane.org>
>>> ---
>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 7 +++++++
>>>    1 file changed, 7 insertions(+)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> index 8bd9c3f..9d6a775 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> @@ -861,6 +861,13 @@ static void
>> amdgpu_device_check_arguments(struct amdgpu_device *adev)
>>>              amdgpu_lockup_timeout = 10000;
>>>      }
>>>
>>> +   /*
>>> +    * Disable timeout when GPU reset is disabled to avoid confusing
>>> +    * timeout messages in the kernel log.
>>> +    */
>>> +   if (amdgpu_gpu_recovery == 0 || amdgpu_gpu_recovery == -1)
>>> +           amdgpu_lockup_timeout = INT_MAX;
>>> +
>>>      adev->firmware.load_type = amdgpu_ucode_get_load_type(adev,
>> amdgpu_fw_load_type);
>>>    }
>>>


[-- Attachment #1.2: Type: text/html, Size: 5318 bytes --]

[-- Attachment #2: Type: text/plain, Size: 154 bytes --]

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] drm/amdgpu: disable job timeout on GPU reset disabled
       [not found]                 ` <DM5PR12MB1820FEE50DE4EBD1E44B676BF7AB0-2J9CzHegvk8qWyLXlBb1HgdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
@ 2018-03-20 14:21                   ` Christian König
  0 siblings, 0 replies; 7+ messages in thread
From: Christian König @ 2018-03-20 14:21 UTC (permalink / raw)
  To: Deucher, Alexander, Quan, Evan, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW


[-- Attachment #1.1: Type: text/plain, Size: 3917 bytes --]

That's a good point as well, maybe we should have separate timeouts for 
gfx and compute?

Something like 5 seconds for gfx and 1 minute (or even longer) for compute?

Anyway I agree that we can worry about that later on, patch is 
Reviewed-by: Christian König <christian.koenig-5C7GfCeVMHo@public.gmane.org> for now.

Regards,
Christian.

Am 20.03.2018 um 15:16 schrieb Deucher, Alexander:
>
> My concern was that compute will always have the timeout disabled with 
> no way to override it even if you enable GPU reset.  I guess we can 
> address that down the road.
>
>
> Acked-by: Alex Deucher <alexander.deucher-5C7GfCeVMHo@public.gmane.org>
>
> ------------------------------------------------------------------------
> *From:* Koenig, Christian
> *Sent:* Tuesday, March 20, 2018 6:14:29 AM
> *To:* Quan, Evan; amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
> *Cc:* Deucher, Alexander
> *Subject:* Re: [PATCH] drm/amdgpu: disable job timeout on GPU reset 
> disabled
> Hi Evan,
>
> that one is perfect if you ask me. Just reading up on the history of
> that patch, Alex what was your concern with that?
>
> Regarding printing this as error, that's a really good point as well. We
> should probably reduce it to a warning or even info severity.
>
> Regards,
> Christian.
>
> Am 20.03.2018 um 03:11 schrieb Quan, Evan:
> > Hi Christian,
> >
> > The messages prompted on timeout are Errors not just Warnings 
> although we did not see any real problem(for the dgemm special case). 
> That's why we say it confusing.
> > And i suppose you want a fix like my previous patch(see attachment).
> >
> > Regards,
> > Evan
> >> -----Original Message-----
> >> From: Christian König [mailto:ckoenig.leichtzumerken-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org]
> >> Sent: Monday, March 19, 2018 5:42 PM
> >> To: Quan, Evan <Evan.Quan-5C7GfCeVMHo@public.gmane.org>; amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
> >> Cc: Deucher, Alexander <Alexander.Deucher-5C7GfCeVMHo@public.gmane.org>
> >> Subject: Re: [PATCH] drm/amdgpu: disable job timeout on GPU reset
> >> disabled
> >>
> >> Am 19.03.2018 um 07:08 schrieb Evan Quan:
> >>> Since under some heavy computing environment(dgemm test), it takes the
> >>> asic over 10+ seconds to finish the dispatched single job which will
> >>> trigger the timeout. It's quite confusing although it does not seem to
> >>> bring any real problems.
> >>> As a quick workround, we choose to disable timeout when GPU reset is
> >>> disabled.
> >> NAK, I enabled those warning intentionally even when the GPU 
> recovery is
> >> disabled to have a hint in the logs what goes wrong.
> >>
> >> Please only increase the timeout for the compute queue and/or add a
> >> separate timeout for them.
> >>
> >> Regards,
> >> Christian.
> >>
> >>
> >>> Change-Id: I3a95d856ba4993094dc7b6269649e470c5b053d2
> >>> Signed-off-by: Evan Quan <evan.quan-5C7GfCeVMHo@public.gmane.org>
> >>> ---
> >>>    drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 7 +++++++
> >>>    1 file changed, 7 insertions(+)
> >>>
> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> >>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> >>> index 8bd9c3f..9d6a775 100644
> >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> >>> @@ -861,6 +861,13 @@ static void
> >> amdgpu_device_check_arguments(struct amdgpu_device *adev)
> >>>              amdgpu_lockup_timeout = 10000;
> >>>      }
> >>>
> >>> +   /*
> >>> +    * Disable timeout when GPU reset is disabled to avoid confusing
> >>> +    * timeout messages in the kernel log.
> >>> +    */
> >>> +   if (amdgpu_gpu_recovery == 0 || amdgpu_gpu_recovery == -1)
> >>> +           amdgpu_lockup_timeout = INT_MAX;
> >>> +
> >>>      adev->firmware.load_type = amdgpu_ucode_get_load_type(adev,
> >> amdgpu_fw_load_type);
> >>>    }
> >>>
>


[-- Attachment #1.2: Type: text/html, Size: 8152 bytes --]

[-- Attachment #2: Type: text/plain, Size: 154 bytes --]

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2018-03-20 14:21 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-03-19  6:08 [PATCH] drm/amdgpu: disable job timeout on GPU reset disabled Evan Quan
     [not found] ` <1521439692-14823-1-git-send-email-evan.quan-5C7GfCeVMHo@public.gmane.org>
2018-03-19  6:12   ` Quan, Evan
2018-03-19  9:42   ` Christian König
     [not found]     ` <d7a88e66-6533-9c12-c36c-9b3ea569e354-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2018-03-20  2:11       ` Quan, Evan
     [not found]         ` <DM5PR1201MB248999A09FAAC36F90204EF9E4AB0-grEf7a3NxMAAZHT/xKzwlGrFom/aUZj6nBOFsp37pqbUKgpGm//BTAC/G2K4zDHf@public.gmane.org>
2018-03-20 10:14           ` Christian König
     [not found]             ` <fffd20df-cbcb-51ae-7de2-915804fce17f-5C7GfCeVMHo@public.gmane.org>
2018-03-20 14:16               ` Deucher, Alexander
     [not found]                 ` <DM5PR12MB1820FEE50DE4EBD1E44B676BF7AB0-2J9CzHegvk8qWyLXlBb1HgdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2018-03-20 14:21                   ` Christian König

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.