* [PATCH v2] drm/amd/amdgpu:Fix compute ring unable to detect hang.
@ 2019-09-19 7:08 Jesse Zhang
[not found] ` <1568876935-18731-2-git-send-email-zhexi.zhang-5C7GfCeVMHo@public.gmane.org>
0 siblings, 1 reply; 10+ messages in thread
From: Jesse Zhang @ 2019-09-19 7:08 UTC (permalink / raw)
To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW; +Cc: Jesse Zhang
When compute fence did signal, compute ring cannot detect hardware hang
because its timeout value is set to be infinite by default.
In SR-IOV and passthrough mode, if user does not declare custome timeout
value for compute ring, then use gfx ring timeout value as default. So
that when there is a ture hardware hang, compute ring can detect it.
Change-Id: I794ec0868c6c0aad407749457260ecfee0617c10
Signed-off-by: Jesse Zhang <zhexi.zhang@amd.com>
---
drivers/gpu/drm/amd/amdgpu/soc15.c | 10 ++++++++++
1 file changed, 10 insertions(+)
diff --git a/drivers/gpu/drm/amd/amdgpu/soc15.c b/drivers/gpu/drm/amd/amdgpu/soc15.c
index 7c7e9f5..6cd5548 100644
--- a/drivers/gpu/drm/amd/amdgpu/soc15.c
+++ b/drivers/gpu/drm/amd/amdgpu/soc15.c
@@ -687,6 +687,16 @@ int soc15_set_ip_blocks(struct amdgpu_device *adev)
adev->rev_id = soc15_get_rev_id(adev);
adev->nbio.funcs->detect_hw_virt(adev);
+ /*
+ * If running under SR-IOV or passthrough mode and user did not set
+ * custom value for compute ring timeout, set timeout to be the same
+ * as gfx ring timeout to avoid compute ring cannot detect an true
+ * hang.
+ */
+ if ((amdgpu_sriov_vf(adev) || amdgpu_passthrough(adev)) &&
+ (adev->compute_timeout == MAX_SCHEDULE_TIMEOUT))
+ adev->compute_timeout = adev->gfx_timeout;
+
if (amdgpu_sriov_vf(adev))
adev->virt.ops = &xgpu_ai_virt_ops;
--
2.7.4
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH v3] drm/amd/amdgpu:Fix compute ring unable to detect hang.
[not found] ` <1568876935-18731-2-git-send-email-zhexi.zhang-5C7GfCeVMHo@public.gmane.org>
@ 2019-09-19 7:58 ` Jesse Zhang
2019-09-19 8:00 ` [PATCH v4] " Jesse Zhang
1 sibling, 0 replies; 10+ messages in thread
From: Jesse Zhang @ 2019-09-19 7:58 UTC (permalink / raw)
To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW; +Cc: Jesse Zhang
When compute fence did signal, compute ring cannot detect hardware hang
because its timeout value is set to be infinite by default.
In SR-IOV and passthrough mode, if user does not declare custome timeout
value for compute ring, then use gfx ring timeout value as default. So
that when there is a ture hardware hang, compute ring can detect it.
Change-Id: I794ec0868c6c0aad407749457260ecfee0617c10
Signed-off-by: Jesse Zhang <zhexi.zhang@amd.com>
---
drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 5 +----
drivers/gpu/drm/amd/amdgpu/soc15.c | 10 ++++++++++
2 files changed, 11 insertions(+), 4 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
index cbcaa7c..963b6d1 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
@@ -468,10 +468,7 @@ int amdgpu_fence_driver_init_ring(struct amdgpu_ring *ring,
* For sriov case, always use the timeout
* as gfx ring
*/
- if (!amdgpu_sriov_vf(ring->adev))
- timeout = adev->compute_timeout;
- else
- timeout = adev->gfx_timeout;
+ timeout = adev->compute_timeout;
break;
case AMDGPU_RING_TYPE_SDMA:
timeout = adev->sdma_timeout;
diff --git a/drivers/gpu/drm/amd/amdgpu/soc15.c b/drivers/gpu/drm/amd/amdgpu/soc15.c
index 7c7e9f5..6cd5548 100644
--- a/drivers/gpu/drm/amd/amdgpu/soc15.c
+++ b/drivers/gpu/drm/amd/amdgpu/soc15.c
@@ -687,6 +687,16 @@ int soc15_set_ip_blocks(struct amdgpu_device *adev)
adev->rev_id = soc15_get_rev_id(adev);
adev->nbio.funcs->detect_hw_virt(adev);
+ /*
+ * If running under SR-IOV or passthrough mode and user did not set
+ * custom value for compute ring timeout, set timeout to be the same
+ * as gfx ring timeout to avoid compute ring cannot detect an true
+ * hang.
+ */
+ if ((amdgpu_sriov_vf(adev) || amdgpu_passthrough(adev)) &&
+ (adev->compute_timeout == MAX_SCHEDULE_TIMEOUT))
+ adev->compute_timeout = adev->gfx_timeout;
+
if (amdgpu_sriov_vf(adev))
adev->virt.ops = &xgpu_ai_virt_ops;
--
2.7.4
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH v4] drm/amd/amdgpu:Fix compute ring unable to detect hang.
[not found] ` <1568876935-18731-2-git-send-email-zhexi.zhang-5C7GfCeVMHo@public.gmane.org>
2019-09-19 7:58 ` [PATCH v3] " Jesse Zhang
@ 2019-09-19 8:00 ` Jesse Zhang
[not found] ` <1568880041-19830-1-git-send-email-zhexi.zhang-5C7GfCeVMHo@public.gmane.org>
1 sibling, 1 reply; 10+ messages in thread
From: Jesse Zhang @ 2019-09-19 8:00 UTC (permalink / raw)
To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW; +Cc: Jesse Zhang
When compute fence did not signal, compute ring cannot detect hardware
hang because its timeout value is set to be infinite by default.
In SR-IOV and passthrough mode, if user does not declare custome timeout
value for compute ring, then use gfx ring timeout value as default. So
that when there is a ture hardware hang, compute ring can detect it.
Change-Id: I794ec0868c6c0aad407749457260ecfee0617c10
Signed-off-by: Jesse Zhang <zhexi.zhang@amd.com>
---
drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 5 +----
drivers/gpu/drm/amd/amdgpu/soc15.c | 10 ++++++++++
2 files changed, 11 insertions(+), 4 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
index cbcaa7c..963b6d1 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
@@ -468,10 +468,7 @@ int amdgpu_fence_driver_init_ring(struct amdgpu_ring *ring,
* For sriov case, always use the timeout
* as gfx ring
*/
- if (!amdgpu_sriov_vf(ring->adev))
- timeout = adev->compute_timeout;
- else
- timeout = adev->gfx_timeout;
+ timeout = adev->compute_timeout;
break;
case AMDGPU_RING_TYPE_SDMA:
timeout = adev->sdma_timeout;
diff --git a/drivers/gpu/drm/amd/amdgpu/soc15.c b/drivers/gpu/drm/amd/amdgpu/soc15.c
index 7c7e9f5..6cd5548 100644
--- a/drivers/gpu/drm/amd/amdgpu/soc15.c
+++ b/drivers/gpu/drm/amd/amdgpu/soc15.c
@@ -687,6 +687,16 @@ int soc15_set_ip_blocks(struct amdgpu_device *adev)
adev->rev_id = soc15_get_rev_id(adev);
adev->nbio.funcs->detect_hw_virt(adev);
+ /*
+ * If running under SR-IOV or passthrough mode and user did not set
+ * custom value for compute ring timeout, set timeout to be the same
+ * as gfx ring timeout to avoid compute ring cannot detect an true
+ * hang.
+ */
+ if ((amdgpu_sriov_vf(adev) || amdgpu_passthrough(adev)) &&
+ (adev->compute_timeout == MAX_SCHEDULE_TIMEOUT))
+ adev->compute_timeout = adev->gfx_timeout;
+
if (amdgpu_sriov_vf(adev))
adev->virt.ops = &xgpu_ai_virt_ops;
--
2.7.4
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx
^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH v4] drm/amd/amdgpu:Fix compute ring unable to detect hang.
[not found] ` <1568880041-19830-1-git-send-email-zhexi.zhang-5C7GfCeVMHo@public.gmane.org>
@ 2019-09-19 8:14 ` Christian König
2019-09-19 10:09 ` [PATCH v5] " Jesse Zhang
` (2 subsequent siblings)
3 siblings, 0 replies; 10+ messages in thread
From: Christian König @ 2019-09-19 8:14 UTC (permalink / raw)
To: Jesse Zhang, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW
Am 19.09.19 um 10:00 schrieb Jesse Zhang:
> When compute fence did not signal, compute ring cannot detect hardware
> hang because its timeout value is set to be infinite by default.
>
> In SR-IOV and passthrough mode, if user does not declare custome timeout
> value for compute ring, then use gfx ring timeout value as default. So
> that when there is a ture hardware hang, compute ring can detect it.
>
> Change-Id: I794ec0868c6c0aad407749457260ecfee0617c10
> Signed-off-by: Jesse Zhang <zhexi.zhang@amd.com>
> ---
> drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 5 +----
> drivers/gpu/drm/amd/amdgpu/soc15.c | 10 ++++++++++
> 2 files changed, 11 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> index cbcaa7c..963b6d1 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> @@ -468,10 +468,7 @@ int amdgpu_fence_driver_init_ring(struct amdgpu_ring *ring,
> * For sriov case, always use the timeout
> * as gfx ring
> */
Please also remove the comment since that is now stale.
Apart from that looks good to me,
Christian.
> - if (!amdgpu_sriov_vf(ring->adev))
> - timeout = adev->compute_timeout;
> - else
> - timeout = adev->gfx_timeout;
> + timeout = adev->compute_timeout;
> break;
> case AMDGPU_RING_TYPE_SDMA:
> timeout = adev->sdma_timeout;
> diff --git a/drivers/gpu/drm/amd/amdgpu/soc15.c b/drivers/gpu/drm/amd/amdgpu/soc15.c
> index 7c7e9f5..6cd5548 100644
> --- a/drivers/gpu/drm/amd/amdgpu/soc15.c
> +++ b/drivers/gpu/drm/amd/amdgpu/soc15.c
> @@ -687,6 +687,16 @@ int soc15_set_ip_blocks(struct amdgpu_device *adev)
> adev->rev_id = soc15_get_rev_id(adev);
> adev->nbio.funcs->detect_hw_virt(adev);
>
> + /*
> + * If running under SR-IOV or passthrough mode and user did not set
> + * custom value for compute ring timeout, set timeout to be the same
> + * as gfx ring timeout to avoid compute ring cannot detect an true
> + * hang.
> + */
> + if ((amdgpu_sriov_vf(adev) || amdgpu_passthrough(adev)) &&
> + (adev->compute_timeout == MAX_SCHEDULE_TIMEOUT))
> + adev->compute_timeout = adev->gfx_timeout;
> +
> if (amdgpu_sriov_vf(adev))
> adev->virt.ops = &xgpu_ai_virt_ops;
>
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx
^ permalink raw reply [flat|nested] 10+ messages in thread
* [PATCH v5] drm/amd/amdgpu:Fix compute ring unable to detect hang.
[not found] ` <1568880041-19830-1-git-send-email-zhexi.zhang-5C7GfCeVMHo@public.gmane.org>
2019-09-19 8:14 ` Christian König
@ 2019-09-19 10:09 ` Jesse Zhang
[not found] ` <1568887741-1029-1-git-send-email-zhexi.zhang-5C7GfCeVMHo@public.gmane.org>
2019-09-20 2:36 ` [PATCH v6] " Jesse Zhang
2019-09-20 2:38 ` [PATCH v7] " Jesse Zhang
3 siblings, 1 reply; 10+ messages in thread
From: Jesse Zhang @ 2019-09-19 10:09 UTC (permalink / raw)
To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW; +Cc: Jesse Zhang
When compute fence did signal, compute ring cannot detect hardware hang
because its timeout value is set to be infinite by default.
In SR-IOV and passthrough mode, if user does not declare custome timeout
value for compute ring, then use gfx ring timeout value as default. So
that when there is a ture hardware hang, compute ring can detect it.
Change-Id: I794ec0868c6c0aad407749457260ecfee0617c10
Signed-off-by: Jesse Zhang <zhexi.zhang@amd.com>
---
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 12 ++++++------
drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 4 +++-
2 files changed, 9 insertions(+), 7 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 3b5282b..03ac5a1da 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -1024,12 +1024,6 @@ static int amdgpu_device_check_arguments(struct amdgpu_device *adev)
amdgpu_device_check_block_size(adev);
- ret = amdgpu_device_get_job_timeout_settings(adev);
- if (ret) {
- dev_err(adev->dev, "invalid lockup_timeout parameter syntax\n");
- return ret;
- }
-
adev->firmware.load_type = amdgpu_ucode_get_load_type(adev, amdgpu_fw_load_type);
return ret;
@@ -2732,6 +2726,12 @@ int amdgpu_device_init(struct amdgpu_device *adev,
if (r)
return r;
+ r = amdgpu_device_get_job_timeout_settings(adev);
+ if (r) {
+ dev_err(adev->dev, "invalid lockup_timeout parameter syntax\n");
+ return r;
+ }
+
/* doorbell bar mapping and doorbell index init*/
amdgpu_device_doorbell_init(adev);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index 420888e..1236245 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -1378,10 +1378,12 @@ int amdgpu_device_get_job_timeout_settings(struct amdgpu_device *adev)
}
/*
* There is only one value specified and
- * it should apply to all non-compute jobs.
+ * it should apply to all jobs.
*/
if (index == 1)
adev->sdma_timeout = adev->video_timeout = adev->gfx_timeout;
+ if (amdgpu_sriov_vf(adev) || amdgpu_passthrough(adev))
+ adev->compute_timeout = adev->gfx_timeout;
}
return ret;
--
2.7.4
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx
^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH v5] drm/amd/amdgpu:Fix compute ring unable to detect hang.
[not found] ` <1568887741-1029-1-git-send-email-zhexi.zhang-5C7GfCeVMHo@public.gmane.org>
@ 2019-09-19 12:12 ` Christian König
0 siblings, 0 replies; 10+ messages in thread
From: Christian König @ 2019-09-19 12:12 UTC (permalink / raw)
To: Jesse Zhang, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW
Am 19.09.19 um 12:09 schrieb Jesse Zhang:
> When compute fence did signal, compute ring cannot detect hardware hang
> because its timeout value is set to be infinite by default.
>
> In SR-IOV and passthrough mode, if user does not declare custome timeout
> value for compute ring, then use gfx ring timeout value as default. So
> that when there is a ture hardware hang, compute ring can detect it.
>
> Change-Id: I794ec0868c6c0aad407749457260ecfee0617c10
> Signed-off-by: Jesse Zhang <zhexi.zhang@amd.com>
> ---
> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 12 ++++++------
> drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 4 +++-
> 2 files changed, 9 insertions(+), 7 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 3b5282b..03ac5a1da 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -1024,12 +1024,6 @@ static int amdgpu_device_check_arguments(struct amdgpu_device *adev)
>
> amdgpu_device_check_block_size(adev);
>
> - ret = amdgpu_device_get_job_timeout_settings(adev);
> - if (ret) {
> - dev_err(adev->dev, "invalid lockup_timeout parameter syntax\n");
> - return ret;
> - }
> -
> adev->firmware.load_type = amdgpu_ucode_get_load_type(adev, amdgpu_fw_load_type);
>
> return ret;
> @@ -2732,6 +2726,12 @@ int amdgpu_device_init(struct amdgpu_device *adev,
> if (r)
> return r;
>
> + r = amdgpu_device_get_job_timeout_settings(adev);
> + if (r) {
> + dev_err(adev->dev, "invalid lockup_timeout parameter syntax\n");
> + return r;
> + }
> +
I assume that you move the code because previously SRIOV/passthrough
setting is not available yet?
But even with this here you can still remove the extra SRIOV check in
amdgpu_fence.c.
Regards,
Christian.
> /* doorbell bar mapping and doorbell index init*/
> amdgpu_device_doorbell_init(adev);
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> index 420888e..1236245 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> @@ -1378,10 +1378,12 @@ int amdgpu_device_get_job_timeout_settings(struct amdgpu_device *adev)
> }
> /*
> * There is only one value specified and
> - * it should apply to all non-compute jobs.
> + * it should apply to all jobs.
> */
> if (index == 1)
> adev->sdma_timeout = adev->video_timeout = adev->gfx_timeout;
> + if (amdgpu_sriov_vf(adev) || amdgpu_passthrough(adev))
> + adev->compute_timeout = adev->gfx_timeout;
> }
>
> return ret;
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx
^ permalink raw reply [flat|nested] 10+ messages in thread
* [PATCH v6] drm/amd/amdgpu:Fix compute ring unable to detect hang.
[not found] ` <1568880041-19830-1-git-send-email-zhexi.zhang-5C7GfCeVMHo@public.gmane.org>
2019-09-19 8:14 ` Christian König
2019-09-19 10:09 ` [PATCH v5] " Jesse Zhang
@ 2019-09-20 2:36 ` Jesse Zhang
2019-09-20 2:38 ` [PATCH v7] " Jesse Zhang
3 siblings, 0 replies; 10+ messages in thread
From: Jesse Zhang @ 2019-09-20 2:36 UTC (permalink / raw)
To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW; +Cc: Jesse Zhang
When compute fence did signal, compute ring cannot detect hardware hang
because its timeout value is set to be infinite by default.
In SR-IOV and passthrough mode, if user does not declare custome timeout
value for compute ring, then use gfx ring timeout value as default. So
that when there is a ture hardware hang, compute ring can detect it.
Change-Id: I794ec0868c6c0aad407749457260ecfee0617c10
Signed-off-by: Jesse Zhang <zhexi.zhang@amd.com>
---
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 12 ++++++------
drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 4 +++-
drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 13 +------------
3 files changed, 10 insertions(+), 19 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 3b5282b..03ac5a1da 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -1024,12 +1024,6 @@ static int amdgpu_device_check_arguments(struct amdgpu_device *adev)
amdgpu_device_check_block_size(adev);
- ret = amdgpu_device_get_job_timeout_settings(adev);
- if (ret) {
- dev_err(adev->dev, "invalid lockup_timeout parameter syntax\n");
- return ret;
- }
-
adev->firmware.load_type = amdgpu_ucode_get_load_type(adev, amdgpu_fw_load_type);
return ret;
@@ -2732,6 +2726,12 @@ int amdgpu_device_init(struct amdgpu_device *adev,
if (r)
return r;
+ r = amdgpu_device_get_job_timeout_settings(adev);
+ if (r) {
+ dev_err(adev->dev, "invalid lockup_timeout parameter syntax\n");
+ return r;
+ }
+
/* doorbell bar mapping and doorbell index init*/
amdgpu_device_doorbell_init(adev);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index 420888e..1236245 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -1378,10 +1378,12 @@ int amdgpu_device_get_job_timeout_settings(struct amdgpu_device *adev)
}
/*
* There is only one value specified and
- * it should apply to all non-compute jobs.
+ * it should apply to all jobs.
*/
if (index == 1)
adev->sdma_timeout = adev->video_timeout = adev->gfx_timeout;
+ if (amdgpu_sriov_vf(adev) || amdgpu_passthrough(adev))
+ adev->compute_timeout = adev->gfx_timeout;
}
return ret;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
index cbcaa7c..9ef53ca 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
@@ -460,18 +460,7 @@ int amdgpu_fence_driver_init_ring(struct amdgpu_ring *ring,
timeout = adev->gfx_timeout;
break;
case AMDGPU_RING_TYPE_COMPUTE:
- /*
- * For non-sriov case, no timeout enforce
- * on compute ring by default. Unless user
- * specifies a timeout for compute ring.
- *
- * For sriov case, always use the timeout
- * as gfx ring
- */
- if (!amdgpu_sriov_vf(ring->adev))
- timeout = adev->compute_timeout;
- else
- timeout = adev->gfx_timeout;
+ timeout = adev->compute_timeout;
break;
case AMDGPU_RING_TYPE_SDMA:
timeout = adev->sdma_timeout;
--
2.7.4
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH v7] drm/amd/amdgpu:Fix compute ring unable to detect hang.
[not found] ` <1568880041-19830-1-git-send-email-zhexi.zhang-5C7GfCeVMHo@public.gmane.org>
` (2 preceding siblings ...)
2019-09-20 2:36 ` [PATCH v6] " Jesse Zhang
@ 2019-09-20 2:38 ` Jesse Zhang
[not found] ` <1568947109-5924-1-git-send-email-zhexi.zhang-5C7GfCeVMHo@public.gmane.org>
3 siblings, 1 reply; 10+ messages in thread
From: Jesse Zhang @ 2019-09-20 2:38 UTC (permalink / raw)
To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW; +Cc: Jesse Zhang
When compute fence did not signal, compute ring cannot detect hardware hang
because its timeout value is set to be infinite by default.
In SR-IOV and passthrough mode, if user does not declare custome timeout
value for compute ring, then use gfx ring timeout value as default. So
that when there is a ture hardware hang, compute ring can detect it.
Change-Id: I794ec0868c6c0aad407749457260ecfee0617c10
Signed-off-by: Jesse Zhang <zhexi.zhang@amd.com>
---
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 12 ++++++------
drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 4 +++-
drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 13 +------------
3 files changed, 10 insertions(+), 19 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 3b5282b..03ac5a1da 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -1024,12 +1024,6 @@ static int amdgpu_device_check_arguments(struct amdgpu_device *adev)
amdgpu_device_check_block_size(adev);
- ret = amdgpu_device_get_job_timeout_settings(adev);
- if (ret) {
- dev_err(adev->dev, "invalid lockup_timeout parameter syntax\n");
- return ret;
- }
-
adev->firmware.load_type = amdgpu_ucode_get_load_type(adev, amdgpu_fw_load_type);
return ret;
@@ -2732,6 +2726,12 @@ int amdgpu_device_init(struct amdgpu_device *adev,
if (r)
return r;
+ r = amdgpu_device_get_job_timeout_settings(adev);
+ if (r) {
+ dev_err(adev->dev, "invalid lockup_timeout parameter syntax\n");
+ return r;
+ }
+
/* doorbell bar mapping and doorbell index init*/
amdgpu_device_doorbell_init(adev);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index 420888e..1236245 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -1378,10 +1378,12 @@ int amdgpu_device_get_job_timeout_settings(struct amdgpu_device *adev)
}
/*
* There is only one value specified and
- * it should apply to all non-compute jobs.
+ * it should apply to all jobs.
*/
if (index == 1)
adev->sdma_timeout = adev->video_timeout = adev->gfx_timeout;
+ if (amdgpu_sriov_vf(adev) || amdgpu_passthrough(adev))
+ adev->compute_timeout = adev->gfx_timeout;
}
return ret;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
index cbcaa7c..9ef53ca 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
@@ -460,18 +460,7 @@ int amdgpu_fence_driver_init_ring(struct amdgpu_ring *ring,
timeout = adev->gfx_timeout;
break;
case AMDGPU_RING_TYPE_COMPUTE:
- /*
- * For non-sriov case, no timeout enforce
- * on compute ring by default. Unless user
- * specifies a timeout for compute ring.
- *
- * For sriov case, always use the timeout
- * as gfx ring
- */
- if (!amdgpu_sriov_vf(ring->adev))
- timeout = adev->compute_timeout;
- else
- timeout = adev->gfx_timeout;
+ timeout = adev->compute_timeout;
break;
case AMDGPU_RING_TYPE_SDMA:
timeout = adev->sdma_timeout;
--
2.7.4
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH v8] drm/amd/amdgpu:Fix compute ring unable to detect hang.
[not found] ` <1568947109-5924-1-git-send-email-zhexi.zhang-5C7GfCeVMHo@public.gmane.org>
@ 2019-09-20 6:57 ` Jesse Zhang
[not found] ` <1568962637-26150-1-git-send-email-zhexi.zhang-5C7GfCeVMHo@public.gmane.org>
0 siblings, 1 reply; 10+ messages in thread
From: Jesse Zhang @ 2019-09-20 6:57 UTC (permalink / raw)
To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW; +Cc: Jesse Zhang
When compute fence did not signal, compute ring cannot detect hardware hang
because its timeout value is set to be infinite by default.
In SR-IOV and passthrough mode, if user does not declare custome timeout
value for compute ring, then use gfx ring timeout value as default. So
that when there is a ture hardware hang, compute ring can detect it.
Change-Id: I794ec0868c6c0aad407749457260ecfee0617c10
Signed-off-by: Jesse Zhang <zhexi.zhang@amd.com>
---
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 12 ++++++------
drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 7 ++++++-
drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 13 +------------
3 files changed, 13 insertions(+), 19 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 3b5282b..03ac5a1da 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -1024,12 +1024,6 @@ static int amdgpu_device_check_arguments(struct amdgpu_device *adev)
amdgpu_device_check_block_size(adev);
- ret = amdgpu_device_get_job_timeout_settings(adev);
- if (ret) {
- dev_err(adev->dev, "invalid lockup_timeout parameter syntax\n");
- return ret;
- }
-
adev->firmware.load_type = amdgpu_ucode_get_load_type(adev, amdgpu_fw_load_type);
return ret;
@@ -2732,6 +2726,12 @@ int amdgpu_device_init(struct amdgpu_device *adev,
if (r)
return r;
+ r = amdgpu_device_get_job_timeout_settings(adev);
+ if (r) {
+ dev_err(adev->dev, "invalid lockup_timeout parameter syntax\n");
+ return r;
+ }
+
/* doorbell bar mapping and doorbell index init*/
amdgpu_device_doorbell_init(adev);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index 420888e..98be49b 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -1338,10 +1338,15 @@ int amdgpu_device_get_job_timeout_settings(struct amdgpu_device *adev)
/*
* By default timeout for non compute jobs is 10000.
* And there is no timeout enforced on compute jobs.
+ * In SR-IOV or passthrough mode, timeout for compute
+ * jobs are 10000 by default.
*/
adev->gfx_timeout = msecs_to_jiffies(10000);
adev->sdma_timeout = adev->video_timeout = adev->gfx_timeout;
- adev->compute_timeout = MAX_SCHEDULE_TIMEOUT;
+ if (amdgpu_sriov_vf(adev) || amdgpu_passthrough(adev))
+ adev->compute_timeout = adev->gfx_timeout;
+ else
+ adev->compute_timeout = MAX_SCHEDULE_TIMEOUT;
if (strnlen(input, AMDGPU_MAX_TIMEOUT_PARAM_LENTH)) {
while ((timeout_setting = strsep(&input, ",")) &&
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
index cbcaa7c..9ef53ca 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
@@ -460,18 +460,7 @@ int amdgpu_fence_driver_init_ring(struct amdgpu_ring *ring,
timeout = adev->gfx_timeout;
break;
case AMDGPU_RING_TYPE_COMPUTE:
- /*
- * For non-sriov case, no timeout enforce
- * on compute ring by default. Unless user
- * specifies a timeout for compute ring.
- *
- * For sriov case, always use the timeout
- * as gfx ring
- */
- if (!amdgpu_sriov_vf(ring->adev))
- timeout = adev->compute_timeout;
- else
- timeout = adev->gfx_timeout;
+ timeout = adev->compute_timeout;
break;
case AMDGPU_RING_TYPE_SDMA:
timeout = adev->sdma_timeout;
--
2.7.4
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx
^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH v8] drm/amd/amdgpu:Fix compute ring unable to detect hang.
[not found] ` <1568962637-26150-1-git-send-email-zhexi.zhang-5C7GfCeVMHo@public.gmane.org>
@ 2019-09-20 14:29 ` Christian König
0 siblings, 0 replies; 10+ messages in thread
From: Christian König @ 2019-09-20 14:29 UTC (permalink / raw)
To: Jesse Zhang, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW
Am 20.09.19 um 08:57 schrieb Jesse Zhang:
> When compute fence did not signal, compute ring cannot detect hardware hang
> because its timeout value is set to be infinite by default.
>
> In SR-IOV and passthrough mode, if user does not declare custome timeout
> value for compute ring, then use gfx ring timeout value as default. So
> that when there is a ture hardware hang, compute ring can detect it.
>
> Change-Id: I794ec0868c6c0aad407749457260ecfee0617c10
> Signed-off-by: Jesse Zhang <zhexi.zhang@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
> ---
> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 12 ++++++------
> drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 7 ++++++-
> drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 13 +------------
> 3 files changed, 13 insertions(+), 19 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 3b5282b..03ac5a1da 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -1024,12 +1024,6 @@ static int amdgpu_device_check_arguments(struct amdgpu_device *adev)
>
> amdgpu_device_check_block_size(adev);
>
> - ret = amdgpu_device_get_job_timeout_settings(adev);
> - if (ret) {
> - dev_err(adev->dev, "invalid lockup_timeout parameter syntax\n");
> - return ret;
> - }
> -
> adev->firmware.load_type = amdgpu_ucode_get_load_type(adev, amdgpu_fw_load_type);
>
> return ret;
> @@ -2732,6 +2726,12 @@ int amdgpu_device_init(struct amdgpu_device *adev,
> if (r)
> return r;
>
> + r = amdgpu_device_get_job_timeout_settings(adev);
> + if (r) {
> + dev_err(adev->dev, "invalid lockup_timeout parameter syntax\n");
> + return r;
> + }
> +
> /* doorbell bar mapping and doorbell index init*/
> amdgpu_device_doorbell_init(adev);
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> index 420888e..98be49b 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> @@ -1338,10 +1338,15 @@ int amdgpu_device_get_job_timeout_settings(struct amdgpu_device *adev)
> /*
> * By default timeout for non compute jobs is 10000.
> * And there is no timeout enforced on compute jobs.
> + * In SR-IOV or passthrough mode, timeout for compute
> + * jobs are 10000 by default.
> */
> adev->gfx_timeout = msecs_to_jiffies(10000);
> adev->sdma_timeout = adev->video_timeout = adev->gfx_timeout;
> - adev->compute_timeout = MAX_SCHEDULE_TIMEOUT;
> + if (amdgpu_sriov_vf(adev) || amdgpu_passthrough(adev))
> + adev->compute_timeout = adev->gfx_timeout;
> + else
> + adev->compute_timeout = MAX_SCHEDULE_TIMEOUT;
>
> if (strnlen(input, AMDGPU_MAX_TIMEOUT_PARAM_LENTH)) {
> while ((timeout_setting = strsep(&input, ",")) &&
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> index cbcaa7c..9ef53ca 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> @@ -460,18 +460,7 @@ int amdgpu_fence_driver_init_ring(struct amdgpu_ring *ring,
> timeout = adev->gfx_timeout;
> break;
> case AMDGPU_RING_TYPE_COMPUTE:
> - /*
> - * For non-sriov case, no timeout enforce
> - * on compute ring by default. Unless user
> - * specifies a timeout for compute ring.
> - *
> - * For sriov case, always use the timeout
> - * as gfx ring
> - */
> - if (!amdgpu_sriov_vf(ring->adev))
> - timeout = adev->compute_timeout;
> - else
> - timeout = adev->gfx_timeout;
> + timeout = adev->compute_timeout;
> break;
> case AMDGPU_RING_TYPE_SDMA:
> timeout = adev->sdma_timeout;
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2019-09-20 14:29 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-09-19 7:08 [PATCH v2] drm/amd/amdgpu:Fix compute ring unable to detect hang Jesse Zhang
[not found] ` <1568876935-18731-2-git-send-email-zhexi.zhang-5C7GfCeVMHo@public.gmane.org>
2019-09-19 7:58 ` [PATCH v3] " Jesse Zhang
2019-09-19 8:00 ` [PATCH v4] " Jesse Zhang
[not found] ` <1568880041-19830-1-git-send-email-zhexi.zhang-5C7GfCeVMHo@public.gmane.org>
2019-09-19 8:14 ` Christian König
2019-09-19 10:09 ` [PATCH v5] " Jesse Zhang
[not found] ` <1568887741-1029-1-git-send-email-zhexi.zhang-5C7GfCeVMHo@public.gmane.org>
2019-09-19 12:12 ` Christian König
2019-09-20 2:36 ` [PATCH v6] " Jesse Zhang
2019-09-20 2:38 ` [PATCH v7] " Jesse Zhang
[not found] ` <1568947109-5924-1-git-send-email-zhexi.zhang-5C7GfCeVMHo@public.gmane.org>
2019-09-20 6:57 ` [PATCH v8] " Jesse Zhang
[not found] ` <1568962637-26150-1-git-send-email-zhexi.zhang-5C7GfCeVMHo@public.gmane.org>
2019-09-20 14:29 ` Christian König
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.