* [PATCH v2] drm/amd/amdgpu:Fix compute ring unable to detect hang. @ 2019-09-19 7:08 Jesse Zhang [not found] ` <1568876935-18731-2-git-send-email-zhexi.zhang-5C7GfCeVMHo@public.gmane.org> 0 siblings, 1 reply; 10+ messages in thread From: Jesse Zhang @ 2019-09-19 7:08 UTC (permalink / raw) To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW; +Cc: Jesse Zhang When compute fence did signal, compute ring cannot detect hardware hang because its timeout value is set to be infinite by default. In SR-IOV and passthrough mode, if user does not declare custome timeout value for compute ring, then use gfx ring timeout value as default. So that when there is a ture hardware hang, compute ring can detect it. Change-Id: I794ec0868c6c0aad407749457260ecfee0617c10 Signed-off-by: Jesse Zhang <zhexi.zhang@amd.com> --- drivers/gpu/drm/amd/amdgpu/soc15.c | 10 ++++++++++ 1 file changed, 10 insertions(+) diff --git a/drivers/gpu/drm/amd/amdgpu/soc15.c b/drivers/gpu/drm/amd/amdgpu/soc15.c index 7c7e9f5..6cd5548 100644 --- a/drivers/gpu/drm/amd/amdgpu/soc15.c +++ b/drivers/gpu/drm/amd/amdgpu/soc15.c @@ -687,6 +687,16 @@ int soc15_set_ip_blocks(struct amdgpu_device *adev) adev->rev_id = soc15_get_rev_id(adev); adev->nbio.funcs->detect_hw_virt(adev); + /* + * If running under SR-IOV or passthrough mode and user did not set + * custom value for compute ring timeout, set timeout to be the same + * as gfx ring timeout to avoid compute ring cannot detect an true + * hang. + */ + if ((amdgpu_sriov_vf(adev) || amdgpu_passthrough(adev)) && + (adev->compute_timeout == MAX_SCHEDULE_TIMEOUT)) + adev->compute_timeout = adev->gfx_timeout; + if (amdgpu_sriov_vf(adev)) adev->virt.ops = &xgpu_ai_virt_ops; -- 2.7.4 _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx ^ permalink raw reply related [flat|nested] 10+ messages in thread
[parent not found: <1568876935-18731-2-git-send-email-zhexi.zhang-5C7GfCeVMHo@public.gmane.org>]
* [PATCH v3] drm/amd/amdgpu:Fix compute ring unable to detect hang. [not found] ` <1568876935-18731-2-git-send-email-zhexi.zhang-5C7GfCeVMHo@public.gmane.org> @ 2019-09-19 7:58 ` Jesse Zhang 2019-09-19 8:00 ` [PATCH v4] " Jesse Zhang 1 sibling, 0 replies; 10+ messages in thread From: Jesse Zhang @ 2019-09-19 7:58 UTC (permalink / raw) To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW; +Cc: Jesse Zhang When compute fence did signal, compute ring cannot detect hardware hang because its timeout value is set to be infinite by default. In SR-IOV and passthrough mode, if user does not declare custome timeout value for compute ring, then use gfx ring timeout value as default. So that when there is a ture hardware hang, compute ring can detect it. Change-Id: I794ec0868c6c0aad407749457260ecfee0617c10 Signed-off-by: Jesse Zhang <zhexi.zhang@amd.com> --- drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 5 +---- drivers/gpu/drm/amd/amdgpu/soc15.c | 10 ++++++++++ 2 files changed, 11 insertions(+), 4 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c index cbcaa7c..963b6d1 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c @@ -468,10 +468,7 @@ int amdgpu_fence_driver_init_ring(struct amdgpu_ring *ring, * For sriov case, always use the timeout * as gfx ring */ - if (!amdgpu_sriov_vf(ring->adev)) - timeout = adev->compute_timeout; - else - timeout = adev->gfx_timeout; + timeout = adev->compute_timeout; break; case AMDGPU_RING_TYPE_SDMA: timeout = adev->sdma_timeout; diff --git a/drivers/gpu/drm/amd/amdgpu/soc15.c b/drivers/gpu/drm/amd/amdgpu/soc15.c index 7c7e9f5..6cd5548 100644 --- a/drivers/gpu/drm/amd/amdgpu/soc15.c +++ b/drivers/gpu/drm/amd/amdgpu/soc15.c @@ -687,6 +687,16 @@ int soc15_set_ip_blocks(struct amdgpu_device *adev) adev->rev_id = soc15_get_rev_id(adev); adev->nbio.funcs->detect_hw_virt(adev); + /* + * If running under SR-IOV or passthrough mode and user did not set + * custom value for compute ring timeout, set timeout to be the same + * as gfx ring timeout to avoid compute ring cannot detect an true + * hang. + */ + if ((amdgpu_sriov_vf(adev) || amdgpu_passthrough(adev)) && + (adev->compute_timeout == MAX_SCHEDULE_TIMEOUT)) + adev->compute_timeout = adev->gfx_timeout; + if (amdgpu_sriov_vf(adev)) adev->virt.ops = &xgpu_ai_virt_ops; -- 2.7.4 _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx ^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH v4] drm/amd/amdgpu:Fix compute ring unable to detect hang. [not found] ` <1568876935-18731-2-git-send-email-zhexi.zhang-5C7GfCeVMHo@public.gmane.org> 2019-09-19 7:58 ` [PATCH v3] " Jesse Zhang @ 2019-09-19 8:00 ` Jesse Zhang [not found] ` <1568880041-19830-1-git-send-email-zhexi.zhang-5C7GfCeVMHo@public.gmane.org> 1 sibling, 1 reply; 10+ messages in thread From: Jesse Zhang @ 2019-09-19 8:00 UTC (permalink / raw) To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW; +Cc: Jesse Zhang When compute fence did not signal, compute ring cannot detect hardware hang because its timeout value is set to be infinite by default. In SR-IOV and passthrough mode, if user does not declare custome timeout value for compute ring, then use gfx ring timeout value as default. So that when there is a ture hardware hang, compute ring can detect it. Change-Id: I794ec0868c6c0aad407749457260ecfee0617c10 Signed-off-by: Jesse Zhang <zhexi.zhang@amd.com> --- drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 5 +---- drivers/gpu/drm/amd/amdgpu/soc15.c | 10 ++++++++++ 2 files changed, 11 insertions(+), 4 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c index cbcaa7c..963b6d1 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c @@ -468,10 +468,7 @@ int amdgpu_fence_driver_init_ring(struct amdgpu_ring *ring, * For sriov case, always use the timeout * as gfx ring */ - if (!amdgpu_sriov_vf(ring->adev)) - timeout = adev->compute_timeout; - else - timeout = adev->gfx_timeout; + timeout = adev->compute_timeout; break; case AMDGPU_RING_TYPE_SDMA: timeout = adev->sdma_timeout; diff --git a/drivers/gpu/drm/amd/amdgpu/soc15.c b/drivers/gpu/drm/amd/amdgpu/soc15.c index 7c7e9f5..6cd5548 100644 --- a/drivers/gpu/drm/amd/amdgpu/soc15.c +++ b/drivers/gpu/drm/amd/amdgpu/soc15.c @@ -687,6 +687,16 @@ int soc15_set_ip_blocks(struct amdgpu_device *adev) adev->rev_id = soc15_get_rev_id(adev); adev->nbio.funcs->detect_hw_virt(adev); + /* + * If running under SR-IOV or passthrough mode and user did not set + * custom value for compute ring timeout, set timeout to be the same + * as gfx ring timeout to avoid compute ring cannot detect an true + * hang. + */ + if ((amdgpu_sriov_vf(adev) || amdgpu_passthrough(adev)) && + (adev->compute_timeout == MAX_SCHEDULE_TIMEOUT)) + adev->compute_timeout = adev->gfx_timeout; + if (amdgpu_sriov_vf(adev)) adev->virt.ops = &xgpu_ai_virt_ops; -- 2.7.4 _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx ^ permalink raw reply related [flat|nested] 10+ messages in thread
[parent not found: <1568880041-19830-1-git-send-email-zhexi.zhang-5C7GfCeVMHo@public.gmane.org>]
* Re: [PATCH v4] drm/amd/amdgpu:Fix compute ring unable to detect hang. [not found] ` <1568880041-19830-1-git-send-email-zhexi.zhang-5C7GfCeVMHo@public.gmane.org> @ 2019-09-19 8:14 ` Christian König 2019-09-19 10:09 ` [PATCH v5] " Jesse Zhang ` (2 subsequent siblings) 3 siblings, 0 replies; 10+ messages in thread From: Christian König @ 2019-09-19 8:14 UTC (permalink / raw) To: Jesse Zhang, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW Am 19.09.19 um 10:00 schrieb Jesse Zhang: > When compute fence did not signal, compute ring cannot detect hardware > hang because its timeout value is set to be infinite by default. > > In SR-IOV and passthrough mode, if user does not declare custome timeout > value for compute ring, then use gfx ring timeout value as default. So > that when there is a ture hardware hang, compute ring can detect it. > > Change-Id: I794ec0868c6c0aad407749457260ecfee0617c10 > Signed-off-by: Jesse Zhang <zhexi.zhang@amd.com> > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 5 +---- > drivers/gpu/drm/amd/amdgpu/soc15.c | 10 ++++++++++ > 2 files changed, 11 insertions(+), 4 deletions(-) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c > index cbcaa7c..963b6d1 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c > @@ -468,10 +468,7 @@ int amdgpu_fence_driver_init_ring(struct amdgpu_ring *ring, > * For sriov case, always use the timeout > * as gfx ring > */ Please also remove the comment since that is now stale. Apart from that looks good to me, Christian. > - if (!amdgpu_sriov_vf(ring->adev)) > - timeout = adev->compute_timeout; > - else > - timeout = adev->gfx_timeout; > + timeout = adev->compute_timeout; > break; > case AMDGPU_RING_TYPE_SDMA: > timeout = adev->sdma_timeout; > diff --git a/drivers/gpu/drm/amd/amdgpu/soc15.c b/drivers/gpu/drm/amd/amdgpu/soc15.c > index 7c7e9f5..6cd5548 100644 > --- a/drivers/gpu/drm/amd/amdgpu/soc15.c > +++ b/drivers/gpu/drm/amd/amdgpu/soc15.c > @@ -687,6 +687,16 @@ int soc15_set_ip_blocks(struct amdgpu_device *adev) > adev->rev_id = soc15_get_rev_id(adev); > adev->nbio.funcs->detect_hw_virt(adev); > > + /* > + * If running under SR-IOV or passthrough mode and user did not set > + * custom value for compute ring timeout, set timeout to be the same > + * as gfx ring timeout to avoid compute ring cannot detect an true > + * hang. > + */ > + if ((amdgpu_sriov_vf(adev) || amdgpu_passthrough(adev)) && > + (adev->compute_timeout == MAX_SCHEDULE_TIMEOUT)) > + adev->compute_timeout = adev->gfx_timeout; > + > if (amdgpu_sriov_vf(adev)) > adev->virt.ops = &xgpu_ai_virt_ops; > _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx ^ permalink raw reply [flat|nested] 10+ messages in thread
* [PATCH v5] drm/amd/amdgpu:Fix compute ring unable to detect hang. [not found] ` <1568880041-19830-1-git-send-email-zhexi.zhang-5C7GfCeVMHo@public.gmane.org> 2019-09-19 8:14 ` Christian König @ 2019-09-19 10:09 ` Jesse Zhang [not found] ` <1568887741-1029-1-git-send-email-zhexi.zhang-5C7GfCeVMHo@public.gmane.org> 2019-09-20 2:36 ` [PATCH v6] " Jesse Zhang 2019-09-20 2:38 ` [PATCH v7] " Jesse Zhang 3 siblings, 1 reply; 10+ messages in thread From: Jesse Zhang @ 2019-09-19 10:09 UTC (permalink / raw) To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW; +Cc: Jesse Zhang When compute fence did signal, compute ring cannot detect hardware hang because its timeout value is set to be infinite by default. In SR-IOV and passthrough mode, if user does not declare custome timeout value for compute ring, then use gfx ring timeout value as default. So that when there is a ture hardware hang, compute ring can detect it. Change-Id: I794ec0868c6c0aad407749457260ecfee0617c10 Signed-off-by: Jesse Zhang <zhexi.zhang@amd.com> --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 12 ++++++------ drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 4 +++- 2 files changed, 9 insertions(+), 7 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c index 3b5282b..03ac5a1da 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c @@ -1024,12 +1024,6 @@ static int amdgpu_device_check_arguments(struct amdgpu_device *adev) amdgpu_device_check_block_size(adev); - ret = amdgpu_device_get_job_timeout_settings(adev); - if (ret) { - dev_err(adev->dev, "invalid lockup_timeout parameter syntax\n"); - return ret; - } - adev->firmware.load_type = amdgpu_ucode_get_load_type(adev, amdgpu_fw_load_type); return ret; @@ -2732,6 +2726,12 @@ int amdgpu_device_init(struct amdgpu_device *adev, if (r) return r; + r = amdgpu_device_get_job_timeout_settings(adev); + if (r) { + dev_err(adev->dev, "invalid lockup_timeout parameter syntax\n"); + return r; + } + /* doorbell bar mapping and doorbell index init*/ amdgpu_device_doorbell_init(adev); diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c index 420888e..1236245 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c @@ -1378,10 +1378,12 @@ int amdgpu_device_get_job_timeout_settings(struct amdgpu_device *adev) } /* * There is only one value specified and - * it should apply to all non-compute jobs. + * it should apply to all jobs. */ if (index == 1) adev->sdma_timeout = adev->video_timeout = adev->gfx_timeout; + if (amdgpu_sriov_vf(adev) || amdgpu_passthrough(adev)) + adev->compute_timeout = adev->gfx_timeout; } return ret; -- 2.7.4 _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx ^ permalink raw reply related [flat|nested] 10+ messages in thread
[parent not found: <1568887741-1029-1-git-send-email-zhexi.zhang-5C7GfCeVMHo@public.gmane.org>]
* Re: [PATCH v5] drm/amd/amdgpu:Fix compute ring unable to detect hang. [not found] ` <1568887741-1029-1-git-send-email-zhexi.zhang-5C7GfCeVMHo@public.gmane.org> @ 2019-09-19 12:12 ` Christian König 0 siblings, 0 replies; 10+ messages in thread From: Christian König @ 2019-09-19 12:12 UTC (permalink / raw) To: Jesse Zhang, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW Am 19.09.19 um 12:09 schrieb Jesse Zhang: > When compute fence did signal, compute ring cannot detect hardware hang > because its timeout value is set to be infinite by default. > > In SR-IOV and passthrough mode, if user does not declare custome timeout > value for compute ring, then use gfx ring timeout value as default. So > that when there is a ture hardware hang, compute ring can detect it. > > Change-Id: I794ec0868c6c0aad407749457260ecfee0617c10 > Signed-off-by: Jesse Zhang <zhexi.zhang@amd.com> > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 12 ++++++------ > drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 4 +++- > 2 files changed, 9 insertions(+), 7 deletions(-) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > index 3b5282b..03ac5a1da 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > @@ -1024,12 +1024,6 @@ static int amdgpu_device_check_arguments(struct amdgpu_device *adev) > > amdgpu_device_check_block_size(adev); > > - ret = amdgpu_device_get_job_timeout_settings(adev); > - if (ret) { > - dev_err(adev->dev, "invalid lockup_timeout parameter syntax\n"); > - return ret; > - } > - > adev->firmware.load_type = amdgpu_ucode_get_load_type(adev, amdgpu_fw_load_type); > > return ret; > @@ -2732,6 +2726,12 @@ int amdgpu_device_init(struct amdgpu_device *adev, > if (r) > return r; > > + r = amdgpu_device_get_job_timeout_settings(adev); > + if (r) { > + dev_err(adev->dev, "invalid lockup_timeout parameter syntax\n"); > + return r; > + } > + I assume that you move the code because previously SRIOV/passthrough setting is not available yet? But even with this here you can still remove the extra SRIOV check in amdgpu_fence.c. Regards, Christian. > /* doorbell bar mapping and doorbell index init*/ > amdgpu_device_doorbell_init(adev); > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c > index 420888e..1236245 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c > @@ -1378,10 +1378,12 @@ int amdgpu_device_get_job_timeout_settings(struct amdgpu_device *adev) > } > /* > * There is only one value specified and > - * it should apply to all non-compute jobs. > + * it should apply to all jobs. > */ > if (index == 1) > adev->sdma_timeout = adev->video_timeout = adev->gfx_timeout; > + if (amdgpu_sriov_vf(adev) || amdgpu_passthrough(adev)) > + adev->compute_timeout = adev->gfx_timeout; > } > > return ret; _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx ^ permalink raw reply [flat|nested] 10+ messages in thread
* [PATCH v6] drm/amd/amdgpu:Fix compute ring unable to detect hang. [not found] ` <1568880041-19830-1-git-send-email-zhexi.zhang-5C7GfCeVMHo@public.gmane.org> 2019-09-19 8:14 ` Christian König 2019-09-19 10:09 ` [PATCH v5] " Jesse Zhang @ 2019-09-20 2:36 ` Jesse Zhang 2019-09-20 2:38 ` [PATCH v7] " Jesse Zhang 3 siblings, 0 replies; 10+ messages in thread From: Jesse Zhang @ 2019-09-20 2:36 UTC (permalink / raw) To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW; +Cc: Jesse Zhang When compute fence did signal, compute ring cannot detect hardware hang because its timeout value is set to be infinite by default. In SR-IOV and passthrough mode, if user does not declare custome timeout value for compute ring, then use gfx ring timeout value as default. So that when there is a ture hardware hang, compute ring can detect it. Change-Id: I794ec0868c6c0aad407749457260ecfee0617c10 Signed-off-by: Jesse Zhang <zhexi.zhang@amd.com> --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 12 ++++++------ drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 4 +++- drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 13 +------------ 3 files changed, 10 insertions(+), 19 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c index 3b5282b..03ac5a1da 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c @@ -1024,12 +1024,6 @@ static int amdgpu_device_check_arguments(struct amdgpu_device *adev) amdgpu_device_check_block_size(adev); - ret = amdgpu_device_get_job_timeout_settings(adev); - if (ret) { - dev_err(adev->dev, "invalid lockup_timeout parameter syntax\n"); - return ret; - } - adev->firmware.load_type = amdgpu_ucode_get_load_type(adev, amdgpu_fw_load_type); return ret; @@ -2732,6 +2726,12 @@ int amdgpu_device_init(struct amdgpu_device *adev, if (r) return r; + r = amdgpu_device_get_job_timeout_settings(adev); + if (r) { + dev_err(adev->dev, "invalid lockup_timeout parameter syntax\n"); + return r; + } + /* doorbell bar mapping and doorbell index init*/ amdgpu_device_doorbell_init(adev); diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c index 420888e..1236245 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c @@ -1378,10 +1378,12 @@ int amdgpu_device_get_job_timeout_settings(struct amdgpu_device *adev) } /* * There is only one value specified and - * it should apply to all non-compute jobs. + * it should apply to all jobs. */ if (index == 1) adev->sdma_timeout = adev->video_timeout = adev->gfx_timeout; + if (amdgpu_sriov_vf(adev) || amdgpu_passthrough(adev)) + adev->compute_timeout = adev->gfx_timeout; } return ret; diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c index cbcaa7c..9ef53ca 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c @@ -460,18 +460,7 @@ int amdgpu_fence_driver_init_ring(struct amdgpu_ring *ring, timeout = adev->gfx_timeout; break; case AMDGPU_RING_TYPE_COMPUTE: - /* - * For non-sriov case, no timeout enforce - * on compute ring by default. Unless user - * specifies a timeout for compute ring. - * - * For sriov case, always use the timeout - * as gfx ring - */ - if (!amdgpu_sriov_vf(ring->adev)) - timeout = adev->compute_timeout; - else - timeout = adev->gfx_timeout; + timeout = adev->compute_timeout; break; case AMDGPU_RING_TYPE_SDMA: timeout = adev->sdma_timeout; -- 2.7.4 _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx ^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH v7] drm/amd/amdgpu:Fix compute ring unable to detect hang. [not found] ` <1568880041-19830-1-git-send-email-zhexi.zhang-5C7GfCeVMHo@public.gmane.org> ` (2 preceding siblings ...) 2019-09-20 2:36 ` [PATCH v6] " Jesse Zhang @ 2019-09-20 2:38 ` Jesse Zhang [not found] ` <1568947109-5924-1-git-send-email-zhexi.zhang-5C7GfCeVMHo@public.gmane.org> 3 siblings, 1 reply; 10+ messages in thread From: Jesse Zhang @ 2019-09-20 2:38 UTC (permalink / raw) To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW; +Cc: Jesse Zhang When compute fence did not signal, compute ring cannot detect hardware hang because its timeout value is set to be infinite by default. In SR-IOV and passthrough mode, if user does not declare custome timeout value for compute ring, then use gfx ring timeout value as default. So that when there is a ture hardware hang, compute ring can detect it. Change-Id: I794ec0868c6c0aad407749457260ecfee0617c10 Signed-off-by: Jesse Zhang <zhexi.zhang@amd.com> --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 12 ++++++------ drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 4 +++- drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 13 +------------ 3 files changed, 10 insertions(+), 19 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c index 3b5282b..03ac5a1da 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c @@ -1024,12 +1024,6 @@ static int amdgpu_device_check_arguments(struct amdgpu_device *adev) amdgpu_device_check_block_size(adev); - ret = amdgpu_device_get_job_timeout_settings(adev); - if (ret) { - dev_err(adev->dev, "invalid lockup_timeout parameter syntax\n"); - return ret; - } - adev->firmware.load_type = amdgpu_ucode_get_load_type(adev, amdgpu_fw_load_type); return ret; @@ -2732,6 +2726,12 @@ int amdgpu_device_init(struct amdgpu_device *adev, if (r) return r; + r = amdgpu_device_get_job_timeout_settings(adev); + if (r) { + dev_err(adev->dev, "invalid lockup_timeout parameter syntax\n"); + return r; + } + /* doorbell bar mapping and doorbell index init*/ amdgpu_device_doorbell_init(adev); diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c index 420888e..1236245 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c @@ -1378,10 +1378,12 @@ int amdgpu_device_get_job_timeout_settings(struct amdgpu_device *adev) } /* * There is only one value specified and - * it should apply to all non-compute jobs. + * it should apply to all jobs. */ if (index == 1) adev->sdma_timeout = adev->video_timeout = adev->gfx_timeout; + if (amdgpu_sriov_vf(adev) || amdgpu_passthrough(adev)) + adev->compute_timeout = adev->gfx_timeout; } return ret; diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c index cbcaa7c..9ef53ca 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c @@ -460,18 +460,7 @@ int amdgpu_fence_driver_init_ring(struct amdgpu_ring *ring, timeout = adev->gfx_timeout; break; case AMDGPU_RING_TYPE_COMPUTE: - /* - * For non-sriov case, no timeout enforce - * on compute ring by default. Unless user - * specifies a timeout for compute ring. - * - * For sriov case, always use the timeout - * as gfx ring - */ - if (!amdgpu_sriov_vf(ring->adev)) - timeout = adev->compute_timeout; - else - timeout = adev->gfx_timeout; + timeout = adev->compute_timeout; break; case AMDGPU_RING_TYPE_SDMA: timeout = adev->sdma_timeout; -- 2.7.4 _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx ^ permalink raw reply related [flat|nested] 10+ messages in thread
[parent not found: <1568947109-5924-1-git-send-email-zhexi.zhang-5C7GfCeVMHo@public.gmane.org>]
* [PATCH v8] drm/amd/amdgpu:Fix compute ring unable to detect hang. [not found] ` <1568947109-5924-1-git-send-email-zhexi.zhang-5C7GfCeVMHo@public.gmane.org> @ 2019-09-20 6:57 ` Jesse Zhang [not found] ` <1568962637-26150-1-git-send-email-zhexi.zhang-5C7GfCeVMHo@public.gmane.org> 0 siblings, 1 reply; 10+ messages in thread From: Jesse Zhang @ 2019-09-20 6:57 UTC (permalink / raw) To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW; +Cc: Jesse Zhang When compute fence did not signal, compute ring cannot detect hardware hang because its timeout value is set to be infinite by default. In SR-IOV and passthrough mode, if user does not declare custome timeout value for compute ring, then use gfx ring timeout value as default. So that when there is a ture hardware hang, compute ring can detect it. Change-Id: I794ec0868c6c0aad407749457260ecfee0617c10 Signed-off-by: Jesse Zhang <zhexi.zhang@amd.com> --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 12 ++++++------ drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 7 ++++++- drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 13 +------------ 3 files changed, 13 insertions(+), 19 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c index 3b5282b..03ac5a1da 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c @@ -1024,12 +1024,6 @@ static int amdgpu_device_check_arguments(struct amdgpu_device *adev) amdgpu_device_check_block_size(adev); - ret = amdgpu_device_get_job_timeout_settings(adev); - if (ret) { - dev_err(adev->dev, "invalid lockup_timeout parameter syntax\n"); - return ret; - } - adev->firmware.load_type = amdgpu_ucode_get_load_type(adev, amdgpu_fw_load_type); return ret; @@ -2732,6 +2726,12 @@ int amdgpu_device_init(struct amdgpu_device *adev, if (r) return r; + r = amdgpu_device_get_job_timeout_settings(adev); + if (r) { + dev_err(adev->dev, "invalid lockup_timeout parameter syntax\n"); + return r; + } + /* doorbell bar mapping and doorbell index init*/ amdgpu_device_doorbell_init(adev); diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c index 420888e..98be49b 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c @@ -1338,10 +1338,15 @@ int amdgpu_device_get_job_timeout_settings(struct amdgpu_device *adev) /* * By default timeout for non compute jobs is 10000. * And there is no timeout enforced on compute jobs. + * In SR-IOV or passthrough mode, timeout for compute + * jobs are 10000 by default. */ adev->gfx_timeout = msecs_to_jiffies(10000); adev->sdma_timeout = adev->video_timeout = adev->gfx_timeout; - adev->compute_timeout = MAX_SCHEDULE_TIMEOUT; + if (amdgpu_sriov_vf(adev) || amdgpu_passthrough(adev)) + adev->compute_timeout = adev->gfx_timeout; + else + adev->compute_timeout = MAX_SCHEDULE_TIMEOUT; if (strnlen(input, AMDGPU_MAX_TIMEOUT_PARAM_LENTH)) { while ((timeout_setting = strsep(&input, ",")) && diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c index cbcaa7c..9ef53ca 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c @@ -460,18 +460,7 @@ int amdgpu_fence_driver_init_ring(struct amdgpu_ring *ring, timeout = adev->gfx_timeout; break; case AMDGPU_RING_TYPE_COMPUTE: - /* - * For non-sriov case, no timeout enforce - * on compute ring by default. Unless user - * specifies a timeout for compute ring. - * - * For sriov case, always use the timeout - * as gfx ring - */ - if (!amdgpu_sriov_vf(ring->adev)) - timeout = adev->compute_timeout; - else - timeout = adev->gfx_timeout; + timeout = adev->compute_timeout; break; case AMDGPU_RING_TYPE_SDMA: timeout = adev->sdma_timeout; -- 2.7.4 _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx ^ permalink raw reply related [flat|nested] 10+ messages in thread
[parent not found: <1568962637-26150-1-git-send-email-zhexi.zhang-5C7GfCeVMHo@public.gmane.org>]
* Re: [PATCH v8] drm/amd/amdgpu:Fix compute ring unable to detect hang. [not found] ` <1568962637-26150-1-git-send-email-zhexi.zhang-5C7GfCeVMHo@public.gmane.org> @ 2019-09-20 14:29 ` Christian König 0 siblings, 0 replies; 10+ messages in thread From: Christian König @ 2019-09-20 14:29 UTC (permalink / raw) To: Jesse Zhang, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW Am 20.09.19 um 08:57 schrieb Jesse Zhang: > When compute fence did not signal, compute ring cannot detect hardware hang > because its timeout value is set to be infinite by default. > > In SR-IOV and passthrough mode, if user does not declare custome timeout > value for compute ring, then use gfx ring timeout value as default. So > that when there is a ture hardware hang, compute ring can detect it. > > Change-Id: I794ec0868c6c0aad407749457260ecfee0617c10 > Signed-off-by: Jesse Zhang <zhexi.zhang@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 12 ++++++------ > drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 7 ++++++- > drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 13 +------------ > 3 files changed, 13 insertions(+), 19 deletions(-) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > index 3b5282b..03ac5a1da 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > @@ -1024,12 +1024,6 @@ static int amdgpu_device_check_arguments(struct amdgpu_device *adev) > > amdgpu_device_check_block_size(adev); > > - ret = amdgpu_device_get_job_timeout_settings(adev); > - if (ret) { > - dev_err(adev->dev, "invalid lockup_timeout parameter syntax\n"); > - return ret; > - } > - > adev->firmware.load_type = amdgpu_ucode_get_load_type(adev, amdgpu_fw_load_type); > > return ret; > @@ -2732,6 +2726,12 @@ int amdgpu_device_init(struct amdgpu_device *adev, > if (r) > return r; > > + r = amdgpu_device_get_job_timeout_settings(adev); > + if (r) { > + dev_err(adev->dev, "invalid lockup_timeout parameter syntax\n"); > + return r; > + } > + > /* doorbell bar mapping and doorbell index init*/ > amdgpu_device_doorbell_init(adev); > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c > index 420888e..98be49b 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c > @@ -1338,10 +1338,15 @@ int amdgpu_device_get_job_timeout_settings(struct amdgpu_device *adev) > /* > * By default timeout for non compute jobs is 10000. > * And there is no timeout enforced on compute jobs. > + * In SR-IOV or passthrough mode, timeout for compute > + * jobs are 10000 by default. > */ > adev->gfx_timeout = msecs_to_jiffies(10000); > adev->sdma_timeout = adev->video_timeout = adev->gfx_timeout; > - adev->compute_timeout = MAX_SCHEDULE_TIMEOUT; > + if (amdgpu_sriov_vf(adev) || amdgpu_passthrough(adev)) > + adev->compute_timeout = adev->gfx_timeout; > + else > + adev->compute_timeout = MAX_SCHEDULE_TIMEOUT; > > if (strnlen(input, AMDGPU_MAX_TIMEOUT_PARAM_LENTH)) { > while ((timeout_setting = strsep(&input, ",")) && > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c > index cbcaa7c..9ef53ca 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c > @@ -460,18 +460,7 @@ int amdgpu_fence_driver_init_ring(struct amdgpu_ring *ring, > timeout = adev->gfx_timeout; > break; > case AMDGPU_RING_TYPE_COMPUTE: > - /* > - * For non-sriov case, no timeout enforce > - * on compute ring by default. Unless user > - * specifies a timeout for compute ring. > - * > - * For sriov case, always use the timeout > - * as gfx ring > - */ > - if (!amdgpu_sriov_vf(ring->adev)) > - timeout = adev->compute_timeout; > - else > - timeout = adev->gfx_timeout; > + timeout = adev->compute_timeout; > break; > case AMDGPU_RING_TYPE_SDMA: > timeout = adev->sdma_timeout; _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2019-09-20 14:29 UTC | newest] Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2019-09-19 7:08 [PATCH v2] drm/amd/amdgpu:Fix compute ring unable to detect hang Jesse Zhang [not found] ` <1568876935-18731-2-git-send-email-zhexi.zhang-5C7GfCeVMHo@public.gmane.org> 2019-09-19 7:58 ` [PATCH v3] " Jesse Zhang 2019-09-19 8:00 ` [PATCH v4] " Jesse Zhang [not found] ` <1568880041-19830-1-git-send-email-zhexi.zhang-5C7GfCeVMHo@public.gmane.org> 2019-09-19 8:14 ` Christian König 2019-09-19 10:09 ` [PATCH v5] " Jesse Zhang [not found] ` <1568887741-1029-1-git-send-email-zhexi.zhang-5C7GfCeVMHo@public.gmane.org> 2019-09-19 12:12 ` Christian König 2019-09-20 2:36 ` [PATCH v6] " Jesse Zhang 2019-09-20 2:38 ` [PATCH v7] " Jesse Zhang [not found] ` <1568947109-5924-1-git-send-email-zhexi.zhang-5C7GfCeVMHo@public.gmane.org> 2019-09-20 6:57 ` [PATCH v8] " Jesse Zhang [not found] ` <1568962637-26150-1-git-send-email-zhexi.zhang-5C7GfCeVMHo@public.gmane.org> 2019-09-20 14:29 ` Christian König
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.