* [PATCH 1/2] drm/amdgpu: log on non-zero error conter per IP before GPU reset @ 2020-02-14 14:06 Guchun Chen 2020-02-14 14:06 ` [PATCH 2/2] drm/amdgpu: record non-zero error counter info in NBIO before resetting GPU Guchun Chen 0 siblings, 1 reply; 3+ messages in thread From: Guchun Chen @ 2020-02-14 14:06 UTC (permalink / raw) To: amd-gfx, Hawking.Zhang, Dennis.Li, Tao.Zhou1, John.Clements; +Cc: Guchun Chen Once sync flood interrupt is triggered by RAS error, before actual GPU recovery job, it's necessary to log on and print non-zero error counter, this will help user knows where the RAS error source is from quickly. Signed-off-by: Guchun Chen <guchun.chen@amd.com> --- drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 33 +++++++++++++++++++++++++ 1 file changed, 33 insertions(+) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c index cef94e2169fe..6a9a45d6b9e4 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c @@ -1319,6 +1319,33 @@ static int amdgpu_ras_interrupt_remove_all(struct amdgpu_device *adev) } /* ih end */ +/* traversal all IPs except NBIO to query error counter */ +static void amdgpu_ras_log_on_err_counter(struct amdgpu_device *adev) +{ + struct amdgpu_ras *con = amdgpu_ras_get_context(adev); + struct ras_manager *obj; + + if (!con) + return; + + list_for_each_entry(obj, &con->head, node) { + struct ras_query_if info = { + .head = obj->head, + }; + + /* + * PCIE_BIF IP has one different isr by ras controller + * interrupt, the specific ras counter query will be + * done in that isr. So skip such block from common + * sync flood interrupt isr calling. + */ + if (info.head.block == AMDGPU_RAS_BLOCK__PCIE_BIF) + continue; + + amdgpu_ras_error_query(adev, &info); + } +} + /* recovery begin */ /* return 0 on success. @@ -1373,6 +1400,12 @@ static void amdgpu_ras_do_recovery(struct work_struct *work) struct amdgpu_ras *ras = container_of(work, struct amdgpu_ras, recovery_work); + /* + * Query and print non zero error counter per IP block for + * awareness before recovering GPU. + */ + amdgpu_ras_log_on_err_counter(ras->adev); + if (amdgpu_device_should_recover_gpu(ras->adev)) amdgpu_device_gpu_recover(ras->adev, 0); atomic_set(&ras->in_recovery, 0); -- 2.17.1 _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx ^ permalink raw reply related [flat|nested] 3+ messages in thread
* [PATCH 2/2] drm/amdgpu: record non-zero error counter info in NBIO before resetting GPU 2020-02-14 14:06 [PATCH 1/2] drm/amdgpu: log on non-zero error conter per IP before GPU reset Guchun Chen @ 2020-02-14 14:06 ` Guchun Chen 2020-02-14 14:59 ` Zhang, Hawking 0 siblings, 1 reply; 3+ messages in thread From: Guchun Chen @ 2020-02-14 14:06 UTC (permalink / raw) To: amd-gfx, Hawking.Zhang, Dennis.Li, Tao.Zhou1, John.Clements; +Cc: Guchun Chen When NBIO's RAS error happens, before trigging GPU reset, it's needed to record error counter information, which can correct the error counter value missed issue when reading from debugfs. Signed-off-by: Guchun Chen <guchun.chen@amd.com> --- drivers/gpu/drm/amd/amdgpu/nbio_v7_4.c | 15 ++++++++++++++- 1 file changed, 14 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdgpu/nbio_v7_4.c b/drivers/gpu/drm/amd/amdgpu/nbio_v7_4.c index 65eb378fa035..149d386590df 100644 --- a/drivers/gpu/drm/amd/amdgpu/nbio_v7_4.c +++ b/drivers/gpu/drm/amd/amdgpu/nbio_v7_4.c @@ -318,6 +318,7 @@ static void nbio_v7_4_handle_ras_controller_intr_no_bifring(struct amdgpu_device { uint32_t bif_doorbell_intr_cntl; struct ras_manager *obj = amdgpu_ras_find_obj(adev, adev->nbio.ras_if); + struct ras_err_data err_data = {0, 0, 0, NULL}; bif_doorbell_intr_cntl = RREG32_SOC15(NBIO, 0, mmBIF_DOORBELL_INT_CNTL); if (REG_GET_FIELD(bif_doorbell_intr_cntl, @@ -332,7 +333,19 @@ static void nbio_v7_4_handle_ras_controller_intr_no_bifring(struct amdgpu_device * clear error status after ras_controller_intr according to * hw team and count ue number for query */ - nbio_v7_4_query_ras_error_count(adev, &obj->err_data); + nbio_v7_4_query_ras_error_count(adev, &err_data); + + /* logging on error counter and printing for awareness */ + obj->err_data.ue_count += err_data.ue_count; + obj->err_data.ce_count += err_data.ce_count; + + if (err_data.ce_count) + DRM_INFO("%ld correctable errors detected in %s block\n", + obj->err_data.ce_count, adev->nbio.ras_if->name); + + if (err_data.ue_count) + DRM_INFO("%ld uncorrectable errors detected in %s block\n", + obj->err_data.ue_count, adev->nbio.ras_if->name); DRM_WARN("RAS controller interrupt triggered by NBIF error\n"); -- 2.17.1 _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx ^ permalink raw reply related [flat|nested] 3+ messages in thread
* RE: [PATCH 2/2] drm/amdgpu: record non-zero error counter info in NBIO before resetting GPU 2020-02-14 14:06 ` [PATCH 2/2] drm/amdgpu: record non-zero error counter info in NBIO before resetting GPU Guchun Chen @ 2020-02-14 14:59 ` Zhang, Hawking 0 siblings, 0 replies; 3+ messages in thread From: Zhang, Hawking @ 2020-02-14 14:59 UTC (permalink / raw) To: Chen, Guchun, amd-gfx, Li, Dennis, Zhou1, Tao, Clements, John [AMD Official Use Only - Internal Distribution Only] Series is Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Regards, Hawking -----Original Message----- From: Chen, Guchun <Guchun.Chen@amd.com> Sent: Friday, February 14, 2020 22:07 To: amd-gfx@lists.freedesktop.org; Zhang, Hawking <Hawking.Zhang@amd.com>; Li, Dennis <Dennis.Li@amd.com>; Zhou1, Tao <Tao.Zhou1@amd.com>; Clements, John <John.Clements@amd.com> Cc: Chen, Guchun <Guchun.Chen@amd.com> Subject: [PATCH 2/2] drm/amdgpu: record non-zero error counter info in NBIO before resetting GPU When NBIO's RAS error happens, before trigging GPU reset, it's needed to record error counter information, which can correct the error counter value missed issue when reading from debugfs. Signed-off-by: Guchun Chen <guchun.chen@amd.com> --- drivers/gpu/drm/amd/amdgpu/nbio_v7_4.c | 15 ++++++++++++++- 1 file changed, 14 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdgpu/nbio_v7_4.c b/drivers/gpu/drm/amd/amdgpu/nbio_v7_4.c index 65eb378fa035..149d386590df 100644 --- a/drivers/gpu/drm/amd/amdgpu/nbio_v7_4.c +++ b/drivers/gpu/drm/amd/amdgpu/nbio_v7_4.c @@ -318,6 +318,7 @@ static void nbio_v7_4_handle_ras_controller_intr_no_bifring(struct amdgpu_device { uint32_t bif_doorbell_intr_cntl; struct ras_manager *obj = amdgpu_ras_find_obj(adev, adev->nbio.ras_if); + struct ras_err_data err_data = {0, 0, 0, NULL}; bif_doorbell_intr_cntl = RREG32_SOC15(NBIO, 0, mmBIF_DOORBELL_INT_CNTL); if (REG_GET_FIELD(bif_doorbell_intr_cntl, @@ -332,7 +333,19 @@ static void nbio_v7_4_handle_ras_controller_intr_no_bifring(struct amdgpu_device * clear error status after ras_controller_intr according to * hw team and count ue number for query */ - nbio_v7_4_query_ras_error_count(adev, &obj->err_data); + nbio_v7_4_query_ras_error_count(adev, &err_data); + + /* logging on error counter and printing for awareness */ + obj->err_data.ue_count += err_data.ue_count; + obj->err_data.ce_count += err_data.ce_count; + + if (err_data.ce_count) + DRM_INFO("%ld correctable errors detected in %s block\n", + obj->err_data.ce_count, adev->nbio.ras_if->name); + + if (err_data.ue_count) + DRM_INFO("%ld uncorrectable errors detected in %s block\n", + obj->err_data.ue_count, adev->nbio.ras_if->name); DRM_WARN("RAS controller interrupt triggered by NBIF error\n"); -- 2.17.1 _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx ^ permalink raw reply related [flat|nested] 3+ messages in thread
end of thread, other threads:[~2020-02-14 14:59 UTC | newest] Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2020-02-14 14:06 [PATCH 1/2] drm/amdgpu: log on non-zero error conter per IP before GPU reset Guchun Chen 2020-02-14 14:06 ` [PATCH 2/2] drm/amdgpu: record non-zero error counter info in NBIO before resetting GPU Guchun Chen 2020-02-14 14:59 ` Zhang, Hawking
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).