All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/20] RAS support for amdgpu
@ 2019-03-05 20:41 Alex Deucher
       [not found] ` <20190305204136.17587-1-alexander.deucher-5C7GfCeVMHo@public.gmane.org>
  0 siblings, 1 reply; 21+ messages in thread
From: Alex Deucher @ 2019-03-05 20:41 UTC (permalink / raw)
  To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW; +Cc: Alex Deucher

This patch set adds initial RAS (Reliability, Availability, Serviceability)
support to amdgpu on supported boards.  Features include SRAM and VRAM ECC,
bad page tracking, and error containment.

Eric Huang (2):
  drm/amdkfd: add RAS capabilities in topology for Vega20 (v2)
  drm/amdkfd: add RAS ECC event support (v2)

Feifei Xu (1):
  drm/amdgpu: enable ras on gfx9 (v2)

xinhui pan (17):
  drm/amdgpu: add ta ras fw info (v2)
  drm/amdgpu: export ta fw info
  drm/amdgpu: add module parameters for ras
  drm/amdgpu: add ta_ras_if.h
  drm/amdgpu: add psp ras callback func and macro
  drm/amdgpu: add psp ras subsystem infrastructure (v2)
  drm/amdgpu: add psp v11 ras callback
  drm/amdgpu: add psp cmd submit timeout
  drm/amdgpu: add amdgpu_ras.c to support ras (v2)
  drm/amdgpu: add debugfs ctrl node
  drm/amdgpu: reserve bad pages during recovery
  drm/amdgpu: enable ras on sdma4
  drm/amdgpu: enable ras on gmc9
  drm/amdgpu: Add a new flag to AMDGPU_CTX_OP_QUERY_STATE2
  drm/amdgpu: add ioctl query for enabled ras features (v2)
  drm/amdgpu: skip gpu reset when ras error occured
  drm/amdgpu: add human readable debugfs control support (v2)

 drivers/gpu/drm/amd/amdgpu/Makefile        |    2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu.h        |    2 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h |    1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c    |   17 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.h    |    2 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |   14 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c    |   17 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h    |    3 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h    |    2 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c    |   31 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c    |  220 ++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_psp.h    |   32 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c    | 1438 ++++++++++++++++++++
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h    |  229 ++++
 drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.h   |    4 +
 drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c      |  174 +++
 drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c      |  277 ++++
 drivers/gpu/drm/amd/amdgpu/psp_v11_0.c     |   57 +
 drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c     |  186 ++-
 drivers/gpu/drm/amd/amdgpu/ta_ras_if.h     |  108 ++
 drivers/gpu/drm/amd/amdkfd/kfd_device.c    |   11 +
 drivers/gpu/drm/amd/amdkfd/kfd_events.c    |   18 +-
 drivers/gpu/drm/amd/amdkfd/kfd_priv.h      |    3 +
 drivers/gpu/drm/amd/amdkfd/kfd_topology.c  |   16 +
 drivers/gpu/drm/amd/amdkfd/kfd_topology.h  |    4 +
 include/uapi/drm/amdgpu_drm.h              |   35 +
 include/uapi/linux/kfd_ioctl.h             |   12 +-
 27 files changed, 2910 insertions(+), 5 deletions(-)
 create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
 create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
 create mode 100644 drivers/gpu/drm/amd/amdgpu/ta_ras_if.h

-- 
2.20.1

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 01/20] drm/amdgpu: add ta ras fw info (v2)
       [not found] ` <20190305204136.17587-1-alexander.deucher-5C7GfCeVMHo@public.gmane.org>
@ 2019-03-05 20:41   ` Alex Deucher
  2019-03-05 20:41   ` [PATCH 02/20] drm/amdgpu: export ta fw info Alex Deucher
                     ` (18 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: Alex Deucher @ 2019-03-05 20:41 UTC (permalink / raw)
  To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW; +Cc: Alex Deucher, xinhui pan

From: xinhui pan <xinhui.pan@amd.com>

Add ras fw part, xgmi and ras fw are combined together in ta binary.
Reading the data from the info is not implemented yet.

v2: squash in "drm/amdgpu: fix NULL pointer when ta is missing"

Signed-off-by: xinhui pan <xinhui.pan@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_psp.h | 4 ++++
 drivers/gpu/drm/amd/amdgpu/psp_v11_0.c  | 7 +++++++
 2 files changed, 11 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.h
index 2ef98cc755d6..49c3942e469c 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.h
@@ -150,9 +150,13 @@ struct psp_context
 
 	/* xgmi ta firmware and buffer */
 	const struct firmware		*ta_fw;
+	uint32_t			ta_fw_version;
 	uint32_t			ta_xgmi_ucode_version;
 	uint32_t			ta_xgmi_ucode_size;
 	uint8_t				*ta_xgmi_start_addr;
+	uint32_t			ta_ras_ucode_version;
+	uint32_t			ta_ras_ucode_size;
+	uint8_t				*ta_ras_start_addr;
 	struct psp_xgmi_context		xgmi_context;
 };
 
diff --git a/drivers/gpu/drm/amd/amdgpu/psp_v11_0.c b/drivers/gpu/drm/amd/amdgpu/psp_v11_0.c
index 2c584cc9375f..8fb267684787 100644
--- a/drivers/gpu/drm/amd/amdgpu/psp_v11_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/psp_v11_0.c
@@ -116,6 +116,13 @@ static int psp_v11_0_init_microcode(struct psp_context *psp)
 		adev->psp.ta_xgmi_ucode_size = le32_to_cpu(ta_hdr->ta_xgmi_size_bytes);
 		adev->psp.ta_xgmi_start_addr = (uint8_t *)ta_hdr +
 			le32_to_cpu(ta_hdr->header.ucode_array_offset_bytes);
+
+		adev->psp.ta_fw_version = le32_to_cpu(ta_hdr->header.ucode_version);
+
+		adev->psp.ta_ras_ucode_version = le32_to_cpu(ta_hdr->ta_ras_ucode_version);
+		adev->psp.ta_ras_ucode_size = le32_to_cpu(ta_hdr->ta_ras_size_bytes);
+		adev->psp.ta_ras_start_addr = (uint8_t *)adev->psp.ta_xgmi_start_addr +
+			le32_to_cpu(ta_hdr->ta_ras_offset_bytes);
 	}
 
 	return 0;
-- 
2.20.1

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 02/20] drm/amdgpu: export ta fw info
       [not found] ` <20190305204136.17587-1-alexander.deucher-5C7GfCeVMHo@public.gmane.org>
  2019-03-05 20:41   ` [PATCH 01/20] drm/amdgpu: add ta ras fw info (v2) Alex Deucher
@ 2019-03-05 20:41   ` Alex Deucher
  2019-03-05 20:41   ` [PATCH 03/20] drm/amdgpu: add module parameters for ras Alex Deucher
                     ` (17 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: Alex Deucher @ 2019-03-05 20:41 UTC (permalink / raw)
  To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW; +Cc: Alex Deucher, xinhui pan

From: xinhui pan <xinhui.pan@amd.com>

Output the ta fw, aka xgmi/ras, via debugfs.

Signed-off-by: xinhui pan <xinhui.pan@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 21 +++++++++++++++++++++
 include/uapi/drm/amdgpu_drm.h           |  1 +
 2 files changed, 22 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
index b65e18101108..4d0ccca891cb 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
@@ -295,6 +295,17 @@ static int amdgpu_firmware_info(struct drm_amdgpu_info_firmware *fw_info,
 		fw_info->ver = adev->pm.fw_version;
 		fw_info->feature = 0;
 		break;
+	case AMDGPU_INFO_FW_TA:
+		if (query_fw->index > 1)
+			return -EINVAL;
+		if (query_fw->index == 0) {
+			fw_info->ver = adev->psp.ta_fw_version;
+			fw_info->feature = adev->psp.ta_xgmi_ucode_version;
+		} else {
+			fw_info->ver = adev->psp.ta_fw_version;
+			fw_info->feature = adev->psp.ta_ras_ucode_version;
+		}
+		break;
 	case AMDGPU_INFO_FW_SDMA:
 		if (query_fw->index >= adev->sdma.num_instances)
 			return -EINVAL;
@@ -1327,6 +1338,16 @@ static int amdgpu_debugfs_firmware_info(struct seq_file *m, void *data)
 	seq_printf(m, "ASD feature version: %u, firmware version: 0x%08x\n",
 		   fw_info.feature, fw_info.ver);
 
+	query_fw.fw_type = AMDGPU_INFO_FW_TA;
+	for (i = 0; i < 2; i++) {
+		query_fw.index = i;
+		ret = amdgpu_firmware_info(&fw_info, &query_fw, adev);
+		if (ret)
+			continue;
+		seq_printf(m, "TA %s feature version: %u, firmware version: 0x%08x\n",
+				i ? "RAS" : "XGMI", fw_info.feature, fw_info.ver);
+	}
+
 	/* SMC */
 	query_fw.fw_type = AMDGPU_INFO_FW_SMC;
 	ret = amdgpu_firmware_info(&fw_info, &query_fw, adev);
diff --git a/include/uapi/drm/amdgpu_drm.h b/include/uapi/drm/amdgpu_drm.h
index 4a53f6cfa034..e5275d4481f5 100644
--- a/include/uapi/drm/amdgpu_drm.h
+++ b/include/uapi/drm/amdgpu_drm.h
@@ -680,6 +680,7 @@ struct drm_amdgpu_cs_chunk_data {
 	#define AMDGPU_INFO_FW_GFX_RLC_RESTORE_LIST_SRM_MEM 0x11
 	/* Subquery id: Query DMCU firmware version */
 	#define AMDGPU_INFO_FW_DMCU		0x12
+	#define AMDGPU_INFO_FW_TA		0x13
 /* number of bytes moved for TTM migration */
 #define AMDGPU_INFO_NUM_BYTES_MOVED		0x0f
 /* the used VRAM size */
-- 
2.20.1

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 03/20] drm/amdgpu: add module parameters for ras
       [not found] ` <20190305204136.17587-1-alexander.deucher-5C7GfCeVMHo@public.gmane.org>
  2019-03-05 20:41   ` [PATCH 01/20] drm/amdgpu: add ta ras fw info (v2) Alex Deucher
  2019-03-05 20:41   ` [PATCH 02/20] drm/amdgpu: export ta fw info Alex Deucher
@ 2019-03-05 20:41   ` Alex Deucher
  2019-03-05 20:41   ` [PATCH 04/20] drm/amdgpu: add ta_ras_if.h Alex Deucher
                     ` (16 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: Alex Deucher @ 2019-03-05 20:41 UTC (permalink / raw)
  To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW
  Cc: Alex Deucher, xinhui pan, Hawking Zhang

From: xinhui pan <xinhui.pan@amd.com>

Allow RAS feature enable/disable via boot parameter.

Signed-off-by: xinhui pan <xinhui.pan@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu.h     |  2 ++
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 17 +++++++++++++++++
 2 files changed, 19 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index 374e1d2f2456..369548e2f0f6 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -157,6 +157,8 @@ extern int amdgpu_emu_mode;
 extern uint amdgpu_smu_memory_pool_size;
 extern uint amdgpu_dc_feature_mask;
 extern struct amdgpu_mgpu_info mgpu_info;
+extern int amdgpu_ras_enable;
+extern uint amdgpu_ras_mask;
 
 #ifdef CONFIG_DRM_AMDGPU_SI
 extern int amdgpu_si_support;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index d35c5a273e84..b4ac80d20f75 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -137,6 +137,8 @@ uint amdgpu_dc_feature_mask = 0;
 struct amdgpu_mgpu_info mgpu_info = {
 	.mutex = __MUTEX_INITIALIZER(mgpu_info.mutex),
 };
+int amdgpu_ras_enable = -1;
+uint amdgpu_ras_mask = 0xffffffff;
 
 /**
  * DOC: vramlimit (int)
@@ -495,6 +497,21 @@ module_param_named(gpu_recovery, amdgpu_gpu_recovery, int, 0444);
 MODULE_PARM_DESC(emu_mode, "Emulation mode, (1 = enable, 0 = disable)");
 module_param_named(emu_mode, amdgpu_emu_mode, int, 0444);
 
+/**
+ * DOC: amdgpu_ras_enable (int)
+ * Enable RAS features on the GPU (0 = disable, 1 = enable, -1 = auto (default))
+ */
+MODULE_PARM_DESC(amdgpu_ras_enable, "Enable RAS features on the GPU (0 = disable, 1 = enable, -1 = auto (default))");
+module_param_named(ras_enable, amdgpu_ras_enable, int, 0444);
+
+/**
+ * DOC: amdgpu_ras_mask (uint)
+ * Mask of RAS features to enable (default 0xffffffff), only valid when ras_enable == 1
+ * See the flags in drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
+ */
+MODULE_PARM_DESC(amdgpu_ras_mask, "Mask of RAS features to enable (default 0xffffffff), only valid when ras_enable == 1");
+module_param_named(ras_mask, amdgpu_ras_mask, uint, 0444);
+
 /**
  * DOC: si_support (int)
  * Set SI support driver. This parameter works after set config CONFIG_DRM_AMDGPU_SI. For SI asic, when radeon driver is enabled,
-- 
2.20.1

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 04/20] drm/amdgpu: add ta_ras_if.h
       [not found] ` <20190305204136.17587-1-alexander.deucher-5C7GfCeVMHo@public.gmane.org>
                     ` (2 preceding siblings ...)
  2019-03-05 20:41   ` [PATCH 03/20] drm/amdgpu: add module parameters for ras Alex Deucher
@ 2019-03-05 20:41   ` Alex Deucher
  2019-03-05 20:41   ` [PATCH 05/20] drm/amdgpu: add psp ras callback func and macro Alex Deucher
                     ` (15 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: Alex Deucher @ 2019-03-05 20:41 UTC (permalink / raw)
  To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW; +Cc: Alex Deucher, xinhui pan

From: xinhui pan <xinhui.pan@amd.com>

Signed-off-by: xinhui pan <xinhui.pan@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/ta_ras_if.h | 108 +++++++++++++++++++++++++
 1 file changed, 108 insertions(+)
 create mode 100644 drivers/gpu/drm/amd/amdgpu/ta_ras_if.h

diff --git a/drivers/gpu/drm/amd/amdgpu/ta_ras_if.h b/drivers/gpu/drm/amd/amdgpu/ta_ras_if.h
new file mode 100644
index 000000000000..0b4e7b55595a
--- /dev/null
+++ b/drivers/gpu/drm/amd/amdgpu/ta_ras_if.h
@@ -0,0 +1,108 @@
+/****************************************************************************\
+* 
+*  File Name      ta_ras_if.h
+*  Project        AMD PSP SW IP Module
+*
+*  Description    Interface to the RAS Trusted Application
+*
+*  Copyright 2019 Advanced Micro Devices, Inc.
+*
+* Permission is hereby granted, free of charge, to any person obtaining a copy of this software 
+* and associated documentation files (the "Software"), to deal in the Software without restriction,
+* including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense,
+* and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so,
+* subject to the following conditions:
+*
+* The above copyright notice and this permission notice shall be included in all copies or substantial
+* portions of the Software.
+*
+* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+* IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+* FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
+* THE COPYRIGHT HOLDER(S) OR AUTHOR(S) BE LIABLE FOR ANY CLAIM, DAMAGES OR
+* OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
+* ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
+* OTHER DEALINGS IN THE SOFTWARE.
+*/
+#ifndef _TA_RAS_IF_H
+#define _TA_RAS_IF_H
+
+/* Responses have bit 31 set */
+#define RSP_ID_MASK (1U << 31)
+#define RSP_ID(cmdId) (((uint32_t)(cmdId)) | RSP_ID_MASK)
+
+#define TA_NUM_BLOCK_MAX		14
+
+enum ras_command {
+	TA_RAS_COMMAND__ENABLE_FEATURES = 0,
+	TA_RAS_COMMAND__DISABLE_FEATURES,
+	TA_RAS_COMMAND__TRIGGER_ERROR,
+};
+
+enum ta_ras_status {
+	TA_RAS_STATUS__SUCCESS				= 0x00,
+	TA_RAS_STATUS__RESET_NEEDED			= 0x01,
+	TA_RAS_STATUS__ERROR_INVALID_PARAMETER		= 0x02,
+	TA_RAS_STATUS__ERROR_RAS_NOT_AVAILABLE		= 0x03,
+	TA_RAS_STATUS__ERROR_RAS_DUPLICATE_CMD		= 0x04,
+	TA_RAS_STATUS__ERROR_INJECTION_FAILED		= 0x05
+};
+
+enum ta_ras_block {
+	TA_RAS_BLOCK__UMC = 0,
+	TA_RAS_BLOCK__SDMA,
+	TA_RAS_BLOCK__GFX,
+	TA_RAS_BLOCK__MMHUB,
+	TA_RAS_BLOCK__ATHUB,
+	TA_RAS_BLOCK__PCIE_BIF,
+	TA_RAS_BLOCK__HDP,
+	TA_RAS_BLOCK__XGMI_WAFL,
+	TA_RAS_BLOCK__DF,
+	TA_RAS_BLOCK__SMN,
+	TA_RAS_BLOCK__SEM,
+	TA_RAS_BLOCK__MP0,
+	TA_RAS_BLOCK__MP1,
+	TA_RAS_BLOCK__FUSE = (TA_NUM_BLOCK_MAX - 1),
+};
+
+enum ta_ras_error_type {
+	TA_RAS_ERROR__NONE				= 0,
+	TA_RAS_ERROR__PARITY				= 1,
+	TA_RAS_ERROR__SINGLE_CORRECTABLE		= 2,
+	TA_RAS_ERROR__MULTI_UNCORRECTABLE		= 4,
+	TA_RAS_ERROR__POISON				= 8
+};
+
+struct ta_ras_enable_features_input {
+	enum ta_ras_block       block_id;
+	enum ta_ras_error_type  error_type;
+};
+
+struct ta_ras_disable_features_input {
+	enum ta_ras_block       block_id;
+	enum ta_ras_error_type  error_type;
+};
+
+struct ta_ras_trigger_error_input {
+	enum ta_ras_block		block_id;
+	enum ta_ras_error_type		inject_error_type;
+	uint32_t			sub_block_index;
+	uint64_t			address;
+	uint64_t			value;
+};
+
+union ta_ras_cmd_input {
+	struct ta_ras_enable_features_input	enable_features;
+	struct ta_ras_disable_features_input	disable_features;
+	struct ta_ras_trigger_error_input	trigger_error;
+};
+
+struct ta_ras_shared_memory {
+	uint32_t			cmd_id;
+	uint32_t			resp_id;
+	enum ta_ras_status		ras_status;
+	uint32_t			reserved;
+	union ta_ras_cmd_input		ras_in_message;
+};
+
+#endif // TL_RAS_IF_H_
-- 
2.20.1

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 05/20] drm/amdgpu: add psp ras callback func and macro
       [not found] ` <20190305204136.17587-1-alexander.deucher-5C7GfCeVMHo@public.gmane.org>
                     ` (3 preceding siblings ...)
  2019-03-05 20:41   ` [PATCH 04/20] drm/amdgpu: add ta_ras_if.h Alex Deucher
@ 2019-03-05 20:41   ` Alex Deucher
  2019-03-05 20:41   ` [PATCH 06/20] drm/amdgpu: add psp ras subsystem infrastructure (v2) Alex Deucher
                     ` (14 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: Alex Deucher @ 2019-03-05 20:41 UTC (permalink / raw)
  To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW
  Cc: Alex Deucher, xinhui pan, Hawking Zhang

From: xinhui pan <xinhui.pan@amd.com>

Define the driver side interface for ras ta.

Acked-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: xinhui pan <xinhui.pan@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_psp.h | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.h
index 49c3942e469c..3e6fcc9cdef0 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.h
@@ -28,6 +28,7 @@
 #include "amdgpu.h"
 #include "psp_gfx_if.h"
 #include "ta_xgmi_if.h"
+#include "ta_ras_if.h"
 
 #define PSP_FENCE_BUFFER_SIZE	0x1000
 #define PSP_CMD_BUFFER_SIZE	0x1000
@@ -88,6 +89,9 @@ struct psp_funcs
 	int (*xgmi_set_topology_info)(struct psp_context *psp, int number_devices,
 				      struct psp_xgmi_topology_info *topology);
 	bool (*support_vmr_ring)(struct psp_context *psp);
+	int (*ras_trigger_error)(struct psp_context *psp,
+			struct ta_ras_trigger_error_input *info);
+	int (*ras_cure_posion)(struct psp_context *psp, uint64_t *mode_ptr);
 };
 
 struct psp_xgmi_context {
@@ -211,6 +215,13 @@ struct psp_xgmi_topology_info {
 
 #define amdgpu_psp_check_fw_loading_status(adev, i) (adev)->firmware.funcs->check_fw_loading_status((adev), (i))
 
+#define psp_ras_trigger_error(psp, info) \
+	((psp)->funcs->ras_trigger_error ? \
+	(psp)->funcs->ras_trigger_error((psp), (info)) : -EINVAL)
+#define psp_ras_cure_posion(psp, addr) \
+	((psp)->funcs->ras_cure_posion ? \
+	(psp)->funcs->ras_cure_posion(psp, (addr)) : -EINVAL)
+
 extern const struct amd_ip_funcs psp_ip_funcs;
 
 extern const struct amdgpu_ip_block_version psp_v3_1_ip_block;
-- 
2.20.1

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 06/20] drm/amdgpu: add psp ras subsystem infrastructure (v2)
       [not found] ` <20190305204136.17587-1-alexander.deucher-5C7GfCeVMHo@public.gmane.org>
                     ` (4 preceding siblings ...)
  2019-03-05 20:41   ` [PATCH 05/20] drm/amdgpu: add psp ras callback func and macro Alex Deucher
@ 2019-03-05 20:41   ` Alex Deucher
  2019-03-05 20:41   ` [PATCH 07/20] drm/amdgpu: add psp v11 ras callback Alex Deucher
                     ` (13 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: Alex Deucher @ 2019-03-05 20:41 UTC (permalink / raw)
  To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW; +Cc: Alex Deucher, xinhui pan

From: xinhui pan <xinhui.pan@amd.com>

Add ras fw loading, init, terminate.
Add ras cmd submit helper.
Add ras feature enable/disable common function.

v2: squash in unused variable warning fix

Signed-off-by: xinhui pan <xinhui.pan@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c | 214 ++++++++++++++++++++++++
 drivers/gpu/drm/amd/amdgpu/amdgpu_psp.h |  16 ++
 2 files changed, 230 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
index 3091488cd8cc..c1b862016245 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
@@ -466,6 +466,206 @@ static int psp_xgmi_initialize(struct psp_context *psp)
 	return ret;
 }
 
+// ras begin
+static void psp_prep_ras_ta_load_cmd_buf(struct psp_gfx_cmd_resp *cmd,
+		uint64_t ras_ta_mc, uint64_t ras_mc_shared,
+		uint32_t ras_ta_size, uint32_t shared_size)
+{
+	cmd->cmd_id = GFX_CMD_ID_LOAD_TA;
+	cmd->cmd.cmd_load_ta.app_phy_addr_lo = lower_32_bits(ras_ta_mc);
+	cmd->cmd.cmd_load_ta.app_phy_addr_hi = upper_32_bits(ras_ta_mc);
+	cmd->cmd.cmd_load_ta.app_len = ras_ta_size;
+
+	cmd->cmd.cmd_load_ta.cmd_buf_phy_addr_lo = lower_32_bits(ras_mc_shared);
+	cmd->cmd.cmd_load_ta.cmd_buf_phy_addr_hi = upper_32_bits(ras_mc_shared);
+	cmd->cmd.cmd_load_ta.cmd_buf_len = shared_size;
+}
+
+static int psp_ras_init_shared_buf(struct psp_context *psp)
+{
+	int ret;
+
+	/*
+	 * Allocate 16k memory aligned to 4k from Frame Buffer (local
+	 * physical) for ras ta <-> Driver
+	 */
+	ret = amdgpu_bo_create_kernel(psp->adev, PSP_RAS_SHARED_MEM_SIZE,
+			PAGE_SIZE, AMDGPU_GEM_DOMAIN_VRAM,
+			&psp->ras.ras_shared_bo,
+			&psp->ras.ras_shared_mc_addr,
+			&psp->ras.ras_shared_buf);
+
+	return ret;
+}
+
+static int psp_ras_load(struct psp_context *psp)
+{
+	int ret;
+	struct psp_gfx_cmd_resp *cmd;
+
+	/*
+	 * TODO: bypass the loading in sriov for now
+	 */
+	if (amdgpu_sriov_vf(psp->adev))
+		return 0;
+
+	cmd = kzalloc(sizeof(struct psp_gfx_cmd_resp), GFP_KERNEL);
+	if (!cmd)
+		return -ENOMEM;
+
+	memset(psp->fw_pri_buf, 0, PSP_1_MEG);
+	memcpy(psp->fw_pri_buf, psp->ta_ras_start_addr, psp->ta_ras_ucode_size);
+
+	psp_prep_ras_ta_load_cmd_buf(cmd, psp->fw_pri_mc_addr,
+			psp->ras.ras_shared_mc_addr,
+			psp->ta_ras_ucode_size, PSP_RAS_SHARED_MEM_SIZE);
+
+	ret = psp_cmd_submit_buf(psp, NULL, cmd,
+			psp->fence_buf_mc_addr);
+
+	if (!ret) {
+		psp->ras.ras_initialized = 1;
+		psp->ras.session_id = cmd->resp.session_id;
+	}
+
+	kfree(cmd);
+
+	return ret;
+}
+
+static void psp_prep_ras_ta_unload_cmd_buf(struct psp_gfx_cmd_resp *cmd,
+						uint32_t ras_session_id)
+{
+	cmd->cmd_id = GFX_CMD_ID_UNLOAD_TA;
+	cmd->cmd.cmd_unload_ta.session_id = ras_session_id;
+}
+
+static int psp_ras_unload(struct psp_context *psp)
+{
+	int ret;
+	struct psp_gfx_cmd_resp *cmd;
+
+	/*
+	 * TODO: bypass the unloading in sriov for now
+	 */
+	if (amdgpu_sriov_vf(psp->adev))
+		return 0;
+
+	cmd = kzalloc(sizeof(struct psp_gfx_cmd_resp), GFP_KERNEL);
+	if (!cmd)
+		return -ENOMEM;
+
+	psp_prep_ras_ta_unload_cmd_buf(cmd, psp->ras.session_id);
+
+	ret = psp_cmd_submit_buf(psp, NULL, cmd,
+			psp->fence_buf_mc_addr);
+
+	kfree(cmd);
+
+	return ret;
+}
+
+static void psp_prep_ras_ta_invoke_cmd_buf(struct psp_gfx_cmd_resp *cmd,
+		uint32_t ta_cmd_id,
+		uint32_t ras_session_id)
+{
+	cmd->cmd_id = GFX_CMD_ID_INVOKE_CMD;
+	cmd->cmd.cmd_invoke_cmd.session_id = ras_session_id;
+	cmd->cmd.cmd_invoke_cmd.ta_cmd_id = ta_cmd_id;
+	/* Note: cmd_invoke_cmd.buf is not used for now */
+}
+
+int psp_ras_invoke(struct psp_context *psp, uint32_t ta_cmd_id)
+{
+	int ret;
+	struct psp_gfx_cmd_resp *cmd;
+
+	/*
+	 * TODO: bypass the loading in sriov for now
+	 */
+	if (amdgpu_sriov_vf(psp->adev))
+		return 0;
+
+	cmd = kzalloc(sizeof(struct psp_gfx_cmd_resp), GFP_KERNEL);
+	if (!cmd)
+		return -ENOMEM;
+
+	psp_prep_ras_ta_invoke_cmd_buf(cmd, ta_cmd_id,
+			psp->ras.session_id);
+
+	ret = psp_cmd_submit_buf(psp, NULL, cmd,
+			psp->fence_buf_mc_addr);
+
+	kfree(cmd);
+
+	return ret;
+}
+
+int psp_ras_enable_features(struct psp_context *psp,
+		union ta_ras_cmd_input *info, bool enable)
+{
+	struct ta_ras_shared_memory *ras_cmd;
+	int ret;
+
+	if (!psp->ras.ras_initialized)
+		return -EINVAL;
+
+	ras_cmd = (struct ta_ras_shared_memory *)psp->ras.ras_shared_buf;
+	memset(ras_cmd, 0, sizeof(struct ta_ras_shared_memory));
+
+	if (enable)
+		ras_cmd->cmd_id = TA_RAS_COMMAND__ENABLE_FEATURES;
+	else
+		ras_cmd->cmd_id = TA_RAS_COMMAND__DISABLE_FEATURES;
+
+	ras_cmd->ras_in_message = *info;
+
+	ret = psp_ras_invoke(psp, ras_cmd->cmd_id);
+	if (ret)
+		return -EINVAL;
+
+	return ras_cmd->ras_status;
+}
+
+static int psp_ras_terminate(struct psp_context *psp)
+{
+	int ret;
+
+	if (!psp->ras.ras_initialized)
+		return 0;
+
+	ret = psp_ras_unload(psp);
+	if (ret)
+		return ret;
+
+	psp->ras.ras_initialized = 0;
+
+	/* free ras shared memory */
+	amdgpu_bo_free_kernel(&psp->ras.ras_shared_bo,
+			&psp->ras.ras_shared_mc_addr,
+			&psp->ras.ras_shared_buf);
+
+	return 0;
+}
+
+static int psp_ras_initialize(struct psp_context *psp)
+{
+	int ret;
+
+	if (!psp->ras.ras_initialized) {
+		ret = psp_ras_init_shared_buf(psp);
+		if (ret)
+			return ret;
+	}
+
+	ret = psp_ras_load(psp);
+	if (ret)
+		return ret;
+
+	return 0;
+}
+// ras end
+
 static int psp_hw_start(struct psp_context *psp)
 {
 	struct amdgpu_device *adev = psp->adev;
@@ -502,6 +702,12 @@ static int psp_hw_start(struct psp_context *psp)
 			dev_err(psp->adev->dev,
 				"XGMI: Failed to initialize XGMI session\n");
 	}
+
+	ret = psp_ras_initialize(psp);
+	if (ret)
+		dev_err(psp->adev->dev,
+				"RAS: Failed to initialize RAS\n");
+
 	return 0;
 }
 
@@ -753,6 +959,8 @@ static int psp_hw_fini(void *handle)
 	    psp->xgmi_context.initialized == 1)
                 psp_xgmi_terminate(psp);
 
+	psp_ras_terminate(psp);
+
 	psp_ring_destroy(psp, PSP_RING_TYPE__KM);
 
 	amdgpu_bo_free_kernel(&psp->tmr_bo, &psp->tmr_mc_addr, &psp->tmr_buf);
@@ -786,6 +994,12 @@ static int psp_suspend(void *handle)
 		}
 	}
 
+	ret = psp_ras_terminate(psp);
+	if (ret) {
+		DRM_ERROR("Failed to terminate ras ta\n");
+		return ret;
+	}
+
 	ret = psp_ring_stop(psp, PSP_RING_TYPE__KM);
 	if (ret) {
 		DRM_ERROR("PSP ring stop failed\n");
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.h
index 3e6fcc9cdef0..9c067e760a5f 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.h
@@ -34,6 +34,7 @@
 #define PSP_CMD_BUFFER_SIZE	0x1000
 #define PSP_ASD_SHARED_MEM_SIZE 0x4000
 #define PSP_XGMI_SHARED_MEM_SIZE 0x4000
+#define PSP_RAS_SHARED_MEM_SIZE 0x4000
 #define PSP_1_MEG		0x100000
 #define PSP_TMR_SIZE	0x400000
 
@@ -102,6 +103,15 @@ struct psp_xgmi_context {
 	void                            *xgmi_shared_buf;
 };
 
+struct psp_ras_context {
+	/*ras fw*/
+	bool			ras_initialized;
+	uint32_t		session_id;
+	struct amdgpu_bo	*ras_shared_bo;
+	uint64_t		ras_shared_mc_addr;
+	void			*ras_shared_buf;
+};
+
 struct psp_context
 {
 	struct amdgpu_device            *adev;
@@ -162,6 +172,7 @@ struct psp_context
 	uint32_t			ta_ras_ucode_size;
 	uint8_t				*ta_ras_start_addr;
 	struct psp_xgmi_context		xgmi_context;
+	struct psp_ras_context		ras;
 };
 
 struct amdgpu_psp_funcs {
@@ -232,6 +243,11 @@ extern const struct amdgpu_ip_block_version psp_v10_0_ip_block;
 
 int psp_gpu_reset(struct amdgpu_device *adev);
 int psp_xgmi_invoke(struct psp_context *psp, uint32_t ta_cmd_id);
+
+int psp_ras_invoke(struct psp_context *psp, uint32_t ta_cmd_id);
+int psp_ras_enable_features(struct psp_context *psp,
+		union ta_ras_cmd_input *info, bool enable);
+
 extern const struct amdgpu_ip_block_version psp_v11_0_ip_block;
 
 #endif
-- 
2.20.1

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 07/20] drm/amdgpu: add psp v11 ras callback
       [not found] ` <20190305204136.17587-1-alexander.deucher-5C7GfCeVMHo@public.gmane.org>
                     ` (5 preceding siblings ...)
  2019-03-05 20:41   ` [PATCH 06/20] drm/amdgpu: add psp ras subsystem infrastructure (v2) Alex Deucher
@ 2019-03-05 20:41   ` Alex Deucher
  2019-03-05 20:41   ` [PATCH 08/20] drm/amdgpu: add psp cmd submit timeout Alex Deucher
                     ` (12 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: Alex Deucher @ 2019-03-05 20:41 UTC (permalink / raw)
  To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW
  Cc: Alex Deucher, xinhui pan, Hawking Zhang

From: xinhui pan <xinhui.pan@amd.com>

Add trigger_error and cure_posion.

Acked-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: xinhui pan <xinhui.pan@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/psp_v11_0.c | 50 ++++++++++++++++++++++++++
 1 file changed, 50 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/psp_v11_0.c b/drivers/gpu/drm/amd/amdgpu/psp_v11_0.c
index 8fb267684787..2b3429d90690 100644
--- a/drivers/gpu/drm/amd/amdgpu/psp_v11_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/psp_v11_0.c
@@ -722,6 +722,54 @@ static int psp_v11_0_xgmi_get_node_id(struct psp_context *psp, uint64_t *node_id
 	return 0;
 }
 
+static int psp_v11_0_ras_trigger_error(struct psp_context *psp,
+		struct ta_ras_trigger_error_input *info)
+{
+	struct ta_ras_shared_memory *ras_cmd;
+	int ret;
+
+	if (!psp->ras.ras_initialized)
+		return -EINVAL;
+
+	ras_cmd = (struct ta_ras_shared_memory *)psp->ras.ras_shared_buf;
+	memset(ras_cmd, 0, sizeof(struct ta_ras_shared_memory));
+
+	ras_cmd->cmd_id = TA_RAS_COMMAND__TRIGGER_ERROR;
+	ras_cmd->ras_in_message.trigger_error = *info;
+
+	ret = psp_ras_invoke(psp, ras_cmd->cmd_id);
+	if (ret)
+		return -EINVAL;
+
+	return ras_cmd->ras_status;
+}
+
+static int psp_v11_0_ras_cure_posion(struct psp_context *psp, uint64_t *mode_ptr)
+{
+#if 0
+	// not support yet.
+	struct ta_ras_shared_memory *ras_cmd;
+	int ret;
+
+	if (!psp->ras.ras_initialized)
+		return -EINVAL;
+
+	ras_cmd = (struct ta_ras_shared_memory *)psp->ras.ras_shared_buf;
+	memset(ras_cmd, 0, sizeof(struct ta_ras_shared_memory));
+
+	ras_cmd->cmd_id = TA_RAS_COMMAND__CURE_POISON;
+	ras_cmd->ras_in_message.cure_poison.mode_ptr = mode_ptr;
+
+	ret = psp_ras_invoke(psp, ras_cmd->cmd_id);
+	if (ret)
+		return -EINVAL;
+
+	return ras_cmd->ras_status;
+#else
+	return -EINVAL;
+#endif
+}
+
 static const struct psp_funcs psp_v11_0_funcs = {
 	.init_microcode = psp_v11_0_init_microcode,
 	.bootloader_load_sysdrv = psp_v11_0_bootloader_load_sysdrv,
@@ -738,6 +786,8 @@ static const struct psp_funcs psp_v11_0_funcs = {
 	.xgmi_get_hive_id = psp_v11_0_xgmi_get_hive_id,
 	.xgmi_get_node_id = psp_v11_0_xgmi_get_node_id,
 	.support_vmr_ring = psp_v11_0_support_vmr_ring,
+	.ras_trigger_error = psp_v11_0_ras_trigger_error,
+	.ras_cure_posion = psp_v11_0_ras_cure_posion,
 };
 
 void psp_v11_0_set_psp_funcs(struct psp_context *psp)
-- 
2.20.1

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 08/20] drm/amdgpu: add psp cmd submit timeout
       [not found] ` <20190305204136.17587-1-alexander.deucher-5C7GfCeVMHo@public.gmane.org>
                     ` (6 preceding siblings ...)
  2019-03-05 20:41   ` [PATCH 07/20] drm/amdgpu: add psp v11 ras callback Alex Deucher
@ 2019-03-05 20:41   ` Alex Deucher
  2019-03-05 20:41   ` [PATCH 09/20] drm/amdgpu: add amdgpu_ras.c to support ras (v2) Alex Deucher
                     ` (11 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: Alex Deucher @ 2019-03-05 20:41 UTC (permalink / raw)
  To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW; +Cc: Alex Deucher, xinhui pan

From: xinhui pan <xinhui.pan@amd.com>

Signed-off-by: xinhui pan <xinhui.pan@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
index c1b862016245..7e3e1d588d74 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
@@ -120,6 +120,7 @@ psp_cmd_submit_buf(struct psp_context *psp,
 {
 	int ret;
 	int index;
+	int timeout = 2000;
 
 	memset(psp->cmd_buf_mem, 0, PSP_CMD_BUFFER_SIZE);
 
@@ -133,8 +134,11 @@ psp_cmd_submit_buf(struct psp_context *psp,
 		return ret;
 	}
 
-	while (*((unsigned int *)psp->fence_buf) != index)
+	while (*((unsigned int *)psp->fence_buf) != index) {
+		if (--timeout == 0)
+			return -EINVAL;
 		msleep(1);
+	}
 
 	/* In some cases, psp response status is not 0 even there is no
 	 * problem while the command is submitted. Some version of PSP FW
-- 
2.20.1

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 09/20] drm/amdgpu: add amdgpu_ras.c to support ras (v2)
       [not found] ` <20190305204136.17587-1-alexander.deucher-5C7GfCeVMHo@public.gmane.org>
                     ` (7 preceding siblings ...)
  2019-03-05 20:41   ` [PATCH 08/20] drm/amdgpu: add psp cmd submit timeout Alex Deucher
@ 2019-03-05 20:41   ` Alex Deucher
  2019-03-05 20:41   ` [PATCH 10/20] drm/amdgpu: add debugfs ctrl node Alex Deucher
                     ` (10 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: Alex Deucher @ 2019-03-05 20:41 UTC (permalink / raw)
  To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW; +Cc: Alex Deucher, xinhui pan

From: xinhui pan <xinhui.pan@amd.com>

add obj management.
add feature control.
add debugfs infrastructure.
add sysfs infrastructure.
add IH infrastructure.
add recovery infrastructure.

It is a framework. Other IPs need call amdgpu_ras_xxx function instead of
psp_ras_xxx functions.

v2: squash in warning fixes

Signed-off-by: xinhui pan <xinhui.pan@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/Makefile        |    2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |    9 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_psp.h    |    1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c    | 1247 ++++++++++++++++++++
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h    |  217 ++++
 5 files changed, 1475 insertions(+), 1 deletion(-)
 create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
 create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h

diff --git a/drivers/gpu/drm/amd/amdgpu/Makefile b/drivers/gpu/drm/amd/amdgpu/Makefile
index 851001ced5e8..6039944abb71 100644
--- a/drivers/gpu/drm/amd/amdgpu/Makefile
+++ b/drivers/gpu/drm/amd/amdgpu/Makefile
@@ -53,7 +53,7 @@ amdgpu-y += amdgpu_device.o amdgpu_kms.o \
 	amdgpu_ucode.o amdgpu_bo_list.o amdgpu_ctx.o amdgpu_sync.o \
 	amdgpu_gtt_mgr.o amdgpu_vram_mgr.o amdgpu_virt.o amdgpu_atomfirmware.o \
 	amdgpu_vf_error.o amdgpu_sched.o amdgpu_debugfs.o amdgpu_ids.o \
-	amdgpu_gmc.o amdgpu_xgmi.o amdgpu_csa.o
+	amdgpu_gmc.o amdgpu_xgmi.o amdgpu_csa.o amdgpu_ras.o
 
 # add asic specific block
 amdgpu-$(CONFIG_DRM_AMDGPU_CIK)+= cik.o cik_ih.o kv_smc.o kv_dpm.o \
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 05cd5c460a1d..d6f130bdb985 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -60,6 +60,7 @@
 #include "amdgpu_pm.h"
 
 #include "amdgpu_xgmi.h"
+#include "amdgpu_ras.h"
 
 MODULE_FIRMWARE("amdgpu/vega10_gpu_info.bin");
 MODULE_FIRMWARE("amdgpu/vega12_gpu_info.bin");
@@ -1638,6 +1639,10 @@ static int amdgpu_device_ip_init(struct amdgpu_device *adev)
 {
 	int i, r;
 
+	r = amdgpu_ras_init(adev);
+	if (r)
+		return r;
+
 	for (i = 0; i < adev->num_ip_blocks; i++) {
 		if (!adev->ip_blocks[i].status.valid)
 			continue;
@@ -1869,6 +1874,8 @@ static int amdgpu_device_ip_fini(struct amdgpu_device *adev)
 {
 	int i, r;
 
+	amdgpu_ras_pre_fini(adev);
+
 	if (adev->gmc.xgmi.num_physical_nodes > 1)
 		amdgpu_xgmi_remove_device(adev);
 
@@ -1937,6 +1944,8 @@ static int amdgpu_device_ip_fini(struct amdgpu_device *adev)
 		adev->ip_blocks[i].status.late_initialized = false;
 	}
 
+	amdgpu_ras_fini(adev);
+
 	if (amdgpu_sriov_vf(adev))
 		if (amdgpu_virt_release_full_gpu(adev, false))
 			DRM_ERROR("failed to release exclusive mode on fini\n");
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.h
index 9c067e760a5f..cde113f07c96 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.h
@@ -110,6 +110,7 @@ struct psp_ras_context {
 	struct amdgpu_bo	*ras_shared_bo;
 	uint64_t		ras_shared_mc_addr;
 	void			*ras_shared_buf;
+	struct amdgpu_ras	*ras;
 };
 
 struct psp_context
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
new file mode 100644
index 000000000000..d7c7090fade9
--- /dev/null
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
@@ -0,0 +1,1247 @@
+/*
+ * Copyright 2018 Advanced Micro Devices, Inc.
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a
+ * copy of this software and associated documentation files (the "Software"),
+ * to deal in the Software without restriction, including without limitation
+ * the rights to use, copy, modify, merge, publish, distribute, sublicense,
+ * and/or sell copies of the Software, and to permit persons to whom the
+ * Software is furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
+ * THE COPYRIGHT HOLDER(S) OR AUTHOR(S) BE LIABLE FOR ANY CLAIM, DAMAGES OR
+ * OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
+ * ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
+ * OTHER DEALINGS IN THE SOFTWARE.
+ *
+ *
+ */
+#include <linux/debugfs.h>
+#include <linux/list.h>
+#include <linux/module.h>
+#include "amdgpu.h"
+#include "amdgpu_ras.h"
+
+struct ras_ih_data {
+	/* interrupt bottom half */
+	struct work_struct ih_work;
+	int inuse;
+	/* IP callback */
+	ras_ih_cb cb;
+	/* full of entries */
+	unsigned char *ring;
+	unsigned int ring_size;
+	unsigned int element_size;
+	unsigned int aligned_element_size;
+	unsigned int rptr;
+	unsigned int wptr;
+};
+
+struct ras_fs_data {
+	char sysfs_name[32];
+	char debugfs_name[32];
+};
+
+struct ras_err_data {
+	unsigned long ue_count;
+	unsigned long ce_count;
+};
+
+struct ras_err_handler_data {
+	/* point to bad pages array */
+	struct {
+		unsigned long bp;
+		struct amdgpu_bo *bo;
+	} *bps;
+	/* the count of entries */
+	int count;
+	/* the space can place new entries */
+	int space_left;
+	/* last reserved entry's index + 1 */
+	int last_reserved;
+};
+
+struct ras_manager {
+	struct ras_common_if head;
+	/* reference count */
+	int use;
+	/* ras block link */
+	struct list_head node;
+	/* the device */
+	struct amdgpu_device *adev;
+	/* debugfs */
+	struct dentry *ent;
+	/* sysfs */
+	struct device_attribute sysfs_attr;
+	int attr_inuse;
+
+	/* fs node name */
+	struct ras_fs_data fs_data;
+
+	/* IH data */
+	struct ras_ih_data ih_data;
+
+	struct ras_err_data err_data;
+};
+
+const char *ras_error_string[] = {
+	"none",
+	"parity",
+	"single_correctable",
+	"multi_uncorrectable",
+	"poison",
+};
+
+const char *ras_block_string[] = {
+	"umc",
+	"sdma",
+	"gfx",
+	"mmhub",
+	"athub",
+	"pcie_bif",
+	"hdp",
+	"xgmi_wafl",
+	"df",
+	"smn",
+	"sem",
+	"mp0",
+	"mp1",
+	"fuse",
+};
+
+#define ras_err_str(i) (ras_error_string[ffs(i)])
+#define ras_block_str(i) (ras_block_string[i])
+
+static void amdgpu_ras_self_test(struct amdgpu_device *adev)
+{
+	/* TODO */
+}
+
+static ssize_t amdgpu_ras_debugfs_read(struct file *f, char __user *buf,
+					size_t size, loff_t *pos)
+{
+	struct ras_manager *obj = (struct ras_manager *)file_inode(f)->i_private;
+	struct ras_query_if info = {
+		.head = obj->head,
+	};
+	ssize_t s;
+	char val[128];
+
+	if (amdgpu_ras_error_query(obj->adev, &info))
+		return -EINVAL;
+
+	s = snprintf(val, sizeof(val), "%s: %lu\n%s: %lu\n",
+			"ue", info.ue_count,
+			"ce", info.ce_count);
+	if (*pos >= s)
+		return 0;
+
+	s -= *pos;
+	s = min_t(u64, s, size);
+
+
+	if (copy_to_user(buf, &val[*pos], s))
+		return -EINVAL;
+
+	*pos += s;
+
+	return s;
+}
+
+static ssize_t amdgpu_ras_debugfs_write(struct file *f, const char __user *buf,
+		size_t size, loff_t *pos)
+{
+	struct ras_manager *obj = (struct ras_manager *)file_inode(f)->i_private;
+	struct ras_inject_if info = {
+		.head = obj->head,
+	};
+	ssize_t s = min_t(u64, 64, size);
+	char val[64];
+	char *str = val;
+	memset(val, 0, sizeof(val));
+
+	if (*pos)
+		return -EINVAL;
+
+	if (copy_from_user(str, buf, s))
+		return -EINVAL;
+
+	/* only care ue/ce for now. */
+	if (memcmp(str, "ue", 2) == 0) {
+		info.head.type = AMDGPU_RAS_ERROR__MULTI_UNCORRECTABLE;
+		str += 2;
+	} else if (memcmp(str, "ce", 2) == 0) {
+		info.head.type = AMDGPU_RAS_ERROR__SINGLE_CORRECTABLE;
+		str += 2;
+	}
+
+	if (sscanf(str, "0x%llx 0x%llx", &info.address, &info.value) != 2) {
+		if (sscanf(str, "%llu %llu", &info.address, &info.value) != 2)
+			return -EINVAL;
+	}
+
+	*pos = s;
+
+	if (amdgpu_ras_error_inject(obj->adev, &info))
+		return -EINVAL;
+
+	return size;
+}
+
+static const struct file_operations amdgpu_ras_debugfs_ops = {
+	.owner = THIS_MODULE,
+	.read = amdgpu_ras_debugfs_read,
+	.write = amdgpu_ras_debugfs_write,
+	.llseek = default_llseek
+};
+
+static ssize_t amdgpu_ras_sysfs_read(struct device *dev,
+		struct device_attribute *attr, char *buf)
+{
+	struct ras_manager *obj = container_of(attr, struct ras_manager, sysfs_attr);
+	struct ras_query_if info = {
+		.head = obj->head,
+	};
+
+	if (amdgpu_ras_error_query(obj->adev, &info))
+		return -EINVAL;
+
+	return snprintf(buf, PAGE_SIZE, "%s: %lu\n%s: %lu\n",
+			"ue", info.ue_count,
+			"ce", info.ce_count);
+}
+
+/* obj begin */
+
+#define get_obj(obj) do { (obj)->use++; } while (0)
+#define alive_obj(obj) ((obj)->use)
+
+static inline void put_obj(struct ras_manager *obj)
+{
+	if (obj && --obj->use == 0)
+		list_del(&obj->node);
+	if (obj && obj->use < 0) {
+		 DRM_ERROR("RAS ERROR: Unbalance obj(%s) use\n", obj->head.name);
+	}
+}
+
+/* make one obj and return it. */
+static struct ras_manager *amdgpu_ras_create_obj(struct amdgpu_device *adev,
+		struct ras_common_if *head)
+{
+	struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
+	struct ras_manager *obj;
+
+	if (!con)
+		return NULL;
+
+	if (head->block >= AMDGPU_RAS_BLOCK_COUNT)
+		return NULL;
+
+	obj = &con->objs[head->block];
+	/* already exist. return obj? */
+	if (alive_obj(obj))
+		return NULL;
+
+	obj->head = *head;
+	obj->adev = adev;
+	list_add(&obj->node, &con->head);
+	get_obj(obj);
+
+	return obj;
+}
+
+/* return an obj equal to head, or the first when head is NULL */
+static struct ras_manager *amdgpu_ras_find_obj(struct amdgpu_device *adev,
+		struct ras_common_if *head)
+{
+	struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
+	struct ras_manager *obj;
+	int i;
+
+	if (!con)
+		return NULL;
+
+	if (head) {
+		if (head->block >= AMDGPU_RAS_BLOCK_COUNT)
+			return NULL;
+
+		obj = &con->objs[head->block];
+
+		if (alive_obj(obj)) {
+			WARN_ON(head->block != obj->head.block);
+			return obj;
+		}
+	} else {
+		for (i = 0; i < AMDGPU_RAS_BLOCK_COUNT; i++) {
+			obj = &con->objs[i];
+			if (alive_obj(obj)) {
+				WARN_ON(i != obj->head.block);
+				return obj;
+			}
+		}
+	}
+
+	return NULL;
+}
+/* obj end */
+
+/* feature ctl begin */
+static int amdgpu_ras_is_feature_allowed(struct amdgpu_device *adev,
+		struct ras_common_if *head)
+{
+	return amdgpu_ras_enable && (amdgpu_ras_mask & BIT(head->block));
+}
+
+static int amdgpu_ras_is_feature_enabled(struct amdgpu_device *adev,
+		struct ras_common_if *head)
+{
+	struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
+
+	return con->features & BIT(head->block);
+}
+
+/*
+ * if obj is not created, then create one.
+ * set feature enable flag.
+ */
+static int __amdgpu_ras_feature_enable(struct amdgpu_device *adev,
+		struct ras_common_if *head, int enable)
+{
+	struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
+	struct ras_manager *obj = amdgpu_ras_find_obj(adev, head);
+
+	if (!amdgpu_ras_is_feature_allowed(adev, head))
+		return 0;
+	if (!(!!enable ^ !!amdgpu_ras_is_feature_enabled(adev, head)))
+		return 0;
+
+	if (enable) {
+		if (!obj) {
+			obj = amdgpu_ras_create_obj(adev, head);
+			if (!obj)
+				return -EINVAL;
+		} else {
+			/* In case we create obj somewhere else */
+			get_obj(obj);
+		}
+		con->features |= BIT(head->block);
+	} else {
+		if (obj && amdgpu_ras_is_feature_enabled(adev, head)) {
+			con->features &= ~BIT(head->block);
+			put_obj(obj);
+		}
+	}
+
+	return 0;
+}
+
+/* wrapper of psp_ras_enable_features */
+int amdgpu_ras_feature_enable(struct amdgpu_device *adev,
+		struct ras_common_if *head, bool enable)
+{
+	struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
+	union ta_ras_cmd_input info;
+	int ret;
+
+	if (!con)
+		return -EINVAL;
+
+	if (!enable) {
+		info.disable_features = (struct ta_ras_disable_features_input) {
+			.block_id =  head->block,
+			.error_type = head->type,
+		};
+	} else {
+		info.enable_features = (struct ta_ras_enable_features_input) {
+			.block_id =  head->block,
+			.error_type = head->type,
+		};
+	}
+
+	/* Do not enable if it is not allowed. */
+	WARN_ON(enable && !amdgpu_ras_is_feature_allowed(adev, head));
+	/* Are we alerady in that state we are going to set? */
+	if (!(!!enable ^ !!amdgpu_ras_is_feature_enabled(adev, head)))
+		return 0;
+
+	ret = psp_ras_enable_features(&adev->psp, &info, enable);
+	if (ret) {
+		DRM_ERROR("RAS ERROR: %s %s feature failed ret %d\n",
+				enable ? "enable":"disable",
+				ras_block_str(head->block),
+				ret);
+		return -EINVAL;
+	}
+
+	/* setup the obj */
+	__amdgpu_ras_feature_enable(adev, head, enable);
+
+	return 0;
+}
+
+static int amdgpu_ras_disable_all_features(struct amdgpu_device *adev,
+		bool bypass)
+{
+	struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
+	struct ras_manager *obj, *tmp;
+
+	list_for_each_entry_safe(obj, tmp, &con->head, node) {
+		/* bypass psp.
+		 * aka just release the obj and corresponding flags
+		 */
+		if (bypass) {
+			if (__amdgpu_ras_feature_enable(adev, &obj->head, 0))
+				break;
+		} else {
+			if (amdgpu_ras_feature_enable(adev, &obj->head, 0))
+				break;
+		}
+	};
+
+	return con->features;
+}
+
+static int amdgpu_ras_enable_all_features(struct amdgpu_device *adev,
+		bool bypass)
+{
+	struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
+	int ras_block_count = AMDGPU_RAS_BLOCK_COUNT;
+	int i;
+
+	for (i = 0; i < ras_block_count; i++) {
+		struct ras_common_if head = {
+			.block = i,
+			.type = AMDGPU_RAS_ERROR__MULTI_UNCORRECTABLE,
+			.sub_block_index = 0,
+		};
+		strcpy(head.name, ras_block_str(i));
+		if (bypass) {
+			/*
+			 * bypass psp. vbios enable ras for us.
+			 * so just create the obj
+			 */
+			if (__amdgpu_ras_feature_enable(adev, &head, 1))
+				break;
+		} else {
+			if (amdgpu_ras_feature_enable(adev, &head, 1))
+				break;
+		}
+	};
+
+	return con->features;
+}
+/* feature ctl end */
+
+/* query/inject/cure begin */
+int amdgpu_ras_error_query(struct amdgpu_device *adev,
+		struct ras_query_if *info)
+{
+	struct ras_manager *obj = amdgpu_ras_find_obj(adev, &info->head);
+
+	if (!obj)
+		return -EINVAL;
+	/* TODO might read the register to read the count */
+
+	info->ue_count = obj->err_data.ue_count;
+	info->ce_count = obj->err_data.ce_count;
+
+	return 0;
+}
+
+/* wrapper of psp_ras_trigger_error */
+int amdgpu_ras_error_inject(struct amdgpu_device *adev,
+		struct ras_inject_if *info)
+{
+	struct ras_manager *obj = amdgpu_ras_find_obj(adev, &info->head);
+	struct ta_ras_trigger_error_input block_info = {
+		.block_id = info->head.block,
+		.inject_error_type = info->head.type,
+		.sub_block_index = info->head.sub_block_index,
+		.address = info->address,
+		.value = info->value,
+	};
+	int ret = 0;
+
+	if (!obj)
+		return -EINVAL;
+
+	ret = psp_ras_trigger_error(&adev->psp, &block_info);
+	if (ret)
+		DRM_ERROR("RAS ERROR: inject %s error failed ret %d\n",
+				ras_block_str(info->head.block),
+				ret);
+
+	return ret;
+}
+
+int amdgpu_ras_error_cure(struct amdgpu_device *adev,
+		struct ras_cure_if *info)
+{
+	/* psp fw has no cure interface for now. */
+	return 0;
+}
+
+/* get the total error counts on all IPs */
+int amdgpu_ras_query_error_count(struct amdgpu_device *adev,
+		bool is_ce)
+{
+	struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
+	struct ras_manager *obj;
+	struct ras_err_data data = {0, 0};
+
+	if (!con)
+		return -EINVAL;
+
+	list_for_each_entry(obj, &con->head, node) {
+		struct ras_query_if info = {
+			.head = obj->head,
+		};
+
+		if (amdgpu_ras_error_query(adev, &info))
+			return -EINVAL;
+
+		data.ce_count += info.ce_count;
+		data.ue_count += info.ue_count;
+	}
+
+	return is_ce ? data.ce_count : data.ue_count;
+}
+/* query/inject/cure end */
+
+
+/* sysfs begin */
+
+static ssize_t amdgpu_ras_sysfs_features_read(struct device *dev,
+		struct device_attribute *attr, char *buf)
+{
+	struct amdgpu_ras *con =
+		container_of(attr, struct amdgpu_ras, features_attr);
+	struct drm_device *ddev = dev_get_drvdata(dev);
+	struct amdgpu_device *adev = ddev->dev_private;
+	struct ras_common_if head;
+	int ras_block_count = AMDGPU_RAS_BLOCK_COUNT;
+	int i;
+	ssize_t s;
+	struct ras_manager *obj;
+
+	s = scnprintf(buf, PAGE_SIZE, "feature mask: 0x%x\n", con->features);
+
+	for (i = 0; i < ras_block_count; i++) {
+		head.block = i;
+
+		if (amdgpu_ras_is_feature_enabled(adev, &head)) {
+			obj = amdgpu_ras_find_obj(adev, &head);
+			s += scnprintf(&buf[s], PAGE_SIZE - s,
+					"%s: %s\n",
+					ras_block_str(i),
+					ras_err_str(obj->head.type));
+		} else
+			s += scnprintf(&buf[s], PAGE_SIZE - s,
+					"%s: disabled\n",
+					ras_block_str(i));
+	}
+
+	return s;
+}
+
+static int amdgpu_ras_sysfs_create_feature_node(struct amdgpu_device *adev)
+{
+	struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
+	struct attribute *attrs[] = {
+		&con->features_attr.attr,
+		NULL
+	};
+	struct attribute_group group = {
+		.name = "ras",
+		.attrs = attrs,
+	};
+
+	con->features_attr = (struct device_attribute) {
+		.attr = {
+			.name = "features",
+			.mode = S_IRUGO,
+		},
+			.show = amdgpu_ras_sysfs_features_read,
+	};
+
+	return sysfs_create_group(&adev->dev->kobj, &group);
+}
+
+static int amdgpu_ras_sysfs_remove_feature_node(struct amdgpu_device *adev)
+{
+	struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
+	struct attribute *attrs[] = {
+		&con->features_attr.attr,
+		NULL
+	};
+	struct attribute_group group = {
+		.name = "ras",
+		.attrs = attrs,
+	};
+
+	sysfs_remove_group(&adev->dev->kobj, &group);
+
+	return 0;
+}
+
+int amdgpu_ras_sysfs_create(struct amdgpu_device *adev,
+		struct ras_fs_if *head)
+{
+	struct ras_manager *obj = amdgpu_ras_find_obj(adev, &head->head);
+
+	if (!obj || obj->attr_inuse)
+		return -EINVAL;
+
+	get_obj(obj);
+
+	memcpy(obj->fs_data.sysfs_name,
+			head->sysfs_name,
+			sizeof(obj->fs_data.sysfs_name));
+
+	obj->sysfs_attr = (struct device_attribute){
+		.attr = {
+			.name = obj->fs_data.sysfs_name,
+			.mode = S_IRUGO,
+		},
+			.show = amdgpu_ras_sysfs_read,
+	};
+
+	if (sysfs_add_file_to_group(&adev->dev->kobj,
+				&obj->sysfs_attr.attr,
+				"ras")) {
+		put_obj(obj);
+		return -EINVAL;
+	}
+
+	obj->attr_inuse = 1;
+
+	return 0;
+}
+
+int amdgpu_ras_sysfs_remove(struct amdgpu_device *adev,
+		struct ras_common_if *head)
+{
+	struct ras_manager *obj = amdgpu_ras_find_obj(adev, head);
+
+	if (!obj || !obj->attr_inuse)
+		return -EINVAL;
+
+	sysfs_remove_file_from_group(&adev->dev->kobj,
+				&obj->sysfs_attr.attr,
+				"ras");
+	obj->attr_inuse = 0;
+	put_obj(obj);
+
+	return 0;
+}
+
+static int amdgpu_ras_sysfs_remove_all(struct amdgpu_device *adev)
+{
+	struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
+	struct ras_manager *obj, *tmp;
+
+	list_for_each_entry_safe(obj, tmp, &con->head, node) {
+		amdgpu_ras_sysfs_remove(adev, &obj->head);
+	}
+
+	amdgpu_ras_sysfs_remove_feature_node(adev);
+
+	return 0;
+}
+/* sysfs end */
+
+/* debugfs begin */
+int amdgpu_ras_debugfs_create(struct amdgpu_device *adev,
+		struct ras_fs_if *head)
+{
+	struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
+	struct ras_manager *obj = amdgpu_ras_find_obj(adev, &head->head);
+	struct dentry *ent;
+
+	if (!obj || obj->ent)
+		return -EINVAL;
+
+	get_obj(obj);
+
+	memcpy(obj->fs_data.debugfs_name,
+			head->debugfs_name,
+			sizeof(obj->fs_data.debugfs_name));
+
+	ent = debugfs_create_file(obj->fs_data.debugfs_name,
+			S_IWUGO | S_IRUGO, con->dir,
+			obj, &amdgpu_ras_debugfs_ops);
+
+	if (IS_ERR(ent))
+		return -EINVAL;
+
+	obj->ent = ent;
+
+	return 0;
+}
+
+int amdgpu_ras_debugfs_remove(struct amdgpu_device *adev,
+		struct ras_common_if *head)
+{
+	struct ras_manager *obj = amdgpu_ras_find_obj(adev, head);
+
+	if (!obj || !obj->ent)
+		return 0;
+
+	debugfs_remove(obj->ent);
+	obj->ent = NULL;
+	put_obj(obj);
+
+	return 0;
+}
+
+static int amdgpu_ras_debugfs_remove_all(struct amdgpu_device *adev)
+{
+	struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
+	struct ras_manager *obj, *tmp;
+
+	list_for_each_entry_safe(obj, tmp, &con->head, node) {
+		amdgpu_ras_debugfs_remove(adev, &obj->head);
+	}
+
+	debugfs_remove(con->dir);
+	con->dir = NULL;
+
+	return 0;
+}
+/* debugfs end */
+
+/* ras fs */
+
+static int amdgpu_ras_fs_init(struct amdgpu_device *adev)
+{
+	struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
+	struct drm_minor *minor = adev->ddev->primary;
+	struct dentry *root = minor->debugfs_root, *dir;
+
+	dir = debugfs_create_dir("ras", root);
+	if (IS_ERR(dir))
+		return -EINVAL;
+
+	con->dir = dir;
+
+	amdgpu_ras_sysfs_create_feature_node(adev);
+
+	return 0;
+}
+
+static int amdgpu_ras_fs_fini(struct amdgpu_device *adev)
+{
+	amdgpu_ras_debugfs_remove_all(adev);
+	amdgpu_ras_sysfs_remove_all(adev);
+	return 0;
+}
+/* ras fs end */
+
+/* ih begin */
+static void amdgpu_ras_interrupt_handler(struct ras_manager *obj)
+{
+	struct ras_ih_data *data = &obj->ih_data;
+	struct amdgpu_iv_entry entry;
+	int ret;
+
+	while (data->rptr != data->wptr) {
+		rmb();
+		memcpy(&entry, &data->ring[data->rptr],
+				data->element_size);
+
+		wmb();
+		data->rptr = (data->aligned_element_size +
+				data->rptr) % data->ring_size;
+
+		/* Let IP handle its data, maybe we need get the output
+		 * from the callback to udpate the error type/count, etc
+		 */
+		if (data->cb) {
+			ret = data->cb(obj->adev, &entry);
+			/* ue will trigger an interrupt, and in that case
+			 * we need do a reset to recovery the whole system.
+			 * But leave IP do that recovery, here we just dispatch
+			 * the error.
+			 */
+			if (ret == AMDGPU_RAS_UE) {
+				obj->err_data.ue_count++;
+			}
+			/* Might need get ce count by register, but not all IP
+			 * saves ce count, some IP just use one bit or two bits
+			 * to indicate ce happened.
+			 */
+		}
+	}
+}
+
+static void amdgpu_ras_interrupt_process_handler(struct work_struct *work)
+{
+	struct ras_ih_data *data =
+		container_of(work, struct ras_ih_data, ih_work);
+	struct ras_manager *obj =
+		container_of(data, struct ras_manager, ih_data);
+
+	amdgpu_ras_interrupt_handler(obj);
+}
+
+int amdgpu_ras_interrupt_dispatch(struct amdgpu_device *adev,
+		struct ras_dispatch_if *info)
+{
+	struct ras_manager *obj = amdgpu_ras_find_obj(adev, &info->head);
+	struct ras_ih_data *data = &obj->ih_data;
+
+	if (!obj)
+		return -EINVAL;
+
+	if (data->inuse == 0)
+		return 0;
+
+	/* Might be overflow... */
+	memcpy(&data->ring[data->wptr], info->entry,
+			data->element_size);
+
+	wmb();
+	data->wptr = (data->aligned_element_size +
+			data->wptr) % data->ring_size;
+
+	schedule_work(&data->ih_work);
+
+	return 0;
+}
+
+int amdgpu_ras_interrupt_remove_handler(struct amdgpu_device *adev,
+		struct ras_ih_if *info)
+{
+	struct ras_manager *obj = amdgpu_ras_find_obj(adev, &info->head);
+	struct ras_ih_data *data;
+
+	if (!obj)
+		return -EINVAL;
+
+	data = &obj->ih_data;
+	if (data->inuse == 0)
+		return 0;
+
+	cancel_work_sync(&data->ih_work);
+
+	kfree(data->ring);
+	memset(data, 0, sizeof(*data));
+	put_obj(obj);
+
+	return 0;
+}
+
+int amdgpu_ras_interrupt_add_handler(struct amdgpu_device *adev,
+		struct ras_ih_if *info)
+{
+	struct ras_manager *obj = amdgpu_ras_find_obj(adev, &info->head);
+	struct ras_ih_data *data;
+
+	if (!obj) {
+		/* in case we registe the IH before enable ras feature */
+		obj = amdgpu_ras_create_obj(adev, &info->head);
+		if (!obj)
+			return -EINVAL;
+	} else
+		get_obj(obj);
+
+	data = &obj->ih_data;
+	/* add the callback.etc */
+	*data = (struct ras_ih_data) {
+		.inuse = 0,
+		.cb = info->cb,
+		.element_size = sizeof(struct amdgpu_iv_entry),
+		.rptr = 0,
+		.wptr = 0,
+	};
+
+	INIT_WORK(&data->ih_work, amdgpu_ras_interrupt_process_handler);
+
+	data->aligned_element_size = ALIGN(data->element_size, 8);
+	/* the ring can store 64 iv entries. */
+	data->ring_size = 64 * data->aligned_element_size;
+	data->ring = kmalloc(data->ring_size, GFP_KERNEL);
+	if (!data->ring) {
+		put_obj(obj);
+		return -ENOMEM;
+	}
+
+	/* IH is ready */
+	data->inuse = 1;
+
+	return 0;
+}
+
+static int amdgpu_ras_interrupt_remove_all(struct amdgpu_device *adev)
+{
+	struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
+	struct ras_manager *obj, *tmp;
+
+	list_for_each_entry_safe(obj, tmp, &con->head, node) {
+		struct ras_ih_if info = {
+			.head = obj->head,
+		};
+		amdgpu_ras_interrupt_remove_handler(adev, &info);
+	}
+
+	return 0;
+}
+/* ih end */
+
+/* recovery begin */
+static void amdgpu_ras_do_recovery(struct work_struct *work)
+{
+	struct amdgpu_ras *ras =
+		container_of(work, struct amdgpu_ras, recovery_work);
+
+	amdgpu_device_gpu_recover(ras->adev, 0);
+	atomic_set(&ras->in_recovery, 0);
+}
+
+static int amdgpu_ras_release_vram(struct amdgpu_device *adev,
+		struct amdgpu_bo **bo_ptr)
+{
+	/* no need to free it actually. */
+	amdgpu_bo_free_kernel(bo_ptr, NULL, NULL);
+	return 0;
+}
+
+/* reserve vram with size@offset */
+static int amdgpu_ras_reserve_vram(struct amdgpu_device *adev,
+		uint64_t offset, uint64_t size,
+		struct amdgpu_bo **bo_ptr)
+{
+	struct ttm_operation_ctx ctx = { false, false };
+	struct amdgpu_bo_param bp;
+	int r = 0;
+	int i;
+	struct amdgpu_bo *bo;
+
+	if (bo_ptr)
+		*bo_ptr = NULL;
+	memset(&bp, 0, sizeof(bp));
+	bp.size = size;
+	bp.byte_align = PAGE_SIZE;
+	bp.domain = AMDGPU_GEM_DOMAIN_VRAM;
+	bp.flags = AMDGPU_GEM_CREATE_VRAM_CONTIGUOUS |
+		AMDGPU_GEM_CREATE_NO_CPU_ACCESS;
+	bp.type = ttm_bo_type_kernel;
+	bp.resv = NULL;
+
+	r = amdgpu_bo_create(adev, &bp, &bo);
+	if (r)
+		return -EINVAL;
+
+	r = amdgpu_bo_reserve(bo, false);
+	if (r)
+		goto error_reserve;
+
+	offset = ALIGN(offset, PAGE_SIZE);
+	for (i = 0; i < bo->placement.num_placement; ++i) {
+		bo->placements[i].fpfn = offset >> PAGE_SHIFT;
+		bo->placements[i].lpfn = (offset + size) >> PAGE_SHIFT;
+	}
+
+	ttm_bo_mem_put(&bo->tbo, &bo->tbo.mem);
+	r = ttm_bo_mem_space(&bo->tbo, &bo->placement, &bo->tbo.mem, &ctx);
+	if (r)
+		goto error_pin;
+
+	r = amdgpu_bo_pin_restricted(bo,
+			AMDGPU_GEM_DOMAIN_VRAM,
+			offset,
+			offset + size);
+	if (r)
+		goto error_pin;
+
+	if (bo_ptr)
+		*bo_ptr = bo;
+
+	amdgpu_bo_unreserve(bo);
+	return r;
+
+error_pin:
+	amdgpu_bo_unreserve(bo);
+error_reserve:
+	amdgpu_bo_unref(&bo);
+	return r;
+}
+
+/* alloc/realloc bps array */
+static int amdgpu_ras_realloc_eh_data_space(struct amdgpu_device *adev,
+		struct ras_err_handler_data *data, int pages)
+{
+	unsigned int old_space = data->count + data->space_left;
+	unsigned int new_space = old_space + pages;
+	unsigned int align_space = ALIGN(new_space, 1024);
+	void *tmp = kmalloc(align_space * sizeof(*data->bps), GFP_KERNEL);
+
+	if (!tmp)
+		return -ENOMEM;
+
+	if (data->bps) {
+		memcpy(tmp, data->bps,
+				data->count * sizeof(*data->bps));
+		kfree(data->bps);
+	}
+
+	data->bps = tmp;
+	data->space_left += align_space - old_space;
+	return 0;
+}
+
+/* it deal with vram only. */
+int amdgpu_ras_add_bad_pages(struct amdgpu_device *adev,
+		unsigned long *bps, int pages)
+{
+	struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
+	struct ras_err_handler_data *data = con->eh_data;
+	int i = pages;
+	int ret = 0;
+
+	if (!con || !data || !bps || pages <= 0)
+		return 0;
+
+	mutex_lock(&con->recovery_lock);
+	if (!data)
+		goto out;
+
+	if (data->space_left <= pages)
+		if (amdgpu_ras_realloc_eh_data_space(adev, data, pages)) {
+			ret = -ENOMEM;
+			goto out;
+		}
+
+	while (i--)
+		data->bps[data->count++].bp = bps[i];
+
+	data->space_left -= pages;
+out:
+	mutex_unlock(&con->recovery_lock);
+
+	return ret;
+}
+
+/* called in gpu recovery/init */
+int amdgpu_ras_reserve_bad_pages(struct amdgpu_device *adev)
+{
+	struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
+	struct ras_err_handler_data *data = con->eh_data;
+	uint64_t bp;
+	struct amdgpu_bo *bo;
+	int i;
+
+	if (!con || !data)
+		return 0;
+
+	mutex_lock(&con->recovery_lock);
+	/* reserve vram at driver post stage. */
+	for (i = data->last_reserved; i < data->count; i++) {
+		bp = data->bps[i].bp;
+
+		if (amdgpu_ras_reserve_vram(adev, bp << PAGE_SHIFT,
+					PAGE_SIZE, &bo))
+			DRM_ERROR("RAS ERROR: reserve vram %llx fail\n", bp);
+
+		data->bps[i].bo = bo;
+		data->last_reserved = i + 1;
+	}
+	mutex_unlock(&con->recovery_lock);
+	return 0;
+}
+
+/* called when driver unload */
+static int amdgpu_ras_release_bad_pages(struct amdgpu_device *adev)
+{
+	struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
+	struct ras_err_handler_data *data = con->eh_data;
+	struct amdgpu_bo *bo;
+	int i;
+
+	if (!con || !data)
+		return 0;
+
+	mutex_lock(&con->recovery_lock);
+	for (i = data->last_reserved - 1; i >= 0; i--) {
+		bo = data->bps[i].bo;
+
+		amdgpu_ras_release_vram(adev, &bo);
+
+		data->bps[i].bo = bo;
+		data->last_reserved = i;
+	}
+	mutex_unlock(&con->recovery_lock);
+	return 0;
+}
+
+static int amdgpu_ras_save_bad_pages(struct amdgpu_device *adev)
+{
+	/* TODO
+	 * write the array to eeprom when SMU disabled.
+	 */
+	return 0;
+}
+
+static int amdgpu_ras_load_bad_pages(struct amdgpu_device *adev)
+{
+	/* TODO
+	 * read the array to eeprom when SMU disabled.
+	 */
+	return 0;
+}
+
+static int amdgpu_ras_recovery_init(struct amdgpu_device *adev)
+{
+	struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
+	struct ras_err_handler_data **data = &con->eh_data;
+
+	*data = kmalloc(sizeof(**data),
+			GFP_KERNEL|__GFP_ZERO);
+	if (!*data)
+		return -ENOMEM;
+
+	mutex_init(&con->recovery_lock);
+	INIT_WORK(&con->recovery_work, amdgpu_ras_do_recovery);
+	atomic_set(&con->in_recovery, 0);
+	con->adev = adev;
+
+	amdgpu_ras_load_bad_pages(adev);
+	amdgpu_ras_reserve_bad_pages(adev);
+
+	return 0;
+}
+
+static int amdgpu_ras_recovery_fini(struct amdgpu_device *adev)
+{
+	struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
+	struct ras_err_handler_data *data = con->eh_data;
+
+	cancel_work_sync(&con->recovery_work);
+	amdgpu_ras_save_bad_pages(adev);
+	amdgpu_ras_release_bad_pages(adev);
+
+	mutex_lock(&con->recovery_lock);
+	con->eh_data = NULL;
+	kfree(data->bps);
+	kfree(data);
+	mutex_unlock(&con->recovery_lock);
+
+	return 0;
+}
+/* recovery end */
+
+struct ras_DID_capability {
+	u16 did;
+	u8 rid;
+	u32 capability;
+};
+
+static const struct ras_DID_capability supported_DID_array[] = {
+	{0x66a0, 0x00, AMDGPU_RAS_BLOCK_MASK},
+	{0x66a0, 0x02, AMDGPU_RAS_BLOCK_MASK},
+	{0x66a1, 0x00, AMDGPU_RAS_BLOCK_MASK},
+	{0x66a1, 0x01, AMDGPU_RAS_BLOCK_MASK},
+	{0x66a1, 0x04, AMDGPU_RAS_BLOCK_MASK},
+	{0x66a3, 0x00, AMDGPU_RAS_BLOCK_MASK},
+	{0x66a7, 0x00, AMDGPU_RAS_BLOCK_MASK},
+};
+
+static uint32_t amdgpu_ras_check_supported(struct amdgpu_device *adev)
+{
+	/* TODO need check vbios table */
+	int i;
+	int did = adev->pdev->device;
+	int rid = adev->pdev->revision;
+
+	for (i = 0; i < ARRAY_SIZE(supported_DID_array); i++) {
+		if (did == supported_DID_array[i].did &&
+				rid == supported_DID_array[i].rid) {
+			return supported_DID_array[i].capability;
+		}
+	}
+	return 0;
+}
+
+int amdgpu_ras_init(struct amdgpu_device *adev)
+{
+	struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
+	uint32_t supported = amdgpu_ras_check_supported(adev);
+
+	if (con || supported == 0)
+		return 0;
+
+	con = kmalloc(sizeof(struct amdgpu_ras) +
+			sizeof(struct ras_manager) * AMDGPU_RAS_BLOCK_COUNT,
+			GFP_KERNEL|__GFP_ZERO);
+	if (!con)
+		return -ENOMEM;
+
+	con->objs = (struct ras_manager *)(con + 1);
+
+	amdgpu_ras_set_context(adev, con);
+
+	con->supported = supported;
+	con->features = 0;
+	INIT_LIST_HEAD(&con->head);
+
+	if (amdgpu_ras_recovery_init(adev))
+		goto recovery_out;
+
+	amdgpu_ras_mask &= AMDGPU_RAS_BLOCK_MASK;
+
+	amdgpu_ras_enable_all_features(adev, 1);
+
+	if (amdgpu_ras_fs_init(adev))
+		goto fs_out;
+
+	amdgpu_ras_self_test(adev);
+	return 0;
+fs_out:
+	amdgpu_ras_recovery_fini(adev);
+recovery_out:
+	amdgpu_ras_set_context(adev, NULL);
+	kfree(con);
+
+	return -EINVAL;
+}
+
+/* do some fini work before IP fini as dependence */
+int amdgpu_ras_pre_fini(struct amdgpu_device *adev)
+{
+	struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
+
+	if (!con)
+		return 0;
+
+	/* Need disable ras on all IPs here before ip [hw/sw]fini */
+	amdgpu_ras_disable_all_features(adev, 0);
+	amdgpu_ras_recovery_fini(adev);
+	return 0;
+}
+
+int amdgpu_ras_fini(struct amdgpu_device *adev)
+{
+	struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
+
+	if (!con)
+		return 0;
+
+	amdgpu_ras_fs_fini(adev);
+	amdgpu_ras_interrupt_remove_all(adev);
+
+	WARN(con->features, "Feature mask is not cleared");
+
+	if (con->features)
+		amdgpu_ras_disable_all_features(adev, 1);
+
+	amdgpu_ras_set_context(adev, NULL);
+	kfree(con);
+
+	return 0;
+}
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
new file mode 100644
index 000000000000..d6d4bbb92a42
--- /dev/null
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
@@ -0,0 +1,217 @@
+/*
+ * Copyright 2018 Advanced Micro Devices, Inc.
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a
+ * copy of this software and associated documentation files (the "Software"),
+ * to deal in the Software without restriction, including without limitation
+ * the rights to use, copy, modify, merge, publish, distribute, sublicense,
+ * and/or sell copies of the Software, and to permit persons to whom the
+ * Software is furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
+ * THE COPYRIGHT HOLDER(S) OR AUTHOR(S) BE LIABLE FOR ANY CLAIM, DAMAGES OR
+ * OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
+ * ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
+ * OTHER DEALINGS IN THE SOFTWARE.
+ *
+ *
+ */
+#ifndef _AMDGPU_RAS_H
+#define _AMDGPU_RAS_H
+
+#include <linux/debugfs.h>
+#include <linux/list.h>
+#include "amdgpu.h"
+#include "amdgpu_psp.h"
+#include "ta_ras_if.h"
+
+enum amdgpu_ras_block {
+	AMDGPU_RAS_BLOCK__UMC = 0,
+	AMDGPU_RAS_BLOCK__SDMA,
+	AMDGPU_RAS_BLOCK__GFX,
+	AMDGPU_RAS_BLOCK__MMHUB,
+	AMDGPU_RAS_BLOCK__ATHUB,
+	AMDGPU_RAS_BLOCK__PCIE_BIF,
+	AMDGPU_RAS_BLOCK__HDP,
+	AMDGPU_RAS_BLOCK__XGMI_WAFL,
+	AMDGPU_RAS_BLOCK__DF,
+	AMDGPU_RAS_BLOCK__SMN,
+	AMDGPU_RAS_BLOCK__SEM,
+	AMDGPU_RAS_BLOCK__MP0,
+	AMDGPU_RAS_BLOCK__MP1,
+	AMDGPU_RAS_BLOCK__FUSE,
+
+	AMDGPU_RAS_BLOCK__LAST
+};
+
+#define AMDGPU_RAS_BLOCK_COUNT	AMDGPU_RAS_BLOCK__LAST
+#define AMDGPU_RAS_BLOCK_MASK	((1ULL << AMDGPU_RAS_BLOCK_COUNT) - 1)
+
+enum amdgpu_ras_error_type {
+	AMDGPU_RAS_ERROR__NONE							= 0,
+	AMDGPU_RAS_ERROR__PARITY						= 1,
+	AMDGPU_RAS_ERROR__SINGLE_CORRECTABLE					= 2,
+	AMDGPU_RAS_ERROR__MULTI_UNCORRECTABLE					= 4,
+	AMDGPU_RAS_ERROR__POISON						= 8,
+};
+
+enum amdgpu_ras_ret {
+	AMDGPU_RAS_SUCCESS = 0,
+	AMDGPU_RAS_FAIL,
+	AMDGPU_RAS_UE,
+	AMDGPU_RAS_CE,
+	AMDGPU_RAS_PT,
+};
+
+struct ras_common_if {
+	enum amdgpu_ras_block block;
+	enum amdgpu_ras_error_type type;
+	uint32_t sub_block_index;
+	/* block name */
+	char name[32];
+};
+
+typedef int (*ras_ih_cb)(struct amdgpu_device *adev,
+		struct amdgpu_iv_entry *entry);
+
+struct amdgpu_ras {
+	/* ras infrastructure */
+	uint32_t supported;
+	uint32_t features;
+	struct list_head head;
+	/* debugfs */
+	struct dentry *dir;
+	/* sysfs */
+	struct device_attribute features_attr;
+	/* block array */
+	struct ras_manager *objs;
+
+	/* gpu recovery */
+	struct work_struct recovery_work;
+	atomic_t in_recovery;
+	struct amdgpu_device *adev;
+	/* error handler data */
+	struct ras_err_handler_data *eh_data;
+	struct mutex recovery_lock;
+};
+
+/* interfaces for IP */
+
+struct ras_fs_if {
+	struct ras_common_if head;
+	char sysfs_name[32];
+	char debugfs_name[32];
+};
+
+struct ras_query_if {
+	struct ras_common_if head;
+	unsigned long ue_count;
+	unsigned long ce_count;
+};
+
+struct ras_inject_if {
+	struct ras_common_if head;
+	uint64_t address;
+	uint64_t value;
+};
+
+struct ras_cure_if {
+	struct ras_common_if head;
+	uint64_t address;
+};
+
+struct ras_ih_if {
+	struct ras_common_if head;
+	ras_ih_cb cb;
+};
+
+struct ras_dispatch_if {
+	struct ras_common_if head;
+	struct amdgpu_iv_entry *entry;
+};
+
+/* work flow
+ * vbios
+ * 1: ras feature enable (enabled by default)
+ * psp
+ * 2: ras framework init (in ip_init)
+ * IP
+ * 3: IH add
+ * 4: debugfs/sysfs create
+ * 5: query/inject
+ * 6: debugfs/sysfs remove
+ * 7: IH remove
+ * 8: feature disable
+ */
+
+#define amdgpu_ras_get_context(adev)		((adev)->psp.ras.ras)
+#define amdgpu_ras_set_context(adev, ras_con)	((adev)->psp.ras.ras = (ras_con))
+
+/* check if ras is supported on block, say, sdma, gfx */
+static inline int amdgpu_ras_is_supported(struct amdgpu_device *adev,
+		unsigned int block)
+{
+	struct amdgpu_ras *ras = amdgpu_ras_get_context(adev);
+
+	return ras && (ras->supported & (1 << block));
+}
+
+int amdgpu_ras_query_error_count(struct amdgpu_device *adev,
+		bool is_ce);
+
+/* error handling functions */
+int amdgpu_ras_add_bad_pages(struct amdgpu_device *adev,
+		unsigned long *bps, int pages);
+
+int amdgpu_ras_reserve_bad_pages(struct amdgpu_device *adev);
+
+static inline int amdgpu_ras_reset_gpu(struct amdgpu_device *adev,
+		bool is_baco)
+{
+	struct amdgpu_ras *ras = amdgpu_ras_get_context(adev);
+
+	if (atomic_cmpxchg(&ras->in_recovery, 0, 1) == 0)
+		schedule_work(&ras->recovery_work);
+	return 0;
+}
+
+/* called in ip_init and ip_fini */
+int amdgpu_ras_init(struct amdgpu_device *adev);
+int amdgpu_ras_fini(struct amdgpu_device *adev);
+int amdgpu_ras_pre_fini(struct amdgpu_device *adev);
+
+int amdgpu_ras_feature_enable(struct amdgpu_device *adev,
+		struct ras_common_if *head, bool enable);
+
+int amdgpu_ras_sysfs_create(struct amdgpu_device *adev,
+		struct ras_fs_if *head);
+
+int amdgpu_ras_sysfs_remove(struct amdgpu_device *adev,
+		struct ras_common_if *head);
+
+int amdgpu_ras_debugfs_create(struct amdgpu_device *adev,
+		struct ras_fs_if *head);
+
+int amdgpu_ras_debugfs_remove(struct amdgpu_device *adev,
+		struct ras_common_if *head);
+
+int amdgpu_ras_error_query(struct amdgpu_device *adev,
+		struct ras_query_if *info);
+
+int amdgpu_ras_error_inject(struct amdgpu_device *adev,
+		struct ras_inject_if *info);
+
+int amdgpu_ras_interrupt_add_handler(struct amdgpu_device *adev,
+		struct ras_ih_if *info);
+
+int amdgpu_ras_interrupt_remove_handler(struct amdgpu_device *adev,
+		struct ras_ih_if *info);
+
+int amdgpu_ras_interrupt_dispatch(struct amdgpu_device *adev,
+		struct ras_dispatch_if *info);
+#endif
-- 
2.20.1

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 10/20] drm/amdgpu: add debugfs ctrl node
       [not found] ` <20190305204136.17587-1-alexander.deucher-5C7GfCeVMHo@public.gmane.org>
                     ` (8 preceding siblings ...)
  2019-03-05 20:41   ` [PATCH 09/20] drm/amdgpu: add amdgpu_ras.c to support ras (v2) Alex Deucher
@ 2019-03-05 20:41   ` Alex Deucher
  2019-03-05 20:41   ` [PATCH 11/20] drm/amdgpu: reserve bad pages during recovery Alex Deucher
                     ` (9 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: Alex Deucher @ 2019-03-05 20:41 UTC (permalink / raw)
  To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW; +Cc: Alex Deucher, xinhui pan

From: xinhui pan <xinhui.pan@amd.com>

allow userspace enable/disable ras

Signed-off-by: xinhui pan <xinhui.pan@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 122 ++++++++++++++++++++++--
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h |   9 ++
 2 files changed, 121 insertions(+), 10 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
index d7c7090fade9..ca9c7d1ede2f 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
@@ -200,6 +200,90 @@ static const struct file_operations amdgpu_ras_debugfs_ops = {
 	.llseek = default_llseek
 };
 
+/*
+ * DOC: ras debugfs control interface
+ *
+ * It accepts struct ras_debug_if who has two members.
+ *
+ * First member: ras_debug_if::head or ras_debug_if::inject.
+ * It is used to indicate which IP block will be under control.
+ * Its contents are not human readable, IOW, write it by your programs.
+ *
+ * head has four members, they are block, type, sub_block_index, name.
+ * block: which IP will be under control.
+ * type: what kind of error will be enabled/disabled/injected.
+ * sub_block_index: some IPs have subcomponets. say, GFX, sDMA.
+ * name: the name of IP.
+ *
+ * inject has two more members than head, they are address, value.
+ * As their names indicate, inject operation will write the
+ * value to the address.
+ *
+ * Second member: struct ras_debug_if::op.
+ * It has three kinds of operations.
+ *  0: disable RAS on the block. Take ::head as its data.
+ *  1: enable RAS on the block. Take ::head as its data.
+ *  2: inject errors on the block. Take ::inject as its data.
+ *
+ * How to check the result?
+ *
+ * For disable/enable, please check ras features at
+ * /sys/class/drm/card[0/1/2...]/device/ras/features
+ *
+ * For inject, please check corresponding err count at
+ * /sys/class/drm/card[0/1/2...]/device/ras/[gfx/sdma/...]_err_count
+ *
+ * NOTE: operation is only allowed on blocks which are supported.
+ * Please check ras mask at /sys/module/amdgpu/parameters/ras_mask
+ */
+static ssize_t amdgpu_ras_debugfs_ctrl_write(struct file *f, const char __user *buf,
+		size_t size, loff_t *pos)
+{
+	struct amdgpu_device *adev = (struct amdgpu_device *)file_inode(f)->i_private;
+	struct ras_debug_if data;
+	int ret = 0;
+
+	if (size < sizeof(data))
+		return -EINVAL;
+
+	memset(&data, 0, sizeof(data));
+
+	if (*pos)
+		return -EINVAL;
+
+	if (copy_from_user(&data, buf, sizeof(data)))
+		return -EINVAL;
+
+	*pos = size;
+
+	if (!amdgpu_ras_is_supported(adev, data.head.block))
+		return -EINVAL;
+
+	switch (data.op) {
+	case 0:
+		ret = amdgpu_ras_feature_enable(adev, &data.head, 0);
+		break;
+	case 1:
+		ret = amdgpu_ras_feature_enable(adev, &data.head, 1);
+		break;
+	case 2:
+		ret = amdgpu_ras_error_inject(adev, &data.inject);
+		break;
+	};
+
+	if (ret)
+		return -EINVAL;
+
+	return size;
+}
+
+static const struct file_operations amdgpu_ras_debugfs_ctrl_ops = {
+	.owner = THIS_MODULE,
+	.read = NULL,
+	.write = amdgpu_ras_debugfs_ctrl_write,
+	.llseek = default_llseek
+};
+
 static ssize_t amdgpu_ras_sysfs_read(struct device *dev,
 		struct device_attribute *attr, char *buf)
 {
@@ -657,6 +741,31 @@ static int amdgpu_ras_sysfs_remove_all(struct amdgpu_device *adev)
 /* sysfs end */
 
 /* debugfs begin */
+static int amdgpu_ras_debugfs_create_ctrl_node(struct amdgpu_device *adev)
+{
+	struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
+	struct drm_minor *minor = adev->ddev->primary;
+	struct dentry *root = minor->debugfs_root, *dir;
+	struct dentry *ent;
+
+	dir = debugfs_create_dir("ras", root);
+	if (IS_ERR(dir))
+		return -EINVAL;
+
+	con->dir = dir;
+
+	ent = debugfs_create_file("ras_ctrl",
+			S_IWUGO | S_IRUGO, con->dir,
+			adev, &amdgpu_ras_debugfs_ctrl_ops);
+	if (IS_ERR(ent)) {
+		debugfs_remove(con->dir);
+		return -EINVAL;
+	}
+
+	con->ent = ent;
+	return 0;
+}
+
 int amdgpu_ras_debugfs_create(struct amdgpu_device *adev,
 		struct ras_fs_if *head)
 {
@@ -709,8 +818,10 @@ static int amdgpu_ras_debugfs_remove_all(struct amdgpu_device *adev)
 		amdgpu_ras_debugfs_remove(adev, &obj->head);
 	}
 
+	debugfs_remove(con->ent);
 	debugfs_remove(con->dir);
 	con->dir = NULL;
+	con->ent = NULL;
 
 	return 0;
 }
@@ -720,17 +831,8 @@ static int amdgpu_ras_debugfs_remove_all(struct amdgpu_device *adev)
 
 static int amdgpu_ras_fs_init(struct amdgpu_device *adev)
 {
-	struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
-	struct drm_minor *minor = adev->ddev->primary;
-	struct dentry *root = minor->debugfs_root, *dir;
-
-	dir = debugfs_create_dir("ras", root);
-	if (IS_ERR(dir))
-		return -EINVAL;
-
-	con->dir = dir;
-
 	amdgpu_ras_sysfs_create_feature_node(adev);
+	amdgpu_ras_debugfs_create_ctrl_node(adev);
 
 	return 0;
 }
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
index d6d4bbb92a42..599dcd01aec7 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
@@ -86,6 +86,8 @@ struct amdgpu_ras {
 	struct list_head head;
 	/* debugfs */
 	struct dentry *dir;
+	/* debugfs ctrl */
+	struct dentry *ent;
 	/* sysfs */
 	struct device_attribute features_attr;
 	/* block array */
@@ -135,6 +137,13 @@ struct ras_dispatch_if {
 	struct amdgpu_iv_entry *entry;
 };
 
+struct ras_debug_if {
+	union {
+		struct ras_common_if head;
+		struct ras_inject_if inject;
+	};
+	int op;
+};
 /* work flow
  * vbios
  * 1: ras feature enable (enabled by default)
-- 
2.20.1

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 11/20] drm/amdgpu: reserve bad pages during recovery
       [not found] ` <20190305204136.17587-1-alexander.deucher-5C7GfCeVMHo@public.gmane.org>
                     ` (9 preceding siblings ...)
  2019-03-05 20:41   ` [PATCH 10/20] drm/amdgpu: add debugfs ctrl node Alex Deucher
@ 2019-03-05 20:41   ` Alex Deucher
  2019-03-05 20:41   ` [PATCH 12/20] drm/amdgpu: enable ras on sdma4 Alex Deucher
                     ` (8 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: Alex Deucher @ 2019-03-05 20:41 UTC (permalink / raw)
  To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW; +Cc: Alex Deucher, xinhui pan

From: xinhui pan <xinhui.pan@amd.com>

Mark vram pages with errors as bad and prevent the driver
from using them.

Signed-off-by: xinhui pan <xinhui.pan@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index d6f130bdb985..a7876fafc558 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -3396,6 +3396,11 @@ static int amdgpu_do_asic_reset(struct amdgpu_hive_info *hive,
 						break;
 				}
 			}
+
+			list_for_each_entry(tmp_adev, device_list_handle,
+					gmc.xgmi.head) {
+				amdgpu_ras_reserve_bad_pages(tmp_adev);
+			}
 		}
 	}
 
-- 
2.20.1

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 12/20] drm/amdgpu: enable ras on sdma4
       [not found] ` <20190305204136.17587-1-alexander.deucher-5C7GfCeVMHo@public.gmane.org>
                     ` (10 preceding siblings ...)
  2019-03-05 20:41   ` [PATCH 11/20] drm/amdgpu: reserve bad pages during recovery Alex Deucher
@ 2019-03-05 20:41   ` Alex Deucher
  2019-03-05 20:41   ` [PATCH 13/20] drm/amdgpu: enable ras on gfx9 (v2) Alex Deucher
                     ` (7 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: Alex Deucher @ 2019-03-05 20:41 UTC (permalink / raw)
  To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW
  Cc: Eric Huang, Feifei Xu, Alex Deucher, xinhui pan

From: xinhui pan <xinhui.pan@amd.com>

register IH, enable ras features on sdma.
create sysfs debugfs file for sdma.

Signed-off-by: xinhui pan <xinhui.pan@amd.com>
Signed-off-by: Feifei Xu <Feifei.Xu@amd.com>
Signed-off-by: Eric Huang <JinhuiEric.Huang@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.h |   4 +
 drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c   | 184 ++++++++++++++++++++++-
 2 files changed, 187 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.h
index 16b1a6ae5ba6..c17af30e758d 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.h
@@ -30,6 +30,8 @@
 enum amdgpu_sdma_irq {
 	AMDGPU_SDMA_IRQ_TRAP0 = 0,
 	AMDGPU_SDMA_IRQ_TRAP1,
+	AMDGPU_SDMA_IRQ_ECC0,
+	AMDGPU_SDMA_IRQ_ECC1,
 
 	AMDGPU_SDMA_IRQ_LAST
 };
@@ -49,9 +51,11 @@ struct amdgpu_sdma {
 	struct amdgpu_sdma_instance instance[AMDGPU_MAX_SDMA_INSTANCES];
 	struct amdgpu_irq_src	trap_irq;
 	struct amdgpu_irq_src	illegal_inst_irq;
+	struct amdgpu_irq_src	ecc_irq;
 	int			num_instances;
 	uint32_t                    srbm_soft_reset;
 	bool			has_page_queue;
+	struct ras_common_if	*ras_if;
 };
 
 /*
diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c b/drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c
index 127b85983e8f..058b9daec514 100644
--- a/drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c
@@ -41,6 +41,8 @@
 #include "ivsrcid/sdma0/irqsrcs_sdma0_4_0.h"
 #include "ivsrcid/sdma1/irqsrcs_sdma1_4_0.h"
 
+#include "amdgpu_ras.h"
+
 MODULE_FIRMWARE("amdgpu/vega10_sdma.bin");
 MODULE_FIRMWARE("amdgpu/vega10_sdma1.bin");
 MODULE_FIRMWARE("amdgpu/vega12_sdma.bin");
@@ -1493,6 +1495,83 @@ static int sdma_v4_0_early_init(void *handle)
 	return 0;
 }
 
+static int sdma_v4_0_process_ras_data_cb(struct amdgpu_device *adev,
+		struct amdgpu_iv_entry *entry);
+
+static int sdma_v4_0_late_init(void *handle)
+{
+	struct amdgpu_device *adev = (struct amdgpu_device *)handle;
+	struct ras_common_if **ras_if = &adev->sdma.ras_if;
+	struct ras_ih_if ih_info = {
+		.cb = sdma_v4_0_process_ras_data_cb,
+	};
+	struct ras_fs_if fs_info = {
+		.sysfs_name = "sdma_err_count",
+		.debugfs_name = "sdma_err_inject",
+	};
+	struct ras_common_if ras_block = {
+		.block = AMDGPU_RAS_BLOCK__SDMA,
+		.type = AMDGPU_RAS_ERROR__MULTI_UNCORRECTABLE,
+		.sub_block_index = 0,
+		.name = "sdma",
+	};
+	int r;
+
+	if (!amdgpu_ras_is_supported(adev, AMDGPU_RAS_BLOCK__SDMA)) {
+		amdgpu_ras_feature_enable(adev, &ras_block, 0);
+		return 0;
+	}
+
+	*ras_if = kmalloc(sizeof(**ras_if), GFP_KERNEL);
+	if (!*ras_if)
+		return -ENOMEM;
+
+	**ras_if = ras_block;
+
+	r = amdgpu_ras_feature_enable(adev, *ras_if, 1);
+	if (r)
+		goto feature;
+
+	ih_info.head = **ras_if;
+	fs_info.head = **ras_if;
+
+	r = amdgpu_ras_interrupt_add_handler(adev, &ih_info);
+	if (r)
+		goto interrupt;
+
+	r = amdgpu_ras_debugfs_create(adev, &fs_info);
+	if (r)
+		goto debugfs;
+
+	r = amdgpu_ras_sysfs_create(adev, &fs_info);
+	if (r)
+		goto sysfs;
+
+	r = amdgpu_irq_get(adev, &adev->sdma.ecc_irq, AMDGPU_SDMA_IRQ_ECC0);
+	if (r)
+		goto irq;
+
+	r = amdgpu_irq_get(adev, &adev->sdma.ecc_irq, AMDGPU_SDMA_IRQ_ECC1);
+	if (r) {
+		amdgpu_irq_put(adev, &adev->sdma.ecc_irq, AMDGPU_SDMA_IRQ_ECC0);
+		goto irq;
+	}
+
+	return 0;
+irq:
+	amdgpu_ras_sysfs_remove(adev, *ras_if);
+sysfs:
+	amdgpu_ras_debugfs_remove(adev, *ras_if);
+debugfs:
+	amdgpu_ras_interrupt_remove_handler(adev, &ih_info);
+interrupt:
+	amdgpu_ras_feature_enable(adev, *ras_if, 0);
+feature:
+	kfree(*ras_if);
+	*ras_if = NULL;
+	return -EINVAL;
+}
+
 static int sdma_v4_0_sw_init(void *handle)
 {
 	struct amdgpu_ring *ring;
@@ -1511,6 +1590,18 @@ static int sdma_v4_0_sw_init(void *handle)
 	if (r)
 		return r;
 
+	/* SDMA SRAM ECC event */
+	r = amdgpu_irq_add_id(adev, SOC15_IH_CLIENTID_SDMA0, SDMA0_4_0__SRCID__SDMA_SRAM_ECC,
+			&adev->sdma.ecc_irq);
+	if (r)
+		return r;
+
+	/* SDMA SRAM ECC event */
+	r = amdgpu_irq_add_id(adev, SOC15_IH_CLIENTID_SDMA1, SDMA1_4_0__SRCID__SDMA_SRAM_ECC,
+			&adev->sdma.ecc_irq);
+	if (r)
+		return r;
+
 	for (i = 0; i < adev->sdma.num_instances; i++) {
 		ring = &adev->sdma.instance[i].ring;
 		ring->ring_obj = NULL;
@@ -1561,6 +1652,22 @@ static int sdma_v4_0_sw_fini(void *handle)
 	struct amdgpu_device *adev = (struct amdgpu_device *)handle;
 	int i;
 
+	if (amdgpu_ras_is_supported(adev, AMDGPU_RAS_BLOCK__SDMA) &&
+			adev->sdma.ras_if) {
+		struct ras_common_if *ras_if = adev->sdma.ras_if;
+		struct ras_ih_if ih_info = {
+			.head = *ras_if,
+		};
+
+		/*remove fs first*/
+		amdgpu_ras_debugfs_remove(adev, ras_if);
+		amdgpu_ras_sysfs_remove(adev, ras_if);
+		/*remove the IH*/
+		amdgpu_ras_interrupt_remove_handler(adev, &ih_info);
+		amdgpu_ras_feature_enable(adev, ras_if, 0);
+		kfree(ras_if);
+	}
+
 	for (i = 0; i < adev->sdma.num_instances; i++) {
 		amdgpu_ring_fini(&adev->sdma.instance[i].ring);
 		if (adev->sdma.has_page_queue)
@@ -1598,6 +1705,9 @@ static int sdma_v4_0_hw_fini(void *handle)
 	if (amdgpu_sriov_vf(adev))
 		return 0;
 
+	amdgpu_irq_put(adev, &adev->sdma.ecc_irq, AMDGPU_SDMA_IRQ_ECC0);
+	amdgpu_irq_put(adev, &adev->sdma.ecc_irq, AMDGPU_SDMA_IRQ_ECC1);
+
 	sdma_v4_0_ctx_switch_enable(adev, false);
 	sdma_v4_0_enable(adev, false);
 
@@ -1714,6 +1824,50 @@ static int sdma_v4_0_process_trap_irq(struct amdgpu_device *adev,
 	return 0;
 }
 
+static int sdma_v4_0_process_ras_data_cb(struct amdgpu_device *adev,
+		struct amdgpu_iv_entry *entry)
+{
+	uint32_t instance, err_source;
+
+	switch (entry->client_id) {
+	case SOC15_IH_CLIENTID_SDMA0:
+		instance = 0;
+		break;
+	case SOC15_IH_CLIENTID_SDMA1:
+		instance = 1;
+		break;
+	default:
+		return 0;
+	}
+
+	switch (entry->src_id) {
+	case SDMA0_4_0__SRCID__SDMA_SRAM_ECC:
+		err_source = 0;
+		break;
+	case SDMA0_4_0__SRCID__SDMA_ECC:
+		err_source = 1;
+		break;
+	default:
+		return 0;
+	}
+
+	amdgpu_ras_reset_gpu(adev, 0);
+
+	return AMDGPU_RAS_UE;
+}
+
+static int sdma_v4_0_process_ecc_irq(struct amdgpu_device *adev,
+				      struct amdgpu_irq_src *source,
+				      struct amdgpu_iv_entry *entry)
+{
+	struct ras_dispatch_if ih_data = {
+		.head = *adev->sdma.ras_if,
+		.entry = entry,
+	};
+	amdgpu_ras_interrupt_dispatch(adev, &ih_data);
+	return 0;
+}
+
 static int sdma_v4_0_process_illegal_inst_irq(struct amdgpu_device *adev,
 					      struct amdgpu_irq_src *source,
 					      struct amdgpu_iv_entry *entry)
@@ -1741,6 +1895,25 @@ static int sdma_v4_0_process_illegal_inst_irq(struct amdgpu_device *adev,
 	return 0;
 }
 
+static int sdma_v4_0_set_ecc_irq_state(struct amdgpu_device *adev,
+					struct amdgpu_irq_src *source,
+					unsigned type,
+					enum amdgpu_interrupt_state state)
+{
+	u32 sdma_edc_config;
+
+	u32 reg_offset = (type == AMDGPU_SDMA_IRQ_ECC0) ?
+		sdma_v4_0_get_reg_offset(adev, 0, mmSDMA0_EDC_CONFIG) :
+		sdma_v4_0_get_reg_offset(adev, 1, mmSDMA0_EDC_CONFIG);
+
+	sdma_edc_config = RREG32(reg_offset);
+	sdma_edc_config = REG_SET_FIELD(sdma_edc_config, SDMA0_EDC_CONFIG, ECC_INT_ENABLE,
+		       state == AMDGPU_IRQ_STATE_ENABLE ? 1 : 0);
+	WREG32(reg_offset, sdma_edc_config);
+
+	return 0;
+}
+
 static void sdma_v4_0_update_medium_grain_clock_gating(
 		struct amdgpu_device *adev,
 		bool enable)
@@ -1906,7 +2079,7 @@ static void sdma_v4_0_get_clockgating_state(void *handle, u32 *flags)
 const struct amd_ip_funcs sdma_v4_0_ip_funcs = {
 	.name = "sdma_v4_0",
 	.early_init = sdma_v4_0_early_init,
-	.late_init = NULL,
+	.late_init = sdma_v4_0_late_init,
 	.sw_init = sdma_v4_0_sw_init,
 	.sw_fini = sdma_v4_0_sw_fini,
 	.hw_init = sdma_v4_0_hw_init,
@@ -2008,11 +2181,20 @@ static const struct amdgpu_irq_src_funcs sdma_v4_0_illegal_inst_irq_funcs = {
 	.process = sdma_v4_0_process_illegal_inst_irq,
 };
 
+static const struct amdgpu_irq_src_funcs sdma_v4_0_ecc_irq_funcs = {
+	.set = sdma_v4_0_set_ecc_irq_state,
+	.process = sdma_v4_0_process_ecc_irq,
+};
+
+
+
 static void sdma_v4_0_set_irq_funcs(struct amdgpu_device *adev)
 {
 	adev->sdma.trap_irq.num_types = AMDGPU_SDMA_IRQ_LAST;
 	adev->sdma.trap_irq.funcs = &sdma_v4_0_trap_irq_funcs;
 	adev->sdma.illegal_inst_irq.funcs = &sdma_v4_0_illegal_inst_irq_funcs;
+	adev->sdma.ecc_irq.num_types = AMDGPU_SDMA_IRQ_LAST;
+	adev->sdma.ecc_irq.funcs = &sdma_v4_0_ecc_irq_funcs;
 }
 
 /**
-- 
2.20.1

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 13/20] drm/amdgpu: enable ras on gfx9 (v2)
       [not found] ` <20190305204136.17587-1-alexander.deucher-5C7GfCeVMHo@public.gmane.org>
                     ` (11 preceding siblings ...)
  2019-03-05 20:41   ` [PATCH 12/20] drm/amdgpu: enable ras on sdma4 Alex Deucher
@ 2019-03-05 20:41   ` Alex Deucher
  2019-03-05 20:41   ` [PATCH 14/20] drm/amdgpu: enable ras on gmc9 Alex Deucher
                     ` (6 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: Alex Deucher @ 2019-03-05 20:41 UTC (permalink / raw)
  To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW
  Cc: Alex Deucher, Feifei Xu, xinhui pan

From: Feifei Xu <Feifei.Xu@amd.com>

Register ecc interrupts and ecc interrupt handler on gfx9.
Add ras support on gfx9

v2: squash in warning fix

Signed-off-by: Feifei Xu <Feifei.Xu@amd.com>
Signed-off-by: xinhui pan <xinhui.pan@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h |   3 +
 drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c   | 173 ++++++++++++++++++++++++
 2 files changed, 176 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h
index f790e15bcd08..09fc53af3d35 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h
@@ -258,6 +258,9 @@ struct amdgpu_gfx {
 	/* pipe reservation */
 	struct mutex			pipe_reserve_mutex;
 	DECLARE_BITMAP			(pipe_reserve_bitmap, AMDGPU_MAX_COMPUTE_QUEUES);
+
+	/*ras */
+	struct ras_common_if		*ras_if;
 };
 
 #define amdgpu_gfx_get_gpu_clock_counter(adev) (adev)->gfx.funcs->get_gpu_clock_counter((adev))
diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
index 5533f6e4f4a4..88c45f990f05 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
@@ -40,6 +40,8 @@
 
 #include "ivsrcid/gfx/irqsrcs_gfx_9_0.h"
 
+#include "amdgpu_ras.h"
+
 #define GFX9_NUM_GFX_RINGS     1
 #define GFX9_MEC_HPD_SIZE 4096
 #define RLCG_UCODE_LOADING_START_ADDRESS 0x00002000L
@@ -1638,6 +1640,18 @@ static int gfx_v9_0_sw_init(void *handle)
 	if (r)
 		return r;
 
+	/* ECC error */
+	r = amdgpu_irq_add_id(adev, SOC15_IH_CLIENTID_GRBM_CP, GFX_9_0__SRCID__CP_ECC_ERROR,
+			      &adev->gfx.cp_ecc_error_irq);
+	if (r)
+		return r;
+
+	/* FUE error */
+	r = amdgpu_irq_add_id(adev, SOC15_IH_CLIENTID_GRBM_CP, GFX_9_0__SRCID__CP_FUE_ERROR,
+			      &adev->gfx.cp_ecc_error_irq);
+	if (r)
+		return r;
+
 	adev->gfx.gfx_current_status = AMDGPU_GFX_NORMAL_MODE;
 
 	gfx_v9_0_scratch_init(adev);
@@ -1730,6 +1744,20 @@ static int gfx_v9_0_sw_fini(void *handle)
 	int i;
 	struct amdgpu_device *adev = (struct amdgpu_device *)handle;
 
+	if (amdgpu_ras_is_supported(adev, AMDGPU_RAS_BLOCK__GFX) &&
+			adev->gfx.ras_if) {
+		struct ras_common_if *ras_if = adev->gfx.ras_if;
+		struct ras_ih_if ih_info = {
+			.head = *ras_if,
+		};
+
+		amdgpu_ras_debugfs_remove(adev, ras_if);
+		amdgpu_ras_sysfs_remove(adev, ras_if);
+		amdgpu_ras_interrupt_remove_handler(adev,  &ih_info);
+		amdgpu_ras_feature_enable(adev, ras_if, 0);
+		kfree(ras_if);
+	}
+
 	amdgpu_bo_free_kernel(&adev->gds.oa_gfx_bo, NULL, NULL);
 	amdgpu_bo_free_kernel(&adev->gds.gws_gfx_bo, NULL, NULL);
 	amdgpu_bo_free_kernel(&adev->gds.gds_gfx_bo, NULL, NULL);
@@ -3304,6 +3332,7 @@ static int gfx_v9_0_hw_fini(void *handle)
 {
 	struct amdgpu_device *adev = (struct amdgpu_device *)handle;
 
+	amdgpu_irq_put(adev, &adev->gfx.cp_ecc_error_irq, 0);
 	amdgpu_irq_put(adev, &adev->gfx.priv_reg_irq, 0);
 	amdgpu_irq_put(adev, &adev->gfx.priv_inst_irq, 0);
 
@@ -3493,6 +3522,77 @@ static int gfx_v9_0_early_init(void *handle)
 	return 0;
 }
 
+static int gfx_v9_0_process_ras_data_cb(struct amdgpu_device *adev,
+		struct amdgpu_iv_entry *entry);
+
+static int gfx_v9_0_ecc_late_init(void *handle)
+{
+	struct amdgpu_device *adev = (struct amdgpu_device *)handle;
+	struct ras_common_if **ras_if = &adev->gfx.ras_if;
+	struct ras_ih_if ih_info = {
+		.cb = gfx_v9_0_process_ras_data_cb,
+	};
+	struct ras_fs_if fs_info = {
+		.sysfs_name = "gfx_err_count",
+		.debugfs_name = "gfx_err_inject",
+	};
+	struct ras_common_if ras_block = {
+		.block = AMDGPU_RAS_BLOCK__GFX,
+		.type = AMDGPU_RAS_ERROR__MULTI_UNCORRECTABLE,
+		.sub_block_index = 0,
+		.name = "gfx",
+	};
+	int r;
+
+	if (!amdgpu_ras_is_supported(adev, AMDGPU_RAS_BLOCK__GFX)) {
+		amdgpu_ras_feature_enable(adev, &ras_block, 0);
+		return 0;
+	}
+
+	*ras_if = kmalloc(sizeof(**ras_if), GFP_KERNEL);
+	if (!*ras_if)
+		return -ENOMEM;
+
+	**ras_if = ras_block;
+
+	r = amdgpu_ras_feature_enable(adev, *ras_if, 1);
+	if (r)
+		goto feature;
+
+	ih_info.head = **ras_if;
+	fs_info.head = **ras_if;
+
+	r = amdgpu_ras_interrupt_add_handler(adev, &ih_info);
+	if (r)
+		goto interrupt;
+
+	r = amdgpu_ras_debugfs_create(adev, &fs_info);
+	if (r)
+		goto debugfs;
+
+	r = amdgpu_ras_sysfs_create(adev, &fs_info);
+	if (r)
+		goto sysfs;
+
+	r = amdgpu_irq_get(adev, &adev->gfx.cp_ecc_error_irq, 0);
+	if (r)
+		goto irq;
+
+	return 0;
+irq:
+	amdgpu_ras_sysfs_remove(adev, *ras_if);
+sysfs:
+	amdgpu_ras_debugfs_remove(adev, *ras_if);
+debugfs:
+	amdgpu_ras_interrupt_remove_handler(adev, &ih_info);
+interrupt:
+	amdgpu_ras_feature_enable(adev, *ras_if, 0);
+feature:
+	kfree(*ras_if);
+	*ras_if = NULL;
+	return -EINVAL;
+}
+
 static int gfx_v9_0_late_init(void *handle)
 {
 	struct amdgpu_device *adev = (struct amdgpu_device *)handle;
@@ -3506,6 +3606,10 @@ static int gfx_v9_0_late_init(void *handle)
 	if (r)
 		return r;
 
+	r = gfx_v9_0_ecc_late_init(handle);
+	if (r)
+		return r;
+
 	return 0;
 }
 
@@ -4542,6 +4646,45 @@ static int gfx_v9_0_set_priv_inst_fault_state(struct amdgpu_device *adev,
 	return 0;
 }
 
+#define ENABLE_ECC_ON_ME_PIPE(me, pipe)				\
+	WREG32_FIELD15(GC, 0, CP_ME##me##_PIPE##pipe##_INT_CNTL,\
+			CP_ECC_ERROR_INT_ENABLE, 1)
+
+#define DISABLE_ECC_ON_ME_PIPE(me, pipe)			\
+	WREG32_FIELD15(GC, 0, CP_ME##me##_PIPE##pipe##_INT_CNTL,\
+			CP_ECC_ERROR_INT_ENABLE, 0)
+
+static int gfx_v9_0_set_cp_ecc_error_state(struct amdgpu_device *adev,
+					      struct amdgpu_irq_src *source,
+					      unsigned type,
+					      enum amdgpu_interrupt_state state)
+{
+	switch (state) {
+	case AMDGPU_IRQ_STATE_DISABLE:
+		WREG32_FIELD15(GC, 0, CP_INT_CNTL_RING0,
+				CP_ECC_ERROR_INT_ENABLE, 0);
+		DISABLE_ECC_ON_ME_PIPE(1, 0);
+		DISABLE_ECC_ON_ME_PIPE(1, 1);
+		DISABLE_ECC_ON_ME_PIPE(1, 2);
+		DISABLE_ECC_ON_ME_PIPE(1, 3);
+		break;
+
+	case AMDGPU_IRQ_STATE_ENABLE:
+		WREG32_FIELD15(GC, 0, CP_INT_CNTL_RING0,
+				CP_ECC_ERROR_INT_ENABLE, 1);
+		ENABLE_ECC_ON_ME_PIPE(1, 0);
+		ENABLE_ECC_ON_ME_PIPE(1, 1);
+		ENABLE_ECC_ON_ME_PIPE(1, 2);
+		ENABLE_ECC_ON_ME_PIPE(1, 3);
+		break;
+	default:
+		break;
+	}
+
+	return 0;
+}
+
+
 static int gfx_v9_0_set_eop_interrupt_state(struct amdgpu_device *adev,
 					    struct amdgpu_irq_src *src,
 					    unsigned type,
@@ -4658,6 +4801,27 @@ static int gfx_v9_0_priv_inst_irq(struct amdgpu_device *adev,
 	return 0;
 }
 
+static int gfx_v9_0_process_ras_data_cb(struct amdgpu_device *adev,
+		struct amdgpu_iv_entry *entry)
+{
+	/* TODO ue will trigger an interrupt. */
+	amdgpu_ras_reset_gpu(adev, 0);
+	return AMDGPU_RAS_UE;
+}
+
+static int gfx_v9_0_cp_ecc_error_irq(struct amdgpu_device *adev,
+				  struct amdgpu_irq_src *source,
+				  struct amdgpu_iv_entry *entry)
+{
+	struct ras_dispatch_if ih_data = {
+		.head = *adev->gfx.ras_if,
+		.entry = entry,
+	};
+	DRM_ERROR("CP ECC ERROR IRQ\n");
+	amdgpu_ras_interrupt_dispatch(adev, &ih_data);
+	return 0;
+}
+
 static const struct amd_ip_funcs gfx_v9_0_ip_funcs = {
 	.name = "gfx_v9_0",
 	.early_init = gfx_v9_0_early_init,
@@ -4819,6 +4983,12 @@ static const struct amdgpu_irq_src_funcs gfx_v9_0_priv_inst_irq_funcs = {
 	.process = gfx_v9_0_priv_inst_irq,
 };
 
+static const struct amdgpu_irq_src_funcs gfx_v9_0_cp_ecc_error_irq_funcs = {
+	.set = gfx_v9_0_set_cp_ecc_error_state,
+	.process = gfx_v9_0_cp_ecc_error_irq,
+};
+
+
 static void gfx_v9_0_set_irq_funcs(struct amdgpu_device *adev)
 {
 	adev->gfx.eop_irq.num_types = AMDGPU_CP_IRQ_LAST;
@@ -4829,6 +4999,9 @@ static void gfx_v9_0_set_irq_funcs(struct amdgpu_device *adev)
 
 	adev->gfx.priv_inst_irq.num_types = 1;
 	adev->gfx.priv_inst_irq.funcs = &gfx_v9_0_priv_inst_irq_funcs;
+
+	adev->gfx.cp_ecc_error_irq.num_types = 2; /*C5 ECC error and C9 FUE error*/
+	adev->gfx.cp_ecc_error_irq.funcs = &gfx_v9_0_cp_ecc_error_irq_funcs;
 }
 
 static void gfx_v9_0_set_rlc_funcs(struct amdgpu_device *adev)
-- 
2.20.1

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 14/20] drm/amdgpu: enable ras on gmc9
       [not found] ` <20190305204136.17587-1-alexander.deucher-5C7GfCeVMHo@public.gmane.org>
                     ` (12 preceding siblings ...)
  2019-03-05 20:41   ` [PATCH 13/20] drm/amdgpu: enable ras on gfx9 (v2) Alex Deucher
@ 2019-03-05 20:41   ` Alex Deucher
  2019-03-05 20:41   ` [PATCH 15/20] drm/amdgpu: Add a new flag to AMDGPU_CTX_OP_QUERY_STATE2 Alex Deucher
                     ` (5 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: Alex Deucher @ 2019-03-05 20:41 UTC (permalink / raw)
  To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW; +Cc: Alex Deucher, xinhui pan

From: xinhui pan <xinhui.pan@amd.com>

Signed-off-by: xinhui pan <xinhui.pan@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h |   2 +
 drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c   | 276 ++++++++++++++++++++++++
 2 files changed, 278 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h
index d6c10b4d68c0..6ce45664ff87 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h
@@ -144,6 +144,8 @@ struct amdgpu_gmc {
 	const struct amdgpu_gmc_funcs	*gmc_funcs;
 
 	struct amdgpu_xgmi xgmi;
+	struct amdgpu_irq_src	ecc_irq;
+	struct ras_common_if    *ras_if;
 };
 
 #define amdgpu_gmc_flush_gpu_tlb(adev, vmid, type) (adev)->gmc.gmc_funcs->flush_gpu_tlb((adev), (vmid), (type))
diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
index d91274037c9f..1a164fe2dd9a 100644
--- a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
@@ -47,6 +47,8 @@
 
 #include "ivsrcid/vmc/irqsrcs_vmc_1_0.h"
 
+#include "amdgpu_ras.h"
+
 /* add these here since we already include dce12 headers and these are for DCN */
 #define mmHUBP0_DCSURF_PRI_VIEWPORT_DIMENSION                                                          0x055d
 #define mmHUBP0_DCSURF_PRI_VIEWPORT_DIMENSION_BASE_IDX                                                 2
@@ -199,6 +201,175 @@ static const uint32_t ecc_umcch_eccctrl_addrs[] = {
 	UMCCH_ECCCTRL_ADDR15,
 };
 
+static const uint32_t ecc_umc_mcumc_ctrl_addrs[] = {
+	(0x000143c0 + 0x00000000),
+	(0x000143c0 + 0x00000800),
+	(0x000143c0 + 0x00001000),
+	(0x000143c0 + 0x00001800),
+	(0x000543c0 + 0x00000000),
+	(0x000543c0 + 0x00000800),
+	(0x000543c0 + 0x00001000),
+	(0x000543c0 + 0x00001800),
+	(0x000943c0 + 0x00000000),
+	(0x000943c0 + 0x00000800),
+	(0x000943c0 + 0x00001000),
+	(0x000943c0 + 0x00001800),
+	(0x000d43c0 + 0x00000000),
+	(0x000d43c0 + 0x00000800),
+	(0x000d43c0 + 0x00001000),
+	(0x000d43c0 + 0x00001800),
+	(0x001143c0 + 0x00000000),
+	(0x001143c0 + 0x00000800),
+	(0x001143c0 + 0x00001000),
+	(0x001143c0 + 0x00001800),
+	(0x001543c0 + 0x00000000),
+	(0x001543c0 + 0x00000800),
+	(0x001543c0 + 0x00001000),
+	(0x001543c0 + 0x00001800),
+	(0x001943c0 + 0x00000000),
+	(0x001943c0 + 0x00000800),
+	(0x001943c0 + 0x00001000),
+	(0x001943c0 + 0x00001800),
+	(0x001d43c0 + 0x00000000),
+	(0x001d43c0 + 0x00000800),
+	(0x001d43c0 + 0x00001000),
+	(0x001d43c0 + 0x00001800),
+};
+
+static const uint32_t ecc_umc_mcumc_ctrl_mask_addrs[] = {
+	(0x000143e0 + 0x00000000),
+	(0x000143e0 + 0x00000800),
+	(0x000143e0 + 0x00001000),
+	(0x000143e0 + 0x00001800),
+	(0x000543e0 + 0x00000000),
+	(0x000543e0 + 0x00000800),
+	(0x000543e0 + 0x00001000),
+	(0x000543e0 + 0x00001800),
+	(0x000943e0 + 0x00000000),
+	(0x000943e0 + 0x00000800),
+	(0x000943e0 + 0x00001000),
+	(0x000943e0 + 0x00001800),
+	(0x000d43e0 + 0x00000000),
+	(0x000d43e0 + 0x00000800),
+	(0x000d43e0 + 0x00001000),
+	(0x000d43e0 + 0x00001800),
+	(0x001143e0 + 0x00000000),
+	(0x001143e0 + 0x00000800),
+	(0x001143e0 + 0x00001000),
+	(0x001143e0 + 0x00001800),
+	(0x001543e0 + 0x00000000),
+	(0x001543e0 + 0x00000800),
+	(0x001543e0 + 0x00001000),
+	(0x001543e0 + 0x00001800),
+	(0x001943e0 + 0x00000000),
+	(0x001943e0 + 0x00000800),
+	(0x001943e0 + 0x00001000),
+	(0x001943e0 + 0x00001800),
+	(0x001d43e0 + 0x00000000),
+	(0x001d43e0 + 0x00000800),
+	(0x001d43e0 + 0x00001000),
+	(0x001d43e0 + 0x00001800),
+};
+
+static const uint32_t ecc_umc_mcumc_status_addrs[] = {
+	(0x000143c2 + 0x00000000),
+	(0x000143c2 + 0x00000800),
+	(0x000143c2 + 0x00001000),
+	(0x000143c2 + 0x00001800),
+	(0x000543c2 + 0x00000000),
+	(0x000543c2 + 0x00000800),
+	(0x000543c2 + 0x00001000),
+	(0x000543c2 + 0x00001800),
+	(0x000943c2 + 0x00000000),
+	(0x000943c2 + 0x00000800),
+	(0x000943c2 + 0x00001000),
+	(0x000943c2 + 0x00001800),
+	(0x000d43c2 + 0x00000000),
+	(0x000d43c2 + 0x00000800),
+	(0x000d43c2 + 0x00001000),
+	(0x000d43c2 + 0x00001800),
+	(0x001143c2 + 0x00000000),
+	(0x001143c2 + 0x00000800),
+	(0x001143c2 + 0x00001000),
+	(0x001143c2 + 0x00001800),
+	(0x001543c2 + 0x00000000),
+	(0x001543c2 + 0x00000800),
+	(0x001543c2 + 0x00001000),
+	(0x001543c2 + 0x00001800),
+	(0x001943c2 + 0x00000000),
+	(0x001943c2 + 0x00000800),
+	(0x001943c2 + 0x00001000),
+	(0x001943c2 + 0x00001800),
+	(0x001d43c2 + 0x00000000),
+	(0x001d43c2 + 0x00000800),
+	(0x001d43c2 + 0x00001000),
+	(0x001d43c2 + 0x00001800),
+};
+
+static int gmc_v9_0_ecc_interrupt_state(struct amdgpu_device *adev,
+		struct amdgpu_irq_src *src,
+		unsigned type,
+		enum amdgpu_interrupt_state state)
+{
+	u32 bits, i, tmp, reg;
+
+	bits = 0x7f;
+
+	switch (state) {
+	case AMDGPU_IRQ_STATE_DISABLE:
+		for (i = 0; i < ARRAY_SIZE(ecc_umc_mcumc_ctrl_addrs); i++) {
+			reg = ecc_umc_mcumc_ctrl_addrs[i];
+			tmp = RREG32(reg);
+			tmp &= ~bits;
+			WREG32(reg, tmp);
+		}
+		for (i = 0; i < ARRAY_SIZE(ecc_umc_mcumc_ctrl_mask_addrs); i++) {
+			reg = ecc_umc_mcumc_ctrl_mask_addrs[i];
+			tmp = RREG32(reg);
+			tmp &= ~bits;
+			WREG32(reg, tmp);
+		}
+		break;
+	case AMDGPU_IRQ_STATE_ENABLE:
+		for (i = 0; i < ARRAY_SIZE(ecc_umc_mcumc_ctrl_addrs); i++) {
+			reg = ecc_umc_mcumc_ctrl_addrs[i];
+			tmp = RREG32(reg);
+			tmp |= bits;
+			WREG32(reg, tmp);
+		}
+		for (i = 0; i < ARRAY_SIZE(ecc_umc_mcumc_ctrl_mask_addrs); i++) {
+			reg = ecc_umc_mcumc_ctrl_mask_addrs[i];
+			tmp = RREG32(reg);
+			tmp |= bits;
+			WREG32(reg, tmp);
+		}
+		break;
+	default:
+		break;
+	}
+
+	return 0;
+}
+
+static int gmc_v9_0_process_ras_data_cb(struct amdgpu_device *adev,
+		struct amdgpu_iv_entry *entry)
+{
+	amdgpu_ras_reset_gpu(adev, 0);
+	return AMDGPU_RAS_UE;
+}
+
+static int gmc_v9_0_process_ecc_irq(struct amdgpu_device *adev,
+		struct amdgpu_irq_src *source,
+		struct amdgpu_iv_entry *entry)
+{
+	struct ras_dispatch_if ih_data = {
+		.head = *adev->gmc.ras_if,
+		.entry = entry,
+	};
+	amdgpu_ras_interrupt_dispatch(adev, &ih_data);
+	return 0;
+}
+
 static int gmc_v9_0_vm_fault_interrupt_state(struct amdgpu_device *adev,
 					struct amdgpu_irq_src *src,
 					unsigned type,
@@ -350,10 +521,19 @@ static const struct amdgpu_irq_src_funcs gmc_v9_0_irq_funcs = {
 	.process = gmc_v9_0_process_interrupt,
 };
 
+
+static const struct amdgpu_irq_src_funcs gmc_v9_0_ecc_funcs = {
+	.set = gmc_v9_0_ecc_interrupt_state,
+	.process = gmc_v9_0_process_ecc_irq,
+};
+
 static void gmc_v9_0_set_irq_funcs(struct amdgpu_device *adev)
 {
 	adev->gmc.vm_fault.num_types = 1;
 	adev->gmc.vm_fault.funcs = &gmc_v9_0_irq_funcs;
+
+	adev->gmc.ecc_irq.num_types = 1;
+	adev->gmc.ecc_irq.funcs = &gmc_v9_0_ecc_funcs;
 }
 
 static uint32_t gmc_v9_0_get_invalidate_req(unsigned int vmid,
@@ -723,6 +903,75 @@ static int gmc_v9_0_allocate_vm_inv_eng(struct amdgpu_device *adev)
 	return 0;
 }
 
+static int gmc_v9_0_ecc_late_init(void *handle)
+{
+	struct amdgpu_device *adev = (struct amdgpu_device *)handle;
+	struct ras_common_if **ras_if = &adev->gmc.ras_if;
+	struct ras_ih_if ih_info = {
+		.cb = gmc_v9_0_process_ras_data_cb,
+	};
+	struct ras_fs_if fs_info = {
+		.sysfs_name = "umc_err_count",
+		.debugfs_name = "umc_err_inject",
+	};
+	struct ras_common_if ras_block = {
+		.block = AMDGPU_RAS_BLOCK__UMC,
+		.type = AMDGPU_RAS_ERROR__MULTI_UNCORRECTABLE,
+		.sub_block_index = 0,
+		.name = "umc",
+	};
+	int r;
+
+	if (!amdgpu_ras_is_supported(adev, AMDGPU_RAS_BLOCK__UMC)) {
+		amdgpu_ras_feature_enable(adev, &ras_block, 0);
+		return 0;
+	}
+
+	*ras_if = kmalloc(sizeof(**ras_if), GFP_KERNEL);
+	if (!*ras_if)
+		return -ENOMEM;
+
+	**ras_if = ras_block;
+
+	r = amdgpu_ras_feature_enable(adev, *ras_if, 1);
+	if (r)
+		goto feature;
+
+	ih_info.head = **ras_if;
+	fs_info.head = **ras_if;
+
+	r = amdgpu_ras_interrupt_add_handler(adev, &ih_info);
+	if (r)
+		goto interrupt;
+
+	r = amdgpu_ras_debugfs_create(adev, &fs_info);
+	if (r)
+		goto debugfs;
+
+	r = amdgpu_ras_sysfs_create(adev, &fs_info);
+	if (r)
+		goto sysfs;
+
+	r = amdgpu_irq_get(adev, &adev->gmc.ecc_irq, 0);
+	if (r)
+		goto irq;
+
+	return 0;
+irq:
+	amdgpu_ras_sysfs_remove(adev, *ras_if);
+sysfs:
+	amdgpu_ras_debugfs_remove(adev, *ras_if);
+debugfs:
+	amdgpu_ras_interrupt_remove_handler(adev, &ih_info);
+interrupt:
+	amdgpu_ras_feature_enable(adev, *ras_if, 0);
+feature:
+	kfree(*ras_if);
+	*ras_if = NULL;
+	return -EINVAL;
+}
+
+
 static int gmc_v9_0_late_init(void *handle)
 {
 	struct amdgpu_device *adev = (struct amdgpu_device *)handle;
@@ -748,6 +997,10 @@ static int gmc_v9_0_late_init(void *handle)
 		}
 	}
 
+	r = gmc_v9_0_ecc_late_init(handle);
+	if (r)
+		return r;
+
 	return amdgpu_irq_get(adev, &adev->gmc.vm_fault, 0);
 }
 
@@ -959,6 +1212,12 @@ static int gmc_v9_0_sw_init(void *handle)
 	if (r)
 		return r;
 
+	/* interrupt sent to DF. */
+	r = amdgpu_irq_add_id(adev, SOC15_IH_CLIENTID_DF, 0,
+			&adev->gmc.ecc_irq);
+	if (r)
+		return r;
+
 	/* Set the internal MC address mask
 	 * This is the max address of the GPU's
 	 * internal address space.
@@ -1024,6 +1283,22 @@ static int gmc_v9_0_sw_fini(void *handle)
 {
 	struct amdgpu_device *adev = (struct amdgpu_device *)handle;
 
+	if (amdgpu_ras_is_supported(adev, AMDGPU_RAS_BLOCK__UMC) &&
+			adev->gmc.ras_if) {
+		struct ras_common_if *ras_if = adev->gmc.ras_if;
+		struct ras_ih_if ih_info = {
+			.head = *ras_if,
+		};
+
+		/*remove fs first*/
+		amdgpu_ras_debugfs_remove(adev, ras_if);
+		amdgpu_ras_sysfs_remove(adev, ras_if);
+		/*remove the IH*/
+		amdgpu_ras_interrupt_remove_handler(adev, &ih_info);
+		amdgpu_ras_feature_enable(adev, ras_if, 0);
+		kfree(ras_if);
+	}
+
 	amdgpu_gem_force_release(adev);
 	amdgpu_vm_manager_fini(adev);
 
@@ -1170,6 +1445,7 @@ static int gmc_v9_0_hw_fini(void *handle)
 		return 0;
 	}
 
+	amdgpu_irq_put(adev, &adev->gmc.ecc_irq, 0);
 	amdgpu_irq_put(adev, &adev->gmc.vm_fault, 0);
 	gmc_v9_0_gart_disable(adev);
 
-- 
2.20.1

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 15/20] drm/amdgpu: Add a new flag to AMDGPU_CTX_OP_QUERY_STATE2
       [not found] ` <20190305204136.17587-1-alexander.deucher-5C7GfCeVMHo@public.gmane.org>
                     ` (13 preceding siblings ...)
  2019-03-05 20:41   ` [PATCH 14/20] drm/amdgpu: enable ras on gmc9 Alex Deucher
@ 2019-03-05 20:41   ` Alex Deucher
  2019-03-05 20:41   ` [PATCH 16/20] drm/amdgpu: add ioctl query for enabled ras features (v2) Alex Deucher
                     ` (4 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: Alex Deucher @ 2019-03-05 20:41 UTC (permalink / raw)
  To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW; +Cc: Alex Deucher, xinhui pan

From: xinhui pan <xinhui.pan@amd.com>

Add AMDGPU_CTX_QUERY2_FLAGS_RAS_CE/UE which indicate if any error happened
between previous query and this query.

Signed-off-by: xinhui pan <xinhui.pan@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 17 +++++++++++++++++
 drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.h |  2 ++
 include/uapi/drm/amdgpu_drm.h           |  3 +++
 3 files changed, 22 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
index 7b526593eb77..736ed1d67ec2 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
@@ -26,6 +26,7 @@
 #include <drm/drm_auth.h>
 #include "amdgpu.h"
 #include "amdgpu_sched.h"
+#include "amdgpu_ras.h"
 
 #define to_amdgpu_ctx_entity(e)	\
 	container_of((e), struct amdgpu_ctx_entity, entity)
@@ -344,6 +345,7 @@ static int amdgpu_ctx_query2(struct amdgpu_device *adev,
 {
 	struct amdgpu_ctx *ctx;
 	struct amdgpu_ctx_mgr *mgr;
+	uint32_t ras_counter;
 
 	if (!fpriv)
 		return -EINVAL;
@@ -368,6 +370,21 @@ static int amdgpu_ctx_query2(struct amdgpu_device *adev,
 	if (atomic_read(&ctx->guilty))
 		out->state.flags |= AMDGPU_CTX_QUERY2_FLAGS_GUILTY;
 
+	/*query ue count*/
+	ras_counter = amdgpu_ras_query_error_count(adev, false);
+	/*ras counter is monotonic increasing*/
+	if (ras_counter != ctx->ras_counter_ue) {
+		out->state.flags |= AMDGPU_CTX_QUERY2_FLAGS_RAS_UE;
+		ctx->ras_counter_ue = ras_counter;
+	}
+
+	/*query ce count*/
+	ras_counter = amdgpu_ras_query_error_count(adev, true);
+	if (ras_counter != ctx->ras_counter_ce) {
+		out->state.flags |= AMDGPU_CTX_QUERY2_FLAGS_RAS_CE;
+		ctx->ras_counter_ce = ras_counter;
+	}
+
 	mutex_unlock(&mgr->lock);
 	return 0;
 }
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.h
index b3b012c0a7da..8e561daa64cb 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.h
@@ -49,6 +49,8 @@ struct amdgpu_ctx {
 	enum drm_sched_priority		override_priority;
 	struct mutex			lock;
 	atomic_t			guilty;
+	uint32_t			ras_counter_ce;
+	uint32_t			ras_counter_ue;
 };
 
 struct amdgpu_ctx_mgr {
diff --git a/include/uapi/drm/amdgpu_drm.h b/include/uapi/drm/amdgpu_drm.h
index e5275d4481f5..722598b25f37 100644
--- a/include/uapi/drm/amdgpu_drm.h
+++ b/include/uapi/drm/amdgpu_drm.h
@@ -210,6 +210,9 @@ union drm_amdgpu_bo_list {
 #define AMDGPU_CTX_QUERY2_FLAGS_VRAMLOST (1<<1)
 /* indicate some job from this context once cause gpu hang */
 #define AMDGPU_CTX_QUERY2_FLAGS_GUILTY   (1<<2)
+/* indicate some errors are detected by RAS */
+#define AMDGPU_CTX_QUERY2_FLAGS_RAS_CE   (1<<3)
+#define AMDGPU_CTX_QUERY2_FLAGS_RAS_UE   (1<<4)
 
 /* Context priority level */
 #define AMDGPU_CTX_PRIORITY_UNSET       -2048
-- 
2.20.1

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 16/20] drm/amdgpu: add ioctl query for enabled ras features (v2)
       [not found] ` <20190305204136.17587-1-alexander.deucher-5C7GfCeVMHo@public.gmane.org>
                     ` (14 preceding siblings ...)
  2019-03-05 20:41   ` [PATCH 15/20] drm/amdgpu: Add a new flag to AMDGPU_CTX_OP_QUERY_STATE2 Alex Deucher
@ 2019-03-05 20:41   ` Alex Deucher
  2019-03-05 20:41   ` [PATCH 17/20] drm/amdgpu: skip gpu reset when ras error occured Alex Deucher
                     ` (3 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: Alex Deucher @ 2019-03-05 20:41 UTC (permalink / raw)
  To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW; +Cc: Alex Deucher, xinhui pan

From: xinhui pan <xinhui.pan@amd.com>

Add a query for userspace to check which RAS features
are enabled.

v2: squash in warning fix

Signed-off-by: xinhui pan <xinhui.pan@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 10 ++++++++
 include/uapi/drm/amdgpu_drm.h           | 31 +++++++++++++++++++++++++
 2 files changed, 41 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
index 4d0ccca891cb..df40f35db89c 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
@@ -39,6 +39,7 @@
 #include "amdgpu_amdkfd.h"
 #include "amdgpu_gem.h"
 #include "amdgpu_display.h"
+#include "amdgpu_ras.h"
 
 static void amdgpu_unregister_gpu_instance(struct amdgpu_device *adev)
 {
@@ -919,6 +920,15 @@ static int amdgpu_info_ioctl(struct drm_device *dev, void *data, struct drm_file
 	case AMDGPU_INFO_VRAM_LOST_COUNTER:
 		ui32 = atomic_read(&adev->vram_lost_counter);
 		return copy_to_user(out, &ui32, min(size, 4u)) ? -EFAULT : 0;
+	case AMDGPU_INFO_RAS_ENABLED_FEATURES: {
+		struct amdgpu_ras *ras = amdgpu_ras_get_context(adev);
+
+		if (!ras)
+			return -EINVAL;
+		return copy_to_user(out, &ras->features,
+				min_t(u32, size, sizeof(ras->features))) ?
+			-EFAULT : 0;
+	}
 	default:
 		DRM_DEBUG_KMS("Invalid request %d\n", info->query);
 		return -EINVAL;
diff --git a/include/uapi/drm/amdgpu_drm.h b/include/uapi/drm/amdgpu_drm.h
index 722598b25f37..e3a97da4add9 100644
--- a/include/uapi/drm/amdgpu_drm.h
+++ b/include/uapi/drm/amdgpu_drm.h
@@ -737,6 +737,37 @@ struct drm_amdgpu_cs_chunk_data {
 /* Number of VRAM page faults on CPU access. */
 #define AMDGPU_INFO_NUM_VRAM_CPU_PAGE_FAULTS	0x1E
 #define AMDGPU_INFO_VRAM_LOST_COUNTER		0x1F
+/* query ras mask of enabled features*/
+#define AMDGPU_INFO_RAS_ENABLED_FEATURES	0x20
+
+/* RAS MASK: UMC (VRAM) */
+#define AMDGPU_INFO_RAS_ENABLED_UMC			(1 << 0)
+/* RAS MASK: SDMA */
+#define AMDGPU_INFO_RAS_ENABLED_SDMA			(1 << 1)
+/* RAS MASK: GFX */
+#define AMDGPU_INFO_RAS_ENABLED_GFX			(1 << 2)
+/* RAS MASK: MMHUB */
+#define AMDGPU_INFO_RAS_ENABLED_MMHUB			(1 << 3)
+/* RAS MASK: ATHUB */
+#define AMDGPU_INFO_RAS_ENABLED_ATHUB			(1 << 4)
+/* RAS MASK: PCIE */
+#define AMDGPU_INFO_RAS_ENABLED_PCIE			(1 << 5)
+/* RAS MASK: HDP */
+#define AMDGPU_INFO_RAS_ENABLED_HDP			(1 << 6)
+/* RAS MASK: XGMI */
+#define AMDGPU_INFO_RAS_ENABLED_XGMI			(1 << 7)
+/* RAS MASK: DF */
+#define AMDGPU_INFO_RAS_ENABLED_DF			(1 << 8)
+/* RAS MASK: SMN */
+#define AMDGPU_INFO_RAS_ENABLED_SMN			(1 << 9)
+/* RAS MASK: SEM */
+#define AMDGPU_INFO_RAS_ENABLED_SEM			(1 << 10)
+/* RAS MASK: MP0 */
+#define AMDGPU_INFO_RAS_ENABLED_MP0			(1 << 11)
+/* RAS MASK: MP1 */
+#define AMDGPU_INFO_RAS_ENABLED_MP1			(1 << 12)
+/* RAS MASK: FUSE */
+#define AMDGPU_INFO_RAS_ENABLED_FUSE			(1 << 13)
 
 #define AMDGPU_INFO_MMR_SE_INDEX_SHIFT	0
 #define AMDGPU_INFO_MMR_SE_INDEX_MASK	0xff
-- 
2.20.1

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 17/20] drm/amdgpu: skip gpu reset when ras error occured
       [not found] ` <20190305204136.17587-1-alexander.deucher-5C7GfCeVMHo@public.gmane.org>
                     ` (15 preceding siblings ...)
  2019-03-05 20:41   ` [PATCH 16/20] drm/amdgpu: add ioctl query for enabled ras features (v2) Alex Deucher
@ 2019-03-05 20:41   ` Alex Deucher
  2019-03-05 20:41   ` [PATCH 18/20] drm/amdgpu: add human readable debugfs control support (v2) Alex Deucher
                     ` (2 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: Alex Deucher @ 2019-03-05 20:41 UTC (permalink / raw)
  To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW; +Cc: Alex Deucher, xinhui pan

From: xinhui pan <xinhui.pan@amd.com>

gpu reset is not stable on vega20 A1.

Signed-off-by: xinhui pan <xinhui.pan@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
index 599dcd01aec7..02cb9a13ddc5 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
@@ -182,10 +182,13 @@ int amdgpu_ras_reserve_bad_pages(struct amdgpu_device *adev);
 static inline int amdgpu_ras_reset_gpu(struct amdgpu_device *adev,
 		bool is_baco)
 {
+	/* remove me when gpu reset works on vega20 A1. */
+#if 0
 	struct amdgpu_ras *ras = amdgpu_ras_get_context(adev);
 
 	if (atomic_cmpxchg(&ras->in_recovery, 0, 1) == 0)
 		schedule_work(&ras->recovery_work);
+#endif
 	return 0;
 }
 
-- 
2.20.1

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 18/20] drm/amdgpu: add human readable debugfs control support (v2)
       [not found] ` <20190305204136.17587-1-alexander.deucher-5C7GfCeVMHo@public.gmane.org>
                     ` (16 preceding siblings ...)
  2019-03-05 20:41   ` [PATCH 17/20] drm/amdgpu: skip gpu reset when ras error occured Alex Deucher
@ 2019-03-05 20:41   ` Alex Deucher
  2019-03-05 20:41   ` [PATCH 19/20] drm/amdkfd: add RAS capabilities in topology for Vega20 (v2) Alex Deucher
  2019-03-05 20:41   ` [PATCH 20/20] drm/amdkfd: add RAS ECC event support (v2) Alex Deucher
  19 siblings, 0 replies; 21+ messages in thread
From: Alex Deucher @ 2019-03-05 20:41 UTC (permalink / raw)
  To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW; +Cc: Alex Deucher, xinhui pan

From: xinhui pan <xinhui.pan@amd.com>

Currently, the debugfs control node can't parse bash-like commands.
Now add such support for any tester that uses scripts.

v2: squash in fixes for input validation

Signed-off-by: xinhui pan <xinhui.pan@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 115 +++++++++++++++++++++---
 1 file changed, 102 insertions(+), 13 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
index ca9c7d1ede2f..1df6b03a3680 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
@@ -200,14 +200,87 @@ static const struct file_operations amdgpu_ras_debugfs_ops = {
 	.llseek = default_llseek
 };
 
+static int amdgpu_ras_find_block_id_by_name(const char *name, int *block_id)
+{
+	int i;
+
+	for (i = 0; i < ARRAY_SIZE(ras_block_string); i++) {
+		*block_id = i;
+		if (strcmp(name, ras_block_str(i)) == 0)
+			return 0;
+	}
+	return -EINVAL;
+}
+
+static int amdgpu_ras_debugfs_ctrl_parse_data(struct file *f,
+		const char __user *buf, size_t size,
+		loff_t *pos, struct ras_debug_if *data)
+{
+	ssize_t s = min_t(u64, 64, size);
+	char str[65];
+	char block_name[33];
+	char err[9] = "ue";
+	int op = -1;
+	int block_id;
+	u64 address, value;
+
+	if (*pos)
+		return -EINVAL;
+	*pos = size;
+
+	memset(str, 0, sizeof(str));
+	memset(data, 0, sizeof(*data));
+
+	if (copy_from_user(str, buf, s))
+		return -EINVAL;
+
+	if (sscanf(str, "disable %32s", block_name) == 1)
+		op = 0;
+	else if (sscanf(str, "enable %32s %8s", block_name, err) == 2)
+		op = 1;
+	else if (sscanf(str, "inject %32s %8s", block_name, err) == 2)
+		op = 2;
+	else if (sscanf(str, "%32s", block_name) == 1)
+		/* ascii string, but commands are not matched. */
+		return -EINVAL;
+
+	if (op != -1) {
+		if (amdgpu_ras_find_block_id_by_name(block_name, &block_id))
+			return -EINVAL;
+
+		data->head.block = block_id;
+		data->head.type = memcmp("ue", err, 2) == 0 ?
+			AMDGPU_RAS_ERROR__MULTI_UNCORRECTABLE :
+			AMDGPU_RAS_ERROR__SINGLE_CORRECTABLE;
+		data->op = op;
+
+		if (op == 2) {
+			if (sscanf(str, "%*s %*s %*s %llu %llu",
+						&address, &value) != 2)
+				if (sscanf(str, "%*s %*s %*s 0x%llx 0x%llx",
+							&address, &value) != 2)
+					return -EINVAL;
+			data->inject.address = address;
+			data->inject.value = value;
+		}
+	} else {
+		if (size < sizeof(data))
+			return -EINVAL;
+
+		if (copy_from_user(data, buf, sizeof(*data)))
+			return -EINVAL;
+	}
+
+	return 0;
+}
 /*
  * DOC: ras debugfs control interface
  *
  * It accepts struct ras_debug_if who has two members.
  *
  * First member: ras_debug_if::head or ras_debug_if::inject.
- * It is used to indicate which IP block will be under control.
- * Its contents are not human readable, IOW, write it by your programs.
+ *
+ * head is used to indicate which IP block will be under control.
  *
  * head has four members, they are block, type, sub_block_index, name.
  * block: which IP will be under control.
@@ -225,6 +298,28 @@ static const struct file_operations amdgpu_ras_debugfs_ops = {
  *  1: enable RAS on the block. Take ::head as its data.
  *  2: inject errors on the block. Take ::inject as its data.
  *
+ * How to use the interface?
+ * programs:
+ * copy the struct ras_debug_if in your codes and initialize it.
+ * write the struct to the control node.
+ *
+ * bash:
+ * echo op block [error [address value]] > .../ras/ras_ctrl
+ *	op: disable, enable, inject
+ *		disable: only block is needed
+ *		enable: block and error are needed
+ *		inject: error, address, value are needed
+ *	block: umc, smda, gfx, .........
+ *		see ras_block_string[] for details
+ *	error: ue, ce
+ *		ue: multi_uncorrectable
+ *		ce: single_correctable
+ *
+ * here are some examples for bash commands,
+ *	echo inject umc ue 0x0 0x0 > /sys/kernel/debug/dri/0/ras/ras_ctrl
+ *	echo inject umc ce 0 0 > /sys/kernel/debug/dri/0/ras/ras_ctrl
+ *	echo disable umc > /sys/kernel/debug/dri/0/ras/ras_ctrl
+ *
  * How to check the result?
  *
  * For disable/enable, please check ras features at
@@ -243,19 +338,10 @@ static ssize_t amdgpu_ras_debugfs_ctrl_write(struct file *f, const char __user *
 	struct ras_debug_if data;
 	int ret = 0;
 
-	if (size < sizeof(data))
-		return -EINVAL;
-
-	memset(&data, 0, sizeof(data));
-
-	if (*pos)
-		return -EINVAL;
-
-	if (copy_from_user(&data, buf, sizeof(data)))
+	ret = amdgpu_ras_debugfs_ctrl_parse_data(f, buf, size, pos, &data);
+	if (ret)
 		return -EINVAL;
 
-	*pos = size;
-
 	if (!amdgpu_ras_is_supported(adev, data.head.block))
 		return -EINVAL;
 
@@ -269,6 +355,9 @@ static ssize_t amdgpu_ras_debugfs_ctrl_write(struct file *f, const char __user *
 	case 2:
 		ret = amdgpu_ras_error_inject(adev, &data.inject);
 		break;
+	default:
+		ret = -EINVAL;
+		break;
 	};
 
 	if (ret)
-- 
2.20.1

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 19/20] drm/amdkfd: add RAS capabilities in topology for Vega20 (v2)
       [not found] ` <20190305204136.17587-1-alexander.deucher-5C7GfCeVMHo@public.gmane.org>
                     ` (17 preceding siblings ...)
  2019-03-05 20:41   ` [PATCH 18/20] drm/amdgpu: add human readable debugfs control support (v2) Alex Deucher
@ 2019-03-05 20:41   ` Alex Deucher
  2019-03-05 20:41   ` [PATCH 20/20] drm/amdkfd: add RAS ECC event support (v2) Alex Deucher
  19 siblings, 0 replies; 21+ messages in thread
From: Alex Deucher @ 2019-03-05 20:41 UTC (permalink / raw)
  To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW
  Cc: Eric Huang, Alex Deucher, Felix Kuehling

From: Eric Huang <JinhuiEric.Huang@amd.com>

It is to collaborate with HSA_CAPABILITY in libhsakmt.

v2: squash in NULL pointer check

Signed-off-by: Eric Huang <JinhuiEric.Huang@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
---
 drivers/gpu/drm/amd/amdkfd/kfd_topology.c | 16 ++++++++++++++++
 drivers/gpu/drm/amd/amdkfd/kfd_topology.h |  4 ++++
 2 files changed, 20 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_topology.c b/drivers/gpu/drm/amd/amdkfd/kfd_topology.c
index 09da91644f9f..2cb09e088dce 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_topology.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_topology.c
@@ -37,6 +37,7 @@
 #include "kfd_device_queue_manager.h"
 #include "kfd_iommu.h"
 #include "amdgpu_amdkfd.h"
+#include "amdgpu_ras.h"
 
 /* topology_device_list - Master list of all topology devices */
 static struct list_head topology_device_list;
@@ -1197,6 +1198,7 @@ int kfd_topology_add_device(struct kfd_dev *gpu)
 	void *crat_image = NULL;
 	size_t image_size = 0;
 	int proximity_domain;
+	struct amdgpu_ras *ctx;
 
 	INIT_LIST_HEAD(&temp_topology_device_list);
 
@@ -1328,6 +1330,20 @@ int kfd_topology_add_device(struct kfd_dev *gpu)
 		dev->node_props.capability |= HSA_CAP_ATS_PRESENT;
 	}
 
+	ctx = amdgpu_ras_get_context((struct amdgpu_device *)(dev->gpu->kgd));
+	if (ctx) {
+		/* kfd only concerns sram ecc on GFX/SDMA and HBM ecc on UMC */
+		dev->node_props.capability |=
+			(((ctx->features & BIT(AMDGPU_RAS_BLOCK__SDMA)) != 0) ||
+			 ((ctx->features & BIT(AMDGPU_RAS_BLOCK__GFX)) != 0)) ?
+			HSA_CAP_SRAM_EDCSUPPORTED : 0;
+		dev->node_props.capability |= ((ctx->features & BIT(AMDGPU_RAS_BLOCK__UMC)) != 0) ?
+			HSA_CAP_MEM_EDCSUPPORTED : 0;
+
+		dev->node_props.capability |= (ctx->features != 0) ?
+			HSA_CAP_RASEVENTNOTIFY : 0;
+	}
+
 	kfd_debug_print_topology();
 
 	if (!res)
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_topology.h b/drivers/gpu/drm/amd/amdkfd/kfd_topology.h
index 92a19be07344..84710cfd23c2 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_topology.h
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_topology.h
@@ -48,6 +48,10 @@
 #define HSA_CAP_DOORBELL_TYPE_2_0		0x2
 #define HSA_CAP_AQL_QUEUE_DOUBLE_MAP		0x00004000
 
+#define HSA_CAP_SRAM_EDCSUPPORTED		0x00080000
+#define HSA_CAP_MEM_EDCSUPPORTED		0x00100000
+#define HSA_CAP_RASEVENTNOTIFY			0x00200000
+
 struct kfd_node_properties {
 	uint64_t hive_id;
 	uint32_t cpu_cores_count;
-- 
2.20.1

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 20/20] drm/amdkfd: add RAS ECC event support (v2)
       [not found] ` <20190305204136.17587-1-alexander.deucher-5C7GfCeVMHo@public.gmane.org>
                     ` (18 preceding siblings ...)
  2019-03-05 20:41   ` [PATCH 19/20] drm/amdkfd: add RAS capabilities in topology for Vega20 (v2) Alex Deucher
@ 2019-03-05 20:41   ` Alex Deucher
  19 siblings, 0 replies; 21+ messages in thread
From: Alex Deucher @ 2019-03-05 20:41 UTC (permalink / raw)
  To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW
  Cc: Eric Huang, Alex Deucher, Felix Kuehling

From: Eric Huang <JinhuiEric.Huang@amd.com>

RAS ECC event will combine with GPU reset event, due to
ECC interrupts are caused by uncorrectable error that triggers
GPU reset.

v2: Fix misleading-indentation warning

Signed-off-by: Eric Huang <JinhuiEric.Huang@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h |  1 +
 drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c      |  1 +
 drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c      |  1 +
 drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c     |  2 ++
 drivers/gpu/drm/amd/amdkfd/kfd_device.c    | 11 +++++++++++
 drivers/gpu/drm/amd/amdkfd/kfd_events.c    | 18 +++++++++++++++++-
 drivers/gpu/drm/amd/amdkfd/kfd_priv.h      |  3 +++
 include/uapi/linux/kfd_ioctl.h             | 12 +++++++++++-
 8 files changed, 47 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
index 0e1711a75b68..e6a503760b62 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
@@ -229,5 +229,6 @@ int kgd2kfd_quiesce_mm(struct mm_struct *mm);
 int kgd2kfd_resume_mm(struct mm_struct *mm);
 int kgd2kfd_schedule_evict_and_restore_process(struct mm_struct *mm,
 					       struct dma_fence *fence);
+void kgd2kfd_set_sram_ecc_flag(struct kfd_dev *kfd);
 
 #endif /* AMDGPU_AMDKFD_H_INCLUDED */
diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
index 88c45f990f05..6bb71f6ee18e 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
@@ -4805,6 +4805,7 @@ static int gfx_v9_0_process_ras_data_cb(struct amdgpu_device *adev,
 		struct amdgpu_iv_entry *entry)
 {
 	/* TODO ue will trigger an interrupt. */
+	kgd2kfd_set_sram_ecc_flag(adev->kfd.dev);
 	amdgpu_ras_reset_gpu(adev, 0);
 	return AMDGPU_RAS_UE;
 }
diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
index 1a164fe2dd9a..84904bd680df 100644
--- a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
@@ -354,6 +354,7 @@ static int gmc_v9_0_ecc_interrupt_state(struct amdgpu_device *adev,
 static int gmc_v9_0_process_ras_data_cb(struct amdgpu_device *adev,
 		struct amdgpu_iv_entry *entry)
 {
+	kgd2kfd_set_sram_ecc_flag(adev->kfd.dev);
 	amdgpu_ras_reset_gpu(adev, 0);
 	return AMDGPU_RAS_UE;
 }
diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c b/drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c
index 058b9daec514..f7a6fafd70ae 100644
--- a/drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c
@@ -1851,6 +1851,8 @@ static int sdma_v4_0_process_ras_data_cb(struct amdgpu_device *adev,
 		return 0;
 	}
 
+	kgd2kfd_set_sram_ecc_flag(adev->kfd.dev);
+
 	amdgpu_ras_reset_gpu(adev, 0);
 
 	return AMDGPU_RAS_UE;
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
index 8be9677c0c07..b3cdbf79f47b 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
@@ -466,6 +466,8 @@ struct kfd_dev *kgd2kfd_probe(struct kgd_dev *kgd,
 	memset(&kfd->doorbell_available_index, 0,
 		sizeof(kfd->doorbell_available_index));
 
+	atomic_set(&kfd->sram_ecc_flag, 0);
+
 	return kfd;
 }
 
@@ -661,6 +663,9 @@ int kgd2kfd_post_reset(struct kfd_dev *kfd)
 		return ret;
 	count = atomic_dec_return(&kfd_locked);
 	WARN_ONCE(count != 0, "KFD reset ref. error");
+
+	atomic_set(&kfd->sram_ecc_flag, 0);
+
 	return 0;
 }
 
@@ -1024,6 +1029,12 @@ int kfd_gtt_sa_free(struct kfd_dev *kfd, struct kfd_mem_obj *mem_obj)
 	return 0;
 }
 
+void kgd2kfd_set_sram_ecc_flag(struct kfd_dev *kfd)
+{
+	if (kfd)
+		atomic_inc(&kfd->sram_ecc_flag);
+}
+
 #if defined(CONFIG_DEBUG_FS)
 
 /* This function will send a package to HIQ to hang the HWS
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_events.c b/drivers/gpu/drm/amd/amdkfd/kfd_events.c
index e9f0e0a1b41c..6e1d41c5bf86 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_events.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_events.c
@@ -1011,25 +1011,41 @@ void kfd_signal_vm_fault_event(struct kfd_dev *dev, unsigned int pasid,
 void kfd_signal_reset_event(struct kfd_dev *dev)
 {
 	struct kfd_hsa_hw_exception_data hw_exception_data;
+	struct kfd_hsa_memory_exception_data memory_exception_data;
 	struct kfd_process *p;
 	struct kfd_event *ev;
 	unsigned int temp;
 	uint32_t id, idx;
+	int reset_cause = atomic_read(&dev->sram_ecc_flag) ?
+			KFD_HW_EXCEPTION_ECC :
+			KFD_HW_EXCEPTION_GPU_HANG;
 
 	/* Whole gpu reset caused by GPU hang and memory is lost */
 	memset(&hw_exception_data, 0, sizeof(hw_exception_data));
 	hw_exception_data.gpu_id = dev->id;
 	hw_exception_data.memory_lost = 1;
+	hw_exception_data.reset_cause = reset_cause;
+
+	memset(&memory_exception_data, 0, sizeof(memory_exception_data));
+	memory_exception_data.ErrorType = KFD_MEM_ERR_SRAM_ECC;
+	memory_exception_data.gpu_id = dev->id;
+	memory_exception_data.failure.imprecise = true;
 
 	idx = srcu_read_lock(&kfd_processes_srcu);
 	hash_for_each_rcu(kfd_processes_table, temp, p, kfd_processes) {
 		mutex_lock(&p->event_mutex);
 		id = KFD_FIRST_NONSIGNAL_EVENT_ID;
-		idr_for_each_entry_continue(&p->event_idr, ev, id)
+		idr_for_each_entry_continue(&p->event_idr, ev, id) {
 			if (ev->type == KFD_EVENT_TYPE_HW_EXCEPTION) {
 				ev->hw_exception_data = hw_exception_data;
 				set_event(ev);
 			}
+			if (ev->type == KFD_EVENT_TYPE_MEMORY &&
+			    reset_cause == KFD_HW_EXCEPTION_ECC) {
+				ev->memory_exception_data = memory_exception_data;
+				set_event(ev);
+			}
+		}
 		mutex_unlock(&p->event_mutex);
 	}
 	srcu_read_unlock(&kfd_processes_srcu, idx);
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
index 0eeee3c6d6dc..9e0230965675 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
@@ -276,6 +276,9 @@ struct kfd_dev {
 	uint64_t hive_id;
 
 	bool pci_atomic_requested;
+
+	/* SRAM ECC flag */
+	atomic_t sram_ecc_flag;
 };
 
 enum kfd_mempool {
diff --git a/include/uapi/linux/kfd_ioctl.h b/include/uapi/linux/kfd_ioctl.h
index e622fd1fbd46..dc067ed0b72d 100644
--- a/include/uapi/linux/kfd_ioctl.h
+++ b/include/uapi/linux/kfd_ioctl.h
@@ -211,6 +211,11 @@ struct kfd_ioctl_dbg_wave_control_args {
 #define KFD_HW_EXCEPTION_GPU_HANG	0
 #define KFD_HW_EXCEPTION_ECC		1
 
+/* For kfd_hsa_memory_exception_data.ErrorType */
+#define KFD_MEM_ERR_NO_RAS		0
+#define KFD_MEM_ERR_SRAM_ECC		1
+#define KFD_MEM_ERR_POISON_CONSUMED	2
+#define KFD_MEM_ERR_GPU_HANG		3
 
 struct kfd_ioctl_create_event_args {
 	__u64 event_page_offset;	/* from KFD */
@@ -250,7 +255,12 @@ struct kfd_hsa_memory_exception_data {
 	struct kfd_memory_exception_failure failure;
 	__u64 va;
 	__u32 gpu_id;
-	__u32 pad;
+	__u32 ErrorType; /* 0 = no RAS error,
+			  * 1 = ECC_SRAM,
+			  * 2 = Link_SYNFLOOD (poison),
+			  * 3 = GPU hang (not attributable to a specific cause),
+			  * other values reserved
+			  */
 };
 
 /* hw exception data */
-- 
2.20.1

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2019-03-05 20:41 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-03-05 20:41 [PATCH 00/20] RAS support for amdgpu Alex Deucher
     [not found] ` <20190305204136.17587-1-alexander.deucher-5C7GfCeVMHo@public.gmane.org>
2019-03-05 20:41   ` [PATCH 01/20] drm/amdgpu: add ta ras fw info (v2) Alex Deucher
2019-03-05 20:41   ` [PATCH 02/20] drm/amdgpu: export ta fw info Alex Deucher
2019-03-05 20:41   ` [PATCH 03/20] drm/amdgpu: add module parameters for ras Alex Deucher
2019-03-05 20:41   ` [PATCH 04/20] drm/amdgpu: add ta_ras_if.h Alex Deucher
2019-03-05 20:41   ` [PATCH 05/20] drm/amdgpu: add psp ras callback func and macro Alex Deucher
2019-03-05 20:41   ` [PATCH 06/20] drm/amdgpu: add psp ras subsystem infrastructure (v2) Alex Deucher
2019-03-05 20:41   ` [PATCH 07/20] drm/amdgpu: add psp v11 ras callback Alex Deucher
2019-03-05 20:41   ` [PATCH 08/20] drm/amdgpu: add psp cmd submit timeout Alex Deucher
2019-03-05 20:41   ` [PATCH 09/20] drm/amdgpu: add amdgpu_ras.c to support ras (v2) Alex Deucher
2019-03-05 20:41   ` [PATCH 10/20] drm/amdgpu: add debugfs ctrl node Alex Deucher
2019-03-05 20:41   ` [PATCH 11/20] drm/amdgpu: reserve bad pages during recovery Alex Deucher
2019-03-05 20:41   ` [PATCH 12/20] drm/amdgpu: enable ras on sdma4 Alex Deucher
2019-03-05 20:41   ` [PATCH 13/20] drm/amdgpu: enable ras on gfx9 (v2) Alex Deucher
2019-03-05 20:41   ` [PATCH 14/20] drm/amdgpu: enable ras on gmc9 Alex Deucher
2019-03-05 20:41   ` [PATCH 15/20] drm/amdgpu: Add a new flag to AMDGPU_CTX_OP_QUERY_STATE2 Alex Deucher
2019-03-05 20:41   ` [PATCH 16/20] drm/amdgpu: add ioctl query for enabled ras features (v2) Alex Deucher
2019-03-05 20:41   ` [PATCH 17/20] drm/amdgpu: skip gpu reset when ras error occured Alex Deucher
2019-03-05 20:41   ` [PATCH 18/20] drm/amdgpu: add human readable debugfs control support (v2) Alex Deucher
2019-03-05 20:41   ` [PATCH 19/20] drm/amdkfd: add RAS capabilities in topology for Vega20 (v2) Alex Deucher
2019-03-05 20:41   ` [PATCH 20/20] drm/amdkfd: add RAS ECC event support (v2) Alex Deucher

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.