All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/9] BAD GPU retirement policy by total bad pages
@ 2020-07-23  8:33 Guchun Chen
  2020-07-23  8:33 ` [PATCH 1/9] drm/amdgpu: add bad page count threshold in module parameter Guchun Chen
                   ` (8 more replies)
  0 siblings, 9 replies; 10+ messages in thread
From: Guchun Chen @ 2020-07-23  8:33 UTC (permalink / raw)
  To: amd-gfx, alexander.deucher, Hawking.Zhang, Dennis.Li,
	Stanley.Yang, Tao.Zhou1, John.Clements, lijo.lazar
  Cc: Guchun Chen

The series is to enable GPU retirement feature, which is trigged
when bad pages detected by RAS ECC exceed the threshold value.

When the saved bad pages written to eeprom reach the threshold,
one ras recovery will be issued immediately and the recovery will
fail to tell user that the GPU is BAD and needs to be retired for
further check.

During bootup, similar BAD GPU check is conducted as well when
eeprom get initialized, and it will break boot up for user's
awareness.

User could set bad_page_threshold=0 when probing amdgpu driver to
disable this feature and bring up GPU as usual.

Guchun Chen (9):
  drm/amdgpu: add bad page count threshold in module parameter
  drm/amdgpu: validate bad page threshold in ras
  drm/amdgpu: add bad gpu tag definition
  drm/amdgpu: break driver init process when it's bad GPU
  drm/amdgpu: skip bad page reservation once issuing from eeprom write
  drm/amdgpu: schedule ras recovery when reaching bad page threshold
  drm/amdgpu: break GPU recovery once it's bad
  drm/amdgpu: restore ras flags when user resets eeprom
  drm/amdgpu: calculate actual size instead of hardcode size

 drivers/gpu/drm/amd/amdgpu/amdgpu.h           |   1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c    |  29 ++++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c       |  11 ++
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c       |  77 ++++++++++++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h       |  19 ++-
 .../gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c    | 109 ++++++++++++++++--
 .../gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.h    |   8 +-
 7 files changed, 230 insertions(+), 24 deletions(-)

-- 
2.17.1

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 1/9] drm/amdgpu: add bad page count threshold in module parameter
  2020-07-23  8:33 [PATCH 0/9] BAD GPU retirement policy by total bad pages Guchun Chen
@ 2020-07-23  8:33 ` Guchun Chen
  2020-07-23  8:33 ` [PATCH 2/9] drm/amdgpu: validate bad page threshold in ras Guchun Chen
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: Guchun Chen @ 2020-07-23  8:33 UTC (permalink / raw)
  To: amd-gfx, alexander.deucher, Hawking.Zhang, Dennis.Li,
	Stanley.Yang, Tao.Zhou1, John.Clements, lijo.lazar
  Cc: Guchun Chen

bad_page_threshold could be specified to detect and retire
bad GPU if faulty bad pages exceed it.

When it's -1, ras will use typical bad page failure value.
When it's 0, disable bad gpu check.

v2: correct documentation of this parameter.

Signed-off-by: Guchun Chen <guchun.chen@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu.h     |  1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 11 +++++++++++
 2 files changed, 12 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index 06bfb8658dec..bb83ffb5e26a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -181,6 +181,7 @@ extern uint amdgpu_dm_abm_level;
 extern struct amdgpu_mgpu_info mgpu_info;
 extern int amdgpu_ras_enable;
 extern uint amdgpu_ras_mask;
+extern int amdgpu_bad_page_threshold;
 extern int amdgpu_async_gfx_ring;
 extern int amdgpu_mcbp;
 extern int amdgpu_discovery;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index d28b95f721c4..5a517f2fce35 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -161,6 +161,7 @@ struct amdgpu_mgpu_info mgpu_info = {
 };
 int amdgpu_ras_enable = -1;
 uint amdgpu_ras_mask = 0xffffffff;
+int amdgpu_bad_page_threshold = -1;
 
 /**
  * DOC: vramlimit (int)
@@ -801,6 +802,16 @@ module_param_named(tmz, amdgpu_tmz, int, 0444);
 MODULE_PARM_DESC(reset_method, "GPU reset method (-1 = auto (default), 0 = legacy, 1 = mode0, 2 = mode1, 3 = mode2, 4 = baco)");
 module_param_named(reset_method, amdgpu_reset_method, int, 0444);
 
+/**
+ * DOC: bad_page_threshold (int)
+ * Bad page threshold configuration is applied by GPU retirement policy.
+ * The detail is to specify the threshold value of faulty pages detected by
+ * ECC, that may result in GPU's retirement if total faulty pages by ECC
+ * exceed threshold value.
+ */
+MODULE_PARM_DESC(bad_page_threshold, "Bad page threshold(-1 = auto(default typical value), 0 = disable)");
+module_param_named(bad_page_threshold, amdgpu_bad_page_threshold, int, 0444);
+
 static const struct pci_device_id pciidlist[] = {
 #ifdef  CONFIG_DRM_AMDGPU_SI
 	{0x1002, 0x6780, PCI_ANY_ID, PCI_ANY_ID, 0, 0, CHIP_TAHITI},
-- 
2.17.1

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 2/9] drm/amdgpu: validate bad page threshold in ras
  2020-07-23  8:33 [PATCH 0/9] BAD GPU retirement policy by total bad pages Guchun Chen
  2020-07-23  8:33 ` [PATCH 1/9] drm/amdgpu: add bad page count threshold in module parameter Guchun Chen
@ 2020-07-23  8:33 ` Guchun Chen
  2020-07-23  8:33 ` [PATCH 3/9] drm/amdgpu: add bad gpu tag definition Guchun Chen
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: Guchun Chen @ 2020-07-23  8:33 UTC (permalink / raw)
  To: amd-gfx, alexander.deucher, Hawking.Zhang, Dennis.Li,
	Stanley.Yang, Tao.Zhou1, John.Clements, lijo.lazar
  Cc: Guchun Chen

Bad page threshold value should be valid in the range between
-1 and max records length of eeprom. It could determine when
the GPU should be retired.

v2: When using the default typical value, it should be min
value between typical value and eeprom max records length.

Signed-off-by: Guchun Chen <guchun.chen@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c       | 45 +++++++++++++++++++
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h       |  3 ++
 .../gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c    |  5 +++
 .../gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.h    |  2 +
 4 files changed, 55 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
index 6f06e1214622..8daeb54917ed 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
@@ -69,6 +69,9 @@ const char *ras_block_string[] = {
 /* inject address is 52 bits */
 #define	RAS_UMC_INJECT_ADDR_LIMIT	(0x1ULL << 52)
 
+/* typical ECC bad page rate(1 bad page per 100MB VRAM) */
+#define RAS_BAD_PAGE_RATE		(100 * 1024 * 1024ULL)
+
 enum amdgpu_ras_retire_page_reservation {
 	AMDGPU_RAS_RETIRE_PAGE_RESERVED,
 	AMDGPU_RAS_RETIRE_PAGE_PENDING,
@@ -1700,6 +1703,44 @@ static bool amdgpu_ras_check_bad_page(struct amdgpu_device *adev,
 	return ret;
 }
 
+static void amdgpu_ras_validate_threshold(struct amdgpu_device *adev,
+					uint32_t max_length)
+{
+	struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
+	int tmp_threshold = amdgpu_bad_page_threshold;
+	u64 val;
+
+	/*
+	 * Justification of value bad_page_cnt_threshold in ras structure
+	 *
+	 * Generally, -1 <= amdgpu_bad_page_threshold <= max record length
+	 * in eeprom.
+	 *    - If amdgpu_bad_page_threshold = -1,
+	 *      bad_page_cnt_threshold = typical value by formula.
+	 *    - If amdgpu_bad_page_threshold = 0,
+	 *      bad_page_cnt_threshold = 0xFFFFFFFF,
+	 *      and disable GPU retirement feature accordingly.
+	 *    - When the value from user is 0 < amdgpu_bad_page_threshold <
+	 *      max record length in eeprom, use it directly.
+	 */
+
+	if (tmp_threshold < -1)
+		tmp_threshold = -1;
+	else if (tmp_threshold > max_length)
+		tmp_threshold = max_length;
+
+	if (tmp_threshold == -1) {
+		val = adev->gmc.mc_vram_size;
+		do_div(val, RAS_BAD_PAGE_RATE);
+		con->bad_page_cnt_threshold = min(lower_32_bits(val),
+						max_length);
+	} else if (tmp_threshold == 0) {
+		con->bad_page_cnt_threshold = 0xFFFFFFFF;
+	} else {
+		con->bad_page_cnt_threshold = tmp_threshold;
+	}
+}
+
 /* called in gpu recovery/init */
 int amdgpu_ras_reserve_bad_pages(struct amdgpu_device *adev)
 {
@@ -1777,6 +1818,7 @@ int amdgpu_ras_recovery_init(struct amdgpu_device *adev)
 {
 	struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
 	struct ras_err_handler_data **data;
+	uint32_t max_eeprom_records_len = 0;
 	int ret;
 
 	if (con)
@@ -1795,6 +1837,9 @@ int amdgpu_ras_recovery_init(struct amdgpu_device *adev)
 	atomic_set(&con->in_recovery, 0);
 	con->adev = adev;
 
+	max_eeprom_records_len = amdgpu_ras_eeprom_get_record_max_length();
+	amdgpu_ras_validate_threshold(adev, max_eeprom_records_len);
+
 	ret = amdgpu_ras_eeprom_init(&con->eeprom_control);
 	if (ret)
 		goto free;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
index b2667342cf67..4672649a9293 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
@@ -336,6 +336,9 @@ struct amdgpu_ras {
 	struct amdgpu_ras_eeprom_control eeprom_control;
 
 	bool error_query_ready;
+
+	/* bad page count threshold */
+	uint32_t bad_page_cnt_threshold;
 };
 
 struct ras_fs_data {
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
index c0096097bbcf..a2c982b1eac6 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
@@ -499,6 +499,11 @@ int amdgpu_ras_eeprom_process_recods(struct amdgpu_ras_eeprom_control *control,
 	return ret == num ? 0 : -EIO;
 }
 
+inline uint32_t amdgpu_ras_eeprom_get_record_max_length(void)
+{
+	return EEPROM_MAX_RECORD_NUM;
+}
+
 /* Used for testing if bugs encountered */
 #if 0
 void amdgpu_ras_eeprom_test(struct amdgpu_ras_eeprom_control *control)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.h
index 7e8647a05df7..b272840cb069 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.h
@@ -85,6 +85,8 @@ int amdgpu_ras_eeprom_process_recods(struct amdgpu_ras_eeprom_control *control,
 					    bool write,
 					    int num);
 
+inline uint32_t amdgpu_ras_eeprom_get_record_max_length(void);
+
 void amdgpu_ras_eeprom_test(struct amdgpu_ras_eeprom_control *control);
 
 #endif // _AMDGPU_RAS_EEPROM_H
-- 
2.17.1

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 3/9] drm/amdgpu: add bad gpu tag definition
  2020-07-23  8:33 [PATCH 0/9] BAD GPU retirement policy by total bad pages Guchun Chen
  2020-07-23  8:33 ` [PATCH 1/9] drm/amdgpu: add bad page count threshold in module parameter Guchun Chen
  2020-07-23  8:33 ` [PATCH 2/9] drm/amdgpu: validate bad page threshold in ras Guchun Chen
@ 2020-07-23  8:33 ` Guchun Chen
  2020-07-23  8:33 ` [PATCH 4/9] drm/amdgpu: break driver init process when it's bad GPU Guchun Chen
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: Guchun Chen @ 2020-07-23  8:33 UTC (permalink / raw)
  To: amd-gfx, alexander.deucher, Hawking.Zhang, Dennis.Li,
	Stanley.Yang, Tao.Zhou1, John.Clements, lijo.lazar
  Cc: Guchun Chen

This tag will be hired for bad gpu detection.

Signed-off-by: Guchun Chen <guchun.chen@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
index a2c982b1eac6..35c0c849d49b 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
@@ -46,6 +46,9 @@
 #define EEPROM_TABLE_HDR_VAL 0x414d4452
 #define EEPROM_TABLE_VER 0x00010000
 
+/* Bad GPU tag ‘BADG’ */
+#define EEPROM_TABLE_HDR_BAD 0x42414447
+
 /* Assume 2 Mbit size */
 #define EEPROM_SIZE_BYTES 256000
 #define EEPROM_PAGE__SIZE_BYTES 256
-- 
2.17.1

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 4/9] drm/amdgpu: break driver init process when it's bad GPU
  2020-07-23  8:33 [PATCH 0/9] BAD GPU retirement policy by total bad pages Guchun Chen
                   ` (2 preceding siblings ...)
  2020-07-23  8:33 ` [PATCH 3/9] drm/amdgpu: add bad gpu tag definition Guchun Chen
@ 2020-07-23  8:33 ` Guchun Chen
  2020-07-23  8:33 ` [PATCH 5/9] drm/amdgpu: skip bad page reservation once issuing from eeprom write Guchun Chen
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: Guchun Chen @ 2020-07-23  8:33 UTC (permalink / raw)
  To: amd-gfx, alexander.deucher, Hawking.Zhang, Dennis.Li,
	Stanley.Yang, Tao.Zhou1, John.Clements, lijo.lazar
  Cc: Guchun Chen

When retrieving bad gpu tag from eeprom, GPU init should
fail as the GPU needs to be retired for further check.

Signed-off-by: Guchun Chen <guchun.chen@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c     | 12 +++++++++---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c        | 15 +++++++++++++--
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c | 11 ++++++++++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.h |  3 ++-
 4 files changed, 34 insertions(+), 7 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 2662cd7c8685..882f8a0964a5 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -2059,13 +2059,19 @@ static int amdgpu_device_ip_init(struct amdgpu_device *adev)
 	 * it should be called after amdgpu_device_ip_hw_init_phase2  since
 	 * for some ASICs the RAS EEPROM code relies on SMU fully functioning
 	 * for I2C communication which only true at this point.
-	 * recovery_init may fail, but it can free all resources allocated by
-	 * itself and its failure should not stop amdgpu init process.
+	 *
+	 * amdgpu_ras_recovery_init may fail, but the upper only cares the
+	 * failure from bad gpu situation and stop amdgpu init process
+	 * arrordingly. For other failed cases, it will still release all
+	 * the resource and print error message, rather than returning one
+	 * negative value to upper level.
 	 *
 	 * Note: theoretically, this should be called before all vram allocations
 	 * to protect retired page from abusing
 	 */
-	amdgpu_ras_recovery_init(adev);
+	r = amdgpu_ras_recovery_init(adev);
+	if (r)
+		goto init_failed;
 
 	if (adev->gmc.xgmi.num_physical_nodes > 1)
 		amdgpu_xgmi_add_device(adev);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
index 8daeb54917ed..06db2f0b78d7 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
@@ -1819,6 +1819,7 @@ int amdgpu_ras_recovery_init(struct amdgpu_device *adev)
 	struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
 	struct ras_err_handler_data **data;
 	uint32_t max_eeprom_records_len = 0;
+	bool bad_gpu = false;
 	int ret;
 
 	if (con)
@@ -1840,9 +1841,12 @@ int amdgpu_ras_recovery_init(struct amdgpu_device *adev)
 	max_eeprom_records_len = amdgpu_ras_eeprom_get_record_max_length();
 	amdgpu_ras_validate_threshold(adev, max_eeprom_records_len);
 
-	ret = amdgpu_ras_eeprom_init(&con->eeprom_control);
-	if (ret)
+	ret = amdgpu_ras_eeprom_init(&con->eeprom_control, &bad_gpu);
+	/* We only fail this calling and halt booting when bad_gpu is true. */
+	if (bad_gpu) {
+		ret = -EINVAL;
 		goto free;
+	}
 
 	if (con->eeprom_control.num_recs) {
 		ret = amdgpu_ras_load_bad_pages(adev);
@@ -1865,6 +1869,13 @@ int amdgpu_ras_recovery_init(struct amdgpu_device *adev)
 out:
 	dev_warn(adev->dev, "Failed to initialize ras recovery!\n");
 
+	/*
+	 * Except bad_gpu case, other failure cases in this function
+	 * would not fail amdgpu driver init.
+	 */
+	if (!bad_gpu)
+		ret = 0;
+
 	return ret;
 }
 
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
index 35c0c849d49b..3f1b167afe6b 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
@@ -241,12 +241,14 @@ int amdgpu_ras_eeprom_reset_table(struct amdgpu_ras_eeprom_control *control)
 
 }
 
-int amdgpu_ras_eeprom_init(struct amdgpu_ras_eeprom_control *control)
+int amdgpu_ras_eeprom_init(struct amdgpu_ras_eeprom_control *control,
+			bool *bad_gpu)
 {
 	int ret = 0;
 	struct amdgpu_device *adev = to_amdgpu_device(control);
 	unsigned char buff[EEPROM_ADDRESS_SIZE + EEPROM_TABLE_HEADER_SIZE] = { 0 };
 	struct amdgpu_ras_eeprom_table_header *hdr = &control->tbl_hdr;
+	struct amdgpu_ras *ras = amdgpu_ras_get_context(adev);
 	struct i2c_msg msg = {
 			.addr	= 0,
 			.flags	= I2C_M_RD,
@@ -254,6 +256,8 @@ int amdgpu_ras_eeprom_init(struct amdgpu_ras_eeprom_control *control)
 			.buf	= buff,
 	};
 
+	*bad_gpu = false;
+
 	/* Verify i2c adapter is initialized */
 	if (!adev->pm.smu_i2c.algo)
 		return -ENOENT;
@@ -282,6 +286,11 @@ int amdgpu_ras_eeprom_init(struct amdgpu_ras_eeprom_control *control)
 		DRM_DEBUG_DRIVER("Found existing EEPROM table with %d records",
 				 control->num_recs);
 
+	} else if ((ras->bad_page_cnt_threshold != 0xFFFFFFFF) && (
+			hdr->header == EEPROM_TABLE_HDR_BAD)) {
+		*bad_gpu = true;
+		DRM_ERROR("Detect BAD GPU TAG in eeprom table and "
+			"GPU shall be retired.\n");
 	} else {
 		DRM_INFO("Creating new EEPROM table");
 
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.h
index b272840cb069..a2de243da31d 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.h
@@ -77,7 +77,8 @@ struct eeprom_table_record {
 	unsigned char mcumc_id;
 }__attribute__((__packed__));
 
-int amdgpu_ras_eeprom_init(struct amdgpu_ras_eeprom_control *control);
+int amdgpu_ras_eeprom_init(struct amdgpu_ras_eeprom_control *control,
+			bool *bad_gpu);
 int amdgpu_ras_eeprom_reset_table(struct amdgpu_ras_eeprom_control *control);
 
 int amdgpu_ras_eeprom_process_recods(struct amdgpu_ras_eeprom_control *control,
-- 
2.17.1

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 5/9] drm/amdgpu: skip bad page reservation once issuing from eeprom write
  2020-07-23  8:33 [PATCH 0/9] BAD GPU retirement policy by total bad pages Guchun Chen
                   ` (3 preceding siblings ...)
  2020-07-23  8:33 ` [PATCH 4/9] drm/amdgpu: break driver init process when it's bad GPU Guchun Chen
@ 2020-07-23  8:33 ` Guchun Chen
  2020-07-23  8:33 ` [PATCH 6/9] drm/amdgpu: schedule ras recovery when reaching bad page threshold Guchun Chen
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: Guchun Chen @ 2020-07-23  8:33 UTC (permalink / raw)
  To: amd-gfx, alexander.deucher, Hawking.Zhang, Dennis.Li,
	Stanley.Yang, Tao.Zhou1, John.Clements, lijo.lazar
  Cc: Guchun Chen

Once the ras recovery is issued from eeprom write itself,
bad page reservation should be ignored, otherwise, recursive
calling of writting to eeprom would happen.

Signed-off-by: Guchun Chen <guchun.chen@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c |  2 --
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h | 14 +++++++++++---
 2 files changed, 11 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
index 06db2f0b78d7..4c86c7a64bcc 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
@@ -62,8 +62,6 @@ const char *ras_block_string[] = {
 #define ras_err_str(i) (ras_error_string[ffs(i)])
 #define ras_block_str(i) (ras_block_string[i])
 
-#define AMDGPU_RAS_FLAG_INIT_BY_VBIOS		1
-#define AMDGPU_RAS_FLAG_INIT_NEED_RESET		2
 #define RAS_DEFAULT_FLAGS (AMDGPU_RAS_FLAG_INIT_BY_VBIOS)
 
 /* inject address is 52 bits */
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
index 4672649a9293..cf9f60202334 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
@@ -31,6 +31,10 @@
 #include "ta_ras_if.h"
 #include "amdgpu_ras_eeprom.h"
 
+#define AMDGPU_RAS_FLAG_INIT_BY_VBIOS		(0x1 << 0)
+#define AMDGPU_RAS_FLAG_INIT_NEED_RESET		(0x1 << 1)
+#define AMDGPU_RAS_FLAG_SKIP_BAD_PAGE_RESV	(0x1 << 2)
+
 enum amdgpu_ras_block {
 	AMDGPU_RAS_BLOCK__UMC = 0,
 	AMDGPU_RAS_BLOCK__SDMA,
@@ -503,10 +507,14 @@ static inline int amdgpu_ras_reset_gpu(struct amdgpu_device *adev)
 {
 	struct amdgpu_ras *ras = amdgpu_ras_get_context(adev);
 
-	/* save bad page to eeprom before gpu reset,
-	 * i2c may be unstable in gpu reset
+	/*
+	 * Save bad page to eeprom before gpu reset, i2c may be unstable
+	 * in gpu reset.
+	 *
+	 * Also, exclude the case when ras recovery issuer is
+	 * eeprom page write itself.
 	 */
-	if (in_task())
+	if (!(ras->flags & AMDGPU_RAS_FLAG_SKIP_BAD_PAGE_RESV) && in_task())
 		amdgpu_ras_reserve_bad_pages(adev);
 
 	if (atomic_cmpxchg(&ras->in_recovery, 0, 1) == 0)
-- 
2.17.1

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 6/9] drm/amdgpu: schedule ras recovery when reaching bad page threshold
  2020-07-23  8:33 [PATCH 0/9] BAD GPU retirement policy by total bad pages Guchun Chen
                   ` (4 preceding siblings ...)
  2020-07-23  8:33 ` [PATCH 5/9] drm/amdgpu: skip bad page reservation once issuing from eeprom write Guchun Chen
@ 2020-07-23  8:33 ` Guchun Chen
  2020-07-23  8:33 ` [PATCH 7/9] drm/amdgpu: break GPU recovery once it's bad Guchun Chen
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: Guchun Chen @ 2020-07-23  8:33 UTC (permalink / raw)
  To: amd-gfx, alexander.deucher, Hawking.Zhang, Dennis.Li,
	Stanley.Yang, Tao.Zhou1, John.Clements, lijo.lazar
  Cc: Guchun Chen

Once the bad page saved to eeprom reaches the configured
threshold, ras recovery will be issued to tell user.

Signed-off-by: Guchun Chen <guchun.chen@amd.com>
---
 .../gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c    | 36 ++++++++++++++++++-
 1 file changed, 35 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
index 3f1b167afe6b..0cd594c74bff 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
@@ -395,8 +395,10 @@ int amdgpu_ras_eeprom_process_recods(struct amdgpu_ras_eeprom_control *control,
 	int i, ret = 0;
 	struct i2c_msg *msgs, *msg;
 	unsigned char *buffs, *buff;
+	bool sched_ras_recovery = false;
 	struct eeprom_table_record *record;
 	struct amdgpu_device *adev = to_amdgpu_device(control);
+	struct amdgpu_ras *ras = amdgpu_ras_get_context(adev);
 
 	if (adev->asic_type != CHIP_VEGA20 && adev->asic_type != CHIP_ARCTURUS)
 		return 0;
@@ -414,11 +416,29 @@ int amdgpu_ras_eeprom_process_recods(struct amdgpu_ras_eeprom_control *control,
 		goto free_buff;
 	}
 
+	/*
+	 * If saved bad pages number exceeds the bad page threshod for
+	 * the whole VRAM, update table header to mark one BAD GPU and
+	 * schedule one ras recovery after eeprom write is done, this
+	 * can avoid the missing for latest records.
+	 *
+	 * This new header will be picked up and checked in the bootup by
+	 * ras recovery, which may break bootup process to notify user this
+	 * GPU is bad and to retire such GPU.
+	 */
+	if (write && (ras->bad_page_cnt_threshold != 0xFFFFFFFF) &&
+		((control->num_recs + num) >= ras->bad_page_cnt_threshold)) {
+		dev_warn(adev->dev,
+			"Saved bad pages(%d) reaches threshold value(%d).\n",
+			control->num_recs + num, ras->bad_page_cnt_threshold);
+		control->tbl_hdr.header = EEPROM_TABLE_HDR_BAD;
+		sched_ras_recovery = true;
+	}
+
 	/* In case of overflow just start from beginning to not lose newest records */
 	if (write && (control->next_addr + EEPROM_TABLE_RECORD_SIZE * num > EEPROM_SIZE_BYTES))
 		control->next_addr = EEPROM_RECORD_START;
 
-
 	/*
 	 * TODO Currently makes EEPROM writes for each record, this creates
 	 * internal fragmentation. Optimized the code to do full page write of
@@ -494,6 +514,20 @@ int amdgpu_ras_eeprom_process_recods(struct amdgpu_ras_eeprom_control *control,
 		__update_tbl_checksum(control, records, num, old_hdr_byte_sum);
 
 		__update_table_header(control, buffs);
+
+		if (sched_ras_recovery) {
+			/*
+			 * Before scheduling ras recovery, assert the related
+			 * flag first, which shall bypass common bad page
+			 * reservation execution in amdgpu_ras_reset_gpu.
+			 */
+			amdgpu_ras_get_context(adev)->flags |=
+				AMDGPU_RAS_FLAG_SKIP_BAD_PAGE_RESV;
+
+			dev_warn(adev->dev, "Conduct ras recovery due to bad "
+				"page threshold reached.\n");
+			amdgpu_ras_reset_gpu(adev);
+		}
 	} else if (!__validate_tbl_checksum(control, records, num)) {
 		DRM_WARN("EEPROM Table checksum mismatch!");
 		/* TODO Uncomment when EEPROM read/write is relliable */
-- 
2.17.1

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 7/9] drm/amdgpu: break GPU recovery once it's bad
  2020-07-23  8:33 [PATCH 0/9] BAD GPU retirement policy by total bad pages Guchun Chen
                   ` (5 preceding siblings ...)
  2020-07-23  8:33 ` [PATCH 6/9] drm/amdgpu: schedule ras recovery when reaching bad page threshold Guchun Chen
@ 2020-07-23  8:33 ` Guchun Chen
  2020-07-23  8:33 ` [PATCH 8/9] drm/amdgpu: restore ras flags when user resets eeprom Guchun Chen
  2020-07-23  8:33 ` [PATCH 9/9] drm/amdgpu: calculate actual size instead of hardcode size Guchun Chen
  8 siblings, 0 replies; 10+ messages in thread
From: Guchun Chen @ 2020-07-23  8:33 UTC (permalink / raw)
  To: amd-gfx, alexander.deucher, Hawking.Zhang, Dennis.Li,
	Stanley.Yang, Tao.Zhou1, John.Clements, lijo.lazar
  Cc: Guchun Chen

When GPU executes recovery and retriving bad GPU tag
from external eerpom device, the recovery will be stopped
and print GPU retirement message for user's awareness.

Signed-off-by: Guchun Chen <guchun.chen@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c    | 17 ++++++--
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c       | 13 ++++++
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h       |  2 +
 .../gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c    | 43 +++++++++++++++++++
 .../gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.h    |  3 ++
 5 files changed, 75 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 882f8a0964a5..771c4e6b7a0f 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4139,8 +4139,20 @@ static int amdgpu_do_asic_reset(struct amdgpu_hive_info *hive,
 
 				amdgpu_fbdev_set_suspend(tmp_adev, 0);
 
-				/* must succeed. */
-				amdgpu_ras_resume(tmp_adev);
+				/*
+				 * The CPU is BAD once faulty pages by ECC has
+				 * reached the threshold, and ras recovery is
+				 * scheduled. So add one check here to break
+				 * recovery if it's one BAD GPU, and remind
+				 * user to retire this GPU.
+				 */
+				if (!amdgpu_ras_check_bad_gpu(tmp_adev)) {
+					/* must succeed. */
+					amdgpu_ras_resume(tmp_adev);
+				} else {
+					r = -EINVAL;
+					goto out;
+				}
 
 				/* Update PSP FW topology after reset */
 				if (hive && tmp_adev->gmc.xgmi.num_physical_nodes > 1)
@@ -4148,7 +4160,6 @@ static int amdgpu_do_asic_reset(struct amdgpu_hive_info *hive,
 			}
 		}
 
-
 out:
 		if (!r) {
 			amdgpu_irq_gpu_reset_resume_helper(tmp_adev);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
index 4c86c7a64bcc..acb8231f2052 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
@@ -2200,3 +2200,16 @@ bool amdgpu_ras_need_emergency_restart(struct amdgpu_device *adev)
 
 	return false;
 }
+
+bool amdgpu_ras_check_bad_gpu(struct amdgpu_device *adev)
+{
+	struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
+	bool bad_gpu = false;
+
+	if (con && (con->bad_page_cnt_threshold != 0xFFFFFFFF))
+		amdgpu_ras_eeprom_check_bad_gpu(&con->eeprom_control,
+						&bad_gpu);
+
+	/* We are only interested in variable bad_gpu. */
+	return bad_gpu;
+}
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
index cf9f60202334..95918d355fa9 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
@@ -497,6 +497,8 @@ void amdgpu_ras_suspend(struct amdgpu_device *adev);
 unsigned long amdgpu_ras_query_error_count(struct amdgpu_device *adev,
 		bool is_ce);
 
+bool amdgpu_ras_check_bad_gpu(struct amdgpu_device *adev);
+
 /* error handling functions */
 int amdgpu_ras_add_bad_pages(struct amdgpu_device *adev,
 		struct eeprom_table_record *bps, int pages);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
index 0cd594c74bff..d27cd5ae431a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
@@ -387,6 +387,49 @@ static uint32_t __correct_eeprom_dest_address(uint32_t curr_address)
 	return curr_address;
 }
 
+int amdgpu_ras_eeprom_check_bad_gpu(struct amdgpu_ras_eeprom_control *control,
+				bool *bad_gpu)
+{
+	struct amdgpu_device *adev = to_amdgpu_device(control);
+	unsigned char buff[EEPROM_ADDRESS_SIZE +
+			EEPROM_TABLE_HEADER_SIZE] = { 0 };
+	struct amdgpu_ras_eeprom_table_header *hdr = &control->tbl_hdr;
+	struct i2c_msg msg = {
+			.addr = control->i2c_address,
+			.flags = I2C_M_RD,
+			.len = EEPROM_ADDRESS_SIZE + EEPROM_TABLE_HEADER_SIZE,
+			.buf = buff,
+	};
+	int ret;
+
+	*bad_gpu = false;
+
+	/* read EEPROM table header */
+	mutex_lock(&control->tbl_mutex);
+	ret = i2c_transfer(&adev->pm.smu_i2c, &msg, 1);
+	if (ret < 1) {
+		dev_err(adev->dev, "Failed to read EEPROM table header.\n");
+		goto err;
+	}
+
+	__decode_table_header_from_buff(hdr, &buff[2]);
+
+	if (hdr->header == EEPROM_TABLE_HDR_BAD) {
+		dev_warn(adev->dev, "Current GPU is BAD and should be retired.\n");
+		*bad_gpu = true;
+	}
+	__decode_table_header_from_buff(hdr, &buff[2]);
+
+	if (hdr->header == EEPROM_TABLE_HDR_BAD) {
+		dev_warn(adev->dev, "Current GPU is BAD and should be retired.\n");
+		*bad_gpu = true;
+	}
+
+err:
+	mutex_unlock(&control->tbl_mutex);
+	return 0;
+}
+
 int amdgpu_ras_eeprom_process_recods(struct amdgpu_ras_eeprom_control *control,
 					    struct eeprom_table_record *records,
 					    bool write,
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.h
index a2de243da31d..82757c88db9e 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.h
@@ -81,6 +81,9 @@ int amdgpu_ras_eeprom_init(struct amdgpu_ras_eeprom_control *control,
 			bool *bad_gpu);
 int amdgpu_ras_eeprom_reset_table(struct amdgpu_ras_eeprom_control *control);
 
+int amdgpu_ras_eeprom_check_bad_gpu(struct amdgpu_ras_eeprom_control *control,
+				bool *bad_gpu);
+
 int amdgpu_ras_eeprom_process_recods(struct amdgpu_ras_eeprom_control *control,
 					    struct eeprom_table_record *records,
 					    bool write,
-- 
2.17.1

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 8/9] drm/amdgpu: restore ras flags when user resets eeprom
  2020-07-23  8:33 [PATCH 0/9] BAD GPU retirement policy by total bad pages Guchun Chen
                   ` (6 preceding siblings ...)
  2020-07-23  8:33 ` [PATCH 7/9] drm/amdgpu: break GPU recovery once it's bad Guchun Chen
@ 2020-07-23  8:33 ` Guchun Chen
  2020-07-23  8:33 ` [PATCH 9/9] drm/amdgpu: calculate actual size instead of hardcode size Guchun Chen
  8 siblings, 0 replies; 10+ messages in thread
From: Guchun Chen @ 2020-07-23  8:33 UTC (permalink / raw)
  To: amd-gfx, alexander.deucher, Hawking.Zhang, Dennis.Li,
	Stanley.Yang, Tao.Zhou1, John.Clements, lijo.lazar
  Cc: Guchun Chen

RAS flags needs to be cleaned as well when user requires
one clean eeprom.

Signed-off-by: Guchun Chen <guchun.chen@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
index acb8231f2052..003bbd023c23 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
@@ -371,6 +371,8 @@ static ssize_t amdgpu_ras_debugfs_eeprom_write(struct file *f, const char __user
 	struct amdgpu_device *adev = (struct amdgpu_device *)file_inode(f)->i_private;
 	int ret;
 
+	amdgpu_ras_get_context(adev)->flags = RAS_DEFAULT_FLAGS;
+
 	ret = amdgpu_ras_eeprom_reset_table(&adev->psp.ras.ras->eeprom_control);
 
 	return ret == 1 ? size : -EIO;
-- 
2.17.1

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 9/9] drm/amdgpu: calculate actual size instead of hardcode size
  2020-07-23  8:33 [PATCH 0/9] BAD GPU retirement policy by total bad pages Guchun Chen
                   ` (7 preceding siblings ...)
  2020-07-23  8:33 ` [PATCH 8/9] drm/amdgpu: restore ras flags when user resets eeprom Guchun Chen
@ 2020-07-23  8:33 ` Guchun Chen
  8 siblings, 0 replies; 10+ messages in thread
From: Guchun Chen @ 2020-07-23  8:33 UTC (permalink / raw)
  To: amd-gfx, alexander.deucher, Hawking.Zhang, Dennis.Li,
	Stanley.Yang, Tao.Zhou1, John.Clements, lijo.lazar
  Cc: Guchun Chen

Use sizeof to get actual size.

v2: correct other confused comment of head and record size.

Signed-off-by: Guchun Chen <guchun.chen@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c | 11 +++--------
 1 file changed, 3 insertions(+), 8 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
index d27cd5ae431a..12ae8eb3b53e 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
@@ -31,14 +31,9 @@
 #define EEPROM_I2C_TARGET_ADDR_ARCTURUS  	0xA8
 #define EEPROM_I2C_TARGET_ADDR_ARCTURUS_D342  	0xA0
 
-/*
- * The 2 macros bellow represent the actual size in bytes that
- * those entities occupy in the EEPROM memory.
- * EEPROM_TABLE_RECORD_SIZE is different than sizeof(eeprom_table_record) which
- * uses uint64 to store 6b fields such as retired_page.
- */
-#define EEPROM_TABLE_HEADER_SIZE 20
-#define EEPROM_TABLE_RECORD_SIZE 24
+/* Define head and record size in EEPROM memory. */
+#define EEPROM_TABLE_HEADER_SIZE (sizeof(struct amdgpu_ras_eeprom_table_header))
+#define EEPROM_TABLE_RECORD_SIZE (sizeof(struct eeprom_table_record))
 
 #define EEPROM_ADDRESS_SIZE 0x2
 
-- 
2.17.1

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2020-07-23  8:35 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-07-23  8:33 [PATCH 0/9] BAD GPU retirement policy by total bad pages Guchun Chen
2020-07-23  8:33 ` [PATCH 1/9] drm/amdgpu: add bad page count threshold in module parameter Guchun Chen
2020-07-23  8:33 ` [PATCH 2/9] drm/amdgpu: validate bad page threshold in ras Guchun Chen
2020-07-23  8:33 ` [PATCH 3/9] drm/amdgpu: add bad gpu tag definition Guchun Chen
2020-07-23  8:33 ` [PATCH 4/9] drm/amdgpu: break driver init process when it's bad GPU Guchun Chen
2020-07-23  8:33 ` [PATCH 5/9] drm/amdgpu: skip bad page reservation once issuing from eeprom write Guchun Chen
2020-07-23  8:33 ` [PATCH 6/9] drm/amdgpu: schedule ras recovery when reaching bad page threshold Guchun Chen
2020-07-23  8:33 ` [PATCH 7/9] drm/amdgpu: break GPU recovery once it's bad Guchun Chen
2020-07-23  8:33 ` [PATCH 8/9] drm/amdgpu: restore ras flags when user resets eeprom Guchun Chen
2020-07-23  8:33 ` [PATCH 9/9] drm/amdgpu: calculate actual size instead of hardcode size Guchun Chen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.