All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/5] BAD GPU retirement policy by total bad pages
@ 2020-07-22  3:14 Guchun Chen
  2020-07-22  3:14 ` [PATCH 1/5] drm/amdgpu: add bad page count threshold in module parameter Guchun Chen
                   ` (4 more replies)
  0 siblings, 5 replies; 17+ messages in thread
From: Guchun Chen @ 2020-07-22  3:14 UTC (permalink / raw)
  To: amd-gfx, alexander.deucher, Hawking.Zhang, Dennis.Li,
	Stanley.Yang, Tao.Zhou1, John.Clements
  Cc: Guchun Chen

The series is to enable the feature of GPU RMA(Return Merchandise
Authorization) which is trigged when bad pages detected by RAS ECC
exceed the threshold value.

When the saved bad pages written to eeprom reach the threshold,
one ras recovery will be issued immediately and the recovery will
fail to tell user that the GPU is BAD and needs to be retired for
further check.

During bootup, similar BAD GPU check is conducted as well when
eeprom get initialized, and it will break boot up for user's
awareness.

User could set bad_page_threshold=0 when probing amdgpu driver to
disable this feature to bring up GPU, and reset eeprom later.

Guchun Chen (5):
  drm/amdgpu: add bad page count threshold in module parameter
  drm/amdgpu: validate bad page threshold in ras
  drm/amdgpu: conduct bad gpu check during bootup/reset
  drm/amdgpu: restore ras flags when user resets eeprom
  drm/amdgpu: calculate actual size instead of hardcode size

 drivers/gpu/drm/amd/amdgpu/amdgpu.h           |  1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c    | 21 +++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c       | 11 +++
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c       | 70 ++++++++++++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h       | 19 +++-
 .../gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c    | 98 ++++++++++++++++++-
 .../gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.h    |  8 +-
 7 files changed, 211 insertions(+), 17 deletions(-)

-- 
2.17.1

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 1/5] drm/amdgpu: add bad page count threshold in module parameter
  2020-07-22  3:14 [PATCH 0/5] BAD GPU retirement policy by total bad pages Guchun Chen
@ 2020-07-22  3:14 ` Guchun Chen
  2020-07-23  0:31   ` Li, Dennis
  2020-07-22  3:14 ` [PATCH 2/5] drm/amdgpu: validate bad page threshold in ras Guchun Chen
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 17+ messages in thread
From: Guchun Chen @ 2020-07-22  3:14 UTC (permalink / raw)
  To: amd-gfx, alexander.deucher, Hawking.Zhang, Dennis.Li,
	Stanley.Yang, Tao.Zhou1, John.Clements
  Cc: Guchun Chen

bad_page_threshold could be specified to detect and retire
bad GPU if faulty bad pages exceed it.

When it's -1, ras will use typical bad page failure value.

Signed-off-by: Guchun Chen <guchun.chen@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu.h     |  1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 11 +++++++++++
 2 files changed, 12 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index 06bfb8658dec..bb83ffb5e26a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -181,6 +181,7 @@ extern uint amdgpu_dm_abm_level;
 extern struct amdgpu_mgpu_info mgpu_info;
 extern int amdgpu_ras_enable;
 extern uint amdgpu_ras_mask;
+extern int amdgpu_bad_page_threshold;
 extern int amdgpu_async_gfx_ring;
 extern int amdgpu_mcbp;
 extern int amdgpu_discovery;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index d28b95f721c4..f99671101746 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -161,6 +161,7 @@ struct amdgpu_mgpu_info mgpu_info = {
 };
 int amdgpu_ras_enable = -1;
 uint amdgpu_ras_mask = 0xffffffff;
+int amdgpu_bad_page_threshold = -1;
 
 /**
  * DOC: vramlimit (int)
@@ -801,6 +802,16 @@ module_param_named(tmz, amdgpu_tmz, int, 0444);
 MODULE_PARM_DESC(reset_method, "GPU reset method (-1 = auto (default), 0 = legacy, 1 = mode0, 2 = mode1, 3 = mode2, 4 = baco)");
 module_param_named(reset_method, amdgpu_reset_method, int, 0444);
 
+/**
+ * DOC: bad_page_threshold (int)
+ * Bad page threshold configuration is driven by RMA(Return Merchandise
+ * Authorization) policy, which is to specify the threshold value of faulty
+ * pages detected by ECC, which may result in GPU's retirement if total
+ * faulty pages by ECC exceed threshold value.
+ */
+MODULE_PARM_DESC(bad_page_threshold, "Bad page threshold(-1 = auto(default typical value))");
+module_param_named(bad_page_threshold, amdgpu_bad_page_threshold, int, 0444);
+
 static const struct pci_device_id pciidlist[] = {
 #ifdef  CONFIG_DRM_AMDGPU_SI
 	{0x1002, 0x6780, PCI_ANY_ID, PCI_ANY_ID, 0, 0, CHIP_TAHITI},
-- 
2.17.1

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 2/5] drm/amdgpu: validate bad page threshold in ras
  2020-07-22  3:14 [PATCH 0/5] BAD GPU retirement policy by total bad pages Guchun Chen
  2020-07-22  3:14 ` [PATCH 1/5] drm/amdgpu: add bad page count threshold in module parameter Guchun Chen
@ 2020-07-22  3:14 ` Guchun Chen
  2020-07-22  7:51   ` Yang, Stanley
  2020-07-22  3:14 ` [PATCH 3/5] drm/amdgpu: conduct bad gpu check during bootup/reset Guchun Chen
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 17+ messages in thread
From: Guchun Chen @ 2020-07-22  3:14 UTC (permalink / raw)
  To: amd-gfx, alexander.deucher, Hawking.Zhang, Dennis.Li,
	Stanley.Yang, Tao.Zhou1, John.Clements
  Cc: Guchun Chen

Bad page threshold value should be valid in the range between
-1 and max records length of eeprom. It could determine when
the GPU should be retired.

Signed-off-by: Guchun Chen <guchun.chen@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c       | 43 +++++++++++++++++++
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h       |  3 ++
 .../gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c    |  5 +++
 .../gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.h    |  2 +
 4 files changed, 53 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
index 6f06e1214622..e3d67d85c55f 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
@@ -69,6 +69,9 @@ const char *ras_block_string[] = {
 /* inject address is 52 bits */
 #define	RAS_UMC_INJECT_ADDR_LIMIT	(0x1ULL << 52)
 
+/* typical ECC bad page rate(1 bad page per 100MB VRAM) */
+#define RAS_BAD_PAGE_RATE		(100 * 1024 * 1024ULL)
+
 enum amdgpu_ras_retire_page_reservation {
 	AMDGPU_RAS_RETIRE_PAGE_RESERVED,
 	AMDGPU_RAS_RETIRE_PAGE_PENDING,
@@ -1700,6 +1703,42 @@ static bool amdgpu_ras_check_bad_page(struct amdgpu_device *adev,
 	return ret;
 }
 
+static void amdgpu_ras_validate_threshold(struct amdgpu_device *adev,
+					uint32_t max_length)
+{
+	struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
+	int tmp_threshold = amdgpu_bad_page_threshold;
+	u64 val;
+
+	/*
+	 * Justification of value bad_page_cnt_threshold in ras structure
+	 *
+	 * 1. -1 <= amdgpu_bad_page_threshold <= max record length in eeprom
+	 * 2. if amdgpu_bad_page_threshold = -1,
+	 *    bad_page_cnt_threshold = typical value by formula.
+	 * 3. if amdgpu_bad_page_threshold = 0,
+	 *    bad_page_cnt_threshold = 0xFFFFFFFF,
+	 *    and disable RMA feature accordingly.
+	 * 4. use the value specified from user when (amdgpu_bad_page_threshold
+	 *    > 0 && < max record length in eeprom).
+	 */
+
+	if (tmp_threshold < -1)
+		tmp_threshold = -1;
+	else if (tmp_threshold > max_length)
+		tmp_threshold = max_length;
+
+	if (tmp_threshold == -1) {
+		val = adev->gmc.mc_vram_size;
+		do_div(val, RAS_BAD_PAGE_RATE);
+		con->bad_page_cnt_threshold = lower_32_bits(val);
+	} else if (tmp_threshold == 0) {
+		con->bad_page_cnt_threshold = 0xFFFFFFFF;
+	} else {
+		con->bad_page_cnt_threshold = tmp_threshold;
+	}
+}
+
 /* called in gpu recovery/init */
 int amdgpu_ras_reserve_bad_pages(struct amdgpu_device *adev)
 {
@@ -1777,6 +1816,7 @@ int amdgpu_ras_recovery_init(struct amdgpu_device *adev)
 {
 	struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
 	struct ras_err_handler_data **data;
+	uint32_t max_eeprom_records_len = 0;
 	int ret;
 
 	if (con)
@@ -1795,6 +1835,9 @@ int amdgpu_ras_recovery_init(struct amdgpu_device *adev)
 	atomic_set(&con->in_recovery, 0);
 	con->adev = adev;
 
+	max_eeprom_records_len = amdgpu_ras_eeprom_get_record_max_length();
+	amdgpu_ras_validate_threshold(adev, max_eeprom_records_len);
+
 	ret = amdgpu_ras_eeprom_init(&con->eeprom_control);
 	if (ret)
 		goto free;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
index b2667342cf67..4672649a9293 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
@@ -336,6 +336,9 @@ struct amdgpu_ras {
 	struct amdgpu_ras_eeprom_control eeprom_control;
 
 	bool error_query_ready;
+
+	/* bad page count threshold */
+	uint32_t bad_page_cnt_threshold;
 };
 
 struct ras_fs_data {
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
index c0096097bbcf..a2c982b1eac6 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
@@ -499,6 +499,11 @@ int amdgpu_ras_eeprom_process_recods(struct amdgpu_ras_eeprom_control *control,
 	return ret == num ? 0 : -EIO;
 }
 
+inline uint32_t amdgpu_ras_eeprom_get_record_max_length(void)
+{
+	return EEPROM_MAX_RECORD_NUM;
+}
+
 /* Used for testing if bugs encountered */
 #if 0
 void amdgpu_ras_eeprom_test(struct amdgpu_ras_eeprom_control *control)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.h
index 7e8647a05df7..b272840cb069 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.h
@@ -85,6 +85,8 @@ int amdgpu_ras_eeprom_process_recods(struct amdgpu_ras_eeprom_control *control,
 					    bool write,
 					    int num);
 
+inline uint32_t amdgpu_ras_eeprom_get_record_max_length(void);
+
 void amdgpu_ras_eeprom_test(struct amdgpu_ras_eeprom_control *control);
 
 #endif // _AMDGPU_RAS_EEPROM_H
-- 
2.17.1

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 3/5] drm/amdgpu: conduct bad gpu check during bootup/reset
  2020-07-22  3:14 [PATCH 0/5] BAD GPU retirement policy by total bad pages Guchun Chen
  2020-07-22  3:14 ` [PATCH 1/5] drm/amdgpu: add bad page count threshold in module parameter Guchun Chen
  2020-07-22  3:14 ` [PATCH 2/5] drm/amdgpu: validate bad page threshold in ras Guchun Chen
@ 2020-07-22  3:14 ` Guchun Chen
  2020-07-23  2:51   ` Zhou1, Tao
  2020-07-22  3:14 ` [PATCH 4/5] drm/amdgpu: restore ras flags when user resets eeprom Guchun Chen
  2020-07-22  3:14 ` [PATCH 5/5] drm/amdgpu: calculate actual size instead of hardcode size Guchun Chen
  4 siblings, 1 reply; 17+ messages in thread
From: Guchun Chen @ 2020-07-22  3:14 UTC (permalink / raw)
  To: amd-gfx, alexander.deucher, Hawking.Zhang, Dennis.Li,
	Stanley.Yang, Tao.Zhou1, John.Clements
  Cc: Guchun Chen

During boot up, when init eeprom, RAS will check if the BAD
GPU tag is saved or not. And will break boot up and notify user
to RMA it.

At the meanwhile, when saved bad page count to eeprom exceeds
threshold, one ras recovery will be triggered immediately, and
bad gpu check will be executed and reset will fail as well to
remind user.

User could set bad_page_threshold = 0 as module parameter when
probing driver to disable the bad feature check.

Signed-off-by: Guchun Chen <guchun.chen@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c    | 21 ++++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c       | 25 +++++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h       | 16 +++-
 .../gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c    | 87 ++++++++++++++++++-
 .../gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.h    |  6 +-
 5 files changed, 142 insertions(+), 13 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 2662cd7c8685..d529bf7917f5 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -2065,7 +2065,9 @@ static int amdgpu_device_ip_init(struct amdgpu_device *adev)
 	 * Note: theoretically, this should be called before all vram allocations
 	 * to protect retired page from abusing
 	 */
-	amdgpu_ras_recovery_init(adev);
+	r = amdgpu_ras_recovery_init(adev);
+	if (r)
+		goto init_failed;
 
 	if (adev->gmc.xgmi.num_physical_nodes > 1)
 		amdgpu_xgmi_add_device(adev);
@@ -4133,8 +4135,20 @@ static int amdgpu_do_asic_reset(struct amdgpu_hive_info *hive,
 
 				amdgpu_fbdev_set_suspend(tmp_adev, 0);
 
-				/* must succeed. */
-				amdgpu_ras_resume(tmp_adev);
+				/*
+				 * The CPU is BAD once faulty pages by ECC has
+				 * reached the threshold, and ras recovery is
+				 * scheduled. So add one check here to prevent
+				 * recovery if it's one BAD GPU, and remind
+				 * user to RMA this GPU.
+				 */
+				if (!amdgpu_ras_check_bad_gpu(tmp_adev)) {
+					/* must succeed. */
+					amdgpu_ras_resume(tmp_adev);
+				} else {
+					r = -EINVAL;
+					goto out;
+				}
 
 				/* Update PSP FW topology after reset */
 				if (hive && tmp_adev->gmc.xgmi.num_physical_nodes > 1)
@@ -4142,7 +4156,6 @@ static int amdgpu_do_asic_reset(struct amdgpu_hive_info *hive,
 			}
 		}
 
-
 out:
 		if (!r) {
 			amdgpu_irq_gpu_reset_resume_helper(tmp_adev);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
index e3d67d85c55f..818d4154e4a3 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
@@ -62,8 +62,6 @@ const char *ras_block_string[] = {
 #define ras_err_str(i) (ras_error_string[ffs(i)])
 #define ras_block_str(i) (ras_block_string[i])
 
-#define AMDGPU_RAS_FLAG_INIT_BY_VBIOS		1
-#define AMDGPU_RAS_FLAG_INIT_NEED_RESET		2
 #define RAS_DEFAULT_FLAGS (AMDGPU_RAS_FLAG_INIT_BY_VBIOS)
 
 /* inject address is 52 bits */
@@ -1817,6 +1815,7 @@ int amdgpu_ras_recovery_init(struct amdgpu_device *adev)
 	struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
 	struct ras_err_handler_data **data;
 	uint32_t max_eeprom_records_len = 0;
+	bool bad_gpu = false;
 	int ret;
 
 	if (con)
@@ -1838,9 +1837,14 @@ int amdgpu_ras_recovery_init(struct amdgpu_device *adev)
 	max_eeprom_records_len = amdgpu_ras_eeprom_get_record_max_length();
 	amdgpu_ras_validate_threshold(adev, max_eeprom_records_len);
 
-	ret = amdgpu_ras_eeprom_init(&con->eeprom_control);
-	if (ret)
+	ret = amdgpu_ras_eeprom_init(&con->eeprom_control, &bad_gpu);
+	/*
+	 * we only fail this calling and halt booting when bad_gpu is true.
+	 */
+	if (bad_gpu) {
+		ret = -EINVAL;
 		goto free;
+	}
 
 	if (con->eeprom_control.num_recs) {
 		ret = amdgpu_ras_load_bad_pages(adev);
@@ -2189,3 +2193,16 @@ bool amdgpu_ras_need_emergency_restart(struct amdgpu_device *adev)
 
 	return false;
 }
+
+bool amdgpu_ras_check_bad_gpu(struct amdgpu_device *adev)
+{
+	struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
+	bool bad_gpu = false;
+
+	if (con && (con->bad_page_cnt_threshold != 0xFFFFFFFF))
+		amdgpu_ras_eeprom_check_bad_gpu(&con->eeprom_control,
+						&bad_gpu);
+
+	/* We are only interested in variable bad_gpu. */
+	return bad_gpu;
+}
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
index 4672649a9293..d7a363b37feb 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
@@ -31,6 +31,10 @@
 #include "ta_ras_if.h"
 #include "amdgpu_ras_eeprom.h"
 
+#define AMDGPU_RAS_FLAG_INIT_BY_VBIOS		(0x1 << 0)
+#define AMDGPU_RAS_FLAG_INIT_NEED_RESET		(0x1 << 1)
+#define AMDGPU_RAS_FLAG_SKIP_BAD_PAGE_RESV	(0x1 << 2)
+
 enum amdgpu_ras_block {
 	AMDGPU_RAS_BLOCK__UMC = 0,
 	AMDGPU_RAS_BLOCK__SDMA,
@@ -493,6 +497,8 @@ void amdgpu_ras_suspend(struct amdgpu_device *adev);
 unsigned long amdgpu_ras_query_error_count(struct amdgpu_device *adev,
 		bool is_ce);
 
+bool amdgpu_ras_check_bad_gpu(struct amdgpu_device *adev);
+
 /* error handling functions */
 int amdgpu_ras_add_bad_pages(struct amdgpu_device *adev,
 		struct eeprom_table_record *bps, int pages);
@@ -503,10 +509,14 @@ static inline int amdgpu_ras_reset_gpu(struct amdgpu_device *adev)
 {
 	struct amdgpu_ras *ras = amdgpu_ras_get_context(adev);
 
-	/* save bad page to eeprom before gpu reset,
-	 * i2c may be unstable in gpu reset
+	/*
+	 * Save bad page to eeprom before gpu reset, i2c may be unstable
+	 * in gpu reset.
+	 *
+	 * Also, exclude the case when ras recovery issuer is
+	 * eerprom page write.
 	 */
-	if (in_task())
+	if (!(ras->flags & AMDGPU_RAS_FLAG_SKIP_BAD_PAGE_RESV) && in_task())
 		amdgpu_ras_reserve_bad_pages(adev);
 
 	if (atomic_cmpxchg(&ras->in_recovery, 0, 1) == 0)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
index a2c982b1eac6..96b63c026bad 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
@@ -46,6 +46,9 @@
 #define EEPROM_TABLE_HDR_VAL 0x414d4452
 #define EEPROM_TABLE_VER 0x00010000
 
+/* Bad GPU tag ‘BADG’ */
+#define EEPROM_TABLE_HDR_BAD 0x42414447
+
 /* Assume 2 Mbit size */
 #define EEPROM_SIZE_BYTES 256000
 #define EEPROM_PAGE__SIZE_BYTES 256
@@ -238,12 +241,14 @@ int amdgpu_ras_eeprom_reset_table(struct amdgpu_ras_eeprom_control *control)
 
 }
 
-int amdgpu_ras_eeprom_init(struct amdgpu_ras_eeprom_control *control)
+int amdgpu_ras_eeprom_init(struct amdgpu_ras_eeprom_control *control,
+			bool *bad_gpu)
 {
 	int ret = 0;
 	struct amdgpu_device *adev = to_amdgpu_device(control);
 	unsigned char buff[EEPROM_ADDRESS_SIZE + EEPROM_TABLE_HEADER_SIZE] = { 0 };
 	struct amdgpu_ras_eeprom_table_header *hdr = &control->tbl_hdr;
+	struct amdgpu_ras *ras = amdgpu_ras_get_context(adev);
 	struct i2c_msg msg = {
 			.addr	= 0,
 			.flags	= I2C_M_RD,
@@ -251,6 +256,8 @@ int amdgpu_ras_eeprom_init(struct amdgpu_ras_eeprom_control *control)
 			.buf	= buff,
 	};
 
+	*bad_gpu = false;
+
 	/* Verify i2c adapter is initialized */
 	if (!adev->pm.smu_i2c.algo)
 		return -ENOENT;
@@ -279,6 +286,11 @@ int amdgpu_ras_eeprom_init(struct amdgpu_ras_eeprom_control *control)
 		DRM_DEBUG_DRIVER("Found existing EEPROM table with %d records",
 				 control->num_recs);
 
+	} else if ((ras->bad_page_cnt_threshold != 0xFFFFFFFF) && (
+			hdr->header == EEPROM_TABLE_HDR_BAD)) {
+		*bad_gpu = true;
+		DRM_ERROR("Detect BAD GPU TAG in eeprom table and "
+				"GPU shall be RMAed.\n");
 	} else {
 		DRM_INFO("Creating new EEPROM table");
 
@@ -375,6 +387,44 @@ static uint32_t __correct_eeprom_dest_address(uint32_t curr_address)
 	return curr_address;
 }
 
+int amdgpu_ras_eeprom_check_bad_gpu(struct amdgpu_ras_eeprom_control *control,
+					bool *bad_gpu)
+{
+	struct amdgpu_device *adev = to_amdgpu_device(control);
+	unsigned char buff[EEPROM_ADDRESS_SIZE +
+			EEPROM_TABLE_HEADER_SIZE] = { 0 };
+	struct amdgpu_ras_eeprom_table_header *hdr = &control->tbl_hdr;
+	struct i2c_msg msg = {
+			.addr = control->i2c_address,
+			.flags = I2C_M_RD,
+			.len = EEPROM_ADDRESS_SIZE + EEPROM_TABLE_HEADER_SIZE,
+			.buf = buff,
+	};
+	int ret;
+
+	*bad_gpu = false;
+
+	/* read EEPROM table header */
+	mutex_lock(&control->tbl_mutex);
+	ret = i2c_transfer(&adev->pm.smu_i2c, &msg, 1);
+	if (ret < 1) {
+		dev_err(adev->dev, "Failed to read EEPROM table header.\n");
+	goto err;
+	}
+
+	__decode_table_header_from_buff(hdr, &buff[2]);
+
+	if (hdr->header == EEPROM_TABLE_HDR_BAD) {
+		dev_warn(adev->dev, "Current GPU is BAD and should be RMAed.\n");
+		*bad_gpu = true;
+	}
+
+	ret = 0;
+err:
+	mutex_unlock(&control->tbl_mutex);
+	return ret;
+}
+
 int amdgpu_ras_eeprom_process_recods(struct amdgpu_ras_eeprom_control *control,
 					    struct eeprom_table_record *records,
 					    bool write,
@@ -383,8 +433,10 @@ int amdgpu_ras_eeprom_process_recods(struct amdgpu_ras_eeprom_control *control,
 	int i, ret = 0;
 	struct i2c_msg *msgs, *msg;
 	unsigned char *buffs, *buff;
+	bool sched_ras_recovery = false;
 	struct eeprom_table_record *record;
 	struct amdgpu_device *adev = to_amdgpu_device(control);
+	struct amdgpu_ras *ras = amdgpu_ras_get_context(adev);
 
 	if (adev->asic_type != CHIP_VEGA20 && adev->asic_type != CHIP_ARCTURUS)
 		return 0;
@@ -402,6 +454,25 @@ int amdgpu_ras_eeprom_process_recods(struct amdgpu_ras_eeprom_control *control,
 		goto free_buff;
 	}
 
+	/*
+	 * If saved bad pages number exceeds the bad page threshod for
+	 * the whole VRAM, update table header to mark one BAD GPU and
+	 * schedule one ras recovery after eeprom write is done, this
+	 * can avoid the missing for latest records.
+	 *
+	 * This new header will be picked up and checked in the bootup by
+	 * ras recovery, which may break bootup process to notify user this
+	 * GPU is bad and to RMA(Return Merchandise Authorization) this GPU.
+	 */
+	if (write && (ras->bad_page_cnt_threshold != 0xFFFFFFFF) &&
+		((control->num_recs + num) >= ras->bad_page_cnt_threshold)) {
+		dev_warn(adev->dev,
+			"Saved bad pages(%d) reaches threshold value(%d).\n",
+			control->num_recs + num, ras->bad_page_cnt_threshold);
+			control->tbl_hdr.header = EEPROM_TABLE_HDR_BAD;
+		sched_ras_recovery = true;
+	}
+
 	/* In case of overflow just start from beginning to not lose newest records */
 	if (write && (control->next_addr + EEPROM_TABLE_RECORD_SIZE * num > EEPROM_SIZE_BYTES))
 		control->next_addr = EEPROM_RECORD_START;
@@ -482,6 +553,20 @@ int amdgpu_ras_eeprom_process_recods(struct amdgpu_ras_eeprom_control *control,
 		__update_tbl_checksum(control, records, num, old_hdr_byte_sum);
 
 		__update_table_header(control, buffs);
+
+		if (sched_ras_recovery) {
+			/*
+			 * Before scheduling ras recovery, assert the related
+			 * flag first, which shall bypass common bad page
+			 * reservation execution in amdgpu_ras_reset_gpu.
+			 */
+			amdgpu_ras_get_context(adev)->flags |=
+				AMDGPU_RAS_FLAG_SKIP_BAD_PAGE_RESV;
+
+			dev_warn(adev->dev, "Conduct ras recovery due to bad "
+					"page threshold reached.\n");
+			amdgpu_ras_reset_gpu(adev);
+		}
 	} else if (!__validate_tbl_checksum(control, records, num)) {
 		DRM_WARN("EEPROM Table checksum mismatch!");
 		/* TODO Uncomment when EEPROM read/write is relliable */
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.h
index b272840cb069..4ccd139248a9 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.h
@@ -77,9 +77,13 @@ struct eeprom_table_record {
 	unsigned char mcumc_id;
 }__attribute__((__packed__));
 
-int amdgpu_ras_eeprom_init(struct amdgpu_ras_eeprom_control *control);
+int amdgpu_ras_eeprom_init(struct amdgpu_ras_eeprom_control *control,
+				bool *bad_gpu);
 int amdgpu_ras_eeprom_reset_table(struct amdgpu_ras_eeprom_control *control);
 
+int amdgpu_ras_eeprom_check_bad_gpu(struct amdgpu_ras_eeprom_control *control,
+				bool *bad_gpu);
+
 int amdgpu_ras_eeprom_process_recods(struct amdgpu_ras_eeprom_control *control,
 					    struct eeprom_table_record *records,
 					    bool write,
-- 
2.17.1

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 4/5] drm/amdgpu: restore ras flags when user resets eeprom
  2020-07-22  3:14 [PATCH 0/5] BAD GPU retirement policy by total bad pages Guchun Chen
                   ` (2 preceding siblings ...)
  2020-07-22  3:14 ` [PATCH 3/5] drm/amdgpu: conduct bad gpu check during bootup/reset Guchun Chen
@ 2020-07-22  3:14 ` Guchun Chen
  2020-07-22  3:14 ` [PATCH 5/5] drm/amdgpu: calculate actual size instead of hardcode size Guchun Chen
  4 siblings, 0 replies; 17+ messages in thread
From: Guchun Chen @ 2020-07-22  3:14 UTC (permalink / raw)
  To: amd-gfx, alexander.deucher, Hawking.Zhang, Dennis.Li,
	Stanley.Yang, Tao.Zhou1, John.Clements
  Cc: Guchun Chen

Ras flags needs to be cleaned as well when user requires
one clean eeprom.

Signed-off-by: Guchun Chen <guchun.chen@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
index 818d4154e4a3..964384b5fe4a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
@@ -371,6 +371,8 @@ static ssize_t amdgpu_ras_debugfs_eeprom_write(struct file *f, const char __user
 	struct amdgpu_device *adev = (struct amdgpu_device *)file_inode(f)->i_private;
 	int ret;
 
+	amdgpu_ras_get_context(adev)->flags = RAS_DEFAULT_FLAGS;
+
 	ret = amdgpu_ras_eeprom_reset_table(&adev->psp.ras.ras->eeprom_control);
 
 	return ret == 1 ? size : -EIO;
-- 
2.17.1

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 5/5] drm/amdgpu: calculate actual size instead of hardcode size
  2020-07-22  3:14 [PATCH 0/5] BAD GPU retirement policy by total bad pages Guchun Chen
                   ` (3 preceding siblings ...)
  2020-07-22  3:14 ` [PATCH 4/5] drm/amdgpu: restore ras flags when user resets eeprom Guchun Chen
@ 2020-07-22  3:14 ` Guchun Chen
  2020-07-22 14:26   ` Andrey Grodzovsky
  4 siblings, 1 reply; 17+ messages in thread
From: Guchun Chen @ 2020-07-22  3:14 UTC (permalink / raw)
  To: amd-gfx, alexander.deucher, Hawking.Zhang, Dennis.Li,
	Stanley.Yang, Tao.Zhou1, John.Clements
  Cc: Guchun Chen

Use sizeof to get actual size.

Signed-off-by: Guchun Chen <guchun.chen@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
index 96b63c026bad..a5da108e43c3 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
@@ -34,11 +34,9 @@
 /*
  * The 2 macros bellow represent the actual size in bytes that
  * those entities occupy in the EEPROM memory.
- * EEPROM_TABLE_RECORD_SIZE is different than sizeof(eeprom_table_record) which
- * uses uint64 to store 6b fields such as retired_page.
  */
-#define EEPROM_TABLE_HEADER_SIZE 20
-#define EEPROM_TABLE_RECORD_SIZE 24
+#define EEPROM_TABLE_HEADER_SIZE (sizeof(struct amdgpu_ras_eeprom_table_header))
+#define EEPROM_TABLE_RECORD_SIZE (sizeof(struct eeprom_table_record))
 
 #define EEPROM_ADDRESS_SIZE 0x2
 
-- 
2.17.1

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* RE: [PATCH 2/5] drm/amdgpu: validate bad page threshold in ras
  2020-07-22  3:14 ` [PATCH 2/5] drm/amdgpu: validate bad page threshold in ras Guchun Chen
@ 2020-07-22  7:51   ` Yang, Stanley
  2020-07-23  3:40     ` Chen, Guchun
  0 siblings, 1 reply; 17+ messages in thread
From: Yang, Stanley @ 2020-07-22  7:51 UTC (permalink / raw)
  To: Chen, Guchun, amd-gfx, Deucher, Alexander, Zhang, Hawking, Li,
	Dennis, Zhou1, Tao, Clements, John

[AMD Official Use Only - Internal Distribution Only]

Hi Guchun,

Please see my comment inline.

> -----Original Message-----
> From: Chen, Guchun <Guchun.Chen@amd.com>
> Sent: Wednesday, July 22, 2020 11:14 AM
> To: amd-gfx@lists.freedesktop.org; Deucher, Alexander
> <Alexander.Deucher@amd.com>; Zhang, Hawking
> <Hawking.Zhang@amd.com>; Li, Dennis <Dennis.Li@amd.com>; Yang,
> Stanley <Stanley.Yang@amd.com>; Zhou1, Tao <Tao.Zhou1@amd.com>;
> Clements, John <John.Clements@amd.com>
> Cc: Chen, Guchun <Guchun.Chen@amd.com>
> Subject: [PATCH 2/5] drm/amdgpu: validate bad page threshold in ras
> 
> Bad page threshold value should be valid in the range between
> -1 and max records length of eeprom. It could determine when the GPU
> should be retired.
> 
> Signed-off-by: Guchun Chen <guchun.chen@amd.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c       | 43
> +++++++++++++++++++
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h       |  3 ++
>  .../gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c    |  5 +++
>  .../gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.h    |  2 +
>  4 files changed, 53 insertions(+)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> index 6f06e1214622..e3d67d85c55f 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> @@ -69,6 +69,9 @@ const char *ras_block_string[] = {
>  /* inject address is 52 bits */
>  #define	RAS_UMC_INJECT_ADDR_LIMIT	(0x1ULL << 52)
> 
> +/* typical ECC bad page rate(1 bad page per 100MB VRAM) */
> +#define RAS_BAD_PAGE_RATE		(100 * 1024 * 1024ULL)
> +
>  enum amdgpu_ras_retire_page_reservation {
>  	AMDGPU_RAS_RETIRE_PAGE_RESERVED,
>  	AMDGPU_RAS_RETIRE_PAGE_PENDING,
> @@ -1700,6 +1703,42 @@ static bool amdgpu_ras_check_bad_page(struct
> amdgpu_device *adev,
>  	return ret;
>  }
> 
> +static void amdgpu_ras_validate_threshold(struct amdgpu_device *adev,
> +					uint32_t max_length)
> +{
> +	struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
> +	int tmp_threshold = amdgpu_bad_page_threshold;
> +	u64 val;
> +
> +	/*
> +	 * Justification of value bad_page_cnt_threshold in ras structure
> +	 *
> +	 * 1. -1 <= amdgpu_bad_page_threshold <= max record length in
> eeprom
> +	 * 2. if amdgpu_bad_page_threshold = -1,
> +	 *    bad_page_cnt_threshold = typical value by formula.
> +	 * 3. if amdgpu_bad_page_threshold = 0,
> +	 *    bad_page_cnt_threshold = 0xFFFFFFFF,
> +	 *    and disable RMA feature accordingly.
> +	 * 4. use the value specified from user when
> (amdgpu_bad_page_threshold
> +	 *    > 0 && < max record length in eeprom).
> +	 */
> +
> +	if (tmp_threshold < -1)
> +		tmp_threshold = -1;
> +	else if (tmp_threshold > max_length)
> +		tmp_threshold = max_length;
> +
> +	if (tmp_threshold == -1) {
> +		val = adev->gmc.mc_vram_size;
> +		do_div(val, RAS_BAD_PAGE_RATE);
> +		con->bad_page_cnt_threshold = lower_32_bits(val);
[Yang, Stanley] : It's better to compare con->bad_page_cnt_threshold with max_length,
the value of bad_page_cnt_threshold should not exceed max_length.

> +	} else if (tmp_threshold == 0) {
> +		con->bad_page_cnt_threshold = 0xFFFFFFFF;
> +	} else {
> +		con->bad_page_cnt_threshold = tmp_threshold;
> +	}
> +}
> +
>  /* called in gpu recovery/init */
>  int amdgpu_ras_reserve_bad_pages(struct amdgpu_device *adev)  { @@ -
> 1777,6 +1816,7 @@ int amdgpu_ras_recovery_init(struct amdgpu_device
> *adev)  {
>  	struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
>  	struct ras_err_handler_data **data;
> +	uint32_t max_eeprom_records_len = 0;
>  	int ret;
> 
>  	if (con)
> @@ -1795,6 +1835,9 @@ int amdgpu_ras_recovery_init(struct
> amdgpu_device *adev)
>  	atomic_set(&con->in_recovery, 0);
>  	con->adev = adev;
> 
> +	max_eeprom_records_len =
> amdgpu_ras_eeprom_get_record_max_length();
> +	amdgpu_ras_validate_threshold(adev, max_eeprom_records_len);
> +
>  	ret = amdgpu_ras_eeprom_init(&con->eeprom_control);
>  	if (ret)
>  		goto free;
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> index b2667342cf67..4672649a9293 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> @@ -336,6 +336,9 @@ struct amdgpu_ras {
>  	struct amdgpu_ras_eeprom_control eeprom_control;
> 
>  	bool error_query_ready;
> +
> +	/* bad page count threshold */
> +	uint32_t bad_page_cnt_threshold;
>  };
> 
>  struct ras_fs_data {
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> index c0096097bbcf..a2c982b1eac6 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> @@ -499,6 +499,11 @@ int amdgpu_ras_eeprom_process_recods(struct
> amdgpu_ras_eeprom_control *control,
>  	return ret == num ? 0 : -EIO;
>  }
> 
> +inline uint32_t amdgpu_ras_eeprom_get_record_max_length(void)
> +{
> +	return EEPROM_MAX_RECORD_NUM;
> +}
> +
>  /* Used for testing if bugs encountered */  #if 0  void
> amdgpu_ras_eeprom_test(struct amdgpu_ras_eeprom_control *control)
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.h
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.h
> index 7e8647a05df7..b272840cb069 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.h
> @@ -85,6 +85,8 @@ int amdgpu_ras_eeprom_process_recods(struct
> amdgpu_ras_eeprom_control *control,
>  					    bool write,
>  					    int num);
> 
> +inline uint32_t amdgpu_ras_eeprom_get_record_max_length(void);
> +
>  void amdgpu_ras_eeprom_test(struct amdgpu_ras_eeprom_control
> *control);
> 
>  #endif // _AMDGPU_RAS_EEPROM_H
> --
> 2.17.1

Regards,
Stanley
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 5/5] drm/amdgpu: calculate actual size instead of hardcode size
  2020-07-22  3:14 ` [PATCH 5/5] drm/amdgpu: calculate actual size instead of hardcode size Guchun Chen
@ 2020-07-22 14:26   ` Andrey Grodzovsky
  2020-07-22 14:29     ` Chen, Guchun
  0 siblings, 1 reply; 17+ messages in thread
From: Andrey Grodzovsky @ 2020-07-22 14:26 UTC (permalink / raw)
  To: Guchun Chen, amd-gfx, alexander.deucher, Hawking.Zhang,
	Dennis.Li, Stanley.Yang, Tao.Zhou1, John.Clements


On 7/21/20 11:14 PM, Guchun Chen wrote:
> Use sizeof to get actual size.
>
> Signed-off-by: Guchun Chen <guchun.chen@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c | 6 ++----
>   1 file changed, 2 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> index 96b63c026bad..a5da108e43c3 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> @@ -34,11 +34,9 @@
>   /*
>    * The 2 macros bellow represent the actual size in bytes that
>    * those entities occupy in the EEPROM memory.
> - * EEPROM_TABLE_RECORD_SIZE is different than sizeof(eeprom_table_record) which
> - * uses uint64 to store 6b fields such as retired_page.


Did you find out the comment above was wrong ?

Andrey


>    */
> -#define EEPROM_TABLE_HEADER_SIZE 20
> -#define EEPROM_TABLE_RECORD_SIZE 24
> +#define EEPROM_TABLE_HEADER_SIZE (sizeof(struct amdgpu_ras_eeprom_table_header))
> +#define EEPROM_TABLE_RECORD_SIZE (sizeof(struct eeprom_table_record))
>   
>   #define EEPROM_ADDRESS_SIZE 0x2
>   
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: [PATCH 5/5] drm/amdgpu: calculate actual size instead of hardcode size
  2020-07-22 14:26   ` Andrey Grodzovsky
@ 2020-07-22 14:29     ` Chen, Guchun
  0 siblings, 0 replies; 17+ messages in thread
From: Chen, Guchun @ 2020-07-22 14:29 UTC (permalink / raw)
  To: Grodzovsky, Andrey, amd-gfx, Deucher, Alexander, Zhang, Hawking,
	Li, Dennis, Yang, Stanley, Zhou1, Tao, Clements, John

[AMD Public Use]

Hi Andrey,

Aha, thanks for your reminding, I ignore that comment. Let me update it later.

Regards,
Guchun

-----Original Message-----
From: Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com> 
Sent: Wednesday, July 22, 2020 10:26 PM
To: Chen, Guchun <Guchun.Chen@amd.com>; amd-gfx@lists.freedesktop.org; Deucher, Alexander <Alexander.Deucher@amd.com>; Zhang, Hawking <Hawking.Zhang@amd.com>; Li, Dennis <Dennis.Li@amd.com>; Yang, Stanley <Stanley.Yang@amd.com>; Zhou1, Tao <Tao.Zhou1@amd.com>; Clements, John <John.Clements@amd.com>
Subject: Re: [PATCH 5/5] drm/amdgpu: calculate actual size instead of hardcode size


On 7/21/20 11:14 PM, Guchun Chen wrote:
> Use sizeof to get actual size.
>
> Signed-off-by: Guchun Chen <guchun.chen@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c | 6 ++----
>   1 file changed, 2 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> index 96b63c026bad..a5da108e43c3 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> @@ -34,11 +34,9 @@
>   /*
>    * The 2 macros bellow represent the actual size in bytes that
>    * those entities occupy in the EEPROM memory.
> - * EEPROM_TABLE_RECORD_SIZE is different than 
> sizeof(eeprom_table_record) which
> - * uses uint64 to store 6b fields such as retired_page.


Did you find out the comment above was wrong ?

Andrey


>    */
> -#define EEPROM_TABLE_HEADER_SIZE 20
> -#define EEPROM_TABLE_RECORD_SIZE 24
> +#define EEPROM_TABLE_HEADER_SIZE (sizeof(struct 
> +amdgpu_ras_eeprom_table_header)) #define EEPROM_TABLE_RECORD_SIZE 
> +(sizeof(struct eeprom_table_record))
>   
>   #define EEPROM_ADDRESS_SIZE 0x2
>   
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: [PATCH 1/5] drm/amdgpu: add bad page count threshold in module parameter
  2020-07-22  3:14 ` [PATCH 1/5] drm/amdgpu: add bad page count threshold in module parameter Guchun Chen
@ 2020-07-23  0:31   ` Li, Dennis
  2020-07-23  3:47     ` Chen, Guchun
  0 siblings, 1 reply; 17+ messages in thread
From: Li, Dennis @ 2020-07-23  0:31 UTC (permalink / raw)
  To: Chen, Guchun, amd-gfx, Deucher, Alexander, Zhang, Hawking, Yang,
	 Stanley, Zhou1, Tao, Clements,  John

[AMD Official Use Only - Internal Distribution Only]

Hi, Guchun,
      It is better to let user be able to change amdgpu_bad_page_threshold with sysfs, so that users no need to reboot system when they want to change their strategy.  

Best Regards
Dennis Li
-----Original Message-----
From: Chen, Guchun <Guchun.Chen@amd.com> 
Sent: Wednesday, July 22, 2020 11:14 AM
To: amd-gfx@lists.freedesktop.org; Deucher, Alexander <Alexander.Deucher@amd.com>; Zhang, Hawking <Hawking.Zhang@amd.com>; Li, Dennis <Dennis.Li@amd.com>; Yang, Stanley <Stanley.Yang@amd.com>; Zhou1, Tao <Tao.Zhou1@amd.com>; Clements, John <John.Clements@amd.com>
Cc: Chen, Guchun <Guchun.Chen@amd.com>
Subject: [PATCH 1/5] drm/amdgpu: add bad page count threshold in module parameter

bad_page_threshold could be specified to detect and retire bad GPU if faulty bad pages exceed it.

When it's -1, ras will use typical bad page failure value.

Signed-off-by: Guchun Chen <guchun.chen@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu.h     |  1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 11 +++++++++++
 2 files changed, 12 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index 06bfb8658dec..bb83ffb5e26a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -181,6 +181,7 @@ extern uint amdgpu_dm_abm_level;  extern struct amdgpu_mgpu_info mgpu_info;  extern int amdgpu_ras_enable;  extern uint amdgpu_ras_mask;
+extern int amdgpu_bad_page_threshold;
 extern int amdgpu_async_gfx_ring;
 extern int amdgpu_mcbp;
 extern int amdgpu_discovery;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index d28b95f721c4..f99671101746 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -161,6 +161,7 @@ struct amdgpu_mgpu_info mgpu_info = {  };  int amdgpu_ras_enable = -1;  uint amdgpu_ras_mask = 0xffffffff;
+int amdgpu_bad_page_threshold = -1;
 
 /**
  * DOC: vramlimit (int)
@@ -801,6 +802,16 @@ module_param_named(tmz, amdgpu_tmz, int, 0444);  MODULE_PARM_DESC(reset_method, "GPU reset method (-1 = auto (default), 0 = legacy, 1 = mode0, 2 = mode1, 3 = mode2, 4 = baco)");  module_param_named(reset_method, amdgpu_reset_method, int, 0444);
 
+/**
+ * DOC: bad_page_threshold (int)
+ * Bad page threshold configuration is driven by RMA(Return Merchandise
+ * Authorization) policy, which is to specify the threshold value of 
+faulty
+ * pages detected by ECC, which may result in GPU's retirement if total
+ * faulty pages by ECC exceed threshold value.
+ */
+MODULE_PARM_DESC(bad_page_threshold, "Bad page threshold(-1 = 
+auto(default typical value))"); module_param_named(bad_page_threshold, 
+amdgpu_bad_page_threshold, int, 0444);
+
 static const struct pci_device_id pciidlist[] = {  #ifdef  CONFIG_DRM_AMDGPU_SI
 	{0x1002, 0x6780, PCI_ANY_ID, PCI_ANY_ID, 0, 0, CHIP_TAHITI},
--
2.17.1
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* RE: [PATCH 3/5] drm/amdgpu: conduct bad gpu check during bootup/reset
  2020-07-22  3:14 ` [PATCH 3/5] drm/amdgpu: conduct bad gpu check during bootup/reset Guchun Chen
@ 2020-07-23  2:51   ` Zhou1, Tao
  2020-07-23  3:38     ` Chen, Guchun
  0 siblings, 1 reply; 17+ messages in thread
From: Zhou1, Tao @ 2020-07-23  2:51 UTC (permalink / raw)
  To: Chen, Guchun, amd-gfx, Deucher, Alexander, Zhang, Hawking, Li,
	Dennis, Yang, Stanley, Clements, John

[AMD Official Use Only - Internal Distribution Only]

> -----Original Message-----
> From: Chen, Guchun <Guchun.Chen@amd.com>
> Sent: Wednesday, July 22, 2020 11:14 AM
> To: amd-gfx@lists.freedesktop.org; Deucher, Alexander
> <Alexander.Deucher@amd.com>; Zhang, Hawking <Hawking.Zhang@amd.com>;
> Li, Dennis <Dennis.Li@amd.com>; Yang, Stanley <Stanley.Yang@amd.com>;
> Zhou1, Tao <Tao.Zhou1@amd.com>; Clements, John
> <John.Clements@amd.com>
> Cc: Chen, Guchun <Guchun.Chen@amd.com>
> Subject: [PATCH 3/5] drm/amdgpu: conduct bad gpu check during bootup/reset
>
> During boot up, when init eeprom, RAS will check if the BAD GPU tag is saved or
> not. And will break boot up and notify user to RMA it.
>
> At the meanwhile, when saved bad page count to eeprom exceeds threshold,
> one ras recovery will be triggered immediately, and bad gpu check will be
> executed and reset will fail as well to remind user.
>
> User could set bad_page_threshold = 0 as module parameter when probing
> driver to disable the bad feature check.
>
> Signed-off-by: Guchun Chen <guchun.chen@amd.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c    | 21 ++++-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c       | 25 +++++-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h       | 16 +++-
>  .../gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c    | 87 ++++++++++++++++++-
>  .../gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.h    |  6 +-
>  5 files changed, 142 insertions(+), 13 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 2662cd7c8685..d529bf7917f5 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -2065,7 +2065,9 @@ static int amdgpu_device_ip_init(struct
> amdgpu_device *adev)
>   * Note: theoretically, this should be called before all vram allocations
>   * to protect retired page from abusing
>   */

[Tao] The comment above should be also updated

> -amdgpu_ras_recovery_init(adev);
> +r = amdgpu_ras_recovery_init(adev);
> +if (r)
> +goto init_failed;

[Tao] Are you sure "r != 0" equals to "bad gpu"? Other errors of recovery_init should not block gpu init.

>
>  if (adev->gmc.xgmi.num_physical_nodes > 1)
>  amdgpu_xgmi_add_device(adev);
> @@ -4133,8 +4135,20 @@ static int amdgpu_do_asic_reset(struct
> amdgpu_hive_info *hive,
>
>  amdgpu_fbdev_set_suspend(tmp_adev, 0);
>
> -/* must succeed. */
> -amdgpu_ras_resume(tmp_adev);
> +/*
> + * The CPU is BAD once faulty pages by ECC has
> + * reached the threshold, and ras recovery is
> + * scheduled. So add one check here to prevent
> + * recovery if it's one BAD GPU, and remind
> + * user to RMA this GPU.
> + */
> +if (!amdgpu_ras_check_bad_gpu(tmp_adev)) {
> +/* must succeed. */
> +amdgpu_ras_resume(tmp_adev);
> +} else {
> +r = -EINVAL;
> +goto out;
> +}
>
>  /* Update PSP FW topology after reset */
>  if (hive && tmp_adev-
> >gmc.xgmi.num_physical_nodes > 1) @@ -4142,7 +4156,6 @@ static int
> amdgpu_do_asic_reset(struct amdgpu_hive_info *hive,
>  }
>  }
>
> -
>  out:
>  if (!r) {
>  amdgpu_irq_gpu_reset_resume_helper(tmp_adev);
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> index e3d67d85c55f..818d4154e4a3 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> @@ -62,8 +62,6 @@ const char *ras_block_string[] = {  #define ras_err_str(i)
> (ras_error_string[ffs(i)])  #define ras_block_str(i) (ras_block_string[i])
>
> -#define AMDGPU_RAS_FLAG_INIT_BY_VBIOS1
> -#define AMDGPU_RAS_FLAG_INIT_NEED_RESET2
>  #define RAS_DEFAULT_FLAGS (AMDGPU_RAS_FLAG_INIT_BY_VBIOS)
>
>  /* inject address is 52 bits */
> @@ -1817,6 +1815,7 @@ int amdgpu_ras_recovery_init(struct amdgpu_device
> *adev)
>  struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
>  struct ras_err_handler_data **data;
>  uint32_t max_eeprom_records_len = 0;
> +bool bad_gpu = false;
>  int ret;
>
>  if (con)
> @@ -1838,9 +1837,14 @@ int amdgpu_ras_recovery_init(struct amdgpu_device
> *adev)
>  max_eeprom_records_len =
> amdgpu_ras_eeprom_get_record_max_length();
>  amdgpu_ras_validate_threshold(adev, max_eeprom_records_len);
>
> -ret = amdgpu_ras_eeprom_init(&con->eeprom_control);
> -if (ret)
> +ret = amdgpu_ras_eeprom_init(&con->eeprom_control, &bad_gpu);
> +/*
> + * we only fail this calling and halt booting when bad_gpu is true.
> + */
> +if (bad_gpu) {
> +ret = -EINVAL;
>  goto free;
> +}
>
>  if (con->eeprom_control.num_recs) {
>  ret = amdgpu_ras_load_bad_pages(adev); @@ -2189,3
> +2193,16 @@ bool amdgpu_ras_need_emergency_restart(struct
> amdgpu_device *adev)
>
>  return false;
>  }
> +
> +bool amdgpu_ras_check_bad_gpu(struct amdgpu_device *adev) {
> +struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
> +bool bad_gpu = false;
> +
> +if (con && (con->bad_page_cnt_threshold != 0xFFFFFFFF))
> +amdgpu_ras_eeprom_check_bad_gpu(&con->eeprom_control,
> +&bad_gpu);
> +
> +/* We are only interested in variable bad_gpu. */
> +return bad_gpu;
> +}
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> index 4672649a9293..d7a363b37feb 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> @@ -31,6 +31,10 @@
>  #include "ta_ras_if.h"
>  #include "amdgpu_ras_eeprom.h"
>
> +#define AMDGPU_RAS_FLAG_INIT_BY_VBIOS(0x1 << 0)
> +#define AMDGPU_RAS_FLAG_INIT_NEED_RESET(0x1 << 1)
> +#define AMDGPU_RAS_FLAG_SKIP_BAD_PAGE_RESV(0x1 << 2)
> +
>  enum amdgpu_ras_block {
>  AMDGPU_RAS_BLOCK__UMC = 0,
>  AMDGPU_RAS_BLOCK__SDMA,
> @@ -493,6 +497,8 @@ void amdgpu_ras_suspend(struct amdgpu_device
> *adev);  unsigned long amdgpu_ras_query_error_count(struct amdgpu_device
> *adev,
>  bool is_ce);
>
> +bool amdgpu_ras_check_bad_gpu(struct amdgpu_device *adev);
> +
>  /* error handling functions */
>  int amdgpu_ras_add_bad_pages(struct amdgpu_device *adev,
>  struct eeprom_table_record *bps, int pages); @@ -503,10
> +509,14 @@ static inline int amdgpu_ras_reset_gpu(struct amdgpu_device
> *adev)  {
>  struct amdgpu_ras *ras = amdgpu_ras_get_context(adev);
>
> -/* save bad page to eeprom before gpu reset,
> - * i2c may be unstable in gpu reset
> +/*
> + * Save bad page to eeprom before gpu reset, i2c may be unstable
> + * in gpu reset.
> + *
> + * Also, exclude the case when ras recovery issuer is
> + * eerprom page write.
>   */
> -if (in_task())
> +if (!(ras->flags & AMDGPU_RAS_FLAG_SKIP_BAD_PAGE_RESV) &&
> in_task())
>  amdgpu_ras_reserve_bad_pages(adev);
>
>  if (atomic_cmpxchg(&ras->in_recovery, 0, 1) == 0) diff --git
> a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> index a2c982b1eac6..96b63c026bad 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c

[Tao] It's better to split it into two patches, one is for ras and another one is for eeprom

> @@ -46,6 +46,9 @@
>  #define EEPROM_TABLE_HDR_VAL 0x414d4452  #define EEPROM_TABLE_VER
> 0x00010000
>
> +/* Bad GPU tag ‘BADG’ */
> +#define EEPROM_TABLE_HDR_BAD 0x42414447
> +
>  /* Assume 2 Mbit size */
>  #define EEPROM_SIZE_BYTES 256000
>  #define EEPROM_PAGE__SIZE_BYTES 256
> @@ -238,12 +241,14 @@ int amdgpu_ras_eeprom_reset_table(struct
> amdgpu_ras_eeprom_control *control)
>
>  }
>
> -int amdgpu_ras_eeprom_init(struct amdgpu_ras_eeprom_control *control)
> +int amdgpu_ras_eeprom_init(struct amdgpu_ras_eeprom_control *control,
> +bool *bad_gpu)
>  {
>  int ret = 0;
>  struct amdgpu_device *adev = to_amdgpu_device(control);
>  unsigned char buff[EEPROM_ADDRESS_SIZE +
> EEPROM_TABLE_HEADER_SIZE] = { 0 };
>  struct amdgpu_ras_eeprom_table_header *hdr = &control->tbl_hdr;
> +struct amdgpu_ras *ras = amdgpu_ras_get_context(adev);
>  struct i2c_msg msg = {
>  .addr= 0,
>  .flags= I2C_M_RD,
> @@ -251,6 +256,8 @@ int amdgpu_ras_eeprom_init(struct
> amdgpu_ras_eeprom_control *control)
>  .buf= buff,
>  };
>
> +*bad_gpu = false;
> +
>  /* Verify i2c adapter is initialized */
>  if (!adev->pm.smu_i2c.algo)
>  return -ENOENT;
> @@ -279,6 +286,11 @@ int amdgpu_ras_eeprom_init(struct
> amdgpu_ras_eeprom_control *control)
>  DRM_DEBUG_DRIVER("Found existing EEPROM table with %d
> records",
>   control->num_recs);
>
> +} else if ((ras->bad_page_cnt_threshold != 0xFFFFFFFF) && (
> +hdr->header == EEPROM_TABLE_HDR_BAD)) {
> +*bad_gpu = true;
> +DRM_ERROR("Detect BAD GPU TAG in eeprom table and "
> +"GPU shall be RMAed.\n");
>  } else {
>  DRM_INFO("Creating new EEPROM table");
>
> @@ -375,6 +387,44 @@ static uint32_t
> __correct_eeprom_dest_address(uint32_t curr_address)
>  return curr_address;
>  }
>
> +int amdgpu_ras_eeprom_check_bad_gpu(struct amdgpu_ras_eeprom_control
> *control,
> +bool *bad_gpu)
> +{
> +struct amdgpu_device *adev = to_amdgpu_device(control);
> +unsigned char buff[EEPROM_ADDRESS_SIZE +
> +EEPROM_TABLE_HEADER_SIZE] = { 0 };
> +struct amdgpu_ras_eeprom_table_header *hdr = &control->tbl_hdr;
> +struct i2c_msg msg = {
> +.addr = control->i2c_address,
> +.flags = I2C_M_RD,
> +.len = EEPROM_ADDRESS_SIZE +
> EEPROM_TABLE_HEADER_SIZE,
> +.buf = buff,
> +};
> +int ret;
> +
> +*bad_gpu = false;
> +
> +/* read EEPROM table header */
> +mutex_lock(&control->tbl_mutex);
> +ret = i2c_transfer(&adev->pm.smu_i2c, &msg, 1);
> +if (ret < 1) {
> +dev_err(adev->dev, "Failed to read EEPROM table header.\n");
> +goto err;

[Tao] One tab is missed

> +}
> +
> +__decode_table_header_from_buff(hdr, &buff[2]);
> +
> +if (hdr->header == EEPROM_TABLE_HDR_BAD) {
> +dev_warn(adev->dev, "Current GPU is BAD and should be
> RMAed.\n");
> +*bad_gpu = true;
> +}
> +
> +ret = 0;
> +err:
> +mutex_unlock(&control->tbl_mutex);
> +return ret;
> +}
> +
>  int amdgpu_ras_eeprom_process_recods(struct amdgpu_ras_eeprom_control
> *control,
>      struct eeprom_table_record
> *records,
>      bool write,
> @@ -383,8 +433,10 @@ int amdgpu_ras_eeprom_process_recods(struct
> amdgpu_ras_eeprom_control *control,
>  int i, ret = 0;
>  struct i2c_msg *msgs, *msg;
>  unsigned char *buffs, *buff;
> +bool sched_ras_recovery = false;
>  struct eeprom_table_record *record;
>  struct amdgpu_device *adev = to_amdgpu_device(control);
> +struct amdgpu_ras *ras = amdgpu_ras_get_context(adev);
>
>  if (adev->asic_type != CHIP_VEGA20 && adev->asic_type !=
> CHIP_ARCTURUS)
>  return 0;
> @@ -402,6 +454,25 @@ int amdgpu_ras_eeprom_process_recods(struct
> amdgpu_ras_eeprom_control *control,
>  goto free_buff;
>  }
>
> +/*
> + * If saved bad pages number exceeds the bad page threshod for
> + * the whole VRAM, update table header to mark one BAD GPU and
> + * schedule one ras recovery after eeprom write is done, this
> + * can avoid the missing for latest records.
> + *
> + * This new header will be picked up and checked in the bootup by
> + * ras recovery, which may break bootup process to notify user this
> + * GPU is bad and to RMA(Return Merchandise Authorization) this GPU.
> + */
> +if (write && (ras->bad_page_cnt_threshold != 0xFFFFFFFF) &&
> +((control->num_recs + num) >= ras->bad_page_cnt_threshold)) {
> +dev_warn(adev->dev,
> +"Saved bad pages(%d) reaches threshold value(%d).\n",
> +control->num_recs + num, ras-
> >bad_page_cnt_threshold);
> +control->tbl_hdr.header = EEPROM_TABLE_HDR_BAD;
> +sched_ras_recovery = true;
> +}
> +
>  /* In case of overflow just start from beginning to not lose newest
> records */
>  if (write && (control->next_addr + EEPROM_TABLE_RECORD_SIZE *
> num > EEPROM_SIZE_BYTES))
>  control->next_addr = EEPROM_RECORD_START; @@ -482,6
> +553,20 @@ int amdgpu_ras_eeprom_process_recods(struct
> amdgpu_ras_eeprom_control *control,
>  __update_tbl_checksum(control, records, num,
> old_hdr_byte_sum);
>
>  __update_table_header(control, buffs);
> +
> +if (sched_ras_recovery) {
> +/*
> + * Before scheduling ras recovery, assert the related
> + * flag first, which shall bypass common bad page
> + * reservation execution in amdgpu_ras_reset_gpu.
> + */
> +amdgpu_ras_get_context(adev)->flags |=
> +AMDGPU_RAS_FLAG_SKIP_BAD_PAGE_RESV;
> +
> +dev_warn(adev->dev, "Conduct ras recovery due to bad
> "
> +"page threshold reached.\n");
> +amdgpu_ras_reset_gpu(adev);
> +}
>  } else if (!__validate_tbl_checksum(control, records, num)) {
>  DRM_WARN("EEPROM Table checksum mismatch!");
>  /* TODO Uncomment when EEPROM read/write is relliable */
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.h
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.h
> index b272840cb069..4ccd139248a9 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.h
> @@ -77,9 +77,13 @@ struct eeprom_table_record {
>  unsigned char mcumc_id;
>  }__attribute__((__packed__));
>
> -int amdgpu_ras_eeprom_init(struct amdgpu_ras_eeprom_control *control);
> +int amdgpu_ras_eeprom_init(struct amdgpu_ras_eeprom_control *control,
> +bool *bad_gpu);
>  int amdgpu_ras_eeprom_reset_table(struct amdgpu_ras_eeprom_control
> *control);
>
> +int amdgpu_ras_eeprom_check_bad_gpu(struct amdgpu_ras_eeprom_control
> *control,
> +bool *bad_gpu);
> +
>  int amdgpu_ras_eeprom_process_recods(struct amdgpu_ras_eeprom_control
> *control,
>      struct eeprom_table_record
> *records,
>      bool write,
> --
> 2.17.1

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: [PATCH 3/5] drm/amdgpu: conduct bad gpu check during bootup/reset
  2020-07-23  2:51   ` Zhou1, Tao
@ 2020-07-23  3:38     ` Chen, Guchun
  2020-07-23  4:03       ` Zhou1, Tao
  0 siblings, 1 reply; 17+ messages in thread
From: Chen, Guchun @ 2020-07-23  3:38 UTC (permalink / raw)
  To: Zhou1, Tao, amd-gfx, Deucher, Alexander, Zhang, Hawking, Li,
	Dennis, Yang, Stanley, Clements, John

[AMD Public Use]

Thanks for your review, Tao.
Please check my comments after yours.

Regards,
Guchun

-----Original Message-----
From: Zhou1, Tao <Tao.Zhou1@amd.com> 
Sent: Thursday, July 23, 2020 10:51 AM
To: Chen, Guchun <Guchun.Chen@amd.com>; amd-gfx@lists.freedesktop.org; Deucher, Alexander <Alexander.Deucher@amd.com>; Zhang, Hawking <Hawking.Zhang@amd.com>; Li, Dennis <Dennis.Li@amd.com>; Yang, Stanley <Stanley.Yang@amd.com>; Clements, John <John.Clements@amd.com>
Subject: RE: [PATCH 3/5] drm/amdgpu: conduct bad gpu check during bootup/reset

[AMD Official Use Only - Internal Distribution Only]

> -----Original Message-----
> From: Chen, Guchun <Guchun.Chen@amd.com>
> Sent: Wednesday, July 22, 2020 11:14 AM
> To: amd-gfx@lists.freedesktop.org; Deucher, Alexander 
> <Alexander.Deucher@amd.com>; Zhang, Hawking <Hawking.Zhang@amd.com>; 
> Li, Dennis <Dennis.Li@amd.com>; Yang, Stanley <Stanley.Yang@amd.com>; 
> Zhou1, Tao <Tao.Zhou1@amd.com>; Clements, John <John.Clements@amd.com>
> Cc: Chen, Guchun <Guchun.Chen@amd.com>
> Subject: [PATCH 3/5] drm/amdgpu: conduct bad gpu check during 
> bootup/reset
>
> During boot up, when init eeprom, RAS will check if the BAD GPU tag is 
> saved or not. And will break boot up and notify user to RMA it.
>
> At the meanwhile, when saved bad page count to eeprom exceeds 
> threshold, one ras recovery will be triggered immediately, and bad gpu 
> check will be executed and reset will fail as well to remind user.
>
> User could set bad_page_threshold = 0 as module parameter when probing 
> driver to disable the bad feature check.
>
> Signed-off-by: Guchun Chen <guchun.chen@amd.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c    | 21 ++++-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c       | 25 +++++-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h       | 16 +++-
>  .../gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c    | 87 ++++++++++++++++++-
>  .../gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.h    |  6 +-
>  5 files changed, 142 insertions(+), 13 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 2662cd7c8685..d529bf7917f5 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -2065,7 +2065,9 @@ static int amdgpu_device_ip_init(struct 
> amdgpu_device *adev)
>   * Note: theoretically, this should be called before all vram allocations
>   * to protect retired page from abusing
>   */

[Tao] The comment above should be also updated
[Guchun]yeah, will update it later.

> -amdgpu_ras_recovery_init(adev);
> +r = amdgpu_ras_recovery_init(adev);
> +if (r)
> +goto init_failed;

[Tao] Are you sure "r != 0" equals to "bad gpu"? Other errors of recovery_init should not block gpu init.
[Guchun]Good point. This should be addressed carefully.
>
>  if (adev->gmc.xgmi.num_physical_nodes > 1)  
> amdgpu_xgmi_add_device(adev); @@ -4133,8 +4135,20 @@ static int 
> amdgpu_do_asic_reset(struct amdgpu_hive_info *hive,
>
>  amdgpu_fbdev_set_suspend(tmp_adev, 0);
>
> -/* must succeed. */
> -amdgpu_ras_resume(tmp_adev);
> +/*
> + * The CPU is BAD once faulty pages by ECC has
> + * reached the threshold, and ras recovery is
> + * scheduled. So add one check here to prevent
> + * recovery if it's one BAD GPU, and remind
> + * user to RMA this GPU.
> + */
> +if (!amdgpu_ras_check_bad_gpu(tmp_adev)) {
> +/* must succeed. */
> +amdgpu_ras_resume(tmp_adev);
> +} else {
> +r = -EINVAL;
> +goto out;
> +}
>
>  /* Update PSP FW topology after reset */  if (hive && tmp_adev-
> >gmc.xgmi.num_physical_nodes > 1) @@ -4142,7 +4156,6 @@ static int
> amdgpu_do_asic_reset(struct amdgpu_hive_info *hive,  }  }
>
> -
>  out:
>  if (!r) {
>  amdgpu_irq_gpu_reset_resume_helper(tmp_adev);
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> index e3d67d85c55f..818d4154e4a3 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> @@ -62,8 +62,6 @@ const char *ras_block_string[] = {  #define ras_err_str(i)
> (ras_error_string[ffs(i)])  #define ras_block_str(i) (ras_block_string[i])
>
> -#define AMDGPU_RAS_FLAG_INIT_BY_VBIOS1
> -#define AMDGPU_RAS_FLAG_INIT_NEED_RESET2
>  #define RAS_DEFAULT_FLAGS (AMDGPU_RAS_FLAG_INIT_BY_VBIOS)
>
>  /* inject address is 52 bits */
> @@ -1817,6 +1815,7 @@ int amdgpu_ras_recovery_init(struct amdgpu_device
> *adev)
>  struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
>  struct ras_err_handler_data **data;
>  uint32_t max_eeprom_records_len = 0;
> +bool bad_gpu = false;
>  int ret;
>
>  if (con)
> @@ -1838,9 +1837,14 @@ int amdgpu_ras_recovery_init(struct amdgpu_device
> *adev)
>  max_eeprom_records_len =
> amdgpu_ras_eeprom_get_record_max_length();
>  amdgpu_ras_validate_threshold(adev, max_eeprom_records_len);
>
> -ret = amdgpu_ras_eeprom_init(&con->eeprom_control);
> -if (ret)
> +ret = amdgpu_ras_eeprom_init(&con->eeprom_control, &bad_gpu);
> +/*
> + * we only fail this calling and halt booting when bad_gpu is true.
> + */
> +if (bad_gpu) {
> +ret = -EINVAL;
>  goto free;
> +}
>
>  if (con->eeprom_control.num_recs) {
>  ret = amdgpu_ras_load_bad_pages(adev); @@ -2189,3
> +2193,16 @@ bool amdgpu_ras_need_emergency_restart(struct
> amdgpu_device *adev)
>
>  return false;
>  }
> +
> +bool amdgpu_ras_check_bad_gpu(struct amdgpu_device *adev) {
> +struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
> +bool bad_gpu = false;
> +
> +if (con && (con->bad_page_cnt_threshold != 0xFFFFFFFF))
> +amdgpu_ras_eeprom_check_bad_gpu(&con->eeprom_control,
> +&bad_gpu);
> +
> +/* We are only interested in variable bad_gpu. */
> +return bad_gpu;
> +}
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> index 4672649a9293..d7a363b37feb 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> @@ -31,6 +31,10 @@
>  #include "ta_ras_if.h"
>  #include "amdgpu_ras_eeprom.h"
>
> +#define AMDGPU_RAS_FLAG_INIT_BY_VBIOS(0x1 << 0)
> +#define AMDGPU_RAS_FLAG_INIT_NEED_RESET(0x1 << 1)
> +#define AMDGPU_RAS_FLAG_SKIP_BAD_PAGE_RESV(0x1 << 2)
> +
>  enum amdgpu_ras_block {
>  AMDGPU_RAS_BLOCK__UMC = 0,
>  AMDGPU_RAS_BLOCK__SDMA,
> @@ -493,6 +497,8 @@ void amdgpu_ras_suspend(struct amdgpu_device
> *adev);  unsigned long amdgpu_ras_query_error_count(struct amdgpu_device
> *adev,
>  bool is_ce);
>
> +bool amdgpu_ras_check_bad_gpu(struct amdgpu_device *adev);
> +
>  /* error handling functions */
>  int amdgpu_ras_add_bad_pages(struct amdgpu_device *adev,
>  struct eeprom_table_record *bps, int pages); @@ -503,10
> +509,14 @@ static inline int amdgpu_ras_reset_gpu(struct amdgpu_device
> *adev)  {
>  struct amdgpu_ras *ras = amdgpu_ras_get_context(adev);
>
> -/* save bad page to eeprom before gpu reset,
> - * i2c may be unstable in gpu reset
> +/*
> + * Save bad page to eeprom before gpu reset, i2c may be unstable
> + * in gpu reset.
> + *
> + * Also, exclude the case when ras recovery issuer is
> + * eerprom page write.
>   */
> -if (in_task())
> +if (!(ras->flags & AMDGPU_RAS_FLAG_SKIP_BAD_PAGE_RESV) &&
> in_task())
>  amdgpu_ras_reserve_bad_pages(adev);
>
>  if (atomic_cmpxchg(&ras->in_recovery, 0, 1) == 0) diff --git
> a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> index a2c982b1eac6..96b63c026bad 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c

[Tao] It's better to split it into two patches, one is for ras and another one is for eeprom
[Guchun]I tried this, however, even if for ras change, eeprom change is needed as well. So I merged them both into one patch.

> @@ -46,6 +46,9 @@
>  #define EEPROM_TABLE_HDR_VAL 0x414d4452  #define EEPROM_TABLE_VER
> 0x00010000
>
> +/* Bad GPU tag ‘BADG’ */
> +#define EEPROM_TABLE_HDR_BAD 0x42414447
> +
>  /* Assume 2 Mbit size */
>  #define EEPROM_SIZE_BYTES 256000
>  #define EEPROM_PAGE__SIZE_BYTES 256
> @@ -238,12 +241,14 @@ int amdgpu_ras_eeprom_reset_table(struct
> amdgpu_ras_eeprom_control *control)
>
>  }
>
> -int amdgpu_ras_eeprom_init(struct amdgpu_ras_eeprom_control *control)
> +int amdgpu_ras_eeprom_init(struct amdgpu_ras_eeprom_control *control,
> +bool *bad_gpu)
>  {
>  int ret = 0;
>  struct amdgpu_device *adev = to_amdgpu_device(control);
>  unsigned char buff[EEPROM_ADDRESS_SIZE +
> EEPROM_TABLE_HEADER_SIZE] = { 0 };
>  struct amdgpu_ras_eeprom_table_header *hdr = &control->tbl_hdr;
> +struct amdgpu_ras *ras = amdgpu_ras_get_context(adev);
>  struct i2c_msg msg = {
>  .addr= 0,
>  .flags= I2C_M_RD,
> @@ -251,6 +256,8 @@ int amdgpu_ras_eeprom_init(struct
> amdgpu_ras_eeprom_control *control)
>  .buf= buff,
>  };
>
> +*bad_gpu = false;
> +
>  /* Verify i2c adapter is initialized */
>  if (!adev->pm.smu_i2c.algo)
>  return -ENOENT;
> @@ -279,6 +286,11 @@ int amdgpu_ras_eeprom_init(struct
> amdgpu_ras_eeprom_control *control)
>  DRM_DEBUG_DRIVER("Found existing EEPROM table with %d
> records",
>   control->num_recs);
>
> +} else if ((ras->bad_page_cnt_threshold != 0xFFFFFFFF) && (
> +hdr->header == EEPROM_TABLE_HDR_BAD)) {
> +*bad_gpu = true;
> +DRM_ERROR("Detect BAD GPU TAG in eeprom table and "
> +"GPU shall be RMAed.\n");
>  } else {
>  DRM_INFO("Creating new EEPROM table");
>
> @@ -375,6 +387,44 @@ static uint32_t
> __correct_eeprom_dest_address(uint32_t curr_address)
>  return curr_address;
>  }
>
> +int amdgpu_ras_eeprom_check_bad_gpu(struct amdgpu_ras_eeprom_control
> *control,
> +bool *bad_gpu)
> +{
> +struct amdgpu_device *adev = to_amdgpu_device(control);
> +unsigned char buff[EEPROM_ADDRESS_SIZE +
> +EEPROM_TABLE_HEADER_SIZE] = { 0 };
> +struct amdgpu_ras_eeprom_table_header *hdr = &control->tbl_hdr;
> +struct i2c_msg msg = {
> +.addr = control->i2c_address,
> +.flags = I2C_M_RD,
> +.len = EEPROM_ADDRESS_SIZE +
> EEPROM_TABLE_HEADER_SIZE,
> +.buf = buff,
> +};
> +int ret;
> +
> +*bad_gpu = false;
> +
> +/* read EEPROM table header */
> +mutex_lock(&control->tbl_mutex);
> +ret = i2c_transfer(&adev->pm.smu_i2c, &msg, 1);
> +if (ret < 1) {
> +dev_err(adev->dev, "Failed to read EEPROM table header.\n");
> +goto err;

[Tao] One tab is missed
[Guchun]Correct this later.

> +}
> +
> +__decode_table_header_from_buff(hdr, &buff[2]);
> +
> +if (hdr->header == EEPROM_TABLE_HDR_BAD) {
> +dev_warn(adev->dev, "Current GPU is BAD and should be
> RMAed.\n");
> +*bad_gpu = true;
> +}
> +
> +ret = 0;
> +err:
> +mutex_unlock(&control->tbl_mutex);
> +return ret;
> +}
> +
>  int amdgpu_ras_eeprom_process_recods(struct amdgpu_ras_eeprom_control
> *control,
>      struct eeprom_table_record
> *records,
>      bool write,
> @@ -383,8 +433,10 @@ int amdgpu_ras_eeprom_process_recods(struct
> amdgpu_ras_eeprom_control *control,
>  int i, ret = 0;
>  struct i2c_msg *msgs, *msg;
>  unsigned char *buffs, *buff;
> +bool sched_ras_recovery = false;
>  struct eeprom_table_record *record;
>  struct amdgpu_device *adev = to_amdgpu_device(control);
> +struct amdgpu_ras *ras = amdgpu_ras_get_context(adev);
>
>  if (adev->asic_type != CHIP_VEGA20 && adev->asic_type !=
> CHIP_ARCTURUS)
>  return 0;
> @@ -402,6 +454,25 @@ int amdgpu_ras_eeprom_process_recods(struct
> amdgpu_ras_eeprom_control *control,
>  goto free_buff;
>  }
>
> +/*
> + * If saved bad pages number exceeds the bad page threshod for
> + * the whole VRAM, update table header to mark one BAD GPU and
> + * schedule one ras recovery after eeprom write is done, this
> + * can avoid the missing for latest records.
> + *
> + * This new header will be picked up and checked in the bootup by
> + * ras recovery, which may break bootup process to notify user this
> + * GPU is bad and to RMA(Return Merchandise Authorization) this GPU.
> + */
> +if (write && (ras->bad_page_cnt_threshold != 0xFFFFFFFF) &&
> +((control->num_recs + num) >= ras->bad_page_cnt_threshold)) {
> +dev_warn(adev->dev,
> +"Saved bad pages(%d) reaches threshold value(%d).\n",
> +control->num_recs + num, ras-
> >bad_page_cnt_threshold);
> +control->tbl_hdr.header = EEPROM_TABLE_HDR_BAD;
> +sched_ras_recovery = true;
> +}
> +
>  /* In case of overflow just start from beginning to not lose newest
> records */
>  if (write && (control->next_addr + EEPROM_TABLE_RECORD_SIZE *
> num > EEPROM_SIZE_BYTES))
>  control->next_addr = EEPROM_RECORD_START; @@ -482,6
> +553,20 @@ int amdgpu_ras_eeprom_process_recods(struct
> amdgpu_ras_eeprom_control *control,
>  __update_tbl_checksum(control, records, num,
> old_hdr_byte_sum);
>
>  __update_table_header(control, buffs);
> +
> +if (sched_ras_recovery) {
> +/*
> + * Before scheduling ras recovery, assert the related
> + * flag first, which shall bypass common bad page
> + * reservation execution in amdgpu_ras_reset_gpu.
> + */
> +amdgpu_ras_get_context(adev)->flags |=
> +AMDGPU_RAS_FLAG_SKIP_BAD_PAGE_RESV;
> +
> +dev_warn(adev->dev, "Conduct ras recovery due to bad
> "
> +"page threshold reached.\n");
> +amdgpu_ras_reset_gpu(adev);
> +}
>  } else if (!__validate_tbl_checksum(control, records, num)) {
>  DRM_WARN("EEPROM Table checksum mismatch!");
>  /* TODO Uncomment when EEPROM read/write is relliable */
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.h
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.h
> index b272840cb069..4ccd139248a9 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.h
> @@ -77,9 +77,13 @@ struct eeprom_table_record {
>  unsigned char mcumc_id;
>  }__attribute__((__packed__));
>
> -int amdgpu_ras_eeprom_init(struct amdgpu_ras_eeprom_control *control);
> +int amdgpu_ras_eeprom_init(struct amdgpu_ras_eeprom_control *control,
> +bool *bad_gpu);
>  int amdgpu_ras_eeprom_reset_table(struct amdgpu_ras_eeprom_control
> *control);
>
> +int amdgpu_ras_eeprom_check_bad_gpu(struct amdgpu_ras_eeprom_control
> *control,
> +bool *bad_gpu);
> +
>  int amdgpu_ras_eeprom_process_recods(struct amdgpu_ras_eeprom_control
> *control,
>      struct eeprom_table_record
> *records,
>      bool write,
> --
> 2.17.1

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: [PATCH 2/5] drm/amdgpu: validate bad page threshold in ras
  2020-07-22  7:51   ` Yang, Stanley
@ 2020-07-23  3:40     ` Chen, Guchun
  0 siblings, 0 replies; 17+ messages in thread
From: Chen, Guchun @ 2020-07-23  3:40 UTC (permalink / raw)
  To: Yang, Stanley, amd-gfx, Deucher, Alexander, Zhang, Hawking, Li,
	Dennis, Zhou1, Tao, Clements, John

[AMD Public Use]

Thanks for review, Stanley.

Re: [Yang, Stanley] : It's better to compare con->bad_page_cnt_threshold with max_length, the value of bad_page_cnt_threshold should not exceed max_length.

Correct, one guard is necessary. It will be patch v2.

Regards,
Guchun

-----Original Message-----
From: Yang, Stanley <Stanley.Yang@amd.com> 
Sent: Wednesday, July 22, 2020 3:52 PM
To: Chen, Guchun <Guchun.Chen@amd.com>; amd-gfx@lists.freedesktop.org; Deucher, Alexander <Alexander.Deucher@amd.com>; Zhang, Hawking <Hawking.Zhang@amd.com>; Li, Dennis <Dennis.Li@amd.com>; Zhou1, Tao <Tao.Zhou1@amd.com>; Clements, John <John.Clements@amd.com>
Subject: RE: [PATCH 2/5] drm/amdgpu: validate bad page threshold in ras

[AMD Official Use Only - Internal Distribution Only]

Hi Guchun,

Please see my comment inline.

> -----Original Message-----
> From: Chen, Guchun <Guchun.Chen@amd.com>
> Sent: Wednesday, July 22, 2020 11:14 AM
> To: amd-gfx@lists.freedesktop.org; Deucher, Alexander 
> <Alexander.Deucher@amd.com>; Zhang, Hawking <Hawking.Zhang@amd.com>; 
> Li, Dennis <Dennis.Li@amd.com>; Yang, Stanley <Stanley.Yang@amd.com>; 
> Zhou1, Tao <Tao.Zhou1@amd.com>; Clements, John <John.Clements@amd.com>
> Cc: Chen, Guchun <Guchun.Chen@amd.com>
> Subject: [PATCH 2/5] drm/amdgpu: validate bad page threshold in ras
> 
> Bad page threshold value should be valid in the range between
> -1 and max records length of eeprom. It could determine when the GPU 
> should be retired.
> 
> Signed-off-by: Guchun Chen <guchun.chen@amd.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c       | 43
> +++++++++++++++++++
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h       |  3 ++
>  .../gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c    |  5 +++
>  .../gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.h    |  2 +
>  4 files changed, 53 insertions(+)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> index 6f06e1214622..e3d67d85c55f 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> @@ -69,6 +69,9 @@ const char *ras_block_string[] = {
>  /* inject address is 52 bits */
>  #define	RAS_UMC_INJECT_ADDR_LIMIT	(0x1ULL << 52)
> 
> +/* typical ECC bad page rate(1 bad page per 100MB VRAM) */
> +#define RAS_BAD_PAGE_RATE		(100 * 1024 * 1024ULL)
> +
>  enum amdgpu_ras_retire_page_reservation {
>  	AMDGPU_RAS_RETIRE_PAGE_RESERVED,
>  	AMDGPU_RAS_RETIRE_PAGE_PENDING,
> @@ -1700,6 +1703,42 @@ static bool amdgpu_ras_check_bad_page(struct 
> amdgpu_device *adev,
>  	return ret;
>  }
> 
> +static void amdgpu_ras_validate_threshold(struct amdgpu_device *adev,
> +					uint32_t max_length)
> +{
> +	struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
> +	int tmp_threshold = amdgpu_bad_page_threshold;
> +	u64 val;
> +
> +	/*
> +	 * Justification of value bad_page_cnt_threshold in ras structure
> +	 *
> +	 * 1. -1 <= amdgpu_bad_page_threshold <= max record length in
> eeprom
> +	 * 2. if amdgpu_bad_page_threshold = -1,
> +	 *    bad_page_cnt_threshold = typical value by formula.
> +	 * 3. if amdgpu_bad_page_threshold = 0,
> +	 *    bad_page_cnt_threshold = 0xFFFFFFFF,
> +	 *    and disable RMA feature accordingly.
> +	 * 4. use the value specified from user when
> (amdgpu_bad_page_threshold
> +	 *    > 0 && < max record length in eeprom).
> +	 */
> +
> +	if (tmp_threshold < -1)
> +		tmp_threshold = -1;
> +	else if (tmp_threshold > max_length)
> +		tmp_threshold = max_length;
> +
> +	if (tmp_threshold == -1) {
> +		val = adev->gmc.mc_vram_size;
> +		do_div(val, RAS_BAD_PAGE_RATE);
> +		con->bad_page_cnt_threshold = lower_32_bits(val);
[Yang, Stanley] : It's better to compare con->bad_page_cnt_threshold with max_length, the value of bad_page_cnt_threshold should not exceed max_length.

> +	} else if (tmp_threshold == 0) {
> +		con->bad_page_cnt_threshold = 0xFFFFFFFF;
> +	} else {
> +		con->bad_page_cnt_threshold = tmp_threshold;
> +	}
> +}
> +
>  /* called in gpu recovery/init */
>  int amdgpu_ras_reserve_bad_pages(struct amdgpu_device *adev)  { @@ -
> 1777,6 +1816,7 @@ int amdgpu_ras_recovery_init(struct amdgpu_device
> *adev)  {
>  	struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
>  	struct ras_err_handler_data **data;
> +	uint32_t max_eeprom_records_len = 0;
>  	int ret;
> 
>  	if (con)
> @@ -1795,6 +1835,9 @@ int amdgpu_ras_recovery_init(struct 
> amdgpu_device *adev)
>  	atomic_set(&con->in_recovery, 0);
>  	con->adev = adev;
> 
> +	max_eeprom_records_len =
> amdgpu_ras_eeprom_get_record_max_length();
> +	amdgpu_ras_validate_threshold(adev, max_eeprom_records_len);
> +
>  	ret = amdgpu_ras_eeprom_init(&con->eeprom_control);
>  	if (ret)
>  		goto free;
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> index b2667342cf67..4672649a9293 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> @@ -336,6 +336,9 @@ struct amdgpu_ras {
>  	struct amdgpu_ras_eeprom_control eeprom_control;
> 
>  	bool error_query_ready;
> +
> +	/* bad page count threshold */
> +	uint32_t bad_page_cnt_threshold;
>  };
> 
>  struct ras_fs_data {
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> index c0096097bbcf..a2c982b1eac6 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> @@ -499,6 +499,11 @@ int amdgpu_ras_eeprom_process_recods(struct
> amdgpu_ras_eeprom_control *control,
>  	return ret == num ? 0 : -EIO;
>  }
> 
> +inline uint32_t amdgpu_ras_eeprom_get_record_max_length(void)
> +{
> +	return EEPROM_MAX_RECORD_NUM;
> +}
> +
>  /* Used for testing if bugs encountered */  #if 0  void 
> amdgpu_ras_eeprom_test(struct amdgpu_ras_eeprom_control *control) diff 
> --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.h
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.h
> index 7e8647a05df7..b272840cb069 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.h
> @@ -85,6 +85,8 @@ int amdgpu_ras_eeprom_process_recods(struct
> amdgpu_ras_eeprom_control *control,
>  					    bool write,
>  					    int num);
> 
> +inline uint32_t amdgpu_ras_eeprom_get_record_max_length(void);
> +
>  void amdgpu_ras_eeprom_test(struct amdgpu_ras_eeprom_control 
> *control);
> 
>  #endif // _AMDGPU_RAS_EEPROM_H
> --
> 2.17.1

Regards,
Stanley
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: [PATCH 1/5] drm/amdgpu: add bad page count threshold in module parameter
  2020-07-23  0:31   ` Li, Dennis
@ 2020-07-23  3:47     ` Chen, Guchun
  2020-07-23 13:10       ` Christian König
  0 siblings, 1 reply; 17+ messages in thread
From: Chen, Guchun @ 2020-07-23  3:47 UTC (permalink / raw)
  To: Li, Dennis, amd-gfx, Deucher, Alexander, Zhang, Hawking, Yang,
	 Stanley, Zhou1, Tao, Clements,  John

[AMD Public Use]

Hi Dennis,

To be honest, your suggestion is considered when I start the design. My thought is in actual world, bad page threshold is one static configuration, it should be set once when probing.
So module parameter is one ideal choice for this.

Regards,
Guchun

-----Original Message-----
From: Li, Dennis <Dennis.Li@amd.com> 
Sent: Thursday, July 23, 2020 8:32 AM
To: Chen, Guchun <Guchun.Chen@amd.com>; amd-gfx@lists.freedesktop.org; Deucher, Alexander <Alexander.Deucher@amd.com>; Zhang, Hawking <Hawking.Zhang@amd.com>; Yang, Stanley <Stanley.Yang@amd.com>; Zhou1, Tao <Tao.Zhou1@amd.com>; Clements, John <John.Clements@amd.com>
Subject: RE: [PATCH 1/5] drm/amdgpu: add bad page count threshold in module parameter

[AMD Official Use Only - Internal Distribution Only]

Hi, Guchun,
      It is better to let user be able to change amdgpu_bad_page_threshold with sysfs, so that users no need to reboot system when they want to change their strategy.  

Best Regards
Dennis Li
-----Original Message-----
From: Chen, Guchun <Guchun.Chen@amd.com>
Sent: Wednesday, July 22, 2020 11:14 AM
To: amd-gfx@lists.freedesktop.org; Deucher, Alexander <Alexander.Deucher@amd.com>; Zhang, Hawking <Hawking.Zhang@amd.com>; Li, Dennis <Dennis.Li@amd.com>; Yang, Stanley <Stanley.Yang@amd.com>; Zhou1, Tao <Tao.Zhou1@amd.com>; Clements, John <John.Clements@amd.com>
Cc: Chen, Guchun <Guchun.Chen@amd.com>
Subject: [PATCH 1/5] drm/amdgpu: add bad page count threshold in module parameter

bad_page_threshold could be specified to detect and retire bad GPU if faulty bad pages exceed it.

When it's -1, ras will use typical bad page failure value.

Signed-off-by: Guchun Chen <guchun.chen@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu.h     |  1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 11 +++++++++++
 2 files changed, 12 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index 06bfb8658dec..bb83ffb5e26a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -181,6 +181,7 @@ extern uint amdgpu_dm_abm_level;  extern struct amdgpu_mgpu_info mgpu_info;  extern int amdgpu_ras_enable;  extern uint amdgpu_ras_mask;
+extern int amdgpu_bad_page_threshold;
 extern int amdgpu_async_gfx_ring;
 extern int amdgpu_mcbp;
 extern int amdgpu_discovery;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index d28b95f721c4..f99671101746 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -161,6 +161,7 @@ struct amdgpu_mgpu_info mgpu_info = {  };  int amdgpu_ras_enable = -1;  uint amdgpu_ras_mask = 0xffffffff;
+int amdgpu_bad_page_threshold = -1;
 
 /**
  * DOC: vramlimit (int)
@@ -801,6 +802,16 @@ module_param_named(tmz, amdgpu_tmz, int, 0444);  MODULE_PARM_DESC(reset_method, "GPU reset method (-1 = auto (default), 0 = legacy, 1 = mode0, 2 = mode1, 3 = mode2, 4 = baco)");  module_param_named(reset_method, amdgpu_reset_method, int, 0444);
 
+/**
+ * DOC: bad_page_threshold (int)
+ * Bad page threshold configuration is driven by RMA(Return Merchandise
+ * Authorization) policy, which is to specify the threshold value of 
+faulty
+ * pages detected by ECC, which may result in GPU's retirement if total
+ * faulty pages by ECC exceed threshold value.
+ */
+MODULE_PARM_DESC(bad_page_threshold, "Bad page threshold(-1 = 
+auto(default typical value))"); module_param_named(bad_page_threshold,
+amdgpu_bad_page_threshold, int, 0444);
+
 static const struct pci_device_id pciidlist[] = {  #ifdef  CONFIG_DRM_AMDGPU_SI
 	{0x1002, 0x6780, PCI_ANY_ID, PCI_ANY_ID, 0, 0, CHIP_TAHITI},
--
2.17.1
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* RE: [PATCH 3/5] drm/amdgpu: conduct bad gpu check during bootup/reset
  2020-07-23  3:38     ` Chen, Guchun
@ 2020-07-23  4:03       ` Zhou1, Tao
  0 siblings, 0 replies; 17+ messages in thread
From: Zhou1, Tao @ 2020-07-23  4:03 UTC (permalink / raw)
  To: Chen, Guchun, amd-gfx, Deucher, Alexander, Zhang, Hawking, Li,
	Dennis, Yang, Stanley, Clements, John

Please see my comment about the patch organization

> -----Original Message-----
> From: Chen, Guchun <Guchun.Chen@amd.com>
> Sent: Thursday, July 23, 2020 11:39 AM
> To: Zhou1, Tao <Tao.Zhou1@amd.com>; amd-gfx@lists.freedesktop.org;
> Deucher, Alexander <Alexander.Deucher@amd.com>; Zhang, Hawking
> <Hawking.Zhang@amd.com>; Li, Dennis <Dennis.Li@amd.com>; Yang, Stanley
> <Stanley.Yang@amd.com>; Clements, John <John.Clements@amd.com>
> Subject: RE: [PATCH 3/5] drm/amdgpu: conduct bad gpu check during
> bootup/reset
> 
> [AMD Public Use]
> 
> Thanks for your review, Tao.
> Please check my comments after yours.
> 
> Regards,
> Guchun
> 
> -----Original Message-----
> From: Zhou1, Tao <Tao.Zhou1@amd.com>
> Sent: Thursday, July 23, 2020 10:51 AM
> To: Chen, Guchun <Guchun.Chen@amd.com>; amd-gfx@lists.freedesktop.org;
> Deucher, Alexander <Alexander.Deucher@amd.com>; Zhang, Hawking
> <Hawking.Zhang@amd.com>; Li, Dennis <Dennis.Li@amd.com>; Yang, Stanley
> <Stanley.Yang@amd.com>; Clements, John <John.Clements@amd.com>
> Subject: RE: [PATCH 3/5] drm/amdgpu: conduct bad gpu check during
> bootup/reset
> 
> [AMD Official Use Only - Internal Distribution Only]
> 
> > -----Original Message-----
> > From: Chen, Guchun <Guchun.Chen@amd.com>
> > Sent: Wednesday, July 22, 2020 11:14 AM
> > To: amd-gfx@lists.freedesktop.org; Deucher, Alexander
> > <Alexander.Deucher@amd.com>; Zhang, Hawking
> <Hawking.Zhang@amd.com>;
> > Li, Dennis <Dennis.Li@amd.com>; Yang, Stanley <Stanley.Yang@amd.com>;
> > Zhou1, Tao <Tao.Zhou1@amd.com>; Clements, John
> <John.Clements@amd.com>
> > Cc: Chen, Guchun <Guchun.Chen@amd.com>
> > Subject: [PATCH 3/5] drm/amdgpu: conduct bad gpu check during
> > bootup/reset
> >
> > During boot up, when init eeprom, RAS will check if the BAD GPU tag is
> > saved or not. And will break boot up and notify user to RMA it.
> >
> > At the meanwhile, when saved bad page count to eeprom exceeds
> > threshold, one ras recovery will be triggered immediately, and bad gpu
> > check will be executed and reset will fail as well to remind user.
> >
> > User could set bad_page_threshold = 0 as module parameter when probing
> > driver to disable the bad feature check.
> >
> > Signed-off-by: Guchun Chen <guchun.chen@amd.com>
> > ---
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c    | 21 ++++-
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c       | 25 +++++-
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h       | 16 +++-
> >  .../gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c    | 87
> ++++++++++++++++++-
> >  .../gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.h    |  6 +-
> >  5 files changed, 142 insertions(+), 13 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > index 2662cd7c8685..d529bf7917f5 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > @@ -2065,7 +2065,9 @@ static int amdgpu_device_ip_init(struct
> > amdgpu_device *adev)
> >   * Note: theoretically, this should be called before all vram allocations
> >   * to protect retired page from abusing
> >   */
> 
> [Tao] The comment above should be also updated [Guchun]yeah, will update it
> later.
> 
> > -amdgpu_ras_recovery_init(adev);
> > +r = amdgpu_ras_recovery_init(adev);
> > +if (r)
> > +goto init_failed;
> 
> [Tao] Are you sure "r != 0" equals to "bad gpu"? Other errors of recovery_init
> should not block gpu init.
> [Guchun]Good point. This should be addressed carefully.
> >
> >  if (adev->gmc.xgmi.num_physical_nodes > 1)
> > amdgpu_xgmi_add_device(adev); @@ -4133,8 +4135,20 @@ static int
> > amdgpu_do_asic_reset(struct amdgpu_hive_info *hive,
> >
> >  amdgpu_fbdev_set_suspend(tmp_adev, 0);
> >
> > -/* must succeed. */
> > -amdgpu_ras_resume(tmp_adev);
> > +/*
> > + * The CPU is BAD once faulty pages by ECC has
> > + * reached the threshold, and ras recovery is
> > + * scheduled. So add one check here to prevent
> > + * recovery if it's one BAD GPU, and remind
> > + * user to RMA this GPU.
> > + */
> > +if (!amdgpu_ras_check_bad_gpu(tmp_adev)) {
> > +/* must succeed. */
> > +amdgpu_ras_resume(tmp_adev);
> > +} else {
> > +r = -EINVAL;
> > +goto out;
> > +}
> >
> >  /* Update PSP FW topology after reset */  if (hive && tmp_adev-
> > >gmc.xgmi.num_physical_nodes > 1) @@ -4142,7 +4156,6 @@ static int
> > amdgpu_do_asic_reset(struct amdgpu_hive_info *hive,  }  }
> >
> > -
> >  out:
> >  if (!r) {
> >  amdgpu_irq_gpu_reset_resume_helper(tmp_adev);
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> > index e3d67d85c55f..818d4154e4a3 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> > @@ -62,8 +62,6 @@ const char *ras_block_string[] = {  #define
> > ras_err_str(i)
> > (ras_error_string[ffs(i)])  #define ras_block_str(i)
> > (ras_block_string[i])
> >
> > -#define AMDGPU_RAS_FLAG_INIT_BY_VBIOS1 -#define
> > AMDGPU_RAS_FLAG_INIT_NEED_RESET2  #define RAS_DEFAULT_FLAGS
> > (AMDGPU_RAS_FLAG_INIT_BY_VBIOS)
> >
> >  /* inject address is 52 bits */
> > @@ -1817,6 +1815,7 @@ int amdgpu_ras_recovery_init(struct
> > amdgpu_device
> > *adev)
> >  struct amdgpu_ras *con = amdgpu_ras_get_context(adev);  struct
> > ras_err_handler_data **data;  uint32_t max_eeprom_records_len = 0;
> > +bool bad_gpu = false;
> >  int ret;
> >
> >  if (con)
> > @@ -1838,9 +1837,14 @@ int amdgpu_ras_recovery_init(struct
> > amdgpu_device
> > *adev)
> >  max_eeprom_records_len =
> > amdgpu_ras_eeprom_get_record_max_length();
> >  amdgpu_ras_validate_threshold(adev, max_eeprom_records_len);
> >
> > -ret = amdgpu_ras_eeprom_init(&con->eeprom_control);
> > -if (ret)
> > +ret = amdgpu_ras_eeprom_init(&con->eeprom_control, &bad_gpu);
> > +/*
> > + * we only fail this calling and halt booting when bad_gpu is true.
> > + */
> > +if (bad_gpu) {
> > +ret = -EINVAL;
> >  goto free;
> > +}
> >
> >  if (con->eeprom_control.num_recs) {
> >  ret = amdgpu_ras_load_bad_pages(adev); @@ -2189,3
> > +2193,16 @@ bool amdgpu_ras_need_emergency_restart(struct
> > amdgpu_device *adev)
> >
> >  return false;
> >  }
> > +
> > +bool amdgpu_ras_check_bad_gpu(struct amdgpu_device *adev) { struct
> > +amdgpu_ras *con = amdgpu_ras_get_context(adev); bool bad_gpu = false;
> > +
> > +if (con && (con->bad_page_cnt_threshold != 0xFFFFFFFF))
> > +amdgpu_ras_eeprom_check_bad_gpu(&con->eeprom_control,
> > +&bad_gpu);
> > +
> > +/* We are only interested in variable bad_gpu. */ return bad_gpu; }
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> > index 4672649a9293..d7a363b37feb 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> > @@ -31,6 +31,10 @@
> >  #include "ta_ras_if.h"
> >  #include "amdgpu_ras_eeprom.h"
> >
> > +#define AMDGPU_RAS_FLAG_INIT_BY_VBIOS(0x1 << 0) #define
> > +AMDGPU_RAS_FLAG_INIT_NEED_RESET(0x1 << 1) #define
> > +AMDGPU_RAS_FLAG_SKIP_BAD_PAGE_RESV(0x1 << 2)
> > +
> >  enum amdgpu_ras_block {
> >  AMDGPU_RAS_BLOCK__UMC = 0,
> >  AMDGPU_RAS_BLOCK__SDMA,
> > @@ -493,6 +497,8 @@ void amdgpu_ras_suspend(struct amdgpu_device
> > *adev);  unsigned long amdgpu_ras_query_error_count(struct
> > amdgpu_device *adev,  bool is_ce);
> >
> > +bool amdgpu_ras_check_bad_gpu(struct amdgpu_device *adev);
> > +
> >  /* error handling functions */
> >  int amdgpu_ras_add_bad_pages(struct amdgpu_device *adev,  struct
> > eeprom_table_record *bps, int pages); @@ -503,10
> > +509,14 @@ static inline int amdgpu_ras_reset_gpu(struct amdgpu_device
> > *adev)  {
> >  struct amdgpu_ras *ras = amdgpu_ras_get_context(adev);
> >
> > -/* save bad page to eeprom before gpu reset,
> > - * i2c may be unstable in gpu reset
> > +/*
> > + * Save bad page to eeprom before gpu reset, i2c may be unstable
> > + * in gpu reset.
> > + *
> > + * Also, exclude the case when ras recovery issuer is
> > + * eerprom page write.
> >   */
> > -if (in_task())
> > +if (!(ras->flags & AMDGPU_RAS_FLAG_SKIP_BAD_PAGE_RESV) &&
> > in_task())
> >  amdgpu_ras_reserve_bad_pages(adev);
> >
> >  if (atomic_cmpxchg(&ras->in_recovery, 0, 1) == 0) diff --git
> > a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> > index a2c982b1eac6..96b63c026bad 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> 
> [Tao] It's better to split it into two patches, one is for ras and another one is for
> eeprom [Guchun]I tried this, however, even if for ras change, eeprom change is
> needed as well. So I merged them both into one patch.

[Tao] We may need to split it into more parts like this:

Part1: preparation for ras
Part2: preparation for eeprom
Part3: main part for ras
Part4: main part for eeprom

> 
> > @@ -46,6 +46,9 @@
> >  #define EEPROM_TABLE_HDR_VAL 0x414d4452  #define
> EEPROM_TABLE_VER
> > 0x00010000
> >
> > +/* Bad GPU tag ‘BADG’ */
> > +#define EEPROM_TABLE_HDR_BAD 0x42414447
> > +
> >  /* Assume 2 Mbit size */
> >  #define EEPROM_SIZE_BYTES 256000
> >  #define EEPROM_PAGE__SIZE_BYTES 256
> > @@ -238,12 +241,14 @@ int amdgpu_ras_eeprom_reset_table(struct
> > amdgpu_ras_eeprom_control *control)
> >
> >  }
> >
> > -int amdgpu_ras_eeprom_init(struct amdgpu_ras_eeprom_control *control)
> > +int amdgpu_ras_eeprom_init(struct amdgpu_ras_eeprom_control *control,
> > +bool *bad_gpu)
> >  {
> >  int ret = 0;
> >  struct amdgpu_device *adev = to_amdgpu_device(control);  unsigned
> > char buff[EEPROM_ADDRESS_SIZE + EEPROM_TABLE_HEADER_SIZE] = { 0 };
> > struct amdgpu_ras_eeprom_table_header *hdr = &control->tbl_hdr;
> > +struct amdgpu_ras *ras = amdgpu_ras_get_context(adev);
> >  struct i2c_msg msg = {
> >  .addr= 0,
> >  .flags= I2C_M_RD,
> > @@ -251,6 +256,8 @@ int amdgpu_ras_eeprom_init(struct
> > amdgpu_ras_eeprom_control *control)  .buf= buff,  };
> >
> > +*bad_gpu = false;
> > +
> >  /* Verify i2c adapter is initialized */  if (!adev->pm.smu_i2c.algo)
> > return -ENOENT; @@ -279,6 +286,11 @@ int amdgpu_ras_eeprom_init(struct
> > amdgpu_ras_eeprom_control *control)  DRM_DEBUG_DRIVER("Found
> existing
> > EEPROM table with %d records",
> >   control->num_recs);
> >
> > +} else if ((ras->bad_page_cnt_threshold != 0xFFFFFFFF) && (
> > +hdr->header == EEPROM_TABLE_HDR_BAD)) {
> > +*bad_gpu = true;
> > +DRM_ERROR("Detect BAD GPU TAG in eeprom table and "
> > +"GPU shall be RMAed.\n");
> >  } else {
> >  DRM_INFO("Creating new EEPROM table");
> >
> > @@ -375,6 +387,44 @@ static uint32_t
> > __correct_eeprom_dest_address(uint32_t curr_address)  return
> > curr_address;  }
> >
> > +int amdgpu_ras_eeprom_check_bad_gpu(struct
> amdgpu_ras_eeprom_control
> > *control,
> > +bool *bad_gpu)
> > +{
> > +struct amdgpu_device *adev = to_amdgpu_device(control); unsigned char
> > +buff[EEPROM_ADDRESS_SIZE + EEPROM_TABLE_HEADER_SIZE] = { 0 }; struct
> > +amdgpu_ras_eeprom_table_header *hdr = &control->tbl_hdr; struct
> > +i2c_msg msg = { .addr = control->i2c_address, .flags = I2C_M_RD, .len
> > += EEPROM_ADDRESS_SIZE +
> > EEPROM_TABLE_HEADER_SIZE,
> > +.buf = buff,
> > +};
> > +int ret;
> > +
> > +*bad_gpu = false;
> > +
> > +/* read EEPROM table header */
> > +mutex_lock(&control->tbl_mutex);
> > +ret = i2c_transfer(&adev->pm.smu_i2c, &msg, 1); if (ret < 1) {
> > +dev_err(adev->dev, "Failed to read EEPROM table header.\n"); goto
> > +err;
> 
> [Tao] One tab is missed
> [Guchun]Correct this later.
> 
> > +}
> > +
> > +__decode_table_header_from_buff(hdr, &buff[2]);
> > +
> > +if (hdr->header == EEPROM_TABLE_HDR_BAD) { dev_warn(adev->dev,
> > +"Current GPU is BAD and should be
> > RMAed.\n");
> > +*bad_gpu = true;
> > +}
> > +
> > +ret = 0;
> > +err:
> > +mutex_unlock(&control->tbl_mutex);
> > +return ret;
> > +}
> > +
> >  int amdgpu_ras_eeprom_process_recods(struct amdgpu_ras_eeprom_control
> > *control,
> >      struct eeprom_table_record
> > *records,
> >      bool write,
> > @@ -383,8 +433,10 @@ int amdgpu_ras_eeprom_process_recods(struct
> > amdgpu_ras_eeprom_control *control,
> >  int i, ret = 0;
> >  struct i2c_msg *msgs, *msg;
> >  unsigned char *buffs, *buff;
> > +bool sched_ras_recovery = false;
> >  struct eeprom_table_record *record;
> >  struct amdgpu_device *adev = to_amdgpu_device(control);
> > +struct amdgpu_ras *ras = amdgpu_ras_get_context(adev);
> >
> >  if (adev->asic_type != CHIP_VEGA20 && adev->asic_type !=
> > CHIP_ARCTURUS)
> >  return 0;
> > @@ -402,6 +454,25 @@ int amdgpu_ras_eeprom_process_recods(struct
> > amdgpu_ras_eeprom_control *control,
> >  goto free_buff;
> >  }
> >
> > +/*
> > + * If saved bad pages number exceeds the bad page threshod for
> > + * the whole VRAM, update table header to mark one BAD GPU and
> > + * schedule one ras recovery after eeprom write is done, this
> > + * can avoid the missing for latest records.
> > + *
> > + * This new header will be picked up and checked in the bootup by
> > + * ras recovery, which may break bootup process to notify user this
> > + * GPU is bad and to RMA(Return Merchandise Authorization) this GPU.
> > + */
> > +if (write && (ras->bad_page_cnt_threshold != 0xFFFFFFFF) &&
> > +((control->num_recs + num) >= ras->bad_page_cnt_threshold)) {
> > +dev_warn(adev->dev, "Saved bad pages(%d) reaches threshold
> > +value(%d).\n",
> > +control->num_recs + num, ras-
> > >bad_page_cnt_threshold);
> > +control->tbl_hdr.header = EEPROM_TABLE_HDR_BAD;
> > +sched_ras_recovery = true;
> > +}
> > +
> >  /* In case of overflow just start from beginning to not lose newest
> > records */  if (write && (control->next_addr +
> > EEPROM_TABLE_RECORD_SIZE * num > EEPROM_SIZE_BYTES))
> > control->next_addr = EEPROM_RECORD_START; @@ -482,6
> > +553,20 @@ int amdgpu_ras_eeprom_process_recods(struct
> > amdgpu_ras_eeprom_control *control,
> >  __update_tbl_checksum(control, records, num, old_hdr_byte_sum);
> >
> >  __update_table_header(control, buffs);
> > +
> > +if (sched_ras_recovery) {
> > +/*
> > + * Before scheduling ras recovery, assert the related
> > + * flag first, which shall bypass common bad page
> > + * reservation execution in amdgpu_ras_reset_gpu.
> > + */
> > +amdgpu_ras_get_context(adev)->flags |=
> > +AMDGPU_RAS_FLAG_SKIP_BAD_PAGE_RESV;
> > +
> > +dev_warn(adev->dev, "Conduct ras recovery due to bad
> > "
> > +"page threshold reached.\n");
> > +amdgpu_ras_reset_gpu(adev);
> > +}
> >  } else if (!__validate_tbl_checksum(control, records, num)) {
> > DRM_WARN("EEPROM Table checksum mismatch!");
> >  /* TODO Uncomment when EEPROM read/write is relliable */ diff --git
> > a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.h
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.h
> > index b272840cb069..4ccd139248a9 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.h
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.h
> > @@ -77,9 +77,13 @@ struct eeprom_table_record {  unsigned char
> > mcumc_id;  }__attribute__((__packed__));
> >
> > -int amdgpu_ras_eeprom_init(struct amdgpu_ras_eeprom_control
> > *control);
> > +int amdgpu_ras_eeprom_init(struct amdgpu_ras_eeprom_control *control,
> > +bool *bad_gpu);
> >  int amdgpu_ras_eeprom_reset_table(struct amdgpu_ras_eeprom_control
> > *control);
> >
> > +int amdgpu_ras_eeprom_check_bad_gpu(struct
> amdgpu_ras_eeprom_control
> > *control,
> > +bool *bad_gpu);
> > +
> >  int amdgpu_ras_eeprom_process_recods(struct amdgpu_ras_eeprom_control
> > *control,
> >      struct eeprom_table_record
> > *records,
> >      bool write,
> > --
> > 2.17.1

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/5] drm/amdgpu: add bad page count threshold in module parameter
  2020-07-23  3:47     ` Chen, Guchun
@ 2020-07-23 13:10       ` Christian König
  2020-07-23 13:39         ` Alex Deucher
  0 siblings, 1 reply; 17+ messages in thread
From: Christian König @ 2020-07-23 13:10 UTC (permalink / raw)
  To: Chen, Guchun, Li, Dennis, amd-gfx, Deucher, Alexander, Zhang,
	Hawking, Yang, Stanley, Zhou1, Tao, Clements, John

I agree with Guchun as well.

When you have a dynamic module parameter and change the bad page 
threshold the GPU might just stop working suddenly.

That is not a good idea as far as I can see.

Regards,
Christian.

Am 23.07.20 um 05:47 schrieb Chen, Guchun:
> [AMD Public Use]
>
> Hi Dennis,
>
> To be honest, your suggestion is considered when I start the design. My thought is in actual world, bad page threshold is one static configuration, it should be set once when probing.
> So module parameter is one ideal choice for this.
>
> Regards,
> Guchun
>
> -----Original Message-----
> From: Li, Dennis <Dennis.Li@amd.com>
> Sent: Thursday, July 23, 2020 8:32 AM
> To: Chen, Guchun <Guchun.Chen@amd.com>; amd-gfx@lists.freedesktop.org; Deucher, Alexander <Alexander.Deucher@amd.com>; Zhang, Hawking <Hawking.Zhang@amd.com>; Yang, Stanley <Stanley.Yang@amd.com>; Zhou1, Tao <Tao.Zhou1@amd.com>; Clements, John <John.Clements@amd.com>
> Subject: RE: [PATCH 1/5] drm/amdgpu: add bad page count threshold in module parameter
>
> [AMD Official Use Only - Internal Distribution Only]
>
> Hi, Guchun,
>        It is better to let user be able to change amdgpu_bad_page_threshold with sysfs, so that users no need to reboot system when they want to change their strategy.
>
> Best Regards
> Dennis Li
> -----Original Message-----
> From: Chen, Guchun <Guchun.Chen@amd.com>
> Sent: Wednesday, July 22, 2020 11:14 AM
> To: amd-gfx@lists.freedesktop.org; Deucher, Alexander <Alexander.Deucher@amd.com>; Zhang, Hawking <Hawking.Zhang@amd.com>; Li, Dennis <Dennis.Li@amd.com>; Yang, Stanley <Stanley.Yang@amd.com>; Zhou1, Tao <Tao.Zhou1@amd.com>; Clements, John <John.Clements@amd.com>
> Cc: Chen, Guchun <Guchun.Chen@amd.com>
> Subject: [PATCH 1/5] drm/amdgpu: add bad page count threshold in module parameter
>
> bad_page_threshold could be specified to detect and retire bad GPU if faulty bad pages exceed it.
>
> When it's -1, ras will use typical bad page failure value.
>
> Signed-off-by: Guchun Chen <guchun.chen@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu.h     |  1 +
>   drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 11 +++++++++++
>   2 files changed, 12 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> index 06bfb8658dec..bb83ffb5e26a 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> @@ -181,6 +181,7 @@ extern uint amdgpu_dm_abm_level;  extern struct amdgpu_mgpu_info mgpu_info;  extern int amdgpu_ras_enable;  extern uint amdgpu_ras_mask;
> +extern int amdgpu_bad_page_threshold;
>   extern int amdgpu_async_gfx_ring;
>   extern int amdgpu_mcbp;
>   extern int amdgpu_discovery;
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> index d28b95f721c4..f99671101746 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> @@ -161,6 +161,7 @@ struct amdgpu_mgpu_info mgpu_info = {  };  int amdgpu_ras_enable = -1;  uint amdgpu_ras_mask = 0xffffffff;
> +int amdgpu_bad_page_threshold = -1;
>   
>   /**
>    * DOC: vramlimit (int)
> @@ -801,6 +802,16 @@ module_param_named(tmz, amdgpu_tmz, int, 0444);  MODULE_PARM_DESC(reset_method, "GPU reset method (-1 = auto (default), 0 = legacy, 1 = mode0, 2 = mode1, 3 = mode2, 4 = baco)");  module_param_named(reset_method, amdgpu_reset_method, int, 0444);
>   
> +/**
> + * DOC: bad_page_threshold (int)
> + * Bad page threshold configuration is driven by RMA(Return Merchandise
> + * Authorization) policy, which is to specify the threshold value of
> +faulty
> + * pages detected by ECC, which may result in GPU's retirement if total
> + * faulty pages by ECC exceed threshold value.
> + */
> +MODULE_PARM_DESC(bad_page_threshold, "Bad page threshold(-1 =
> +auto(default typical value))"); module_param_named(bad_page_threshold,
> +amdgpu_bad_page_threshold, int, 0444);
> +
>   static const struct pci_device_id pciidlist[] = {  #ifdef  CONFIG_DRM_AMDGPU_SI
>   	{0x1002, 0x6780, PCI_ANY_ID, PCI_ANY_ID, 0, 0, CHIP_TAHITI},
> --
> 2.17.1
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/5] drm/amdgpu: add bad page count threshold in module parameter
  2020-07-23 13:10       ` Christian König
@ 2020-07-23 13:39         ` Alex Deucher
  0 siblings, 0 replies; 17+ messages in thread
From: Alex Deucher @ 2020-07-23 13:39 UTC (permalink / raw)
  To: Christian Koenig
  Cc: Chen, Guchun, Zhou1, Tao, amd-gfx, Yang, Stanley, Deucher,
	Alexander, Clements, John, Li, Dennis, Zhang, Hawking

Also note that module parameters are global.  If you change the
parameter, it changes it for all GPUs in the system.  That may not be
what the customer wants.

Alex

On Thu, Jul 23, 2020 at 9:10 AM Christian König
<ckoenig.leichtzumerken@gmail.com> wrote:
>
> I agree with Guchun as well.
>
> When you have a dynamic module parameter and change the bad page
> threshold the GPU might just stop working suddenly.
>
> That is not a good idea as far as I can see.
>
> Regards,
> Christian.
>
> Am 23.07.20 um 05:47 schrieb Chen, Guchun:
> > [AMD Public Use]
> >
> > Hi Dennis,
> >
> > To be honest, your suggestion is considered when I start the design. My thought is in actual world, bad page threshold is one static configuration, it should be set once when probing.
> > So module parameter is one ideal choice for this.
> >
> > Regards,
> > Guchun
> >
> > -----Original Message-----
> > From: Li, Dennis <Dennis.Li@amd.com>
> > Sent: Thursday, July 23, 2020 8:32 AM
> > To: Chen, Guchun <Guchun.Chen@amd.com>; amd-gfx@lists.freedesktop.org; Deucher, Alexander <Alexander.Deucher@amd.com>; Zhang, Hawking <Hawking.Zhang@amd.com>; Yang, Stanley <Stanley.Yang@amd.com>; Zhou1, Tao <Tao.Zhou1@amd.com>; Clements, John <John.Clements@amd.com>
> > Subject: RE: [PATCH 1/5] drm/amdgpu: add bad page count threshold in module parameter
> >
> > [AMD Official Use Only - Internal Distribution Only]
> >
> > Hi, Guchun,
> >        It is better to let user be able to change amdgpu_bad_page_threshold with sysfs, so that users no need to reboot system when they want to change their strategy.
> >
> > Best Regards
> > Dennis Li
> > -----Original Message-----
> > From: Chen, Guchun <Guchun.Chen@amd.com>
> > Sent: Wednesday, July 22, 2020 11:14 AM
> > To: amd-gfx@lists.freedesktop.org; Deucher, Alexander <Alexander.Deucher@amd.com>; Zhang, Hawking <Hawking.Zhang@amd.com>; Li, Dennis <Dennis.Li@amd.com>; Yang, Stanley <Stanley.Yang@amd.com>; Zhou1, Tao <Tao.Zhou1@amd.com>; Clements, John <John.Clements@amd.com>
> > Cc: Chen, Guchun <Guchun.Chen@amd.com>
> > Subject: [PATCH 1/5] drm/amdgpu: add bad page count threshold in module parameter
> >
> > bad_page_threshold could be specified to detect and retire bad GPU if faulty bad pages exceed it.
> >
> > When it's -1, ras will use typical bad page failure value.
> >
> > Signed-off-by: Guchun Chen <guchun.chen@amd.com>
> > ---
> >   drivers/gpu/drm/amd/amdgpu/amdgpu.h     |  1 +
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 11 +++++++++++
> >   2 files changed, 12 insertions(+)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > index 06bfb8658dec..bb83ffb5e26a 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > @@ -181,6 +181,7 @@ extern uint amdgpu_dm_abm_level;  extern struct amdgpu_mgpu_info mgpu_info;  extern int amdgpu_ras_enable;  extern uint amdgpu_ras_mask;
> > +extern int amdgpu_bad_page_threshold;
> >   extern int amdgpu_async_gfx_ring;
> >   extern int amdgpu_mcbp;
> >   extern int amdgpu_discovery;
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > index d28b95f721c4..f99671101746 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > @@ -161,6 +161,7 @@ struct amdgpu_mgpu_info mgpu_info = {  };  int amdgpu_ras_enable = -1;  uint amdgpu_ras_mask = 0xffffffff;
> > +int amdgpu_bad_page_threshold = -1;
> >
> >   /**
> >    * DOC: vramlimit (int)
> > @@ -801,6 +802,16 @@ module_param_named(tmz, amdgpu_tmz, int, 0444);  MODULE_PARM_DESC(reset_method, "GPU reset method (-1 = auto (default), 0 = legacy, 1 = mode0, 2 = mode1, 3 = mode2, 4 = baco)");  module_param_named(reset_method, amdgpu_reset_method, int, 0444);
> >
> > +/**
> > + * DOC: bad_page_threshold (int)
> > + * Bad page threshold configuration is driven by RMA(Return Merchandise
> > + * Authorization) policy, which is to specify the threshold value of
> > +faulty
> > + * pages detected by ECC, which may result in GPU's retirement if total
> > + * faulty pages by ECC exceed threshold value.
> > + */
> > +MODULE_PARM_DESC(bad_page_threshold, "Bad page threshold(-1 =
> > +auto(default typical value))"); module_param_named(bad_page_threshold,
> > +amdgpu_bad_page_threshold, int, 0444);
> > +
> >   static const struct pci_device_id pciidlist[] = {  #ifdef  CONFIG_DRM_AMDGPU_SI
> >       {0x1002, 0x6780, PCI_ANY_ID, PCI_ANY_ID, 0, 0, CHIP_TAHITI},
> > --
> > 2.17.1
> > _______________________________________________
> > amd-gfx mailing list
> > amd-gfx@lists.freedesktop.org
> > https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2020-07-23 13:39 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-07-22  3:14 [PATCH 0/5] BAD GPU retirement policy by total bad pages Guchun Chen
2020-07-22  3:14 ` [PATCH 1/5] drm/amdgpu: add bad page count threshold in module parameter Guchun Chen
2020-07-23  0:31   ` Li, Dennis
2020-07-23  3:47     ` Chen, Guchun
2020-07-23 13:10       ` Christian König
2020-07-23 13:39         ` Alex Deucher
2020-07-22  3:14 ` [PATCH 2/5] drm/amdgpu: validate bad page threshold in ras Guchun Chen
2020-07-22  7:51   ` Yang, Stanley
2020-07-23  3:40     ` Chen, Guchun
2020-07-22  3:14 ` [PATCH 3/5] drm/amdgpu: conduct bad gpu check during bootup/reset Guchun Chen
2020-07-23  2:51   ` Zhou1, Tao
2020-07-23  3:38     ` Chen, Guchun
2020-07-23  4:03       ` Zhou1, Tao
2020-07-22  3:14 ` [PATCH 4/5] drm/amdgpu: restore ras flags when user resets eeprom Guchun Chen
2020-07-22  3:14 ` [PATCH 5/5] drm/amdgpu: calculate actual size instead of hardcode size Guchun Chen
2020-07-22 14:26   ` Andrey Grodzovsky
2020-07-22 14:29     ` Chen, Guchun

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.