All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Chen, Guchun" <Guchun.Chen@amd.com>
To: "Li, Dennis" <Dennis.Li@amd.com>,
	"amd-gfx@lists.freedesktop.org" <amd-gfx@lists.freedesktop.org>,
	"Deucher, Alexander" <Alexander.Deucher@amd.com>,
	"Zhang, Hawking" <Hawking.Zhang@amd.com>,
	"Grodzovsky, Andrey" <Andrey.Grodzovsky@amd.com>,
	"Zhou1, Tao" <Tao.Zhou1@amd.com>,
	"Clements, John" <John.Clements@amd.com>,
	"Lazar, Lijo" <Lijo.Lazar@amd.com>,
	"Koenig, Christian" <Christian.Koenig@amd.com>,
	"Yang, Stanley" <Stanley.Yang@amd.com>
Subject: RE: [PATCH 04/12] drm/amdgpu: break driver init process when it's bad GPU
Date: Tue, 28 Jul 2020 14:11:39 +0000	[thread overview]
Message-ID: <CY4PR12MB128712A8799EAC4B9752E0D4F1730@CY4PR12MB1287.namprd12.prod.outlook.com> (raw)
In-Reply-To: <DM5PR12MB2533F35E3050B8B3F2F665AEED730@DM5PR12MB2533.namprd12.prod.outlook.com>

[AMD Public Use]

Hi Dennis,

Please check my response after yours.

Regards,
Guchun

-----Original Message-----
From: Li, Dennis <Dennis.Li@amd.com> 
Sent: Tuesday, July 28, 2020 5:43 PM
To: Chen, Guchun <Guchun.Chen@amd.com>; amd-gfx@lists.freedesktop.org; Deucher, Alexander <Alexander.Deucher@amd.com>; Zhang, Hawking <Hawking.Zhang@amd.com>; Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>; Zhou1, Tao <Tao.Zhou1@amd.com>; Clements, John <John.Clements@amd.com>; Lazar, Lijo <Lijo.Lazar@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>; Yang, Stanley <Stanley.Yang@amd.com>
Subject: RE: [PATCH 04/12] drm/amdgpu: break driver init process when it's bad GPU

[AMD Official Use Only - Internal Distribution Only]

Hi, Guchun,
      Please see my below comments.

Best Regards
Dennis Li
-----Original Message-----
From: Chen, Guchun <Guchun.Chen@amd.com> 
Sent: Tuesday, July 28, 2020 3:49 PM
To: amd-gfx@lists.freedesktop.org; Deucher, Alexander <Alexander.Deucher@amd.com>; Zhang, Hawking <Hawking.Zhang@amd.com>; Li, Dennis <Dennis.Li@amd.com>; Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>; Zhou1, Tao <Tao.Zhou1@amd.com>; Clements, John <John.Clements@amd.com>; Lazar, Lijo <Lijo.Lazar@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>; Yang, Stanley <Stanley.Yang@amd.com>
Cc: Chen, Guchun <Guchun.Chen@amd.com>
Subject: [PATCH 04/12] drm/amdgpu: break driver init process when it's bad GPU

When retrieving bad gpu tag from eeprom, GPU init should fail as the GPU needs to be retired for further check.

v2: Fix spelling typo, correct the condition to detect
    bad gpu tag and refine error message.

v3: Refine function argument name.

Signed-off-by: Guchun Chen <guchun.chen@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c     | 12 +++++++++---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c        | 18 ++++++++++++++++--
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c | 10 +++++++++-  drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.h |  3 ++-
 4 files changed, 36 insertions(+), 7 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 2662cd7c8685..30af0dfee1a1 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -2059,13 +2059,19 @@ static int amdgpu_device_ip_init(struct amdgpu_device *adev)
 	 * it should be called after amdgpu_device_ip_hw_init_phase2  since
 	 * for some ASICs the RAS EEPROM code relies on SMU fully functioning
 	 * for I2C communication which only true at this point.
-	 * recovery_init may fail, but it can free all resources allocated by
-	 * itself and its failure should not stop amdgpu init process.
+	 *
+	 * amdgpu_ras_recovery_init may fail, but the upper only cares the
+	 * failure from bad gpu situation and stop amdgpu init process
+	 * accordingly. For other failed cases, it will still release all
+	 * the resource and print error message, rather than returning one
+	 * negative value to upper level.
 	 *
 	 * Note: theoretically, this should be called before all vram allocations
 	 * to protect retired page from abusing
 	 */
-	amdgpu_ras_recovery_init(adev);
+	r = amdgpu_ras_recovery_init(adev);
+	if (r)
+		goto init_failed;
 
 	if (adev->gmc.xgmi.num_physical_nodes > 1)
 		amdgpu_xgmi_add_device(adev);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
index 3c4c142e9d8a..56e1aeba2d64 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
@@ -1822,6 +1822,7 @@ int amdgpu_ras_recovery_init(struct amdgpu_device *adev)
 	struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
 	struct ras_err_handler_data **data;
 	uint32_t max_eeprom_records_len = 0;
+	bool exc_err_limit = false;
 	int ret;
 
 	if (con)
@@ -1843,9 +1844,15 @@ int amdgpu_ras_recovery_init(struct amdgpu_device *adev)
 	max_eeprom_records_len = amdgpu_ras_eeprom_get_record_max_length();
 	amdgpu_ras_validate_threshold(adev, max_eeprom_records_len);
 
-	ret = amdgpu_ras_eeprom_init(&con->eeprom_control);
-	if (ret)
+	ret = amdgpu_ras_eeprom_init(&con->eeprom_control, &exc_err_limit);
+	/*
+	 * We only fail this calling and halt booting up
+	 * when exc_err_limit is true.
+	 */
+	if (exc_err_limit) {
+		ret = -EINVAL;
 		goto free;
+	}

[Dennis Li] Compared with old codes,  new change miss checking ret.
[Guchun] Yeah, this hits me that another if condition is that ret should be checked as well when exc_err_limit is false,
that means there is some problem with eeprom i2c functionality.
It will be addressed in next patch set.
 
 	if (con->eeprom_control.num_recs) {
 		ret = amdgpu_ras_load_bad_pages(adev); @@ -1868,6 +1875,13 @@ int amdgpu_ras_recovery_init(struct amdgpu_device *adev)
 out:
 	dev_warn(adev->dev, "Failed to initialize ras recovery!\n");
 
+	/*
+	 * Except error threshold exceeding case, other failure cases in this
+	 * function would not fail amdgpu driver init.
+	 */
+	if (!exc_err_limit)
+		ret = 0;
+
 	return ret;
 }
 
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
index 35c0c849d49b..67995b66d7d4 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
@@ -241,7 +241,8 @@ int amdgpu_ras_eeprom_reset_table(struct amdgpu_ras_eeprom_control *control)
 
 }
 
-int amdgpu_ras_eeprom_init(struct amdgpu_ras_eeprom_control *control)
+int amdgpu_ras_eeprom_init(struct amdgpu_ras_eeprom_control *control,
+			bool *exceed_err_limit)

 {
 	int ret = 0;
 	struct amdgpu_device *adev = to_amdgpu_device(control); @@ -254,6 +255,8 @@ int amdgpu_ras_eeprom_init(struct amdgpu_ras_eeprom_control *control)
 			.buf	= buff,
 	};
 
+	*exceed_err_limit = false;
+
 	/* Verify i2c adapter is initialized */
 	if (!adev->pm.smu_i2c.algo)
 		return -ENOENT;
@@ -282,6 +285,11 @@ int amdgpu_ras_eeprom_init(struct amdgpu_ras_eeprom_control *control)
 		DRM_DEBUG_DRIVER("Found existing EEPROM table with %d records",
 				 control->num_recs);
 
+	} else if ((hdr->header == EEPROM_TABLE_HDR_BAD) &&
+			(amdgpu_bad_page_threshold != 0)) {
+		*exceed_err_limit = true;
+		DRM_ERROR("Exceeding the bad_page_threshold parameter, "
+				"disabling the GPU.\n");

[Dennis Li] Why must introduce a new parameter exceed_err_limit?  I think it can return -EINVAL directly here.
[Guchun]We need to definitely know what's the error case and decide next step concisely. When this variable exceed_err_limit is true, that means
GPU bad tag is detected, and consequently, this scenario will be returned to upper layer to halt driver's boot up. For other errors returned by this
function, they may be caused by eeprom i2c functionality, in such case, amdgpu driver's probe will not be impacted, as generally, eeprom is
one external device only.

 	} else {
 		DRM_INFO("Creating new EEPROM table");
 
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.h
index b272840cb069..f245b96d9599 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.h
@@ -77,7 +77,8 @@ struct eeprom_table_record {
 	unsigned char mcumc_id;
 }__attribute__((__packed__));
 
-int amdgpu_ras_eeprom_init(struct amdgpu_ras_eeprom_control *control);
+int amdgpu_ras_eeprom_init(struct amdgpu_ras_eeprom_control *control,
+			bool *exceed_err_limit);
 int amdgpu_ras_eeprom_reset_table(struct amdgpu_ras_eeprom_control *control);
 
 int amdgpu_ras_eeprom_process_recods(struct amdgpu_ras_eeprom_control *control,
--
2.17.1
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

  reply	other threads:[~2020-07-28 14:11 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-07-28  7:49 [PATCH 00/12] BAD GPU retirement policy by total bad pages Guchun Chen
2020-07-28  7:49 ` [PATCH 01/12] drm/amdgpu: add bad page count threshold in module parameter Guchun Chen
2020-07-28  7:49 ` [PATCH 02/12] drm/amdgpu: validate bad page threshold in ras Guchun Chen
2020-07-28  7:49 ` [PATCH 03/12] drm/amdgpu: add bad gpu tag definition Guchun Chen
2020-07-28  7:49 ` [PATCH 04/12] drm/amdgpu: break driver init process when it's bad GPU Guchun Chen
2020-07-28  9:43   ` Li, Dennis
2020-07-28 14:11     ` Chen, Guchun [this message]
2020-07-28  7:49 ` [PATCH 05/12] drm/amdgpu: skip bad page reservation once issuing from eeprom write Guchun Chen
2020-07-28  7:49 ` [PATCH 06/12] drm/amdgpu: schedule ras recovery when reaching bad page threshold Guchun Chen
2020-07-28  7:49 ` [PATCH 07/12] drm/amdgpu: break GPU recovery once it's in bad state Guchun Chen
2020-07-28  7:49 ` [PATCH 08/12] drm/amdgpu: restore ras flags when user resets eeprom Guchun Chen
2020-07-28  7:49 ` [PATCH 09/12] drm/amdgpu: define one macro for RAS's sysfs/debugfs name Guchun Chen
2020-07-28  7:55   ` Christian König
2020-07-28  8:00     ` Chen, Guchun
2020-07-28  7:49 ` [PATCH 10/12] drm/amdgpu: decouple sysfs creating of bad page node Guchun Chen
2020-07-28  7:49 ` [PATCH 11/12] drm/amdgpu: disable page reservation when amdgpu_bad_page_threshold = 0 Guchun Chen
2020-07-28  7:49 ` [PATCH 12/12] drm/amdgpu: reset eeprom once specifying one bigger threshold Guchun Chen
2020-07-29  2:56 [PATCH 00/12] BAD GPU retirement policy by total bad pages Guchun Chen
2020-07-29  2:56 ` [PATCH 04/12] drm/amdgpu: break driver init process when it's bad GPU Guchun Chen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CY4PR12MB128712A8799EAC4B9752E0D4F1730@CY4PR12MB1287.namprd12.prod.outlook.com \
    --to=guchun.chen@amd.com \
    --cc=Alexander.Deucher@amd.com \
    --cc=Andrey.Grodzovsky@amd.com \
    --cc=Christian.Koenig@amd.com \
    --cc=Dennis.Li@amd.com \
    --cc=Hawking.Zhang@amd.com \
    --cc=John.Clements@amd.com \
    --cc=Lijo.Lazar@amd.com \
    --cc=Stanley.Yang@amd.com \
    --cc=Tao.Zhou1@amd.com \
    --cc=amd-gfx@lists.freedesktop.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.