amd-gfx.lists.freedesktop.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] drm/amdgpu: Fix checking return result of retire page
@ 2021-04-12  1:15 Luben Tuikov
  2021-04-12 20:22 ` Deucher, Alexander
  2021-04-14  8:19 ` Clements, John
  0 siblings, 2 replies; 5+ messages in thread
From: Luben Tuikov @ 2021-04-12  1:15 UTC (permalink / raw)
  To: amd-gfx; +Cc: Alexander Deucher, Luben Tuikov, John Clements, Hawking Zhang

* Remove double-sscanf to scan for %llu and
  0x%llx, as that is not going to work--the %llu
  will consume the "0" in "0x" of your input,
  and a hex value can never be consumed. And just
  entering a hex number without leading 0x will
  either be scanned as a string and not match,
  or the leading decimal portion is scanned
  which is not correct. Thus remove the first
  %llu scan and leave only the %llx scan,
  removing the leading 0x since %llx can scan
  either.

* Fix logic the check of the result of
  amdgpu_reserve_page_direct()--it is 0
  on success, and non-zero on error.

* Add bad_page_cnt_threshold to debugfs for
  reporting purposes only--it usually matches the
  size of EEPROM but may be different depending on
  module option. Small other improvements.

* Improve kernel-doc for the sysfs interface.

Cc: Alexander Deucher <Alexander.Deucher@amd.com>
Cc: John Clements <john.clements@amd.com>
Cc: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Luben Tuikov <luben.tuikov@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 67 ++++++++++++-------------
 1 file changed, 32 insertions(+), 35 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
index 0541196ae1ed..c4b94b444b90 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
@@ -114,7 +114,7 @@ static int amdgpu_reserve_page_direct(struct amdgpu_device *adev, uint64_t addre
 
 	if (amdgpu_ras_check_bad_page(adev, address)) {
 		dev_warn(adev->dev,
-			 "RAS WARN: 0x%llx has been marked as bad page!\n",
+			 "RAS WARN: 0x%llx has already been marked as bad page!\n",
 			 address);
 		return 0;
 	}
@@ -228,11 +228,9 @@ static int amdgpu_ras_debugfs_ctrl_parse_data(struct file *f,
 		return -EINVAL;
 
 	if (op != -1) {
-
 		if (op == 3) {
-			if (sscanf(str, "%*s %llu", &address) != 1)
-				if (sscanf(str, "%*s 0x%llx", &address) != 1)
-					return -EINVAL;
+			if (sscanf(str, "%*s %llx", &address) != 1)
+				return -EINVAL;
 
 			data->op = op;
 			data->inject.address = address;
@@ -255,11 +253,9 @@ static int amdgpu_ras_debugfs_ctrl_parse_data(struct file *f,
 		data->op = op;
 
 		if (op == 2) {
-			if (sscanf(str, "%*s %*s %*s %u %llu %llu",
-						&sub_block, &address, &value) != 3)
-				if (sscanf(str, "%*s %*s %*s 0x%x 0x%llx 0x%llx",
-							&sub_block, &address, &value) != 3)
-					return -EINVAL;
+			if (sscanf(str, "%*s %*s %*s %x %llx %llx",
+				   &sub_block, &address, &value) != 3)
+				return -EINVAL;
 			data->head.sub_block_index = sub_block;
 			data->inject.address = address;
 			data->inject.value = value;
@@ -278,7 +274,7 @@ static int amdgpu_ras_debugfs_ctrl_parse_data(struct file *f,
 /**
  * DOC: AMDGPU RAS debugfs control interface
  *
- * It accepts struct ras_debug_if who has two members.
+ * The control interface accepts struct ras_debug_if which has two members.
  *
  * First member: ras_debug_if::head or ras_debug_if::inject.
  *
@@ -303,32 +299,33 @@ static int amdgpu_ras_debugfs_ctrl_parse_data(struct file *f,
  *
  * How to use the interface?
  *
- * Programs
+ * Program
  *
  * Copy the struct ras_debug_if in your codes and initialize it.
  * Write the struct to the control node.
  *
- * Shells
+ * Shell
  *
  * .. code-block:: bash
  *
- *	echo op block [error [sub_block address value]] > .../ras/ras_ctrl
+ *	echo "disable <block>" > /sys/kernel/debug/dri/<N>/ras/ras_ctrl
+ *	echo "enable  <block> <error>" > /sys/kernel/debug/dri/<N>/ras/ras_ctrl
+ *	echo "inject  <block> <error> <sub-block> <address> <value> > /sys/kernel/debug/dri/<N>/ras/ras_ctrl
  *
- * Parameters:
+ * Where N, is the card which you want to affect.
  *
- * op: disable, enable, inject
- *	disable: only block is needed
- *	enable: block and error are needed
- *	inject: error, address, value are needed
- * block: umc, sdma, gfx, .........
+ * "disable" requires only the block.
+ * "enable" requires the block and error type.
+ * "inject" requires the block, error type, address, and value.
+ * The block is one of: umc, sdma, gfx, etc.
  *	see ras_block_string[] for details
- * error: ue, ce
- *	ue: multi_uncorrectable
- *	ce: single_correctable
- * sub_block:
- *	sub block index, pass 0 if there is no sub block
+ * The error type is one of: ue, ce, where,
+ *	ue is multi-uncorrectable
+ *	ce is single-correctable
+ * The sub-block is a the sub-block index, pass 0 if there is no sub-block.
+ * The address and value are hexadecimal numbers, leading 0x is optional.
  *
- * here are some examples for bash commands:
+ * For instance,
  *
  * .. code-block:: bash
  *
@@ -336,17 +333,17 @@ static int amdgpu_ras_debugfs_ctrl_parse_data(struct file *f,
  *	echo inject umc ce 0 0 0 > /sys/kernel/debug/dri/0/ras/ras_ctrl
  *	echo disable umc > /sys/kernel/debug/dri/0/ras/ras_ctrl
  *
- * How to check the result?
+ * How to check the result of the operation?
  *
- * For disable/enable, please check ras features at
+ * To check disable/enable, see "ras" features at,
  * /sys/class/drm/card[0/1/2...]/device/ras/features
  *
- * For inject, please check corresponding err count at
- * /sys/class/drm/card[0/1/2...]/device/ras/[gfx/sdma/...]_err_count
+ * To check inject, see the corresponding error count at,
+ * /sys/class/drm/card[0/1/2...]/device/ras/[gfx|sdma|umc|...]_err_count
  *
  * .. note::
  *	Operations are only allowed on blocks which are supported.
- *	Please check ras mask at /sys/module/amdgpu/parameters/ras_mask
+ *	Check the "ras" mask at /sys/module/amdgpu/parameters/ras_mask
  *	to see which blocks support RAS on a particular asic.
  *
  */
@@ -367,11 +364,9 @@ static ssize_t amdgpu_ras_debugfs_ctrl_write(struct file *f, const char __user *
 	if (ret)
 		return -EINVAL;
 
-	if (data.op == 3)
-	{
+	if (data.op == 3) {
 		ret = amdgpu_reserve_page_direct(adev, data.inject.address);
-
-		if (ret)
+		if (!ret)
 			return size;
 		else
 			return ret;
@@ -1269,6 +1264,8 @@ static struct dentry *amdgpu_ras_debugfs_create_ctrl_node(struct amdgpu_device *
 			    &amdgpu_ras_debugfs_ctrl_ops);
 	debugfs_create_file("ras_eeprom_reset", S_IWUGO | S_IRUGO, dir, adev,
 			    &amdgpu_ras_debugfs_eeprom_ops);
+	debugfs_create_u32("bad_page_cnt_threshold", S_IRUGO, dir,
+			   &con->bad_page_cnt_threshold);
 
 	/*
 	 * After one uncorrectable error happens, usually GPU recovery will
-- 
2.31.0.97.g1424303384

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* RE: [PATCH] drm/amdgpu: Fix checking return result of retire page
  2021-04-12  1:15 [PATCH] drm/amdgpu: Fix checking return result of retire page Luben Tuikov
@ 2021-04-12 20:22 ` Deucher, Alexander
  2021-04-14  8:19 ` Clements, John
  1 sibling, 0 replies; 5+ messages in thread
From: Deucher, Alexander @ 2021-04-12 20:22 UTC (permalink / raw)
  To: Tuikov, Luben, amd-gfx; +Cc: Clements, John, Zhang, Hawking

[AMD Public Use]

> -----Original Message-----
> From: Tuikov, Luben <Luben.Tuikov@amd.com>
> Sent: Sunday, April 11, 2021 9:15 PM
> To: amd-gfx@lists.freedesktop.org
> Cc: Tuikov, Luben <Luben.Tuikov@amd.com>; Deucher, Alexander
> <Alexander.Deucher@amd.com>; Clements, John
> <John.Clements@amd.com>; Zhang, Hawking <Hawking.Zhang@amd.com>
> Subject: [PATCH] drm/amdgpu: Fix checking return result of retire page
> 
> * Remove double-sscanf to scan for %llu and
>   0x%llx, as that is not going to work--the %llu
>   will consume the "0" in "0x" of your input,
>   and a hex value can never be consumed. And just
>   entering a hex number without leading 0x will
>   either be scanned as a string and not match,
>   or the leading decimal portion is scanned
>   which is not correct. Thus remove the first
>   %llu scan and leave only the %llx scan,
>   removing the leading 0x since %llx can scan
>   either.
> 
> * Fix logic the check of the result of
>   amdgpu_reserve_page_direct()--it is 0
>   on success, and non-zero on error.
> 
> * Add bad_page_cnt_threshold to debugfs for
>   reporting purposes only--it usually matches the
>   size of EEPROM but may be different depending on
>   module option. Small other improvements.
> 
> * Improve kernel-doc for the sysfs interface.

I think these should be split into 4 separate patches.  The first two are bug fixes, the latter two are general improvements.

Thanks,

Alex

> 
> Cc: Alexander Deucher <Alexander.Deucher@amd.com>
> Cc: John Clements <john.clements@amd.com>
> Cc: Hawking Zhang <Hawking.Zhang@amd.com>
> Signed-off-by: Luben Tuikov <luben.tuikov@amd.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 67 ++++++++++++-----------
> --
>  1 file changed, 32 insertions(+), 35 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> index 0541196ae1ed..c4b94b444b90 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> @@ -114,7 +114,7 @@ static int amdgpu_reserve_page_direct(struct
> amdgpu_device *adev, uint64_t addre
> 
>  	if (amdgpu_ras_check_bad_page(adev, address)) {
>  		dev_warn(adev->dev,
> -			 "RAS WARN: 0x%llx has been marked as bad
> page!\n",
> +			 "RAS WARN: 0x%llx has already been marked as bad
> page!\n",
>  			 address);
>  		return 0;
>  	}
> @@ -228,11 +228,9 @@ static int
> amdgpu_ras_debugfs_ctrl_parse_data(struct file *f,
>  		return -EINVAL;
> 
>  	if (op != -1) {
> -
>  		if (op == 3) {
> -			if (sscanf(str, "%*s %llu", &address) != 1)
> -				if (sscanf(str, "%*s 0x%llx", &address) != 1)
> -					return -EINVAL;
> +			if (sscanf(str, "%*s %llx", &address) != 1)
> +				return -EINVAL;
> 
>  			data->op = op;
>  			data->inject.address = address;
> @@ -255,11 +253,9 @@ static int
> amdgpu_ras_debugfs_ctrl_parse_data(struct file *f,
>  		data->op = op;
> 
>  		if (op == 2) {
> -			if (sscanf(str, "%*s %*s %*s %u %llu %llu",
> -						&sub_block, &address,
> &value) != 3)
> -				if (sscanf(str, "%*s %*s %*s 0x%x 0x%llx
> 0x%llx",
> -							&sub_block,
> &address, &value) != 3)
> -					return -EINVAL;
> +			if (sscanf(str, "%*s %*s %*s %x %llx %llx",
> +				   &sub_block, &address, &value) != 3)
> +				return -EINVAL;
>  			data->head.sub_block_index = sub_block;
>  			data->inject.address = address;
>  			data->inject.value = value;
> @@ -278,7 +274,7 @@ static int
> amdgpu_ras_debugfs_ctrl_parse_data(struct file *f,
>  /**
>   * DOC: AMDGPU RAS debugfs control interface
>   *
> - * It accepts struct ras_debug_if who has two members.
> + * The control interface accepts struct ras_debug_if which has two
> members.
>   *
>   * First member: ras_debug_if::head or ras_debug_if::inject.
>   *
> @@ -303,32 +299,33 @@ static int
> amdgpu_ras_debugfs_ctrl_parse_data(struct file *f,
>   *
>   * How to use the interface?
>   *
> - * Programs
> + * Program
>   *
>   * Copy the struct ras_debug_if in your codes and initialize it.
>   * Write the struct to the control node.
>   *
> - * Shells
> + * Shell
>   *
>   * .. code-block:: bash
>   *
> - *	echo op block [error [sub_block address value]] > .../ras/ras_ctrl
> + *	echo "disable <block>" > /sys/kernel/debug/dri/<N>/ras/ras_ctrl
> + *	echo "enable  <block> <error>" >
> /sys/kernel/debug/dri/<N>/ras/ras_ctrl
> + *	echo "inject  <block> <error> <sub-block> <address> <value> >
> /sys/kernel/debug/dri/<N>/ras/ras_ctrl
>   *
> - * Parameters:
> + * Where N, is the card which you want to affect.
>   *
> - * op: disable, enable, inject
> - *	disable: only block is needed
> - *	enable: block and error are needed
> - *	inject: error, address, value are needed
> - * block: umc, sdma, gfx, .........
> + * "disable" requires only the block.
> + * "enable" requires the block and error type.
> + * "inject" requires the block, error type, address, and value.
> + * The block is one of: umc, sdma, gfx, etc.
>   *	see ras_block_string[] for details
> - * error: ue, ce
> - *	ue: multi_uncorrectable
> - *	ce: single_correctable
> - * sub_block:
> - *	sub block index, pass 0 if there is no sub block
> + * The error type is one of: ue, ce, where,
> + *	ue is multi-uncorrectable
> + *	ce is single-correctable
> + * The sub-block is a the sub-block index, pass 0 if there is no sub-block.
> + * The address and value are hexadecimal numbers, leading 0x is optional.
>   *
> - * here are some examples for bash commands:
> + * For instance,
>   *
>   * .. code-block:: bash
>   *
> @@ -336,17 +333,17 @@ static int
> amdgpu_ras_debugfs_ctrl_parse_data(struct file *f,
>   *	echo inject umc ce 0 0 0 > /sys/kernel/debug/dri/0/ras/ras_ctrl
>   *	echo disable umc > /sys/kernel/debug/dri/0/ras/ras_ctrl
>   *
> - * How to check the result?
> + * How to check the result of the operation?
>   *
> - * For disable/enable, please check ras features at
> + * To check disable/enable, see "ras" features at,
>   * /sys/class/drm/card[0/1/2...]/device/ras/features
>   *
> - * For inject, please check corresponding err count at
> - * /sys/class/drm/card[0/1/2...]/device/ras/[gfx/sdma/...]_err_count
> + * To check inject, see the corresponding error count at,
> + * /sys/class/drm/card[0/1/2...]/device/ras/[gfx|sdma|umc|...]_err_count
>   *
>   * .. note::
>   *	Operations are only allowed on blocks which are supported.
> - *	Please check ras mask at
> /sys/module/amdgpu/parameters/ras_mask
> + *	Check the "ras" mask at /sys/module/amdgpu/parameters/ras_mask
>   *	to see which blocks support RAS on a particular asic.
>   *
>   */
> @@ -367,11 +364,9 @@ static ssize_t amdgpu_ras_debugfs_ctrl_write(struct
> file *f, const char __user *
>  	if (ret)
>  		return -EINVAL;
> 
> -	if (data.op == 3)
> -	{
> +	if (data.op == 3) {
>  		ret = amdgpu_reserve_page_direct(adev,
> data.inject.address);
> -
> -		if (ret)
> +		if (!ret)
>  			return size;
>  		else
>  			return ret;
> @@ -1269,6 +1264,8 @@ static struct dentry
> *amdgpu_ras_debugfs_create_ctrl_node(struct amdgpu_device *
>  			    &amdgpu_ras_debugfs_ctrl_ops);
>  	debugfs_create_file("ras_eeprom_reset", S_IWUGO | S_IRUGO, dir,
> adev,
>  			    &amdgpu_ras_debugfs_eeprom_ops);
> +	debugfs_create_u32("bad_page_cnt_threshold", S_IRUGO, dir,
> +			   &con->bad_page_cnt_threshold);
> 
>  	/*
>  	 * After one uncorrectable error happens, usually GPU recovery will
> --
> 2.31.0.97.g1424303384
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 5+ messages in thread

* RE: [PATCH] drm/amdgpu: Fix checking return result of retire page
  2021-04-12  1:15 [PATCH] drm/amdgpu: Fix checking return result of retire page Luben Tuikov
  2021-04-12 20:22 ` Deucher, Alexander
@ 2021-04-14  8:19 ` Clements, John
  2021-04-14 13:58   ` Luben Tuikov
  1 sibling, 1 reply; 5+ messages in thread
From: Clements, John @ 2021-04-14  8:19 UTC (permalink / raw)
  To: Tuikov, Luben, amd-gfx; +Cc: Deucher, Alexander, Zhang, Hawking

[AMD Official Use Only - Internal Distribution Only]

Thank you Luben for re-organizing this source file and fixing those bugs.

Please add back support for decimal input parameter values, maybe something like this:
+			if (sscanf(str, "%*s 0x%llx", &address) != 1) &&  (sscanf(str, "%*s %llu", &address) != 1))
+				return -EINVAL; 

My concern is that there are tools out there that use that interface, so I wouldn't want any side effects there.

Reviewed-by: John Clements <John.Clements@amd.com>

-----Original Message-----
From: Tuikov, Luben <Luben.Tuikov@amd.com> 
Sent: Monday, April 12, 2021 9:15 AM
To: amd-gfx@lists.freedesktop.org
Cc: Tuikov, Luben <Luben.Tuikov@amd.com>; Deucher, Alexander <Alexander.Deucher@amd.com>; Clements, John <John.Clements@amd.com>; Zhang, Hawking <Hawking.Zhang@amd.com>
Subject: [PATCH] drm/amdgpu: Fix checking return result of retire page

* Remove double-sscanf to scan for %llu and
  0x%llx, as that is not going to work--the %llu
  will consume the "0" in "0x" of your input,
  and a hex value can never be consumed. And just
  entering a hex number without leading 0x will
  either be scanned as a string and not match,
  or the leading decimal portion is scanned
  which is not correct. Thus remove the first
  %llu scan and leave only the %llx scan,
  removing the leading 0x since %llx can scan
  either.

* Fix logic the check of the result of
  amdgpu_reserve_page_direct()--it is 0
  on success, and non-zero on error.

* Add bad_page_cnt_threshold to debugfs for
  reporting purposes only--it usually matches the
  size of EEPROM but may be different depending on
  module option. Small other improvements.

* Improve kernel-doc for the sysfs interface.

Cc: Alexander Deucher <Alexander.Deucher@amd.com>
Cc: John Clements <john.clements@amd.com>
Cc: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Luben Tuikov <luben.tuikov@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 67 ++++++++++++-------------
 1 file changed, 32 insertions(+), 35 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
index 0541196ae1ed..c4b94b444b90 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
@@ -114,7 +114,7 @@ static int amdgpu_reserve_page_direct(struct amdgpu_device *adev, uint64_t addre
 
 	if (amdgpu_ras_check_bad_page(adev, address)) {
 		dev_warn(adev->dev,
-			 "RAS WARN: 0x%llx has been marked as bad page!\n",
+			 "RAS WARN: 0x%llx has already been marked as bad page!\n",
 			 address);
 		return 0;
 	}
@@ -228,11 +228,9 @@ static int amdgpu_ras_debugfs_ctrl_parse_data(struct file *f,
 		return -EINVAL;
 
 	if (op != -1) {
-
 		if (op == 3) {
-			if (sscanf(str, "%*s %llu", &address) != 1)
-				if (sscanf(str, "%*s 0x%llx", &address) != 1)
-					return -EINVAL;
+			if (sscanf(str, "%*s %llx", &address) != 1)
+				return -EINVAL;
 
 			data->op = op;
 			data->inject.address = address;
@@ -255,11 +253,9 @@ static int amdgpu_ras_debugfs_ctrl_parse_data(struct file *f,
 		data->op = op;
 
 		if (op == 2) {
-			if (sscanf(str, "%*s %*s %*s %u %llu %llu",
-						&sub_block, &address, &value) != 3)
-				if (sscanf(str, "%*s %*s %*s 0x%x 0x%llx 0x%llx",
-							&sub_block, &address, &value) != 3)
-					return -EINVAL;
+			if (sscanf(str, "%*s %*s %*s %x %llx %llx",
+				   &sub_block, &address, &value) != 3)
+				return -EINVAL;
 			data->head.sub_block_index = sub_block;
 			data->inject.address = address;
 			data->inject.value = value;
@@ -278,7 +274,7 @@ static int amdgpu_ras_debugfs_ctrl_parse_data(struct file *f,
 /**
  * DOC: AMDGPU RAS debugfs control interface
  *
- * It accepts struct ras_debug_if who has two members.
+ * The control interface accepts struct ras_debug_if which has two members.
  *
  * First member: ras_debug_if::head or ras_debug_if::inject.
  *
@@ -303,32 +299,33 @@ static int amdgpu_ras_debugfs_ctrl_parse_data(struct file *f,
  *
  * How to use the interface?
  *
- * Programs
+ * Program
  *
  * Copy the struct ras_debug_if in your codes and initialize it.
  * Write the struct to the control node.
  *
- * Shells
+ * Shell
  *
  * .. code-block:: bash
  *
- *	echo op block [error [sub_block address value]] > .../ras/ras_ctrl
+ *	echo "disable <block>" > /sys/kernel/debug/dri/<N>/ras/ras_ctrl
+ *	echo "enable  <block> <error>" > /sys/kernel/debug/dri/<N>/ras/ras_ctrl
+ *	echo "inject  <block> <error> <sub-block> <address> <value> > /sys/kernel/debug/dri/<N>/ras/ras_ctrl
  *
- * Parameters:
+ * Where N, is the card which you want to affect.
  *
- * op: disable, enable, inject
- *	disable: only block is needed
- *	enable: block and error are needed
- *	inject: error, address, value are needed
- * block: umc, sdma, gfx, .........
+ * "disable" requires only the block.
+ * "enable" requires the block and error type.
+ * "inject" requires the block, error type, address, and value.
+ * The block is one of: umc, sdma, gfx, etc.
  *	see ras_block_string[] for details
- * error: ue, ce
- *	ue: multi_uncorrectable
- *	ce: single_correctable
- * sub_block:
- *	sub block index, pass 0 if there is no sub block
+ * The error type is one of: ue, ce, where,
+ *	ue is multi-uncorrectable
+ *	ce is single-correctable
+ * The sub-block is a the sub-block index, pass 0 if there is no sub-block.
+ * The address and value are hexadecimal numbers, leading 0x is optional.
  *
- * here are some examples for bash commands:
+ * For instance,
  *
  * .. code-block:: bash
  *
@@ -336,17 +333,17 @@ static int amdgpu_ras_debugfs_ctrl_parse_data(struct file *f,
  *	echo inject umc ce 0 0 0 > /sys/kernel/debug/dri/0/ras/ras_ctrl
  *	echo disable umc > /sys/kernel/debug/dri/0/ras/ras_ctrl
  *
- * How to check the result?
+ * How to check the result of the operation?
  *
- * For disable/enable, please check ras features at
+ * To check disable/enable, see "ras" features at,
  * /sys/class/drm/card[0/1/2...]/device/ras/features
  *
- * For inject, please check corresponding err count at
- * /sys/class/drm/card[0/1/2...]/device/ras/[gfx/sdma/...]_err_count
+ * To check inject, see the corresponding error count at,
+ * /sys/class/drm/card[0/1/2...]/device/ras/[gfx|sdma|umc|...]_err_count
  *
  * .. note::
  *	Operations are only allowed on blocks which are supported.
- *	Please check ras mask at /sys/module/amdgpu/parameters/ras_mask
+ *	Check the "ras" mask at /sys/module/amdgpu/parameters/ras_mask
  *	to see which blocks support RAS on a particular asic.
  *
  */
@@ -367,11 +364,9 @@ static ssize_t amdgpu_ras_debugfs_ctrl_write(struct file *f, const char __user *
 	if (ret)
 		return -EINVAL;
 
-	if (data.op == 3)
-	{
+	if (data.op == 3) {
 		ret = amdgpu_reserve_page_direct(adev, data.inject.address);
-
-		if (ret)
+		if (!ret)
 			return size;
 		else
 			return ret;
@@ -1269,6 +1264,8 @@ static struct dentry *amdgpu_ras_debugfs_create_ctrl_node(struct amdgpu_device *
 			    &amdgpu_ras_debugfs_ctrl_ops);
 	debugfs_create_file("ras_eeprom_reset", S_IWUGO | S_IRUGO, dir, adev,
 			    &amdgpu_ras_debugfs_eeprom_ops);
+	debugfs_create_u32("bad_page_cnt_threshold", S_IRUGO, dir,
+			   &con->bad_page_cnt_threshold);
 
 	/*
 	 * After one uncorrectable error happens, usually GPU recovery will
-- 
2.31.0.97.g1424303384
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH] drm/amdgpu: Fix checking return result of retire page
  2021-04-14  8:19 ` Clements, John
@ 2021-04-14 13:58   ` Luben Tuikov
  2021-04-14 14:02     ` Luben Tuikov
  0 siblings, 1 reply; 5+ messages in thread
From: Luben Tuikov @ 2021-04-14 13:58 UTC (permalink / raw)
  To: Clements, John, amd-gfx; +Cc: Deucher, Alexander, Zhang, Hawking

I'll take a look.

For the time being, you don't need parenthesis around != conditional as && has lower
priority, i.e. the parenthesis are superfluous.

Regards,
Luben

On 2021-04-14 4:19 a.m., Clements, John wrote:
> [AMD Official Use Only - Internal Distribution Only]
> 
> Thank you Luben for re-organizing this source file and fixing those bugs.
> 
> Please add back support for decimal input parameter values, maybe something like this:
> +			if (sscanf(str, "%*s 0x%llx", &address) != 1) &&  (sscanf(str, "%*s %llu", &address) != 1))
> +				return -EINVAL; 
> 
> My concern is that there are tools out there that use that interface, so I wouldn't want any side effects there.
> 
> Reviewed-by: John Clements <John.Clements@amd.com>
> 
> -----Original Message-----
> From: Tuikov, Luben <Luben.Tuikov@amd.com> 
> Sent: Monday, April 12, 2021 9:15 AM
> To: amd-gfx@lists.freedesktop.org
> Cc: Tuikov, Luben <Luben.Tuikov@amd.com>; Deucher, Alexander <Alexander.Deucher@amd.com>; Clements, John <John.Clements@amd.com>; Zhang, Hawking <Hawking.Zhang@amd.com>
> Subject: [PATCH] drm/amdgpu: Fix checking return result of retire page
> 
> * Remove double-sscanf to scan for %llu and
>   0x%llx, as that is not going to work--the %llu
>   will consume the "0" in "0x" of your input,
>   and a hex value can never be consumed. And just
>   entering a hex number without leading 0x will
>   either be scanned as a string and not match,
>   or the leading decimal portion is scanned
>   which is not correct. Thus remove the first
>   %llu scan and leave only the %llx scan,
>   removing the leading 0x since %llx can scan
>   either.
> 
> * Fix logic the check of the result of
>   amdgpu_reserve_page_direct()--it is 0
>   on success, and non-zero on error.
> 
> * Add bad_page_cnt_threshold to debugfs for
>   reporting purposes only--it usually matches the
>   size of EEPROM but may be different depending on
>   module option. Small other improvements.
> 
> * Improve kernel-doc for the sysfs interface.
> 
> Cc: Alexander Deucher <Alexander.Deucher@amd.com>
> Cc: John Clements <john.clements@amd.com>
> Cc: Hawking Zhang <Hawking.Zhang@amd.com>
> Signed-off-by: Luben Tuikov <luben.tuikov@amd.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 67 ++++++++++++-------------
>  1 file changed, 32 insertions(+), 35 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> index 0541196ae1ed..c4b94b444b90 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> @@ -114,7 +114,7 @@ static int amdgpu_reserve_page_direct(struct amdgpu_device *adev, uint64_t addre
>  
>  	if (amdgpu_ras_check_bad_page(adev, address)) {
>  		dev_warn(adev->dev,
> -			 "RAS WARN: 0x%llx has been marked as bad page!\n",
> +			 "RAS WARN: 0x%llx has already been marked as bad page!\n",
>  			 address);
>  		return 0;
>  	}
> @@ -228,11 +228,9 @@ static int amdgpu_ras_debugfs_ctrl_parse_data(struct file *f,
>  		return -EINVAL;
>  
>  	if (op != -1) {
> -
>  		if (op == 3) {
> -			if (sscanf(str, "%*s %llu", &address) != 1)
> -				if (sscanf(str, "%*s 0x%llx", &address) != 1)
> -					return -EINVAL;
> +			if (sscanf(str, "%*s %llx", &address) != 1)
> +				return -EINVAL;
>  
>  			data->op = op;
>  			data->inject.address = address;
> @@ -255,11 +253,9 @@ static int amdgpu_ras_debugfs_ctrl_parse_data(struct file *f,
>  		data->op = op;
>  
>  		if (op == 2) {
> -			if (sscanf(str, "%*s %*s %*s %u %llu %llu",
> -						&sub_block, &address, &value) != 3)
> -				if (sscanf(str, "%*s %*s %*s 0x%x 0x%llx 0x%llx",
> -							&sub_block, &address, &value) != 3)
> -					return -EINVAL;
> +			if (sscanf(str, "%*s %*s %*s %x %llx %llx",
> +				   &sub_block, &address, &value) != 3)
> +				return -EINVAL;
>  			data->head.sub_block_index = sub_block;
>  			data->inject.address = address;
>  			data->inject.value = value;
> @@ -278,7 +274,7 @@ static int amdgpu_ras_debugfs_ctrl_parse_data(struct file *f,
>  /**
>   * DOC: AMDGPU RAS debugfs control interface
>   *
> - * It accepts struct ras_debug_if who has two members.
> + * The control interface accepts struct ras_debug_if which has two members.
>   *
>   * First member: ras_debug_if::head or ras_debug_if::inject.
>   *
> @@ -303,32 +299,33 @@ static int amdgpu_ras_debugfs_ctrl_parse_data(struct file *f,
>   *
>   * How to use the interface?
>   *
> - * Programs
> + * Program
>   *
>   * Copy the struct ras_debug_if in your codes and initialize it.
>   * Write the struct to the control node.
>   *
> - * Shells
> + * Shell
>   *
>   * .. code-block:: bash
>   *
> - *	echo op block [error [sub_block address value]] > .../ras/ras_ctrl
> + *	echo "disable <block>" > /sys/kernel/debug/dri/<N>/ras/ras_ctrl
> + *	echo "enable  <block> <error>" > /sys/kernel/debug/dri/<N>/ras/ras_ctrl
> + *	echo "inject  <block> <error> <sub-block> <address> <value> > /sys/kernel/debug/dri/<N>/ras/ras_ctrl
>   *
> - * Parameters:
> + * Where N, is the card which you want to affect.
>   *
> - * op: disable, enable, inject
> - *	disable: only block is needed
> - *	enable: block and error are needed
> - *	inject: error, address, value are needed
> - * block: umc, sdma, gfx, .........
> + * "disable" requires only the block.
> + * "enable" requires the block and error type.
> + * "inject" requires the block, error type, address, and value.
> + * The block is one of: umc, sdma, gfx, etc.
>   *	see ras_block_string[] for details
> - * error: ue, ce
> - *	ue: multi_uncorrectable
> - *	ce: single_correctable
> - * sub_block:
> - *	sub block index, pass 0 if there is no sub block
> + * The error type is one of: ue, ce, where,
> + *	ue is multi-uncorrectable
> + *	ce is single-correctable
> + * The sub-block is a the sub-block index, pass 0 if there is no sub-block.
> + * The address and value are hexadecimal numbers, leading 0x is optional.
>   *
> - * here are some examples for bash commands:
> + * For instance,
>   *
>   * .. code-block:: bash
>   *
> @@ -336,17 +333,17 @@ static int amdgpu_ras_debugfs_ctrl_parse_data(struct file *f,
>   *	echo inject umc ce 0 0 0 > /sys/kernel/debug/dri/0/ras/ras_ctrl
>   *	echo disable umc > /sys/kernel/debug/dri/0/ras/ras_ctrl
>   *
> - * How to check the result?
> + * How to check the result of the operation?
>   *
> - * For disable/enable, please check ras features at
> + * To check disable/enable, see "ras" features at,
>   * /sys/class/drm/card[0/1/2...]/device/ras/features
>   *
> - * For inject, please check corresponding err count at
> - * /sys/class/drm/card[0/1/2...]/device/ras/[gfx/sdma/...]_err_count
> + * To check inject, see the corresponding error count at,
> + * /sys/class/drm/card[0/1/2...]/device/ras/[gfx|sdma|umc|...]_err_count
>   *
>   * .. note::
>   *	Operations are only allowed on blocks which are supported.
> - *	Please check ras mask at /sys/module/amdgpu/parameters/ras_mask
> + *	Check the "ras" mask at /sys/module/amdgpu/parameters/ras_mask
>   *	to see which blocks support RAS on a particular asic.
>   *
>   */
> @@ -367,11 +364,9 @@ static ssize_t amdgpu_ras_debugfs_ctrl_write(struct file *f, const char __user *
>  	if (ret)
>  		return -EINVAL;
>  
> -	if (data.op == 3)
> -	{
> +	if (data.op == 3) {
>  		ret = amdgpu_reserve_page_direct(adev, data.inject.address);
> -
> -		if (ret)
> +		if (!ret)
>  			return size;
>  		else
>  			return ret;
> @@ -1269,6 +1264,8 @@ static struct dentry *amdgpu_ras_debugfs_create_ctrl_node(struct amdgpu_device *
>  			    &amdgpu_ras_debugfs_ctrl_ops);
>  	debugfs_create_file("ras_eeprom_reset", S_IWUGO | S_IRUGO, dir, adev,
>  			    &amdgpu_ras_debugfs_eeprom_ops);
> +	debugfs_create_u32("bad_page_cnt_threshold", S_IRUGO, dir,
> +			   &con->bad_page_cnt_threshold);
>  
>  	/*
>  	 * After one uncorrectable error happens, usually GPU recovery will
> 

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] drm/amdgpu: Fix checking return result of retire page
  2021-04-14 13:58   ` Luben Tuikov
@ 2021-04-14 14:02     ` Luben Tuikov
  0 siblings, 0 replies; 5+ messages in thread
From: Luben Tuikov @ 2021-04-14 14:02 UTC (permalink / raw)
  To: Clements, John, amd-gfx; +Cc: Deucher, Alexander, Zhang, Hawking

You also seem to be missing a leading parenthesis.

Regards,
Luben

On 2021-04-14 9:58 a.m., Luben Tuikov wrote:
> I'll take a look.
> 
> For the time being, you don't need parenthesis around != conditional as && has lower
> priority, i.e. the parenthesis are superfluous.
> 
> Regards,
> Luben
> 
> On 2021-04-14 4:19 a.m., Clements, John wrote:
>> [AMD Official Use Only - Internal Distribution Only]
>>
>> Thank you Luben for re-organizing this source file and fixing those bugs.
>>
>> Please add back support for decimal input parameter values, maybe something like this:
>> +			if (sscanf(str, "%*s 0x%llx", &address) != 1) &&  (sscanf(str, "%*s %llu", &address) != 1))
>> +				return -EINVAL; 
>>
>> My concern is that there are tools out there that use that interface, so I wouldn't want any side effects there.
>>
>> Reviewed-by: John Clements <John.Clements@amd.com>
>>
>> -----Original Message-----
>> From: Tuikov, Luben <Luben.Tuikov@amd.com> 
>> Sent: Monday, April 12, 2021 9:15 AM
>> To: amd-gfx@lists.freedesktop.org
>> Cc: Tuikov, Luben <Luben.Tuikov@amd.com>; Deucher, Alexander <Alexander.Deucher@amd.com>; Clements, John <John.Clements@amd.com>; Zhang, Hawking <Hawking.Zhang@amd.com>
>> Subject: [PATCH] drm/amdgpu: Fix checking return result of retire page
>>
>> * Remove double-sscanf to scan for %llu and
>>   0x%llx, as that is not going to work--the %llu
>>   will consume the "0" in "0x" of your input,
>>   and a hex value can never be consumed. And just
>>   entering a hex number without leading 0x will
>>   either be scanned as a string and not match,
>>   or the leading decimal portion is scanned
>>   which is not correct. Thus remove the first
>>   %llu scan and leave only the %llx scan,
>>   removing the leading 0x since %llx can scan
>>   either.
>>
>> * Fix logic the check of the result of
>>   amdgpu_reserve_page_direct()--it is 0
>>   on success, and non-zero on error.
>>
>> * Add bad_page_cnt_threshold to debugfs for
>>   reporting purposes only--it usually matches the
>>   size of EEPROM but may be different depending on
>>   module option. Small other improvements.
>>
>> * Improve kernel-doc for the sysfs interface.
>>
>> Cc: Alexander Deucher <Alexander.Deucher@amd.com>
>> Cc: John Clements <john.clements@amd.com>
>> Cc: Hawking Zhang <Hawking.Zhang@amd.com>
>> Signed-off-by: Luben Tuikov <luben.tuikov@amd.com>
>> ---
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 67 ++++++++++++-------------
>>  1 file changed, 32 insertions(+), 35 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
>> index 0541196ae1ed..c4b94b444b90 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
>> @@ -114,7 +114,7 @@ static int amdgpu_reserve_page_direct(struct amdgpu_device *adev, uint64_t addre
>>  
>>  	if (amdgpu_ras_check_bad_page(adev, address)) {
>>  		dev_warn(adev->dev,
>> -			 "RAS WARN: 0x%llx has been marked as bad page!\n",
>> +			 "RAS WARN: 0x%llx has already been marked as bad page!\n",
>>  			 address);
>>  		return 0;
>>  	}
>> @@ -228,11 +228,9 @@ static int amdgpu_ras_debugfs_ctrl_parse_data(struct file *f,
>>  		return -EINVAL;
>>  
>>  	if (op != -1) {
>> -
>>  		if (op == 3) {
>> -			if (sscanf(str, "%*s %llu", &address) != 1)
>> -				if (sscanf(str, "%*s 0x%llx", &address) != 1)
>> -					return -EINVAL;
>> +			if (sscanf(str, "%*s %llx", &address) != 1)
>> +				return -EINVAL;
>>  
>>  			data->op = op;
>>  			data->inject.address = address;
>> @@ -255,11 +253,9 @@ static int amdgpu_ras_debugfs_ctrl_parse_data(struct file *f,
>>  		data->op = op;
>>  
>>  		if (op == 2) {
>> -			if (sscanf(str, "%*s %*s %*s %u %llu %llu",
>> -						&sub_block, &address, &value) != 3)
>> -				if (sscanf(str, "%*s %*s %*s 0x%x 0x%llx 0x%llx",
>> -							&sub_block, &address, &value) != 3)
>> -					return -EINVAL;
>> +			if (sscanf(str, "%*s %*s %*s %x %llx %llx",
>> +				   &sub_block, &address, &value) != 3)
>> +				return -EINVAL;
>>  			data->head.sub_block_index = sub_block;
>>  			data->inject.address = address;
>>  			data->inject.value = value;
>> @@ -278,7 +274,7 @@ static int amdgpu_ras_debugfs_ctrl_parse_data(struct file *f,
>>  /**
>>   * DOC: AMDGPU RAS debugfs control interface
>>   *
>> - * It accepts struct ras_debug_if who has two members.
>> + * The control interface accepts struct ras_debug_if which has two members.
>>   *
>>   * First member: ras_debug_if::head or ras_debug_if::inject.
>>   *
>> @@ -303,32 +299,33 @@ static int amdgpu_ras_debugfs_ctrl_parse_data(struct file *f,
>>   *
>>   * How to use the interface?
>>   *
>> - * Programs
>> + * Program
>>   *
>>   * Copy the struct ras_debug_if in your codes and initialize it.
>>   * Write the struct to the control node.
>>   *
>> - * Shells
>> + * Shell
>>   *
>>   * .. code-block:: bash
>>   *
>> - *	echo op block [error [sub_block address value]] > .../ras/ras_ctrl
>> + *	echo "disable <block>" > /sys/kernel/debug/dri/<N>/ras/ras_ctrl
>> + *	echo "enable  <block> <error>" > /sys/kernel/debug/dri/<N>/ras/ras_ctrl
>> + *	echo "inject  <block> <error> <sub-block> <address> <value> > /sys/kernel/debug/dri/<N>/ras/ras_ctrl
>>   *
>> - * Parameters:
>> + * Where N, is the card which you want to affect.
>>   *
>> - * op: disable, enable, inject
>> - *	disable: only block is needed
>> - *	enable: block and error are needed
>> - *	inject: error, address, value are needed
>> - * block: umc, sdma, gfx, .........
>> + * "disable" requires only the block.
>> + * "enable" requires the block and error type.
>> + * "inject" requires the block, error type, address, and value.
>> + * The block is one of: umc, sdma, gfx, etc.
>>   *	see ras_block_string[] for details
>> - * error: ue, ce
>> - *	ue: multi_uncorrectable
>> - *	ce: single_correctable
>> - * sub_block:
>> - *	sub block index, pass 0 if there is no sub block
>> + * The error type is one of: ue, ce, where,
>> + *	ue is multi-uncorrectable
>> + *	ce is single-correctable
>> + * The sub-block is a the sub-block index, pass 0 if there is no sub-block.
>> + * The address and value are hexadecimal numbers, leading 0x is optional.
>>   *
>> - * here are some examples for bash commands:
>> + * For instance,
>>   *
>>   * .. code-block:: bash
>>   *
>> @@ -336,17 +333,17 @@ static int amdgpu_ras_debugfs_ctrl_parse_data(struct file *f,
>>   *	echo inject umc ce 0 0 0 > /sys/kernel/debug/dri/0/ras/ras_ctrl
>>   *	echo disable umc > /sys/kernel/debug/dri/0/ras/ras_ctrl
>>   *
>> - * How to check the result?
>> + * How to check the result of the operation?
>>   *
>> - * For disable/enable, please check ras features at
>> + * To check disable/enable, see "ras" features at,
>>   * /sys/class/drm/card[0/1/2...]/device/ras/features
>>   *
>> - * For inject, please check corresponding err count at
>> - * /sys/class/drm/card[0/1/2...]/device/ras/[gfx/sdma/...]_err_count
>> + * To check inject, see the corresponding error count at,
>> + * /sys/class/drm/card[0/1/2...]/device/ras/[gfx|sdma|umc|...]_err_count
>>   *
>>   * .. note::
>>   *	Operations are only allowed on blocks which are supported.
>> - *	Please check ras mask at /sys/module/amdgpu/parameters/ras_mask
>> + *	Check the "ras" mask at /sys/module/amdgpu/parameters/ras_mask
>>   *	to see which blocks support RAS on a particular asic.
>>   *
>>   */
>> @@ -367,11 +364,9 @@ static ssize_t amdgpu_ras_debugfs_ctrl_write(struct file *f, const char __user *
>>  	if (ret)
>>  		return -EINVAL;
>>  
>> -	if (data.op == 3)
>> -	{
>> +	if (data.op == 3) {
>>  		ret = amdgpu_reserve_page_direct(adev, data.inject.address);
>> -
>> -		if (ret)
>> +		if (!ret)
>>  			return size;
>>  		else
>>  			return ret;
>> @@ -1269,6 +1264,8 @@ static struct dentry *amdgpu_ras_debugfs_create_ctrl_node(struct amdgpu_device *
>>  			    &amdgpu_ras_debugfs_ctrl_ops);
>>  	debugfs_create_file("ras_eeprom_reset", S_IWUGO | S_IRUGO, dir, adev,
>>  			    &amdgpu_ras_debugfs_eeprom_ops);
>> +	debugfs_create_u32("bad_page_cnt_threshold", S_IRUGO, dir,
>> +			   &con->bad_page_cnt_threshold);
>>  
>>  	/*
>>  	 * After one uncorrectable error happens, usually GPU recovery will
>>
> 

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2021-04-14 14:02 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-04-12  1:15 [PATCH] drm/amdgpu: Fix checking return result of retire page Luben Tuikov
2021-04-12 20:22 ` Deucher, Alexander
2021-04-14  8:19 ` Clements, John
2021-04-14 13:58   ` Luben Tuikov
2021-04-14 14:02     ` Luben Tuikov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).