All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] drm/amdgpu/gpuvm: add some additional comments in amdgpu_vm_update_ptes
@ 2019-10-30 18:41 ` Alex Deucher
  0 siblings, 0 replies; 14+ messages in thread
From: Alex Deucher @ 2019-10-30 18:41 UTC (permalink / raw)
  To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW; +Cc: Alex Deucher

To better clarify what is happening in this function.

Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
index c8ce42200059..3c0bd6472a46 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
@@ -1419,6 +1419,9 @@ static int amdgpu_vm_update_ptes(struct amdgpu_vm_update_params *params,
 		uint64_t incr, entry_end, pe_start;
 		struct amdgpu_bo *pt;
 
+		/* make sure that the page tables covering the address range are
+		 * actually allocated
+		 */
 		r = amdgpu_vm_alloc_pts(params->adev, params->vm, &cursor,
 					params->direct);
 		if (r)
@@ -1492,7 +1495,12 @@ static int amdgpu_vm_update_ptes(struct amdgpu_vm_update_params *params,
 		} while (frag_start < entry_end);
 
 		if (amdgpu_vm_pt_descendant(adev, &cursor)) {
-			/* Free all child entries */
+			/* Free all child entries.
+			 * Update the tables with the flags and addresses and free up subsequent
+			 * tables in the case of huge pages or freed up areas.
+			 * This is the maximum you can free, because all other page tables are not
+			 * completely covered by the range and so potentially still in use.
+			 */
 			while (cursor.pfn < frag_start) {
 				amdgpu_vm_free_pts(adev, params->vm, &cursor);
 				amdgpu_vm_pt_next(adev, &cursor);
-- 
2.23.0

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH] drm/amdgpu/gpuvm: add some additional comments in amdgpu_vm_update_ptes
@ 2019-10-30 18:41 ` Alex Deucher
  0 siblings, 0 replies; 14+ messages in thread
From: Alex Deucher @ 2019-10-30 18:41 UTC (permalink / raw)
  To: amd-gfx; +Cc: Alex Deucher

To better clarify what is happening in this function.

Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
index c8ce42200059..3c0bd6472a46 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
@@ -1419,6 +1419,9 @@ static int amdgpu_vm_update_ptes(struct amdgpu_vm_update_params *params,
 		uint64_t incr, entry_end, pe_start;
 		struct amdgpu_bo *pt;
 
+		/* make sure that the page tables covering the address range are
+		 * actually allocated
+		 */
 		r = amdgpu_vm_alloc_pts(params->adev, params->vm, &cursor,
 					params->direct);
 		if (r)
@@ -1492,7 +1495,12 @@ static int amdgpu_vm_update_ptes(struct amdgpu_vm_update_params *params,
 		} while (frag_start < entry_end);
 
 		if (amdgpu_vm_pt_descendant(adev, &cursor)) {
-			/* Free all child entries */
+			/* Free all child entries.
+			 * Update the tables with the flags and addresses and free up subsequent
+			 * tables in the case of huge pages or freed up areas.
+			 * This is the maximum you can free, because all other page tables are not
+			 * completely covered by the range and so potentially still in use.
+			 */
 			while (cursor.pfn < frag_start) {
 				amdgpu_vm_free_pts(adev, params->vm, &cursor);
 				amdgpu_vm_pt_next(adev, &cursor);
-- 
2.23.0

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH] drm/amdgpu: Improve RAS documentation
@ 2019-10-30 18:41     ` Alex Deucher
  0 siblings, 0 replies; 14+ messages in thread
From: Alex Deucher @ 2019-10-30 18:41 UTC (permalink / raw)
  To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW; +Cc: Alex Deucher

Clarify some areas, clean up formatting, add section for
unrecoverable error handling.

Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
---
 Documentation/gpu/amdgpu.rst            | 35 ++++++++++++++++++++++
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 40 ++++++++++++++++++++-----
 2 files changed, 68 insertions(+), 7 deletions(-)

diff --git a/Documentation/gpu/amdgpu.rst b/Documentation/gpu/amdgpu.rst
index 5b9eaf23558e..1c08d64970ee 100644
--- a/Documentation/gpu/amdgpu.rst
+++ b/Documentation/gpu/amdgpu.rst
@@ -82,12 +82,21 @@ AMDGPU XGMI Support
 AMDGPU RAS Support
 ==================
 
+The AMDGPU RAS interfaces are exposed via sysfs (for informational queries) and
+debugfs (for error injection).
+
 RAS debugfs/sysfs Control and Error Injection Interfaces
 --------------------------------------------------------
 
 .. kernel-doc:: drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
    :doc: AMDGPU RAS debugfs control interface
 
+RAS Reboot Behavior for Unrecoverable Errors
+--------------------------------------------------------
+
+.. kernel-doc:: drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
+   :doc: AMDGPU RAS Reboot Behavior for Unrecoverable Errors
+
 RAS Error Count sysfs Interface
 -------------------------------
 
@@ -109,6 +118,32 @@ RAS VRAM Bad Pages sysfs Interface
 .. kernel-doc:: drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
    :internal:
 
+Sample Code
+-----------
+Sample code for testing error injection can be found here:
+https://cgit.freedesktop.org/mesa/drm/tree/tests/amdgpu/ras_tests.c
+
+This is part of the libdrm amdgpu unit tests which cover several areas of the GPU.
+There are four sets of tests:
+
+RAS Basic Test
+
+The test verifies the RAS feature enabled status and makes sure the necessary sysfs and debugfs files
+are present.
+
+RAS Query Test
+
+This test will check the RAS availability and enablement status for each supported IP block as well as
+the error counts.
+
+RAS Inject Test
+
+This test injects errors for each IP.
+
+RAS Disable Test
+
+This tests disabling of RAS features for each IP block.
+
 
 GPU Power/Thermal Controls and Monitoring
 =========================================
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
index dab90c280476..404483437bd3 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
@@ -220,7 +220,7 @@ static struct ras_manager *amdgpu_ras_find_obj(struct amdgpu_device *adev,
  * As their names indicate, inject operation will write the
  * value to the address.
  *
- * Second member: struct ras_debug_if::op.
+ * The second member: struct ras_debug_if::op.
  * It has three kinds of operations.
  *
  * - 0: disable RAS on the block. Take ::head as its data.
@@ -228,14 +228,20 @@ static struct ras_manager *amdgpu_ras_find_obj(struct amdgpu_device *adev,
  * - 2: inject errors on the block. Take ::inject as its data.
  *
  * How to use the interface?
- * programs:
- * copy the struct ras_debug_if in your codes and initialize it.
- * write the struct to the control node.
+ *
+ * Programs
+ *
+ * Copy the struct ras_debug_if in your codes and initialize it.
+ * Write the struct to the control node.
+ *
+ * Shells
  *
  * .. code-block:: bash
  *
  *	echo op block [error [sub_block address value]] > .../ras/ras_ctrl
  *
+ * Parameters:
+ *
  * op: disable, enable, inject
  *	disable: only block is needed
  *	enable: block and error are needed
@@ -265,8 +271,10 @@ static struct ras_manager *amdgpu_ras_find_obj(struct amdgpu_device *adev,
  * /sys/class/drm/card[0/1/2...]/device/ras/[gfx/sdma/...]_err_count
  *
  * .. note::
- *	Operation is only allowed on blocks which are supported.
+ *	Operations are only allowed on blocks which are supported.
  *	Please check ras mask at /sys/module/amdgpu/parameters/ras_mask
+ *	to see which blocks support RAS on a particular asic.
+ *
  */
 static ssize_t amdgpu_ras_debugfs_ctrl_write(struct file *f, const char __user *buf,
 		size_t size, loff_t *pos)
@@ -322,7 +330,7 @@ static ssize_t amdgpu_ras_debugfs_ctrl_write(struct file *f, const char __user *
  * DOC: AMDGPU RAS debugfs EEPROM table reset interface
  *
  * Some boards contain an EEPROM which is used to persistently store a list of
- * bad pages containing ECC errors detected in vram.  This interface provides
+ * bad pages which experiences ECC errors in vram.  This interface provides
  * a way to reset the EEPROM, e.g., after testing error injection.
  *
  * Usage:
@@ -362,7 +370,7 @@ static const struct file_operations amdgpu_ras_debugfs_eeprom_ops = {
 /**
  * DOC: AMDGPU RAS sysfs Error Count Interface
  *
- * It allows user to read the error count for each IP block on the gpu through
+ * It allows the user to read the error count for each IP block on the gpu through
  * /sys/class/drm/card[0/1/2...]/device/ras/[gfx/sdma/...]_err_count
  *
  * It outputs the multiple lines which report the uncorrected (ue) and corrected
@@ -1027,6 +1035,24 @@ static int amdgpu_ras_sysfs_remove_all(struct amdgpu_device *adev)
 }
 /* sysfs end */
 
+/**
+ * DOC: AMDGPU RAS Reboot Behavior for Unrecoverable Errors
+ *
+ * Normally when there is an uncorrectable error, the driver will reset
+ * the GPU to recover.  However, in the event of an unrecoverable error,
+ * the driver provides an interface to reboot the system automatically
+ * in that event.
+ *
+ * The following file in debugfs provides that interface:
+ * /sys/kernel/debug/dri/[0/1/2...]/ras/auto_reboot
+ *
+ * Usage:
+ *
+ * .. code-block:: bash
+ *
+ *	echo true > .../ras/auto_reboot
+ *
+ */
 /* debugfs begin */
 static void amdgpu_ras_debugfs_create_ctrl_node(struct amdgpu_device *adev)
 {
-- 
2.23.0

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH] drm/amdgpu: Improve RAS documentation
@ 2019-10-30 18:41     ` Alex Deucher
  0 siblings, 0 replies; 14+ messages in thread
From: Alex Deucher @ 2019-10-30 18:41 UTC (permalink / raw)
  To: amd-gfx; +Cc: Alex Deucher

Clarify some areas, clean up formatting, add section for
unrecoverable error handling.

Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
---
 Documentation/gpu/amdgpu.rst            | 35 ++++++++++++++++++++++
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 40 ++++++++++++++++++++-----
 2 files changed, 68 insertions(+), 7 deletions(-)

diff --git a/Documentation/gpu/amdgpu.rst b/Documentation/gpu/amdgpu.rst
index 5b9eaf23558e..1c08d64970ee 100644
--- a/Documentation/gpu/amdgpu.rst
+++ b/Documentation/gpu/amdgpu.rst
@@ -82,12 +82,21 @@ AMDGPU XGMI Support
 AMDGPU RAS Support
 ==================
 
+The AMDGPU RAS interfaces are exposed via sysfs (for informational queries) and
+debugfs (for error injection).
+
 RAS debugfs/sysfs Control and Error Injection Interfaces
 --------------------------------------------------------
 
 .. kernel-doc:: drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
    :doc: AMDGPU RAS debugfs control interface
 
+RAS Reboot Behavior for Unrecoverable Errors
+--------------------------------------------------------
+
+.. kernel-doc:: drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
+   :doc: AMDGPU RAS Reboot Behavior for Unrecoverable Errors
+
 RAS Error Count sysfs Interface
 -------------------------------
 
@@ -109,6 +118,32 @@ RAS VRAM Bad Pages sysfs Interface
 .. kernel-doc:: drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
    :internal:
 
+Sample Code
+-----------
+Sample code for testing error injection can be found here:
+https://cgit.freedesktop.org/mesa/drm/tree/tests/amdgpu/ras_tests.c
+
+This is part of the libdrm amdgpu unit tests which cover several areas of the GPU.
+There are four sets of tests:
+
+RAS Basic Test
+
+The test verifies the RAS feature enabled status and makes sure the necessary sysfs and debugfs files
+are present.
+
+RAS Query Test
+
+This test will check the RAS availability and enablement status for each supported IP block as well as
+the error counts.
+
+RAS Inject Test
+
+This test injects errors for each IP.
+
+RAS Disable Test
+
+This tests disabling of RAS features for each IP block.
+
 
 GPU Power/Thermal Controls and Monitoring
 =========================================
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
index dab90c280476..404483437bd3 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
@@ -220,7 +220,7 @@ static struct ras_manager *amdgpu_ras_find_obj(struct amdgpu_device *adev,
  * As their names indicate, inject operation will write the
  * value to the address.
  *
- * Second member: struct ras_debug_if::op.
+ * The second member: struct ras_debug_if::op.
  * It has three kinds of operations.
  *
  * - 0: disable RAS on the block. Take ::head as its data.
@@ -228,14 +228,20 @@ static struct ras_manager *amdgpu_ras_find_obj(struct amdgpu_device *adev,
  * - 2: inject errors on the block. Take ::inject as its data.
  *
  * How to use the interface?
- * programs:
- * copy the struct ras_debug_if in your codes and initialize it.
- * write the struct to the control node.
+ *
+ * Programs
+ *
+ * Copy the struct ras_debug_if in your codes and initialize it.
+ * Write the struct to the control node.
+ *
+ * Shells
  *
  * .. code-block:: bash
  *
  *	echo op block [error [sub_block address value]] > .../ras/ras_ctrl
  *
+ * Parameters:
+ *
  * op: disable, enable, inject
  *	disable: only block is needed
  *	enable: block and error are needed
@@ -265,8 +271,10 @@ static struct ras_manager *amdgpu_ras_find_obj(struct amdgpu_device *adev,
  * /sys/class/drm/card[0/1/2...]/device/ras/[gfx/sdma/...]_err_count
  *
  * .. note::
- *	Operation is only allowed on blocks which are supported.
+ *	Operations are only allowed on blocks which are supported.
  *	Please check ras mask at /sys/module/amdgpu/parameters/ras_mask
+ *	to see which blocks support RAS on a particular asic.
+ *
  */
 static ssize_t amdgpu_ras_debugfs_ctrl_write(struct file *f, const char __user *buf,
 		size_t size, loff_t *pos)
@@ -322,7 +330,7 @@ static ssize_t amdgpu_ras_debugfs_ctrl_write(struct file *f, const char __user *
  * DOC: AMDGPU RAS debugfs EEPROM table reset interface
  *
  * Some boards contain an EEPROM which is used to persistently store a list of
- * bad pages containing ECC errors detected in vram.  This interface provides
+ * bad pages which experiences ECC errors in vram.  This interface provides
  * a way to reset the EEPROM, e.g., after testing error injection.
  *
  * Usage:
@@ -362,7 +370,7 @@ static const struct file_operations amdgpu_ras_debugfs_eeprom_ops = {
 /**
  * DOC: AMDGPU RAS sysfs Error Count Interface
  *
- * It allows user to read the error count for each IP block on the gpu through
+ * It allows the user to read the error count for each IP block on the gpu through
  * /sys/class/drm/card[0/1/2...]/device/ras/[gfx/sdma/...]_err_count
  *
  * It outputs the multiple lines which report the uncorrected (ue) and corrected
@@ -1027,6 +1035,24 @@ static int amdgpu_ras_sysfs_remove_all(struct amdgpu_device *adev)
 }
 /* sysfs end */
 
+/**
+ * DOC: AMDGPU RAS Reboot Behavior for Unrecoverable Errors
+ *
+ * Normally when there is an uncorrectable error, the driver will reset
+ * the GPU to recover.  However, in the event of an unrecoverable error,
+ * the driver provides an interface to reboot the system automatically
+ * in that event.
+ *
+ * The following file in debugfs provides that interface:
+ * /sys/kernel/debug/dri/[0/1/2...]/ras/auto_reboot
+ *
+ * Usage:
+ *
+ * .. code-block:: bash
+ *
+ *	echo true > .../ras/auto_reboot
+ *
+ */
 /* debugfs begin */
 static void amdgpu_ras_debugfs_create_ctrl_node(struct amdgpu_device *adev)
 {
-- 
2.23.0

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH] drm/amdgpu/gpuvm: add some additional comments in amdgpu_vm_update_ptes
@ 2019-10-31  7:24     ` Christian König
  0 siblings, 0 replies; 14+ messages in thread
From: Christian König @ 2019-10-31  7:24 UTC (permalink / raw)
  To: Alex Deucher, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW; +Cc: Alex Deucher

Am 30.10.19 um 19:41 schrieb Alex Deucher:
> To better clarify what is happening in this function.
>
> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

Reviewed-by: Christian König <christian.koenig@amd.com>

> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 10 +++++++++-
>   1 file changed, 9 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> index c8ce42200059..3c0bd6472a46 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> @@ -1419,6 +1419,9 @@ static int amdgpu_vm_update_ptes(struct amdgpu_vm_update_params *params,
>   		uint64_t incr, entry_end, pe_start;
>   		struct amdgpu_bo *pt;
>   
> +		/* make sure that the page tables covering the address range are
> +		 * actually allocated
> +		 */
>   		r = amdgpu_vm_alloc_pts(params->adev, params->vm, &cursor,
>   					params->direct);
>   		if (r)
> @@ -1492,7 +1495,12 @@ static int amdgpu_vm_update_ptes(struct amdgpu_vm_update_params *params,
>   		} while (frag_start < entry_end);
>   
>   		if (amdgpu_vm_pt_descendant(adev, &cursor)) {
> -			/* Free all child entries */
> +			/* Free all child entries.
> +			 * Update the tables with the flags and addresses and free up subsequent
> +			 * tables in the case of huge pages or freed up areas.
> +			 * This is the maximum you can free, because all other page tables are not
> +			 * completely covered by the range and so potentially still in use.
> +			 */
>   			while (cursor.pfn < frag_start) {
>   				amdgpu_vm_free_pts(adev, params->vm, &cursor);
>   				amdgpu_vm_pt_next(adev, &cursor);

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] drm/amdgpu/gpuvm: add some additional comments in amdgpu_vm_update_ptes
@ 2019-10-31  7:24     ` Christian König
  0 siblings, 0 replies; 14+ messages in thread
From: Christian König @ 2019-10-31  7:24 UTC (permalink / raw)
  To: Alex Deucher, amd-gfx; +Cc: Alex Deucher

Am 30.10.19 um 19:41 schrieb Alex Deucher:
> To better clarify what is happening in this function.
>
> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

Reviewed-by: Christian König <christian.koenig@amd.com>

> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 10 +++++++++-
>   1 file changed, 9 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> index c8ce42200059..3c0bd6472a46 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> @@ -1419,6 +1419,9 @@ static int amdgpu_vm_update_ptes(struct amdgpu_vm_update_params *params,
>   		uint64_t incr, entry_end, pe_start;
>   		struct amdgpu_bo *pt;
>   
> +		/* make sure that the page tables covering the address range are
> +		 * actually allocated
> +		 */
>   		r = amdgpu_vm_alloc_pts(params->adev, params->vm, &cursor,
>   					params->direct);
>   		if (r)
> @@ -1492,7 +1495,12 @@ static int amdgpu_vm_update_ptes(struct amdgpu_vm_update_params *params,
>   		} while (frag_start < entry_end);
>   
>   		if (amdgpu_vm_pt_descendant(adev, &cursor)) {
> -			/* Free all child entries */
> +			/* Free all child entries.
> +			 * Update the tables with the flags and addresses and free up subsequent
> +			 * tables in the case of huge pages or freed up areas.
> +			 * This is the maximum you can free, because all other page tables are not
> +			 * completely covered by the range and so potentially still in use.
> +			 */
>   			while (cursor.pfn < frag_start) {
>   				amdgpu_vm_free_pts(adev, params->vm, &cursor);
>   				amdgpu_vm_pt_next(adev, &cursor);

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] drm/amdgpu: Improve RAS documentation
@ 2019-11-06 16:35         ` Alex Deucher
  0 siblings, 0 replies; 14+ messages in thread
From: Alex Deucher @ 2019-11-06 16:35 UTC (permalink / raw)
  To: amd-gfx list; +Cc: Alex Deucher

Ping?


On Wed, Oct 30, 2019 at 2:41 PM Alex Deucher <alexdeucher@gmail.com> wrote:
>
> Clarify some areas, clean up formatting, add section for
> unrecoverable error handling.
>
> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
> ---
>  Documentation/gpu/amdgpu.rst            | 35 ++++++++++++++++++++++
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 40 ++++++++++++++++++++-----
>  2 files changed, 68 insertions(+), 7 deletions(-)
>
> diff --git a/Documentation/gpu/amdgpu.rst b/Documentation/gpu/amdgpu.rst
> index 5b9eaf23558e..1c08d64970ee 100644
> --- a/Documentation/gpu/amdgpu.rst
> +++ b/Documentation/gpu/amdgpu.rst
> @@ -82,12 +82,21 @@ AMDGPU XGMI Support
>  AMDGPU RAS Support
>  ==================
>
> +The AMDGPU RAS interfaces are exposed via sysfs (for informational queries) and
> +debugfs (for error injection).
> +
>  RAS debugfs/sysfs Control and Error Injection Interfaces
>  --------------------------------------------------------
>
>  .. kernel-doc:: drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
>     :doc: AMDGPU RAS debugfs control interface
>
> +RAS Reboot Behavior for Unrecoverable Errors
> +--------------------------------------------------------
> +
> +.. kernel-doc:: drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> +   :doc: AMDGPU RAS Reboot Behavior for Unrecoverable Errors
> +
>  RAS Error Count sysfs Interface
>  -------------------------------
>
> @@ -109,6 +118,32 @@ RAS VRAM Bad Pages sysfs Interface
>  .. kernel-doc:: drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
>     :internal:
>
> +Sample Code
> +-----------
> +Sample code for testing error injection can be found here:
> +https://cgit.freedesktop.org/mesa/drm/tree/tests/amdgpu/ras_tests.c
> +
> +This is part of the libdrm amdgpu unit tests which cover several areas of the GPU.
> +There are four sets of tests:
> +
> +RAS Basic Test
> +
> +The test verifies the RAS feature enabled status and makes sure the necessary sysfs and debugfs files
> +are present.
> +
> +RAS Query Test
> +
> +This test will check the RAS availability and enablement status for each supported IP block as well as
> +the error counts.
> +
> +RAS Inject Test
> +
> +This test injects errors for each IP.
> +
> +RAS Disable Test
> +
> +This tests disabling of RAS features for each IP block.
> +
>
>  GPU Power/Thermal Controls and Monitoring
>  =========================================
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> index dab90c280476..404483437bd3 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> @@ -220,7 +220,7 @@ static struct ras_manager *amdgpu_ras_find_obj(struct amdgpu_device *adev,
>   * As their names indicate, inject operation will write the
>   * value to the address.
>   *
> - * Second member: struct ras_debug_if::op.
> + * The second member: struct ras_debug_if::op.
>   * It has three kinds of operations.
>   *
>   * - 0: disable RAS on the block. Take ::head as its data.
> @@ -228,14 +228,20 @@ static struct ras_manager *amdgpu_ras_find_obj(struct amdgpu_device *adev,
>   * - 2: inject errors on the block. Take ::inject as its data.
>   *
>   * How to use the interface?
> - * programs:
> - * copy the struct ras_debug_if in your codes and initialize it.
> - * write the struct to the control node.
> + *
> + * Programs
> + *
> + * Copy the struct ras_debug_if in your codes and initialize it.
> + * Write the struct to the control node.
> + *
> + * Shells
>   *
>   * .. code-block:: bash
>   *
>   *     echo op block [error [sub_block address value]] > .../ras/ras_ctrl
>   *
> + * Parameters:
> + *
>   * op: disable, enable, inject
>   *     disable: only block is needed
>   *     enable: block and error are needed
> @@ -265,8 +271,10 @@ static struct ras_manager *amdgpu_ras_find_obj(struct amdgpu_device *adev,
>   * /sys/class/drm/card[0/1/2...]/device/ras/[gfx/sdma/...]_err_count
>   *
>   * .. note::
> - *     Operation is only allowed on blocks which are supported.
> + *     Operations are only allowed on blocks which are supported.
>   *     Please check ras mask at /sys/module/amdgpu/parameters/ras_mask
> + *     to see which blocks support RAS on a particular asic.
> + *
>   */
>  static ssize_t amdgpu_ras_debugfs_ctrl_write(struct file *f, const char __user *buf,
>                 size_t size, loff_t *pos)
> @@ -322,7 +330,7 @@ static ssize_t amdgpu_ras_debugfs_ctrl_write(struct file *f, const char __user *
>   * DOC: AMDGPU RAS debugfs EEPROM table reset interface
>   *
>   * Some boards contain an EEPROM which is used to persistently store a list of
> - * bad pages containing ECC errors detected in vram.  This interface provides
> + * bad pages which experiences ECC errors in vram.  This interface provides
>   * a way to reset the EEPROM, e.g., after testing error injection.
>   *
>   * Usage:
> @@ -362,7 +370,7 @@ static const struct file_operations amdgpu_ras_debugfs_eeprom_ops = {
>  /**
>   * DOC: AMDGPU RAS sysfs Error Count Interface
>   *
> - * It allows user to read the error count for each IP block on the gpu through
> + * It allows the user to read the error count for each IP block on the gpu through
>   * /sys/class/drm/card[0/1/2...]/device/ras/[gfx/sdma/...]_err_count
>   *
>   * It outputs the multiple lines which report the uncorrected (ue) and corrected
> @@ -1027,6 +1035,24 @@ static int amdgpu_ras_sysfs_remove_all(struct amdgpu_device *adev)
>  }
>  /* sysfs end */
>
> +/**
> + * DOC: AMDGPU RAS Reboot Behavior for Unrecoverable Errors
> + *
> + * Normally when there is an uncorrectable error, the driver will reset
> + * the GPU to recover.  However, in the event of an unrecoverable error,
> + * the driver provides an interface to reboot the system automatically
> + * in that event.
> + *
> + * The following file in debugfs provides that interface:
> + * /sys/kernel/debug/dri/[0/1/2...]/ras/auto_reboot
> + *
> + * Usage:
> + *
> + * .. code-block:: bash
> + *
> + *     echo true > .../ras/auto_reboot
> + *
> + */
>  /* debugfs begin */
>  static void amdgpu_ras_debugfs_create_ctrl_node(struct amdgpu_device *adev)
>  {
> --
> 2.23.0
>
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] drm/amdgpu: Improve RAS documentation
@ 2019-11-06 16:35         ` Alex Deucher
  0 siblings, 0 replies; 14+ messages in thread
From: Alex Deucher @ 2019-11-06 16:35 UTC (permalink / raw)
  To: amd-gfx list; +Cc: Alex Deucher

Ping?


On Wed, Oct 30, 2019 at 2:41 PM Alex Deucher <alexdeucher@gmail.com> wrote:
>
> Clarify some areas, clean up formatting, add section for
> unrecoverable error handling.
>
> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
> ---
>  Documentation/gpu/amdgpu.rst            | 35 ++++++++++++++++++++++
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 40 ++++++++++++++++++++-----
>  2 files changed, 68 insertions(+), 7 deletions(-)
>
> diff --git a/Documentation/gpu/amdgpu.rst b/Documentation/gpu/amdgpu.rst
> index 5b9eaf23558e..1c08d64970ee 100644
> --- a/Documentation/gpu/amdgpu.rst
> +++ b/Documentation/gpu/amdgpu.rst
> @@ -82,12 +82,21 @@ AMDGPU XGMI Support
>  AMDGPU RAS Support
>  ==================
>
> +The AMDGPU RAS interfaces are exposed via sysfs (for informational queries) and
> +debugfs (for error injection).
> +
>  RAS debugfs/sysfs Control and Error Injection Interfaces
>  --------------------------------------------------------
>
>  .. kernel-doc:: drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
>     :doc: AMDGPU RAS debugfs control interface
>
> +RAS Reboot Behavior for Unrecoverable Errors
> +--------------------------------------------------------
> +
> +.. kernel-doc:: drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> +   :doc: AMDGPU RAS Reboot Behavior for Unrecoverable Errors
> +
>  RAS Error Count sysfs Interface
>  -------------------------------
>
> @@ -109,6 +118,32 @@ RAS VRAM Bad Pages sysfs Interface
>  .. kernel-doc:: drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
>     :internal:
>
> +Sample Code
> +-----------
> +Sample code for testing error injection can be found here:
> +https://cgit.freedesktop.org/mesa/drm/tree/tests/amdgpu/ras_tests.c
> +
> +This is part of the libdrm amdgpu unit tests which cover several areas of the GPU.
> +There are four sets of tests:
> +
> +RAS Basic Test
> +
> +The test verifies the RAS feature enabled status and makes sure the necessary sysfs and debugfs files
> +are present.
> +
> +RAS Query Test
> +
> +This test will check the RAS availability and enablement status for each supported IP block as well as
> +the error counts.
> +
> +RAS Inject Test
> +
> +This test injects errors for each IP.
> +
> +RAS Disable Test
> +
> +This tests disabling of RAS features for each IP block.
> +
>
>  GPU Power/Thermal Controls and Monitoring
>  =========================================
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> index dab90c280476..404483437bd3 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> @@ -220,7 +220,7 @@ static struct ras_manager *amdgpu_ras_find_obj(struct amdgpu_device *adev,
>   * As their names indicate, inject operation will write the
>   * value to the address.
>   *
> - * Second member: struct ras_debug_if::op.
> + * The second member: struct ras_debug_if::op.
>   * It has three kinds of operations.
>   *
>   * - 0: disable RAS on the block. Take ::head as its data.
> @@ -228,14 +228,20 @@ static struct ras_manager *amdgpu_ras_find_obj(struct amdgpu_device *adev,
>   * - 2: inject errors on the block. Take ::inject as its data.
>   *
>   * How to use the interface?
> - * programs:
> - * copy the struct ras_debug_if in your codes and initialize it.
> - * write the struct to the control node.
> + *
> + * Programs
> + *
> + * Copy the struct ras_debug_if in your codes and initialize it.
> + * Write the struct to the control node.
> + *
> + * Shells
>   *
>   * .. code-block:: bash
>   *
>   *     echo op block [error [sub_block address value]] > .../ras/ras_ctrl
>   *
> + * Parameters:
> + *
>   * op: disable, enable, inject
>   *     disable: only block is needed
>   *     enable: block and error are needed
> @@ -265,8 +271,10 @@ static struct ras_manager *amdgpu_ras_find_obj(struct amdgpu_device *adev,
>   * /sys/class/drm/card[0/1/2...]/device/ras/[gfx/sdma/...]_err_count
>   *
>   * .. note::
> - *     Operation is only allowed on blocks which are supported.
> + *     Operations are only allowed on blocks which are supported.
>   *     Please check ras mask at /sys/module/amdgpu/parameters/ras_mask
> + *     to see which blocks support RAS on a particular asic.
> + *
>   */
>  static ssize_t amdgpu_ras_debugfs_ctrl_write(struct file *f, const char __user *buf,
>                 size_t size, loff_t *pos)
> @@ -322,7 +330,7 @@ static ssize_t amdgpu_ras_debugfs_ctrl_write(struct file *f, const char __user *
>   * DOC: AMDGPU RAS debugfs EEPROM table reset interface
>   *
>   * Some boards contain an EEPROM which is used to persistently store a list of
> - * bad pages containing ECC errors detected in vram.  This interface provides
> + * bad pages which experiences ECC errors in vram.  This interface provides
>   * a way to reset the EEPROM, e.g., after testing error injection.
>   *
>   * Usage:
> @@ -362,7 +370,7 @@ static const struct file_operations amdgpu_ras_debugfs_eeprom_ops = {
>  /**
>   * DOC: AMDGPU RAS sysfs Error Count Interface
>   *
> - * It allows user to read the error count for each IP block on the gpu through
> + * It allows the user to read the error count for each IP block on the gpu through
>   * /sys/class/drm/card[0/1/2...]/device/ras/[gfx/sdma/...]_err_count
>   *
>   * It outputs the multiple lines which report the uncorrected (ue) and corrected
> @@ -1027,6 +1035,24 @@ static int amdgpu_ras_sysfs_remove_all(struct amdgpu_device *adev)
>  }
>  /* sysfs end */
>
> +/**
> + * DOC: AMDGPU RAS Reboot Behavior for Unrecoverable Errors
> + *
> + * Normally when there is an uncorrectable error, the driver will reset
> + * the GPU to recover.  However, in the event of an unrecoverable error,
> + * the driver provides an interface to reboot the system automatically
> + * in that event.
> + *
> + * The following file in debugfs provides that interface:
> + * /sys/kernel/debug/dri/[0/1/2...]/ras/auto_reboot
> + *
> + * Usage:
> + *
> + * .. code-block:: bash
> + *
> + *     echo true > .../ras/auto_reboot
> + *
> + */
>  /* debugfs begin */
>  static void amdgpu_ras_debugfs_create_ctrl_node(struct amdgpu_device *adev)
>  {
> --
> 2.23.0
>
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] drm/amdgpu: Improve RAS documentation
@ 2019-11-06 17:14         ` Zhao, Yong
  0 siblings, 0 replies; 14+ messages in thread
From: Zhao, Yong @ 2019-11-06 17:14 UTC (permalink / raw)
  To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

See two wording comments inline. With that

Reviewed-by: Yong Zhao<yong.zhao@amd.com>

On 2019-10-30 2:41 p.m., Alex Deucher wrote:
> Clarify some areas, clean up formatting, add section for
> unrecoverable error handling.
>
> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
> ---
>   Documentation/gpu/amdgpu.rst            | 35 ++++++++++++++++++++++
>   drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 40 ++++++++++++++++++++-----
>   2 files changed, 68 insertions(+), 7 deletions(-)
>
> diff --git a/Documentation/gpu/amdgpu.rst b/Documentation/gpu/amdgpu.rst
> index 5b9eaf23558e..1c08d64970ee 100644
> --- a/Documentation/gpu/amdgpu.rst
> +++ b/Documentation/gpu/amdgpu.rst
> @@ -82,12 +82,21 @@ AMDGPU XGMI Support
>   AMDGPU RAS Support
>   ==================
>   
> +The AMDGPU RAS interfaces are exposed via sysfs (for informational queries) and
> +debugfs (for error injection).
> +
>   RAS debugfs/sysfs Control and Error Injection Interfaces
>   --------------------------------------------------------
>   
>   .. kernel-doc:: drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
>      :doc: AMDGPU RAS debugfs control interface
>   
> +RAS Reboot Behavior for Unrecoverable Errors
> +--------------------------------------------------------
> +
> +.. kernel-doc:: drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> +   :doc: AMDGPU RAS Reboot Behavior for Unrecoverable Errors
> +
>   RAS Error Count sysfs Interface
>   -------------------------------
>   
> @@ -109,6 +118,32 @@ RAS VRAM Bad Pages sysfs Interface
>   .. kernel-doc:: drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
>      :internal:
>   
> +Sample Code
> +-----------
> +Sample code for testing error injection can be found here:
> +https://cgit.freedesktop.org/mesa/drm/tree/tests/amdgpu/ras_tests.c
> +
> +This is part of the libdrm amdgpu unit tests which cover several areas of the GPU.
> +There are four sets of tests:
> +
> +RAS Basic Test
> +
> +The test verifies the RAS feature enabled status and makes sure the necessary sysfs and debugfs files
> +are present.
> +
> +RAS Query Test
> +
> +This test will check the RAS availability and enablement status for each supported IP block as well as
> +the error counts.

This test checks

> +
> +RAS Inject Test
> +
> +This test injects errors for each IP.
> +
> +RAS Disable Test
> +
> +This tests disabling of RAS features for each IP block.

This tests tests disabling

> +
>   
>   GPU Power/Thermal Controls and Monitoring
>   =========================================
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> index dab90c280476..404483437bd3 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> @@ -220,7 +220,7 @@ static struct ras_manager *amdgpu_ras_find_obj(struct amdgpu_device *adev,
>    * As their names indicate, inject operation will write the
>    * value to the address.
>    *
> - * Second member: struct ras_debug_if::op.
> + * The second member: struct ras_debug_if::op.
>    * It has three kinds of operations.
>    *
>    * - 0: disable RAS on the block. Take ::head as its data.
> @@ -228,14 +228,20 @@ static struct ras_manager *amdgpu_ras_find_obj(struct amdgpu_device *adev,
>    * - 2: inject errors on the block. Take ::inject as its data.
>    *
>    * How to use the interface?
> - * programs:
> - * copy the struct ras_debug_if in your codes and initialize it.
> - * write the struct to the control node.
> + *
> + * Programs
> + *
> + * Copy the struct ras_debug_if in your codes and initialize it.
> + * Write the struct to the control node.
> + *
> + * Shells
>    *
>    * .. code-block:: bash
>    *
>    *	echo op block [error [sub_block address value]] > .../ras/ras_ctrl
>    *
> + * Parameters:
> + *
>    * op: disable, enable, inject
>    *	disable: only block is needed
>    *	enable: block and error are needed
> @@ -265,8 +271,10 @@ static struct ras_manager *amdgpu_ras_find_obj(struct amdgpu_device *adev,
>    * /sys/class/drm/card[0/1/2...]/device/ras/[gfx/sdma/...]_err_count
>    *
>    * .. note::
> - *	Operation is only allowed on blocks which are supported.
> + *	Operations are only allowed on blocks which are supported.
>    *	Please check ras mask at /sys/module/amdgpu/parameters/ras_mask
> + *	to see which blocks support RAS on a particular asic.
> + *
>    */
>   static ssize_t amdgpu_ras_debugfs_ctrl_write(struct file *f, const char __user *buf,
>   		size_t size, loff_t *pos)
> @@ -322,7 +330,7 @@ static ssize_t amdgpu_ras_debugfs_ctrl_write(struct file *f, const char __user *
>    * DOC: AMDGPU RAS debugfs EEPROM table reset interface
>    *
>    * Some boards contain an EEPROM which is used to persistently store a list of
> - * bad pages containing ECC errors detected in vram.  This interface provides
> + * bad pages which experiences ECC errors in vram.  This interface provides
>    * a way to reset the EEPROM, e.g., after testing error injection.
>    *
>    * Usage:
> @@ -362,7 +370,7 @@ static const struct file_operations amdgpu_ras_debugfs_eeprom_ops = {
>   /**
>    * DOC: AMDGPU RAS sysfs Error Count Interface
>    *
> - * It allows user to read the error count for each IP block on the gpu through
> + * It allows the user to read the error count for each IP block on the gpu through
>    * /sys/class/drm/card[0/1/2...]/device/ras/[gfx/sdma/...]_err_count
>    *
>    * It outputs the multiple lines which report the uncorrected (ue) and corrected
> @@ -1027,6 +1035,24 @@ static int amdgpu_ras_sysfs_remove_all(struct amdgpu_device *adev)
>   }
>   /* sysfs end */
>   
> +/**
> + * DOC: AMDGPU RAS Reboot Behavior for Unrecoverable Errors
> + *
> + * Normally when there is an uncorrectable error, the driver will reset
> + * the GPU to recover.  However, in the event of an unrecoverable error,
> + * the driver provides an interface to reboot the system automatically
> + * in that event.
> + *
> + * The following file in debugfs provides that interface:
> + * /sys/kernel/debug/dri/[0/1/2...]/ras/auto_reboot
> + *
> + * Usage:
> + *
> + * .. code-block:: bash
> + *
> + *	echo true > .../ras/auto_reboot
> + *
> + */
>   /* debugfs begin */
>   static void amdgpu_ras_debugfs_create_ctrl_node(struct amdgpu_device *adev)
>   {
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] drm/amdgpu: Improve RAS documentation
@ 2019-11-06 17:14         ` Zhao, Yong
  0 siblings, 0 replies; 14+ messages in thread
From: Zhao, Yong @ 2019-11-06 17:14 UTC (permalink / raw)
  To: amd-gfx

See two wording comments inline. With that

Reviewed-by: Yong Zhao<yong.zhao@amd.com>

On 2019-10-30 2:41 p.m., Alex Deucher wrote:
> Clarify some areas, clean up formatting, add section for
> unrecoverable error handling.
>
> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
> ---
>   Documentation/gpu/amdgpu.rst            | 35 ++++++++++++++++++++++
>   drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 40 ++++++++++++++++++++-----
>   2 files changed, 68 insertions(+), 7 deletions(-)
>
> diff --git a/Documentation/gpu/amdgpu.rst b/Documentation/gpu/amdgpu.rst
> index 5b9eaf23558e..1c08d64970ee 100644
> --- a/Documentation/gpu/amdgpu.rst
> +++ b/Documentation/gpu/amdgpu.rst
> @@ -82,12 +82,21 @@ AMDGPU XGMI Support
>   AMDGPU RAS Support
>   ==================
>   
> +The AMDGPU RAS interfaces are exposed via sysfs (for informational queries) and
> +debugfs (for error injection).
> +
>   RAS debugfs/sysfs Control and Error Injection Interfaces
>   --------------------------------------------------------
>   
>   .. kernel-doc:: drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
>      :doc: AMDGPU RAS debugfs control interface
>   
> +RAS Reboot Behavior for Unrecoverable Errors
> +--------------------------------------------------------
> +
> +.. kernel-doc:: drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> +   :doc: AMDGPU RAS Reboot Behavior for Unrecoverable Errors
> +
>   RAS Error Count sysfs Interface
>   -------------------------------
>   
> @@ -109,6 +118,32 @@ RAS VRAM Bad Pages sysfs Interface
>   .. kernel-doc:: drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
>      :internal:
>   
> +Sample Code
> +-----------
> +Sample code for testing error injection can be found here:
> +https://cgit.freedesktop.org/mesa/drm/tree/tests/amdgpu/ras_tests.c
> +
> +This is part of the libdrm amdgpu unit tests which cover several areas of the GPU.
> +There are four sets of tests:
> +
> +RAS Basic Test
> +
> +The test verifies the RAS feature enabled status and makes sure the necessary sysfs and debugfs files
> +are present.
> +
> +RAS Query Test
> +
> +This test will check the RAS availability and enablement status for each supported IP block as well as
> +the error counts.

This test checks

> +
> +RAS Inject Test
> +
> +This test injects errors for each IP.
> +
> +RAS Disable Test
> +
> +This tests disabling of RAS features for each IP block.

This tests tests disabling

> +
>   
>   GPU Power/Thermal Controls and Monitoring
>   =========================================
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> index dab90c280476..404483437bd3 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> @@ -220,7 +220,7 @@ static struct ras_manager *amdgpu_ras_find_obj(struct amdgpu_device *adev,
>    * As their names indicate, inject operation will write the
>    * value to the address.
>    *
> - * Second member: struct ras_debug_if::op.
> + * The second member: struct ras_debug_if::op.
>    * It has three kinds of operations.
>    *
>    * - 0: disable RAS on the block. Take ::head as its data.
> @@ -228,14 +228,20 @@ static struct ras_manager *amdgpu_ras_find_obj(struct amdgpu_device *adev,
>    * - 2: inject errors on the block. Take ::inject as its data.
>    *
>    * How to use the interface?
> - * programs:
> - * copy the struct ras_debug_if in your codes and initialize it.
> - * write the struct to the control node.
> + *
> + * Programs
> + *
> + * Copy the struct ras_debug_if in your codes and initialize it.
> + * Write the struct to the control node.
> + *
> + * Shells
>    *
>    * .. code-block:: bash
>    *
>    *	echo op block [error [sub_block address value]] > .../ras/ras_ctrl
>    *
> + * Parameters:
> + *
>    * op: disable, enable, inject
>    *	disable: only block is needed
>    *	enable: block and error are needed
> @@ -265,8 +271,10 @@ static struct ras_manager *amdgpu_ras_find_obj(struct amdgpu_device *adev,
>    * /sys/class/drm/card[0/1/2...]/device/ras/[gfx/sdma/...]_err_count
>    *
>    * .. note::
> - *	Operation is only allowed on blocks which are supported.
> + *	Operations are only allowed on blocks which are supported.
>    *	Please check ras mask at /sys/module/amdgpu/parameters/ras_mask
> + *	to see which blocks support RAS on a particular asic.
> + *
>    */
>   static ssize_t amdgpu_ras_debugfs_ctrl_write(struct file *f, const char __user *buf,
>   		size_t size, loff_t *pos)
> @@ -322,7 +330,7 @@ static ssize_t amdgpu_ras_debugfs_ctrl_write(struct file *f, const char __user *
>    * DOC: AMDGPU RAS debugfs EEPROM table reset interface
>    *
>    * Some boards contain an EEPROM which is used to persistently store a list of
> - * bad pages containing ECC errors detected in vram.  This interface provides
> + * bad pages which experiences ECC errors in vram.  This interface provides
>    * a way to reset the EEPROM, e.g., after testing error injection.
>    *
>    * Usage:
> @@ -362,7 +370,7 @@ static const struct file_operations amdgpu_ras_debugfs_eeprom_ops = {
>   /**
>    * DOC: AMDGPU RAS sysfs Error Count Interface
>    *
> - * It allows user to read the error count for each IP block on the gpu through
> + * It allows the user to read the error count for each IP block on the gpu through
>    * /sys/class/drm/card[0/1/2...]/device/ras/[gfx/sdma/...]_err_count
>    *
>    * It outputs the multiple lines which report the uncorrected (ue) and corrected
> @@ -1027,6 +1035,24 @@ static int amdgpu_ras_sysfs_remove_all(struct amdgpu_device *adev)
>   }
>   /* sysfs end */
>   
> +/**
> + * DOC: AMDGPU RAS Reboot Behavior for Unrecoverable Errors
> + *
> + * Normally when there is an uncorrectable error, the driver will reset
> + * the GPU to recover.  However, in the event of an unrecoverable error,
> + * the driver provides an interface to reboot the system automatically
> + * in that event.
> + *
> + * The following file in debugfs provides that interface:
> + * /sys/kernel/debug/dri/[0/1/2...]/ras/auto_reboot
> + *
> + * Usage:
> + *
> + * .. code-block:: bash
> + *
> + *	echo true > .../ras/auto_reboot
> + *
> + */
>   /* debugfs begin */
>   static void amdgpu_ras_debugfs_create_ctrl_node(struct amdgpu_device *adev)
>   {
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 14+ messages in thread

* RE: [PATCH] drm/amdgpu: Improve RAS documentation
@ 2019-11-06 17:31             ` Russell, Kent
  0 siblings, 0 replies; 14+ messages in thread
From: Russell, Kent @ 2019-11-06 17:31 UTC (permalink / raw)
  To: Zhao, Yong, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

I think you meant "This test tests" instead of "This tests tests" for your 2nd comment but agreed on the consistent verb tenses.

 Kent

-----Original Message-----
From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> On Behalf Of Zhao, Yong
Sent: Wednesday, November 6, 2019 12:15 PM
To: amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH] drm/amdgpu: Improve RAS documentation

See two wording comments inline. With that

Reviewed-by: Yong Zhao<yong.zhao@amd.com>

On 2019-10-30 2:41 p.m., Alex Deucher wrote:
> Clarify some areas, clean up formatting, add section for unrecoverable 
> error handling.
>
> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
> ---
>   Documentation/gpu/amdgpu.rst            | 35 ++++++++++++++++++++++
>   drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 40 ++++++++++++++++++++-----
>   2 files changed, 68 insertions(+), 7 deletions(-)
>
> diff --git a/Documentation/gpu/amdgpu.rst 
> b/Documentation/gpu/amdgpu.rst index 5b9eaf23558e..1c08d64970ee 100644
> --- a/Documentation/gpu/amdgpu.rst
> +++ b/Documentation/gpu/amdgpu.rst
> @@ -82,12 +82,21 @@ AMDGPU XGMI Support
>   AMDGPU RAS Support
>   ==================
>   
> +The AMDGPU RAS interfaces are exposed via sysfs (for informational 
> +queries) and debugfs (for error injection).
> +
>   RAS debugfs/sysfs Control and Error Injection Interfaces
>   --------------------------------------------------------
>   
>   .. kernel-doc:: drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
>      :doc: AMDGPU RAS debugfs control interface
>   
> +RAS Reboot Behavior for Unrecoverable Errors
> +--------------------------------------------------------
> +
> +.. kernel-doc:: drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> +   :doc: AMDGPU RAS Reboot Behavior for Unrecoverable Errors
> +
>   RAS Error Count sysfs Interface
>   -------------------------------
>   
> @@ -109,6 +118,32 @@ RAS VRAM Bad Pages sysfs Interface
>   .. kernel-doc:: drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
>      :internal:
>   
> +Sample Code
> +-----------
> +Sample code for testing error injection can be found here:
> +https://cgit.freedesktop.org/mesa/drm/tree/tests/amdgpu/ras_tests.c
> +
> +This is part of the libdrm amdgpu unit tests which cover several areas of the GPU.
> +There are four sets of tests:
> +
> +RAS Basic Test
> +
> +The test verifies the RAS feature enabled status and makes sure the 
> +necessary sysfs and debugfs files are present.
> +
> +RAS Query Test
> +
> +This test will check the RAS availability and enablement status for 
> +each supported IP block as well as the error counts.

This test checks

> +
> +RAS Inject Test
> +
> +This test injects errors for each IP.
> +
> +RAS Disable Test
> +
> +This tests disabling of RAS features for each IP block.

This tests tests disabling

> +
>   
>   GPU Power/Thermal Controls and Monitoring
>   =========================================
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> index dab90c280476..404483437bd3 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> @@ -220,7 +220,7 @@ static struct ras_manager *amdgpu_ras_find_obj(struct amdgpu_device *adev,
>    * As their names indicate, inject operation will write the
>    * value to the address.
>    *
> - * Second member: struct ras_debug_if::op.
> + * The second member: struct ras_debug_if::op.
>    * It has three kinds of operations.
>    *
>    * - 0: disable RAS on the block. Take ::head as its data.
> @@ -228,14 +228,20 @@ static struct ras_manager *amdgpu_ras_find_obj(struct amdgpu_device *adev,
>    * - 2: inject errors on the block. Take ::inject as its data.
>    *
>    * How to use the interface?
> - * programs:
> - * copy the struct ras_debug_if in your codes and initialize it.
> - * write the struct to the control node.
> + *
> + * Programs
> + *
> + * Copy the struct ras_debug_if in your codes and initialize it.
> + * Write the struct to the control node.
> + *
> + * Shells
>    *
>    * .. code-block:: bash
>    *
>    *	echo op block [error [sub_block address value]] > .../ras/ras_ctrl
>    *
> + * Parameters:
> + *
>    * op: disable, enable, inject
>    *	disable: only block is needed
>    *	enable: block and error are needed
> @@ -265,8 +271,10 @@ static struct ras_manager *amdgpu_ras_find_obj(struct amdgpu_device *adev,
>    * /sys/class/drm/card[0/1/2...]/device/ras/[gfx/sdma/...]_err_count
>    *
>    * .. note::
> - *	Operation is only allowed on blocks which are supported.
> + *	Operations are only allowed on blocks which are supported.
>    *	Please check ras mask at /sys/module/amdgpu/parameters/ras_mask
> + *	to see which blocks support RAS on a particular asic.
> + *
>    */
>   static ssize_t amdgpu_ras_debugfs_ctrl_write(struct file *f, const char __user *buf,
>   		size_t size, loff_t *pos)
> @@ -322,7 +330,7 @@ static ssize_t amdgpu_ras_debugfs_ctrl_write(struct file *f, const char __user *
>    * DOC: AMDGPU RAS debugfs EEPROM table reset interface
>    *
>    * Some boards contain an EEPROM which is used to persistently store 
> a list of
> - * bad pages containing ECC errors detected in vram.  This interface 
> provides
> + * bad pages which experiences ECC errors in vram.  This interface 
> + provides
>    * a way to reset the EEPROM, e.g., after testing error injection.
>    *
>    * Usage:
> @@ -362,7 +370,7 @@ static const struct file_operations amdgpu_ras_debugfs_eeprom_ops = {
>   /**
>    * DOC: AMDGPU RAS sysfs Error Count Interface
>    *
> - * It allows user to read the error count for each IP block on the 
> gpu through
> + * It allows the user to read the error count for each IP block on 
> + the gpu through
>    * /sys/class/drm/card[0/1/2...]/device/ras/[gfx/sdma/...]_err_count
>    *
>    * It outputs the multiple lines which report the uncorrected (ue) 
> and corrected @@ -1027,6 +1035,24 @@ static int amdgpu_ras_sysfs_remove_all(struct amdgpu_device *adev)
>   }
>   /* sysfs end */
>   
> +/**
> + * DOC: AMDGPU RAS Reboot Behavior for Unrecoverable Errors
> + *
> + * Normally when there is an uncorrectable error, the driver will 
> +reset
> + * the GPU to recover.  However, in the event of an unrecoverable 
> +error,
> + * the driver provides an interface to reboot the system 
> +automatically
> + * in that event.
> + *
> + * The following file in debugfs provides that interface:
> + * /sys/kernel/debug/dri/[0/1/2...]/ras/auto_reboot
> + *
> + * Usage:
> + *
> + * .. code-block:: bash
> + *
> + *	echo true > .../ras/auto_reboot
> + *
> + */
>   /* debugfs begin */
>   static void amdgpu_ras_debugfs_create_ctrl_node(struct amdgpu_device *adev)
>   {
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 14+ messages in thread

* RE: [PATCH] drm/amdgpu: Improve RAS documentation
@ 2019-11-06 17:31             ` Russell, Kent
  0 siblings, 0 replies; 14+ messages in thread
From: Russell, Kent @ 2019-11-06 17:31 UTC (permalink / raw)
  To: Zhao, Yong, amd-gfx

I think you meant "This test tests" instead of "This tests tests" for your 2nd comment but agreed on the consistent verb tenses.

 Kent

-----Original Message-----
From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> On Behalf Of Zhao, Yong
Sent: Wednesday, November 6, 2019 12:15 PM
To: amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH] drm/amdgpu: Improve RAS documentation

See two wording comments inline. With that

Reviewed-by: Yong Zhao<yong.zhao@amd.com>

On 2019-10-30 2:41 p.m., Alex Deucher wrote:
> Clarify some areas, clean up formatting, add section for unrecoverable 
> error handling.
>
> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
> ---
>   Documentation/gpu/amdgpu.rst            | 35 ++++++++++++++++++++++
>   drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 40 ++++++++++++++++++++-----
>   2 files changed, 68 insertions(+), 7 deletions(-)
>
> diff --git a/Documentation/gpu/amdgpu.rst 
> b/Documentation/gpu/amdgpu.rst index 5b9eaf23558e..1c08d64970ee 100644
> --- a/Documentation/gpu/amdgpu.rst
> +++ b/Documentation/gpu/amdgpu.rst
> @@ -82,12 +82,21 @@ AMDGPU XGMI Support
>   AMDGPU RAS Support
>   ==================
>   
> +The AMDGPU RAS interfaces are exposed via sysfs (for informational 
> +queries) and debugfs (for error injection).
> +
>   RAS debugfs/sysfs Control and Error Injection Interfaces
>   --------------------------------------------------------
>   
>   .. kernel-doc:: drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
>      :doc: AMDGPU RAS debugfs control interface
>   
> +RAS Reboot Behavior for Unrecoverable Errors
> +--------------------------------------------------------
> +
> +.. kernel-doc:: drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> +   :doc: AMDGPU RAS Reboot Behavior for Unrecoverable Errors
> +
>   RAS Error Count sysfs Interface
>   -------------------------------
>   
> @@ -109,6 +118,32 @@ RAS VRAM Bad Pages sysfs Interface
>   .. kernel-doc:: drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
>      :internal:
>   
> +Sample Code
> +-----------
> +Sample code for testing error injection can be found here:
> +https://cgit.freedesktop.org/mesa/drm/tree/tests/amdgpu/ras_tests.c
> +
> +This is part of the libdrm amdgpu unit tests which cover several areas of the GPU.
> +There are four sets of tests:
> +
> +RAS Basic Test
> +
> +The test verifies the RAS feature enabled status and makes sure the 
> +necessary sysfs and debugfs files are present.
> +
> +RAS Query Test
> +
> +This test will check the RAS availability and enablement status for 
> +each supported IP block as well as the error counts.

This test checks

> +
> +RAS Inject Test
> +
> +This test injects errors for each IP.
> +
> +RAS Disable Test
> +
> +This tests disabling of RAS features for each IP block.

This tests tests disabling

> +
>   
>   GPU Power/Thermal Controls and Monitoring
>   =========================================
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> index dab90c280476..404483437bd3 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> @@ -220,7 +220,7 @@ static struct ras_manager *amdgpu_ras_find_obj(struct amdgpu_device *adev,
>    * As their names indicate, inject operation will write the
>    * value to the address.
>    *
> - * Second member: struct ras_debug_if::op.
> + * The second member: struct ras_debug_if::op.
>    * It has three kinds of operations.
>    *
>    * - 0: disable RAS on the block. Take ::head as its data.
> @@ -228,14 +228,20 @@ static struct ras_manager *amdgpu_ras_find_obj(struct amdgpu_device *adev,
>    * - 2: inject errors on the block. Take ::inject as its data.
>    *
>    * How to use the interface?
> - * programs:
> - * copy the struct ras_debug_if in your codes and initialize it.
> - * write the struct to the control node.
> + *
> + * Programs
> + *
> + * Copy the struct ras_debug_if in your codes and initialize it.
> + * Write the struct to the control node.
> + *
> + * Shells
>    *
>    * .. code-block:: bash
>    *
>    *	echo op block [error [sub_block address value]] > .../ras/ras_ctrl
>    *
> + * Parameters:
> + *
>    * op: disable, enable, inject
>    *	disable: only block is needed
>    *	enable: block and error are needed
> @@ -265,8 +271,10 @@ static struct ras_manager *amdgpu_ras_find_obj(struct amdgpu_device *adev,
>    * /sys/class/drm/card[0/1/2...]/device/ras/[gfx/sdma/...]_err_count
>    *
>    * .. note::
> - *	Operation is only allowed on blocks which are supported.
> + *	Operations are only allowed on blocks which are supported.
>    *	Please check ras mask at /sys/module/amdgpu/parameters/ras_mask
> + *	to see which blocks support RAS on a particular asic.
> + *
>    */
>   static ssize_t amdgpu_ras_debugfs_ctrl_write(struct file *f, const char __user *buf,
>   		size_t size, loff_t *pos)
> @@ -322,7 +330,7 @@ static ssize_t amdgpu_ras_debugfs_ctrl_write(struct file *f, const char __user *
>    * DOC: AMDGPU RAS debugfs EEPROM table reset interface
>    *
>    * Some boards contain an EEPROM which is used to persistently store 
> a list of
> - * bad pages containing ECC errors detected in vram.  This interface 
> provides
> + * bad pages which experiences ECC errors in vram.  This interface 
> + provides
>    * a way to reset the EEPROM, e.g., after testing error injection.
>    *
>    * Usage:
> @@ -362,7 +370,7 @@ static const struct file_operations amdgpu_ras_debugfs_eeprom_ops = {
>   /**
>    * DOC: AMDGPU RAS sysfs Error Count Interface
>    *
> - * It allows user to read the error count for each IP block on the 
> gpu through
> + * It allows the user to read the error count for each IP block on 
> + the gpu through
>    * /sys/class/drm/card[0/1/2...]/device/ras/[gfx/sdma/...]_err_count
>    *
>    * It outputs the multiple lines which report the uncorrected (ue) 
> and corrected @@ -1027,6 +1035,24 @@ static int amdgpu_ras_sysfs_remove_all(struct amdgpu_device *adev)
>   }
>   /* sysfs end */
>   
> +/**
> + * DOC: AMDGPU RAS Reboot Behavior for Unrecoverable Errors
> + *
> + * Normally when there is an uncorrectable error, the driver will 
> +reset
> + * the GPU to recover.  However, in the event of an unrecoverable 
> +error,
> + * the driver provides an interface to reboot the system 
> +automatically
> + * in that event.
> + *
> + * The following file in debugfs provides that interface:
> + * /sys/kernel/debug/dri/[0/1/2...]/ras/auto_reboot
> + *
> + * Usage:
> + *
> + * .. code-block:: bash
> + *
> + *	echo true > .../ras/auto_reboot
> + *
> + */
>   /* debugfs begin */
>   static void amdgpu_ras_debugfs_create_ctrl_node(struct amdgpu_device *adev)
>   {
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 14+ messages in thread

* RE: [PATCH] drm/amdgpu: Improve RAS documentation
@ 2019-11-07  1:10             ` Chen, Guchun
  0 siblings, 0 replies; 14+ messages in thread
From: Chen, Guchun @ 2019-11-07  1:10 UTC (permalink / raw)
  To: Alex Deucher, amd-gfx list; +Cc: Deucher, Alexander

One comment.
With that fixed, this patch is: Reviewed-by: Guchun Chen <guchun.chen@amd.com>

Regards,
Guchun

-----Original Message-----
From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> On Behalf Of Alex Deucher
Sent: Thursday, November 7, 2019 12:35 AM
To: amd-gfx list <amd-gfx@lists.freedesktop.org>
Cc: Deucher, Alexander <Alexander.Deucher@amd.com>
Subject: Re: [PATCH] drm/amdgpu: Improve RAS documentation

Ping?


On Wed, Oct 30, 2019 at 2:41 PM Alex Deucher <alexdeucher@gmail.com> wrote:
>
> Clarify some areas, clean up formatting, add section for unrecoverable 
> error handling.
>
> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
> ---
>  Documentation/gpu/amdgpu.rst            | 35 ++++++++++++++++++++++
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 40 
> ++++++++++++++++++++-----
>  2 files changed, 68 insertions(+), 7 deletions(-)
>
> diff --git a/Documentation/gpu/amdgpu.rst 
> b/Documentation/gpu/amdgpu.rst index 5b9eaf23558e..1c08d64970ee 100644
> --- a/Documentation/gpu/amdgpu.rst
> +++ b/Documentation/gpu/amdgpu.rst
> @@ -82,12 +82,21 @@ AMDGPU XGMI Support  AMDGPU RAS Support  
> ==================
>
> +The AMDGPU RAS interfaces are exposed via sysfs (for informational 
> +queries) and debugfs (for error injection).
[Guchun]It’s better we add block enablement/disablement statement in debugfs interface.
RAS driver supports this.

Regards,
Guchun

> +
>  RAS debugfs/sysfs Control and Error Injection Interfaces
>  --------------------------------------------------------
>
>  .. kernel-doc:: drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
>     :doc: AMDGPU RAS debugfs control interface
>
> +RAS Reboot Behavior for Unrecoverable Errors
> +--------------------------------------------------------
> +
> +.. kernel-doc:: drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> +   :doc: AMDGPU RAS Reboot Behavior for Unrecoverable Errors
> +
>  RAS Error Count sysfs Interface
>  -------------------------------
>
> @@ -109,6 +118,32 @@ RAS VRAM Bad Pages sysfs Interface  .. 
> kernel-doc:: drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
>     :internal:
>
> +Sample Code
> +-----------
> +Sample code for testing error injection can be found here:
> +https://cgit.freedesktop.org/mesa/drm/tree/tests/amdgpu/ras_tests.c
> +
> +This is part of the libdrm amdgpu unit tests which cover several areas of the GPU.
> +There are four sets of tests:
> +
> +RAS Basic Test
> +
> +The test verifies the RAS feature enabled status and makes sure the 
> +necessary sysfs and debugfs files are present.
> +
> +RAS Query Test
> +
> +This test will check the RAS availability and enablement status for 
> +each supported IP block as well as the error counts.
> +
> +RAS Inject Test
> +
> +This test injects errors for each IP.
> +
> +RAS Disable Test
> +
> +This tests disabling of RAS features for each IP block.
> +
>
>  GPU Power/Thermal Controls and Monitoring  
> =========================================
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> index dab90c280476..404483437bd3 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> @@ -220,7 +220,7 @@ static struct ras_manager *amdgpu_ras_find_obj(struct amdgpu_device *adev,
>   * As their names indicate, inject operation will write the
>   * value to the address.
>   *
> - * Second member: struct ras_debug_if::op.
> + * The second member: struct ras_debug_if::op.
>   * It has three kinds of operations.
>   *
>   * - 0: disable RAS on the block. Take ::head as its data.
> @@ -228,14 +228,20 @@ static struct ras_manager *amdgpu_ras_find_obj(struct amdgpu_device *adev,
>   * - 2: inject errors on the block. Take ::inject as its data.
>   *
>   * How to use the interface?
> - * programs:
> - * copy the struct ras_debug_if in your codes and initialize it.
> - * write the struct to the control node.
> + *
> + * Programs
> + *
> + * Copy the struct ras_debug_if in your codes and initialize it.
> + * Write the struct to the control node.
> + *
> + * Shells
>   *
>   * .. code-block:: bash
>   *
>   *     echo op block [error [sub_block address value]] > .../ras/ras_ctrl
>   *
> + * Parameters:
> + *
>   * op: disable, enable, inject
>   *     disable: only block is needed
>   *     enable: block and error are needed
> @@ -265,8 +271,10 @@ static struct ras_manager *amdgpu_ras_find_obj(struct amdgpu_device *adev,
>   * /sys/class/drm/card[0/1/2...]/device/ras/[gfx/sdma/...]_err_count
>   *
>   * .. note::
> - *     Operation is only allowed on blocks which are supported.
> + *     Operations are only allowed on blocks which are supported.
>   *     Please check ras mask at /sys/module/amdgpu/parameters/ras_mask
> + *     to see which blocks support RAS on a particular asic.
> + *
>   */
>  static ssize_t amdgpu_ras_debugfs_ctrl_write(struct file *f, const char __user *buf,
>                 size_t size, loff_t *pos) @@ -322,7 +330,7 @@ static 
> ssize_t amdgpu_ras_debugfs_ctrl_write(struct file *f, const char __user *
>   * DOC: AMDGPU RAS debugfs EEPROM table reset interface
>   *
>   * Some boards contain an EEPROM which is used to persistently store 
> a list of
> - * bad pages containing ECC errors detected in vram.  This interface 
> provides
> + * bad pages which experiences ECC errors in vram.  This interface 
> + provides
>   * a way to reset the EEPROM, e.g., after testing error injection.
>   *
>   * Usage:
> @@ -362,7 +370,7 @@ static const struct file_operations 
> amdgpu_ras_debugfs_eeprom_ops = {
>  /**
>   * DOC: AMDGPU RAS sysfs Error Count Interface
>   *
> - * It allows user to read the error count for each IP block on the 
> gpu through
> + * It allows the user to read the error count for each IP block on 
> + the gpu through
>   * /sys/class/drm/card[0/1/2...]/device/ras/[gfx/sdma/...]_err_count
>   *
>   * It outputs the multiple lines which report the uncorrected (ue) 
> and corrected @@ -1027,6 +1035,24 @@ static int 
> amdgpu_ras_sysfs_remove_all(struct amdgpu_device *adev)  }
>  /* sysfs end */
>
> +/**
> + * DOC: AMDGPU RAS Reboot Behavior for Unrecoverable Errors
> + *
> + * Normally when there is an uncorrectable error, the driver will 
> +reset
> + * the GPU to recover.  However, in the event of an unrecoverable 
> +error,
> + * the driver provides an interface to reboot the system 
> +automatically
> + * in that event.
> + *
> + * The following file in debugfs provides that interface:
> + * /sys/kernel/debug/dri/[0/1/2...]/ras/auto_reboot
> + *
> + * Usage:
> + *
> + * .. code-block:: bash
> + *
> + *     echo true > .../ras/auto_reboot
> + *
> + */
>  /* debugfs begin */
>  static void amdgpu_ras_debugfs_create_ctrl_node(struct amdgpu_device 
> *adev)  {
> --
> 2.23.0
>
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 14+ messages in thread

* RE: [PATCH] drm/amdgpu: Improve RAS documentation
@ 2019-11-07  1:10             ` Chen, Guchun
  0 siblings, 0 replies; 14+ messages in thread
From: Chen, Guchun @ 2019-11-07  1:10 UTC (permalink / raw)
  To: Alex Deucher, amd-gfx list; +Cc: Deucher, Alexander

One comment.
With that fixed, this patch is: Reviewed-by: Guchun Chen <guchun.chen@amd.com>

Regards,
Guchun

-----Original Message-----
From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> On Behalf Of Alex Deucher
Sent: Thursday, November 7, 2019 12:35 AM
To: amd-gfx list <amd-gfx@lists.freedesktop.org>
Cc: Deucher, Alexander <Alexander.Deucher@amd.com>
Subject: Re: [PATCH] drm/amdgpu: Improve RAS documentation

Ping?


On Wed, Oct 30, 2019 at 2:41 PM Alex Deucher <alexdeucher@gmail.com> wrote:
>
> Clarify some areas, clean up formatting, add section for unrecoverable 
> error handling.
>
> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
> ---
>  Documentation/gpu/amdgpu.rst            | 35 ++++++++++++++++++++++
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 40 
> ++++++++++++++++++++-----
>  2 files changed, 68 insertions(+), 7 deletions(-)
>
> diff --git a/Documentation/gpu/amdgpu.rst 
> b/Documentation/gpu/amdgpu.rst index 5b9eaf23558e..1c08d64970ee 100644
> --- a/Documentation/gpu/amdgpu.rst
> +++ b/Documentation/gpu/amdgpu.rst
> @@ -82,12 +82,21 @@ AMDGPU XGMI Support  AMDGPU RAS Support  
> ==================
>
> +The AMDGPU RAS interfaces are exposed via sysfs (for informational 
> +queries) and debugfs (for error injection).
[Guchun]It’s better we add block enablement/disablement statement in debugfs interface.
RAS driver supports this.

Regards,
Guchun

> +
>  RAS debugfs/sysfs Control and Error Injection Interfaces
>  --------------------------------------------------------
>
>  .. kernel-doc:: drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
>     :doc: AMDGPU RAS debugfs control interface
>
> +RAS Reboot Behavior for Unrecoverable Errors
> +--------------------------------------------------------
> +
> +.. kernel-doc:: drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> +   :doc: AMDGPU RAS Reboot Behavior for Unrecoverable Errors
> +
>  RAS Error Count sysfs Interface
>  -------------------------------
>
> @@ -109,6 +118,32 @@ RAS VRAM Bad Pages sysfs Interface  .. 
> kernel-doc:: drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
>     :internal:
>
> +Sample Code
> +-----------
> +Sample code for testing error injection can be found here:
> +https://cgit.freedesktop.org/mesa/drm/tree/tests/amdgpu/ras_tests.c
> +
> +This is part of the libdrm amdgpu unit tests which cover several areas of the GPU.
> +There are four sets of tests:
> +
> +RAS Basic Test
> +
> +The test verifies the RAS feature enabled status and makes sure the 
> +necessary sysfs and debugfs files are present.
> +
> +RAS Query Test
> +
> +This test will check the RAS availability and enablement status for 
> +each supported IP block as well as the error counts.
> +
> +RAS Inject Test
> +
> +This test injects errors for each IP.
> +
> +RAS Disable Test
> +
> +This tests disabling of RAS features for each IP block.
> +
>
>  GPU Power/Thermal Controls and Monitoring  
> =========================================
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> index dab90c280476..404483437bd3 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> @@ -220,7 +220,7 @@ static struct ras_manager *amdgpu_ras_find_obj(struct amdgpu_device *adev,
>   * As their names indicate, inject operation will write the
>   * value to the address.
>   *
> - * Second member: struct ras_debug_if::op.
> + * The second member: struct ras_debug_if::op.
>   * It has three kinds of operations.
>   *
>   * - 0: disable RAS on the block. Take ::head as its data.
> @@ -228,14 +228,20 @@ static struct ras_manager *amdgpu_ras_find_obj(struct amdgpu_device *adev,
>   * - 2: inject errors on the block. Take ::inject as its data.
>   *
>   * How to use the interface?
> - * programs:
> - * copy the struct ras_debug_if in your codes and initialize it.
> - * write the struct to the control node.
> + *
> + * Programs
> + *
> + * Copy the struct ras_debug_if in your codes and initialize it.
> + * Write the struct to the control node.
> + *
> + * Shells
>   *
>   * .. code-block:: bash
>   *
>   *     echo op block [error [sub_block address value]] > .../ras/ras_ctrl
>   *
> + * Parameters:
> + *
>   * op: disable, enable, inject
>   *     disable: only block is needed
>   *     enable: block and error are needed
> @@ -265,8 +271,10 @@ static struct ras_manager *amdgpu_ras_find_obj(struct amdgpu_device *adev,
>   * /sys/class/drm/card[0/1/2...]/device/ras/[gfx/sdma/...]_err_count
>   *
>   * .. note::
> - *     Operation is only allowed on blocks which are supported.
> + *     Operations are only allowed on blocks which are supported.
>   *     Please check ras mask at /sys/module/amdgpu/parameters/ras_mask
> + *     to see which blocks support RAS on a particular asic.
> + *
>   */
>  static ssize_t amdgpu_ras_debugfs_ctrl_write(struct file *f, const char __user *buf,
>                 size_t size, loff_t *pos) @@ -322,7 +330,7 @@ static 
> ssize_t amdgpu_ras_debugfs_ctrl_write(struct file *f, const char __user *
>   * DOC: AMDGPU RAS debugfs EEPROM table reset interface
>   *
>   * Some boards contain an EEPROM which is used to persistently store 
> a list of
> - * bad pages containing ECC errors detected in vram.  This interface 
> provides
> + * bad pages which experiences ECC errors in vram.  This interface 
> + provides
>   * a way to reset the EEPROM, e.g., after testing error injection.
>   *
>   * Usage:
> @@ -362,7 +370,7 @@ static const struct file_operations 
> amdgpu_ras_debugfs_eeprom_ops = {
>  /**
>   * DOC: AMDGPU RAS sysfs Error Count Interface
>   *
> - * It allows user to read the error count for each IP block on the 
> gpu through
> + * It allows the user to read the error count for each IP block on 
> + the gpu through
>   * /sys/class/drm/card[0/1/2...]/device/ras/[gfx/sdma/...]_err_count
>   *
>   * It outputs the multiple lines which report the uncorrected (ue) 
> and corrected @@ -1027,6 +1035,24 @@ static int 
> amdgpu_ras_sysfs_remove_all(struct amdgpu_device *adev)  }
>  /* sysfs end */
>
> +/**
> + * DOC: AMDGPU RAS Reboot Behavior for Unrecoverable Errors
> + *
> + * Normally when there is an uncorrectable error, the driver will 
> +reset
> + * the GPU to recover.  However, in the event of an unrecoverable 
> +error,
> + * the driver provides an interface to reboot the system 
> +automatically
> + * in that event.
> + *
> + * The following file in debugfs provides that interface:
> + * /sys/kernel/debug/dri/[0/1/2...]/ras/auto_reboot
> + *
> + * Usage:
> + *
> + * .. code-block:: bash
> + *
> + *     echo true > .../ras/auto_reboot
> + *
> + */
>  /* debugfs begin */
>  static void amdgpu_ras_debugfs_create_ctrl_node(struct amdgpu_device 
> *adev)  {
> --
> 2.23.0
>
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2019-11-07  1:10 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-10-30 18:41 [PATCH] drm/amdgpu/gpuvm: add some additional comments in amdgpu_vm_update_ptes Alex Deucher
2019-10-30 18:41 ` Alex Deucher
     [not found] ` <20191030184134.250234-1-alexander.deucher-5C7GfCeVMHo@public.gmane.org>
2019-10-30 18:41   ` [PATCH] drm/amdgpu: Improve RAS documentation Alex Deucher
2019-10-30 18:41     ` Alex Deucher
     [not found]     ` <20191030184134.250234-2-alexander.deucher-5C7GfCeVMHo@public.gmane.org>
2019-11-06 16:35       ` Alex Deucher
2019-11-06 16:35         ` Alex Deucher
     [not found]         ` <CADnq5_Mjr+U8sspvjm-KMqX4VTvdUHdDa6GgnNykQW+QvTMqXQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2019-11-07  1:10           ` Chen, Guchun
2019-11-07  1:10             ` Chen, Guchun
2019-11-06 17:14       ` Zhao, Yong
2019-11-06 17:14         ` Zhao, Yong
     [not found]         ` <4d4b67a3-25e0-a52d-67d7-06bb333c53b0-5C7GfCeVMHo@public.gmane.org>
2019-11-06 17:31           ` Russell, Kent
2019-11-06 17:31             ` Russell, Kent
2019-10-31  7:24   ` [PATCH] drm/amdgpu/gpuvm: add some additional comments in amdgpu_vm_update_ptes Christian König
2019-10-31  7:24     ` Christian König

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.