* [PATCH] drm/amdgpu/gpuvm: add some additional comments in amdgpu_vm_update_ptes @ 2019-10-30 18:41 ` Alex Deucher 0 siblings, 0 replies; 14+ messages in thread From: Alex Deucher @ 2019-10-30 18:41 UTC (permalink / raw) To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW; +Cc: Alex Deucher To better clarify what is happening in this function. Signed-off-by: Alex Deucher <alexander.deucher@amd.com> --- drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 10 +++++++++- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c index c8ce42200059..3c0bd6472a46 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c @@ -1419,6 +1419,9 @@ static int amdgpu_vm_update_ptes(struct amdgpu_vm_update_params *params, uint64_t incr, entry_end, pe_start; struct amdgpu_bo *pt; + /* make sure that the page tables covering the address range are + * actually allocated + */ r = amdgpu_vm_alloc_pts(params->adev, params->vm, &cursor, params->direct); if (r) @@ -1492,7 +1495,12 @@ static int amdgpu_vm_update_ptes(struct amdgpu_vm_update_params *params, } while (frag_start < entry_end); if (amdgpu_vm_pt_descendant(adev, &cursor)) { - /* Free all child entries */ + /* Free all child entries. + * Update the tables with the flags and addresses and free up subsequent + * tables in the case of huge pages or freed up areas. + * This is the maximum you can free, because all other page tables are not + * completely covered by the range and so potentially still in use. + */ while (cursor.pfn < frag_start) { amdgpu_vm_free_pts(adev, params->vm, &cursor); amdgpu_vm_pt_next(adev, &cursor); -- 2.23.0 _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx ^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH] drm/amdgpu/gpuvm: add some additional comments in amdgpu_vm_update_ptes @ 2019-10-30 18:41 ` Alex Deucher 0 siblings, 0 replies; 14+ messages in thread From: Alex Deucher @ 2019-10-30 18:41 UTC (permalink / raw) To: amd-gfx; +Cc: Alex Deucher To better clarify what is happening in this function. Signed-off-by: Alex Deucher <alexander.deucher@amd.com> --- drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 10 +++++++++- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c index c8ce42200059..3c0bd6472a46 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c @@ -1419,6 +1419,9 @@ static int amdgpu_vm_update_ptes(struct amdgpu_vm_update_params *params, uint64_t incr, entry_end, pe_start; struct amdgpu_bo *pt; + /* make sure that the page tables covering the address range are + * actually allocated + */ r = amdgpu_vm_alloc_pts(params->adev, params->vm, &cursor, params->direct); if (r) @@ -1492,7 +1495,12 @@ static int amdgpu_vm_update_ptes(struct amdgpu_vm_update_params *params, } while (frag_start < entry_end); if (amdgpu_vm_pt_descendant(adev, &cursor)) { - /* Free all child entries */ + /* Free all child entries. + * Update the tables with the flags and addresses and free up subsequent + * tables in the case of huge pages or freed up areas. + * This is the maximum you can free, because all other page tables are not + * completely covered by the range and so potentially still in use. + */ while (cursor.pfn < frag_start) { amdgpu_vm_free_pts(adev, params->vm, &cursor); amdgpu_vm_pt_next(adev, &cursor); -- 2.23.0 _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx ^ permalink raw reply related [flat|nested] 14+ messages in thread
[parent not found: <20191030184134.250234-1-alexander.deucher-5C7GfCeVMHo@public.gmane.org>]
* [PATCH] drm/amdgpu: Improve RAS documentation @ 2019-10-30 18:41 ` Alex Deucher 0 siblings, 0 replies; 14+ messages in thread From: Alex Deucher @ 2019-10-30 18:41 UTC (permalink / raw) To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW; +Cc: Alex Deucher Clarify some areas, clean up formatting, add section for unrecoverable error handling. Signed-off-by: Alex Deucher <alexander.deucher@amd.com> --- Documentation/gpu/amdgpu.rst | 35 ++++++++++++++++++++++ drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 40 ++++++++++++++++++++----- 2 files changed, 68 insertions(+), 7 deletions(-) diff --git a/Documentation/gpu/amdgpu.rst b/Documentation/gpu/amdgpu.rst index 5b9eaf23558e..1c08d64970ee 100644 --- a/Documentation/gpu/amdgpu.rst +++ b/Documentation/gpu/amdgpu.rst @@ -82,12 +82,21 @@ AMDGPU XGMI Support AMDGPU RAS Support ================== +The AMDGPU RAS interfaces are exposed via sysfs (for informational queries) and +debugfs (for error injection). + RAS debugfs/sysfs Control and Error Injection Interfaces -------------------------------------------------------- .. kernel-doc:: drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c :doc: AMDGPU RAS debugfs control interface +RAS Reboot Behavior for Unrecoverable Errors +-------------------------------------------------------- + +.. kernel-doc:: drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c + :doc: AMDGPU RAS Reboot Behavior for Unrecoverable Errors + RAS Error Count sysfs Interface ------------------------------- @@ -109,6 +118,32 @@ RAS VRAM Bad Pages sysfs Interface .. kernel-doc:: drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c :internal: +Sample Code +----------- +Sample code for testing error injection can be found here: +https://cgit.freedesktop.org/mesa/drm/tree/tests/amdgpu/ras_tests.c + +This is part of the libdrm amdgpu unit tests which cover several areas of the GPU. +There are four sets of tests: + +RAS Basic Test + +The test verifies the RAS feature enabled status and makes sure the necessary sysfs and debugfs files +are present. + +RAS Query Test + +This test will check the RAS availability and enablement status for each supported IP block as well as +the error counts. + +RAS Inject Test + +This test injects errors for each IP. + +RAS Disable Test + +This tests disabling of RAS features for each IP block. + GPU Power/Thermal Controls and Monitoring ========================================= diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c index dab90c280476..404483437bd3 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c @@ -220,7 +220,7 @@ static struct ras_manager *amdgpu_ras_find_obj(struct amdgpu_device *adev, * As their names indicate, inject operation will write the * value to the address. * - * Second member: struct ras_debug_if::op. + * The second member: struct ras_debug_if::op. * It has three kinds of operations. * * - 0: disable RAS on the block. Take ::head as its data. @@ -228,14 +228,20 @@ static struct ras_manager *amdgpu_ras_find_obj(struct amdgpu_device *adev, * - 2: inject errors on the block. Take ::inject as its data. * * How to use the interface? - * programs: - * copy the struct ras_debug_if in your codes and initialize it. - * write the struct to the control node. + * + * Programs + * + * Copy the struct ras_debug_if in your codes and initialize it. + * Write the struct to the control node. + * + * Shells * * .. code-block:: bash * * echo op block [error [sub_block address value]] > .../ras/ras_ctrl * + * Parameters: + * * op: disable, enable, inject * disable: only block is needed * enable: block and error are needed @@ -265,8 +271,10 @@ static struct ras_manager *amdgpu_ras_find_obj(struct amdgpu_device *adev, * /sys/class/drm/card[0/1/2...]/device/ras/[gfx/sdma/...]_err_count * * .. note:: - * Operation is only allowed on blocks which are supported. + * Operations are only allowed on blocks which are supported. * Please check ras mask at /sys/module/amdgpu/parameters/ras_mask + * to see which blocks support RAS on a particular asic. + * */ static ssize_t amdgpu_ras_debugfs_ctrl_write(struct file *f, const char __user *buf, size_t size, loff_t *pos) @@ -322,7 +330,7 @@ static ssize_t amdgpu_ras_debugfs_ctrl_write(struct file *f, const char __user * * DOC: AMDGPU RAS debugfs EEPROM table reset interface * * Some boards contain an EEPROM which is used to persistently store a list of - * bad pages containing ECC errors detected in vram. This interface provides + * bad pages which experiences ECC errors in vram. This interface provides * a way to reset the EEPROM, e.g., after testing error injection. * * Usage: @@ -362,7 +370,7 @@ static const struct file_operations amdgpu_ras_debugfs_eeprom_ops = { /** * DOC: AMDGPU RAS sysfs Error Count Interface * - * It allows user to read the error count for each IP block on the gpu through + * It allows the user to read the error count for each IP block on the gpu through * /sys/class/drm/card[0/1/2...]/device/ras/[gfx/sdma/...]_err_count * * It outputs the multiple lines which report the uncorrected (ue) and corrected @@ -1027,6 +1035,24 @@ static int amdgpu_ras_sysfs_remove_all(struct amdgpu_device *adev) } /* sysfs end */ +/** + * DOC: AMDGPU RAS Reboot Behavior for Unrecoverable Errors + * + * Normally when there is an uncorrectable error, the driver will reset + * the GPU to recover. However, in the event of an unrecoverable error, + * the driver provides an interface to reboot the system automatically + * in that event. + * + * The following file in debugfs provides that interface: + * /sys/kernel/debug/dri/[0/1/2...]/ras/auto_reboot + * + * Usage: + * + * .. code-block:: bash + * + * echo true > .../ras/auto_reboot + * + */ /* debugfs begin */ static void amdgpu_ras_debugfs_create_ctrl_node(struct amdgpu_device *adev) { -- 2.23.0 _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx ^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH] drm/amdgpu: Improve RAS documentation @ 2019-10-30 18:41 ` Alex Deucher 0 siblings, 0 replies; 14+ messages in thread From: Alex Deucher @ 2019-10-30 18:41 UTC (permalink / raw) To: amd-gfx; +Cc: Alex Deucher Clarify some areas, clean up formatting, add section for unrecoverable error handling. Signed-off-by: Alex Deucher <alexander.deucher@amd.com> --- Documentation/gpu/amdgpu.rst | 35 ++++++++++++++++++++++ drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 40 ++++++++++++++++++++----- 2 files changed, 68 insertions(+), 7 deletions(-) diff --git a/Documentation/gpu/amdgpu.rst b/Documentation/gpu/amdgpu.rst index 5b9eaf23558e..1c08d64970ee 100644 --- a/Documentation/gpu/amdgpu.rst +++ b/Documentation/gpu/amdgpu.rst @@ -82,12 +82,21 @@ AMDGPU XGMI Support AMDGPU RAS Support ================== +The AMDGPU RAS interfaces are exposed via sysfs (for informational queries) and +debugfs (for error injection). + RAS debugfs/sysfs Control and Error Injection Interfaces -------------------------------------------------------- .. kernel-doc:: drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c :doc: AMDGPU RAS debugfs control interface +RAS Reboot Behavior for Unrecoverable Errors +-------------------------------------------------------- + +.. kernel-doc:: drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c + :doc: AMDGPU RAS Reboot Behavior for Unrecoverable Errors + RAS Error Count sysfs Interface ------------------------------- @@ -109,6 +118,32 @@ RAS VRAM Bad Pages sysfs Interface .. kernel-doc:: drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c :internal: +Sample Code +----------- +Sample code for testing error injection can be found here: +https://cgit.freedesktop.org/mesa/drm/tree/tests/amdgpu/ras_tests.c + +This is part of the libdrm amdgpu unit tests which cover several areas of the GPU. +There are four sets of tests: + +RAS Basic Test + +The test verifies the RAS feature enabled status and makes sure the necessary sysfs and debugfs files +are present. + +RAS Query Test + +This test will check the RAS availability and enablement status for each supported IP block as well as +the error counts. + +RAS Inject Test + +This test injects errors for each IP. + +RAS Disable Test + +This tests disabling of RAS features for each IP block. + GPU Power/Thermal Controls and Monitoring ========================================= diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c index dab90c280476..404483437bd3 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c @@ -220,7 +220,7 @@ static struct ras_manager *amdgpu_ras_find_obj(struct amdgpu_device *adev, * As their names indicate, inject operation will write the * value to the address. * - * Second member: struct ras_debug_if::op. + * The second member: struct ras_debug_if::op. * It has three kinds of operations. * * - 0: disable RAS on the block. Take ::head as its data. @@ -228,14 +228,20 @@ static struct ras_manager *amdgpu_ras_find_obj(struct amdgpu_device *adev, * - 2: inject errors on the block. Take ::inject as its data. * * How to use the interface? - * programs: - * copy the struct ras_debug_if in your codes and initialize it. - * write the struct to the control node. + * + * Programs + * + * Copy the struct ras_debug_if in your codes and initialize it. + * Write the struct to the control node. + * + * Shells * * .. code-block:: bash * * echo op block [error [sub_block address value]] > .../ras/ras_ctrl * + * Parameters: + * * op: disable, enable, inject * disable: only block is needed * enable: block and error are needed @@ -265,8 +271,10 @@ static struct ras_manager *amdgpu_ras_find_obj(struct amdgpu_device *adev, * /sys/class/drm/card[0/1/2...]/device/ras/[gfx/sdma/...]_err_count * * .. note:: - * Operation is only allowed on blocks which are supported. + * Operations are only allowed on blocks which are supported. * Please check ras mask at /sys/module/amdgpu/parameters/ras_mask + * to see which blocks support RAS on a particular asic. + * */ static ssize_t amdgpu_ras_debugfs_ctrl_write(struct file *f, const char __user *buf, size_t size, loff_t *pos) @@ -322,7 +330,7 @@ static ssize_t amdgpu_ras_debugfs_ctrl_write(struct file *f, const char __user * * DOC: AMDGPU RAS debugfs EEPROM table reset interface * * Some boards contain an EEPROM which is used to persistently store a list of - * bad pages containing ECC errors detected in vram. This interface provides + * bad pages which experiences ECC errors in vram. This interface provides * a way to reset the EEPROM, e.g., after testing error injection. * * Usage: @@ -362,7 +370,7 @@ static const struct file_operations amdgpu_ras_debugfs_eeprom_ops = { /** * DOC: AMDGPU RAS sysfs Error Count Interface * - * It allows user to read the error count for each IP block on the gpu through + * It allows the user to read the error count for each IP block on the gpu through * /sys/class/drm/card[0/1/2...]/device/ras/[gfx/sdma/...]_err_count * * It outputs the multiple lines which report the uncorrected (ue) and corrected @@ -1027,6 +1035,24 @@ static int amdgpu_ras_sysfs_remove_all(struct amdgpu_device *adev) } /* sysfs end */ +/** + * DOC: AMDGPU RAS Reboot Behavior for Unrecoverable Errors + * + * Normally when there is an uncorrectable error, the driver will reset + * the GPU to recover. However, in the event of an unrecoverable error, + * the driver provides an interface to reboot the system automatically + * in that event. + * + * The following file in debugfs provides that interface: + * /sys/kernel/debug/dri/[0/1/2...]/ras/auto_reboot + * + * Usage: + * + * .. code-block:: bash + * + * echo true > .../ras/auto_reboot + * + */ /* debugfs begin */ static void amdgpu_ras_debugfs_create_ctrl_node(struct amdgpu_device *adev) { -- 2.23.0 _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx ^ permalink raw reply related [flat|nested] 14+ messages in thread
[parent not found: <20191030184134.250234-2-alexander.deucher-5C7GfCeVMHo@public.gmane.org>]
* Re: [PATCH] drm/amdgpu: Improve RAS documentation @ 2019-11-06 16:35 ` Alex Deucher 0 siblings, 0 replies; 14+ messages in thread From: Alex Deucher @ 2019-11-06 16:35 UTC (permalink / raw) To: amd-gfx list; +Cc: Alex Deucher Ping? On Wed, Oct 30, 2019 at 2:41 PM Alex Deucher <alexdeucher@gmail.com> wrote: > > Clarify some areas, clean up formatting, add section for > unrecoverable error handling. > > Signed-off-by: Alex Deucher <alexander.deucher@amd.com> > --- > Documentation/gpu/amdgpu.rst | 35 ++++++++++++++++++++++ > drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 40 ++++++++++++++++++++----- > 2 files changed, 68 insertions(+), 7 deletions(-) > > diff --git a/Documentation/gpu/amdgpu.rst b/Documentation/gpu/amdgpu.rst > index 5b9eaf23558e..1c08d64970ee 100644 > --- a/Documentation/gpu/amdgpu.rst > +++ b/Documentation/gpu/amdgpu.rst > @@ -82,12 +82,21 @@ AMDGPU XGMI Support > AMDGPU RAS Support > ================== > > +The AMDGPU RAS interfaces are exposed via sysfs (for informational queries) and > +debugfs (for error injection). > + > RAS debugfs/sysfs Control and Error Injection Interfaces > -------------------------------------------------------- > > .. kernel-doc:: drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c > :doc: AMDGPU RAS debugfs control interface > > +RAS Reboot Behavior for Unrecoverable Errors > +-------------------------------------------------------- > + > +.. kernel-doc:: drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c > + :doc: AMDGPU RAS Reboot Behavior for Unrecoverable Errors > + > RAS Error Count sysfs Interface > ------------------------------- > > @@ -109,6 +118,32 @@ RAS VRAM Bad Pages sysfs Interface > .. kernel-doc:: drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c > :internal: > > +Sample Code > +----------- > +Sample code for testing error injection can be found here: > +https://cgit.freedesktop.org/mesa/drm/tree/tests/amdgpu/ras_tests.c > + > +This is part of the libdrm amdgpu unit tests which cover several areas of the GPU. > +There are four sets of tests: > + > +RAS Basic Test > + > +The test verifies the RAS feature enabled status and makes sure the necessary sysfs and debugfs files > +are present. > + > +RAS Query Test > + > +This test will check the RAS availability and enablement status for each supported IP block as well as > +the error counts. > + > +RAS Inject Test > + > +This test injects errors for each IP. > + > +RAS Disable Test > + > +This tests disabling of RAS features for each IP block. > + > > GPU Power/Thermal Controls and Monitoring > ========================================= > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c > index dab90c280476..404483437bd3 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c > @@ -220,7 +220,7 @@ static struct ras_manager *amdgpu_ras_find_obj(struct amdgpu_device *adev, > * As their names indicate, inject operation will write the > * value to the address. > * > - * Second member: struct ras_debug_if::op. > + * The second member: struct ras_debug_if::op. > * It has three kinds of operations. > * > * - 0: disable RAS on the block. Take ::head as its data. > @@ -228,14 +228,20 @@ static struct ras_manager *amdgpu_ras_find_obj(struct amdgpu_device *adev, > * - 2: inject errors on the block. Take ::inject as its data. > * > * How to use the interface? > - * programs: > - * copy the struct ras_debug_if in your codes and initialize it. > - * write the struct to the control node. > + * > + * Programs > + * > + * Copy the struct ras_debug_if in your codes and initialize it. > + * Write the struct to the control node. > + * > + * Shells > * > * .. code-block:: bash > * > * echo op block [error [sub_block address value]] > .../ras/ras_ctrl > * > + * Parameters: > + * > * op: disable, enable, inject > * disable: only block is needed > * enable: block and error are needed > @@ -265,8 +271,10 @@ static struct ras_manager *amdgpu_ras_find_obj(struct amdgpu_device *adev, > * /sys/class/drm/card[0/1/2...]/device/ras/[gfx/sdma/...]_err_count > * > * .. note:: > - * Operation is only allowed on blocks which are supported. > + * Operations are only allowed on blocks which are supported. > * Please check ras mask at /sys/module/amdgpu/parameters/ras_mask > + * to see which blocks support RAS on a particular asic. > + * > */ > static ssize_t amdgpu_ras_debugfs_ctrl_write(struct file *f, const char __user *buf, > size_t size, loff_t *pos) > @@ -322,7 +330,7 @@ static ssize_t amdgpu_ras_debugfs_ctrl_write(struct file *f, const char __user * > * DOC: AMDGPU RAS debugfs EEPROM table reset interface > * > * Some boards contain an EEPROM which is used to persistently store a list of > - * bad pages containing ECC errors detected in vram. This interface provides > + * bad pages which experiences ECC errors in vram. This interface provides > * a way to reset the EEPROM, e.g., after testing error injection. > * > * Usage: > @@ -362,7 +370,7 @@ static const struct file_operations amdgpu_ras_debugfs_eeprom_ops = { > /** > * DOC: AMDGPU RAS sysfs Error Count Interface > * > - * It allows user to read the error count for each IP block on the gpu through > + * It allows the user to read the error count for each IP block on the gpu through > * /sys/class/drm/card[0/1/2...]/device/ras/[gfx/sdma/...]_err_count > * > * It outputs the multiple lines which report the uncorrected (ue) and corrected > @@ -1027,6 +1035,24 @@ static int amdgpu_ras_sysfs_remove_all(struct amdgpu_device *adev) > } > /* sysfs end */ > > +/** > + * DOC: AMDGPU RAS Reboot Behavior for Unrecoverable Errors > + * > + * Normally when there is an uncorrectable error, the driver will reset > + * the GPU to recover. However, in the event of an unrecoverable error, > + * the driver provides an interface to reboot the system automatically > + * in that event. > + * > + * The following file in debugfs provides that interface: > + * /sys/kernel/debug/dri/[0/1/2...]/ras/auto_reboot > + * > + * Usage: > + * > + * .. code-block:: bash > + * > + * echo true > .../ras/auto_reboot > + * > + */ > /* debugfs begin */ > static void amdgpu_ras_debugfs_create_ctrl_node(struct amdgpu_device *adev) > { > -- > 2.23.0 > _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] drm/amdgpu: Improve RAS documentation @ 2019-11-06 16:35 ` Alex Deucher 0 siblings, 0 replies; 14+ messages in thread From: Alex Deucher @ 2019-11-06 16:35 UTC (permalink / raw) To: amd-gfx list; +Cc: Alex Deucher Ping? On Wed, Oct 30, 2019 at 2:41 PM Alex Deucher <alexdeucher@gmail.com> wrote: > > Clarify some areas, clean up formatting, add section for > unrecoverable error handling. > > Signed-off-by: Alex Deucher <alexander.deucher@amd.com> > --- > Documentation/gpu/amdgpu.rst | 35 ++++++++++++++++++++++ > drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 40 ++++++++++++++++++++----- > 2 files changed, 68 insertions(+), 7 deletions(-) > > diff --git a/Documentation/gpu/amdgpu.rst b/Documentation/gpu/amdgpu.rst > index 5b9eaf23558e..1c08d64970ee 100644 > --- a/Documentation/gpu/amdgpu.rst > +++ b/Documentation/gpu/amdgpu.rst > @@ -82,12 +82,21 @@ AMDGPU XGMI Support > AMDGPU RAS Support > ================== > > +The AMDGPU RAS interfaces are exposed via sysfs (for informational queries) and > +debugfs (for error injection). > + > RAS debugfs/sysfs Control and Error Injection Interfaces > -------------------------------------------------------- > > .. kernel-doc:: drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c > :doc: AMDGPU RAS debugfs control interface > > +RAS Reboot Behavior for Unrecoverable Errors > +-------------------------------------------------------- > + > +.. kernel-doc:: drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c > + :doc: AMDGPU RAS Reboot Behavior for Unrecoverable Errors > + > RAS Error Count sysfs Interface > ------------------------------- > > @@ -109,6 +118,32 @@ RAS VRAM Bad Pages sysfs Interface > .. kernel-doc:: drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c > :internal: > > +Sample Code > +----------- > +Sample code for testing error injection can be found here: > +https://cgit.freedesktop.org/mesa/drm/tree/tests/amdgpu/ras_tests.c > + > +This is part of the libdrm amdgpu unit tests which cover several areas of the GPU. > +There are four sets of tests: > + > +RAS Basic Test > + > +The test verifies the RAS feature enabled status and makes sure the necessary sysfs and debugfs files > +are present. > + > +RAS Query Test > + > +This test will check the RAS availability and enablement status for each supported IP block as well as > +the error counts. > + > +RAS Inject Test > + > +This test injects errors for each IP. > + > +RAS Disable Test > + > +This tests disabling of RAS features for each IP block. > + > > GPU Power/Thermal Controls and Monitoring > ========================================= > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c > index dab90c280476..404483437bd3 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c > @@ -220,7 +220,7 @@ static struct ras_manager *amdgpu_ras_find_obj(struct amdgpu_device *adev, > * As their names indicate, inject operation will write the > * value to the address. > * > - * Second member: struct ras_debug_if::op. > + * The second member: struct ras_debug_if::op. > * It has three kinds of operations. > * > * - 0: disable RAS on the block. Take ::head as its data. > @@ -228,14 +228,20 @@ static struct ras_manager *amdgpu_ras_find_obj(struct amdgpu_device *adev, > * - 2: inject errors on the block. Take ::inject as its data. > * > * How to use the interface? > - * programs: > - * copy the struct ras_debug_if in your codes and initialize it. > - * write the struct to the control node. > + * > + * Programs > + * > + * Copy the struct ras_debug_if in your codes and initialize it. > + * Write the struct to the control node. > + * > + * Shells > * > * .. code-block:: bash > * > * echo op block [error [sub_block address value]] > .../ras/ras_ctrl > * > + * Parameters: > + * > * op: disable, enable, inject > * disable: only block is needed > * enable: block and error are needed > @@ -265,8 +271,10 @@ static struct ras_manager *amdgpu_ras_find_obj(struct amdgpu_device *adev, > * /sys/class/drm/card[0/1/2...]/device/ras/[gfx/sdma/...]_err_count > * > * .. note:: > - * Operation is only allowed on blocks which are supported. > + * Operations are only allowed on blocks which are supported. > * Please check ras mask at /sys/module/amdgpu/parameters/ras_mask > + * to see which blocks support RAS on a particular asic. > + * > */ > static ssize_t amdgpu_ras_debugfs_ctrl_write(struct file *f, const char __user *buf, > size_t size, loff_t *pos) > @@ -322,7 +330,7 @@ static ssize_t amdgpu_ras_debugfs_ctrl_write(struct file *f, const char __user * > * DOC: AMDGPU RAS debugfs EEPROM table reset interface > * > * Some boards contain an EEPROM which is used to persistently store a list of > - * bad pages containing ECC errors detected in vram. This interface provides > + * bad pages which experiences ECC errors in vram. This interface provides > * a way to reset the EEPROM, e.g., after testing error injection. > * > * Usage: > @@ -362,7 +370,7 @@ static const struct file_operations amdgpu_ras_debugfs_eeprom_ops = { > /** > * DOC: AMDGPU RAS sysfs Error Count Interface > * > - * It allows user to read the error count for each IP block on the gpu through > + * It allows the user to read the error count for each IP block on the gpu through > * /sys/class/drm/card[0/1/2...]/device/ras/[gfx/sdma/...]_err_count > * > * It outputs the multiple lines which report the uncorrected (ue) and corrected > @@ -1027,6 +1035,24 @@ static int amdgpu_ras_sysfs_remove_all(struct amdgpu_device *adev) > } > /* sysfs end */ > > +/** > + * DOC: AMDGPU RAS Reboot Behavior for Unrecoverable Errors > + * > + * Normally when there is an uncorrectable error, the driver will reset > + * the GPU to recover. However, in the event of an unrecoverable error, > + * the driver provides an interface to reboot the system automatically > + * in that event. > + * > + * The following file in debugfs provides that interface: > + * /sys/kernel/debug/dri/[0/1/2...]/ras/auto_reboot > + * > + * Usage: > + * > + * .. code-block:: bash > + * > + * echo true > .../ras/auto_reboot > + * > + */ > /* debugfs begin */ > static void amdgpu_ras_debugfs_create_ctrl_node(struct amdgpu_device *adev) > { > -- > 2.23.0 > _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx ^ permalink raw reply [flat|nested] 14+ messages in thread
[parent not found: <CADnq5_Mjr+U8sspvjm-KMqX4VTvdUHdDa6GgnNykQW+QvTMqXQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* RE: [PATCH] drm/amdgpu: Improve RAS documentation @ 2019-11-07 1:10 ` Chen, Guchun 0 siblings, 0 replies; 14+ messages in thread From: Chen, Guchun @ 2019-11-07 1:10 UTC (permalink / raw) To: Alex Deucher, amd-gfx list; +Cc: Deucher, Alexander One comment. With that fixed, this patch is: Reviewed-by: Guchun Chen <guchun.chen@amd.com> Regards, Guchun -----Original Message----- From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> On Behalf Of Alex Deucher Sent: Thursday, November 7, 2019 12:35 AM To: amd-gfx list <amd-gfx@lists.freedesktop.org> Cc: Deucher, Alexander <Alexander.Deucher@amd.com> Subject: Re: [PATCH] drm/amdgpu: Improve RAS documentation Ping? On Wed, Oct 30, 2019 at 2:41 PM Alex Deucher <alexdeucher@gmail.com> wrote: > > Clarify some areas, clean up formatting, add section for unrecoverable > error handling. > > Signed-off-by: Alex Deucher <alexander.deucher@amd.com> > --- > Documentation/gpu/amdgpu.rst | 35 ++++++++++++++++++++++ > drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 40 > ++++++++++++++++++++----- > 2 files changed, 68 insertions(+), 7 deletions(-) > > diff --git a/Documentation/gpu/amdgpu.rst > b/Documentation/gpu/amdgpu.rst index 5b9eaf23558e..1c08d64970ee 100644 > --- a/Documentation/gpu/amdgpu.rst > +++ b/Documentation/gpu/amdgpu.rst > @@ -82,12 +82,21 @@ AMDGPU XGMI Support AMDGPU RAS Support > ================== > > +The AMDGPU RAS interfaces are exposed via sysfs (for informational > +queries) and debugfs (for error injection). [Guchun]It’s better we add block enablement/disablement statement in debugfs interface. RAS driver supports this. Regards, Guchun > + > RAS debugfs/sysfs Control and Error Injection Interfaces > -------------------------------------------------------- > > .. kernel-doc:: drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c > :doc: AMDGPU RAS debugfs control interface > > +RAS Reboot Behavior for Unrecoverable Errors > +-------------------------------------------------------- > + > +.. kernel-doc:: drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c > + :doc: AMDGPU RAS Reboot Behavior for Unrecoverable Errors > + > RAS Error Count sysfs Interface > ------------------------------- > > @@ -109,6 +118,32 @@ RAS VRAM Bad Pages sysfs Interface .. > kernel-doc:: drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c > :internal: > > +Sample Code > +----------- > +Sample code for testing error injection can be found here: > +https://cgit.freedesktop.org/mesa/drm/tree/tests/amdgpu/ras_tests.c > + > +This is part of the libdrm amdgpu unit tests which cover several areas of the GPU. > +There are four sets of tests: > + > +RAS Basic Test > + > +The test verifies the RAS feature enabled status and makes sure the > +necessary sysfs and debugfs files are present. > + > +RAS Query Test > + > +This test will check the RAS availability and enablement status for > +each supported IP block as well as the error counts. > + > +RAS Inject Test > + > +This test injects errors for each IP. > + > +RAS Disable Test > + > +This tests disabling of RAS features for each IP block. > + > > GPU Power/Thermal Controls and Monitoring > ========================================= > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c > b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c > index dab90c280476..404483437bd3 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c > @@ -220,7 +220,7 @@ static struct ras_manager *amdgpu_ras_find_obj(struct amdgpu_device *adev, > * As their names indicate, inject operation will write the > * value to the address. > * > - * Second member: struct ras_debug_if::op. > + * The second member: struct ras_debug_if::op. > * It has three kinds of operations. > * > * - 0: disable RAS on the block. Take ::head as its data. > @@ -228,14 +228,20 @@ static struct ras_manager *amdgpu_ras_find_obj(struct amdgpu_device *adev, > * - 2: inject errors on the block. Take ::inject as its data. > * > * How to use the interface? > - * programs: > - * copy the struct ras_debug_if in your codes and initialize it. > - * write the struct to the control node. > + * > + * Programs > + * > + * Copy the struct ras_debug_if in your codes and initialize it. > + * Write the struct to the control node. > + * > + * Shells > * > * .. code-block:: bash > * > * echo op block [error [sub_block address value]] > .../ras/ras_ctrl > * > + * Parameters: > + * > * op: disable, enable, inject > * disable: only block is needed > * enable: block and error are needed > @@ -265,8 +271,10 @@ static struct ras_manager *amdgpu_ras_find_obj(struct amdgpu_device *adev, > * /sys/class/drm/card[0/1/2...]/device/ras/[gfx/sdma/...]_err_count > * > * .. note:: > - * Operation is only allowed on blocks which are supported. > + * Operations are only allowed on blocks which are supported. > * Please check ras mask at /sys/module/amdgpu/parameters/ras_mask > + * to see which blocks support RAS on a particular asic. > + * > */ > static ssize_t amdgpu_ras_debugfs_ctrl_write(struct file *f, const char __user *buf, > size_t size, loff_t *pos) @@ -322,7 +330,7 @@ static > ssize_t amdgpu_ras_debugfs_ctrl_write(struct file *f, const char __user * > * DOC: AMDGPU RAS debugfs EEPROM table reset interface > * > * Some boards contain an EEPROM which is used to persistently store > a list of > - * bad pages containing ECC errors detected in vram. This interface > provides > + * bad pages which experiences ECC errors in vram. This interface > + provides > * a way to reset the EEPROM, e.g., after testing error injection. > * > * Usage: > @@ -362,7 +370,7 @@ static const struct file_operations > amdgpu_ras_debugfs_eeprom_ops = { > /** > * DOC: AMDGPU RAS sysfs Error Count Interface > * > - * It allows user to read the error count for each IP block on the > gpu through > + * It allows the user to read the error count for each IP block on > + the gpu through > * /sys/class/drm/card[0/1/2...]/device/ras/[gfx/sdma/...]_err_count > * > * It outputs the multiple lines which report the uncorrected (ue) > and corrected @@ -1027,6 +1035,24 @@ static int > amdgpu_ras_sysfs_remove_all(struct amdgpu_device *adev) } > /* sysfs end */ > > +/** > + * DOC: AMDGPU RAS Reboot Behavior for Unrecoverable Errors > + * > + * Normally when there is an uncorrectable error, the driver will > +reset > + * the GPU to recover. However, in the event of an unrecoverable > +error, > + * the driver provides an interface to reboot the system > +automatically > + * in that event. > + * > + * The following file in debugfs provides that interface: > + * /sys/kernel/debug/dri/[0/1/2...]/ras/auto_reboot > + * > + * Usage: > + * > + * .. code-block:: bash > + * > + * echo true > .../ras/auto_reboot > + * > + */ > /* debugfs begin */ > static void amdgpu_ras_debugfs_create_ctrl_node(struct amdgpu_device > *adev) { > -- > 2.23.0 > _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx ^ permalink raw reply [flat|nested] 14+ messages in thread
* RE: [PATCH] drm/amdgpu: Improve RAS documentation @ 2019-11-07 1:10 ` Chen, Guchun 0 siblings, 0 replies; 14+ messages in thread From: Chen, Guchun @ 2019-11-07 1:10 UTC (permalink / raw) To: Alex Deucher, amd-gfx list; +Cc: Deucher, Alexander One comment. With that fixed, this patch is: Reviewed-by: Guchun Chen <guchun.chen@amd.com> Regards, Guchun -----Original Message----- From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> On Behalf Of Alex Deucher Sent: Thursday, November 7, 2019 12:35 AM To: amd-gfx list <amd-gfx@lists.freedesktop.org> Cc: Deucher, Alexander <Alexander.Deucher@amd.com> Subject: Re: [PATCH] drm/amdgpu: Improve RAS documentation Ping? On Wed, Oct 30, 2019 at 2:41 PM Alex Deucher <alexdeucher@gmail.com> wrote: > > Clarify some areas, clean up formatting, add section for unrecoverable > error handling. > > Signed-off-by: Alex Deucher <alexander.deucher@amd.com> > --- > Documentation/gpu/amdgpu.rst | 35 ++++++++++++++++++++++ > drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 40 > ++++++++++++++++++++----- > 2 files changed, 68 insertions(+), 7 deletions(-) > > diff --git a/Documentation/gpu/amdgpu.rst > b/Documentation/gpu/amdgpu.rst index 5b9eaf23558e..1c08d64970ee 100644 > --- a/Documentation/gpu/amdgpu.rst > +++ b/Documentation/gpu/amdgpu.rst > @@ -82,12 +82,21 @@ AMDGPU XGMI Support AMDGPU RAS Support > ================== > > +The AMDGPU RAS interfaces are exposed via sysfs (for informational > +queries) and debugfs (for error injection). [Guchun]It’s better we add block enablement/disablement statement in debugfs interface. RAS driver supports this. Regards, Guchun > + > RAS debugfs/sysfs Control and Error Injection Interfaces > -------------------------------------------------------- > > .. kernel-doc:: drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c > :doc: AMDGPU RAS debugfs control interface > > +RAS Reboot Behavior for Unrecoverable Errors > +-------------------------------------------------------- > + > +.. kernel-doc:: drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c > + :doc: AMDGPU RAS Reboot Behavior for Unrecoverable Errors > + > RAS Error Count sysfs Interface > ------------------------------- > > @@ -109,6 +118,32 @@ RAS VRAM Bad Pages sysfs Interface .. > kernel-doc:: drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c > :internal: > > +Sample Code > +----------- > +Sample code for testing error injection can be found here: > +https://cgit.freedesktop.org/mesa/drm/tree/tests/amdgpu/ras_tests.c > + > +This is part of the libdrm amdgpu unit tests which cover several areas of the GPU. > +There are four sets of tests: > + > +RAS Basic Test > + > +The test verifies the RAS feature enabled status and makes sure the > +necessary sysfs and debugfs files are present. > + > +RAS Query Test > + > +This test will check the RAS availability and enablement status for > +each supported IP block as well as the error counts. > + > +RAS Inject Test > + > +This test injects errors for each IP. > + > +RAS Disable Test > + > +This tests disabling of RAS features for each IP block. > + > > GPU Power/Thermal Controls and Monitoring > ========================================= > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c > b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c > index dab90c280476..404483437bd3 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c > @@ -220,7 +220,7 @@ static struct ras_manager *amdgpu_ras_find_obj(struct amdgpu_device *adev, > * As their names indicate, inject operation will write the > * value to the address. > * > - * Second member: struct ras_debug_if::op. > + * The second member: struct ras_debug_if::op. > * It has three kinds of operations. > * > * - 0: disable RAS on the block. Take ::head as its data. > @@ -228,14 +228,20 @@ static struct ras_manager *amdgpu_ras_find_obj(struct amdgpu_device *adev, > * - 2: inject errors on the block. Take ::inject as its data. > * > * How to use the interface? > - * programs: > - * copy the struct ras_debug_if in your codes and initialize it. > - * write the struct to the control node. > + * > + * Programs > + * > + * Copy the struct ras_debug_if in your codes and initialize it. > + * Write the struct to the control node. > + * > + * Shells > * > * .. code-block:: bash > * > * echo op block [error [sub_block address value]] > .../ras/ras_ctrl > * > + * Parameters: > + * > * op: disable, enable, inject > * disable: only block is needed > * enable: block and error are needed > @@ -265,8 +271,10 @@ static struct ras_manager *amdgpu_ras_find_obj(struct amdgpu_device *adev, > * /sys/class/drm/card[0/1/2...]/device/ras/[gfx/sdma/...]_err_count > * > * .. note:: > - * Operation is only allowed on blocks which are supported. > + * Operations are only allowed on blocks which are supported. > * Please check ras mask at /sys/module/amdgpu/parameters/ras_mask > + * to see which blocks support RAS on a particular asic. > + * > */ > static ssize_t amdgpu_ras_debugfs_ctrl_write(struct file *f, const char __user *buf, > size_t size, loff_t *pos) @@ -322,7 +330,7 @@ static > ssize_t amdgpu_ras_debugfs_ctrl_write(struct file *f, const char __user * > * DOC: AMDGPU RAS debugfs EEPROM table reset interface > * > * Some boards contain an EEPROM which is used to persistently store > a list of > - * bad pages containing ECC errors detected in vram. This interface > provides > + * bad pages which experiences ECC errors in vram. This interface > + provides > * a way to reset the EEPROM, e.g., after testing error injection. > * > * Usage: > @@ -362,7 +370,7 @@ static const struct file_operations > amdgpu_ras_debugfs_eeprom_ops = { > /** > * DOC: AMDGPU RAS sysfs Error Count Interface > * > - * It allows user to read the error count for each IP block on the > gpu through > + * It allows the user to read the error count for each IP block on > + the gpu through > * /sys/class/drm/card[0/1/2...]/device/ras/[gfx/sdma/...]_err_count > * > * It outputs the multiple lines which report the uncorrected (ue) > and corrected @@ -1027,6 +1035,24 @@ static int > amdgpu_ras_sysfs_remove_all(struct amdgpu_device *adev) } > /* sysfs end */ > > +/** > + * DOC: AMDGPU RAS Reboot Behavior for Unrecoverable Errors > + * > + * Normally when there is an uncorrectable error, the driver will > +reset > + * the GPU to recover. However, in the event of an unrecoverable > +error, > + * the driver provides an interface to reboot the system > +automatically > + * in that event. > + * > + * The following file in debugfs provides that interface: > + * /sys/kernel/debug/dri/[0/1/2...]/ras/auto_reboot > + * > + * Usage: > + * > + * .. code-block:: bash > + * > + * echo true > .../ras/auto_reboot > + * > + */ > /* debugfs begin */ > static void amdgpu_ras_debugfs_create_ctrl_node(struct amdgpu_device > *adev) { > -- > 2.23.0 > _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] drm/amdgpu: Improve RAS documentation @ 2019-11-06 17:14 ` Zhao, Yong 0 siblings, 0 replies; 14+ messages in thread From: Zhao, Yong @ 2019-11-06 17:14 UTC (permalink / raw) To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW See two wording comments inline. With that Reviewed-by: Yong Zhao<yong.zhao@amd.com> On 2019-10-30 2:41 p.m., Alex Deucher wrote: > Clarify some areas, clean up formatting, add section for > unrecoverable error handling. > > Signed-off-by: Alex Deucher <alexander.deucher@amd.com> > --- > Documentation/gpu/amdgpu.rst | 35 ++++++++++++++++++++++ > drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 40 ++++++++++++++++++++----- > 2 files changed, 68 insertions(+), 7 deletions(-) > > diff --git a/Documentation/gpu/amdgpu.rst b/Documentation/gpu/amdgpu.rst > index 5b9eaf23558e..1c08d64970ee 100644 > --- a/Documentation/gpu/amdgpu.rst > +++ b/Documentation/gpu/amdgpu.rst > @@ -82,12 +82,21 @@ AMDGPU XGMI Support > AMDGPU RAS Support > ================== > > +The AMDGPU RAS interfaces are exposed via sysfs (for informational queries) and > +debugfs (for error injection). > + > RAS debugfs/sysfs Control and Error Injection Interfaces > -------------------------------------------------------- > > .. kernel-doc:: drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c > :doc: AMDGPU RAS debugfs control interface > > +RAS Reboot Behavior for Unrecoverable Errors > +-------------------------------------------------------- > + > +.. kernel-doc:: drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c > + :doc: AMDGPU RAS Reboot Behavior for Unrecoverable Errors > + > RAS Error Count sysfs Interface > ------------------------------- > > @@ -109,6 +118,32 @@ RAS VRAM Bad Pages sysfs Interface > .. kernel-doc:: drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c > :internal: > > +Sample Code > +----------- > +Sample code for testing error injection can be found here: > +https://cgit.freedesktop.org/mesa/drm/tree/tests/amdgpu/ras_tests.c > + > +This is part of the libdrm amdgpu unit tests which cover several areas of the GPU. > +There are four sets of tests: > + > +RAS Basic Test > + > +The test verifies the RAS feature enabled status and makes sure the necessary sysfs and debugfs files > +are present. > + > +RAS Query Test > + > +This test will check the RAS availability and enablement status for each supported IP block as well as > +the error counts. This test checks > + > +RAS Inject Test > + > +This test injects errors for each IP. > + > +RAS Disable Test > + > +This tests disabling of RAS features for each IP block. This tests tests disabling > + > > GPU Power/Thermal Controls and Monitoring > ========================================= > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c > index dab90c280476..404483437bd3 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c > @@ -220,7 +220,7 @@ static struct ras_manager *amdgpu_ras_find_obj(struct amdgpu_device *adev, > * As their names indicate, inject operation will write the > * value to the address. > * > - * Second member: struct ras_debug_if::op. > + * The second member: struct ras_debug_if::op. > * It has three kinds of operations. > * > * - 0: disable RAS on the block. Take ::head as its data. > @@ -228,14 +228,20 @@ static struct ras_manager *amdgpu_ras_find_obj(struct amdgpu_device *adev, > * - 2: inject errors on the block. Take ::inject as its data. > * > * How to use the interface? > - * programs: > - * copy the struct ras_debug_if in your codes and initialize it. > - * write the struct to the control node. > + * > + * Programs > + * > + * Copy the struct ras_debug_if in your codes and initialize it. > + * Write the struct to the control node. > + * > + * Shells > * > * .. code-block:: bash > * > * echo op block [error [sub_block address value]] > .../ras/ras_ctrl > * > + * Parameters: > + * > * op: disable, enable, inject > * disable: only block is needed > * enable: block and error are needed > @@ -265,8 +271,10 @@ static struct ras_manager *amdgpu_ras_find_obj(struct amdgpu_device *adev, > * /sys/class/drm/card[0/1/2...]/device/ras/[gfx/sdma/...]_err_count > * > * .. note:: > - * Operation is only allowed on blocks which are supported. > + * Operations are only allowed on blocks which are supported. > * Please check ras mask at /sys/module/amdgpu/parameters/ras_mask > + * to see which blocks support RAS on a particular asic. > + * > */ > static ssize_t amdgpu_ras_debugfs_ctrl_write(struct file *f, const char __user *buf, > size_t size, loff_t *pos) > @@ -322,7 +330,7 @@ static ssize_t amdgpu_ras_debugfs_ctrl_write(struct file *f, const char __user * > * DOC: AMDGPU RAS debugfs EEPROM table reset interface > * > * Some boards contain an EEPROM which is used to persistently store a list of > - * bad pages containing ECC errors detected in vram. This interface provides > + * bad pages which experiences ECC errors in vram. This interface provides > * a way to reset the EEPROM, e.g., after testing error injection. > * > * Usage: > @@ -362,7 +370,7 @@ static const struct file_operations amdgpu_ras_debugfs_eeprom_ops = { > /** > * DOC: AMDGPU RAS sysfs Error Count Interface > * > - * It allows user to read the error count for each IP block on the gpu through > + * It allows the user to read the error count for each IP block on the gpu through > * /sys/class/drm/card[0/1/2...]/device/ras/[gfx/sdma/...]_err_count > * > * It outputs the multiple lines which report the uncorrected (ue) and corrected > @@ -1027,6 +1035,24 @@ static int amdgpu_ras_sysfs_remove_all(struct amdgpu_device *adev) > } > /* sysfs end */ > > +/** > + * DOC: AMDGPU RAS Reboot Behavior for Unrecoverable Errors > + * > + * Normally when there is an uncorrectable error, the driver will reset > + * the GPU to recover. However, in the event of an unrecoverable error, > + * the driver provides an interface to reboot the system automatically > + * in that event. > + * > + * The following file in debugfs provides that interface: > + * /sys/kernel/debug/dri/[0/1/2...]/ras/auto_reboot > + * > + * Usage: > + * > + * .. code-block:: bash > + * > + * echo true > .../ras/auto_reboot > + * > + */ > /* debugfs begin */ > static void amdgpu_ras_debugfs_create_ctrl_node(struct amdgpu_device *adev) > { _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] drm/amdgpu: Improve RAS documentation @ 2019-11-06 17:14 ` Zhao, Yong 0 siblings, 0 replies; 14+ messages in thread From: Zhao, Yong @ 2019-11-06 17:14 UTC (permalink / raw) To: amd-gfx See two wording comments inline. With that Reviewed-by: Yong Zhao<yong.zhao@amd.com> On 2019-10-30 2:41 p.m., Alex Deucher wrote: > Clarify some areas, clean up formatting, add section for > unrecoverable error handling. > > Signed-off-by: Alex Deucher <alexander.deucher@amd.com> > --- > Documentation/gpu/amdgpu.rst | 35 ++++++++++++++++++++++ > drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 40 ++++++++++++++++++++----- > 2 files changed, 68 insertions(+), 7 deletions(-) > > diff --git a/Documentation/gpu/amdgpu.rst b/Documentation/gpu/amdgpu.rst > index 5b9eaf23558e..1c08d64970ee 100644 > --- a/Documentation/gpu/amdgpu.rst > +++ b/Documentation/gpu/amdgpu.rst > @@ -82,12 +82,21 @@ AMDGPU XGMI Support > AMDGPU RAS Support > ================== > > +The AMDGPU RAS interfaces are exposed via sysfs (for informational queries) and > +debugfs (for error injection). > + > RAS debugfs/sysfs Control and Error Injection Interfaces > -------------------------------------------------------- > > .. kernel-doc:: drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c > :doc: AMDGPU RAS debugfs control interface > > +RAS Reboot Behavior for Unrecoverable Errors > +-------------------------------------------------------- > + > +.. kernel-doc:: drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c > + :doc: AMDGPU RAS Reboot Behavior for Unrecoverable Errors > + > RAS Error Count sysfs Interface > ------------------------------- > > @@ -109,6 +118,32 @@ RAS VRAM Bad Pages sysfs Interface > .. kernel-doc:: drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c > :internal: > > +Sample Code > +----------- > +Sample code for testing error injection can be found here: > +https://cgit.freedesktop.org/mesa/drm/tree/tests/amdgpu/ras_tests.c > + > +This is part of the libdrm amdgpu unit tests which cover several areas of the GPU. > +There are four sets of tests: > + > +RAS Basic Test > + > +The test verifies the RAS feature enabled status and makes sure the necessary sysfs and debugfs files > +are present. > + > +RAS Query Test > + > +This test will check the RAS availability and enablement status for each supported IP block as well as > +the error counts. This test checks > + > +RAS Inject Test > + > +This test injects errors for each IP. > + > +RAS Disable Test > + > +This tests disabling of RAS features for each IP block. This tests tests disabling > + > > GPU Power/Thermal Controls and Monitoring > ========================================= > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c > index dab90c280476..404483437bd3 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c > @@ -220,7 +220,7 @@ static struct ras_manager *amdgpu_ras_find_obj(struct amdgpu_device *adev, > * As their names indicate, inject operation will write the > * value to the address. > * > - * Second member: struct ras_debug_if::op. > + * The second member: struct ras_debug_if::op. > * It has three kinds of operations. > * > * - 0: disable RAS on the block. Take ::head as its data. > @@ -228,14 +228,20 @@ static struct ras_manager *amdgpu_ras_find_obj(struct amdgpu_device *adev, > * - 2: inject errors on the block. Take ::inject as its data. > * > * How to use the interface? > - * programs: > - * copy the struct ras_debug_if in your codes and initialize it. > - * write the struct to the control node. > + * > + * Programs > + * > + * Copy the struct ras_debug_if in your codes and initialize it. > + * Write the struct to the control node. > + * > + * Shells > * > * .. code-block:: bash > * > * echo op block [error [sub_block address value]] > .../ras/ras_ctrl > * > + * Parameters: > + * > * op: disable, enable, inject > * disable: only block is needed > * enable: block and error are needed > @@ -265,8 +271,10 @@ static struct ras_manager *amdgpu_ras_find_obj(struct amdgpu_device *adev, > * /sys/class/drm/card[0/1/2...]/device/ras/[gfx/sdma/...]_err_count > * > * .. note:: > - * Operation is only allowed on blocks which are supported. > + * Operations are only allowed on blocks which are supported. > * Please check ras mask at /sys/module/amdgpu/parameters/ras_mask > + * to see which blocks support RAS on a particular asic. > + * > */ > static ssize_t amdgpu_ras_debugfs_ctrl_write(struct file *f, const char __user *buf, > size_t size, loff_t *pos) > @@ -322,7 +330,7 @@ static ssize_t amdgpu_ras_debugfs_ctrl_write(struct file *f, const char __user * > * DOC: AMDGPU RAS debugfs EEPROM table reset interface > * > * Some boards contain an EEPROM which is used to persistently store a list of > - * bad pages containing ECC errors detected in vram. This interface provides > + * bad pages which experiences ECC errors in vram. This interface provides > * a way to reset the EEPROM, e.g., after testing error injection. > * > * Usage: > @@ -362,7 +370,7 @@ static const struct file_operations amdgpu_ras_debugfs_eeprom_ops = { > /** > * DOC: AMDGPU RAS sysfs Error Count Interface > * > - * It allows user to read the error count for each IP block on the gpu through > + * It allows the user to read the error count for each IP block on the gpu through > * /sys/class/drm/card[0/1/2...]/device/ras/[gfx/sdma/...]_err_count > * > * It outputs the multiple lines which report the uncorrected (ue) and corrected > @@ -1027,6 +1035,24 @@ static int amdgpu_ras_sysfs_remove_all(struct amdgpu_device *adev) > } > /* sysfs end */ > > +/** > + * DOC: AMDGPU RAS Reboot Behavior for Unrecoverable Errors > + * > + * Normally when there is an uncorrectable error, the driver will reset > + * the GPU to recover. However, in the event of an unrecoverable error, > + * the driver provides an interface to reboot the system automatically > + * in that event. > + * > + * The following file in debugfs provides that interface: > + * /sys/kernel/debug/dri/[0/1/2...]/ras/auto_reboot > + * > + * Usage: > + * > + * .. code-block:: bash > + * > + * echo true > .../ras/auto_reboot > + * > + */ > /* debugfs begin */ > static void amdgpu_ras_debugfs_create_ctrl_node(struct amdgpu_device *adev) > { _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx ^ permalink raw reply [flat|nested] 14+ messages in thread
[parent not found: <4d4b67a3-25e0-a52d-67d7-06bb333c53b0-5C7GfCeVMHo@public.gmane.org>]
* RE: [PATCH] drm/amdgpu: Improve RAS documentation @ 2019-11-06 17:31 ` Russell, Kent 0 siblings, 0 replies; 14+ messages in thread From: Russell, Kent @ 2019-11-06 17:31 UTC (permalink / raw) To: Zhao, Yong, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW I think you meant "This test tests" instead of "This tests tests" for your 2nd comment but agreed on the consistent verb tenses. Kent -----Original Message----- From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> On Behalf Of Zhao, Yong Sent: Wednesday, November 6, 2019 12:15 PM To: amd-gfx@lists.freedesktop.org Subject: Re: [PATCH] drm/amdgpu: Improve RAS documentation See two wording comments inline. With that Reviewed-by: Yong Zhao<yong.zhao@amd.com> On 2019-10-30 2:41 p.m., Alex Deucher wrote: > Clarify some areas, clean up formatting, add section for unrecoverable > error handling. > > Signed-off-by: Alex Deucher <alexander.deucher@amd.com> > --- > Documentation/gpu/amdgpu.rst | 35 ++++++++++++++++++++++ > drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 40 ++++++++++++++++++++----- > 2 files changed, 68 insertions(+), 7 deletions(-) > > diff --git a/Documentation/gpu/amdgpu.rst > b/Documentation/gpu/amdgpu.rst index 5b9eaf23558e..1c08d64970ee 100644 > --- a/Documentation/gpu/amdgpu.rst > +++ b/Documentation/gpu/amdgpu.rst > @@ -82,12 +82,21 @@ AMDGPU XGMI Support > AMDGPU RAS Support > ================== > > +The AMDGPU RAS interfaces are exposed via sysfs (for informational > +queries) and debugfs (for error injection). > + > RAS debugfs/sysfs Control and Error Injection Interfaces > -------------------------------------------------------- > > .. kernel-doc:: drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c > :doc: AMDGPU RAS debugfs control interface > > +RAS Reboot Behavior for Unrecoverable Errors > +-------------------------------------------------------- > + > +.. kernel-doc:: drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c > + :doc: AMDGPU RAS Reboot Behavior for Unrecoverable Errors > + > RAS Error Count sysfs Interface > ------------------------------- > > @@ -109,6 +118,32 @@ RAS VRAM Bad Pages sysfs Interface > .. kernel-doc:: drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c > :internal: > > +Sample Code > +----------- > +Sample code for testing error injection can be found here: > +https://cgit.freedesktop.org/mesa/drm/tree/tests/amdgpu/ras_tests.c > + > +This is part of the libdrm amdgpu unit tests which cover several areas of the GPU. > +There are four sets of tests: > + > +RAS Basic Test > + > +The test verifies the RAS feature enabled status and makes sure the > +necessary sysfs and debugfs files are present. > + > +RAS Query Test > + > +This test will check the RAS availability and enablement status for > +each supported IP block as well as the error counts. This test checks > + > +RAS Inject Test > + > +This test injects errors for each IP. > + > +RAS Disable Test > + > +This tests disabling of RAS features for each IP block. This tests tests disabling > + > > GPU Power/Thermal Controls and Monitoring > ========================================= > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c > b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c > index dab90c280476..404483437bd3 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c > @@ -220,7 +220,7 @@ static struct ras_manager *amdgpu_ras_find_obj(struct amdgpu_device *adev, > * As their names indicate, inject operation will write the > * value to the address. > * > - * Second member: struct ras_debug_if::op. > + * The second member: struct ras_debug_if::op. > * It has three kinds of operations. > * > * - 0: disable RAS on the block. Take ::head as its data. > @@ -228,14 +228,20 @@ static struct ras_manager *amdgpu_ras_find_obj(struct amdgpu_device *adev, > * - 2: inject errors on the block. Take ::inject as its data. > * > * How to use the interface? > - * programs: > - * copy the struct ras_debug_if in your codes and initialize it. > - * write the struct to the control node. > + * > + * Programs > + * > + * Copy the struct ras_debug_if in your codes and initialize it. > + * Write the struct to the control node. > + * > + * Shells > * > * .. code-block:: bash > * > * echo op block [error [sub_block address value]] > .../ras/ras_ctrl > * > + * Parameters: > + * > * op: disable, enable, inject > * disable: only block is needed > * enable: block and error are needed > @@ -265,8 +271,10 @@ static struct ras_manager *amdgpu_ras_find_obj(struct amdgpu_device *adev, > * /sys/class/drm/card[0/1/2...]/device/ras/[gfx/sdma/...]_err_count > * > * .. note:: > - * Operation is only allowed on blocks which are supported. > + * Operations are only allowed on blocks which are supported. > * Please check ras mask at /sys/module/amdgpu/parameters/ras_mask > + * to see which blocks support RAS on a particular asic. > + * > */ > static ssize_t amdgpu_ras_debugfs_ctrl_write(struct file *f, const char __user *buf, > size_t size, loff_t *pos) > @@ -322,7 +330,7 @@ static ssize_t amdgpu_ras_debugfs_ctrl_write(struct file *f, const char __user * > * DOC: AMDGPU RAS debugfs EEPROM table reset interface > * > * Some boards contain an EEPROM which is used to persistently store > a list of > - * bad pages containing ECC errors detected in vram. This interface > provides > + * bad pages which experiences ECC errors in vram. This interface > + provides > * a way to reset the EEPROM, e.g., after testing error injection. > * > * Usage: > @@ -362,7 +370,7 @@ static const struct file_operations amdgpu_ras_debugfs_eeprom_ops = { > /** > * DOC: AMDGPU RAS sysfs Error Count Interface > * > - * It allows user to read the error count for each IP block on the > gpu through > + * It allows the user to read the error count for each IP block on > + the gpu through > * /sys/class/drm/card[0/1/2...]/device/ras/[gfx/sdma/...]_err_count > * > * It outputs the multiple lines which report the uncorrected (ue) > and corrected @@ -1027,6 +1035,24 @@ static int amdgpu_ras_sysfs_remove_all(struct amdgpu_device *adev) > } > /* sysfs end */ > > +/** > + * DOC: AMDGPU RAS Reboot Behavior for Unrecoverable Errors > + * > + * Normally when there is an uncorrectable error, the driver will > +reset > + * the GPU to recover. However, in the event of an unrecoverable > +error, > + * the driver provides an interface to reboot the system > +automatically > + * in that event. > + * > + * The following file in debugfs provides that interface: > + * /sys/kernel/debug/dri/[0/1/2...]/ras/auto_reboot > + * > + * Usage: > + * > + * .. code-block:: bash > + * > + * echo true > .../ras/auto_reboot > + * > + */ > /* debugfs begin */ > static void amdgpu_ras_debugfs_create_ctrl_node(struct amdgpu_device *adev) > { _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx ^ permalink raw reply [flat|nested] 14+ messages in thread
* RE: [PATCH] drm/amdgpu: Improve RAS documentation @ 2019-11-06 17:31 ` Russell, Kent 0 siblings, 0 replies; 14+ messages in thread From: Russell, Kent @ 2019-11-06 17:31 UTC (permalink / raw) To: Zhao, Yong, amd-gfx I think you meant "This test tests" instead of "This tests tests" for your 2nd comment but agreed on the consistent verb tenses. Kent -----Original Message----- From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> On Behalf Of Zhao, Yong Sent: Wednesday, November 6, 2019 12:15 PM To: amd-gfx@lists.freedesktop.org Subject: Re: [PATCH] drm/amdgpu: Improve RAS documentation See two wording comments inline. With that Reviewed-by: Yong Zhao<yong.zhao@amd.com> On 2019-10-30 2:41 p.m., Alex Deucher wrote: > Clarify some areas, clean up formatting, add section for unrecoverable > error handling. > > Signed-off-by: Alex Deucher <alexander.deucher@amd.com> > --- > Documentation/gpu/amdgpu.rst | 35 ++++++++++++++++++++++ > drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 40 ++++++++++++++++++++----- > 2 files changed, 68 insertions(+), 7 deletions(-) > > diff --git a/Documentation/gpu/amdgpu.rst > b/Documentation/gpu/amdgpu.rst index 5b9eaf23558e..1c08d64970ee 100644 > --- a/Documentation/gpu/amdgpu.rst > +++ b/Documentation/gpu/amdgpu.rst > @@ -82,12 +82,21 @@ AMDGPU XGMI Support > AMDGPU RAS Support > ================== > > +The AMDGPU RAS interfaces are exposed via sysfs (for informational > +queries) and debugfs (for error injection). > + > RAS debugfs/sysfs Control and Error Injection Interfaces > -------------------------------------------------------- > > .. kernel-doc:: drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c > :doc: AMDGPU RAS debugfs control interface > > +RAS Reboot Behavior for Unrecoverable Errors > +-------------------------------------------------------- > + > +.. kernel-doc:: drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c > + :doc: AMDGPU RAS Reboot Behavior for Unrecoverable Errors > + > RAS Error Count sysfs Interface > ------------------------------- > > @@ -109,6 +118,32 @@ RAS VRAM Bad Pages sysfs Interface > .. kernel-doc:: drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c > :internal: > > +Sample Code > +----------- > +Sample code for testing error injection can be found here: > +https://cgit.freedesktop.org/mesa/drm/tree/tests/amdgpu/ras_tests.c > + > +This is part of the libdrm amdgpu unit tests which cover several areas of the GPU. > +There are four sets of tests: > + > +RAS Basic Test > + > +The test verifies the RAS feature enabled status and makes sure the > +necessary sysfs and debugfs files are present. > + > +RAS Query Test > + > +This test will check the RAS availability and enablement status for > +each supported IP block as well as the error counts. This test checks > + > +RAS Inject Test > + > +This test injects errors for each IP. > + > +RAS Disable Test > + > +This tests disabling of RAS features for each IP block. This tests tests disabling > + > > GPU Power/Thermal Controls and Monitoring > ========================================= > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c > b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c > index dab90c280476..404483437bd3 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c > @@ -220,7 +220,7 @@ static struct ras_manager *amdgpu_ras_find_obj(struct amdgpu_device *adev, > * As their names indicate, inject operation will write the > * value to the address. > * > - * Second member: struct ras_debug_if::op. > + * The second member: struct ras_debug_if::op. > * It has three kinds of operations. > * > * - 0: disable RAS on the block. Take ::head as its data. > @@ -228,14 +228,20 @@ static struct ras_manager *amdgpu_ras_find_obj(struct amdgpu_device *adev, > * - 2: inject errors on the block. Take ::inject as its data. > * > * How to use the interface? > - * programs: > - * copy the struct ras_debug_if in your codes and initialize it. > - * write the struct to the control node. > + * > + * Programs > + * > + * Copy the struct ras_debug_if in your codes and initialize it. > + * Write the struct to the control node. > + * > + * Shells > * > * .. code-block:: bash > * > * echo op block [error [sub_block address value]] > .../ras/ras_ctrl > * > + * Parameters: > + * > * op: disable, enable, inject > * disable: only block is needed > * enable: block and error are needed > @@ -265,8 +271,10 @@ static struct ras_manager *amdgpu_ras_find_obj(struct amdgpu_device *adev, > * /sys/class/drm/card[0/1/2...]/device/ras/[gfx/sdma/...]_err_count > * > * .. note:: > - * Operation is only allowed on blocks which are supported. > + * Operations are only allowed on blocks which are supported. > * Please check ras mask at /sys/module/amdgpu/parameters/ras_mask > + * to see which blocks support RAS on a particular asic. > + * > */ > static ssize_t amdgpu_ras_debugfs_ctrl_write(struct file *f, const char __user *buf, > size_t size, loff_t *pos) > @@ -322,7 +330,7 @@ static ssize_t amdgpu_ras_debugfs_ctrl_write(struct file *f, const char __user * > * DOC: AMDGPU RAS debugfs EEPROM table reset interface > * > * Some boards contain an EEPROM which is used to persistently store > a list of > - * bad pages containing ECC errors detected in vram. This interface > provides > + * bad pages which experiences ECC errors in vram. This interface > + provides > * a way to reset the EEPROM, e.g., after testing error injection. > * > * Usage: > @@ -362,7 +370,7 @@ static const struct file_operations amdgpu_ras_debugfs_eeprom_ops = { > /** > * DOC: AMDGPU RAS sysfs Error Count Interface > * > - * It allows user to read the error count for each IP block on the > gpu through > + * It allows the user to read the error count for each IP block on > + the gpu through > * /sys/class/drm/card[0/1/2...]/device/ras/[gfx/sdma/...]_err_count > * > * It outputs the multiple lines which report the uncorrected (ue) > and corrected @@ -1027,6 +1035,24 @@ static int amdgpu_ras_sysfs_remove_all(struct amdgpu_device *adev) > } > /* sysfs end */ > > +/** > + * DOC: AMDGPU RAS Reboot Behavior for Unrecoverable Errors > + * > + * Normally when there is an uncorrectable error, the driver will > +reset > + * the GPU to recover. However, in the event of an unrecoverable > +error, > + * the driver provides an interface to reboot the system > +automatically > + * in that event. > + * > + * The following file in debugfs provides that interface: > + * /sys/kernel/debug/dri/[0/1/2...]/ras/auto_reboot > + * > + * Usage: > + * > + * .. code-block:: bash > + * > + * echo true > .../ras/auto_reboot > + * > + */ > /* debugfs begin */ > static void amdgpu_ras_debugfs_create_ctrl_node(struct amdgpu_device *adev) > { _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] drm/amdgpu/gpuvm: add some additional comments in amdgpu_vm_update_ptes @ 2019-10-31 7:24 ` Christian König 0 siblings, 0 replies; 14+ messages in thread From: Christian König @ 2019-10-31 7:24 UTC (permalink / raw) To: Alex Deucher, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW; +Cc: Alex Deucher Am 30.10.19 um 19:41 schrieb Alex Deucher: > To better clarify what is happening in this function. > > Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 10 +++++++++- > 1 file changed, 9 insertions(+), 1 deletion(-) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c > index c8ce42200059..3c0bd6472a46 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c > @@ -1419,6 +1419,9 @@ static int amdgpu_vm_update_ptes(struct amdgpu_vm_update_params *params, > uint64_t incr, entry_end, pe_start; > struct amdgpu_bo *pt; > > + /* make sure that the page tables covering the address range are > + * actually allocated > + */ > r = amdgpu_vm_alloc_pts(params->adev, params->vm, &cursor, > params->direct); > if (r) > @@ -1492,7 +1495,12 @@ static int amdgpu_vm_update_ptes(struct amdgpu_vm_update_params *params, > } while (frag_start < entry_end); > > if (amdgpu_vm_pt_descendant(adev, &cursor)) { > - /* Free all child entries */ > + /* Free all child entries. > + * Update the tables with the flags and addresses and free up subsequent > + * tables in the case of huge pages or freed up areas. > + * This is the maximum you can free, because all other page tables are not > + * completely covered by the range and so potentially still in use. > + */ > while (cursor.pfn < frag_start) { > amdgpu_vm_free_pts(adev, params->vm, &cursor); > amdgpu_vm_pt_next(adev, &cursor); _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] drm/amdgpu/gpuvm: add some additional comments in amdgpu_vm_update_ptes @ 2019-10-31 7:24 ` Christian König 0 siblings, 0 replies; 14+ messages in thread From: Christian König @ 2019-10-31 7:24 UTC (permalink / raw) To: Alex Deucher, amd-gfx; +Cc: Alex Deucher Am 30.10.19 um 19:41 schrieb Alex Deucher: > To better clarify what is happening in this function. > > Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 10 +++++++++- > 1 file changed, 9 insertions(+), 1 deletion(-) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c > index c8ce42200059..3c0bd6472a46 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c > @@ -1419,6 +1419,9 @@ static int amdgpu_vm_update_ptes(struct amdgpu_vm_update_params *params, > uint64_t incr, entry_end, pe_start; > struct amdgpu_bo *pt; > > + /* make sure that the page tables covering the address range are > + * actually allocated > + */ > r = amdgpu_vm_alloc_pts(params->adev, params->vm, &cursor, > params->direct); > if (r) > @@ -1492,7 +1495,12 @@ static int amdgpu_vm_update_ptes(struct amdgpu_vm_update_params *params, > } while (frag_start < entry_end); > > if (amdgpu_vm_pt_descendant(adev, &cursor)) { > - /* Free all child entries */ > + /* Free all child entries. > + * Update the tables with the flags and addresses and free up subsequent > + * tables in the case of huge pages or freed up areas. > + * This is the maximum you can free, because all other page tables are not > + * completely covered by the range and so potentially still in use. > + */ > while (cursor.pfn < frag_start) { > amdgpu_vm_free_pts(adev, params->vm, &cursor); > amdgpu_vm_pt_next(adev, &cursor); _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx ^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2019-11-07 1:10 UTC | newest] Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2019-10-30 18:41 [PATCH] drm/amdgpu/gpuvm: add some additional comments in amdgpu_vm_update_ptes Alex Deucher 2019-10-30 18:41 ` Alex Deucher [not found] ` <20191030184134.250234-1-alexander.deucher-5C7GfCeVMHo@public.gmane.org> 2019-10-30 18:41 ` [PATCH] drm/amdgpu: Improve RAS documentation Alex Deucher 2019-10-30 18:41 ` Alex Deucher [not found] ` <20191030184134.250234-2-alexander.deucher-5C7GfCeVMHo@public.gmane.org> 2019-11-06 16:35 ` Alex Deucher 2019-11-06 16:35 ` Alex Deucher [not found] ` <CADnq5_Mjr+U8sspvjm-KMqX4VTvdUHdDa6GgnNykQW+QvTMqXQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2019-11-07 1:10 ` Chen, Guchun 2019-11-07 1:10 ` Chen, Guchun 2019-11-06 17:14 ` Zhao, Yong 2019-11-06 17:14 ` Zhao, Yong [not found] ` <4d4b67a3-25e0-a52d-67d7-06bb333c53b0-5C7GfCeVMHo@public.gmane.org> 2019-11-06 17:31 ` Russell, Kent 2019-11-06 17:31 ` Russell, Kent 2019-10-31 7:24 ` [PATCH] drm/amdgpu/gpuvm: add some additional comments in amdgpu_vm_update_ptes Christian König 2019-10-31 7:24 ` Christian König
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.