All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 1/3] drm/amdgpu: Don't query CE and UE errors
@ 2021-05-21 21:18 ` Luben Tuikov
  0 siblings, 0 replies; 12+ messages in thread
From: Luben Tuikov @ 2021-05-21 21:18 UTC (permalink / raw)
  To: amd-gfx; +Cc: Luben Tuikov, Alexander Deucher, stable, Christian König

On QUERY2 IOCTL don't query counts of correctable
and uncorrectable errors, since when RAS is
enabled and supported on Vega20 server boards,
this takes insurmountably long time, in O(n^3),
which slows the system down to the point of it
being unusable when we have GUI up.

Fixes: ae363a212b14 ("drm/amdgpu: Add a new flag to AMDGPU_CTX_OP_QUERY_STATE2")
Cc: Alexander Deucher <Alexander.Deucher@amd.com>
Cc: stable@vger.kernel.org
Signed-off-by: Luben Tuikov <luben.tuikov@amd.com>
Reviewed-by: Alexander Deucher <Alexander.Deucher@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 16 ----------------
 1 file changed, 16 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
index fc83445fbc40..bb0cfe871aba 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
@@ -337,7 +337,6 @@ static int amdgpu_ctx_query2(struct amdgpu_device *adev,
 {
 	struct amdgpu_ctx *ctx;
 	struct amdgpu_ctx_mgr *mgr;
-	unsigned long ras_counter;
 
 	if (!fpriv)
 		return -EINVAL;
@@ -362,21 +361,6 @@ static int amdgpu_ctx_query2(struct amdgpu_device *adev,
 	if (atomic_read(&ctx->guilty))
 		out->state.flags |= AMDGPU_CTX_QUERY2_FLAGS_GUILTY;
 
-	/*query ue count*/
-	ras_counter = amdgpu_ras_query_error_count(adev, false);
-	/*ras counter is monotonic increasing*/
-	if (ras_counter != ctx->ras_counter_ue) {
-		out->state.flags |= AMDGPU_CTX_QUERY2_FLAGS_RAS_UE;
-		ctx->ras_counter_ue = ras_counter;
-	}
-
-	/*query ce count*/
-	ras_counter = amdgpu_ras_query_error_count(adev, true);
-	if (ras_counter != ctx->ras_counter_ce) {
-		out->state.flags |= AMDGPU_CTX_QUERY2_FLAGS_RAS_CE;
-		ctx->ras_counter_ce = ras_counter;
-	}
-
 	mutex_unlock(&mgr->lock);
 	return 0;
 }
-- 
2.31.1.527.g2d677e5b15


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 1/3] drm/amdgpu: Don't query CE and UE errors
@ 2021-05-21 21:18 ` Luben Tuikov
  0 siblings, 0 replies; 12+ messages in thread
From: Luben Tuikov @ 2021-05-21 21:18 UTC (permalink / raw)
  To: amd-gfx; +Cc: Alexander Deucher, Luben Tuikov, Christian König, stable

On QUERY2 IOCTL don't query counts of correctable
and uncorrectable errors, since when RAS is
enabled and supported on Vega20 server boards,
this takes insurmountably long time, in O(n^3),
which slows the system down to the point of it
being unusable when we have GUI up.

Fixes: ae363a212b14 ("drm/amdgpu: Add a new flag to AMDGPU_CTX_OP_QUERY_STATE2")
Cc: Alexander Deucher <Alexander.Deucher@amd.com>
Cc: stable@vger.kernel.org
Signed-off-by: Luben Tuikov <luben.tuikov@amd.com>
Reviewed-by: Alexander Deucher <Alexander.Deucher@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 16 ----------------
 1 file changed, 16 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
index fc83445fbc40..bb0cfe871aba 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
@@ -337,7 +337,6 @@ static int amdgpu_ctx_query2(struct amdgpu_device *adev,
 {
 	struct amdgpu_ctx *ctx;
 	struct amdgpu_ctx_mgr *mgr;
-	unsigned long ras_counter;
 
 	if (!fpriv)
 		return -EINVAL;
@@ -362,21 +361,6 @@ static int amdgpu_ctx_query2(struct amdgpu_device *adev,
 	if (atomic_read(&ctx->guilty))
 		out->state.flags |= AMDGPU_CTX_QUERY2_FLAGS_GUILTY;
 
-	/*query ue count*/
-	ras_counter = amdgpu_ras_query_error_count(adev, false);
-	/*ras counter is monotonic increasing*/
-	if (ras_counter != ctx->ras_counter_ue) {
-		out->state.flags |= AMDGPU_CTX_QUERY2_FLAGS_RAS_UE;
-		ctx->ras_counter_ue = ras_counter;
-	}
-
-	/*query ce count*/
-	ras_counter = amdgpu_ras_query_error_count(adev, true);
-	if (ras_counter != ctx->ras_counter_ce) {
-		out->state.flags |= AMDGPU_CTX_QUERY2_FLAGS_RAS_CE;
-		ctx->ras_counter_ce = ras_counter;
-	}
-
 	mutex_unlock(&mgr->lock);
 	return 0;
 }
-- 
2.31.1.527.g2d677e5b15

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 2/3] drm/amdgpu: Fix RAS function interface
  2021-05-21 21:18 ` Luben Tuikov
  (?)
@ 2021-05-21 21:18 ` Luben Tuikov
  -1 siblings, 0 replies; 12+ messages in thread
From: Luben Tuikov @ 2021-05-21 21:18 UTC (permalink / raw)
  To: amd-gfx; +Cc: Alexander Deucher, Luben Tuikov, John Clements, Hawking Zhang

The correctable and uncorrectable errors
are calculated at each invocation of this
function. Therefore, it is highly inefficient to
return just one of them based on a Boolean
input. If the caller wants both, twice the work
would be done. (And this work is O(n^3) on
Vega20.)

Fix this "interface" to simply return what it had
calculated--both values. Let the caller choose
what it wants to record, inspect, use.

Cc: Alexander Deucher <Alexander.Deucher@amd.com>
Cc: John Clements <john.clements@amd.com>
Cc: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Luben Tuikov <luben.tuikov@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 23 +++++++++++++++--------
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h |  5 +++--
 2 files changed, 18 insertions(+), 10 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
index e3a4c3a7635a..ed3c43e8b0b5 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
@@ -1043,29 +1043,36 @@ int amdgpu_ras_error_inject(struct amdgpu_device *adev,
 }
 
 /* get the total error counts on all IPs */
-unsigned long amdgpu_ras_query_error_count(struct amdgpu_device *adev,
-		bool is_ce)
+void amdgpu_ras_query_error_count(struct amdgpu_device *adev,
+				  unsigned long *ce_count,
+				  unsigned long *ue_count)
 {
 	struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
 	struct ras_manager *obj;
-	struct ras_err_data data = {0, 0};
+	unsigned long ce, ue;
 
 	if (!adev->ras_enabled || !con)
-		return 0;
+		return;
 
+	ce = 0;
+	ue = 0;
 	list_for_each_entry(obj, &con->head, node) {
 		struct ras_query_if info = {
 			.head = obj->head,
 		};
 
 		if (amdgpu_ras_query_error_status(adev, &info))
-			return 0;
+			return;
 
-		data.ce_count += info.ce_count;
-		data.ue_count += info.ue_count;
+		ce += info.ce_count;
+		ue += info.ue_count;
 	}
 
-	return is_ce ? data.ce_count : data.ue_count;
+	if (ce_count)
+		*ce_count = ce;
+
+	if (ue_count)
+		*ue_count = ue;
 }
 /* query/inject/cure end */
 
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
index bfa40c8ecc94..10fca0393106 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
@@ -485,8 +485,9 @@ int amdgpu_ras_request_reset_on_boot(struct amdgpu_device *adev,
 void amdgpu_ras_resume(struct amdgpu_device *adev);
 void amdgpu_ras_suspend(struct amdgpu_device *adev);
 
-unsigned long amdgpu_ras_query_error_count(struct amdgpu_device *adev,
-		bool is_ce);
+void amdgpu_ras_query_error_count(struct amdgpu_device *adev,
+				  unsigned long *ce_count,
+				  unsigned long *ue_count);
 
 /* error handling functions */
 int amdgpu_ras_add_bad_pages(struct amdgpu_device *adev,
-- 
2.31.1.527.g2d677e5b15

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 3/3] drm/amdgpu: Use delayed work to collect RAS error counters
  2021-05-21 21:18 ` Luben Tuikov
  (?)
  (?)
@ 2021-05-21 21:18 ` Luben Tuikov
  2021-05-25 22:03   ` Alex Deucher
  2021-05-26 11:00   ` Lazar, Lijo
  -1 siblings, 2 replies; 12+ messages in thread
From: Luben Tuikov @ 2021-05-21 21:18 UTC (permalink / raw)
  To: amd-gfx
  Cc: Alexander Deucher, Luben Tuikov, John Clements,
	Christian König, Hawking Zhang

On Context Query2 IOCTL return the correctable and
uncorrectable errors in O(1) fashion, from cached
values, and schedule a delayed work function to
calculate and cache them for the next such IOCTL.

Cc: Alexander Deucher <Alexander.Deucher@amd.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: John Clements <john.clements@amd.com>
Cc: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Luben Tuikov <luben.tuikov@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 32 +++++++++++++++++++--
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 38 +++++++++++++++++++++++++
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h |  5 ++++
 3 files changed, 73 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
index bb0cfe871aba..4e95d255960b 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
@@ -331,10 +331,13 @@ static int amdgpu_ctx_query(struct amdgpu_device *adev,
 	return 0;
 }
 
+#define AMDGPU_RAS_COUNTE_DELAY_MS 3000
+
 static int amdgpu_ctx_query2(struct amdgpu_device *adev,
-	struct amdgpu_fpriv *fpriv, uint32_t id,
-	union drm_amdgpu_ctx_out *out)
+			     struct amdgpu_fpriv *fpriv, uint32_t id,
+			     union drm_amdgpu_ctx_out *out)
 {
+	struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
 	struct amdgpu_ctx *ctx;
 	struct amdgpu_ctx_mgr *mgr;
 
@@ -361,6 +364,31 @@ static int amdgpu_ctx_query2(struct amdgpu_device *adev,
 	if (atomic_read(&ctx->guilty))
 		out->state.flags |= AMDGPU_CTX_QUERY2_FLAGS_GUILTY;
 
+	if (adev->ras_enabled && con) {
+		/* Return the cached values in O(1),
+		 * and schedule delayed work to cache
+		 * new vaues.
+		 */
+		int ce_count, ue_count;
+
+		ce_count = atomic_read(&con->ras_ce_count);
+		ue_count = atomic_read(&con->ras_ue_count);
+
+		if (ce_count != ctx->ras_counter_ce) {
+			ctx->ras_counter_ce = ce_count;
+			out->state.flags |= AMDGPU_CTX_QUERY2_FLAGS_RAS_CE;
+		}
+
+		if (ue_count != ctx->ras_counter_ue) {
+			ctx->ras_counter_ue = ue_count;
+			out->state.flags |= AMDGPU_CTX_QUERY2_FLAGS_RAS_UE;
+		}
+
+		if (!delayed_work_pending(&con->ras_counte_delay_work))
+			schedule_delayed_work(&con->ras_counte_delay_work,
+				  msecs_to_jiffies(AMDGPU_RAS_COUNTE_DELAY_MS));
+	}
+
 	mutex_unlock(&mgr->lock);
 	return 0;
 }
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
index ed3c43e8b0b5..80f576098318 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
@@ -27,6 +27,7 @@
 #include <linux/uaccess.h>
 #include <linux/reboot.h>
 #include <linux/syscalls.h>
+#include <linux/pm_runtime.h>
 
 #include "amdgpu.h"
 #include "amdgpu_ras.h"
@@ -2116,6 +2117,30 @@ static void amdgpu_ras_check_supported(struct amdgpu_device *adev)
 		adev->ras_hw_enabled & amdgpu_ras_mask;
 }
 
+static void amdgpu_ras_counte_dw(struct work_struct *work)
+{
+	struct amdgpu_ras *con = container_of(work, struct amdgpu_ras,
+					      ras_counte_delay_work.work);
+	struct amdgpu_device *adev = con->adev;
+	struct drm_device *dev = &adev->ddev;
+	unsigned long ce_count, ue_count;
+	int res;
+
+	res = pm_runtime_get_sync(dev->dev);
+	if (res < 0)
+		goto Out;
+
+	/* Cache new values.
+	 */
+	amdgpu_ras_query_error_count(adev, &ce_count, &ue_count);
+	atomic_set(&con->ras_ce_count, ce_count);
+	atomic_set(&con->ras_ue_count, ue_count);
+
+	pm_runtime_mark_last_busy(dev->dev);
+Out:
+	pm_runtime_put_autosuspend(dev->dev);
+}
+
 int amdgpu_ras_init(struct amdgpu_device *adev)
 {
 	struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
@@ -2130,6 +2155,11 @@ int amdgpu_ras_init(struct amdgpu_device *adev)
 	if (!con)
 		return -ENOMEM;
 
+	con->adev = adev;
+	INIT_DELAYED_WORK(&con->ras_counte_delay_work, amdgpu_ras_counte_dw);
+	atomic_set(&con->ras_ce_count, 0);
+	atomic_set(&con->ras_ue_count, 0);
+
 	con->objs = (struct ras_manager *)(con + 1);
 
 	amdgpu_ras_set_context(adev, con);
@@ -2233,6 +2263,8 @@ int amdgpu_ras_late_init(struct amdgpu_device *adev,
 			 struct ras_fs_if *fs_info,
 			 struct ras_ih_if *ih_info)
 {
+	struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
+	unsigned long ue_count, ce_count;
 	int r;
 
 	/* disable RAS feature per IP block if it is not supported */
@@ -2273,6 +2305,12 @@ int amdgpu_ras_late_init(struct amdgpu_device *adev,
 	if (r)
 		goto sysfs;
 
+	/* Those are the cached values at init.
+	 */
+	amdgpu_ras_query_error_count(adev, &ce_count, &ue_count);
+	atomic_set(&con->ras_ce_count, ce_count);
+	atomic_set(&con->ras_ue_count, ue_count);
+
 	return 0;
 cleanup:
 	amdgpu_ras_sysfs_remove(adev, ras_block);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
index 10fca0393106..256cea5d34f2 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
@@ -340,6 +340,11 @@ struct amdgpu_ras {
 
 	/* disable ras error count harvest in recovery */
 	bool disable_ras_err_cnt_harvest;
+
+	/* RAS count errors delayed work */
+	struct delayed_work ras_counte_delay_work;
+	atomic_t ras_ue_count;
+	atomic_t ras_ce_count;
 };
 
 struct ras_fs_data {
-- 
2.31.1.527.g2d677e5b15

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH 3/3] drm/amdgpu: Use delayed work to collect RAS error counters
  2021-05-21 21:18 ` [PATCH 3/3] drm/amdgpu: Use delayed work to collect RAS error counters Luben Tuikov
@ 2021-05-25 22:03   ` Alex Deucher
  2021-05-25 23:56     ` Luben Tuikov
  2021-05-26 11:00   ` Lazar, Lijo
  1 sibling, 1 reply; 12+ messages in thread
From: Alex Deucher @ 2021-05-25 22:03 UTC (permalink / raw)
  To: Luben Tuikov
  Cc: Alexander Deucher, John Clements, Christian König,
	amd-gfx list, Hawking Zhang

On Fri, May 21, 2021 at 5:19 PM Luben Tuikov <luben.tuikov@amd.com> wrote:
>
> On Context Query2 IOCTL return the correctable and
> uncorrectable errors in O(1) fashion, from cached
> values, and schedule a delayed work function to
> calculate and cache them for the next such IOCTL.

Patches 1, 2, are:
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>

For patch 3, I think we need to cancel any outstanding delayed work in
ras_fini().  Other than that, it looks good to me.

Alex

>
> Cc: Alexander Deucher <Alexander.Deucher@amd.com>
> Cc: Christian König <christian.koenig@amd.com>
> Cc: John Clements <john.clements@amd.com>
> Cc: Hawking Zhang <Hawking.Zhang@amd.com>
> Signed-off-by: Luben Tuikov <luben.tuikov@amd.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 32 +++++++++++++++++++--
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 38 +++++++++++++++++++++++++
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h |  5 ++++
>  3 files changed, 73 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
> index bb0cfe871aba..4e95d255960b 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
> @@ -331,10 +331,13 @@ static int amdgpu_ctx_query(struct amdgpu_device *adev,
>         return 0;
>  }
>
> +#define AMDGPU_RAS_COUNTE_DELAY_MS 3000
> +
>  static int amdgpu_ctx_query2(struct amdgpu_device *adev,
> -       struct amdgpu_fpriv *fpriv, uint32_t id,
> -       union drm_amdgpu_ctx_out *out)
> +                            struct amdgpu_fpriv *fpriv, uint32_t id,
> +                            union drm_amdgpu_ctx_out *out)
>  {
> +       struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
>         struct amdgpu_ctx *ctx;
>         struct amdgpu_ctx_mgr *mgr;
>
> @@ -361,6 +364,31 @@ static int amdgpu_ctx_query2(struct amdgpu_device *adev,
>         if (atomic_read(&ctx->guilty))
>                 out->state.flags |= AMDGPU_CTX_QUERY2_FLAGS_GUILTY;
>
> +       if (adev->ras_enabled && con) {
> +               /* Return the cached values in O(1),
> +                * and schedule delayed work to cache
> +                * new vaues.
> +                */
> +               int ce_count, ue_count;
> +
> +               ce_count = atomic_read(&con->ras_ce_count);
> +               ue_count = atomic_read(&con->ras_ue_count);
> +
> +               if (ce_count != ctx->ras_counter_ce) {
> +                       ctx->ras_counter_ce = ce_count;
> +                       out->state.flags |= AMDGPU_CTX_QUERY2_FLAGS_RAS_CE;
> +               }
> +
> +               if (ue_count != ctx->ras_counter_ue) {
> +                       ctx->ras_counter_ue = ue_count;
> +                       out->state.flags |= AMDGPU_CTX_QUERY2_FLAGS_RAS_UE;
> +               }
> +
> +               if (!delayed_work_pending(&con->ras_counte_delay_work))
> +                       schedule_delayed_work(&con->ras_counte_delay_work,
> +                                 msecs_to_jiffies(AMDGPU_RAS_COUNTE_DELAY_MS));
> +       }
> +
>         mutex_unlock(&mgr->lock);
>         return 0;
>  }
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> index ed3c43e8b0b5..80f576098318 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> @@ -27,6 +27,7 @@
>  #include <linux/uaccess.h>
>  #include <linux/reboot.h>
>  #include <linux/syscalls.h>
> +#include <linux/pm_runtime.h>
>
>  #include "amdgpu.h"
>  #include "amdgpu_ras.h"
> @@ -2116,6 +2117,30 @@ static void amdgpu_ras_check_supported(struct amdgpu_device *adev)
>                 adev->ras_hw_enabled & amdgpu_ras_mask;
>  }
>
> +static void amdgpu_ras_counte_dw(struct work_struct *work)
> +{
> +       struct amdgpu_ras *con = container_of(work, struct amdgpu_ras,
> +                                             ras_counte_delay_work.work);
> +       struct amdgpu_device *adev = con->adev;
> +       struct drm_device *dev = &adev->ddev;
> +       unsigned long ce_count, ue_count;
> +       int res;
> +
> +       res = pm_runtime_get_sync(dev->dev);
> +       if (res < 0)
> +               goto Out;
> +
> +       /* Cache new values.
> +        */
> +       amdgpu_ras_query_error_count(adev, &ce_count, &ue_count);
> +       atomic_set(&con->ras_ce_count, ce_count);
> +       atomic_set(&con->ras_ue_count, ue_count);
> +
> +       pm_runtime_mark_last_busy(dev->dev);
> +Out:
> +       pm_runtime_put_autosuspend(dev->dev);
> +}
> +
>  int amdgpu_ras_init(struct amdgpu_device *adev)
>  {
>         struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
> @@ -2130,6 +2155,11 @@ int amdgpu_ras_init(struct amdgpu_device *adev)
>         if (!con)
>                 return -ENOMEM;
>
> +       con->adev = adev;
> +       INIT_DELAYED_WORK(&con->ras_counte_delay_work, amdgpu_ras_counte_dw);
> +       atomic_set(&con->ras_ce_count, 0);
> +       atomic_set(&con->ras_ue_count, 0);
> +
>         con->objs = (struct ras_manager *)(con + 1);
>
>         amdgpu_ras_set_context(adev, con);
> @@ -2233,6 +2263,8 @@ int amdgpu_ras_late_init(struct amdgpu_device *adev,
>                          struct ras_fs_if *fs_info,
>                          struct ras_ih_if *ih_info)
>  {
> +       struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
> +       unsigned long ue_count, ce_count;
>         int r;
>
>         /* disable RAS feature per IP block if it is not supported */
> @@ -2273,6 +2305,12 @@ int amdgpu_ras_late_init(struct amdgpu_device *adev,
>         if (r)
>                 goto sysfs;
>
> +       /* Those are the cached values at init.
> +        */
> +       amdgpu_ras_query_error_count(adev, &ce_count, &ue_count);
> +       atomic_set(&con->ras_ce_count, ce_count);
> +       atomic_set(&con->ras_ue_count, ue_count);
> +
>         return 0;
>  cleanup:
>         amdgpu_ras_sysfs_remove(adev, ras_block);
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> index 10fca0393106..256cea5d34f2 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> @@ -340,6 +340,11 @@ struct amdgpu_ras {
>
>         /* disable ras error count harvest in recovery */
>         bool disable_ras_err_cnt_harvest;
> +
> +       /* RAS count errors delayed work */
> +       struct delayed_work ras_counte_delay_work;
> +       atomic_t ras_ue_count;
> +       atomic_t ras_ce_count;
>  };
>
>  struct ras_fs_data {
> --
> 2.31.1.527.g2d677e5b15
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 3/3] drm/amdgpu: Use delayed work to collect RAS error counters
  2021-05-25 22:03   ` Alex Deucher
@ 2021-05-25 23:56     ` Luben Tuikov
  0 siblings, 0 replies; 12+ messages in thread
From: Luben Tuikov @ 2021-05-25 23:56 UTC (permalink / raw)
  To: Alex Deucher
  Cc: Alexander Deucher, John Clements, Christian König,
	amd-gfx list, Hawking Zhang

On 2021-05-25 6:03 p.m., Alex Deucher wrote:
> On Fri, May 21, 2021 at 5:19 PM Luben Tuikov <luben.tuikov@amd.com> wrote:
>> On Context Query2 IOCTL return the correctable and
>> uncorrectable errors in O(1) fashion, from cached
>> values, and schedule a delayed work function to
>> calculate and cache them for the next such IOCTL.
> Patches 1, 2, are:
> Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
>
> For patch 3, I think we need to cancel any outstanding delayed work in
> ras_fini().  Other than that, it looks good to me.
Ah, yes, good point--I missed that. I'll add it and resubmit.

Regards,
Luben

>
> Alex
>
>> Cc: Alexander Deucher <Alexander.Deucher@amd.com>
>> Cc: Christian König <christian.koenig@amd.com>
>> Cc: John Clements <john.clements@amd.com>
>> Cc: Hawking Zhang <Hawking.Zhang@amd.com>
>> Signed-off-by: Luben Tuikov <luben.tuikov@amd.com>
>> ---
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 32 +++++++++++++++++++--
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 38 +++++++++++++++++++++++++
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h |  5 ++++
>>  3 files changed, 73 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>> index bb0cfe871aba..4e95d255960b 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>> @@ -331,10 +331,13 @@ static int amdgpu_ctx_query(struct amdgpu_device *adev,
>>         return 0;
>>  }
>>
>> +#define AMDGPU_RAS_COUNTE_DELAY_MS 3000
>> +
>>  static int amdgpu_ctx_query2(struct amdgpu_device *adev,
>> -       struct amdgpu_fpriv *fpriv, uint32_t id,
>> -       union drm_amdgpu_ctx_out *out)
>> +                            struct amdgpu_fpriv *fpriv, uint32_t id,
>> +                            union drm_amdgpu_ctx_out *out)
>>  {
>> +       struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
>>         struct amdgpu_ctx *ctx;
>>         struct amdgpu_ctx_mgr *mgr;
>>
>> @@ -361,6 +364,31 @@ static int amdgpu_ctx_query2(struct amdgpu_device *adev,
>>         if (atomic_read(&ctx->guilty))
>>                 out->state.flags |= AMDGPU_CTX_QUERY2_FLAGS_GUILTY;
>>
>> +       if (adev->ras_enabled && con) {
>> +               /* Return the cached values in O(1),
>> +                * and schedule delayed work to cache
>> +                * new vaues.
>> +                */
>> +               int ce_count, ue_count;
>> +
>> +               ce_count = atomic_read(&con->ras_ce_count);
>> +               ue_count = atomic_read(&con->ras_ue_count);
>> +
>> +               if (ce_count != ctx->ras_counter_ce) {
>> +                       ctx->ras_counter_ce = ce_count;
>> +                       out->state.flags |= AMDGPU_CTX_QUERY2_FLAGS_RAS_CE;
>> +               }
>> +
>> +               if (ue_count != ctx->ras_counter_ue) {
>> +                       ctx->ras_counter_ue = ue_count;
>> +                       out->state.flags |= AMDGPU_CTX_QUERY2_FLAGS_RAS_UE;
>> +               }
>> +
>> +               if (!delayed_work_pending(&con->ras_counte_delay_work))
>> +                       schedule_delayed_work(&con->ras_counte_delay_work,
>> +                                 msecs_to_jiffies(AMDGPU_RAS_COUNTE_DELAY_MS));
>> +       }
>> +
>>         mutex_unlock(&mgr->lock);
>>         return 0;
>>  }
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
>> index ed3c43e8b0b5..80f576098318 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
>> @@ -27,6 +27,7 @@
>>  #include <linux/uaccess.h>
>>  #include <linux/reboot.h>
>>  #include <linux/syscalls.h>
>> +#include <linux/pm_runtime.h>
>>
>>  #include "amdgpu.h"
>>  #include "amdgpu_ras.h"
>> @@ -2116,6 +2117,30 @@ static void amdgpu_ras_check_supported(struct amdgpu_device *adev)
>>                 adev->ras_hw_enabled & amdgpu_ras_mask;
>>  }
>>
>> +static void amdgpu_ras_counte_dw(struct work_struct *work)
>> +{
>> +       struct amdgpu_ras *con = container_of(work, struct amdgpu_ras,
>> +                                             ras_counte_delay_work.work);
>> +       struct amdgpu_device *adev = con->adev;
>> +       struct drm_device *dev = &adev->ddev;
>> +       unsigned long ce_count, ue_count;
>> +       int res;
>> +
>> +       res = pm_runtime_get_sync(dev->dev);
>> +       if (res < 0)
>> +               goto Out;
>> +
>> +       /* Cache new values.
>> +        */
>> +       amdgpu_ras_query_error_count(adev, &ce_count, &ue_count);
>> +       atomic_set(&con->ras_ce_count, ce_count);
>> +       atomic_set(&con->ras_ue_count, ue_count);
>> +
>> +       pm_runtime_mark_last_busy(dev->dev);
>> +Out:
>> +       pm_runtime_put_autosuspend(dev->dev);
>> +}
>> +
>>  int amdgpu_ras_init(struct amdgpu_device *adev)
>>  {
>>         struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
>> @@ -2130,6 +2155,11 @@ int amdgpu_ras_init(struct amdgpu_device *adev)
>>         if (!con)
>>                 return -ENOMEM;
>>
>> +       con->adev = adev;
>> +       INIT_DELAYED_WORK(&con->ras_counte_delay_work, amdgpu_ras_counte_dw);
>> +       atomic_set(&con->ras_ce_count, 0);
>> +       atomic_set(&con->ras_ue_count, 0);
>> +
>>         con->objs = (struct ras_manager *)(con + 1);
>>
>>         amdgpu_ras_set_context(adev, con);
>> @@ -2233,6 +2263,8 @@ int amdgpu_ras_late_init(struct amdgpu_device *adev,
>>                          struct ras_fs_if *fs_info,
>>                          struct ras_ih_if *ih_info)
>>  {
>> +       struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
>> +       unsigned long ue_count, ce_count;
>>         int r;
>>
>>         /* disable RAS feature per IP block if it is not supported */
>> @@ -2273,6 +2305,12 @@ int amdgpu_ras_late_init(struct amdgpu_device *adev,
>>         if (r)
>>                 goto sysfs;
>>
>> +       /* Those are the cached values at init.
>> +        */
>> +       amdgpu_ras_query_error_count(adev, &ce_count, &ue_count);
>> +       atomic_set(&con->ras_ce_count, ce_count);
>> +       atomic_set(&con->ras_ue_count, ue_count);
>> +
>>         return 0;
>>  cleanup:
>>         amdgpu_ras_sysfs_remove(adev, ras_block);
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
>> index 10fca0393106..256cea5d34f2 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
>> @@ -340,6 +340,11 @@ struct amdgpu_ras {
>>
>>         /* disable ras error count harvest in recovery */
>>         bool disable_ras_err_cnt_harvest;
>> +
>> +       /* RAS count errors delayed work */
>> +       struct delayed_work ras_counte_delay_work;
>> +       atomic_t ras_ue_count;
>> +       atomic_t ras_ce_count;
>>  };
>>
>>  struct ras_fs_data {
>> --
>> 2.31.1.527.g2d677e5b15
>>
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx@lists.freedesktop.org
>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=04%7C01%7Cluben.tuikov%40amd.com%7C16860b04040649fe81d208d91fc8fb00%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637575770340862619%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=4KIyoPOrvCCC9ljQQhJlxKPhjONiFK%2FEAHNXEc30BtQ%3D&amp;reserved=0

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: [PATCH 3/3] drm/amdgpu: Use delayed work to collect RAS error counters
  2021-05-21 21:18 ` [PATCH 3/3] drm/amdgpu: Use delayed work to collect RAS error counters Luben Tuikov
  2021-05-25 22:03   ` Alex Deucher
@ 2021-05-26 11:00   ` Lazar, Lijo
  2021-05-26 15:12     ` Luben Tuikov
  1 sibling, 1 reply; 12+ messages in thread
From: Lazar, Lijo @ 2021-05-26 11:00 UTC (permalink / raw)
  To: Tuikov, Luben, amd-gfx
  Cc: Deucher, Alexander, Tuikov, Luben, Clements, John, Koenig,
	 Christian, Zhang, Hawking

[AMD Official Use Only]

Scheduling an error status query just based on IOCTL doesn't sound like a sound approach. What if driver needs to handle errors based on that - for ex: if the number of correctable errors exceed a certain threshold?

IMO, I'm more aligned to Luben's original approach of having something waiting in the background - instead of a periodic timer based trigger, it could be an event based trigger.  Event may be an ioctl, error handler timer ticks or something else.

Thanks,
Lijo

-----Original Message-----
From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> On Behalf Of Luben Tuikov
Sent: Saturday, May 22, 2021 2:49 AM
To: amd-gfx@lists.freedesktop.org
Cc: Deucher, Alexander <Alexander.Deucher@amd.com>; Tuikov, Luben <Luben.Tuikov@amd.com>; Clements, John <John.Clements@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>; Zhang, Hawking <Hawking.Zhang@amd.com>
Subject: [PATCH 3/3] drm/amdgpu: Use delayed work to collect RAS error counters

On Context Query2 IOCTL return the correctable and uncorrectable errors in O(1) fashion, from cached values, and schedule a delayed work function to calculate and cache them for the next such IOCTL.

Cc: Alexander Deucher <Alexander.Deucher@amd.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: John Clements <john.clements@amd.com>
Cc: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Luben Tuikov <luben.tuikov@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 32 +++++++++++++++++++--  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 38 +++++++++++++++++++++++++  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h |  5 ++++
 3 files changed, 73 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
index bb0cfe871aba..4e95d255960b 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
@@ -331,10 +331,13 @@ static int amdgpu_ctx_query(struct amdgpu_device *adev,
 	return 0;
 }
 
+#define AMDGPU_RAS_COUNTE_DELAY_MS 3000
+
 static int amdgpu_ctx_query2(struct amdgpu_device *adev,
-	struct amdgpu_fpriv *fpriv, uint32_t id,
-	union drm_amdgpu_ctx_out *out)
+			     struct amdgpu_fpriv *fpriv, uint32_t id,
+			     union drm_amdgpu_ctx_out *out)
 {
+	struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
 	struct amdgpu_ctx *ctx;
 	struct amdgpu_ctx_mgr *mgr;
 
@@ -361,6 +364,31 @@ static int amdgpu_ctx_query2(struct amdgpu_device *adev,
 	if (atomic_read(&ctx->guilty))
 		out->state.flags |= AMDGPU_CTX_QUERY2_FLAGS_GUILTY;
 
+	if (adev->ras_enabled && con) {
+		/* Return the cached values in O(1),
+		 * and schedule delayed work to cache
+		 * new vaues.
+		 */
+		int ce_count, ue_count;
+
+		ce_count = atomic_read(&con->ras_ce_count);
+		ue_count = atomic_read(&con->ras_ue_count);
+
+		if (ce_count != ctx->ras_counter_ce) {
+			ctx->ras_counter_ce = ce_count;
+			out->state.flags |= AMDGPU_CTX_QUERY2_FLAGS_RAS_CE;
+		}
+
+		if (ue_count != ctx->ras_counter_ue) {
+			ctx->ras_counter_ue = ue_count;
+			out->state.flags |= AMDGPU_CTX_QUERY2_FLAGS_RAS_UE;
+		}
+
+		if (!delayed_work_pending(&con->ras_counte_delay_work))
+			schedule_delayed_work(&con->ras_counte_delay_work,
+				  msecs_to_jiffies(AMDGPU_RAS_COUNTE_DELAY_MS));
+	}
+
 	mutex_unlock(&mgr->lock);
 	return 0;
 }
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
index ed3c43e8b0b5..80f576098318 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
@@ -27,6 +27,7 @@
 #include <linux/uaccess.h>
 #include <linux/reboot.h>
 #include <linux/syscalls.h>
+#include <linux/pm_runtime.h>
 
 #include "amdgpu.h"
 #include "amdgpu_ras.h"
@@ -2116,6 +2117,30 @@ static void amdgpu_ras_check_supported(struct amdgpu_device *adev)
 		adev->ras_hw_enabled & amdgpu_ras_mask;  }
 
+static void amdgpu_ras_counte_dw(struct work_struct *work) {
+	struct amdgpu_ras *con = container_of(work, struct amdgpu_ras,
+					      ras_counte_delay_work.work);
+	struct amdgpu_device *adev = con->adev;
+	struct drm_device *dev = &adev->ddev;
+	unsigned long ce_count, ue_count;
+	int res;
+
+	res = pm_runtime_get_sync(dev->dev);
+	if (res < 0)
+		goto Out;
+
+	/* Cache new values.
+	 */
+	amdgpu_ras_query_error_count(adev, &ce_count, &ue_count);
+	atomic_set(&con->ras_ce_count, ce_count);
+	atomic_set(&con->ras_ue_count, ue_count);
+
+	pm_runtime_mark_last_busy(dev->dev);
+Out:
+	pm_runtime_put_autosuspend(dev->dev);
+}
+
 int amdgpu_ras_init(struct amdgpu_device *adev)  {
 	struct amdgpu_ras *con = amdgpu_ras_get_context(adev); @@ -2130,6 +2155,11 @@ int amdgpu_ras_init(struct amdgpu_device *adev)
 	if (!con)
 		return -ENOMEM;
 
+	con->adev = adev;
+	INIT_DELAYED_WORK(&con->ras_counte_delay_work, amdgpu_ras_counte_dw);
+	atomic_set(&con->ras_ce_count, 0);
+	atomic_set(&con->ras_ue_count, 0);
+
 	con->objs = (struct ras_manager *)(con + 1);
 
 	amdgpu_ras_set_context(adev, con);
@@ -2233,6 +2263,8 @@ int amdgpu_ras_late_init(struct amdgpu_device *adev,
 			 struct ras_fs_if *fs_info,
 			 struct ras_ih_if *ih_info)
 {
+	struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
+	unsigned long ue_count, ce_count;
 	int r;
 
 	/* disable RAS feature per IP block if it is not supported */ @@ -2273,6 +2305,12 @@ int amdgpu_ras_late_init(struct amdgpu_device *adev,
 	if (r)
 		goto sysfs;
 
+	/* Those are the cached values at init.
+	 */
+	amdgpu_ras_query_error_count(adev, &ce_count, &ue_count);
+	atomic_set(&con->ras_ce_count, ce_count);
+	atomic_set(&con->ras_ue_count, ue_count);
+
 	return 0;
 cleanup:
 	amdgpu_ras_sysfs_remove(adev, ras_block); diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
index 10fca0393106..256cea5d34f2 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
@@ -340,6 +340,11 @@ struct amdgpu_ras {
 
 	/* disable ras error count harvest in recovery */
 	bool disable_ras_err_cnt_harvest;
+
+	/* RAS count errors delayed work */
+	struct delayed_work ras_counte_delay_work;
+	atomic_t ras_ue_count;
+	atomic_t ras_ce_count;
 };
 
 struct ras_fs_data {
--
2.31.1.527.g2d677e5b15

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=04%7C01%7Clijo.lazar%40amd.com%7C3686015f68f84c9088ab08d91c9e0fcf%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637572287465788021%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=HbdtryTNLwzF862vg6E%2BwKBHmrw8NFz3gKQsU9ggdOk%3D&amp;reserved=0
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH 3/3] drm/amdgpu: Use delayed work to collect RAS error counters
  2021-05-26 11:00   ` Lazar, Lijo
@ 2021-05-26 15:12     ` Luben Tuikov
  2021-05-26 16:05       ` Lazar, Lijo
  0 siblings, 1 reply; 12+ messages in thread
From: Luben Tuikov @ 2021-05-26 15:12 UTC (permalink / raw)
  To: Lazar, Lijo, amd-gfx
  Cc: Deucher, Alexander, Clements, John, Koenig, Christian, Zhang, Hawking

On 2021-05-26 7:00 a.m., Lazar, Lijo wrote:
> [AMD Official Use Only]
>
> Scheduling an error status query just based on IOCTL doesn't sound like a sound approach. What if driver needs to handle errors based on that - for ex: if the number of correctable errors exceed a certain threshold?
That's exactly the trigger which evokes the error count procedure. The difference is that on that IOCTL,
we return in O(1), the cached values and then trigger the counting procedure in a delayed work item,
since it takes  O(n^3) and we cannot do it when the IOCTL is being processed as that negatively
affects the user experience, and the overall agility of the system.

This is the closest implementation to what was had before when the count was trigger at the IOCTL time,
and the IOCTL blocked until the count was completed.

Acting on exceeding a certain threshold is fine, since we're not competing against the delta
of the excessive amount of errors, just so long as it does exceed. That is, so long as it exceeds,
do something, we don't really care if it exceeds by delta or 10*delta.

But what is important, is the _time_ frequency of the delayed work, AMDGPU_RAS_COUNTE_DELAY_MS,
in my v2 of this patch. When set to 0, i.e. count as soon as possible, we get about 22% duty cycle of
the CPU just doing that, all the time, as it seems this IOCTL is being called constantly, and thus
the counting takes place all the time, continuously. And this isn't good for the system's performance
and power management.

When set to 3 seconds, we get a normal (expected) system behaviour in Vega20 sever boards.

> IMO, I'm more aligned to Luben's original approach of having something waiting in the background - instead of a periodic timer based trigger, it could be an event based trigger.  Event may be an ioctl, error handler timer ticks or something else.

Well, my original idea broke power management (PM), since it ran continuously
regardless of PM and whether we indeed need the count.

Now, when you say "Event may be an ioctl"--this is exactly what,

1) was had before, interlocked, and it made the system unusable to a GUI user, and
2) is what we have in this patch, but we process the count asynchronously, while
    we return the O(1) count instantly.

The advantage of 2) over my original approach, is that the count is triggered only
on IOCTL call, albeit delayed so that we can return in O(1) the cached value. Thus,
if no QUERY2 IOCTL is received, then we don't count the errors, as we don't schedule
the delayed work.

Regards,
Luben

> Thanks,
> Lijo
>
> -----Original Message-----
> From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> On Behalf Of Luben Tuikov
> Sent: Saturday, May 22, 2021 2:49 AM
> To: amd-gfx@lists.freedesktop.org
> Cc: Deucher, Alexander <Alexander.Deucher@amd.com>; Tuikov, Luben <Luben.Tuikov@amd.com>; Clements, John <John.Clements@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>; Zhang, Hawking <Hawking.Zhang@amd.com>
> Subject: [PATCH 3/3] drm/amdgpu: Use delayed work to collect RAS error counters
>
> On Context Query2 IOCTL return the correctable and uncorrectable errors in O(1) fashion, from cached values, and schedule a delayed work function to calculate and cache them for the next such IOCTL.
>
> Cc: Alexander Deucher <Alexander.Deucher@amd.com>
> Cc: Christian König <christian.koenig@amd.com>
> Cc: John Clements <john.clements@amd.com>
> Cc: Hawking Zhang <Hawking.Zhang@amd.com>
> Signed-off-by: Luben Tuikov <luben.tuikov@amd.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 32 +++++++++++++++++++--  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 38 +++++++++++++++++++++++++  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h |  5 ++++
>  3 files changed, 73 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
> index bb0cfe871aba..4e95d255960b 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
> @@ -331,10 +331,13 @@ static int amdgpu_ctx_query(struct amdgpu_device *adev,
>  	return 0;
>  }
>  
> +#define AMDGPU_RAS_COUNTE_DELAY_MS 3000
> +
>  static int amdgpu_ctx_query2(struct amdgpu_device *adev,
> -	struct amdgpu_fpriv *fpriv, uint32_t id,
> -	union drm_amdgpu_ctx_out *out)
> +			     struct amdgpu_fpriv *fpriv, uint32_t id,
> +			     union drm_amdgpu_ctx_out *out)
>  {
> +	struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
>  	struct amdgpu_ctx *ctx;
>  	struct amdgpu_ctx_mgr *mgr;
>  
> @@ -361,6 +364,31 @@ static int amdgpu_ctx_query2(struct amdgpu_device *adev,
>  	if (atomic_read(&ctx->guilty))
>  		out->state.flags |= AMDGPU_CTX_QUERY2_FLAGS_GUILTY;
>  
> +	if (adev->ras_enabled && con) {
> +		/* Return the cached values in O(1),
> +		 * and schedule delayed work to cache
> +		 * new vaues.
> +		 */
> +		int ce_count, ue_count;
> +
> +		ce_count = atomic_read(&con->ras_ce_count);
> +		ue_count = atomic_read(&con->ras_ue_count);
> +
> +		if (ce_count != ctx->ras_counter_ce) {
> +			ctx->ras_counter_ce = ce_count;
> +			out->state.flags |= AMDGPU_CTX_QUERY2_FLAGS_RAS_CE;
> +		}
> +
> +		if (ue_count != ctx->ras_counter_ue) {
> +			ctx->ras_counter_ue = ue_count;
> +			out->state.flags |= AMDGPU_CTX_QUERY2_FLAGS_RAS_UE;
> +		}
> +
> +		if (!delayed_work_pending(&con->ras_counte_delay_work))
> +			schedule_delayed_work(&con->ras_counte_delay_work,
> +				  msecs_to_jiffies(AMDGPU_RAS_COUNTE_DELAY_MS));
> +	}
> +
>  	mutex_unlock(&mgr->lock);
>  	return 0;
>  }
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> index ed3c43e8b0b5..80f576098318 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> @@ -27,6 +27,7 @@
>  #include <linux/uaccess.h>
>  #include <linux/reboot.h>
>  #include <linux/syscalls.h>
> +#include <linux/pm_runtime.h>
>  
>  #include "amdgpu.h"
>  #include "amdgpu_ras.h"
> @@ -2116,6 +2117,30 @@ static void amdgpu_ras_check_supported(struct amdgpu_device *adev)
>  		adev->ras_hw_enabled & amdgpu_ras_mask;  }
>  
> +static void amdgpu_ras_counte_dw(struct work_struct *work) {
> +	struct amdgpu_ras *con = container_of(work, struct amdgpu_ras,
> +					      ras_counte_delay_work.work);
> +	struct amdgpu_device *adev = con->adev;
> +	struct drm_device *dev = &adev->ddev;
> +	unsigned long ce_count, ue_count;
> +	int res;
> +
> +	res = pm_runtime_get_sync(dev->dev);
> +	if (res < 0)
> +		goto Out;
> +
> +	/* Cache new values.
> +	 */
> +	amdgpu_ras_query_error_count(adev, &ce_count, &ue_count);
> +	atomic_set(&con->ras_ce_count, ce_count);
> +	atomic_set(&con->ras_ue_count, ue_count);
> +
> +	pm_runtime_mark_last_busy(dev->dev);
> +Out:
> +	pm_runtime_put_autosuspend(dev->dev);
> +}
> +
>  int amdgpu_ras_init(struct amdgpu_device *adev)  {
>  	struct amdgpu_ras *con = amdgpu_ras_get_context(adev); @@ -2130,6 +2155,11 @@ int amdgpu_ras_init(struct amdgpu_device *adev)
>  	if (!con)
>  		return -ENOMEM;
>  
> +	con->adev = adev;
> +	INIT_DELAYED_WORK(&con->ras_counte_delay_work, amdgpu_ras_counte_dw);
> +	atomic_set(&con->ras_ce_count, 0);
> +	atomic_set(&con->ras_ue_count, 0);
> +
>  	con->objs = (struct ras_manager *)(con + 1);
>  
>  	amdgpu_ras_set_context(adev, con);
> @@ -2233,6 +2263,8 @@ int amdgpu_ras_late_init(struct amdgpu_device *adev,
>  			 struct ras_fs_if *fs_info,
>  			 struct ras_ih_if *ih_info)
>  {
> +	struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
> +	unsigned long ue_count, ce_count;
>  	int r;
>  
>  	/* disable RAS feature per IP block if it is not supported */ @@ -2273,6 +2305,12 @@ int amdgpu_ras_late_init(struct amdgpu_device *adev,
>  	if (r)
>  		goto sysfs;
>  
> +	/* Those are the cached values at init.
> +	 */
> +	amdgpu_ras_query_error_count(adev, &ce_count, &ue_count);
> +	atomic_set(&con->ras_ce_count, ce_count);
> +	atomic_set(&con->ras_ue_count, ue_count);
> +
>  	return 0;
>  cleanup:
>  	amdgpu_ras_sysfs_remove(adev, ras_block); diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> index 10fca0393106..256cea5d34f2 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> @@ -340,6 +340,11 @@ struct amdgpu_ras {
>  
>  	/* disable ras error count harvest in recovery */
>  	bool disable_ras_err_cnt_harvest;
> +
> +	/* RAS count errors delayed work */
> +	struct delayed_work ras_counte_delay_work;
> +	atomic_t ras_ue_count;
> +	atomic_t ras_ce_count;
>  };
>  
>  struct ras_fs_data {
> --
> 2.31.1.527.g2d677e5b15
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=04%7C01%7Clijo.lazar%40amd.com%7C3686015f68f84c9088ab08d91c9e0fcf%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637572287465788021%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=HbdtryTNLwzF862vg6E%2BwKBHmrw8NFz3gKQsU9ggdOk%3D&amp;reserved=0

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 3/3] drm/amdgpu: Use delayed work to collect RAS error counters
  2021-05-26 15:12     ` Luben Tuikov
@ 2021-05-26 16:05       ` Lazar, Lijo
  2021-05-26 16:11         ` Alex Deucher
  0 siblings, 1 reply; 12+ messages in thread
From: Lazar, Lijo @ 2021-05-26 16:05 UTC (permalink / raw)
  To: Tuikov, Luben, amd-gfx
  Cc: Deucher, Alexander, Clements, John, Koenig, Christian, Zhang, Hawking


[-- Attachment #1.1: Type: text/plain, Size: 10907 bytes --]

[AMD Official Use Only]

Hi Luben,

What I meant by event based is a thread waiting on wait queue for events, not a periodic polling as you had in the original patch. It still fetches the cached data on IOCTL but also triggers an event to poll for new errors. Similarly, a periodic error handler running to handle threshold errors also could trigger event. Basically, error data fetch is centralized to the thread.

It's just a different approach, don't know if that will make things more complex.

Thanks,
Lijo
________________________________
From: Tuikov, Luben <Luben.Tuikov@amd.com>
Sent: Wednesday, May 26, 2021 8:42:29 PM
To: Lazar, Lijo <Lijo.Lazar@amd.com>; amd-gfx@lists.freedesktop.org <amd-gfx@lists.freedesktop.org>
Cc: Deucher, Alexander <Alexander.Deucher@amd.com>; Clements, John <John.Clements@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>; Zhang, Hawking <Hawking.Zhang@amd.com>
Subject: Re: [PATCH 3/3] drm/amdgpu: Use delayed work to collect RAS error counters

On 2021-05-26 7:00 a.m., Lazar, Lijo wrote:
> [AMD Official Use Only]
>
> Scheduling an error status query just based on IOCTL doesn't sound like a sound approach. What if driver needs to handle errors based on that - for ex: if the number of correctable errors exceed a certain threshold?
That's exactly the trigger which evokes the error count procedure. The difference is that on that IOCTL,
we return in O(1), the cached values and then trigger the counting procedure in a delayed work item,
since it takes  O(n^3) and we cannot do it when the IOCTL is being processed as that negatively
affects the user experience, and the overall agility of the system.

This is the closest implementation to what was had before when the count was trigger at the IOCTL time,
and the IOCTL blocked until the count was completed.

Acting on exceeding a certain threshold is fine, since we're not competing against the delta
of the excessive amount of errors, just so long as it does exceed. That is, so long as it exceeds,
do something, we don't really care if it exceeds by delta or 10*delta.

But what is important, is the _time_ frequency of the delayed work, AMDGPU_RAS_COUNTE_DELAY_MS,
in my v2 of this patch. When set to 0, i.e. count as soon as possible, we get about 22% duty cycle of
the CPU just doing that, all the time, as it seems this IOCTL is being called constantly, and thus
the counting takes place all the time, continuously. And this isn't good for the system's performance
and power management.

When set to 3 seconds, we get a normal (expected) system behaviour in Vega20 sever boards.

> IMO, I'm more aligned to Luben's original approach of having something waiting in the background - instead of a periodic timer based trigger, it could be an event based trigger.  Event may be an ioctl, error handler timer ticks or something else.

Well, my original idea broke power management (PM), since it ran continuously
regardless of PM and whether we indeed need the count.

Now, when you say "Event may be an ioctl"--this is exactly what,

1) was had before, interlocked, and it made the system unusable to a GUI user, and
2) is what we have in this patch, but we process the count asynchronously, while
    we return the O(1) count instantly.

The advantage of 2) over my original approach, is that the count is triggered only
on IOCTL call, albeit delayed so that we can return in O(1) the cached value. Thus,
if no QUERY2 IOCTL is received, then we don't count the errors, as we don't schedule
the delayed work.

Regards,
Luben

> Thanks,
> Lijo
>
> -----Original Message-----
> From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> On Behalf Of Luben Tuikov
> Sent: Saturday, May 22, 2021 2:49 AM
> To: amd-gfx@lists.freedesktop.org
> Cc: Deucher, Alexander <Alexander.Deucher@amd.com>; Tuikov, Luben <Luben.Tuikov@amd.com>; Clements, John <John.Clements@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>; Zhang, Hawking <Hawking.Zhang@amd.com>
> Subject: [PATCH 3/3] drm/amdgpu: Use delayed work to collect RAS error counters
>
> On Context Query2 IOCTL return the correctable and uncorrectable errors in O(1) fashion, from cached values, and schedule a delayed work function to calculate and cache them for the next such IOCTL.
>
> Cc: Alexander Deucher <Alexander.Deucher@amd.com>
> Cc: Christian König <christian.koenig@amd.com>
> Cc: John Clements <john.clements@amd.com>
> Cc: Hawking Zhang <Hawking.Zhang@amd.com>
> Signed-off-by: Luben Tuikov <luben.tuikov@amd.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 32 +++++++++++++++++++--  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 38 +++++++++++++++++++++++++  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h |  5 ++++
>  3 files changed, 73 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
> index bb0cfe871aba..4e95d255960b 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
> @@ -331,10 +331,13 @@ static int amdgpu_ctx_query(struct amdgpu_device *adev,
>        return 0;
>  }
>
> +#define AMDGPU_RAS_COUNTE_DELAY_MS 3000
> +
>  static int amdgpu_ctx_query2(struct amdgpu_device *adev,
> -     struct amdgpu_fpriv *fpriv, uint32_t id,
> -     union drm_amdgpu_ctx_out *out)
> +                          struct amdgpu_fpriv *fpriv, uint32_t id,
> +                          union drm_amdgpu_ctx_out *out)
>  {
> +     struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
>        struct amdgpu_ctx *ctx;
>        struct amdgpu_ctx_mgr *mgr;
>
> @@ -361,6 +364,31 @@ static int amdgpu_ctx_query2(struct amdgpu_device *adev,
>        if (atomic_read(&ctx->guilty))
>                out->state.flags |= AMDGPU_CTX_QUERY2_FLAGS_GUILTY;
>
> +     if (adev->ras_enabled && con) {
> +             /* Return the cached values in O(1),
> +              * and schedule delayed work to cache
> +              * new vaues.
> +              */
> +             int ce_count, ue_count;
> +
> +             ce_count = atomic_read(&con->ras_ce_count);
> +             ue_count = atomic_read(&con->ras_ue_count);
> +
> +             if (ce_count != ctx->ras_counter_ce) {
> +                     ctx->ras_counter_ce = ce_count;
> +                     out->state.flags |= AMDGPU_CTX_QUERY2_FLAGS_RAS_CE;
> +             }
> +
> +             if (ue_count != ctx->ras_counter_ue) {
> +                     ctx->ras_counter_ue = ue_count;
> +                     out->state.flags |= AMDGPU_CTX_QUERY2_FLAGS_RAS_UE;
> +             }
> +
> +             if (!delayed_work_pending(&con->ras_counte_delay_work))
> +                     schedule_delayed_work(&con->ras_counte_delay_work,
> +                               msecs_to_jiffies(AMDGPU_RAS_COUNTE_DELAY_MS));
> +     }
> +
>        mutex_unlock(&mgr->lock);
>        return 0;
>  }
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> index ed3c43e8b0b5..80f576098318 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> @@ -27,6 +27,7 @@
>  #include <linux/uaccess.h>
>  #include <linux/reboot.h>
>  #include <linux/syscalls.h>
> +#include <linux/pm_runtime.h>
>
>  #include "amdgpu.h"
>  #include "amdgpu_ras.h"
> @@ -2116,6 +2117,30 @@ static void amdgpu_ras_check_supported(struct amdgpu_device *adev)
>                adev->ras_hw_enabled & amdgpu_ras_mask;  }
>
> +static void amdgpu_ras_counte_dw(struct work_struct *work) {
> +     struct amdgpu_ras *con = container_of(work, struct amdgpu_ras,
> +                                           ras_counte_delay_work.work);
> +     struct amdgpu_device *adev = con->adev;
> +     struct drm_device *dev = &adev->ddev;
> +     unsigned long ce_count, ue_count;
> +     int res;
> +
> +     res = pm_runtime_get_sync(dev->dev);
> +     if (res < 0)
> +             goto Out;
> +
> +     /* Cache new values.
> +      */
> +     amdgpu_ras_query_error_count(adev, &ce_count, &ue_count);
> +     atomic_set(&con->ras_ce_count, ce_count);
> +     atomic_set(&con->ras_ue_count, ue_count);
> +
> +     pm_runtime_mark_last_busy(dev->dev);
> +Out:
> +     pm_runtime_put_autosuspend(dev->dev);
> +}
> +
>  int amdgpu_ras_init(struct amdgpu_device *adev)  {
>        struct amdgpu_ras *con = amdgpu_ras_get_context(adev); @@ -2130,6 +2155,11 @@ int amdgpu_ras_init(struct amdgpu_device *adev)
>        if (!con)
>                return -ENOMEM;
>
> +     con->adev = adev;
> +     INIT_DELAYED_WORK(&con->ras_counte_delay_work, amdgpu_ras_counte_dw);
> +     atomic_set(&con->ras_ce_count, 0);
> +     atomic_set(&con->ras_ue_count, 0);
> +
>        con->objs = (struct ras_manager *)(con + 1);
>
>        amdgpu_ras_set_context(adev, con);
> @@ -2233,6 +2263,8 @@ int amdgpu_ras_late_init(struct amdgpu_device *adev,
>                         struct ras_fs_if *fs_info,
>                         struct ras_ih_if *ih_info)
>  {
> +     struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
> +     unsigned long ue_count, ce_count;
>        int r;
>
>        /* disable RAS feature per IP block if it is not supported */ @@ -2273,6 +2305,12 @@ int amdgpu_ras_late_init(struct amdgpu_device *adev,
>        if (r)
>                goto sysfs;
>
> +     /* Those are the cached values at init.
> +      */
> +     amdgpu_ras_query_error_count(adev, &ce_count, &ue_count);
> +     atomic_set(&con->ras_ce_count, ce_count);
> +     atomic_set(&con->ras_ue_count, ue_count);
> +
>        return 0;
>  cleanup:
>        amdgpu_ras_sysfs_remove(adev, ras_block); diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> index 10fca0393106..256cea5d34f2 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> @@ -340,6 +340,11 @@ struct amdgpu_ras {
>
>        /* disable ras error count harvest in recovery */
>        bool disable_ras_err_cnt_harvest;
> +
> +     /* RAS count errors delayed work */
> +     struct delayed_work ras_counte_delay_work;
> +     atomic_t ras_ue_count;
> +     atomic_t ras_ce_count;
>  };
>
>  struct ras_fs_data {
> --
> 2.31.1.527.g2d677e5b15
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=04%7C01%7Clijo.lazar%40amd.com%7C3686015f68f84c9088ab08d91c9e0fcf%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637572287465788021%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=HbdtryTNLwzF862vg6E%2BwKBHmrw8NFz3gKQsU9ggdOk%3D&amp;reserved=0


[-- Attachment #1.2: Type: text/html, Size: 18218 bytes --]

[-- Attachment #2: Type: text/plain, Size: 154 bytes --]

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 3/3] drm/amdgpu: Use delayed work to collect RAS error counters
  2021-05-26 16:05       ` Lazar, Lijo
@ 2021-05-26 16:11         ` Alex Deucher
  0 siblings, 0 replies; 12+ messages in thread
From: Alex Deucher @ 2021-05-26 16:11 UTC (permalink / raw)
  To: Lazar, Lijo
  Cc: amd-gfx, Tuikov, Luben, Deucher, Alexander, Clements, John,
	Koenig, Christian, Zhang, Hawking

Isn't that kind what this patch does?  When the app is running, these
IOCTL calls are happening at regular intervals, so if we schedule
work, the cache should be updated by the next time we get the IOCTL
call.  I'm not sure how we would trigger the events.  I don't think
interrupts on correctable errors make sense.

Alex

On Wed, May 26, 2021 at 12:05 PM Lazar, Lijo <Lijo.Lazar@amd.com> wrote:
>
> [AMD Official Use Only]
>
>
> Hi Luben,
>
> What I meant by event based is a thread waiting on wait queue for events, not a periodic polling as you had in the original patch. It still fetches the cached data on IOCTL but also triggers an event to poll for new errors. Similarly, a periodic error handler running to handle threshold errors also could trigger event. Basically, error data fetch is centralized to the thread.
>
> It's just a different approach, don't know if that will make things more complex.
>
> Thanks,
> Lijo
> ________________________________
> From: Tuikov, Luben <Luben.Tuikov@amd.com>
> Sent: Wednesday, May 26, 2021 8:42:29 PM
> To: Lazar, Lijo <Lijo.Lazar@amd.com>; amd-gfx@lists.freedesktop.org <amd-gfx@lists.freedesktop.org>
> Cc: Deucher, Alexander <Alexander.Deucher@amd.com>; Clements, John <John.Clements@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>; Zhang, Hawking <Hawking.Zhang@amd.com>
> Subject: Re: [PATCH 3/3] drm/amdgpu: Use delayed work to collect RAS error counters
>
> On 2021-05-26 7:00 a.m., Lazar, Lijo wrote:
> > [AMD Official Use Only]
> >
> > Scheduling an error status query just based on IOCTL doesn't sound like a sound approach. What if driver needs to handle errors based on that - for ex: if the number of correctable errors exceed a certain threshold?
> That's exactly the trigger which evokes the error count procedure. The difference is that on that IOCTL,
> we return in O(1), the cached values and then trigger the counting procedure in a delayed work item,
> since it takes  O(n^3) and we cannot do it when the IOCTL is being processed as that negatively
> affects the user experience, and the overall agility of the system.
>
> This is the closest implementation to what was had before when the count was trigger at the IOCTL time,
> and the IOCTL blocked until the count was completed.
>
> Acting on exceeding a certain threshold is fine, since we're not competing against the delta
> of the excessive amount of errors, just so long as it does exceed. That is, so long as it exceeds,
> do something, we don't really care if it exceeds by delta or 10*delta.
>
> But what is important, is the _time_ frequency of the delayed work, AMDGPU_RAS_COUNTE_DELAY_MS,
> in my v2 of this patch. When set to 0, i.e. count as soon as possible, we get about 22% duty cycle of
> the CPU just doing that, all the time, as it seems this IOCTL is being called constantly, and thus
> the counting takes place all the time, continuously. And this isn't good for the system's performance
> and power management.
>
> When set to 3 seconds, we get a normal (expected) system behaviour in Vega20 sever boards.
>
> > IMO, I'm more aligned to Luben's original approach of having something waiting in the background - instead of a periodic timer based trigger, it could be an event based trigger.  Event may be an ioctl, error handler timer ticks or something else.
>
> Well, my original idea broke power management (PM), since it ran continuously
> regardless of PM and whether we indeed need the count.
>
> Now, when you say "Event may be an ioctl"--this is exactly what,
>
> 1) was had before, interlocked, and it made the system unusable to a GUI user, and
> 2) is what we have in this patch, but we process the count asynchronously, while
>     we return the O(1) count instantly.
>
> The advantage of 2) over my original approach, is that the count is triggered only
> on IOCTL call, albeit delayed so that we can return in O(1) the cached value. Thus,
> if no QUERY2 IOCTL is received, then we don't count the errors, as we don't schedule
> the delayed work.
>
> Regards,
> Luben
>
> > Thanks,
> > Lijo
> >
> > -----Original Message-----
> > From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> On Behalf Of Luben Tuikov
> > Sent: Saturday, May 22, 2021 2:49 AM
> > To: amd-gfx@lists.freedesktop.org
> > Cc: Deucher, Alexander <Alexander.Deucher@amd.com>; Tuikov, Luben <Luben.Tuikov@amd.com>; Clements, John <John.Clements@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>; Zhang, Hawking <Hawking.Zhang@amd.com>
> > Subject: [PATCH 3/3] drm/amdgpu: Use delayed work to collect RAS error counters
> >
> > On Context Query2 IOCTL return the correctable and uncorrectable errors in O(1) fashion, from cached values, and schedule a delayed work function to calculate and cache them for the next such IOCTL.
> >
> > Cc: Alexander Deucher <Alexander.Deucher@amd.com>
> > Cc: Christian König <christian.koenig@amd.com>
> > Cc: John Clements <john.clements@amd.com>
> > Cc: Hawking Zhang <Hawking.Zhang@amd.com>
> > Signed-off-by: Luben Tuikov <luben.tuikov@amd.com>
> > ---
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 32 +++++++++++++++++++--  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 38 +++++++++++++++++++++++++  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h |  5 ++++
> >  3 files changed, 73 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
> > index bb0cfe871aba..4e95d255960b 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
> > @@ -331,10 +331,13 @@ static int amdgpu_ctx_query(struct amdgpu_device *adev,
> >        return 0;
> >  }
> >
> > +#define AMDGPU_RAS_COUNTE_DELAY_MS 3000
> > +
> >  static int amdgpu_ctx_query2(struct amdgpu_device *adev,
> > -     struct amdgpu_fpriv *fpriv, uint32_t id,
> > -     union drm_amdgpu_ctx_out *out)
> > +                          struct amdgpu_fpriv *fpriv, uint32_t id,
> > +                          union drm_amdgpu_ctx_out *out)
> >  {
> > +     struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
> >        struct amdgpu_ctx *ctx;
> >        struct amdgpu_ctx_mgr *mgr;
> >
> > @@ -361,6 +364,31 @@ static int amdgpu_ctx_query2(struct amdgpu_device *adev,
> >        if (atomic_read(&ctx->guilty))
> >                out->state.flags |= AMDGPU_CTX_QUERY2_FLAGS_GUILTY;
> >
> > +     if (adev->ras_enabled && con) {
> > +             /* Return the cached values in O(1),
> > +              * and schedule delayed work to cache
> > +              * new vaues.
> > +              */
> > +             int ce_count, ue_count;
> > +
> > +             ce_count = atomic_read(&con->ras_ce_count);
> > +             ue_count = atomic_read(&con->ras_ue_count);
> > +
> > +             if (ce_count != ctx->ras_counter_ce) {
> > +                     ctx->ras_counter_ce = ce_count;
> > +                     out->state.flags |= AMDGPU_CTX_QUERY2_FLAGS_RAS_CE;
> > +             }
> > +
> > +             if (ue_count != ctx->ras_counter_ue) {
> > +                     ctx->ras_counter_ue = ue_count;
> > +                     out->state.flags |= AMDGPU_CTX_QUERY2_FLAGS_RAS_UE;
> > +             }
> > +
> > +             if (!delayed_work_pending(&con->ras_counte_delay_work))
> > +                     schedule_delayed_work(&con->ras_counte_delay_work,
> > +                               msecs_to_jiffies(AMDGPU_RAS_COUNTE_DELAY_MS));
> > +     }
> > +
> >        mutex_unlock(&mgr->lock);
> >        return 0;
> >  }
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> > index ed3c43e8b0b5..80f576098318 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> > @@ -27,6 +27,7 @@
> >  #include <linux/uaccess.h>
> >  #include <linux/reboot.h>
> >  #include <linux/syscalls.h>
> > +#include <linux/pm_runtime.h>
> >
> >  #include "amdgpu.h"
> >  #include "amdgpu_ras.h"
> > @@ -2116,6 +2117,30 @@ static void amdgpu_ras_check_supported(struct amdgpu_device *adev)
> >                adev->ras_hw_enabled & amdgpu_ras_mask;  }
> >
> > +static void amdgpu_ras_counte_dw(struct work_struct *work) {
> > +     struct amdgpu_ras *con = container_of(work, struct amdgpu_ras,
> > +                                           ras_counte_delay_work.work);
> > +     struct amdgpu_device *adev = con->adev;
> > +     struct drm_device *dev = &adev->ddev;
> > +     unsigned long ce_count, ue_count;
> > +     int res;
> > +
> > +     res = pm_runtime_get_sync(dev->dev);
> > +     if (res < 0)
> > +             goto Out;
> > +
> > +     /* Cache new values.
> > +      */
> > +     amdgpu_ras_query_error_count(adev, &ce_count, &ue_count);
> > +     atomic_set(&con->ras_ce_count, ce_count);
> > +     atomic_set(&con->ras_ue_count, ue_count);
> > +
> > +     pm_runtime_mark_last_busy(dev->dev);
> > +Out:
> > +     pm_runtime_put_autosuspend(dev->dev);
> > +}
> > +
> >  int amdgpu_ras_init(struct amdgpu_device *adev)  {
> >        struct amdgpu_ras *con = amdgpu_ras_get_context(adev); @@ -2130,6 +2155,11 @@ int amdgpu_ras_init(struct amdgpu_device *adev)
> >        if (!con)
> >                return -ENOMEM;
> >
> > +     con->adev = adev;
> > +     INIT_DELAYED_WORK(&con->ras_counte_delay_work, amdgpu_ras_counte_dw);
> > +     atomic_set(&con->ras_ce_count, 0);
> > +     atomic_set(&con->ras_ue_count, 0);
> > +
> >        con->objs = (struct ras_manager *)(con + 1);
> >
> >        amdgpu_ras_set_context(adev, con);
> > @@ -2233,6 +2263,8 @@ int amdgpu_ras_late_init(struct amdgpu_device *adev,
> >                         struct ras_fs_if *fs_info,
> >                         struct ras_ih_if *ih_info)
> >  {
> > +     struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
> > +     unsigned long ue_count, ce_count;
> >        int r;
> >
> >        /* disable RAS feature per IP block if it is not supported */ @@ -2273,6 +2305,12 @@ int amdgpu_ras_late_init(struct amdgpu_device *adev,
> >        if (r)
> >                goto sysfs;
> >
> > +     /* Those are the cached values at init.
> > +      */
> > +     amdgpu_ras_query_error_count(adev, &ce_count, &ue_count);
> > +     atomic_set(&con->ras_ce_count, ce_count);
> > +     atomic_set(&con->ras_ue_count, ue_count);
> > +
> >        return 0;
> >  cleanup:
> >        amdgpu_ras_sysfs_remove(adev, ras_block); diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> > index 10fca0393106..256cea5d34f2 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> > @@ -340,6 +340,11 @@ struct amdgpu_ras {
> >
> >        /* disable ras error count harvest in recovery */
> >        bool disable_ras_err_cnt_harvest;
> > +
> > +     /* RAS count errors delayed work */
> > +     struct delayed_work ras_counte_delay_work;
> > +     atomic_t ras_ue_count;
> > +     atomic_t ras_ce_count;
> >  };
> >
> >  struct ras_fs_data {
> > --
> > 2.31.1.527.g2d677e5b15
> >
> > _______________________________________________
> > amd-gfx mailing list
> > amd-gfx@lists.freedesktop.org
> > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=04%7C01%7Clijo.lazar%40amd.com%7C3686015f68f84c9088ab08d91c9e0fcf%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637572287465788021%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=HbdtryTNLwzF862vg6E%2BwKBHmrw8NFz3gKQsU9ggdOk%3D&amp;reserved=0
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 2/3] drm/amdgpu: Fix RAS function interface
  2021-05-26 16:43 [PATCH 1/3] drm/amdgpu: Don't query CE and UE errors Luben Tuikov
@ 2021-05-26 16:43 ` Luben Tuikov
  0 siblings, 0 replies; 12+ messages in thread
From: Luben Tuikov @ 2021-05-26 16:43 UTC (permalink / raw)
  To: amd-gfx; +Cc: Alexander Deucher, Luben Tuikov, John Clements, Hawking Zhang

The correctable and uncorrectable errors
are calculated at each invocation of this
function. Therefore, it is highly inefficient to
return just one of them based on a Boolean
input. If the caller wants both, twice the work
would be done. (And this work is O(n^3) on
Vega20.)

Fix this "interface" to simply return what it had
calculated--both values. Let the caller choose
what it wants to record, inspect, use.

Cc: Alexander Deucher <Alexander.Deucher@amd.com>
Cc: John Clements <john.clements@amd.com>
Cc: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Luben Tuikov <luben.tuikov@amd.com>
Reviewed-by: Alexander Deucher <Alexander.Deucher@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 23 +++++++++++++++--------
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h |  5 +++--
 2 files changed, 18 insertions(+), 10 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
index e3a4c3a7635a..ed3c43e8b0b5 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
@@ -1043,29 +1043,36 @@ int amdgpu_ras_error_inject(struct amdgpu_device *adev,
 }
 
 /* get the total error counts on all IPs */
-unsigned long amdgpu_ras_query_error_count(struct amdgpu_device *adev,
-		bool is_ce)
+void amdgpu_ras_query_error_count(struct amdgpu_device *adev,
+				  unsigned long *ce_count,
+				  unsigned long *ue_count)
 {
 	struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
 	struct ras_manager *obj;
-	struct ras_err_data data = {0, 0};
+	unsigned long ce, ue;
 
 	if (!adev->ras_enabled || !con)
-		return 0;
+		return;
 
+	ce = 0;
+	ue = 0;
 	list_for_each_entry(obj, &con->head, node) {
 		struct ras_query_if info = {
 			.head = obj->head,
 		};
 
 		if (amdgpu_ras_query_error_status(adev, &info))
-			return 0;
+			return;
 
-		data.ce_count += info.ce_count;
-		data.ue_count += info.ue_count;
+		ce += info.ce_count;
+		ue += info.ue_count;
 	}
 
-	return is_ce ? data.ce_count : data.ue_count;
+	if (ce_count)
+		*ce_count = ce;
+
+	if (ue_count)
+		*ue_count = ue;
 }
 /* query/inject/cure end */
 
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
index bfa40c8ecc94..10fca0393106 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
@@ -485,8 +485,9 @@ int amdgpu_ras_request_reset_on_boot(struct amdgpu_device *adev,
 void amdgpu_ras_resume(struct amdgpu_device *adev);
 void amdgpu_ras_suspend(struct amdgpu_device *adev);
 
-unsigned long amdgpu_ras_query_error_count(struct amdgpu_device *adev,
-		bool is_ce);
+void amdgpu_ras_query_error_count(struct amdgpu_device *adev,
+				  unsigned long *ce_count,
+				  unsigned long *ue_count);
 
 /* error handling functions */
 int amdgpu_ras_add_bad_pages(struct amdgpu_device *adev,
-- 
2.31.1.527.g2d677e5b15

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 2/3] drm/amdgpu: Fix RAS function interface
  2021-05-26  0:40 [PATCH 1/3] drm/amdgpu: Don't query CE and UE errors Luben Tuikov
@ 2021-05-26  0:40 ` Luben Tuikov
  0 siblings, 0 replies; 12+ messages in thread
From: Luben Tuikov @ 2021-05-26  0:40 UTC (permalink / raw)
  To: amd-gfx; +Cc: Alexander Deucher, Luben Tuikov, John Clements, Hawking Zhang

The correctable and uncorrectable errors
are calculated at each invocation of this
function. Therefore, it is highly inefficient to
return just one of them based on a Boolean
input. If the caller wants both, twice the work
would be done. (And this work is O(n^3) on
Vega20.)

Fix this "interface" to simply return what it had
calculated--both values. Let the caller choose
what it wants to record, inspect, use.

Cc: Alexander Deucher <Alexander.Deucher@amd.com>
Cc: John Clements <john.clements@amd.com>
Cc: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Luben Tuikov <luben.tuikov@amd.com>
Reviewed-by: Alexander Deucher <Alexander.Deucher@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 23 +++++++++++++++--------
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h |  5 +++--
 2 files changed, 18 insertions(+), 10 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
index e3a4c3a7635a..ed3c43e8b0b5 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
@@ -1043,29 +1043,36 @@ int amdgpu_ras_error_inject(struct amdgpu_device *adev,
 }
 
 /* get the total error counts on all IPs */
-unsigned long amdgpu_ras_query_error_count(struct amdgpu_device *adev,
-		bool is_ce)
+void amdgpu_ras_query_error_count(struct amdgpu_device *adev,
+				  unsigned long *ce_count,
+				  unsigned long *ue_count)
 {
 	struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
 	struct ras_manager *obj;
-	struct ras_err_data data = {0, 0};
+	unsigned long ce, ue;
 
 	if (!adev->ras_enabled || !con)
-		return 0;
+		return;
 
+	ce = 0;
+	ue = 0;
 	list_for_each_entry(obj, &con->head, node) {
 		struct ras_query_if info = {
 			.head = obj->head,
 		};
 
 		if (amdgpu_ras_query_error_status(adev, &info))
-			return 0;
+			return;
 
-		data.ce_count += info.ce_count;
-		data.ue_count += info.ue_count;
+		ce += info.ce_count;
+		ue += info.ue_count;
 	}
 
-	return is_ce ? data.ce_count : data.ue_count;
+	if (ce_count)
+		*ce_count = ce;
+
+	if (ue_count)
+		*ue_count = ue;
 }
 /* query/inject/cure end */
 
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
index bfa40c8ecc94..10fca0393106 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
@@ -485,8 +485,9 @@ int amdgpu_ras_request_reset_on_boot(struct amdgpu_device *adev,
 void amdgpu_ras_resume(struct amdgpu_device *adev);
 void amdgpu_ras_suspend(struct amdgpu_device *adev);
 
-unsigned long amdgpu_ras_query_error_count(struct amdgpu_device *adev,
-		bool is_ce);
+void amdgpu_ras_query_error_count(struct amdgpu_device *adev,
+				  unsigned long *ce_count,
+				  unsigned long *ue_count);
 
 /* error handling functions */
 int amdgpu_ras_add_bad_pages(struct amdgpu_device *adev,
-- 
2.31.1.527.g2d677e5b15

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2021-05-26 16:43 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-05-21 21:18 [PATCH 1/3] drm/amdgpu: Don't query CE and UE errors Luben Tuikov
2021-05-21 21:18 ` Luben Tuikov
2021-05-21 21:18 ` [PATCH 2/3] drm/amdgpu: Fix RAS function interface Luben Tuikov
2021-05-21 21:18 ` [PATCH 3/3] drm/amdgpu: Use delayed work to collect RAS error counters Luben Tuikov
2021-05-25 22:03   ` Alex Deucher
2021-05-25 23:56     ` Luben Tuikov
2021-05-26 11:00   ` Lazar, Lijo
2021-05-26 15:12     ` Luben Tuikov
2021-05-26 16:05       ` Lazar, Lijo
2021-05-26 16:11         ` Alex Deucher
2021-05-26  0:40 [PATCH 1/3] drm/amdgpu: Don't query CE and UE errors Luben Tuikov
2021-05-26  0:40 ` [PATCH 2/3] drm/amdgpu: Fix RAS function interface Luben Tuikov
2021-05-26 16:43 [PATCH 1/3] drm/amdgpu: Don't query CE and UE errors Luben Tuikov
2021-05-26 16:43 ` [PATCH 2/3] drm/amdgpu: Fix RAS function interface Luben Tuikov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.