All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 1/2] drm/amdgpu: Don't query CE and UE errors
@ 2021-05-12 17:03 ` Luben Tuikov
  0 siblings, 0 replies; 20+ messages in thread
From: Luben Tuikov @ 2021-05-12 17:03 UTC (permalink / raw)
  To: amd-gfx; +Cc: Luben Tuikov, Alexander Deucher, stable

On QUERY2 IOCTL don't query counts of correctable
and uncorrectable errors, since when RAS is
enabled and supported on Vega20 server boards,
this takes insurmountably long time, in O(n^3),
which slows the system down to the point of it
being unusable when we have GUI up.

Fixes: ae363a212b14 ("drm/amdgpu: Add a new flag to AMDGPU_CTX_OP_QUERY_STATE2")
Cc: Alexander Deucher <Alexander.Deucher@amd.com>
Cc: stable@vger.kernel.org
Signed-off-by: Luben Tuikov <luben.tuikov@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 26 ++++++++++++-------------
 1 file changed, 13 insertions(+), 13 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
index 01fe60fedcbe..d481a33f4eaf 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
@@ -363,19 +363,19 @@ static int amdgpu_ctx_query2(struct amdgpu_device *adev,
 		out->state.flags |= AMDGPU_CTX_QUERY2_FLAGS_GUILTY;
 
 	/*query ue count*/
-	ras_counter = amdgpu_ras_query_error_count(adev, false);
-	/*ras counter is monotonic increasing*/
-	if (ras_counter != ctx->ras_counter_ue) {
-		out->state.flags |= AMDGPU_CTX_QUERY2_FLAGS_RAS_UE;
-		ctx->ras_counter_ue = ras_counter;
-	}
-
-	/*query ce count*/
-	ras_counter = amdgpu_ras_query_error_count(adev, true);
-	if (ras_counter != ctx->ras_counter_ce) {
-		out->state.flags |= AMDGPU_CTX_QUERY2_FLAGS_RAS_CE;
-		ctx->ras_counter_ce = ras_counter;
-	}
+	/* ras_counter = amdgpu_ras_query_error_count(adev, false); */
+	/* /\*ras counter is monotonic increasing*\/ */
+	/* if (ras_counter != ctx->ras_counter_ue) { */
+	/* 	out->state.flags |= AMDGPU_CTX_QUERY2_FLAGS_RAS_UE; */
+	/* 	ctx->ras_counter_ue = ras_counter; */
+	/* } */
+
+	/* /\*query ce count*\/ */
+	/* ras_counter = amdgpu_ras_query_error_count(adev, true); */
+	/* if (ras_counter != ctx->ras_counter_ce) { */
+	/* 	out->state.flags |= AMDGPU_CTX_QUERY2_FLAGS_RAS_CE; */
+	/* 	ctx->ras_counter_ce = ras_counter; */
+	/* } */
 
 	mutex_unlock(&mgr->lock);
 	return 0;
-- 
2.31.1.527.g2d677e5b15


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 1/2] drm/amdgpu: Don't query CE and UE errors
@ 2021-05-12 17:03 ` Luben Tuikov
  0 siblings, 0 replies; 20+ messages in thread
From: Luben Tuikov @ 2021-05-12 17:03 UTC (permalink / raw)
  To: amd-gfx; +Cc: Alexander Deucher, Luben Tuikov, stable

On QUERY2 IOCTL don't query counts of correctable
and uncorrectable errors, since when RAS is
enabled and supported on Vega20 server boards,
this takes insurmountably long time, in O(n^3),
which slows the system down to the point of it
being unusable when we have GUI up.

Fixes: ae363a212b14 ("drm/amdgpu: Add a new flag to AMDGPU_CTX_OP_QUERY_STATE2")
Cc: Alexander Deucher <Alexander.Deucher@amd.com>
Cc: stable@vger.kernel.org
Signed-off-by: Luben Tuikov <luben.tuikov@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 26 ++++++++++++-------------
 1 file changed, 13 insertions(+), 13 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
index 01fe60fedcbe..d481a33f4eaf 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
@@ -363,19 +363,19 @@ static int amdgpu_ctx_query2(struct amdgpu_device *adev,
 		out->state.flags |= AMDGPU_CTX_QUERY2_FLAGS_GUILTY;
 
 	/*query ue count*/
-	ras_counter = amdgpu_ras_query_error_count(adev, false);
-	/*ras counter is monotonic increasing*/
-	if (ras_counter != ctx->ras_counter_ue) {
-		out->state.flags |= AMDGPU_CTX_QUERY2_FLAGS_RAS_UE;
-		ctx->ras_counter_ue = ras_counter;
-	}
-
-	/*query ce count*/
-	ras_counter = amdgpu_ras_query_error_count(adev, true);
-	if (ras_counter != ctx->ras_counter_ce) {
-		out->state.flags |= AMDGPU_CTX_QUERY2_FLAGS_RAS_CE;
-		ctx->ras_counter_ce = ras_counter;
-	}
+	/* ras_counter = amdgpu_ras_query_error_count(adev, false); */
+	/* /\*ras counter is monotonic increasing*\/ */
+	/* if (ras_counter != ctx->ras_counter_ue) { */
+	/* 	out->state.flags |= AMDGPU_CTX_QUERY2_FLAGS_RAS_UE; */
+	/* 	ctx->ras_counter_ue = ras_counter; */
+	/* } */
+
+	/* /\*query ce count*\/ */
+	/* ras_counter = amdgpu_ras_query_error_count(adev, true); */
+	/* if (ras_counter != ctx->ras_counter_ce) { */
+	/* 	out->state.flags |= AMDGPU_CTX_QUERY2_FLAGS_RAS_CE; */
+	/* 	ctx->ras_counter_ce = ras_counter; */
+	/* } */
 
 	mutex_unlock(&mgr->lock);
 	return 0;
-- 
2.31.1.527.g2d677e5b15

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 2/2] drm/amdgpu: Poll of RAS errors asynchronously
  2021-05-12 17:03 ` Luben Tuikov
  (?)
@ 2021-05-12 17:03 ` Luben Tuikov
  2021-05-13  8:00   ` Christian König
  -1 siblings, 1 reply; 20+ messages in thread
From: Luben Tuikov @ 2021-05-12 17:03 UTC (permalink / raw)
  To: amd-gfx; +Cc: Alexander Deucher, Luben Tuikov, John Clements, Hawking Zhang

When using Vega 20 with RAS support and RAS is
enabled, the system interactivity is extremely
slow, to the point of being unusable. After
debugging, it was determined that this is due to
the polling loop performed for
AMDGPU_CTX_OP_QUERY_STATE2 under
amdgpu_ctx_ioctl(), which seems to be executed on
every ioctl from X/Mesa.

The latter seems to call amdgpu_ctx_query2() which
calls amdgpu_ras_query_error_count() twice, once
for each of ue_count and ce_count. This is
unnecessarily wasteful since
amdgpu_ras_query_error_count() calculates both,
but with the current interface it returns one or
the other, depending on its Boolean input, when it
can in fact return both, in a single invocation,
if it had a better interface.

Further down, the query_ras_error_count() callback
is called, which could be quite a large polling
loop, and very time consuming. For instance,
gfx_v9_0_query_ras_error_count() is at least
O(n^3). A similar situation is seen with
umc_v6_1_query_ras_error_count().

This commit implements asynchronous RAS error
count polling to that of the ioctl. A kernel
thread polls the RAS error counters once in a
while. The ioctl reads the values
asynchronously. The poll frequency is a module
parameter, with range [500, 10000] milliseconds,
default 3000.

v2: Enable setting the polling interval to 0,
    which disables the thread.

Cc: Alexander Deucher <Alexander.Deucher@amd.com>
Cc: John Clements <john.clements@amd.com>
Cc: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Luben Tuikov <luben.tuikov@amd.com>
Reviewed-by: John Clements <john.clements@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu.h     |  1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 36 +++++++-------
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 10 ++++
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 66 ++++++++++++++++++++++---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h |  8 +--
 5 files changed, 93 insertions(+), 28 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index 3147c1c935c8..a269f778194f 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -197,6 +197,7 @@ extern struct amdgpu_mgpu_info mgpu_info;
 extern int amdgpu_ras_enable;
 extern uint amdgpu_ras_mask;
 extern int amdgpu_bad_page_threshold;
+extern uint amdgpu_ras_thread_poll_ms;
 extern struct amdgpu_watchdog_timer amdgpu_watchdog_timer;
 extern int amdgpu_async_gfx_ring;
 extern int amdgpu_mcbp;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
index d481a33f4eaf..558e887e99b6 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
@@ -332,12 +332,12 @@ static int amdgpu_ctx_query(struct amdgpu_device *adev,
 }
 
 static int amdgpu_ctx_query2(struct amdgpu_device *adev,
-	struct amdgpu_fpriv *fpriv, uint32_t id,
-	union drm_amdgpu_ctx_out *out)
+			     struct amdgpu_fpriv *fpriv, uint32_t id,
+			     union drm_amdgpu_ctx_out *out)
 {
+	struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
 	struct amdgpu_ctx *ctx;
 	struct amdgpu_ctx_mgr *mgr;
-	unsigned long ras_counter;
 
 	if (!fpriv)
 		return -EINVAL;
@@ -362,20 +362,22 @@ static int amdgpu_ctx_query2(struct amdgpu_device *adev,
 	if (atomic_read(&ctx->guilty))
 		out->state.flags |= AMDGPU_CTX_QUERY2_FLAGS_GUILTY;
 
-	/*query ue count*/
-	/* ras_counter = amdgpu_ras_query_error_count(adev, false); */
-	/* /\*ras counter is monotonic increasing*\/ */
-	/* if (ras_counter != ctx->ras_counter_ue) { */
-	/* 	out->state.flags |= AMDGPU_CTX_QUERY2_FLAGS_RAS_UE; */
-	/* 	ctx->ras_counter_ue = ras_counter; */
-	/* } */
-
-	/* /\*query ce count*\/ */
-	/* ras_counter = amdgpu_ras_query_error_count(adev, true); */
-	/* if (ras_counter != ctx->ras_counter_ce) { */
-	/* 	out->state.flags |= AMDGPU_CTX_QUERY2_FLAGS_RAS_CE; */
-	/* 	ctx->ras_counter_ce = ras_counter; */
-	/* } */
+	if (con) {
+		int ce_count, ue_count;
+
+		ce_count = atomic_read(&con->ras_ce_count);
+		ue_count = atomic_read(&con->ras_ue_count);
+
+		if (ce_count != ctx->ras_counter_ce) {
+			out->state.flags |= AMDGPU_CTX_QUERY2_FLAGS_RAS_CE;
+			ctx->ras_counter_ce = ce_count;
+		}
+
+		if (ue_count != ctx->ras_counter_ue) {
+			out->state.flags |= AMDGPU_CTX_QUERY2_FLAGS_RAS_UE;
+			ctx->ras_counter_ue = ue_count;
+		}
+	}
 
 	mutex_unlock(&mgr->lock);
 	return 0;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index 877469d288f8..641c374b8525 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -183,6 +183,7 @@ struct amdgpu_mgpu_info mgpu_info = {
 int amdgpu_ras_enable = -1;
 uint amdgpu_ras_mask = 0xffffffff;
 int amdgpu_bad_page_threshold = -1;
+uint amdgpu_ras_thread_poll_ms = 3000;
 struct amdgpu_watchdog_timer amdgpu_watchdog_timer = {
 	.timeout_fatal_disable = false,
 	.period = 0x0, /* default to 0x0 (timeout disable) */
@@ -534,6 +535,15 @@ module_param_named(emu_mode, amdgpu_emu_mode, int, 0444);
 MODULE_PARM_DESC(ras_enable, "Enable RAS features on the GPU (0 = disable, 1 = enable, -1 = auto (default))");
 module_param_named(ras_enable, amdgpu_ras_enable, int, 0444);
 
+/**
+ * DOC: ras_thread_poll (uint)
+ * Number of milliseconds between RAS poll for errors.
+ * Valid range 0 and [500,10000]. Set to 0 to disable the thread.
+ * Default: 3000.
+ */
+MODULE_PARM_DESC(ras_thread_poll, "Number of milliseconds between RAS poll for errors. Valid range 0 and [500,10000]. Set to 0 to disable the thread. Default: 3000.");
+module_param_named(ras_thread_poll, amdgpu_ras_thread_poll_ms, uint, 0444);
+
 /**
  * DOC: ras_mask (uint)
  * Mask of RAS features to enable (default 0xffffffff), only valid when ras_enable == 1
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
index b1c57a5b6e89..30bec289e9ad 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
@@ -1043,15 +1043,17 @@ int amdgpu_ras_error_inject(struct amdgpu_device *adev,
 }
 
 /* get the total error counts on all IPs */
-unsigned long amdgpu_ras_query_error_count(struct amdgpu_device *adev,
-		bool is_ce)
+static void amdgpu_ras_query_error_count(struct amdgpu_device *adev)
 {
 	struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
 	struct ras_manager *obj;
-	struct ras_err_data data = {0, 0};
+	int ce_count, ue_count;
 
 	if (!adev->ras_enabled || !con)
-		return 0;
+		return;
+
+	ce_count = 0;
+	ue_count = 0;
 
 	list_for_each_entry(obj, &con->head, node) {
 		struct ras_query_if info = {
@@ -1059,13 +1061,14 @@ unsigned long amdgpu_ras_query_error_count(struct amdgpu_device *adev,
 		};
 
 		if (amdgpu_ras_query_error_status(adev, &info))
-			return 0;
+			return;
 
-		data.ce_count += info.ce_count;
-		data.ue_count += info.ue_count;
+		ce_count += info.ce_count;
+		ue_count += info.ue_count;
 	}
 
-	return is_ce ? data.ce_count : data.ue_count;
+	atomic_set(&con->ras_ce_count, ce_count);
+	atomic_set(&con->ras_ue_count, ue_count);
 }
 /* query/inject/cure end */
 
@@ -2109,6 +2112,49 @@ static void amdgpu_ras_check_supported(struct amdgpu_device *adev)
 		adev->ras_hw_enabled & amdgpu_ras_mask;
 }
 
+static int amdgpu_ras_thread(void *data)
+{
+	struct amdgpu_device *adev = data;
+	struct amdgpu_ras    *con  = amdgpu_ras_get_context(adev);
+
+	con->ras_thread_poll_ms = amdgpu_ras_thread_poll_ms;
+	if (con->ras_thread_poll_ms == 0) {
+		atomic_set(&con->ras_ce_count, 0);
+		atomic_set(&con->ras_ue_count, 0);
+		return 0;
+	} else if (con->ras_thread_poll_ms < 500 ||
+		   con->ras_thread_poll_ms > 10000) {
+		con->ras_thread_poll_ms = 3000;
+	}
+
+	while (1) {
+		if (kthread_should_stop())
+			break;
+		if (kthread_should_park())
+			kthread_parkme();
+
+		amdgpu_ras_query_error_count(adev);
+		msleep_interruptible(con->ras_thread_poll_ms);
+	}
+
+	return 0;
+}
+
+static int amdgpu_ras_thread_start(struct amdgpu_device *adev)
+{
+	struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
+	struct task_struct *kt;
+
+	kt = kthread_run(amdgpu_ras_thread, adev,
+			 "amdras:%s", pci_name(adev->pdev));
+	if (IS_ERR(kt))
+		return PTR_ERR(kt);
+
+	con->ras_thread = kt;
+
+	return 0;
+}
+
 int amdgpu_ras_init(struct amdgpu_device *adev)
 {
 	struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
@@ -2182,6 +2228,10 @@ int amdgpu_ras_init(struct amdgpu_device *adev)
 		goto release_con;
 	}
 
+	r = amdgpu_ras_thread_start(adev);
+	if (r)
+		goto release_con;
+
 	dev_info(adev->dev, "RAS INFO: ras initialized successfully, "
 		 "hardware ability[%x] ras_mask[%x]\n",
 		 adev->ras_hw_enabled, adev->ras_enabled);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
index 201fbdee1d09..fb9e4c7ab054 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
@@ -340,6 +340,11 @@ struct amdgpu_ras {
 
 	/* disable ras error count harvest in recovery */
 	bool disable_ras_err_cnt_harvest;
+
+	struct task_struct *ras_thread;
+	unsigned int        ras_thread_poll_ms;
+	atomic_t            ras_ue_count;
+	atomic_t            ras_ce_count;
 };
 
 struct ras_fs_data {
@@ -485,9 +490,6 @@ int amdgpu_ras_request_reset_on_boot(struct amdgpu_device *adev,
 void amdgpu_ras_resume(struct amdgpu_device *adev);
 void amdgpu_ras_suspend(struct amdgpu_device *adev);
 
-unsigned long amdgpu_ras_query_error_count(struct amdgpu_device *adev,
-		bool is_ce);
-
 /* error handling functions */
 int amdgpu_ras_add_bad_pages(struct amdgpu_device *adev,
 		struct eeprom_table_record *bps, int pages);
-- 
2.31.1.527.g2d677e5b15

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* RE: [PATCH 1/2] drm/amdgpu: Don't query CE and UE errors
  2021-05-12 17:03 ` Luben Tuikov
@ 2021-05-12 19:52   ` Deucher, Alexander
  -1 siblings, 0 replies; 20+ messages in thread
From: Deucher, Alexander @ 2021-05-12 19:52 UTC (permalink / raw)
  To: Tuikov, Luben, amd-gfx; +Cc: stable

[AMD Public Use]

> -----Original Message-----
> From: Tuikov, Luben <Luben.Tuikov@amd.com>
> Sent: Wednesday, May 12, 2021 1:03 PM
> To: amd-gfx@lists.freedesktop.org
> Cc: Tuikov, Luben <Luben.Tuikov@amd.com>; Deucher, Alexander
> <Alexander.Deucher@amd.com>; stable@vger.kernel.org
> Subject: [PATCH 1/2] drm/amdgpu: Don't query CE and UE errors
> 
> On QUERY2 IOCTL don't query counts of correctable and uncorrectable
> errors, since when RAS is enabled and supported on Vega20 server boards,
> this takes insurmountably long time, in O(n^3), which slows the system down
> to the point of it being unusable when we have GUI up.
> 
> Fixes: ae363a212b14 ("drm/amdgpu: Add a new flag to
> AMDGPU_CTX_OP_QUERY_STATE2")
> Cc: Alexander Deucher <Alexander.Deucher@amd.com>
> Cc: stable@vger.kernel.org
> Signed-off-by: Luben Tuikov <luben.tuikov@amd.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 26 ++++++++++++-----------
> --
>  1 file changed, 13 insertions(+), 13 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
> index 01fe60fedcbe..d481a33f4eaf 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
> @@ -363,19 +363,19 @@ static int amdgpu_ctx_query2(struct
> amdgpu_device *adev,
>  		out->state.flags |= AMDGPU_CTX_QUERY2_FLAGS_GUILTY;
> 
>  	/*query ue count*/
> -	ras_counter = amdgpu_ras_query_error_count(adev, false);
> -	/*ras counter is monotonic increasing*/
> -	if (ras_counter != ctx->ras_counter_ue) {
> -		out->state.flags |= AMDGPU_CTX_QUERY2_FLAGS_RAS_UE;
> -		ctx->ras_counter_ue = ras_counter;
> -	}
> -
> -	/*query ce count*/
> -	ras_counter = amdgpu_ras_query_error_count(adev, true);
> -	if (ras_counter != ctx->ras_counter_ce) {
> -		out->state.flags |= AMDGPU_CTX_QUERY2_FLAGS_RAS_CE;
> -		ctx->ras_counter_ce = ras_counter;
> -	}
> +	/* ras_counter = amdgpu_ras_query_error_count(adev, false); */
> +	/* /\*ras counter is monotonic increasing*\/ */
> +	/* if (ras_counter != ctx->ras_counter_ue) { */
> +	/* 	out->state.flags |= AMDGPU_CTX_QUERY2_FLAGS_RAS_UE;
> */
> +	/* 	ctx->ras_counter_ue = ras_counter; */
> +	/* } */
> +
> +	/* /\*query ce count*\/ */
> +	/* ras_counter = amdgpu_ras_query_error_count(adev, true); */
> +	/* if (ras_counter != ctx->ras_counter_ce) { */
> +	/* 	out->state.flags |= AMDGPU_CTX_QUERY2_FLAGS_RAS_CE;
> */
> +	/* 	ctx->ras_counter_ce = ras_counter; */
> +	/* } */
> 

Rather than commenting this out, just drop it in patch 1, and then re-add this in patch 2.

Alex

>  	mutex_unlock(&mgr->lock);
>  	return 0;
> --
> 2.31.1.527.g2d677e5b15

^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: [PATCH 1/2] drm/amdgpu: Don't query CE and UE errors
@ 2021-05-12 19:52   ` Deucher, Alexander
  0 siblings, 0 replies; 20+ messages in thread
From: Deucher, Alexander @ 2021-05-12 19:52 UTC (permalink / raw)
  To: Tuikov, Luben, amd-gfx; +Cc: stable

[AMD Public Use]

> -----Original Message-----
> From: Tuikov, Luben <Luben.Tuikov@amd.com>
> Sent: Wednesday, May 12, 2021 1:03 PM
> To: amd-gfx@lists.freedesktop.org
> Cc: Tuikov, Luben <Luben.Tuikov@amd.com>; Deucher, Alexander
> <Alexander.Deucher@amd.com>; stable@vger.kernel.org
> Subject: [PATCH 1/2] drm/amdgpu: Don't query CE and UE errors
> 
> On QUERY2 IOCTL don't query counts of correctable and uncorrectable
> errors, since when RAS is enabled and supported on Vega20 server boards,
> this takes insurmountably long time, in O(n^3), which slows the system down
> to the point of it being unusable when we have GUI up.
> 
> Fixes: ae363a212b14 ("drm/amdgpu: Add a new flag to
> AMDGPU_CTX_OP_QUERY_STATE2")
> Cc: Alexander Deucher <Alexander.Deucher@amd.com>
> Cc: stable@vger.kernel.org
> Signed-off-by: Luben Tuikov <luben.tuikov@amd.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 26 ++++++++++++-----------
> --
>  1 file changed, 13 insertions(+), 13 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
> index 01fe60fedcbe..d481a33f4eaf 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
> @@ -363,19 +363,19 @@ static int amdgpu_ctx_query2(struct
> amdgpu_device *adev,
>  		out->state.flags |= AMDGPU_CTX_QUERY2_FLAGS_GUILTY;
> 
>  	/*query ue count*/
> -	ras_counter = amdgpu_ras_query_error_count(adev, false);
> -	/*ras counter is monotonic increasing*/
> -	if (ras_counter != ctx->ras_counter_ue) {
> -		out->state.flags |= AMDGPU_CTX_QUERY2_FLAGS_RAS_UE;
> -		ctx->ras_counter_ue = ras_counter;
> -	}
> -
> -	/*query ce count*/
> -	ras_counter = amdgpu_ras_query_error_count(adev, true);
> -	if (ras_counter != ctx->ras_counter_ce) {
> -		out->state.flags |= AMDGPU_CTX_QUERY2_FLAGS_RAS_CE;
> -		ctx->ras_counter_ce = ras_counter;
> -	}
> +	/* ras_counter = amdgpu_ras_query_error_count(adev, false); */
> +	/* /\*ras counter is monotonic increasing*\/ */
> +	/* if (ras_counter != ctx->ras_counter_ue) { */
> +	/* 	out->state.flags |= AMDGPU_CTX_QUERY2_FLAGS_RAS_UE;
> */
> +	/* 	ctx->ras_counter_ue = ras_counter; */
> +	/* } */
> +
> +	/* /\*query ce count*\/ */
> +	/* ras_counter = amdgpu_ras_query_error_count(adev, true); */
> +	/* if (ras_counter != ctx->ras_counter_ce) { */
> +	/* 	out->state.flags |= AMDGPU_CTX_QUERY2_FLAGS_RAS_CE;
> */
> +	/* 	ctx->ras_counter_ce = ras_counter; */
> +	/* } */
> 

Rather than commenting this out, just drop it in patch 1, and then re-add this in patch 2.

Alex

>  	mutex_unlock(&mgr->lock);
>  	return 0;
> --
> 2.31.1.527.g2d677e5b15
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH 1/2] drm/amdgpu: Don't query CE and UE errors
  2021-05-12 19:52   ` Deucher, Alexander
@ 2021-05-13  5:32     ` Luben Tuikov
  -1 siblings, 0 replies; 20+ messages in thread
From: Luben Tuikov @ 2021-05-13  5:32 UTC (permalink / raw)
  To: amd-gfx; +Cc: Luben Tuikov, Alexander Deucher, stable

On QUERY2 IOCTL don't query counts of correctable
and uncorrectable errors, since when RAS is
enabled and supported on Vega20 server boards,
this takes insurmountably long time, in O(n^3),
which slows the system down to the point of it
being unusable when we have GUI up.

Fixes: ae363a212b14 ("drm/amdgpu: Add a new flag to AMDGPU_CTX_OP_QUERY_STATE2")
Cc: Alexander Deucher <Alexander.Deucher@amd.com>
Cc: stable@vger.kernel.org
Signed-off-by: Luben Tuikov <luben.tuikov@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 16 ----------------
 1 file changed, 16 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
index 01fe60fedcbe..e1557020c49d 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
@@ -337,7 +337,6 @@ static int amdgpu_ctx_query2(struct amdgpu_device *adev,
 {
 	struct amdgpu_ctx *ctx;
 	struct amdgpu_ctx_mgr *mgr;
-	unsigned long ras_counter;
 
 	if (!fpriv)
 		return -EINVAL;
@@ -362,21 +361,6 @@ static int amdgpu_ctx_query2(struct amdgpu_device *adev,
 	if (atomic_read(&ctx->guilty))
 		out->state.flags |= AMDGPU_CTX_QUERY2_FLAGS_GUILTY;
 
-	/*query ue count*/
-	ras_counter = amdgpu_ras_query_error_count(adev, false);
-	/*ras counter is monotonic increasing*/
-	if (ras_counter != ctx->ras_counter_ue) {
-		out->state.flags |= AMDGPU_CTX_QUERY2_FLAGS_RAS_UE;
-		ctx->ras_counter_ue = ras_counter;
-	}
-
-	/*query ce count*/
-	ras_counter = amdgpu_ras_query_error_count(adev, true);
-	if (ras_counter != ctx->ras_counter_ce) {
-		out->state.flags |= AMDGPU_CTX_QUERY2_FLAGS_RAS_CE;
-		ctx->ras_counter_ce = ras_counter;
-	}
-
 	mutex_unlock(&mgr->lock);
 	return 0;
 }
-- 
2.31.1.527.g2d677e5b15


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 1/2] drm/amdgpu: Don't query CE and UE errors
@ 2021-05-13  5:32     ` Luben Tuikov
  0 siblings, 0 replies; 20+ messages in thread
From: Luben Tuikov @ 2021-05-13  5:32 UTC (permalink / raw)
  To: amd-gfx; +Cc: Alexander Deucher, Luben Tuikov, stable

On QUERY2 IOCTL don't query counts of correctable
and uncorrectable errors, since when RAS is
enabled and supported on Vega20 server boards,
this takes insurmountably long time, in O(n^3),
which slows the system down to the point of it
being unusable when we have GUI up.

Fixes: ae363a212b14 ("drm/amdgpu: Add a new flag to AMDGPU_CTX_OP_QUERY_STATE2")
Cc: Alexander Deucher <Alexander.Deucher@amd.com>
Cc: stable@vger.kernel.org
Signed-off-by: Luben Tuikov <luben.tuikov@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 16 ----------------
 1 file changed, 16 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
index 01fe60fedcbe..e1557020c49d 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
@@ -337,7 +337,6 @@ static int amdgpu_ctx_query2(struct amdgpu_device *adev,
 {
 	struct amdgpu_ctx *ctx;
 	struct amdgpu_ctx_mgr *mgr;
-	unsigned long ras_counter;
 
 	if (!fpriv)
 		return -EINVAL;
@@ -362,21 +361,6 @@ static int amdgpu_ctx_query2(struct amdgpu_device *adev,
 	if (atomic_read(&ctx->guilty))
 		out->state.flags |= AMDGPU_CTX_QUERY2_FLAGS_GUILTY;
 
-	/*query ue count*/
-	ras_counter = amdgpu_ras_query_error_count(adev, false);
-	/*ras counter is monotonic increasing*/
-	if (ras_counter != ctx->ras_counter_ue) {
-		out->state.flags |= AMDGPU_CTX_QUERY2_FLAGS_RAS_UE;
-		ctx->ras_counter_ue = ras_counter;
-	}
-
-	/*query ce count*/
-	ras_counter = amdgpu_ras_query_error_count(adev, true);
-	if (ras_counter != ctx->ras_counter_ce) {
-		out->state.flags |= AMDGPU_CTX_QUERY2_FLAGS_RAS_CE;
-		ctx->ras_counter_ce = ras_counter;
-	}
-
 	mutex_unlock(&mgr->lock);
 	return 0;
 }
-- 
2.31.1.527.g2d677e5b15

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 2/2] drm/amdgpu: Poll of RAS errors asynchronously
  2021-05-13  5:32     ` Luben Tuikov
  (?)
@ 2021-05-13  5:32     ` Luben Tuikov
  -1 siblings, 0 replies; 20+ messages in thread
From: Luben Tuikov @ 2021-05-13  5:32 UTC (permalink / raw)
  To: amd-gfx; +Cc: Alexander Deucher, Luben Tuikov, John Clements, Hawking Zhang

When using Vega 20 with RAS support and RAS is
enabled, the system interactivity is extremely
slow, to the point of being unusable. After
debugging, it was determined that this is due to
the polling loop performed for
AMDGPU_CTX_OP_QUERY_STATE2 under
amdgpu_ctx_ioctl(), which seems to be executed on
every ioctl from X/Mesa.

The latter seems to call amdgpu_ctx_query2() which
calls amdgpu_ras_query_error_count() twice, once
for each of ue_count and ce_count. This is
unnecessarily wasteful since
amdgpu_ras_query_error_count() calculates both,
but with the current interface it returns one or
the other, depending on its Boolean input, when it
can in fact return both, in a single invocation,
if it had a better interface.

Further down, the query_ras_error_count() callback
is called, which could be quite a large polling
loop, and very time consuming. For instance,
gfx_v9_0_query_ras_error_count() is at least
O(n^3). A similar situation is seen with
umc_v6_1_query_ras_error_count().

This commit implements asynchronous RAS error
count polling to that of the ioctl. A kernel
thread polls the RAS error counters once in a
while. The ioctl reads the values
asynchronously. The poll frequency is a module
parameter, with range [500, 10000] milliseconds,
default 3000.

v2: Enable setting the polling interval to 0,
    which disables the thread.

Cc: Alexander Deucher <Alexander.Deucher@amd.com>
Cc: John Clements <john.clements@amd.com>
Cc: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Luben Tuikov <luben.tuikov@amd.com>
Reviewed-by: John Clements <john.clements@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu.h     |  1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 22 ++++++++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 10 ++++
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 66 ++++++++++++++++++++++---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h |  8 +--
 5 files changed, 94 insertions(+), 13 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index 3147c1c935c8..a269f778194f 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -197,6 +197,7 @@ extern struct amdgpu_mgpu_info mgpu_info;
 extern int amdgpu_ras_enable;
 extern uint amdgpu_ras_mask;
 extern int amdgpu_bad_page_threshold;
+extern uint amdgpu_ras_thread_poll_ms;
 extern struct amdgpu_watchdog_timer amdgpu_watchdog_timer;
 extern int amdgpu_async_gfx_ring;
 extern int amdgpu_mcbp;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
index e1557020c49d..558e887e99b6 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
@@ -332,9 +332,10 @@ static int amdgpu_ctx_query(struct amdgpu_device *adev,
 }
 
 static int amdgpu_ctx_query2(struct amdgpu_device *adev,
-	struct amdgpu_fpriv *fpriv, uint32_t id,
-	union drm_amdgpu_ctx_out *out)
+			     struct amdgpu_fpriv *fpriv, uint32_t id,
+			     union drm_amdgpu_ctx_out *out)
 {
+	struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
 	struct amdgpu_ctx *ctx;
 	struct amdgpu_ctx_mgr *mgr;
 
@@ -361,6 +362,23 @@ static int amdgpu_ctx_query2(struct amdgpu_device *adev,
 	if (atomic_read(&ctx->guilty))
 		out->state.flags |= AMDGPU_CTX_QUERY2_FLAGS_GUILTY;
 
+	if (con) {
+		int ce_count, ue_count;
+
+		ce_count = atomic_read(&con->ras_ce_count);
+		ue_count = atomic_read(&con->ras_ue_count);
+
+		if (ce_count != ctx->ras_counter_ce) {
+			out->state.flags |= AMDGPU_CTX_QUERY2_FLAGS_RAS_CE;
+			ctx->ras_counter_ce = ce_count;
+		}
+
+		if (ue_count != ctx->ras_counter_ue) {
+			out->state.flags |= AMDGPU_CTX_QUERY2_FLAGS_RAS_UE;
+			ctx->ras_counter_ue = ue_count;
+		}
+	}
+
 	mutex_unlock(&mgr->lock);
 	return 0;
 }
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index 877469d288f8..641c374b8525 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -183,6 +183,7 @@ struct amdgpu_mgpu_info mgpu_info = {
 int amdgpu_ras_enable = -1;
 uint amdgpu_ras_mask = 0xffffffff;
 int amdgpu_bad_page_threshold = -1;
+uint amdgpu_ras_thread_poll_ms = 3000;
 struct amdgpu_watchdog_timer amdgpu_watchdog_timer = {
 	.timeout_fatal_disable = false,
 	.period = 0x0, /* default to 0x0 (timeout disable) */
@@ -534,6 +535,15 @@ module_param_named(emu_mode, amdgpu_emu_mode, int, 0444);
 MODULE_PARM_DESC(ras_enable, "Enable RAS features on the GPU (0 = disable, 1 = enable, -1 = auto (default))");
 module_param_named(ras_enable, amdgpu_ras_enable, int, 0444);
 
+/**
+ * DOC: ras_thread_poll (uint)
+ * Number of milliseconds between RAS poll for errors.
+ * Valid range 0 and [500,10000]. Set to 0 to disable the thread.
+ * Default: 3000.
+ */
+MODULE_PARM_DESC(ras_thread_poll, "Number of milliseconds between RAS poll for errors. Valid range 0 and [500,10000]. Set to 0 to disable the thread. Default: 3000.");
+module_param_named(ras_thread_poll, amdgpu_ras_thread_poll_ms, uint, 0444);
+
 /**
  * DOC: ras_mask (uint)
  * Mask of RAS features to enable (default 0xffffffff), only valid when ras_enable == 1
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
index b1c57a5b6e89..30bec289e9ad 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
@@ -1043,15 +1043,17 @@ int amdgpu_ras_error_inject(struct amdgpu_device *adev,
 }
 
 /* get the total error counts on all IPs */
-unsigned long amdgpu_ras_query_error_count(struct amdgpu_device *adev,
-		bool is_ce)
+static void amdgpu_ras_query_error_count(struct amdgpu_device *adev)
 {
 	struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
 	struct ras_manager *obj;
-	struct ras_err_data data = {0, 0};
+	int ce_count, ue_count;
 
 	if (!adev->ras_enabled || !con)
-		return 0;
+		return;
+
+	ce_count = 0;
+	ue_count = 0;
 
 	list_for_each_entry(obj, &con->head, node) {
 		struct ras_query_if info = {
@@ -1059,13 +1061,14 @@ unsigned long amdgpu_ras_query_error_count(struct amdgpu_device *adev,
 		};
 
 		if (amdgpu_ras_query_error_status(adev, &info))
-			return 0;
+			return;
 
-		data.ce_count += info.ce_count;
-		data.ue_count += info.ue_count;
+		ce_count += info.ce_count;
+		ue_count += info.ue_count;
 	}
 
-	return is_ce ? data.ce_count : data.ue_count;
+	atomic_set(&con->ras_ce_count, ce_count);
+	atomic_set(&con->ras_ue_count, ue_count);
 }
 /* query/inject/cure end */
 
@@ -2109,6 +2112,49 @@ static void amdgpu_ras_check_supported(struct amdgpu_device *adev)
 		adev->ras_hw_enabled & amdgpu_ras_mask;
 }
 
+static int amdgpu_ras_thread(void *data)
+{
+	struct amdgpu_device *adev = data;
+	struct amdgpu_ras    *con  = amdgpu_ras_get_context(adev);
+
+	con->ras_thread_poll_ms = amdgpu_ras_thread_poll_ms;
+	if (con->ras_thread_poll_ms == 0) {
+		atomic_set(&con->ras_ce_count, 0);
+		atomic_set(&con->ras_ue_count, 0);
+		return 0;
+	} else if (con->ras_thread_poll_ms < 500 ||
+		   con->ras_thread_poll_ms > 10000) {
+		con->ras_thread_poll_ms = 3000;
+	}
+
+	while (1) {
+		if (kthread_should_stop())
+			break;
+		if (kthread_should_park())
+			kthread_parkme();
+
+		amdgpu_ras_query_error_count(adev);
+		msleep_interruptible(con->ras_thread_poll_ms);
+	}
+
+	return 0;
+}
+
+static int amdgpu_ras_thread_start(struct amdgpu_device *adev)
+{
+	struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
+	struct task_struct *kt;
+
+	kt = kthread_run(amdgpu_ras_thread, adev,
+			 "amdras:%s", pci_name(adev->pdev));
+	if (IS_ERR(kt))
+		return PTR_ERR(kt);
+
+	con->ras_thread = kt;
+
+	return 0;
+}
+
 int amdgpu_ras_init(struct amdgpu_device *adev)
 {
 	struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
@@ -2182,6 +2228,10 @@ int amdgpu_ras_init(struct amdgpu_device *adev)
 		goto release_con;
 	}
 
+	r = amdgpu_ras_thread_start(adev);
+	if (r)
+		goto release_con;
+
 	dev_info(adev->dev, "RAS INFO: ras initialized successfully, "
 		 "hardware ability[%x] ras_mask[%x]\n",
 		 adev->ras_hw_enabled, adev->ras_enabled);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
index 201fbdee1d09..fb9e4c7ab054 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
@@ -340,6 +340,11 @@ struct amdgpu_ras {
 
 	/* disable ras error count harvest in recovery */
 	bool disable_ras_err_cnt_harvest;
+
+	struct task_struct *ras_thread;
+	unsigned int        ras_thread_poll_ms;
+	atomic_t            ras_ue_count;
+	atomic_t            ras_ce_count;
 };
 
 struct ras_fs_data {
@@ -485,9 +490,6 @@ int amdgpu_ras_request_reset_on_boot(struct amdgpu_device *adev,
 void amdgpu_ras_resume(struct amdgpu_device *adev);
 void amdgpu_ras_suspend(struct amdgpu_device *adev);
 
-unsigned long amdgpu_ras_query_error_count(struct amdgpu_device *adev,
-		bool is_ce);
-
 /* error handling functions */
 int amdgpu_ras_add_bad_pages(struct amdgpu_device *adev,
 		struct eeprom_table_record *bps, int pages);
-- 
2.31.1.527.g2d677e5b15

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/2] drm/amdgpu: Don't query CE and UE errors
  2021-05-12 17:03 ` Luben Tuikov
@ 2021-05-13  7:56   ` Christian König
  -1 siblings, 0 replies; 20+ messages in thread
From: Christian König @ 2021-05-13  7:56 UTC (permalink / raw)
  To: Luben Tuikov, amd-gfx; +Cc: Alexander Deucher, stable



Am 12.05.21 um 19:03 schrieb Luben Tuikov:
> On QUERY2 IOCTL don't query counts of correctable
> and uncorrectable errors, since when RAS is
> enabled and supported on Vega20 server boards,
> this takes insurmountably long time, in O(n^3),
> which slows the system down to the point of it
> being unusable when we have GUI up.
>
> Fixes: ae363a212b14 ("drm/amdgpu: Add a new flag to AMDGPU_CTX_OP_QUERY_STATE2")
> Cc: Alexander Deucher <Alexander.Deucher@amd.com>
> Cc: stable@vger.kernel.org
> Signed-off-by: Luben Tuikov <luben.tuikov@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 26 ++++++++++++-------------
>   1 file changed, 13 insertions(+), 13 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
> index 01fe60fedcbe..d481a33f4eaf 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
> @@ -363,19 +363,19 @@ static int amdgpu_ctx_query2(struct amdgpu_device *adev,
>   		out->state.flags |= AMDGPU_CTX_QUERY2_FLAGS_GUILTY;
>   
>   	/*query ue count*/
> -	ras_counter = amdgpu_ras_query_error_count(adev, false);
> -	/*ras counter is monotonic increasing*/
> -	if (ras_counter != ctx->ras_counter_ue) {
> -		out->state.flags |= AMDGPU_CTX_QUERY2_FLAGS_RAS_UE;
> -		ctx->ras_counter_ue = ras_counter;
> -	}
> -
> -	/*query ce count*/
> -	ras_counter = amdgpu_ras_query_error_count(adev, true);
> -	if (ras_counter != ctx->ras_counter_ce) {
> -		out->state.flags |= AMDGPU_CTX_QUERY2_FLAGS_RAS_CE;
> -		ctx->ras_counter_ce = ras_counter;
> -	}
> +	/* ras_counter = amdgpu_ras_query_error_count(adev, false); */
> +	/* /\*ras counter is monotonic increasing*\/ */
> +	/* if (ras_counter != ctx->ras_counter_ue) { */
> +	/* 	out->state.flags |= AMDGPU_CTX_QUERY2_FLAGS_RAS_UE; */
> +	/* 	ctx->ras_counter_ue = ras_counter; */
> +	/* } */
> +
> +	/* /\*query ce count*\/ */
> +	/* ras_counter = amdgpu_ras_query_error_count(adev, true); */
> +	/* if (ras_counter != ctx->ras_counter_ce) { */
> +	/* 	out->state.flags |= AMDGPU_CTX_QUERY2_FLAGS_RAS_CE; */
> +	/* 	ctx->ras_counter_ce = ras_counter; */
> +	/* } */

Please completely drop the code. We usually don't keep commented out 
code in the driver.

With that done the patch is Reviewed-by: Christian König 
<christian.koenig@amd.com>

Christian.

>   
>   	mutex_unlock(&mgr->lock);
>   	return 0;


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/2] drm/amdgpu: Don't query CE and UE errors
@ 2021-05-13  7:56   ` Christian König
  0 siblings, 0 replies; 20+ messages in thread
From: Christian König @ 2021-05-13  7:56 UTC (permalink / raw)
  To: Luben Tuikov, amd-gfx; +Cc: Alexander Deucher, stable



Am 12.05.21 um 19:03 schrieb Luben Tuikov:
> On QUERY2 IOCTL don't query counts of correctable
> and uncorrectable errors, since when RAS is
> enabled and supported on Vega20 server boards,
> this takes insurmountably long time, in O(n^3),
> which slows the system down to the point of it
> being unusable when we have GUI up.
>
> Fixes: ae363a212b14 ("drm/amdgpu: Add a new flag to AMDGPU_CTX_OP_QUERY_STATE2")
> Cc: Alexander Deucher <Alexander.Deucher@amd.com>
> Cc: stable@vger.kernel.org
> Signed-off-by: Luben Tuikov <luben.tuikov@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 26 ++++++++++++-------------
>   1 file changed, 13 insertions(+), 13 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
> index 01fe60fedcbe..d481a33f4eaf 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
> @@ -363,19 +363,19 @@ static int amdgpu_ctx_query2(struct amdgpu_device *adev,
>   		out->state.flags |= AMDGPU_CTX_QUERY2_FLAGS_GUILTY;
>   
>   	/*query ue count*/
> -	ras_counter = amdgpu_ras_query_error_count(adev, false);
> -	/*ras counter is monotonic increasing*/
> -	if (ras_counter != ctx->ras_counter_ue) {
> -		out->state.flags |= AMDGPU_CTX_QUERY2_FLAGS_RAS_UE;
> -		ctx->ras_counter_ue = ras_counter;
> -	}
> -
> -	/*query ce count*/
> -	ras_counter = amdgpu_ras_query_error_count(adev, true);
> -	if (ras_counter != ctx->ras_counter_ce) {
> -		out->state.flags |= AMDGPU_CTX_QUERY2_FLAGS_RAS_CE;
> -		ctx->ras_counter_ce = ras_counter;
> -	}
> +	/* ras_counter = amdgpu_ras_query_error_count(adev, false); */
> +	/* /\*ras counter is monotonic increasing*\/ */
> +	/* if (ras_counter != ctx->ras_counter_ue) { */
> +	/* 	out->state.flags |= AMDGPU_CTX_QUERY2_FLAGS_RAS_UE; */
> +	/* 	ctx->ras_counter_ue = ras_counter; */
> +	/* } */
> +
> +	/* /\*query ce count*\/ */
> +	/* ras_counter = amdgpu_ras_query_error_count(adev, true); */
> +	/* if (ras_counter != ctx->ras_counter_ce) { */
> +	/* 	out->state.flags |= AMDGPU_CTX_QUERY2_FLAGS_RAS_CE; */
> +	/* 	ctx->ras_counter_ce = ras_counter; */
> +	/* } */

Please completely drop the code. We usually don't keep commented out 
code in the driver.

With that done the patch is Reviewed-by: Christian König 
<christian.koenig@amd.com>

Christian.

>   
>   	mutex_unlock(&mgr->lock);
>   	return 0;

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/2] drm/amdgpu: Poll of RAS errors asynchronously
  2021-05-12 17:03 ` [PATCH 2/2] drm/amdgpu: Poll of RAS errors asynchronously Luben Tuikov
@ 2021-05-13  8:00   ` Christian König
  2021-05-13 19:34     ` Luben Tuikov
  0 siblings, 1 reply; 20+ messages in thread
From: Christian König @ 2021-05-13  8:00 UTC (permalink / raw)
  To: Luben Tuikov, amd-gfx; +Cc: Alexander Deucher, John Clements, Hawking Zhang

Am 12.05.21 um 19:03 schrieb Luben Tuikov:
> When using Vega 20 with RAS support and RAS is
> enabled, the system interactivity is extremely
> slow, to the point of being unusable. After
> debugging, it was determined that this is due to
> the polling loop performed for
> AMDGPU_CTX_OP_QUERY_STATE2 under
> amdgpu_ctx_ioctl(), which seems to be executed on
> every ioctl from X/Mesa.
>
> The latter seems to call amdgpu_ctx_query2() which
> calls amdgpu_ras_query_error_count() twice, once
> for each of ue_count and ce_count. This is
> unnecessarily wasteful since
> amdgpu_ras_query_error_count() calculates both,
> but with the current interface it returns one or
> the other, depending on its Boolean input, when it
> can in fact return both, in a single invocation,
> if it had a better interface.
>
> Further down, the query_ras_error_count() callback
> is called, which could be quite a large polling
> loop, and very time consuming. For instance,
> gfx_v9_0_query_ras_error_count() is at least
> O(n^3). A similar situation is seen with
> umc_v6_1_query_ras_error_count().
>
> This commit implements asynchronous RAS error
> count polling to that of the ioctl. A kernel
> thread polls the RAS error counters once in a
> while. The ioctl reads the values
> asynchronously. The poll frequency is a module
> parameter, with range [500, 10000] milliseconds,
> default 3000.
>
> v2: Enable setting the polling interval to 0,
>      which disables the thread.

Please drop the module parameter, we already have way to many module 
parameters and that doesn't add much value.

Then I would really prefer to implement this as a delayed work item instead.

If you call schedule_delayed_work() from the amdgpu_ctx_query2() the 
work item will only be started every X jiffies when the function is 
actually used.

Regards,
Christian.

>
> Cc: Alexander Deucher <Alexander.Deucher@amd.com>
> Cc: John Clements <john.clements@amd.com>
> Cc: Hawking Zhang <Hawking.Zhang@amd.com>
> Signed-off-by: Luben Tuikov <luben.tuikov@amd.com>
> Reviewed-by: John Clements <john.clements@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu.h     |  1 +
>   drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 36 +++++++-------
>   drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 10 ++++
>   drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 66 ++++++++++++++++++++++---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h |  8 +--
>   5 files changed, 93 insertions(+), 28 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> index 3147c1c935c8..a269f778194f 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> @@ -197,6 +197,7 @@ extern struct amdgpu_mgpu_info mgpu_info;
>   extern int amdgpu_ras_enable;
>   extern uint amdgpu_ras_mask;
>   extern int amdgpu_bad_page_threshold;
> +extern uint amdgpu_ras_thread_poll_ms;
>   extern struct amdgpu_watchdog_timer amdgpu_watchdog_timer;
>   extern int amdgpu_async_gfx_ring;
>   extern int amdgpu_mcbp;
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
> index d481a33f4eaf..558e887e99b6 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
> @@ -332,12 +332,12 @@ static int amdgpu_ctx_query(struct amdgpu_device *adev,
>   }
>   
>   static int amdgpu_ctx_query2(struct amdgpu_device *adev,
> -	struct amdgpu_fpriv *fpriv, uint32_t id,
> -	union drm_amdgpu_ctx_out *out)
> +			     struct amdgpu_fpriv *fpriv, uint32_t id,
> +			     union drm_amdgpu_ctx_out *out)
>   {
> +	struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
>   	struct amdgpu_ctx *ctx;
>   	struct amdgpu_ctx_mgr *mgr;
> -	unsigned long ras_counter;
>   
>   	if (!fpriv)
>   		return -EINVAL;
> @@ -362,20 +362,22 @@ static int amdgpu_ctx_query2(struct amdgpu_device *adev,
>   	if (atomic_read(&ctx->guilty))
>   		out->state.flags |= AMDGPU_CTX_QUERY2_FLAGS_GUILTY;
>   
> -	/*query ue count*/
> -	/* ras_counter = amdgpu_ras_query_error_count(adev, false); */
> -	/* /\*ras counter is monotonic increasing*\/ */
> -	/* if (ras_counter != ctx->ras_counter_ue) { */
> -	/* 	out->state.flags |= AMDGPU_CTX_QUERY2_FLAGS_RAS_UE; */
> -	/* 	ctx->ras_counter_ue = ras_counter; */
> -	/* } */
> -
> -	/* /\*query ce count*\/ */
> -	/* ras_counter = amdgpu_ras_query_error_count(adev, true); */
> -	/* if (ras_counter != ctx->ras_counter_ce) { */
> -	/* 	out->state.flags |= AMDGPU_CTX_QUERY2_FLAGS_RAS_CE; */
> -	/* 	ctx->ras_counter_ce = ras_counter; */
> -	/* } */
> +	if (con) {
> +		int ce_count, ue_count;
> +
> +		ce_count = atomic_read(&con->ras_ce_count);
> +		ue_count = atomic_read(&con->ras_ue_count);
> +
> +		if (ce_count != ctx->ras_counter_ce) {
> +			out->state.flags |= AMDGPU_CTX_QUERY2_FLAGS_RAS_CE;
> +			ctx->ras_counter_ce = ce_count;
> +		}
> +
> +		if (ue_count != ctx->ras_counter_ue) {
> +			out->state.flags |= AMDGPU_CTX_QUERY2_FLAGS_RAS_UE;
> +			ctx->ras_counter_ue = ue_count;
> +		}
> +	}
>   
>   	mutex_unlock(&mgr->lock);
>   	return 0;
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> index 877469d288f8..641c374b8525 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> @@ -183,6 +183,7 @@ struct amdgpu_mgpu_info mgpu_info = {
>   int amdgpu_ras_enable = -1;
>   uint amdgpu_ras_mask = 0xffffffff;
>   int amdgpu_bad_page_threshold = -1;
> +uint amdgpu_ras_thread_poll_ms = 3000;
>   struct amdgpu_watchdog_timer amdgpu_watchdog_timer = {
>   	.timeout_fatal_disable = false,
>   	.period = 0x0, /* default to 0x0 (timeout disable) */
> @@ -534,6 +535,15 @@ module_param_named(emu_mode, amdgpu_emu_mode, int, 0444);
>   MODULE_PARM_DESC(ras_enable, "Enable RAS features on the GPU (0 = disable, 1 = enable, -1 = auto (default))");
>   module_param_named(ras_enable, amdgpu_ras_enable, int, 0444);
>   
> +/**
> + * DOC: ras_thread_poll (uint)
> + * Number of milliseconds between RAS poll for errors.
> + * Valid range 0 and [500,10000]. Set to 0 to disable the thread.
> + * Default: 3000.
> + */
> +MODULE_PARM_DESC(ras_thread_poll, "Number of milliseconds between RAS poll for errors. Valid range 0 and [500,10000]. Set to 0 to disable the thread. Default: 3000.");
> +module_param_named(ras_thread_poll, amdgpu_ras_thread_poll_ms, uint, 0444);
> +
>   /**
>    * DOC: ras_mask (uint)
>    * Mask of RAS features to enable (default 0xffffffff), only valid when ras_enable == 1
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> index b1c57a5b6e89..30bec289e9ad 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> @@ -1043,15 +1043,17 @@ int amdgpu_ras_error_inject(struct amdgpu_device *adev,
>   }
>   
>   /* get the total error counts on all IPs */
> -unsigned long amdgpu_ras_query_error_count(struct amdgpu_device *adev,
> -		bool is_ce)
> +static void amdgpu_ras_query_error_count(struct amdgpu_device *adev)
>   {
>   	struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
>   	struct ras_manager *obj;
> -	struct ras_err_data data = {0, 0};
> +	int ce_count, ue_count;
>   
>   	if (!adev->ras_enabled || !con)
> -		return 0;
> +		return;
> +
> +	ce_count = 0;
> +	ue_count = 0;
>   
>   	list_for_each_entry(obj, &con->head, node) {
>   		struct ras_query_if info = {
> @@ -1059,13 +1061,14 @@ unsigned long amdgpu_ras_query_error_count(struct amdgpu_device *adev,
>   		};
>   
>   		if (amdgpu_ras_query_error_status(adev, &info))
> -			return 0;
> +			return;
>   
> -		data.ce_count += info.ce_count;
> -		data.ue_count += info.ue_count;
> +		ce_count += info.ce_count;
> +		ue_count += info.ue_count;
>   	}
>   
> -	return is_ce ? data.ce_count : data.ue_count;
> +	atomic_set(&con->ras_ce_count, ce_count);
> +	atomic_set(&con->ras_ue_count, ue_count);
>   }
>   /* query/inject/cure end */
>   
> @@ -2109,6 +2112,49 @@ static void amdgpu_ras_check_supported(struct amdgpu_device *adev)
>   		adev->ras_hw_enabled & amdgpu_ras_mask;
>   }
>   
> +static int amdgpu_ras_thread(void *data)
> +{
> +	struct amdgpu_device *adev = data;
> +	struct amdgpu_ras    *con  = amdgpu_ras_get_context(adev);
> +
> +	con->ras_thread_poll_ms = amdgpu_ras_thread_poll_ms;
> +	if (con->ras_thread_poll_ms == 0) {
> +		atomic_set(&con->ras_ce_count, 0);
> +		atomic_set(&con->ras_ue_count, 0);
> +		return 0;
> +	} else if (con->ras_thread_poll_ms < 500 ||
> +		   con->ras_thread_poll_ms > 10000) {
> +		con->ras_thread_poll_ms = 3000;
> +	}
> +
> +	while (1) {
> +		if (kthread_should_stop())
> +			break;
> +		if (kthread_should_park())
> +			kthread_parkme();
> +
> +		amdgpu_ras_query_error_count(adev);
> +		msleep_interruptible(con->ras_thread_poll_ms);
> +	}
> +
> +	return 0;
> +}
> +
> +static int amdgpu_ras_thread_start(struct amdgpu_device *adev)
> +{
> +	struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
> +	struct task_struct *kt;
> +
> +	kt = kthread_run(amdgpu_ras_thread, adev,
> +			 "amdras:%s", pci_name(adev->pdev));
> +	if (IS_ERR(kt))
> +		return PTR_ERR(kt);
> +
> +	con->ras_thread = kt;
> +
> +	return 0;
> +}
> +
>   int amdgpu_ras_init(struct amdgpu_device *adev)
>   {
>   	struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
> @@ -2182,6 +2228,10 @@ int amdgpu_ras_init(struct amdgpu_device *adev)
>   		goto release_con;
>   	}
>   
> +	r = amdgpu_ras_thread_start(adev);
> +	if (r)
> +		goto release_con;
> +
>   	dev_info(adev->dev, "RAS INFO: ras initialized successfully, "
>   		 "hardware ability[%x] ras_mask[%x]\n",
>   		 adev->ras_hw_enabled, adev->ras_enabled);
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> index 201fbdee1d09..fb9e4c7ab054 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> @@ -340,6 +340,11 @@ struct amdgpu_ras {
>   
>   	/* disable ras error count harvest in recovery */
>   	bool disable_ras_err_cnt_harvest;
> +
> +	struct task_struct *ras_thread;
> +	unsigned int        ras_thread_poll_ms;
> +	atomic_t            ras_ue_count;
> +	atomic_t            ras_ce_count;
>   };
>   
>   struct ras_fs_data {
> @@ -485,9 +490,6 @@ int amdgpu_ras_request_reset_on_boot(struct amdgpu_device *adev,
>   void amdgpu_ras_resume(struct amdgpu_device *adev);
>   void amdgpu_ras_suspend(struct amdgpu_device *adev);
>   
> -unsigned long amdgpu_ras_query_error_count(struct amdgpu_device *adev,
> -		bool is_ce);
> -
>   /* error handling functions */
>   int amdgpu_ras_add_bad_pages(struct amdgpu_device *adev,
>   		struct eeprom_table_record *bps, int pages);

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/2] drm/amdgpu: Poll of RAS errors asynchronously
  2021-05-13  8:00   ` Christian König
@ 2021-05-13 19:34     ` Luben Tuikov
  2021-05-13 19:45       ` Alex Deucher
  2021-05-14  7:40       ` Christian König
  0 siblings, 2 replies; 20+ messages in thread
From: Luben Tuikov @ 2021-05-13 19:34 UTC (permalink / raw)
  To: Christian König, amd-gfx
  Cc: Alexander Deucher, John Clements, Hawking Zhang

On 2021-05-13 4:00 a.m., Christian König wrote:
> Am 12.05.21 um 19:03 schrieb Luben Tuikov:
>> When using Vega 20 with RAS support and RAS is
>> enabled, the system interactivity is extremely
>> slow, to the point of being unusable. After
>> debugging, it was determined that this is due to
>> the polling loop performed for
>> AMDGPU_CTX_OP_QUERY_STATE2 under
>> amdgpu_ctx_ioctl(), which seems to be executed on
>> every ioctl from X/Mesa.
>>
>> The latter seems to call amdgpu_ctx_query2() which
>> calls amdgpu_ras_query_error_count() twice, once
>> for each of ue_count and ce_count. This is
>> unnecessarily wasteful since
>> amdgpu_ras_query_error_count() calculates both,
>> but with the current interface it returns one or
>> the other, depending on its Boolean input, when it
>> can in fact return both, in a single invocation,
>> if it had a better interface.
>>
>> Further down, the query_ras_error_count() callback
>> is called, which could be quite a large polling
>> loop, and very time consuming. For instance,
>> gfx_v9_0_query_ras_error_count() is at least
>> O(n^3). A similar situation is seen with
>> umc_v6_1_query_ras_error_count().
>>
>> This commit implements asynchronous RAS error
>> count polling to that of the ioctl. A kernel
>> thread polls the RAS error counters once in a
>> while. The ioctl reads the values
>> asynchronously. The poll frequency is a module
>> parameter, with range [500, 10000] milliseconds,
>> default 3000.
>>
>> v2: Enable setting the polling interval to 0,
>>      which disables the thread.
> Please drop the module parameter, we already have way to many module 
> parameters and that doesn't add much value.

No problem.

So three seconds then?

>
> Then I would really prefer to implement this as a delayed work item instead.
>
> If you call schedule_delayed_work() from the amdgpu_ctx_query2() the 
> work item will only be started every X jiffies when the function is 
> actually used.

Yes, you mentioned this in our meeting and I did reply to this then:

I'd rather keep it a thread, because it shows intention, it shows that
we've committed that this is work which we need done, periodically
and that it is something we want to do--using a thread makes it
visible, known, intentional.

I don't mind using "delayed work", for "work" items which are one-off,
or something intermittent, non-regular type of work, non-periodic type of work.

Regards,
Luben

>
> Regards,
> Christian.
>
>> Cc: Alexander Deucher <Alexander.Deucher@amd.com>
>> Cc: John Clements <john.clements@amd.com>
>> Cc: Hawking Zhang <Hawking.Zhang@amd.com>
>> Signed-off-by: Luben Tuikov <luben.tuikov@amd.com>
>> Reviewed-by: John Clements <john.clements@amd.com>
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu.h     |  1 +
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 36 +++++++-------
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 10 ++++
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 66 ++++++++++++++++++++++---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h |  8 +--
>>   5 files changed, 93 insertions(+), 28 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>> index 3147c1c935c8..a269f778194f 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>> @@ -197,6 +197,7 @@ extern struct amdgpu_mgpu_info mgpu_info;
>>   extern int amdgpu_ras_enable;
>>   extern uint amdgpu_ras_mask;
>>   extern int amdgpu_bad_page_threshold;
>> +extern uint amdgpu_ras_thread_poll_ms;
>>   extern struct amdgpu_watchdog_timer amdgpu_watchdog_timer;
>>   extern int amdgpu_async_gfx_ring;
>>   extern int amdgpu_mcbp;
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>> index d481a33f4eaf..558e887e99b6 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>> @@ -332,12 +332,12 @@ static int amdgpu_ctx_query(struct amdgpu_device *adev,
>>   }
>>   
>>   static int amdgpu_ctx_query2(struct amdgpu_device *adev,
>> -	struct amdgpu_fpriv *fpriv, uint32_t id,
>> -	union drm_amdgpu_ctx_out *out)
>> +			     struct amdgpu_fpriv *fpriv, uint32_t id,
>> +			     union drm_amdgpu_ctx_out *out)
>>   {
>> +	struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
>>   	struct amdgpu_ctx *ctx;
>>   	struct amdgpu_ctx_mgr *mgr;
>> -	unsigned long ras_counter;
>>   
>>   	if (!fpriv)
>>   		return -EINVAL;
>> @@ -362,20 +362,22 @@ static int amdgpu_ctx_query2(struct amdgpu_device *adev,
>>   	if (atomic_read(&ctx->guilty))
>>   		out->state.flags |= AMDGPU_CTX_QUERY2_FLAGS_GUILTY;
>>   
>> -	/*query ue count*/
>> -	/* ras_counter = amdgpu_ras_query_error_count(adev, false); */
>> -	/* /\*ras counter is monotonic increasing*\/ */
>> -	/* if (ras_counter != ctx->ras_counter_ue) { */
>> -	/* 	out->state.flags |= AMDGPU_CTX_QUERY2_FLAGS_RAS_UE; */
>> -	/* 	ctx->ras_counter_ue = ras_counter; */
>> -	/* } */
>> -
>> -	/* /\*query ce count*\/ */
>> -	/* ras_counter = amdgpu_ras_query_error_count(adev, true); */
>> -	/* if (ras_counter != ctx->ras_counter_ce) { */
>> -	/* 	out->state.flags |= AMDGPU_CTX_QUERY2_FLAGS_RAS_CE; */
>> -	/* 	ctx->ras_counter_ce = ras_counter; */
>> -	/* } */
>> +	if (con) {
>> +		int ce_count, ue_count;
>> +
>> +		ce_count = atomic_read(&con->ras_ce_count);
>> +		ue_count = atomic_read(&con->ras_ue_count);
>> +
>> +		if (ce_count != ctx->ras_counter_ce) {
>> +			out->state.flags |= AMDGPU_CTX_QUERY2_FLAGS_RAS_CE;
>> +			ctx->ras_counter_ce = ce_count;
>> +		}
>> +
>> +		if (ue_count != ctx->ras_counter_ue) {
>> +			out->state.flags |= AMDGPU_CTX_QUERY2_FLAGS_RAS_UE;
>> +			ctx->ras_counter_ue = ue_count;
>> +		}
>> +	}
>>   
>>   	mutex_unlock(&mgr->lock);
>>   	return 0;
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>> index 877469d288f8..641c374b8525 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>> @@ -183,6 +183,7 @@ struct amdgpu_mgpu_info mgpu_info = {
>>   int amdgpu_ras_enable = -1;
>>   uint amdgpu_ras_mask = 0xffffffff;
>>   int amdgpu_bad_page_threshold = -1;
>> +uint amdgpu_ras_thread_poll_ms = 3000;
>>   struct amdgpu_watchdog_timer amdgpu_watchdog_timer = {
>>   	.timeout_fatal_disable = false,
>>   	.period = 0x0, /* default to 0x0 (timeout disable) */
>> @@ -534,6 +535,15 @@ module_param_named(emu_mode, amdgpu_emu_mode, int, 0444);
>>   MODULE_PARM_DESC(ras_enable, "Enable RAS features on the GPU (0 = disable, 1 = enable, -1 = auto (default))");
>>   module_param_named(ras_enable, amdgpu_ras_enable, int, 0444);
>>   
>> +/**
>> + * DOC: ras_thread_poll (uint)
>> + * Number of milliseconds between RAS poll for errors.
>> + * Valid range 0 and [500,10000]. Set to 0 to disable the thread.
>> + * Default: 3000.
>> + */
>> +MODULE_PARM_DESC(ras_thread_poll, "Number of milliseconds between RAS poll for errors. Valid range 0 and [500,10000]. Set to 0 to disable the thread. Default: 3000.");
>> +module_param_named(ras_thread_poll, amdgpu_ras_thread_poll_ms, uint, 0444);
>> +
>>   /**
>>    * DOC: ras_mask (uint)
>>    * Mask of RAS features to enable (default 0xffffffff), only valid when ras_enable == 1
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
>> index b1c57a5b6e89..30bec289e9ad 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
>> @@ -1043,15 +1043,17 @@ int amdgpu_ras_error_inject(struct amdgpu_device *adev,
>>   }
>>   
>>   /* get the total error counts on all IPs */
>> -unsigned long amdgpu_ras_query_error_count(struct amdgpu_device *adev,
>> -		bool is_ce)
>> +static void amdgpu_ras_query_error_count(struct amdgpu_device *adev)
>>   {
>>   	struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
>>   	struct ras_manager *obj;
>> -	struct ras_err_data data = {0, 0};
>> +	int ce_count, ue_count;
>>   
>>   	if (!adev->ras_enabled || !con)
>> -		return 0;
>> +		return;
>> +
>> +	ce_count = 0;
>> +	ue_count = 0;
>>   
>>   	list_for_each_entry(obj, &con->head, node) {
>>   		struct ras_query_if info = {
>> @@ -1059,13 +1061,14 @@ unsigned long amdgpu_ras_query_error_count(struct amdgpu_device *adev,
>>   		};
>>   
>>   		if (amdgpu_ras_query_error_status(adev, &info))
>> -			return 0;
>> +			return;
>>   
>> -		data.ce_count += info.ce_count;
>> -		data.ue_count += info.ue_count;
>> +		ce_count += info.ce_count;
>> +		ue_count += info.ue_count;
>>   	}
>>   
>> -	return is_ce ? data.ce_count : data.ue_count;
>> +	atomic_set(&con->ras_ce_count, ce_count);
>> +	atomic_set(&con->ras_ue_count, ue_count);
>>   }
>>   /* query/inject/cure end */
>>   
>> @@ -2109,6 +2112,49 @@ static void amdgpu_ras_check_supported(struct amdgpu_device *adev)
>>   		adev->ras_hw_enabled & amdgpu_ras_mask;
>>   }
>>   
>> +static int amdgpu_ras_thread(void *data)
>> +{
>> +	struct amdgpu_device *adev = data;
>> +	struct amdgpu_ras    *con  = amdgpu_ras_get_context(adev);
>> +
>> +	con->ras_thread_poll_ms = amdgpu_ras_thread_poll_ms;
>> +	if (con->ras_thread_poll_ms == 0) {
>> +		atomic_set(&con->ras_ce_count, 0);
>> +		atomic_set(&con->ras_ue_count, 0);
>> +		return 0;
>> +	} else if (con->ras_thread_poll_ms < 500 ||
>> +		   con->ras_thread_poll_ms > 10000) {
>> +		con->ras_thread_poll_ms = 3000;
>> +	}
>> +
>> +	while (1) {
>> +		if (kthread_should_stop())
>> +			break;
>> +		if (kthread_should_park())
>> +			kthread_parkme();
>> +
>> +		amdgpu_ras_query_error_count(adev);
>> +		msleep_interruptible(con->ras_thread_poll_ms);
>> +	}
>> +
>> +	return 0;
>> +}
>> +
>> +static int amdgpu_ras_thread_start(struct amdgpu_device *adev)
>> +{
>> +	struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
>> +	struct task_struct *kt;
>> +
>> +	kt = kthread_run(amdgpu_ras_thread, adev,
>> +			 "amdras:%s", pci_name(adev->pdev));
>> +	if (IS_ERR(kt))
>> +		return PTR_ERR(kt);
>> +
>> +	con->ras_thread = kt;
>> +
>> +	return 0;
>> +}
>> +
>>   int amdgpu_ras_init(struct amdgpu_device *adev)
>>   {
>>   	struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
>> @@ -2182,6 +2228,10 @@ int amdgpu_ras_init(struct amdgpu_device *adev)
>>   		goto release_con;
>>   	}
>>   
>> +	r = amdgpu_ras_thread_start(adev);
>> +	if (r)
>> +		goto release_con;
>> +
>>   	dev_info(adev->dev, "RAS INFO: ras initialized successfully, "
>>   		 "hardware ability[%x] ras_mask[%x]\n",
>>   		 adev->ras_hw_enabled, adev->ras_enabled);
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
>> index 201fbdee1d09..fb9e4c7ab054 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
>> @@ -340,6 +340,11 @@ struct amdgpu_ras {
>>   
>>   	/* disable ras error count harvest in recovery */
>>   	bool disable_ras_err_cnt_harvest;
>> +
>> +	struct task_struct *ras_thread;
>> +	unsigned int        ras_thread_poll_ms;
>> +	atomic_t            ras_ue_count;
>> +	atomic_t            ras_ce_count;
>>   };
>>   
>>   struct ras_fs_data {
>> @@ -485,9 +490,6 @@ int amdgpu_ras_request_reset_on_boot(struct amdgpu_device *adev,
>>   void amdgpu_ras_resume(struct amdgpu_device *adev);
>>   void amdgpu_ras_suspend(struct amdgpu_device *adev);
>>   
>> -unsigned long amdgpu_ras_query_error_count(struct amdgpu_device *adev,
>> -		bool is_ce);
>> -
>>   /* error handling functions */
>>   int amdgpu_ras_add_bad_pages(struct amdgpu_device *adev,
>>   		struct eeprom_table_record *bps, int pages);

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/2] drm/amdgpu: Don't query CE and UE errors
  2021-05-13  7:56   ` Christian König
@ 2021-05-13 19:37     ` Luben Tuikov
  -1 siblings, 0 replies; 20+ messages in thread
From: Luben Tuikov @ 2021-05-13 19:37 UTC (permalink / raw)
  To: Christian König, amd-gfx; +Cc: Alexander Deucher, stable

On 2021-05-13 3:56 a.m., Christian König wrote:
>
> Am 12.05.21 um 19:03 schrieb Luben Tuikov:
>> On QUERY2 IOCTL don't query counts of correctable
>> and uncorrectable errors, since when RAS is
>> enabled and supported on Vega20 server boards,
>> this takes insurmountably long time, in O(n^3),
>> which slows the system down to the point of it
>> being unusable when we have GUI up.
>>
>> Fixes: ae363a212b14 ("drm/amdgpu: Add a new flag to AMDGPU_CTX_OP_QUERY_STATE2")
>> Cc: Alexander Deucher <Alexander.Deucher@amd.com>
>> Cc: stable@vger.kernel.org
>> Signed-off-by: Luben Tuikov <luben.tuikov@amd.com>
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 26 ++++++++++++-------------
>>   1 file changed, 13 insertions(+), 13 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>> index 01fe60fedcbe..d481a33f4eaf 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>> @@ -363,19 +363,19 @@ static int amdgpu_ctx_query2(struct amdgpu_device *adev,
>>   		out->state.flags |= AMDGPU_CTX_QUERY2_FLAGS_GUILTY;
>>   
>>   	/*query ue count*/
>> -	ras_counter = amdgpu_ras_query_error_count(adev, false);
>> -	/*ras counter is monotonic increasing*/
>> -	if (ras_counter != ctx->ras_counter_ue) {
>> -		out->state.flags |= AMDGPU_CTX_QUERY2_FLAGS_RAS_UE;
>> -		ctx->ras_counter_ue = ras_counter;
>> -	}
>> -
>> -	/*query ce count*/
>> -	ras_counter = amdgpu_ras_query_error_count(adev, true);
>> -	if (ras_counter != ctx->ras_counter_ce) {
>> -		out->state.flags |= AMDGPU_CTX_QUERY2_FLAGS_RAS_CE;
>> -		ctx->ras_counter_ce = ras_counter;
>> -	}
>> +	/* ras_counter = amdgpu_ras_query_error_count(adev, false); */
>> +	/* /\*ras counter is monotonic increasing*\/ */
>> +	/* if (ras_counter != ctx->ras_counter_ue) { */
>> +	/* 	out->state.flags |= AMDGPU_CTX_QUERY2_FLAGS_RAS_UE; */
>> +	/* 	ctx->ras_counter_ue = ras_counter; */
>> +	/* } */
>> +
>> +	/* /\*query ce count*\/ */
>> +	/* ras_counter = amdgpu_ras_query_error_count(adev, true); */
>> +	/* if (ras_counter != ctx->ras_counter_ce) { */
>> +	/* 	out->state.flags |= AMDGPU_CTX_QUERY2_FLAGS_RAS_CE; */
>> +	/* 	ctx->ras_counter_ce = ras_counter; */
>> +	/* } */
> Please completely drop the code. We usually don't keep commented out 
> code in the driver.

1. Alex suggested this when we chatted--this is why it is commented.
2. He suggested the same thing last night and 2.5 hours before your email,
    I posted a patch in which the code is commented out--did you not see it?
    It's threaded, it appears above, 2.5 hours before your email.

Regards,
Luben

>
> With that done the patch is Reviewed-by: Christian König 
> <christian.koenig@amd.com>
>
> Christian.
>
>>   
>>   	mutex_unlock(&mgr->lock);
>>   	return 0;


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/2] drm/amdgpu: Don't query CE and UE errors
@ 2021-05-13 19:37     ` Luben Tuikov
  0 siblings, 0 replies; 20+ messages in thread
From: Luben Tuikov @ 2021-05-13 19:37 UTC (permalink / raw)
  To: Christian König, amd-gfx; +Cc: Alexander Deucher, stable

On 2021-05-13 3:56 a.m., Christian König wrote:
>
> Am 12.05.21 um 19:03 schrieb Luben Tuikov:
>> On QUERY2 IOCTL don't query counts of correctable
>> and uncorrectable errors, since when RAS is
>> enabled and supported on Vega20 server boards,
>> this takes insurmountably long time, in O(n^3),
>> which slows the system down to the point of it
>> being unusable when we have GUI up.
>>
>> Fixes: ae363a212b14 ("drm/amdgpu: Add a new flag to AMDGPU_CTX_OP_QUERY_STATE2")
>> Cc: Alexander Deucher <Alexander.Deucher@amd.com>
>> Cc: stable@vger.kernel.org
>> Signed-off-by: Luben Tuikov <luben.tuikov@amd.com>
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 26 ++++++++++++-------------
>>   1 file changed, 13 insertions(+), 13 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>> index 01fe60fedcbe..d481a33f4eaf 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>> @@ -363,19 +363,19 @@ static int amdgpu_ctx_query2(struct amdgpu_device *adev,
>>   		out->state.flags |= AMDGPU_CTX_QUERY2_FLAGS_GUILTY;
>>   
>>   	/*query ue count*/
>> -	ras_counter = amdgpu_ras_query_error_count(adev, false);
>> -	/*ras counter is monotonic increasing*/
>> -	if (ras_counter != ctx->ras_counter_ue) {
>> -		out->state.flags |= AMDGPU_CTX_QUERY2_FLAGS_RAS_UE;
>> -		ctx->ras_counter_ue = ras_counter;
>> -	}
>> -
>> -	/*query ce count*/
>> -	ras_counter = amdgpu_ras_query_error_count(adev, true);
>> -	if (ras_counter != ctx->ras_counter_ce) {
>> -		out->state.flags |= AMDGPU_CTX_QUERY2_FLAGS_RAS_CE;
>> -		ctx->ras_counter_ce = ras_counter;
>> -	}
>> +	/* ras_counter = amdgpu_ras_query_error_count(adev, false); */
>> +	/* /\*ras counter is monotonic increasing*\/ */
>> +	/* if (ras_counter != ctx->ras_counter_ue) { */
>> +	/* 	out->state.flags |= AMDGPU_CTX_QUERY2_FLAGS_RAS_UE; */
>> +	/* 	ctx->ras_counter_ue = ras_counter; */
>> +	/* } */
>> +
>> +	/* /\*query ce count*\/ */
>> +	/* ras_counter = amdgpu_ras_query_error_count(adev, true); */
>> +	/* if (ras_counter != ctx->ras_counter_ce) { */
>> +	/* 	out->state.flags |= AMDGPU_CTX_QUERY2_FLAGS_RAS_CE; */
>> +	/* 	ctx->ras_counter_ce = ras_counter; */
>> +	/* } */
> Please completely drop the code. We usually don't keep commented out 
> code in the driver.

1. Alex suggested this when we chatted--this is why it is commented.
2. He suggested the same thing last night and 2.5 hours before your email,
    I posted a patch in which the code is commented out--did you not see it?
    It's threaded, it appears above, 2.5 hours before your email.

Regards,
Luben

>
> With that done the patch is Reviewed-by: Christian König 
> <christian.koenig@amd.com>
>
> Christian.
>
>>   
>>   	mutex_unlock(&mgr->lock);
>>   	return 0;

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/2] drm/amdgpu: Don't query CE and UE errors
  2021-05-13  5:32     ` Luben Tuikov
@ 2021-05-13 19:41       ` Alex Deucher
  -1 siblings, 0 replies; 20+ messages in thread
From: Alex Deucher @ 2021-05-13 19:41 UTC (permalink / raw)
  To: Luben Tuikov; +Cc: amd-gfx list, Alexander Deucher, for 3.8

On Thu, May 13, 2021 at 1:32 AM Luben Tuikov <luben.tuikov@amd.com> wrote:
>
> On QUERY2 IOCTL don't query counts of correctable
> and uncorrectable errors, since when RAS is
> enabled and supported on Vega20 server boards,
> this takes insurmountably long time, in O(n^3),
> which slows the system down to the point of it
> being unusable when we have GUI up.
>
> Fixes: ae363a212b14 ("drm/amdgpu: Add a new flag to AMDGPU_CTX_OP_QUERY_STATE2")
> Cc: Alexander Deucher <Alexander.Deucher@amd.com>
> Cc: stable@vger.kernel.org
> Signed-off-by: Luben Tuikov <luben.tuikov@amd.com>

Reviewed-by: Alex Deucher <alexander.deucher@amd.com>

> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 16 ----------------
>  1 file changed, 16 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
> index 01fe60fedcbe..e1557020c49d 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
> @@ -337,7 +337,6 @@ static int amdgpu_ctx_query2(struct amdgpu_device *adev,
>  {
>         struct amdgpu_ctx *ctx;
>         struct amdgpu_ctx_mgr *mgr;
> -       unsigned long ras_counter;
>
>         if (!fpriv)
>                 return -EINVAL;
> @@ -362,21 +361,6 @@ static int amdgpu_ctx_query2(struct amdgpu_device *adev,
>         if (atomic_read(&ctx->guilty))
>                 out->state.flags |= AMDGPU_CTX_QUERY2_FLAGS_GUILTY;
>
> -       /*query ue count*/
> -       ras_counter = amdgpu_ras_query_error_count(adev, false);
> -       /*ras counter is monotonic increasing*/
> -       if (ras_counter != ctx->ras_counter_ue) {
> -               out->state.flags |= AMDGPU_CTX_QUERY2_FLAGS_RAS_UE;
> -               ctx->ras_counter_ue = ras_counter;
> -       }
> -
> -       /*query ce count*/
> -       ras_counter = amdgpu_ras_query_error_count(adev, true);
> -       if (ras_counter != ctx->ras_counter_ce) {
> -               out->state.flags |= AMDGPU_CTX_QUERY2_FLAGS_RAS_CE;
> -               ctx->ras_counter_ce = ras_counter;
> -       }
> -
>         mutex_unlock(&mgr->lock);
>         return 0;
>  }
> --
> 2.31.1.527.g2d677e5b15
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/2] drm/amdgpu: Don't query CE and UE errors
@ 2021-05-13 19:41       ` Alex Deucher
  0 siblings, 0 replies; 20+ messages in thread
From: Alex Deucher @ 2021-05-13 19:41 UTC (permalink / raw)
  To: Luben Tuikov; +Cc: Alexander Deucher, for 3.8, amd-gfx list

On Thu, May 13, 2021 at 1:32 AM Luben Tuikov <luben.tuikov@amd.com> wrote:
>
> On QUERY2 IOCTL don't query counts of correctable
> and uncorrectable errors, since when RAS is
> enabled and supported on Vega20 server boards,
> this takes insurmountably long time, in O(n^3),
> which slows the system down to the point of it
> being unusable when we have GUI up.
>
> Fixes: ae363a212b14 ("drm/amdgpu: Add a new flag to AMDGPU_CTX_OP_QUERY_STATE2")
> Cc: Alexander Deucher <Alexander.Deucher@amd.com>
> Cc: stable@vger.kernel.org
> Signed-off-by: Luben Tuikov <luben.tuikov@amd.com>

Reviewed-by: Alex Deucher <alexander.deucher@amd.com>

> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 16 ----------------
>  1 file changed, 16 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
> index 01fe60fedcbe..e1557020c49d 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
> @@ -337,7 +337,6 @@ static int amdgpu_ctx_query2(struct amdgpu_device *adev,
>  {
>         struct amdgpu_ctx *ctx;
>         struct amdgpu_ctx_mgr *mgr;
> -       unsigned long ras_counter;
>
>         if (!fpriv)
>                 return -EINVAL;
> @@ -362,21 +361,6 @@ static int amdgpu_ctx_query2(struct amdgpu_device *adev,
>         if (atomic_read(&ctx->guilty))
>                 out->state.flags |= AMDGPU_CTX_QUERY2_FLAGS_GUILTY;
>
> -       /*query ue count*/
> -       ras_counter = amdgpu_ras_query_error_count(adev, false);
> -       /*ras counter is monotonic increasing*/
> -       if (ras_counter != ctx->ras_counter_ue) {
> -               out->state.flags |= AMDGPU_CTX_QUERY2_FLAGS_RAS_UE;
> -               ctx->ras_counter_ue = ras_counter;
> -       }
> -
> -       /*query ce count*/
> -       ras_counter = amdgpu_ras_query_error_count(adev, true);
> -       if (ras_counter != ctx->ras_counter_ce) {
> -               out->state.flags |= AMDGPU_CTX_QUERY2_FLAGS_RAS_CE;
> -               ctx->ras_counter_ce = ras_counter;
> -       }
> -
>         mutex_unlock(&mgr->lock);
>         return 0;
>  }
> --
> 2.31.1.527.g2d677e5b15
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/2] drm/amdgpu: Poll of RAS errors asynchronously
  2021-05-13 19:34     ` Luben Tuikov
@ 2021-05-13 19:45       ` Alex Deucher
  2021-05-14  7:40       ` Christian König
  1 sibling, 0 replies; 20+ messages in thread
From: Alex Deucher @ 2021-05-13 19:45 UTC (permalink / raw)
  To: Luben Tuikov
  Cc: Alexander Deucher, Christian König, John Clements,
	amd-gfx list, Hawking Zhang

On Thu, May 13, 2021 at 3:35 PM Luben Tuikov <luben.tuikov@amd.com> wrote:
>
> On 2021-05-13 4:00 a.m., Christian König wrote:
> > Am 12.05.21 um 19:03 schrieb Luben Tuikov:
> >> When using Vega 20 with RAS support and RAS is
> >> enabled, the system interactivity is extremely
> >> slow, to the point of being unusable. After
> >> debugging, it was determined that this is due to
> >> the polling loop performed for
> >> AMDGPU_CTX_OP_QUERY_STATE2 under
> >> amdgpu_ctx_ioctl(), which seems to be executed on
> >> every ioctl from X/Mesa.
> >>
> >> The latter seems to call amdgpu_ctx_query2() which
> >> calls amdgpu_ras_query_error_count() twice, once
> >> for each of ue_count and ce_count. This is
> >> unnecessarily wasteful since
> >> amdgpu_ras_query_error_count() calculates both,
> >> but with the current interface it returns one or
> >> the other, depending on its Boolean input, when it
> >> can in fact return both, in a single invocation,
> >> if it had a better interface.
> >>
> >> Further down, the query_ras_error_count() callback
> >> is called, which could be quite a large polling
> >> loop, and very time consuming. For instance,
> >> gfx_v9_0_query_ras_error_count() is at least
> >> O(n^3). A similar situation is seen with
> >> umc_v6_1_query_ras_error_count().
> >>
> >> This commit implements asynchronous RAS error
> >> count polling to that of the ioctl. A kernel
> >> thread polls the RAS error counters once in a
> >> while. The ioctl reads the values
> >> asynchronously. The poll frequency is a module
> >> parameter, with range [500, 10000] milliseconds,
> >> default 3000.
> >>
> >> v2: Enable setting the polling interval to 0,
> >>      which disables the thread.
> > Please drop the module parameter, we already have way to many module
> > parameters and that doesn't add much value.
>
> No problem.
>
> So three seconds then?
>
> >
> > Then I would really prefer to implement this as a delayed work item instead.
> >
> > If you call schedule_delayed_work() from the amdgpu_ctx_query2() the
> > work item will only be started every X jiffies when the function is
> > actually used.
>
> Yes, you mentioned this in our meeting and I did reply to this then:
>
> I'd rather keep it a thread, because it shows intention, it shows that
> we've committed that this is work which we need done, periodically
> and that it is something we want to do--using a thread makes it
> visible, known, intentional.
>
> I don't mind using "delayed work", for "work" items which are one-off,
> or something intermittent, non-regular type of work, non-periodic type of work.
>

Well the IOCTL is technically optional.  It depends on whether the UMD
that is using the driver wants this functionality or not.  radeonsi
just happens to use it since it supports the OGL robustness APIs.

Alex


> Regards,
> Luben
>
> >
> > Regards,
> > Christian.
> >
> >> Cc: Alexander Deucher <Alexander.Deucher@amd.com>
> >> Cc: John Clements <john.clements@amd.com>
> >> Cc: Hawking Zhang <Hawking.Zhang@amd.com>
> >> Signed-off-by: Luben Tuikov <luben.tuikov@amd.com>
> >> Reviewed-by: John Clements <john.clements@amd.com>
> >> ---
> >>   drivers/gpu/drm/amd/amdgpu/amdgpu.h     |  1 +
> >>   drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 36 +++++++-------
> >>   drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 10 ++++
> >>   drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 66 ++++++++++++++++++++++---
> >>   drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h |  8 +--
> >>   5 files changed, 93 insertions(+), 28 deletions(-)
> >>
> >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> >> index 3147c1c935c8..a269f778194f 100644
> >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> >> @@ -197,6 +197,7 @@ extern struct amdgpu_mgpu_info mgpu_info;
> >>   extern int amdgpu_ras_enable;
> >>   extern uint amdgpu_ras_mask;
> >>   extern int amdgpu_bad_page_threshold;
> >> +extern uint amdgpu_ras_thread_poll_ms;
> >>   extern struct amdgpu_watchdog_timer amdgpu_watchdog_timer;
> >>   extern int amdgpu_async_gfx_ring;
> >>   extern int amdgpu_mcbp;
> >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
> >> index d481a33f4eaf..558e887e99b6 100644
> >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
> >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
> >> @@ -332,12 +332,12 @@ static int amdgpu_ctx_query(struct amdgpu_device *adev,
> >>   }
> >>
> >>   static int amdgpu_ctx_query2(struct amdgpu_device *adev,
> >> -    struct amdgpu_fpriv *fpriv, uint32_t id,
> >> -    union drm_amdgpu_ctx_out *out)
> >> +                         struct amdgpu_fpriv *fpriv, uint32_t id,
> >> +                         union drm_amdgpu_ctx_out *out)
> >>   {
> >> +    struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
> >>      struct amdgpu_ctx *ctx;
> >>      struct amdgpu_ctx_mgr *mgr;
> >> -    unsigned long ras_counter;
> >>
> >>      if (!fpriv)
> >>              return -EINVAL;
> >> @@ -362,20 +362,22 @@ static int amdgpu_ctx_query2(struct amdgpu_device *adev,
> >>      if (atomic_read(&ctx->guilty))
> >>              out->state.flags |= AMDGPU_CTX_QUERY2_FLAGS_GUILTY;
> >>
> >> -    /*query ue count*/
> >> -    /* ras_counter = amdgpu_ras_query_error_count(adev, false); */
> >> -    /* /\*ras counter is monotonic increasing*\/ */
> >> -    /* if (ras_counter != ctx->ras_counter_ue) { */
> >> -    /*      out->state.flags |= AMDGPU_CTX_QUERY2_FLAGS_RAS_UE; */
> >> -    /*      ctx->ras_counter_ue = ras_counter; */
> >> -    /* } */
> >> -
> >> -    /* /\*query ce count*\/ */
> >> -    /* ras_counter = amdgpu_ras_query_error_count(adev, true); */
> >> -    /* if (ras_counter != ctx->ras_counter_ce) { */
> >> -    /*      out->state.flags |= AMDGPU_CTX_QUERY2_FLAGS_RAS_CE; */
> >> -    /*      ctx->ras_counter_ce = ras_counter; */
> >> -    /* } */
> >> +    if (con) {
> >> +            int ce_count, ue_count;
> >> +
> >> +            ce_count = atomic_read(&con->ras_ce_count);
> >> +            ue_count = atomic_read(&con->ras_ue_count);
> >> +
> >> +            if (ce_count != ctx->ras_counter_ce) {
> >> +                    out->state.flags |= AMDGPU_CTX_QUERY2_FLAGS_RAS_CE;
> >> +                    ctx->ras_counter_ce = ce_count;
> >> +            }
> >> +
> >> +            if (ue_count != ctx->ras_counter_ue) {
> >> +                    out->state.flags |= AMDGPU_CTX_QUERY2_FLAGS_RAS_UE;
> >> +                    ctx->ras_counter_ue = ue_count;
> >> +            }
> >> +    }
> >>
> >>      mutex_unlock(&mgr->lock);
> >>      return 0;
> >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> >> index 877469d288f8..641c374b8525 100644
> >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> >> @@ -183,6 +183,7 @@ struct amdgpu_mgpu_info mgpu_info = {
> >>   int amdgpu_ras_enable = -1;
> >>   uint amdgpu_ras_mask = 0xffffffff;
> >>   int amdgpu_bad_page_threshold = -1;
> >> +uint amdgpu_ras_thread_poll_ms = 3000;
> >>   struct amdgpu_watchdog_timer amdgpu_watchdog_timer = {
> >>      .timeout_fatal_disable = false,
> >>      .period = 0x0, /* default to 0x0 (timeout disable) */
> >> @@ -534,6 +535,15 @@ module_param_named(emu_mode, amdgpu_emu_mode, int, 0444);
> >>   MODULE_PARM_DESC(ras_enable, "Enable RAS features on the GPU (0 = disable, 1 = enable, -1 = auto (default))");
> >>   module_param_named(ras_enable, amdgpu_ras_enable, int, 0444);
> >>
> >> +/**
> >> + * DOC: ras_thread_poll (uint)
> >> + * Number of milliseconds between RAS poll for errors.
> >> + * Valid range 0 and [500,10000]. Set to 0 to disable the thread.
> >> + * Default: 3000.
> >> + */
> >> +MODULE_PARM_DESC(ras_thread_poll, "Number of milliseconds between RAS poll for errors. Valid range 0 and [500,10000]. Set to 0 to disable the thread. Default: 3000.");
> >> +module_param_named(ras_thread_poll, amdgpu_ras_thread_poll_ms, uint, 0444);
> >> +
> >>   /**
> >>    * DOC: ras_mask (uint)
> >>    * Mask of RAS features to enable (default 0xffffffff), only valid when ras_enable == 1
> >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> >> index b1c57a5b6e89..30bec289e9ad 100644
> >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> >> @@ -1043,15 +1043,17 @@ int amdgpu_ras_error_inject(struct amdgpu_device *adev,
> >>   }
> >>
> >>   /* get the total error counts on all IPs */
> >> -unsigned long amdgpu_ras_query_error_count(struct amdgpu_device *adev,
> >> -            bool is_ce)
> >> +static void amdgpu_ras_query_error_count(struct amdgpu_device *adev)
> >>   {
> >>      struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
> >>      struct ras_manager *obj;
> >> -    struct ras_err_data data = {0, 0};
> >> +    int ce_count, ue_count;
> >>
> >>      if (!adev->ras_enabled || !con)
> >> -            return 0;
> >> +            return;
> >> +
> >> +    ce_count = 0;
> >> +    ue_count = 0;
> >>
> >>      list_for_each_entry(obj, &con->head, node) {
> >>              struct ras_query_if info = {
> >> @@ -1059,13 +1061,14 @@ unsigned long amdgpu_ras_query_error_count(struct amdgpu_device *adev,
> >>              };
> >>
> >>              if (amdgpu_ras_query_error_status(adev, &info))
> >> -                    return 0;
> >> +                    return;
> >>
> >> -            data.ce_count += info.ce_count;
> >> -            data.ue_count += info.ue_count;
> >> +            ce_count += info.ce_count;
> >> +            ue_count += info.ue_count;
> >>      }
> >>
> >> -    return is_ce ? data.ce_count : data.ue_count;
> >> +    atomic_set(&con->ras_ce_count, ce_count);
> >> +    atomic_set(&con->ras_ue_count, ue_count);
> >>   }
> >>   /* query/inject/cure end */
> >>
> >> @@ -2109,6 +2112,49 @@ static void amdgpu_ras_check_supported(struct amdgpu_device *adev)
> >>              adev->ras_hw_enabled & amdgpu_ras_mask;
> >>   }
> >>
> >> +static int amdgpu_ras_thread(void *data)
> >> +{
> >> +    struct amdgpu_device *adev = data;
> >> +    struct amdgpu_ras    *con  = amdgpu_ras_get_context(adev);
> >> +
> >> +    con->ras_thread_poll_ms = amdgpu_ras_thread_poll_ms;
> >> +    if (con->ras_thread_poll_ms == 0) {
> >> +            atomic_set(&con->ras_ce_count, 0);
> >> +            atomic_set(&con->ras_ue_count, 0);
> >> +            return 0;
> >> +    } else if (con->ras_thread_poll_ms < 500 ||
> >> +               con->ras_thread_poll_ms > 10000) {
> >> +            con->ras_thread_poll_ms = 3000;
> >> +    }
> >> +
> >> +    while (1) {
> >> +            if (kthread_should_stop())
> >> +                    break;
> >> +            if (kthread_should_park())
> >> +                    kthread_parkme();
> >> +
> >> +            amdgpu_ras_query_error_count(adev);
> >> +            msleep_interruptible(con->ras_thread_poll_ms);
> >> +    }
> >> +
> >> +    return 0;
> >> +}
> >> +
> >> +static int amdgpu_ras_thread_start(struct amdgpu_device *adev)
> >> +{
> >> +    struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
> >> +    struct task_struct *kt;
> >> +
> >> +    kt = kthread_run(amdgpu_ras_thread, adev,
> >> +                     "amdras:%s", pci_name(adev->pdev));
> >> +    if (IS_ERR(kt))
> >> +            return PTR_ERR(kt);
> >> +
> >> +    con->ras_thread = kt;
> >> +
> >> +    return 0;
> >> +}
> >> +
> >>   int amdgpu_ras_init(struct amdgpu_device *adev)
> >>   {
> >>      struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
> >> @@ -2182,6 +2228,10 @@ int amdgpu_ras_init(struct amdgpu_device *adev)
> >>              goto release_con;
> >>      }
> >>
> >> +    r = amdgpu_ras_thread_start(adev);
> >> +    if (r)
> >> +            goto release_con;
> >> +
> >>      dev_info(adev->dev, "RAS INFO: ras initialized successfully, "
> >>               "hardware ability[%x] ras_mask[%x]\n",
> >>               adev->ras_hw_enabled, adev->ras_enabled);
> >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> >> index 201fbdee1d09..fb9e4c7ab054 100644
> >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> >> @@ -340,6 +340,11 @@ struct amdgpu_ras {
> >>
> >>      /* disable ras error count harvest in recovery */
> >>      bool disable_ras_err_cnt_harvest;
> >> +
> >> +    struct task_struct *ras_thread;
> >> +    unsigned int        ras_thread_poll_ms;
> >> +    atomic_t            ras_ue_count;
> >> +    atomic_t            ras_ce_count;
> >>   };
> >>
> >>   struct ras_fs_data {
> >> @@ -485,9 +490,6 @@ int amdgpu_ras_request_reset_on_boot(struct amdgpu_device *adev,
> >>   void amdgpu_ras_resume(struct amdgpu_device *adev);
> >>   void amdgpu_ras_suspend(struct amdgpu_device *adev);
> >>
> >> -unsigned long amdgpu_ras_query_error_count(struct amdgpu_device *adev,
> >> -            bool is_ce);
> >> -
> >>   /* error handling functions */
> >>   int amdgpu_ras_add_bad_pages(struct amdgpu_device *adev,
> >>              struct eeprom_table_record *bps, int pages);
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/2] drm/amdgpu: Don't query CE and UE errors
  2021-05-13 19:37     ` Luben Tuikov
@ 2021-05-14  7:31       ` Christian König
  -1 siblings, 0 replies; 20+ messages in thread
From: Christian König @ 2021-05-14  7:31 UTC (permalink / raw)
  To: Luben Tuikov, amd-gfx; +Cc: Alexander Deucher, stable



Am 13.05.21 um 21:37 schrieb Luben Tuikov:
> On 2021-05-13 3:56 a.m., Christian König wrote:
>> Am 12.05.21 um 19:03 schrieb Luben Tuikov:
>>> On QUERY2 IOCTL don't query counts of correctable
>>> and uncorrectable errors, since when RAS is
>>> enabled and supported on Vega20 server boards,
>>> this takes insurmountably long time, in O(n^3),
>>> which slows the system down to the point of it
>>> being unusable when we have GUI up.
>>>
>>> Fixes: ae363a212b14 ("drm/amdgpu: Add a new flag to AMDGPU_CTX_OP_QUERY_STATE2")
>>> Cc: Alexander Deucher <Alexander.Deucher@amd.com>
>>> Cc: stable@vger.kernel.org
>>> Signed-off-by: Luben Tuikov <luben.tuikov@amd.com>
>>> ---
>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 26 ++++++++++++-------------
>>>    1 file changed, 13 insertions(+), 13 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>>> index 01fe60fedcbe..d481a33f4eaf 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>>> @@ -363,19 +363,19 @@ static int amdgpu_ctx_query2(struct amdgpu_device *adev,
>>>    		out->state.flags |= AMDGPU_CTX_QUERY2_FLAGS_GUILTY;
>>>    
>>>    	/*query ue count*/
>>> -	ras_counter = amdgpu_ras_query_error_count(adev, false);
>>> -	/*ras counter is monotonic increasing*/
>>> -	if (ras_counter != ctx->ras_counter_ue) {
>>> -		out->state.flags |= AMDGPU_CTX_QUERY2_FLAGS_RAS_UE;
>>> -		ctx->ras_counter_ue = ras_counter;
>>> -	}
>>> -
>>> -	/*query ce count*/
>>> -	ras_counter = amdgpu_ras_query_error_count(adev, true);
>>> -	if (ras_counter != ctx->ras_counter_ce) {
>>> -		out->state.flags |= AMDGPU_CTX_QUERY2_FLAGS_RAS_CE;
>>> -		ctx->ras_counter_ce = ras_counter;
>>> -	}
>>> +	/* ras_counter = amdgpu_ras_query_error_count(adev, false); */
>>> +	/* /\*ras counter is monotonic increasing*\/ */
>>> +	/* if (ras_counter != ctx->ras_counter_ue) { */
>>> +	/* 	out->state.flags |= AMDGPU_CTX_QUERY2_FLAGS_RAS_UE; */
>>> +	/* 	ctx->ras_counter_ue = ras_counter; */
>>> +	/* } */
>>> +
>>> +	/* /\*query ce count*\/ */
>>> +	/* ras_counter = amdgpu_ras_query_error_count(adev, true); */
>>> +	/* if (ras_counter != ctx->ras_counter_ce) { */
>>> +	/* 	out->state.flags |= AMDGPU_CTX_QUERY2_FLAGS_RAS_CE; */
>>> +	/* 	ctx->ras_counter_ce = ras_counter; */
>>> +	/* } */
>> Please completely drop the code. We usually don't keep commented out
>> code in the driver.
> 1. Alex suggested this when we chatted--this is why it is commented.

Sounds like a misunderstanding to me, usually it is Alex who insists on 
dropping the code.

> 2. He suggested the same thing last night and 2.5 hours before your email,
>      I posted a patch in which the code is commented out--did you not see it?
>      It's threaded, it appears above, 2.5 hours before your email.

Sorry for the redundancy, didn't had seen that in my inbox yet when I 
wrote the reply.

Regards,
Christian.

>
> Regards,
> Luben
>
>> With that done the patch is Reviewed-by: Christian König
>> <christian.koenig@amd.com>
>>
>> Christian.
>>
>>>    
>>>    	mutex_unlock(&mgr->lock);
>>>    	return 0;


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/2] drm/amdgpu: Don't query CE and UE errors
@ 2021-05-14  7:31       ` Christian König
  0 siblings, 0 replies; 20+ messages in thread
From: Christian König @ 2021-05-14  7:31 UTC (permalink / raw)
  To: Luben Tuikov, amd-gfx; +Cc: Alexander Deucher, stable



Am 13.05.21 um 21:37 schrieb Luben Tuikov:
> On 2021-05-13 3:56 a.m., Christian König wrote:
>> Am 12.05.21 um 19:03 schrieb Luben Tuikov:
>>> On QUERY2 IOCTL don't query counts of correctable
>>> and uncorrectable errors, since when RAS is
>>> enabled and supported on Vega20 server boards,
>>> this takes insurmountably long time, in O(n^3),
>>> which slows the system down to the point of it
>>> being unusable when we have GUI up.
>>>
>>> Fixes: ae363a212b14 ("drm/amdgpu: Add a new flag to AMDGPU_CTX_OP_QUERY_STATE2")
>>> Cc: Alexander Deucher <Alexander.Deucher@amd.com>
>>> Cc: stable@vger.kernel.org
>>> Signed-off-by: Luben Tuikov <luben.tuikov@amd.com>
>>> ---
>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 26 ++++++++++++-------------
>>>    1 file changed, 13 insertions(+), 13 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>>> index 01fe60fedcbe..d481a33f4eaf 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>>> @@ -363,19 +363,19 @@ static int amdgpu_ctx_query2(struct amdgpu_device *adev,
>>>    		out->state.flags |= AMDGPU_CTX_QUERY2_FLAGS_GUILTY;
>>>    
>>>    	/*query ue count*/
>>> -	ras_counter = amdgpu_ras_query_error_count(adev, false);
>>> -	/*ras counter is monotonic increasing*/
>>> -	if (ras_counter != ctx->ras_counter_ue) {
>>> -		out->state.flags |= AMDGPU_CTX_QUERY2_FLAGS_RAS_UE;
>>> -		ctx->ras_counter_ue = ras_counter;
>>> -	}
>>> -
>>> -	/*query ce count*/
>>> -	ras_counter = amdgpu_ras_query_error_count(adev, true);
>>> -	if (ras_counter != ctx->ras_counter_ce) {
>>> -		out->state.flags |= AMDGPU_CTX_QUERY2_FLAGS_RAS_CE;
>>> -		ctx->ras_counter_ce = ras_counter;
>>> -	}
>>> +	/* ras_counter = amdgpu_ras_query_error_count(adev, false); */
>>> +	/* /\*ras counter is monotonic increasing*\/ */
>>> +	/* if (ras_counter != ctx->ras_counter_ue) { */
>>> +	/* 	out->state.flags |= AMDGPU_CTX_QUERY2_FLAGS_RAS_UE; */
>>> +	/* 	ctx->ras_counter_ue = ras_counter; */
>>> +	/* } */
>>> +
>>> +	/* /\*query ce count*\/ */
>>> +	/* ras_counter = amdgpu_ras_query_error_count(adev, true); */
>>> +	/* if (ras_counter != ctx->ras_counter_ce) { */
>>> +	/* 	out->state.flags |= AMDGPU_CTX_QUERY2_FLAGS_RAS_CE; */
>>> +	/* 	ctx->ras_counter_ce = ras_counter; */
>>> +	/* } */
>> Please completely drop the code. We usually don't keep commented out
>> code in the driver.
> 1. Alex suggested this when we chatted--this is why it is commented.

Sounds like a misunderstanding to me, usually it is Alex who insists on 
dropping the code.

> 2. He suggested the same thing last night and 2.5 hours before your email,
>      I posted a patch in which the code is commented out--did you not see it?
>      It's threaded, it appears above, 2.5 hours before your email.

Sorry for the redundancy, didn't had seen that in my inbox yet when I 
wrote the reply.

Regards,
Christian.

>
> Regards,
> Luben
>
>> With that done the patch is Reviewed-by: Christian König
>> <christian.koenig@amd.com>
>>
>> Christian.
>>
>>>    
>>>    	mutex_unlock(&mgr->lock);
>>>    	return 0;

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/2] drm/amdgpu: Poll of RAS errors asynchronously
  2021-05-13 19:34     ` Luben Tuikov
  2021-05-13 19:45       ` Alex Deucher
@ 2021-05-14  7:40       ` Christian König
  1 sibling, 0 replies; 20+ messages in thread
From: Christian König @ 2021-05-14  7:40 UTC (permalink / raw)
  To: Luben Tuikov, amd-gfx; +Cc: Alexander Deucher, John Clements, Hawking Zhang

Am 13.05.21 um 21:34 schrieb Luben Tuikov:
> On 2021-05-13 4:00 a.m., Christian König wrote:
>> Am 12.05.21 um 19:03 schrieb Luben Tuikov:
>>> When using Vega 20 with RAS support and RAS is
>>> enabled, the system interactivity is extremely
>>> slow, to the point of being unusable. After
>>> debugging, it was determined that this is due to
>>> the polling loop performed for
>>> AMDGPU_CTX_OP_QUERY_STATE2 under
>>> amdgpu_ctx_ioctl(), which seems to be executed on
>>> every ioctl from X/Mesa.
>>>
>>> The latter seems to call amdgpu_ctx_query2() which
>>> calls amdgpu_ras_query_error_count() twice, once
>>> for each of ue_count and ce_count. This is
>>> unnecessarily wasteful since
>>> amdgpu_ras_query_error_count() calculates both,
>>> but with the current interface it returns one or
>>> the other, depending on its Boolean input, when it
>>> can in fact return both, in a single invocation,
>>> if it had a better interface.
>>>
>>> Further down, the query_ras_error_count() callback
>>> is called, which could be quite a large polling
>>> loop, and very time consuming. For instance,
>>> gfx_v9_0_query_ras_error_count() is at least
>>> O(n^3). A similar situation is seen with
>>> umc_v6_1_query_ras_error_count().
>>>
>>> This commit implements asynchronous RAS error
>>> count polling to that of the ioctl. A kernel
>>> thread polls the RAS error counters once in a
>>> while. The ioctl reads the values
>>> asynchronously. The poll frequency is a module
>>> parameter, with range [500, 10000] milliseconds,
>>> default 3000.
>>>
>>> v2: Enable setting the polling interval to 0,
>>>       which disables the thread.
>> Please drop the module parameter, we already have way to many module
>> parameters and that doesn't add much value.
> No problem.
>
> So three seconds then?

Yes, just make it a define somewhere.

>
>> Then I would really prefer to implement this as a delayed work item instead.
>>
>> If you call schedule_delayed_work() from the amdgpu_ctx_query2() the
>> work item will only be started every X jiffies when the function is
>> actually used.
> Yes, you mentioned this in our meeting and I did reply to this then:
>
> I'd rather keep it a thread, because it shows intention, it shows that
> we've committed that this is work which we need done, periodically
> and that it is something we want to do--using a thread makes it
> visible, known, intentional.

A kernel thread is always justified if you have some distinct background 
work you want to be able to measure the overhead for or if you have 
strong dependencies on cache hotness for your thread.

For example like kswapd or the scheduler threads, but I don't see that 
for the RAS query feature.

> I don't mind using "delayed work", for "work" items which are one-off,
> or something intermittent, non-regular type of work, non-periodic type of work.

Well that is exactly what I'm suggesting here.

When you make it a delayed work item it becomes a non-regular operation 
only run when userspace actually queries the value.

In other words by doing that you just make sure that the different user 
queries are aggregated and never cause a hardware access more often than 
the delay.

See the main use case for the hardware is datacenter and the dedicated 
thread would just be overhead in this scenario which always runs in the 
background and computes values which nobody uses.

Regards,
Christian.

>
> Regards,
> Luben
>
>> Regards,
>> Christian.
>>
>>> Cc: Alexander Deucher <Alexander.Deucher@amd.com>
>>> Cc: John Clements <john.clements@amd.com>
>>> Cc: Hawking Zhang <Hawking.Zhang@amd.com>
>>> Signed-off-by: Luben Tuikov <luben.tuikov@amd.com>
>>> Reviewed-by: John Clements <john.clements@amd.com>
>>> ---
>>>    drivers/gpu/drm/amd/amdgpu/amdgpu.h     |  1 +
>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 36 +++++++-------
>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 10 ++++
>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 66 ++++++++++++++++++++++---
>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h |  8 +--
>>>    5 files changed, 93 insertions(+), 28 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>> index 3147c1c935c8..a269f778194f 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>> @@ -197,6 +197,7 @@ extern struct amdgpu_mgpu_info mgpu_info;
>>>    extern int amdgpu_ras_enable;
>>>    extern uint amdgpu_ras_mask;
>>>    extern int amdgpu_bad_page_threshold;
>>> +extern uint amdgpu_ras_thread_poll_ms;
>>>    extern struct amdgpu_watchdog_timer amdgpu_watchdog_timer;
>>>    extern int amdgpu_async_gfx_ring;
>>>    extern int amdgpu_mcbp;
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>>> index d481a33f4eaf..558e887e99b6 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>>> @@ -332,12 +332,12 @@ static int amdgpu_ctx_query(struct amdgpu_device *adev,
>>>    }
>>>    
>>>    static int amdgpu_ctx_query2(struct amdgpu_device *adev,
>>> -	struct amdgpu_fpriv *fpriv, uint32_t id,
>>> -	union drm_amdgpu_ctx_out *out)
>>> +			     struct amdgpu_fpriv *fpriv, uint32_t id,
>>> +			     union drm_amdgpu_ctx_out *out)
>>>    {
>>> +	struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
>>>    	struct amdgpu_ctx *ctx;
>>>    	struct amdgpu_ctx_mgr *mgr;
>>> -	unsigned long ras_counter;
>>>    
>>>    	if (!fpriv)
>>>    		return -EINVAL;
>>> @@ -362,20 +362,22 @@ static int amdgpu_ctx_query2(struct amdgpu_device *adev,
>>>    	if (atomic_read(&ctx->guilty))
>>>    		out->state.flags |= AMDGPU_CTX_QUERY2_FLAGS_GUILTY;
>>>    
>>> -	/*query ue count*/
>>> -	/* ras_counter = amdgpu_ras_query_error_count(adev, false); */
>>> -	/* /\*ras counter is monotonic increasing*\/ */
>>> -	/* if (ras_counter != ctx->ras_counter_ue) { */
>>> -	/* 	out->state.flags |= AMDGPU_CTX_QUERY2_FLAGS_RAS_UE; */
>>> -	/* 	ctx->ras_counter_ue = ras_counter; */
>>> -	/* } */
>>> -
>>> -	/* /\*query ce count*\/ */
>>> -	/* ras_counter = amdgpu_ras_query_error_count(adev, true); */
>>> -	/* if (ras_counter != ctx->ras_counter_ce) { */
>>> -	/* 	out->state.flags |= AMDGPU_CTX_QUERY2_FLAGS_RAS_CE; */
>>> -	/* 	ctx->ras_counter_ce = ras_counter; */
>>> -	/* } */
>>> +	if (con) {
>>> +		int ce_count, ue_count;
>>> +
>>> +		ce_count = atomic_read(&con->ras_ce_count);
>>> +		ue_count = atomic_read(&con->ras_ue_count);
>>> +
>>> +		if (ce_count != ctx->ras_counter_ce) {
>>> +			out->state.flags |= AMDGPU_CTX_QUERY2_FLAGS_RAS_CE;
>>> +			ctx->ras_counter_ce = ce_count;
>>> +		}
>>> +
>>> +		if (ue_count != ctx->ras_counter_ue) {
>>> +			out->state.flags |= AMDGPU_CTX_QUERY2_FLAGS_RAS_UE;
>>> +			ctx->ras_counter_ue = ue_count;
>>> +		}
>>> +	}
>>>    
>>>    	mutex_unlock(&mgr->lock);
>>>    	return 0;
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>> index 877469d288f8..641c374b8525 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>> @@ -183,6 +183,7 @@ struct amdgpu_mgpu_info mgpu_info = {
>>>    int amdgpu_ras_enable = -1;
>>>    uint amdgpu_ras_mask = 0xffffffff;
>>>    int amdgpu_bad_page_threshold = -1;
>>> +uint amdgpu_ras_thread_poll_ms = 3000;
>>>    struct amdgpu_watchdog_timer amdgpu_watchdog_timer = {
>>>    	.timeout_fatal_disable = false,
>>>    	.period = 0x0, /* default to 0x0 (timeout disable) */
>>> @@ -534,6 +535,15 @@ module_param_named(emu_mode, amdgpu_emu_mode, int, 0444);
>>>    MODULE_PARM_DESC(ras_enable, "Enable RAS features on the GPU (0 = disable, 1 = enable, -1 = auto (default))");
>>>    module_param_named(ras_enable, amdgpu_ras_enable, int, 0444);
>>>    
>>> +/**
>>> + * DOC: ras_thread_poll (uint)
>>> + * Number of milliseconds between RAS poll for errors.
>>> + * Valid range 0 and [500,10000]. Set to 0 to disable the thread.
>>> + * Default: 3000.
>>> + */
>>> +MODULE_PARM_DESC(ras_thread_poll, "Number of milliseconds between RAS poll for errors. Valid range 0 and [500,10000]. Set to 0 to disable the thread. Default: 3000.");
>>> +module_param_named(ras_thread_poll, amdgpu_ras_thread_poll_ms, uint, 0444);
>>> +
>>>    /**
>>>     * DOC: ras_mask (uint)
>>>     * Mask of RAS features to enable (default 0xffffffff), only valid when ras_enable == 1
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
>>> index b1c57a5b6e89..30bec289e9ad 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
>>> @@ -1043,15 +1043,17 @@ int amdgpu_ras_error_inject(struct amdgpu_device *adev,
>>>    }
>>>    
>>>    /* get the total error counts on all IPs */
>>> -unsigned long amdgpu_ras_query_error_count(struct amdgpu_device *adev,
>>> -		bool is_ce)
>>> +static void amdgpu_ras_query_error_count(struct amdgpu_device *adev)
>>>    {
>>>    	struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
>>>    	struct ras_manager *obj;
>>> -	struct ras_err_data data = {0, 0};
>>> +	int ce_count, ue_count;
>>>    
>>>    	if (!adev->ras_enabled || !con)
>>> -		return 0;
>>> +		return;
>>> +
>>> +	ce_count = 0;
>>> +	ue_count = 0;
>>>    
>>>    	list_for_each_entry(obj, &con->head, node) {
>>>    		struct ras_query_if info = {
>>> @@ -1059,13 +1061,14 @@ unsigned long amdgpu_ras_query_error_count(struct amdgpu_device *adev,
>>>    		};
>>>    
>>>    		if (amdgpu_ras_query_error_status(adev, &info))
>>> -			return 0;
>>> +			return;
>>>    
>>> -		data.ce_count += info.ce_count;
>>> -		data.ue_count += info.ue_count;
>>> +		ce_count += info.ce_count;
>>> +		ue_count += info.ue_count;
>>>    	}
>>>    
>>> -	return is_ce ? data.ce_count : data.ue_count;
>>> +	atomic_set(&con->ras_ce_count, ce_count);
>>> +	atomic_set(&con->ras_ue_count, ue_count);
>>>    }
>>>    /* query/inject/cure end */
>>>    
>>> @@ -2109,6 +2112,49 @@ static void amdgpu_ras_check_supported(struct amdgpu_device *adev)
>>>    		adev->ras_hw_enabled & amdgpu_ras_mask;
>>>    }
>>>    
>>> +static int amdgpu_ras_thread(void *data)
>>> +{
>>> +	struct amdgpu_device *adev = data;
>>> +	struct amdgpu_ras    *con  = amdgpu_ras_get_context(adev);
>>> +
>>> +	con->ras_thread_poll_ms = amdgpu_ras_thread_poll_ms;
>>> +	if (con->ras_thread_poll_ms == 0) {
>>> +		atomic_set(&con->ras_ce_count, 0);
>>> +		atomic_set(&con->ras_ue_count, 0);
>>> +		return 0;
>>> +	} else if (con->ras_thread_poll_ms < 500 ||
>>> +		   con->ras_thread_poll_ms > 10000) {
>>> +		con->ras_thread_poll_ms = 3000;
>>> +	}
>>> +
>>> +	while (1) {
>>> +		if (kthread_should_stop())
>>> +			break;
>>> +		if (kthread_should_park())
>>> +			kthread_parkme();
>>> +
>>> +		amdgpu_ras_query_error_count(adev);
>>> +		msleep_interruptible(con->ras_thread_poll_ms);
>>> +	}
>>> +
>>> +	return 0;
>>> +}
>>> +
>>> +static int amdgpu_ras_thread_start(struct amdgpu_device *adev)
>>> +{
>>> +	struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
>>> +	struct task_struct *kt;
>>> +
>>> +	kt = kthread_run(amdgpu_ras_thread, adev,
>>> +			 "amdras:%s", pci_name(adev->pdev));
>>> +	if (IS_ERR(kt))
>>> +		return PTR_ERR(kt);
>>> +
>>> +	con->ras_thread = kt;
>>> +
>>> +	return 0;
>>> +}
>>> +
>>>    int amdgpu_ras_init(struct amdgpu_device *adev)
>>>    {
>>>    	struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
>>> @@ -2182,6 +2228,10 @@ int amdgpu_ras_init(struct amdgpu_device *adev)
>>>    		goto release_con;
>>>    	}
>>>    
>>> +	r = amdgpu_ras_thread_start(adev);
>>> +	if (r)
>>> +		goto release_con;
>>> +
>>>    	dev_info(adev->dev, "RAS INFO: ras initialized successfully, "
>>>    		 "hardware ability[%x] ras_mask[%x]\n",
>>>    		 adev->ras_hw_enabled, adev->ras_enabled);
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
>>> index 201fbdee1d09..fb9e4c7ab054 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
>>> @@ -340,6 +340,11 @@ struct amdgpu_ras {
>>>    
>>>    	/* disable ras error count harvest in recovery */
>>>    	bool disable_ras_err_cnt_harvest;
>>> +
>>> +	struct task_struct *ras_thread;
>>> +	unsigned int        ras_thread_poll_ms;
>>> +	atomic_t            ras_ue_count;
>>> +	atomic_t            ras_ce_count;
>>>    };
>>>    
>>>    struct ras_fs_data {
>>> @@ -485,9 +490,6 @@ int amdgpu_ras_request_reset_on_boot(struct amdgpu_device *adev,
>>>    void amdgpu_ras_resume(struct amdgpu_device *adev);
>>>    void amdgpu_ras_suspend(struct amdgpu_device *adev);
>>>    
>>> -unsigned long amdgpu_ras_query_error_count(struct amdgpu_device *adev,
>>> -		bool is_ce);
>>> -
>>>    /* error handling functions */
>>>    int amdgpu_ras_add_bad_pages(struct amdgpu_device *adev,
>>>    		struct eeprom_table_record *bps, int pages);

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2021-05-14  7:40 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-05-12 17:03 [PATCH 1/2] drm/amdgpu: Don't query CE and UE errors Luben Tuikov
2021-05-12 17:03 ` Luben Tuikov
2021-05-12 17:03 ` [PATCH 2/2] drm/amdgpu: Poll of RAS errors asynchronously Luben Tuikov
2021-05-13  8:00   ` Christian König
2021-05-13 19:34     ` Luben Tuikov
2021-05-13 19:45       ` Alex Deucher
2021-05-14  7:40       ` Christian König
2021-05-12 19:52 ` [PATCH 1/2] drm/amdgpu: Don't query CE and UE errors Deucher, Alexander
2021-05-12 19:52   ` Deucher, Alexander
2021-05-13  5:32   ` Luben Tuikov
2021-05-13  5:32     ` Luben Tuikov
2021-05-13  5:32     ` [PATCH 2/2] drm/amdgpu: Poll of RAS errors asynchronously Luben Tuikov
2021-05-13 19:41     ` [PATCH 1/2] drm/amdgpu: Don't query CE and UE errors Alex Deucher
2021-05-13 19:41       ` Alex Deucher
2021-05-13  7:56 ` Christian König
2021-05-13  7:56   ` Christian König
2021-05-13 19:37   ` Luben Tuikov
2021-05-13 19:37     ` Luben Tuikov
2021-05-14  7:31     ` Christian König
2021-05-14  7:31       ` Christian König

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.