linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] drm/amdgpu: initialize amdgpu_ras_query_error_count() error count parameters
@ 2021-07-02 19:52 trix
  2021-07-02 22:51 ` Luben Tuikov
  0 siblings, 1 reply; 5+ messages in thread
From: trix @ 2021-07-02 19:52 UTC (permalink / raw)
  To: alexander.deucher, christian.koenig, Xinhui.Pan, airlied, daniel,
	Hawking.Zhang, Dennis.Li, guchun.chen, luben.tuikov,
	Stanley.Yang, nirmoy.das
  Cc: amd-gfx, dri-devel, linux-kernel, Tom Rix

From: Tom Rix <trix@redhat.com>

Static analysis reports this problem
amdgpu_ras.c:2324:2: warning: 2nd function call argument is an
  uninitialized value
        atomic_set(&con->ras_ce_count, ce_count);
        ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

ce_count is normally set by the earlier call to
amdgpu_ras_query_error_count().  But amdgpu_ras_query_error_count()
can return early without setting, leaving its error count parameters
in a garbage state.

Initialize the error count parameters earlier.

Fixes: a46751fbcde5 ("drm/amdgpu: Fix RAS function interface")
Signed-off-by: Tom Rix <trix@redhat.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
index 875874ea745ec..c80fa545aa2b8 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
@@ -1056,6 +1056,12 @@ void amdgpu_ras_query_error_count(struct amdgpu_device *adev,
 	struct ras_manager *obj;
 	unsigned long ce, ue;
 
+	if (ce_count)
+		*ce_count = 0;
+
+	if (ue_count)
+		*ue_count = 0;
+
 	if (!adev->ras_enabled || !con)
 		return;
 
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH] drm/amdgpu: initialize amdgpu_ras_query_error_count() error count parameters
  2021-07-02 19:52 [PATCH] drm/amdgpu: initialize amdgpu_ras_query_error_count() error count parameters trix
@ 2021-07-02 22:51 ` Luben Tuikov
  2021-07-02 23:04   ` [PATCH] drm/amdgpu: Return error if no RAS Luben Tuikov
  0 siblings, 1 reply; 5+ messages in thread
From: Luben Tuikov @ 2021-07-02 22:51 UTC (permalink / raw)
  To: trix, alexander.deucher, christian.koenig, Xinhui.Pan, airlied,
	daniel, Hawking.Zhang, Dennis.Li, guchun.chen, Stanley.Yang,
	nirmoy.das
  Cc: amd-gfx, dri-devel, linux-kernel, Tuikov, Luben

That's a good find, but I'd rather functions have no side effects. I'll follow up with a patch which correctly fixes this.

Regards,
Luben

On 2021-07-02 3:52 p.m., trix@redhat.com wrote:
> From: Tom Rix <trix@redhat.com>
>
> Static analysis reports this problem
> amdgpu_ras.c:2324:2: warning: 2nd function call argument is an
>   uninitialized value
>         atomic_set(&con->ras_ce_count, ce_count);
>         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>
> ce_count is normally set by the earlier call to
> amdgpu_ras_query_error_count().  But amdgpu_ras_query_error_count()
> can return early without setting, leaving its error count parameters
> in a garbage state.
>
> Initialize the error count parameters earlier.
>
> Fixes: a46751fbcde5 ("drm/amdgpu: Fix RAS function interface")
> Signed-off-by: Tom Rix <trix@redhat.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 6 ++++++
>  1 file changed, 6 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> index 875874ea745ec..c80fa545aa2b8 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> @@ -1056,6 +1056,12 @@ void amdgpu_ras_query_error_count(struct amdgpu_device *adev,
>  	struct ras_manager *obj;
>  	unsigned long ce, ue;
>  
> +	if (ce_count)
> +		*ce_count = 0;
> +
> +	if (ue_count)
> +		*ue_count = 0;
> +
>  	if (!adev->ras_enabled || !con)
>  		return;
>  


^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH] drm/amdgpu: Return error if no RAS
  2021-07-02 22:51 ` Luben Tuikov
@ 2021-07-02 23:04   ` Luben Tuikov
  2021-07-06 15:25     ` Alex Deucher
  0 siblings, 1 reply; 5+ messages in thread
From: Luben Tuikov @ 2021-07-02 23:04 UTC (permalink / raw)
  To: amd-gfx
  Cc: Christian König, Xinhui.Pan, airlied, Daniel Vetter,
	Hawking Zhang, Dennis.Li, guchun.chen, Stanley.Yang, nirmoy.das,
	dri-devel, Linux Kernel Mailing List, Luben Tuikov,
	Alexander Deucher, John Clements, Tom Rix

In amdgpu_ras_query_error_count() return an error
if the device doesn't support RAS. This prevents
that function from having to always set the values
of the integer pointers (if set), and thus
prevents function side effects--always to have to
set values of integers if integer pointers set,
regardless of whether RAS is supported or
not--with this change this side effect is
mitigated.

Also, if no pointers are set, don't count, since
we've no way of reporting the counts.

Also, give this function a kernel-doc.

Cc: Alexander Deucher <Alexander.Deucher@amd.com>
Cc: John Clements <john.clements@amd.com>
Cc: Hawking Zhang <Hawking.Zhang@amd.com>
Reported-by: Tom Rix <trix@redhat.com>
Fixes: a46751fbcde505 ("drm/amdgpu: Fix RAS function interface")
Signed-off-by: Luben Tuikov <luben.tuikov@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 49 ++++++++++++++++++-------
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h |  6 +--
 2 files changed, 38 insertions(+), 17 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
index c6ae63893dbdb2..ed698b2be79023 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
@@ -813,7 +813,7 @@ static int amdgpu_ras_enable_all_features(struct amdgpu_device *adev,
 
 /* query/inject/cure begin */
 int amdgpu_ras_query_error_status(struct amdgpu_device *adev,
-	struct ras_query_if *info)
+				  struct ras_query_if *info)
 {
 	struct ras_manager *obj = amdgpu_ras_find_obj(adev, &info->head);
 	struct ras_err_data err_data = {0, 0, 0, NULL};
@@ -1047,17 +1047,32 @@ int amdgpu_ras_error_inject(struct amdgpu_device *adev,
 	return ret;
 }
 
-/* get the total error counts on all IPs */
-void amdgpu_ras_query_error_count(struct amdgpu_device *adev,
-				  unsigned long *ce_count,
-				  unsigned long *ue_count)
+/**
+ * amdgpu_ras_query_error_count -- Get error counts of all IPs
+ * adev: pointer to AMD GPU device
+ * ce_count: pointer to an integer to be set to the count of correctible errors.
+ * ue_count: pointer to an integer to be set to the count of uncorrectible
+ * errors.
+ *
+ * If set, @ce_count or @ue_count, count and return the corresponding
+ * error counts in those integer pointers. Return 0 if the device
+ * supports RAS. Return -EINVAL if the device doesn't support RAS.
+ */
+int amdgpu_ras_query_error_count(struct amdgpu_device *adev,
+				 unsigned long *ce_count,
+				 unsigned long *ue_count)
 {
 	struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
 	struct ras_manager *obj;
 	unsigned long ce, ue;
 
 	if (!adev->ras_enabled || !con)
-		return;
+		return -EINVAL;
+
+	/* Don't count since no reporting.
+	 */
+	if (!ce_count && !ue_count)
+		return 0;
 
 	ce = 0;
 	ue = 0;
@@ -1065,9 +1080,11 @@ void amdgpu_ras_query_error_count(struct amdgpu_device *adev,
 		struct ras_query_if info = {
 			.head = obj->head,
 		};
+		int res;
 
-		if (amdgpu_ras_query_error_status(adev, &info))
-			return;
+		res = amdgpu_ras_query_error_status(adev, &info);
+		if (res)
+			return res;
 
 		ce += info.ce_count;
 		ue += info.ue_count;
@@ -1078,6 +1095,8 @@ void amdgpu_ras_query_error_count(struct amdgpu_device *adev,
 
 	if (ue_count)
 		*ue_count = ue;
+
+	return 0;
 }
 /* query/inject/cure end */
 
@@ -2145,9 +2164,10 @@ static void amdgpu_ras_counte_dw(struct work_struct *work)
 
 	/* Cache new values.
 	 */
-	amdgpu_ras_query_error_count(adev, &ce_count, &ue_count);
-	atomic_set(&con->ras_ce_count, ce_count);
-	atomic_set(&con->ras_ue_count, ue_count);
+	if (amdgpu_ras_query_error_count(adev, &ce_count, &ue_count) == 0) {
+		atomic_set(&con->ras_ce_count, ce_count);
+		atomic_set(&con->ras_ue_count, ue_count);
+	}
 
 	pm_runtime_mark_last_busy(dev->dev);
 Out:
@@ -2320,9 +2340,10 @@ int amdgpu_ras_late_init(struct amdgpu_device *adev,
 
 	/* Those are the cached values at init.
 	 */
-	amdgpu_ras_query_error_count(adev, &ce_count, &ue_count);
-	atomic_set(&con->ras_ce_count, ce_count);
-	atomic_set(&con->ras_ue_count, ue_count);
+	if (amdgpu_ras_query_error_count(adev, &ce_count, &ue_count) == 0) {
+		atomic_set(&con->ras_ce_count, ce_count);
+		atomic_set(&con->ras_ue_count, ue_count);
+	}
 
 	return 0;
 cleanup:
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
index 283afd791db107..4d9c63f2f37718 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
@@ -491,9 +491,9 @@ int amdgpu_ras_request_reset_on_boot(struct amdgpu_device *adev,
 void amdgpu_ras_resume(struct amdgpu_device *adev);
 void amdgpu_ras_suspend(struct amdgpu_device *adev);
 
-void amdgpu_ras_query_error_count(struct amdgpu_device *adev,
-				  unsigned long *ce_count,
-				  unsigned long *ue_count);
+int amdgpu_ras_query_error_count(struct amdgpu_device *adev,
+				 unsigned long *ce_count,
+				 unsigned long *ue_count);
 
 /* error handling functions */
 int amdgpu_ras_add_bad_pages(struct amdgpu_device *adev,
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH] drm/amdgpu: Return error if no RAS
  2021-07-02 23:04   ` [PATCH] drm/amdgpu: Return error if no RAS Luben Tuikov
@ 2021-07-06 15:25     ` Alex Deucher
  2021-07-06 15:32       ` Luben Tuikov
  0 siblings, 1 reply; 5+ messages in thread
From: Alex Deucher @ 2021-07-06 15:25 UTC (permalink / raw)
  To: Luben Tuikov
  Cc: amd-gfx list, Chen, Guchun, Dave Airlie, Daniel Vetter,
	xinhui pan, Linux Kernel Mailing List,
	Maling list - DRI developers, Nirmoy Das, Stanley.Yang, Tom Rix,
	Alexander Deucher, John Clements, Christian König,
	Dennis Li, Hawking Zhang

On Fri, Jul 2, 2021 at 7:05 PM Luben Tuikov <luben.tuikov@amd.com> wrote:
>
> In amdgpu_ras_query_error_count() return an error
> if the device doesn't support RAS. This prevents
> that function from having to always set the values
> of the integer pointers (if set), and thus
> prevents function side effects--always to have to
> set values of integers if integer pointers set,
> regardless of whether RAS is supported or
> not--with this change this side effect is
> mitigated.
>
> Also, if no pointers are set, don't count, since
> we've no way of reporting the counts.
>
> Also, give this function a kernel-doc.
>
> Cc: Alexander Deucher <Alexander.Deucher@amd.com>
> Cc: John Clements <john.clements@amd.com>
> Cc: Hawking Zhang <Hawking.Zhang@amd.com>
> Reported-by: Tom Rix <trix@redhat.com>
> Fixes: a46751fbcde505 ("drm/amdgpu: Fix RAS function interface")
> Signed-off-by: Luben Tuikov <luben.tuikov@amd.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 49 ++++++++++++++++++-------
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h |  6 +--
>  2 files changed, 38 insertions(+), 17 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> index c6ae63893dbdb2..ed698b2be79023 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> @@ -813,7 +813,7 @@ static int amdgpu_ras_enable_all_features(struct amdgpu_device *adev,
>
>  /* query/inject/cure begin */
>  int amdgpu_ras_query_error_status(struct amdgpu_device *adev,
> -       struct ras_query_if *info)
> +                                 struct ras_query_if *info)
>  {
>         struct ras_manager *obj = amdgpu_ras_find_obj(adev, &info->head);
>         struct ras_err_data err_data = {0, 0, 0, NULL};
> @@ -1047,17 +1047,32 @@ int amdgpu_ras_error_inject(struct amdgpu_device *adev,
>         return ret;
>  }
>
> -/* get the total error counts on all IPs */
> -void amdgpu_ras_query_error_count(struct amdgpu_device *adev,
> -                                 unsigned long *ce_count,
> -                                 unsigned long *ue_count)
> +/**
> + * amdgpu_ras_query_error_count -- Get error counts of all IPs
> + * adev: pointer to AMD GPU device
> + * ce_count: pointer to an integer to be set to the count of correctible errors.
> + * ue_count: pointer to an integer to be set to the count of uncorrectible
> + * errors.
> + *
> + * If set, @ce_count or @ue_count, count and return the corresponding
> + * error counts in those integer pointers. Return 0 if the device
> + * supports RAS. Return -EINVAL if the device doesn't support RAS.
> + */
> +int amdgpu_ras_query_error_count(struct amdgpu_device *adev,
> +                                unsigned long *ce_count,
> +                                unsigned long *ue_count)
>  {
>         struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
>         struct ras_manager *obj;
>         unsigned long ce, ue;
>
>         if (!adev->ras_enabled || !con)
> -               return;
> +               return -EINVAL;

Maybe return -ENOTSUPP or -ENODEV here or something like that since
the device doesn't support RAS in that case?  Other than that, looks
good to me.
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>

Alex


> +
> +       /* Don't count since no reporting.
> +        */
> +       if (!ce_count && !ue_count)
> +               return 0;
>
>         ce = 0;
>         ue = 0;
> @@ -1065,9 +1080,11 @@ void amdgpu_ras_query_error_count(struct amdgpu_device *adev,
>                 struct ras_query_if info = {
>                         .head = obj->head,
>                 };
> +               int res;
>
> -               if (amdgpu_ras_query_error_status(adev, &info))
> -                       return;
> +               res = amdgpu_ras_query_error_status(adev, &info);
> +               if (res)
> +                       return res;
>
>                 ce += info.ce_count;
>                 ue += info.ue_count;
> @@ -1078,6 +1095,8 @@ void amdgpu_ras_query_error_count(struct amdgpu_device *adev,
>
>         if (ue_count)
>                 *ue_count = ue;
> +
> +       return 0;
>  }
>  /* query/inject/cure end */
>
> @@ -2145,9 +2164,10 @@ static void amdgpu_ras_counte_dw(struct work_struct *work)
>
>         /* Cache new values.
>          */
> -       amdgpu_ras_query_error_count(adev, &ce_count, &ue_count);
> -       atomic_set(&con->ras_ce_count, ce_count);
> -       atomic_set(&con->ras_ue_count, ue_count);
> +       if (amdgpu_ras_query_error_count(adev, &ce_count, &ue_count) == 0) {
> +               atomic_set(&con->ras_ce_count, ce_count);
> +               atomic_set(&con->ras_ue_count, ue_count);
> +       }
>
>         pm_runtime_mark_last_busy(dev->dev);
>  Out:
> @@ -2320,9 +2340,10 @@ int amdgpu_ras_late_init(struct amdgpu_device *adev,
>
>         /* Those are the cached values at init.
>          */
> -       amdgpu_ras_query_error_count(adev, &ce_count, &ue_count);
> -       atomic_set(&con->ras_ce_count, ce_count);
> -       atomic_set(&con->ras_ue_count, ue_count);
> +       if (amdgpu_ras_query_error_count(adev, &ce_count, &ue_count) == 0) {
> +               atomic_set(&con->ras_ce_count, ce_count);
> +               atomic_set(&con->ras_ue_count, ue_count);
> +       }
>
>         return 0;
>  cleanup:
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> index 283afd791db107..4d9c63f2f37718 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> @@ -491,9 +491,9 @@ int amdgpu_ras_request_reset_on_boot(struct amdgpu_device *adev,
>  void amdgpu_ras_resume(struct amdgpu_device *adev);
>  void amdgpu_ras_suspend(struct amdgpu_device *adev);
>
> -void amdgpu_ras_query_error_count(struct amdgpu_device *adev,
> -                                 unsigned long *ce_count,
> -                                 unsigned long *ue_count);
> +int amdgpu_ras_query_error_count(struct amdgpu_device *adev,
> +                                unsigned long *ce_count,
> +                                unsigned long *ue_count);
>
>  /* error handling functions */
>  int amdgpu_ras_add_bad_pages(struct amdgpu_device *adev,
> --
> 2.32.0
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] drm/amdgpu: Return error if no RAS
  2021-07-06 15:25     ` Alex Deucher
@ 2021-07-06 15:32       ` Luben Tuikov
  0 siblings, 0 replies; 5+ messages in thread
From: Luben Tuikov @ 2021-07-06 15:32 UTC (permalink / raw)
  To: Alex Deucher
  Cc: amd-gfx list, Chen, Guchun, Dave Airlie, Daniel Vetter,
	xinhui pan, Linux Kernel Mailing List,
	Maling list - DRI developers, Nirmoy Das, Stanley.Yang, Tom Rix,
	Alexander Deucher, John Clements, Christian König,
	Dennis Li, Hawking Zhang

On 2021-07-06 11:25 a.m., Alex Deucher wrote:
> On Fri, Jul 2, 2021 at 7:05 PM Luben Tuikov <luben.tuikov@amd.com> wrote:
>> In amdgpu_ras_query_error_count() return an error
>> if the device doesn't support RAS. This prevents
>> that function from having to always set the values
>> of the integer pointers (if set), and thus
>> prevents function side effects--always to have to
>> set values of integers if integer pointers set,
>> regardless of whether RAS is supported or
>> not--with this change this side effect is
>> mitigated.
>>
>> Also, if no pointers are set, don't count, since
>> we've no way of reporting the counts.
>>
>> Also, give this function a kernel-doc.
>>
>> Cc: Alexander Deucher <Alexander.Deucher@amd.com>
>> Cc: John Clements <john.clements@amd.com>
>> Cc: Hawking Zhang <Hawking.Zhang@amd.com>
>> Reported-by: Tom Rix <trix@redhat.com>
>> Fixes: a46751fbcde505 ("drm/amdgpu: Fix RAS function interface")
>> Signed-off-by: Luben Tuikov <luben.tuikov@amd.com>
>> ---
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 49 ++++++++++++++++++-------
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h |  6 +--
>>  2 files changed, 38 insertions(+), 17 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
>> index c6ae63893dbdb2..ed698b2be79023 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
>> @@ -813,7 +813,7 @@ static int amdgpu_ras_enable_all_features(struct amdgpu_device *adev,
>>
>>  /* query/inject/cure begin */
>>  int amdgpu_ras_query_error_status(struct amdgpu_device *adev,
>> -       struct ras_query_if *info)
>> +                                 struct ras_query_if *info)
>>  {
>>         struct ras_manager *obj = amdgpu_ras_find_obj(adev, &info->head);
>>         struct ras_err_data err_data = {0, 0, 0, NULL};
>> @@ -1047,17 +1047,32 @@ int amdgpu_ras_error_inject(struct amdgpu_device *adev,
>>         return ret;
>>  }
>>
>> -/* get the total error counts on all IPs */
>> -void amdgpu_ras_query_error_count(struct amdgpu_device *adev,
>> -                                 unsigned long *ce_count,
>> -                                 unsigned long *ue_count)
>> +/**
>> + * amdgpu_ras_query_error_count -- Get error counts of all IPs
>> + * adev: pointer to AMD GPU device
>> + * ce_count: pointer to an integer to be set to the count of correctible errors.
>> + * ue_count: pointer to an integer to be set to the count of uncorrectible
>> + * errors.
>> + *
>> + * If set, @ce_count or @ue_count, count and return the corresponding
>> + * error counts in those integer pointers. Return 0 if the device
>> + * supports RAS. Return -EINVAL if the device doesn't support RAS.
>> + */
>> +int amdgpu_ras_query_error_count(struct amdgpu_device *adev,
>> +                                unsigned long *ce_count,
>> +                                unsigned long *ue_count)
>>  {
>>         struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
>>         struct ras_manager *obj;
>>         unsigned long ce, ue;
>>
>>         if (!adev->ras_enabled || !con)
>> -               return;
>> +               return -EINVAL;
> Maybe return -ENOTSUPP or -ENODEV here or something like that since
> the device doesn't support RAS in that case?  Other than that, looks
> good to me.
> Reviewed-by: Alex Deucher <alexander.deucher@amd.com>

Okay, great. I thought about -ENODEV but didn't want to confuse the IOCTL/userspace that the device isn't there, so I'll then return -EOPNOTSUPP, add your R-B and push it.

Regards,
Luben

>
> Alex
>
>
>> +
>> +       /* Don't count since no reporting.
>> +        */
>> +       if (!ce_count && !ue_count)
>> +               return 0;
>>
>>         ce = 0;
>>         ue = 0;
>> @@ -1065,9 +1080,11 @@ void amdgpu_ras_query_error_count(struct amdgpu_device *adev,
>>                 struct ras_query_if info = {
>>                         .head = obj->head,
>>                 };
>> +               int res;
>>
>> -               if (amdgpu_ras_query_error_status(adev, &info))
>> -                       return;
>> +               res = amdgpu_ras_query_error_status(adev, &info);
>> +               if (res)
>> +                       return res;
>>
>>                 ce += info.ce_count;
>>                 ue += info.ue_count;
>> @@ -1078,6 +1095,8 @@ void amdgpu_ras_query_error_count(struct amdgpu_device *adev,
>>
>>         if (ue_count)
>>                 *ue_count = ue;
>> +
>> +       return 0;
>>  }
>>  /* query/inject/cure end */
>>
>> @@ -2145,9 +2164,10 @@ static void amdgpu_ras_counte_dw(struct work_struct *work)
>>
>>         /* Cache new values.
>>          */
>> -       amdgpu_ras_query_error_count(adev, &ce_count, &ue_count);
>> -       atomic_set(&con->ras_ce_count, ce_count);
>> -       atomic_set(&con->ras_ue_count, ue_count);
>> +       if (amdgpu_ras_query_error_count(adev, &ce_count, &ue_count) == 0) {
>> +               atomic_set(&con->ras_ce_count, ce_count);
>> +               atomic_set(&con->ras_ue_count, ue_count);
>> +       }
>>
>>         pm_runtime_mark_last_busy(dev->dev);
>>  Out:
>> @@ -2320,9 +2340,10 @@ int amdgpu_ras_late_init(struct amdgpu_device *adev,
>>
>>         /* Those are the cached values at init.
>>          */
>> -       amdgpu_ras_query_error_count(adev, &ce_count, &ue_count);
>> -       atomic_set(&con->ras_ce_count, ce_count);
>> -       atomic_set(&con->ras_ue_count, ue_count);
>> +       if (amdgpu_ras_query_error_count(adev, &ce_count, &ue_count) == 0) {
>> +               atomic_set(&con->ras_ce_count, ce_count);
>> +               atomic_set(&con->ras_ue_count, ue_count);
>> +       }
>>
>>         return 0;
>>  cleanup:
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
>> index 283afd791db107..4d9c63f2f37718 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
>> @@ -491,9 +491,9 @@ int amdgpu_ras_request_reset_on_boot(struct amdgpu_device *adev,
>>  void amdgpu_ras_resume(struct amdgpu_device *adev);
>>  void amdgpu_ras_suspend(struct amdgpu_device *adev);
>>
>> -void amdgpu_ras_query_error_count(struct amdgpu_device *adev,
>> -                                 unsigned long *ce_count,
>> -                                 unsigned long *ue_count);
>> +int amdgpu_ras_query_error_count(struct amdgpu_device *adev,
>> +                                unsigned long *ce_count,
>> +                                unsigned long *ue_count);
>>
>>  /* error handling functions */
>>  int amdgpu_ras_add_bad_pages(struct amdgpu_device *adev,
>> --
>> 2.32.0
>>
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx@lists.freedesktop.org
>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=04%7C01%7Cluben.tuikov%40amd.com%7C9ad19b88bf0d4f471fed08d940926367%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637611819760889211%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=ldf4psDK2bA9Bvln2yLQ0ycmrJk2KP82v5qfwhjovNo%3D&amp;reserved=0


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2021-07-06 15:32 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-07-02 19:52 [PATCH] drm/amdgpu: initialize amdgpu_ras_query_error_count() error count parameters trix
2021-07-02 22:51 ` Luben Tuikov
2021-07-02 23:04   ` [PATCH] drm/amdgpu: Return error if no RAS Luben Tuikov
2021-07-06 15:25     ` Alex Deucher
2021-07-06 15:32       ` Luben Tuikov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).