All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] drm/amdgpu: add support to SMU debug option
@ 2021-12-01  9:24 Lang Yu
  2021-12-01  9:29 ` Christian König
                   ` (2 more replies)
  0 siblings, 3 replies; 18+ messages in thread
From: Lang Yu @ 2021-12-01  9:24 UTC (permalink / raw)
  To: amd-gfx; +Cc: Alex Deucher, Lang Yu, Lijo Lazar, Huang Rui, Christian Koenig

To maintain system error state when SMU errors occurred,
which will aid in debugging SMU firmware issues, add SMU
debug option support.

It can be enabled or disabled via amdgpu_smu_debug
debugfs file. When enabled, it makes SMU errors fatal.
It is disabled by default.

== Command Guide ==

1, enable SMU debug option

 # echo 1 > /sys/kernel/debug/dri/0/amdgpu_smu_debug

2, disable SMU debug option

 # echo 0 > /sys/kernel/debug/dri/0/amdgpu_smu_debug

v3:
 - Use debugfs_create_bool().(Christian)
 - Put variable into smu_context struct.
 - Don't resend command when timeout.

v2:
 - Resend command when timeout.(Lijo)
 - Use debugfs file instead of module parameter.

Signed-off-by: Lang Yu <lang.yu@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c        | 3 +++
 drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h            | 5 +++++
 drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c | 2 ++
 drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c             | 8 +++++++-
 4 files changed, 17 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
index 164d6a9e9fbb..86cd888c7822 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
@@ -1618,6 +1618,9 @@ int amdgpu_debugfs_init(struct amdgpu_device *adev)
 	if (!debugfs_initialized())
 		return 0;
 
+	debugfs_create_bool("amdgpu_smu_debug", 0600, root,
+				  &adev->smu.smu_debug_mode);
+
 	ent = debugfs_create_file("amdgpu_preempt_ib", 0600, root, adev,
 				  &fops_ib_preempt);
 	if (IS_ERR(ent)) {
diff --git a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
index f738f7dc20c9..50dbf5594a9d 100644
--- a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
+++ b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
@@ -569,6 +569,11 @@ struct smu_context
 	struct smu_user_dpm_profile user_dpm_profile;
 
 	struct stb_context stb_context;
+	/*
+	 * When enabled, it makes SMU errors fatal.
+	 * (0 = disabled (default), 1 = enabled)
+	 */
+	bool smu_debug_mode;
 };
 
 struct i2c_adapter;
diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c b/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
index 6e781cee8bb6..d3797a2d6451 100644
--- a/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
+++ b/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
@@ -1919,6 +1919,8 @@ static int aldebaran_mode2_reset(struct smu_context *smu)
 out:
 	mutex_unlock(&smu->message_lock);
 
+	BUG_ON(unlikely(smu->smu_debug_mode) && ret);
+
 	return ret;
 }
 
diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
index 048ca1673863..9be005eb4241 100644
--- a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
+++ b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
@@ -349,15 +349,21 @@ int smu_cmn_send_smc_msg_with_param(struct smu_context *smu,
 		__smu_cmn_reg_print_error(smu, reg, index, param, msg);
 		goto Out;
 	}
+
 	__smu_cmn_send_msg(smu, (uint16_t) index, param);
 	reg = __smu_cmn_poll_stat(smu);
 	res = __smu_cmn_reg2errno(smu, reg);
-	if (res != 0)
+	if (res != 0) {
 		__smu_cmn_reg_print_error(smu, reg, index, param, msg);
+		goto Out;
+	}
 	if (read_arg)
 		smu_cmn_read_arg(smu, read_arg);
 Out:
 	mutex_unlock(&smu->message_lock);
+
+	BUG_ON(unlikely(smu->smu_debug_mode) && res);
+
 	return res;
 }
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH] drm/amdgpu: add support to SMU debug option
  2021-12-01  9:24 [PATCH] drm/amdgpu: add support to SMU debug option Lang Yu
@ 2021-12-01  9:29 ` Christian König
  2021-12-01 10:44   ` Yu, Lang
  2021-12-01  9:44 ` Huang Rui
  2021-12-01  9:47 ` Lazar, Lijo
  2 siblings, 1 reply; 18+ messages in thread
From: Christian König @ 2021-12-01  9:29 UTC (permalink / raw)
  To: Lang Yu, amd-gfx; +Cc: Alex Deucher, Lijo Lazar, Huang Rui

Am 01.12.21 um 10:24 schrieb Lang Yu:
> To maintain system error state when SMU errors occurred,
> which will aid in debugging SMU firmware issues, add SMU
> debug option support.
>
> It can be enabled or disabled via amdgpu_smu_debug
> debugfs file. When enabled, it makes SMU errors fatal.
> It is disabled by default.
>
> == Command Guide ==
>
> 1, enable SMU debug option
>
>   # echo 1 > /sys/kernel/debug/dri/0/amdgpu_smu_debug
>
> 2, disable SMU debug option
>
>   # echo 0 > /sys/kernel/debug/dri/0/amdgpu_smu_debug
>
> v3:
>   - Use debugfs_create_bool().(Christian)
>   - Put variable into smu_context struct.
>   - Don't resend command when timeout.
>
> v2:
>   - Resend command when timeout.(Lijo)
>   - Use debugfs file instead of module parameter.
>
> Signed-off-by: Lang Yu <lang.yu@amd.com>

Well the debugfs part looks really nice and clean now, but one more 
comment below.

> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c        | 3 +++
>   drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h            | 5 +++++
>   drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c | 2 ++
>   drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c             | 8 +++++++-
>   4 files changed, 17 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
> index 164d6a9e9fbb..86cd888c7822 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
> @@ -1618,6 +1618,9 @@ int amdgpu_debugfs_init(struct amdgpu_device *adev)
>   	if (!debugfs_initialized())
>   		return 0;
>   
> +	debugfs_create_bool("amdgpu_smu_debug", 0600, root,
> +				  &adev->smu.smu_debug_mode);
> +
>   	ent = debugfs_create_file("amdgpu_preempt_ib", 0600, root, adev,
>   				  &fops_ib_preempt);
>   	if (IS_ERR(ent)) {
> diff --git a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
> index f738f7dc20c9..50dbf5594a9d 100644
> --- a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
> +++ b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
> @@ -569,6 +569,11 @@ struct smu_context
>   	struct smu_user_dpm_profile user_dpm_profile;
>   
>   	struct stb_context stb_context;
> +	/*
> +	 * When enabled, it makes SMU errors fatal.
> +	 * (0 = disabled (default), 1 = enabled)
> +	 */
> +	bool smu_debug_mode;
>   };
>   
>   struct i2c_adapter;
> diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c b/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
> index 6e781cee8bb6..d3797a2d6451 100644
> --- a/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
> +++ b/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
> @@ -1919,6 +1919,8 @@ static int aldebaran_mode2_reset(struct smu_context *smu)
>   out:
>   	mutex_unlock(&smu->message_lock);
>   
> +	BUG_ON(unlikely(smu->smu_debug_mode) && ret);
> +
>   	return ret;
>   }
>   
> diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
> index 048ca1673863..9be005eb4241 100644
> --- a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
> +++ b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
> @@ -349,15 +349,21 @@ int smu_cmn_send_smc_msg_with_param(struct smu_context *smu,
>   		__smu_cmn_reg_print_error(smu, reg, index, param, msg);
>   		goto Out;
>   	}
> +
>   	__smu_cmn_send_msg(smu, (uint16_t) index, param);
>   	reg = __smu_cmn_poll_stat(smu);
>   	res = __smu_cmn_reg2errno(smu, reg);
> -	if (res != 0)
> +	if (res != 0) {
>   		__smu_cmn_reg_print_error(smu, reg, index, param, msg);
> +		goto Out;
> +	}
>   	if (read_arg)
>   		smu_cmn_read_arg(smu, read_arg);
>   Out:
>   	mutex_unlock(&smu->message_lock);
> +
> +	BUG_ON(unlikely(smu->smu_debug_mode) && res);

BUG_ON() really crashes the kernel and is only allowed if we prevent 
further data corruption with that.

Most of the time WARN_ON() is more appropriate, but I can't fully judge 
here since I don't know the SMU code well enough.

Christian.

> +
>   	return res;
>   }
>   


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] drm/amdgpu: add support to SMU debug option
  2021-12-01  9:24 [PATCH] drm/amdgpu: add support to SMU debug option Lang Yu
  2021-12-01  9:29 ` Christian König
@ 2021-12-01  9:44 ` Huang Rui
  2021-12-01  9:47 ` Lazar, Lijo
  2 siblings, 0 replies; 18+ messages in thread
From: Huang Rui @ 2021-12-01  9:44 UTC (permalink / raw)
  To: Yu, Lang; +Cc: Deucher, Alexander, Lazar, Lijo, Koenig, Christian, amd-gfx

On Wed, Dec 01, 2021 at 05:24:58PM +0800, Yu, Lang wrote:
> To maintain system error state when SMU errors occurred,
> which will aid in debugging SMU firmware issues, add SMU
> debug option support.
> 
> It can be enabled or disabled via amdgpu_smu_debug
> debugfs file. When enabled, it makes SMU errors fatal.
> It is disabled by default.
> 
> == Command Guide ==
> 
> 1, enable SMU debug option
> 
>  # echo 1 > /sys/kernel/debug/dri/0/amdgpu_smu_debug
> 
> 2, disable SMU debug option
> 
>  # echo 0 > /sys/kernel/debug/dri/0/amdgpu_smu_debug
> 
> v3:
>  - Use debugfs_create_bool().(Christian)
>  - Put variable into smu_context struct.
>  - Don't resend command when timeout.
> 
> v2:
>  - Resend command when timeout.(Lijo)
>  - Use debugfs file instead of module parameter.
> 
> Signed-off-by: Lang Yu <lang.yu@amd.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c        | 3 +++
>  drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h            | 5 +++++
>  drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c | 2 ++
>  drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c             | 8 +++++++-
>  4 files changed, 17 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
> index 164d6a9e9fbb..86cd888c7822 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
> @@ -1618,6 +1618,9 @@ int amdgpu_debugfs_init(struct amdgpu_device *adev)
>  	if (!debugfs_initialized())
>  		return 0;
>  
> +	debugfs_create_bool("amdgpu_smu_debug", 0600, root,
> +				  &adev->smu.smu_debug_mode);
> +
>  	ent = debugfs_create_file("amdgpu_preempt_ib", 0600, root, adev,
>  				  &fops_ib_preempt);
>  	if (IS_ERR(ent)) {
> diff --git a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
> index f738f7dc20c9..50dbf5594a9d 100644
> --- a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
> +++ b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
> @@ -569,6 +569,11 @@ struct smu_context
>  	struct smu_user_dpm_profile user_dpm_profile;
>  
>  	struct stb_context stb_context;
> +	/*
> +	 * When enabled, it makes SMU errors fatal.
> +	 * (0 = disabled (default), 1 = enabled)
> +	 */
> +	bool smu_debug_mode;
>  };
>  
>  struct i2c_adapter;
> diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c b/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
> index 6e781cee8bb6..d3797a2d6451 100644
> --- a/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
> +++ b/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
> @@ -1919,6 +1919,8 @@ static int aldebaran_mode2_reset(struct smu_context *smu)
>  out:
>  	mutex_unlock(&smu->message_lock);
>  
> +	BUG_ON(unlikely(smu->smu_debug_mode) && ret);
> +
>  	return ret;
>  }
>  
> diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
> index 048ca1673863..9be005eb4241 100644
> --- a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
> +++ b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
> @@ -349,15 +349,21 @@ int smu_cmn_send_smc_msg_with_param(struct smu_context *smu,
>  		__smu_cmn_reg_print_error(smu, reg, index, param, msg);
>  		goto Out;
>  	}
> +
>  	__smu_cmn_send_msg(smu, (uint16_t) index, param);
>  	reg = __smu_cmn_poll_stat(smu);
>  	res = __smu_cmn_reg2errno(smu, reg);
> -	if (res != 0)
> +	if (res != 0) {
>  		__smu_cmn_reg_print_error(smu, reg, index, param, msg);
> +		goto Out;
> +	}
>  	if (read_arg)
>  		smu_cmn_read_arg(smu, read_arg);
>  Out:
>  	mutex_unlock(&smu->message_lock);
> +
> +	BUG_ON(unlikely(smu->smu_debug_mode) && res);
> +

Do we need to add BUG_ON on smu_cmn_send_msg_without_waiting() as well?

Thanks,
Ray

>  	return res;
>  }
>  
> -- 
> 2.25.1
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] drm/amdgpu: add support to SMU debug option
  2021-12-01  9:24 [PATCH] drm/amdgpu: add support to SMU debug option Lang Yu
  2021-12-01  9:29 ` Christian König
  2021-12-01  9:44 ` Huang Rui
@ 2021-12-01  9:47 ` Lazar, Lijo
  2021-12-01 10:38   ` Yu, Lang
  2 siblings, 1 reply; 18+ messages in thread
From: Lazar, Lijo @ 2021-12-01  9:47 UTC (permalink / raw)
  To: Lang Yu, amd-gfx; +Cc: Alex Deucher, Huang Rui, Christian Koenig



On 12/1/2021 2:54 PM, Lang Yu wrote:
> To maintain system error state when SMU errors occurred,
> which will aid in debugging SMU firmware issues, add SMU
> debug option support.
> 
> It can be enabled or disabled via amdgpu_smu_debug
> debugfs file. When enabled, it makes SMU errors fatal.
> It is disabled by default.
> 
> == Command Guide ==
> 
> 1, enable SMU debug option
> 
>   # echo 1 > /sys/kernel/debug/dri/0/amdgpu_smu_debug
> 
> 2, disable SMU debug option
> 
>   # echo 0 > /sys/kernel/debug/dri/0/amdgpu_smu_debug
> 
> v3:
>   - Use debugfs_create_bool().(Christian)
>   - Put variable into smu_context struct.
>   - Don't resend command when timeout.
> 
> v2:
>   - Resend command when timeout.(Lijo)
>   - Use debugfs file instead of module parameter.
> 
> Signed-off-by: Lang Yu <lang.yu@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c        | 3 +++
>   drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h            | 5 +++++
>   drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c | 2 ++
>   drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c             | 8 +++++++-
>   4 files changed, 17 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
> index 164d6a9e9fbb..86cd888c7822 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
> @@ -1618,6 +1618,9 @@ int amdgpu_debugfs_init(struct amdgpu_device *adev)
>   	if (!debugfs_initialized())
>   		return 0;
>   
> +	debugfs_create_bool("amdgpu_smu_debug", 0600, root,
> +				  &adev->smu.smu_debug_mode);
> +
>   	ent = debugfs_create_file("amdgpu_preempt_ib", 0600, root, adev,
>   				  &fops_ib_preempt);
>   	if (IS_ERR(ent)) {
> diff --git a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
> index f738f7dc20c9..50dbf5594a9d 100644
> --- a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
> +++ b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
> @@ -569,6 +569,11 @@ struct smu_context
>   	struct smu_user_dpm_profile user_dpm_profile;
>   
>   	struct stb_context stb_context;
> +	/*
> +	 * When enabled, it makes SMU errors fatal.
> +	 * (0 = disabled (default), 1 = enabled)
> +	 */
> +	bool smu_debug_mode;
>   };
>   
>   struct i2c_adapter;
> diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c b/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
> index 6e781cee8bb6..d3797a2d6451 100644
> --- a/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
> +++ b/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
> @@ -1919,6 +1919,8 @@ static int aldebaran_mode2_reset(struct smu_context *smu)
>   out:
>   	mutex_unlock(&smu->message_lock);
>   
> +	BUG_ON(unlikely(smu->smu_debug_mode) && ret);
> +
This hunk can be skipped while submitting. If this fails, GPU reset will 
fail and amdgpu won't continue.

>   	return ret;
>   }
>   
> diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
> index 048ca1673863..9be005eb4241 100644
> --- a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
> +++ b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
> @@ -349,15 +349,21 @@ int smu_cmn_send_smc_msg_with_param(struct smu_context *smu,
>   		__smu_cmn_reg_print_error(smu, reg, index, param, msg);
>   		goto Out;
>   	}
> +
>   	__smu_cmn_send_msg(smu, (uint16_t) index, param);
>   	reg = __smu_cmn_poll_stat(smu);
>   	res = __smu_cmn_reg2errno(smu, reg);
> -	if (res != 0)
> +	if (res != 0) {
>   		__smu_cmn_reg_print_error(smu, reg, index, param, msg);
> +		goto Out;

Next step is reading smu parameter register which is harmless as reading 
response register and it's not clear on read. This goto also may be skipped.

Thanks,
Lijo

> +	}
>   	if (read_arg)
>   		smu_cmn_read_arg(smu, read_arg);
>   Out:
>   	mutex_unlock(&smu->message_lock);
> +
> +	BUG_ON(unlikely(smu->smu_debug_mode) && res);
> +
>   	return res;
>   }
>   
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: [PATCH] drm/amdgpu: add support to SMU debug option
  2021-12-01  9:47 ` Lazar, Lijo
@ 2021-12-01 10:38   ` Yu, Lang
  2021-12-01 10:46     ` Lazar, Lijo
  0 siblings, 1 reply; 18+ messages in thread
From: Yu, Lang @ 2021-12-01 10:38 UTC (permalink / raw)
  To: Lazar, Lijo, amd-gfx; +Cc: Deucher, Alexander, Huang, Ray, Koenig, Christian

[AMD Official Use Only]



>-----Original Message-----
>From: Lazar, Lijo <Lijo.Lazar@amd.com>
>Sent: Wednesday, December 1, 2021 5:47 PM
>To: Yu, Lang <Lang.Yu@amd.com>; amd-gfx@lists.freedesktop.org
>Cc: Deucher, Alexander <Alexander.Deucher@amd.com>; Huang, Ray
><Ray.Huang@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>
>Subject: Re: [PATCH] drm/amdgpu: add support to SMU debug option
>
>
>
>On 12/1/2021 2:54 PM, Lang Yu wrote:
>> To maintain system error state when SMU errors occurred, which will
>> aid in debugging SMU firmware issues, add SMU debug option support.
>>
>> It can be enabled or disabled via amdgpu_smu_debug debugfs file. When
>> enabled, it makes SMU errors fatal.
>> It is disabled by default.
>>
>> == Command Guide ==
>>
>> 1, enable SMU debug option
>>
>>   # echo 1 > /sys/kernel/debug/dri/0/amdgpu_smu_debug
>>
>> 2, disable SMU debug option
>>
>>   # echo 0 > /sys/kernel/debug/dri/0/amdgpu_smu_debug
>>
>> v3:
>>   - Use debugfs_create_bool().(Christian)
>>   - Put variable into smu_context struct.
>>   - Don't resend command when timeout.
>>
>> v2:
>>   - Resend command when timeout.(Lijo)
>>   - Use debugfs file instead of module parameter.
>>
>> Signed-off-by: Lang Yu <lang.yu@amd.com>
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c        | 3 +++
>>   drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h            | 5 +++++
>>   drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c | 2 ++
>>   drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c             | 8 +++++++-
>>   4 files changed, 17 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
>> index 164d6a9e9fbb..86cd888c7822 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
>> @@ -1618,6 +1618,9 @@ int amdgpu_debugfs_init(struct amdgpu_device
>*adev)
>>   	if (!debugfs_initialized())
>>   		return 0;
>>
>> +	debugfs_create_bool("amdgpu_smu_debug", 0600, root,
>> +				  &adev->smu.smu_debug_mode);
>> +
>>   	ent = debugfs_create_file("amdgpu_preempt_ib", 0600, root, adev,
>>   				  &fops_ib_preempt);
>>   	if (IS_ERR(ent)) {
>> diff --git a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
>> b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
>> index f738f7dc20c9..50dbf5594a9d 100644
>> --- a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
>> +++ b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
>> @@ -569,6 +569,11 @@ struct smu_context
>>   	struct smu_user_dpm_profile user_dpm_profile;
>>
>>   	struct stb_context stb_context;
>> +	/*
>> +	 * When enabled, it makes SMU errors fatal.
>> +	 * (0 = disabled (default), 1 = enabled)
>> +	 */
>> +	bool smu_debug_mode;
>>   };
>>
>>   struct i2c_adapter;
>> diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
>> b/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
>> index 6e781cee8bb6..d3797a2d6451 100644
>> --- a/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
>> +++ b/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
>> @@ -1919,6 +1919,8 @@ static int aldebaran_mode2_reset(struct
>smu_context *smu)
>>   out:
>>   	mutex_unlock(&smu->message_lock);
>>
>> +	BUG_ON(unlikely(smu->smu_debug_mode) && ret);
>> +
>This hunk can be skipped while submitting. If this fails, GPU reset will fail and
>amdgpu won't continue.

Ok, we don't handle such cases.

>
>>   	return ret;
>>   }
>>
>> diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
>> b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
>> index 048ca1673863..9be005eb4241 100644
>> --- a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
>> +++ b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
>> @@ -349,15 +349,21 @@ int smu_cmn_send_smc_msg_with_param(struct
>smu_context *smu,
>>   		__smu_cmn_reg_print_error(smu, reg, index, param, msg);
>>   		goto Out;
>>   	}
>> +
>>   	__smu_cmn_send_msg(smu, (uint16_t) index, param);
>>   	reg = __smu_cmn_poll_stat(smu);
>>   	res = __smu_cmn_reg2errno(smu, reg);
>> -	if (res != 0)
>> +	if (res != 0) {
>>   		__smu_cmn_reg_print_error(smu, reg, index, param, msg);
>> +		goto Out;
>
>Next step is reading smu parameter register which is harmless as reading
>response register and it's not clear on read. This goto also may be skipped.

I just think that does some extra work. We don’t want to read response register.
This goto makes error handling more clear.

Regards,
Lang

>Thanks,
>Lijo
>
>> +	}
>>   	if (read_arg)
>>   		smu_cmn_read_arg(smu, read_arg);
>>   Out:
>>   	mutex_unlock(&smu->message_lock);
>> +
>> +	BUG_ON(unlikely(smu->smu_debug_mode) && res);
>> +
>>   	return res;
>>   }
>>
>>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: [PATCH] drm/amdgpu: add support to SMU debug option
  2021-12-01  9:29 ` Christian König
@ 2021-12-01 10:44   ` Yu, Lang
  2021-12-01 10:48     ` Christian König
  0 siblings, 1 reply; 18+ messages in thread
From: Yu, Lang @ 2021-12-01 10:44 UTC (permalink / raw)
  To: Koenig, Christian, amd-gfx; +Cc: Deucher, Alexander, Lazar, Lijo, Huang, Ray

[AMD Official Use Only]



>-----Original Message-----
>From: Koenig, Christian <Christian.Koenig@amd.com>
>Sent: Wednesday, December 1, 2021 5:30 PM
>To: Yu, Lang <Lang.Yu@amd.com>; amd-gfx@lists.freedesktop.org
>Cc: Deucher, Alexander <Alexander.Deucher@amd.com>; Lazar, Lijo
><Lijo.Lazar@amd.com>; Huang, Ray <Ray.Huang@amd.com>
>Subject: Re: [PATCH] drm/amdgpu: add support to SMU debug option
>
>Am 01.12.21 um 10:24 schrieb Lang Yu:
>> To maintain system error state when SMU errors occurred, which will
>> aid in debugging SMU firmware issues, add SMU debug option support.
>>
>> It can be enabled or disabled via amdgpu_smu_debug debugfs file. When
>> enabled, it makes SMU errors fatal.
>> It is disabled by default.
>>
>> == Command Guide ==
>>
>> 1, enable SMU debug option
>>
>>   # echo 1 > /sys/kernel/debug/dri/0/amdgpu_smu_debug
>>
>> 2, disable SMU debug option
>>
>>   # echo 0 > /sys/kernel/debug/dri/0/amdgpu_smu_debug
>>
>> v3:
>>   - Use debugfs_create_bool().(Christian)
>>   - Put variable into smu_context struct.
>>   - Don't resend command when timeout.
>>
>> v2:
>>   - Resend command when timeout.(Lijo)
>>   - Use debugfs file instead of module parameter.
>>
>> Signed-off-by: Lang Yu <lang.yu@amd.com>
>
>Well the debugfs part looks really nice and clean now, but one more comment
>below.
>
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c        | 3 +++
>>   drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h            | 5 +++++
>>   drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c | 2 ++
>>   drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c             | 8 +++++++-
>>   4 files changed, 17 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
>> index 164d6a9e9fbb..86cd888c7822 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
>> @@ -1618,6 +1618,9 @@ int amdgpu_debugfs_init(struct amdgpu_device
>*adev)
>>   	if (!debugfs_initialized())
>>   		return 0;
>>
>> +	debugfs_create_bool("amdgpu_smu_debug", 0600, root,
>> +				  &adev->smu.smu_debug_mode);
>> +
>>   	ent = debugfs_create_file("amdgpu_preempt_ib", 0600, root, adev,
>>   				  &fops_ib_preempt);
>>   	if (IS_ERR(ent)) {
>> diff --git a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
>> b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
>> index f738f7dc20c9..50dbf5594a9d 100644
>> --- a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
>> +++ b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
>> @@ -569,6 +569,11 @@ struct smu_context
>>   	struct smu_user_dpm_profile user_dpm_profile;
>>
>>   	struct stb_context stb_context;
>> +	/*
>> +	 * When enabled, it makes SMU errors fatal.
>> +	 * (0 = disabled (default), 1 = enabled)
>> +	 */
>> +	bool smu_debug_mode;
>>   };
>>
>>   struct i2c_adapter;
>> diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
>> b/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
>> index 6e781cee8bb6..d3797a2d6451 100644
>> --- a/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
>> +++ b/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
>> @@ -1919,6 +1919,8 @@ static int aldebaran_mode2_reset(struct
>smu_context *smu)
>>   out:
>>   	mutex_unlock(&smu->message_lock);
>>
>> +	BUG_ON(unlikely(smu->smu_debug_mode) && ret);
>> +
>>   	return ret;
>>   }
>>
>> diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
>> b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
>> index 048ca1673863..9be005eb4241 100644
>> --- a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
>> +++ b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
>> @@ -349,15 +349,21 @@ int smu_cmn_send_smc_msg_with_param(struct
>smu_context *smu,
>>   		__smu_cmn_reg_print_error(smu, reg, index, param, msg);
>>   		goto Out;
>>   	}
>> +
>>   	__smu_cmn_send_msg(smu, (uint16_t) index, param);
>>   	reg = __smu_cmn_poll_stat(smu);
>>   	res = __smu_cmn_reg2errno(smu, reg);
>> -	if (res != 0)
>> +	if (res != 0) {
>>   		__smu_cmn_reg_print_error(smu, reg, index, param, msg);
>> +		goto Out;
>> +	}
>>   	if (read_arg)
>>   		smu_cmn_read_arg(smu, read_arg);
>>   Out:
>>   	mutex_unlock(&smu->message_lock);
>> +
>> +	BUG_ON(unlikely(smu->smu_debug_mode) && res);
>
>BUG_ON() really crashes the kernel and is only allowed if we prevent further data
>corruption with that.
>
>Most of the time WARN_ON() is more appropriate, but I can't fully judge here
>since I don't know the SMU code well enough.

This is what SMU FW guys want. They want "user-visible (potentially fatal) errors", then a hang.
They want to keep system state since the error occurred.

Regards,
Lang

>Christian.
>
>> +
>>   	return res;
>>   }
>>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] drm/amdgpu: add support to SMU debug option
  2021-12-01 10:38   ` Yu, Lang
@ 2021-12-01 10:46     ` Lazar, Lijo
  2021-12-01 10:56       ` Yu, Lang
  0 siblings, 1 reply; 18+ messages in thread
From: Lazar, Lijo @ 2021-12-01 10:46 UTC (permalink / raw)
  To: Yu, Lang, amd-gfx; +Cc: Deucher, Alexander, Huang, Ray, Koenig, Christian



On 12/1/2021 4:08 PM, Yu, Lang wrote:
> [AMD Official Use Only]
> 
> 
> 
>> -----Original Message-----
>> From: Lazar, Lijo <Lijo.Lazar@amd.com>
>> Sent: Wednesday, December 1, 2021 5:47 PM
>> To: Yu, Lang <Lang.Yu@amd.com>; amd-gfx@lists.freedesktop.org
>> Cc: Deucher, Alexander <Alexander.Deucher@amd.com>; Huang, Ray
>> <Ray.Huang@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>
>> Subject: Re: [PATCH] drm/amdgpu: add support to SMU debug option
>>
>>
>>
>> On 12/1/2021 2:54 PM, Lang Yu wrote:
>>> To maintain system error state when SMU errors occurred, which will
>>> aid in debugging SMU firmware issues, add SMU debug option support.
>>>
>>> It can be enabled or disabled via amdgpu_smu_debug debugfs file. When
>>> enabled, it makes SMU errors fatal.
>>> It is disabled by default.
>>>
>>> == Command Guide ==
>>>
>>> 1, enable SMU debug option
>>>
>>>    # echo 1 > /sys/kernel/debug/dri/0/amdgpu_smu_debug
>>>
>>> 2, disable SMU debug option
>>>
>>>    # echo 0 > /sys/kernel/debug/dri/0/amdgpu_smu_debug
>>>
>>> v3:
>>>    - Use debugfs_create_bool().(Christian)
>>>    - Put variable into smu_context struct.
>>>    - Don't resend command when timeout.
>>>
>>> v2:
>>>    - Resend command when timeout.(Lijo)
>>>    - Use debugfs file instead of module parameter.
>>>
>>> Signed-off-by: Lang Yu <lang.yu@amd.com>
>>> ---
>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c        | 3 +++
>>>    drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h            | 5 +++++
>>>    drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c | 2 ++
>>>    drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c             | 8 +++++++-
>>>    4 files changed, 17 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
>>> index 164d6a9e9fbb..86cd888c7822 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
>>> @@ -1618,6 +1618,9 @@ int amdgpu_debugfs_init(struct amdgpu_device
>> *adev)
>>>    	if (!debugfs_initialized())
>>>    		return 0;
>>>
>>> +	debugfs_create_bool("amdgpu_smu_debug", 0600, root,
>>> +				  &adev->smu.smu_debug_mode);
>>> +
>>>    	ent = debugfs_create_file("amdgpu_preempt_ib", 0600, root, adev,
>>>    				  &fops_ib_preempt);
>>>    	if (IS_ERR(ent)) {
>>> diff --git a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
>>> b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
>>> index f738f7dc20c9..50dbf5594a9d 100644
>>> --- a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
>>> +++ b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
>>> @@ -569,6 +569,11 @@ struct smu_context
>>>    	struct smu_user_dpm_profile user_dpm_profile;
>>>
>>>    	struct stb_context stb_context;
>>> +	/*
>>> +	 * When enabled, it makes SMU errors fatal.
>>> +	 * (0 = disabled (default), 1 = enabled)
>>> +	 */
>>> +	bool smu_debug_mode;
>>>    };
>>>
>>>    struct i2c_adapter;
>>> diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
>>> b/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
>>> index 6e781cee8bb6..d3797a2d6451 100644
>>> --- a/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
>>> +++ b/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
>>> @@ -1919,6 +1919,8 @@ static int aldebaran_mode2_reset(struct
>> smu_context *smu)
>>>    out:
>>>    	mutex_unlock(&smu->message_lock);
>>>
>>> +	BUG_ON(unlikely(smu->smu_debug_mode) && ret);
>>> +
>> This hunk can be skipped while submitting. If this fails, GPU reset will fail and
>> amdgpu won't continue.
> 
> Ok, we don't handle such cases.
> 
>>
>>>    	return ret;
>>>    }
>>>
>>> diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
>>> b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
>>> index 048ca1673863..9be005eb4241 100644
>>> --- a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
>>> +++ b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
>>> @@ -349,15 +349,21 @@ int smu_cmn_send_smc_msg_with_param(struct
>> smu_context *smu,
>>>    		__smu_cmn_reg_print_error(smu, reg, index, param, msg);
>>>    		goto Out;
>>>    	}
>>> +
>>>    	__smu_cmn_send_msg(smu, (uint16_t) index, param);
>>>    	reg = __smu_cmn_poll_stat(smu);
>>>    	res = __smu_cmn_reg2errno(smu, reg);
>>> -	if (res != 0)
>>> +	if (res != 0) {
>>>    		__smu_cmn_reg_print_error(smu, reg, index, param, msg);
>>> +		goto Out;
>>
>> Next step is reading smu parameter register which is harmless as reading
>> response register and it's not clear on read. This goto also may be skipped.
> 
> I just think that does some extra work. We don’t want to read response register.
> This goto makes error handling more clear.
> 

This change affects non-debug mode also. If things are normal, error 
handling is supposed to be done by the caller based on the FW response 
and/or return parameter value, if there is any. smu_debug_mode shouldn't 
change that.

Thanks,
Lijo

> Regards,
> Lang
> 
>> Thanks,
>> Lijo
>>
>>> +	}
>>>    	if (read_arg)
>>>    		smu_cmn_read_arg(smu, read_arg);
>>>    Out:
>>>    	mutex_unlock(&smu->message_lock);
>>> +
>>> +	BUG_ON(unlikely(smu->smu_debug_mode) && res);
>>> +
>>>    	return res;
>>>    }
>>>
>>>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] drm/amdgpu: add support to SMU debug option
  2021-12-01 10:44   ` Yu, Lang
@ 2021-12-01 10:48     ` Christian König
  2021-12-01 11:20       ` Yu, Lang
  0 siblings, 1 reply; 18+ messages in thread
From: Christian König @ 2021-12-01 10:48 UTC (permalink / raw)
  To: Yu, Lang, Koenig, Christian, amd-gfx
  Cc: Deucher, Alexander, Lazar, Lijo, Huang, Ray

Am 01.12.21 um 11:44 schrieb Yu, Lang:
> [AMD Official Use Only]
>
>
>
>> -----Original Message-----
>> From: Koenig, Christian <Christian.Koenig@amd.com>
>> Sent: Wednesday, December 1, 2021 5:30 PM
>> To: Yu, Lang <Lang.Yu@amd.com>; amd-gfx@lists.freedesktop.org
>> Cc: Deucher, Alexander <Alexander.Deucher@amd.com>; Lazar, Lijo
>> <Lijo.Lazar@amd.com>; Huang, Ray <Ray.Huang@amd.com>
>> Subject: Re: [PATCH] drm/amdgpu: add support to SMU debug option
>>
>> Am 01.12.21 um 10:24 schrieb Lang Yu:
>>> To maintain system error state when SMU errors occurred, which will
>>> aid in debugging SMU firmware issues, add SMU debug option support.
>>>
>>> It can be enabled or disabled via amdgpu_smu_debug debugfs file. When
>>> enabled, it makes SMU errors fatal.
>>> It is disabled by default.
>>>
>>> == Command Guide ==
>>>
>>> 1, enable SMU debug option
>>>
>>>    # echo 1 > /sys/kernel/debug/dri/0/amdgpu_smu_debug
>>>
>>> 2, disable SMU debug option
>>>
>>>    # echo 0 > /sys/kernel/debug/dri/0/amdgpu_smu_debug
>>>
>>> v3:
>>>    - Use debugfs_create_bool().(Christian)
>>>    - Put variable into smu_context struct.
>>>    - Don't resend command when timeout.
>>>
>>> v2:
>>>    - Resend command when timeout.(Lijo)
>>>    - Use debugfs file instead of module parameter.
>>>
>>> Signed-off-by: Lang Yu <lang.yu@amd.com>
>> Well the debugfs part looks really nice and clean now, but one more comment
>> below.
>>
>>> ---
>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c        | 3 +++
>>>    drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h            | 5 +++++
>>>    drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c | 2 ++
>>>    drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c             | 8 +++++++-
>>>    4 files changed, 17 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
>>> index 164d6a9e9fbb..86cd888c7822 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
>>> @@ -1618,6 +1618,9 @@ int amdgpu_debugfs_init(struct amdgpu_device
>> *adev)
>>>    	if (!debugfs_initialized())
>>>    		return 0;
>>>
>>> +	debugfs_create_bool("amdgpu_smu_debug", 0600, root,
>>> +				  &adev->smu.smu_debug_mode);
>>> +
>>>    	ent = debugfs_create_file("amdgpu_preempt_ib", 0600, root, adev,
>>>    				  &fops_ib_preempt);
>>>    	if (IS_ERR(ent)) {
>>> diff --git a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
>>> b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
>>> index f738f7dc20c9..50dbf5594a9d 100644
>>> --- a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
>>> +++ b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
>>> @@ -569,6 +569,11 @@ struct smu_context
>>>    	struct smu_user_dpm_profile user_dpm_profile;
>>>
>>>    	struct stb_context stb_context;
>>> +	/*
>>> +	 * When enabled, it makes SMU errors fatal.
>>> +	 * (0 = disabled (default), 1 = enabled)
>>> +	 */
>>> +	bool smu_debug_mode;
>>>    };
>>>
>>>    struct i2c_adapter;
>>> diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
>>> b/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
>>> index 6e781cee8bb6..d3797a2d6451 100644
>>> --- a/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
>>> +++ b/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
>>> @@ -1919,6 +1919,8 @@ static int aldebaran_mode2_reset(struct
>> smu_context *smu)
>>>    out:
>>>    	mutex_unlock(&smu->message_lock);
>>>
>>> +	BUG_ON(unlikely(smu->smu_debug_mode) && ret);
>>> +
>>>    	return ret;
>>>    }
>>>
>>> diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
>>> b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
>>> index 048ca1673863..9be005eb4241 100644
>>> --- a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
>>> +++ b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
>>> @@ -349,15 +349,21 @@ int smu_cmn_send_smc_msg_with_param(struct
>> smu_context *smu,
>>>    		__smu_cmn_reg_print_error(smu, reg, index, param, msg);
>>>    		goto Out;
>>>    	}
>>> +
>>>    	__smu_cmn_send_msg(smu, (uint16_t) index, param);
>>>    	reg = __smu_cmn_poll_stat(smu);
>>>    	res = __smu_cmn_reg2errno(smu, reg);
>>> -	if (res != 0)
>>> +	if (res != 0) {
>>>    		__smu_cmn_reg_print_error(smu, reg, index, param, msg);
>>> +		goto Out;
>>> +	}
>>>    	if (read_arg)
>>>    		smu_cmn_read_arg(smu, read_arg);
>>>    Out:
>>>    	mutex_unlock(&smu->message_lock);
>>> +
>>> +	BUG_ON(unlikely(smu->smu_debug_mode) && res);
>> BUG_ON() really crashes the kernel and is only allowed if we prevent further data
>> corruption with that.
>>
>> Most of the time WARN_ON() is more appropriate, but I can't fully judge here
>> since I don't know the SMU code well enough.
> This is what SMU FW guys want. They want "user-visible (potentially fatal) errors", then a hang.
> They want to keep system state since the error occurred.

Well that is rather problematic.

First of all we need to really justify that, crashing the kernel is not 
something easily done.

Then this isn't really effective here. What happens is that you crash 
the kernel thread of the currently executing process, but it is 
perfectly possible that another thread still tries to send messages to 
the SMU. You need to have the BUG_ON() before dropping the lock to make 
sure that this really gets the driver stuck in the current state.

Regards,
Christian.

>
> Regards,
> Lang
>
>> Christian.
>>
>>> +
>>>    	return res;
>>>    }
>>>


^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: [PATCH] drm/amdgpu: add support to SMU debug option
  2021-12-01 10:46     ` Lazar, Lijo
@ 2021-12-01 10:56       ` Yu, Lang
  0 siblings, 0 replies; 18+ messages in thread
From: Yu, Lang @ 2021-12-01 10:56 UTC (permalink / raw)
  To: Lazar, Lijo, amd-gfx; +Cc: Deucher, Alexander, Huang, Ray, Koenig, Christian

[AMD Official Use Only]



>-----Original Message-----
>From: Lazar, Lijo <Lijo.Lazar@amd.com>
>Sent: Wednesday, December 1, 2021 6:46 PM
>To: Yu, Lang <Lang.Yu@amd.com>; amd-gfx@lists.freedesktop.org
>Cc: Deucher, Alexander <Alexander.Deucher@amd.com>; Huang, Ray
><Ray.Huang@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>
>Subject: Re: [PATCH] drm/amdgpu: add support to SMU debug option
>
>
>
>On 12/1/2021 4:08 PM, Yu, Lang wrote:
>> [AMD Official Use Only]
>>
>>
>>
>>> -----Original Message-----
>>> From: Lazar, Lijo <Lijo.Lazar@amd.com>
>>> Sent: Wednesday, December 1, 2021 5:47 PM
>>> To: Yu, Lang <Lang.Yu@amd.com>; amd-gfx@lists.freedesktop.org
>>> Cc: Deucher, Alexander <Alexander.Deucher@amd.com>; Huang, Ray
>>> <Ray.Huang@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>
>>> Subject: Re: [PATCH] drm/amdgpu: add support to SMU debug option
>>>
>>>
>>>
>>> On 12/1/2021 2:54 PM, Lang Yu wrote:
>>>> To maintain system error state when SMU errors occurred, which will
>>>> aid in debugging SMU firmware issues, add SMU debug option support.
>>>>
>>>> It can be enabled or disabled via amdgpu_smu_debug debugfs file.
>>>> When enabled, it makes SMU errors fatal.
>>>> It is disabled by default.
>>>>
>>>> == Command Guide ==
>>>>
>>>> 1, enable SMU debug option
>>>>
>>>>    # echo 1 > /sys/kernel/debug/dri/0/amdgpu_smu_debug
>>>>
>>>> 2, disable SMU debug option
>>>>
>>>>    # echo 0 > /sys/kernel/debug/dri/0/amdgpu_smu_debug
>>>>
>>>> v3:
>>>>    - Use debugfs_create_bool().(Christian)
>>>>    - Put variable into smu_context struct.
>>>>    - Don't resend command when timeout.
>>>>
>>>> v2:
>>>>    - Resend command when timeout.(Lijo)
>>>>    - Use debugfs file instead of module parameter.
>>>>
>>>> Signed-off-by: Lang Yu <lang.yu@amd.com>
>>>> ---
>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c        | 3 +++
>>>>    drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h            | 5 +++++
>>>>    drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c | 2 ++
>>>>    drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c             | 8 +++++++-
>>>>    4 files changed, 17 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
>>>> index 164d6a9e9fbb..86cd888c7822 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
>>>> @@ -1618,6 +1618,9 @@ int amdgpu_debugfs_init(struct amdgpu_device
>>> *adev)
>>>>    	if (!debugfs_initialized())
>>>>    		return 0;
>>>>
>>>> +	debugfs_create_bool("amdgpu_smu_debug", 0600, root,
>>>> +				  &adev->smu.smu_debug_mode);
>>>> +
>>>>    	ent = debugfs_create_file("amdgpu_preempt_ib", 0600, root, adev,
>>>>    				  &fops_ib_preempt);
>>>>    	if (IS_ERR(ent)) {
>>>> diff --git a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
>>>> b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
>>>> index f738f7dc20c9..50dbf5594a9d 100644
>>>> --- a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
>>>> +++ b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
>>>> @@ -569,6 +569,11 @@ struct smu_context
>>>>    	struct smu_user_dpm_profile user_dpm_profile;
>>>>
>>>>    	struct stb_context stb_context;
>>>> +	/*
>>>> +	 * When enabled, it makes SMU errors fatal.
>>>> +	 * (0 = disabled (default), 1 = enabled)
>>>> +	 */
>>>> +	bool smu_debug_mode;
>>>>    };
>>>>
>>>>    struct i2c_adapter;
>>>> diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
>>>> b/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
>>>> index 6e781cee8bb6..d3797a2d6451 100644
>>>> --- a/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
>>>> +++ b/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
>>>> @@ -1919,6 +1919,8 @@ static int aldebaran_mode2_reset(struct
>>> smu_context *smu)
>>>>    out:
>>>>    	mutex_unlock(&smu->message_lock);
>>>>
>>>> +	BUG_ON(unlikely(smu->smu_debug_mode) && ret);
>>>> +
>>> This hunk can be skipped while submitting. If this fails, GPU reset
>>> will fail and amdgpu won't continue.
>>
>> Ok, we don't handle such cases.
>>
>>>
>>>>    	return ret;
>>>>    }
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
>>>> b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
>>>> index 048ca1673863..9be005eb4241 100644
>>>> --- a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
>>>> +++ b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
>>>> @@ -349,15 +349,21 @@ int smu_cmn_send_smc_msg_with_param(struct
>>> smu_context *smu,
>>>>    		__smu_cmn_reg_print_error(smu, reg, index, param, msg);
>>>>    		goto Out;
>>>>    	}
>>>> +
>>>>    	__smu_cmn_send_msg(smu, (uint16_t) index, param);
>>>>    	reg = __smu_cmn_poll_stat(smu);
>>>>    	res = __smu_cmn_reg2errno(smu, reg);
>>>> -	if (res != 0)
>>>> +	if (res != 0) {
>>>>    		__smu_cmn_reg_print_error(smu, reg, index, param, msg);
>>>> +		goto Out;
>>>
>>> Next step is reading smu parameter register which is harmless as
>>> reading response register and it's not clear on read. This goto also may be
>skipped.
>>
>> I just think that does some extra work. We don’t want to read response register.
>> This goto makes error handling more clear.
>>
>
>This change affects non-debug mode also. If things are normal, error handling is
>supposed to be done by the caller based on the FW response and/or return
>parameter value, if there is any. smu_debug_mode shouldn't change that.

Thanks, I got it.

>Thanks,
>Lijo
>
>> Regards,
>> Lang
>>
>>> Thanks,
>>> Lijo
>>>
>>>> +	}
>>>>    	if (read_arg)
>>>>    		smu_cmn_read_arg(smu, read_arg);
>>>>    Out:
>>>>    	mutex_unlock(&smu->message_lock);
>>>> +
>>>> +	BUG_ON(unlikely(smu->smu_debug_mode) && res);
>>>> +
>>>>    	return res;
>>>>    }
>>>>
>>>>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: [PATCH] drm/amdgpu: add support to SMU debug option
  2021-12-01 10:48     ` Christian König
@ 2021-12-01 11:20       ` Yu, Lang
  2021-12-01 11:29         ` Christian König
  0 siblings, 1 reply; 18+ messages in thread
From: Yu, Lang @ 2021-12-01 11:20 UTC (permalink / raw)
  To: Christian König, Koenig, Christian, amd-gfx
  Cc: Deucher, Alexander, Lazar, Lijo, Huang, Ray

[AMD Official Use Only]



>-----Original Message-----
>From: Christian König <ckoenig.leichtzumerken@gmail.com>
>Sent: Wednesday, December 1, 2021 6:49 PM
>To: Yu, Lang <Lang.Yu@amd.com>; Koenig, Christian
><Christian.Koenig@amd.com>; amd-gfx@lists.freedesktop.org
>Cc: Deucher, Alexander <Alexander.Deucher@amd.com>; Lazar, Lijo
><Lijo.Lazar@amd.com>; Huang, Ray <Ray.Huang@amd.com>
>Subject: Re: [PATCH] drm/amdgpu: add support to SMU debug option
>
>Am 01.12.21 um 11:44 schrieb Yu, Lang:
>> [AMD Official Use Only]
>>
>>
>>
>>> -----Original Message-----
>>> From: Koenig, Christian <Christian.Koenig@amd.com>
>>> Sent: Wednesday, December 1, 2021 5:30 PM
>>> To: Yu, Lang <Lang.Yu@amd.com>; amd-gfx@lists.freedesktop.org
>>> Cc: Deucher, Alexander <Alexander.Deucher@amd.com>; Lazar, Lijo
>>> <Lijo.Lazar@amd.com>; Huang, Ray <Ray.Huang@amd.com>
>>> Subject: Re: [PATCH] drm/amdgpu: add support to SMU debug option
>>>
>>> Am 01.12.21 um 10:24 schrieb Lang Yu:
>>>> To maintain system error state when SMU errors occurred, which will
>>>> aid in debugging SMU firmware issues, add SMU debug option support.
>>>>
>>>> It can be enabled or disabled via amdgpu_smu_debug debugfs file.
>>>> When enabled, it makes SMU errors fatal.
>>>> It is disabled by default.
>>>>
>>>> == Command Guide ==
>>>>
>>>> 1, enable SMU debug option
>>>>
>>>>    # echo 1 > /sys/kernel/debug/dri/0/amdgpu_smu_debug
>>>>
>>>> 2, disable SMU debug option
>>>>
>>>>    # echo 0 > /sys/kernel/debug/dri/0/amdgpu_smu_debug
>>>>
>>>> v3:
>>>>    - Use debugfs_create_bool().(Christian)
>>>>    - Put variable into smu_context struct.
>>>>    - Don't resend command when timeout.
>>>>
>>>> v2:
>>>>    - Resend command when timeout.(Lijo)
>>>>    - Use debugfs file instead of module parameter.
>>>>
>>>> Signed-off-by: Lang Yu <lang.yu@amd.com>
>>> Well the debugfs part looks really nice and clean now, but one more
>>> comment below.
>>>
>>>> ---
>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c        | 3 +++
>>>>    drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h            | 5 +++++
>>>>    drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c | 2 ++
>>>>    drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c             | 8 +++++++-
>>>>    4 files changed, 17 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
>>>> index 164d6a9e9fbb..86cd888c7822 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
>>>> @@ -1618,6 +1618,9 @@ int amdgpu_debugfs_init(struct amdgpu_device
>>> *adev)
>>>>    	if (!debugfs_initialized())
>>>>    		return 0;
>>>>
>>>> +	debugfs_create_bool("amdgpu_smu_debug", 0600, root,
>>>> +				  &adev->smu.smu_debug_mode);
>>>> +
>>>>    	ent = debugfs_create_file("amdgpu_preempt_ib", 0600, root, adev,
>>>>    				  &fops_ib_preempt);
>>>>    	if (IS_ERR(ent)) {
>>>> diff --git a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
>>>> b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
>>>> index f738f7dc20c9..50dbf5594a9d 100644
>>>> --- a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
>>>> +++ b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
>>>> @@ -569,6 +569,11 @@ struct smu_context
>>>>    	struct smu_user_dpm_profile user_dpm_profile;
>>>>
>>>>    	struct stb_context stb_context;
>>>> +	/*
>>>> +	 * When enabled, it makes SMU errors fatal.
>>>> +	 * (0 = disabled (default), 1 = enabled)
>>>> +	 */
>>>> +	bool smu_debug_mode;
>>>>    };
>>>>
>>>>    struct i2c_adapter;
>>>> diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
>>>> b/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
>>>> index 6e781cee8bb6..d3797a2d6451 100644
>>>> --- a/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
>>>> +++ b/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
>>>> @@ -1919,6 +1919,8 @@ static int aldebaran_mode2_reset(struct
>>> smu_context *smu)
>>>>    out:
>>>>    	mutex_unlock(&smu->message_lock);
>>>>
>>>> +	BUG_ON(unlikely(smu->smu_debug_mode) && ret);
>>>> +
>>>>    	return ret;
>>>>    }
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
>>>> b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
>>>> index 048ca1673863..9be005eb4241 100644
>>>> --- a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
>>>> +++ b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
>>>> @@ -349,15 +349,21 @@ int smu_cmn_send_smc_msg_with_param(struct
>>> smu_context *smu,
>>>>    		__smu_cmn_reg_print_error(smu, reg, index, param, msg);
>>>>    		goto Out;
>>>>    	}
>>>> +
>>>>    	__smu_cmn_send_msg(smu, (uint16_t) index, param);
>>>>    	reg = __smu_cmn_poll_stat(smu);
>>>>    	res = __smu_cmn_reg2errno(smu, reg);
>>>> -	if (res != 0)
>>>> +	if (res != 0) {
>>>>    		__smu_cmn_reg_print_error(smu, reg, index, param, msg);
>>>> +		goto Out;
>>>> +	}
>>>>    	if (read_arg)
>>>>    		smu_cmn_read_arg(smu, read_arg);
>>>>    Out:
>>>>    	mutex_unlock(&smu->message_lock);
>>>> +
>>>> +	BUG_ON(unlikely(smu->smu_debug_mode) && res);
>>> BUG_ON() really crashes the kernel and is only allowed if we prevent
>>> further data corruption with that.
>>>
>>> Most of the time WARN_ON() is more appropriate, but I can't fully
>>> judge here since I don't know the SMU code well enough.
>> This is what SMU FW guys want. They want "user-visible (potentially fatal)
>errors", then a hang.
>> They want to keep system state since the error occurred.
>
>Well that is rather problematic.
>
>First of all we need to really justify that, crashing the kernel is not something
>easily done.
>
>Then this isn't really effective here. What happens is that you crash the kernel
>thread of the currently executing process, but it is perfectly possible that another
>thread still tries to send messages to the SMU. You need to have the BUG_ON()
>before dropping the lock to make sure that this really gets the driver stuck in the
>current state.

Thanks. I got it. I just thought it is a kenel panic.
Could we use a panic() here? 

Regards,
Lang

>Regards,
>Christian.
>
>>
>> Regards,
>> Lang
>>
>>> Christian.
>>>
>>>> +
>>>>    	return res;
>>>>    }
>>>>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] drm/amdgpu: add support to SMU debug option
  2021-12-01 11:20       ` Yu, Lang
@ 2021-12-01 11:29         ` Christian König
  2021-12-01 11:37           ` Yu, Lang
  0 siblings, 1 reply; 18+ messages in thread
From: Christian König @ 2021-12-01 11:29 UTC (permalink / raw)
  To: Yu, Lang, Christian König, amd-gfx
  Cc: Deucher, Alexander, Lazar, Lijo, Huang, Ray

Am 01.12.21 um 12:20 schrieb Yu, Lang:
> [AMD Official Use Only]
>
>> -----Original Message-----
>> From: Christian König <ckoenig.leichtzumerken@gmail.com>
>> Sent: Wednesday, December 1, 2021 6:49 PM
>> To: Yu, Lang <Lang.Yu@amd.com>; Koenig, Christian
>> <Christian.Koenig@amd.com>; amd-gfx@lists.freedesktop.org
>> Cc: Deucher, Alexander <Alexander.Deucher@amd.com>; Lazar, Lijo
>> <Lijo.Lazar@amd.com>; Huang, Ray <Ray.Huang@amd.com>
>> Subject: Re: [PATCH] drm/amdgpu: add support to SMU debug option
>>
>> Am 01.12.21 um 11:44 schrieb Yu, Lang:
>>> [AMD Official Use Only]
>>>
>>>
>>>
>>>> -----Original Message-----
>>>> From: Koenig, Christian <Christian.Koenig@amd.com>
>>>> Sent: Wednesday, December 1, 2021 5:30 PM
>>>> To: Yu, Lang <Lang.Yu@amd.com>; amd-gfx@lists.freedesktop.org
>>>> Cc: Deucher, Alexander <Alexander.Deucher@amd.com>; Lazar, Lijo
>>>> <Lijo.Lazar@amd.com>; Huang, Ray <Ray.Huang@amd.com>
>>>> Subject: Re: [PATCH] drm/amdgpu: add support to SMU debug option
>>>>
>>>> Am 01.12.21 um 10:24 schrieb Lang Yu:
>>>>> To maintain system error state when SMU errors occurred, which will
>>>>> aid in debugging SMU firmware issues, add SMU debug option support.
>>>>>
>>>>> It can be enabled or disabled via amdgpu_smu_debug debugfs file.
>>>>> When enabled, it makes SMU errors fatal.
>>>>> It is disabled by default.
>>>>>
>>>>> == Command Guide ==
>>>>>
>>>>> 1, enable SMU debug option
>>>>>
>>>>>     # echo 1 > /sys/kernel/debug/dri/0/amdgpu_smu_debug
>>>>>
>>>>> 2, disable SMU debug option
>>>>>
>>>>>     # echo 0 > /sys/kernel/debug/dri/0/amdgpu_smu_debug
>>>>>
>>>>> v3:
>>>>>     - Use debugfs_create_bool().(Christian)
>>>>>     - Put variable into smu_context struct.
>>>>>     - Don't resend command when timeout.
>>>>>
>>>>> v2:
>>>>>     - Resend command when timeout.(Lijo)
>>>>>     - Use debugfs file instead of module parameter.
>>>>>
>>>>> Signed-off-by: Lang Yu <lang.yu@amd.com>
>>>> Well the debugfs part looks really nice and clean now, but one more
>>>> comment below.
>>>>
>>>>> ---
>>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c        | 3 +++
>>>>>     drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h            | 5 +++++
>>>>>     drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c | 2 ++
>>>>>     drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c             | 8 +++++++-
>>>>>     4 files changed, 17 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
>>>>> index 164d6a9e9fbb..86cd888c7822 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
>>>>> @@ -1618,6 +1618,9 @@ int amdgpu_debugfs_init(struct amdgpu_device
>>>> *adev)
>>>>>     	if (!debugfs_initialized())
>>>>>     		return 0;
>>>>>
>>>>> +	debugfs_create_bool("amdgpu_smu_debug", 0600, root,
>>>>> +				  &adev->smu.smu_debug_mode);
>>>>> +
>>>>>     	ent = debugfs_create_file("amdgpu_preempt_ib", 0600, root, adev,
>>>>>     				  &fops_ib_preempt);
>>>>>     	if (IS_ERR(ent)) {
>>>>> diff --git a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
>>>>> b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
>>>>> index f738f7dc20c9..50dbf5594a9d 100644
>>>>> --- a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
>>>>> +++ b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
>>>>> @@ -569,6 +569,11 @@ struct smu_context
>>>>>     	struct smu_user_dpm_profile user_dpm_profile;
>>>>>
>>>>>     	struct stb_context stb_context;
>>>>> +	/*
>>>>> +	 * When enabled, it makes SMU errors fatal.
>>>>> +	 * (0 = disabled (default), 1 = enabled)
>>>>> +	 */
>>>>> +	bool smu_debug_mode;
>>>>>     };
>>>>>
>>>>>     struct i2c_adapter;
>>>>> diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
>>>>> b/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
>>>>> index 6e781cee8bb6..d3797a2d6451 100644
>>>>> --- a/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
>>>>> +++ b/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
>>>>> @@ -1919,6 +1919,8 @@ static int aldebaran_mode2_reset(struct
>>>> smu_context *smu)
>>>>>     out:
>>>>>     	mutex_unlock(&smu->message_lock);
>>>>>
>>>>> +	BUG_ON(unlikely(smu->smu_debug_mode) && ret);
>>>>> +
>>>>>     	return ret;
>>>>>     }
>>>>>
>>>>> diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
>>>>> b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
>>>>> index 048ca1673863..9be005eb4241 100644
>>>>> --- a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
>>>>> +++ b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
>>>>> @@ -349,15 +349,21 @@ int smu_cmn_send_smc_msg_with_param(struct
>>>> smu_context *smu,
>>>>>     		__smu_cmn_reg_print_error(smu, reg, index, param, msg);
>>>>>     		goto Out;
>>>>>     	}
>>>>> +
>>>>>     	__smu_cmn_send_msg(smu, (uint16_t) index, param);
>>>>>     	reg = __smu_cmn_poll_stat(smu);
>>>>>     	res = __smu_cmn_reg2errno(smu, reg);
>>>>> -	if (res != 0)
>>>>> +	if (res != 0) {
>>>>>     		__smu_cmn_reg_print_error(smu, reg, index, param, msg);
>>>>> +		goto Out;
>>>>> +	}
>>>>>     	if (read_arg)
>>>>>     		smu_cmn_read_arg(smu, read_arg);
>>>>>     Out:
>>>>>     	mutex_unlock(&smu->message_lock);
>>>>> +
>>>>> +	BUG_ON(unlikely(smu->smu_debug_mode) && res);
>>>> BUG_ON() really crashes the kernel and is only allowed if we prevent
>>>> further data corruption with that.
>>>>
>>>> Most of the time WARN_ON() is more appropriate, but I can't fully
>>>> judge here since I don't know the SMU code well enough.
>>> This is what SMU FW guys want. They want "user-visible (potentially fatal)
>> errors", then a hang.
>>> They want to keep system state since the error occurred.
>> Well that is rather problematic.
>>
>> First of all we need to really justify that, crashing the kernel is not something
>> easily done.
>>
>> Then this isn't really effective here. What happens is that you crash the kernel
>> thread of the currently executing process, but it is perfectly possible that another
>> thread still tries to send messages to the SMU. You need to have the BUG_ON()
>> before dropping the lock to make sure that this really gets the driver stuck in the
>> current state.
> Thanks. I got it. I just thought it is a kenel panic.
> Could we use a panic() here?

Potentially, but that might reboot the system automatically which is 
probably not what you want either.

How does the SMU firmware team gather the necessary information when a 
problem occurs?

When they connect external hardware a BUG_ON() while holding the SMU 
lock might work, but if they want to SSH into the system you should 
probably rather set a flag that the hardware shouldn't be touched any 
more by the driver and continue.

Otherwise the box is most likely really unstable after this.

Regards,
Christian.

>
> Regards,
> Lang
>
>> Regards,
>> Christian.
>>
>>> Regards,
>>> Lang
>>>
>>>> Christian.
>>>>
>>>>> +
>>>>>     	return res;
>>>>>     }
>>>>>


^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: [PATCH] drm/amdgpu: add support to SMU debug option
  2021-12-01 11:29         ` Christian König
@ 2021-12-01 11:37           ` Yu, Lang
  2021-12-01 13:11             ` Christian König
  2021-12-02  2:47             ` Quan, Evan
  0 siblings, 2 replies; 18+ messages in thread
From: Yu, Lang @ 2021-12-01 11:37 UTC (permalink / raw)
  To: Koenig, Christian, Christian König, amd-gfx
  Cc: Deucher, Alexander, Lazar, Lijo, Huang, Ray

[AMD Official Use Only]



>-----Original Message-----
>From: Koenig, Christian <Christian.Koenig@amd.com>
>Sent: Wednesday, December 1, 2021 7:29 PM
>To: Yu, Lang <Lang.Yu@amd.com>; Christian König
><ckoenig.leichtzumerken@gmail.com>; amd-gfx@lists.freedesktop.org
>Cc: Deucher, Alexander <Alexander.Deucher@amd.com>; Lazar, Lijo
><Lijo.Lazar@amd.com>; Huang, Ray <Ray.Huang@amd.com>
>Subject: Re: [PATCH] drm/amdgpu: add support to SMU debug option
>
>Am 01.12.21 um 12:20 schrieb Yu, Lang:
>> [AMD Official Use Only]
>>
>>> -----Original Message-----
>>> From: Christian König <ckoenig.leichtzumerken@gmail.com>
>>> Sent: Wednesday, December 1, 2021 6:49 PM
>>> To: Yu, Lang <Lang.Yu@amd.com>; Koenig, Christian
>>> <Christian.Koenig@amd.com>; amd-gfx@lists.freedesktop.org
>>> Cc: Deucher, Alexander <Alexander.Deucher@amd.com>; Lazar, Lijo
>>> <Lijo.Lazar@amd.com>; Huang, Ray <Ray.Huang@amd.com>
>>> Subject: Re: [PATCH] drm/amdgpu: add support to SMU debug option
>>>
>>> Am 01.12.21 um 11:44 schrieb Yu, Lang:
>>>> [AMD Official Use Only]
>>>>
>>>>
>>>>
>>>>> -----Original Message-----
>>>>> From: Koenig, Christian <Christian.Koenig@amd.com>
>>>>> Sent: Wednesday, December 1, 2021 5:30 PM
>>>>> To: Yu, Lang <Lang.Yu@amd.com>; amd-gfx@lists.freedesktop.org
>>>>> Cc: Deucher, Alexander <Alexander.Deucher@amd.com>; Lazar, Lijo
>>>>> <Lijo.Lazar@amd.com>; Huang, Ray <Ray.Huang@amd.com>
>>>>> Subject: Re: [PATCH] drm/amdgpu: add support to SMU debug option
>>>>>
>>>>> Am 01.12.21 um 10:24 schrieb Lang Yu:
>>>>>> To maintain system error state when SMU errors occurred, which
>>>>>> will aid in debugging SMU firmware issues, add SMU debug option support.
>>>>>>
>>>>>> It can be enabled or disabled via amdgpu_smu_debug debugfs file.
>>>>>> When enabled, it makes SMU errors fatal.
>>>>>> It is disabled by default.
>>>>>>
>>>>>> == Command Guide ==
>>>>>>
>>>>>> 1, enable SMU debug option
>>>>>>
>>>>>>     # echo 1 > /sys/kernel/debug/dri/0/amdgpu_smu_debug
>>>>>>
>>>>>> 2, disable SMU debug option
>>>>>>
>>>>>>     # echo 0 > /sys/kernel/debug/dri/0/amdgpu_smu_debug
>>>>>>
>>>>>> v3:
>>>>>>     - Use debugfs_create_bool().(Christian)
>>>>>>     - Put variable into smu_context struct.
>>>>>>     - Don't resend command when timeout.
>>>>>>
>>>>>> v2:
>>>>>>     - Resend command when timeout.(Lijo)
>>>>>>     - Use debugfs file instead of module parameter.
>>>>>>
>>>>>> Signed-off-by: Lang Yu <lang.yu@amd.com>
>>>>> Well the debugfs part looks really nice and clean now, but one more
>>>>> comment below.
>>>>>
>>>>>> ---
>>>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c        | 3 +++
>>>>>>     drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h            | 5 +++++
>>>>>>     drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c | 2 ++
>>>>>>     drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c             | 8 +++++++-
>>>>>>     4 files changed, 17 insertions(+), 1 deletion(-)
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
>>>>>> index 164d6a9e9fbb..86cd888c7822 100644
>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
>>>>>> @@ -1618,6 +1618,9 @@ int amdgpu_debugfs_init(struct amdgpu_device
>>>>> *adev)
>>>>>>     	if (!debugfs_initialized())
>>>>>>     		return 0;
>>>>>>
>>>>>> +	debugfs_create_bool("amdgpu_smu_debug", 0600, root,
>>>>>> +				  &adev->smu.smu_debug_mode);
>>>>>> +
>>>>>>     	ent = debugfs_create_file("amdgpu_preempt_ib", 0600, root,
>adev,
>>>>>>     				  &fops_ib_preempt);
>>>>>>     	if (IS_ERR(ent)) {
>>>>>> diff --git a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
>>>>>> b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
>>>>>> index f738f7dc20c9..50dbf5594a9d 100644
>>>>>> --- a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
>>>>>> +++ b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
>>>>>> @@ -569,6 +569,11 @@ struct smu_context
>>>>>>     	struct smu_user_dpm_profile user_dpm_profile;
>>>>>>
>>>>>>     	struct stb_context stb_context;
>>>>>> +	/*
>>>>>> +	 * When enabled, it makes SMU errors fatal.
>>>>>> +	 * (0 = disabled (default), 1 = enabled)
>>>>>> +	 */
>>>>>> +	bool smu_debug_mode;
>>>>>>     };
>>>>>>
>>>>>>     struct i2c_adapter;
>>>>>> diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
>>>>>> b/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
>>>>>> index 6e781cee8bb6..d3797a2d6451 100644
>>>>>> --- a/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
>>>>>> +++ b/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
>>>>>> @@ -1919,6 +1919,8 @@ static int aldebaran_mode2_reset(struct
>>>>> smu_context *smu)
>>>>>>     out:
>>>>>>     	mutex_unlock(&smu->message_lock);
>>>>>>
>>>>>> +	BUG_ON(unlikely(smu->smu_debug_mode) && ret);
>>>>>> +
>>>>>>     	return ret;
>>>>>>     }
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
>>>>>> b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
>>>>>> index 048ca1673863..9be005eb4241 100644
>>>>>> --- a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
>>>>>> +++ b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
>>>>>> @@ -349,15 +349,21 @@ int
>smu_cmn_send_smc_msg_with_param(struct
>>>>> smu_context *smu,
>>>>>>     		__smu_cmn_reg_print_error(smu, reg, index, param,
>msg);
>>>>>>     		goto Out;
>>>>>>     	}
>>>>>> +
>>>>>>     	__smu_cmn_send_msg(smu, (uint16_t) index, param);
>>>>>>     	reg = __smu_cmn_poll_stat(smu);
>>>>>>     	res = __smu_cmn_reg2errno(smu, reg);
>>>>>> -	if (res != 0)
>>>>>> +	if (res != 0) {
>>>>>>     		__smu_cmn_reg_print_error(smu, reg, index, param,
>msg);
>>>>>> +		goto Out;
>>>>>> +	}
>>>>>>     	if (read_arg)
>>>>>>     		smu_cmn_read_arg(smu, read_arg);
>>>>>>     Out:
>>>>>>     	mutex_unlock(&smu->message_lock);
>>>>>> +
>>>>>> +	BUG_ON(unlikely(smu->smu_debug_mode) && res);
>>>>> BUG_ON() really crashes the kernel and is only allowed if we
>>>>> prevent further data corruption with that.
>>>>>
>>>>> Most of the time WARN_ON() is more appropriate, but I can't fully
>>>>> judge here since I don't know the SMU code well enough.
>>>> This is what SMU FW guys want. They want "user-visible (potentially
>>>> fatal)
>>> errors", then a hang.
>>>> They want to keep system state since the error occurred.
>>> Well that is rather problematic.
>>>
>>> First of all we need to really justify that, crashing the kernel is
>>> not something easily done.
>>>
>>> Then this isn't really effective here. What happens is that you crash
>>> the kernel thread of the currently executing process, but it is
>>> perfectly possible that another thread still tries to send messages
>>> to the SMU. You need to have the BUG_ON() before dropping the lock to
>>> make sure that this really gets the driver stuck in the current state.
>> Thanks. I got it. I just thought it is a kenel panic.
>> Could we use a panic() here?
>
>Potentially, but that might reboot the system automatically which is probably not
>what you want either.
>
>How does the SMU firmware team gather the necessary information when a
>problem occurs?

As far as I know, they usually use a HDT to collect information.
And they request a hang when error occurred in ticket.
"Suggested error responses include pop-up windows (by x86 driver, if this is possible) or simply hanging after logging the error. "

Regards,
Lang

>
>When they connect external hardware a BUG_ON() while holding the SMU lock
>might work, but if they want to SSH into the system you should probably rather
>set a flag that the hardware shouldn't be touched any more by the driver and
>continue.
>
>Otherwise the box is most likely really unstable after this.
>
>Regards,
>Christian.
>
>>
>> Regards,
>> Lang
>>
>>> Regards,
>>> Christian.
>>>
>>>> Regards,
>>>> Lang
>>>>
>>>>> Christian.
>>>>>
>>>>>> +
>>>>>>     	return res;
>>>>>>     }
>>>>>>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] drm/amdgpu: add support to SMU debug option
  2021-12-01 11:37           ` Yu, Lang
@ 2021-12-01 13:11             ` Christian König
  2021-12-01 16:01               ` Andrey Grodzovsky
  2021-12-02  2:47             ` Quan, Evan
  1 sibling, 1 reply; 18+ messages in thread
From: Christian König @ 2021-12-01 13:11 UTC (permalink / raw)
  To: Yu, Lang, Koenig, Christian, amd-gfx, Andrey Grodzovsky
  Cc: Deucher, Alexander, Lazar, Lijo, Huang, Ray

Adding Andrey as well.

Am 01.12.21 um 12:37 schrieb Yu, Lang:
> [SNIP]
>>>>>>> +	BUG_ON(unlikely(smu->smu_debug_mode) && res);
>>>>>> BUG_ON() really crashes the kernel and is only allowed if we
>>>>>> prevent further data corruption with that.
>>>>>>
>>>>>> Most of the time WARN_ON() is more appropriate, but I can't fully
>>>>>> judge here since I don't know the SMU code well enough.
>>>>> This is what SMU FW guys want. They want "user-visible (potentially
>>>>> fatal)
>>>> errors", then a hang.
>>>>> They want to keep system state since the error occurred.
>>>> Well that is rather problematic.
>>>>
>>>> First of all we need to really justify that, crashing the kernel is
>>>> not something easily done.
>>>>
>>>> Then this isn't really effective here. What happens is that you crash
>>>> the kernel thread of the currently executing process, but it is
>>>> perfectly possible that another thread still tries to send messages
>>>> to the SMU. You need to have the BUG_ON() before dropping the lock to
>>>> make sure that this really gets the driver stuck in the current state.
>>> Thanks. I got it. I just thought it is a kenel panic.
>>> Could we use a panic() here?
>> Potentially, but that might reboot the system automatically which is probably not
>> what you want either.
>>
>> How does the SMU firmware team gather the necessary information when a
>> problem occurs?
> As far as I know, they usually use a HDT to collect information.
> And they request a hang when error occurred in ticket.
> "Suggested error responses include pop-up windows (by x86 driver, if this is possible) or simply hanging after logging the error."

In that case I suggest to set the "don't_touch_the_hardware_any_more" 
procedure we also use in case of PCIe hotplug.

Andrey has the details but essentially it stops the driver from touching 
the hardware any more, signals all fences and unblocks everything.

It should then be trivial to inspect the hardware state and see what's 
going on, but the system will keep stable at least for SSH access.

Might be a good idea to have that mode for other fault cases like page 
faults and hardware crashes.

Regards,
Christian.

>
> Regards,
> Lang
>


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] drm/amdgpu: add support to SMU debug option
  2021-12-01 13:11             ` Christian König
@ 2021-12-01 16:01               ` Andrey Grodzovsky
  2021-12-02  9:31                 ` Yu, Lang
  0 siblings, 1 reply; 18+ messages in thread
From: Andrey Grodzovsky @ 2021-12-01 16:01 UTC (permalink / raw)
  To: Christian König, Yu, Lang, Koenig, Christian, amd-gfx
  Cc: Deucher, Alexander, Lazar, Lijo, Huang, Ray


On 2021-12-01 8:11 a.m., Christian König wrote:
> Adding Andrey as well.
>
> Am 01.12.21 um 12:37 schrieb Yu, Lang:
>> [SNIP]
>>>>>>>> + BUG_ON(unlikely(smu->smu_debug_mode) && res);
>>>>>>> BUG_ON() really crashes the kernel and is only allowed if we
>>>>>>> prevent further data corruption with that.
>>>>>>>
>>>>>>> Most of the time WARN_ON() is more appropriate, but I can't fully
>>>>>>> judge here since I don't know the SMU code well enough.
>>>>>> This is what SMU FW guys want. They want "user-visible (potentially
>>>>>> fatal)
>>>>> errors", then a hang.
>>>>>> They want to keep system state since the error occurred.
>>>>> Well that is rather problematic.
>>>>>
>>>>> First of all we need to really justify that, crashing the kernel is
>>>>> not something easily done.
>>>>>
>>>>> Then this isn't really effective here. What happens is that you crash
>>>>> the kernel thread of the currently executing process, but it is
>>>>> perfectly possible that another thread still tries to send messages
>>>>> to the SMU. You need to have the BUG_ON() before dropping the lock to
>>>>> make sure that this really gets the driver stuck in the current 
>>>>> state.
>>>> Thanks. I got it. I just thought it is a kenel panic.
>>>> Could we use a panic() here?
>>> Potentially, but that might reboot the system automatically which is 
>>> probably not
>>> what you want either.
>>>
>>> How does the SMU firmware team gather the necessary information when a
>>> problem occurs?
>> As far as I know, they usually use a HDT to collect information.
>> And they request a hang when error occurred in ticket.
>> "Suggested error responses include pop-up windows (by x86 driver, if 
>> this is possible) or simply hanging after logging the error."
>
> In that case I suggest to set the "don't_touch_the_hardware_any_more" 
> procedure we also use in case of PCIe hotplug.
>
> Andrey has the details but essentially it stops the driver from 
> touching the hardware any more, signals all fences and unblocks 
> everything.
>
> It should then be trivial to inspect the hardware state and see what's 
> going on, but the system will keep stable at least for SSH access.
>
> Might be a good idea to have that mode for other fault cases like page 
> faults and hardware crashes.
>
> Regards,
> Christian.


There is no one specific function that does all of that, what I think 
can be done is to bring the device to kind of halt state where no one 
touches it - as following -

1) Follow amdpgu_pci_remove -

     drm_dev_unplug to make device inaccessible to user space (IOCTLs 
e.t.c.) and clears MMIO mappings to device and disallows remappings 
through page faults

     No need to call all of amdgpu_driver_unload_kms but, within it call 
amdgpu_irq_disable_all and amdgpu_fence_driver_hw_fini toi disable 
interrupts and force signall all HW fences.

     pci_disable_device and pci_wait_for_pending_transaction to flush 
any in flight DMA operations from device

2) set adev->no_hw_access so that most of places we access HW (all 
subsequent registers reads/writes and SMU/PSP message sending is 
skipped, but some race will be with those already in progress so maybe 
adding some wait)

Andrey


>
>>
>> Regards,
>> Lang
>>
>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: [PATCH] drm/amdgpu: add support to SMU debug option
  2021-12-01 11:37           ` Yu, Lang
  2021-12-01 13:11             ` Christian König
@ 2021-12-02  2:47             ` Quan, Evan
  2021-12-02  3:12               ` Yu, Lang
  1 sibling, 1 reply; 18+ messages in thread
From: Quan, Evan @ 2021-12-02  2:47 UTC (permalink / raw)
  To: Yu, Lang, Koenig, Christian, Christian König, amd-gfx
  Cc: Deucher, Alexander, Lazar, Lijo, Huang, Ray

[AMD Official Use Only]



> -----Original Message-----
> From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> On Behalf Of Yu,
> Lang
> Sent: Wednesday, December 1, 2021 7:37 PM
> To: Koenig, Christian <Christian.Koenig@amd.com>; Christian König
> <ckoenig.leichtzumerken@gmail.com>; amd-gfx@lists.freedesktop.org
> Cc: Deucher, Alexander <Alexander.Deucher@amd.com>; Lazar, Lijo
> <Lijo.Lazar@amd.com>; Huang, Ray <Ray.Huang@amd.com>
> Subject: RE: [PATCH] drm/amdgpu: add support to SMU debug option
> 
> [AMD Official Use Only]
> 
> 
> 
> >-----Original Message-----
> >From: Koenig, Christian <Christian.Koenig@amd.com>
> >Sent: Wednesday, December 1, 2021 7:29 PM
> >To: Yu, Lang <Lang.Yu@amd.com>; Christian König
> ><ckoenig.leichtzumerken@gmail.com>; amd-gfx@lists.freedesktop.org
> >Cc: Deucher, Alexander <Alexander.Deucher@amd.com>; Lazar, Lijo
> ><Lijo.Lazar@amd.com>; Huang, Ray <Ray.Huang@amd.com>
> >Subject: Re: [PATCH] drm/amdgpu: add support to SMU debug option
> >
> >Am 01.12.21 um 12:20 schrieb Yu, Lang:
> >> [AMD Official Use Only]
> >>
> >>> -----Original Message-----
> >>> From: Christian König <ckoenig.leichtzumerken@gmail.com>
> >>> Sent: Wednesday, December 1, 2021 6:49 PM
> >>> To: Yu, Lang <Lang.Yu@amd.com>; Koenig, Christian
> >>> <Christian.Koenig@amd.com>; amd-gfx@lists.freedesktop.org
> >>> Cc: Deucher, Alexander <Alexander.Deucher@amd.com>; Lazar, Lijo
> >>> <Lijo.Lazar@amd.com>; Huang, Ray <Ray.Huang@amd.com>
> >>> Subject: Re: [PATCH] drm/amdgpu: add support to SMU debug option
> >>>
> >>> Am 01.12.21 um 11:44 schrieb Yu, Lang:
> >>>> [AMD Official Use Only]
> >>>>
> >>>>
> >>>>
> >>>>> -----Original Message-----
> >>>>> From: Koenig, Christian <Christian.Koenig@amd.com>
> >>>>> Sent: Wednesday, December 1, 2021 5:30 PM
> >>>>> To: Yu, Lang <Lang.Yu@amd.com>; amd-gfx@lists.freedesktop.org
> >>>>> Cc: Deucher, Alexander <Alexander.Deucher@amd.com>; Lazar, Lijo
> >>>>> <Lijo.Lazar@amd.com>; Huang, Ray <Ray.Huang@amd.com>
> >>>>> Subject: Re: [PATCH] drm/amdgpu: add support to SMU debug option
> >>>>>
> >>>>> Am 01.12.21 um 10:24 schrieb Lang Yu:
> >>>>>> To maintain system error state when SMU errors occurred, which
> >>>>>> will aid in debugging SMU firmware issues, add SMU debug option
> support.
> >>>>>>
> >>>>>> It can be enabled or disabled via amdgpu_smu_debug debugfs file.
> >>>>>> When enabled, it makes SMU errors fatal.
> >>>>>> It is disabled by default.
> >>>>>>
> >>>>>> == Command Guide ==
> >>>>>>
> >>>>>> 1, enable SMU debug option
> >>>>>>
> >>>>>>     # echo 1 > /sys/kernel/debug/dri/0/amdgpu_smu_debug
> >>>>>>
> >>>>>> 2, disable SMU debug option
> >>>>>>
> >>>>>>     # echo 0 > /sys/kernel/debug/dri/0/amdgpu_smu_debug
> >>>>>>
> >>>>>> v3:
> >>>>>>     - Use debugfs_create_bool().(Christian)
> >>>>>>     - Put variable into smu_context struct.
> >>>>>>     - Don't resend command when timeout.
> >>>>>>
> >>>>>> v2:
> >>>>>>     - Resend command when timeout.(Lijo)
> >>>>>>     - Use debugfs file instead of module parameter.
> >>>>>>
> >>>>>> Signed-off-by: Lang Yu <lang.yu@amd.com>
> >>>>> Well the debugfs part looks really nice and clean now, but one
> >>>>> more comment below.
> >>>>>
> >>>>>> ---
> >>>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c        | 3 +++
> >>>>>>     drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h            | 5 +++++
> >>>>>>     drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c | 2 ++
> >>>>>>     drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c             | 8 +++++++-
> >>>>>>     4 files changed, 17 insertions(+), 1 deletion(-)
> >>>>>>
> >>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
> >>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
> >>>>>> index 164d6a9e9fbb..86cd888c7822 100644
> >>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
> >>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
> >>>>>> @@ -1618,6 +1618,9 @@ int amdgpu_debugfs_init(struct
> >>>>>> amdgpu_device
> >>>>> *adev)
> >>>>>>     	if (!debugfs_initialized())
> >>>>>>     		return 0;
> >>>>>>
> >>>>>> +	debugfs_create_bool("amdgpu_smu_debug", 0600, root,
> >>>>>> +				  &adev->smu.smu_debug_mode);
> >>>>>> +
> >>>>>>     	ent = debugfs_create_file("amdgpu_preempt_ib", 0600,
> root,
> >adev,
> >>>>>>     				  &fops_ib_preempt);
> >>>>>>     	if (IS_ERR(ent)) {
> >>>>>> diff --git a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
> >>>>>> b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
> >>>>>> index f738f7dc20c9..50dbf5594a9d 100644
> >>>>>> --- a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
> >>>>>> +++ b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
> >>>>>> @@ -569,6 +569,11 @@ struct smu_context
> >>>>>>     	struct smu_user_dpm_profile user_dpm_profile;
> >>>>>>
> >>>>>>     	struct stb_context stb_context;
> >>>>>> +	/*
> >>>>>> +	 * When enabled, it makes SMU errors fatal.
> >>>>>> +	 * (0 = disabled (default), 1 = enabled)
> >>>>>> +	 */
> >>>>>> +	bool smu_debug_mode;
> >>>>>>     };
> >>>>>>
> >>>>>>     struct i2c_adapter;
> >>>>>> diff --git
> a/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
> >>>>>> b/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
> >>>>>> index 6e781cee8bb6..d3797a2d6451 100644
> >>>>>> --- a/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
> >>>>>> +++ b/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
> >>>>>> @@ -1919,6 +1919,8 @@ static int aldebaran_mode2_reset(struct
> >>>>> smu_context *smu)
> >>>>>>     out:
> >>>>>>     	mutex_unlock(&smu->message_lock);
> >>>>>>
> >>>>>> +	BUG_ON(unlikely(smu->smu_debug_mode) && ret);
> >>>>>> +
> >>>>>>     	return ret;
> >>>>>>     }
> >>>>>>
> >>>>>> diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
> >>>>>> b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
> >>>>>> index 048ca1673863..9be005eb4241 100644
> >>>>>> --- a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
> >>>>>> +++ b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
> >>>>>> @@ -349,15 +349,21 @@ int
> >smu_cmn_send_smc_msg_with_param(struct
> >>>>> smu_context *smu,
> >>>>>>     		__smu_cmn_reg_print_error(smu, reg, index, param,
> >msg);
> >>>>>>     		goto Out;
> >>>>>>     	}
> >>>>>> +
> >>>>>>     	__smu_cmn_send_msg(smu, (uint16_t) index, param);
> >>>>>>     	reg = __smu_cmn_poll_stat(smu);
> >>>>>>     	res = __smu_cmn_reg2errno(smu, reg);
> >>>>>> -	if (res != 0)
> >>>>>> +	if (res != 0) {
> >>>>>>     		__smu_cmn_reg_print_error(smu, reg, index, param,
> >msg);
> >>>>>> +		goto Out;
> >>>>>> +	}
> >>>>>>     	if (read_arg)
> >>>>>>     		smu_cmn_read_arg(smu, read_arg);
> >>>>>>     Out:
> >>>>>>     	mutex_unlock(&smu->message_lock);
> >>>>>> +
> >>>>>> +	BUG_ON(unlikely(smu->smu_debug_mode) && res);
> >>>>> BUG_ON() really crashes the kernel and is only allowed if we
> >>>>> prevent further data corruption with that.
> >>>>>
> >>>>> Most of the time WARN_ON() is more appropriate, but I can't fully
> >>>>> judge here since I don't know the SMU code well enough.
> >>>> This is what SMU FW guys want. They want "user-visible (potentially
> >>>> fatal)
> >>> errors", then a hang.
> >>>> They want to keep system state since the error occurred.
> >>> Well that is rather problematic.
> >>>
> >>> First of all we need to really justify that, crashing the kernel is
> >>> not something easily done.
> >>>
> >>> Then this isn't really effective here. What happens is that you
> >>> crash the kernel thread of the currently executing process, but it
> >>> is perfectly possible that another thread still tries to send
> >>> messages to the SMU. You need to have the BUG_ON() before dropping
> >>> the lock to make sure that this really gets the driver stuck in the current
> state.
> >> Thanks. I got it. I just thought it is a kenel panic.
> >> Could we use a panic() here?
> >
> >Potentially, but that might reboot the system automatically which is
> >probably not what you want either.
> >
> >How does the SMU firmware team gather the necessary information when
> a
> >problem occurs?
> 
> As far as I know, they usually use a HDT to collect information.
> And they request a hang when error occurred in ticket.
> "Suggested error responses include pop-up windows (by x86 driver, if this is
> possible) or simply hanging after logging the error. "
[Quan, Evan] Maybe what they want is just a stable SMU state(like no more message issuing after failure).
If that's true, I think you can just bail out on __smu_cmn_poll_stat() failure(in smu_cmn_send_smc_msg_with_param() and smu_cmn_send_msg_without_waiting()).
That will prevent further message issuing to SMU.

BR
Evan
> 
> Regards,
> Lang
> 
> >
> >When they connect external hardware a BUG_ON() while holding the SMU
> >lock might work, but if they want to SSH into the system you should
> >probably rather set a flag that the hardware shouldn't be touched any
> >more by the driver and continue.
> >
> >Otherwise the box is most likely really unstable after this.
> >
> >Regards,
> >Christian.
> >
> >>
> >> Regards,
> >> Lang
> >>
> >>> Regards,
> >>> Christian.
> >>>
> >>>> Regards,
> >>>> Lang
> >>>>
> >>>>> Christian.
> >>>>>
> >>>>>> +
> >>>>>>     	return res;
> >>>>>>     }
> >>>>>>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: [PATCH] drm/amdgpu: add support to SMU debug option
  2021-12-02  2:47             ` Quan, Evan
@ 2021-12-02  3:12               ` Yu, Lang
  2021-12-02  7:18                 ` Lazar, Lijo
  0 siblings, 1 reply; 18+ messages in thread
From: Yu, Lang @ 2021-12-02  3:12 UTC (permalink / raw)
  To: Quan, Evan, Koenig, Christian, Christian König, amd-gfx
  Cc: Deucher, Alexander, Lazar, Lijo, Huang, Ray

[AMD Official Use Only]



>-----Original Message-----
>From: Quan, Evan <Evan.Quan@amd.com>
>Sent: Thursday, December 2, 2021 10:48 AM
>To: Yu, Lang <Lang.Yu@amd.com>; Koenig, Christian
><Christian.Koenig@amd.com>; Christian König
><ckoenig.leichtzumerken@gmail.com>; amd-gfx@lists.freedesktop.org
>Cc: Deucher, Alexander <Alexander.Deucher@amd.com>; Lazar, Lijo
><Lijo.Lazar@amd.com>; Huang, Ray <Ray.Huang@amd.com>
>Subject: RE: [PATCH] drm/amdgpu: add support to SMU debug option
>
>[AMD Official Use Only]
>
>
>
>> -----Original Message-----
>> From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> On Behalf Of Yu,
>> Lang
>> Sent: Wednesday, December 1, 2021 7:37 PM
>> To: Koenig, Christian <Christian.Koenig@amd.com>; Christian König
>> <ckoenig.leichtzumerken@gmail.com>; amd-gfx@lists.freedesktop.org
>> Cc: Deucher, Alexander <Alexander.Deucher@amd.com>; Lazar, Lijo
>> <Lijo.Lazar@amd.com>; Huang, Ray <Ray.Huang@amd.com>
>> Subject: RE: [PATCH] drm/amdgpu: add support to SMU debug option
>>
>> [AMD Official Use Only]
>>
>>
>>
>> >-----Original Message-----
>> >From: Koenig, Christian <Christian.Koenig@amd.com>
>> >Sent: Wednesday, December 1, 2021 7:29 PM
>> >To: Yu, Lang <Lang.Yu@amd.com>; Christian König
>> ><ckoenig.leichtzumerken@gmail.com>; amd-gfx@lists.freedesktop.org
>> >Cc: Deucher, Alexander <Alexander.Deucher@amd.com>; Lazar, Lijo
>> ><Lijo.Lazar@amd.com>; Huang, Ray <Ray.Huang@amd.com>
>> >Subject: Re: [PATCH] drm/amdgpu: add support to SMU debug option
>> >
>> >Am 01.12.21 um 12:20 schrieb Yu, Lang:
>> >> [AMD Official Use Only]
>> >>
>> >>> -----Original Message-----
>> >>> From: Christian König <ckoenig.leichtzumerken@gmail.com>
>> >>> Sent: Wednesday, December 1, 2021 6:49 PM
>> >>> To: Yu, Lang <Lang.Yu@amd.com>; Koenig, Christian
>> >>> <Christian.Koenig@amd.com>; amd-gfx@lists.freedesktop.org
>> >>> Cc: Deucher, Alexander <Alexander.Deucher@amd.com>; Lazar, Lijo
>> >>> <Lijo.Lazar@amd.com>; Huang, Ray <Ray.Huang@amd.com>
>> >>> Subject: Re: [PATCH] drm/amdgpu: add support to SMU debug option
>> >>>
>> >>> Am 01.12.21 um 11:44 schrieb Yu, Lang:
>> >>>> [AMD Official Use Only]
>> >>>>
>> >>>>
>> >>>>
>> >>>>> -----Original Message-----
>> >>>>> From: Koenig, Christian <Christian.Koenig@amd.com>
>> >>>>> Sent: Wednesday, December 1, 2021 5:30 PM
>> >>>>> To: Yu, Lang <Lang.Yu@amd.com>; amd-gfx@lists.freedesktop.org
>> >>>>> Cc: Deucher, Alexander <Alexander.Deucher@amd.com>; Lazar, Lijo
>> >>>>> <Lijo.Lazar@amd.com>; Huang, Ray <Ray.Huang@amd.com>
>> >>>>> Subject: Re: [PATCH] drm/amdgpu: add support to SMU debug option
>> >>>>>
>> >>>>> Am 01.12.21 um 10:24 schrieb Lang Yu:
>> >>>>>> To maintain system error state when SMU errors occurred, which
>> >>>>>> will aid in debugging SMU firmware issues, add SMU debug option
>> support.
>> >>>>>>
>> >>>>>> It can be enabled or disabled via amdgpu_smu_debug debugfs file.
>> >>>>>> When enabled, it makes SMU errors fatal.
>> >>>>>> It is disabled by default.
>> >>>>>>
>> >>>>>> == Command Guide ==
>> >>>>>>
>> >>>>>> 1, enable SMU debug option
>> >>>>>>
>> >>>>>>     # echo 1 > /sys/kernel/debug/dri/0/amdgpu_smu_debug
>> >>>>>>
>> >>>>>> 2, disable SMU debug option
>> >>>>>>
>> >>>>>>     # echo 0 > /sys/kernel/debug/dri/0/amdgpu_smu_debug
>> >>>>>>
>> >>>>>> v3:
>> >>>>>>     - Use debugfs_create_bool().(Christian)
>> >>>>>>     - Put variable into smu_context struct.
>> >>>>>>     - Don't resend command when timeout.
>> >>>>>>
>> >>>>>> v2:
>> >>>>>>     - Resend command when timeout.(Lijo)
>> >>>>>>     - Use debugfs file instead of module parameter.
>> >>>>>>
>> >>>>>> Signed-off-by: Lang Yu <lang.yu@amd.com>
>> >>>>> Well the debugfs part looks really nice and clean now, but one
>> >>>>> more comment below.
>> >>>>>
>> >>>>>> ---
>> >>>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c        | 3 +++
>> >>>>>>     drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h            | 5 +++++
>> >>>>>>     drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c | 2 ++
>> >>>>>>     drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c             | 8 +++++++-
>> >>>>>>     4 files changed, 17 insertions(+), 1 deletion(-)
>> >>>>>>
>> >>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
>> >>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
>> >>>>>> index 164d6a9e9fbb..86cd888c7822 100644
>> >>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
>> >>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
>> >>>>>> @@ -1618,6 +1618,9 @@ int amdgpu_debugfs_init(struct
>> >>>>>> amdgpu_device
>> >>>>> *adev)
>> >>>>>>     	if (!debugfs_initialized())
>> >>>>>>     		return 0;
>> >>>>>>
>> >>>>>> +	debugfs_create_bool("amdgpu_smu_debug", 0600, root,
>> >>>>>> +				  &adev->smu.smu_debug_mode);
>> >>>>>> +
>> >>>>>>     	ent = debugfs_create_file("amdgpu_preempt_ib", 0600,
>> root,
>> >adev,
>> >>>>>>     				  &fops_ib_preempt);
>> >>>>>>     	if (IS_ERR(ent)) {
>> >>>>>> diff --git a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
>> >>>>>> b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
>> >>>>>> index f738f7dc20c9..50dbf5594a9d 100644
>> >>>>>> --- a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
>> >>>>>> +++ b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
>> >>>>>> @@ -569,6 +569,11 @@ struct smu_context
>> >>>>>>     	struct smu_user_dpm_profile user_dpm_profile;
>> >>>>>>
>> >>>>>>     	struct stb_context stb_context;
>> >>>>>> +	/*
>> >>>>>> +	 * When enabled, it makes SMU errors fatal.
>> >>>>>> +	 * (0 = disabled (default), 1 = enabled)
>> >>>>>> +	 */
>> >>>>>> +	bool smu_debug_mode;
>> >>>>>>     };
>> >>>>>>
>> >>>>>>     struct i2c_adapter;
>> >>>>>> diff --git
>> a/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
>> >>>>>> b/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
>> >>>>>> index 6e781cee8bb6..d3797a2d6451 100644
>> >>>>>> --- a/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
>> >>>>>> +++ b/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
>> >>>>>> @@ -1919,6 +1919,8 @@ static int aldebaran_mode2_reset(struct
>> >>>>> smu_context *smu)
>> >>>>>>     out:
>> >>>>>>     	mutex_unlock(&smu->message_lock);
>> >>>>>>
>> >>>>>> +	BUG_ON(unlikely(smu->smu_debug_mode) && ret);
>> >>>>>> +
>> >>>>>>     	return ret;
>> >>>>>>     }
>> >>>>>>
>> >>>>>> diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
>> >>>>>> b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
>> >>>>>> index 048ca1673863..9be005eb4241 100644
>> >>>>>> --- a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
>> >>>>>> +++ b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
>> >>>>>> @@ -349,15 +349,21 @@ int
>> >smu_cmn_send_smc_msg_with_param(struct
>> >>>>> smu_context *smu,
>> >>>>>>     		__smu_cmn_reg_print_error(smu, reg, index, param,
>> >msg);
>> >>>>>>     		goto Out;
>> >>>>>>     	}
>> >>>>>> +
>> >>>>>>     	__smu_cmn_send_msg(smu, (uint16_t) index, param);
>> >>>>>>     	reg = __smu_cmn_poll_stat(smu);
>> >>>>>>     	res = __smu_cmn_reg2errno(smu, reg);
>> >>>>>> -	if (res != 0)
>> >>>>>> +	if (res != 0) {
>> >>>>>>     		__smu_cmn_reg_print_error(smu, reg, index, param,
>> >msg);
>> >>>>>> +		goto Out;
>> >>>>>> +	}
>> >>>>>>     	if (read_arg)
>> >>>>>>     		smu_cmn_read_arg(smu, read_arg);
>> >>>>>>     Out:
>> >>>>>>     	mutex_unlock(&smu->message_lock);
>> >>>>>> +
>> >>>>>> +	BUG_ON(unlikely(smu->smu_debug_mode) && res);
>> >>>>> BUG_ON() really crashes the kernel and is only allowed if we
>> >>>>> prevent further data corruption with that.
>> >>>>>
>> >>>>> Most of the time WARN_ON() is more appropriate, but I can't
>> >>>>> fully judge here since I don't know the SMU code well enough.
>> >>>> This is what SMU FW guys want. They want "user-visible
>> >>>> (potentially
>> >>>> fatal)
>> >>> errors", then a hang.
>> >>>> They want to keep system state since the error occurred.
>> >>> Well that is rather problematic.
>> >>>
>> >>> First of all we need to really justify that, crashing the kernel
>> >>> is not something easily done.
>> >>>
>> >>> Then this isn't really effective here. What happens is that you
>> >>> crash the kernel thread of the currently executing process, but it
>> >>> is perfectly possible that another thread still tries to send
>> >>> messages to the SMU. You need to have the BUG_ON() before dropping
>> >>> the lock to make sure that this really gets the driver stuck in
>> >>> the current
>> state.
>> >> Thanks. I got it. I just thought it is a kenel panic.
>> >> Could we use a panic() here?
>> >
>> >Potentially, but that might reboot the system automatically which is
>> >probably not what you want either.
>> >
>> >How does the SMU firmware team gather the necessary information when
>> a
>> >problem occurs?
>>
>> As far as I know, they usually use a HDT to collect information.
>> And they request a hang when error occurred in ticket.
>> "Suggested error responses include pop-up windows (by x86 driver, if
>> this is
>> possible) or simply hanging after logging the error. "
>[Quan, Evan] Maybe what they want is just a stable SMU state(like no more
>message issuing after failure).
>If that's true, I think you can just bail out on __smu_cmn_poll_stat() failure(in
>smu_cmn_send_smc_msg_with_param() and
>smu_cmn_send_msg_without_waiting()).
>That will prevent further message issuing to SMU.

But it's difficult to distinguish normal timeout cases(see aldebaran_mode2_reset()).
Sometimes it's reasonable to return a timeout. The user wants to use a longer timeout.

Regards,
Lang

>
>BR
>Evan
>>
>> Regards,
>> Lang
>>
>> >
>> >When they connect external hardware a BUG_ON() while holding the SMU
>> >lock might work, but if they want to SSH into the system you should
>> >probably rather set a flag that the hardware shouldn't be touched any
>> >more by the driver and continue.
>> >
>> >Otherwise the box is most likely really unstable after this.
>> >
>> >Regards,
>> >Christian.
>> >
>> >>
>> >> Regards,
>> >> Lang
>> >>
>> >>> Regards,
>> >>> Christian.
>> >>>
>> >>>> Regards,
>> >>>> Lang
>> >>>>
>> >>>>> Christian.
>> >>>>>
>> >>>>>> +
>> >>>>>>     	return res;
>> >>>>>>     }
>> >>>>>>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] drm/amdgpu: add support to SMU debug option
  2021-12-02  3:12               ` Yu, Lang
@ 2021-12-02  7:18                 ` Lazar, Lijo
  0 siblings, 0 replies; 18+ messages in thread
From: Lazar, Lijo @ 2021-12-02  7:18 UTC (permalink / raw)
  To: Yu, Lang, Quan, Evan, Koenig, Christian, Christian König, amd-gfx
  Cc: Deucher, Alexander, Huang, Ray



On 12/2/2021 8:42 AM, Yu, Lang wrote:
> [AMD Official Use Only]
> 
> 
> 
>> -----Original Message-----
>> From: Quan, Evan <Evan.Quan@amd.com>
>> Sent: Thursday, December 2, 2021 10:48 AM
>> To: Yu, Lang <Lang.Yu@amd.com>; Koenig, Christian
>> <Christian.Koenig@amd.com>; Christian König
>> <ckoenig.leichtzumerken@gmail.com>; amd-gfx@lists.freedesktop.org
>> Cc: Deucher, Alexander <Alexander.Deucher@amd.com>; Lazar, Lijo
>> <Lijo.Lazar@amd.com>; Huang, Ray <Ray.Huang@amd.com>
>> Subject: RE: [PATCH] drm/amdgpu: add support to SMU debug option
>>
>> [AMD Official Use Only]
>>
>>
>>
>>> -----Original Message-----
>>> From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> On Behalf Of Yu,
>>> Lang
>>> Sent: Wednesday, December 1, 2021 7:37 PM
>>> To: Koenig, Christian <Christian.Koenig@amd.com>; Christian König
>>> <ckoenig.leichtzumerken@gmail.com>; amd-gfx@lists.freedesktop.org
>>> Cc: Deucher, Alexander <Alexander.Deucher@amd.com>; Lazar, Lijo
>>> <Lijo.Lazar@amd.com>; Huang, Ray <Ray.Huang@amd.com>
>>> Subject: RE: [PATCH] drm/amdgpu: add support to SMU debug option
>>>
>>> [AMD Official Use Only]
>>>
>>>
>>>
>>>> -----Original Message-----
>>>> From: Koenig, Christian <Christian.Koenig@amd.com>
>>>> Sent: Wednesday, December 1, 2021 7:29 PM
>>>> To: Yu, Lang <Lang.Yu@amd.com>; Christian König
>>>> <ckoenig.leichtzumerken@gmail.com>; amd-gfx@lists.freedesktop.org
>>>> Cc: Deucher, Alexander <Alexander.Deucher@amd.com>; Lazar, Lijo
>>>> <Lijo.Lazar@amd.com>; Huang, Ray <Ray.Huang@amd.com>
>>>> Subject: Re: [PATCH] drm/amdgpu: add support to SMU debug option
>>>>
>>>> Am 01.12.21 um 12:20 schrieb Yu, Lang:
>>>>> [AMD Official Use Only]
>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Christian König <ckoenig.leichtzumerken@gmail.com>
>>>>>> Sent: Wednesday, December 1, 2021 6:49 PM
>>>>>> To: Yu, Lang <Lang.Yu@amd.com>; Koenig, Christian
>>>>>> <Christian.Koenig@amd.com>; amd-gfx@lists.freedesktop.org
>>>>>> Cc: Deucher, Alexander <Alexander.Deucher@amd.com>; Lazar, Lijo
>>>>>> <Lijo.Lazar@amd.com>; Huang, Ray <Ray.Huang@amd.com>
>>>>>> Subject: Re: [PATCH] drm/amdgpu: add support to SMU debug option
>>>>>>
>>>>>> Am 01.12.21 um 11:44 schrieb Yu, Lang:
>>>>>>> [AMD Official Use Only]
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> -----Original Message-----
>>>>>>>> From: Koenig, Christian <Christian.Koenig@amd.com>
>>>>>>>> Sent: Wednesday, December 1, 2021 5:30 PM
>>>>>>>> To: Yu, Lang <Lang.Yu@amd.com>; amd-gfx@lists.freedesktop.org
>>>>>>>> Cc: Deucher, Alexander <Alexander.Deucher@amd.com>; Lazar, Lijo
>>>>>>>> <Lijo.Lazar@amd.com>; Huang, Ray <Ray.Huang@amd.com>
>>>>>>>> Subject: Re: [PATCH] drm/amdgpu: add support to SMU debug option
>>>>>>>>
>>>>>>>> Am 01.12.21 um 10:24 schrieb Lang Yu:
>>>>>>>>> To maintain system error state when SMU errors occurred, which
>>>>>>>>> will aid in debugging SMU firmware issues, add SMU debug option
>>> support.
>>>>>>>>>
>>>>>>>>> It can be enabled or disabled via amdgpu_smu_debug debugfs file.
>>>>>>>>> When enabled, it makes SMU errors fatal.
>>>>>>>>> It is disabled by default.
>>>>>>>>>
>>>>>>>>> == Command Guide ==
>>>>>>>>>
>>>>>>>>> 1, enable SMU debug option
>>>>>>>>>
>>>>>>>>>      # echo 1 > /sys/kernel/debug/dri/0/amdgpu_smu_debug
>>>>>>>>>
>>>>>>>>> 2, disable SMU debug option
>>>>>>>>>
>>>>>>>>>      # echo 0 > /sys/kernel/debug/dri/0/amdgpu_smu_debug
>>>>>>>>>
>>>>>>>>> v3:
>>>>>>>>>      - Use debugfs_create_bool().(Christian)
>>>>>>>>>      - Put variable into smu_context struct.
>>>>>>>>>      - Don't resend command when timeout.
>>>>>>>>>
>>>>>>>>> v2:
>>>>>>>>>      - Resend command when timeout.(Lijo)
>>>>>>>>>      - Use debugfs file instead of module parameter.
>>>>>>>>>
>>>>>>>>> Signed-off-by: Lang Yu <lang.yu@amd.com>
>>>>>>>> Well the debugfs part looks really nice and clean now, but one
>>>>>>>> more comment below.
>>>>>>>>
>>>>>>>>> ---
>>>>>>>>>      drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c        | 3 +++
>>>>>>>>>      drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h            | 5 +++++
>>>>>>>>>      drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c | 2 ++
>>>>>>>>>      drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c             | 8 +++++++-
>>>>>>>>>      4 files changed, 17 insertions(+), 1 deletion(-)
>>>>>>>>>
>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
>>>>>>>>> index 164d6a9e9fbb..86cd888c7822 100644
>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
>>>>>>>>> @@ -1618,6 +1618,9 @@ int amdgpu_debugfs_init(struct
>>>>>>>>> amdgpu_device
>>>>>>>> *adev)
>>>>>>>>>      	if (!debugfs_initialized())
>>>>>>>>>      		return 0;
>>>>>>>>>
>>>>>>>>> +	debugfs_create_bool("amdgpu_smu_debug", 0600, root,
>>>>>>>>> +				  &adev->smu.smu_debug_mode);
>>>>>>>>> +
>>>>>>>>>      	ent = debugfs_create_file("amdgpu_preempt_ib", 0600,
>>> root,
>>>> adev,
>>>>>>>>>      				  &fops_ib_preempt);
>>>>>>>>>      	if (IS_ERR(ent)) {
>>>>>>>>> diff --git a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
>>>>>>>>> b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
>>>>>>>>> index f738f7dc20c9..50dbf5594a9d 100644
>>>>>>>>> --- a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
>>>>>>>>> +++ b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
>>>>>>>>> @@ -569,6 +569,11 @@ struct smu_context
>>>>>>>>>      	struct smu_user_dpm_profile user_dpm_profile;
>>>>>>>>>
>>>>>>>>>      	struct stb_context stb_context;
>>>>>>>>> +	/*
>>>>>>>>> +	 * When enabled, it makes SMU errors fatal.
>>>>>>>>> +	 * (0 = disabled (default), 1 = enabled)
>>>>>>>>> +	 */
>>>>>>>>> +	bool smu_debug_mode;
>>>>>>>>>      };
>>>>>>>>>
>>>>>>>>>      struct i2c_adapter;
>>>>>>>>> diff --git
>>> a/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
>>>>>>>>> b/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
>>>>>>>>> index 6e781cee8bb6..d3797a2d6451 100644
>>>>>>>>> --- a/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
>>>>>>>>> +++ b/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
>>>>>>>>> @@ -1919,6 +1919,8 @@ static int aldebaran_mode2_reset(struct
>>>>>>>> smu_context *smu)
>>>>>>>>>      out:
>>>>>>>>>      	mutex_unlock(&smu->message_lock);
>>>>>>>>>
>>>>>>>>> +	BUG_ON(unlikely(smu->smu_debug_mode) && ret);
>>>>>>>>> +
>>>>>>>>>      	return ret;
>>>>>>>>>      }
>>>>>>>>>
>>>>>>>>> diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
>>>>>>>>> b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
>>>>>>>>> index 048ca1673863..9be005eb4241 100644
>>>>>>>>> --- a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
>>>>>>>>> +++ b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
>>>>>>>>> @@ -349,15 +349,21 @@ int
>>>> smu_cmn_send_smc_msg_with_param(struct
>>>>>>>> smu_context *smu,
>>>>>>>>>      		__smu_cmn_reg_print_error(smu, reg, index, param,
>>>> msg);
>>>>>>>>>      		goto Out;
>>>>>>>>>      	}
>>>>>>>>> +
>>>>>>>>>      	__smu_cmn_send_msg(smu, (uint16_t) index, param);
>>>>>>>>>      	reg = __smu_cmn_poll_stat(smu);
>>>>>>>>>      	res = __smu_cmn_reg2errno(smu, reg);
>>>>>>>>> -	if (res != 0)
>>>>>>>>> +	if (res != 0) {
>>>>>>>>>      		__smu_cmn_reg_print_error(smu, reg, index, param,
>>>> msg);
>>>>>>>>> +		goto Out;
>>>>>>>>> +	}
>>>>>>>>>      	if (read_arg)
>>>>>>>>>      		smu_cmn_read_arg(smu, read_arg);
>>>>>>>>>      Out:
>>>>>>>>>      	mutex_unlock(&smu->message_lock);
>>>>>>>>> +
>>>>>>>>> +	BUG_ON(unlikely(smu->smu_debug_mode) && res);
>>>>>>>> BUG_ON() really crashes the kernel and is only allowed if we
>>>>>>>> prevent further data corruption with that.
>>>>>>>>
>>>>>>>> Most of the time WARN_ON() is more appropriate, but I can't
>>>>>>>> fully judge here since I don't know the SMU code well enough.
>>>>>>> This is what SMU FW guys want. They want "user-visible
>>>>>>> (potentially
>>>>>>> fatal)
>>>>>> errors", then a hang.
>>>>>>> They want to keep system state since the error occurred.
>>>>>> Well that is rather problematic.
>>>>>>
>>>>>> First of all we need to really justify that, crashing the kernel
>>>>>> is not something easily done.
>>>>>>
>>>>>> Then this isn't really effective here. What happens is that you
>>>>>> crash the kernel thread of the currently executing process, but it
>>>>>> is perfectly possible that another thread still tries to send
>>>>>> messages to the SMU. You need to have the BUG_ON() before dropping
>>>>>> the lock to make sure that this really gets the driver stuck in
>>>>>> the current
>>> state.
>>>>> Thanks. I got it. I just thought it is a kenel panic.
>>>>> Could we use a panic() here?
>>>>
>>>> Potentially, but that might reboot the system automatically which is
>>>> probably not what you want either.
>>>>
>>>> How does the SMU firmware team gather the necessary information when
>>> a
>>>> problem occurs?
>>>
>>> As far as I know, they usually use a HDT to collect information.
>>> And they request a hang when error occurred in ticket.
>>> "Suggested error responses include pop-up windows (by x86 driver, if
>>> this is
>>> possible) or simply hanging after logging the error. "
>> [Quan, Evan] Maybe what they want is just a stable SMU state(like no more
>> message issuing after failure).
>> If that's true, I think you can just bail out on __smu_cmn_poll_stat() failure(in
>> smu_cmn_send_smc_msg_with_param() and
>> smu_cmn_send_msg_without_waiting()).
>> That will prevent further message issuing to SMU.
> 
> But it's difficult to distinguish normal timeout cases(see aldebaran_mode2_reset()).
> Sometimes it's reasonable to return a timeout. The user wants to use a longer timeout.
> 

Other than driver, other FWs also will interact with PMFW and this 
itself may not take care of that. The ideal thing is to stop driver jobs 
and further execution to preserve the fail state. BUG() may be the 
easiest if the expectation is to use external hardware to check the fail 
state.

Thanks,
Lijo

> Regards,
> Lang
> 
>>
>> BR
>> Evan
>>>
>>> Regards,
>>> Lang
>>>
>>>>
>>>> When they connect external hardware a BUG_ON() while holding the SMU
>>>> lock might work, but if they want to SSH into the system you should
>>>> probably rather set a flag that the hardware shouldn't be touched any
>>>> more by the driver and continue.
>>>>
>>>> Otherwise the box is most likely really unstable after this.
>>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>>>
>>>>> Regards,
>>>>> Lang
>>>>>
>>>>>> Regards,
>>>>>> Christian.
>>>>>>
>>>>>>> Regards,
>>>>>>> Lang
>>>>>>>
>>>>>>>> Christian.
>>>>>>>>
>>>>>>>>> +
>>>>>>>>>      	return res;
>>>>>>>>>      }
>>>>>>>>>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: [PATCH] drm/amdgpu: add support to SMU debug option
  2021-12-01 16:01               ` Andrey Grodzovsky
@ 2021-12-02  9:31                 ` Yu, Lang
  0 siblings, 0 replies; 18+ messages in thread
From: Yu, Lang @ 2021-12-02  9:31 UTC (permalink / raw)
  To: Grodzovsky, Andrey, Christian König, Koenig, Christian, amd-gfx
  Cc: Deucher, Alexander, Lazar, Lijo, Huang, Ray

[AMD Official Use Only]



>-----Original Message-----
>From: Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>
>Sent: Thursday, December 2, 2021 12:01 AM
>To: Christian König <ckoenig.leichtzumerken@gmail.com>; Yu, Lang
><Lang.Yu@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>; amd-
>gfx@lists.freedesktop.org
>Cc: Deucher, Alexander <Alexander.Deucher@amd.com>; Lazar, Lijo
><Lijo.Lazar@amd.com>; Huang, Ray <Ray.Huang@amd.com>
>Subject: Re: [PATCH] drm/amdgpu: add support to SMU debug option
>
>
>On 2021-12-01 8:11 a.m., Christian König wrote:
>> Adding Andrey as well.
>>
>> Am 01.12.21 um 12:37 schrieb Yu, Lang:
>>> [SNIP]
>>>>>>>>> + BUG_ON(unlikely(smu->smu_debug_mode) && res);
>>>>>>>> BUG_ON() really crashes the kernel and is only allowed if we
>>>>>>>> prevent further data corruption with that.
>>>>>>>>
>>>>>>>> Most of the time WARN_ON() is more appropriate, but I can't
>>>>>>>> fully judge here since I don't know the SMU code well enough.
>>>>>>> This is what SMU FW guys want. They want "user-visible
>>>>>>> (potentially
>>>>>>> fatal)
>>>>>> errors", then a hang.
>>>>>>> They want to keep system state since the error occurred.
>>>>>> Well that is rather problematic.
>>>>>>
>>>>>> First of all we need to really justify that, crashing the kernel
>>>>>> is not something easily done.
>>>>>>
>>>>>> Then this isn't really effective here. What happens is that you
>>>>>> crash the kernel thread of the currently executing process, but it
>>>>>> is perfectly possible that another thread still tries to send
>>>>>> messages to the SMU. You need to have the BUG_ON() before dropping
>>>>>> the lock to make sure that this really gets the driver stuck in
>>>>>> the current state.
>>>>> Thanks. I got it. I just thought it is a kenel panic.
>>>>> Could we use a panic() here?
>>>> Potentially, but that might reboot the system automatically which is
>>>> probably not what you want either.
>>>>
>>>> How does the SMU firmware team gather the necessary information when
>>>> a problem occurs?
>>> As far as I know, they usually use a HDT to collect information.
>>> And they request a hang when error occurred in ticket.
>>> "Suggested error responses include pop-up windows (by x86 driver, if
>>> this is possible) or simply hanging after logging the error."
>>
>> In that case I suggest to set the "don't_touch_the_hardware_any_more"
>> procedure we also use in case of PCIe hotplug.
>>
>> Andrey has the details but essentially it stops the driver from
>> touching the hardware any more, signals all fences and unblocks
>> everything.
>>
>> It should then be trivial to inspect the hardware state and see what's
>> going on, but the system will keep stable at least for SSH access.
>>
>> Might be a good idea to have that mode for other fault cases like page
>> faults and hardware crashes.
>>
>> Regards,
>> Christian.
>
>
>There is no one specific function that does all of that, what I think can be done is
>to bring the device to kind of halt state where no one touches it - as following -
>
>1) Follow amdpgu_pci_remove -
>
>     drm_dev_unplug to make device inaccessible to user space (IOCTLs
>e.t.c.) and clears MMIO mappings to device and disallows remappings through
>page faults
>
>     No need to call all of amdgpu_driver_unload_kms but, within it call
>amdgpu_irq_disable_all and amdgpu_fence_driver_hw_fini toi disable interrupts
>and force signall all HW fences.
>
>     pci_disable_device and pci_wait_for_pending_transaction to flush any in flight
>DMA operations from device
>
>2) set adev->no_hw_access so that most of places we access HW (all subsequent
>registers reads/writes and SMU/PSP message sending is skipped, but some race
>will be with those already in progress so maybe adding some wait)
>
>Andrey

Thanks for Christian's advice and Andrey's clarifications about that.
It seems that we should also handle kfd related stuff.

Regards,
Lang

>
>
>>
>>>
>>> Regards,
>>> Lang
>>>
>>

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2021-12-02  9:46 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-12-01  9:24 [PATCH] drm/amdgpu: add support to SMU debug option Lang Yu
2021-12-01  9:29 ` Christian König
2021-12-01 10:44   ` Yu, Lang
2021-12-01 10:48     ` Christian König
2021-12-01 11:20       ` Yu, Lang
2021-12-01 11:29         ` Christian König
2021-12-01 11:37           ` Yu, Lang
2021-12-01 13:11             ` Christian König
2021-12-01 16:01               ` Andrey Grodzovsky
2021-12-02  9:31                 ` Yu, Lang
2021-12-02  2:47             ` Quan, Evan
2021-12-02  3:12               ` Yu, Lang
2021-12-02  7:18                 ` Lazar, Lijo
2021-12-01  9:44 ` Huang Rui
2021-12-01  9:47 ` Lazar, Lijo
2021-12-01 10:38   ` Yu, Lang
2021-12-01 10:46     ` Lazar, Lijo
2021-12-01 10:56       ` Yu, Lang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.