All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 1/2] drm/amdgpu: introduce a kind of halt state for amdgpu device
@ 2021-12-09  8:49 Lang Yu
  2021-12-09  8:49 ` [PATCH 2/2] drm/amdgpu: add support for SMU debug option Lang Yu
  2021-12-09  9:00 ` [PATCH 1/2] drm/amdgpu: introduce a kind of halt state for amdgpu device Christian König
  0 siblings, 2 replies; 13+ messages in thread
From: Lang Yu @ 2021-12-09  8:49 UTC (permalink / raw)
  To: amd-gfx
  Cc: Andrey Grodzovsky, Lijo Lazar, Huang Rui, Alex Deucher, Lang Yu,
	Christian Koenig

It is useful to maintain error context when debugging
SW/FW issues. We introduce amdgpu_device_halt() for this
purpose. It will bring hardware to a kind of halt state,
so that no one can touch it any more.

Compare to a simple hang, the system will keep stable
at least for SSH access. Then it should be trivial to
inspect the hardware state and see what's going on.

Suggested-by: Christian Koenig <christian.koenig@amd.com>
Suggested-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Signed-off-by: Lang Yu <lang.yu@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu.h        |  2 ++
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 39 ++++++++++++++++++++++
 2 files changed, 41 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index c5cfe2926ca1..3f5f8f62aa5c 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -1317,6 +1317,8 @@ void amdgpu_device_flush_hdp(struct amdgpu_device *adev,
 void amdgpu_device_invalidate_hdp(struct amdgpu_device *adev,
 		struct amdgpu_ring *ring);
 
+void amdgpu_device_halt(struct amdgpu_device *adev);
+
 /* atpx handler */
 #if defined(CONFIG_VGA_SWITCHEROO)
 void amdgpu_register_atpx_handler(void);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index a1c14466f23d..62216627cc83 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -5634,3 +5634,42 @@ void amdgpu_device_invalidate_hdp(struct amdgpu_device *adev,
 
 	amdgpu_asic_invalidate_hdp(adev, ring);
 }
+
+/**
+ * amdgpu_device_halt() - bring hardware to some kind of halt state
+ *
+ * @adev: amdgpu_device pointer
+ *
+ * Bring hardware to some kind of halt state so that no one can touch it
+ * any more. It will help to maintain error context when error occurred.
+ * Compare to a simple hang, the system will keep stable at least for SSH
+ * access. Then it should be trivial to inspect the hardware state and
+ * see what's going on. Implemented as following:
+ *
+ * 1. drm_dev_unplug() makes device inaccessible to user space(IOCTLs, etc),
+ *    clears all CPU mappings to device, disallows remappings through page faults
+ * 2. amdgpu_irq_disable_all() disables all interrupts
+ * 3. amdgpu_fence_driver_hw_fini() signals all HW fences
+ * 4. amdgpu_device_unmap_mmio() clears all MMIO mappings
+ * 5. pci_disable_device() and pci_wait_for_pending_transaction()
+ *    flush any in flight DMA operations
+ * 6. set adev->no_hw_access to true
+ */
+void amdgpu_device_halt(struct amdgpu_device *adev)
+{
+	struct pci_dev *pdev = adev->pdev;
+	struct drm_device *ddev = &adev->ddev;
+
+	drm_dev_unplug(ddev);
+
+	amdgpu_irq_disable_all(adev);
+
+	amdgpu_fence_driver_hw_fini(adev);
+
+	amdgpu_device_unmap_mmio(adev);
+
+	pci_disable_device(pdev);
+	pci_wait_for_pending_transaction(pdev);
+
+	adev->no_hw_access = true;
+}
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 2/2] drm/amdgpu: add support for SMU debug option
  2021-12-09  8:49 [PATCH 1/2] drm/amdgpu: introduce a kind of halt state for amdgpu device Lang Yu
@ 2021-12-09  8:49 ` Lang Yu
  2021-12-10  2:07   ` Quan, Evan
  2021-12-09  9:00 ` [PATCH 1/2] drm/amdgpu: introduce a kind of halt state for amdgpu device Christian König
  1 sibling, 1 reply; 13+ messages in thread
From: Lang Yu @ 2021-12-09  8:49 UTC (permalink / raw)
  To: amd-gfx
  Cc: Andrey Grodzovsky, Lijo Lazar, Huang Rui, Alex Deucher, Lang Yu,
	Christian Koenig

SMU firmware guys expect the driver maintains error context
and doesn't interact with SMU any more when SMU errors occurred.
That will aid in debugging SMU firmware issues.

Add SMU debug option support for this request, it can be
enabled or disabled via amdgpu_smu_debug debugfs file.
When enabled, it brings hardware to a kind of halt state
so that no one can touch it any more in the envent of SMU
errors.

Currently, dirver interacts with SMU via sending messages.
And threre are three ways to sending messages to SMU.
Handle them respectively as following:

1, smu_cmn_send_smc_msg_with_param() for normal timeout cases

  Halt on any error.

2, smu_cmn_send_msg_without_waiting()/smu_cmn_wait_for_response()
for longer timeout cases

  Halt on errors apart from ETIME. Otherwise this way won't work.

3, smu_cmn_send_msg_without_waiting() for no waiting cases

  Halt on errors apart from ETIME. Otherwise second way won't work.

After halting, use BUG() to explicitly notify users.

== Command Guide ==

1, enable SMU debug option

 # echo 1 > /sys/kernel/debug/dri/0/amdgpu_smu_debug

2, disable SMU debug option

 # echo 0 > /sys/kernel/debug/dri/0/amdgpu_smu_debug

v4:
 - Set to halt state instead of a simple hang.(Christian)

v3:
 - Use debugfs_create_bool().(Christian)
 - Put variable into smu_context struct.
 - Don't resend command when timeout.

v2:
 - Resend command when timeout.(Lijo)
 - Use debugfs file instead of module parameter.

Signed-off-by: Lang Yu <lang.yu@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c |  3 +++
 drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h     |  5 +++++
 drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c      | 20 +++++++++++++++++++-
 3 files changed, 27 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
index 164d6a9e9fbb..86cd888c7822 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
@@ -1618,6 +1618,9 @@ int amdgpu_debugfs_init(struct amdgpu_device *adev)
 	if (!debugfs_initialized())
 		return 0;
 
+	debugfs_create_bool("amdgpu_smu_debug", 0600, root,
+				  &adev->smu.smu_debug_mode);
+
 	ent = debugfs_create_file("amdgpu_preempt_ib", 0600, root, adev,
 				  &fops_ib_preempt);
 	if (IS_ERR(ent)) {
diff --git a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
index f738f7dc20c9..50dbf5594a9d 100644
--- a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
+++ b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
@@ -569,6 +569,11 @@ struct smu_context
 	struct smu_user_dpm_profile user_dpm_profile;
 
 	struct stb_context stb_context;
+	/*
+	 * When enabled, it makes SMU errors fatal.
+	 * (0 = disabled (default), 1 = enabled)
+	 */
+	bool smu_debug_mode;
 };
 
 struct i2c_adapter;
diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
index 048ca1673863..84016d22c075 100644
--- a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
+++ b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
@@ -272,6 +272,11 @@ int smu_cmn_send_msg_without_waiting(struct smu_context *smu,
 	__smu_cmn_send_msg(smu, msg_index, param);
 	res = 0;
 Out:
+	if (unlikely(smu->smu_debug_mode) && res && (res != -ETIME)) {
+		amdgpu_device_halt(smu->adev);
+		BUG();
+	}
+
 	return res;
 }
 
@@ -288,9 +293,17 @@ int smu_cmn_send_msg_without_waiting(struct smu_context *smu,
 int smu_cmn_wait_for_response(struct smu_context *smu)
 {
 	u32 reg;
+	int res;
 
 	reg = __smu_cmn_poll_stat(smu);
-	return __smu_cmn_reg2errno(smu, reg);
+	res = __smu_cmn_reg2errno(smu, reg);
+
+	if (unlikely(smu->smu_debug_mode) && res && (res != -ETIME)) {
+		amdgpu_device_halt(smu->adev);
+		BUG();
+	}
+
+	return res;
 }
 
 /**
@@ -357,6 +370,11 @@ int smu_cmn_send_smc_msg_with_param(struct smu_context *smu,
 	if (read_arg)
 		smu_cmn_read_arg(smu, read_arg);
 Out:
+	if (unlikely(smu->smu_debug_mode) && res) {
+		amdgpu_device_halt(smu->adev);
+		BUG();
+	}
+
 	mutex_unlock(&smu->message_lock);
 	return res;
 }
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/2] drm/amdgpu: introduce a kind of halt state for amdgpu device
  2021-12-09  8:49 [PATCH 1/2] drm/amdgpu: introduce a kind of halt state for amdgpu device Lang Yu
  2021-12-09  8:49 ` [PATCH 2/2] drm/amdgpu: add support for SMU debug option Lang Yu
@ 2021-12-09  9:00 ` Christian König
  2021-12-09 15:38   ` Andrey Grodzovsky
  1 sibling, 1 reply; 13+ messages in thread
From: Christian König @ 2021-12-09  9:00 UTC (permalink / raw)
  To: Lang Yu, amd-gfx; +Cc: Alex Deucher, Andrey Grodzovsky, Lijo Lazar, Huang Rui



Am 09.12.21 um 09:49 schrieb Lang Yu:
> It is useful to maintain error context when debugging
> SW/FW issues. We introduce amdgpu_device_halt() for this
> purpose. It will bring hardware to a kind of halt state,
> so that no one can touch it any more.
>
> Compare to a simple hang, the system will keep stable
> at least for SSH access. Then it should be trivial to
> inspect the hardware state and see what's going on.
>
> Suggested-by: Christian Koenig <christian.koenig@amd.com>
> Suggested-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
> Signed-off-by: Lang Yu <lang.yu@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu.h        |  2 ++
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 39 ++++++++++++++++++++++
>   2 files changed, 41 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> index c5cfe2926ca1..3f5f8f62aa5c 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> @@ -1317,6 +1317,8 @@ void amdgpu_device_flush_hdp(struct amdgpu_device *adev,
>   void amdgpu_device_invalidate_hdp(struct amdgpu_device *adev,
>   		struct amdgpu_ring *ring);
>   
> +void amdgpu_device_halt(struct amdgpu_device *adev);
> +
>   /* atpx handler */
>   #if defined(CONFIG_VGA_SWITCHEROO)
>   void amdgpu_register_atpx_handler(void);
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index a1c14466f23d..62216627cc83 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -5634,3 +5634,42 @@ void amdgpu_device_invalidate_hdp(struct amdgpu_device *adev,
>   
>   	amdgpu_asic_invalidate_hdp(adev, ring);
>   }
> +
> +/**
> + * amdgpu_device_halt() - bring hardware to some kind of halt state
> + *
> + * @adev: amdgpu_device pointer
> + *
> + * Bring hardware to some kind of halt state so that no one can touch it
> + * any more. It will help to maintain error context when error occurred.
> + * Compare to a simple hang, the system will keep stable at least for SSH
> + * access. Then it should be trivial to inspect the hardware state and
> + * see what's going on. Implemented as following:
> + *
> + * 1. drm_dev_unplug() makes device inaccessible to user space(IOCTLs, etc),
> + *    clears all CPU mappings to device, disallows remappings through page faults
> + * 2. amdgpu_irq_disable_all() disables all interrupts
> + * 3. amdgpu_fence_driver_hw_fini() signals all HW fences
> + * 4. amdgpu_device_unmap_mmio() clears all MMIO mappings
> + * 5. pci_disable_device() and pci_wait_for_pending_transaction()
> + *    flush any in flight DMA operations
> + * 6. set adev->no_hw_access to true
> + */
> +void amdgpu_device_halt(struct amdgpu_device *adev)
> +{
> +	struct pci_dev *pdev = adev->pdev;
> +	struct drm_device *ddev = &adev->ddev;
> +
> +	drm_dev_unplug(ddev);
> +
> +	amdgpu_irq_disable_all(adev);
> +
> +	amdgpu_fence_driver_hw_fini(adev);
> +
> +	amdgpu_device_unmap_mmio(adev);
> +
> +	pci_disable_device(pdev);
> +	pci_wait_for_pending_transaction(pdev);
> +
> +	adev->no_hw_access = true;

I think we need to reorder this, e.g. set adev->no_hw_access much 
earlier for example. Andrey what do you think?

Apart from that sounds like the right idea to me.

Regards,
Christian.

> +}


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/2] drm/amdgpu: introduce a kind of halt state for amdgpu device
  2021-12-09  9:00 ` [PATCH 1/2] drm/amdgpu: introduce a kind of halt state for amdgpu device Christian König
@ 2021-12-09 15:38   ` Andrey Grodzovsky
  2021-12-09 15:42     ` Christian König
  0 siblings, 1 reply; 13+ messages in thread
From: Andrey Grodzovsky @ 2021-12-09 15:38 UTC (permalink / raw)
  To: Christian König, Lang Yu, amd-gfx
  Cc: Alex Deucher, Lijo Lazar, Huang Rui


On 2021-12-09 4:00 a.m., Christian König wrote:
>
>
> Am 09.12.21 um 09:49 schrieb Lang Yu:
>> It is useful to maintain error context when debugging
>> SW/FW issues. We introduce amdgpu_device_halt() for this
>> purpose. It will bring hardware to a kind of halt state,
>> so that no one can touch it any more.
>>
>> Compare to a simple hang, the system will keep stable
>> at least for SSH access. Then it should be trivial to
>> inspect the hardware state and see what's going on.
>>
>> Suggested-by: Christian Koenig <christian.koenig@amd.com>
>> Suggested-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>> Signed-off-by: Lang Yu <lang.yu@amd.com>
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu.h        |  2 ++
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 39 ++++++++++++++++++++++
>>   2 files changed, 41 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>> index c5cfe2926ca1..3f5f8f62aa5c 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>> @@ -1317,6 +1317,8 @@ void amdgpu_device_flush_hdp(struct 
>> amdgpu_device *adev,
>>   void amdgpu_device_invalidate_hdp(struct amdgpu_device *adev,
>>           struct amdgpu_ring *ring);
>>   +void amdgpu_device_halt(struct amdgpu_device *adev);
>> +
>>   /* atpx handler */
>>   #if defined(CONFIG_VGA_SWITCHEROO)
>>   void amdgpu_register_atpx_handler(void);
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> index a1c14466f23d..62216627cc83 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> @@ -5634,3 +5634,42 @@ void amdgpu_device_invalidate_hdp(struct 
>> amdgpu_device *adev,
>>         amdgpu_asic_invalidate_hdp(adev, ring);
>>   }
>> +
>> +/**
>> + * amdgpu_device_halt() - bring hardware to some kind of halt state
>> + *
>> + * @adev: amdgpu_device pointer
>> + *
>> + * Bring hardware to some kind of halt state so that no one can 
>> touch it
>> + * any more. It will help to maintain error context when error 
>> occurred.
>> + * Compare to a simple hang, the system will keep stable at least 
>> for SSH
>> + * access. Then it should be trivial to inspect the hardware state and
>> + * see what's going on. Implemented as following:
>> + *
>> + * 1. drm_dev_unplug() makes device inaccessible to user 
>> space(IOCTLs, etc),
>> + *    clears all CPU mappings to device, disallows remappings 
>> through page faults
>> + * 2. amdgpu_irq_disable_all() disables all interrupts
>> + * 3. amdgpu_fence_driver_hw_fini() signals all HW fences
>> + * 4. amdgpu_device_unmap_mmio() clears all MMIO mappings
>> + * 5. pci_disable_device() and pci_wait_for_pending_transaction()
>> + *    flush any in flight DMA operations
>> + * 6. set adev->no_hw_access to true
>> + */
>> +void amdgpu_device_halt(struct amdgpu_device *adev)
>> +{
>> +    struct pci_dev *pdev = adev->pdev;
>> +    struct drm_device *ddev = &adev->ddev;
>> +
>> +    drm_dev_unplug(ddev);
>> +
>> +    amdgpu_irq_disable_all(adev);
>> +
>> +    amdgpu_fence_driver_hw_fini(adev);
>> +
>> +    amdgpu_device_unmap_mmio(adev);


Note that this one will cause page fault on any subsequent MMIO access 
(trough registers or by direct VRAM access)


>>
>> +
>> +    pci_disable_device(pdev);
>> +    pci_wait_for_pending_transaction(pdev);
>> +
>> +    adev->no_hw_access = true;
>
> I think we need to reorder this, e.g. set adev->no_hw_access much 
> earlier for example. Andrey what do you think?


Earlier can be ok but at least after the last HW configuration we 
actaully want to do like disabling IRQs.

Andrey

>
> Apart from that sounds like the right idea to me.
>
> Regards,
> Christian.
>
>> +}
>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/2] drm/amdgpu: introduce a kind of halt state for amdgpu device
  2021-12-09 15:38   ` Andrey Grodzovsky
@ 2021-12-09 15:42     ` Christian König
  2021-12-10  3:47       ` Lang Yu
  0 siblings, 1 reply; 13+ messages in thread
From: Christian König @ 2021-12-09 15:42 UTC (permalink / raw)
  To: Andrey Grodzovsky, Lang Yu, amd-gfx; +Cc: Alex Deucher, Lijo Lazar, Huang Rui

Am 09.12.21 um 16:38 schrieb Andrey Grodzovsky:
>
> On 2021-12-09 4:00 a.m., Christian König wrote:
>>
>>
>> Am 09.12.21 um 09:49 schrieb Lang Yu:
>>> It is useful to maintain error context when debugging
>>> SW/FW issues. We introduce amdgpu_device_halt() for this
>>> purpose. It will bring hardware to a kind of halt state,
>>> so that no one can touch it any more.
>>>
>>> Compare to a simple hang, the system will keep stable
>>> at least for SSH access. Then it should be trivial to
>>> inspect the hardware state and see what's going on.
>>>
>>> Suggested-by: Christian Koenig <christian.koenig@amd.com>
>>> Suggested-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>> Signed-off-by: Lang Yu <lang.yu@amd.com>
>>> ---
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu.h        |  2 ++
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 39 
>>> ++++++++++++++++++++++
>>>   2 files changed, 41 insertions(+)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>> index c5cfe2926ca1..3f5f8f62aa5c 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>> @@ -1317,6 +1317,8 @@ void amdgpu_device_flush_hdp(struct 
>>> amdgpu_device *adev,
>>>   void amdgpu_device_invalidate_hdp(struct amdgpu_device *adev,
>>>           struct amdgpu_ring *ring);
>>>   +void amdgpu_device_halt(struct amdgpu_device *adev);
>>> +
>>>   /* atpx handler */
>>>   #if defined(CONFIG_VGA_SWITCHEROO)
>>>   void amdgpu_register_atpx_handler(void);
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> index a1c14466f23d..62216627cc83 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> @@ -5634,3 +5634,42 @@ void amdgpu_device_invalidate_hdp(struct 
>>> amdgpu_device *adev,
>>>         amdgpu_asic_invalidate_hdp(adev, ring);
>>>   }
>>> +
>>> +/**
>>> + * amdgpu_device_halt() - bring hardware to some kind of halt state
>>> + *
>>> + * @adev: amdgpu_device pointer
>>> + *
>>> + * Bring hardware to some kind of halt state so that no one can 
>>> touch it
>>> + * any more. It will help to maintain error context when error 
>>> occurred.
>>> + * Compare to a simple hang, the system will keep stable at least 
>>> for SSH
>>> + * access. Then it should be trivial to inspect the hardware state and
>>> + * see what's going on. Implemented as following:
>>> + *
>>> + * 1. drm_dev_unplug() makes device inaccessible to user 
>>> space(IOCTLs, etc),
>>> + *    clears all CPU mappings to device, disallows remappings 
>>> through page faults
>>> + * 2. amdgpu_irq_disable_all() disables all interrupts
>>> + * 3. amdgpu_fence_driver_hw_fini() signals all HW fences
>>> + * 4. amdgpu_device_unmap_mmio() clears all MMIO mappings
>>> + * 5. pci_disable_device() and pci_wait_for_pending_transaction()
>>> + *    flush any in flight DMA operations
>>> + * 6. set adev->no_hw_access to true
>>> + */
>>> +void amdgpu_device_halt(struct amdgpu_device *adev)
>>> +{
>>> +    struct pci_dev *pdev = adev->pdev;
>>> +    struct drm_device *ddev = &adev->ddev;
>>> +
>>> +    drm_dev_unplug(ddev);
>>> +
>>> +    amdgpu_irq_disable_all(adev);
>>> +
>>> +    amdgpu_fence_driver_hw_fini(adev);
>>> +
>>> +    amdgpu_device_unmap_mmio(adev);
>
>
> Note that this one will cause page fault on any subsequent MMIO access 
> (trough registers or by direct VRAM access)
>
>
>>>
>>> +
>>> +    pci_disable_device(pdev);
>>> +    pci_wait_for_pending_transaction(pdev);
>>> +
>>> +    adev->no_hw_access = true;
>>
>> I think we need to reorder this, e.g. set adev->no_hw_access much 
>> earlier for example. Andrey what do you think?
>
>
> Earlier can be ok but at least after the last HW configuration we 
> actaully want to do like disabling IRQs.

My thinking was to at least do this before we unmap the MMIO to avoid 
the crash.

Additionally to that we maybe don't even want to do this for this case.

Christian.

>
>
> Andrey
>
>>
>> Apart from that sounds like the right idea to me.
>>
>> Regards,
>> Christian.
>>
>>> +}
>>


^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: [PATCH 2/2] drm/amdgpu: add support for SMU debug option
  2021-12-09  8:49 ` [PATCH 2/2] drm/amdgpu: add support for SMU debug option Lang Yu
@ 2021-12-10  2:07   ` Quan, Evan
  2021-12-10  2:33     ` Lang Yu
  0 siblings, 1 reply; 13+ messages in thread
From: Quan, Evan @ 2021-12-10  2:07 UTC (permalink / raw)
  To: Yu, Lang, amd-gfx
  Cc: Grodzovsky, Andrey, Lazar, Lijo, Huang, Ray, Deucher, Alexander,
	Yu, Lang, Koenig, Christian

[AMD Official Use Only]



> -----Original Message-----
> From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> On Behalf Of Lang
> Yu
> Sent: Thursday, December 9, 2021 4:49 PM
> To: amd-gfx@lists.freedesktop.org
> Cc: Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>; Lazar, Lijo
> <Lijo.Lazar@amd.com>; Huang, Ray <Ray.Huang@amd.com>; Deucher,
> Alexander <Alexander.Deucher@amd.com>; Yu, Lang <Lang.Yu@amd.com>;
> Koenig, Christian <Christian.Koenig@amd.com>
> Subject: [PATCH 2/2] drm/amdgpu: add support for SMU debug option
> 
> SMU firmware guys expect the driver maintains error context
> and doesn't interact with SMU any more when SMU errors occurred.
> That will aid in debugging SMU firmware issues.
> 
> Add SMU debug option support for this request, it can be
> enabled or disabled via amdgpu_smu_debug debugfs file.
> When enabled, it brings hardware to a kind of halt state
> so that no one can touch it any more in the envent of SMU
> errors.
> 
> Currently, dirver interacts with SMU via sending messages.
> And threre are three ways to sending messages to SMU.
> Handle them respectively as following:
> 
> 1, smu_cmn_send_smc_msg_with_param() for normal timeout cases
> 
>   Halt on any error.
> 
> 2, smu_cmn_send_msg_without_waiting()/smu_cmn_wait_for_response()
> for longer timeout cases
> 
>   Halt on errors apart from ETIME. Otherwise this way won't work.
> 
> 3, smu_cmn_send_msg_without_waiting() for no waiting cases
> 
>   Halt on errors apart from ETIME. Otherwise second way won't work.
> 
> After halting, use BUG() to explicitly notify users.
> 
> == Command Guide ==
> 
> 1, enable SMU debug option
> 
>  # echo 1 > /sys/kernel/debug/dri/0/amdgpu_smu_debug
> 
> 2, disable SMU debug option
> 
>  # echo 0 > /sys/kernel/debug/dri/0/amdgpu_smu_debug
> 
> v4:
>  - Set to halt state instead of a simple hang.(Christian)
> 
> v3:
>  - Use debugfs_create_bool().(Christian)
>  - Put variable into smu_context struct.
>  - Don't resend command when timeout.
> 
> v2:
>  - Resend command when timeout.(Lijo)
>  - Use debugfs file instead of module parameter.
> 
> Signed-off-by: Lang Yu <lang.yu@amd.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c |  3 +++
>  drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h     |  5 +++++
>  drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c      | 20
> +++++++++++++++++++-
>  3 files changed, 27 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
> index 164d6a9e9fbb..86cd888c7822 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
> @@ -1618,6 +1618,9 @@ int amdgpu_debugfs_init(struct amdgpu_device
> *adev)
>  	if (!debugfs_initialized())
>  		return 0;
> 
> +	debugfs_create_bool("amdgpu_smu_debug", 0600, root,
> +				  &adev->smu.smu_debug_mode);
> +
>  	ent = debugfs_create_file("amdgpu_preempt_ib", 0600, root, adev,
>  				  &fops_ib_preempt);
>  	if (IS_ERR(ent)) {
> diff --git a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
> b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
> index f738f7dc20c9..50dbf5594a9d 100644
> --- a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
> +++ b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
> @@ -569,6 +569,11 @@ struct smu_context
>  	struct smu_user_dpm_profile user_dpm_profile;
> 
>  	struct stb_context stb_context;
> +	/*
> +	 * When enabled, it makes SMU errors fatal.
> +	 * (0 = disabled (default), 1 = enabled)
> +	 */
> +	bool smu_debug_mode;
[Quan, Evan] Can you expand this to bit mask(as ppfeaturemask)? So that in future we can add support for other debug features.
>  };
> 
>  struct i2c_adapter;
> diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
> b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
> index 048ca1673863..84016d22c075 100644
> --- a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
> +++ b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
> @@ -272,6 +272,11 @@ int smu_cmn_send_msg_without_waiting(struct
> smu_context *smu,
>  	__smu_cmn_send_msg(smu, msg_index, param);
>  	res = 0;
>  Out:
> +	if (unlikely(smu->smu_debug_mode) && res && (res != -ETIME)) {
> +		amdgpu_device_halt(smu->adev);
> +		BUG();
[Quan, Evan] I agree amdgpu_device_halt() is a good idea. Christian and Andrey can share you more insights about that.
Do we still need the "BUG()" then? 

BR
Evan
> +	}
> +
>  	return res;
>  }
> 
> @@ -288,9 +293,17 @@ int smu_cmn_send_msg_without_waiting(struct
> smu_context *smu,
>  int smu_cmn_wait_for_response(struct smu_context *smu)
>  {
>  	u32 reg;
> +	int res;
> 
>  	reg = __smu_cmn_poll_stat(smu);
> -	return __smu_cmn_reg2errno(smu, reg);
> +	res = __smu_cmn_reg2errno(smu, reg);
> +
> +	if (unlikely(smu->smu_debug_mode) && res && (res != -ETIME)) {
> +		amdgpu_device_halt(smu->adev);
> +		BUG();
> +	}
> +
> +	return res;
>  }
> 
>  /**
> @@ -357,6 +370,11 @@ int smu_cmn_send_smc_msg_with_param(struct
> smu_context *smu,
>  	if (read_arg)
>  		smu_cmn_read_arg(smu, read_arg);
>  Out:
> +	if (unlikely(smu->smu_debug_mode) && res) {
> +		amdgpu_device_halt(smu->adev);
> +		BUG();
> +	}
> +
>  	mutex_unlock(&smu->message_lock);
>  	return res;
>  }
> --
> 2.25.1

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] drm/amdgpu: add support for SMU debug option
  2021-12-10  2:07   ` Quan, Evan
@ 2021-12-10  2:33     ` Lang Yu
  2021-12-10  2:52       ` Quan, Evan
  0 siblings, 1 reply; 13+ messages in thread
From: Lang Yu @ 2021-12-10  2:33 UTC (permalink / raw)
  To: Quan, Evan
  Cc: Grodzovsky, Andrey, Lazar, Lijo, amd-gfx, Huang, Ray, Deucher,
	Alexander, Koenig, Christian

On 12/10/ , Quan, Evan wrote:
> [AMD Official Use Only]
> 
> 
> 
> > -----Original Message-----
> > From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> On Behalf Of Lang
> > Yu
> > Sent: Thursday, December 9, 2021 4:49 PM
> > To: amd-gfx@lists.freedesktop.org
> > Cc: Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>; Lazar, Lijo
> > <Lijo.Lazar@amd.com>; Huang, Ray <Ray.Huang@amd.com>; Deucher,
> > Alexander <Alexander.Deucher@amd.com>; Yu, Lang <Lang.Yu@amd.com>;
> > Koenig, Christian <Christian.Koenig@amd.com>
> > Subject: [PATCH 2/2] drm/amdgpu: add support for SMU debug option
> > 
> > SMU firmware guys expect the driver maintains error context
> > and doesn't interact with SMU any more when SMU errors occurred.
> > That will aid in debugging SMU firmware issues.
> > 
> > Add SMU debug option support for this request, it can be
> > enabled or disabled via amdgpu_smu_debug debugfs file.
> > When enabled, it brings hardware to a kind of halt state
> > so that no one can touch it any more in the envent of SMU
> > errors.
> > 
> > Currently, dirver interacts with SMU via sending messages.
> > And threre are three ways to sending messages to SMU.
> > Handle them respectively as following:
> > 
> > 1, smu_cmn_send_smc_msg_with_param() for normal timeout cases
> > 
> >   Halt on any error.
> > 
> > 2, smu_cmn_send_msg_without_waiting()/smu_cmn_wait_for_response()
> > for longer timeout cases
> > 
> >   Halt on errors apart from ETIME. Otherwise this way won't work.
> > 
> > 3, smu_cmn_send_msg_without_waiting() for no waiting cases
> > 
> >   Halt on errors apart from ETIME. Otherwise second way won't work.
> > 
> > After halting, use BUG() to explicitly notify users.
> > 
> > == Command Guide ==
> > 
> > 1, enable SMU debug option
> > 
> >  # echo 1 > /sys/kernel/debug/dri/0/amdgpu_smu_debug
> > 
> > 2, disable SMU debug option
> > 
> >  # echo 0 > /sys/kernel/debug/dri/0/amdgpu_smu_debug
> > 
> > v4:
> >  - Set to halt state instead of a simple hang.(Christian)
> > 
> > v3:
> >  - Use debugfs_create_bool().(Christian)
> >  - Put variable into smu_context struct.
> >  - Don't resend command when timeout.
> > 
> > v2:
> >  - Resend command when timeout.(Lijo)
> >  - Use debugfs file instead of module parameter.
> > 
> > Signed-off-by: Lang Yu <lang.yu@amd.com>
> > ---
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c |  3 +++
> >  drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h     |  5 +++++
> >  drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c      | 20
> > +++++++++++++++++++-
> >  3 files changed, 27 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
> > index 164d6a9e9fbb..86cd888c7822 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
> > @@ -1618,6 +1618,9 @@ int amdgpu_debugfs_init(struct amdgpu_device
> > *adev)
> >  	if (!debugfs_initialized())
> >  		return 0;
> > 
> > +	debugfs_create_bool("amdgpu_smu_debug", 0600, root,
> > +				  &adev->smu.smu_debug_mode);
> > +
> >  	ent = debugfs_create_file("amdgpu_preempt_ib", 0600, root, adev,
> >  				  &fops_ib_preempt);
> >  	if (IS_ERR(ent)) {
> > diff --git a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
> > b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
> > index f738f7dc20c9..50dbf5594a9d 100644
> > --- a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
> > +++ b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
> > @@ -569,6 +569,11 @@ struct smu_context
> >  	struct smu_user_dpm_profile user_dpm_profile;
> > 
> >  	struct stb_context stb_context;
> > +	/*
> > +	 * When enabled, it makes SMU errors fatal.
> > +	 * (0 = disabled (default), 1 = enabled)
> > +	 */
> > +	bool smu_debug_mode;
> [Quan, Evan] Can you expand this to bit mask(as ppfeaturemask)? So that in future we can add support for other debug features.
> >  };

OK.

> > 
> >  struct i2c_adapter;
> > diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
> > b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
> > index 048ca1673863..84016d22c075 100644
> > --- a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
> > +++ b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
> > @@ -272,6 +272,11 @@ int smu_cmn_send_msg_without_waiting(struct
> > smu_context *smu,
> >  	__smu_cmn_send_msg(smu, msg_index, param);
> >  	res = 0;
> >  Out:
> > +	if (unlikely(smu->smu_debug_mode) && res && (res != -ETIME)) {
> > +		amdgpu_device_halt(smu->adev);
> > +		BUG();
> [Quan, Evan] I agree amdgpu_device_halt() is a good idea. Christian and Andrey can share you more insights about that.
> Do we still need the "BUG()" then? 

The BUG() is used to explicitly notify users something went 
wrong. Otherwise userspace may not know immediately. 
FW guys request this in ticket.

Regards,
Lang

> BR
> Evan
> > +	}
> > +
> >  	return res;
> >  }
> > 
> > @@ -288,9 +293,17 @@ int smu_cmn_send_msg_without_waiting(struct
> > smu_context *smu,
> >  int smu_cmn_wait_for_response(struct smu_context *smu)
> >  {
> >  	u32 reg;
> > +	int res;
> > 
> >  	reg = __smu_cmn_poll_stat(smu);
> > -	return __smu_cmn_reg2errno(smu, reg);
> > +	res = __smu_cmn_reg2errno(smu, reg);
> > +
> > +	if (unlikely(smu->smu_debug_mode) && res && (res != -ETIME)) {
> > +		amdgpu_device_halt(smu->adev);
> > +		BUG();
> > +	}
> > +
> > +	return res;
> >  }
> > 
> >  /**
> > @@ -357,6 +370,11 @@ int smu_cmn_send_smc_msg_with_param(struct
> > smu_context *smu,
> >  	if (read_arg)
> >  		smu_cmn_read_arg(smu, read_arg);
> >  Out:
> > +	if (unlikely(smu->smu_debug_mode) && res) {
> > +		amdgpu_device_halt(smu->adev);
> > +		BUG();
> > +	}
> > +
> >  	mutex_unlock(&smu->message_lock);
> >  	return res;
> >  }
> > --
> > 2.25.1

^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: [PATCH 2/2] drm/amdgpu: add support for SMU debug option
  2021-12-10  2:33     ` Lang Yu
@ 2021-12-10  2:52       ` Quan, Evan
  2021-12-10  3:21         ` Lang Yu
  0 siblings, 1 reply; 13+ messages in thread
From: Quan, Evan @ 2021-12-10  2:52 UTC (permalink / raw)
  To: Yu, Lang
  Cc: Grodzovsky, Andrey, Lazar, Lijo, amd-gfx, Huang, Ray, Deucher,
	Alexander, Koenig, Christian

[AMD Official Use Only]



> -----Original Message-----
> From: Yu, Lang <Lang.Yu@amd.com>
> Sent: Friday, December 10, 2021 10:34 AM
> To: Quan, Evan <Evan.Quan@amd.com>
> Cc: amd-gfx@lists.freedesktop.org; Grodzovsky, Andrey
> <Andrey.Grodzovsky@amd.com>; Lazar, Lijo <Lijo.Lazar@amd.com>; Huang,
> Ray <Ray.Huang@amd.com>; Deucher, Alexander
> <Alexander.Deucher@amd.com>; Koenig, Christian
> <Christian.Koenig@amd.com>
> Subject: Re: [PATCH 2/2] drm/amdgpu: add support for SMU debug option
> 
> On 12/10/ , Quan, Evan wrote:
> > [AMD Official Use Only]
> >
> >
> >
> > > -----Original Message-----
> > > From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> On Behalf Of
> > > Lang Yu
> > > Sent: Thursday, December 9, 2021 4:49 PM
> > > To: amd-gfx@lists.freedesktop.org
> > > Cc: Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>; Lazar, Lijo
> > > <Lijo.Lazar@amd.com>; Huang, Ray <Ray.Huang@amd.com>; Deucher,
> > > Alexander <Alexander.Deucher@amd.com>; Yu, Lang
> <Lang.Yu@amd.com>;
> > > Koenig, Christian <Christian.Koenig@amd.com>
> > > Subject: [PATCH 2/2] drm/amdgpu: add support for SMU debug option
> > >
> > > SMU firmware guys expect the driver maintains error context and
> > > doesn't interact with SMU any more when SMU errors occurred.
> > > That will aid in debugging SMU firmware issues.
> > >
> > > Add SMU debug option support for this request, it can be enabled or
> > > disabled via amdgpu_smu_debug debugfs file.
> > > When enabled, it brings hardware to a kind of halt state so that no
> > > one can touch it any more in the envent of SMU errors.
> > >
> > > Currently, dirver interacts with SMU via sending messages.
> > > And threre are three ways to sending messages to SMU.
> > > Handle them respectively as following:
> > >
> > > 1, smu_cmn_send_smc_msg_with_param() for normal timeout cases
> > >
> > >   Halt on any error.
> > >
> > > 2,
> smu_cmn_send_msg_without_waiting()/smu_cmn_wait_for_response()
> > > for longer timeout cases
> > >
> > >   Halt on errors apart from ETIME. Otherwise this way won't work.
> > >
> > > 3, smu_cmn_send_msg_without_waiting() for no waiting cases
> > >
> > >   Halt on errors apart from ETIME. Otherwise second way won't work.
> > >
> > > After halting, use BUG() to explicitly notify users.
> > >
> > > == Command Guide ==
> > >
> > > 1, enable SMU debug option
> > >
> > >  # echo 1 > /sys/kernel/debug/dri/0/amdgpu_smu_debug
> > >
> > > 2, disable SMU debug option
> > >
> > >  # echo 0 > /sys/kernel/debug/dri/0/amdgpu_smu_debug
> > >
> > > v4:
> > >  - Set to halt state instead of a simple hang.(Christian)
> > >
> > > v3:
> > >  - Use debugfs_create_bool().(Christian)
> > >  - Put variable into smu_context struct.
> > >  - Don't resend command when timeout.
> > >
> > > v2:
> > >  - Resend command when timeout.(Lijo)
> > >  - Use debugfs file instead of module parameter.
> > >
> > > Signed-off-by: Lang Yu <lang.yu@amd.com>
> > > ---
> > >  drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c |  3 +++
> > >  drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h     |  5 +++++
> > >  drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c      | 20
> > > +++++++++++++++++++-
> > >  3 files changed, 27 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
> > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
> > > index 164d6a9e9fbb..86cd888c7822 100644
> > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
> > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
> > > @@ -1618,6 +1618,9 @@ int amdgpu_debugfs_init(struct
> amdgpu_device
> > > *adev)
> > >  	if (!debugfs_initialized())
> > >  		return 0;
> > >
> > > +	debugfs_create_bool("amdgpu_smu_debug", 0600, root,
> > > +				  &adev->smu.smu_debug_mode);
> > > +
> > >  	ent = debugfs_create_file("amdgpu_preempt_ib", 0600, root, adev,
> > >  				  &fops_ib_preempt);
> > >  	if (IS_ERR(ent)) {
> > > diff --git a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
> > > b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
> > > index f738f7dc20c9..50dbf5594a9d 100644
> > > --- a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
> > > +++ b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
> > > @@ -569,6 +569,11 @@ struct smu_context
> > >  	struct smu_user_dpm_profile user_dpm_profile;
> > >
> > >  	struct stb_context stb_context;
> > > +	/*
> > > +	 * When enabled, it makes SMU errors fatal.
> > > +	 * (0 = disabled (default), 1 = enabled)
> > > +	 */
> > > +	bool smu_debug_mode;
> > [Quan, Evan] Can you expand this to bit mask(as ppfeaturemask)? So that
> in future we can add support for other debug features.
> > >  };
> 
> OK.
> 
> > >
> > >  struct i2c_adapter;
> > > diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
> > > b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
> > > index 048ca1673863..84016d22c075 100644
> > > --- a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
> > > +++ b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
> > > @@ -272,6 +272,11 @@ int smu_cmn_send_msg_without_waiting(struct
> > > smu_context *smu,
> > >  	__smu_cmn_send_msg(smu, msg_index, param);
> > >  	res = 0;
> > >  Out:
> > > +	if (unlikely(smu->smu_debug_mode) && res && (res != -ETIME)) {
> > > +		amdgpu_device_halt(smu->adev);
> > > +		BUG();
> > [Quan, Evan] I agree amdgpu_device_halt() is a good idea. Christian and
> Andrey can share you more insights about that.
> > Do we still need the "BUG()" then?
> 
> The BUG() is used to explicitly notify users something went wrong.
> Otherwise userspace may not know immediately.
> FW guys request this in ticket.
[Quan, Evan] Won't drm_dev_unplug() and pci_disable_device() used in amdgpu_device_halt throw some errors(on user's further attempt to communicate with our driver)?
Also if the purpose is to raise user's concern, WARN() may be a more gentle way?

BR
Evan
> 
> Regards,
> Lang
> 
> > BR
> > Evan
> > > +	}
> > > +
> > >  	return res;
> > >  }
> > >
> > > @@ -288,9 +293,17 @@ int smu_cmn_send_msg_without_waiting(struct
> > > smu_context *smu,
> > >  int smu_cmn_wait_for_response(struct smu_context *smu)  {
> > >  	u32 reg;
> > > +	int res;
> > >
> > >  	reg = __smu_cmn_poll_stat(smu);
> > > -	return __smu_cmn_reg2errno(smu, reg);
> > > +	res = __smu_cmn_reg2errno(smu, reg);
> > > +
> > > +	if (unlikely(smu->smu_debug_mode) && res && (res != -ETIME)) {
> > > +		amdgpu_device_halt(smu->adev);
> > > +		BUG();
> > > +	}
> > > +
> > > +	return res;
> > >  }
> > >
> > >  /**
> > > @@ -357,6 +370,11 @@ int
> smu_cmn_send_smc_msg_with_param(struct
> > > smu_context *smu,
> > >  	if (read_arg)
> > >  		smu_cmn_read_arg(smu, read_arg);
> > >  Out:
> > > +	if (unlikely(smu->smu_debug_mode) && res) {
> > > +		amdgpu_device_halt(smu->adev);
> > > +		BUG();
> > > +	}
> > > +
> > >  	mutex_unlock(&smu->message_lock);
> > >  	return res;
> > >  }
> > > --
> > > 2.25.1

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] drm/amdgpu: add support for SMU debug option
  2021-12-10  2:52       ` Quan, Evan
@ 2021-12-10  3:21         ` Lang Yu
  2021-12-10  7:04           ` Christian König
  0 siblings, 1 reply; 13+ messages in thread
From: Lang Yu @ 2021-12-10  3:21 UTC (permalink / raw)
  To: Quan, Evan
  Cc: Grodzovsky, Andrey, Lazar, Lijo, amd-gfx, Huang, Ray, Deucher,
	Alexander, Koenig, Christian

On 12/10/ , Quan, Evan wrote:
> [AMD Official Use Only]
> 
> 
> 
> > -----Original Message-----
> > From: Yu, Lang <Lang.Yu@amd.com>
> > Sent: Friday, December 10, 2021 10:34 AM
> > To: Quan, Evan <Evan.Quan@amd.com>
> > Cc: amd-gfx@lists.freedesktop.org; Grodzovsky, Andrey
> > <Andrey.Grodzovsky@amd.com>; Lazar, Lijo <Lijo.Lazar@amd.com>; Huang,
> > Ray <Ray.Huang@amd.com>; Deucher, Alexander
> > <Alexander.Deucher@amd.com>; Koenig, Christian
> > <Christian.Koenig@amd.com>
> > Subject: Re: [PATCH 2/2] drm/amdgpu: add support for SMU debug option
> > 
> > On 12/10/ , Quan, Evan wrote:
> > > [AMD Official Use Only]
> > >
> > >
> > >
> > > > -----Original Message-----
> > > > From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> On Behalf Of
> > > > Lang Yu
> > > > Sent: Thursday, December 9, 2021 4:49 PM
> > > > To: amd-gfx@lists.freedesktop.org
> > > > Cc: Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>; Lazar, Lijo
> > > > <Lijo.Lazar@amd.com>; Huang, Ray <Ray.Huang@amd.com>; Deucher,
> > > > Alexander <Alexander.Deucher@amd.com>; Yu, Lang
> > <Lang.Yu@amd.com>;
> > > > Koenig, Christian <Christian.Koenig@amd.com>
> > > > Subject: [PATCH 2/2] drm/amdgpu: add support for SMU debug option
> > > >
> > > > SMU firmware guys expect the driver maintains error context and
> > > > doesn't interact with SMU any more when SMU errors occurred.
> > > > That will aid in debugging SMU firmware issues.
> > > >
> > > > Add SMU debug option support for this request, it can be enabled or
> > > > disabled via amdgpu_smu_debug debugfs file.
> > > > When enabled, it brings hardware to a kind of halt state so that no
> > > > one can touch it any more in the envent of SMU errors.
> > > >
> > > > Currently, dirver interacts with SMU via sending messages.
> > > > And threre are three ways to sending messages to SMU.
> > > > Handle them respectively as following:
> > > >
> > > > 1, smu_cmn_send_smc_msg_with_param() for normal timeout cases
> > > >
> > > >   Halt on any error.
> > > >
> > > > 2,
> > smu_cmn_send_msg_without_waiting()/smu_cmn_wait_for_response()
> > > > for longer timeout cases
> > > >
> > > >   Halt on errors apart from ETIME. Otherwise this way won't work.
> > > >
> > > > 3, smu_cmn_send_msg_without_waiting() for no waiting cases
> > > >
> > > >   Halt on errors apart from ETIME. Otherwise second way won't work.
> > > >
> > > > After halting, use BUG() to explicitly notify users.
> > > >
> > > > == Command Guide ==
> > > >
> > > > 1, enable SMU debug option
> > > >
> > > >  # echo 1 > /sys/kernel/debug/dri/0/amdgpu_smu_debug
> > > >
> > > > 2, disable SMU debug option
> > > >
> > > >  # echo 0 > /sys/kernel/debug/dri/0/amdgpu_smu_debug
> > > >
> > > > v4:
> > > >  - Set to halt state instead of a simple hang.(Christian)
> > > >
> > > > v3:
> > > >  - Use debugfs_create_bool().(Christian)
> > > >  - Put variable into smu_context struct.
> > > >  - Don't resend command when timeout.
> > > >
> > > > v2:
> > > >  - Resend command when timeout.(Lijo)
> > > >  - Use debugfs file instead of module parameter.
> > > >
> > > > Signed-off-by: Lang Yu <lang.yu@amd.com>
> > > > ---
> > > >  drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c |  3 +++
> > > >  drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h     |  5 +++++
> > > >  drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c      | 20
> > > > +++++++++++++++++++-
> > > >  3 files changed, 27 insertions(+), 1 deletion(-)
> > > >
> > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
> > > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
> > > > index 164d6a9e9fbb..86cd888c7822 100644
> > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
> > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
> > > > @@ -1618,6 +1618,9 @@ int amdgpu_debugfs_init(struct
> > amdgpu_device
> > > > *adev)
> > > >  	if (!debugfs_initialized())
> > > >  		return 0;
> > > >
> > > > +	debugfs_create_bool("amdgpu_smu_debug", 0600, root,
> > > > +				  &adev->smu.smu_debug_mode);
> > > > +
> > > >  	ent = debugfs_create_file("amdgpu_preempt_ib", 0600, root, adev,
> > > >  				  &fops_ib_preempt);
> > > >  	if (IS_ERR(ent)) {
> > > > diff --git a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
> > > > b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
> > > > index f738f7dc20c9..50dbf5594a9d 100644
> > > > --- a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
> > > > +++ b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
> > > > @@ -569,6 +569,11 @@ struct smu_context
> > > >  	struct smu_user_dpm_profile user_dpm_profile;
> > > >
> > > >  	struct stb_context stb_context;
> > > > +	/*
> > > > +	 * When enabled, it makes SMU errors fatal.
> > > > +	 * (0 = disabled (default), 1 = enabled)
> > > > +	 */
> > > > +	bool smu_debug_mode;
> > > [Quan, Evan] Can you expand this to bit mask(as ppfeaturemask)? So that
> > in future we can add support for other debug features.
> > > >  };
> > 
> > OK.
> > 
> > > >
> > > >  struct i2c_adapter;
> > > > diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
> > > > b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
> > > > index 048ca1673863..84016d22c075 100644
> > > > --- a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
> > > > +++ b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
> > > > @@ -272,6 +272,11 @@ int smu_cmn_send_msg_without_waiting(struct
> > > > smu_context *smu,
> > > >  	__smu_cmn_send_msg(smu, msg_index, param);
> > > >  	res = 0;
> > > >  Out:
> > > > +	if (unlikely(smu->smu_debug_mode) && res && (res != -ETIME)) {
> > > > +		amdgpu_device_halt(smu->adev);
> > > > +		BUG();
> > > [Quan, Evan] I agree amdgpu_device_halt() is a good idea. Christian and
> > Andrey can share you more insights about that.
> > > Do we still need the "BUG()" then?
> > 
> > The BUG() is used to explicitly notify users something went wrong.
> > Otherwise userspace may not know immediately.
> > FW guys request this in ticket.
> [Quan, Evan] Won't drm_dev_unplug() and pci_disable_device() used in amdgpu_device_halt throw some errors(on user's further attempt to communicate with our driver)?
> Also if the purpose is to raise user's concern, WARN() may be a more gentle way?

From my testing and observation, it depends on what the driver will do next. 
Probably trigger a page fault. If we don't connect a monitor but SSH access.
We don't know what happended until something like a page fault is triggered.

But here what I want to do is throwing the error immediately to userspace.
If using WARN(), the user need to poll dmesg to see if something went wrong.

Regards,
Lang

> BR
> Evan
> > 
> > Regards,
> > Lang
> > 
> > > BR
> > > Evan
> > > > +	}
> > > > +
> > > >  	return res;
> > > >  }
> > > >
> > > > @@ -288,9 +293,17 @@ int smu_cmn_send_msg_without_waiting(struct
> > > > smu_context *smu,
> > > >  int smu_cmn_wait_for_response(struct smu_context *smu)  {
> > > >  	u32 reg;
> > > > +	int res;
> > > >
> > > >  	reg = __smu_cmn_poll_stat(smu);
> > > > -	return __smu_cmn_reg2errno(smu, reg);
> > > > +	res = __smu_cmn_reg2errno(smu, reg);
> > > > +
> > > > +	if (unlikely(smu->smu_debug_mode) && res && (res != -ETIME)) {
> > > > +		amdgpu_device_halt(smu->adev);
> > > > +		BUG();
> > > > +	}
> > > > +
> > > > +	return res;
> > > >  }
> > > >
> > > >  /**
> > > > @@ -357,6 +370,11 @@ int
> > smu_cmn_send_smc_msg_with_param(struct
> > > > smu_context *smu,
> > > >  	if (read_arg)
> > > >  		smu_cmn_read_arg(smu, read_arg);
> > > >  Out:
> > > > +	if (unlikely(smu->smu_debug_mode) && res) {
> > > > +		amdgpu_device_halt(smu->adev);
> > > > +		BUG();
> > > > +	}
> > > > +
> > > >  	mutex_unlock(&smu->message_lock);
> > > >  	return res;
> > > >  }
> > > > --
> > > > 2.25.1

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/2] drm/amdgpu: introduce a kind of halt state for amdgpu device
  2021-12-09 15:42     ` Christian König
@ 2021-12-10  3:47       ` Lang Yu
  2021-12-10 15:45         ` Andrey Grodzovsky
  0 siblings, 1 reply; 13+ messages in thread
From: Lang Yu @ 2021-12-10  3:47 UTC (permalink / raw)
  To: Christian KKKnig
  Cc: Alex Deucher, Andrey Grodzovsky, Lijo Lazar, Huang Rui, amd-gfx

On 12/09/ , Christian KKKnig wrote:
> Am 09.12.21 um 16:38 schrieb Andrey Grodzovsky:
> > 
> > On 2021-12-09 4:00 a.m., Christian König wrote:
> > > 
> > > 
> > > Am 09.12.21 um 09:49 schrieb Lang Yu:
> > > > It is useful to maintain error context when debugging
> > > > SW/FW issues. We introduce amdgpu_device_halt() for this
> > > > purpose. It will bring hardware to a kind of halt state,
> > > > so that no one can touch it any more.
> > > > 
> > > > Compare to a simple hang, the system will keep stable
> > > > at least for SSH access. Then it should be trivial to
> > > > inspect the hardware state and see what's going on.
> > > > 
> > > > Suggested-by: Christian Koenig <christian.koenig@amd.com>
> > > > Suggested-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
> > > > Signed-off-by: Lang Yu <lang.yu@amd.com>
> > > > ---
> > > >   drivers/gpu/drm/amd/amdgpu/amdgpu.h        |  2 ++
> > > >   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 39
> > > > ++++++++++++++++++++++
> > > >   2 files changed, 41 insertions(+)
> > > > 
> > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > > > b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > > > index c5cfe2926ca1..3f5f8f62aa5c 100644
> > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > > > @@ -1317,6 +1317,8 @@ void amdgpu_device_flush_hdp(struct
> > > > amdgpu_device *adev,
> > > >   void amdgpu_device_invalidate_hdp(struct amdgpu_device *adev,
> > > >           struct amdgpu_ring *ring);
> > > >   +void amdgpu_device_halt(struct amdgpu_device *adev);
> > > > +
> > > >   /* atpx handler */
> > > >   #if defined(CONFIG_VGA_SWITCHEROO)
> > > >   void amdgpu_register_atpx_handler(void);
> > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > > index a1c14466f23d..62216627cc83 100644
> > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > > @@ -5634,3 +5634,42 @@ void amdgpu_device_invalidate_hdp(struct
> > > > amdgpu_device *adev,
> > > >         amdgpu_asic_invalidate_hdp(adev, ring);
> > > >   }
> > > > +
> > > > +/**
> > > > + * amdgpu_device_halt() - bring hardware to some kind of halt state
> > > > + *
> > > > + * @adev: amdgpu_device pointer
> > > > + *
> > > > + * Bring hardware to some kind of halt state so that no one can
> > > > touch it
> > > > + * any more. It will help to maintain error context when error
> > > > occurred.
> > > > + * Compare to a simple hang, the system will keep stable at
> > > > least for SSH
> > > > + * access. Then it should be trivial to inspect the hardware state and
> > > > + * see what's going on. Implemented as following:
> > > > + *
> > > > + * 1. drm_dev_unplug() makes device inaccessible to user
> > > > space(IOCTLs, etc),
> > > > + *    clears all CPU mappings to device, disallows remappings through
> > > > page faults
> > > > + * 2. amdgpu_irq_disable_all() disables all interrupts
> > > > + * 3. amdgpu_fence_driver_hw_fini() signals all HW fences
> > > > + * 4. amdgpu_device_unmap_mmio() clears all MMIO mappings
> > > > + * 5. pci_disable_device() and pci_wait_for_pending_transaction()
> > > > + *    flush any in flight DMA operations
> > > > + * 6. set adev->no_hw_access to true
> > > > + */
> > > > +void amdgpu_device_halt(struct amdgpu_device *adev)
> > > > +{
> > > > +    struct pci_dev *pdev = adev->pdev;
> > > > +    struct drm_device *ddev = &adev->ddev;
> > > > +
> > > > +    drm_dev_unplug(ddev);
> > > > +
> > > > +    amdgpu_irq_disable_all(adev);
> > > > +
> > > > +    amdgpu_fence_driver_hw_fini(adev);
> > > > +
> > > > +    amdgpu_device_unmap_mmio(adev);
> > 
> > 
> > Note that this one will cause page fault on any subsequent MMIO access
> > (trough registers or by direct VRAM access)
> > 
> > 
> > > > 
> > > > +
> > > > +    pci_disable_device(pdev);
> > > > +    pci_wait_for_pending_transaction(pdev);
> > > > +
> > > > +    adev->no_hw_access = true;
> > > 
> > > I think we need to reorder this, e.g. set adev->no_hw_access much
> > > earlier for example. Andrey what do you think?
> > 
> > 
> > Earlier can be ok but at least after the last HW configuration we
> > actaully want to do like disabling IRQs.
> 
> My thinking was to at least do this before we unmap the MMIO to avoid the
> crash.
> 
> Additionally to that we maybe don't even want to do this for this case.

So we just do "adev->no_hw_access = true;" before
"amdgpu_device_unmap_mmio(adev);".

That can avoid potential registers access page faults.
But direct VRAM access will still trigger page faults.

For example, 
"cat /sys/class/drm/card0/device/pp_od_clk_voltage"
will call smu_cmn_update_table and can still trigger 
a page fault.

smu_cmn_update_table()
{
 ...
	if (drv2smu) {
		memcpy(table->cpu_addr, table_data, table_size);
 ...
}

Regards,
Lang

> Christian.
> 
> > 
> > 
> > Andrey
> > 
> > > 
> > > Apart from that sounds like the right idea to me.
> > > 
> > > Regards,
> > > Christian.
> > > 
> > > > +}
> > > 
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] drm/amdgpu: add support for SMU debug option
  2021-12-10  3:21         ` Lang Yu
@ 2021-12-10  7:04           ` Christian König
  2021-12-10  7:49             ` Lang Yu
  0 siblings, 1 reply; 13+ messages in thread
From: Christian König @ 2021-12-10  7:04 UTC (permalink / raw)
  To: Lang Yu, Quan, Evan
  Cc: Deucher, Alexander, Grodzovsky, Andrey, Lazar, Lijo, Huang, Ray, amd-gfx

Am 10.12.21 um 04:21 schrieb Lang Yu:
> On 12/10/ , Quan, Evan wrote:
>> [AMD Official Use Only]
>>
>>
>>
>>> -----Original Message-----
>>> From: Yu, Lang <Lang.Yu@amd.com>
>>> Sent: Friday, December 10, 2021 10:34 AM
>>> To: Quan, Evan <Evan.Quan@amd.com>
>>> Cc: amd-gfx@lists.freedesktop.org; Grodzovsky, Andrey
>>> <Andrey.Grodzovsky@amd.com>; Lazar, Lijo <Lijo.Lazar@amd.com>; Huang,
>>> Ray <Ray.Huang@amd.com>; Deucher, Alexander
>>> <Alexander.Deucher@amd.com>; Koenig, Christian
>>> <Christian.Koenig@amd.com>
>>> Subject: Re: [PATCH 2/2] drm/amdgpu: add support for SMU debug option
>>>
>>> On 12/10/ , Quan, Evan wrote:
>>>> [AMD Official Use Only]
>>>>
>>>>
>>>>
>>>>> -----Original Message-----
>>>>> From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> On Behalf Of
>>>>> Lang Yu
>>>>> Sent: Thursday, December 9, 2021 4:49 PM
>>>>> To: amd-gfx@lists.freedesktop.org
>>>>> Cc: Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>; Lazar, Lijo
>>>>> <Lijo.Lazar@amd.com>; Huang, Ray <Ray.Huang@amd.com>; Deucher,
>>>>> Alexander <Alexander.Deucher@amd.com>; Yu, Lang
>>> <Lang.Yu@amd.com>;
>>>>> Koenig, Christian <Christian.Koenig@amd.com>
>>>>> Subject: [PATCH 2/2] drm/amdgpu: add support for SMU debug option
>>>>>
>>>>> SMU firmware guys expect the driver maintains error context and
>>>>> doesn't interact with SMU any more when SMU errors occurred.
>>>>> That will aid in debugging SMU firmware issues.
>>>>>
>>>>> Add SMU debug option support for this request, it can be enabled or
>>>>> disabled via amdgpu_smu_debug debugfs file.
>>>>> When enabled, it brings hardware to a kind of halt state so that no
>>>>> one can touch it any more in the envent of SMU errors.
>>>>>
>>>>> Currently, dirver interacts with SMU via sending messages.
>>>>> And threre are three ways to sending messages to SMU.
>>>>> Handle them respectively as following:
>>>>>
>>>>> 1, smu_cmn_send_smc_msg_with_param() for normal timeout cases
>>>>>
>>>>>    Halt on any error.
>>>>>
>>>>> 2,
>>> smu_cmn_send_msg_without_waiting()/smu_cmn_wait_for_response()
>>>>> for longer timeout cases
>>>>>
>>>>>    Halt on errors apart from ETIME. Otherwise this way won't work.
>>>>>
>>>>> 3, smu_cmn_send_msg_without_waiting() for no waiting cases
>>>>>
>>>>>    Halt on errors apart from ETIME. Otherwise second way won't work.
>>>>>
>>>>> After halting, use BUG() to explicitly notify users.
>>>>>
>>>>> == Command Guide ==
>>>>>
>>>>> 1, enable SMU debug option
>>>>>
>>>>>   # echo 1 > /sys/kernel/debug/dri/0/amdgpu_smu_debug
>>>>>
>>>>> 2, disable SMU debug option
>>>>>
>>>>>   # echo 0 > /sys/kernel/debug/dri/0/amdgpu_smu_debug
>>>>>
>>>>> v4:
>>>>>   - Set to halt state instead of a simple hang.(Christian)
>>>>>
>>>>> v3:
>>>>>   - Use debugfs_create_bool().(Christian)
>>>>>   - Put variable into smu_context struct.
>>>>>   - Don't resend command when timeout.
>>>>>
>>>>> v2:
>>>>>   - Resend command when timeout.(Lijo)
>>>>>   - Use debugfs file instead of module parameter.
>>>>>
>>>>> Signed-off-by: Lang Yu <lang.yu@amd.com>
>>>>> ---
>>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c |  3 +++
>>>>>   drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h     |  5 +++++
>>>>>   drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c      | 20
>>>>> +++++++++++++++++++-
>>>>>   3 files changed, 27 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
>>>>> index 164d6a9e9fbb..86cd888c7822 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
>>>>> @@ -1618,6 +1618,9 @@ int amdgpu_debugfs_init(struct
>>> amdgpu_device
>>>>> *adev)
>>>>>   	if (!debugfs_initialized())
>>>>>   		return 0;
>>>>>
>>>>> +	debugfs_create_bool("amdgpu_smu_debug", 0600, root,
>>>>> +				  &adev->smu.smu_debug_mode);
>>>>> +
>>>>>   	ent = debugfs_create_file("amdgpu_preempt_ib", 0600, root, adev,
>>>>>   				  &fops_ib_preempt);
>>>>>   	if (IS_ERR(ent)) {
>>>>> diff --git a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
>>>>> b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
>>>>> index f738f7dc20c9..50dbf5594a9d 100644
>>>>> --- a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
>>>>> +++ b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
>>>>> @@ -569,6 +569,11 @@ struct smu_context
>>>>>   	struct smu_user_dpm_profile user_dpm_profile;
>>>>>
>>>>>   	struct stb_context stb_context;
>>>>> +	/*
>>>>> +	 * When enabled, it makes SMU errors fatal.
>>>>> +	 * (0 = disabled (default), 1 = enabled)
>>>>> +	 */
>>>>> +	bool smu_debug_mode;
>>>> [Quan, Evan] Can you expand this to bit mask(as ppfeaturemask)? So that
>>> in future we can add support for other debug features.
>>>>>   };
>>> OK.
>>>
>>>>>   struct i2c_adapter;
>>>>> diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
>>>>> b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
>>>>> index 048ca1673863..84016d22c075 100644
>>>>> --- a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
>>>>> +++ b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
>>>>> @@ -272,6 +272,11 @@ int smu_cmn_send_msg_without_waiting(struct
>>>>> smu_context *smu,
>>>>>   	__smu_cmn_send_msg(smu, msg_index, param);
>>>>>   	res = 0;
>>>>>   Out:
>>>>> +	if (unlikely(smu->smu_debug_mode) && res && (res != -ETIME)) {
>>>>> +		amdgpu_device_halt(smu->adev);
>>>>> +		BUG();
>>>> [Quan, Evan] I agree amdgpu_device_halt() is a good idea. Christian and
>>> Andrey can share you more insights about that.
>>>> Do we still need the "BUG()" then?
>>> The BUG() is used to explicitly notify users something went wrong.
>>> Otherwise userspace may not know immediately.
>>> FW guys request this in ticket.
>> [Quan, Evan] Won't drm_dev_unplug() and pci_disable_device() used in amdgpu_device_halt throw some errors(on user's further attempt to communicate with our driver)?
>> Also if the purpose is to raise user's concern, WARN() may be a more gentle way?
>  From my testing and observation, it depends on what the driver will do next.
> Probably trigger a page fault. If we don't connect a monitor but SSH access.
> We don't know what happended until something like a page fault is triggered.
>
> But here what I want to do is throwing the error immediately to userspace.
> If using WARN(), the user need to poll dmesg to see if something went wrong.

I agree with Evan as well. Please use WARN() or WARN_ON() here instead.

BUG() and BUG_ON() is only allowed when you prevent further data 
corruption by intentionally crashing the kernel thread. But that is 
really invasive because for example locks are not necessarily dropped 
and so can crash the whole system.

Regards,
Christian.

>
> Regards,
> Lang
>
>> BR
>> Evan
>>> Regards,
>>> Lang
>>>
>>>> BR
>>>> Evan
>>>>> +	}
>>>>> +
>>>>>   	return res;
>>>>>   }
>>>>>
>>>>> @@ -288,9 +293,17 @@ int smu_cmn_send_msg_without_waiting(struct
>>>>> smu_context *smu,
>>>>>   int smu_cmn_wait_for_response(struct smu_context *smu)  {
>>>>>   	u32 reg;
>>>>> +	int res;
>>>>>
>>>>>   	reg = __smu_cmn_poll_stat(smu);
>>>>> -	return __smu_cmn_reg2errno(smu, reg);
>>>>> +	res = __smu_cmn_reg2errno(smu, reg);
>>>>> +
>>>>> +	if (unlikely(smu->smu_debug_mode) && res && (res != -ETIME)) {
>>>>> +		amdgpu_device_halt(smu->adev);
>>>>> +		BUG();
>>>>> +	}
>>>>> +
>>>>> +	return res;
>>>>>   }
>>>>>
>>>>>   /**
>>>>> @@ -357,6 +370,11 @@ int
>>> smu_cmn_send_smc_msg_with_param(struct
>>>>> smu_context *smu,
>>>>>   	if (read_arg)
>>>>>   		smu_cmn_read_arg(smu, read_arg);
>>>>>   Out:
>>>>> +	if (unlikely(smu->smu_debug_mode) && res) {
>>>>> +		amdgpu_device_halt(smu->adev);
>>>>> +		BUG();
>>>>> +	}
>>>>> +
>>>>>   	mutex_unlock(&smu->message_lock);
>>>>>   	return res;
>>>>>   }
>>>>> --
>>>>> 2.25.1


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] drm/amdgpu: add support for SMU debug option
  2021-12-10  7:04           ` Christian König
@ 2021-12-10  7:49             ` Lang Yu
  0 siblings, 0 replies; 13+ messages in thread
From: Lang Yu @ 2021-12-10  7:49 UTC (permalink / raw)
  To: Christian KKKnig
  Cc: Grodzovsky, Andrey, Lazar, Lijo, amd-gfx, Huang, Ray, Deucher,
	Alexander, Quan, Evan

On 12/10/ , Christian KKKnig wrote:
> Am 10.12.21 um 04:21 schrieb Lang Yu:
> > On 12/10/ , Quan, Evan wrote:
> > > [AMD Official Use Only]
> > > 
> > > 
> > > 
> > > > -----Original Message-----
> > > > From: Yu, Lang <Lang.Yu@amd.com>
> > > > Sent: Friday, December 10, 2021 10:34 AM
> > > > To: Quan, Evan <Evan.Quan@amd.com>
> > > > Cc: amd-gfx@lists.freedesktop.org; Grodzovsky, Andrey
> > > > <Andrey.Grodzovsky@amd.com>; Lazar, Lijo <Lijo.Lazar@amd.com>; Huang,
> > > > Ray <Ray.Huang@amd.com>; Deucher, Alexander
> > > > <Alexander.Deucher@amd.com>; Koenig, Christian
> > > > <Christian.Koenig@amd.com>
> > > > Subject: Re: [PATCH 2/2] drm/amdgpu: add support for SMU debug option
> > > > 
> > > > On 12/10/ , Quan, Evan wrote:
> > > > > [AMD Official Use Only]
> > > > > 
> > > > > 
> > > > > 
> > > > > > -----Original Message-----
> > > > > > From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> On Behalf Of
> > > > > > Lang Yu
> > > > > > Sent: Thursday, December 9, 2021 4:49 PM
> > > > > > To: amd-gfx@lists.freedesktop.org
> > > > > > Cc: Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>; Lazar, Lijo
> > > > > > <Lijo.Lazar@amd.com>; Huang, Ray <Ray.Huang@amd.com>; Deucher,
> > > > > > Alexander <Alexander.Deucher@amd.com>; Yu, Lang
> > > > <Lang.Yu@amd.com>;
> > > > > > Koenig, Christian <Christian.Koenig@amd.com>
> > > > > > Subject: [PATCH 2/2] drm/amdgpu: add support for SMU debug option
> > > > > > 
> > > > > > SMU firmware guys expect the driver maintains error context and
> > > > > > doesn't interact with SMU any more when SMU errors occurred.
> > > > > > That will aid in debugging SMU firmware issues.
> > > > > > 
> > > > > > Add SMU debug option support for this request, it can be enabled or
> > > > > > disabled via amdgpu_smu_debug debugfs file.
> > > > > > When enabled, it brings hardware to a kind of halt state so that no
> > > > > > one can touch it any more in the envent of SMU errors.
> > > > > > 
> > > > > > Currently, dirver interacts with SMU via sending messages.
> > > > > > And threre are three ways to sending messages to SMU.
> > > > > > Handle them respectively as following:
> > > > > > 
> > > > > > 1, smu_cmn_send_smc_msg_with_param() for normal timeout cases
> > > > > > 
> > > > > >    Halt on any error.
> > > > > > 
> > > > > > 2,
> > > > smu_cmn_send_msg_without_waiting()/smu_cmn_wait_for_response()
> > > > > > for longer timeout cases
> > > > > > 
> > > > > >    Halt on errors apart from ETIME. Otherwise this way won't work.
> > > > > > 
> > > > > > 3, smu_cmn_send_msg_without_waiting() for no waiting cases
> > > > > > 
> > > > > >    Halt on errors apart from ETIME. Otherwise second way won't work.
> > > > > > 
> > > > > > After halting, use BUG() to explicitly notify users.
> > > > > > 
> > > > > > == Command Guide ==
> > > > > > 
> > > > > > 1, enable SMU debug option
> > > > > > 
> > > > > >   # echo 1 > /sys/kernel/debug/dri/0/amdgpu_smu_debug
> > > > > > 
> > > > > > 2, disable SMU debug option
> > > > > > 
> > > > > >   # echo 0 > /sys/kernel/debug/dri/0/amdgpu_smu_debug
> > > > > > 
> > > > > > v4:
> > > > > >   - Set to halt state instead of a simple hang.(Christian)
> > > > > > 
> > > > > > v3:
> > > > > >   - Use debugfs_create_bool().(Christian)
> > > > > >   - Put variable into smu_context struct.
> > > > > >   - Don't resend command when timeout.
> > > > > > 
> > > > > > v2:
> > > > > >   - Resend command when timeout.(Lijo)
> > > > > >   - Use debugfs file instead of module parameter.
> > > > > > 
> > > > > > Signed-off-by: Lang Yu <lang.yu@amd.com>
> > > > > > ---
> > > > > >   drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c |  3 +++
> > > > > >   drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h     |  5 +++++
> > > > > >   drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c      | 20
> > > > > > +++++++++++++++++++-
> > > > > >   3 files changed, 27 insertions(+), 1 deletion(-)
> > > > > > 
> > > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
> > > > > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
> > > > > > index 164d6a9e9fbb..86cd888c7822 100644
> > > > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
> > > > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
> > > > > > @@ -1618,6 +1618,9 @@ int amdgpu_debugfs_init(struct
> > > > amdgpu_device
> > > > > > *adev)
> > > > > >   	if (!debugfs_initialized())
> > > > > >   		return 0;
> > > > > > 
> > > > > > +	debugfs_create_bool("amdgpu_smu_debug", 0600, root,
> > > > > > +				  &adev->smu.smu_debug_mode);
> > > > > > +
> > > > > >   	ent = debugfs_create_file("amdgpu_preempt_ib", 0600, root, adev,
> > > > > >   				  &fops_ib_preempt);
> > > > > >   	if (IS_ERR(ent)) {
> > > > > > diff --git a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
> > > > > > b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
> > > > > > index f738f7dc20c9..50dbf5594a9d 100644
> > > > > > --- a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
> > > > > > +++ b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
> > > > > > @@ -569,6 +569,11 @@ struct smu_context
> > > > > >   	struct smu_user_dpm_profile user_dpm_profile;
> > > > > > 
> > > > > >   	struct stb_context stb_context;
> > > > > > +	/*
> > > > > > +	 * When enabled, it makes SMU errors fatal.
> > > > > > +	 * (0 = disabled (default), 1 = enabled)
> > > > > > +	 */
> > > > > > +	bool smu_debug_mode;
> > > > > [Quan, Evan] Can you expand this to bit mask(as ppfeaturemask)? So that
> > > > in future we can add support for other debug features.
> > > > > >   };
> > > > OK.
> > > > 
> > > > > >   struct i2c_adapter;
> > > > > > diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
> > > > > > b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
> > > > > > index 048ca1673863..84016d22c075 100644
> > > > > > --- a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
> > > > > > +++ b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
> > > > > > @@ -272,6 +272,11 @@ int smu_cmn_send_msg_without_waiting(struct
> > > > > > smu_context *smu,
> > > > > >   	__smu_cmn_send_msg(smu, msg_index, param);
> > > > > >   	res = 0;
> > > > > >   Out:
> > > > > > +	if (unlikely(smu->smu_debug_mode) && res && (res != -ETIME)) {
> > > > > > +		amdgpu_device_halt(smu->adev);
> > > > > > +		BUG();
> > > > > [Quan, Evan] I agree amdgpu_device_halt() is a good idea. Christian and
> > > > Andrey can share you more insights about that.
> > > > > Do we still need the "BUG()" then?
> > > > The BUG() is used to explicitly notify users something went wrong.
> > > > Otherwise userspace may not know immediately.
> > > > FW guys request this in ticket.
> > > [Quan, Evan] Won't drm_dev_unplug() and pci_disable_device() used in amdgpu_device_halt throw some errors(on user's further attempt to communicate with our driver)?
> > > Also if the purpose is to raise user's concern, WARN() may be a more gentle way?
> >  From my testing and observation, it depends on what the driver will do next.
> > Probably trigger a page fault. If we don't connect a monitor but SSH access.
> > We don't know what happended until something like a page fault is triggered.
> > 
> > But here what I want to do is throwing the error immediately to userspace.
> > If using WARN(), the user need to poll dmesg to see if something went wrong.
> 
> I agree with Evan as well. Please use WARN() or WARN_ON() here instead.
> 
> BUG() and BUG_ON() is only allowed when you prevent further data corruption
> by intentionally crashing the kernel thread. But that is really invasive
> because for example locks are not necessarily dropped and so can crash the
> whole system.

Ok, will use WARN().

Regards,
Lang

> Regards,
> Christian.
> 
> > 
> > Regards,
> > Lang
> > 
> > > BR
> > > Evan
> > > > Regards,
> > > > Lang
> > > > 
> > > > > BR
> > > > > Evan
> > > > > > +	}
> > > > > > +
> > > > > >   	return res;
> > > > > >   }
> > > > > > 
> > > > > > @@ -288,9 +293,17 @@ int smu_cmn_send_msg_without_waiting(struct
> > > > > > smu_context *smu,
> > > > > >   int smu_cmn_wait_for_response(struct smu_context *smu)  {
> > > > > >   	u32 reg;
> > > > > > +	int res;
> > > > > > 
> > > > > >   	reg = __smu_cmn_poll_stat(smu);
> > > > > > -	return __smu_cmn_reg2errno(smu, reg);
> > > > > > +	res = __smu_cmn_reg2errno(smu, reg);
> > > > > > +
> > > > > > +	if (unlikely(smu->smu_debug_mode) && res && (res != -ETIME)) {
> > > > > > +		amdgpu_device_halt(smu->adev);
> > > > > > +		BUG();
> > > > > > +	}
> > > > > > +
> > > > > > +	return res;
> > > > > >   }
> > > > > > 
> > > > > >   /**
> > > > > > @@ -357,6 +370,11 @@ int
> > > > smu_cmn_send_smc_msg_with_param(struct
> > > > > > smu_context *smu,
> > > > > >   	if (read_arg)
> > > > > >   		smu_cmn_read_arg(smu, read_arg);
> > > > > >   Out:
> > > > > > +	if (unlikely(smu->smu_debug_mode) && res) {
> > > > > > +		amdgpu_device_halt(smu->adev);
> > > > > > +		BUG();
> > > > > > +	}
> > > > > > +
> > > > > >   	mutex_unlock(&smu->message_lock);
> > > > > >   	return res;
> > > > > >   }
> > > > > > --
> > > > > > 2.25.1
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/2] drm/amdgpu: introduce a kind of halt state for amdgpu device
  2021-12-10  3:47       ` Lang Yu
@ 2021-12-10 15:45         ` Andrey Grodzovsky
  0 siblings, 0 replies; 13+ messages in thread
From: Andrey Grodzovsky @ 2021-12-10 15:45 UTC (permalink / raw)
  To: Lang Yu, Christian KKKnig; +Cc: Alex Deucher, Lijo Lazar, Huang Rui, amd-gfx


On 2021-12-09 10:47 p.m., Lang Yu wrote:
> On 12/09/ , Christian KKKnig wrote:
>> Am 09.12.21 um 16:38 schrieb Andrey Grodzovsky:
>>> On 2021-12-09 4:00 a.m., Christian König wrote:
>>>>
>>>> Am 09.12.21 um 09:49 schrieb Lang Yu:
>>>>> It is useful to maintain error context when debugging
>>>>> SW/FW issues. We introduce amdgpu_device_halt() for this
>>>>> purpose. It will bring hardware to a kind of halt state,
>>>>> so that no one can touch it any more.
>>>>>
>>>>> Compare to a simple hang, the system will keep stable
>>>>> at least for SSH access. Then it should be trivial to
>>>>> inspect the hardware state and see what's going on.
>>>>>
>>>>> Suggested-by: Christian Koenig <christian.koenig@amd.com>
>>>>> Suggested-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>>>> Signed-off-by: Lang Yu <lang.yu@amd.com>
>>>>> ---
>>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu.h        |  2 ++
>>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 39
>>>>> ++++++++++++++++++++++
>>>>>    2 files changed, 41 insertions(+)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>>> index c5cfe2926ca1..3f5f8f62aa5c 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>>> @@ -1317,6 +1317,8 @@ void amdgpu_device_flush_hdp(struct
>>>>> amdgpu_device *adev,
>>>>>    void amdgpu_device_invalidate_hdp(struct amdgpu_device *adev,
>>>>>            struct amdgpu_ring *ring);
>>>>>    +void amdgpu_device_halt(struct amdgpu_device *adev);
>>>>> +
>>>>>    /* atpx handler */
>>>>>    #if defined(CONFIG_VGA_SWITCHEROO)
>>>>>    void amdgpu_register_atpx_handler(void);
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>> index a1c14466f23d..62216627cc83 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>> @@ -5634,3 +5634,42 @@ void amdgpu_device_invalidate_hdp(struct
>>>>> amdgpu_device *adev,
>>>>>          amdgpu_asic_invalidate_hdp(adev, ring);
>>>>>    }
>>>>> +
>>>>> +/**
>>>>> + * amdgpu_device_halt() - bring hardware to some kind of halt state
>>>>> + *
>>>>> + * @adev: amdgpu_device pointer
>>>>> + *
>>>>> + * Bring hardware to some kind of halt state so that no one can
>>>>> touch it
>>>>> + * any more. It will help to maintain error context when error
>>>>> occurred.
>>>>> + * Compare to a simple hang, the system will keep stable at
>>>>> least for SSH
>>>>> + * access. Then it should be trivial to inspect the hardware state and
>>>>> + * see what's going on. Implemented as following:
>>>>> + *
>>>>> + * 1. drm_dev_unplug() makes device inaccessible to user
>>>>> space(IOCTLs, etc),
>>>>> + *    clears all CPU mappings to device, disallows remappings through
>>>>> page faults
>>>>> + * 2. amdgpu_irq_disable_all() disables all interrupts
>>>>> + * 3. amdgpu_fence_driver_hw_fini() signals all HW fencesamdgpu_device_unmap_mmio
>>>>>
>>>>> + * 4. amdgpu_device_unmap_mmio() clears all MMIO mappings
>>>>> + * 5. pci_disable_device() and pci_wait_for_pending_transaction()
>>>>> + *    flush any in flight DMA operations
>>>>> + * 6. set adev->no_hw_access to true
>>>>> + */
>>>>> +void amdgpu_device_halt(struct amdgpu_device *adev)
>>>>> +{
>>>>> +    struct pci_dev *pdev = adev->pdev;
>>>>> +    struct drm_device *ddev = &adev->ddev;
>>>>> +
>>>>> +    drm_dev_unplug(ddev);
>>>>> +
>>>>> +    amdgpu_irq_disable_all(adev);
>>>>> +
>>>>> +    amdgpu_fence_driver_hw_fini(adev);
>>>>> +
>>>>> +    amdgpu_device_unmap_mmio(adev);
>>>
>>> Note that this one will cause page fault on any subsequent MMIO access
>>> (trough registers or by direct VRAM access)
>>>
>>>
>>>>> +
>>>>> +    pci_disable_device(pdev);
>>>>> +    pci_wait_for_pending_transaction(pdev);
>>>>> +
>>>>> +    adev->no_hw_access = true;
>>>> I think we need to reorder this, e.g. set adev->no_hw_access much
>>>> earlier for example. Andrey what do you think?
>>>
>>> Earlier can be ok but at least after the last HW configuration we
>>> actaully want to do like disabling IRQs.
>> My thinking was to at least do this before we unmap the MMIO to avoid the
>> crash.
>>
>> Additionally to that we maybe don't even want to do this for this case.
> So we just do "adev->no_hw_access = true;" before
> "amdgpu_device_unmap_mmio(adev);".
>
> That can avoid potential registers access page faults.
> But direct VRAM access will still trigger page faults.
>
> For example,
> "cat /sys/class/drm/card0/device/pp_od_clk_voltage"
> will call smu_cmn_update_table and can still trigger
> a page fault.
>
> smu_cmn_update_table()
> {
>   ...
> 	if (drv2smu) {
> 		memcpy(table->cpu_addr, table_data, table_size);
>   ...
> }
>
> Regards,
> Lang


What Christian meant is to just drop amdgpu_device_unmap_mmio, unless 
you are worried
that we might somehow alter the SMU state from the driver side by direct 
VRAM access.

Andrey


>
>> Christian.
>>
>>>
>>> Andrey
>>>
>>>> Apart from that sounds like the right idea to me.
>>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>>> +}

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2021-12-10 15:45 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-12-09  8:49 [PATCH 1/2] drm/amdgpu: introduce a kind of halt state for amdgpu device Lang Yu
2021-12-09  8:49 ` [PATCH 2/2] drm/amdgpu: add support for SMU debug option Lang Yu
2021-12-10  2:07   ` Quan, Evan
2021-12-10  2:33     ` Lang Yu
2021-12-10  2:52       ` Quan, Evan
2021-12-10  3:21         ` Lang Yu
2021-12-10  7:04           ` Christian König
2021-12-10  7:49             ` Lang Yu
2021-12-09  9:00 ` [PATCH 1/2] drm/amdgpu: introduce a kind of halt state for amdgpu device Christian König
2021-12-09 15:38   ` Andrey Grodzovsky
2021-12-09 15:42     ` Christian König
2021-12-10  3:47       ` Lang Yu
2021-12-10 15:45         ` Andrey Grodzovsky

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.