All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] amdgpu: disable GPU reset if amdgpu.lockup_timeout=0
@ 2017-12-11 21:29 Marek Olšák
       [not found] ` <1513027772-32408-1-git-send-email-maraeo-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
  0 siblings, 1 reply; 23+ messages in thread
From: Marek Olšák @ 2017-12-11 21:29 UTC (permalink / raw)
  To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

From: Marek Olšák <marek.olsak@amd.com>

Signed-off-by: Marek Olšák <marek.olsak@amd.com>
---

Is this really correct? I have no easy way to test it.

 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 8d03baa..56c41cf 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -3018,20 +3018,24 @@ static int amdgpu_reset_sriov(struct amdgpu_device *adev, uint64_t *reset_flags,
  *
  * Attempt to reset the GPU if it has hung (all asics).
  * Returns 0 for success or an error on failure.
  */
 int amdgpu_gpu_recover(struct amdgpu_device *adev, struct amdgpu_job *job)
 {
 	struct drm_atomic_state *state = NULL;
 	uint64_t reset_flags = 0;
 	int i, r, resched;
 
+	/* amdgpu.lockup_timeout=0 disables GPU reset. */
+	if (amdgpu_lockup_timeout == 0)
+		return 0;
+
 	if (!amdgpu_check_soft_reset(adev)) {
 		DRM_INFO("No hardware hang detected. Did some blocks stall?\n");
 		return 0;
 	}
 
 	dev_info(adev->dev, "GPU reset begin!\n");
 
 	mutex_lock(&adev->lock_reset);
 	atomic_inc(&adev->gpu_reset_counter);
 	adev->in_gpu_reset = 1;
-- 
2.7.4

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* RE: [PATCH] amdgpu: disable GPU reset if amdgpu.lockup_timeout=0
       [not found] ` <1513027772-32408-1-git-send-email-maraeo-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
@ 2017-12-12  3:18   ` Liu, Monk
       [not found]     ` <BLUPR12MB04497DB5DEFE7C66CE3E702084340-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
  2017-12-12  9:01   ` Christian König
  1 sibling, 1 reply; 23+ messages in thread
From: Liu, Monk @ 2017-12-12  3:18 UTC (permalink / raw)
  To: Marek Olšák, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

NAK, you change break SRIOV logic:

Without lockup_timeout set, this gpu_recover() won't get called at all , unless your IB triggered invalid instruct and that IRQ invoked 
Amdgpu_gpu_recover(), by this cause you should disable the logic that in that IRQ instead of change gpu_recover() itself because 
For SRIOV we need gpu_recover() even lockup_timeout is zero 


-----Original Message-----
From: amd-gfx [mailto:amd-gfx-bounces@lists.freedesktop.org] On Behalf Of Marek Ol?ák
Sent: 2017年12月12日 5:30
To: amd-gfx@lists.freedesktop.org
Subject: [PATCH] amdgpu: disable GPU reset if amdgpu.lockup_timeout=0

From: Marek Olšák <marek.olsak@amd.com>

Signed-off-by: Marek Olšák <marek.olsak@amd.com>
---

Is this really correct? I have no easy way to test it.

 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 8d03baa..56c41cf 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -3018,20 +3018,24 @@ static int amdgpu_reset_sriov(struct amdgpu_device *adev, uint64_t *reset_flags,
  *
  * Attempt to reset the GPU if it has hung (all asics).
  * Returns 0 for success or an error on failure.
  */
 int amdgpu_gpu_recover(struct amdgpu_device *adev, struct amdgpu_job *job)  {
 	struct drm_atomic_state *state = NULL;
 	uint64_t reset_flags = 0;
 	int i, r, resched;
 
+	/* amdgpu.lockup_timeout=0 disables GPU reset. */
+	if (amdgpu_lockup_timeout == 0)
+		return 0;
+
 	if (!amdgpu_check_soft_reset(adev)) {
 		DRM_INFO("No hardware hang detected. Did some blocks stall?\n");
 		return 0;
 	}
 
 	dev_info(adev->dev, "GPU reset begin!\n");
 
 	mutex_lock(&adev->lock_reset);
 	atomic_inc(&adev->gpu_reset_counter);
 	adev->in_gpu_reset = 1;
--
2.7.4

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH] amdgpu: disable GPU reset if amdgpu.lockup_timeout=0
       [not found] ` <1513027772-32408-1-git-send-email-maraeo-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
  2017-12-12  3:18   ` Liu, Monk
@ 2017-12-12  9:01   ` Christian König
       [not found]     ` <95bd3554-f9e3-6ad7-32e5-9921da23feaa-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
  1 sibling, 1 reply; 23+ messages in thread
From: Christian König @ 2017-12-12  9:01 UTC (permalink / raw)
  To: Marek Olšák, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
	Grodzovsky, Andrey

Am 11.12.2017 um 22:29 schrieb Marek Olšák:
> From: Marek Olšák <marek.olsak@amd.com>
>
> Signed-off-by: Marek Olšák <marek.olsak@amd.com>
> ---
>
> Is this really correct? I have no easy way to test it.

It's a step in the right direction, but I would rather vote for 
something else:

Instead of disabling the timeout by default we only disable the GPU 
reset/recovery.

The idea is to add a new parameter amdgpu_gpu_recovery which makes 
amdgpu_gpu_recover only prints out an error and doesn't touch the GPU at 
all (on bare metal systems).

Then we finally set the amdgpu_lockup_timeout to a non zero value by 
default.

Andrey could you take care of this when you have time?

Thanks,
Christian.

>
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 ++++
>   1 file changed, 4 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 8d03baa..56c41cf 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -3018,20 +3018,24 @@ static int amdgpu_reset_sriov(struct amdgpu_device *adev, uint64_t *reset_flags,
>    *
>    * Attempt to reset the GPU if it has hung (all asics).
>    * Returns 0 for success or an error on failure.
>    */
>   int amdgpu_gpu_recover(struct amdgpu_device *adev, struct amdgpu_job *job)
>   {
>   	struct drm_atomic_state *state = NULL;
>   	uint64_t reset_flags = 0;
>   	int i, r, resched;
>   
> +	/* amdgpu.lockup_timeout=0 disables GPU reset. */
> +	if (amdgpu_lockup_timeout == 0)
> +		return 0;
> +
>   	if (!amdgpu_check_soft_reset(adev)) {
>   		DRM_INFO("No hardware hang detected. Did some blocks stall?\n");
>   		return 0;
>   	}
>   
>   	dev_info(adev->dev, "GPU reset begin!\n");
>   
>   	mutex_lock(&adev->lock_reset);
>   	atomic_inc(&adev->gpu_reset_counter);
>   	adev->in_gpu_reset = 1;

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] amdgpu: disable GPU reset if amdgpu.lockup_timeout=0
       [not found]     ` <95bd3554-f9e3-6ad7-32e5-9921da23feaa-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
@ 2017-12-12 12:58       ` Andrey Grodzovsky
  2017-12-12 14:57       ` Marek Olšák
  2017-12-12 19:16       ` [PATCH] drm/amdgpu: Add gpu_recovery parameter Andrey Grodzovsky
  2 siblings, 0 replies; 23+ messages in thread
From: Andrey Grodzovsky @ 2017-12-12 12:58 UTC (permalink / raw)
  To: christian.koenig-5C7GfCeVMHo, Marek Olšák,
	amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW



On 12/12/2017 04:01 AM, Christian König wrote:
> Am 11.12.2017 um 22:29 schrieb Marek Olšák:
>> From: Marek Olšák <marek.olsak@amd.com>
>>
>> Signed-off-by: Marek Olšák <marek.olsak@amd.com>
>> ---
>>
>> Is this really correct? I have no easy way to test it.
>
> It's a step in the right direction, but I would rather vote for 
> something else:
>
> Instead of disabling the timeout by default we only disable the GPU 
> reset/recovery.
>
> The idea is to add a new parameter amdgpu_gpu_recovery which makes 
> amdgpu_gpu_recover only prints out an error and doesn't touch the GPU 
> at all (on bare metal systems).
>
> Then we finally set the amdgpu_lockup_timeout to a non zero value by 
> default.
>
> Andrey could you take care of this when you have time?
>
> Thanks,
> Christian.

Sure.

Thanks,
Andrey

>
>>
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 ++++
>>   1 file changed, 4 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> index 8d03baa..56c41cf 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> @@ -3018,20 +3018,24 @@ static int amdgpu_reset_sriov(struct 
>> amdgpu_device *adev, uint64_t *reset_flags,
>>    *
>>    * Attempt to reset the GPU if it has hung (all asics).
>>    * Returns 0 for success or an error on failure.
>>    */
>>   int amdgpu_gpu_recover(struct amdgpu_device *adev, struct 
>> amdgpu_job *job)
>>   {
>>       struct drm_atomic_state *state = NULL;
>>       uint64_t reset_flags = 0;
>>       int i, r, resched;
>>   +    /* amdgpu.lockup_timeout=0 disables GPU reset. */
>> +    if (amdgpu_lockup_timeout == 0)
>> +        return 0;
>> +
>>       if (!amdgpu_check_soft_reset(adev)) {
>>           DRM_INFO("No hardware hang detected. Did some blocks 
>> stall?\n");
>>           return 0;
>>       }
>>         dev_info(adev->dev, "GPU reset begin!\n");
>>         mutex_lock(&adev->lock_reset);
>>       atomic_inc(&adev->gpu_reset_counter);
>>       adev->in_gpu_reset = 1;
>

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] amdgpu: disable GPU reset if amdgpu.lockup_timeout=0
       [not found]     ` <95bd3554-f9e3-6ad7-32e5-9921da23feaa-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
  2017-12-12 12:58       ` Andrey Grodzovsky
@ 2017-12-12 14:57       ` Marek Olšák
       [not found]         ` <CAAxE2A72rODGx5SgsHutRQVag+_1o-s8oykXwbmJUZugh0-S4A-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2017-12-12 19:16       ` [PATCH] drm/amdgpu: Add gpu_recovery parameter Andrey Grodzovsky
  2 siblings, 1 reply; 23+ messages in thread
From: Marek Olšák @ 2017-12-12 14:57 UTC (permalink / raw)
  To: Christian König; +Cc: Grodzovsky, Andrey, amd-gfx mailing list

On Tue, Dec 12, 2017 at 10:01 AM, Christian König
<ckoenig.leichtzumerken@gmail.com> wrote:
> Am 11.12.2017 um 22:29 schrieb Marek Olšák:
>>
>> From: Marek Olšák <marek.olsak@amd.com>
>>
>> Signed-off-by: Marek Olšák <marek.olsak@amd.com>
>> ---
>>
>> Is this really correct? I have no easy way to test it.
>
>
> It's a step in the right direction, but I would rather vote for something
> else:
>
> Instead of disabling the timeout by default we only disable the GPU
> reset/recovery.
>
> The idea is to add a new parameter amdgpu_gpu_recovery which makes
> amdgpu_gpu_recover only prints out an error and doesn't touch the GPU at all
> (on bare metal systems).
>
> Then we finally set the amdgpu_lockup_timeout to a non zero value by
> default.
>
> Andrey could you take care of this when you have time?

I don't understand this.

Why can't we keep the previous behavior where amdgpu.lockup_timeout=0
disabled GPU reset? Why do we have to add another option for the same
thing?

Marek
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] amdgpu: disable GPU reset if amdgpu.lockup_timeout=0
       [not found]     ` <BLUPR12MB04497DB5DEFE7C66CE3E702084340-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
@ 2017-12-12 15:02       ` Marek Olšák
       [not found]         ` <CAAxE2A6ZCHrkKM81vLBEubB-fLQCQuBtDYNjmTvDqEDKVLVDgw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 23+ messages in thread
From: Marek Olšák @ 2017-12-12 15:02 UTC (permalink / raw)
  To: Liu, Monk; +Cc: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

On Tue, Dec 12, 2017 at 4:18 AM, Liu, Monk <Monk.Liu@amd.com> wrote:
> NAK, you change break SRIOV logic:
>
> Without lockup_timeout set, this gpu_recover() won't get called at all , unless your IB triggered invalid instruct and that IRQ invoked
> Amdgpu_gpu_recover(), by this cause you should disable the logic that in that IRQ instead of change gpu_recover() itself because
> For SRIOV we need gpu_recover() even lockup_timeout is zero

The default value of 0 indicates that GPU reset isn't ready to be
enabled by default. That's what it means. Once the GPU reset works,
the default should be non-zero (e.g. 10000) and
amdgpu.lockup_timeout=0 should be used to disable all GPU resets in
order to be able do scandumps and debug GPU hangs.

Marek
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] amdgpu: disable GPU reset if amdgpu.lockup_timeout=0
       [not found]         ` <CAAxE2A72rODGx5SgsHutRQVag+_1o-s8oykXwbmJUZugh0-S4A-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2017-12-12 16:36           ` Christian König
       [not found]             ` <2f95a109-c168-804d-f2f9-4211449caad6-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
  0 siblings, 1 reply; 23+ messages in thread
From: Christian König @ 2017-12-12 16:36 UTC (permalink / raw)
  To: Marek Olšák, Christian König
  Cc: Grodzovsky, Andrey, amd-gfx mailing list

Am 12.12.2017 um 15:57 schrieb Marek Olšák:
> On Tue, Dec 12, 2017 at 10:01 AM, Christian König
> <ckoenig.leichtzumerken@gmail.com> wrote:
>> Am 11.12.2017 um 22:29 schrieb Marek Olšák:
>>> From: Marek Olšák <marek.olsak@amd.com>
>>>
>>> Signed-off-by: Marek Olšák <marek.olsak@amd.com>
>>> ---
>>>
>>> Is this really correct? I have no easy way to test it.
>>
>> It's a step in the right direction, but I would rather vote for something
>> else:
>>
>> Instead of disabling the timeout by default we only disable the GPU
>> reset/recovery.
>>
>> The idea is to add a new parameter amdgpu_gpu_recovery which makes
>> amdgpu_gpu_recover only prints out an error and doesn't touch the GPU at all
>> (on bare metal systems).
>>
>> Then we finally set the amdgpu_lockup_timeout to a non zero value by
>> default.
>>
>> Andrey could you take care of this when you have time?
> I don't understand this.
>
> Why can't we keep the previous behavior where amdgpu.lockup_timeout=0
> disabled GPU reset? Why do we have to add another option for the same
> thing?

lockup_timeout=0 never disabled the GPU reset, it just disabled the timeout.

You could still manually trigger a reset and also invalid commands, 
invalid register writes and requests from the SRIOV hypervisor could 
trigger this.

And as Monk explained GPU resets are mandatory for SRIOV, you can't 
disable them at all in this case.

Additional to that we probably want the error message that something 
timed out, but not touching the hardware in any way.

Regards,
Christian.

>
> Marek
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH] drm/amdgpu: Add gpu_recovery parameter
       [not found]     ` <95bd3554-f9e3-6ad7-32e5-9921da23feaa-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
  2017-12-12 12:58       ` Andrey Grodzovsky
  2017-12-12 14:57       ` Marek Olšák
@ 2017-12-12 19:16       ` Andrey Grodzovsky
       [not found]         ` <1513106162-19664-1-git-send-email-andrey.grodzovsky-5C7GfCeVMHo@public.gmane.org>
  2 siblings, 1 reply; 23+ messages in thread
From: Andrey Grodzovsky @ 2017-12-12 19:16 UTC (permalink / raw)
  To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW, christian.koenig-5C7GfCeVMHo
  Cc: Andrey Grodzovsky, Monk.Liu-5C7GfCeVMHo, maraeo-Re5JQEeQqe8AvxtiuMwx3w

Add new parameter to control GPU recovery procedure.
Retire old way of disabling GPU recovery by setting lockup_timeout == 0 and
set default for lockup_timeout to 10s.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu.h        | 1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 5 +++++
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c    | 8 ++++++--
 3 files changed, 12 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index 3735500..26abe03 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -126,6 +126,7 @@ extern int amdgpu_param_buf_per_se;
 extern int amdgpu_job_hang_limit;
 extern int amdgpu_lbpw;
 extern int amdgpu_compute_multipipe;
+extern int amdgpu_gpu_recovery;
 
 #ifdef CONFIG_DRM_AMDGPU_SI
 extern int amdgpu_si_support;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 8d03baa..d84b57a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -3030,6 +3030,11 @@ int amdgpu_gpu_recover(struct amdgpu_device *adev, struct amdgpu_job *job)
 		return 0;
 	}
 
+	if (!amdgpu_gpu_recovery) {
+		DRM_INFO("GPU recovery disabled.\n");
+		return 0;
+	}
+
 	dev_info(adev->dev, "GPU reset begin!\n");
 
 	mutex_lock(&adev->lock_reset);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index 0b039bd..5c612e9 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -90,7 +90,7 @@ int amdgpu_disp_priority = 0;
 int amdgpu_hw_i2c = 0;
 int amdgpu_pcie_gen2 = -1;
 int amdgpu_msi = -1;
-int amdgpu_lockup_timeout = 0;
+int amdgpu_lockup_timeout = 10000;
 int amdgpu_dpm = -1;
 int amdgpu_fw_load_type = -1;
 int amdgpu_aspm = -1;
@@ -128,6 +128,7 @@ int amdgpu_param_buf_per_se = 0;
 int amdgpu_job_hang_limit = 0;
 int amdgpu_lbpw = -1;
 int amdgpu_compute_multipipe = -1;
+int amdgpu_gpu_recovery = 1;
 
 MODULE_PARM_DESC(vramlimit, "Restrict VRAM for testing, in megabytes");
 module_param_named(vramlimit, amdgpu_vram_limit, int, 0600);
@@ -165,7 +166,7 @@ module_param_named(pcie_gen2, amdgpu_pcie_gen2, int, 0444);
 MODULE_PARM_DESC(msi, "MSI support (1 = enable, 0 = disable, -1 = auto)");
 module_param_named(msi, amdgpu_msi, int, 0444);
 
-MODULE_PARM_DESC(lockup_timeout, "GPU lockup timeout in ms (default 0 = disable)");
+MODULE_PARM_DESC(lockup_timeout, "GPU lockup timeout in ms (default 10000)");
 module_param_named(lockup_timeout, amdgpu_lockup_timeout, int, 0444);
 
 MODULE_PARM_DESC(dpm, "DPM support (1 = enable, 0 = disable, -1 = auto)");
@@ -280,6 +281,9 @@ module_param_named(lbpw, amdgpu_lbpw, int, 0444);
 MODULE_PARM_DESC(compute_multipipe, "Force compute queues to be spread across pipes (1 = enable, 0 = disable, -1 = auto)");
 module_param_named(compute_multipipe, amdgpu_compute_multipipe, int, 0444);
 
+MODULE_PARM_DESC(gpu_recovery, "Enable GPU recovery mechanism, (1 = enable (default) , 0 = disable");
+module_param_named(gpu_recovery, amdgpu_gpu_recovery, int, 0444);
+
 #ifdef CONFIG_DRM_AMDGPU_SI
 
 #if defined(CONFIG_DRM_RADEON) || defined(CONFIG_DRM_RADEON_MODULE)
-- 
2.7.4

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH] drm/amdgpu: Add gpu_recovery parameter
       [not found]         ` <1513106162-19664-1-git-send-email-andrey.grodzovsky-5C7GfCeVMHo@public.gmane.org>
@ 2017-12-12 19:26           ` Alex Deucher
  2017-12-13 12:20           ` Christian König
  2017-12-14  7:16           ` Liu, Monk
  2 siblings, 0 replies; 23+ messages in thread
From: Alex Deucher @ 2017-12-12 19:26 UTC (permalink / raw)
  To: Andrey Grodzovsky; +Cc: monk.liu, Marek Olsak, Christian Koenig, amd-gfx list

On Tue, Dec 12, 2017 at 2:16 PM, Andrey Grodzovsky
<andrey.grodzovsky@amd.com> wrote:
> Add new parameter to control GPU recovery procedure.
> Retire old way of disabling GPU recovery by setting lockup_timeout == 0 and
> set default for lockup_timeout to 10s.
>
> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu.h        | 1 +
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 5 +++++
>  drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c    | 8 ++++++--
>  3 files changed, 12 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> index 3735500..26abe03 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> @@ -126,6 +126,7 @@ extern int amdgpu_param_buf_per_se;
>  extern int amdgpu_job_hang_limit;
>  extern int amdgpu_lbpw;
>  extern int amdgpu_compute_multipipe;
> +extern int amdgpu_gpu_recovery;
>
>  #ifdef CONFIG_DRM_AMDGPU_SI
>  extern int amdgpu_si_support;
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 8d03baa..d84b57a 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -3030,6 +3030,11 @@ int amdgpu_gpu_recover(struct amdgpu_device *adev, struct amdgpu_job *job)
>                 return 0;
>         }
>
> +       if (!amdgpu_gpu_recovery) {
> +               DRM_INFO("GPU recovery disabled.\n");
> +               return 0;
> +       }


Probably need and else here for the -1 (auto) case so we can disable
by default for non-SR-IOV and always keep it enabled for SR-IOV.


> +
>         dev_info(adev->dev, "GPU reset begin!\n");
>
>         mutex_lock(&adev->lock_reset);
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> index 0b039bd..5c612e9 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> @@ -90,7 +90,7 @@ int amdgpu_disp_priority = 0;
>  int amdgpu_hw_i2c = 0;
>  int amdgpu_pcie_gen2 = -1;
>  int amdgpu_msi = -1;
> -int amdgpu_lockup_timeout = 0;
> +int amdgpu_lockup_timeout = 10000;
>  int amdgpu_dpm = -1;
>  int amdgpu_fw_load_type = -1;
>  int amdgpu_aspm = -1;
> @@ -128,6 +128,7 @@ int amdgpu_param_buf_per_se = 0;
>  int amdgpu_job_hang_limit = 0;
>  int amdgpu_lbpw = -1;
>  int amdgpu_compute_multipipe = -1;
> +int amdgpu_gpu_recovery = 1;
>
>  MODULE_PARM_DESC(vramlimit, "Restrict VRAM for testing, in megabytes");
>  module_param_named(vramlimit, amdgpu_vram_limit, int, 0600);
> @@ -165,7 +166,7 @@ module_param_named(pcie_gen2, amdgpu_pcie_gen2, int, 0444);
>  MODULE_PARM_DESC(msi, "MSI support (1 = enable, 0 = disable, -1 = auto)");
>  module_param_named(msi, amdgpu_msi, int, 0444);
>
> -MODULE_PARM_DESC(lockup_timeout, "GPU lockup timeout in ms (default 0 = disable)");
> +MODULE_PARM_DESC(lockup_timeout, "GPU lockup timeout in ms (default 10000)");

Make this a separate change.

>  module_param_named(lockup_timeout, amdgpu_lockup_timeout, int, 0444);
>
>  MODULE_PARM_DESC(dpm, "DPM support (1 = enable, 0 = disable, -1 = auto)");
> @@ -280,6 +281,9 @@ module_param_named(lbpw, amdgpu_lbpw, int, 0444);
>  MODULE_PARM_DESC(compute_multipipe, "Force compute queues to be spread across pipes (1 = enable, 0 = disable, -1 = auto)");
>  module_param_named(compute_multipipe, amdgpu_compute_multipipe, int, 0444);
>
> +MODULE_PARM_DESC(gpu_recovery, "Enable GPU recovery mechanism, (1 = enable (default) , 0 = disable");
> +module_param_named(gpu_recovery, amdgpu_gpu_recovery, int, 0444);
> +

set the default to -1 so we can have different default behavior in
different cases, e.g., different asics or environments (like SR-IOV).

Alex

>  #ifdef CONFIG_DRM_AMDGPU_SI
>
>  #if defined(CONFIG_DRM_RADEON) || defined(CONFIG_DRM_RADEON_MODULE)
> --
> 2.7.4
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] amdgpu: disable GPU reset if amdgpu.lockup_timeout=0
       [not found]             ` <2f95a109-c168-804d-f2f9-4211449caad6-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
@ 2017-12-12 19:57               ` Marek Olšák
  0 siblings, 0 replies; 23+ messages in thread
From: Marek Olšák @ 2017-12-12 19:57 UTC (permalink / raw)
  To: Christian König; +Cc: Grodzovsky, Andrey, amd-gfx mailing list

On Tue, Dec 12, 2017 at 5:36 PM, Christian König
<ckoenig.leichtzumerken@gmail.com> wrote:
> Am 12.12.2017 um 15:57 schrieb Marek Olšák:
>>
>> On Tue, Dec 12, 2017 at 10:01 AM, Christian König
>> <ckoenig.leichtzumerken@gmail.com> wrote:
>>>
>>> Am 11.12.2017 um 22:29 schrieb Marek Olšák:
>>>>
>>>> From: Marek Olšák <marek.olsak@amd.com>
>>>>
>>>> Signed-off-by: Marek Olšák <marek.olsak@amd.com>
>>>> ---
>>>>
>>>> Is this really correct? I have no easy way to test it.
>>>
>>>
>>> It's a step in the right direction, but I would rather vote for something
>>> else:
>>>
>>> Instead of disabling the timeout by default we only disable the GPU
>>> reset/recovery.
>>>
>>> The idea is to add a new parameter amdgpu_gpu_recovery which makes
>>> amdgpu_gpu_recover only prints out an error and doesn't touch the GPU at
>>> all
>>> (on bare metal systems).
>>>
>>> Then we finally set the amdgpu_lockup_timeout to a non zero value by
>>> default.
>>>
>>> Andrey could you take care of this when you have time?
>>
>> I don't understand this.
>>
>> Why can't we keep the previous behavior where amdgpu.lockup_timeout=0
>> disabled GPU reset? Why do we have to add another option for the same
>> thing?
>
>
> lockup_timeout=0 never disabled the GPU reset, it just disabled the timeout.

It disabled the automatic reset before we had those interrupt callbacks.

>
> You could still manually trigger a reset and also invalid commands, invalid
> register writes and requests from the SRIOV hypervisor could trigger this.

That's OK. Manual resets should always be allowed.

>
> And as Monk explained GPU resets are mandatory for SRIOV, you can't disable
> them at all in this case.

What is preventing Monk from setting amdgpu.lockup_timeout > 0, which
should be the default state anyway?

Let's just say lockup_timeout=0 has undefined behavior with SRIOV.

>
> Additional to that we probably want the error message that something timed
> out, but not touching the hardware in any way.

Yes that is a fair point.

Marek
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] drm/amdgpu: Add gpu_recovery parameter
       [not found]         ` <1513106162-19664-1-git-send-email-andrey.grodzovsky-5C7GfCeVMHo@public.gmane.org>
  2017-12-12 19:26           ` Alex Deucher
@ 2017-12-13 12:20           ` Christian König
       [not found]             ` <7daac1b8-38c4-0546-b16f-5d09f71fcd12-5C7GfCeVMHo@public.gmane.org>
  2017-12-14  7:16           ` Liu, Monk
  2 siblings, 1 reply; 23+ messages in thread
From: Christian König @ 2017-12-13 12:20 UTC (permalink / raw)
  To: Andrey Grodzovsky, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW
  Cc: Monk.Liu-5C7GfCeVMHo, maraeo-Re5JQEeQqe8AvxtiuMwx3w

Am 12.12.2017 um 20:16 schrieb Andrey Grodzovsky:
> Add new parameter to control GPU recovery procedure.
> Retire old way of disabling GPU recovery by setting lockup_timeout == 0 and
> set default for lockup_timeout to 10s.
>
> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu.h        | 1 +
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 5 +++++
>   drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c    | 8 ++++++--
>   3 files changed, 12 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> index 3735500..26abe03 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> @@ -126,6 +126,7 @@ extern int amdgpu_param_buf_per_se;
>   extern int amdgpu_job_hang_limit;
>   extern int amdgpu_lbpw;
>   extern int amdgpu_compute_multipipe;
> +extern int amdgpu_gpu_recovery;
>   
>   #ifdef CONFIG_DRM_AMDGPU_SI
>   extern int amdgpu_si_support;
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 8d03baa..d84b57a 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -3030,6 +3030,11 @@ int amdgpu_gpu_recover(struct amdgpu_device *adev, struct amdgpu_job *job)
>   		return 0;
>   	}
>   
> +	if (!amdgpu_gpu_recovery) {
> +		DRM_INFO("GPU recovery disabled.\n");
> +		return 0;
> +	}
> +

Please move this check into the caller of amdgpu_gpu_recover().

This way we can still trigger a GPU recovery manually or from the 
hypervisor under SRIOV.

Christian.

>   	dev_info(adev->dev, "GPU reset begin!\n");
>   
>   	mutex_lock(&adev->lock_reset);
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> index 0b039bd..5c612e9 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> @@ -90,7 +90,7 @@ int amdgpu_disp_priority = 0;
>   int amdgpu_hw_i2c = 0;
>   int amdgpu_pcie_gen2 = -1;
>   int amdgpu_msi = -1;
> -int amdgpu_lockup_timeout = 0;
> +int amdgpu_lockup_timeout = 10000;
>   int amdgpu_dpm = -1;
>   int amdgpu_fw_load_type = -1;
>   int amdgpu_aspm = -1;
> @@ -128,6 +128,7 @@ int amdgpu_param_buf_per_se = 0;
>   int amdgpu_job_hang_limit = 0;
>   int amdgpu_lbpw = -1;
>   int amdgpu_compute_multipipe = -1;
> +int amdgpu_gpu_recovery = 1;
>   
>   MODULE_PARM_DESC(vramlimit, "Restrict VRAM for testing, in megabytes");
>   module_param_named(vramlimit, amdgpu_vram_limit, int, 0600);
> @@ -165,7 +166,7 @@ module_param_named(pcie_gen2, amdgpu_pcie_gen2, int, 0444);
>   MODULE_PARM_DESC(msi, "MSI support (1 = enable, 0 = disable, -1 = auto)");
>   module_param_named(msi, amdgpu_msi, int, 0444);
>   
> -MODULE_PARM_DESC(lockup_timeout, "GPU lockup timeout in ms (default 0 = disable)");
> +MODULE_PARM_DESC(lockup_timeout, "GPU lockup timeout in ms (default 10000)");
>   module_param_named(lockup_timeout, amdgpu_lockup_timeout, int, 0444);
>   
>   MODULE_PARM_DESC(dpm, "DPM support (1 = enable, 0 = disable, -1 = auto)");
> @@ -280,6 +281,9 @@ module_param_named(lbpw, amdgpu_lbpw, int, 0444);
>   MODULE_PARM_DESC(compute_multipipe, "Force compute queues to be spread across pipes (1 = enable, 0 = disable, -1 = auto)");
>   module_param_named(compute_multipipe, amdgpu_compute_multipipe, int, 0444);
>   
> +MODULE_PARM_DESC(gpu_recovery, "Enable GPU recovery mechanism, (1 = enable (default) , 0 = disable");
> +module_param_named(gpu_recovery, amdgpu_gpu_recovery, int, 0444);
> +
>   #ifdef CONFIG_DRM_AMDGPU_SI
>   
>   #if defined(CONFIG_DRM_RADEON) || defined(CONFIG_DRM_RADEON_MODULE)

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] drm/amdgpu: Add gpu_recovery parameter
       [not found]             ` <7daac1b8-38c4-0546-b16f-5d09f71fcd12-5C7GfCeVMHo@public.gmane.org>
@ 2017-12-13 12:53               ` Andrey Grodzovsky
       [not found]                 ` <a0a2cf88-18a3-cdeb-19d4-2ff3e457b98d-5C7GfCeVMHo@public.gmane.org>
  0 siblings, 1 reply; 23+ messages in thread
From: Andrey Grodzovsky @ 2017-12-13 12:53 UTC (permalink / raw)
  To: Christian König, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW
  Cc: maraeo-Re5JQEeQqe8AvxtiuMwx3w, Monk.Liu-5C7GfCeVMHo



On 12/13/2017 07:20 AM, Christian König wrote:
> Am 12.12.2017 um 20:16 schrieb Andrey Grodzovsky:
>> Add new parameter to control GPU recovery procedure.
>> Retire old way of disabling GPU recovery by setting lockup_timeout == 
>> 0 and
>> set default for lockup_timeout to 10s.
>>
>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu.h        | 1 +
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 5 +++++
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c    | 8 ++++++--
>>   3 files changed, 12 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>> index 3735500..26abe03 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>> @@ -126,6 +126,7 @@ extern int amdgpu_param_buf_per_se;
>>   extern int amdgpu_job_hang_limit;
>>   extern int amdgpu_lbpw;
>>   extern int amdgpu_compute_multipipe;
>> +extern int amdgpu_gpu_recovery;
>>     #ifdef CONFIG_DRM_AMDGPU_SI
>>   extern int amdgpu_si_support;
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> index 8d03baa..d84b57a 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> @@ -3030,6 +3030,11 @@ int amdgpu_gpu_recover(struct amdgpu_device 
>> *adev, struct amdgpu_job *job)
>>           return 0;
>>       }
>>   +    if (!amdgpu_gpu_recovery) {
>> +        DRM_INFO("GPU recovery disabled.\n");
>> +        return 0;
>> +    }
>> +
>
> Please move this check into the caller of amdgpu_gpu_recover().
>
> This way we can still trigger a GPU recovery manually or from the 
> hypervisor under SRIOV.
>
> Christian.

Problem with this is that amdgpu_check_soft_reset will not be called, 
this function which prints which IP block was hung even when later we 
opt not to recover.
I suggest instead to add a bool force_reset parameter to 
amdgpu_gpu_recover which will override amdgpu_gpu_recovery and we can 
set it to true from amdgpu_debugfs_gpu_recover only.

Thanks,
Andrey

>
>>       dev_info(adev->dev, "GPU reset begin!\n");
>>         mutex_lock(&adev->lock_reset);
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>> index 0b039bd..5c612e9 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>> @@ -90,7 +90,7 @@ int amdgpu_disp_priority = 0;
>>   int amdgpu_hw_i2c = 0;
>>   int amdgpu_pcie_gen2 = -1;
>>   int amdgpu_msi = -1;
>> -int amdgpu_lockup_timeout = 0;
>> +int amdgpu_lockup_timeout = 10000;
>>   int amdgpu_dpm = -1;
>>   int amdgpu_fw_load_type = -1;
>>   int amdgpu_aspm = -1;
>> @@ -128,6 +128,7 @@ int amdgpu_param_buf_per_se = 0;
>>   int amdgpu_job_hang_limit = 0;
>>   int amdgpu_lbpw = -1;
>>   int amdgpu_compute_multipipe = -1;
>> +int amdgpu_gpu_recovery = 1;
>>     MODULE_PARM_DESC(vramlimit, "Restrict VRAM for testing, in 
>> megabytes");
>>   module_param_named(vramlimit, amdgpu_vram_limit, int, 0600);
>> @@ -165,7 +166,7 @@ module_param_named(pcie_gen2, amdgpu_pcie_gen2, 
>> int, 0444);
>>   MODULE_PARM_DESC(msi, "MSI support (1 = enable, 0 = disable, -1 = 
>> auto)");
>>   module_param_named(msi, amdgpu_msi, int, 0444);
>>   -MODULE_PARM_DESC(lockup_timeout, "GPU lockup timeout in ms 
>> (default 0 = disable)");
>> +MODULE_PARM_DESC(lockup_timeout, "GPU lockup timeout in ms (default 
>> 10000)");
>>   module_param_named(lockup_timeout, amdgpu_lockup_timeout, int, 0444);
>>     MODULE_PARM_DESC(dpm, "DPM support (1 = enable, 0 = disable, -1 = 
>> auto)");
>> @@ -280,6 +281,9 @@ module_param_named(lbpw, amdgpu_lbpw, int, 0444);
>>   MODULE_PARM_DESC(compute_multipipe, "Force compute queues to be 
>> spread across pipes (1 = enable, 0 = disable, -1 = auto)");
>>   module_param_named(compute_multipipe, amdgpu_compute_multipipe, 
>> int, 0444);
>>   +MODULE_PARM_DESC(gpu_recovery, "Enable GPU recovery mechanism, (1 
>> = enable (default) , 0 = disable");
>> +module_param_named(gpu_recovery, amdgpu_gpu_recovery, int, 0444);
>> +
>>   #ifdef CONFIG_DRM_AMDGPU_SI
>>     #if defined(CONFIG_DRM_RADEON) || defined(CONFIG_DRM_RADEON_MODULE)
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] drm/amdgpu: Add gpu_recovery parameter
       [not found]                 ` <a0a2cf88-18a3-cdeb-19d4-2ff3e457b98d-5C7GfCeVMHo@public.gmane.org>
@ 2017-12-13 18:12                   ` Christian König
       [not found]                     ` <57efe9e8-51c2-caa4-6f54-fc1c99c626fa-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
  2017-12-14  7:19                   ` [PATCH] " Liu, Monk
  1 sibling, 1 reply; 23+ messages in thread
From: Christian König @ 2017-12-13 18:12 UTC (permalink / raw)
  To: Andrey Grodzovsky, Christian König,
	amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW
  Cc: Monk.Liu-5C7GfCeVMHo, maraeo-Re5JQEeQqe8AvxtiuMwx3w

Am 13.12.2017 um 13:53 schrieb Andrey Grodzovsky:
>
>
> On 12/13/2017 07:20 AM, Christian König wrote:
>> Am 12.12.2017 um 20:16 schrieb Andrey Grodzovsky:
>>> Add new parameter to control GPU recovery procedure.
>>> Retire old way of disabling GPU recovery by setting lockup_timeout 
>>> == 0 and
>>> set default for lockup_timeout to 10s.
>>>
>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>> ---
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu.h        | 1 +
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 5 +++++
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c    | 8 ++++++--
>>>   3 files changed, 12 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>> index 3735500..26abe03 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>> @@ -126,6 +126,7 @@ extern int amdgpu_param_buf_per_se;
>>>   extern int amdgpu_job_hang_limit;
>>>   extern int amdgpu_lbpw;
>>>   extern int amdgpu_compute_multipipe;
>>> +extern int amdgpu_gpu_recovery;
>>>     #ifdef CONFIG_DRM_AMDGPU_SI
>>>   extern int amdgpu_si_support;
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> index 8d03baa..d84b57a 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> @@ -3030,6 +3030,11 @@ int amdgpu_gpu_recover(struct amdgpu_device 
>>> *adev, struct amdgpu_job *job)
>>>           return 0;
>>>       }
>>>   +    if (!amdgpu_gpu_recovery) {
>>> +        DRM_INFO("GPU recovery disabled.\n");
>>> +        return 0;
>>> +    }
>>> +
>>
>> Please move this check into the caller of amdgpu_gpu_recover().
>>
>> This way we can still trigger a GPU recovery manually or from the 
>> hypervisor under SRIOV.
>>
>> Christian.
>
> Problem with this is that amdgpu_check_soft_reset will not be called, 
> this function which prints which IP block was hung even when later we 
> opt not to recover.
> I suggest instead to add a bool force_reset parameter to 
> amdgpu_gpu_recover which will override amdgpu_gpu_recovery and we can 
> set it to true from amdgpu_debugfs_gpu_recover only.

Good point and the solution sounds good to me as well.

Please go ahead with that,
Christian.

>
> Thanks,
> Andrey
>
>>
>>>       dev_info(adev->dev, "GPU reset begin!\n");
>>>         mutex_lock(&adev->lock_reset);
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c 
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>> index 0b039bd..5c612e9 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>> @@ -90,7 +90,7 @@ int amdgpu_disp_priority = 0;
>>>   int amdgpu_hw_i2c = 0;
>>>   int amdgpu_pcie_gen2 = -1;
>>>   int amdgpu_msi = -1;
>>> -int amdgpu_lockup_timeout = 0;
>>> +int amdgpu_lockup_timeout = 10000;
>>>   int amdgpu_dpm = -1;
>>>   int amdgpu_fw_load_type = -1;
>>>   int amdgpu_aspm = -1;
>>> @@ -128,6 +128,7 @@ int amdgpu_param_buf_per_se = 0;
>>>   int amdgpu_job_hang_limit = 0;
>>>   int amdgpu_lbpw = -1;
>>>   int amdgpu_compute_multipipe = -1;
>>> +int amdgpu_gpu_recovery = 1;
>>>     MODULE_PARM_DESC(vramlimit, "Restrict VRAM for testing, in 
>>> megabytes");
>>>   module_param_named(vramlimit, amdgpu_vram_limit, int, 0600);
>>> @@ -165,7 +166,7 @@ module_param_named(pcie_gen2, amdgpu_pcie_gen2, 
>>> int, 0444);
>>>   MODULE_PARM_DESC(msi, "MSI support (1 = enable, 0 = disable, -1 = 
>>> auto)");
>>>   module_param_named(msi, amdgpu_msi, int, 0444);
>>>   -MODULE_PARM_DESC(lockup_timeout, "GPU lockup timeout in ms 
>>> (default 0 = disable)");
>>> +MODULE_PARM_DESC(lockup_timeout, "GPU lockup timeout in ms (default 
>>> 10000)");
>>>   module_param_named(lockup_timeout, amdgpu_lockup_timeout, int, 0444);
>>>     MODULE_PARM_DESC(dpm, "DPM support (1 = enable, 0 = disable, -1 
>>> = auto)");
>>> @@ -280,6 +281,9 @@ module_param_named(lbpw, amdgpu_lbpw, int, 0444);
>>>   MODULE_PARM_DESC(compute_multipipe, "Force compute queues to be 
>>> spread across pipes (1 = enable, 0 = disable, -1 = auto)");
>>>   module_param_named(compute_multipipe, amdgpu_compute_multipipe, 
>>> int, 0444);
>>>   +MODULE_PARM_DESC(gpu_recovery, "Enable GPU recovery mechanism, (1 
>>> = enable (default) , 0 = disable");
>>> +module_param_named(gpu_recovery, amdgpu_gpu_recovery, int, 0444);
>>> +
>>>   #ifdef CONFIG_DRM_AMDGPU_SI
>>>     #if defined(CONFIG_DRM_RADEON) || defined(CONFIG_DRM_RADEON_MODULE)
>>
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx@lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH v2] drm/amdgpu: Add gpu_recovery parameter
       [not found]                     ` <57efe9e8-51c2-caa4-6f54-fc1c99c626fa-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
@ 2017-12-13 19:01                       ` Andrey Grodzovsky
       [not found]                         ` <1513191692-6034-1-git-send-email-andrey.grodzovsky-5C7GfCeVMHo@public.gmane.org>
  0 siblings, 1 reply; 23+ messages in thread
From: Andrey Grodzovsky @ 2017-12-13 19:01 UTC (permalink / raw)
  To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW, christian.koenig-5C7GfCeVMHo
  Cc: Andrey Grodzovsky, Monk.Liu-5C7GfCeVMHo, maraeo-Re5JQEeQqe8AvxtiuMwx3w

Add new parameter to control GPU recovery procedure.
Retire old way of disabling GPU recovery by setting lockup_timeout == 0 and
set default for lockup_timeout to 10s.

v2:
Add auto logic where reset is disabled for bare metal and enabled
for SR-IOV.
Allow forced reset from debugfs.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu.h        | 3 ++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 9 ++++++++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c    | 4 ++++
 drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  | 2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c    | 2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_job.c    | 2 +-
 drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c      | 2 +-
 drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c      | 2 +-
 8 files changed, 19 insertions(+), 7 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index 3735500..d7f0263 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -126,6 +126,7 @@ extern int amdgpu_param_buf_per_se;
 extern int amdgpu_job_hang_limit;
 extern int amdgpu_lbpw;
 extern int amdgpu_compute_multipipe;
+extern int amdgpu_gpu_recovery;
 
 #ifdef CONFIG_DRM_AMDGPU_SI
 extern int amdgpu_si_support;
@@ -1879,7 +1880,7 @@ amdgpu_get_sdma_instance(struct amdgpu_ring *ring)
 #define amdgpu_psp_check_fw_loading_status(adev, i) (adev)->firmware.funcs->check_fw_loading_status((adev), (i))
 
 /* Common functions */
-int amdgpu_gpu_recover(struct amdgpu_device *adev, struct amdgpu_job* job);
+int amdgpu_gpu_recover(struct amdgpu_device *adev, struct amdgpu_job* job, bool force);
 bool amdgpu_need_backup(struct amdgpu_device *adev);
 void amdgpu_pci_config_reset(struct amdgpu_device *adev);
 bool amdgpu_need_post(struct amdgpu_device *adev);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 8d03baa..a074502 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -3015,11 +3015,12 @@ static int amdgpu_reset_sriov(struct amdgpu_device *adev, uint64_t *reset_flags,
  *
  * @adev: amdgpu device pointer
  * @job: which job trigger hang
+ * @force forces reset regardless of amdgpu_gpu_recovery
  *
  * Attempt to reset the GPU if it has hung (all asics).
  * Returns 0 for success or an error on failure.
  */
-int amdgpu_gpu_recover(struct amdgpu_device *adev, struct amdgpu_job *job)
+int amdgpu_gpu_recover(struct amdgpu_device *adev, struct amdgpu_job *job, bool force)
 {
 	struct drm_atomic_state *state = NULL;
 	uint64_t reset_flags = 0;
@@ -3030,6 +3031,12 @@ int amdgpu_gpu_recover(struct amdgpu_device *adev, struct amdgpu_job *job)
 		return 0;
 	}
 
+	if (!force && (amdgpu_gpu_recovery == 0 ||
+			(amdgpu_gpu_recovery == -1  && !amdgpu_sriov_vf(adev)))) {
+		DRM_INFO("GPU recovery disabled.\n");
+		return 0;
+	}
+
 	dev_info(adev->dev, "GPU reset begin!\n");
 
 	mutex_lock(&adev->lock_reset);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index 0b039bd..b734cd6 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -128,6 +128,7 @@ int amdgpu_param_buf_per_se = 0;
 int amdgpu_job_hang_limit = 0;
 int amdgpu_lbpw = -1;
 int amdgpu_compute_multipipe = -1;
+int amdgpu_gpu_recovery = -1; /* auto */
 
 MODULE_PARM_DESC(vramlimit, "Restrict VRAM for testing, in megabytes");
 module_param_named(vramlimit, amdgpu_vram_limit, int, 0600);
@@ -280,6 +281,9 @@ module_param_named(lbpw, amdgpu_lbpw, int, 0444);
 MODULE_PARM_DESC(compute_multipipe, "Force compute queues to be spread across pipes (1 = enable, 0 = disable, -1 = auto)");
 module_param_named(compute_multipipe, amdgpu_compute_multipipe, int, 0444);
 
+MODULE_PARM_DESC(gpu_recovery, "Enable GPU recovery mechanism, (1 = enable, 0 = disable, -1 = auto");
+module_param_named(gpu_recovery, amdgpu_gpu_recovery, int, 0444);
+
 #ifdef CONFIG_DRM_AMDGPU_SI
 
 #if defined(CONFIG_DRM_RADEON) || defined(CONFIG_DRM_RADEON_MODULE)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
index 1469963..854baf0 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
@@ -705,7 +705,7 @@ static int amdgpu_debugfs_gpu_recover(struct seq_file *m, void *data)
 	struct amdgpu_device *adev = dev->dev_private;
 
 	seq_printf(m, "gpu recover\n");
-	amdgpu_gpu_recover(adev, NULL);
+	amdgpu_gpu_recover(adev, NULL, true);
 
 	return 0;
 }
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c
index c340774..c43643e 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c
@@ -88,7 +88,7 @@ static void amdgpu_irq_reset_work_func(struct work_struct *work)
 						  reset_work);
 
 	if (!amdgpu_sriov_vf(adev))
-		amdgpu_gpu_recover(adev, NULL);
+		amdgpu_gpu_recover(adev, NULL, false);
 }
 
 /* Disable *all* interrupts */
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
index 013c0a8..be8a437 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
@@ -37,7 +37,7 @@ static void amdgpu_job_timedout(struct drm_sched_job *s_job)
 		  atomic_read(&job->ring->fence_drv.last_seq),
 		  job->ring->fence_drv.sync_seq);
 
-	amdgpu_gpu_recover(job->adev, job);
+	amdgpu_gpu_recover(job->adev, job, false);
 }
 
 int amdgpu_job_alloc(struct amdgpu_device *adev, unsigned num_ibs,
diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
index 71f5690..7ade56d 100644
--- a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
+++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
@@ -253,7 +253,7 @@ static void xgpu_ai_mailbox_flr_work(struct work_struct *work)
 	}
 
 	/* Trigger recovery due to world switch failure */
-	amdgpu_gpu_recover(adev, NULL);
+	amdgpu_gpu_recover(adev, NULL, false);
 }
 
 static int xgpu_ai_set_mailbox_rcv_irq(struct amdgpu_device *adev,
diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c b/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c
index df52824..e05823d 100644
--- a/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c
+++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c
@@ -521,7 +521,7 @@ static void xgpu_vi_mailbox_flr_work(struct work_struct *work)
 	}
 
 	/* Trigger recovery due to world switch failure */
-	amdgpu_gpu_recover(adev, NULL);
+	amdgpu_gpu_recover(adev, NULL, false);
 }
 
 static int xgpu_vi_set_mailbox_rcv_irq(struct amdgpu_device *adev,
-- 
2.7.4

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* RE: [PATCH] drm/amdgpu: Add gpu_recovery parameter
       [not found]         ` <1513106162-19664-1-git-send-email-andrey.grodzovsky-5C7GfCeVMHo@public.gmane.org>
  2017-12-12 19:26           ` Alex Deucher
  2017-12-13 12:20           ` Christian König
@ 2017-12-14  7:16           ` Liu, Monk
       [not found]             ` <BLUPR12MB0449AF7089DA4708747E285D840A0-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
  2 siblings, 1 reply; 23+ messages in thread
From: Liu, Monk @ 2017-12-14  7:16 UTC (permalink / raw)
  To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW, Koenig, Christian
  Cc: Grodzovsky, Andrey, maraeo-Re5JQEeQqe8AvxtiuMwx3w

Andrey

You patch looks breaks the logic for SRIOV, please check function "xgpu_ai_mailbox_flr_work"
This function manually triggers GPU_RECOVER by the will of hypervisor.

Your check of :
+	if (!amdgpu_gpu_recovery) {
+		DRM_INFO("GPU recovery disabled.\n");
+		return 0;
+	}

Actually breaks the SRIOV logic 

I have two idea:
1) Please change to : If (!amdgpu_gpu_recover && !amdgpu_sriov_vf(adev))
2) please add another parameter "force_gpu_recover", and set it to true in driver init stage, and 
  In your check, you can change to: if(!amdgpu_gpu_recovery && !force_gpu_recover)

BR Monk

-----Original Message-----
From: amd-gfx [mailto:amd-gfx-bounces@lists.freedesktop.org] On Behalf Of Andrey Grodzovsky
Sent: 2017年12月13日 3:16
To: amd-gfx@lists.freedesktop.org; Koenig, Christian <Christian.Koenig@amd.com>
Cc: Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>; Liu, Monk <Monk.Liu@amd.com>; maraeo@gmail.com
Subject: [PATCH] drm/amdgpu: Add gpu_recovery parameter

Add new parameter to control GPU recovery procedure.
Retire old way of disabling GPU recovery by setting lockup_timeout == 0 and set default for lockup_timeout to 10s.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu.h        | 1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 5 +++++
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c    | 8 ++++++--
 3 files changed, 12 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index 3735500..26abe03 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -126,6 +126,7 @@ extern int amdgpu_param_buf_per_se;  extern int amdgpu_job_hang_limit;  extern int amdgpu_lbpw;  extern int amdgpu_compute_multipipe;
+extern int amdgpu_gpu_recovery;
 
 #ifdef CONFIG_DRM_AMDGPU_SI
 extern int amdgpu_si_support;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 8d03baa..d84b57a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -3030,6 +3030,11 @@ int amdgpu_gpu_recover(struct amdgpu_device *adev, struct amdgpu_job *job)
 		return 0;
 	}
 
+	if (!amdgpu_gpu_recovery) {
+		DRM_INFO("GPU recovery disabled.\n");
+		return 0;
+	}
+
 	dev_info(adev->dev, "GPU reset begin!\n");
 
 	mutex_lock(&adev->lock_reset);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index 0b039bd..5c612e9 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -90,7 +90,7 @@ int amdgpu_disp_priority = 0;  int amdgpu_hw_i2c = 0;  int amdgpu_pcie_gen2 = -1;  int amdgpu_msi = -1; -int amdgpu_lockup_timeout = 0;
+int amdgpu_lockup_timeout = 10000;
 int amdgpu_dpm = -1;
 int amdgpu_fw_load_type = -1;
 int amdgpu_aspm = -1;
@@ -128,6 +128,7 @@ int amdgpu_param_buf_per_se = 0;  int amdgpu_job_hang_limit = 0;  int amdgpu_lbpw = -1;  int amdgpu_compute_multipipe = -1;
+int amdgpu_gpu_recovery = 1;
 
 MODULE_PARM_DESC(vramlimit, "Restrict VRAM for testing, in megabytes");  module_param_named(vramlimit, amdgpu_vram_limit, int, 0600); @@ -165,7 +166,7 @@ module_param_named(pcie_gen2, amdgpu_pcie_gen2, int, 0444);  MODULE_PARM_DESC(msi, "MSI support (1 = enable, 0 = disable, -1 = auto)");  module_param_named(msi, amdgpu_msi, int, 0444);
 
-MODULE_PARM_DESC(lockup_timeout, "GPU lockup timeout in ms (default 0 = disable)");
+MODULE_PARM_DESC(lockup_timeout, "GPU lockup timeout in ms (default 
+10000)");
 module_param_named(lockup_timeout, amdgpu_lockup_timeout, int, 0444);
 
 MODULE_PARM_DESC(dpm, "DPM support (1 = enable, 0 = disable, -1 = auto)"); @@ -280,6 +281,9 @@ module_param_named(lbpw, amdgpu_lbpw, int, 0444);  MODULE_PARM_DESC(compute_multipipe, "Force compute queues to be spread across pipes (1 = enable, 0 = disable, -1 = auto)");  module_param_named(compute_multipipe, amdgpu_compute_multipipe, int, 0444);
 
+MODULE_PARM_DESC(gpu_recovery, "Enable GPU recovery mechanism, (1 = 
+enable (default) , 0 = disable"); module_param_named(gpu_recovery, 
+amdgpu_gpu_recovery, int, 0444);
+
 #ifdef CONFIG_DRM_AMDGPU_SI
 
 #if defined(CONFIG_DRM_RADEON) || defined(CONFIG_DRM_RADEON_MODULE)
--
2.7.4

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* RE: [PATCH] drm/amdgpu: Add gpu_recovery parameter
       [not found]                 ` <a0a2cf88-18a3-cdeb-19d4-2ff3e457b98d-5C7GfCeVMHo@public.gmane.org>
  2017-12-13 18:12                   ` Christian König
@ 2017-12-14  7:19                   ` Liu, Monk
       [not found]                     ` <BLUPR12MB0449000FB7842B0C60C0C65D840A0-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
  1 sibling, 1 reply; 23+ messages in thread
From: Liu, Monk @ 2017-12-14  7:19 UTC (permalink / raw)
  To: Grodzovsky, Andrey, Koenig, Christian,
	amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW
  Cc: maraeo-Re5JQEeQqe8AvxtiuMwx3w

> Problem with this is that amdgpu_check_soft_reset will not be called, this function which prints which IP block was hung even when later we opt not to recover.
I suggest instead to add a bool force_reset parameter to amdgpu_gpu_recover which will override amdgpu_gpu_recovery and we can set it to true from amdgpu_debugfs_gpu_recover only.

[ML] why you need "check_soft_reset" be called ? I think soft reset checking is useless totally ... because with TDR feature, the only thing can 
Tell  you if GPU hang is time out warning.

For soft checking, it only shows you if some IP is busy or not, but busy may not prove the engine is hang , it may just busy .... 


BR Monk

-----Original Message-----
From: Grodzovsky, Andrey 
Sent: 2017年12月13日 20:53
To: Koenig, Christian <Christian.Koenig@amd.com>; amd-gfx@lists.freedesktop.org
Cc: Liu, Monk <Monk.Liu@amd.com>; maraeo@gmail.com
Subject: Re: [PATCH] drm/amdgpu: Add gpu_recovery parameter



On 12/13/2017 07:20 AM, Christian König wrote:
> Am 12.12.2017 um 20:16 schrieb Andrey Grodzovsky:
>> Add new parameter to control GPU recovery procedure.
>> Retire old way of disabling GPU recovery by setting lockup_timeout ==
>> 0 and
>> set default for lockup_timeout to 10s.
>>
>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu.h        | 1 +
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 5 +++++
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c    | 8 ++++++--
>>   3 files changed, 12 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>> index 3735500..26abe03 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>> @@ -126,6 +126,7 @@ extern int amdgpu_param_buf_per_se;
>>   extern int amdgpu_job_hang_limit;
>>   extern int amdgpu_lbpw;
>>   extern int amdgpu_compute_multipipe;
>> +extern int amdgpu_gpu_recovery;
>>     #ifdef CONFIG_DRM_AMDGPU_SI
>>   extern int amdgpu_si_support;
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> index 8d03baa..d84b57a 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> @@ -3030,6 +3030,11 @@ int amdgpu_gpu_recover(struct amdgpu_device 
>> *adev, struct amdgpu_job *job)
>>           return 0;
>>       }
>>   +    if (!amdgpu_gpu_recovery) {
>> +        DRM_INFO("GPU recovery disabled.\n");
>> +        return 0;
>> +    }
>> +
>
> Please move this check into the caller of amdgpu_gpu_recover().
>
> This way we can still trigger a GPU recovery manually or from the 
> hypervisor under SRIOV.
>
> Christian.

Problem with this is that amdgpu_check_soft_reset will not be called, this function which prints which IP block was hung even when later we opt not to recover.
I suggest instead to add a bool force_reset parameter to amdgpu_gpu_recover which will override amdgpu_gpu_recovery and we can set it to true from amdgpu_debugfs_gpu_recover only.

Thanks,
Andrey

>
>>       dev_info(adev->dev, "GPU reset begin!\n");
>>         mutex_lock(&adev->lock_reset); diff --git 
>> a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>> index 0b039bd..5c612e9 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>> @@ -90,7 +90,7 @@ int amdgpu_disp_priority = 0;
>>   int amdgpu_hw_i2c = 0;
>>   int amdgpu_pcie_gen2 = -1;
>>   int amdgpu_msi = -1;
>> -int amdgpu_lockup_timeout = 0;
>> +int amdgpu_lockup_timeout = 10000;
>>   int amdgpu_dpm = -1;
>>   int amdgpu_fw_load_type = -1;
>>   int amdgpu_aspm = -1;
>> @@ -128,6 +128,7 @@ int amdgpu_param_buf_per_se = 0;
>>   int amdgpu_job_hang_limit = 0;
>>   int amdgpu_lbpw = -1;
>>   int amdgpu_compute_multipipe = -1;
>> +int amdgpu_gpu_recovery = 1;
>>     MODULE_PARM_DESC(vramlimit, "Restrict VRAM for testing, in 
>> megabytes");
>>   module_param_named(vramlimit, amdgpu_vram_limit, int, 0600); @@ 
>> -165,7 +166,7 @@ module_param_named(pcie_gen2, amdgpu_pcie_gen2, int, 
>> 0444);
>>   MODULE_PARM_DESC(msi, "MSI support (1 = enable, 0 = disable, -1 = 
>> auto)");
>>   module_param_named(msi, amdgpu_msi, int, 0444);
>>   -MODULE_PARM_DESC(lockup_timeout, "GPU lockup timeout in ms 
>> (default 0 = disable)");
>> +MODULE_PARM_DESC(lockup_timeout, "GPU lockup timeout in ms (default
>> 10000)");
>>   module_param_named(lockup_timeout, amdgpu_lockup_timeout, int, 
>> 0444);
>>     MODULE_PARM_DESC(dpm, "DPM support (1 = enable, 0 = disable, -1 = 
>> auto)"); @@ -280,6 +281,9 @@ module_param_named(lbpw, amdgpu_lbpw, 
>> int, 0444);
>>   MODULE_PARM_DESC(compute_multipipe, "Force compute queues to be 
>> spread across pipes (1 = enable, 0 = disable, -1 = auto)");
>>   module_param_named(compute_multipipe, amdgpu_compute_multipipe, 
>> int, 0444);
>>   +MODULE_PARM_DESC(gpu_recovery, "Enable GPU recovery mechanism, (1 
>> = enable (default) , 0 = disable");
>> +module_param_named(gpu_recovery, amdgpu_gpu_recovery, int, 0444);
>> +
>>   #ifdef CONFIG_DRM_AMDGPU_SI
>>     #if defined(CONFIG_DRM_RADEON) || 
>> defined(CONFIG_DRM_RADEON_MODULE)
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] drm/amdgpu: Add gpu_recovery parameter
       [not found]                     ` <BLUPR12MB0449000FB7842B0C60C0C65D840A0-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
@ 2017-12-14  8:52                       ` Christian König
       [not found]                         ` <aa8dc97b-c27e-fe7d-77e9-b273c6ed2aed-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
  0 siblings, 1 reply; 23+ messages in thread
From: Christian König @ 2017-12-14  8:52 UTC (permalink / raw)
  To: Liu, Monk, Grodzovsky, Andrey, Koenig, Christian,
	amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW
  Cc: maraeo-Re5JQEeQqe8AvxtiuMwx3w

Am 14.12.2017 um 08:19 schrieb Liu, Monk:
>> Problem with this is that amdgpu_check_soft_reset will not be called, this function which prints which IP block was hung even when later we opt not to recover.
> I suggest instead to add a bool force_reset parameter to amdgpu_gpu_recover which will override amdgpu_gpu_recovery and we can set it to true from amdgpu_debugfs_gpu_recover only.
>
> [ML] why you need "check_soft_reset" be called ? I think soft reset checking is useless totally ... because with TDR feature, the only thing can
> Tell  you if GPU hang is time out warning.

And that's exactly why we call it, we just want the warning in the logs.

Christian.

>
> For soft checking, it only shows you if some IP is busy or not, but busy may not prove the engine is hang , it may just busy ....
>
>
> BR Monk
>
> -----Original Message-----
> From: Grodzovsky, Andrey
> Sent: 2017年12月13日 20:53
> To: Koenig, Christian <Christian.Koenig@amd.com>; amd-gfx@lists.freedesktop.org
> Cc: Liu, Monk <Monk.Liu@amd.com>; maraeo@gmail.com
> Subject: Re: [PATCH] drm/amdgpu: Add gpu_recovery parameter
>
>
>
> On 12/13/2017 07:20 AM, Christian König wrote:
>> Am 12.12.2017 um 20:16 schrieb Andrey Grodzovsky:
>>> Add new parameter to control GPU recovery procedure.
>>> Retire old way of disabling GPU recovery by setting lockup_timeout ==
>>> 0 and
>>> set default for lockup_timeout to 10s.
>>>
>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>> ---
>>>    drivers/gpu/drm/amd/amdgpu/amdgpu.h        | 1 +
>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 5 +++++
>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c    | 8 ++++++--
>>>    3 files changed, 12 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>> index 3735500..26abe03 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>> @@ -126,6 +126,7 @@ extern int amdgpu_param_buf_per_se;
>>>    extern int amdgpu_job_hang_limit;
>>>    extern int amdgpu_lbpw;
>>>    extern int amdgpu_compute_multipipe;
>>> +extern int amdgpu_gpu_recovery;
>>>      #ifdef CONFIG_DRM_AMDGPU_SI
>>>    extern int amdgpu_si_support;
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> index 8d03baa..d84b57a 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> @@ -3030,6 +3030,11 @@ int amdgpu_gpu_recover(struct amdgpu_device
>>> *adev, struct amdgpu_job *job)
>>>            return 0;
>>>        }
>>>    +    if (!amdgpu_gpu_recovery) {
>>> +        DRM_INFO("GPU recovery disabled.\n");
>>> +        return 0;
>>> +    }
>>> +
>> Please move this check into the caller of amdgpu_gpu_recover().
>>
>> This way we can still trigger a GPU recovery manually or from the
>> hypervisor under SRIOV.
>>
>> Christian.
> Problem with this is that amdgpu_check_soft_reset will not be called, this function which prints which IP block was hung even when later we opt not to recover.
> I suggest instead to add a bool force_reset parameter to amdgpu_gpu_recover which will override amdgpu_gpu_recovery and we can set it to true from amdgpu_debugfs_gpu_recover only.
>
> Thanks,
> Andrey
>
>>>        dev_info(adev->dev, "GPU reset begin!\n");
>>>          mutex_lock(&adev->lock_reset); diff --git
>>> a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>> index 0b039bd..5c612e9 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>> @@ -90,7 +90,7 @@ int amdgpu_disp_priority = 0;
>>>    int amdgpu_hw_i2c = 0;
>>>    int amdgpu_pcie_gen2 = -1;
>>>    int amdgpu_msi = -1;
>>> -int amdgpu_lockup_timeout = 0;
>>> +int amdgpu_lockup_timeout = 10000;
>>>    int amdgpu_dpm = -1;
>>>    int amdgpu_fw_load_type = -1;
>>>    int amdgpu_aspm = -1;
>>> @@ -128,6 +128,7 @@ int amdgpu_param_buf_per_se = 0;
>>>    int amdgpu_job_hang_limit = 0;
>>>    int amdgpu_lbpw = -1;
>>>    int amdgpu_compute_multipipe = -1;
>>> +int amdgpu_gpu_recovery = 1;
>>>      MODULE_PARM_DESC(vramlimit, "Restrict VRAM for testing, in
>>> megabytes");
>>>    module_param_named(vramlimit, amdgpu_vram_limit, int, 0600); @@
>>> -165,7 +166,7 @@ module_param_named(pcie_gen2, amdgpu_pcie_gen2, int,
>>> 0444);
>>>    MODULE_PARM_DESC(msi, "MSI support (1 = enable, 0 = disable, -1 =
>>> auto)");
>>>    module_param_named(msi, amdgpu_msi, int, 0444);
>>>    -MODULE_PARM_DESC(lockup_timeout, "GPU lockup timeout in ms
>>> (default 0 = disable)");
>>> +MODULE_PARM_DESC(lockup_timeout, "GPU lockup timeout in ms (default
>>> 10000)");
>>>    module_param_named(lockup_timeout, amdgpu_lockup_timeout, int,
>>> 0444);
>>>      MODULE_PARM_DESC(dpm, "DPM support (1 = enable, 0 = disable, -1 =
>>> auto)"); @@ -280,6 +281,9 @@ module_param_named(lbpw, amdgpu_lbpw,
>>> int, 0444);
>>>    MODULE_PARM_DESC(compute_multipipe, "Force compute queues to be
>>> spread across pipes (1 = enable, 0 = disable, -1 = auto)");
>>>    module_param_named(compute_multipipe, amdgpu_compute_multipipe,
>>> int, 0444);
>>>    +MODULE_PARM_DESC(gpu_recovery, "Enable GPU recovery mechanism, (1
>>> = enable (default) , 0 = disable");
>>> +module_param_named(gpu_recovery, amdgpu_gpu_recovery, int, 0444);
>>> +
>>>    #ifdef CONFIG_DRM_AMDGPU_SI
>>>      #if defined(CONFIG_DRM_RADEON) ||
>>> defined(CONFIG_DRM_RADEON_MODULE)
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx@lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] drm/amdgpu: Add gpu_recovery parameter
       [not found]             ` <BLUPR12MB0449AF7089DA4708747E285D840A0-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
@ 2017-12-14 11:19               ` Andrey Grodzovsky
  0 siblings, 0 replies; 23+ messages in thread
From: Andrey Grodzovsky @ 2017-12-14 11:19 UTC (permalink / raw)
  To: Liu, Monk, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW, Koenig, Christian
  Cc: maraeo-Re5JQEeQqe8AvxtiuMwx3w



On 12/14/2017 02:16 AM, Liu, Monk wrote:
> Andrey
>
> You patch looks breaks the logic for SRIOV, please check function "xgpu_ai_mailbox_flr_work"
> This function manually triggers GPU_RECOVER by the will of hypervisor.
>
> Your check of :
> +	if (!amdgpu_gpu_recovery) {
> +		DRM_INFO("GPU recovery disabled.\n");
> +		return 0;
> +	}
>
> Actually breaks the SRIOV logic
>
> I have two idea:
> 1) Please change to : If (!amdgpu_gpu_recover && !amdgpu_sriov_vf(adev))
> 2) please add another parameter "force_gpu_recover", and set it to true in driver init stage, and
>    In your check, you can change to: if(!amdgpu_gpu_recovery && !force_gpu_recover)
>
> BR Monk

This patch version is outdated, I think you already saw the later one.

Thanks,
Andrey

>
> -----Original Message-----
> From: amd-gfx [mailto:amd-gfx-bounces@lists.freedesktop.org] On Behalf Of Andrey Grodzovsky
> Sent: 2017年12月13日 3:16
> To: amd-gfx@lists.freedesktop.org; Koenig, Christian <Christian.Koenig@amd.com>
> Cc: Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>; Liu, Monk <Monk.Liu@amd.com>; maraeo@gmail.com
> Subject: [PATCH] drm/amdgpu: Add gpu_recovery parameter
>
> Add new parameter to control GPU recovery procedure.
> Retire old way of disabling GPU recovery by setting lockup_timeout == 0 and set default for lockup_timeout to 10s.
>
> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu.h        | 1 +
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 5 +++++
>   drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c    | 8 ++++++--
>   3 files changed, 12 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> index 3735500..26abe03 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> @@ -126,6 +126,7 @@ extern int amdgpu_param_buf_per_se;  extern int amdgpu_job_hang_limit;  extern int amdgpu_lbpw;  extern int amdgpu_compute_multipipe;
> +extern int amdgpu_gpu_recovery;
>   
>   #ifdef CONFIG_DRM_AMDGPU_SI
>   extern int amdgpu_si_support;
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 8d03baa..d84b57a 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -3030,6 +3030,11 @@ int amdgpu_gpu_recover(struct amdgpu_device *adev, struct amdgpu_job *job)
>   		return 0;
>   	}
>   
> +	if (!amdgpu_gpu_recovery) {
> +		DRM_INFO("GPU recovery disabled.\n");
> +		return 0;
> +	}
> +
>   	dev_info(adev->dev, "GPU reset begin!\n");
>   
>   	mutex_lock(&adev->lock_reset);
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> index 0b039bd..5c612e9 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> @@ -90,7 +90,7 @@ int amdgpu_disp_priority = 0;  int amdgpu_hw_i2c = 0;  int amdgpu_pcie_gen2 = -1;  int amdgpu_msi = -1; -int amdgpu_lockup_timeout = 0;
> +int amdgpu_lockup_timeout = 10000;
>   int amdgpu_dpm = -1;
>   int amdgpu_fw_load_type = -1;
>   int amdgpu_aspm = -1;
> @@ -128,6 +128,7 @@ int amdgpu_param_buf_per_se = 0;  int amdgpu_job_hang_limit = 0;  int amdgpu_lbpw = -1;  int amdgpu_compute_multipipe = -1;
> +int amdgpu_gpu_recovery = 1;
>   
>   MODULE_PARM_DESC(vramlimit, "Restrict VRAM for testing, in megabytes");  module_param_named(vramlimit, amdgpu_vram_limit, int, 0600); @@ -165,7 +166,7 @@ module_param_named(pcie_gen2, amdgpu_pcie_gen2, int, 0444);  MODULE_PARM_DESC(msi, "MSI support (1 = enable, 0 = disable, -1 = auto)");  module_param_named(msi, amdgpu_msi, int, 0444);
>   
> -MODULE_PARM_DESC(lockup_timeout, "GPU lockup timeout in ms (default 0 = disable)");
> +MODULE_PARM_DESC(lockup_timeout, "GPU lockup timeout in ms (default
> +10000)");
>   module_param_named(lockup_timeout, amdgpu_lockup_timeout, int, 0444);
>   
>   MODULE_PARM_DESC(dpm, "DPM support (1 = enable, 0 = disable, -1 = auto)"); @@ -280,6 +281,9 @@ module_param_named(lbpw, amdgpu_lbpw, int, 0444);  MODULE_PARM_DESC(compute_multipipe, "Force compute queues to be spread across pipes (1 = enable, 0 = disable, -1 = auto)");  module_param_named(compute_multipipe, amdgpu_compute_multipipe, int, 0444);
>   
> +MODULE_PARM_DESC(gpu_recovery, "Enable GPU recovery mechanism, (1 =
> +enable (default) , 0 = disable"); module_param_named(gpu_recovery,
> +amdgpu_gpu_recovery, int, 0444);
> +
>   #ifdef CONFIG_DRM_AMDGPU_SI
>   
>   #if defined(CONFIG_DRM_RADEON) || defined(CONFIG_DRM_RADEON_MODULE)
> --
> 2.7.4
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] drm/amdgpu: Add gpu_recovery parameter
       [not found]                         ` <aa8dc97b-c27e-fe7d-77e9-b273c6ed2aed-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
@ 2017-12-15 13:05                           ` Andrey Grodzovsky
       [not found]                             ` <627671a2-3b2d-3750-301e-8d9f40e36b48-5C7GfCeVMHo@public.gmane.org>
  0 siblings, 1 reply; 23+ messages in thread
From: Andrey Grodzovsky @ 2017-12-15 13:05 UTC (permalink / raw)
  To: christian.koenig-5C7GfCeVMHo, Liu, Monk,
	amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW
  Cc: maraeo-Re5JQEeQqe8AvxtiuMwx3w



On 12/14/2017 03:52 AM, Christian König wrote:
> Am 14.12.2017 um 08:19 schrieb Liu, Monk:
>>> Problem with this is that amdgpu_check_soft_reset will not be 
>>> called, this function which prints which IP block was hung even when 
>>> later we opt not to recover.
>> I suggest instead to add a bool force_reset parameter to 
>> amdgpu_gpu_recover which will override amdgpu_gpu_recovery and we can 
>> set it to true from amdgpu_debugfs_gpu_recover only.
>>
>> [ML] why you need "check_soft_reset" be called ? I think soft reset 
>> checking is useless totally ... because with TDR feature, the only 
>> thing can
>> Tell  you if GPU hang is time out warning.
>
> And that's exactly why we call it, we just want the warning in the logs.
>
> Christian.

Just a ping for any more comments or a RB.

Thanks,
Andrey

>
>>
>> For soft checking, it only shows you if some IP is busy or not, but 
>> busy may not prove the engine is hang , it may just busy ....
>>
>>
>> BR Monk
>>
>> -----Original Message-----
>> From: Grodzovsky, Andrey
>> Sent: 2017年12月13日 20:53
>> To: Koenig, Christian <Christian.Koenig@amd.com>; 
>> amd-gfx@lists.freedesktop.org
>> Cc: Liu, Monk <Monk.Liu@amd.com>; maraeo@gmail.com
>> Subject: Re: [PATCH] drm/amdgpu: Add gpu_recovery parameter
>>
>>
>>
>> On 12/13/2017 07:20 AM, Christian König wrote:
>>> Am 12.12.2017 um 20:16 schrieb Andrey Grodzovsky:
>>>> Add new parameter to control GPU recovery procedure.
>>>> Retire old way of disabling GPU recovery by setting lockup_timeout ==
>>>> 0 and
>>>> set default for lockup_timeout to 10s.
>>>>
>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>>> ---
>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu.h        | 1 +
>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 5 +++++
>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c    | 8 ++++++--
>>>>    3 files changed, 12 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>> index 3735500..26abe03 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>> @@ -126,6 +126,7 @@ extern int amdgpu_param_buf_per_se;
>>>>    extern int amdgpu_job_hang_limit;
>>>>    extern int amdgpu_lbpw;
>>>>    extern int amdgpu_compute_multipipe;
>>>> +extern int amdgpu_gpu_recovery;
>>>>      #ifdef CONFIG_DRM_AMDGPU_SI
>>>>    extern int amdgpu_si_support;
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> index 8d03baa..d84b57a 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> @@ -3030,6 +3030,11 @@ int amdgpu_gpu_recover(struct amdgpu_device
>>>> *adev, struct amdgpu_job *job)
>>>>            return 0;
>>>>        }
>>>>    +    if (!amdgpu_gpu_recovery) {
>>>> +        DRM_INFO("GPU recovery disabled.\n");
>>>> +        return 0;
>>>> +    }
>>>> +
>>> Please move this check into the caller of amdgpu_gpu_recover().
>>>
>>> This way we can still trigger a GPU recovery manually or from the
>>> hypervisor under SRIOV.
>>>
>>> Christian.
>> Problem with this is that amdgpu_check_soft_reset will not be called, 
>> this function which prints which IP block was hung even when later we 
>> opt not to recover.
>> I suggest instead to add a bool force_reset parameter to 
>> amdgpu_gpu_recover which will override amdgpu_gpu_recovery and we can 
>> set it to true from amdgpu_debugfs_gpu_recover only.
>>
>> Thanks,
>> Andrey
>>
>>>>        dev_info(adev->dev, "GPU reset begin!\n");
>>>>          mutex_lock(&adev->lock_reset); diff --git
>>>> a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>>> index 0b039bd..5c612e9 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>>> @@ -90,7 +90,7 @@ int amdgpu_disp_priority = 0;
>>>>    int amdgpu_hw_i2c = 0;
>>>>    int amdgpu_pcie_gen2 = -1;
>>>>    int amdgpu_msi = -1;
>>>> -int amdgpu_lockup_timeout = 0;
>>>> +int amdgpu_lockup_timeout = 10000;
>>>>    int amdgpu_dpm = -1;
>>>>    int amdgpu_fw_load_type = -1;
>>>>    int amdgpu_aspm = -1;
>>>> @@ -128,6 +128,7 @@ int amdgpu_param_buf_per_se = 0;
>>>>    int amdgpu_job_hang_limit = 0;
>>>>    int amdgpu_lbpw = -1;
>>>>    int amdgpu_compute_multipipe = -1;
>>>> +int amdgpu_gpu_recovery = 1;
>>>>      MODULE_PARM_DESC(vramlimit, "Restrict VRAM for testing, in
>>>> megabytes");
>>>>    module_param_named(vramlimit, amdgpu_vram_limit, int, 0600); @@
>>>> -165,7 +166,7 @@ module_param_named(pcie_gen2, amdgpu_pcie_gen2, int,
>>>> 0444);
>>>>    MODULE_PARM_DESC(msi, "MSI support (1 = enable, 0 = disable, -1 =
>>>> auto)");
>>>>    module_param_named(msi, amdgpu_msi, int, 0444);
>>>>    -MODULE_PARM_DESC(lockup_timeout, "GPU lockup timeout in ms
>>>> (default 0 = disable)");
>>>> +MODULE_PARM_DESC(lockup_timeout, "GPU lockup timeout in ms (default
>>>> 10000)");
>>>>    module_param_named(lockup_timeout, amdgpu_lockup_timeout, int,
>>>> 0444);
>>>>      MODULE_PARM_DESC(dpm, "DPM support (1 = enable, 0 = disable, -1 =
>>>> auto)"); @@ -280,6 +281,9 @@ module_param_named(lbpw, amdgpu_lbpw,
>>>> int, 0444);
>>>>    MODULE_PARM_DESC(compute_multipipe, "Force compute queues to be
>>>> spread across pipes (1 = enable, 0 = disable, -1 = auto)");
>>>>    module_param_named(compute_multipipe, amdgpu_compute_multipipe,
>>>> int, 0444);
>>>>    +MODULE_PARM_DESC(gpu_recovery, "Enable GPU recovery mechanism, (1
>>>> = enable (default) , 0 = disable");
>>>> +module_param_named(gpu_recovery, amdgpu_gpu_recovery, int, 0444);
>>>> +
>>>>    #ifdef CONFIG_DRM_AMDGPU_SI
>>>>      #if defined(CONFIG_DRM_RADEON) ||
>>>> defined(CONFIG_DRM_RADEON_MODULE)
>>> _______________________________________________
>>> amd-gfx mailing list
>>> amd-gfx@lists.freedesktop.org
>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx@lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2] drm/amdgpu: Add gpu_recovery parameter
       [not found]                         ` <1513191692-6034-1-git-send-email-andrey.grodzovsky-5C7GfCeVMHo@public.gmane.org>
@ 2017-12-15 13:19                           ` Christian König
  0 siblings, 0 replies; 23+ messages in thread
From: Christian König @ 2017-12-15 13:19 UTC (permalink / raw)
  To: Andrey Grodzovsky, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW
  Cc: Monk.Liu-5C7GfCeVMHo, maraeo-Re5JQEeQqe8AvxtiuMwx3w

Am 13.12.2017 um 20:01 schrieb Andrey Grodzovsky:
> Add new parameter to control GPU recovery procedure.
> Retire old way of disabling GPU recovery by setting lockup_timeout == 0 and
> set default for lockup_timeout to 10s.
>
> v2:
> Add auto logic where reset is disabled for bare metal and enabled
> for SR-IOV.
> Allow forced reset from debugfs.
>
> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>

Reviewed-by: Christian König <christian.koenig@amd.com>

> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu.h        | 3 ++-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 9 ++++++++-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c    | 4 ++++
>   drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  | 2 +-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c    | 2 +-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c    | 2 +-
>   drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c      | 2 +-
>   drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c      | 2 +-
>   8 files changed, 19 insertions(+), 7 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> index 3735500..d7f0263 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> @@ -126,6 +126,7 @@ extern int amdgpu_param_buf_per_se;
>   extern int amdgpu_job_hang_limit;
>   extern int amdgpu_lbpw;
>   extern int amdgpu_compute_multipipe;
> +extern int amdgpu_gpu_recovery;
>   
>   #ifdef CONFIG_DRM_AMDGPU_SI
>   extern int amdgpu_si_support;
> @@ -1879,7 +1880,7 @@ amdgpu_get_sdma_instance(struct amdgpu_ring *ring)
>   #define amdgpu_psp_check_fw_loading_status(adev, i) (adev)->firmware.funcs->check_fw_loading_status((adev), (i))
>   
>   /* Common functions */
> -int amdgpu_gpu_recover(struct amdgpu_device *adev, struct amdgpu_job* job);
> +int amdgpu_gpu_recover(struct amdgpu_device *adev, struct amdgpu_job* job, bool force);
>   bool amdgpu_need_backup(struct amdgpu_device *adev);
>   void amdgpu_pci_config_reset(struct amdgpu_device *adev);
>   bool amdgpu_need_post(struct amdgpu_device *adev);
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 8d03baa..a074502 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -3015,11 +3015,12 @@ static int amdgpu_reset_sriov(struct amdgpu_device *adev, uint64_t *reset_flags,
>    *
>    * @adev: amdgpu device pointer
>    * @job: which job trigger hang
> + * @force forces reset regardless of amdgpu_gpu_recovery
>    *
>    * Attempt to reset the GPU if it has hung (all asics).
>    * Returns 0 for success or an error on failure.
>    */
> -int amdgpu_gpu_recover(struct amdgpu_device *adev, struct amdgpu_job *job)
> +int amdgpu_gpu_recover(struct amdgpu_device *adev, struct amdgpu_job *job, bool force)
>   {
>   	struct drm_atomic_state *state = NULL;
>   	uint64_t reset_flags = 0;
> @@ -3030,6 +3031,12 @@ int amdgpu_gpu_recover(struct amdgpu_device *adev, struct amdgpu_job *job)
>   		return 0;
>   	}
>   
> +	if (!force && (amdgpu_gpu_recovery == 0 ||
> +			(amdgpu_gpu_recovery == -1  && !amdgpu_sriov_vf(adev)))) {
> +		DRM_INFO("GPU recovery disabled.\n");
> +		return 0;
> +	}
> +
>   	dev_info(adev->dev, "GPU reset begin!\n");
>   
>   	mutex_lock(&adev->lock_reset);
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> index 0b039bd..b734cd6 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> @@ -128,6 +128,7 @@ int amdgpu_param_buf_per_se = 0;
>   int amdgpu_job_hang_limit = 0;
>   int amdgpu_lbpw = -1;
>   int amdgpu_compute_multipipe = -1;
> +int amdgpu_gpu_recovery = -1; /* auto */
>   
>   MODULE_PARM_DESC(vramlimit, "Restrict VRAM for testing, in megabytes");
>   module_param_named(vramlimit, amdgpu_vram_limit, int, 0600);
> @@ -280,6 +281,9 @@ module_param_named(lbpw, amdgpu_lbpw, int, 0444);
>   MODULE_PARM_DESC(compute_multipipe, "Force compute queues to be spread across pipes (1 = enable, 0 = disable, -1 = auto)");
>   module_param_named(compute_multipipe, amdgpu_compute_multipipe, int, 0444);
>   
> +MODULE_PARM_DESC(gpu_recovery, "Enable GPU recovery mechanism, (1 = enable, 0 = disable, -1 = auto");
> +module_param_named(gpu_recovery, amdgpu_gpu_recovery, int, 0444);
> +
>   #ifdef CONFIG_DRM_AMDGPU_SI
>   
>   #if defined(CONFIG_DRM_RADEON) || defined(CONFIG_DRM_RADEON_MODULE)
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> index 1469963..854baf0 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> @@ -705,7 +705,7 @@ static int amdgpu_debugfs_gpu_recover(struct seq_file *m, void *data)
>   	struct amdgpu_device *adev = dev->dev_private;
>   
>   	seq_printf(m, "gpu recover\n");
> -	amdgpu_gpu_recover(adev, NULL);
> +	amdgpu_gpu_recover(adev, NULL, true);
>   
>   	return 0;
>   }
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c
> index c340774..c43643e 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c
> @@ -88,7 +88,7 @@ static void amdgpu_irq_reset_work_func(struct work_struct *work)
>   						  reset_work);
>   
>   	if (!amdgpu_sriov_vf(adev))
> -		amdgpu_gpu_recover(adev, NULL);
> +		amdgpu_gpu_recover(adev, NULL, false);
>   }
>   
>   /* Disable *all* interrupts */
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> index 013c0a8..be8a437 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> @@ -37,7 +37,7 @@ static void amdgpu_job_timedout(struct drm_sched_job *s_job)
>   		  atomic_read(&job->ring->fence_drv.last_seq),
>   		  job->ring->fence_drv.sync_seq);
>   
> -	amdgpu_gpu_recover(job->adev, job);
> +	amdgpu_gpu_recover(job->adev, job, false);
>   }
>   
>   int amdgpu_job_alloc(struct amdgpu_device *adev, unsigned num_ibs,
> diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
> index 71f5690..7ade56d 100644
> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
> @@ -253,7 +253,7 @@ static void xgpu_ai_mailbox_flr_work(struct work_struct *work)
>   	}
>   
>   	/* Trigger recovery due to world switch failure */
> -	amdgpu_gpu_recover(adev, NULL);
> +	amdgpu_gpu_recover(adev, NULL, false);
>   }
>   
>   static int xgpu_ai_set_mailbox_rcv_irq(struct amdgpu_device *adev,
> diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c b/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c
> index df52824..e05823d 100644
> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c
> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c
> @@ -521,7 +521,7 @@ static void xgpu_vi_mailbox_flr_work(struct work_struct *work)
>   	}
>   
>   	/* Trigger recovery due to world switch failure */
> -	amdgpu_gpu_recover(adev, NULL);
> +	amdgpu_gpu_recover(adev, NULL, false);
>   }
>   
>   static int xgpu_vi_set_mailbox_rcv_irq(struct amdgpu_device *adev,

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: [PATCH] amdgpu: disable GPU reset if amdgpu.lockup_timeout=0
       [not found]         ` <CAAxE2A6ZCHrkKM81vLBEubB-fLQCQuBtDYNjmTvDqEDKVLVDgw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2017-12-18  7:54           ` Liu, Monk
  0 siblings, 0 replies; 23+ messages in thread
From: Liu, Monk @ 2017-12-18  7:54 UTC (permalink / raw)
  To: Marek Olšák; +Cc: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

Lockup_timeout = 0 doesn't indicate GPU reset isn't ready, kernel/amdgpu never tell you that, instead if means there is no Timeout of jobs 
So no warning, no gpu recover triggered by time out event, but that doesn't mean gpu recover cannot be triggered, e.g. for SRIOV we can
Trigger gpu recover by hypervisor.

Your patch shouldn't and cannot break exist logics, that's very simple rule ...
If you insist your change, at least make sure it doesn't change any logic of SRIOV and that's not hard for you, just add "if (!amdgpu_sriov_vf(adev))" checking 
Prior to your path, although I didn't encourage such ugly actions...


-----Original Message-----
From: Marek Olšák [mailto:maraeo@gmail.com] 
Sent: Tuesday, December 12, 2017 11:02 PM
To: Liu, Monk <Monk.Liu@amd.com>
Cc: amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH] amdgpu: disable GPU reset if amdgpu.lockup_timeout=0

On Tue, Dec 12, 2017 at 4:18 AM, Liu, Monk <Monk.Liu@amd.com> wrote:
> NAK, you change break SRIOV logic:
>
> Without lockup_timeout set, this gpu_recover() won't get called at all 
> , unless your IB triggered invalid instruct and that IRQ invoked 
> Amdgpu_gpu_recover(), by this cause you should disable the logic that 
> in that IRQ instead of change gpu_recover() itself because For SRIOV 
> we need gpu_recover() even lockup_timeout is zero

The default value of 0 indicates that GPU reset isn't ready to be enabled by default. That's what it means. Once the GPU reset works, the default should be non-zero (e.g. 10000) and
amdgpu.lockup_timeout=0 should be used to disable all GPU resets in order to be able do scandumps and debug GPU hangs.

Marek
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: [PATCH] drm/amdgpu: Add gpu_recovery parameter
       [not found]                             ` <627671a2-3b2d-3750-301e-8d9f40e36b48-5C7GfCeVMHo@public.gmane.org>
@ 2017-12-18  8:58                               ` Liu, Monk
       [not found]                                 ` <BLUPR12MB04490D199016A56EE0155829840E0-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
  0 siblings, 1 reply; 23+ messages in thread
From: Liu, Monk @ 2017-12-18  8:58 UTC (permalink / raw)
  To: Grodzovsky, Andrey, Koenig, Christian,
	amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW
  Cc: maraeo-Re5JQEeQqe8AvxtiuMwx3w

You can add my RB

But to be honest, the current bare-metal GPU recover approach still look not good enough especially that soft_rest checking parts:
1) not all engine/IP on all version are implemented for this, and it's very very time cost to imple them all 
2) like I said before, it only shows you if the engine is busy, but busy doesn't mean hang, we should rely one time out to judge a hang, although I know even time out still cannot prove engine hang, 
  But anyway we need a rule, so I vote for leveraging time out totally and kick soft check out 


We can reduce all those codes of soft resets, and call whole Asic reset instead of those soft reset as long as a job hit time out (when amd_gpu_recocovery=1)

BR Monk

-----Original Message-----
From: Grodzovsky, Andrey 
Sent: Friday, December 15, 2017 9:05 PM
To: Koenig, Christian <Christian.Koenig@amd.com>; Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
Cc: maraeo@gmail.com
Subject: Re: [PATCH] drm/amdgpu: Add gpu_recovery parameter



On 12/14/2017 03:52 AM, Christian König wrote:
> Am 14.12.2017 um 08:19 schrieb Liu, Monk:
>>> Problem with this is that amdgpu_check_soft_reset will not be 
>>> called, this function which prints which IP block was hung even when 
>>> later we opt not to recover.
>> I suggest instead to add a bool force_reset parameter to 
>> amdgpu_gpu_recover which will override amdgpu_gpu_recovery and we can 
>> set it to true from amdgpu_debugfs_gpu_recover only.
>>
>> [ML] why you need "check_soft_reset" be called ? I think soft reset 
>> checking is useless totally ... because with TDR feature, the only 
>> thing can Tell  you if GPU hang is time out warning.
>
> And that's exactly why we call it, we just want the warning in the logs.
>
> Christian.

Just a ping for any more comments or a RB.

Thanks,
Andrey

>
>>
>> For soft checking, it only shows you if some IP is busy or not, but 
>> busy may not prove the engine is hang , it may just busy ....
>>
>>
>> BR Monk
>>
>> -----Original Message-----
>> From: Grodzovsky, Andrey
>> Sent: 2017年12月13日 20:53
>> To: Koenig, Christian <Christian.Koenig@amd.com>; 
>> amd-gfx@lists.freedesktop.org
>> Cc: Liu, Monk <Monk.Liu@amd.com>; maraeo@gmail.com
>> Subject: Re: [PATCH] drm/amdgpu: Add gpu_recovery parameter
>>
>>
>>
>> On 12/13/2017 07:20 AM, Christian König wrote:
>>> Am 12.12.2017 um 20:16 schrieb Andrey Grodzovsky:
>>>> Add new parameter to control GPU recovery procedure.
>>>> Retire old way of disabling GPU recovery by setting lockup_timeout 
>>>> ==
>>>> 0 and
>>>> set default for lockup_timeout to 10s.
>>>>
>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>>> ---
>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu.h        | 1 +
>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 5 +++++
>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c    | 8 ++++++--
>>>>    3 files changed, 12 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>> index 3735500..26abe03 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>> @@ -126,6 +126,7 @@ extern int amdgpu_param_buf_per_se;
>>>>    extern int amdgpu_job_hang_limit;
>>>>    extern int amdgpu_lbpw;
>>>>    extern int amdgpu_compute_multipipe;
>>>> +extern int amdgpu_gpu_recovery;
>>>>      #ifdef CONFIG_DRM_AMDGPU_SI
>>>>    extern int amdgpu_si_support;
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> index 8d03baa..d84b57a 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> @@ -3030,6 +3030,11 @@ int amdgpu_gpu_recover(struct amdgpu_device 
>>>> *adev, struct amdgpu_job *job)
>>>>            return 0;
>>>>        }
>>>>    +    if (!amdgpu_gpu_recovery) {
>>>> +        DRM_INFO("GPU recovery disabled.\n");
>>>> +        return 0;
>>>> +    }
>>>> +
>>> Please move this check into the caller of amdgpu_gpu_recover().
>>>
>>> This way we can still trigger a GPU recovery manually or from the 
>>> hypervisor under SRIOV.
>>>
>>> Christian.
>> Problem with this is that amdgpu_check_soft_reset will not be called, 
>> this function which prints which IP block was hung even when later we 
>> opt not to recover.
>> I suggest instead to add a bool force_reset parameter to 
>> amdgpu_gpu_recover which will override amdgpu_gpu_recovery and we can 
>> set it to true from amdgpu_debugfs_gpu_recover only.
>>
>> Thanks,
>> Andrey
>>
>>>>        dev_info(adev->dev, "GPU reset begin!\n");
>>>>          mutex_lock(&adev->lock_reset); diff --git 
>>>> a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>>> index 0b039bd..5c612e9 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>>> @@ -90,7 +90,7 @@ int amdgpu_disp_priority = 0;
>>>>    int amdgpu_hw_i2c = 0;
>>>>    int amdgpu_pcie_gen2 = -1;
>>>>    int amdgpu_msi = -1;
>>>> -int amdgpu_lockup_timeout = 0;
>>>> +int amdgpu_lockup_timeout = 10000;
>>>>    int amdgpu_dpm = -1;
>>>>    int amdgpu_fw_load_type = -1;
>>>>    int amdgpu_aspm = -1;
>>>> @@ -128,6 +128,7 @@ int amdgpu_param_buf_per_se = 0;
>>>>    int amdgpu_job_hang_limit = 0;
>>>>    int amdgpu_lbpw = -1;
>>>>    int amdgpu_compute_multipipe = -1;
>>>> +int amdgpu_gpu_recovery = 1;
>>>>      MODULE_PARM_DESC(vramlimit, "Restrict VRAM for testing, in 
>>>> megabytes");
>>>>    module_param_named(vramlimit, amdgpu_vram_limit, int, 0600); @@
>>>> -165,7 +166,7 @@ module_param_named(pcie_gen2, amdgpu_pcie_gen2, 
>>>> int, 0444);
>>>>    MODULE_PARM_DESC(msi, "MSI support (1 = enable, 0 = disable, -1 
>>>> = auto)");
>>>>    module_param_named(msi, amdgpu_msi, int, 0444);
>>>>    -MODULE_PARM_DESC(lockup_timeout, "GPU lockup timeout in ms 
>>>> (default 0 = disable)");
>>>> +MODULE_PARM_DESC(lockup_timeout, "GPU lockup timeout in ms 
>>>> +(default
>>>> 10000)");
>>>>    module_param_named(lockup_timeout, amdgpu_lockup_timeout, int, 
>>>> 0444);
>>>>      MODULE_PARM_DESC(dpm, "DPM support (1 = enable, 0 = disable, 
>>>> -1 = auto)"); @@ -280,6 +281,9 @@ module_param_named(lbpw, 
>>>> amdgpu_lbpw, int, 0444);
>>>>    MODULE_PARM_DESC(compute_multipipe, "Force compute queues to be 
>>>> spread across pipes (1 = enable, 0 = disable, -1 = auto)");
>>>>    module_param_named(compute_multipipe, amdgpu_compute_multipipe, 
>>>> int, 0444);
>>>>    +MODULE_PARM_DESC(gpu_recovery, "Enable GPU recovery mechanism, 
>>>> (1 = enable (default) , 0 = disable");
>>>> +module_param_named(gpu_recovery, amdgpu_gpu_recovery, int, 0444);
>>>> +
>>>>    #ifdef CONFIG_DRM_AMDGPU_SI
>>>>      #if defined(CONFIG_DRM_RADEON) ||
>>>> defined(CONFIG_DRM_RADEON_MODULE)
>>> _______________________________________________
>>> amd-gfx mailing list
>>> amd-gfx@lists.freedesktop.org
>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx@lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] drm/amdgpu: Add gpu_recovery parameter
       [not found]                                 ` <BLUPR12MB04490D199016A56EE0155829840E0-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
@ 2017-12-18 14:57                                   ` Deucher, Alexander
  0 siblings, 0 replies; 23+ messages in thread
From: Deucher, Alexander @ 2017-12-18 14:57 UTC (permalink / raw)
  To: Liu, Monk, Grodzovsky, Andrey, Koenig, Christian,
	amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW
  Cc: maraeo-Re5JQEeQqe8AvxtiuMwx3w


[-- Attachment #1.1: Type: text/plain, Size: 7888 bytes --]

We should start with full asic reset and then work back to enable soft resets for the IPs.


Alex

________________________________
From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> on behalf of Liu, Monk <Monk.Liu@amd.com>
Sent: Monday, December 18, 2017 3:58:13 AM
To: Grodzovsky, Andrey; Koenig, Christian; amd-gfx@lists.freedesktop.org
Cc: maraeo@gmail.com
Subject: RE: [PATCH] drm/amdgpu: Add gpu_recovery parameter

You can add my RB

But to be honest, the current bare-metal GPU recover approach still look not good enough especially that soft_rest checking parts:
1) not all engine/IP on all version are implemented for this, and it's very very time cost to imple them all
2) like I said before, it only shows you if the engine is busy, but busy doesn't mean hang, we should rely one time out to judge a hang, although I know even time out still cannot prove engine hang,
  But anyway we need a rule, so I vote for leveraging time out totally and kick soft check out


We can reduce all those codes of soft resets, and call whole Asic reset instead of those soft reset as long as a job hit time out (when amd_gpu_recocovery=1)

BR Monk

-----Original Message-----
From: Grodzovsky, Andrey
Sent: Friday, December 15, 2017 9:05 PM
To: Koenig, Christian <Christian.Koenig@amd.com>; Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
Cc: maraeo@gmail.com
Subject: Re: [PATCH] drm/amdgpu: Add gpu_recovery parameter



On 12/14/2017 03:52 AM, Christian König wrote:
> Am 14.12.2017 um 08:19 schrieb Liu, Monk:
>>> Problem with this is that amdgpu_check_soft_reset will not be
>>> called, this function which prints which IP block was hung even when
>>> later we opt not to recover.
>> I suggest instead to add a bool force_reset parameter to
>> amdgpu_gpu_recover which will override amdgpu_gpu_recovery and we can
>> set it to true from amdgpu_debugfs_gpu_recover only.
>>
>> [ML] why you need "check_soft_reset" be called ? I think soft reset
>> checking is useless totally ... because with TDR feature, the only
>> thing can Tell  you if GPU hang is time out warning.
>
> And that's exactly why we call it, we just want the warning in the logs.
>
> Christian.

Just a ping for any more comments or a RB.

Thanks,
Andrey

>
>>
>> For soft checking, it only shows you if some IP is busy or not, but
>> busy may not prove the engine is hang , it may just busy ....
>>
>>
>> BR Monk
>>
>> -----Original Message-----
>> From: Grodzovsky, Andrey
>> Sent: 2017年12月13日 20:53
>> To: Koenig, Christian <Christian.Koenig@amd.com>;
>> amd-gfx@lists.freedesktop.org
>> Cc: Liu, Monk <Monk.Liu@amd.com>; maraeo@gmail.com
>> Subject: Re: [PATCH] drm/amdgpu: Add gpu_recovery parameter
>>
>>
>>
>> On 12/13/2017 07:20 AM, Christian König wrote:
>>> Am 12.12.2017 um 20:16 schrieb Andrey Grodzovsky:
>>>> Add new parameter to control GPU recovery procedure.
>>>> Retire old way of disabling GPU recovery by setting lockup_timeout
>>>> ==
>>>> 0 and
>>>> set default for lockup_timeout to 10s.
>>>>
>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>>> ---
>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu.h        | 1 +
>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 5 +++++
>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c    | 8 ++++++--
>>>>    3 files changed, 12 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>> index 3735500..26abe03 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>> @@ -126,6 +126,7 @@ extern int amdgpu_param_buf_per_se;
>>>>    extern int amdgpu_job_hang_limit;
>>>>    extern int amdgpu_lbpw;
>>>>    extern int amdgpu_compute_multipipe;
>>>> +extern int amdgpu_gpu_recovery;
>>>>      #ifdef CONFIG_DRM_AMDGPU_SI
>>>>    extern int amdgpu_si_support;
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> index 8d03baa..d84b57a 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> @@ -3030,6 +3030,11 @@ int amdgpu_gpu_recover(struct amdgpu_device
>>>> *adev, struct amdgpu_job *job)
>>>>            return 0;
>>>>        }
>>>>    +    if (!amdgpu_gpu_recovery) {
>>>> +        DRM_INFO("GPU recovery disabled.\n");
>>>> +        return 0;
>>>> +    }
>>>> +
>>> Please move this check into the caller of amdgpu_gpu_recover().
>>>
>>> This way we can still trigger a GPU recovery manually or from the
>>> hypervisor under SRIOV.
>>>
>>> Christian.
>> Problem with this is that amdgpu_check_soft_reset will not be called,
>> this function which prints which IP block was hung even when later we
>> opt not to recover.
>> I suggest instead to add a bool force_reset parameter to
>> amdgpu_gpu_recover which will override amdgpu_gpu_recovery and we can
>> set it to true from amdgpu_debugfs_gpu_recover only.
>>
>> Thanks,
>> Andrey
>>
>>>>        dev_info(adev->dev, "GPU reset begin!\n");
>>>>          mutex_lock(&adev->lock_reset); diff --git
>>>> a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>>> index 0b039bd..5c612e9 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>>> @@ -90,7 +90,7 @@ int amdgpu_disp_priority = 0;
>>>>    int amdgpu_hw_i2c = 0;
>>>>    int amdgpu_pcie_gen2 = -1;
>>>>    int amdgpu_msi = -1;
>>>> -int amdgpu_lockup_timeout = 0;
>>>> +int amdgpu_lockup_timeout = 10000;
>>>>    int amdgpu_dpm = -1;
>>>>    int amdgpu_fw_load_type = -1;
>>>>    int amdgpu_aspm = -1;
>>>> @@ -128,6 +128,7 @@ int amdgpu_param_buf_per_se = 0;
>>>>    int amdgpu_job_hang_limit = 0;
>>>>    int amdgpu_lbpw = -1;
>>>>    int amdgpu_compute_multipipe = -1;
>>>> +int amdgpu_gpu_recovery = 1;
>>>>      MODULE_PARM_DESC(vramlimit, "Restrict VRAM for testing, in
>>>> megabytes");
>>>>    module_param_named(vramlimit, amdgpu_vram_limit, int, 0600); @@
>>>> -165,7 +166,7 @@ module_param_named(pcie_gen2, amdgpu_pcie_gen2,
>>>> int, 0444);
>>>>    MODULE_PARM_DESC(msi, "MSI support (1 = enable, 0 = disable, -1
>>>> = auto)");
>>>>    module_param_named(msi, amdgpu_msi, int, 0444);
>>>>    -MODULE_PARM_DESC(lockup_timeout, "GPU lockup timeout in ms
>>>> (default 0 = disable)");
>>>> +MODULE_PARM_DESC(lockup_timeout, "GPU lockup timeout in ms
>>>> +(default
>>>> 10000)");
>>>>    module_param_named(lockup_timeout, amdgpu_lockup_timeout, int,
>>>> 0444);
>>>>      MODULE_PARM_DESC(dpm, "DPM support (1 = enable, 0 = disable,
>>>> -1 = auto)"); @@ -280,6 +281,9 @@ module_param_named(lbpw,
>>>> amdgpu_lbpw, int, 0444);
>>>>    MODULE_PARM_DESC(compute_multipipe, "Force compute queues to be
>>>> spread across pipes (1 = enable, 0 = disable, -1 = auto)");
>>>>    module_param_named(compute_multipipe, amdgpu_compute_multipipe,
>>>> int, 0444);
>>>>    +MODULE_PARM_DESC(gpu_recovery, "Enable GPU recovery mechanism,
>>>> (1 = enable (default) , 0 = disable");
>>>> +module_param_named(gpu_recovery, amdgpu_gpu_recovery, int, 0444);
>>>> +
>>>>    #ifdef CONFIG_DRM_AMDGPU_SI
>>>>      #if defined(CONFIG_DRM_RADEON) ||
>>>> defined(CONFIG_DRM_RADEON_MODULE)
>>> _______________________________________________
>>> amd-gfx mailing list
>>> amd-gfx@lists.freedesktop.org
>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx@lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

[-- Attachment #1.2: Type: text/html, Size: 12094 bytes --]

[-- Attachment #2: Type: text/plain, Size: 154 bytes --]

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2017-12-18 14:57 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-12-11 21:29 [PATCH] amdgpu: disable GPU reset if amdgpu.lockup_timeout=0 Marek Olšák
     [not found] ` <1513027772-32408-1-git-send-email-maraeo-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2017-12-12  3:18   ` Liu, Monk
     [not found]     ` <BLUPR12MB04497DB5DEFE7C66CE3E702084340-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2017-12-12 15:02       ` Marek Olšák
     [not found]         ` <CAAxE2A6ZCHrkKM81vLBEubB-fLQCQuBtDYNjmTvDqEDKVLVDgw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2017-12-18  7:54           ` Liu, Monk
2017-12-12  9:01   ` Christian König
     [not found]     ` <95bd3554-f9e3-6ad7-32e5-9921da23feaa-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2017-12-12 12:58       ` Andrey Grodzovsky
2017-12-12 14:57       ` Marek Olšák
     [not found]         ` <CAAxE2A72rODGx5SgsHutRQVag+_1o-s8oykXwbmJUZugh0-S4A-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2017-12-12 16:36           ` Christian König
     [not found]             ` <2f95a109-c168-804d-f2f9-4211449caad6-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2017-12-12 19:57               ` Marek Olšák
2017-12-12 19:16       ` [PATCH] drm/amdgpu: Add gpu_recovery parameter Andrey Grodzovsky
     [not found]         ` <1513106162-19664-1-git-send-email-andrey.grodzovsky-5C7GfCeVMHo@public.gmane.org>
2017-12-12 19:26           ` Alex Deucher
2017-12-13 12:20           ` Christian König
     [not found]             ` <7daac1b8-38c4-0546-b16f-5d09f71fcd12-5C7GfCeVMHo@public.gmane.org>
2017-12-13 12:53               ` Andrey Grodzovsky
     [not found]                 ` <a0a2cf88-18a3-cdeb-19d4-2ff3e457b98d-5C7GfCeVMHo@public.gmane.org>
2017-12-13 18:12                   ` Christian König
     [not found]                     ` <57efe9e8-51c2-caa4-6f54-fc1c99c626fa-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2017-12-13 19:01                       ` [PATCH v2] " Andrey Grodzovsky
     [not found]                         ` <1513191692-6034-1-git-send-email-andrey.grodzovsky-5C7GfCeVMHo@public.gmane.org>
2017-12-15 13:19                           ` Christian König
2017-12-14  7:19                   ` [PATCH] " Liu, Monk
     [not found]                     ` <BLUPR12MB0449000FB7842B0C60C0C65D840A0-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2017-12-14  8:52                       ` Christian König
     [not found]                         ` <aa8dc97b-c27e-fe7d-77e9-b273c6ed2aed-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2017-12-15 13:05                           ` Andrey Grodzovsky
     [not found]                             ` <627671a2-3b2d-3750-301e-8d9f40e36b48-5C7GfCeVMHo@public.gmane.org>
2017-12-18  8:58                               ` Liu, Monk
     [not found]                                 ` <BLUPR12MB04490D199016A56EE0155829840E0-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2017-12-18 14:57                                   ` Deucher, Alexander
2017-12-14  7:16           ` Liu, Monk
     [not found]             ` <BLUPR12MB0449AF7089DA4708747E285D840A0-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2017-12-14 11:19               ` Andrey Grodzovsky

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.