All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] drm/amdgpu: try again kiq access if not in IRQ(v2)
@ 2018-02-28  7:27 Monk Liu
       [not found] ` <1519802862-9513-1-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
  0 siblings, 1 reply; 18+ messages in thread
From: Monk Liu @ 2018-02-28  7:27 UTC (permalink / raw)
  To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW; +Cc: Monk Liu

sometimes GPU is switched to other VFs and won't swich
back soon, so the kiq reg access will not signal within
a short period, instead of busy waiting a long time(MAX_KEQ_REG_WAIT)
and returning TMO we can istead sleep 5ms and try again
later (non irq context)

And since the waiting in kiq_r/weg is busy wait, so MAX_KIQ_REG_WAIT
shouldn't set to a long time, set it to 10ms is more appropriate.

if gpu already in reset state, don't retry the KIQ reg access
otherwise it would always hang because KIQ was already die usually.

v2:
replace schedule() with msleep() for the wait

Change-Id: I8fc807ce85a8d30d2b50153f3f3a6eda344ef994
Signed-off-by: Monk Liu <Monk.Liu@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c | 15 +++++++++++++--
 1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
index b832651..1672f5b 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
@@ -22,7 +22,7 @@
  */
 
 #include "amdgpu.h"
-#define MAX_KIQ_REG_WAIT	100000000 /* in usecs */
+#define MAX_KIQ_REG_WAIT	10000 /* in usecs, 10ms */
 
 uint64_t amdgpu_csa_vaddr(struct amdgpu_device *adev)
 {
@@ -152,9 +152,14 @@ uint32_t amdgpu_virt_kiq_rreg(struct amdgpu_device *adev, uint32_t reg)
 	amdgpu_ring_commit(ring);
 	spin_unlock_irqrestore(&kiq->ring_lock, flags);
 
+retry_read:
 	r = amdgpu_fence_wait_polling(ring, seq, MAX_KIQ_REG_WAIT);
 	if (r < 1) {
 		DRM_ERROR("wait for kiq fence error: %ld\n", r);
+		if (!in_interrupt() && !adev->in_gpu_reset) {
+			msleep(5);
+			goto retry_read;
+		}
 		return ~0;
 	}
 	val = adev->wb.wb[adev->virt.reg_val_offs];
@@ -179,9 +184,15 @@ void amdgpu_virt_kiq_wreg(struct amdgpu_device *adev, uint32_t reg, uint32_t v)
 	amdgpu_ring_commit(ring);
 	spin_unlock_irqrestore(&kiq->ring_lock, flags);
 
+retry_write:
 	r = amdgpu_fence_wait_polling(ring, seq, MAX_KIQ_REG_WAIT);
-	if (r < 1)
+	if (r < 1) {
 		DRM_ERROR("wait for kiq fence error: %ld\n", r);
+		if (!in_interrupt() && !adev->in_gpu_reset) {
+			msleep(5);
+			goto retry_write;
+		}
+	}
 }
 
 /**
-- 
2.7.4

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* RE: [PATCH] drm/amdgpu: try again kiq access if not in IRQ(v2)
       [not found] ` <1519802862-9513-1-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
@ 2018-03-01  5:52   ` Liu, Monk
       [not found]     ` <BLUPR12MB0449438D2E411F534071ED3684C60-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
  2018-03-01 15:50   ` Felix Kuehling
  1 sibling, 1 reply; 18+ messages in thread
From: Liu, Monk @ 2018-03-01  5:52 UTC (permalink / raw)
  To: Liu, Monk, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

Anyone review this ?

-----Original Message-----
From: amd-gfx [mailto:amd-gfx-bounces@lists.freedesktop.org] On Behalf Of Monk Liu
Sent: 2018年2月28日 15:28
To: amd-gfx@lists.freedesktop.org
Cc: Liu, Monk <Monk.Liu@amd.com>
Subject: [PATCH] drm/amdgpu: try again kiq access if not in IRQ(v2)

sometimes GPU is switched to other VFs and won't swich back soon, so the kiq reg access will not signal within a short period, instead of busy waiting a long time(MAX_KEQ_REG_WAIT) and returning TMO we can istead sleep 5ms and try again later (non irq context)

And since the waiting in kiq_r/weg is busy wait, so MAX_KIQ_REG_WAIT shouldn't set to a long time, set it to 10ms is more appropriate.

if gpu already in reset state, don't retry the KIQ reg access otherwise it would always hang because KIQ was already die usually.

v2:
replace schedule() with msleep() for the wait

Change-Id: I8fc807ce85a8d30d2b50153f3f3a6eda344ef994
Signed-off-by: Monk Liu <Monk.Liu@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c | 15 +++++++++++++--
 1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
index b832651..1672f5b 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
@@ -22,7 +22,7 @@
  */
 
 #include "amdgpu.h"
-#define MAX_KIQ_REG_WAIT	100000000 /* in usecs */
+#define MAX_KIQ_REG_WAIT	10000 /* in usecs, 10ms */
 
 uint64_t amdgpu_csa_vaddr(struct amdgpu_device *adev)  { @@ -152,9 +152,14 @@ uint32_t amdgpu_virt_kiq_rreg(struct amdgpu_device *adev, uint32_t reg)
 	amdgpu_ring_commit(ring);
 	spin_unlock_irqrestore(&kiq->ring_lock, flags);
 
+retry_read:
 	r = amdgpu_fence_wait_polling(ring, seq, MAX_KIQ_REG_WAIT);
 	if (r < 1) {
 		DRM_ERROR("wait for kiq fence error: %ld\n", r);
+		if (!in_interrupt() && !adev->in_gpu_reset) {
+			msleep(5);
+			goto retry_read;
+		}
 		return ~0;
 	}
 	val = adev->wb.wb[adev->virt.reg_val_offs];
@@ -179,9 +184,15 @@ void amdgpu_virt_kiq_wreg(struct amdgpu_device *adev, uint32_t reg, uint32_t v)
 	amdgpu_ring_commit(ring);
 	spin_unlock_irqrestore(&kiq->ring_lock, flags);
 
+retry_write:
 	r = amdgpu_fence_wait_polling(ring, seq, MAX_KIQ_REG_WAIT);
-	if (r < 1)
+	if (r < 1) {
 		DRM_ERROR("wait for kiq fence error: %ld\n", r);
+		if (!in_interrupt() && !adev->in_gpu_reset) {
+			msleep(5);
+			goto retry_write;
+		}
+	}
 }
 
 /**
--
2.7.4

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH] drm/amdgpu: try again kiq access if not in IRQ(v2)
       [not found]     ` <BLUPR12MB0449438D2E411F534071ED3684C60-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
@ 2018-03-01  8:12       ` Christian König
       [not found]         ` <6362f78f-7b22-2fc1-5ce3-805190c4ffb0-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
  0 siblings, 1 reply; 18+ messages in thread
From: Christian König @ 2018-03-01  8:12 UTC (permalink / raw)
  To: Liu, Monk, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

 From a coding style perspective a define for the 5ms timeout would be 
nice to have and it might be better to use a do { } while instead of the 
goto, but those are really minor issues.

But from the technical perspective I can't fully judge if that is a good 
idea or not cause I'm not so deeply know how the KIQ works.

Christian.

Am 01.03.2018 um 06:52 schrieb Liu, Monk:
> Anyone review this ?
>
> -----Original Message-----
> From: amd-gfx [mailto:amd-gfx-bounces@lists.freedesktop.org] On Behalf Of Monk Liu
> Sent: 2018年2月28日 15:28
> To: amd-gfx@lists.freedesktop.org
> Cc: Liu, Monk <Monk.Liu@amd.com>
> Subject: [PATCH] drm/amdgpu: try again kiq access if not in IRQ(v2)
>
> sometimes GPU is switched to other VFs and won't swich back soon, so the kiq reg access will not signal within a short period, instead of busy waiting a long time(MAX_KEQ_REG_WAIT) and returning TMO we can istead sleep 5ms and try again later (non irq context)
>
> And since the waiting in kiq_r/weg is busy wait, so MAX_KIQ_REG_WAIT shouldn't set to a long time, set it to 10ms is more appropriate.
>
> if gpu already in reset state, don't retry the KIQ reg access otherwise it would always hang because KIQ was already die usually.
>
> v2:
> replace schedule() with msleep() for the wait
>
> Change-Id: I8fc807ce85a8d30d2b50153f3f3a6eda344ef994
> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c | 15 +++++++++++++--
>   1 file changed, 13 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
> index b832651..1672f5b 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
> @@ -22,7 +22,7 @@
>    */
>   
>   #include "amdgpu.h"
> -#define MAX_KIQ_REG_WAIT	100000000 /* in usecs */
> +#define MAX_KIQ_REG_WAIT	10000 /* in usecs, 10ms */
>   
>   uint64_t amdgpu_csa_vaddr(struct amdgpu_device *adev)  { @@ -152,9 +152,14 @@ uint32_t amdgpu_virt_kiq_rreg(struct amdgpu_device *adev, uint32_t reg)
>   	amdgpu_ring_commit(ring);
>   	spin_unlock_irqrestore(&kiq->ring_lock, flags);
>   
> +retry_read:
>   	r = amdgpu_fence_wait_polling(ring, seq, MAX_KIQ_REG_WAIT);
>   	if (r < 1) {
>   		DRM_ERROR("wait for kiq fence error: %ld\n", r);
> +		if (!in_interrupt() && !adev->in_gpu_reset) {
> +			msleep(5);
> +			goto retry_read;
> +		}
>   		return ~0;
>   	}
>   	val = adev->wb.wb[adev->virt.reg_val_offs];
> @@ -179,9 +184,15 @@ void amdgpu_virt_kiq_wreg(struct amdgpu_device *adev, uint32_t reg, uint32_t v)
>   	amdgpu_ring_commit(ring);
>   	spin_unlock_irqrestore(&kiq->ring_lock, flags);
>   
> +retry_write:
>   	r = amdgpu_fence_wait_polling(ring, seq, MAX_KIQ_REG_WAIT);
> -	if (r < 1)
> +	if (r < 1) {
>   		DRM_ERROR("wait for kiq fence error: %ld\n", r);
> +		if (!in_interrupt() && !adev->in_gpu_reset) {
> +			msleep(5);
> +			goto retry_write;
> +		}
> +	}
>   }
>   
>   /**
> --
> 2.7.4
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: [PATCH] drm/amdgpu: try again kiq access if not in IRQ(v2)
       [not found]         ` <6362f78f-7b22-2fc1-5ce3-805190c4ffb0-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
@ 2018-03-01  8:21           ` Liu, Monk
       [not found]             ` <BLUPR12MB0449EAA8B40EDA5D6BC762E284C60-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
  0 siblings, 1 reply; 18+ messages in thread
From: Liu, Monk @ 2018-03-01  8:21 UTC (permalink / raw)
  To: Koenig, Christian, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

Hi Christian

KIQ is used by kernel, and we submit commands on it without gpu scheduler, 
In sriov env, register access must go through KIQ in non-exclusive mode because we don’t want one VF access MMIO during the period that this VF is not occupying 
GPU resource,

But since now we use busy polling wait for the fence of KIQ, this way it is too latency to let KIQ wait before KIQ job finish, but since KIQ is also part of CP
So it may serve other VFs, and that cause a situation that KIQ job in certain VF may need at least 1~2 seconds, which is not good to use busy wait,

So my patch changed something to let this busy wait bail out for a moment when not in IRQ context, another concern is if KIQ is in current VF, but current 
VF already die, then all KIQ job will not signal, so it's still need this bail out and retry scheme to make sure CPU not always in that busy polling status

Thanks
/Monk

-----Original Message-----
From: Christian König [mailto:ckoenig.leichtzumerken@gmail.com] 
Sent: 2018年3月1日 16:13
To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH] drm/amdgpu: try again kiq access if not in IRQ(v2)

 From a coding style perspective a define for the 5ms timeout would be nice to have and it might be better to use a do { } while instead of the goto, but those are really minor issues.

But from the technical perspective I can't fully judge if that is a good idea or not cause I'm not so deeply know how the KIQ works.

Christian.

Am 01.03.2018 um 06:52 schrieb Liu, Monk:
> Anyone review this ?
>
> -----Original Message-----
> From: amd-gfx [mailto:amd-gfx-bounces@lists.freedesktop.org] On Behalf 
> Of Monk Liu
> Sent: 2018年2月28日 15:28
> To: amd-gfx@lists.freedesktop.org
> Cc: Liu, Monk <Monk.Liu@amd.com>
> Subject: [PATCH] drm/amdgpu: try again kiq access if not in IRQ(v2)
>
> sometimes GPU is switched to other VFs and won't swich back soon, so 
> the kiq reg access will not signal within a short period, instead of 
> busy waiting a long time(MAX_KEQ_REG_WAIT) and returning TMO we can 
> istead sleep 5ms and try again later (non irq context)
>
> And since the waiting in kiq_r/weg is busy wait, so MAX_KIQ_REG_WAIT shouldn't set to a long time, set it to 10ms is more appropriate.
>
> if gpu already in reset state, don't retry the KIQ reg access otherwise it would always hang because KIQ was already die usually.
>
> v2:
> replace schedule() with msleep() for the wait
>
> Change-Id: I8fc807ce85a8d30d2b50153f3f3a6eda344ef994
> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c | 15 +++++++++++++--
>   1 file changed, 13 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
> index b832651..1672f5b 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
> @@ -22,7 +22,7 @@
>    */
>   
>   #include "amdgpu.h"
> -#define MAX_KIQ_REG_WAIT	100000000 /* in usecs */
> +#define MAX_KIQ_REG_WAIT	10000 /* in usecs, 10ms */
>   
>   uint64_t amdgpu_csa_vaddr(struct amdgpu_device *adev)  { @@ -152,9 +152,14 @@ uint32_t amdgpu_virt_kiq_rreg(struct amdgpu_device *adev, uint32_t reg)
>   	amdgpu_ring_commit(ring);
>   	spin_unlock_irqrestore(&kiq->ring_lock, flags);
>   
> +retry_read:
>   	r = amdgpu_fence_wait_polling(ring, seq, MAX_KIQ_REG_WAIT);
>   	if (r < 1) {
>   		DRM_ERROR("wait for kiq fence error: %ld\n", r);
> +		if (!in_interrupt() && !adev->in_gpu_reset) {
> +			msleep(5);
> +			goto retry_read;
> +		}
>   		return ~0;
>   	}
>   	val = adev->wb.wb[adev->virt.reg_val_offs];
> @@ -179,9 +184,15 @@ void amdgpu_virt_kiq_wreg(struct amdgpu_device *adev, uint32_t reg, uint32_t v)
>   	amdgpu_ring_commit(ring);
>   	spin_unlock_irqrestore(&kiq->ring_lock, flags);
>   
> +retry_write:
>   	r = amdgpu_fence_wait_polling(ring, seq, MAX_KIQ_REG_WAIT);
> -	if (r < 1)
> +	if (r < 1) {
>   		DRM_ERROR("wait for kiq fence error: %ld\n", r);
> +		if (!in_interrupt() && !adev->in_gpu_reset) {
> +			msleep(5);
> +			goto retry_write;
> +		}
> +	}
>   }
>   
>   /**
> --
> 2.7.4
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] drm/amdgpu: try again kiq access if not in IRQ(v2)
       [not found]             ` <BLUPR12MB0449EAA8B40EDA5D6BC762E284C60-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
@ 2018-03-01  8:40               ` Ding, Pixel
       [not found]                 ` <7624B23B-CE3F-4608-B9BF-F01DB29C8503-5C7GfCeVMHo@public.gmane.org>
  0 siblings, 1 reply; 18+ messages in thread
From: Ding, Pixel @ 2018-03-01  8:40 UTC (permalink / raw)
  To: Liu, Monk, Koenig, Christian, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

Reviewed-by: Pixel Ding <Pixel.Ding@amd.com>

— 
Sincerely Yours,
Pixel


On 01/03/2018, 4:21 PM, "amd-gfx on behalf of Liu, Monk" <amd-gfx-bounces@lists.freedesktop.org on behalf of Monk.Liu@amd.com> wrote:

    Hi Christian
    
    KIQ is used by kernel, and we submit commands on it without gpu scheduler, 
    In sriov env, register access must go through KIQ in non-exclusive mode because we don’t want one VF access MMIO during the period that this VF is not occupying 
    GPU resource,
    
    But since now we use busy polling wait for the fence of KIQ, this way it is too latency to let KIQ wait before KIQ job finish, but since KIQ is also part of CP
    So it may serve other VFs, and that cause a situation that KIQ job in certain VF may need at least 1~2 seconds, which is not good to use busy wait,
    
    So my patch changed something to let this busy wait bail out for a moment when not in IRQ context, another concern is if KIQ is in current VF, but current 
    VF already die, then all KIQ job will not signal, so it's still need this bail out and retry scheme to make sure CPU not always in that busy polling status
    
    Thanks
    /Monk
    
    -----Original Message-----
    From: Christian König [mailto:ckoenig.leichtzumerken@gmail.com] 
    Sent: 2018年3月1日 16:13
    To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
    Subject: Re: [PATCH] drm/amdgpu: try again kiq access if not in IRQ(v2)
    
     From a coding style perspective a define for the 5ms timeout would be nice to have and it might be better to use a do { } while instead of the goto, but those are really minor issues.
    
    But from the technical perspective I can't fully judge if that is a good idea or not cause I'm not so deeply know how the KIQ works.
    
    Christian.
    
    Am 01.03.2018 um 06:52 schrieb Liu, Monk:
    > Anyone review this ?
    >
    > -----Original Message-----
    > From: amd-gfx [mailto:amd-gfx-bounces@lists.freedesktop.org] On Behalf 
    > Of Monk Liu
    > Sent: 2018年2月28日 15:28
    > To: amd-gfx@lists.freedesktop.org
    > Cc: Liu, Monk <Monk.Liu@amd.com>
    > Subject: [PATCH] drm/amdgpu: try again kiq access if not in IRQ(v2)
    >
    > sometimes GPU is switched to other VFs and won't swich back soon, so 
    > the kiq reg access will not signal within a short period, instead of 
    > busy waiting a long time(MAX_KEQ_REG_WAIT) and returning TMO we can 
    > istead sleep 5ms and try again later (non irq context)
    >
    > And since the waiting in kiq_r/weg is busy wait, so MAX_KIQ_REG_WAIT shouldn't set to a long time, set it to 10ms is more appropriate.
    >
    > if gpu already in reset state, don't retry the KIQ reg access otherwise it would always hang because KIQ was already die usually.
    >
    > v2:
    > replace schedule() with msleep() for the wait
    >
    > Change-Id: I8fc807ce85a8d30d2b50153f3f3a6eda344ef994
    > Signed-off-by: Monk Liu <Monk.Liu@amd.com>
    > ---
    >   drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c | 15 +++++++++++++--
    >   1 file changed, 13 insertions(+), 2 deletions(-)
    >
    > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c 
    > b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
    > index b832651..1672f5b 100644
    > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
    > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
    > @@ -22,7 +22,7 @@
    >    */
    >   
    >   #include "amdgpu.h"
    > -#define MAX_KIQ_REG_WAIT	100000000 /* in usecs */
    > +#define MAX_KIQ_REG_WAIT	10000 /* in usecs, 10ms */
    >   
    >   uint64_t amdgpu_csa_vaddr(struct amdgpu_device *adev)  { @@ -152,9 +152,14 @@ uint32_t amdgpu_virt_kiq_rreg(struct amdgpu_device *adev, uint32_t reg)
    >   	amdgpu_ring_commit(ring);
    >   	spin_unlock_irqrestore(&kiq->ring_lock, flags);
    >   
    > +retry_read:
    >   	r = amdgpu_fence_wait_polling(ring, seq, MAX_KIQ_REG_WAIT);
    >   	if (r < 1) {
    >   		DRM_ERROR("wait for kiq fence error: %ld\n", r);
    > +		if (!in_interrupt() && !adev->in_gpu_reset) {
    > +			msleep(5);
    > +			goto retry_read;
    > +		}
    >   		return ~0;
    >   	}
    >   	val = adev->wb.wb[adev->virt.reg_val_offs];
    > @@ -179,9 +184,15 @@ void amdgpu_virt_kiq_wreg(struct amdgpu_device *adev, uint32_t reg, uint32_t v)
    >   	amdgpu_ring_commit(ring);
    >   	spin_unlock_irqrestore(&kiq->ring_lock, flags);
    >   
    > +retry_write:
    >   	r = amdgpu_fence_wait_polling(ring, seq, MAX_KIQ_REG_WAIT);
    > -	if (r < 1)
    > +	if (r < 1) {
    >   		DRM_ERROR("wait for kiq fence error: %ld\n", r);
    > +		if (!in_interrupt() && !adev->in_gpu_reset) {
    > +			msleep(5);
    > +			goto retry_write;
    > +		}
    > +	}
    >   }
    >   
    >   /**
    > --
    > 2.7.4
    >
    > _______________________________________________
    > amd-gfx mailing list
    > amd-gfx@lists.freedesktop.org
    > https://lists.freedesktop.org/mailman/listinfo/amd-gfx
    > _______________________________________________
    > amd-gfx mailing list
    > amd-gfx@lists.freedesktop.org
    > https://lists.freedesktop.org/mailman/listinfo/amd-gfx
    
    _______________________________________________
    amd-gfx mailing list
    amd-gfx@lists.freedesktop.org
    https://lists.freedesktop.org/mailman/listinfo/amd-gfx
    

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] drm/amdgpu: try again kiq access if not in IRQ(v2)
       [not found]                 ` <7624B23B-CE3F-4608-B9BF-F01DB29C8503-5C7GfCeVMHo@public.gmane.org>
@ 2018-03-01  8:42                   ` Christian König
  0 siblings, 0 replies; 18+ messages in thread
From: Christian König @ 2018-03-01  8:42 UTC (permalink / raw)
  To: Ding, Pixel, Liu, Monk, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

Sounds reasonable.

I would add a define for the 5ms timeout, with that done the patch is 
Reviewed-by: Christian König <christian.koenig@amd.com> as well.

Regards,
Christian.

Am 01.03.2018 um 09:40 schrieb Ding, Pixel:
> Reviewed-by: Pixel Ding <Pixel.Ding@amd.com>
>
> —
> Sincerely Yours,
> Pixel
>
>
> On 01/03/2018, 4:21 PM, "amd-gfx on behalf of Liu, Monk" <amd-gfx-bounces@lists.freedesktop.org on behalf of Monk.Liu@amd.com> wrote:
>
>      Hi Christian
>      
>      KIQ is used by kernel, and we submit commands on it without gpu scheduler,
>      In sriov env, register access must go through KIQ in non-exclusive mode because we don’t want one VF access MMIO during the period that this VF is not occupying
>      GPU resource,
>      
>      But since now we use busy polling wait for the fence of KIQ, this way it is too latency to let KIQ wait before KIQ job finish, but since KIQ is also part of CP
>      So it may serve other VFs, and that cause a situation that KIQ job in certain VF may need at least 1~2 seconds, which is not good to use busy wait,
>      
>      So my patch changed something to let this busy wait bail out for a moment when not in IRQ context, another concern is if KIQ is in current VF, but current
>      VF already die, then all KIQ job will not signal, so it's still need this bail out and retry scheme to make sure CPU not always in that busy polling status
>      
>      Thanks
>      /Monk
>      
>      -----Original Message-----
>      From: Christian König [mailto:ckoenig.leichtzumerken@gmail.com]
>      Sent: 2018年3月1日 16:13
>      To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
>      Subject: Re: [PATCH] drm/amdgpu: try again kiq access if not in IRQ(v2)
>      
>       From a coding style perspective a define for the 5ms timeout would be nice to have and it might be better to use a do { } while instead of the goto, but those are really minor issues.
>      
>      But from the technical perspective I can't fully judge if that is a good idea or not cause I'm not so deeply know how the KIQ works.
>      
>      Christian.
>      
>      Am 01.03.2018 um 06:52 schrieb Liu, Monk:
>      > Anyone review this ?
>      >
>      > -----Original Message-----
>      > From: amd-gfx [mailto:amd-gfx-bounces@lists.freedesktop.org] On Behalf
>      > Of Monk Liu
>      > Sent: 2018年2月28日 15:28
>      > To: amd-gfx@lists.freedesktop.org
>      > Cc: Liu, Monk <Monk.Liu@amd.com>
>      > Subject: [PATCH] drm/amdgpu: try again kiq access if not in IRQ(v2)
>      >
>      > sometimes GPU is switched to other VFs and won't swich back soon, so
>      > the kiq reg access will not signal within a short period, instead of
>      > busy waiting a long time(MAX_KEQ_REG_WAIT) and returning TMO we can
>      > istead sleep 5ms and try again later (non irq context)
>      >
>      > And since the waiting in kiq_r/weg is busy wait, so MAX_KIQ_REG_WAIT shouldn't set to a long time, set it to 10ms is more appropriate.
>      >
>      > if gpu already in reset state, don't retry the KIQ reg access otherwise it would always hang because KIQ was already die usually.
>      >
>      > v2:
>      > replace schedule() with msleep() for the wait
>      >
>      > Change-Id: I8fc807ce85a8d30d2b50153f3f3a6eda344ef994
>      > Signed-off-by: Monk Liu <Monk.Liu@amd.com>
>      > ---
>      >   drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c | 15 +++++++++++++--
>      >   1 file changed, 13 insertions(+), 2 deletions(-)
>      >
>      > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
>      > b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
>      > index b832651..1672f5b 100644
>      > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
>      > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
>      > @@ -22,7 +22,7 @@
>      >    */
>      >
>      >   #include "amdgpu.h"
>      > -#define MAX_KIQ_REG_WAIT	100000000 /* in usecs */
>      > +#define MAX_KIQ_REG_WAIT	10000 /* in usecs, 10ms */
>      >
>      >   uint64_t amdgpu_csa_vaddr(struct amdgpu_device *adev)  { @@ -152,9 +152,14 @@ uint32_t amdgpu_virt_kiq_rreg(struct amdgpu_device *adev, uint32_t reg)
>      >   	amdgpu_ring_commit(ring);
>      >   	spin_unlock_irqrestore(&kiq->ring_lock, flags);
>      >
>      > +retry_read:
>      >   	r = amdgpu_fence_wait_polling(ring, seq, MAX_KIQ_REG_WAIT);
>      >   	if (r < 1) {
>      >   		DRM_ERROR("wait for kiq fence error: %ld\n", r);
>      > +		if (!in_interrupt() && !adev->in_gpu_reset) {
>      > +			msleep(5);
>      > +			goto retry_read;
>      > +		}
>      >   		return ~0;
>      >   	}
>      >   	val = adev->wb.wb[adev->virt.reg_val_offs];
>      > @@ -179,9 +184,15 @@ void amdgpu_virt_kiq_wreg(struct amdgpu_device *adev, uint32_t reg, uint32_t v)
>      >   	amdgpu_ring_commit(ring);
>      >   	spin_unlock_irqrestore(&kiq->ring_lock, flags);
>      >
>      > +retry_write:
>      >   	r = amdgpu_fence_wait_polling(ring, seq, MAX_KIQ_REG_WAIT);
>      > -	if (r < 1)
>      > +	if (r < 1) {
>      >   		DRM_ERROR("wait for kiq fence error: %ld\n", r);
>      > +		if (!in_interrupt() && !adev->in_gpu_reset) {
>      > +			msleep(5);
>      > +			goto retry_write;
>      > +		}
>      > +	}
>      >   }
>      >
>      >   /**
>      > --
>      > 2.7.4
>      >
>      > _______________________________________________
>      > amd-gfx mailing list
>      > amd-gfx@lists.freedesktop.org
>      > https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>      > _______________________________________________
>      > amd-gfx mailing list
>      > amd-gfx@lists.freedesktop.org
>      > https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>      
>      _______________________________________________
>      amd-gfx mailing list
>      amd-gfx@lists.freedesktop.org
>      https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>      
>

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] drm/amdgpu: try again kiq access if not in IRQ(v2)
       [not found] ` <1519802862-9513-1-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
  2018-03-01  5:52   ` Liu, Monk
@ 2018-03-01 15:50   ` Felix Kuehling
       [not found]     ` <3d12195d-44d3-d9b1-82cd-482f49534ec5-5C7GfCeVMHo@public.gmane.org>
  1 sibling, 1 reply; 18+ messages in thread
From: Felix Kuehling @ 2018-03-01 15:50 UTC (permalink / raw)
  To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW, Liu, Monk

On 2018-02-28 02:27 AM, Monk Liu wrote:
> sometimes GPU is switched to other VFs and won't swich
> back soon, so the kiq reg access will not signal within
> a short period, instead of busy waiting a long time(MAX_KEQ_REG_WAIT)
> and returning TMO we can istead sleep 5ms and try again
> later (non irq context)
>
> And since the waiting in kiq_r/weg is busy wait, so MAX_KIQ_REG_WAIT
> shouldn't set to a long time, set it to 10ms is more appropriate.
>
> if gpu already in reset state, don't retry the KIQ reg access
> otherwise it would always hang because KIQ was already die usually.
>
> v2:
> replace schedule() with msleep() for the wait
>
> Change-Id: I8fc807ce85a8d30d2b50153f3f3a6eda344ef994
> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c | 15 +++++++++++++--
>  1 file changed, 13 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
> index b832651..1672f5b 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
> @@ -22,7 +22,7 @@
>   */
>  
>  #include "amdgpu.h"
> -#define MAX_KIQ_REG_WAIT	100000000 /* in usecs */
> +#define MAX_KIQ_REG_WAIT	10000 /* in usecs, 10ms */
>  
>  uint64_t amdgpu_csa_vaddr(struct amdgpu_device *adev)
>  {
> @@ -152,9 +152,14 @@ uint32_t amdgpu_virt_kiq_rreg(struct amdgpu_device *adev, uint32_t reg)
>  	amdgpu_ring_commit(ring);
>  	spin_unlock_irqrestore(&kiq->ring_lock, flags);
>  
> +retry_read:
>  	r = amdgpu_fence_wait_polling(ring, seq, MAX_KIQ_REG_WAIT);
>  	if (r < 1) {
>  		DRM_ERROR("wait for kiq fence error: %ld\n", r);
> +		if (!in_interrupt() && !adev->in_gpu_reset) {

You should check in_atomic here. Because it's invalid to sleep in atomic
context (e.g. while holding a spin lock) even when not in an interrupt.
This seems to happen a lot for indirect register access, e.g.
soc15_pcie_rreg.

Regards,
  Felix

> +			msleep(5);
> +			goto retry_read;
> +		}
>  		return ~0;
>  	}
>  	val = adev->wb.wb[adev->virt.reg_val_offs];
> @@ -179,9 +184,15 @@ void amdgpu_virt_kiq_wreg(struct amdgpu_device *adev, uint32_t reg, uint32_t v)
>  	amdgpu_ring_commit(ring);
>  	spin_unlock_irqrestore(&kiq->ring_lock, flags);
>  
> +retry_write:
>  	r = amdgpu_fence_wait_polling(ring, seq, MAX_KIQ_REG_WAIT);
> -	if (r < 1)
> +	if (r < 1) {
>  		DRM_ERROR("wait for kiq fence error: %ld\n", r);
> +		if (!in_interrupt() && !adev->in_gpu_reset) {
> +			msleep(5);
> +			goto retry_write;
> +		}
> +	}
>  }
>  
>  /**

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: [PATCH] drm/amdgpu: try again kiq access if not in IRQ(v2)
       [not found]     ` <3d12195d-44d3-d9b1-82cd-482f49534ec5-5C7GfCeVMHo@public.gmane.org>
@ 2018-03-02  9:29       ` Liu, Monk
       [not found]         ` <BLUPR12MB04498B47E0BAFFF2A80A4CE684C50-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
  2018-03-05  4:03       ` Liu, Monk
  1 sibling, 1 reply; 18+ messages in thread
From: Liu, Monk @ 2018-03-02  9:29 UTC (permalink / raw)
  To: Kuehling, Felix, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

In_atomic() isnot encouraged to be used to judge if sleep is possible, see the macros of it

#define in_atomic() (preept_count() != 0)


/Monk

-----Original Message-----
From: Kuehling, Felix 
Sent: 2018年3月1日 23:50
To: amd-gfx@lists.freedesktop.org; Liu, Monk <Monk.Liu@amd.com>
Subject: Re: [PATCH] drm/amdgpu: try again kiq access if not in IRQ(v2)

On 2018-02-28 02:27 AM, Monk Liu wrote:
> sometimes GPU is switched to other VFs and won't swich back soon, so 
> the kiq reg access will not signal within a short period, instead of 
> busy waiting a long time(MAX_KEQ_REG_WAIT) and returning TMO we can 
> istead sleep 5ms and try again later (non irq context)
>
> And since the waiting in kiq_r/weg is busy wait, so MAX_KIQ_REG_WAIT 
> shouldn't set to a long time, set it to 10ms is more appropriate.
>
> if gpu already in reset state, don't retry the KIQ reg access 
> otherwise it would always hang because KIQ was already die usually.
>
> v2:
> replace schedule() with msleep() for the wait
>
> Change-Id: I8fc807ce85a8d30d2b50153f3f3a6eda344ef994
> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c | 15 +++++++++++++--
>  1 file changed, 13 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
> index b832651..1672f5b 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
> @@ -22,7 +22,7 @@
>   */
>  
>  #include "amdgpu.h"
> -#define MAX_KIQ_REG_WAIT	100000000 /* in usecs */
> +#define MAX_KIQ_REG_WAIT	10000 /* in usecs, 10ms */
>  
>  uint64_t amdgpu_csa_vaddr(struct amdgpu_device *adev)  { @@ -152,9 
> +152,14 @@ uint32_t amdgpu_virt_kiq_rreg(struct amdgpu_device *adev, uint32_t reg)
>  	amdgpu_ring_commit(ring);
>  	spin_unlock_irqrestore(&kiq->ring_lock, flags);
>  
> +retry_read:
>  	r = amdgpu_fence_wait_polling(ring, seq, MAX_KIQ_REG_WAIT);
>  	if (r < 1) {
>  		DRM_ERROR("wait for kiq fence error: %ld\n", r);
> +		if (!in_interrupt() && !adev->in_gpu_reset) {

You should check in_atomic here. Because it's invalid to sleep in atomic context (e.g. while holding a spin lock) even when not in an interrupt.
This seems to happen a lot for indirect register access, e.g.
soc15_pcie_rreg.

Regards,
  Felix

> +			msleep(5);
> +			goto retry_read;
> +		}
>  		return ~0;
>  	}
>  	val = adev->wb.wb[adev->virt.reg_val_offs];
> @@ -179,9 +184,15 @@ void amdgpu_virt_kiq_wreg(struct amdgpu_device *adev, uint32_t reg, uint32_t v)
>  	amdgpu_ring_commit(ring);
>  	spin_unlock_irqrestore(&kiq->ring_lock, flags);
>  
> +retry_write:
>  	r = amdgpu_fence_wait_polling(ring, seq, MAX_KIQ_REG_WAIT);
> -	if (r < 1)
> +	if (r < 1) {
>  		DRM_ERROR("wait for kiq fence error: %ld\n", r);
> +		if (!in_interrupt() && !adev->in_gpu_reset) {
> +			msleep(5);
> +			goto retry_write;
> +		}
> +	}
>  }
>  
>  /**

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] drm/amdgpu: try again kiq access if not in IRQ(v2)
       [not found]         ` <BLUPR12MB04498B47E0BAFFF2A80A4CE684C50-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
@ 2018-03-02 20:47           ` Felix Kuehling
       [not found]             ` <e8a28819-c7fa-5b64-6b5b-723299d517f4-5C7GfCeVMHo@public.gmane.org>
  0 siblings, 1 reply; 18+ messages in thread
From: Felix Kuehling @ 2018-03-02 20:47 UTC (permalink / raw)
  To: Liu, Monk, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

On 2018-03-02 04:29 AM, Liu, Monk wrote:
> In_atomic() isnot encouraged to be used to judge if sleep is possible, see the macros of it
>
> #define in_atomic() (preept_count() != 0)

OK. But my point is still that you're not testing the right thing when
you check in_interrupt(). The comment before the in_atomic macro
definition states the limitations and says "do not use in driver code".
Unfortunately it doesn't suggest any alternative. I think in_interrupt
is actually worse, because it misses even more cases than in_atomic.

Regards,
  Felix

>
>
> /Monk
>
> -----Original Message-----
> From: Kuehling, Felix 
> Sent: 2018年3月1日 23:50
> To: amd-gfx@lists.freedesktop.org; Liu, Monk <Monk.Liu@amd.com>
> Subject: Re: [PATCH] drm/amdgpu: try again kiq access if not in IRQ(v2)
>
> On 2018-02-28 02:27 AM, Monk Liu wrote:
>> sometimes GPU is switched to other VFs and won't swich back soon, so 
>> the kiq reg access will not signal within a short period, instead of 
>> busy waiting a long time(MAX_KEQ_REG_WAIT) and returning TMO we can 
>> istead sleep 5ms and try again later (non irq context)
>>
>> And since the waiting in kiq_r/weg is busy wait, so MAX_KIQ_REG_WAIT 
>> shouldn't set to a long time, set it to 10ms is more appropriate.
>>
>> if gpu already in reset state, don't retry the KIQ reg access 
>> otherwise it would always hang because KIQ was already die usually.
>>
>> v2:
>> replace schedule() with msleep() for the wait
>>
>> Change-Id: I8fc807ce85a8d30d2b50153f3f3a6eda344ef994
>> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
>> ---
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c | 15 +++++++++++++--
>>  1 file changed, 13 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
>> index b832651..1672f5b 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
>> @@ -22,7 +22,7 @@
>>   */
>>  
>>  #include "amdgpu.h"
>> -#define MAX_KIQ_REG_WAIT	100000000 /* in usecs */
>> +#define MAX_KIQ_REG_WAIT	10000 /* in usecs, 10ms */
>>  
>>  uint64_t amdgpu_csa_vaddr(struct amdgpu_device *adev)  { @@ -152,9 
>> +152,14 @@ uint32_t amdgpu_virt_kiq_rreg(struct amdgpu_device *adev, uint32_t reg)
>>  	amdgpu_ring_commit(ring);
>>  	spin_unlock_irqrestore(&kiq->ring_lock, flags);
>>  
>> +retry_read:
>>  	r = amdgpu_fence_wait_polling(ring, seq, MAX_KIQ_REG_WAIT);
>>  	if (r < 1) {
>>  		DRM_ERROR("wait for kiq fence error: %ld\n", r);
>> +		if (!in_interrupt() && !adev->in_gpu_reset) {
> You should check in_atomic here. Because it's invalid to sleep in atomic context (e.g. while holding a spin lock) even when not in an interrupt.
> This seems to happen a lot for indirect register access, e.g.
> soc15_pcie_rreg.
>
> Regards,
>   Felix
>
>> +			msleep(5);
>> +			goto retry_read;
>> +		}
>>  		return ~0;
>>  	}
>>  	val = adev->wb.wb[adev->virt.reg_val_offs];
>> @@ -179,9 +184,15 @@ void amdgpu_virt_kiq_wreg(struct amdgpu_device *adev, uint32_t reg, uint32_t v)
>>  	amdgpu_ring_commit(ring);
>>  	spin_unlock_irqrestore(&kiq->ring_lock, flags);
>>  
>> +retry_write:
>>  	r = amdgpu_fence_wait_polling(ring, seq, MAX_KIQ_REG_WAIT);
>> -	if (r < 1)
>> +	if (r < 1) {
>>  		DRM_ERROR("wait for kiq fence error: %ld\n", r);
>> +		if (!in_interrupt() && !adev->in_gpu_reset) {
>> +			msleep(5);
>> +			goto retry_write;
>> +		}
>> +	}
>>  }
>>  
>>  /**

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] drm/amdgpu: try again kiq access if not in IRQ(v2)
       [not found]             ` <e8a28819-c7fa-5b64-6b5b-723299d517f4-5C7GfCeVMHo@public.gmane.org>
@ 2018-03-03 13:38               ` Christian König
       [not found]                 ` <ecccef01-fdab-0810-030a-cc0f0f0ee814-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
  0 siblings, 1 reply; 18+ messages in thread
From: Christian König @ 2018-03-03 13:38 UTC (permalink / raw)
  To: Felix Kuehling, Liu, Monk, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

Am 02.03.2018 um 21:47 schrieb Felix Kuehling:
> On 2018-03-02 04:29 AM, Liu, Monk wrote:
>> In_atomic() isnot encouraged to be used to judge if sleep is possible, see the macros of it
>>
>> #define in_atomic() (preept_count() != 0)
> OK. But my point is still that you're not testing the right thing when
> you check in_interrupt(). The comment before the in_atomic macro
> definition states the limitations and says "do not use in driver code".
> Unfortunately it doesn't suggest any alternative. I think in_interrupt
> is actually worse, because it misses even more cases than in_atomic.

Thinking about this, Felix seems to be absolutely right.

So we need to revert this patch since you can't reliable detect in a 
driver if sleeping is allowed or not.

Regards,
Christian.

>
> Regards,
>    Felix
>
>>
>> /Monk
>>
>> -----Original Message-----
>> From: Kuehling, Felix
>> Sent: 2018年3月1日 23:50
>> To: amd-gfx@lists.freedesktop.org; Liu, Monk <Monk.Liu@amd.com>
>> Subject: Re: [PATCH] drm/amdgpu: try again kiq access if not in IRQ(v2)
>>
>> On 2018-02-28 02:27 AM, Monk Liu wrote:
>>> sometimes GPU is switched to other VFs and won't swich back soon, so
>>> the kiq reg access will not signal within a short period, instead of
>>> busy waiting a long time(MAX_KEQ_REG_WAIT) and returning TMO we can
>>> istead sleep 5ms and try again later (non irq context)
>>>
>>> And since the waiting in kiq_r/weg is busy wait, so MAX_KIQ_REG_WAIT
>>> shouldn't set to a long time, set it to 10ms is more appropriate.
>>>
>>> if gpu already in reset state, don't retry the KIQ reg access
>>> otherwise it would always hang because KIQ was already die usually.
>>>
>>> v2:
>>> replace schedule() with msleep() for the wait
>>>
>>> Change-Id: I8fc807ce85a8d30d2b50153f3f3a6eda344ef994
>>> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
>>> ---
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c | 15 +++++++++++++--
>>>   1 file changed, 13 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
>>> index b832651..1672f5b 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
>>> @@ -22,7 +22,7 @@
>>>    */
>>>   
>>>   #include "amdgpu.h"
>>> -#define MAX_KIQ_REG_WAIT	100000000 /* in usecs */
>>> +#define MAX_KIQ_REG_WAIT	10000 /* in usecs, 10ms */
>>>   
>>>   uint64_t amdgpu_csa_vaddr(struct amdgpu_device *adev)  { @@ -152,9
>>> +152,14 @@ uint32_t amdgpu_virt_kiq_rreg(struct amdgpu_device *adev, uint32_t reg)
>>>   	amdgpu_ring_commit(ring);
>>>   	spin_unlock_irqrestore(&kiq->ring_lock, flags);
>>>   
>>> +retry_read:
>>>   	r = amdgpu_fence_wait_polling(ring, seq, MAX_KIQ_REG_WAIT);
>>>   	if (r < 1) {
>>>   		DRM_ERROR("wait for kiq fence error: %ld\n", r);
>>> +		if (!in_interrupt() && !adev->in_gpu_reset) {
>> You should check in_atomic here. Because it's invalid to sleep in atomic context (e.g. while holding a spin lock) even when not in an interrupt.
>> This seems to happen a lot for indirect register access, e.g.
>> soc15_pcie_rreg.
>>
>> Regards,
>>    Felix
>>
>>> +			msleep(5);
>>> +			goto retry_read;
>>> +		}
>>>   		return ~0;
>>>   	}
>>>   	val = adev->wb.wb[adev->virt.reg_val_offs];
>>> @@ -179,9 +184,15 @@ void amdgpu_virt_kiq_wreg(struct amdgpu_device *adev, uint32_t reg, uint32_t v)
>>>   	amdgpu_ring_commit(ring);
>>>   	spin_unlock_irqrestore(&kiq->ring_lock, flags);
>>>   
>>> +retry_write:
>>>   	r = amdgpu_fence_wait_polling(ring, seq, MAX_KIQ_REG_WAIT);
>>> -	if (r < 1)
>>> +	if (r < 1) {
>>>   		DRM_ERROR("wait for kiq fence error: %ld\n", r);
>>> +		if (!in_interrupt() && !adev->in_gpu_reset) {
>>> +			msleep(5);
>>> +			goto retry_write;
>>> +		}
>>> +	}
>>>   }
>>>   
>>>   /**
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: [PATCH] drm/amdgpu: try again kiq access if not in IRQ(v2)
       [not found]                 ` <ecccef01-fdab-0810-030a-cc0f0f0ee814-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
@ 2018-03-05  3:47                   ` Liu, Monk
  2018-03-05  4:20                   ` Liu, Monk
  1 sibling, 0 replies; 18+ messages in thread
From: Liu, Monk @ 2018-03-05  3:47 UTC (permalink / raw)
  To: Koenig, Christian, Kuehling, Felix,
	amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

Do you guys have a correct way to judge ?

Looks to me there won't be any issue:
1)  there is no spin lock or atomic code wrapping the amdgpu_virt_kiq_rreg, ( see amdgpu_mm_rreg)
2)  in_interrupt() at least could prevent you sleep in IRQ contest , why it is worse ?

For SRIOV this patch is needed, and for bare-metal kiq_reg routine won't be hit, I disagree revert it unless you 
Prove it is dangerous or give a solid example on how it break things 

/Monk

-----Original Message-----
From: Christian König [mailto:ckoenig.leichtzumerken@gmail.com] 
Sent: 2018年3月3日 21:38
To: Kuehling, Felix <Felix.Kuehling@amd.com>; Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH] drm/amdgpu: try again kiq access if not in IRQ(v2)

Am 02.03.2018 um 21:47 schrieb Felix Kuehling:
> On 2018-03-02 04:29 AM, Liu, Monk wrote:
>> In_atomic() isnot encouraged to be used to judge if sleep is 
>> possible, see the macros of it
>>
>> #define in_atomic() (preept_count() != 0)
> OK. But my point is still that you're not testing the right thing when 
> you check in_interrupt(). The comment before the in_atomic macro 
> definition states the limitations and says "do not use in driver code".
> Unfortunately it doesn't suggest any alternative. I think in_interrupt 
> is actually worse, because it misses even more cases than in_atomic.

Thinking about this, Felix seems to be absolutely right.

So we need to revert this patch since you can't reliable detect in a driver if sleeping is allowed or not.

Regards,
Christian.

>
> Regards,
>    Felix
>
>>
>> /Monk
>>
>> -----Original Message-----
>> From: Kuehling, Felix
>> Sent: 2018年3月1日 23:50
>> To: amd-gfx@lists.freedesktop.org; Liu, Monk <Monk.Liu@amd.com>
>> Subject: Re: [PATCH] drm/amdgpu: try again kiq access if not in 
>> IRQ(v2)
>>
>> On 2018-02-28 02:27 AM, Monk Liu wrote:
>>> sometimes GPU is switched to other VFs and won't swich back soon, so 
>>> the kiq reg access will not signal within a short period, instead of 
>>> busy waiting a long time(MAX_KEQ_REG_WAIT) and returning TMO we can 
>>> istead sleep 5ms and try again later (non irq context)
>>>
>>> And since the waiting in kiq_r/weg is busy wait, so MAX_KIQ_REG_WAIT 
>>> shouldn't set to a long time, set it to 10ms is more appropriate.
>>>
>>> if gpu already in reset state, don't retry the KIQ reg access 
>>> otherwise it would always hang because KIQ was already die usually.
>>>
>>> v2:
>>> replace schedule() with msleep() for the wait
>>>
>>> Change-Id: I8fc807ce85a8d30d2b50153f3f3a6eda344ef994
>>> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
>>> ---
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c | 15 +++++++++++++--
>>>   1 file changed, 13 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
>>> index b832651..1672f5b 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
>>> @@ -22,7 +22,7 @@
>>>    */
>>>   
>>>   #include "amdgpu.h"
>>> -#define MAX_KIQ_REG_WAIT	100000000 /* in usecs */
>>> +#define MAX_KIQ_REG_WAIT	10000 /* in usecs, 10ms */
>>>   
>>>   uint64_t amdgpu_csa_vaddr(struct amdgpu_device *adev)  { @@ -152,9
>>> +152,14 @@ uint32_t amdgpu_virt_kiq_rreg(struct amdgpu_device *adev, 
>>> +uint32_t reg)
>>>   	amdgpu_ring_commit(ring);
>>>   	spin_unlock_irqrestore(&kiq->ring_lock, flags);
>>>   
>>> +retry_read:
>>>   	r = amdgpu_fence_wait_polling(ring, seq, MAX_KIQ_REG_WAIT);
>>>   	if (r < 1) {
>>>   		DRM_ERROR("wait for kiq fence error: %ld\n", r);
>>> +		if (!in_interrupt() && !adev->in_gpu_reset) {
>> You should check in_atomic here. Because it's invalid to sleep in atomic context (e.g. while holding a spin lock) even when not in an interrupt.
>> This seems to happen a lot for indirect register access, e.g.
>> soc15_pcie_rreg.
>>
>> Regards,
>>    Felix
>>
>>> +			msleep(5);
>>> +			goto retry_read;
>>> +		}
>>>   		return ~0;
>>>   	}
>>>   	val = adev->wb.wb[adev->virt.reg_val_offs];
>>> @@ -179,9 +184,15 @@ void amdgpu_virt_kiq_wreg(struct amdgpu_device *adev, uint32_t reg, uint32_t v)
>>>   	amdgpu_ring_commit(ring);
>>>   	spin_unlock_irqrestore(&kiq->ring_lock, flags);
>>>   
>>> +retry_write:
>>>   	r = amdgpu_fence_wait_polling(ring, seq, MAX_KIQ_REG_WAIT);
>>> -	if (r < 1)
>>> +	if (r < 1) {
>>>   		DRM_ERROR("wait for kiq fence error: %ld\n", r);
>>> +		if (!in_interrupt() && !adev->in_gpu_reset) {
>>> +			msleep(5);
>>> +			goto retry_write;
>>> +		}
>>> +	}
>>>   }
>>>   
>>>   /**
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: [PATCH] drm/amdgpu: try again kiq access if not in IRQ(v2)
       [not found]     ` <3d12195d-44d3-d9b1-82cd-482f49534ec5-5C7GfCeVMHo@public.gmane.org>
  2018-03-02  9:29       ` Liu, Monk
@ 2018-03-05  4:03       ` Liu, Monk
  1 sibling, 0 replies; 18+ messages in thread
From: Liu, Monk @ 2018-03-05  4:03 UTC (permalink / raw)
  To: Kuehling, Felix, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

> You should check in_atomic here. Because it's invalid to sleep in atomic context (e.g. while holding a spin lock) even when not in an interrupt.
This seems to happen a lot for indirect register access, e.g.
soc15_pcie_rreg.

For this case you are right, but "in_atomic" cannot check what you want, because it fails to check atomic context in non-preemptible kernel

I have another method: inside of virt_kiq_rreg(), we can test if this "adev->pcie_idx_lock" is held, do you think it is enough ?

Except "pcie_idx_lock" I didn't find other spin lock wrapping kiq_virt_rreg by far,

/Monk

-----Original Message-----
From: Kuehling, Felix 
Sent: 2018年3月1日 23:50
To: amd-gfx@lists.freedesktop.org; Liu, Monk <Monk.Liu@amd.com>
Subject: Re: [PATCH] drm/amdgpu: try again kiq access if not in IRQ(v2)

On 2018-02-28 02:27 AM, Monk Liu wrote:
> sometimes GPU is switched to other VFs and won't swich back soon, so 
> the kiq reg access will not signal within a short period, instead of 
> busy waiting a long time(MAX_KEQ_REG_WAIT) and returning TMO we can 
> istead sleep 5ms and try again later (non irq context)
>
> And since the waiting in kiq_r/weg is busy wait, so MAX_KIQ_REG_WAIT 
> shouldn't set to a long time, set it to 10ms is more appropriate.
>
> if gpu already in reset state, don't retry the KIQ reg access 
> otherwise it would always hang because KIQ was already die usually.
>
> v2:
> replace schedule() with msleep() for the wait
>
> Change-Id: I8fc807ce85a8d30d2b50153f3f3a6eda344ef994
> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c | 15 +++++++++++++--
>  1 file changed, 13 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
> index b832651..1672f5b 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
> @@ -22,7 +22,7 @@
>   */
>  
>  #include "amdgpu.h"
> -#define MAX_KIQ_REG_WAIT	100000000 /* in usecs */
> +#define MAX_KIQ_REG_WAIT	10000 /* in usecs, 10ms */
>  
>  uint64_t amdgpu_csa_vaddr(struct amdgpu_device *adev)  { @@ -152,9 
> +152,14 @@ uint32_t amdgpu_virt_kiq_rreg(struct amdgpu_device *adev, uint32_t reg)
>  	amdgpu_ring_commit(ring);
>  	spin_unlock_irqrestore(&kiq->ring_lock, flags);
>  
> +retry_read:
>  	r = amdgpu_fence_wait_polling(ring, seq, MAX_KIQ_REG_WAIT);
>  	if (r < 1) {
>  		DRM_ERROR("wait for kiq fence error: %ld\n", r);
> +		if (!in_interrupt() && !adev->in_gpu_reset) {

You should check in_atomic here. Because it's invalid to sleep in atomic context (e.g. while holding a spin lock) even when not in an interrupt.
This seems to happen a lot for indirect register access, e.g.
soc15_pcie_rreg.

Regards,
  Felix

> +			msleep(5);
> +			goto retry_read;
> +		}
>  		return ~0;
>  	}
>  	val = adev->wb.wb[adev->virt.reg_val_offs];
> @@ -179,9 +184,15 @@ void amdgpu_virt_kiq_wreg(struct amdgpu_device *adev, uint32_t reg, uint32_t v)
>  	amdgpu_ring_commit(ring);
>  	spin_unlock_irqrestore(&kiq->ring_lock, flags);
>  
> +retry_write:
>  	r = amdgpu_fence_wait_polling(ring, seq, MAX_KIQ_REG_WAIT);
> -	if (r < 1)
> +	if (r < 1) {
>  		DRM_ERROR("wait for kiq fence error: %ld\n", r);
> +		if (!in_interrupt() && !adev->in_gpu_reset) {
> +			msleep(5);
> +			goto retry_write;
> +		}
> +	}
>  }
>  
>  /**

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: [PATCH] drm/amdgpu: try again kiq access if not in IRQ(v2)
       [not found]                 ` <ecccef01-fdab-0810-030a-cc0f0f0ee814-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
  2018-03-05  3:47                   ` Liu, Monk
@ 2018-03-05  4:20                   ` Liu, Monk
       [not found]                     ` <BLUPR12MB0449708DD427D1E744992FB584DA0-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
  1 sibling, 1 reply; 18+ messages in thread
From: Liu, Monk @ 2018-03-05  4:20 UTC (permalink / raw)
  To: Koenig, Christian, Kuehling, Felix,
	amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

When there are 16 VF/VM on one GPU, we can easily hit "sys lockup to 22s" kernel error/warning introduced by kiq_rreg/wreg routine
That's why I must use this patch to let thread sleep a while and try again, 

If you insist reverting this patch please give me a solution, otherwise I don't see how it is better by reverting it

/Monk

-----Original Message-----
From: Christian König [mailto:ckoenig.leichtzumerken@gmail.com] 
Sent: 2018年3月3日 21:38
To: Kuehling, Felix <Felix.Kuehling@amd.com>; Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH] drm/amdgpu: try again kiq access if not in IRQ(v2)

Am 02.03.2018 um 21:47 schrieb Felix Kuehling:
> On 2018-03-02 04:29 AM, Liu, Monk wrote:
>> In_atomic() isnot encouraged to be used to judge if sleep is 
>> possible, see the macros of it
>>
>> #define in_atomic() (preept_count() != 0)
> OK. But my point is still that you're not testing the right thing when 
> you check in_interrupt(). The comment before the in_atomic macro 
> definition states the limitations and says "do not use in driver code".
> Unfortunately it doesn't suggest any alternative. I think in_interrupt 
> is actually worse, because it misses even more cases than in_atomic.

Thinking about this, Felix seems to be absolutely right.

So we need to revert this patch since you can't reliable detect in a driver if sleeping is allowed or not.

Regards,
Christian.

>
> Regards,
>    Felix
>
>>
>> /Monk
>>
>> -----Original Message-----
>> From: Kuehling, Felix
>> Sent: 2018年3月1日 23:50
>> To: amd-gfx@lists.freedesktop.org; Liu, Monk <Monk.Liu@amd.com>
>> Subject: Re: [PATCH] drm/amdgpu: try again kiq access if not in 
>> IRQ(v2)
>>
>> On 2018-02-28 02:27 AM, Monk Liu wrote:
>>> sometimes GPU is switched to other VFs and won't swich back soon, so 
>>> the kiq reg access will not signal within a short period, instead of 
>>> busy waiting a long time(MAX_KEQ_REG_WAIT) and returning TMO we can 
>>> istead sleep 5ms and try again later (non irq context)
>>>
>>> And since the waiting in kiq_r/weg is busy wait, so MAX_KIQ_REG_WAIT 
>>> shouldn't set to a long time, set it to 10ms is more appropriate.
>>>
>>> if gpu already in reset state, don't retry the KIQ reg access 
>>> otherwise it would always hang because KIQ was already die usually.
>>>
>>> v2:
>>> replace schedule() with msleep() for the wait
>>>
>>> Change-Id: I8fc807ce85a8d30d2b50153f3f3a6eda344ef994
>>> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
>>> ---
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c | 15 +++++++++++++--
>>>   1 file changed, 13 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
>>> index b832651..1672f5b 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
>>> @@ -22,7 +22,7 @@
>>>    */
>>>   
>>>   #include "amdgpu.h"
>>> -#define MAX_KIQ_REG_WAIT	100000000 /* in usecs */
>>> +#define MAX_KIQ_REG_WAIT	10000 /* in usecs, 10ms */
>>>   
>>>   uint64_t amdgpu_csa_vaddr(struct amdgpu_device *adev)  { @@ -152,9
>>> +152,14 @@ uint32_t amdgpu_virt_kiq_rreg(struct amdgpu_device *adev, 
>>> +uint32_t reg)
>>>   	amdgpu_ring_commit(ring);
>>>   	spin_unlock_irqrestore(&kiq->ring_lock, flags);
>>>   
>>> +retry_read:
>>>   	r = amdgpu_fence_wait_polling(ring, seq, MAX_KIQ_REG_WAIT);
>>>   	if (r < 1) {
>>>   		DRM_ERROR("wait for kiq fence error: %ld\n", r);
>>> +		if (!in_interrupt() && !adev->in_gpu_reset) {
>> You should check in_atomic here. Because it's invalid to sleep in atomic context (e.g. while holding a spin lock) even when not in an interrupt.
>> This seems to happen a lot for indirect register access, e.g.
>> soc15_pcie_rreg.
>>
>> Regards,
>>    Felix
>>
>>> +			msleep(5);
>>> +			goto retry_read;
>>> +		}
>>>   		return ~0;
>>>   	}
>>>   	val = adev->wb.wb[adev->virt.reg_val_offs];
>>> @@ -179,9 +184,15 @@ void amdgpu_virt_kiq_wreg(struct amdgpu_device *adev, uint32_t reg, uint32_t v)
>>>   	amdgpu_ring_commit(ring);
>>>   	spin_unlock_irqrestore(&kiq->ring_lock, flags);
>>>   
>>> +retry_write:
>>>   	r = amdgpu_fence_wait_polling(ring, seq, MAX_KIQ_REG_WAIT);
>>> -	if (r < 1)
>>> +	if (r < 1) {
>>>   		DRM_ERROR("wait for kiq fence error: %ld\n", r);
>>> +		if (!in_interrupt() && !adev->in_gpu_reset) {
>>> +			msleep(5);
>>> +			goto retry_write;
>>> +		}
>>> +	}
>>>   }
>>>   
>>>   /**
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] drm/amdgpu: try again kiq access if not in IRQ(v2)
       [not found]                     ` <BLUPR12MB0449708DD427D1E744992FB584DA0-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
@ 2018-03-05  7:57                       ` Christian König
       [not found]                         ` <1df4fa9f-1a5c-f005-cfa8-e4346d7892c1-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
  0 siblings, 1 reply; 18+ messages in thread
From: Christian König @ 2018-03-05  7:57 UTC (permalink / raw)
  To: Liu, Monk, Koenig, Christian, Kuehling, Felix,
	amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

> otherwise I don't see how it is better by reverting it
Well it's better to revert it for now because it seems to create more 
problems than it solves.

To better approach this issue I suggest to do the following:
1. Revert the original patch.

2. Stop waiting to long for writes. E.g. use a separate timeout (20ms 
maybe?) to wait for the write. Then do a WARN_ON_ONCE when we timeout.

3. To the read function add a "if (!in_intterupt()) may_sleep();" and 
then retest. That should at least print a nice warning when called from 
atomic context.

4. Test the whole thing and try to fix all warnings about atomic 
contexts from the may_sleep();

5. Reapply the original patch, but this time only for the read function, 
not the write function.

Regards,
Christian.

Am 05.03.2018 um 05:20 schrieb Liu, Monk:
> When there are 16 VF/VM on one GPU, we can easily hit "sys lockup to 22s" kernel error/warning introduced by kiq_rreg/wreg routine
> That's why I must use this patch to let thread sleep a while and try again,
>
> If you insist reverting this patch please give me a solution, otherwise I don't see how it is better by reverting it
>
> /Monk
>
> -----Original Message-----
> From: Christian König [mailto:ckoenig.leichtzumerken@gmail.com]
> Sent: 2018年3月3日 21:38
> To: Kuehling, Felix <Felix.Kuehling@amd.com>; Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
> Subject: Re: [PATCH] drm/amdgpu: try again kiq access if not in IRQ(v2)
>
> Am 02.03.2018 um 21:47 schrieb Felix Kuehling:
>> On 2018-03-02 04:29 AM, Liu, Monk wrote:
>>> In_atomic() isnot encouraged to be used to judge if sleep is
>>> possible, see the macros of it
>>>
>>> #define in_atomic() (preept_count() != 0)
>> OK. But my point is still that you're not testing the right thing when
>> you check in_interrupt(). The comment before the in_atomic macro
>> definition states the limitations and says "do not use in driver code".
>> Unfortunately it doesn't suggest any alternative. I think in_interrupt
>> is actually worse, because it misses even more cases than in_atomic.
> Thinking about this, Felix seems to be absolutely right.
>
> So we need to revert this patch since you can't reliable detect in a driver if sleeping is allowed or not.
>
> Regards,
> Christian.
>
>> Regards,
>>     Felix
>>
>>> /Monk
>>>
>>> -----Original Message-----
>>> From: Kuehling, Felix
>>> Sent: 2018年3月1日 23:50
>>> To: amd-gfx@lists.freedesktop.org; Liu, Monk <Monk.Liu@amd.com>
>>> Subject: Re: [PATCH] drm/amdgpu: try again kiq access if not in
>>> IRQ(v2)
>>>
>>> On 2018-02-28 02:27 AM, Monk Liu wrote:
>>>> sometimes GPU is switched to other VFs and won't swich back soon, so
>>>> the kiq reg access will not signal within a short period, instead of
>>>> busy waiting a long time(MAX_KEQ_REG_WAIT) and returning TMO we can
>>>> istead sleep 5ms and try again later (non irq context)
>>>>
>>>> And since the waiting in kiq_r/weg is busy wait, so MAX_KIQ_REG_WAIT
>>>> shouldn't set to a long time, set it to 10ms is more appropriate.
>>>>
>>>> if gpu already in reset state, don't retry the KIQ reg access
>>>> otherwise it would always hang because KIQ was already die usually.
>>>>
>>>> v2:
>>>> replace schedule() with msleep() for the wait
>>>>
>>>> Change-Id: I8fc807ce85a8d30d2b50153f3f3a6eda344ef994
>>>> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
>>>> ---
>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c | 15 +++++++++++++--
>>>>    1 file changed, 13 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
>>>> index b832651..1672f5b 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
>>>> @@ -22,7 +22,7 @@
>>>>     */
>>>>    
>>>>    #include "amdgpu.h"
>>>> -#define MAX_KIQ_REG_WAIT	100000000 /* in usecs */
>>>> +#define MAX_KIQ_REG_WAIT	10000 /* in usecs, 10ms */
>>>>    
>>>>    uint64_t amdgpu_csa_vaddr(struct amdgpu_device *adev)  { @@ -152,9
>>>> +152,14 @@ uint32_t amdgpu_virt_kiq_rreg(struct amdgpu_device *adev,
>>>> +uint32_t reg)
>>>>    	amdgpu_ring_commit(ring);
>>>>    	spin_unlock_irqrestore(&kiq->ring_lock, flags);
>>>>    
>>>> +retry_read:
>>>>    	r = amdgpu_fence_wait_polling(ring, seq, MAX_KIQ_REG_WAIT);
>>>>    	if (r < 1) {
>>>>    		DRM_ERROR("wait for kiq fence error: %ld\n", r);
>>>> +		if (!in_interrupt() && !adev->in_gpu_reset) {
>>> You should check in_atomic here. Because it's invalid to sleep in atomic context (e.g. while holding a spin lock) even when not in an interrupt.
>>> This seems to happen a lot for indirect register access, e.g.
>>> soc15_pcie_rreg.
>>>
>>> Regards,
>>>     Felix
>>>
>>>> +			msleep(5);
>>>> +			goto retry_read;
>>>> +		}
>>>>    		return ~0;
>>>>    	}
>>>>    	val = adev->wb.wb[adev->virt.reg_val_offs];
>>>> @@ -179,9 +184,15 @@ void amdgpu_virt_kiq_wreg(struct amdgpu_device *adev, uint32_t reg, uint32_t v)
>>>>    	amdgpu_ring_commit(ring);
>>>>    	spin_unlock_irqrestore(&kiq->ring_lock, flags);
>>>>    
>>>> +retry_write:
>>>>    	r = amdgpu_fence_wait_polling(ring, seq, MAX_KIQ_REG_WAIT);
>>>> -	if (r < 1)
>>>> +	if (r < 1) {
>>>>    		DRM_ERROR("wait for kiq fence error: %ld\n", r);
>>>> +		if (!in_interrupt() && !adev->in_gpu_reset) {
>>>> +			msleep(5);
>>>> +			goto retry_write;
>>>> +		}
>>>> +	}
>>>>    }
>>>>    
>>>>    /**
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx@lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: [PATCH] drm/amdgpu: try again kiq access if not in IRQ(v2)
       [not found]                         ` <1df4fa9f-1a5c-f005-cfa8-e4346d7892c1-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
@ 2018-03-05  8:08                           ` Liu, Monk
       [not found]                             ` <BLUPR12MB0449FFE3490C8EA3394D41DD84DA0-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
  0 siblings, 1 reply; 18+ messages in thread
From: Liu, Monk @ 2018-03-05  8:08 UTC (permalink / raw)
  To: Koenig, Christian, Kuehling, Felix,
	amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

To better approach this issue I suggest to do the following:
1. Revert the original patch.

2. Stop waiting to long for writes. E.g. use a separate timeout (20ms
maybe?) to wait for the write. Then do a WARN_ON_ONCE when we timeout.
Cannot do that 20ms is not enough, sometimes you need 10 seconds since other VFs may doing bad things like occupying GPU intentionally or they are doing TDR, so 
I don't think separate read and write is good idea, they should be treated equally 

3. To the read function add a "if (!in_intterupt()) may_sleep();" and then retest. That should at least print a nice warning when called from atomic context.
Sorry what is may_sleep() ??

4. Test the whole thing and try to fix all warnings about atomic contexts from the may_sleep();

5. Reapply the original patch, but this time only for the read function, not the write function.


From current LKG code, the only one spin lock may wrapping the kiq_rreg/wreg() is the pcie_idx_lock, and this lock is only used during init(),
Since init() is run under the case of exclusive mode for SRIOV, which means:
1)  register access is not go through KIQ (see admgpu_mm_reg)
2)  those functions are only in bif_medium_grain_xxx part (vi.c and nbio_v6.c) , and they won't hit under SRIOV ( we return in the head if SRIOV detect)
So I don' think this spin_lock may cause trouble...

/Monk



-----Original Message-----
From: Christian König [mailto:ckoenig.leichtzumerken@gmail.com] 
Sent: 2018年3月5日 15:57
To: Liu, Monk <Monk.Liu@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>; Kuehling, Felix <Felix.Kuehling@amd.com>; amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH] drm/amdgpu: try again kiq access if not in IRQ(v2)

> otherwise I don't see how it is better by reverting it
Well it's better to revert it for now because it seems to create more problems than it solves.

To better approach this issue I suggest to do the following:
1. Revert the original patch.

2. Stop waiting to long for writes. E.g. use a separate timeout (20ms
maybe?) to wait for the write. Then do a WARN_ON_ONCE when we timeout.

3. To the read function add a "if (!in_intterupt()) may_sleep();" and then retest. That should at least print a nice warning when called from atomic context.

4. Test the whole thing and try to fix all warnings about atomic contexts from the may_sleep();

5. Reapply the original patch, but this time only for the read function, not the write function.

Regards,
Christian.

Am 05.03.2018 um 05:20 schrieb Liu, Monk:
> When there are 16 VF/VM on one GPU, we can easily hit "sys lockup to 
> 22s" kernel error/warning introduced by kiq_rreg/wreg routine That's 
> why I must use this patch to let thread sleep a while and try again,
>
> If you insist reverting this patch please give me a solution, 
> otherwise I don't see how it is better by reverting it
>
> /Monk
>
> -----Original Message-----
> From: Christian König [mailto:ckoenig.leichtzumerken@gmail.com]
> Sent: 2018年3月3日 21:38
> To: Kuehling, Felix <Felix.Kuehling@amd.com>; Liu, Monk 
> <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
> Subject: Re: [PATCH] drm/amdgpu: try again kiq access if not in 
> IRQ(v2)
>
> Am 02.03.2018 um 21:47 schrieb Felix Kuehling:
>> On 2018-03-02 04:29 AM, Liu, Monk wrote:
>>> In_atomic() isnot encouraged to be used to judge if sleep is 
>>> possible, see the macros of it
>>>
>>> #define in_atomic() (preept_count() != 0)
>> OK. But my point is still that you're not testing the right thing 
>> when you check in_interrupt(). The comment before the in_atomic macro 
>> definition states the limitations and says "do not use in driver code".
>> Unfortunately it doesn't suggest any alternative. I think 
>> in_interrupt is actually worse, because it misses even more cases than in_atomic.
> Thinking about this, Felix seems to be absolutely right.
>
> So we need to revert this patch since you can't reliable detect in a driver if sleeping is allowed or not.
>
> Regards,
> Christian.
>
>> Regards,
>>     Felix
>>
>>> /Monk
>>>
>>> -----Original Message-----
>>> From: Kuehling, Felix
>>> Sent: 2018年3月1日 23:50
>>> To: amd-gfx@lists.freedesktop.org; Liu, Monk <Monk.Liu@amd.com>
>>> Subject: Re: [PATCH] drm/amdgpu: try again kiq access if not in
>>> IRQ(v2)
>>>
>>> On 2018-02-28 02:27 AM, Monk Liu wrote:
>>>> sometimes GPU is switched to other VFs and won't swich back soon, 
>>>> so the kiq reg access will not signal within a short period, 
>>>> instead of busy waiting a long time(MAX_KEQ_REG_WAIT) and returning 
>>>> TMO we can istead sleep 5ms and try again later (non irq context)
>>>>
>>>> And since the waiting in kiq_r/weg is busy wait, so 
>>>> MAX_KIQ_REG_WAIT shouldn't set to a long time, set it to 10ms is more appropriate.
>>>>
>>>> if gpu already in reset state, don't retry the KIQ reg access 
>>>> otherwise it would always hang because KIQ was already die usually.
>>>>
>>>> v2:
>>>> replace schedule() with msleep() for the wait
>>>>
>>>> Change-Id: I8fc807ce85a8d30d2b50153f3f3a6eda344ef994
>>>> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
>>>> ---
>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c | 15 +++++++++++++--
>>>>    1 file changed, 13 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
>>>> index b832651..1672f5b 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
>>>> @@ -22,7 +22,7 @@
>>>>     */
>>>>    
>>>>    #include "amdgpu.h"
>>>> -#define MAX_KIQ_REG_WAIT	100000000 /* in usecs */
>>>> +#define MAX_KIQ_REG_WAIT	10000 /* in usecs, 10ms */
>>>>    
>>>>    uint64_t amdgpu_csa_vaddr(struct amdgpu_device *adev)  { @@ 
>>>> -152,9
>>>> +152,14 @@ uint32_t amdgpu_virt_kiq_rreg(struct amdgpu_device 
>>>> +*adev, uint32_t reg)
>>>>    	amdgpu_ring_commit(ring);
>>>>    	spin_unlock_irqrestore(&kiq->ring_lock, flags);
>>>>    
>>>> +retry_read:
>>>>    	r = amdgpu_fence_wait_polling(ring, seq, MAX_KIQ_REG_WAIT);
>>>>    	if (r < 1) {
>>>>    		DRM_ERROR("wait for kiq fence error: %ld\n", r);
>>>> +		if (!in_interrupt() && !adev->in_gpu_reset) {
>>> You should check in_atomic here. Because it's invalid to sleep in atomic context (e.g. while holding a spin lock) even when not in an interrupt.
>>> This seems to happen a lot for indirect register access, e.g.
>>> soc15_pcie_rreg.
>>>
>>> Regards,
>>>     Felix
>>>
>>>> +			msleep(5);
>>>> +			goto retry_read;
>>>> +		}
>>>>    		return ~0;
>>>>    	}
>>>>    	val = adev->wb.wb[adev->virt.reg_val_offs];
>>>> @@ -179,9 +184,15 @@ void amdgpu_virt_kiq_wreg(struct amdgpu_device *adev, uint32_t reg, uint32_t v)
>>>>    	amdgpu_ring_commit(ring);
>>>>    	spin_unlock_irqrestore(&kiq->ring_lock, flags);
>>>>    
>>>> +retry_write:
>>>>    	r = amdgpu_fence_wait_polling(ring, seq, MAX_KIQ_REG_WAIT);
>>>> -	if (r < 1)
>>>> +	if (r < 1) {
>>>>    		DRM_ERROR("wait for kiq fence error: %ld\n", r);
>>>> +		if (!in_interrupt() && !adev->in_gpu_reset) {
>>>> +			msleep(5);
>>>> +			goto retry_write;
>>>> +		}
>>>> +	}
>>>>    }
>>>>    
>>>>    /**
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx@lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] drm/amdgpu: try again kiq access if not in IRQ(v2)
       [not found]                             ` <BLUPR12MB0449FFE3490C8EA3394D41DD84DA0-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
@ 2018-03-05 11:20                               ` Christian König
       [not found]                                 ` <48b49b85-495f-c5bc-7d8b-447537a94c8f-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
  0 siblings, 1 reply; 18+ messages in thread
From: Christian König @ 2018-03-05 11:20 UTC (permalink / raw)
  To: Liu, Monk, Koenig, Christian, Kuehling, Felix,
	amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

Am 05.03.2018 um 09:08 schrieb Liu, Monk:
> To better approach this issue I suggest to do the following:
> 1. Revert the original patch.
>
> 2. Stop waiting to long for writes. E.g. use a separate timeout (20ms
> maybe?) to wait for the write. Then do a WARN_ON_ONCE when we timeout.
> Cannot do that 20ms is not enough, sometimes you need 10 seconds since other VFs may doing bad things like occupying GPU intentionally or they are doing TDR, so
> I don't think separate read and write is good idea, they should be treated equally

Well the question is if separating read&writes would actually help.

>
> 3. To the read function add a "if (!in_intterupt()) may_sleep();" and then retest. That should at least print a nice warning when called from atomic context.
> Sorry what is may_sleep() ??
>
> 4. Test the whole thing and try to fix all warnings about atomic contexts from the may_sleep();
>
> 5. Reapply the original patch, but this time only for the read function, not the write function.
>
>
>  From current LKG code, the only one spin lock may wrapping the kiq_rreg/wreg() is the pcie_idx_lock, and this lock is only used during init(),
> Since init() is run under the case of exclusive mode for SRIOV, which means:
> 1)  register access is not go through KIQ (see admgpu_mm_reg)
> 2)  those functions are only in bif_medium_grain_xxx part (vi.c and nbio_v6.c) , and they won't hit under SRIOV ( we return in the head if SRIOV detect)
> So I don' think this spin_lock may cause trouble...

Ok in this case let's keep the patch for now, but please provide a new 
patch which adds "if (!in_intterupt()) may_sleep();" in both the read 
and write function.

This way we should at least catch problems early on.

Christian.

>
> /Monk
>
>
>
> -----Original Message-----
> From: Christian König [mailto:ckoenig.leichtzumerken@gmail.com]
> Sent: 2018年3月5日 15:57
> To: Liu, Monk <Monk.Liu@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>; Kuehling, Felix <Felix.Kuehling@amd.com>; amd-gfx@lists.freedesktop.org
> Subject: Re: [PATCH] drm/amdgpu: try again kiq access if not in IRQ(v2)
>
>> otherwise I don't see how it is better by reverting it
> Well it's better to revert it for now because it seems to create more problems than it solves.
>
> To better approach this issue I suggest to do the following:
> 1. Revert the original patch.
>
> 2. Stop waiting to long for writes. E.g. use a separate timeout (20ms
> maybe?) to wait for the write. Then do a WARN_ON_ONCE when we timeout.
>
> 3. To the read function add a "if (!in_intterupt()) may_sleep();" and then retest. That should at least print a nice warning when called from atomic context.
>
> 4. Test the whole thing and try to fix all warnings about atomic contexts from the may_sleep();
>
> 5. Reapply the original patch, but this time only for the read function, not the write function.
>
> Regards,
> Christian.
>
> Am 05.03.2018 um 05:20 schrieb Liu, Monk:
>> When there are 16 VF/VM on one GPU, we can easily hit "sys lockup to
>> 22s" kernel error/warning introduced by kiq_rreg/wreg routine That's
>> why I must use this patch to let thread sleep a while and try again,
>>
>> If you insist reverting this patch please give me a solution,
>> otherwise I don't see how it is better by reverting it
>>
>> /Monk
>>
>> -----Original Message-----
>> From: Christian König [mailto:ckoenig.leichtzumerken@gmail.com]
>> Sent: 2018年3月3日 21:38
>> To: Kuehling, Felix <Felix.Kuehling@amd.com>; Liu, Monk
>> <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
>> Subject: Re: [PATCH] drm/amdgpu: try again kiq access if not in
>> IRQ(v2)
>>
>> Am 02.03.2018 um 21:47 schrieb Felix Kuehling:
>>> On 2018-03-02 04:29 AM, Liu, Monk wrote:
>>>> In_atomic() isnot encouraged to be used to judge if sleep is
>>>> possible, see the macros of it
>>>>
>>>> #define in_atomic() (preept_count() != 0)
>>> OK. But my point is still that you're not testing the right thing
>>> when you check in_interrupt(). The comment before the in_atomic macro
>>> definition states the limitations and says "do not use in driver code".
>>> Unfortunately it doesn't suggest any alternative. I think
>>> in_interrupt is actually worse, because it misses even more cases than in_atomic.
>> Thinking about this, Felix seems to be absolutely right.
>>
>> So we need to revert this patch since you can't reliable detect in a driver if sleeping is allowed or not.
>>
>> Regards,
>> Christian.
>>
>>> Regards,
>>>      Felix
>>>
>>>> /Monk
>>>>
>>>> -----Original Message-----
>>>> From: Kuehling, Felix
>>>> Sent: 2018年3月1日 23:50
>>>> To: amd-gfx@lists.freedesktop.org; Liu, Monk <Monk.Liu@amd.com>
>>>> Subject: Re: [PATCH] drm/amdgpu: try again kiq access if not in
>>>> IRQ(v2)
>>>>
>>>> On 2018-02-28 02:27 AM, Monk Liu wrote:
>>>>> sometimes GPU is switched to other VFs and won't swich back soon,
>>>>> so the kiq reg access will not signal within a short period,
>>>>> instead of busy waiting a long time(MAX_KEQ_REG_WAIT) and returning
>>>>> TMO we can istead sleep 5ms and try again later (non irq context)
>>>>>
>>>>> And since the waiting in kiq_r/weg is busy wait, so
>>>>> MAX_KIQ_REG_WAIT shouldn't set to a long time, set it to 10ms is more appropriate.
>>>>>
>>>>> if gpu already in reset state, don't retry the KIQ reg access
>>>>> otherwise it would always hang because KIQ was already die usually.
>>>>>
>>>>> v2:
>>>>> replace schedule() with msleep() for the wait
>>>>>
>>>>> Change-Id: I8fc807ce85a8d30d2b50153f3f3a6eda344ef994
>>>>> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
>>>>> ---
>>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c | 15 +++++++++++++--
>>>>>     1 file changed, 13 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
>>>>> index b832651..1672f5b 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
>>>>> @@ -22,7 +22,7 @@
>>>>>      */
>>>>>     
>>>>>     #include "amdgpu.h"
>>>>> -#define MAX_KIQ_REG_WAIT	100000000 /* in usecs */
>>>>> +#define MAX_KIQ_REG_WAIT	10000 /* in usecs, 10ms */
>>>>>     
>>>>>     uint64_t amdgpu_csa_vaddr(struct amdgpu_device *adev)  { @@
>>>>> -152,9
>>>>> +152,14 @@ uint32_t amdgpu_virt_kiq_rreg(struct amdgpu_device
>>>>> +*adev, uint32_t reg)
>>>>>     	amdgpu_ring_commit(ring);
>>>>>     	spin_unlock_irqrestore(&kiq->ring_lock, flags);
>>>>>     
>>>>> +retry_read:
>>>>>     	r = amdgpu_fence_wait_polling(ring, seq, MAX_KIQ_REG_WAIT);
>>>>>     	if (r < 1) {
>>>>>     		DRM_ERROR("wait for kiq fence error: %ld\n", r);
>>>>> +		if (!in_interrupt() && !adev->in_gpu_reset) {
>>>> You should check in_atomic here. Because it's invalid to sleep in atomic context (e.g. while holding a spin lock) even when not in an interrupt.
>>>> This seems to happen a lot for indirect register access, e.g.
>>>> soc15_pcie_rreg.
>>>>
>>>> Regards,
>>>>      Felix
>>>>
>>>>> +			msleep(5);
>>>>> +			goto retry_read;
>>>>> +		}
>>>>>     		return ~0;
>>>>>     	}
>>>>>     	val = adev->wb.wb[adev->virt.reg_val_offs];
>>>>> @@ -179,9 +184,15 @@ void amdgpu_virt_kiq_wreg(struct amdgpu_device *adev, uint32_t reg, uint32_t v)
>>>>>     	amdgpu_ring_commit(ring);
>>>>>     	spin_unlock_irqrestore(&kiq->ring_lock, flags);
>>>>>     
>>>>> +retry_write:
>>>>>     	r = amdgpu_fence_wait_polling(ring, seq, MAX_KIQ_REG_WAIT);
>>>>> -	if (r < 1)
>>>>> +	if (r < 1) {
>>>>>     		DRM_ERROR("wait for kiq fence error: %ld\n", r);
>>>>> +		if (!in_interrupt() && !adev->in_gpu_reset) {
>>>>> +			msleep(5);
>>>>> +			goto retry_write;
>>>>> +		}
>>>>> +	}
>>>>>     }
>>>>>     
>>>>>     /**
>>> _______________________________________________
>>> amd-gfx mailing list
>>> amd-gfx@lists.freedesktop.org
>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx@lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: [PATCH] drm/amdgpu: try again kiq access if not in IRQ(v2)
       [not found]                                 ` <48b49b85-495f-c5bc-7d8b-447537a94c8f-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
@ 2018-03-05 11:27                                   ` Liu, Monk
       [not found]                                     ` <BLUPR12MB0449F42E6D448FA2C5A7BD2F84DA0-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
  0 siblings, 1 reply; 18+ messages in thread
From: Liu, Monk @ 2018-03-05 11:27 UTC (permalink / raw)
  To: Koenig, Christian, Kuehling, Felix,
	amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

Hi Christian

I couln't find this "may_sleep()" in my 4.13kernel , did I miss something ??

Thanks 
/Monk

-----Original Message-----
From: Christian König [mailto:ckoenig.leichtzumerken@gmail.com] 
Sent: 2018年3月5日 19:21
To: Liu, Monk <Monk.Liu@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>; Kuehling, Felix <Felix.Kuehling@amd.com>; amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH] drm/amdgpu: try again kiq access if not in IRQ(v2)

Am 05.03.2018 um 09:08 schrieb Liu, Monk:
> To better approach this issue I suggest to do the following:
> 1. Revert the original patch.
>
> 2. Stop waiting to long for writes. E.g. use a separate timeout (20ms
> maybe?) to wait for the write. Then do a WARN_ON_ONCE when we timeout.
> Cannot do that 20ms is not enough, sometimes you need 10 seconds since 
> other VFs may doing bad things like occupying GPU intentionally or 
> they are doing TDR, so I don't think separate read and write is good 
> idea, they should be treated equally

Well the question is if separating read&writes would actually help.

>
> 3. To the read function add a "if (!in_intterupt()) may_sleep();" and then retest. That should at least print a nice warning when called from atomic context.
> Sorry what is may_sleep() ??
>
> 4. Test the whole thing and try to fix all warnings about atomic 
> contexts from the may_sleep();
>
> 5. Reapply the original patch, but this time only for the read function, not the write function.
>
>
>  From current LKG code, the only one spin lock may wrapping the 
> kiq_rreg/wreg() is the pcie_idx_lock, and this lock is only used during init(), Since init() is run under the case of exclusive mode for SRIOV, which means:
> 1)  register access is not go through KIQ (see admgpu_mm_reg)
> 2)  those functions are only in bif_medium_grain_xxx part (vi.c and 
> nbio_v6.c) , and they won't hit under SRIOV ( we return in the head if SRIOV detect) So I don' think this spin_lock may cause trouble...

Ok in this case let's keep the patch for now, but please provide a new patch which adds "if (!in_intterupt()) may_sleep();" in both the read and write function.

This way we should at least catch problems early on.

Christian.

>
> /Monk
>
>
>
> -----Original Message-----
> From: Christian König [mailto:ckoenig.leichtzumerken@gmail.com]
> Sent: 2018年3月5日 15:57
> To: Liu, Monk <Monk.Liu@amd.com>; Koenig, Christian 
> <Christian.Koenig@amd.com>; Kuehling, Felix <Felix.Kuehling@amd.com>; 
> amd-gfx@lists.freedesktop.org
> Subject: Re: [PATCH] drm/amdgpu: try again kiq access if not in 
> IRQ(v2)
>
>> otherwise I don't see how it is better by reverting it
> Well it's better to revert it for now because it seems to create more problems than it solves.
>
> To better approach this issue I suggest to do the following:
> 1. Revert the original patch.
>
> 2. Stop waiting to long for writes. E.g. use a separate timeout (20ms
> maybe?) to wait for the write. Then do a WARN_ON_ONCE when we timeout.
>
> 3. To the read function add a "if (!in_intterupt()) may_sleep();" and then retest. That should at least print a nice warning when called from atomic context.
>
> 4. Test the whole thing and try to fix all warnings about atomic 
> contexts from the may_sleep();
>
> 5. Reapply the original patch, but this time only for the read function, not the write function.
>
> Regards,
> Christian.
>
> Am 05.03.2018 um 05:20 schrieb Liu, Monk:
>> When there are 16 VF/VM on one GPU, we can easily hit "sys lockup to 
>> 22s" kernel error/warning introduced by kiq_rreg/wreg routine That's 
>> why I must use this patch to let thread sleep a while and try again,
>>
>> If you insist reverting this patch please give me a solution, 
>> otherwise I don't see how it is better by reverting it
>>
>> /Monk
>>
>> -----Original Message-----
>> From: Christian König [mailto:ckoenig.leichtzumerken@gmail.com]
>> Sent: 2018年3月3日 21:38
>> To: Kuehling, Felix <Felix.Kuehling@amd.com>; Liu, Monk 
>> <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
>> Subject: Re: [PATCH] drm/amdgpu: try again kiq access if not in
>> IRQ(v2)
>>
>> Am 02.03.2018 um 21:47 schrieb Felix Kuehling:
>>> On 2018-03-02 04:29 AM, Liu, Monk wrote:
>>>> In_atomic() isnot encouraged to be used to judge if sleep is 
>>>> possible, see the macros of it
>>>>
>>>> #define in_atomic() (preept_count() != 0)
>>> OK. But my point is still that you're not testing the right thing 
>>> when you check in_interrupt(). The comment before the in_atomic 
>>> macro definition states the limitations and says "do not use in driver code".
>>> Unfortunately it doesn't suggest any alternative. I think 
>>> in_interrupt is actually worse, because it misses even more cases than in_atomic.
>> Thinking about this, Felix seems to be absolutely right.
>>
>> So we need to revert this patch since you can't reliable detect in a driver if sleeping is allowed or not.
>>
>> Regards,
>> Christian.
>>
>>> Regards,
>>>      Felix
>>>
>>>> /Monk
>>>>
>>>> -----Original Message-----
>>>> From: Kuehling, Felix
>>>> Sent: 2018年3月1日 23:50
>>>> To: amd-gfx@lists.freedesktop.org; Liu, Monk <Monk.Liu@amd.com>
>>>> Subject: Re: [PATCH] drm/amdgpu: try again kiq access if not in
>>>> IRQ(v2)
>>>>
>>>> On 2018-02-28 02:27 AM, Monk Liu wrote:
>>>>> sometimes GPU is switched to other VFs and won't swich back soon, 
>>>>> so the kiq reg access will not signal within a short period, 
>>>>> instead of busy waiting a long time(MAX_KEQ_REG_WAIT) and 
>>>>> returning TMO we can istead sleep 5ms and try again later (non irq 
>>>>> context)
>>>>>
>>>>> And since the waiting in kiq_r/weg is busy wait, so 
>>>>> MAX_KIQ_REG_WAIT shouldn't set to a long time, set it to 10ms is more appropriate.
>>>>>
>>>>> if gpu already in reset state, don't retry the KIQ reg access 
>>>>> otherwise it would always hang because KIQ was already die usually.
>>>>>
>>>>> v2:
>>>>> replace schedule() with msleep() for the wait
>>>>>
>>>>> Change-Id: I8fc807ce85a8d30d2b50153f3f3a6eda344ef994
>>>>> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
>>>>> ---
>>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c | 15 +++++++++++++--
>>>>>     1 file changed, 13 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
>>>>> index b832651..1672f5b 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
>>>>> @@ -22,7 +22,7 @@
>>>>>      */
>>>>>     
>>>>>     #include "amdgpu.h"
>>>>> -#define MAX_KIQ_REG_WAIT	100000000 /* in usecs */
>>>>> +#define MAX_KIQ_REG_WAIT	10000 /* in usecs, 10ms */
>>>>>     
>>>>>     uint64_t amdgpu_csa_vaddr(struct amdgpu_device *adev)  { @@
>>>>> -152,9
>>>>> +152,14 @@ uint32_t amdgpu_virt_kiq_rreg(struct amdgpu_device 
>>>>> +*adev, uint32_t reg)
>>>>>     	amdgpu_ring_commit(ring);
>>>>>     	spin_unlock_irqrestore(&kiq->ring_lock, flags);
>>>>>     
>>>>> +retry_read:
>>>>>     	r = amdgpu_fence_wait_polling(ring, seq, MAX_KIQ_REG_WAIT);
>>>>>     	if (r < 1) {
>>>>>     		DRM_ERROR("wait for kiq fence error: %ld\n", r);
>>>>> +		if (!in_interrupt() && !adev->in_gpu_reset) {
>>>> You should check in_atomic here. Because it's invalid to sleep in atomic context (e.g. while holding a spin lock) even when not in an interrupt.
>>>> This seems to happen a lot for indirect register access, e.g.
>>>> soc15_pcie_rreg.
>>>>
>>>> Regards,
>>>>      Felix
>>>>
>>>>> +			msleep(5);
>>>>> +			goto retry_read;
>>>>> +		}
>>>>>     		return ~0;
>>>>>     	}
>>>>>     	val = adev->wb.wb[adev->virt.reg_val_offs];
>>>>> @@ -179,9 +184,15 @@ void amdgpu_virt_kiq_wreg(struct amdgpu_device *adev, uint32_t reg, uint32_t v)
>>>>>     	amdgpu_ring_commit(ring);
>>>>>     	spin_unlock_irqrestore(&kiq->ring_lock, flags);
>>>>>     
>>>>> +retry_write:
>>>>>     	r = amdgpu_fence_wait_polling(ring, seq, MAX_KIQ_REG_WAIT);
>>>>> -	if (r < 1)
>>>>> +	if (r < 1) {
>>>>>     		DRM_ERROR("wait for kiq fence error: %ld\n", r);
>>>>> +		if (!in_interrupt() && !adev->in_gpu_reset) {
>>>>> +			msleep(5);
>>>>> +			goto retry_write;
>>>>> +		}
>>>>> +	}
>>>>>     }
>>>>>     
>>>>>     /**
>>> _______________________________________________
>>> amd-gfx mailing list
>>> amd-gfx@lists.freedesktop.org
>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx@lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] drm/amdgpu: try again kiq access if not in IRQ(v2)
       [not found]                                     ` <BLUPR12MB0449F42E6D448FA2C5A7BD2F84DA0-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
@ 2018-03-05 11:30                                       ` Christian König
  0 siblings, 0 replies; 18+ messages in thread
From: Christian König @ 2018-03-05 11:30 UTC (permalink / raw)
  To: Liu, Monk, Kuehling, Felix, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

Hi Monk,

then that was written differently. Maybe "might_sleep()" or something 
like that?

It's just a checker which raises a nice warning when you try to sleep in 
atomic context on debug kernels.

Christian.

Am 05.03.2018 um 12:27 schrieb Liu, Monk:
> Hi Christian
>
> I couln't find this "may_sleep()" in my 4.13kernel , did I miss something ??
>
> Thanks
> /Monk
>
> -----Original Message-----
> From: Christian König [mailto:ckoenig.leichtzumerken@gmail.com]
> Sent: 2018年3月5日 19:21
> To: Liu, Monk <Monk.Liu@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>; Kuehling, Felix <Felix.Kuehling@amd.com>; amd-gfx@lists.freedesktop.org
> Subject: Re: [PATCH] drm/amdgpu: try again kiq access if not in IRQ(v2)
>
> Am 05.03.2018 um 09:08 schrieb Liu, Monk:
>> To better approach this issue I suggest to do the following:
>> 1. Revert the original patch.
>>
>> 2. Stop waiting to long for writes. E.g. use a separate timeout (20ms
>> maybe?) to wait for the write. Then do a WARN_ON_ONCE when we timeout.
>> Cannot do that 20ms is not enough, sometimes you need 10 seconds since
>> other VFs may doing bad things like occupying GPU intentionally or
>> they are doing TDR, so I don't think separate read and write is good
>> idea, they should be treated equally
> Well the question is if separating read&writes would actually help.
>
>> 3. To the read function add a "if (!in_intterupt()) may_sleep();" and then retest. That should at least print a nice warning when called from atomic context.
>> Sorry what is may_sleep() ??
>>
>> 4. Test the whole thing and try to fix all warnings about atomic
>> contexts from the may_sleep();
>>
>> 5. Reapply the original patch, but this time only for the read function, not the write function.
>>
>>
>>   From current LKG code, the only one spin lock may wrapping the
>> kiq_rreg/wreg() is the pcie_idx_lock, and this lock is only used during init(), Since init() is run under the case of exclusive mode for SRIOV, which means:
>> 1)  register access is not go through KIQ (see admgpu_mm_reg)
>> 2)  those functions are only in bif_medium_grain_xxx part (vi.c and
>> nbio_v6.c) , and they won't hit under SRIOV ( we return in the head if SRIOV detect) So I don' think this spin_lock may cause trouble...
> Ok in this case let's keep the patch for now, but please provide a new patch which adds "if (!in_intterupt()) may_sleep();" in both the read and write function.
>
> This way we should at least catch problems early on.
>
> Christian.
>
>> /Monk
>>
>>
>>
>> -----Original Message-----
>> From: Christian König [mailto:ckoenig.leichtzumerken@gmail.com]
>> Sent: 2018年3月5日 15:57
>> To: Liu, Monk <Monk.Liu@amd.com>; Koenig, Christian
>> <Christian.Koenig@amd.com>; Kuehling, Felix <Felix.Kuehling@amd.com>;
>> amd-gfx@lists.freedesktop.org
>> Subject: Re: [PATCH] drm/amdgpu: try again kiq access if not in
>> IRQ(v2)
>>
>>> otherwise I don't see how it is better by reverting it
>> Well it's better to revert it for now because it seems to create more problems than it solves.
>>
>> To better approach this issue I suggest to do the following:
>> 1. Revert the original patch.
>>
>> 2. Stop waiting to long for writes. E.g. use a separate timeout (20ms
>> maybe?) to wait for the write. Then do a WARN_ON_ONCE when we timeout.
>>
>> 3. To the read function add a "if (!in_intterupt()) may_sleep();" and then retest. That should at least print a nice warning when called from atomic context.
>>
>> 4. Test the whole thing and try to fix all warnings about atomic
>> contexts from the may_sleep();
>>
>> 5. Reapply the original patch, but this time only for the read function, not the write function.
>>
>> Regards,
>> Christian.
>>
>> Am 05.03.2018 um 05:20 schrieb Liu, Monk:
>>> When there are 16 VF/VM on one GPU, we can easily hit "sys lockup to
>>> 22s" kernel error/warning introduced by kiq_rreg/wreg routine That's
>>> why I must use this patch to let thread sleep a while and try again,
>>>
>>> If you insist reverting this patch please give me a solution,
>>> otherwise I don't see how it is better by reverting it
>>>
>>> /Monk
>>>
>>> -----Original Message-----
>>> From: Christian König [mailto:ckoenig.leichtzumerken@gmail.com]
>>> Sent: 2018年3月3日 21:38
>>> To: Kuehling, Felix <Felix.Kuehling@amd.com>; Liu, Monk
>>> <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
>>> Subject: Re: [PATCH] drm/amdgpu: try again kiq access if not in
>>> IRQ(v2)
>>>
>>> Am 02.03.2018 um 21:47 schrieb Felix Kuehling:
>>>> On 2018-03-02 04:29 AM, Liu, Monk wrote:
>>>>> In_atomic() isnot encouraged to be used to judge if sleep is
>>>>> possible, see the macros of it
>>>>>
>>>>> #define in_atomic() (preept_count() != 0)
>>>> OK. But my point is still that you're not testing the right thing
>>>> when you check in_interrupt(). The comment before the in_atomic
>>>> macro definition states the limitations and says "do not use in driver code".
>>>> Unfortunately it doesn't suggest any alternative. I think
>>>> in_interrupt is actually worse, because it misses even more cases than in_atomic.
>>> Thinking about this, Felix seems to be absolutely right.
>>>
>>> So we need to revert this patch since you can't reliable detect in a driver if sleeping is allowed or not.
>>>
>>> Regards,
>>> Christian.
>>>
>>>> Regards,
>>>>       Felix
>>>>
>>>>> /Monk
>>>>>
>>>>> -----Original Message-----
>>>>> From: Kuehling, Felix
>>>>> Sent: 2018年3月1日 23:50
>>>>> To: amd-gfx@lists.freedesktop.org; Liu, Monk <Monk.Liu@amd.com>
>>>>> Subject: Re: [PATCH] drm/amdgpu: try again kiq access if not in
>>>>> IRQ(v2)
>>>>>
>>>>> On 2018-02-28 02:27 AM, Monk Liu wrote:
>>>>>> sometimes GPU is switched to other VFs and won't swich back soon,
>>>>>> so the kiq reg access will not signal within a short period,
>>>>>> instead of busy waiting a long time(MAX_KEQ_REG_WAIT) and
>>>>>> returning TMO we can istead sleep 5ms and try again later (non irq
>>>>>> context)
>>>>>>
>>>>>> And since the waiting in kiq_r/weg is busy wait, so
>>>>>> MAX_KIQ_REG_WAIT shouldn't set to a long time, set it to 10ms is more appropriate.
>>>>>>
>>>>>> if gpu already in reset state, don't retry the KIQ reg access
>>>>>> otherwise it would always hang because KIQ was already die usually.
>>>>>>
>>>>>> v2:
>>>>>> replace schedule() with msleep() for the wait
>>>>>>
>>>>>> Change-Id: I8fc807ce85a8d30d2b50153f3f3a6eda344ef994
>>>>>> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
>>>>>> ---
>>>>>>      drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c | 15 +++++++++++++--
>>>>>>      1 file changed, 13 insertions(+), 2 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
>>>>>> index b832651..1672f5b 100644
>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
>>>>>> @@ -22,7 +22,7 @@
>>>>>>       */
>>>>>>      
>>>>>>      #include "amdgpu.h"
>>>>>> -#define MAX_KIQ_REG_WAIT	100000000 /* in usecs */
>>>>>> +#define MAX_KIQ_REG_WAIT	10000 /* in usecs, 10ms */
>>>>>>      
>>>>>>      uint64_t amdgpu_csa_vaddr(struct amdgpu_device *adev)  { @@
>>>>>> -152,9
>>>>>> +152,14 @@ uint32_t amdgpu_virt_kiq_rreg(struct amdgpu_device
>>>>>> +*adev, uint32_t reg)
>>>>>>      	amdgpu_ring_commit(ring);
>>>>>>      	spin_unlock_irqrestore(&kiq->ring_lock, flags);
>>>>>>      
>>>>>> +retry_read:
>>>>>>      	r = amdgpu_fence_wait_polling(ring, seq, MAX_KIQ_REG_WAIT);
>>>>>>      	if (r < 1) {
>>>>>>      		DRM_ERROR("wait for kiq fence error: %ld\n", r);
>>>>>> +		if (!in_interrupt() && !adev->in_gpu_reset) {
>>>>> You should check in_atomic here. Because it's invalid to sleep in atomic context (e.g. while holding a spin lock) even when not in an interrupt.
>>>>> This seems to happen a lot for indirect register access, e.g.
>>>>> soc15_pcie_rreg.
>>>>>
>>>>> Regards,
>>>>>       Felix
>>>>>
>>>>>> +			msleep(5);
>>>>>> +			goto retry_read;
>>>>>> +		}
>>>>>>      		return ~0;
>>>>>>      	}
>>>>>>      	val = adev->wb.wb[adev->virt.reg_val_offs];
>>>>>> @@ -179,9 +184,15 @@ void amdgpu_virt_kiq_wreg(struct amdgpu_device *adev, uint32_t reg, uint32_t v)
>>>>>>      	amdgpu_ring_commit(ring);
>>>>>>      	spin_unlock_irqrestore(&kiq->ring_lock, flags);
>>>>>>      
>>>>>> +retry_write:
>>>>>>      	r = amdgpu_fence_wait_polling(ring, seq, MAX_KIQ_REG_WAIT);
>>>>>> -	if (r < 1)
>>>>>> +	if (r < 1) {
>>>>>>      		DRM_ERROR("wait for kiq fence error: %ld\n", r);
>>>>>> +		if (!in_interrupt() && !adev->in_gpu_reset) {
>>>>>> +			msleep(5);
>>>>>> +			goto retry_write;
>>>>>> +		}
>>>>>> +	}
>>>>>>      }
>>>>>>      
>>>>>>      /**
>>>> _______________________________________________
>>>> amd-gfx mailing list
>>>> amd-gfx@lists.freedesktop.org
>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>> _______________________________________________
>>> amd-gfx mailing list
>>> amd-gfx@lists.freedesktop.org
>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx@lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2018-03-05 11:30 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-02-28  7:27 [PATCH] drm/amdgpu: try again kiq access if not in IRQ(v2) Monk Liu
     [not found] ` <1519802862-9513-1-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
2018-03-01  5:52   ` Liu, Monk
     [not found]     ` <BLUPR12MB0449438D2E411F534071ED3684C60-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2018-03-01  8:12       ` Christian König
     [not found]         ` <6362f78f-7b22-2fc1-5ce3-805190c4ffb0-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2018-03-01  8:21           ` Liu, Monk
     [not found]             ` <BLUPR12MB0449EAA8B40EDA5D6BC762E284C60-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2018-03-01  8:40               ` Ding, Pixel
     [not found]                 ` <7624B23B-CE3F-4608-B9BF-F01DB29C8503-5C7GfCeVMHo@public.gmane.org>
2018-03-01  8:42                   ` Christian König
2018-03-01 15:50   ` Felix Kuehling
     [not found]     ` <3d12195d-44d3-d9b1-82cd-482f49534ec5-5C7GfCeVMHo@public.gmane.org>
2018-03-02  9:29       ` Liu, Monk
     [not found]         ` <BLUPR12MB04498B47E0BAFFF2A80A4CE684C50-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2018-03-02 20:47           ` Felix Kuehling
     [not found]             ` <e8a28819-c7fa-5b64-6b5b-723299d517f4-5C7GfCeVMHo@public.gmane.org>
2018-03-03 13:38               ` Christian König
     [not found]                 ` <ecccef01-fdab-0810-030a-cc0f0f0ee814-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2018-03-05  3:47                   ` Liu, Monk
2018-03-05  4:20                   ` Liu, Monk
     [not found]                     ` <BLUPR12MB0449708DD427D1E744992FB584DA0-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2018-03-05  7:57                       ` Christian König
     [not found]                         ` <1df4fa9f-1a5c-f005-cfa8-e4346d7892c1-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2018-03-05  8:08                           ` Liu, Monk
     [not found]                             ` <BLUPR12MB0449FFE3490C8EA3394D41DD84DA0-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2018-03-05 11:20                               ` Christian König
     [not found]                                 ` <48b49b85-495f-c5bc-7d8b-447537a94c8f-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2018-03-05 11:27                                   ` Liu, Monk
     [not found]                                     ` <BLUPR12MB0449F42E6D448FA2C5A7BD2F84DA0-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2018-03-05 11:30                                       ` Christian König
2018-03-05  4:03       ` Liu, Monk

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.