[PATCH] drm/etnaviv: bring back progress check in job timeout handler

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH] drm/etnaviv: bring back progress check in job timeout handler
@ 2018-06-27 14:34 Lucas Stach
  2018-06-27 16:15 ` Fabio Estevam
                   ` (4 more replies)
  0 siblings, 5 replies; 9+ messages in thread
From: Lucas Stach @ 2018-06-27 14:34 UTC (permalink / raw)
  To: Russell King; +Cc: kernel, etnaviv, dri-devel, patchwork-lst

When the hangcheck handler was replaced by the DRM scheduler timeout
handling we dropped the forward progress check, as this might allow
clients to hog the GPU for a long time with a big job.

It turns out that even reasonably well behaved clients like the
Armada Xorg driver occasionally trip over the 500ms timeout. Bring
back the forward progress check to get rid of the userspace regression.

We would still like to fix userspace to submit smaller batches
if possible, but that is for another day.

Fixes: 6d7a20c07760 (drm/etnaviv: replace hangcheck with scheduler timeout)
Signed-off-by: Lucas Stach <l.stach@pengutronix.de>
---
 drivers/gpu/drm/etnaviv/etnaviv_gpu.h   |  3 +++
 drivers/gpu/drm/etnaviv/etnaviv_sched.c | 24 ++++++++++++++++++++++++
 2 files changed, 27 insertions(+)

diff --git a/drivers/gpu/drm/etnaviv/etnaviv_gpu.h b/drivers/gpu/drm/etnaviv/etnaviv_gpu.h
index dd430f0f8ff5..90f17ff7888e 100644
--- a/drivers/gpu/drm/etnaviv/etnaviv_gpu.h
+++ b/drivers/gpu/drm/etnaviv/etnaviv_gpu.h
@@ -131,6 +131,9 @@ struct etnaviv_gpu {
 	struct work_struct sync_point_work;
 	int sync_point_event;
 
+	/* hang detection */
+	u32 hangcheck_dma_addr;
+
 	void __iomem *mmio;
 	int irq;
 
diff --git a/drivers/gpu/drm/etnaviv/etnaviv_sched.c b/drivers/gpu/drm/etnaviv/etnaviv_sched.c
index a74eb57af15b..50d6b88cb7aa 100644
--- a/drivers/gpu/drm/etnaviv/etnaviv_sched.c
+++ b/drivers/gpu/drm/etnaviv/etnaviv_sched.c
@@ -10,6 +10,7 @@
 #include "etnaviv_gem.h"
 #include "etnaviv_gpu.h"
 #include "etnaviv_sched.h"
+#include "state.xml.h"
 
 static int etnaviv_job_hang_limit = 0;
 module_param_named(job_hang_limit, etnaviv_job_hang_limit, int , 0444);
@@ -85,6 +86,29 @@ static void etnaviv_sched_timedout_job(struct drm_sched_job *sched_job)
 {
 	struct etnaviv_gem_submit *submit = to_etnaviv_submit(sched_job);
 	struct etnaviv_gpu *gpu = submit->gpu;
+	u32 dma_addr;
+	int change;
+
+	/*
+	 * If the GPU managed to complete this jobs fence, the timout is
+	 * spurious. Bail out.
+	 */
+	if (fence_completed(gpu, submit->out_fence->seqno))
+		return;
+
+	/*
+	 * If the GPU is still making forward progress on the front-end (which
+	 * should never loop) we shift out the timeout to give it a chance to
+	 * finish the job.
+	 */
+	dma_addr = gpu_read(gpu, VIVS_FE_DMA_ADDRESS);
+	change = dma_addr - gpu->hangcheck_dma_addr;
+	if (change < 0 || change > 16) {
+		gpu->hangcheck_dma_addr = dma_addr;
+		schedule_delayed_work(&sched_job->work_tdr,
+				      sched_job->sched->timeout);
+		return;
+	}
 
 	/* block scheduler */
 	kthread_park(gpu->sched.thread);
-- 
2.17.1

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH] drm/etnaviv: bring back progress check in job timeout handler
  2018-06-27 14:34 [PATCH] drm/etnaviv: bring back progress check in job timeout handler Lucas Stach
@ 2018-06-27 16:15 ` Fabio Estevam
  2018-06-27 17:25 ` Eric Anholt
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 9+ messages in thread
From: Fabio Estevam @ 2018-06-27 16:15 UTC (permalink / raw)
  To: Lucas Stach
  Cc: The etnaviv authors, Russell King, DRI mailing list,
	patchwork-lst, Sascha Hauer

Hi Lucas,

On Wed, Jun 27, 2018 at 11:34 AM, Lucas Stach <l.stach@pengutronix.de> wrote:
> When the hangcheck handler was replaced by the DRM scheduler timeout
> handling we dropped the forward progress check, as this might allow
> clients to hog the GPU for a long time with a big job.
>
> It turns out that even reasonably well behaved clients like the
> Armada Xorg driver occasionally trip over the 500ms timeout. Bring
> back the forward progress check to get rid of the userspace regression.
>
> We would still like to fix userspace to submit smaller batches
> if possible, but that is for another day.
>
> Fixes: 6d7a20c07760 (drm/etnaviv: replace hangcheck with scheduler timeout)

Maybe you could add a Reported-by tag from Russell?

> Signed-off-by: Lucas Stach <l.stach@pengutronix.de>

> +       /*
> +        * If the GPU managed to complete this jobs fence, the timout is

s/timout/timeout

Thanks
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] drm/etnaviv: bring back progress check in job timeout handler
  2018-06-27 14:34 [PATCH] drm/etnaviv: bring back progress check in job timeout handler Lucas Stach
  2018-06-27 16:15 ` Fabio Estevam
@ 2018-06-27 17:25 ` Eric Anholt
  2018-06-28  8:40   ` Lucas Stach
  2018-07-03  9:26 ` Lucas Stach
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 9+ messages in thread
From: Eric Anholt @ 2018-06-27 17:25 UTC (permalink / raw)
  To: Lucas Stach, Russell King; +Cc: dri-devel, etnaviv, kernel, patchwork-lst


[-- Attachment #1.1: Type: text/plain, Size: 1222 bytes --]

Lucas Stach <l.stach@pengutronix.de> writes:

> When the hangcheck handler was replaced by the DRM scheduler timeout
> handling we dropped the forward progress check, as this might allow
> clients to hog the GPU for a long time with a big job.
>
> It turns out that even reasonably well behaved clients like the
> Armada Xorg driver occasionally trip over the 500ms timeout. Bring
> back the forward progress check to get rid of the userspace regression.
>
> We would still like to fix userspace to submit smaller batches
> if possible, but that is for another day.
>
> Fixes: 6d7a20c07760 (drm/etnaviv: replace hangcheck with scheduler timeout)
> Signed-off-by: Lucas Stach <l.stach@pengutronix.de>

I was just wondering if there was a way to do this with the scheduler (I
had a similar issue with GTF-GLES2.gtf.GL.acos.acos_float_vert_xvary),
and this looks correct.

As far as I can see, the fence_completed check shouldn't be necessary,
since you'll get a cancel_delayed_work_sync() once the job finish
happens, so you're only really protecting from a timeout not detecting
progress in between fence signal and job finish, but we expect job
finish to be quick.

Regardless,

Reviewed-by: Eric Anholt <eric@anholt.net>

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] drm/etnaviv: bring back progress check in job timeout handler
  2018-06-27 17:25 ` Eric Anholt
@ 2018-06-28  8:40   ` Lucas Stach
  2018-06-28 17:35     ` Eric Anholt
  0 siblings, 1 reply; 9+ messages in thread
From: Lucas Stach @ 2018-06-28  8:40 UTC (permalink / raw)
  To: Eric Anholt, Russell King; +Cc: dri-devel, etnaviv, kernel, patchwork-lst

Am Mittwoch, den 27.06.2018, 10:25 -0700 schrieb Eric Anholt:
> > Lucas Stach <l.stach@pengutronix.de> writes:
> 
> > When the hangcheck handler was replaced by the DRM scheduler timeout
> > handling we dropped the forward progress check, as this might allow
> > clients to hog the GPU for a long time with a big job.
> > 
> > It turns out that even reasonably well behaved clients like the
> > Armada Xorg driver occasionally trip over the 500ms timeout. Bring
> > back the forward progress check to get rid of the userspace regression.
> > 
> > We would still like to fix userspace to submit smaller batches
> > if possible, but that is for another day.
> > 
> > Fixes: 6d7a20c07760 (drm/etnaviv: replace hangcheck with scheduler timeout)
> > Signed-off-by: Lucas Stach <l.stach@pengutronix.de>
> 
> I was just wondering if there was a way to do this with the scheduler (I
> had a similar issue with GTF-GLES2.gtf.GL.acos.acos_float_vert_xvary),
> and this looks correct.

What are you thinking about? A forward progress check at sub-fence
granularity is always going to be GPU specific. The only thing that
could be shunted to the scheduler is rearming of the timer. We could do
this by changing the return type of timedout_job to something that
allows us to indicate a false-positive to the scheduler.

> As far as I can see, the fence_completed check shouldn't be necessary,
> since you'll get a cancel_delayed_work_sync() once the job finish
> happens, so you're only really protecting from a timeout not detecting
> progress in between fence signal and job finish, but we expect job
> finish to be quick.

Yes, it's really only guarding against this small window. Still I would
like to skip the overhead of stopping  and restarting the whole
scheduler in case the job managed to finish in this window. That's
probably something that could even be moved in common scheduler code,
but probably as part of a follow on cleanup, instead of this stable
patch.

> Regardless,
> 
> > Reviewed-by: Eric Anholt <eric@anholt.net>

Thanks,
Lucas
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] drm/etnaviv: bring back progress check in job timeout handler
  2018-06-28  8:40   ` Lucas Stach
@ 2018-06-28 17:35     ` Eric Anholt
  0 siblings, 0 replies; 9+ messages in thread
From: Eric Anholt @ 2018-06-28 17:35 UTC (permalink / raw)
  To: Lucas Stach, Russell King; +Cc: dri-devel, etnaviv, kernel, patchwork-lst


[-- Attachment #1.1: Type: text/plain, Size: 1459 bytes --]

Lucas Stach <l.stach@pengutronix.de> writes:

> Am Mittwoch, den 27.06.2018, 10:25 -0700 schrieb Eric Anholt:
>> > Lucas Stach <l.stach@pengutronix.de> writes:
>> 
>> > When the hangcheck handler was replaced by the DRM scheduler timeout
>> > handling we dropped the forward progress check, as this might allow
>> > clients to hog the GPU for a long time with a big job.
>> > 
>> > It turns out that even reasonably well behaved clients like the
>> > Armada Xorg driver occasionally trip over the 500ms timeout. Bring
>> > back the forward progress check to get rid of the userspace regression.
>> > 
>> > We would still like to fix userspace to submit smaller batches
>> > if possible, but that is for another day.
>> > 
>> > Fixes: 6d7a20c07760 (drm/etnaviv: replace hangcheck with scheduler timeout)
>> > Signed-off-by: Lucas Stach <l.stach@pengutronix.de>
>> 
>> I was just wondering if there was a way to do this with the scheduler (I
>> had a similar issue with GTF-GLES2.gtf.GL.acos.acos_float_vert_xvary),
>> and this looks correct.
>
> What are you thinking about? A forward progress check at sub-fence
> granularity is always going to be GPU specific. The only thing that
> could be shunted to the scheduler is rearming of the timer. We could do
> this by changing the return type of timedout_job to something that
> allows us to indicate a false-positive to the scheduler.

That sounds like a nice future improvement.

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] drm/etnaviv: bring back progress check in job timeout handler
  2018-06-27 14:34 [PATCH] drm/etnaviv: bring back progress check in job timeout handler Lucas Stach
  2018-06-27 16:15 ` Fabio Estevam
  2018-06-27 17:25 ` Eric Anholt
@ 2018-07-03  9:26 ` Lucas Stach
  2018-07-03  9:44 ` Russell King - ARM Linux
  2018-07-11 20:14 ` Russell King - ARM Linux
  4 siblings, 0 replies; 9+ messages in thread
From: Lucas Stach @ 2018-07-03  9:26 UTC (permalink / raw)
  To: Russell King; +Cc: dri-devel, etnaviv, kernel, patchwork-lst

Hi Russell,

can you tell me if you will be able to provide some test feedback for
this patch during this week? I would like to get this into a fixes pull
towards the end of the week and would feel a bit more confident to do
so knowing that this really fixes the issue you are hitting on GC600.

Regards,
Lucas

Am Mittwoch, den 27.06.2018, 16:34 +0200 schrieb Lucas Stach:
> When the hangcheck handler was replaced by the DRM scheduler timeout
> handling we dropped the forward progress check, as this might allow
> clients to hog the GPU for a long time with a big job.
> 
> It turns out that even reasonably well behaved clients like the
> Armada Xorg driver occasionally trip over the 500ms timeout. Bring
> back the forward progress check to get rid of the userspace regression.
> 
> We would still like to fix userspace to submit smaller batches
> if possible, but that is for another day.
> 
> Fixes: 6d7a20c07760 (drm/etnaviv: replace hangcheck with scheduler timeout)
> > Signed-off-by: Lucas Stach <l.stach@pengutronix.de>
> ---
>  drivers/gpu/drm/etnaviv/etnaviv_gpu.h   |  3 +++
>  drivers/gpu/drm/etnaviv/etnaviv_sched.c | 24 ++++++++++++++++++++++++
>  2 files changed, 27 insertions(+)
> 
> diff --git a/drivers/gpu/drm/etnaviv/etnaviv_gpu.h b/drivers/gpu/drm/etnaviv/etnaviv_gpu.h
> index dd430f0f8ff5..90f17ff7888e 100644
> --- a/drivers/gpu/drm/etnaviv/etnaviv_gpu.h
> +++ b/drivers/gpu/drm/etnaviv/etnaviv_gpu.h
> @@ -131,6 +131,9 @@ struct etnaviv_gpu {
> >  	struct work_struct sync_point_work;
> >  	int sync_point_event;
>  
> > +	/* hang detection */
> > +	u32 hangcheck_dma_addr;
> +
> >  	void __iomem *mmio;
> >  	int irq;
>  
> diff --git a/drivers/gpu/drm/etnaviv/etnaviv_sched.c b/drivers/gpu/drm/etnaviv/etnaviv_sched.c
> index a74eb57af15b..50d6b88cb7aa 100644
> --- a/drivers/gpu/drm/etnaviv/etnaviv_sched.c
> +++ b/drivers/gpu/drm/etnaviv/etnaviv_sched.c
> @@ -10,6 +10,7 @@
>  #include "etnaviv_gem.h"
>  #include "etnaviv_gpu.h"
>  #include "etnaviv_sched.h"
> +#include "state.xml.h"
>  
>  static int etnaviv_job_hang_limit = 0;
>  module_param_named(job_hang_limit, etnaviv_job_hang_limit, int , 0444);
> @@ -85,6 +86,29 @@ static void etnaviv_sched_timedout_job(struct drm_sched_job *sched_job)
>  {
> >  	struct etnaviv_gem_submit *submit = to_etnaviv_submit(sched_job);
> >  	struct etnaviv_gpu *gpu = submit->gpu;
> > +	u32 dma_addr;
> > +	int change;
> +
> > +	/*
> > +	 * If the GPU managed to complete this jobs fence, the timout is
> > +	 * spurious. Bail out.
> > +	 */
> > +	if (fence_completed(gpu, submit->out_fence->seqno))
> > +		return;
> +
> > +	/*
> > +	 * If the GPU is still making forward progress on the front-end (which
> > +	 * should never loop) we shift out the timeout to give it a chance to
> > +	 * finish the job.
> > +	 */
> > +	dma_addr = gpu_read(gpu, VIVS_FE_DMA_ADDRESS);
> > +	change = dma_addr - gpu->hangcheck_dma_addr;
> > +	if (change < 0 || change > 16) {
> > +		gpu->hangcheck_dma_addr = dma_addr;
> > +		schedule_delayed_work(&sched_job->work_tdr,
> > +				      sched_job->sched->timeout);
> > +		return;
> > +	}
>  
> >  	/* block scheduler */
> >  	kthread_park(gpu->sched.thread);
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] drm/etnaviv: bring back progress check in job timeout handler
  2018-06-27 14:34 [PATCH] drm/etnaviv: bring back progress check in job timeout handler Lucas Stach
                   ` (2 preceding siblings ...)
  2018-07-03  9:26 ` Lucas Stach
@ 2018-07-03  9:44 ` Russell King - ARM Linux
  2018-07-03  9:54   ` Lucas Stach
  2018-07-11 20:14 ` Russell King - ARM Linux
  4 siblings, 1 reply; 9+ messages in thread
From: Russell King - ARM Linux @ 2018-07-03  9:44 UTC (permalink / raw)
  To: Lucas Stach; +Cc: kernel, etnaviv, dri-devel, patchwork-lst

On Wed, Jun 27, 2018 at 04:34:27PM +0200, Lucas Stach wrote:
> When the hangcheck handler was replaced by the DRM scheduler timeout
> handling we dropped the forward progress check, as this might allow
> clients to hog the GPU for a long time with a big job.
> 
> It turns out that even reasonably well behaved clients like the
> Armada Xorg driver occasionally trip over the 500ms timeout. Bring
> back the forward progress check to get rid of the userspace regression.
> 
> We would still like to fix userspace to submit smaller batches
> if possible, but that is for another day.
> 
> Fixes: 6d7a20c07760 (drm/etnaviv: replace hangcheck with scheduler timeout)
> Signed-off-by: Lucas Stach <l.stach@pengutronix.de>
> ---
>  drivers/gpu/drm/etnaviv/etnaviv_gpu.h   |  3 +++
>  drivers/gpu/drm/etnaviv/etnaviv_sched.c | 24 ++++++++++++++++++++++++
>  2 files changed, 27 insertions(+)
> 
> diff --git a/drivers/gpu/drm/etnaviv/etnaviv_gpu.h b/drivers/gpu/drm/etnaviv/etnaviv_gpu.h
> index dd430f0f8ff5..90f17ff7888e 100644
> --- a/drivers/gpu/drm/etnaviv/etnaviv_gpu.h
> +++ b/drivers/gpu/drm/etnaviv/etnaviv_gpu.h
> @@ -131,6 +131,9 @@ struct etnaviv_gpu {
>  	struct work_struct sync_point_work;
>  	int sync_point_event;
>  
> +	/* hang detection */
> +	u32 hangcheck_dma_addr;
> +
>  	void __iomem *mmio;
>  	int irq;
>  
> diff --git a/drivers/gpu/drm/etnaviv/etnaviv_sched.c b/drivers/gpu/drm/etnaviv/etnaviv_sched.c
> index a74eb57af15b..50d6b88cb7aa 100644
> --- a/drivers/gpu/drm/etnaviv/etnaviv_sched.c
> +++ b/drivers/gpu/drm/etnaviv/etnaviv_sched.c
> @@ -10,6 +10,7 @@
>  #include "etnaviv_gem.h"
>  #include "etnaviv_gpu.h"
>  #include "etnaviv_sched.h"
> +#include "state.xml.h"
>  
>  static int etnaviv_job_hang_limit = 0;
>  module_param_named(job_hang_limit, etnaviv_job_hang_limit, int , 0444);
> @@ -85,6 +86,29 @@ static void etnaviv_sched_timedout_job(struct drm_sched_job *sched_job)
>  {
>  	struct etnaviv_gem_submit *submit = to_etnaviv_submit(sched_job);
>  	struct etnaviv_gpu *gpu = submit->gpu;
> +	u32 dma_addr;
> +	int change;
> +
> +	/*
> +	 * If the GPU managed to complete this jobs fence, the timout is
> +	 * spurious. Bail out.
> +	 */
> +	if (fence_completed(gpu, submit->out_fence->seqno))
> +		return;
> +
> +	/*
> +	 * If the GPU is still making forward progress on the front-end (which
> +	 * should never loop) we shift out the timeout to give it a chance to
> +	 * finish the job.
> +	 */
> +	dma_addr = gpu_read(gpu, VIVS_FE_DMA_ADDRESS);
> +	change = dma_addr - gpu->hangcheck_dma_addr;
> +	if (change < 0 || change > 16) {
> +		gpu->hangcheck_dma_addr = dma_addr;
> +		schedule_delayed_work(&sched_job->work_tdr,
> +				      sched_job->sched->timeout);
> +		return;
> +	}

Doesn't this patch, by ignoring the timeout, have the effect of completely
disabling the job timeout after its first instance of firing?  From my
reading of the gpu scheduler code, this seems to be the case.

work_tdr is only armed when:
- a job completes and there is another job waiting (we're waiting for it...)
- a job is started, and is the first job on the ring (which won't happen)
- drm_sched_job_recovery() is called (which it isn't)

So, what would appear to happen is we timeout the job, but detect we
have progress, so the timeout is ignored.  The timeout is not re-armed,
so if there was some progress and /then/ the GPU gets stuck, there is
nothing to re-check for progress or timeout.

I suspect that will completely defeat the timeout mechanism, since the
first (and only) timeout will always update the progress check and cause
the timeout to be ignored.

-- 
RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 8.8Mbps down 630kbps up
According to speedtest.net: 8.21Mbps down 510kbps up
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] drm/etnaviv: bring back progress check in job timeout handler
  2018-07-03  9:44 ` Russell King - ARM Linux
@ 2018-07-03  9:54   ` Lucas Stach
  0 siblings, 0 replies; 9+ messages in thread
From: Lucas Stach @ 2018-07-03  9:54 UTC (permalink / raw)
  To: Russell King - ARM Linux; +Cc: kernel, etnaviv, dri-devel, patchwork-lst

Am Dienstag, den 03.07.2018, 10:44 +0100 schrieb Russell King - ARM Linux:
> On Wed, Jun 27, 2018 at 04:34:27PM +0200, Lucas Stach wrote:
> > When the hangcheck handler was replaced by the DRM scheduler timeout
> > handling we dropped the forward progress check, as this might allow
> > clients to hog the GPU for a long time with a big job.
> > 
> > It turns out that even reasonably well behaved clients like the
> > Armada Xorg driver occasionally trip over the 500ms timeout. Bring
> > back the forward progress check to get rid of the userspace regression.
> > 
> > We would still like to fix userspace to submit smaller batches
> > if possible, but that is for another day.
> > 
> > Fixes: 6d7a20c07760 (drm/etnaviv: replace hangcheck with scheduler timeout)
> > > > Signed-off-by: Lucas Stach <l.stach@pengutronix.de>
> > ---
> >  drivers/gpu/drm/etnaviv/etnaviv_gpu.h   |  3 +++
> >  drivers/gpu/drm/etnaviv/etnaviv_sched.c | 24 ++++++++++++++++++++++++
> >  2 files changed, 27 insertions(+)
> > 
> > diff --git a/drivers/gpu/drm/etnaviv/etnaviv_gpu.h b/drivers/gpu/drm/etnaviv/etnaviv_gpu.h
> > index dd430f0f8ff5..90f17ff7888e 100644
> > --- a/drivers/gpu/drm/etnaviv/etnaviv_gpu.h
> > +++ b/drivers/gpu/drm/etnaviv/etnaviv_gpu.h
> > @@ -131,6 +131,9 @@ struct etnaviv_gpu {
> > > >  	struct work_struct sync_point_work;
> > > >  	int sync_point_event;
> >  
> > > > +	/* hang detection */
> > > > +	u32 hangcheck_dma_addr;
> > +
> > > >  	void __iomem *mmio;
> > > >  	int irq;
> >  
> > diff --git a/drivers/gpu/drm/etnaviv/etnaviv_sched.c b/drivers/gpu/drm/etnaviv/etnaviv_sched.c
> > index a74eb57af15b..50d6b88cb7aa 100644
> > --- a/drivers/gpu/drm/etnaviv/etnaviv_sched.c
> > +++ b/drivers/gpu/drm/etnaviv/etnaviv_sched.c
> > @@ -10,6 +10,7 @@
> >  #include "etnaviv_gem.h"
> >  #include "etnaviv_gpu.h"
> >  #include "etnaviv_sched.h"
> > +#include "state.xml.h"
> >  
> >  static int etnaviv_job_hang_limit = 0;
> >  module_param_named(job_hang_limit, etnaviv_job_hang_limit, int , 0444);
> > @@ -85,6 +86,29 @@ static void etnaviv_sched_timedout_job(struct drm_sched_job *sched_job)
> >  {
> > > >  	struct etnaviv_gem_submit *submit = to_etnaviv_submit(sched_job);
> > > >  	struct etnaviv_gpu *gpu = submit->gpu;
> > > > +	u32 dma_addr;
> > > > +	int change;
> > +
> > > > +	/*
> > > > +	 * If the GPU managed to complete this jobs fence, the timout is
> > > > +	 * spurious. Bail out.
> > > > +	 */
> > > > +	if (fence_completed(gpu, submit->out_fence->seqno))
> > > > +		return;
> > +
> > > > +	/*
> > > > +	 * If the GPU is still making forward progress on the front-end (which
> > > > +	 * should never loop) we shift out the timeout to give it a chance to
> > > > +	 * finish the job.
> > > > +	 */
> > > > +	dma_addr = gpu_read(gpu, VIVS_FE_DMA_ADDRESS);
> > > > +	change = dma_addr - gpu->hangcheck_dma_addr;
> > > > +	if (change < 0 || change > 16) {
> > > > +		gpu->hangcheck_dma_addr = dma_addr;
> > > > +		schedule_delayed_work(&sched_job->work_tdr,
> > > > +				      sched_job->sched->timeout);
> > > > +		return;
> > +	}
> 
> Doesn't this patch, by ignoring the timeout, have the effect of completely
> disabling the job timeout after its first instance of firing?  From my
> reading of the gpu scheduler code, this seems to be the case.

The schedule_delayed_work() in the code above is what rearms the
timeout in case we detect progress, so it should fire again.

> work_tdr is only armed when:
> - a job completes and there is another job waiting (we're waiting for it...)
> - a job is started, and is the first job on the ring (which won't happen)
> - drm_sched_job_recovery() is called (which it isn't)
> 
> So, what would appear to happen is we timeout the job, but detect we
> have progress, so the timeout is ignored.  The timeout is not re-armed,
> so if there was some progress and /then/ the GPU gets stuck, there is
> nothing to re-check for progress or timeout.
> 
> I suspect that will completely defeat the timeout mechanism, since the
> first (and only) timeout will always update the progress check and cause
> the timeout to be ignored.
> 
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] drm/etnaviv: bring back progress check in job timeout handler
  2018-06-27 14:34 [PATCH] drm/etnaviv: bring back progress check in job timeout handler Lucas Stach
                   ` (3 preceding siblings ...)
  2018-07-03  9:44 ` Russell King - ARM Linux
@ 2018-07-11 20:14 ` Russell King - ARM Linux
  4 siblings, 0 replies; 9+ messages in thread
From: Russell King - ARM Linux @ 2018-07-11 20:14 UTC (permalink / raw)
  To: Lucas Stach; +Cc: kernel, etnaviv, dri-devel, patchwork-lst

On Wed, Jun 27, 2018 at 04:34:27PM +0200, Lucas Stach wrote:
> When the hangcheck handler was replaced by the DRM scheduler timeout
> handling we dropped the forward progress check, as this might allow
> clients to hog the GPU for a long time with a big job.
> 
> It turns out that even reasonably well behaved clients like the
> Armada Xorg driver occasionally trip over the 500ms timeout. Bring
> back the forward progress check to get rid of the userspace regression.
> 
> We would still like to fix userspace to submit smaller batches
> if possible, but that is for another day.
> 
> Fixes: 6d7a20c07760 (drm/etnaviv: replace hangcheck with scheduler timeout)
> Signed-off-by: Lucas Stach <l.stach@pengutronix.de>

Tested-by: Russell King <rmk+kernel@armlinux.org.uk>

Thanks.

> ---
>  drivers/gpu/drm/etnaviv/etnaviv_gpu.h   |  3 +++
>  drivers/gpu/drm/etnaviv/etnaviv_sched.c | 24 ++++++++++++++++++++++++
>  2 files changed, 27 insertions(+)
> 
> diff --git a/drivers/gpu/drm/etnaviv/etnaviv_gpu.h b/drivers/gpu/drm/etnaviv/etnaviv_gpu.h
> index dd430f0f8ff5..90f17ff7888e 100644
> --- a/drivers/gpu/drm/etnaviv/etnaviv_gpu.h
> +++ b/drivers/gpu/drm/etnaviv/etnaviv_gpu.h
> @@ -131,6 +131,9 @@ struct etnaviv_gpu {
>  	struct work_struct sync_point_work;
>  	int sync_point_event;
>  
> +	/* hang detection */
> +	u32 hangcheck_dma_addr;
> +
>  	void __iomem *mmio;
>  	int irq;
>  
> diff --git a/drivers/gpu/drm/etnaviv/etnaviv_sched.c b/drivers/gpu/drm/etnaviv/etnaviv_sched.c
> index a74eb57af15b..50d6b88cb7aa 100644
> --- a/drivers/gpu/drm/etnaviv/etnaviv_sched.c
> +++ b/drivers/gpu/drm/etnaviv/etnaviv_sched.c
> @@ -10,6 +10,7 @@
>  #include "etnaviv_gem.h"
>  #include "etnaviv_gpu.h"
>  #include "etnaviv_sched.h"
> +#include "state.xml.h"
>  
>  static int etnaviv_job_hang_limit = 0;
>  module_param_named(job_hang_limit, etnaviv_job_hang_limit, int , 0444);
> @@ -85,6 +86,29 @@ static void etnaviv_sched_timedout_job(struct drm_sched_job *sched_job)
>  {
>  	struct etnaviv_gem_submit *submit = to_etnaviv_submit(sched_job);
>  	struct etnaviv_gpu *gpu = submit->gpu;
> +	u32 dma_addr;
> +	int change;
> +
> +	/*
> +	 * If the GPU managed to complete this jobs fence, the timout is
> +	 * spurious. Bail out.
> +	 */
> +	if (fence_completed(gpu, submit->out_fence->seqno))
> +		return;
> +
> +	/*
> +	 * If the GPU is still making forward progress on the front-end (which
> +	 * should never loop) we shift out the timeout to give it a chance to
> +	 * finish the job.
> +	 */
> +	dma_addr = gpu_read(gpu, VIVS_FE_DMA_ADDRESS);
> +	change = dma_addr - gpu->hangcheck_dma_addr;
> +	if (change < 0 || change > 16) {
> +		gpu->hangcheck_dma_addr = dma_addr;
> +		schedule_delayed_work(&sched_job->work_tdr,
> +				      sched_job->sched->timeout);
> +		return;
> +	}
>  
>  	/* block scheduler */
>  	kthread_park(gpu->sched.thread);
> -- 
> 2.17.1
> 

-- 
RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 13.8Mbps down 630kbps up
According to speedtest.net: 13Mbps down 490kbps up
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2018-07-11 20:14 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-06-27 14:34 [PATCH] drm/etnaviv: bring back progress check in job timeout handler Lucas Stach
2018-06-27 16:15 ` Fabio Estevam
2018-06-27 17:25 ` Eric Anholt
2018-06-28  8:40   ` Lucas Stach
2018-06-28 17:35     ` Eric Anholt
2018-07-03  9:26 ` Lucas Stach
2018-07-03  9:44 ` Russell King - ARM Linux
2018-07-03  9:54   ` Lucas Stach
2018-07-11 20:14 ` Russell King - ARM Linux

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.