dri-devel.lists.freedesktop.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v1] drm/panfrost: Sync IRQ by job's timeout handler
@ 2023-07-17  6:52 Dmitry Osipenko
  2023-07-17  7:05 ` Boris Brezillon
  0 siblings, 1 reply; 8+ messages in thread
From: Dmitry Osipenko @ 2023-07-17  6:52 UTC (permalink / raw)
  To: Rob Herring, Steven Price, Boris Brezillon
  Cc: kernel, linux-kernel, dri-devel

Panfrost IRQ handler may stuck for a long time, for example this happens
when there is a bad HDMI connection and HDMI handler takes a long time to
finish processing, holding Panfrost. Make Panfrost's job timeout handler
to sync IRQ before checking fence signal status in order to prevent
spurious job timeouts due to a slow IRQ processing.

Signed-off-by: Dmitry Osipenko <dmitry.osipenko@collabora.com>
---
 drivers/gpu/drm/panfrost/panfrost_job.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c b/drivers/gpu/drm/panfrost/panfrost_job.c
index dbc597ab46fb..a356163da22d 100644
--- a/drivers/gpu/drm/panfrost/panfrost_job.c
+++ b/drivers/gpu/drm/panfrost/panfrost_job.c
@@ -713,6 +713,8 @@ static enum drm_gpu_sched_stat panfrost_job_timedout(struct drm_sched_job
 	struct panfrost_device *pfdev = job->pfdev;
 	int js = panfrost_job_get_slot(job);
 
+	synchronize_irq(pfdev->js->irq);
+
 	/*
 	 * If the GPU managed to complete this jobs fence, the timeout is
 	 * spurious. Bail out.
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH v1] drm/panfrost: Sync IRQ by job's timeout handler
  2023-07-17  6:52 [PATCH v1] drm/panfrost: Sync IRQ by job's timeout handler Dmitry Osipenko
@ 2023-07-17  7:05 ` Boris Brezillon
  2023-07-17  7:20   ` Dmitry Osipenko
  0 siblings, 1 reply; 8+ messages in thread
From: Boris Brezillon @ 2023-07-17  7:05 UTC (permalink / raw)
  To: Dmitry Osipenko; +Cc: kernel, linux-kernel, dri-devel, Steven Price

Hi Dmitry,

On Mon, 17 Jul 2023 09:52:54 +0300
Dmitry Osipenko <dmitry.osipenko@collabora.com> wrote:

> Panfrost IRQ handler may stuck for a long time, for example this happens
> when there is a bad HDMI connection and HDMI handler takes a long time to
> finish processing, holding Panfrost. Make Panfrost's job timeout handler
> to sync IRQ before checking fence signal status in order to prevent
> spurious job timeouts due to a slow IRQ processing.

Feels like the problem should be fixed in the HDMI encoder driver
instead, so it doesn't stall the whole system when processing its
IRQs (use threaded irqs, maybe). I honestly don't think blocking in the
job timeout path to flush IRQs is a good strategy.

Regards,

Boris

> 
> Signed-off-by: Dmitry Osipenko <dmitry.osipenko@collabora.com>
> ---
>  drivers/gpu/drm/panfrost/panfrost_job.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c b/drivers/gpu/drm/panfrost/panfrost_job.c
> index dbc597ab46fb..a356163da22d 100644
> --- a/drivers/gpu/drm/panfrost/panfrost_job.c
> +++ b/drivers/gpu/drm/panfrost/panfrost_job.c
> @@ -713,6 +713,8 @@ static enum drm_gpu_sched_stat panfrost_job_timedout(struct drm_sched_job
>  	struct panfrost_device *pfdev = job->pfdev;
>  	int js = panfrost_job_get_slot(job);
>  
> +	synchronize_irq(pfdev->js->irq);
> +
>  	/*
>  	 * If the GPU managed to complete this jobs fence, the timeout is
>  	 * spurious. Bail out.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v1] drm/panfrost: Sync IRQ by job's timeout handler
  2023-07-17  7:05 ` Boris Brezillon
@ 2023-07-17  7:20   ` Dmitry Osipenko
  2023-07-17  7:49     ` Boris Brezillon
  0 siblings, 1 reply; 8+ messages in thread
From: Dmitry Osipenko @ 2023-07-17  7:20 UTC (permalink / raw)
  To: Boris Brezillon; +Cc: kernel, linux-kernel, dri-devel, Steven Price

Hi,

On 7/17/23 10:05, Boris Brezillon wrote:
> Hi Dmitry,
> 
> On Mon, 17 Jul 2023 09:52:54 +0300
> Dmitry Osipenko <dmitry.osipenko@collabora.com> wrote:
> 
>> Panfrost IRQ handler may stuck for a long time, for example this happens
>> when there is a bad HDMI connection and HDMI handler takes a long time to
>> finish processing, holding Panfrost. Make Panfrost's job timeout handler
>> to sync IRQ before checking fence signal status in order to prevent
>> spurious job timeouts due to a slow IRQ processing.
> 
> Feels like the problem should be fixed in the HDMI encoder driver
> instead, so it doesn't stall the whole system when processing its
> IRQs (use threaded irqs, maybe). I honestly don't think blocking in the
> job timeout path to flush IRQs is a good strategy.

The syncing is necessary to have for correctness regardless of whether
it's HDMI problem or something else, there could be other reasons for
CPU to delay IRQ processing. It's wrong to say that hw is hung, while
it's not.

-- 
Best regards,
Dmitry


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v1] drm/panfrost: Sync IRQ by job's timeout handler
  2023-07-17  7:20   ` Dmitry Osipenko
@ 2023-07-17  7:49     ` Boris Brezillon
  2023-07-17  8:06       ` Steven Price
  0 siblings, 1 reply; 8+ messages in thread
From: Boris Brezillon @ 2023-07-17  7:49 UTC (permalink / raw)
  To: Dmitry Osipenko; +Cc: kernel, linux-kernel, dri-devel, Steven Price

On Mon, 17 Jul 2023 10:20:02 +0300
Dmitry Osipenko <dmitry.osipenko@collabora.com> wrote:

> Hi,
> 
> On 7/17/23 10:05, Boris Brezillon wrote:
> > Hi Dmitry,
> > 
> > On Mon, 17 Jul 2023 09:52:54 +0300
> > Dmitry Osipenko <dmitry.osipenko@collabora.com> wrote:
> >   
> >> Panfrost IRQ handler may stuck for a long time, for example this happens
> >> when there is a bad HDMI connection and HDMI handler takes a long time to
> >> finish processing, holding Panfrost. Make Panfrost's job timeout handler
> >> to sync IRQ before checking fence signal status in order to prevent
> >> spurious job timeouts due to a slow IRQ processing.  
> > 
> > Feels like the problem should be fixed in the HDMI encoder driver
> > instead, so it doesn't stall the whole system when processing its
> > IRQs (use threaded irqs, maybe). I honestly don't think blocking in the
> > job timeout path to flush IRQs is a good strategy.  
> 
> The syncing is necessary to have for correctness regardless of whether
> it's HDMI problem or something else, there could be other reasons for
> CPU to delay IRQ processing. It's wrong to say that hw is hung, while
> it's not.

Well, hardware is effectively hung, if not indefinitely, at least
temporarily. All you do here is block in the timeout handler path
waiting for the GPU interrupt handlers to finish, handler that's
probably waiting in the queue, because the raw HDMI handler is blocking
it somehow. So, in the end, you might just be delaying the time of HWR a
bit more. I know it's not GPU's fault in that case, and the job could
have finished in time if the HDMI encoder hadn't stall the interrupt
handling pipeline, but I'm not sure we should care for that specific
situation. And more importantly, if it took more than 500ms to get a
frame rendered (or, in that case, to get the event that a frame is
rendered), you already lost, so I'm not sure correctness matters:
rendering didn't make it in time, and the watchdog kicked in to try and
unblock the situation. Feels like we're just papering over an HDMI
encoder driver bug here, really.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v1] drm/panfrost: Sync IRQ by job's timeout handler
  2023-07-17  7:49     ` Boris Brezillon
@ 2023-07-17  8:06       ` Steven Price
  2023-07-17  8:49         ` Boris Brezillon
  0 siblings, 1 reply; 8+ messages in thread
From: Steven Price @ 2023-07-17  8:06 UTC (permalink / raw)
  To: Boris Brezillon, Dmitry Osipenko; +Cc: kernel, linux-kernel, dri-devel

On 17/07/2023 08:49, Boris Brezillon wrote:
> On Mon, 17 Jul 2023 10:20:02 +0300
> Dmitry Osipenko <dmitry.osipenko@collabora.com> wrote:
> 
>> Hi,
>>
>> On 7/17/23 10:05, Boris Brezillon wrote:
>>> Hi Dmitry,
>>>
>>> On Mon, 17 Jul 2023 09:52:54 +0300
>>> Dmitry Osipenko <dmitry.osipenko@collabora.com> wrote:
>>>   
>>>> Panfrost IRQ handler may stuck for a long time, for example this happens
>>>> when there is a bad HDMI connection and HDMI handler takes a long time to
>>>> finish processing, holding Panfrost. Make Panfrost's job timeout handler
>>>> to sync IRQ before checking fence signal status in order to prevent
>>>> spurious job timeouts due to a slow IRQ processing.  
>>>
>>> Feels like the problem should be fixed in the HDMI encoder driver
>>> instead, so it doesn't stall the whole system when processing its
>>> IRQs (use threaded irqs, maybe). I honestly don't think blocking in the
>>> job timeout path to flush IRQs is a good strategy.  
>>
>> The syncing is necessary to have for correctness regardless of whether
>> it's HDMI problem or something else, there could be other reasons for
>> CPU to delay IRQ processing. It's wrong to say that hw is hung, while
>> it's not.
> 
> Well, hardware is effectively hung, if not indefinitely, at least
> temporarily. All you do here is block in the timeout handler path
> waiting for the GPU interrupt handlers to finish, handler that's
> probably waiting in the queue, because the raw HDMI handler is blocking
> it somehow. So, in the end, you might just be delaying the time of HWR a
> bit more. I know it's not GPU's fault in that case, and the job could
> have finished in time if the HDMI encoder hadn't stall the interrupt
> handling pipeline, but I'm not sure we should care for that specific
> situation. And more importantly, if it took more than 500ms to get a
> frame rendered (or, in that case, to get the event that a frame is
> rendered), you already lost, so I'm not sure correctness matters:
> rendering didn't make it in time, and the watchdog kicked in to try and
> unblock the situation. Feels like we're just papering over an HDMI
> encoder driver bug here, really.

TLDR; I don't see any major downsides and it stops the GPU getting the 
blame for something that isn't its fault.

I guess the question is whether panfrost should work on a system which 
has terrible IRQ latency. At the moment we have a synchronize_irq() call 
in panfrost_reset() which effectively does the same thing, but with all 
the overhead/spew of resetting the GPU.

Of course in the case Dmitry is actually talking about - it does seem 
like the HDMI encoder has a bug which needs fixing. There are plenty of 
other things that will break if IRQ latency gets that bad.

I do wonder if it makes sense to only synchronize when it's needed, 
e.g.:

----8<---
diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c b/drivers/gpu/drm/panfrost/panfrost_job.c
index dbc597ab46fb..d96266b74e5c 100644
--- a/drivers/gpu/drm/panfrost/panfrost_job.c
+++ b/drivers/gpu/drm/panfrost/panfrost_job.c
@@ -720,6 +720,12 @@ static enum drm_gpu_sched_stat panfrost_job_timedout(struct drm_sched_job
 	if (dma_fence_is_signaled(job->done_fence))
 		return DRM_GPU_SCHED_STAT_NOMINAL;
 
+	/* Synchronize with the IRQ handler in case the IRQ latency is bad */
+	synchronize_irq(pfdev->js->irq);
+	/* Recheck if the job is now complete */
+	if (dma_fence_is_signaled(job->done_fence))
+		return DRM_GPU_SCHED_STAT_NOMINAL;
+
 	dev_err(pfdev->dev, "gpu sched timeout, js=%d, config=0x%x, status=0x%x, head=0x%x, tail=0x%x, sched_job=%p",
 		js,
 		job_read(pfdev, JS_CONFIG(js)),
----8<---

I don't have any data as to how often we hit the case where the DRM 
scheduler calls the timeout but we've already signalled - so the extra 
check might be overkill.

Steve

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH v1] drm/panfrost: Sync IRQ by job's timeout handler
  2023-07-17  8:06       ` Steven Price
@ 2023-07-17  8:49         ` Boris Brezillon
  2023-07-17  8:59           ` Steven Price
  0 siblings, 1 reply; 8+ messages in thread
From: Boris Brezillon @ 2023-07-17  8:49 UTC (permalink / raw)
  To: Steven Price; +Cc: kernel, dri-devel, Dmitry Osipenko, linux-kernel

On Mon, 17 Jul 2023 09:06:56 +0100
Steven Price <steven.price@arm.com> wrote:

> On 17/07/2023 08:49, Boris Brezillon wrote:
> > On Mon, 17 Jul 2023 10:20:02 +0300
> > Dmitry Osipenko <dmitry.osipenko@collabora.com> wrote:
> >   
> >> Hi,
> >>
> >> On 7/17/23 10:05, Boris Brezillon wrote:  
> >>> Hi Dmitry,
> >>>
> >>> On Mon, 17 Jul 2023 09:52:54 +0300
> >>> Dmitry Osipenko <dmitry.osipenko@collabora.com> wrote:
> >>>     
> >>>> Panfrost IRQ handler may stuck for a long time, for example this happens
> >>>> when there is a bad HDMI connection and HDMI handler takes a long time to
> >>>> finish processing, holding Panfrost. Make Panfrost's job timeout handler
> >>>> to sync IRQ before checking fence signal status in order to prevent
> >>>> spurious job timeouts due to a slow IRQ processing.    
> >>>
> >>> Feels like the problem should be fixed in the HDMI encoder driver
> >>> instead, so it doesn't stall the whole system when processing its
> >>> IRQs (use threaded irqs, maybe). I honestly don't think blocking in the
> >>> job timeout path to flush IRQs is a good strategy.    
> >>
> >> The syncing is necessary to have for correctness regardless of whether
> >> it's HDMI problem or something else, there could be other reasons for
> >> CPU to delay IRQ processing. It's wrong to say that hw is hung, while
> >> it's not.  
> > 
> > Well, hardware is effectively hung, if not indefinitely, at least
> > temporarily. All you do here is block in the timeout handler path
> > waiting for the GPU interrupt handlers to finish, handler that's
> > probably waiting in the queue, because the raw HDMI handler is blocking
> > it somehow. So, in the end, you might just be delaying the time of HWR a
> > bit more. I know it's not GPU's fault in that case, and the job could
> > have finished in time if the HDMI encoder hadn't stall the interrupt
> > handling pipeline, but I'm not sure we should care for that specific
> > situation. And more importantly, if it took more than 500ms to get a
> > frame rendered (or, in that case, to get the event that a frame is
> > rendered), you already lost, so I'm not sure correctness matters:
> > rendering didn't make it in time, and the watchdog kicked in to try and
> > unblock the situation. Feels like we're just papering over an HDMI
> > encoder driver bug here, really.  
> 
> TLDR; I don't see any major downsides and it stops the GPU getting the 
> blame for something that isn't its fault.

True, but doing that will also give the impression that things run fine,
but very slowly, which would put the blame on the userspace driver :P.

> 
> I guess the question is whether panfrost should work on a system which 
> has terrible IRQ latency. At the moment we have a synchronize_irq() call 
> in panfrost_reset() which effectively does the same thing, but with all 
> the overhead/spew of resetting the GPU.

Unless I'm mistaken, the synchronize_irq() in panfrost_reset() is
mostly here to make sure there's no race between the interrupt
handler and the reset logic (we mask interrupts, and then synchronize,
guaranteeing that the interrupt handler won't be running after that
point), and it happens after we've printed the error message, so the
user knows something was blocked at least.

> 
> Of course in the case Dmitry is actually talking about - it does seem 
> like the HDMI encoder has a bug which needs fixing. There are plenty of 
> other things that will break if IRQ latency gets that bad.

Yes, that's my point. The GPU driver is the only one to complain right
now, but the HDMI encoder behavior could be impacting other parts of
the system. Silently ignoring those weirdnesses sounds like a terrible
idea.

> 
> I do wonder if it makes sense to only synchronize when it's needed, 
> e.g.:
> 
> ----8<---
> diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c b/drivers/gpu/drm/panfrost/panfrost_job.c
> index dbc597ab46fb..d96266b74e5c 100644
> --- a/drivers/gpu/drm/panfrost/panfrost_job.c
> +++ b/drivers/gpu/drm/panfrost/panfrost_job.c
> @@ -720,6 +720,12 @@ static enum drm_gpu_sched_stat panfrost_job_timedout(struct drm_sched_job
>  	if (dma_fence_is_signaled(job->done_fence))
>  		return DRM_GPU_SCHED_STAT_NOMINAL;
>  
> +	/* Synchronize with the IRQ handler in case the IRQ latency is bad */
> +	synchronize_irq(pfdev->js->irq);
> +	/* Recheck if the job is now complete */
> +	if (dma_fence_is_signaled(job->done_fence))
> +		return DRM_GPU_SCHED_STAT_NOMINAL;
> +
>  	dev_err(pfdev->dev, "gpu sched timeout, js=%d, config=0x%x, status=0x%x, head=0x%x, tail=0x%x, sched_job=%p",
>  		js,
>  		job_read(pfdev, JS_CONFIG(js)),
> ----8<---
> 
> I don't have any data as to how often we hit the case where the DRM 
> scheduler calls the timeout but we've already signalled - so the extra 
> check might be overkill.

Right, it's not so much about the overhead of the synchronize_irq()
call (even though my first reply complained about that :-)), but more
about silently ignoring system misbehaviors. So I guess I'd be fine with
a version printing a dev_warn("Unexpectedly high interrupt latency")
when synchronize_irq() unblocks the situation, which means you'd still
have to do it in two steps.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v1] drm/panfrost: Sync IRQ by job's timeout handler
  2023-07-17  8:49         ` Boris Brezillon
@ 2023-07-17  8:59           ` Steven Price
  2023-07-17 13:30             ` Dmitry Osipenko
  0 siblings, 1 reply; 8+ messages in thread
From: Steven Price @ 2023-07-17  8:59 UTC (permalink / raw)
  To: Boris Brezillon; +Cc: kernel, dri-devel, Dmitry Osipenko, linux-kernel

On 17/07/2023 09:49, Boris Brezillon wrote:
> On Mon, 17 Jul 2023 09:06:56 +0100
> Steven Price <steven.price@arm.com> wrote:
> 
>> On 17/07/2023 08:49, Boris Brezillon wrote:
>>> On Mon, 17 Jul 2023 10:20:02 +0300
>>> Dmitry Osipenko <dmitry.osipenko@collabora.com> wrote:
>>>   
>>>> Hi,
>>>>
>>>> On 7/17/23 10:05, Boris Brezillon wrote:  
>>>>> Hi Dmitry,
>>>>>
>>>>> On Mon, 17 Jul 2023 09:52:54 +0300
>>>>> Dmitry Osipenko <dmitry.osipenko@collabora.com> wrote:
>>>>>     
>>>>>> Panfrost IRQ handler may stuck for a long time, for example this happens
>>>>>> when there is a bad HDMI connection and HDMI handler takes a long time to
>>>>>> finish processing, holding Panfrost. Make Panfrost's job timeout handler
>>>>>> to sync IRQ before checking fence signal status in order to prevent
>>>>>> spurious job timeouts due to a slow IRQ processing.    
>>>>>
>>>>> Feels like the problem should be fixed in the HDMI encoder driver
>>>>> instead, so it doesn't stall the whole system when processing its
>>>>> IRQs (use threaded irqs, maybe). I honestly don't think blocking in the
>>>>> job timeout path to flush IRQs is a good strategy.    
>>>>
>>>> The syncing is necessary to have for correctness regardless of whether
>>>> it's HDMI problem or something else, there could be other reasons for
>>>> CPU to delay IRQ processing. It's wrong to say that hw is hung, while
>>>> it's not.  
>>>
>>> Well, hardware is effectively hung, if not indefinitely, at least
>>> temporarily. All you do here is block in the timeout handler path
>>> waiting for the GPU interrupt handlers to finish, handler that's
>>> probably waiting in the queue, because the raw HDMI handler is blocking
>>> it somehow. So, in the end, you might just be delaying the time of HWR a
>>> bit more. I know it's not GPU's fault in that case, and the job could
>>> have finished in time if the HDMI encoder hadn't stall the interrupt
>>> handling pipeline, but I'm not sure we should care for that specific
>>> situation. And more importantly, if it took more than 500ms to get a
>>> frame rendered (or, in that case, to get the event that a frame is
>>> rendered), you already lost, so I'm not sure correctness matters:
>>> rendering didn't make it in time, and the watchdog kicked in to try and
>>> unblock the situation. Feels like we're just papering over an HDMI
>>> encoder driver bug here, really.  
>>
>> TLDR; I don't see any major downsides and it stops the GPU getting the 
>> blame for something that isn't its fault.
> 
> True, but doing that will also give the impression that things run fine,
> but very slowly, which would put the blame on the userspace driver :P.

Maybe I'm tainted by years of the kernel driver getting the blame
because it was the one that printed the message ;p

>>
>> I guess the question is whether panfrost should work on a system which 
>> has terrible IRQ latency. At the moment we have a synchronize_irq() call 
>> in panfrost_reset() which effectively does the same thing, but with all 
>> the overhead/spew of resetting the GPU.
> 
> Unless I'm mistaken, the synchronize_irq() in panfrost_reset() is
> mostly here to make sure there's no race between the interrupt
> handler and the reset logic (we mask interrupts, and then synchronize,
> guaranteeing that the interrupt handler won't be running after that
> point), and it happens after we've printed the error message, so the
> user knows something was blocked at least.

Yes the synchronize_irq() in panfrost_reset() is there to avoid a real
race - but it has the side effect of flushing out the IRQ and therefore
the job gets completed successfully. And in the high IRQ latency
situation makes the actual reset redundant.

>>
>> Of course in the case Dmitry is actually talking about - it does seem 
>> like the HDMI encoder has a bug which needs fixing. There are plenty of 
>> other things that will break if IRQ latency gets that bad.
> 
> Yes, that's my point. The GPU driver is the only one to complain right
> now, but the HDMI encoder behavior could be impacting other parts of
> the system. Silently ignoring those weirdnesses sounds like a terrible
> idea.

Agreed - but making it look like a GPU driver bug isn't good either.

>>
>> I do wonder if it makes sense to only synchronize when it's needed, 
>> e.g.:
>>
>> ----8<---
>> diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c b/drivers/gpu/drm/panfrost/panfrost_job.c
>> index dbc597ab46fb..d96266b74e5c 100644
>> --- a/drivers/gpu/drm/panfrost/panfrost_job.c
>> +++ b/drivers/gpu/drm/panfrost/panfrost_job.c
>> @@ -720,6 +720,12 @@ static enum drm_gpu_sched_stat panfrost_job_timedout(struct drm_sched_job
>>  	if (dma_fence_is_signaled(job->done_fence))
>>  		return DRM_GPU_SCHED_STAT_NOMINAL;
>>  
>> +	/* Synchronize with the IRQ handler in case the IRQ latency is bad */
>> +	synchronize_irq(pfdev->js->irq);
>> +	/* Recheck if the job is now complete */
>> +	if (dma_fence_is_signaled(job->done_fence))
>> +		return DRM_GPU_SCHED_STAT_NOMINAL;
>> +
>>  	dev_err(pfdev->dev, "gpu sched timeout, js=%d, config=0x%x, status=0x%x, head=0x%x, tail=0x%x, sched_job=%p",
>>  		js,
>>  		job_read(pfdev, JS_CONFIG(js)),
>> ----8<---
>>
>> I don't have any data as to how often we hit the case where the DRM 
>> scheduler calls the timeout but we've already signalled - so the extra 
>> check might be overkill.
> 
> Right, it's not so much about the overhead of the synchronize_irq()
> call (even though my first reply complained about that :-)), but more
> about silently ignoring system misbehaviors. So I guess I'd be fine with
> a version printing a dev_warn("Unexpectedly high interrupt latency")
> when synchronize_irq() unblocks the situation, which means you'd still
> have to do it in two steps.

I like this idea - it still warns so it's obvious there's something
wrong with the system, and it makes it clear it's not the GPU's fault.

Steve



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v1] drm/panfrost: Sync IRQ by job's timeout handler
  2023-07-17  8:59           ` Steven Price
@ 2023-07-17 13:30             ` Dmitry Osipenko
  0 siblings, 0 replies; 8+ messages in thread
From: Dmitry Osipenko @ 2023-07-17 13:30 UTC (permalink / raw)
  To: Steven Price, Boris Brezillon; +Cc: kernel, linux-kernel, dri-devel

17.07.2023 11:59, Steven Price пишет:
> On 17/07/2023 09:49, Boris Brezillon wrote:
>> On Mon, 17 Jul 2023 09:06:56 +0100
>> Steven Price <steven.price@arm.com> wrote:
>>
>>> On 17/07/2023 08:49, Boris Brezillon wrote:
>>>> On Mon, 17 Jul 2023 10:20:02 +0300
>>>> Dmitry Osipenko <dmitry.osipenko@collabora.com> wrote:
>>>>   
>>>>> Hi,
>>>>>
>>>>> On 7/17/23 10:05, Boris Brezillon wrote:  
>>>>>> Hi Dmitry,
>>>>>>
>>>>>> On Mon, 17 Jul 2023 09:52:54 +0300
>>>>>> Dmitry Osipenko <dmitry.osipenko@collabora.com> wrote:
>>>>>>     
>>>>>>> Panfrost IRQ handler may stuck for a long time, for example this happens
>>>>>>> when there is a bad HDMI connection and HDMI handler takes a long time to
>>>>>>> finish processing, holding Panfrost. Make Panfrost's job timeout handler
>>>>>>> to sync IRQ before checking fence signal status in order to prevent
>>>>>>> spurious job timeouts due to a slow IRQ processing.    
>>>>>>
>>>>>> Feels like the problem should be fixed in the HDMI encoder driver
>>>>>> instead, so it doesn't stall the whole system when processing its
>>>>>> IRQs (use threaded irqs, maybe). I honestly don't think blocking in the
>>>>>> job timeout path to flush IRQs is a good strategy.    
>>>>>
>>>>> The syncing is necessary to have for correctness regardless of whether
>>>>> it's HDMI problem or something else, there could be other reasons for
>>>>> CPU to delay IRQ processing. It's wrong to say that hw is hung, while
>>>>> it's not.  
>>>>
>>>> Well, hardware is effectively hung, if not indefinitely, at least
>>>> temporarily. All you do here is block in the timeout handler path
>>>> waiting for the GPU interrupt handlers to finish, handler that's
>>>> probably waiting in the queue, because the raw HDMI handler is blocking
>>>> it somehow. So, in the end, you might just be delaying the time of HWR a
>>>> bit more. I know it's not GPU's fault in that case, and the job could
>>>> have finished in time if the HDMI encoder hadn't stall the interrupt
>>>> handling pipeline, but I'm not sure we should care for that specific
>>>> situation. And more importantly, if it took more than 500ms to get a
>>>> frame rendered (or, in that case, to get the event that a frame is
>>>> rendered), you already lost, so I'm not sure correctness matters:
>>>> rendering didn't make it in time, and the watchdog kicked in to try and
>>>> unblock the situation. Feels like we're just papering over an HDMI
>>>> encoder driver bug here, really.  
>>>
>>> TLDR; I don't see any major downsides and it stops the GPU getting the 
>>> blame for something that isn't its fault.
>>
>> True, but doing that will also give the impression that things run fine,
>> but very slowly, which would put the blame on the userspace driver :P.
> 
> Maybe I'm tainted by years of the kernel driver getting the blame
> because it was the one that printed the message ;p
> 
>>>
>>> I guess the question is whether panfrost should work on a system which 
>>> has terrible IRQ latency. At the moment we have a synchronize_irq() call 
>>> in panfrost_reset() which effectively does the same thing, but with all 
>>> the overhead/spew of resetting the GPU.
>>
>> Unless I'm mistaken, the synchronize_irq() in panfrost_reset() is
>> mostly here to make sure there's no race between the interrupt
>> handler and the reset logic (we mask interrupts, and then synchronize,
>> guaranteeing that the interrupt handler won't be running after that
>> point), and it happens after we've printed the error message, so the
>> user knows something was blocked at least.
> 
> Yes the synchronize_irq() in panfrost_reset() is there to avoid a real
> race - but it has the side effect of flushing out the IRQ and therefore
> the job gets completed successfully. And in the high IRQ latency
> situation makes the actual reset redundant.
> 
>>>
>>> Of course in the case Dmitry is actually talking about - it does seem 
>>> like the HDMI encoder has a bug which needs fixing. There are plenty of 
>>> other things that will break if IRQ latency gets that bad.
>>
>> Yes, that's my point. The GPU driver is the only one to complain right
>> now, but the HDMI encoder behavior could be impacting other parts of
>> the system. Silently ignoring those weirdnesses sounds like a terrible
>> idea.
> 
> Agreed - but making it look like a GPU driver bug isn't good either.
> 
>>>
>>> I do wonder if it makes sense to only synchronize when it's needed, 
>>> e.g.:
>>>
>>> ----8<---
>>> diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c b/drivers/gpu/drm/panfrost/panfrost_job.c
>>> index dbc597ab46fb..d96266b74e5c 100644
>>> --- a/drivers/gpu/drm/panfrost/panfrost_job.c
>>> +++ b/drivers/gpu/drm/panfrost/panfrost_job.c
>>> @@ -720,6 +720,12 @@ static enum drm_gpu_sched_stat panfrost_job_timedout(struct drm_sched_job
>>>  	if (dma_fence_is_signaled(job->done_fence))
>>>  		return DRM_GPU_SCHED_STAT_NOMINAL;
>>>  
>>> +	/* Synchronize with the IRQ handler in case the IRQ latency is bad */
>>> +	synchronize_irq(pfdev->js->irq);
>>> +	/* Recheck if the job is now complete */
>>> +	if (dma_fence_is_signaled(job->done_fence))
>>> +		return DRM_GPU_SCHED_STAT_NOMINAL;
>>> +
>>>  	dev_err(pfdev->dev, "gpu sched timeout, js=%d, config=0x%x, status=0x%x, head=0x%x, tail=0x%x, sched_job=%p",
>>>  		js,
>>>  		job_read(pfdev, JS_CONFIG(js)),
>>> ----8<---
>>>
>>> I don't have any data as to how often we hit the case where the DRM 
>>> scheduler calls the timeout but we've already signalled - so the extra 
>>> check might be overkill.
>>
>> Right, it's not so much about the overhead of the synchronize_irq()
>> call (even though my first reply complained about that :-)), but more
>> about silently ignoring system misbehaviors. So I guess I'd be fine with
>> a version printing a dev_warn("Unexpectedly high interrupt latency")
>> when synchronize_irq() unblocks the situation, which means you'd still
>> have to do it in two steps.
> 
> I like this idea - it still warns so it's obvious there's something
> wrong with the system, and it makes it clear it's not the GPU's fault.

Like that idea too, thanks for the suggestions! Will prepare v2

-- 
Best regards,
Dmitry


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2023-07-17 13:31 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-07-17  6:52 [PATCH v1] drm/panfrost: Sync IRQ by job's timeout handler Dmitry Osipenko
2023-07-17  7:05 ` Boris Brezillon
2023-07-17  7:20   ` Dmitry Osipenko
2023-07-17  7:49     ` Boris Brezillon
2023-07-17  8:06       ` Steven Price
2023-07-17  8:49         ` Boris Brezillon
2023-07-17  8:59           ` Steven Price
2023-07-17 13:30             ` Dmitry Osipenko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).