* [PATCH] drm/panfrost: fix runtime pm imbalance on error
@ 2020-05-20 11:05 ` Dinghao Liu
0 siblings, 0 replies; 10+ messages in thread
From: Dinghao Liu @ 2020-05-20 11:05 UTC (permalink / raw)
To: dinghao.liu, kjlu
Cc: Rob Herring, Tomeu Vizoso, Steven Price, Alyssa Rosenzweig,
David Airlie, Daniel Vetter, dri-devel, linux-kernel
pm_runtime_get_sync() increments the runtime PM usage counter even
the call returns an error code. Thus a pairing decrement is needed
on the error handling path to keep the counter balanced.
Signed-off-by: Dinghao Liu <dinghao.liu@zju.edu.cn>
---
drivers/gpu/drm/panfrost/panfrost_job.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c b/drivers/gpu/drm/panfrost/panfrost_job.c
index 7914b1570841..5719e356c969 100644
--- a/drivers/gpu/drm/panfrost/panfrost_job.c
+++ b/drivers/gpu/drm/panfrost/panfrost_job.c
@@ -146,8 +146,10 @@ static void panfrost_job_hw_submit(struct panfrost_job *job, int js)
int ret;
ret = pm_runtime_get_sync(pfdev->dev);
- if (ret < 0)
+ if (ret < 0) {
+ pm_runtime_put_sync_autosuspend(pfdev->dev);
return;
+ }
if (WARN_ON(job_read(pfdev, JS_COMMAND_NEXT(js)))) {
pm_runtime_put_sync_autosuspend(pfdev->dev);
--
2.17.1
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH] drm/panfrost: fix runtime pm imbalance on error
@ 2020-05-20 11:05 ` Dinghao Liu
0 siblings, 0 replies; 10+ messages in thread
From: Dinghao Liu @ 2020-05-20 11:05 UTC (permalink / raw)
To: dinghao.liu, kjlu
Cc: Tomeu Vizoso, David Airlie, linux-kernel, dri-devel,
Steven Price, Alyssa Rosenzweig
pm_runtime_get_sync() increments the runtime PM usage counter even
the call returns an error code. Thus a pairing decrement is needed
on the error handling path to keep the counter balanced.
Signed-off-by: Dinghao Liu <dinghao.liu@zju.edu.cn>
---
drivers/gpu/drm/panfrost/panfrost_job.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c b/drivers/gpu/drm/panfrost/panfrost_job.c
index 7914b1570841..5719e356c969 100644
--- a/drivers/gpu/drm/panfrost/panfrost_job.c
+++ b/drivers/gpu/drm/panfrost/panfrost_job.c
@@ -146,8 +146,10 @@ static void panfrost_job_hw_submit(struct panfrost_job *job, int js)
int ret;
ret = pm_runtime_get_sync(pfdev->dev);
- if (ret < 0)
+ if (ret < 0) {
+ pm_runtime_put_sync_autosuspend(pfdev->dev);
return;
+ }
if (WARN_ON(job_read(pfdev, JS_COMMAND_NEXT(js)))) {
pm_runtime_put_sync_autosuspend(pfdev->dev);
--
2.17.1
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel
^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH] drm/panfrost: fix runtime pm imbalance on error
2020-05-20 11:05 ` Dinghao Liu
@ 2020-05-20 14:02 ` Steven Price
-1 siblings, 0 replies; 10+ messages in thread
From: Steven Price @ 2020-05-20 14:02 UTC (permalink / raw)
To: Dinghao Liu, kjlu
Cc: Rob Herring, Tomeu Vizoso, Alyssa Rosenzweig, David Airlie,
Daniel Vetter, dri-devel, linux-kernel
On 20/05/2020 12:05, Dinghao Liu wrote:
> pm_runtime_get_sync() increments the runtime PM usage counter even
> the call returns an error code. Thus a pairing decrement is needed
> on the error handling path to keep the counter balanced.
>
> Signed-off-by: Dinghao Liu <dinghao.liu@zju.edu.cn>
Actually I think we have the opposite problem. To be honest we don't
handle this situation very well. By the time panfrost_job_hw_submit() is
called the job has already been added to the pfdev->jobs array, so it's
considered submitted even if it never actually lands on the hardware. So
in the case of this function bailing out early we will then (eventually)
hit a timeout and trigger a GPU reset.
panfrost_job_timedout() iterates through the pfdev->jobs array and calls
pm_runtime_put_noidle() for each job it finds. So there's no inbalance
here that I can see.
Have you actually observed the situation where pm_runtime_get_sync()
returns a failure?
HOWEVER, it appears that by bailing out early the call to
panfrost_devfreq_record_busy() is never made, which as far as I can see
means that there may be an extra call to panfrost_devfreq_record_idle()
when the jobs have timed out. Which could underflow the counter.
But equally looking at panfrost_job_timedout(), we only call
panfrost_devfreq_record_idle() *once* even though multiple jobs might be
processed.
There's a completely untested patch below which in theory should fix that...
Steve
----8<---
diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c
b/drivers/gpu/drm/panfrost/panfrost_job.c
index 7914b1570841..f9519afca29d 100644
--- a/drivers/gpu/drm/panfrost/panfrost_job.c
+++ b/drivers/gpu/drm/panfrost/panfrost_job.c
@@ -145,6 +145,8 @@ static void panfrost_job_hw_submit(struct
panfrost_job *job, int js)
u64 jc_head = job->jc;
int ret;
+ panfrost_devfreq_record_busy(pfdev);
+
ret = pm_runtime_get_sync(pfdev->dev);
if (ret < 0)
return;
@@ -155,7 +157,6 @@ static void panfrost_job_hw_submit(struct
panfrost_job *job, int js)
}
cfg = panfrost_mmu_as_get(pfdev, &job->file_priv->mmu);
- panfrost_devfreq_record_busy(pfdev);
job_write(pfdev, JS_HEAD_NEXT_LO(js), jc_head & 0xFFFFFFFF);
job_write(pfdev, JS_HEAD_NEXT_HI(js), jc_head >> 32);
@@ -410,12 +411,12 @@ static void panfrost_job_timedout(struct
drm_sched_job *sched_job)
for (i = 0; i < NUM_JOB_SLOTS; i++) {
if (pfdev->jobs[i]) {
pm_runtime_put_noidle(pfdev->dev);
+ panfrost_devfreq_record_idle(pfdev);
pfdev->jobs[i] = NULL;
}
}
spin_unlock_irqrestore(&pfdev->js->job_lock, flags);
- panfrost_devfreq_record_idle(pfdev);
panfrost_device_reset(pfdev);
for (i = 0; i < NUM_JOB_SLOTS; i++)
^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH] drm/panfrost: fix runtime pm imbalance on error
@ 2020-05-20 14:02 ` Steven Price
0 siblings, 0 replies; 10+ messages in thread
From: Steven Price @ 2020-05-20 14:02 UTC (permalink / raw)
To: Dinghao Liu, kjlu
Cc: Tomeu Vizoso, David Airlie, linux-kernel, dri-devel, Alyssa Rosenzweig
On 20/05/2020 12:05, Dinghao Liu wrote:
> pm_runtime_get_sync() increments the runtime PM usage counter even
> the call returns an error code. Thus a pairing decrement is needed
> on the error handling path to keep the counter balanced.
>
> Signed-off-by: Dinghao Liu <dinghao.liu@zju.edu.cn>
Actually I think we have the opposite problem. To be honest we don't
handle this situation very well. By the time panfrost_job_hw_submit() is
called the job has already been added to the pfdev->jobs array, so it's
considered submitted even if it never actually lands on the hardware. So
in the case of this function bailing out early we will then (eventually)
hit a timeout and trigger a GPU reset.
panfrost_job_timedout() iterates through the pfdev->jobs array and calls
pm_runtime_put_noidle() for each job it finds. So there's no inbalance
here that I can see.
Have you actually observed the situation where pm_runtime_get_sync()
returns a failure?
HOWEVER, it appears that by bailing out early the call to
panfrost_devfreq_record_busy() is never made, which as far as I can see
means that there may be an extra call to panfrost_devfreq_record_idle()
when the jobs have timed out. Which could underflow the counter.
But equally looking at panfrost_job_timedout(), we only call
panfrost_devfreq_record_idle() *once* even though multiple jobs might be
processed.
There's a completely untested patch below which in theory should fix that...
Steve
----8<---
diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c
b/drivers/gpu/drm/panfrost/panfrost_job.c
index 7914b1570841..f9519afca29d 100644
--- a/drivers/gpu/drm/panfrost/panfrost_job.c
+++ b/drivers/gpu/drm/panfrost/panfrost_job.c
@@ -145,6 +145,8 @@ static void panfrost_job_hw_submit(struct
panfrost_job *job, int js)
u64 jc_head = job->jc;
int ret;
+ panfrost_devfreq_record_busy(pfdev);
+
ret = pm_runtime_get_sync(pfdev->dev);
if (ret < 0)
return;
@@ -155,7 +157,6 @@ static void panfrost_job_hw_submit(struct
panfrost_job *job, int js)
}
cfg = panfrost_mmu_as_get(pfdev, &job->file_priv->mmu);
- panfrost_devfreq_record_busy(pfdev);
job_write(pfdev, JS_HEAD_NEXT_LO(js), jc_head & 0xFFFFFFFF);
job_write(pfdev, JS_HEAD_NEXT_HI(js), jc_head >> 32);
@@ -410,12 +411,12 @@ static void panfrost_job_timedout(struct
drm_sched_job *sched_job)
for (i = 0; i < NUM_JOB_SLOTS; i++) {
if (pfdev->jobs[i]) {
pm_runtime_put_noidle(pfdev->dev);
+ panfrost_devfreq_record_idle(pfdev);
pfdev->jobs[i] = NULL;
}
}
spin_unlock_irqrestore(&pfdev->js->job_lock, flags);
- panfrost_devfreq_record_idle(pfdev);
panfrost_device_reset(pfdev);
for (i = 0; i < NUM_JOB_SLOTS; i++)
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel
^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: Re: [PATCH] drm/panfrost: fix runtime pm imbalance on error
2020-05-20 14:02 ` Steven Price
@ 2020-05-21 7:00 ` dinghao.liu
-1 siblings, 0 replies; 10+ messages in thread
From: dinghao.liu @ 2020-05-21 7:00 UTC (permalink / raw)
To: Steven Price
Cc: kjlu, Rob Herring, Tomeu Vizoso, Alyssa Rosenzweig, David Airlie,
Daniel Vetter, dri-devel, linux-kernel
Hi Steve,
There are two bailing out points in panfrost_job_hw_submit(): one is
the error path beginning from pm_runtime_get_sync(), the other one is
the error path beginning from WARN_ON() in the if statement. The pm
imbalance fixed in this patch is between these two paths. I think the
caller of panfrost_job_hw_submit() cannot distinguish this imbalance
outside this function.
panfrost_job_timedout() calls pm_runtime_put_noidle() for every job it
finds, but all jobs are added to the pfdev->jobs just before calling
panfrost_job_hw_submit(). Therefore I think the imbalance still exists.
But I'm not very sure if we should add pm_runtime_put on the error path
after pm_runtime_get_sync(), or remove pm_runtime_put one the error path
after WARN_ON().
As for the problem about panfrost_devfreq_record_busy(), this may be a
new bug and requires independent patch to fix it.
Regards,
Dinghao
> On 20/05/2020 12:05, Dinghao Liu wrote:
> > pm_runtime_get_sync() increments the runtime PM usage counter even
> > the call returns an error code. Thus a pairing decrement is needed
> > on the error handling path to keep the counter balanced.
> >
> > Signed-off-by: Dinghao Liu <dinghao.liu@zju.edu.cn>
>
> Actually I think we have the opposite problem. To be honest we don't
> handle this situation very well. By the time panfrost_job_hw_submit() is
> called the job has already been added to the pfdev->jobs array, so it's
> considered submitted even if it never actually lands on the hardware. So
> in the case of this function bailing out early we will then (eventually)
> hit a timeout and trigger a GPU reset.
>
> panfrost_job_timedout() iterates through the pfdev->jobs array and calls
> pm_runtime_put_noidle() for each job it finds. So there's no inbalance
> here that I can see.
>
> Have you actually observed the situation where pm_runtime_get_sync()
> returns a failure?
>
> HOWEVER, it appears that by bailing out early the call to
> panfrost_devfreq_record_busy() is never made, which as far as I can see
> means that there may be an extra call to panfrost_devfreq_record_idle()
> when the jobs have timed out. Which could underflow the counter.
>
> But equally looking at panfrost_job_timedout(), we only call
> panfrost_devfreq_record_idle() *once* even though multiple jobs might be
> processed.
>
> There's a completely untested patch below which in theory should fix that...
>
> Steve
>
> ----8<---
> diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c
> b/drivers/gpu/drm/panfrost/panfrost_job.c
> index 7914b1570841..f9519afca29d 100644
> --- a/drivers/gpu/drm/panfrost/panfrost_job.c
> +++ b/drivers/gpu/drm/panfrost/panfrost_job.c
> @@ -145,6 +145,8 @@ static void panfrost_job_hw_submit(struct
> panfrost_job *job, int js)
> u64 jc_head = job->jc;
> int ret;
>
> + panfrost_devfreq_record_busy(pfdev);
> +
> ret = pm_runtime_get_sync(pfdev->dev);
> if (ret < 0)
> return;
> @@ -155,7 +157,6 @@ static void panfrost_job_hw_submit(struct
> panfrost_job *job, int js)
> }
>
> cfg = panfrost_mmu_as_get(pfdev, &job->file_priv->mmu);
> - panfrost_devfreq_record_busy(pfdev);
>
> job_write(pfdev, JS_HEAD_NEXT_LO(js), jc_head & 0xFFFFFFFF);
> job_write(pfdev, JS_HEAD_NEXT_HI(js), jc_head >> 32);
> @@ -410,12 +411,12 @@ static void panfrost_job_timedout(struct
> drm_sched_job *sched_job)
> for (i = 0; i < NUM_JOB_SLOTS; i++) {
> if (pfdev->jobs[i]) {
> pm_runtime_put_noidle(pfdev->dev);
> + panfrost_devfreq_record_idle(pfdev);
> pfdev->jobs[i] = NULL;
> }
> }
> spin_unlock_irqrestore(&pfdev->js->job_lock, flags);
>
> - panfrost_devfreq_record_idle(pfdev);
> panfrost_device_reset(pfdev);
>
> for (i = 0; i < NUM_JOB_SLOTS; i++)
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Re: [PATCH] drm/panfrost: fix runtime pm imbalance on error
@ 2020-05-21 7:00 ` dinghao.liu
0 siblings, 0 replies; 10+ messages in thread
From: dinghao.liu @ 2020-05-21 7:00 UTC (permalink / raw)
To: Steven Price
Cc: Tomeu Vizoso, David Airlie, kjlu, linux-kernel, dri-devel,
Alyssa Rosenzweig
Hi Steve,
There are two bailing out points in panfrost_job_hw_submit(): one is
the error path beginning from pm_runtime_get_sync(), the other one is
the error path beginning from WARN_ON() in the if statement. The pm
imbalance fixed in this patch is between these two paths. I think the
caller of panfrost_job_hw_submit() cannot distinguish this imbalance
outside this function.
panfrost_job_timedout() calls pm_runtime_put_noidle() for every job it
finds, but all jobs are added to the pfdev->jobs just before calling
panfrost_job_hw_submit(). Therefore I think the imbalance still exists.
But I'm not very sure if we should add pm_runtime_put on the error path
after pm_runtime_get_sync(), or remove pm_runtime_put one the error path
after WARN_ON().
As for the problem about panfrost_devfreq_record_busy(), this may be a
new bug and requires independent patch to fix it.
Regards,
Dinghao
> On 20/05/2020 12:05, Dinghao Liu wrote:
> > pm_runtime_get_sync() increments the runtime PM usage counter even
> > the call returns an error code. Thus a pairing decrement is needed
> > on the error handling path to keep the counter balanced.
> >
> > Signed-off-by: Dinghao Liu <dinghao.liu@zju.edu.cn>
>
> Actually I think we have the opposite problem. To be honest we don't
> handle this situation very well. By the time panfrost_job_hw_submit() is
> called the job has already been added to the pfdev->jobs array, so it's
> considered submitted even if it never actually lands on the hardware. So
> in the case of this function bailing out early we will then (eventually)
> hit a timeout and trigger a GPU reset.
>
> panfrost_job_timedout() iterates through the pfdev->jobs array and calls
> pm_runtime_put_noidle() for each job it finds. So there's no inbalance
> here that I can see.
>
> Have you actually observed the situation where pm_runtime_get_sync()
> returns a failure?
>
> HOWEVER, it appears that by bailing out early the call to
> panfrost_devfreq_record_busy() is never made, which as far as I can see
> means that there may be an extra call to panfrost_devfreq_record_idle()
> when the jobs have timed out. Which could underflow the counter.
>
> But equally looking at panfrost_job_timedout(), we only call
> panfrost_devfreq_record_idle() *once* even though multiple jobs might be
> processed.
>
> There's a completely untested patch below which in theory should fix that...
>
> Steve
>
> ----8<---
> diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c
> b/drivers/gpu/drm/panfrost/panfrost_job.c
> index 7914b1570841..f9519afca29d 100644
> --- a/drivers/gpu/drm/panfrost/panfrost_job.c
> +++ b/drivers/gpu/drm/panfrost/panfrost_job.c
> @@ -145,6 +145,8 @@ static void panfrost_job_hw_submit(struct
> panfrost_job *job, int js)
> u64 jc_head = job->jc;
> int ret;
>
> + panfrost_devfreq_record_busy(pfdev);
> +
> ret = pm_runtime_get_sync(pfdev->dev);
> if (ret < 0)
> return;
> @@ -155,7 +157,6 @@ static void panfrost_job_hw_submit(struct
> panfrost_job *job, int js)
> }
>
> cfg = panfrost_mmu_as_get(pfdev, &job->file_priv->mmu);
> - panfrost_devfreq_record_busy(pfdev);
>
> job_write(pfdev, JS_HEAD_NEXT_LO(js), jc_head & 0xFFFFFFFF);
> job_write(pfdev, JS_HEAD_NEXT_HI(js), jc_head >> 32);
> @@ -410,12 +411,12 @@ static void panfrost_job_timedout(struct
> drm_sched_job *sched_job)
> for (i = 0; i < NUM_JOB_SLOTS; i++) {
> if (pfdev->jobs[i]) {
> pm_runtime_put_noidle(pfdev->dev);
> + panfrost_devfreq_record_idle(pfdev);
> pfdev->jobs[i] = NULL;
> }
> }
> spin_unlock_irqrestore(&pfdev->js->job_lock, flags);
>
> - panfrost_devfreq_record_idle(pfdev);
> panfrost_device_reset(pfdev);
>
> for (i = 0; i < NUM_JOB_SLOTS; i++)
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] drm/panfrost: fix runtime pm imbalance on error
2020-05-21 7:00 ` dinghao.liu
@ 2020-05-22 13:09 ` Steven Price
-1 siblings, 0 replies; 10+ messages in thread
From: Steven Price @ 2020-05-22 13:09 UTC (permalink / raw)
To: dinghao.liu
Cc: kjlu, Rob Herring, Tomeu Vizoso, Alyssa Rosenzweig, David Airlie,
Daniel Vetter, dri-devel, linux-kernel
On 21/05/2020 08:00, dinghao.liu@zju.edu.cn wrote:
> Hi Steve,
>
> There are two bailing out points in panfrost_job_hw_submit(): one is
> the error path beginning from pm_runtime_get_sync(), the other one is
> the error path beginning from WARN_ON() in the if statement. The pm
> imbalance fixed in this patch is between these two paths. I think the
> caller of panfrost_job_hw_submit() cannot distinguish this imbalance
> outside this function.
My point is the caller expects panfrost_job_hw_submit() to increase the
PM reference count. Since panfrost_job_hw_submit() cannot return an
error (it's void return) we cannot signal to the caller that the
reference hasn't been taken.
> panfrost_job_timedout() calls pm_runtime_put_noidle() for every job it
> finds, but all jobs are added to the pfdev->jobs just before calling
> panfrost_job_hw_submit(). Therefore I think the imbalance still exists.
My point's exactly that - the "jobs are added to pfdev->jobs just before
calling panfrost_job_hw_submit()". Since we don't have a way for
panfrost_job_hw_submit() to fail it must unconditionally take any
references that will then be freed later on.
> But I'm not very sure if we should add pm_runtime_put on the error path
> after pm_runtime_get_sync(), or remove pm_runtime_put one the error path
> after WARN_ON().
The pm_runtime_put after the WARN_ON() is a bug. Sorry this is probably
what confused you - clearly the WARN_ON() situation is never meant to
happen in the first place, so hopefully this isn't actually possible.
Feel free to send a patch removing it! ;)
> As for the problem about panfrost_devfreq_record_busy(), this may be a
> new bug and requires independent patch to fix it.
Indeed, I'll post a proper patch for that later - I just spotted it
while looking at the code.
Thanks,
Steve
> Regards,
> Dinghao
>
>
>> On 20/05/2020 12:05, Dinghao Liu wrote:
>>> pm_runtime_get_sync() increments the runtime PM usage counter even
>>> the call returns an error code. Thus a pairing decrement is needed
>>> on the error handling path to keep the counter balanced.
>>>
>>> Signed-off-by: Dinghao Liu <dinghao.liu@zju.edu.cn>
>>
>> Actually I think we have the opposite problem. To be honest we don't
>> handle this situation very well. By the time panfrost_job_hw_submit() is
>> called the job has already been added to the pfdev->jobs array, so it's
>> considered submitted even if it never actually lands on the hardware. So
>> in the case of this function bailing out early we will then (eventually)
>> hit a timeout and trigger a GPU reset.
>>
>> panfrost_job_timedout() iterates through the pfdev->jobs array and calls
>> pm_runtime_put_noidle() for each job it finds. So there's no inbalance
>> here that I can see.
>>
>> Have you actually observed the situation where pm_runtime_get_sync()
>> returns a failure?
>>
>> HOWEVER, it appears that by bailing out early the call to
>> panfrost_devfreq_record_busy() is never made, which as far as I can see
>> means that there may be an extra call to panfrost_devfreq_record_idle()
>> when the jobs have timed out. Which could underflow the counter.
>>
>> But equally looking at panfrost_job_timedout(), we only call
>> panfrost_devfreq_record_idle() *once* even though multiple jobs might be
>> processed.
>>
>> There's a completely untested patch below which in theory should fix that...
>>
>> Steve
>>
>> ----8<---
>> diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c
>> b/drivers/gpu/drm/panfrost/panfrost_job.c
>> index 7914b1570841..f9519afca29d 100644
>> --- a/drivers/gpu/drm/panfrost/panfrost_job.c
>> +++ b/drivers/gpu/drm/panfrost/panfrost_job.c
>> @@ -145,6 +145,8 @@ static void panfrost_job_hw_submit(struct
>> panfrost_job *job, int js)
>> u64 jc_head = job->jc;
>> int ret;
>>
>> + panfrost_devfreq_record_busy(pfdev);
>> +
>> ret = pm_runtime_get_sync(pfdev->dev);
>> if (ret < 0)
>> return;
>> @@ -155,7 +157,6 @@ static void panfrost_job_hw_submit(struct
>> panfrost_job *job, int js)
>> }
>>
>> cfg = panfrost_mmu_as_get(pfdev, &job->file_priv->mmu);
>> - panfrost_devfreq_record_busy(pfdev);
>>
>> job_write(pfdev, JS_HEAD_NEXT_LO(js), jc_head & 0xFFFFFFFF);
>> job_write(pfdev, JS_HEAD_NEXT_HI(js), jc_head >> 32);
>> @@ -410,12 +411,12 @@ static void panfrost_job_timedout(struct
>> drm_sched_job *sched_job)
>> for (i = 0; i < NUM_JOB_SLOTS; i++) {
>> if (pfdev->jobs[i]) {
>> pm_runtime_put_noidle(pfdev->dev);
>> + panfrost_devfreq_record_idle(pfdev);
>> pfdev->jobs[i] = NULL;
>> }
>> }
>> spin_unlock_irqrestore(&pfdev->js->job_lock, flags);
>>
>> - panfrost_devfreq_record_idle(pfdev);
>> panfrost_device_reset(pfdev);
>>
>> for (i = 0; i < NUM_JOB_SLOTS; i++)
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] drm/panfrost: fix runtime pm imbalance on error
@ 2020-05-22 13:09 ` Steven Price
0 siblings, 0 replies; 10+ messages in thread
From: Steven Price @ 2020-05-22 13:09 UTC (permalink / raw)
To: dinghao.liu
Cc: Tomeu Vizoso, David Airlie, kjlu, linux-kernel, dri-devel,
Alyssa Rosenzweig
On 21/05/2020 08:00, dinghao.liu@zju.edu.cn wrote:
> Hi Steve,
>
> There are two bailing out points in panfrost_job_hw_submit(): one is
> the error path beginning from pm_runtime_get_sync(), the other one is
> the error path beginning from WARN_ON() in the if statement. The pm
> imbalance fixed in this patch is between these two paths. I think the
> caller of panfrost_job_hw_submit() cannot distinguish this imbalance
> outside this function.
My point is the caller expects panfrost_job_hw_submit() to increase the
PM reference count. Since panfrost_job_hw_submit() cannot return an
error (it's void return) we cannot signal to the caller that the
reference hasn't been taken.
> panfrost_job_timedout() calls pm_runtime_put_noidle() for every job it
> finds, but all jobs are added to the pfdev->jobs just before calling
> panfrost_job_hw_submit(). Therefore I think the imbalance still exists.
My point's exactly that - the "jobs are added to pfdev->jobs just before
calling panfrost_job_hw_submit()". Since we don't have a way for
panfrost_job_hw_submit() to fail it must unconditionally take any
references that will then be freed later on.
> But I'm not very sure if we should add pm_runtime_put on the error path
> after pm_runtime_get_sync(), or remove pm_runtime_put one the error path
> after WARN_ON().
The pm_runtime_put after the WARN_ON() is a bug. Sorry this is probably
what confused you - clearly the WARN_ON() situation is never meant to
happen in the first place, so hopefully this isn't actually possible.
Feel free to send a patch removing it! ;)
> As for the problem about panfrost_devfreq_record_busy(), this may be a
> new bug and requires independent patch to fix it.
Indeed, I'll post a proper patch for that later - I just spotted it
while looking at the code.
Thanks,
Steve
> Regards,
> Dinghao
>
>
>> On 20/05/2020 12:05, Dinghao Liu wrote:
>>> pm_runtime_get_sync() increments the runtime PM usage counter even
>>> the call returns an error code. Thus a pairing decrement is needed
>>> on the error handling path to keep the counter balanced.
>>>
>>> Signed-off-by: Dinghao Liu <dinghao.liu@zju.edu.cn>
>>
>> Actually I think we have the opposite problem. To be honest we don't
>> handle this situation very well. By the time panfrost_job_hw_submit() is
>> called the job has already been added to the pfdev->jobs array, so it's
>> considered submitted even if it never actually lands on the hardware. So
>> in the case of this function bailing out early we will then (eventually)
>> hit a timeout and trigger a GPU reset.
>>
>> panfrost_job_timedout() iterates through the pfdev->jobs array and calls
>> pm_runtime_put_noidle() for each job it finds. So there's no inbalance
>> here that I can see.
>>
>> Have you actually observed the situation where pm_runtime_get_sync()
>> returns a failure?
>>
>> HOWEVER, it appears that by bailing out early the call to
>> panfrost_devfreq_record_busy() is never made, which as far as I can see
>> means that there may be an extra call to panfrost_devfreq_record_idle()
>> when the jobs have timed out. Which could underflow the counter.
>>
>> But equally looking at panfrost_job_timedout(), we only call
>> panfrost_devfreq_record_idle() *once* even though multiple jobs might be
>> processed.
>>
>> There's a completely untested patch below which in theory should fix that...
>>
>> Steve
>>
>> ----8<---
>> diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c
>> b/drivers/gpu/drm/panfrost/panfrost_job.c
>> index 7914b1570841..f9519afca29d 100644
>> --- a/drivers/gpu/drm/panfrost/panfrost_job.c
>> +++ b/drivers/gpu/drm/panfrost/panfrost_job.c
>> @@ -145,6 +145,8 @@ static void panfrost_job_hw_submit(struct
>> panfrost_job *job, int js)
>> u64 jc_head = job->jc;
>> int ret;
>>
>> + panfrost_devfreq_record_busy(pfdev);
>> +
>> ret = pm_runtime_get_sync(pfdev->dev);
>> if (ret < 0)
>> return;
>> @@ -155,7 +157,6 @@ static void panfrost_job_hw_submit(struct
>> panfrost_job *job, int js)
>> }
>>
>> cfg = panfrost_mmu_as_get(pfdev, &job->file_priv->mmu);
>> - panfrost_devfreq_record_busy(pfdev);
>>
>> job_write(pfdev, JS_HEAD_NEXT_LO(js), jc_head & 0xFFFFFFFF);
>> job_write(pfdev, JS_HEAD_NEXT_HI(js), jc_head >> 32);
>> @@ -410,12 +411,12 @@ static void panfrost_job_timedout(struct
>> drm_sched_job *sched_job)
>> for (i = 0; i < NUM_JOB_SLOTS; i++) {
>> if (pfdev->jobs[i]) {
>> pm_runtime_put_noidle(pfdev->dev);
>> + panfrost_devfreq_record_idle(pfdev);
>> pfdev->jobs[i] = NULL;
>> }
>> }
>> spin_unlock_irqrestore(&pfdev->js->job_lock, flags);
>>
>> - panfrost_devfreq_record_idle(pfdev);
>> panfrost_device_reset(pfdev);
>>
>> for (i = 0; i < NUM_JOB_SLOTS; i++)
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Re: [PATCH] drm/panfrost: fix runtime pm imbalance on error
2020-05-22 13:09 ` Steven Price
@ 2020-05-22 13:23 ` dinghao.liu
-1 siblings, 0 replies; 10+ messages in thread
From: dinghao.liu @ 2020-05-22 13:23 UTC (permalink / raw)
To: Steven Price
Cc: kjlu, Rob Herring, Tomeu Vizoso, Alyssa Rosenzweig, David Airlie,
Daniel Vetter, dri-devel, linux-kernel
Thank you for your further explanation! It's all clear for me and I
will write a new patch to fix this imbalance.
Regards,
Dinghao
> On 21/05/2020 08:00, dinghao.liu@zju.edu.cn wrote:
> > Hi Steve,
> >
> > There are two bailing out points in panfrost_job_hw_submit(): one is
> > the error path beginning from pm_runtime_get_sync(), the other one is
> > the error path beginning from WARN_ON() in the if statement. The pm
> > imbalance fixed in this patch is between these two paths. I think the
> > caller of panfrost_job_hw_submit() cannot distinguish this imbalance
> > outside this function.
>
> My point is the caller expects panfrost_job_hw_submit() to increase the
> PM reference count. Since panfrost_job_hw_submit() cannot return an
> error (it's void return) we cannot signal to the caller that the
> reference hasn't been taken.
>
> > panfrost_job_timedout() calls pm_runtime_put_noidle() for every job it
> > finds, but all jobs are added to the pfdev->jobs just before calling
> > panfrost_job_hw_submit(). Therefore I think the imbalance still exists.
>
> My point's exactly that - the "jobs are added to pfdev->jobs just before
> calling panfrost_job_hw_submit()". Since we don't have a way for
> panfrost_job_hw_submit() to fail it must unconditionally take any
> references that will then be freed later on.
>
> > But I'm not very sure if we should add pm_runtime_put on the error path
> > after pm_runtime_get_sync(), or remove pm_runtime_put one the error path
> > after WARN_ON().
>
> The pm_runtime_put after the WARN_ON() is a bug. Sorry this is probably
> what confused you - clearly the WARN_ON() situation is never meant to
> happen in the first place, so hopefully this isn't actually possible.
>
> Feel free to send a patch removing it! ;)
>
> > As for the problem about panfrost_devfreq_record_busy(), this may be a
> > new bug and requires independent patch to fix it.
>
> Indeed, I'll post a proper patch for that later - I just spotted it
> while looking at the code.
>
> Thanks,
>
> Steve
>
> > Regards,
> > Dinghao
> >
> >
> >> On 20/05/2020 12:05, Dinghao Liu wrote:
> >>> pm_runtime_get_sync() increments the runtime PM usage counter even
> >>> the call returns an error code. Thus a pairing decrement is needed
> >>> on the error handling path to keep the counter balanced.
> >>>
> >>> Signed-off-by: Dinghao Liu <dinghao.liu@zju.edu.cn>
> >>
> >> Actually I think we have the opposite problem. To be honest we don't
> >> handle this situation very well. By the time panfrost_job_hw_submit() is
> >> called the job has already been added to the pfdev->jobs array, so it's
> >> considered submitted even if it never actually lands on the hardware. So
> >> in the case of this function bailing out early we will then (eventually)
> >> hit a timeout and trigger a GPU reset.
> >>
> >> panfrost_job_timedout() iterates through the pfdev->jobs array and calls
> >> pm_runtime_put_noidle() for each job it finds. So there's no inbalance
> >> here that I can see.
> >>
> >> Have you actually observed the situation where pm_runtime_get_sync()
> >> returns a failure?
> >>
> >> HOWEVER, it appears that by bailing out early the call to
> >> panfrost_devfreq_record_busy() is never made, which as far as I can see
> >> means that there may be an extra call to panfrost_devfreq_record_idle()
> >> when the jobs have timed out. Which could underflow the counter.
> >>
> >> But equally looking at panfrost_job_timedout(), we only call
> >> panfrost_devfreq_record_idle() *once* even though multiple jobs might be
> >> processed.
> >>
> >> There's a completely untested patch below which in theory should fix that...
> >>
> >> Steve
> >>
> >> ----8<---
> >> diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c
> >> b/drivers/gpu/drm/panfrost/panfrost_job.c
> >> index 7914b1570841..f9519afca29d 100644
> >> --- a/drivers/gpu/drm/panfrost/panfrost_job.c
> >> +++ b/drivers/gpu/drm/panfrost/panfrost_job.c
> >> @@ -145,6 +145,8 @@ static void panfrost_job_hw_submit(struct
> >> panfrost_job *job, int js)
> >> u64 jc_head = job->jc;
> >> int ret;
> >>
> >> + panfrost_devfreq_record_busy(pfdev);
> >> +
> >> ret = pm_runtime_get_sync(pfdev->dev);
> >> if (ret < 0)
> >> return;
> >> @@ -155,7 +157,6 @@ static void panfrost_job_hw_submit(struct
> >> panfrost_job *job, int js)
> >> }
> >>
> >> cfg = panfrost_mmu_as_get(pfdev, &job->file_priv->mmu);
> >> - panfrost_devfreq_record_busy(pfdev);
> >>
> >> job_write(pfdev, JS_HEAD_NEXT_LO(js), jc_head & 0xFFFFFFFF);
> >> job_write(pfdev, JS_HEAD_NEXT_HI(js), jc_head >> 32);
> >> @@ -410,12 +411,12 @@ static void panfrost_job_timedout(struct
> >> drm_sched_job *sched_job)
> >> for (i = 0; i < NUM_JOB_SLOTS; i++) {
> >> if (pfdev->jobs[i]) {
> >> pm_runtime_put_noidle(pfdev->dev);
> >> + panfrost_devfreq_record_idle(pfdev);
> >> pfdev->jobs[i] = NULL;
> >> }
> >> }
> >> spin_unlock_irqrestore(&pfdev->js->job_lock, flags);
> >>
> >> - panfrost_devfreq_record_idle(pfdev);
> >> panfrost_device_reset(pfdev);
> >>
> >> for (i = 0; i < NUM_JOB_SLOTS; i++)
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Re: [PATCH] drm/panfrost: fix runtime pm imbalance on error
@ 2020-05-22 13:23 ` dinghao.liu
0 siblings, 0 replies; 10+ messages in thread
From: dinghao.liu @ 2020-05-22 13:23 UTC (permalink / raw)
To: Steven Price
Cc: Tomeu Vizoso, David Airlie, kjlu, linux-kernel, dri-devel,
Alyssa Rosenzweig
Thank you for your further explanation! It's all clear for me and I
will write a new patch to fix this imbalance.
Regards,
Dinghao
> On 21/05/2020 08:00, dinghao.liu@zju.edu.cn wrote:
> > Hi Steve,
> >
> > There are two bailing out points in panfrost_job_hw_submit(): one is
> > the error path beginning from pm_runtime_get_sync(), the other one is
> > the error path beginning from WARN_ON() in the if statement. The pm
> > imbalance fixed in this patch is between these two paths. I think the
> > caller of panfrost_job_hw_submit() cannot distinguish this imbalance
> > outside this function.
>
> My point is the caller expects panfrost_job_hw_submit() to increase the
> PM reference count. Since panfrost_job_hw_submit() cannot return an
> error (it's void return) we cannot signal to the caller that the
> reference hasn't been taken.
>
> > panfrost_job_timedout() calls pm_runtime_put_noidle() for every job it
> > finds, but all jobs are added to the pfdev->jobs just before calling
> > panfrost_job_hw_submit(). Therefore I think the imbalance still exists.
>
> My point's exactly that - the "jobs are added to pfdev->jobs just before
> calling panfrost_job_hw_submit()". Since we don't have a way for
> panfrost_job_hw_submit() to fail it must unconditionally take any
> references that will then be freed later on.
>
> > But I'm not very sure if we should add pm_runtime_put on the error path
> > after pm_runtime_get_sync(), or remove pm_runtime_put one the error path
> > after WARN_ON().
>
> The pm_runtime_put after the WARN_ON() is a bug. Sorry this is probably
> what confused you - clearly the WARN_ON() situation is never meant to
> happen in the first place, so hopefully this isn't actually possible.
>
> Feel free to send a patch removing it! ;)
>
> > As for the problem about panfrost_devfreq_record_busy(), this may be a
> > new bug and requires independent patch to fix it.
>
> Indeed, I'll post a proper patch for that later - I just spotted it
> while looking at the code.
>
> Thanks,
>
> Steve
>
> > Regards,
> > Dinghao
> >
> >
> >> On 20/05/2020 12:05, Dinghao Liu wrote:
> >>> pm_runtime_get_sync() increments the runtime PM usage counter even
> >>> the call returns an error code. Thus a pairing decrement is needed
> >>> on the error handling path to keep the counter balanced.
> >>>
> >>> Signed-off-by: Dinghao Liu <dinghao.liu@zju.edu.cn>
> >>
> >> Actually I think we have the opposite problem. To be honest we don't
> >> handle this situation very well. By the time panfrost_job_hw_submit() is
> >> called the job has already been added to the pfdev->jobs array, so it's
> >> considered submitted even if it never actually lands on the hardware. So
> >> in the case of this function bailing out early we will then (eventually)
> >> hit a timeout and trigger a GPU reset.
> >>
> >> panfrost_job_timedout() iterates through the pfdev->jobs array and calls
> >> pm_runtime_put_noidle() for each job it finds. So there's no inbalance
> >> here that I can see.
> >>
> >> Have you actually observed the situation where pm_runtime_get_sync()
> >> returns a failure?
> >>
> >> HOWEVER, it appears that by bailing out early the call to
> >> panfrost_devfreq_record_busy() is never made, which as far as I can see
> >> means that there may be an extra call to panfrost_devfreq_record_idle()
> >> when the jobs have timed out. Which could underflow the counter.
> >>
> >> But equally looking at panfrost_job_timedout(), we only call
> >> panfrost_devfreq_record_idle() *once* even though multiple jobs might be
> >> processed.
> >>
> >> There's a completely untested patch below which in theory should fix that...
> >>
> >> Steve
> >>
> >> ----8<---
> >> diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c
> >> b/drivers/gpu/drm/panfrost/panfrost_job.c
> >> index 7914b1570841..f9519afca29d 100644
> >> --- a/drivers/gpu/drm/panfrost/panfrost_job.c
> >> +++ b/drivers/gpu/drm/panfrost/panfrost_job.c
> >> @@ -145,6 +145,8 @@ static void panfrost_job_hw_submit(struct
> >> panfrost_job *job, int js)
> >> u64 jc_head = job->jc;
> >> int ret;
> >>
> >> + panfrost_devfreq_record_busy(pfdev);
> >> +
> >> ret = pm_runtime_get_sync(pfdev->dev);
> >> if (ret < 0)
> >> return;
> >> @@ -155,7 +157,6 @@ static void panfrost_job_hw_submit(struct
> >> panfrost_job *job, int js)
> >> }
> >>
> >> cfg = panfrost_mmu_as_get(pfdev, &job->file_priv->mmu);
> >> - panfrost_devfreq_record_busy(pfdev);
> >>
> >> job_write(pfdev, JS_HEAD_NEXT_LO(js), jc_head & 0xFFFFFFFF);
> >> job_write(pfdev, JS_HEAD_NEXT_HI(js), jc_head >> 32);
> >> @@ -410,12 +411,12 @@ static void panfrost_job_timedout(struct
> >> drm_sched_job *sched_job)
> >> for (i = 0; i < NUM_JOB_SLOTS; i++) {
> >> if (pfdev->jobs[i]) {
> >> pm_runtime_put_noidle(pfdev->dev);
> >> + panfrost_devfreq_record_idle(pfdev);
> >> pfdev->jobs[i] = NULL;
> >> }
> >> }
> >> spin_unlock_irqrestore(&pfdev->js->job_lock, flags);
> >>
> >> - panfrost_devfreq_record_idle(pfdev);
> >> panfrost_device_reset(pfdev);
> >>
> >> for (i = 0; i < NUM_JOB_SLOTS; i++)
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2020-05-23 9:34 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-05-20 11:05 [PATCH] drm/panfrost: fix runtime pm imbalance on error Dinghao Liu
2020-05-20 11:05 ` Dinghao Liu
2020-05-20 14:02 ` Steven Price
2020-05-20 14:02 ` Steven Price
2020-05-21 7:00 ` dinghao.liu
2020-05-21 7:00 ` dinghao.liu
2020-05-22 13:09 ` Steven Price
2020-05-22 13:09 ` Steven Price
2020-05-22 13:23 ` dinghao.liu
2020-05-22 13:23 ` dinghao.liu
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.