* [PATCH] drm/panfrost: fix runtime pm imbalance on error @ 2020-05-20 11:05 ` Dinghao Liu 0 siblings, 0 replies; 10+ messages in thread From: Dinghao Liu @ 2020-05-20 11:05 UTC (permalink / raw) To: dinghao.liu, kjlu Cc: Rob Herring, Tomeu Vizoso, Steven Price, Alyssa Rosenzweig, David Airlie, Daniel Vetter, dri-devel, linux-kernel pm_runtime_get_sync() increments the runtime PM usage counter even the call returns an error code. Thus a pairing decrement is needed on the error handling path to keep the counter balanced. Signed-off-by: Dinghao Liu <dinghao.liu@zju.edu.cn> --- drivers/gpu/drm/panfrost/panfrost_job.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c b/drivers/gpu/drm/panfrost/panfrost_job.c index 7914b1570841..5719e356c969 100644 --- a/drivers/gpu/drm/panfrost/panfrost_job.c +++ b/drivers/gpu/drm/panfrost/panfrost_job.c @@ -146,8 +146,10 @@ static void panfrost_job_hw_submit(struct panfrost_job *job, int js) int ret; ret = pm_runtime_get_sync(pfdev->dev); - if (ret < 0) + if (ret < 0) { + pm_runtime_put_sync_autosuspend(pfdev->dev); return; + } if (WARN_ON(job_read(pfdev, JS_COMMAND_NEXT(js)))) { pm_runtime_put_sync_autosuspend(pfdev->dev); -- 2.17.1 ^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH] drm/panfrost: fix runtime pm imbalance on error @ 2020-05-20 11:05 ` Dinghao Liu 0 siblings, 0 replies; 10+ messages in thread From: Dinghao Liu @ 2020-05-20 11:05 UTC (permalink / raw) To: dinghao.liu, kjlu Cc: Tomeu Vizoso, David Airlie, linux-kernel, dri-devel, Steven Price, Alyssa Rosenzweig pm_runtime_get_sync() increments the runtime PM usage counter even the call returns an error code. Thus a pairing decrement is needed on the error handling path to keep the counter balanced. Signed-off-by: Dinghao Liu <dinghao.liu@zju.edu.cn> --- drivers/gpu/drm/panfrost/panfrost_job.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c b/drivers/gpu/drm/panfrost/panfrost_job.c index 7914b1570841..5719e356c969 100644 --- a/drivers/gpu/drm/panfrost/panfrost_job.c +++ b/drivers/gpu/drm/panfrost/panfrost_job.c @@ -146,8 +146,10 @@ static void panfrost_job_hw_submit(struct panfrost_job *job, int js) int ret; ret = pm_runtime_get_sync(pfdev->dev); - if (ret < 0) + if (ret < 0) { + pm_runtime_put_sync_autosuspend(pfdev->dev); return; + } if (WARN_ON(job_read(pfdev, JS_COMMAND_NEXT(js)))) { pm_runtime_put_sync_autosuspend(pfdev->dev); -- 2.17.1 _______________________________________________ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel ^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH] drm/panfrost: fix runtime pm imbalance on error 2020-05-20 11:05 ` Dinghao Liu @ 2020-05-20 14:02 ` Steven Price -1 siblings, 0 replies; 10+ messages in thread From: Steven Price @ 2020-05-20 14:02 UTC (permalink / raw) To: Dinghao Liu, kjlu Cc: Rob Herring, Tomeu Vizoso, Alyssa Rosenzweig, David Airlie, Daniel Vetter, dri-devel, linux-kernel On 20/05/2020 12:05, Dinghao Liu wrote: > pm_runtime_get_sync() increments the runtime PM usage counter even > the call returns an error code. Thus a pairing decrement is needed > on the error handling path to keep the counter balanced. > > Signed-off-by: Dinghao Liu <dinghao.liu@zju.edu.cn> Actually I think we have the opposite problem. To be honest we don't handle this situation very well. By the time panfrost_job_hw_submit() is called the job has already been added to the pfdev->jobs array, so it's considered submitted even if it never actually lands on the hardware. So in the case of this function bailing out early we will then (eventually) hit a timeout and trigger a GPU reset. panfrost_job_timedout() iterates through the pfdev->jobs array and calls pm_runtime_put_noidle() for each job it finds. So there's no inbalance here that I can see. Have you actually observed the situation where pm_runtime_get_sync() returns a failure? HOWEVER, it appears that by bailing out early the call to panfrost_devfreq_record_busy() is never made, which as far as I can see means that there may be an extra call to panfrost_devfreq_record_idle() when the jobs have timed out. Which could underflow the counter. But equally looking at panfrost_job_timedout(), we only call panfrost_devfreq_record_idle() *once* even though multiple jobs might be processed. There's a completely untested patch below which in theory should fix that... Steve ----8<--- diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c b/drivers/gpu/drm/panfrost/panfrost_job.c index 7914b1570841..f9519afca29d 100644 --- a/drivers/gpu/drm/panfrost/panfrost_job.c +++ b/drivers/gpu/drm/panfrost/panfrost_job.c @@ -145,6 +145,8 @@ static void panfrost_job_hw_submit(struct panfrost_job *job, int js) u64 jc_head = job->jc; int ret; + panfrost_devfreq_record_busy(pfdev); + ret = pm_runtime_get_sync(pfdev->dev); if (ret < 0) return; @@ -155,7 +157,6 @@ static void panfrost_job_hw_submit(struct panfrost_job *job, int js) } cfg = panfrost_mmu_as_get(pfdev, &job->file_priv->mmu); - panfrost_devfreq_record_busy(pfdev); job_write(pfdev, JS_HEAD_NEXT_LO(js), jc_head & 0xFFFFFFFF); job_write(pfdev, JS_HEAD_NEXT_HI(js), jc_head >> 32); @@ -410,12 +411,12 @@ static void panfrost_job_timedout(struct drm_sched_job *sched_job) for (i = 0; i < NUM_JOB_SLOTS; i++) { if (pfdev->jobs[i]) { pm_runtime_put_noidle(pfdev->dev); + panfrost_devfreq_record_idle(pfdev); pfdev->jobs[i] = NULL; } } spin_unlock_irqrestore(&pfdev->js->job_lock, flags); - panfrost_devfreq_record_idle(pfdev); panfrost_device_reset(pfdev); for (i = 0; i < NUM_JOB_SLOTS; i++) ^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH] drm/panfrost: fix runtime pm imbalance on error @ 2020-05-20 14:02 ` Steven Price 0 siblings, 0 replies; 10+ messages in thread From: Steven Price @ 2020-05-20 14:02 UTC (permalink / raw) To: Dinghao Liu, kjlu Cc: Tomeu Vizoso, David Airlie, linux-kernel, dri-devel, Alyssa Rosenzweig On 20/05/2020 12:05, Dinghao Liu wrote: > pm_runtime_get_sync() increments the runtime PM usage counter even > the call returns an error code. Thus a pairing decrement is needed > on the error handling path to keep the counter balanced. > > Signed-off-by: Dinghao Liu <dinghao.liu@zju.edu.cn> Actually I think we have the opposite problem. To be honest we don't handle this situation very well. By the time panfrost_job_hw_submit() is called the job has already been added to the pfdev->jobs array, so it's considered submitted even if it never actually lands on the hardware. So in the case of this function bailing out early we will then (eventually) hit a timeout and trigger a GPU reset. panfrost_job_timedout() iterates through the pfdev->jobs array and calls pm_runtime_put_noidle() for each job it finds. So there's no inbalance here that I can see. Have you actually observed the situation where pm_runtime_get_sync() returns a failure? HOWEVER, it appears that by bailing out early the call to panfrost_devfreq_record_busy() is never made, which as far as I can see means that there may be an extra call to panfrost_devfreq_record_idle() when the jobs have timed out. Which could underflow the counter. But equally looking at panfrost_job_timedout(), we only call panfrost_devfreq_record_idle() *once* even though multiple jobs might be processed. There's a completely untested patch below which in theory should fix that... Steve ----8<--- diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c b/drivers/gpu/drm/panfrost/panfrost_job.c index 7914b1570841..f9519afca29d 100644 --- a/drivers/gpu/drm/panfrost/panfrost_job.c +++ b/drivers/gpu/drm/panfrost/panfrost_job.c @@ -145,6 +145,8 @@ static void panfrost_job_hw_submit(struct panfrost_job *job, int js) u64 jc_head = job->jc; int ret; + panfrost_devfreq_record_busy(pfdev); + ret = pm_runtime_get_sync(pfdev->dev); if (ret < 0) return; @@ -155,7 +157,6 @@ static void panfrost_job_hw_submit(struct panfrost_job *job, int js) } cfg = panfrost_mmu_as_get(pfdev, &job->file_priv->mmu); - panfrost_devfreq_record_busy(pfdev); job_write(pfdev, JS_HEAD_NEXT_LO(js), jc_head & 0xFFFFFFFF); job_write(pfdev, JS_HEAD_NEXT_HI(js), jc_head >> 32); @@ -410,12 +411,12 @@ static void panfrost_job_timedout(struct drm_sched_job *sched_job) for (i = 0; i < NUM_JOB_SLOTS; i++) { if (pfdev->jobs[i]) { pm_runtime_put_noidle(pfdev->dev); + panfrost_devfreq_record_idle(pfdev); pfdev->jobs[i] = NULL; } } spin_unlock_irqrestore(&pfdev->js->job_lock, flags); - panfrost_devfreq_record_idle(pfdev); panfrost_device_reset(pfdev); for (i = 0; i < NUM_JOB_SLOTS; i++) _______________________________________________ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel ^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: Re: [PATCH] drm/panfrost: fix runtime pm imbalance on error 2020-05-20 14:02 ` Steven Price @ 2020-05-21 7:00 ` dinghao.liu -1 siblings, 0 replies; 10+ messages in thread From: dinghao.liu @ 2020-05-21 7:00 UTC (permalink / raw) To: Steven Price Cc: kjlu, Rob Herring, Tomeu Vizoso, Alyssa Rosenzweig, David Airlie, Daniel Vetter, dri-devel, linux-kernel Hi Steve, There are two bailing out points in panfrost_job_hw_submit(): one is the error path beginning from pm_runtime_get_sync(), the other one is the error path beginning from WARN_ON() in the if statement. The pm imbalance fixed in this patch is between these two paths. I think the caller of panfrost_job_hw_submit() cannot distinguish this imbalance outside this function. panfrost_job_timedout() calls pm_runtime_put_noidle() for every job it finds, but all jobs are added to the pfdev->jobs just before calling panfrost_job_hw_submit(). Therefore I think the imbalance still exists. But I'm not very sure if we should add pm_runtime_put on the error path after pm_runtime_get_sync(), or remove pm_runtime_put one the error path after WARN_ON(). As for the problem about panfrost_devfreq_record_busy(), this may be a new bug and requires independent patch to fix it. Regards, Dinghao > On 20/05/2020 12:05, Dinghao Liu wrote: > > pm_runtime_get_sync() increments the runtime PM usage counter even > > the call returns an error code. Thus a pairing decrement is needed > > on the error handling path to keep the counter balanced. > > > > Signed-off-by: Dinghao Liu <dinghao.liu@zju.edu.cn> > > Actually I think we have the opposite problem. To be honest we don't > handle this situation very well. By the time panfrost_job_hw_submit() is > called the job has already been added to the pfdev->jobs array, so it's > considered submitted even if it never actually lands on the hardware. So > in the case of this function bailing out early we will then (eventually) > hit a timeout and trigger a GPU reset. > > panfrost_job_timedout() iterates through the pfdev->jobs array and calls > pm_runtime_put_noidle() for each job it finds. So there's no inbalance > here that I can see. > > Have you actually observed the situation where pm_runtime_get_sync() > returns a failure? > > HOWEVER, it appears that by bailing out early the call to > panfrost_devfreq_record_busy() is never made, which as far as I can see > means that there may be an extra call to panfrost_devfreq_record_idle() > when the jobs have timed out. Which could underflow the counter. > > But equally looking at panfrost_job_timedout(), we only call > panfrost_devfreq_record_idle() *once* even though multiple jobs might be > processed. > > There's a completely untested patch below which in theory should fix that... > > Steve > > ----8<--- > diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c > b/drivers/gpu/drm/panfrost/panfrost_job.c > index 7914b1570841..f9519afca29d 100644 > --- a/drivers/gpu/drm/panfrost/panfrost_job.c > +++ b/drivers/gpu/drm/panfrost/panfrost_job.c > @@ -145,6 +145,8 @@ static void panfrost_job_hw_submit(struct > panfrost_job *job, int js) > u64 jc_head = job->jc; > int ret; > > + panfrost_devfreq_record_busy(pfdev); > + > ret = pm_runtime_get_sync(pfdev->dev); > if (ret < 0) > return; > @@ -155,7 +157,6 @@ static void panfrost_job_hw_submit(struct > panfrost_job *job, int js) > } > > cfg = panfrost_mmu_as_get(pfdev, &job->file_priv->mmu); > - panfrost_devfreq_record_busy(pfdev); > > job_write(pfdev, JS_HEAD_NEXT_LO(js), jc_head & 0xFFFFFFFF); > job_write(pfdev, JS_HEAD_NEXT_HI(js), jc_head >> 32); > @@ -410,12 +411,12 @@ static void panfrost_job_timedout(struct > drm_sched_job *sched_job) > for (i = 0; i < NUM_JOB_SLOTS; i++) { > if (pfdev->jobs[i]) { > pm_runtime_put_noidle(pfdev->dev); > + panfrost_devfreq_record_idle(pfdev); > pfdev->jobs[i] = NULL; > } > } > spin_unlock_irqrestore(&pfdev->js->job_lock, flags); > > - panfrost_devfreq_record_idle(pfdev); > panfrost_device_reset(pfdev); > > for (i = 0; i < NUM_JOB_SLOTS; i++) ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Re: [PATCH] drm/panfrost: fix runtime pm imbalance on error @ 2020-05-21 7:00 ` dinghao.liu 0 siblings, 0 replies; 10+ messages in thread From: dinghao.liu @ 2020-05-21 7:00 UTC (permalink / raw) To: Steven Price Cc: Tomeu Vizoso, David Airlie, kjlu, linux-kernel, dri-devel, Alyssa Rosenzweig Hi Steve, There are two bailing out points in panfrost_job_hw_submit(): one is the error path beginning from pm_runtime_get_sync(), the other one is the error path beginning from WARN_ON() in the if statement. The pm imbalance fixed in this patch is between these two paths. I think the caller of panfrost_job_hw_submit() cannot distinguish this imbalance outside this function. panfrost_job_timedout() calls pm_runtime_put_noidle() for every job it finds, but all jobs are added to the pfdev->jobs just before calling panfrost_job_hw_submit(). Therefore I think the imbalance still exists. But I'm not very sure if we should add pm_runtime_put on the error path after pm_runtime_get_sync(), or remove pm_runtime_put one the error path after WARN_ON(). As for the problem about panfrost_devfreq_record_busy(), this may be a new bug and requires independent patch to fix it. Regards, Dinghao > On 20/05/2020 12:05, Dinghao Liu wrote: > > pm_runtime_get_sync() increments the runtime PM usage counter even > > the call returns an error code. Thus a pairing decrement is needed > > on the error handling path to keep the counter balanced. > > > > Signed-off-by: Dinghao Liu <dinghao.liu@zju.edu.cn> > > Actually I think we have the opposite problem. To be honest we don't > handle this situation very well. By the time panfrost_job_hw_submit() is > called the job has already been added to the pfdev->jobs array, so it's > considered submitted even if it never actually lands on the hardware. So > in the case of this function bailing out early we will then (eventually) > hit a timeout and trigger a GPU reset. > > panfrost_job_timedout() iterates through the pfdev->jobs array and calls > pm_runtime_put_noidle() for each job it finds. So there's no inbalance > here that I can see. > > Have you actually observed the situation where pm_runtime_get_sync() > returns a failure? > > HOWEVER, it appears that by bailing out early the call to > panfrost_devfreq_record_busy() is never made, which as far as I can see > means that there may be an extra call to panfrost_devfreq_record_idle() > when the jobs have timed out. Which could underflow the counter. > > But equally looking at panfrost_job_timedout(), we only call > panfrost_devfreq_record_idle() *once* even though multiple jobs might be > processed. > > There's a completely untested patch below which in theory should fix that... > > Steve > > ----8<--- > diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c > b/drivers/gpu/drm/panfrost/panfrost_job.c > index 7914b1570841..f9519afca29d 100644 > --- a/drivers/gpu/drm/panfrost/panfrost_job.c > +++ b/drivers/gpu/drm/panfrost/panfrost_job.c > @@ -145,6 +145,8 @@ static void panfrost_job_hw_submit(struct > panfrost_job *job, int js) > u64 jc_head = job->jc; > int ret; > > + panfrost_devfreq_record_busy(pfdev); > + > ret = pm_runtime_get_sync(pfdev->dev); > if (ret < 0) > return; > @@ -155,7 +157,6 @@ static void panfrost_job_hw_submit(struct > panfrost_job *job, int js) > } > > cfg = panfrost_mmu_as_get(pfdev, &job->file_priv->mmu); > - panfrost_devfreq_record_busy(pfdev); > > job_write(pfdev, JS_HEAD_NEXT_LO(js), jc_head & 0xFFFFFFFF); > job_write(pfdev, JS_HEAD_NEXT_HI(js), jc_head >> 32); > @@ -410,12 +411,12 @@ static void panfrost_job_timedout(struct > drm_sched_job *sched_job) > for (i = 0; i < NUM_JOB_SLOTS; i++) { > if (pfdev->jobs[i]) { > pm_runtime_put_noidle(pfdev->dev); > + panfrost_devfreq_record_idle(pfdev); > pfdev->jobs[i] = NULL; > } > } > spin_unlock_irqrestore(&pfdev->js->job_lock, flags); > > - panfrost_devfreq_record_idle(pfdev); > panfrost_device_reset(pfdev); > > for (i = 0; i < NUM_JOB_SLOTS; i++) _______________________________________________ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] drm/panfrost: fix runtime pm imbalance on error 2020-05-21 7:00 ` dinghao.liu @ 2020-05-22 13:09 ` Steven Price -1 siblings, 0 replies; 10+ messages in thread From: Steven Price @ 2020-05-22 13:09 UTC (permalink / raw) To: dinghao.liu Cc: kjlu, Rob Herring, Tomeu Vizoso, Alyssa Rosenzweig, David Airlie, Daniel Vetter, dri-devel, linux-kernel On 21/05/2020 08:00, dinghao.liu@zju.edu.cn wrote: > Hi Steve, > > There are two bailing out points in panfrost_job_hw_submit(): one is > the error path beginning from pm_runtime_get_sync(), the other one is > the error path beginning from WARN_ON() in the if statement. The pm > imbalance fixed in this patch is between these two paths. I think the > caller of panfrost_job_hw_submit() cannot distinguish this imbalance > outside this function. My point is the caller expects panfrost_job_hw_submit() to increase the PM reference count. Since panfrost_job_hw_submit() cannot return an error (it's void return) we cannot signal to the caller that the reference hasn't been taken. > panfrost_job_timedout() calls pm_runtime_put_noidle() for every job it > finds, but all jobs are added to the pfdev->jobs just before calling > panfrost_job_hw_submit(). Therefore I think the imbalance still exists. My point's exactly that - the "jobs are added to pfdev->jobs just before calling panfrost_job_hw_submit()". Since we don't have a way for panfrost_job_hw_submit() to fail it must unconditionally take any references that will then be freed later on. > But I'm not very sure if we should add pm_runtime_put on the error path > after pm_runtime_get_sync(), or remove pm_runtime_put one the error path > after WARN_ON(). The pm_runtime_put after the WARN_ON() is a bug. Sorry this is probably what confused you - clearly the WARN_ON() situation is never meant to happen in the first place, so hopefully this isn't actually possible. Feel free to send a patch removing it! ;) > As for the problem about panfrost_devfreq_record_busy(), this may be a > new bug and requires independent patch to fix it. Indeed, I'll post a proper patch for that later - I just spotted it while looking at the code. Thanks, Steve > Regards, > Dinghao > > >> On 20/05/2020 12:05, Dinghao Liu wrote: >>> pm_runtime_get_sync() increments the runtime PM usage counter even >>> the call returns an error code. Thus a pairing decrement is needed >>> on the error handling path to keep the counter balanced. >>> >>> Signed-off-by: Dinghao Liu <dinghao.liu@zju.edu.cn> >> >> Actually I think we have the opposite problem. To be honest we don't >> handle this situation very well. By the time panfrost_job_hw_submit() is >> called the job has already been added to the pfdev->jobs array, so it's >> considered submitted even if it never actually lands on the hardware. So >> in the case of this function bailing out early we will then (eventually) >> hit a timeout and trigger a GPU reset. >> >> panfrost_job_timedout() iterates through the pfdev->jobs array and calls >> pm_runtime_put_noidle() for each job it finds. So there's no inbalance >> here that I can see. >> >> Have you actually observed the situation where pm_runtime_get_sync() >> returns a failure? >> >> HOWEVER, it appears that by bailing out early the call to >> panfrost_devfreq_record_busy() is never made, which as far as I can see >> means that there may be an extra call to panfrost_devfreq_record_idle() >> when the jobs have timed out. Which could underflow the counter. >> >> But equally looking at panfrost_job_timedout(), we only call >> panfrost_devfreq_record_idle() *once* even though multiple jobs might be >> processed. >> >> There's a completely untested patch below which in theory should fix that... >> >> Steve >> >> ----8<--- >> diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c >> b/drivers/gpu/drm/panfrost/panfrost_job.c >> index 7914b1570841..f9519afca29d 100644 >> --- a/drivers/gpu/drm/panfrost/panfrost_job.c >> +++ b/drivers/gpu/drm/panfrost/panfrost_job.c >> @@ -145,6 +145,8 @@ static void panfrost_job_hw_submit(struct >> panfrost_job *job, int js) >> u64 jc_head = job->jc; >> int ret; >> >> + panfrost_devfreq_record_busy(pfdev); >> + >> ret = pm_runtime_get_sync(pfdev->dev); >> if (ret < 0) >> return; >> @@ -155,7 +157,6 @@ static void panfrost_job_hw_submit(struct >> panfrost_job *job, int js) >> } >> >> cfg = panfrost_mmu_as_get(pfdev, &job->file_priv->mmu); >> - panfrost_devfreq_record_busy(pfdev); >> >> job_write(pfdev, JS_HEAD_NEXT_LO(js), jc_head & 0xFFFFFFFF); >> job_write(pfdev, JS_HEAD_NEXT_HI(js), jc_head >> 32); >> @@ -410,12 +411,12 @@ static void panfrost_job_timedout(struct >> drm_sched_job *sched_job) >> for (i = 0; i < NUM_JOB_SLOTS; i++) { >> if (pfdev->jobs[i]) { >> pm_runtime_put_noidle(pfdev->dev); >> + panfrost_devfreq_record_idle(pfdev); >> pfdev->jobs[i] = NULL; >> } >> } >> spin_unlock_irqrestore(&pfdev->js->job_lock, flags); >> >> - panfrost_devfreq_record_idle(pfdev); >> panfrost_device_reset(pfdev); >> >> for (i = 0; i < NUM_JOB_SLOTS; i++) ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] drm/panfrost: fix runtime pm imbalance on error @ 2020-05-22 13:09 ` Steven Price 0 siblings, 0 replies; 10+ messages in thread From: Steven Price @ 2020-05-22 13:09 UTC (permalink / raw) To: dinghao.liu Cc: Tomeu Vizoso, David Airlie, kjlu, linux-kernel, dri-devel, Alyssa Rosenzweig On 21/05/2020 08:00, dinghao.liu@zju.edu.cn wrote: > Hi Steve, > > There are two bailing out points in panfrost_job_hw_submit(): one is > the error path beginning from pm_runtime_get_sync(), the other one is > the error path beginning from WARN_ON() in the if statement. The pm > imbalance fixed in this patch is between these two paths. I think the > caller of panfrost_job_hw_submit() cannot distinguish this imbalance > outside this function. My point is the caller expects panfrost_job_hw_submit() to increase the PM reference count. Since panfrost_job_hw_submit() cannot return an error (it's void return) we cannot signal to the caller that the reference hasn't been taken. > panfrost_job_timedout() calls pm_runtime_put_noidle() for every job it > finds, but all jobs are added to the pfdev->jobs just before calling > panfrost_job_hw_submit(). Therefore I think the imbalance still exists. My point's exactly that - the "jobs are added to pfdev->jobs just before calling panfrost_job_hw_submit()". Since we don't have a way for panfrost_job_hw_submit() to fail it must unconditionally take any references that will then be freed later on. > But I'm not very sure if we should add pm_runtime_put on the error path > after pm_runtime_get_sync(), or remove pm_runtime_put one the error path > after WARN_ON(). The pm_runtime_put after the WARN_ON() is a bug. Sorry this is probably what confused you - clearly the WARN_ON() situation is never meant to happen in the first place, so hopefully this isn't actually possible. Feel free to send a patch removing it! ;) > As for the problem about panfrost_devfreq_record_busy(), this may be a > new bug and requires independent patch to fix it. Indeed, I'll post a proper patch for that later - I just spotted it while looking at the code. Thanks, Steve > Regards, > Dinghao > > >> On 20/05/2020 12:05, Dinghao Liu wrote: >>> pm_runtime_get_sync() increments the runtime PM usage counter even >>> the call returns an error code. Thus a pairing decrement is needed >>> on the error handling path to keep the counter balanced. >>> >>> Signed-off-by: Dinghao Liu <dinghao.liu@zju.edu.cn> >> >> Actually I think we have the opposite problem. To be honest we don't >> handle this situation very well. By the time panfrost_job_hw_submit() is >> called the job has already been added to the pfdev->jobs array, so it's >> considered submitted even if it never actually lands on the hardware. So >> in the case of this function bailing out early we will then (eventually) >> hit a timeout and trigger a GPU reset. >> >> panfrost_job_timedout() iterates through the pfdev->jobs array and calls >> pm_runtime_put_noidle() for each job it finds. So there's no inbalance >> here that I can see. >> >> Have you actually observed the situation where pm_runtime_get_sync() >> returns a failure? >> >> HOWEVER, it appears that by bailing out early the call to >> panfrost_devfreq_record_busy() is never made, which as far as I can see >> means that there may be an extra call to panfrost_devfreq_record_idle() >> when the jobs have timed out. Which could underflow the counter. >> >> But equally looking at panfrost_job_timedout(), we only call >> panfrost_devfreq_record_idle() *once* even though multiple jobs might be >> processed. >> >> There's a completely untested patch below which in theory should fix that... >> >> Steve >> >> ----8<--- >> diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c >> b/drivers/gpu/drm/panfrost/panfrost_job.c >> index 7914b1570841..f9519afca29d 100644 >> --- a/drivers/gpu/drm/panfrost/panfrost_job.c >> +++ b/drivers/gpu/drm/panfrost/panfrost_job.c >> @@ -145,6 +145,8 @@ static void panfrost_job_hw_submit(struct >> panfrost_job *job, int js) >> u64 jc_head = job->jc; >> int ret; >> >> + panfrost_devfreq_record_busy(pfdev); >> + >> ret = pm_runtime_get_sync(pfdev->dev); >> if (ret < 0) >> return; >> @@ -155,7 +157,6 @@ static void panfrost_job_hw_submit(struct >> panfrost_job *job, int js) >> } >> >> cfg = panfrost_mmu_as_get(pfdev, &job->file_priv->mmu); >> - panfrost_devfreq_record_busy(pfdev); >> >> job_write(pfdev, JS_HEAD_NEXT_LO(js), jc_head & 0xFFFFFFFF); >> job_write(pfdev, JS_HEAD_NEXT_HI(js), jc_head >> 32); >> @@ -410,12 +411,12 @@ static void panfrost_job_timedout(struct >> drm_sched_job *sched_job) >> for (i = 0; i < NUM_JOB_SLOTS; i++) { >> if (pfdev->jobs[i]) { >> pm_runtime_put_noidle(pfdev->dev); >> + panfrost_devfreq_record_idle(pfdev); >> pfdev->jobs[i] = NULL; >> } >> } >> spin_unlock_irqrestore(&pfdev->js->job_lock, flags); >> >> - panfrost_devfreq_record_idle(pfdev); >> panfrost_device_reset(pfdev); >> >> for (i = 0; i < NUM_JOB_SLOTS; i++) _______________________________________________ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Re: [PATCH] drm/panfrost: fix runtime pm imbalance on error 2020-05-22 13:09 ` Steven Price @ 2020-05-22 13:23 ` dinghao.liu -1 siblings, 0 replies; 10+ messages in thread From: dinghao.liu @ 2020-05-22 13:23 UTC (permalink / raw) To: Steven Price Cc: kjlu, Rob Herring, Tomeu Vizoso, Alyssa Rosenzweig, David Airlie, Daniel Vetter, dri-devel, linux-kernel Thank you for your further explanation! It's all clear for me and I will write a new patch to fix this imbalance. Regards, Dinghao > On 21/05/2020 08:00, dinghao.liu@zju.edu.cn wrote: > > Hi Steve, > > > > There are two bailing out points in panfrost_job_hw_submit(): one is > > the error path beginning from pm_runtime_get_sync(), the other one is > > the error path beginning from WARN_ON() in the if statement. The pm > > imbalance fixed in this patch is between these two paths. I think the > > caller of panfrost_job_hw_submit() cannot distinguish this imbalance > > outside this function. > > My point is the caller expects panfrost_job_hw_submit() to increase the > PM reference count. Since panfrost_job_hw_submit() cannot return an > error (it's void return) we cannot signal to the caller that the > reference hasn't been taken. > > > panfrost_job_timedout() calls pm_runtime_put_noidle() for every job it > > finds, but all jobs are added to the pfdev->jobs just before calling > > panfrost_job_hw_submit(). Therefore I think the imbalance still exists. > > My point's exactly that - the "jobs are added to pfdev->jobs just before > calling panfrost_job_hw_submit()". Since we don't have a way for > panfrost_job_hw_submit() to fail it must unconditionally take any > references that will then be freed later on. > > > But I'm not very sure if we should add pm_runtime_put on the error path > > after pm_runtime_get_sync(), or remove pm_runtime_put one the error path > > after WARN_ON(). > > The pm_runtime_put after the WARN_ON() is a bug. Sorry this is probably > what confused you - clearly the WARN_ON() situation is never meant to > happen in the first place, so hopefully this isn't actually possible. > > Feel free to send a patch removing it! ;) > > > As for the problem about panfrost_devfreq_record_busy(), this may be a > > new bug and requires independent patch to fix it. > > Indeed, I'll post a proper patch for that later - I just spotted it > while looking at the code. > > Thanks, > > Steve > > > Regards, > > Dinghao > > > > > >> On 20/05/2020 12:05, Dinghao Liu wrote: > >>> pm_runtime_get_sync() increments the runtime PM usage counter even > >>> the call returns an error code. Thus a pairing decrement is needed > >>> on the error handling path to keep the counter balanced. > >>> > >>> Signed-off-by: Dinghao Liu <dinghao.liu@zju.edu.cn> > >> > >> Actually I think we have the opposite problem. To be honest we don't > >> handle this situation very well. By the time panfrost_job_hw_submit() is > >> called the job has already been added to the pfdev->jobs array, so it's > >> considered submitted even if it never actually lands on the hardware. So > >> in the case of this function bailing out early we will then (eventually) > >> hit a timeout and trigger a GPU reset. > >> > >> panfrost_job_timedout() iterates through the pfdev->jobs array and calls > >> pm_runtime_put_noidle() for each job it finds. So there's no inbalance > >> here that I can see. > >> > >> Have you actually observed the situation where pm_runtime_get_sync() > >> returns a failure? > >> > >> HOWEVER, it appears that by bailing out early the call to > >> panfrost_devfreq_record_busy() is never made, which as far as I can see > >> means that there may be an extra call to panfrost_devfreq_record_idle() > >> when the jobs have timed out. Which could underflow the counter. > >> > >> But equally looking at panfrost_job_timedout(), we only call > >> panfrost_devfreq_record_idle() *once* even though multiple jobs might be > >> processed. > >> > >> There's a completely untested patch below which in theory should fix that... > >> > >> Steve > >> > >> ----8<--- > >> diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c > >> b/drivers/gpu/drm/panfrost/panfrost_job.c > >> index 7914b1570841..f9519afca29d 100644 > >> --- a/drivers/gpu/drm/panfrost/panfrost_job.c > >> +++ b/drivers/gpu/drm/panfrost/panfrost_job.c > >> @@ -145,6 +145,8 @@ static void panfrost_job_hw_submit(struct > >> panfrost_job *job, int js) > >> u64 jc_head = job->jc; > >> int ret; > >> > >> + panfrost_devfreq_record_busy(pfdev); > >> + > >> ret = pm_runtime_get_sync(pfdev->dev); > >> if (ret < 0) > >> return; > >> @@ -155,7 +157,6 @@ static void panfrost_job_hw_submit(struct > >> panfrost_job *job, int js) > >> } > >> > >> cfg = panfrost_mmu_as_get(pfdev, &job->file_priv->mmu); > >> - panfrost_devfreq_record_busy(pfdev); > >> > >> job_write(pfdev, JS_HEAD_NEXT_LO(js), jc_head & 0xFFFFFFFF); > >> job_write(pfdev, JS_HEAD_NEXT_HI(js), jc_head >> 32); > >> @@ -410,12 +411,12 @@ static void panfrost_job_timedout(struct > >> drm_sched_job *sched_job) > >> for (i = 0; i < NUM_JOB_SLOTS; i++) { > >> if (pfdev->jobs[i]) { > >> pm_runtime_put_noidle(pfdev->dev); > >> + panfrost_devfreq_record_idle(pfdev); > >> pfdev->jobs[i] = NULL; > >> } > >> } > >> spin_unlock_irqrestore(&pfdev->js->job_lock, flags); > >> > >> - panfrost_devfreq_record_idle(pfdev); > >> panfrost_device_reset(pfdev); > >> > >> for (i = 0; i < NUM_JOB_SLOTS; i++) ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Re: [PATCH] drm/panfrost: fix runtime pm imbalance on error @ 2020-05-22 13:23 ` dinghao.liu 0 siblings, 0 replies; 10+ messages in thread From: dinghao.liu @ 2020-05-22 13:23 UTC (permalink / raw) To: Steven Price Cc: Tomeu Vizoso, David Airlie, kjlu, linux-kernel, dri-devel, Alyssa Rosenzweig Thank you for your further explanation! It's all clear for me and I will write a new patch to fix this imbalance. Regards, Dinghao > On 21/05/2020 08:00, dinghao.liu@zju.edu.cn wrote: > > Hi Steve, > > > > There are two bailing out points in panfrost_job_hw_submit(): one is > > the error path beginning from pm_runtime_get_sync(), the other one is > > the error path beginning from WARN_ON() in the if statement. The pm > > imbalance fixed in this patch is between these two paths. I think the > > caller of panfrost_job_hw_submit() cannot distinguish this imbalance > > outside this function. > > My point is the caller expects panfrost_job_hw_submit() to increase the > PM reference count. Since panfrost_job_hw_submit() cannot return an > error (it's void return) we cannot signal to the caller that the > reference hasn't been taken. > > > panfrost_job_timedout() calls pm_runtime_put_noidle() for every job it > > finds, but all jobs are added to the pfdev->jobs just before calling > > panfrost_job_hw_submit(). Therefore I think the imbalance still exists. > > My point's exactly that - the "jobs are added to pfdev->jobs just before > calling panfrost_job_hw_submit()". Since we don't have a way for > panfrost_job_hw_submit() to fail it must unconditionally take any > references that will then be freed later on. > > > But I'm not very sure if we should add pm_runtime_put on the error path > > after pm_runtime_get_sync(), or remove pm_runtime_put one the error path > > after WARN_ON(). > > The pm_runtime_put after the WARN_ON() is a bug. Sorry this is probably > what confused you - clearly the WARN_ON() situation is never meant to > happen in the first place, so hopefully this isn't actually possible. > > Feel free to send a patch removing it! ;) > > > As for the problem about panfrost_devfreq_record_busy(), this may be a > > new bug and requires independent patch to fix it. > > Indeed, I'll post a proper patch for that later - I just spotted it > while looking at the code. > > Thanks, > > Steve > > > Regards, > > Dinghao > > > > > >> On 20/05/2020 12:05, Dinghao Liu wrote: > >>> pm_runtime_get_sync() increments the runtime PM usage counter even > >>> the call returns an error code. Thus a pairing decrement is needed > >>> on the error handling path to keep the counter balanced. > >>> > >>> Signed-off-by: Dinghao Liu <dinghao.liu@zju.edu.cn> > >> > >> Actually I think we have the opposite problem. To be honest we don't > >> handle this situation very well. By the time panfrost_job_hw_submit() is > >> called the job has already been added to the pfdev->jobs array, so it's > >> considered submitted even if it never actually lands on the hardware. So > >> in the case of this function bailing out early we will then (eventually) > >> hit a timeout and trigger a GPU reset. > >> > >> panfrost_job_timedout() iterates through the pfdev->jobs array and calls > >> pm_runtime_put_noidle() for each job it finds. So there's no inbalance > >> here that I can see. > >> > >> Have you actually observed the situation where pm_runtime_get_sync() > >> returns a failure? > >> > >> HOWEVER, it appears that by bailing out early the call to > >> panfrost_devfreq_record_busy() is never made, which as far as I can see > >> means that there may be an extra call to panfrost_devfreq_record_idle() > >> when the jobs have timed out. Which could underflow the counter. > >> > >> But equally looking at panfrost_job_timedout(), we only call > >> panfrost_devfreq_record_idle() *once* even though multiple jobs might be > >> processed. > >> > >> There's a completely untested patch below which in theory should fix that... > >> > >> Steve > >> > >> ----8<--- > >> diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c > >> b/drivers/gpu/drm/panfrost/panfrost_job.c > >> index 7914b1570841..f9519afca29d 100644 > >> --- a/drivers/gpu/drm/panfrost/panfrost_job.c > >> +++ b/drivers/gpu/drm/panfrost/panfrost_job.c > >> @@ -145,6 +145,8 @@ static void panfrost_job_hw_submit(struct > >> panfrost_job *job, int js) > >> u64 jc_head = job->jc; > >> int ret; > >> > >> + panfrost_devfreq_record_busy(pfdev); > >> + > >> ret = pm_runtime_get_sync(pfdev->dev); > >> if (ret < 0) > >> return; > >> @@ -155,7 +157,6 @@ static void panfrost_job_hw_submit(struct > >> panfrost_job *job, int js) > >> } > >> > >> cfg = panfrost_mmu_as_get(pfdev, &job->file_priv->mmu); > >> - panfrost_devfreq_record_busy(pfdev); > >> > >> job_write(pfdev, JS_HEAD_NEXT_LO(js), jc_head & 0xFFFFFFFF); > >> job_write(pfdev, JS_HEAD_NEXT_HI(js), jc_head >> 32); > >> @@ -410,12 +411,12 @@ static void panfrost_job_timedout(struct > >> drm_sched_job *sched_job) > >> for (i = 0; i < NUM_JOB_SLOTS; i++) { > >> if (pfdev->jobs[i]) { > >> pm_runtime_put_noidle(pfdev->dev); > >> + panfrost_devfreq_record_idle(pfdev); > >> pfdev->jobs[i] = NULL; > >> } > >> } > >> spin_unlock_irqrestore(&pfdev->js->job_lock, flags); > >> > >> - panfrost_devfreq_record_idle(pfdev); > >> panfrost_device_reset(pfdev); > >> > >> for (i = 0; i < NUM_JOB_SLOTS; i++) _______________________________________________ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2020-05-23 9:34 UTC | newest] Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2020-05-20 11:05 [PATCH] drm/panfrost: fix runtime pm imbalance on error Dinghao Liu 2020-05-20 11:05 ` Dinghao Liu 2020-05-20 14:02 ` Steven Price 2020-05-20 14:02 ` Steven Price 2020-05-21 7:00 ` dinghao.liu 2020-05-21 7:00 ` dinghao.liu 2020-05-22 13:09 ` Steven Price 2020-05-22 13:09 ` Steven Price 2020-05-22 13:23 ` dinghao.liu 2020-05-22 13:23 ` dinghao.liu
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.