From: Jonathan Cameron <Jonathan.Cameron@Huawei.com>
To: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Cc: <linuxarm@huawei.com>, <mauro.chehab@huawei.com>,
Andy Gross <agross@kernel.org>,
Bjorn Andersson <bjorn.andersson@linaro.org>,
"Hans Verkuil" <hans.verkuil@cisco.com>,
Mauro Carvalho Chehab <mchehab@kernel.org>,
Stanimir Varbanov <stanimir.varbanov@linaro.org>,
<linux-arm-msm@vger.kernel.org>, <linux-kernel@vger.kernel.org>,
<linux-media@vger.kernel.org>
Subject: Re: [PATCH 03/25] media: venus: Rework error fail recover logic
Date: Wed, 5 May 2021 12:05:02 +0100 [thread overview]
Message-ID: <20210505120502.000047e0@Huawei.com> (raw)
In-Reply-To: <419e346f01af5423485202d624fc144756bd2b11.1620207353.git.mchehab+huawei@kernel.org>
On Wed, 5 May 2021 11:41:53 +0200
Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote:
> The Venus code has a sort of watchdog that attempts to recover
> from IP errors, implemented as a delayed work job, which
> calls venus_sys_error_handler().
>
> Right now, it has several issues:
>
> 1. It assumes that PM runtime resume never fails
>
> 2. It internally runs two while() loops that also assume that
> PM runtime will never fail to go idle:
>
> while (pm_runtime_active(core->dev_dec) || pm_runtime_active(core->dev_enc))
> msleep(10);
>
> ...
>
> while (core->pmdomains[0] && pm_runtime_active(core->pmdomains[0]))
> usleep_range(1000, 1500);
>
> 3. It uses an OR to merge all return codes and then report to the user
>
> 4. If the hardware never recovers, it keeps running on every 10ms,
> flooding the syslog with 2 messages (so, up to 200 messages
> per second).
>
> Rework the code, in order to prevent that, by:
>
> 1. check the return code from PM runtime resume;
> 2. don't let the while() loops run forever;
> 3. store the failed event;
> 4. use warn ratelimited when it fails to recover.
>
> Fixes: af2c3834c8ca ("[media] media: venus: adding core part and helper functions")
> Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Trivial comments inline, otherwise based on no knowledge at all of the
actual hardware, the fix looks sane.
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> ---
> drivers/media/platform/qcom/venus/core.c | 59 +++++++++++++++++++-----
> 1 file changed, 47 insertions(+), 12 deletions(-)
>
> diff --git a/drivers/media/platform/qcom/venus/core.c b/drivers/media/platform/qcom/venus/core.c
> index 54bac7ec14c5..4d0482743c0a 100644
> --- a/drivers/media/platform/qcom/venus/core.c
> +++ b/drivers/media/platform/qcom/venus/core.c
> @@ -78,22 +78,32 @@ static const struct hfi_core_ops venus_core_ops = {
> .event_notify = venus_event_notify,
> };
>
> +#define RPM_WAIT_FOR_IDLE_MAX_ATTEMPTS 10
> +
> static void venus_sys_error_handler(struct work_struct *work)
> {
> struct venus_core *core =
> container_of(work, struct venus_core, work.work);
> - int ret = 0;
> + int ret, i, max_attempts = RPM_WAIT_FOR_IDLE_MAX_ATTEMPTS;
> + bool failed = false;
> + const char *err_msg = "";
>
> - pm_runtime_get_sync(core->dev);
> + ret = pm_runtime_get_sync(core->dev);
> + if (ret < 0) {
> + err_msg = "resume runtime PM\n";
Will end up with two newlines I think as %s\n" later.
> + max_attempts = 0;
> + failed = true;
> + }
>
> hfi_core_deinit(core, true);
>
> - dev_warn(core->dev, "system error has occurred, starting recovery!\n");
> -
> mutex_lock(&core->lock);
>
> - while (pm_runtime_active(core->dev_dec) || pm_runtime_active(core->dev_enc))
> + for (i = 0; i < max_attempts; i++) {
> + if (!pm_runtime_active(core->dev_dec) && !pm_runtime_active(core->dev_enc))
> + break;
> msleep(10);
> + }
>
> venus_shutdown(core);
>
> @@ -101,31 +111,56 @@ static void venus_sys_error_handler(struct work_struct *work)
>
> pm_runtime_put_sync(core->dev);
>
> - while (core->pmdomains[0] && pm_runtime_active(core->pmdomains[0]))
> + for (i = 0; i < max_attempts; i++) {
> + if (!core->pmdomains[0] || !pm_runtime_active(core->pmdomains[0]))
> + break;
> usleep_range(1000, 1500);
> + }
>
> hfi_reinit(core);
>
> - pm_runtime_get_sync(core->dev);
> + ret = pm_runtime_get_sync(core->dev);
> + if (ret < 0) {
> + err_msg = "resume runtime PM\n";
> + max_attempts = 0;
This is after the last use of max_attempts, so no point in setting it to zero.
> + failed = true;
> + }
>
> - ret |= venus_boot(core);
> - ret |= hfi_core_resume(core, true);
> + ret = venus_boot(core);
> + if (ret && !failed) {
> + err_msg = "boot Venus\n";
> + failed = true;
> + }
> +
> + ret = hfi_core_resume(core, true);
> + if (ret && !failed) {
> + err_msg = "resume HFI\n";
> + failed = true;
> + }
>
> enable_irq(core->irq);
>
> mutex_unlock(&core->lock);
>
> - ret |= hfi_core_init(core);
> + ret = hfi_core_init(core);
> + if (ret && !failed) {
> + err_msg = "init HFI\n";
> + failed = true;
> + }
>
> pm_runtime_put_sync(core->dev);
>
> - if (ret) {
> + if (failed) {
> disable_irq_nosync(core->irq);
> - dev_warn(core->dev, "recovery failed (%d)\n", ret);
> + dev_warn_ratelimited(core->dev,
> + "System error has occurred, recovery failed to %s\n",
> + err_msg);
> schedule_delayed_work(&core->work, msecs_to_jiffies(10));
> return;
> }
>
> + dev_warn(core->dev, "system error has occurred (recovered)\n");
> +
> mutex_lock(&core->lock);
> core->sys_error = false;
> mutex_unlock(&core->lock);
next prev parent reply other threads:[~2021-05-05 11:06 UTC|newest]
Thread overview: 73+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-05-05 9:41 [PATCH 00/25] Fix some PM runtime issues at the media subsystem Mauro Carvalho Chehab
2021-05-05 9:41 ` [PATCH 01/25] staging: media: rkvdec: fix pm_runtime_get_sync() usage count Mauro Carvalho Chehab
2021-05-05 12:23 ` Jonathan Cameron
2021-05-05 9:41 ` [PATCH 02/25] staging: media: imx7-mipi-csis: " Mauro Carvalho Chehab
2021-05-05 11:06 ` Jonathan Cameron
2021-05-05 11:17 ` Mauro Carvalho Chehab
2021-05-05 13:56 ` Rui Miguel Silva
2021-05-05 14:23 ` Mauro Carvalho Chehab
2021-05-05 9:41 ` [PATCH 03/25] media: venus: Rework error fail recover logic Mauro Carvalho Chehab
2021-05-05 11:05 ` Jonathan Cameron [this message]
2021-05-05 9:41 ` [PATCH 04/25] media: s5p_cec: decrement usage count if disabled Mauro Carvalho Chehab
2021-05-05 12:24 ` Jonathan Cameron
2021-05-05 9:41 ` [PATCH 05/25] media: i2c: ccs-core: return the right error code at suspend Mauro Carvalho Chehab
2021-05-05 12:24 ` Jonathan Cameron
2021-05-05 12:51 ` Sakari Ailus
2021-05-05 9:41 ` [PATCH 06/25] media: i2c: imx334: fix the pm runtime get logic Mauro Carvalho Chehab
2021-05-05 11:10 ` Jonathan Cameron
2021-05-05 11:24 ` Mauro Carvalho Chehab
2021-05-05 12:26 ` Jonathan Cameron
2021-05-05 9:41 ` [PATCH 07/25] media: exynos-gsc: don't resume at remove time Mauro Carvalho Chehab
2021-05-05 12:27 ` Jonathan Cameron
2021-05-05 9:41 ` [PATCH 08/25] media: atmel: properly get pm_runtime Mauro Carvalho Chehab
2021-05-05 12:08 ` Jonathan Cameron
2021-06-10 9:04 ` Eugen.Hristev
2021-06-10 9:38 ` Mauro Carvalho Chehab
2021-06-10 12:00 ` Eugen.Hristev
2021-06-16 8:03 ` Mauro Carvalho Chehab
2021-09-06 8:03 ` Mauro Carvalho Chehab
2021-09-06 8:13 ` Eugen.Hristev
2021-09-13 10:26 ` Eugen.Hristev
2021-05-05 9:41 ` [PATCH 09/25] media: hantro: do a PM resume earlier Mauro Carvalho Chehab
2021-05-05 11:34 ` Jonathan Cameron
2021-05-05 13:22 ` Ezequiel Garcia
2021-05-05 13:46 ` Mauro Carvalho Chehab
2021-05-05 14:01 ` Ezequiel Garcia
2021-05-05 14:15 ` Mauro Carvalho Chehab
2021-05-05 9:42 ` [PATCH 10/25] media: marvel-ccic: fix some issues when getting pm_runtime Mauro Carvalho Chehab
2021-05-05 9:42 ` [PATCH 11/25] media: mdk-mdp: fix pm_runtime_get_sync() usage count Mauro Carvalho Chehab
2021-05-05 9:42 ` [PATCH 12/25] media: rcar_fdp1: simplify error check logic at fdp_open() Mauro Carvalho Chehab
2021-05-05 9:48 ` Sergei Shtylyov
2021-05-05 9:42 ` [PATCH 13/25] media: rcar_fdp1: fix pm_runtime_get_sync() usage count Mauro Carvalho Chehab
2021-05-05 12:31 ` Jonathan Cameron
2021-05-05 9:42 ` [PATCH 14/25] media: renesas-ceu: Properly check for PM errors Mauro Carvalho Chehab
2021-05-05 9:56 ` Jacopo Mondi
2021-05-05 9:42 ` [PATCH 15/25] media: s5p: fix pm_runtime_get_sync() usage count Mauro Carvalho Chehab
2021-05-05 12:31 ` Jonathan Cameron
2021-05-05 9:42 ` [PATCH 16/25] media: am437x: " Mauro Carvalho Chehab
2021-05-05 12:32 ` Jonathan Cameron
2021-05-08 13:01 ` Lad, Prabhakar
2021-05-05 9:42 ` [PATCH 17/25] media: sh_vou: " Mauro Carvalho Chehab
2021-05-05 9:42 ` [PATCH 18/25] media: mtk-vcodec: fix PM runtime get logic Mauro Carvalho Chehab
2021-05-05 12:32 ` Jonathan Cameron
2021-05-05 9:42 ` [PATCH 19/25] media: s5p-jpeg: fix pm_runtime_get_sync() usage count Mauro Carvalho Chehab
2021-05-05 12:33 ` Jonathan Cameron
2021-05-05 9:42 ` [PATCH 20/25] media: sti/delta: use pm_runtime_resume_and_get() Mauro Carvalho Chehab
2021-05-05 12:01 ` Jonathan Cameron
2021-05-05 12:33 ` Jonathan Cameron
2021-05-05 9:42 ` [PATCH 21/25] media: sunxi: fix pm_runtime_get_sync() usage count Mauro Carvalho Chehab
2021-05-05 9:42 ` [PATCH 22/25] media: sti/bdisp: " Mauro Carvalho Chehab
2021-05-05 12:34 ` Jonathan Cameron
2021-05-05 9:42 ` [PATCH 23/25] media: exynos4-is: " Mauro Carvalho Chehab
2021-05-05 12:20 ` Jonathan Cameron
2021-05-05 9:42 ` [PATCH 24/25] media: exynos-gsc: " Mauro Carvalho Chehab
2021-05-05 12:34 ` Jonathan Cameron
2021-05-05 9:42 ` [PATCH 25/25] media: i2c: ccs-core: " Mauro Carvalho Chehab
2021-05-05 10:34 ` Sakari Ailus
2021-05-05 10:57 ` Mauro Carvalho Chehab
2021-05-05 10:58 ` Mauro Carvalho Chehab
2021-05-05 12:35 ` Jonathan Cameron
2021-05-05 14:06 ` Mauro Carvalho Chehab
2021-05-05 16:36 ` Jonathan Cameron
2021-05-05 11:02 ` Sakari Ailus
2021-05-06 15:11 ` [PATCH 00/25] Fix some PM runtime issues at the media subsystem Mauro Carvalho Chehab
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20210505120502.000047e0@Huawei.com \
--to=jonathan.cameron@huawei.com \
--cc=agross@kernel.org \
--cc=bjorn.andersson@linaro.org \
--cc=hans.verkuil@cisco.com \
--cc=linux-arm-msm@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-media@vger.kernel.org \
--cc=linuxarm@huawei.com \
--cc=mauro.chehab@huawei.com \
--cc=mchehab+huawei@kernel.org \
--cc=mchehab@kernel.org \
--cc=stanimir.varbanov@linaro.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).