All of lore.kernel.org
 help / color / mirror / Atom feed
From: Anson Huang <anson.huang@nxp.com>
To: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>,
	Peng Fan <peng.fan@nxp.com>,
	Viresh Kumar <viresh.kumar@linaro.org>,
	Jacky Bai <ping.bai@nxp.com>,
	"linux-pm@vger.kernel.org" <linux-pm@vger.kernel.org>,
	Ye Li <ye.li@nxp.com>
Subject: RE: About CPU hot-plug stress test failed in cpufreq driver
Date: Tue, 10 Dec 2019 05:39:32 +0000	[thread overview]
Message-ID: <DB3PR0402MB3916F3981F1A6131F60F5BDFF55B0@DB3PR0402MB3916.eurprd04.prod.outlook.com> (raw)
In-Reply-To: <E8129DCF-DE0A-4497-A475-E3876E884DE5@nxp.com>

Hi, Rafael/Viresh
	Correct one thing, v4.19 also has such case, we don't use cpufreq-dt driver on v4.19, so "policy->dvfs_possible_from_any_cpu" is false, then dbs_update_util_handler()'s cpufreq_this_cpu_can_update() will take effect somehow and reduce the race condition much, NOT sure if it will completely avoid the race.
	Current cpufreq_update_util() function will allow a CPU to help other CPU do the util update, this is intentional by patch 674e75411fc2 ("sched: cpufreq: Allow remote cpufreq callbacks "), but I think this may have race condition with the CPU hot-plug, if a CPU is being offline, and already finish the cpufreq_dbs_governor_stop(), then it still could run to cpufreq_update_util() to help other online CPU do the util update, then irq work will be queued to this CPU which is unexpected, given that the "policy->dvfs_possible_from_any_cpu" is always TRUE for cpufreq-dt driver, then dbs_update_util_handler() has no other check for this scenario, and issue could happen.
	Do you think this race condition exists?
	I added below patch for both schedutil and other cpufreq governor, test passed > 5000 iterations and still running.

diff --git a/drivers/cpufreq/cpufreq-dt.c b/drivers/cpufreq/cpufreq-dt.c
index d2b5f06..68421c7 100644
--- a/drivers/cpufreq/cpufreq-dt.c
+++ b/drivers/cpufreq/cpufreq-dt.c
@@ -273,7 +273,7 @@ static int cpufreq_init(struct cpufreq_policy *policy)
                transition_latency = CPUFREQ_ETERNAL;

        policy->cpuinfo.transition_latency = transition_latency;
-       policy->dvfs_possible_from_any_cpu = true;
+       //policy->dvfs_possible_from_any_cpu = true;

        dev_pm_opp_of_register_em(policy->cpus);

diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 86800b4..cc5b4a0 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -140,6 +140,9 @@ static void sugov_deferred_update(struct sugov_policy *sg_policy, u64 time,
        if (!sugov_update_next_freq(sg_policy, time, next_freq))
                return;

+       if (!cpufreq_this_cpu_can_update(sg_policy->policy))
+               return;
+

Anson

> Subject: Re: About CPU hot-plug stress test failed in cpufreq driver
> 
> Could it be caused by “policy->dvfs_possible_from_any_cpu” always being
> TRUE, so that the platforms using cpufreq-dt driver will have such issue? I
> think I made a mistake that v4.19 does NOT have such issue on i.MX
> platforms is because we don’t use cpufreq-dt driver while v5.4 we switch to it.
> I can verify it again tomorrow to see if v4.19 also have such issue by forcing
> policy->dvfs_possible_from_any_cpu to be TRUE.
> 
> Hi, Viresh
>        You ever tried to reproduce it on your hand, the platform you used is
> using cpufreq-dt driver or NOT?
> 
> From Anson's iPhone 6
> 
> 
> >> 在 2019年12月9日,20:44,Rafael J. Wysocki <rafael@kernel.org> 写
> 道:
> >>
> >> On Mon, Dec 9, 2019 at 1:32 PM Anson Huang <anson.huang@nxp.com>
> wrote:
> >>
> >>
> >>
> >> From Anson's iPhone 6
> >>
> >>
> >>>> 在 2019年12月9日,19:23,Rafael J. Wysocki <rafael@kernel.org>
> 写道:
> >>>>
> >>>> On Mon, Dec 9, 2019 at 11:57 AM Anson Huang
> <anson.huang@nxp.com> wrote:
> >>>>
> >>>> Forgot to mentioned that below patch on v5.4 can easily reproduce the
> panic() on our platforms which I think is unexpected, as the policy->cpus
> already be updated after governor stop, but still try to have irq work queued
> on it.
> >>>>
> >>>> static void dbs_update_util_handler(struct update_util_data *data,
> >>>> u64 time, unsigned int flags)
> >>>> +       if (!cpumask_test_cpu(smp_processor_id(), policy_dbs->policy-
> >cpus))
> >>>> +               panic("...irq work on offline cpu %d\n",
> >>>> + smp_processor_id());
> >>>>      irq_work_queue(&policy_dbs->irq_work);
> >>>
> >>> Yes, that is unexpected.
> >>>
> >>> In cpufreq_offline(), we have:
> >>>
> >>>  down_write(&policy->rwsem);
> >>>  if (has_target())
> >>>      cpufreq_stop_governor(policy);
> >>>
> >>>  cpumask_clear_cpu(cpu, policy->cpus);
> >>>
> >>> and cpufreq_stop_governor() calls policy->governor->stop(policy)
> >>> which is cpufreq_dbs_governor_stop().
> >>>
> >>> That calls gov_clear_update_util(policy_dbs->policy) first, which
> >>> invokes cpufreq_remove_update_util_hook() for each CPU in
> >>> policy->cpus and synchronizes RCU, so after that point none of the
> >>> policy->cpus is expected to run dbs_update_util_handler().
> >>>
> >>> policy->cpus is updated next and the governor is started again with
> >>> the new policy->cpus.  Because the offline CPU is not there, it is
> >>> not expected to run dbs_update_util_handler() again.
> >>>
> >>> Do you only get the original error when one of the CPUs goes back online?
> >>
> >> No, sometimes I also got this error during a CPU is being offline.
> >>
> >> But the point is NOT that dbs_update_util_handler() called during
> >> governor stop, it is that this function is running on a CPU which
> >> already finish the governor stop function,
> >
> > Yes, it is, and which should not be possible as per the above.
> >
> > The offline CPU is not there in prolicy->cpus when
> > cpufreq_dbs_governor_start() is called for the policy, so its
> > cpufreq_update_util_data pointer is not set (it is NULL at that time).
> > Therefore it is not expected to run dbs_update_util_handler() until it
> > is turn back online.
> >
> >> I thought the original expectation is that this function ONLY be executed
> on the CPU which needs scaling frequency?
> >> Is this correct?
> >
> > Yes, it is.
> >
> >> v4.19 follows this expectation while v5.4 is NOT.
> >
> > As per the kernel code, they both do.
> >
> >> The only thing I can image is the changes in kernel/sched/ folder cause
> this difference, but I still need more time to figure out what changes cause it,
> if you have any suggestion, please advise, thanks!
> >
> > The CPU offline/online (hotplug) rework was done after 4.19 IIRC and
> > that changed the way online works.  Now, it runs on the CPU going
> > online and previously it ran on the CPU "asking" the other one to go
> > online.  That may be what makes the difference (if my recollection of
> > the time frame is correct).

  reply	other threads:[~2019-12-10  5:39 UTC|newest]

Thread overview: 57+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <DB3PR0402MB391626A8ECFDC182C6EDCF8DF54E0@DB3PR0402MB3916.eurprd04.prod.outlook.com>
2019-11-21  9:35 ` About CPU hot-plug stress test failed in cpufreq driver Viresh Kumar
2019-11-21 10:13   ` Anson Huang
2019-11-21 10:53     ` Rafael J. Wysocki
2019-11-21 10:56       ` Rafael J. Wysocki
2019-11-22  5:15         ` Anson Huang
2019-11-22  9:59           ` Rafael J. Wysocki
2019-11-25  6:05             ` Anson Huang
2019-11-25  9:43               ` Anson Huang
2019-11-26  6:18                 ` Viresh Kumar
2019-11-26  8:22                   ` Anson Huang
2019-11-26  8:25                     ` Viresh Kumar
2019-11-25 12:44               ` Rafael J. Wysocki
2019-11-26  8:57                 ` Rafael J. Wysocki
2019-11-29 11:39                 ` Rafael J. Wysocki
2019-11-29 13:44                   ` Anson Huang
2019-12-05  8:53                     ` Anson Huang
2019-12-05 10:48                       ` Rafael J. Wysocki
2019-12-05 13:18                         ` Anson Huang
2019-12-05 15:52                           ` Rafael J. Wysocki
2019-12-09 10:31                             ` Peng Fan
2019-12-09 10:37                             ` Anson Huang
2019-12-09 10:56                               ` Anson Huang
2019-12-09 11:23                                 ` Rafael J. Wysocki
2019-12-09 12:32                                   ` Anson Huang
2019-12-09 12:44                                     ` Rafael J. Wysocki
2019-12-09 14:18                                       ` Anson Huang
2019-12-10  5:39                                         ` Anson Huang [this message]
2019-12-10  5:53                                       ` Peng Fan
2019-12-10  7:05                                         ` Viresh Kumar
2019-12-10  8:22                                           ` Rafael J. Wysocki
2019-12-10  8:29                                             ` Anson Huang
2019-12-10  8:36                                               ` Viresh Kumar
2019-12-10  8:37                                                 ` Peng Fan
2019-12-10  8:37                                               ` Rafael J. Wysocki
2019-12-10  8:43                                                 ` Peng Fan
2019-12-10  8:45                                                 ` Anson Huang
2019-12-10  8:50                                                   ` Rafael J. Wysocki
2019-12-10  8:51                                                     ` Anson Huang
2019-12-10 10:39                                                       ` Rafael J. Wysocki
2019-12-10 10:54                                                         ` Rafael J. Wysocki
2019-12-11  5:08                                                           ` Anson Huang
2019-12-11  8:59                                                           ` Peng Fan
2019-12-11  9:36                                                             ` Rafael J. Wysocki
2019-12-11  9:43                                                               ` Peng Fan
2019-12-11  9:52                                                                 ` Rafael J. Wysocki
2019-12-11 10:11                                                                   ` Peng Fan
2019-12-10 10:54                                                         ` Viresh Kumar
2019-12-10 11:07                                                           ` Rafael J. Wysocki
2019-12-10  8:57                                                     ` Viresh Kumar
2019-12-10 11:03                                                       ` Rafael J. Wysocki
2019-12-10  9:04                                                     ` Rafael J. Wysocki
2019-12-10  8:31                                             ` Viresh Kumar
2019-12-10  8:12                                         ` Rafael J. Wysocki
2019-12-05 11:00                       ` Viresh Kumar
2019-12-05 11:10                         ` Rafael J. Wysocki
2019-12-05 11:17                           ` Viresh Kumar
2019-11-21 10:37   ` Rafael J. Wysocki

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=DB3PR0402MB3916F3981F1A6131F60F5BDFF55B0@DB3PR0402MB3916.eurprd04.prod.outlook.com \
    --to=anson.huang@nxp.com \
    --cc=linux-pm@vger.kernel.org \
    --cc=peng.fan@nxp.com \
    --cc=ping.bai@nxp.com \
    --cc=rafael@kernel.org \
    --cc=rjw@rjwysocki.net \
    --cc=viresh.kumar@linaro.org \
    --cc=ye.li@nxp.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.