From: Anson Huang <anson.huang@nxp.com>
To: "Rafael J. Wysocki" <rjw@rjwysocki.net>, Peng Fan <peng.fan@nxp.com>
Cc: Viresh Kumar <viresh.kumar@linaro.org>,
Jacky Bai <ping.bai@nxp.com>,
"linux-pm@vger.kernel.org" <linux-pm@vger.kernel.org>
Subject: RE: About CPU hot-plug stress test failed in cpufreq driver
Date: Thu, 5 Dec 2019 13:18:35 +0000 [thread overview]
Message-ID: <DB3PR0402MB39163BEC18FF81B06D1093BBF55C0@DB3PR0402MB3916.eurprd04.prod.outlook.com> (raw)
In-Reply-To: <4087016.QifdzW7851@kreacher>
> -----Original Message-----
> From: Rafael J. Wysocki <rjw@rjwysocki.net>
> Sent: Thursday, December 5, 2019 6:48 PM
> To: Anson Huang <anson.huang@nxp.com>
> Cc: Viresh Kumar <viresh.kumar@linaro.org>; Jacky Bai <ping.bai@nxp.com>;
> linux-pm@vger.kernel.org
> Subject: Re: About CPU hot-plug stress test failed in cpufreq driver
>
> On Thursday, December 5, 2019 9:53:20 AM CET Anson Huang wrote:
> > Hi, Rafael
> > This issue is very weird, the irq_work used in cpufreq_governor.c is
> very
> > simple, ONLY one entry to claim the irq_work, and
> cpufreq_governor's irq_work
> > is a private irq_work structure, no other drivers use it. I added some
> trace
> > event in cpufreq_governor.c and irq_work.c, every time, the issue
> happened at
> > the point of CPU1/2/3 all off, and CPU1 start ON line, but when CPU1
> tried to
> > sync the irq_work in cpufreq_dbs_governor_stop(), the irq_work
> shows that
> > previous work is pending on CPU3 which is offline, I also had the
> trace event
> > in irq_work_claim(), but no any log shows the cpufreq_governor
> irq_work is
> > claimed on CPU3 after CPU3 offline, below is the debug patch I added
> and the
> > log on 2 consoles:
> > If I understand it correctly, the irq work used in cpufreq_governor
> ONLY has
> > one entry of calling irq_work_queue() which will be ONLY claimed on
> the CPU
> > calling the irq_work_queue(), but from trace result, I have NOT see
> where
> > CPU3 could call irq_work_queue() after it finishes the irq work sync
> before
> > offline.
>
> Right.
>
> Which means that this particular irq_work only runs on the CPU that has run
> irq_work_queue() for it.
>
> > Could it something wrong related to cache maintain during CPU
> hotplug?
>
> I'm not sure what is going on, but I do agree that it is weird enough. :-)
>
> [cut]
>
> > LOG on console 1 which does CPU1/2/3 offline and online stress test:
> > CPUHotplug: 4575 times remaining
> > [ 1047.401185] CPU1: shutdown
> > [ 1047.403917] psci: CPU1 killed.
> > [ 1047.449153] CPU2: shutdown
> > [ 1047.451880] psci: CPU2 killed.
> > [ 1047.501131] CPU3: shutdown
> > [ 1047.503857] psci: CPU3 killed.
> > [ 1048.541939] Detected VIPT I-cache on CPU1 [ 1048.541983] GICv3:
> > CPU1: found redistributor 1 region 0:0x0000000051b20000 [ 1048.542050]
> > CPU1: Booted secondary processor 0x0000000001 [0x410fd042] [
> > 1048.585024] Detected VIPT I-cache on CPU2 [ 1048.585061] GICv3: CPU2:
> > found redistributor 2 region 0:0x0000000051b40000 [ 1048.585121] CPU2:
> > Booted secondary processor 0x0000000002 [0x410fd042] [ 1048.645070]
> > Detected VIPT I-cache on CPU3 [ 1048.645112] GICv3: CPU3: found
> > redistributor 3 region 0:0x0000000051b60000 [ 1048.645181] CPU3:
> > Booted secondary processor 0x0000000003 [0x410fd042]
> > CPUHotplug: 4574 times remaining
> > [ 1049.769187] CPU1: shutdown
> > [ 1049.771913] psci: CPU1 killed.
> > [ 1049.809126] CPU2: shutdown
> > [ 1049.811856] psci: CPU2 killed.
> > [ 1049.853135] CPU3: shutdown
> > [ 1049.855868] psci: CPU3 killed.
> >
> > Waiting here forever.....
> >
> > LOG on console 2 which enables the trace events I added upper:
> > sed-4591 [003] d..4 1049.705561: cpu_frequency_irq_claim:
> cpu_id=3, flag=3
> > sed-4591 [003] dNh1 1049.705604:
> > cpu_frequency_irq_run_list: cpu_id=3, flag=0
>
> So here CPU3 runs an IRQ work, presumably the cpufreq governor's one.
>
> After that its raised_list should be empty and it doesn't claim any IRQ works
> going forward.
>
> > <idle>-0 [001] d.s2 1049.716308: cpu_frequency_irq_work:
> cpu_id=1, cpu=-1
> > <idle>-0 [001] d.s2 1049.716319: cpu_frequency_irq_claim:
> cpu_id=1, flag=3
> > <idle>-0 [001] dNH2 1049.716338: cpu_frequency_irq_run_list:
> cpu_id=1, flag=0
>
> And now CPU1 runs the cpufreq governor IRQ work, so it sets work->cpu to 1
> and then to -1 (when flushing raised_list).
>
> > <idle>-0 [002] d.s2 1049.728303: cpu_frequency_irq_work:
> cpu_id=2, cpu=-1
> > <idle>-0 [002] d.s2 1049.728307: cpu_frequency_irq_claim:
> cpu_id=2, flag=3
> > <idle>-0 [002] dNH2 1049.728320: cpu_frequency_irq_run_list:
> cpu_id=2, flag=0
> > <idle>-0 [001] d.s2 1049.740305: cpu_frequency_irq_work:
> cpu_id=1, cpu=-1
> > <idle>-0 [001] d.s2 1049.740307: cpu_frequency_irq_claim:
> cpu_id=1, flag=3
> > <idle>-0 [001] dNH2 1049.740319: cpu_frequency_irq_run_list:
> cpu_id=1, flag=0
> > <idle>-0 [001] d.s2 1049.752305: cpu_frequency_irq_work:
> cpu_id=1, cpu=-1
> > <idle>-0 [001] d.s2 1049.752307: cpu_frequency_irq_claim:
> cpu_id=1, flag=3
> > <idle>-0 [001] dNH2 1049.752316: cpu_frequency_irq_run_list:
> cpu_id=1, flag=0
> > cpuhp/1-13 [001] .... 1049.768340: cpu_frequency_irq_work_sync:
> cpu_id=1, cpu=-1, flag=0
> > cpuhp/1-13 [001] d..4 1049.768681: cpu_frequency_irq_work:
> cpu_id=1, cpu=-1
> > cpuhp/1-13 [001] d..4 1049.768683: cpu_frequency_irq_claim:
> cpu_id=1, flag=3
> > cpuhp/1-13 [001] dNh1 1049.768698: cpu_frequency_irq_run_list:
> cpu_id=1, flag=0
> > smp_test.sh-734 [000] ...1 1049.771903: cpu_frequency_irq_claim:
> cpu_id=0, flag=7
> > smp_test.sh-734 [000] dNh1 1049.775009: cpu_frequency_irq_run_list:
> cpu_id=0, flag=4
> > smp_test.sh-734 [000] ...1 1049.776084: cpu_frequency_irq_claim:
> cpu_id=0, flag=7
> > smp_test.sh-734 [000] dNh. 1049.776392: cpu_frequency_irq_run_list:
> cpu_id=0, flag=4
> > smp_test.sh-734 [000] d..2 1049.779093: cpu_frequency_irq_work:
> cpu_id=0, cpu=-1
> > smp_test.sh-734 [000] d..2 1049.779103: cpu_frequency_irq_claim:
> cpu_id=0, flag=3
> > <idle>-0 [000] dNh2 1049.779162: cpu_frequency_irq_run_list:
> cpu_id=0, flag=0
> > <idle>-0 [000] d.s2 1049.792305: cpu_frequency_irq_work:
> cpu_id=0, cpu=-1
> > <idle>-0 [000] d.s2 1049.792315: cpu_frequency_irq_claim:
> cpu_id=0, flag=3
> > <idle>-0 [000] dNH2 1049.792329: cpu_frequency_irq_run_list:
> cpu_id=0, flag=0
> > cpuhp/2-18 [002] .... 1049.808315: cpu_frequency_irq_work_sync:
> cpu_id=2, cpu=-1, flag=0
> > cpuhp/2-18 [002] d..4 1049.808642: cpu_frequency_irq_work:
> cpu_id=2, cpu=-1
> > cpuhp/2-18 [002] d..4 1049.808645: cpu_frequency_irq_claim:
> cpu_id=2, flag=3
> > cpuhp/2-18 [002] dNh1 1049.808658: cpu_frequency_irq_run_list:
> cpu_id=2, flag=0
> > smp_test.sh-734 [000] ...1 1049.811848: cpu_frequency_irq_claim:
> cpu_id=0, flag=7
> > smp_test.sh-734 [000] dNh1 1049.814949: cpu_frequency_irq_run_list:
> cpu_id=0, flag=4
> > smp_test.sh-734 [000] ...1 1049.815988: cpu_frequency_irq_claim:
> cpu_id=0, flag=7
> > smp_test.sh-734 [000] dNh1 1049.816321: cpu_frequency_irq_run_list:
> cpu_id=0, flag=4
> > smp_test.sh-734 [000] d..3 1049.818936: cpu_frequency_irq_work:
> cpu_id=0, cpu=-1
> > smp_test.sh-734 [000] d..3 1049.818946: cpu_frequency_irq_claim:
> cpu_id=0, flag=3
> > smp_test.sh-734 [000] dNh2 1049.818973: cpu_frequency_irq_run_list:
> cpu_id=0, flag=0
> > <idle>-0 [000] d.s4 1049.832308: cpu_frequency_irq_work:
> cpu_id=0, cpu=-1
> > <idle>-0 [000] d.s4 1049.832317: cpu_frequency_irq_claim:
> cpu_id=0, flag=3
> > <idle>-0 [000] dNH3 1049.832332: cpu_frequency_irq_run_list:
> cpu_id=0, flag=0
> > cpuhp/3-23 [003] .... 1049.852314: cpu_frequency_irq_work_sync:
> cpu_id=3, cpu=-1, flag=0
> >
> > [Anson] when CPU3 offline, the irq work sync is successfully, no irq
> > work pending any more;
> >
> > smp_test.sh-734 [000] ...1 1049.855859: cpu_frequency_irq_claim:
> cpu_id=0, flag=7
> > smp_test.sh-734 [000] dNh1 1049.858958: cpu_frequency_irq_run_list:
> cpu_id=0, flag=4
> > smp_test.sh-734 [000] ...1 1049.859990: cpu_frequency_irq_claim:
> cpu_id=0, flag=7
> > smp_test.sh-734 [000] dNh. 1049.860346: cpu_frequency_irq_run_list:
> cpu_id=0, flag=4
> > <idle>-0 [001] d.h1 1050.896329: cpu_frequency_irq_run_list:
> cpu_id=1, flag=4
> > cpuhp/1-13 [001] .... 1050.916319: cpu_frequency_irq_work_sync:
> cpu_id=1, cpu=3, flag=3
> >
> > [Anson] we can see when CPU1 start online and tried to sync irq work,
> > found it is pending on CPU3 which is offline, and in this period, no
> > irq work claimed by cpufreq_governor,
>
> So I'm wondering how it is possible at all that work->cpu value is 3 at this
> point.
>
> The last CPU that wrote to work->cpu was CPU0 and the written value was -1,
> and
> CPU3 saw that value when it was running irq_work_sync().
>
> There is no sane way by which work->cpu can be equal to 3 from CPU1's
> perspective, because the last value written to it by CPU1 itself was -1 and the
> last value written to it by any other CPU also was -1.
>
> Moreover, after CPU3 had updated it last time (and the last value written to
> it by CPU3 had been -1), other CPUs, *including* CPU1, updated it too (and
> that for multiple times).
>
> So the only theory that can explain why CPU1 sees 3 in there when it is going
> online appears to be some silent memory corruption.
>
> That said, have you tried to make the READ_ONCE() change suggested a
> while ago?
Below patch does NOT work using READ_ONCE() if I did the change correctly:
@@ -212,7 +208,7 @@ void irq_work_sync(struct irq_work *work)
{
lockdep_assert_irqs_enabled();
- while (work->flags & IRQ_WORK_BUSY)
+ while (READ_ONCE(work->flags) & IRQ_WORK_BUSY)
cpu_relax();
}
LOG:
CPUHotplug: 4937 times remaining
[ 214.837047] CPU1: shutdown
[ 214.839781] psci: CPU1 killed.
[ 214.877041] CPU2: shutdown
[ 214.879767] psci: CPU2 killed.
[ 214.917026] CPU3: shutdown
[ 214.919758] psci: CPU3 killed.
[ 215.957816] Detected VIPT I-cache on CPU1
[ 215.957860] GICv3: CPU1: found redistributor 1 region 0:0x0000000051b20000
[ 215.957930] CPU1: Booted secondary processor 0x0000000001 [0x410fd042]
[ 216.001025] Detected VIPT I-cache on CPU2
[ 216.001064] GICv3: CPU2: found redistributor 2 region 0:0x0000000051b40000
[ 216.001126] CPU2: Booted secondary processor 0x0000000002 [0x410fd042]
[ 216.068960] Detected VIPT I-cache on CPU3
[ 216.069004] GICv3: CPU3: found redistributor 3 region 0:0x0000000051b60000
[ 216.069076] CPU3: Booted secondary processor 0x0000000003 [0x410fd042]
CPUHotplug: 4936 times remaining
[ 217.201055] CPU1: shutdown
[ 217.203779] psci: CPU1 killed.
[ 400.506869] audit: type=1006 audit(1573738201.312:3): pid=1332 uid=0 old-aui1
[ 4000.600430] audit: type=1006 audit(1573741801.408:4): pid=1352 uid=0 old-aui1
[ 7600.687496] audit: type=1006 audit(1573745401.492:5): pid=1371 uid=0 old-aui1
cpuhp/1-13 [001] .... 217.200231: cpu_frequency_irq_work_sync: cpu_id=1, cpu=-1, flag=0
smp_test.sh-741 [002] ...1 217.203770: cpu_frequency_irq_claim: cpu_id=2, flag=7
smp_test.sh-741 [002] d.h1 217.206873: cpu_frequency_irq_run_list: cpu_id=2, flag=4
smp_test.sh-741 [002] ...1 217.206893: cpu_frequency_irq_claim: cpu_id=2, flag=7
smp_test.sh-741 [002] dNh. 217.208222: cpu_frequency_irq_run_list: cpu_id=2, flag=4
cpuhp/2-18 [002] .... 217.248206: cpu_frequency_irq_work_sync: cpu_id=2, cpu=1, flag=3
[Anson] this time, the irq work is pending on CPU1 which is offline.
kauditd-31 [000] ...1 400.519304: cpu_frequency_irq_claim: cpu_id=0, flag=7
<idle>-0 [000] dNh1 400.520231: cpu_frequency_irq_run_list: cpu_id=0, flag=4
kauditd-31 [003] ...1 4000.612845: cpu_frequency_irq_claim: cpu_id=3, flag=7
crond-1352 [003] d.h. 4000.616221: cpu_frequency_irq_run_list: cpu_id=3, flag=4
kauditd-31 [000] ...1 7600.699988: cpu_frequency_irq_claim: cpu_id=0, flag=7
<idle>-0 [000] dNh1 7600.700205: cpu_frequency_irq_run_list: cpu_id=0, flag=4
root@imx8qxpmek:~#
next prev parent reply other threads:[~2019-12-05 13:18 UTC|newest]
Thread overview: 57+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <DB3PR0402MB391626A8ECFDC182C6EDCF8DF54E0@DB3PR0402MB3916.eurprd04.prod.outlook.com>
2019-11-21 9:35 ` About CPU hot-plug stress test failed in cpufreq driver Viresh Kumar
2019-11-21 10:13 ` Anson Huang
2019-11-21 10:53 ` Rafael J. Wysocki
2019-11-21 10:56 ` Rafael J. Wysocki
2019-11-22 5:15 ` Anson Huang
2019-11-22 9:59 ` Rafael J. Wysocki
2019-11-25 6:05 ` Anson Huang
2019-11-25 9:43 ` Anson Huang
2019-11-26 6:18 ` Viresh Kumar
2019-11-26 8:22 ` Anson Huang
2019-11-26 8:25 ` Viresh Kumar
2019-11-25 12:44 ` Rafael J. Wysocki
2019-11-26 8:57 ` Rafael J. Wysocki
2019-11-29 11:39 ` Rafael J. Wysocki
2019-11-29 13:44 ` Anson Huang
2019-12-05 8:53 ` Anson Huang
2019-12-05 10:48 ` Rafael J. Wysocki
2019-12-05 13:18 ` Anson Huang [this message]
2019-12-05 15:52 ` Rafael J. Wysocki
2019-12-09 10:31 ` Peng Fan
2019-12-09 10:37 ` Anson Huang
2019-12-09 10:56 ` Anson Huang
2019-12-09 11:23 ` Rafael J. Wysocki
2019-12-09 12:32 ` Anson Huang
2019-12-09 12:44 ` Rafael J. Wysocki
2019-12-09 14:18 ` Anson Huang
2019-12-10 5:39 ` Anson Huang
2019-12-10 5:53 ` Peng Fan
2019-12-10 7:05 ` Viresh Kumar
2019-12-10 8:22 ` Rafael J. Wysocki
2019-12-10 8:29 ` Anson Huang
2019-12-10 8:36 ` Viresh Kumar
2019-12-10 8:37 ` Peng Fan
2019-12-10 8:37 ` Rafael J. Wysocki
2019-12-10 8:43 ` Peng Fan
2019-12-10 8:45 ` Anson Huang
2019-12-10 8:50 ` Rafael J. Wysocki
2019-12-10 8:51 ` Anson Huang
2019-12-10 10:39 ` Rafael J. Wysocki
2019-12-10 10:54 ` Rafael J. Wysocki
2019-12-11 5:08 ` Anson Huang
2019-12-11 8:59 ` Peng Fan
2019-12-11 9:36 ` Rafael J. Wysocki
2019-12-11 9:43 ` Peng Fan
2019-12-11 9:52 ` Rafael J. Wysocki
2019-12-11 10:11 ` Peng Fan
2019-12-10 10:54 ` Viresh Kumar
2019-12-10 11:07 ` Rafael J. Wysocki
2019-12-10 8:57 ` Viresh Kumar
2019-12-10 11:03 ` Rafael J. Wysocki
2019-12-10 9:04 ` Rafael J. Wysocki
2019-12-10 8:31 ` Viresh Kumar
2019-12-10 8:12 ` Rafael J. Wysocki
2019-12-05 11:00 ` Viresh Kumar
2019-12-05 11:10 ` Rafael J. Wysocki
2019-12-05 11:17 ` Viresh Kumar
2019-11-21 10:37 ` Rafael J. Wysocki
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=DB3PR0402MB39163BEC18FF81B06D1093BBF55C0@DB3PR0402MB3916.eurprd04.prod.outlook.com \
--to=anson.huang@nxp.com \
--cc=linux-pm@vger.kernel.org \
--cc=peng.fan@nxp.com \
--cc=ping.bai@nxp.com \
--cc=rjw@rjwysocki.net \
--cc=viresh.kumar@linaro.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).