From: "Rafael J. Wysocki" <rjw@rjwysocki.net>
To: Anson Huang <anson.huang@nxp.com>
Cc: Viresh Kumar <viresh.kumar@linaro.org>,
Jacky Bai <ping.bai@nxp.com>,
"linux-pm@vger.kernel.org" <linux-pm@vger.kernel.org>
Subject: Re: About CPU hot-plug stress test failed in cpufreq driver
Date: Thu, 05 Dec 2019 11:48:22 +0100 [thread overview]
Message-ID: <4087016.QifdzW7851@kreacher> (raw)
In-Reply-To: <DB3PR0402MB39165E1B832597ADBAB241AAF55C0@DB3PR0402MB3916.eurprd04.prod.outlook.com>
On Thursday, December 5, 2019 9:53:20 AM CET Anson Huang wrote:
> Hi, Rafael
> This issue is very weird, the irq_work used in cpufreq_governor.c is very
> simple, ONLY one entry to claim the irq_work, and cpufreq_governor's irq_work
> is a private irq_work structure, no other drivers use it. I added some trace
> event in cpufreq_governor.c and irq_work.c, every time, the issue happened at
> the point of CPU1/2/3 all off, and CPU1 start ON line, but when CPU1 tried to
> sync the irq_work in cpufreq_dbs_governor_stop(), the irq_work shows that
> previous work is pending on CPU3 which is offline, I also had the trace event
> in irq_work_claim(), but no any log shows the cpufreq_governor irq_work is
> claimed on CPU3 after CPU3 offline, below is the debug patch I added and the
> log on 2 consoles:
> If I understand it correctly, the irq work used in cpufreq_governor ONLY has
> one entry of calling irq_work_queue() which will be ONLY claimed on the CPU
> calling the irq_work_queue(), but from trace result, I have NOT see where
> CPU3 could call irq_work_queue() after it finishes the irq work sync before
> offline.
Right.
Which means that this particular irq_work only runs on the CPU that has
run irq_work_queue() for it.
> Could it something wrong related to cache maintain during CPU hotplug?
I'm not sure what is going on, but I do agree that it is weird enough. :-)
[cut]
> LOG on console 1 which does CPU1/2/3 offline and online stress test:
> CPUHotplug: 4575 times remaining
> [ 1047.401185] CPU1: shutdown
> [ 1047.403917] psci: CPU1 killed.
> [ 1047.449153] CPU2: shutdown
> [ 1047.451880] psci: CPU2 killed.
> [ 1047.501131] CPU3: shutdown
> [ 1047.503857] psci: CPU3 killed.
> [ 1048.541939] Detected VIPT I-cache on CPU1
> [ 1048.541983] GICv3: CPU1: found redistributor 1 region 0:0x0000000051b20000
> [ 1048.542050] CPU1: Booted secondary processor 0x0000000001 [0x410fd042]
> [ 1048.585024] Detected VIPT I-cache on CPU2
> [ 1048.585061] GICv3: CPU2: found redistributor 2 region 0:0x0000000051b40000
> [ 1048.585121] CPU2: Booted secondary processor 0x0000000002 [0x410fd042]
> [ 1048.645070] Detected VIPT I-cache on CPU3
> [ 1048.645112] GICv3: CPU3: found redistributor 3 region 0:0x0000000051b60000
> [ 1048.645181] CPU3: Booted secondary processor 0x0000000003 [0x410fd042]
> CPUHotplug: 4574 times remaining
> [ 1049.769187] CPU1: shutdown
> [ 1049.771913] psci: CPU1 killed.
> [ 1049.809126] CPU2: shutdown
> [ 1049.811856] psci: CPU2 killed.
> [ 1049.853135] CPU3: shutdown
> [ 1049.855868] psci: CPU3 killed.
>
> Waiting here forever.....
>
> LOG on console 2 which enables the trace events I added upper:
> sed-4591 [003] d..4 1049.705561: cpu_frequency_irq_claim: cpu_id=3, flag=3
> sed-4591 [003] dNh1 1049.705604: cpu_frequency_irq_run_list: cpu_id=3, flag=0
So here CPU3 runs an IRQ work, presumably the cpufreq governor's one.
After that its raised_list should be empty and it doesn't claim any IRQ works going
forward.
> <idle>-0 [001] d.s2 1049.716308: cpu_frequency_irq_work: cpu_id=1, cpu=-1
> <idle>-0 [001] d.s2 1049.716319: cpu_frequency_irq_claim: cpu_id=1, flag=3
> <idle>-0 [001] dNH2 1049.716338: cpu_frequency_irq_run_list: cpu_id=1, flag=0
And now CPU1 runs the cpufreq governor IRQ work, so it sets work->cpu to 1 and
then to -1 (when flushing raised_list).
> <idle>-0 [002] d.s2 1049.728303: cpu_frequency_irq_work: cpu_id=2, cpu=-1
> <idle>-0 [002] d.s2 1049.728307: cpu_frequency_irq_claim: cpu_id=2, flag=3
> <idle>-0 [002] dNH2 1049.728320: cpu_frequency_irq_run_list: cpu_id=2, flag=0
> <idle>-0 [001] d.s2 1049.740305: cpu_frequency_irq_work: cpu_id=1, cpu=-1
> <idle>-0 [001] d.s2 1049.740307: cpu_frequency_irq_claim: cpu_id=1, flag=3
> <idle>-0 [001] dNH2 1049.740319: cpu_frequency_irq_run_list: cpu_id=1, flag=0
> <idle>-0 [001] d.s2 1049.752305: cpu_frequency_irq_work: cpu_id=1, cpu=-1
> <idle>-0 [001] d.s2 1049.752307: cpu_frequency_irq_claim: cpu_id=1, flag=3
> <idle>-0 [001] dNH2 1049.752316: cpu_frequency_irq_run_list: cpu_id=1, flag=0
> cpuhp/1-13 [001] .... 1049.768340: cpu_frequency_irq_work_sync: cpu_id=1, cpu=-1, flag=0
> cpuhp/1-13 [001] d..4 1049.768681: cpu_frequency_irq_work: cpu_id=1, cpu=-1
> cpuhp/1-13 [001] d..4 1049.768683: cpu_frequency_irq_claim: cpu_id=1, flag=3
> cpuhp/1-13 [001] dNh1 1049.768698: cpu_frequency_irq_run_list: cpu_id=1, flag=0
> smp_test.sh-734 [000] ...1 1049.771903: cpu_frequency_irq_claim: cpu_id=0, flag=7
> smp_test.sh-734 [000] dNh1 1049.775009: cpu_frequency_irq_run_list: cpu_id=0, flag=4
> smp_test.sh-734 [000] ...1 1049.776084: cpu_frequency_irq_claim: cpu_id=0, flag=7
> smp_test.sh-734 [000] dNh. 1049.776392: cpu_frequency_irq_run_list: cpu_id=0, flag=4
> smp_test.sh-734 [000] d..2 1049.779093: cpu_frequency_irq_work: cpu_id=0, cpu=-1
> smp_test.sh-734 [000] d..2 1049.779103: cpu_frequency_irq_claim: cpu_id=0, flag=3
> <idle>-0 [000] dNh2 1049.779162: cpu_frequency_irq_run_list: cpu_id=0, flag=0
> <idle>-0 [000] d.s2 1049.792305: cpu_frequency_irq_work: cpu_id=0, cpu=-1
> <idle>-0 [000] d.s2 1049.792315: cpu_frequency_irq_claim: cpu_id=0, flag=3
> <idle>-0 [000] dNH2 1049.792329: cpu_frequency_irq_run_list: cpu_id=0, flag=0
> cpuhp/2-18 [002] .... 1049.808315: cpu_frequency_irq_work_sync: cpu_id=2, cpu=-1, flag=0
> cpuhp/2-18 [002] d..4 1049.808642: cpu_frequency_irq_work: cpu_id=2, cpu=-1
> cpuhp/2-18 [002] d..4 1049.808645: cpu_frequency_irq_claim: cpu_id=2, flag=3
> cpuhp/2-18 [002] dNh1 1049.808658: cpu_frequency_irq_run_list: cpu_id=2, flag=0
> smp_test.sh-734 [000] ...1 1049.811848: cpu_frequency_irq_claim: cpu_id=0, flag=7
> smp_test.sh-734 [000] dNh1 1049.814949: cpu_frequency_irq_run_list: cpu_id=0, flag=4
> smp_test.sh-734 [000] ...1 1049.815988: cpu_frequency_irq_claim: cpu_id=0, flag=7
> smp_test.sh-734 [000] dNh1 1049.816321: cpu_frequency_irq_run_list: cpu_id=0, flag=4
> smp_test.sh-734 [000] d..3 1049.818936: cpu_frequency_irq_work: cpu_id=0, cpu=-1
> smp_test.sh-734 [000] d..3 1049.818946: cpu_frequency_irq_claim: cpu_id=0, flag=3
> smp_test.sh-734 [000] dNh2 1049.818973: cpu_frequency_irq_run_list: cpu_id=0, flag=0
> <idle>-0 [000] d.s4 1049.832308: cpu_frequency_irq_work: cpu_id=0, cpu=-1
> <idle>-0 [000] d.s4 1049.832317: cpu_frequency_irq_claim: cpu_id=0, flag=3
> <idle>-0 [000] dNH3 1049.832332: cpu_frequency_irq_run_list: cpu_id=0, flag=0
> cpuhp/3-23 [003] .... 1049.852314: cpu_frequency_irq_work_sync: cpu_id=3, cpu=-1, flag=0
>
> [Anson] when CPU3 offline, the irq work sync is successfully, no irq work pending any more;
>
> smp_test.sh-734 [000] ...1 1049.855859: cpu_frequency_irq_claim: cpu_id=0, flag=7
> smp_test.sh-734 [000] dNh1 1049.858958: cpu_frequency_irq_run_list: cpu_id=0, flag=4
> smp_test.sh-734 [000] ...1 1049.859990: cpu_frequency_irq_claim: cpu_id=0, flag=7
> smp_test.sh-734 [000] dNh. 1049.860346: cpu_frequency_irq_run_list: cpu_id=0, flag=4
> <idle>-0 [001] d.h1 1050.896329: cpu_frequency_irq_run_list: cpu_id=1, flag=4
> cpuhp/1-13 [001] .... 1050.916319: cpu_frequency_irq_work_sync: cpu_id=1, cpu=3, flag=3
>
> [Anson] we can see when CPU1 start online and tried to sync irq work, found
> it is pending on CPU3 which is offline, and in this period, no irq work
> claimed by cpufreq_governor,
So I'm wondering how it is possible at all that work->cpu value is 3 at this
point.
The last CPU that wrote to work->cpu was CPU0 and the written value was -1, and
CPU3 saw that value when it was running irq_work_sync().
There is no sane way by which work->cpu can be equal to 3 from CPU1's perspective,
because the last value written to it by CPU1 itself was -1 and the last value
written to it by any other CPU also was -1.
Moreover, after CPU3 had updated it last time (and the last value written to it
by CPU3 had been -1), other CPUs, *including* CPU1, updated it too (and that
for multiple times).
So the only theory that can explain why CPU1 sees 3 in there when it is going
online appears to be some silent memory corruption.
That said, have you tried to make the READ_ONCE() change suggested a while ago?
next prev parent reply other threads:[~2019-12-05 10:48 UTC|newest]
Thread overview: 57+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <DB3PR0402MB391626A8ECFDC182C6EDCF8DF54E0@DB3PR0402MB3916.eurprd04.prod.outlook.com>
2019-11-21 9:35 ` About CPU hot-plug stress test failed in cpufreq driver Viresh Kumar
2019-11-21 10:13 ` Anson Huang
2019-11-21 10:53 ` Rafael J. Wysocki
2019-11-21 10:56 ` Rafael J. Wysocki
2019-11-22 5:15 ` Anson Huang
2019-11-22 9:59 ` Rafael J. Wysocki
2019-11-25 6:05 ` Anson Huang
2019-11-25 9:43 ` Anson Huang
2019-11-26 6:18 ` Viresh Kumar
2019-11-26 8:22 ` Anson Huang
2019-11-26 8:25 ` Viresh Kumar
2019-11-25 12:44 ` Rafael J. Wysocki
2019-11-26 8:57 ` Rafael J. Wysocki
2019-11-29 11:39 ` Rafael J. Wysocki
2019-11-29 13:44 ` Anson Huang
2019-12-05 8:53 ` Anson Huang
2019-12-05 10:48 ` Rafael J. Wysocki [this message]
2019-12-05 13:18 ` Anson Huang
2019-12-05 15:52 ` Rafael J. Wysocki
2019-12-09 10:31 ` Peng Fan
2019-12-09 10:37 ` Anson Huang
2019-12-09 10:56 ` Anson Huang
2019-12-09 11:23 ` Rafael J. Wysocki
2019-12-09 12:32 ` Anson Huang
2019-12-09 12:44 ` Rafael J. Wysocki
2019-12-09 14:18 ` Anson Huang
2019-12-10 5:39 ` Anson Huang
2019-12-10 5:53 ` Peng Fan
2019-12-10 7:05 ` Viresh Kumar
2019-12-10 8:22 ` Rafael J. Wysocki
2019-12-10 8:29 ` Anson Huang
2019-12-10 8:36 ` Viresh Kumar
2019-12-10 8:37 ` Peng Fan
2019-12-10 8:37 ` Rafael J. Wysocki
2019-12-10 8:43 ` Peng Fan
2019-12-10 8:45 ` Anson Huang
2019-12-10 8:50 ` Rafael J. Wysocki
2019-12-10 8:51 ` Anson Huang
2019-12-10 10:39 ` Rafael J. Wysocki
2019-12-10 10:54 ` Rafael J. Wysocki
2019-12-11 5:08 ` Anson Huang
2019-12-11 8:59 ` Peng Fan
2019-12-11 9:36 ` Rafael J. Wysocki
2019-12-11 9:43 ` Peng Fan
2019-12-11 9:52 ` Rafael J. Wysocki
2019-12-11 10:11 ` Peng Fan
2019-12-10 10:54 ` Viresh Kumar
2019-12-10 11:07 ` Rafael J. Wysocki
2019-12-10 8:57 ` Viresh Kumar
2019-12-10 11:03 ` Rafael J. Wysocki
2019-12-10 9:04 ` Rafael J. Wysocki
2019-12-10 8:31 ` Viresh Kumar
2019-12-10 8:12 ` Rafael J. Wysocki
2019-12-05 11:00 ` Viresh Kumar
2019-12-05 11:10 ` Rafael J. Wysocki
2019-12-05 11:17 ` Viresh Kumar
2019-11-21 10:37 ` Rafael J. Wysocki
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4087016.QifdzW7851@kreacher \
--to=rjw@rjwysocki.net \
--cc=anson.huang@nxp.com \
--cc=linux-pm@vger.kernel.org \
--cc=ping.bai@nxp.com \
--cc=viresh.kumar@linaro.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).