linux-pm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Rafael J. Wysocki" <rafael@kernel.org>
To: Anson Huang <anson.huang@nxp.com>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>,
	Peng Fan <peng.fan@nxp.com>,
	Viresh Kumar <viresh.kumar@linaro.org>,
	Jacky Bai <ping.bai@nxp.com>,
	"linux-pm@vger.kernel.org" <linux-pm@vger.kernel.org>
Subject: Re: About CPU hot-plug stress test failed in cpufreq driver
Date: Thu, 5 Dec 2019 16:52:37 +0100	[thread overview]
Message-ID: <CAJZ5v0j_op4vELkEQzJEuXij71p8SGSFSbo0m813W5Gq1ZdYoQ@mail.gmail.com> (raw)
In-Reply-To: <DB3PR0402MB39163BEC18FF81B06D1093BBF55C0@DB3PR0402MB3916.eurprd04.prod.outlook.com>

On Thu, Dec 5, 2019 at 2:18 PM Anson Huang <anson.huang@nxp.com> wrote:
>
>
>
> > -----Original Message-----
> > From: Rafael J. Wysocki <rjw@rjwysocki.net>
> > Sent: Thursday, December 5, 2019 6:48 PM
> > To: Anson Huang <anson.huang@nxp.com>
> > Cc: Viresh Kumar <viresh.kumar@linaro.org>; Jacky Bai <ping.bai@nxp.com>;
> > linux-pm@vger.kernel.org
> > Subject: Re: About CPU hot-plug stress test failed in cpufreq driver
> >
> > On Thursday, December 5, 2019 9:53:20 AM CET Anson Huang wrote:
> > > Hi, Rafael
> > >     This issue is very weird, the irq_work used in cpufreq_governor.c is
> > very
> > >     simple, ONLY one entry to claim the irq_work, and
> > cpufreq_governor's irq_work
> > >     is a private irq_work structure, no other drivers use it. I added some
> > trace
> > >     event in cpufreq_governor.c and irq_work.c, every time, the issue
> > happened at
> > >     the point of CPU1/2/3 all off, and CPU1 start ON line, but when CPU1
> > tried to
> > >     sync the irq_work in cpufreq_dbs_governor_stop(), the irq_work
> > shows that
> > >     previous work is pending on CPU3 which is offline, I also had the
> > trace event
> > >     in irq_work_claim(),  but no any log shows the cpufreq_governor
> > irq_work is
> > >     claimed on CPU3 after CPU3 offline, below is the debug patch I added
> > and the
> > >     log on 2 consoles:
> > >     If I understand it correctly, the irq work used in cpufreq_governor
> > ONLY has
> > >     one entry of calling irq_work_queue() which will be ONLY claimed on
> > the CPU
> > >     calling the irq_work_queue(), but from trace result, I have NOT see
> > where
> > >     CPU3 could call irq_work_queue() after it finishes the irq work sync
> > before
> > >     offline.
> >
> > Right.
> >
> > Which means that this particular irq_work only runs on the CPU that has run
> > irq_work_queue() for it.
> >
> > >     Could it something wrong related to cache maintain during CPU
> > hotplug?
> >
> > I'm not sure what is going on, but I do agree that it is weird enough. :-)
> >
> > [cut]
> >
> > > LOG on console 1 which does CPU1/2/3 offline and online stress test:
> > > CPUHotplug: 4575 times remaining
> > > [ 1047.401185] CPU1: shutdown
> > > [ 1047.403917] psci: CPU1 killed.
> > > [ 1047.449153] CPU2: shutdown
> > > [ 1047.451880] psci: CPU2 killed.
> > > [ 1047.501131] CPU3: shutdown
> > > [ 1047.503857] psci: CPU3 killed.
> > > [ 1048.541939] Detected VIPT I-cache on CPU1 [ 1048.541983] GICv3:
> > > CPU1: found redistributor 1 region 0:0x0000000051b20000 [ 1048.542050]
> > > CPU1: Booted secondary processor 0x0000000001 [0x410fd042] [
> > > 1048.585024] Detected VIPT I-cache on CPU2 [ 1048.585061] GICv3: CPU2:
> > > found redistributor 2 region 0:0x0000000051b40000 [ 1048.585121] CPU2:
> > > Booted secondary processor 0x0000000002 [0x410fd042] [ 1048.645070]
> > > Detected VIPT I-cache on CPU3 [ 1048.645112] GICv3: CPU3: found
> > > redistributor 3 region 0:0x0000000051b60000 [ 1048.645181] CPU3:
> > > Booted secondary processor 0x0000000003 [0x410fd042]
> > > CPUHotplug: 4574 times remaining
> > > [ 1049.769187] CPU1: shutdown
> > > [ 1049.771913] psci: CPU1 killed.
> > > [ 1049.809126] CPU2: shutdown
> > > [ 1049.811856] psci: CPU2 killed.
> > > [ 1049.853135] CPU3: shutdown
> > > [ 1049.855868] psci: CPU3 killed.
> > >
> > > Waiting here forever.....
> > >
> > > LOG on console 2 which enables the trace events I added upper:
> > >              sed-4591  [003] d..4  1049.705561: cpu_frequency_irq_claim:
> > cpu_id=3, flag=3
> > >              sed-4591  [003] dNh1  1049.705604:
> > > cpu_frequency_irq_run_list: cpu_id=3, flag=0
> >
> > So here CPU3 runs an IRQ work, presumably the cpufreq governor's one.
> >
> > After that its raised_list should be empty and it doesn't claim any IRQ works
> > going forward.
> >
> > >           <idle>-0     [001] d.s2  1049.716308: cpu_frequency_irq_work:
> > cpu_id=1, cpu=-1
> > >           <idle>-0     [001] d.s2  1049.716319: cpu_frequency_irq_claim:
> > cpu_id=1, flag=3
> > >           <idle>-0     [001] dNH2  1049.716338: cpu_frequency_irq_run_list:
> > cpu_id=1, flag=0
> >
> > And now CPU1 runs the cpufreq governor IRQ work, so it sets work->cpu to 1
> > and then to -1 (when flushing raised_list).
> >
> > >           <idle>-0     [002] d.s2  1049.728303: cpu_frequency_irq_work:
> > cpu_id=2, cpu=-1
> > >           <idle>-0     [002] d.s2  1049.728307: cpu_frequency_irq_claim:
> > cpu_id=2, flag=3
> > >           <idle>-0     [002] dNH2  1049.728320: cpu_frequency_irq_run_list:
> > cpu_id=2, flag=0
> > >           <idle>-0     [001] d.s2  1049.740305: cpu_frequency_irq_work:
> > cpu_id=1, cpu=-1
> > >           <idle>-0     [001] d.s2  1049.740307: cpu_frequency_irq_claim:
> > cpu_id=1, flag=3
> > >           <idle>-0     [001] dNH2  1049.740319: cpu_frequency_irq_run_list:
> > cpu_id=1, flag=0
> > >           <idle>-0     [001] d.s2  1049.752305: cpu_frequency_irq_work:
> > cpu_id=1, cpu=-1
> > >           <idle>-0     [001] d.s2  1049.752307: cpu_frequency_irq_claim:
> > cpu_id=1, flag=3
> > >           <idle>-0     [001] dNH2  1049.752316: cpu_frequency_irq_run_list:
> > cpu_id=1, flag=0
> > >          cpuhp/1-13    [001] ....  1049.768340: cpu_frequency_irq_work_sync:
> > cpu_id=1, cpu=-1, flag=0
> > >          cpuhp/1-13    [001] d..4  1049.768681: cpu_frequency_irq_work:
> > cpu_id=1, cpu=-1
> > >          cpuhp/1-13    [001] d..4  1049.768683: cpu_frequency_irq_claim:
> > cpu_id=1, flag=3
> > >          cpuhp/1-13    [001] dNh1  1049.768698: cpu_frequency_irq_run_list:
> > cpu_id=1, flag=0
> > >      smp_test.sh-734   [000] ...1  1049.771903: cpu_frequency_irq_claim:
> > cpu_id=0, flag=7
> > >      smp_test.sh-734   [000] dNh1  1049.775009: cpu_frequency_irq_run_list:
> > cpu_id=0, flag=4
> > >      smp_test.sh-734   [000] ...1  1049.776084: cpu_frequency_irq_claim:
> > cpu_id=0, flag=7
> > >      smp_test.sh-734   [000] dNh.  1049.776392: cpu_frequency_irq_run_list:
> > cpu_id=0, flag=4
> > >      smp_test.sh-734   [000] d..2  1049.779093: cpu_frequency_irq_work:
> > cpu_id=0, cpu=-1
> > >      smp_test.sh-734   [000] d..2  1049.779103: cpu_frequency_irq_claim:
> > cpu_id=0, flag=3
> > >           <idle>-0     [000] dNh2  1049.779162: cpu_frequency_irq_run_list:
> > cpu_id=0, flag=0
> > >           <idle>-0     [000] d.s2  1049.792305: cpu_frequency_irq_work:
> > cpu_id=0, cpu=-1
> > >           <idle>-0     [000] d.s2  1049.792315: cpu_frequency_irq_claim:
> > cpu_id=0, flag=3
> > >           <idle>-0     [000] dNH2  1049.792329: cpu_frequency_irq_run_list:
> > cpu_id=0, flag=0
> > >          cpuhp/2-18    [002] ....  1049.808315: cpu_frequency_irq_work_sync:
> > cpu_id=2, cpu=-1, flag=0
> > >          cpuhp/2-18    [002] d..4  1049.808642: cpu_frequency_irq_work:
> > cpu_id=2, cpu=-1
> > >          cpuhp/2-18    [002] d..4  1049.808645: cpu_frequency_irq_claim:
> > cpu_id=2, flag=3
> > >          cpuhp/2-18    [002] dNh1  1049.808658: cpu_frequency_irq_run_list:
> > cpu_id=2, flag=0
> > >      smp_test.sh-734   [000] ...1  1049.811848: cpu_frequency_irq_claim:
> > cpu_id=0, flag=7
> > >      smp_test.sh-734   [000] dNh1  1049.814949: cpu_frequency_irq_run_list:
> > cpu_id=0, flag=4
> > >      smp_test.sh-734   [000] ...1  1049.815988: cpu_frequency_irq_claim:
> > cpu_id=0, flag=7
> > >      smp_test.sh-734   [000] dNh1  1049.816321: cpu_frequency_irq_run_list:
> > cpu_id=0, flag=4
> > >      smp_test.sh-734   [000] d..3  1049.818936: cpu_frequency_irq_work:
> > cpu_id=0, cpu=-1
> > >      smp_test.sh-734   [000] d..3  1049.818946: cpu_frequency_irq_claim:
> > cpu_id=0, flag=3
> > >      smp_test.sh-734   [000] dNh2  1049.818973: cpu_frequency_irq_run_list:
> > cpu_id=0, flag=0
> > >           <idle>-0     [000] d.s4  1049.832308: cpu_frequency_irq_work:
> > cpu_id=0, cpu=-1
> > >           <idle>-0     [000] d.s4  1049.832317: cpu_frequency_irq_claim:
> > cpu_id=0, flag=3
> > >           <idle>-0     [000] dNH3  1049.832332: cpu_frequency_irq_run_list:
> > cpu_id=0, flag=0
> > >          cpuhp/3-23    [003] ....  1049.852314: cpu_frequency_irq_work_sync:
> > cpu_id=3, cpu=-1, flag=0
> > >
> > > [Anson] when CPU3 offline, the irq work sync is successfully, no irq
> > > work pending any more;
> > >
> > >      smp_test.sh-734   [000] ...1  1049.855859: cpu_frequency_irq_claim:
> > cpu_id=0, flag=7
> > >      smp_test.sh-734   [000] dNh1  1049.858958: cpu_frequency_irq_run_list:
> > cpu_id=0, flag=4
> > >      smp_test.sh-734   [000] ...1  1049.859990: cpu_frequency_irq_claim:
> > cpu_id=0, flag=7
> > >      smp_test.sh-734   [000] dNh.  1049.860346: cpu_frequency_irq_run_list:
> > cpu_id=0, flag=4
> > >           <idle>-0     [001] d.h1  1050.896329: cpu_frequency_irq_run_list:
> > cpu_id=1, flag=4
> > >          cpuhp/1-13    [001] ....  1050.916319: cpu_frequency_irq_work_sync:
> > cpu_id=1, cpu=3, flag=3
> > >
> > > [Anson] we can see when CPU1 start online and tried to sync irq work,
> > > found it is pending on CPU3 which is offline, and in this period, no
> > > irq work claimed by cpufreq_governor,
> >
> > So I'm wondering how it is possible at all that work->cpu value is 3 at this
> > point.
> >
> > The last CPU that wrote to work->cpu was CPU0 and the written value was -1,
> > and
> > CPU3 saw that value when it was running irq_work_sync().
> >
> > There is no sane way by which work->cpu can be equal to 3 from CPU1's
> > perspective, because the last value written to it by CPU1 itself was -1 and the
> > last value written to it by any other CPU also was -1.
> >
> > Moreover, after CPU3 had updated it last time (and the last value written to
> > it by CPU3 had been -1), other CPUs, *including* CPU1, updated it too (and
> > that for multiple times).
> >
> > So the only theory that can explain why CPU1 sees 3 in there when it is going
> > online appears to be some silent memory corruption.
> >
> > That said, have you tried to make the READ_ONCE() change suggested a
> > while ago?
>
> Below patch does NOT work using READ_ONCE() if I did the change correctly:

OK

> @@ -212,7 +208,7 @@ void irq_work_sync(struct irq_work *work)
>  {
>         lockdep_assert_irqs_enabled();
>
> -       while (work->flags & IRQ_WORK_BUSY)
> +       while (READ_ONCE(work->flags) & IRQ_WORK_BUSY)

You also may try using test_bit() instead of the raw read, but anyway
at this point I would start talking to the arch/HW people if I were
you.

>                 cpu_relax();
>  }
>
> LOG:
> CPUHotplug: 4937 times remaining
> [  214.837047] CPU1: shutdown
> [  214.839781] psci: CPU1 killed.
> [  214.877041] CPU2: shutdown
> [  214.879767] psci: CPU2 killed.
> [  214.917026] CPU3: shutdown
> [  214.919758] psci: CPU3 killed.
> [  215.957816] Detected VIPT I-cache on CPU1
> [  215.957860] GICv3: CPU1: found redistributor 1 region 0:0x0000000051b20000
> [  215.957930] CPU1: Booted secondary processor 0x0000000001 [0x410fd042]
> [  216.001025] Detected VIPT I-cache on CPU2
> [  216.001064] GICv3: CPU2: found redistributor 2 region 0:0x0000000051b40000
> [  216.001126] CPU2: Booted secondary processor 0x0000000002 [0x410fd042]
> [  216.068960] Detected VIPT I-cache on CPU3
> [  216.069004] GICv3: CPU3: found redistributor 3 region 0:0x0000000051b60000
> [  216.069076] CPU3: Booted secondary processor 0x0000000003 [0x410fd042]
> CPUHotplug: 4936 times remaining
> [  217.201055] CPU1: shutdown
> [  217.203779] psci: CPU1 killed.
> [  400.506869] audit: type=1006 audit(1573738201.312:3): pid=1332 uid=0 old-aui1
> [ 4000.600430] audit: type=1006 audit(1573741801.408:4): pid=1352 uid=0 old-aui1
> [ 7600.687496] audit: type=1006 audit(1573745401.492:5): pid=1371 uid=0 old-aui1
>
>
>          cpuhp/1-13    [001] ....   217.200231: cpu_frequency_irq_work_sync: cpu_id=1, cpu=-1, flag=0
>      smp_test.sh-741   [002] ...1   217.203770: cpu_frequency_irq_claim: cpu_id=2, flag=7
>      smp_test.sh-741   [002] d.h1   217.206873: cpu_frequency_irq_run_list: cpu_id=2, flag=4
>      smp_test.sh-741   [002] ...1   217.206893: cpu_frequency_irq_claim: cpu_id=2, flag=7
>      smp_test.sh-741   [002] dNh.   217.208222: cpu_frequency_irq_run_list: cpu_id=2, flag=4
>          cpuhp/2-18    [002] ....   217.248206: cpu_frequency_irq_work_sync: cpu_id=2, cpu=1, flag=3
>
> [Anson] this time, the irq work is pending on CPU1 which is offline.
>
>          kauditd-31    [000] ...1   400.519304: cpu_frequency_irq_claim: cpu_id=0, flag=7
>           <idle>-0     [000] dNh1   400.520231: cpu_frequency_irq_run_list: cpu_id=0, flag=4
>          kauditd-31    [003] ...1  4000.612845: cpu_frequency_irq_claim: cpu_id=3, flag=7
>            crond-1352  [003] d.h.  4000.616221: cpu_frequency_irq_run_list: cpu_id=3, flag=4
>          kauditd-31    [000] ...1  7600.699988: cpu_frequency_irq_claim: cpu_id=0, flag=7
>           <idle>-0     [000] dNh1  7600.700205: cpu_frequency_irq_run_list: cpu_id=0, flag=4
> root@imx8qxpmek:~#
>
>

  reply	other threads:[~2019-12-05 15:52 UTC|newest]

Thread overview: 57+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <DB3PR0402MB391626A8ECFDC182C6EDCF8DF54E0@DB3PR0402MB3916.eurprd04.prod.outlook.com>
2019-11-21  9:35 ` About CPU hot-plug stress test failed in cpufreq driver Viresh Kumar
2019-11-21 10:13   ` Anson Huang
2019-11-21 10:53     ` Rafael J. Wysocki
2019-11-21 10:56       ` Rafael J. Wysocki
2019-11-22  5:15         ` Anson Huang
2019-11-22  9:59           ` Rafael J. Wysocki
2019-11-25  6:05             ` Anson Huang
2019-11-25  9:43               ` Anson Huang
2019-11-26  6:18                 ` Viresh Kumar
2019-11-26  8:22                   ` Anson Huang
2019-11-26  8:25                     ` Viresh Kumar
2019-11-25 12:44               ` Rafael J. Wysocki
2019-11-26  8:57                 ` Rafael J. Wysocki
2019-11-29 11:39                 ` Rafael J. Wysocki
2019-11-29 13:44                   ` Anson Huang
2019-12-05  8:53                     ` Anson Huang
2019-12-05 10:48                       ` Rafael J. Wysocki
2019-12-05 13:18                         ` Anson Huang
2019-12-05 15:52                           ` Rafael J. Wysocki [this message]
2019-12-09 10:31                             ` Peng Fan
2019-12-09 10:37                             ` Anson Huang
2019-12-09 10:56                               ` Anson Huang
2019-12-09 11:23                                 ` Rafael J. Wysocki
2019-12-09 12:32                                   ` Anson Huang
2019-12-09 12:44                                     ` Rafael J. Wysocki
2019-12-09 14:18                                       ` Anson Huang
2019-12-10  5:39                                         ` Anson Huang
2019-12-10  5:53                                       ` Peng Fan
2019-12-10  7:05                                         ` Viresh Kumar
2019-12-10  8:22                                           ` Rafael J. Wysocki
2019-12-10  8:29                                             ` Anson Huang
2019-12-10  8:36                                               ` Viresh Kumar
2019-12-10  8:37                                                 ` Peng Fan
2019-12-10  8:37                                               ` Rafael J. Wysocki
2019-12-10  8:43                                                 ` Peng Fan
2019-12-10  8:45                                                 ` Anson Huang
2019-12-10  8:50                                                   ` Rafael J. Wysocki
2019-12-10  8:51                                                     ` Anson Huang
2019-12-10 10:39                                                       ` Rafael J. Wysocki
2019-12-10 10:54                                                         ` Rafael J. Wysocki
2019-12-11  5:08                                                           ` Anson Huang
2019-12-11  8:59                                                           ` Peng Fan
2019-12-11  9:36                                                             ` Rafael J. Wysocki
2019-12-11  9:43                                                               ` Peng Fan
2019-12-11  9:52                                                                 ` Rafael J. Wysocki
2019-12-11 10:11                                                                   ` Peng Fan
2019-12-10 10:54                                                         ` Viresh Kumar
2019-12-10 11:07                                                           ` Rafael J. Wysocki
2019-12-10  8:57                                                     ` Viresh Kumar
2019-12-10 11:03                                                       ` Rafael J. Wysocki
2019-12-10  9:04                                                     ` Rafael J. Wysocki
2019-12-10  8:31                                             ` Viresh Kumar
2019-12-10  8:12                                         ` Rafael J. Wysocki
2019-12-05 11:00                       ` Viresh Kumar
2019-12-05 11:10                         ` Rafael J. Wysocki
2019-12-05 11:17                           ` Viresh Kumar
2019-11-21 10:37   ` Rafael J. Wysocki

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAJZ5v0j_op4vELkEQzJEuXij71p8SGSFSbo0m813W5Gq1ZdYoQ@mail.gmail.com \
    --to=rafael@kernel.org \
    --cc=anson.huang@nxp.com \
    --cc=linux-pm@vger.kernel.org \
    --cc=peng.fan@nxp.com \
    --cc=ping.bai@nxp.com \
    --cc=rjw@rjwysocki.net \
    --cc=viresh.kumar@linaro.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).