linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed
From: Nathan Lynch <nathanl@linux.ibm.com>
To: linuxppc-dev@lists.ozlabs.org
Cc: Gautham R Shenoy <ego.lkml@gmail.com>,
	Nicholas Piggin <npiggin@gmail.com>
Subject: missing doorbell interrupt when onlining cpu
Date: Wed, 04 Sep 2019 17:51:19 -0500	[thread overview]
Message-ID: <87zhjjr7yw.fsf@linux.ibm.com> (raw)

I'm hoping for some help investigating a behavior I see when doing cpu
hotplug under load on P9 and P8 LPARs. Occasionally, while coming online
a cpu will seem to get "stuck" in idle, with a pending doorbell
interrupt unserviced (cpu 12 here):

cpuhp/12-70    [012] 46133.602202: cpuhp_enter:          cpu: 0012 target: 205 step: 174 (0xc000000000028920s)
 load.sh-8201  [014] 46133.602248: sched_waking:         comm=cpuhp/12 pid=70 prio=120 target_cpu=012
 load.sh-8201  [014] 46133.602251: smp_send_reschedule:  (c000000000052868) cpu=12
  <idle>-0     [012] 46133.602252: do_idle:              (c000000000162e08)
 load.sh-8201  [014] 46133.602252: smp_muxed_ipi_message_pass: (c0000000000527e8) cpu=12 msg=1
 load.sh-8201  [014] 46133.602253: doorbell_core_ipi:    (c00000000004d3e8) cpu=12
  <idle>-0     [012] 46133.602257: arch_cpu_idle:        (c000000000022d08)
  <idle>-0     [012] 46133.602259: pseries_lpar_idle:    (c0000000000d43c8)

This leaves the task initiating the online blocked in a state like this:

[<0>] __switch_to+0x2dc/0x430
[<0>] __cpuhp_kick_ap+0x78/0xa0
[<0>] cpuhp_kick_ap+0x60/0xf0
[<0>] cpuhp_invoke_callback+0xf4/0x780
[<0>] _cpu_up+0x138/0x260
[<0>] do_cpu_up+0x130/0x160
[<0>] cpu_subsys_online+0x68/0xe0
[<0>] device_online+0xb4/0x120
[<0>] online_store+0xb4/0xc0
[<0>] dev_attr_store+0x3c/0x60
[<0>] sysfs_kf_write+0x70/0xb0
[<0>] kernfs_fop_write+0x17c/0x250
[<0>] __vfs_write+0x40/0x80
[<0>] vfs_write+0xd4/0x250
[<0>] ksys_write+0x74/0x130
[<0>] system_call+0x5c/0x70

This trace is from a 5.2.10 kernel, and I've observed the problem on a
4.12 vendor kernel as well.

The issue always occurs before the cpu has completed all the cpuhp
callbacks that need to run on that cpu. Often it occurs before it even
runs a task (rcu_sched, migration, or cpuhp kthreads are the first to
run). But sometimes it will have run a task or two, as in this case.

It seems specific to doorbell i.e. intra-core IPIs; I have not observed
IPIs between cores getting dropped.

sysrq-l gets the newly onlined cpu unstuck.

The cpu can get in this state even after servicing doorbells earlier in
the online process.

This is using the default cede offline state, not stop-self (which I
haven't tried).

Ideas?

             reply	other threads:[~2019-09-04 22:53 UTC|newest]

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-09-04 22:51 Nathan Lynch [this message]
2019-09-04 23:18 ` missing doorbell interrupt when onlining cpu Nathan Lynch
2019-09-10 23:04   ` Nathan Lynch

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87zhjjr7yw.fsf@linux.ibm.com \
    --to=nathanl@linux.ibm.com \
    --cc=ego.lkml@gmail.com \
    --cc=linuxppc-dev@lists.ozlabs.org \
    --cc=npiggin@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).