linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed
* missing doorbell interrupt when onlining cpu
@ 2019-09-04 22:51 Nathan Lynch
  2019-09-04 23:18 ` Nathan Lynch
  0 siblings, 1 reply; 3+ messages in thread
From: Nathan Lynch @ 2019-09-04 22:51 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Gautham R Shenoy, Nicholas Piggin

I'm hoping for some help investigating a behavior I see when doing cpu
hotplug under load on P9 and P8 LPARs. Occasionally, while coming online
a cpu will seem to get "stuck" in idle, with a pending doorbell
interrupt unserviced (cpu 12 here):

cpuhp/12-70    [012] 46133.602202: cpuhp_enter:          cpu: 0012 target: 205 step: 174 (0xc000000000028920s)
 load.sh-8201  [014] 46133.602248: sched_waking:         comm=cpuhp/12 pid=70 prio=120 target_cpu=012
 load.sh-8201  [014] 46133.602251: smp_send_reschedule:  (c000000000052868) cpu=12
  <idle>-0     [012] 46133.602252: do_idle:              (c000000000162e08)
 load.sh-8201  [014] 46133.602252: smp_muxed_ipi_message_pass: (c0000000000527e8) cpu=12 msg=1
 load.sh-8201  [014] 46133.602253: doorbell_core_ipi:    (c00000000004d3e8) cpu=12
  <idle>-0     [012] 46133.602257: arch_cpu_idle:        (c000000000022d08)
  <idle>-0     [012] 46133.602259: pseries_lpar_idle:    (c0000000000d43c8)

This leaves the task initiating the online blocked in a state like this:

[<0>] __switch_to+0x2dc/0x430
[<0>] __cpuhp_kick_ap+0x78/0xa0
[<0>] cpuhp_kick_ap+0x60/0xf0
[<0>] cpuhp_invoke_callback+0xf4/0x780
[<0>] _cpu_up+0x138/0x260
[<0>] do_cpu_up+0x130/0x160
[<0>] cpu_subsys_online+0x68/0xe0
[<0>] device_online+0xb4/0x120
[<0>] online_store+0xb4/0xc0
[<0>] dev_attr_store+0x3c/0x60
[<0>] sysfs_kf_write+0x70/0xb0
[<0>] kernfs_fop_write+0x17c/0x250
[<0>] __vfs_write+0x40/0x80
[<0>] vfs_write+0xd4/0x250
[<0>] ksys_write+0x74/0x130
[<0>] system_call+0x5c/0x70

This trace is from a 5.2.10 kernel, and I've observed the problem on a
4.12 vendor kernel as well.

The issue always occurs before the cpu has completed all the cpuhp
callbacks that need to run on that cpu. Often it occurs before it even
runs a task (rcu_sched, migration, or cpuhp kthreads are the first to
run). But sometimes it will have run a task or two, as in this case.

It seems specific to doorbell i.e. intra-core IPIs; I have not observed
IPIs between cores getting dropped.

sysrq-l gets the newly onlined cpu unstuck.

The cpu can get in this state even after servicing doorbells earlier in
the online process.

This is using the default cede offline state, not stop-self (which I
haven't tried).

Ideas?

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2019-09-10 23:06 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-09-04 22:51 missing doorbell interrupt when onlining cpu Nathan Lynch
2019-09-04 23:18 ` Nathan Lynch
2019-09-10 23:04   ` Nathan Lynch

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).