All of lore.kernel.org
 help / color / mirror / Atom feed
* Perf hotplug lockup in v4.9-rc8
@ 2016-12-07 13:53 Mark Rutland
  2016-12-07 14:30 ` Mark Rutland
  2016-12-07 17:53 ` Mark Rutland
  0 siblings, 2 replies; 17+ messages in thread
From: Mark Rutland @ 2016-12-07 13:53 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Thomas Gleixner, Sebastian Andrzej Siewior, jeremy.linton

Hi all

Jeremy noticed a kernel lockup on arm64 when the perf tool was used in
parallel with hotplug, which I've reproduced on arm64 and x86(-64) with
v4.9-rc8. In both cases I'm using defconfig; I've tried enabling lockdep
but it was silent for arm64 and x86.

I haven't yet tested earlier kernels and I'm not sure how long this has
been around for; I'm currently building a v4.8 defconfig to compare
with. In the meantime, info dump below.

I'm running:

$ while true; do ./perf stat -e cycles true; done

In parallel with the following script:

---
#!/bin/bash

CPUS=3;

ONLINEFMT="/sys/devices/system/cpu/cpu%s/online";

while true; do
	# Don't bother trying CPU0
        CPU=$((($RANDOM % $CPUS) + 1))
	        ONLINEFILE=$(printf $ONLINEFMT $CPU);
		        echo $(( $RANDOM  % 2 )) > $ONLINEFILE;
			done
----

After a few successful runs, the perf tool locks up, as does my process
monitor. On arm64 I'm abel to kill the perf tool and make forward
progress, but on x86 my SSH session is dead at this point.

A short while later, I get an RCU stall warning:

[ 4134.743008] INFO: rcu_sched self-detected stall on CPU[ 4134.744023] INFO: rcu_sched detected stalls on CPUs/tasks:
[ 4134.744028]  0-...: (26000 ticks this GP) idle=699/140000000000001/0 softirq=84552/84553 fqs=6491
[ 4134.744029]
[ 4134.744030] (detected by 1, t=26002 jiffies, g=42991, c=42990, q=437)
[ 4134.744032] Task dump for CPU 0:
[ 4134.744033] perf            R
[ 4134.744034]   running task    13936  5195   3415 0x00000008
 ffffc90005ed7d30
[ 4134.744042]  ffffffff81095268 ffff8801b73d9040 00000000001d9040 ffff8801a830ead0
[ 4134.744044]  0000000000000139 ffffffff8114454f ffff8801a830ead0 ffff8801b2918400
[ 4134.744047]  ffff8801b2918400 ffff8801afa2e000 ffff8801b2918408Call Trace:
[ 4134.744055]  [<ffffffff81095268>] ? sched_clock_local+0x18/0x80
[ 4134.744058]  [<ffffffff8114454f>] ? perf_install_in_context+0x7f/0x160
[ 4134.744062]  [<ffffffff81a8fb29>] ? _raw_spin_unlock_irq+0x29/0x40
[ 4134.744064]  [<ffffffff8114454f>] ? perf_install_in_context+0x7f/0x160
[ 4134.744066]  [<ffffffff8114b790>] ? __perf_event_enable+0x140/0x140
[ 4134.744068]  [<ffffffff8114e8b2>] ? SYSC_perf_event_open+0x512/0xf20
[ 4134.744070]  [<ffffffff81151df9>] ? SyS_perf_event_open+0x9/0x10
[ 4134.744072]  [<ffffffff81a9026a>] ? entry_SYSCALL_64_fastpath+0x18/0xad
[ 4134.743008]  0-...: (26000 ticks this GP) idle=699/140000000000001/0 softirq=84552/84553 fqs=6519
[ 4134.743008]   (t=26123 jiffies g=42991 c=42990 q=437)
[ 4134.743008] Task dump for CPU 0:
[ 4134.743008] perf            R  running task    13936  5195   3415 0x00000008
[ 4134.743008]  ffff8801b7203cf8 ffffffff81091cc7 0000000000000000 0000000000000000
[ 4134.743008]  0000000000000086 ffff8801b7203d10 ffffffff81094db4 ffffffff82062180
[ 4134.743008]  ffff8801b7203d40 ffffffff81156f72 ffff8801b73d92c0 ffffffff82062180
[ 4134.743008] Call Trace:
[ 4134.743008]  <IRQ> [ 4134.743008]  [<ffffffff81091cc7>] sched_show_task+0x117/0x1c0
[ 4134.743008]  [<ffffffff81094db4>] dump_cpu_task+0x34/0x40
[ 4134.743008]  [<ffffffff81156f72>] rcu_dump_cpu_stacks+0x88/0xc4
[ 4134.743008]  [<ffffffff810d25c7>] rcu_check_callbacks+0x8a7/0xa00
[ 4134.743008]  [<ffffffff810aae4d>] ? trace_hardirqs_off+0xd/0x10
[ 4134.743008]  [<ffffffff81066860>] ? raise_softirq+0x110/0x180
[ 4134.743008]  [<ffffffff810e8dc0>] ? tick_sched_do_timer+0x30/0x30
[ 4134.743008]  [<ffffffff810d7eca>] update_process_times+0x2a/0x50
[ 4134.743008]  [<ffffffff810e8651>] tick_sched_handle.isra.15+0x31/0x40
[ 4134.743008]  [<ffffffff810e8df8>] tick_sched_timer+0x38/0x70
[ 4134.743008]  [<ffffffff810d8f0f>] __hrtimer_run_queues+0xef/0x510
[ 4134.743008]  [<ffffffff810d96f2>] hrtimer_interrupt+0xb2/0x1d0
[ 4134.743008]  [<ffffffff8104b171>] hpet_interrupt_handler+0x11/0x30
[ 4134.743008]  [<ffffffff810c4c77>] __handle_irq_event_percpu+0x37/0x330
[ 4134.743008]  [<ffffffff810c4f8e>] handle_irq_event_percpu+0x1e/0x50
[ 4134.743008]  [<ffffffff810c4ff4>] handle_irq_event+0x34/0x60
[ 4134.743008]  [<ffffffff810c8217>] handle_edge_irq+0x87/0x140
[ 4134.743008]  [<ffffffff8101e486>] handle_irq+0xa6/0x130
[ 4134.743008]  [<ffffffff8101dafe>] do_IRQ+0x5e/0x120
[ 4134.743008]  [<ffffffff81a90c09>] common_interrupt+0x89/0x89
[ 4134.743008]  <EOI> [ 4134.743008]  [<ffffffff81a8fb29>] ? _raw_spin_unlock_irq+0x29/0x40
[ 4134.743008]  [<ffffffff8114454f>] perf_install_in_context+0x7f/0x160
[ 4134.743008]  [<ffffffff8114b790>] ? __perf_event_enable+0x140/0x140
[ 4134.743008]  [<ffffffff8114e8b2>] SYSC_perf_event_open+0x512/0xf20
[ 4134.743008]  [<ffffffff81151df9>] SyS_perf_event_open+0x9/0x10
[ 4134.743008]  [<ffffffff81a9026a>] entry_SYSCALL_64_fastpath+0x18/0xad
[ 4212.748006] INFO: rcu_sched self-detected stall on CPU[ 4212.749020] INFO: rcu_sched detected stalls on CPUs/tasks:
[ 4212.749023]  0-...: (103687 ticks this GP) idle=699/140000000000001/0 softirq=84552/84553 fqs=25939
[ 4212.749024]
[ 4212.749026] (detected by 1, t=104007 jiffies, g=42991, c=42990, q=2440)
[ 4212.749027] Task dump for CPU 0:
[ 4212.749028] perf            R
[ 4212.749028]   running task    13936  5195   3415 0x00000008
 ffffc90005ed7d30
[ 4212.749034]  ffffffff81095268 ffff8801b73d9040 00000000001d9040 ffff8801a830ead0
[ 4212.749037]  00000000000000eb ffffffff8114454f ffff8801a830ead0 ffff8801b2918400
[ 4212.749039]  ffff8801b2918400 ffff8801afa2e000 ffff8801b2918408Call Trace:
[ 4212.749046]  [<ffffffff81095268>] ? sched_clock_local+0x18/0x80
[ 4212.749049]  [<ffffffff8114454f>] ? perf_install_in_context+0x7f/0x160
[ 4212.749052]  [<ffffffff81a8fb29>] ? _raw_spin_unlock_irq+0x29/0x40
[ 4212.749053]  [<ffffffff8114454f>] ? perf_install_in_context+0x7f/0x160
[ 4212.749055]  [<ffffffff8114b790>] ? __perf_event_enable+0x140/0x140
[ 4212.749056]  [<ffffffff8114e8b2>] ? SYSC_perf_event_open+0x512/0xf20
[ 4212.749058]  [<ffffffff81151df9>] ? SyS_perf_event_open+0x9/0x10
[ 4212.749059]  [<ffffffff81a9026a>] ? entry_SYSCALL_64_fastpath+0x18/0xad
[ 4212.748007]  0-...: (103687 ticks this GP) idle=699/140000000000001/0 softirq=84552/84553 fqs=25967
[ 4212.748007]   (t=104129 jiffies g=42991 c=42990 q=2440)
[ 4212.748007] Task dump for CPU 0:
[ 4212.748007] perf            R  running task    13936  5195   3415 0x00000008
[ 4212.748007]  ffff8801b7203cf8 ffffffff81091cc7 0000000000000000 0000000000000000
[ 4212.748007]  0000000000000086 ffff8801b7203d10 ffffffff81094db4 ffffffff82062180
[ 4212.748007]  ffff8801b7203d40 ffffffff81156f72 ffff8801b73d92c0 ffffffff82062180
[ 4212.748007] Call Trace:
[ 4212.748007]  <IRQ> [ 4212.748007]  [<ffffffff81091cc7>] sched_show_task+0x117/0x1c0
[ 4212.748007]  [<ffffffff81094db4>] dump_cpu_task+0x34/0x40
[ 4212.748007]  [<ffffffff81156f72>] rcu_dump_cpu_stacks+0x88/0xc4
[ 4212.748007]  [<ffffffff810d25c7>] rcu_check_callbacks+0x8a7/0xa00
[ 4212.748007]  [<ffffffff810aae4d>] ? trace_hardirqs_off+0xd/0x10
[ 4212.748007]  [<ffffffff81066860>] ? raise_softirq+0x110/0x180
[ 4212.748007]  [<ffffffff810e8dc0>] ? tick_sched_do_timer+0x30/0x30
[ 4212.748007]  [<ffffffff810d7eca>] update_process_times+0x2a/0x50
[ 4212.748007]  [<ffffffff810e8651>] tick_sched_handle.isra.15+0x31/0x40
[ 4212.748007]  [<ffffffff810e8df8>] tick_sched_timer+0x38/0x70
[ 4212.748007]  [<ffffffff810d8f0f>] __hrtimer_run_queues+0xef/0x510
[ 4212.748007]  [<ffffffff810d96f2>] hrtimer_interrupt+0xb2/0x1d0
[ 4212.748007]  [<ffffffff8104b171>] hpet_interrupt_handler+0x11/0x30
[ 4212.748007]  [<ffffffff810c4c77>] __handle_irq_event_percpu+0x37/0x330
[ 4212.748007]  [<ffffffff810c4f8e>] handle_irq_event_percpu+0x1e/0x50
[ 4212.748007]  [<ffffffff810c4ff4>] handle_irq_event+0x34/0x60
[ 4212.748007]  [<ffffffff810c8217>] handle_edge_irq+0x87/0x140
[ 4212.748007]  [<ffffffff8101e486>] handle_irq+0xa6/0x130
[ 4212.748007]  [<ffffffff8101dafe>] do_IRQ+0x5e/0x120
[ 4212.748007]  [<ffffffff81a90c09>] common_interrupt+0x89/0x89
[ 4212.748007]  <EOI> [ 4212.748007]  [<ffffffff81a8fb29>] ? _raw_spin_unlock_irq+0x29/0x40
[ 4212.748007]  [<ffffffff8114454f>] perf_install_in_context+0x7f/0x160
[ 4212.748007]  [<ffffffff8114b790>] ? __perf_event_enable+0x140/0x140
[ 4212.748007]  [<ffffffff8114e8b2>] SYSC_perf_event_open+0x512/0xf20
[ 4212.748007]  [<ffffffff81151df9>] SyS_perf_event_open+0x9/0x10
[ 4212.748007]  [<ffffffff81a9026a>] entry_SYSCALL_64_fastpath+0x18/0xad

Dumping the blocked task state:

[ 4314.359550] sysrq: SysRq : Show Blocked State
[ 4314.360006]   task                        PC stack   pid father
[ 4314.360006] jbd2/sda1-8     D13728  1370      2 0x00000000
[ 4314.360006]  ffff8801b75d8518 ffff8801b5da8000 ffff8801b75d8500 ffff8801b3e3e7c0
[ 4314.360006]  ffff8801b3e3e200 ffffc9000108fa40 ffffffff81a8922c ffff8801b3e3e200
[ 4314.360006]  ffff8801b3e3e200 ffff8801b75d8518 0000b43c001d9040 ffff8801b3e3e200
[ 4314.360006] Call Trace:
[ 4314.360006]  [<ffffffff81a8922c>] ? __schedule+0x2cc/0xa90
[ 4314.360006]  [<ffffffff81a8a1b0>] ? bit_wait+0x50/0x50
[ 4314.360006]  [<ffffffff81a89a28>] schedule+0x38/0x90
[ 4314.360006]  [<ffffffff81a8e687>] schedule_timeout+0x2f7/0x600
[ 4314.360006]  [<ffffffff8104b27a>] ? read_hpet+0xea/0x100
[ 4314.360006]  [<ffffffff810adbfd>] ? trace_hardirqs_on+0xd/0x10
[ 4314.360006]  [<ffffffff810df841>] ? ktime_get+0x91/0x110
[ 4314.360006]  [<ffffffff8111f06a>] ? __delayacct_blkio_start+0x1a/0x30
[ 4314.360006]  [<ffffffff81a8a1b0>] ? bit_wait+0x50/0x50
[ 4314.360006]  [<ffffffff81a88ef1>] io_schedule_timeout+0xa1/0x110
[ 4314.360006]  [<ffffffff81a8a1c6>] bit_wait_io+0x16/0x50
[ 4314.360006]  [<ffffffff81a89e0f>] __wait_on_bit+0x5f/0x90
[ 4314.360006]  [<ffffffff81a8a1b0>] ? bit_wait+0x50/0x50
[ 4314.360006]  [<ffffffff81a89f6d>] out_of_line_wait_on_bit+0x6d/0x80
[ 4314.360006]  [<ffffffff810a7780>] ? autoremove_wake_function+0x30/0x30
[ 4314.360006]  [<ffffffff811f6330>] __wait_on_buffer+0x40/0x50
[ 4314.360006]  [<ffffffff812a0f6c>] jbd2_journal_commit_transaction+0x193c/0x2180
[ 4314.360006]  [<ffffffff81095268>] ? sched_clock_local+0x18/0x80
[ 4314.360006]  [<ffffffff81a8fae1>] ? _raw_spin_unlock_irqrestore+0x31/0x50
[ 4314.360006]  [<ffffffff810adbfd>] ? trace_hardirqs_on+0xd/0x10
[ 4314.360006]  [<ffffffff812a5cb0>] kjournald2+0xc0/0x270
[ 4314.360006]  [<ffffffff810a7750>] ? prepare_to_wait_event+0x110/0x110
[ 4314.360006]  [<ffffffff812a5bf0>] ? commit_timeout+0x10/0x10
[ 4314.360006]  [<ffffffff8108306e>] kthread+0xee/0x110
[ 4314.360006]  [<ffffffff81082f80>] ? kthread_park+0x60/0x60
[ 4314.360006]  [<ffffffff81a904c7>] ret_from_fork+0x27/0x40
[ 4314.360006] irqbalance      D13560  2290      1 0x00000000
[ 4314.360006]  ffff8801b75d8518 ffff8801b5d81880 ffff8801b75d8500 ffff8801af849e40
[ 4314.360006]  ffff8801af849880 ffffc90003667c88 ffffffff81a8922c 0000000000000006
[ 4314.360006]  00000000ffffffff ffff8801b75d8518 0000d33e810ad9d1 ffff8801af849880
[ 4314.360006] Call Trace:
[ 4314.360006]  [<ffffffff81a8922c>] ? __schedule+0x2cc/0xa90
[ 4314.360006]  [<ffffffff81a89a28>] schedule+0x38/0x90
[ 4314.360006]  [<ffffffff81a89d00>] schedule_preempt_disabled+0x10/0x20
[ 4314.360006]  [<ffffffff81a8b3d3>] mutex_lock_nested+0x173/0x3a0
[ 4314.360006]  [<ffffffff815830ad>] ? online_show+0x1d/0x60
[ 4314.360006]  [<ffffffff815830ad>] ? online_show+0x1d/0x60
[ 4314.360006]  [<ffffffff815830ad>] online_show+0x1d/0x60
[ 4314.360006]  [<ffffffff815834cb>] dev_attr_show+0x1b/0x50
[ 4314.360006]  [<ffffffff8124220c>] ? sysfs_file_ops+0x3c/0x60
[ 4314.360006]  [<ffffffff81242533>] sysfs_kf_seq_show+0xc3/0x1a0
[ 4314.360006]  [<ffffffff81240b81>] kernfs_seq_show+0x21/0x30
[ 4314.360006]  [<ffffffff811e590f>] seq_read+0xff/0x3b0
[ 4314.360006]  [<ffffffff81241566>] kernfs_fop_read+0x126/0x1a0
[ 4314.360006]  [<ffffffff811bc343>] __vfs_read+0x23/0x110
[ 4314.360006]  [<ffffffff8132df6b>] ? security_file_permission+0x9b/0xc0
[ 4314.360006]  [<ffffffff811bc9f4>] ? rw_verify_area+0x44/0xb0
[ 4314.360006]  [<ffffffff811bcaea>] vfs_read+0x8a/0x140
[ 4314.360006]  [<ffffffff811bdf04>] SyS_read+0x44/0xa0
[ 4314.360006]  [<ffffffff81a9026a>] entry_SYSCALL_64_fastpath+0x18/0xad
[ 4314.360006] randomhotplug.s D13376  3371   3370 0x00000000
[ 4314.360006]  ffff8801b75d8518 ffff8801b3e16200 ffff8801b75d8500 ffff8801b29d4f40
[ 4314.360006]  ffff8801b29d4980 ffffc90004ee7c80 ffffffff81a8922c ffffc90004ee7c58
[ 4314.360006]  0000000000000246 ffff8801b75d8518 0000248281a8be7e ffff8801b29d4980
[ 4314.360006] Call Trace:
[ 4314.360006]  [<ffffffff81a8922c>] ? __schedule+0x2cc/0xa90
[ 4314.360006]  [<ffffffff81a89a28>] schedule+0x38/0x90
[ 4314.360006]  [<ffffffff81060eef>] cpu_hotplug_begin+0xaf/0xc0
[ 4314.360006]  [<ffffffff81060e40>] ? __cpuhp_remove_state+0x130/0x130
[ 4314.360006]  [<ffffffff810a7750>] ? prepare_to_wait_event+0x110/0x110
[ 4314.360006]  [<ffffffff81060fcd>] _cpu_up+0x2d/0xc0
[ 4314.360006]  [<ffffffff810610bf>] do_cpu_up+0x5f/0x80
[ 4314.360006]  [<ffffffff810610ee>] cpu_up+0xe/0x10
[ 4314.360006]  [<ffffffff8158a177>] cpu_subsys_online+0x37/0x80
[ 4314.360006]  [<ffffffff815852cc>] device_online+0x5c/0x80
[ 4314.360006]  [<ffffffff8158535a>] online_store+0x6a/0x70
[ 4314.360006]  [<ffffffff81582aa3>] dev_attr_store+0x13/0x20
[ 4314.360006]  [<ffffffff8124226f>] sysfs_kf_write+0x3f/0x50
[ 4314.360006]  [<ffffffff81241bbf>] kernfs_fop_write+0x13f/0x1d0
[ 4314.360006]  [<ffffffff811bbde3>] __vfs_write+0x23/0x120
[ 4314.360006]  [<ffffffff810ccda7>] ? rcu_read_lock_sched_held+0x87/0x90
[ 4314.360006]  [<ffffffff810cd07a>] ? rcu_sync_lockdep_assert+0x2a/0x50
[ 4314.360006]  [<ffffffff811bfe38>] ? __sb_start_write+0x148/0x1f0
[ 4314.360006]  [<ffffffff811bcd1f>] ? vfs_write+0x17f/0x1b0
[ 4314.360006]  [<ffffffff811bcc50>] vfs_write+0xb0/0x1b0
[ 4314.360006]  [<ffffffff811bdfa4>] SyS_write+0x44/0xa0
[ 4314.360006]  [<ffffffff81a9026a>] entry_SYSCALL_64_fastpath+0x18/0xad

... and on arm64:

[  435.030575]   task                        PC stack   pid father
[  435.036581] randomhotplug.s D    0  2264   2263 0x00000000
[  435.042060] Call trace:
[  435.044507] [<ffff200008088fe4>] __switch_to+0x15c/0x220
[  435.049817] [<ffff2000096d0910>] __schedule+0x480/0x10f8
[  435.055118] [<ffff2000096d165c>] schedule+0xd4/0x260
[  435.060078] [<ffff20000810f5ec>] cpu_hotplug_begin+0x134/0x148
[  435.065900] [<ffff2000096cdaa8>] _cpu_down+0xb0/0x358
[  435.070941] [<ffff20000810f644>] do_cpu_down+0x44/0x70
[  435.076068] [<ffff20000810f680>] cpu_down+0x10/0x18
[  435.080942] [<ffff200008c92438>] cpu_subsys_offline+0x40/0x58
[  435.086677] [<ffff200008c86b60>] device_offline+0x140/0x1c8
[  435.092237] [<ffff200008c86dec>] online_store+0x8c/0xe8
[  435.097458] [<ffff200008c80a48>] dev_attr_store+0x38/0x70
[  435.102850] [<ffff20000850d248>] sysfs_kf_write+0x108/0x190
[  435.108411] [<ffff20000850b348>] kernfs_fop_write+0x1c8/0x3c8
[  435.114147] [<ffff2000083dce10>] __vfs_write+0xc8/0x3c8
[  435.119360] [<ffff2000083df224>] vfs_write+0x11c/0x428
[  435.124487] [<ffff2000083e28ac>] SyS_write+0xdc/0x1a8
[  435.129528] [<ffff200008083ef0>] el0_svc_naked+0x24/0x28

... and again on arm64:

[  898.370984] htop            D    0 16851  15240 0x00000000
[  898.376459] Call trace:
[  898.378895] [<ffff200008088fe4>] __switch_to+0x15c/0x220
[  898.384195] [<ffff2000096d0910>] __schedule+0x480/0x10f8
[  898.389496] [<ffff2000096d165c>] schedule+0xd4/0x260
[  898.394450] [<ffff2000096d227c>] schedule_preempt_disabled+0x74/0x110
[  898.400879] [<ffff2000096d4390>] __mutex_lock_killable_slowpath+0x270/0x670
[  898.407828] [<ffff2000096d480c>] mutex_lock_killable+0x7c/0x98
[  898.413649] [<ffff2000084e7b14>] do_io_accounting+0x164/0x660
[  898.419383] [<ffff2000084e8028>] proc_tgid_io_accounting+0x18/0x20
[  898.425551] [<ffff2000084e64b4>] proc_single_show+0xcc/0x148
[  898.431199] [<ffff20000843f3dc>] seq_read+0x26c/0xdb0
[  898.436238] [<ffff2000083dca58>] __vfs_read+0xc8/0x3b8
[  898.441364] [<ffff2000083def00>] vfs_read+0xc8/0x2d0
[  898.446317] [<ffff2000083e2704>] SyS_read+0xdc/0x1a8
[  898.451269] [<ffff200008083ef0>] el0_svc_naked+0x24/0x28
[  898.456568] perf            R  running task        0 17019  10076 0x0000000a
[  898.463609] Call trace:
[  898.466046] [<ffff2000080902b8>] dump_backtrace+0x0/0x3d0
[  898.471433] [<ffff20000809069c>] show_stack+0x14/0x20
[  898.476476] [<ffff200008173550>] sched_show_task+0x188/0x248
[  898.482123] [<ffff200008173704>] show_state_filter+0xf4/0x168
[  898.487858] [<ffff200008bc8418>] sysrq_handle_showstate+0x10/0x20
[  898.493940] [<ffff200008bc9760>] __handle_sysrq+0x218/0x318
[  898.499500] [<ffff200008bc9888>] handle_sysrq+0x28/0x30
[  898.504718] [<ffff200008c1fb78>] pl011_fifo_to_tty+0x110/0x498
[  898.510539] [<ffff200008c21fb0>] pl011_int+0x740/0xbe8
[  898.515666] [<ffff2000081c9004>] __handle_irq_event_percpu+0x154/0x260
[  898.522182] [<ffff2000081c917c>] handle_irq_event_percpu+0x6c/0x110
[  898.528436] [<ffff2000081c92cc>] handle_irq_event+0xac/0x148
[  898.534086] [<ffff2000081d2780>] handle_fasteoi_irq+0x218/0x5e8
[  898.539994] [<ffff2000081c6fa8>] generic_handle_irq+0x48/0x68
[  898.545728] [<ffff2000081c7ca4>] __handle_domain_irq+0x9c/0x158
[  898.551636] [<ffff200008081b20>] gic_handle_irq+0x58/0xb0
[  898.557023] Exception stack(0xffff80034e8c3a90 to 0xffff80034e8c3bc0)
[  898.563452] 3a80:                                   ffff80034e8c001c ffff2000082a66c0
[  898.571270] 3aa0: ffff80034e8c3d10 0000000000000001 0000000000000007 0000000000000000
[  898.579088] 3ac0: 1ffff00069d18003 ffff100069d18784 0000000041b58ab3 ffff80034e8c0000
[  898.586905] 3ae0: 0000000000000880 0000000000000000 ffff8003fff788cc ffff80035476a780
[  898.594723] 3b00: 0000000002da8fa8 0000000000000000 0000000000000000 0000000000000000
[  898.602541] 3b20: 0000000000000000 1ffff00069d18784 ffff80034e8c0000 ffff80034e8c0000
[  898.610359] 3b40: 0000000000000001 0000000000000006 ffff80034a27c000 00000000fffffffa
[  898.618176] 3b60: ffff2000082a6000 ffff10006a0021c3 ffff80034e8c3d10 ffff80034e8c3bc0
[  898.625994] 3b80: ffff2000082a07cc ffff80034e8c3bc0 ffff20000822586c 0000000020000145
[  898.633811] 3ba0: 0000000000000000 0000000000000000 0001000000000000 ffff800350010d88
[  898.641628] [<ffff2000080837b0>] el1_irq+0xb0/0x124
[  898.646496] [<ffff20000822586c>] smp_call_function_single+0xe4/0x3b8
[  898.652839] [<ffff2000082a07cc>] perf_install_in_context+0x15c/0x268
[  898.659187] [<ffff2000082c1558>] SyS_perf_event_open+0x1138/0x1908
[  898.665357] [<ffff200008083ef0>] el0_svc_naked+0x24/0x28
[  898.670657] perf            S    0 17020  17019 0x00000000
[  898.676133] Call trace:
[  898.678569] [<ffff200008088fe4>] __switch_to+0x15c/0x220
[  898.683870] [<ffff2000096d0910>] __schedule+0x480/0x10f8
[  898.689170] [<ffff2000096d165c>] schedule+0xd4/0x260
[  898.694123] [<ffff2000083f4a14>] pipe_wait+0xec/0x180
[  898.699163] [<ffff2000083f5d94>] pipe_read+0x404/0x750
[  898.704289] [<ffff2000083dcb9c>] __vfs_read+0x20c/0x3b8
[  898.709501] [<ffff2000083def00>] vfs_read+0xc8/0x2d0
[  898.714454] [<ffff2000083e2704>] SyS_read+0xdc/0x1a8
[  898.719407] [<ffff200008083ef0>] el0_svc_naked+0x24/0x28

Thanks,
Mark.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Perf hotplug lockup in v4.9-rc8
  2016-12-07 13:53 Perf hotplug lockup in v4.9-rc8 Mark Rutland
@ 2016-12-07 14:30 ` Mark Rutland
  2016-12-07 16:39   ` Mark Rutland
  2016-12-07 17:53 ` Mark Rutland
  1 sibling, 1 reply; 17+ messages in thread
From: Mark Rutland @ 2016-12-07 14:30 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Thomas Gleixner, Sebastian Andrzej Siewior, jeremy.linton

On Wed, Dec 07, 2016 at 01:52:17PM +0000, Mark Rutland wrote:
> Hi all
> 
> Jeremy noticed a kernel lockup on arm64 when the perf tool was used in
> parallel with hotplug, which I've reproduced on arm64 and x86(-64) with
> v4.9-rc8. In both cases I'm using defconfig; I've tried enabling lockdep
> but it was silent for arm64 and x86.
> 
> I haven't yet tested earlier kernels and I'm not sure how long this has
> been around for; I'm currently building a v4.8 defconfig to compare
> with. In the meantime, info dump below.

On v4.8 it hangs on x86, but my SSH session survived. Log below; I'll
give v4.7 a go...

[  111.155981] sysrq: SysRq : Show Blocked State
[  111.156971]   task                        PC stack   pid father
[  111.156971] randomhotplug.s D ffff8801b4bafc80 13296  3597   3596 0x00000000
[  111.156971]  ffff8801b4bafc80 ffff8801b4bafc58 0000000000000246 ffffffff8201d540
[  111.156971]  ffff8801a7881880 ffff8801b4bb0000 0000000000000079 0000000000000002
[  111.156971]  0000000000000002 0000000000000000 ffff8801b4bafc98 ffffffff81a752d7
[  111.156971] Call Trace:
[  111.156971]  [<ffffffff81a752d7>] schedule+0x37/0x90
[  111.156971]  [<ffffffff8105fcaf>] cpu_hotplug_begin+0xaf/0xc0
[  111.156971]  [<ffffffff8105fc00>] ? __cpuhp_setup_state+0x210/0x210
[  111.156971]  [<ffffffff810a5440>] ? prepare_to_wait_event+0x100/0x100
[  111.156971]  [<ffffffff8105fd8d>] _cpu_up+0x2d/0xc0
[  111.156971]  [<ffffffff8105fe7f>] do_cpu_up+0x5f/0x80
[  111.156971]  [<ffffffff8105feae>] cpu_up+0xe/0x10
[  111.156971]  [<ffffffff81583357>] cpu_subsys_online+0x37/0x80
[  111.156971]  [<ffffffff8157e43c>] device_online+0x5c/0x80
[  111.156971]  [<ffffffff8157e4ca>] online_store+0x6a/0x70
[  111.156971]  [<ffffffff8157bc33>] dev_attr_store+0x13/0x20
[  111.156971]  [<ffffffff8123f0ff>] sysfs_kf_write+0x3f/0x50
[  111.156971]  [<ffffffff8123ea4f>] kernfs_fop_write+0x13f/0x1d0
[  111.156971]  [<ffffffff811ba0b3>] __vfs_write+0x23/0x120
[  111.156971]  [<ffffffff810a7be7>] ? percpu_down_read+0x57/0x90
[  111.156971]  [<ffffffff811bd875>] ? __sb_start_write+0xc5/0xe0
[  111.156971]  [<ffffffff811bd875>] ? __sb_start_write+0xc5/0xe0
[  111.156971]  [<ffffffff811ba790>] vfs_write+0xb0/0x1b0
[  111.156971]  [<ffffffff811bbae4>] SyS_write+0x44/0xa0
[  111.156971]  [<ffffffff81a7b565>] entry_SYSCALL_64_fastpath+0x18/0xa8

[  134.865008] INFO: rcu_sched self-detected stall on CPU[  134.866024] INFO: rcu_sched detected stalls on CPUs/tasks:
[  134.866028]  1-...: (26000 ticks this GP) idle=163/140000000000001/0 softirq=13624/13624 fqs=6461
[  134.866028]  (detected by 0, t=26002 jiffies, g=13692, c=13691, q=422)
[  134.866031] Task dump for CPU 1:
[  134.866032] perf            R  running task    14392  6398   3677 0x00000008
[  134.866038]  ffff8801a7807c00 ffff8801a7807c00 ffff8801af66e800 ffff8801a7807c08
[  134.866040]  ffff8801a4553df0 ffff8801b4059880 0000000000000001 0000000000000001
[  134.866042]  0000000000000000 0000000000000000 ffff8801b2cac980 0000000000000006
[  134.866045] Call Trace:
[  134.866050]  [<ffffffff81a7ae19>] ? _raw_spin_unlock_irq+0x29/0x40
[  134.866053]  [<ffffffff81141c1f>] ? perf_install_in_context+0x7f/0x160
[  134.866055]  [<ffffffff81148ed0>] ? __perf_event_enable+0x140/0x140
[  134.866056]  [<ffffffff8114bf40>] ? SYSC_perf_event_open+0x510/0xf10
[  134.866058]  [<ffffffff8114f479>] ? SyS_perf_event_open+0x9/0x10
[  134.866059]  [<ffffffff81a7b565>] ? entry_SYSCALL_64_fastpath+0x18/0xa8

[  134.865009]
[  134.865009]  1-...: (26000 ticks this GP) idle=163/140000000000001/0 softirq=13624/13624 fqs=6486
[  134.865009]   (t=26111 jiffies g=13692 c=13691 q=422)
[  134.865009] Task dump for CPU 1:
[  134.865009] perf            R  running task    14392  6398   3677 0x00000008
[  134.865009]  0000000000000e5d ffff8801b7403ca8 ffffffff810900d6 ffffffff8108fff7
[  134.865009]  0000000000000001 0000000000000001 0000000000000086 ffff8801b7403cc0
[  134.865009]  ffffffff81093134 ffffffff82059a80 ffff8801b7403cf0 ffffffff811545aa
[  134.865009] Call Trace:
[  134.865009]  <IRQ>  [<ffffffff810900d6>] sched_show_task+0x156/0x260
[  134.865009]  [<ffffffff8108fff7>] ? sched_show_task+0x77/0x260
[  134.865009]  [<ffffffff81093134>] dump_cpu_task+0x34/0x40
[  134.865009]  [<ffffffff811545aa>] rcu_dump_cpu_stacks+0x88/0xc4
[  134.865009]  [<ffffffff810d02d7>] rcu_check_callbacks+0x8a7/0xa00
[  134.865009]  [<ffffffff8111dd6b>] ? __acct_update_integrals+0x2b/0xb0
[  134.865009]  [<ffffffff810e6b40>] ? tick_sched_do_timer+0x30/0x30
[  134.865009]  [<ffffffff810d5c8a>] update_process_times+0x2a/0x50
[  134.865009]  [<ffffffff810e6390>] tick_sched_handle.isra.15+0x20/0x60
[  134.865009]  [<ffffffff810e6b78>] tick_sched_timer+0x38/0x70
[  134.865009]  [<ffffffff810d6c9f>] __hrtimer_run_queues+0xef/0x510
[  134.865009]  [<ffffffff810d7482>] hrtimer_interrupt+0xb2/0x1d0
[  134.865009]  [<ffffffff8104b161>] hpet_interrupt_handler+0x11/0x30
[  134.865009]  [<ffffffff810c2d37>] __handle_irq_event_percpu+0x37/0x330
[  134.865009]  [<ffffffff810c304e>] handle_irq_event_percpu+0x1e/0x50
[  134.865009]  [<ffffffff810c30b4>] handle_irq_event+0x34/0x60
[  134.865009]  [<ffffffff810c62b7>] handle_edge_irq+0x87/0x140
[  134.865009]  [<ffffffff8101e4b6>] handle_irq+0xa6/0x130
[  134.865009]  [<ffffffff8101db2e>] do_IRQ+0x5e/0x120
[  134.865009]  [<ffffffff81a7bec9>] common_interrupt+0x89/0x89
[  134.865009]  <EOI>  [<ffffffff81a7ae19>] ? _raw_spin_unlock_irq+0x29/0x40
[  134.865009]  [<ffffffff81141c1f>] perf_install_in_context+0x7f/0x160
[  134.865009]  [<ffffffff81148ed0>] ? __perf_event_enable+0x140/0x140
[  134.865009]  [<ffffffff8114bf40>] SYSC_perf_event_open+0x510/0xf10
[  134.865009]  [<ffffffff8114f479>] SyS_perf_event_open+0x9/0x10
[  134.865009]  [<ffffffff81a7b565>] entry_SYSCALL_64_fastpath+0x18/0xa8
[  160.688129] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [perf:6398]
[  160.689009] Modules linked in:
[  160.689009] irq event stamp: 196860054
[  160.689009] hardirqs last  enabled at (196860053): [<ffffffff81a7ae17>] _raw_spin_unlock_irq+0x27/0x40
[  160.689009] hardirqs last disabled at (196860054): [<ffffffff81a7bec4>] common_interrupt+0x84/0x89
[  160.689009] softirqs last  enabled at (196859662): [<ffffffff8106502d>] __do_softirq+0x30d/0x470
[  160.689009] softirqs last disabled at (196859655): [<ffffffff81065407>] irq_exit+0x97/0xa0
[  160.689009] CPU: 1 PID: 6398 Comm: perf Not tainted 4.8.0 #3
[  160.689009] Hardware name: LENOVO 7484A3G/LENOVO, BIOS 5CKT54AUS 09/07/2009
[  160.689009] task: ffff8801b2cac980 task.stack: ffff8801a4550000
[  160.689009] RIP: 0010:[<ffffffff81a7ae19>]  [<ffffffff81a7ae19>] _raw_spin_unlock_irq+0x29/0x40
[  160.689009] RSP: 0018:ffff8801a4553df0  EFLAGS: 00000206
[  160.689009] RAX: ffff8801b2cac980 RBX: ffff8801b4059880 RCX: 0000000000000006
[  160.689009] RDX: 0000000000004d60 RSI: ffff8801b2cad220 RDI: ffff8801b2cac980
[  160.689009] RBP: ffff8801a4553df0 R08: 0000000000000000 R09: 0000000000000000
[  160.689009] R10: 0000000000000001 R11: 0000000000000001 R12: ffff8801a7807c08
[  160.689009] R13: ffff8801af66e800 R14: ffff8801a7807c00 R15: ffff8801a7807c00
[  160.689009] FS:  00007f5cb7301740(0000) GS:ffff8801b7400000(0000) knlGS:0000000000000000
[  160.689009] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  160.689009] CR2: 00007f95f49e8000 CR3: 00000001a7b9a000 CR4: 00000000000406e0
[  160.689009] Stack:
[  160.689009]  ffff8801a4553e40 ffffffff81141c1f 0000000000000000 ffffffff81148ed0
[  160.689009]  ffff8801af66e800 ffff8801fffffffa ffff8801b4059880 ffff8801af66e93c
[  160.689009]  ffff8801a7807c50 0000000000000003 ffff8801a4553f38 ffffffff8114bf40
[  160.689009] Call Trace:
[  160.689009]  [<ffffffff81141c1f>] perf_install_in_context+0x7f/0x160
[  160.689009]  [<ffffffff81148ed0>] ? __perf_event_enable+0x140/0x140
[  160.689009]  [<ffffffff8114bf40>] SYSC_perf_event_open+0x510/0xf10
[  160.689009]  [<ffffffff8114f479>] SyS_perf_event_open+0x9/0x10
[  160.689009]  [<ffffffff81a7b565>] entry_SYSCALL_64_fastpath+0x18/0xa8
[  160.689009] Code: 00 00 55 be 01 00 00 00 48 89 e5 53 48 89 fb 48 8b 55 08 48 8d 7f 18 e8 e6 31 63 ff 48 89 df e8 2e 85 63 ff e8 e9 0d 63 ff fb 5b <65> ff 0d 78 17 59 7e 5d c3 66 66 66 66 66 2e 0f 1f 84 00 00 00

Thanks,
Mark.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Perf hotplug lockup in v4.9-rc8
  2016-12-07 14:30 ` Mark Rutland
@ 2016-12-07 16:39   ` Mark Rutland
  0 siblings, 0 replies; 17+ messages in thread
From: Mark Rutland @ 2016-12-07 16:39 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Thomas Gleixner, Sebastian Andrzej Siewior, jeremy.linton

On Wed, Dec 07, 2016 at 02:30:58PM +0000, Mark Rutland wrote:
> On Wed, Dec 07, 2016 at 01:52:17PM +0000, Mark Rutland wrote:
> > Hi all
> > 
> > Jeremy noticed a kernel lockup on arm64 when the perf tool was used in
> > parallel with hotplug, which I've reproduced on arm64 and x86(-64) with
> > v4.9-rc8. In both cases I'm using defconfig; I've tried enabling lockdep
> > but it was silent for arm64 and x86.
> > 
> > I haven't yet tested earlier kernels and I'm not sure how long this has
> > been around for; I'm currently building a v4.8 defconfig to compare
> > with. In the meantime, info dump below.
> 
> On v4.8 it hangs on x86, but my SSH session survived. Log below; I'll
> give v4.7 a go...

It seems the lockup has been around at least since v4.6, so I'll give up
trying to find a point when it didn't exist.

Peter had some ideas over IRC, so I'll give those a go atop of v4.9-rc8.

Thanks,
Mark.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Perf hotplug lockup in v4.9-rc8
  2016-12-07 13:53 Perf hotplug lockup in v4.9-rc8 Mark Rutland
  2016-12-07 14:30 ` Mark Rutland
@ 2016-12-07 17:53 ` Mark Rutland
  2016-12-07 18:34   ` Peter Zijlstra
  1 sibling, 1 reply; 17+ messages in thread
From: Mark Rutland @ 2016-12-07 17:53 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Thomas Gleixner, Sebastian Andrzej Siewior, jeremy.linton

On Wed, Dec 07, 2016 at 01:52:17PM +0000, Mark Rutland wrote:
> Hi all
> 
> Jeremy noticed a kernel lockup on arm64 when the perf tool was used in
> parallel with hotplug, which I've reproduced on arm64 and x86(-64) with
> v4.9-rc8. In both cases I'm using defconfig; I've tried enabling lockdep
> but it was silent for arm64 and x86.

It looks like we're trying to install a task-bound event into a context
where task_cpu(ctx->task) is dead, and thus the cpu_function_call() in
perf_install_in_context() fails. We retry repeatedly.

On !PREEMPT (as with x86 defconfig), we manage to prevent the hotplug
machinery from making progress, and this turns into a livelock.

On PREEMPT (as with arm64 defconfig), I'm somewhat lost.

Thanks,
Mark.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Perf hotplug lockup in v4.9-rc8
  2016-12-07 17:53 ` Mark Rutland
@ 2016-12-07 18:34   ` Peter Zijlstra
  2016-12-07 19:56     ` Mark Rutland
  2016-12-09 13:59     ` Peter Zijlstra
  0 siblings, 2 replies; 17+ messages in thread
From: Peter Zijlstra @ 2016-12-07 18:34 UTC (permalink / raw)
  To: Mark Rutland
  Cc: linux-kernel, Ingo Molnar, Arnaldo Carvalho de Melo,
	Thomas Gleixner, Sebastian Andrzej Siewior, jeremy.linton

On Wed, Dec 07, 2016 at 05:53:47PM +0000, Mark Rutland wrote:
> On Wed, Dec 07, 2016 at 01:52:17PM +0000, Mark Rutland wrote:
> > Hi all
> > 
> > Jeremy noticed a kernel lockup on arm64 when the perf tool was used in
> > parallel with hotplug, which I've reproduced on arm64 and x86(-64) with
> > v4.9-rc8. In both cases I'm using defconfig; I've tried enabling lockdep
> > but it was silent for arm64 and x86.
> 
> It looks like we're trying to install a task-bound event into a context
> where task_cpu(ctx->task) is dead, and thus the cpu_function_call() in
> perf_install_in_context() fails. We retry repeatedly.
> 
> On !PREEMPT (as with x86 defconfig), we manage to prevent the hotplug
> machinery from making progress, and this turns into a livelock.
> 
> On PREEMPT (as with arm64 defconfig), I'm somewhat lost.

So the problem is that even with PREEMPT we can hit a blocked task
that has a 'dead' cpu.

We'll spin until either the task wakes up or the CPU does, either can
take a very long time.

How exactly your test-case triggers this, all it executes is 'true' and
that really shouldn't block much, is a mystery still.

In any case, the below cures things, but it is, as the comment says,
horrific.

I'm yet to come up with a better patch that doesn't violate scheduler
internals.

---
 kernel/events/core.c | 27 +++++++++++++++++++++++++++
 1 file changed, 27 insertions(+)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 6ee1febdf6ff..6faf4b03396e 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -1203,6 +1203,11 @@ static void put_ctx(struct perf_event_context *ctx)
  *	      perf_event_context::lock
  *	    perf_event::mmap_mutex
  *	    mmap_sem
+ *
+ *    task_struct::pi_lock
+ *      rq::lock
+ *        perf_event_context::lock
+ *
  */
 static struct perf_event_context *
 perf_event_ctx_lock_nested(struct perf_event *event, int nesting)
@@ -2352,6 +2357,28 @@ perf_install_in_context(struct perf_event_context *ctx,
 		return;
 	}
 	raw_spin_unlock_irq(&ctx->lock);
+
+	raw_spin_lock_irq(&task->pi_lock);
+	if (!(task->state == TASK_RUNNING || task->state == TASK_WAKING)) {
+		/*
+		 * XXX horrific hack...
+		 */
+		raw_spin_lock(&ctx->lock);
+		if (task != ctx->task) {
+			raw_spin_unlock(&ctx->lock);
+			raw_spin_unlock_irq(&task->pi_lock);
+			goto again;
+		}
+
+		add_event_to_ctx(event, ctx);
+		raw_spin_unlock(&ctx->lock);
+		raw_spin_unlock_irq(&task->pi_lock);
+		return;
+	}
+	raw_spin_unlock_irq(&task->pi_lock);
+
+	cond_resched();
+
 	/*
 	 * Since !ctx->is_active doesn't mean anything, we must IPI
 	 * unconditionally.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Perf hotplug lockup in v4.9-rc8
  2016-12-07 18:34   ` Peter Zijlstra
@ 2016-12-07 19:56     ` Mark Rutland
  2016-12-09 13:59     ` Peter Zijlstra
  1 sibling, 0 replies; 17+ messages in thread
From: Mark Rutland @ 2016-12-07 19:56 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Ingo Molnar, Arnaldo Carvalho de Melo,
	Thomas Gleixner, Sebastian Andrzej Siewior, jeremy.linton

On Wed, Dec 07, 2016 at 07:34:55PM +0100, Peter Zijlstra wrote:
> On Wed, Dec 07, 2016 at 05:53:47PM +0000, Mark Rutland wrote:
> > On Wed, Dec 07, 2016 at 01:52:17PM +0000, Mark Rutland wrote:
> > > Hi all
> > > 
> > > Jeremy noticed a kernel lockup on arm64 when the perf tool was used in
> > > parallel with hotplug, which I've reproduced on arm64 and x86(-64) with
> > > v4.9-rc8. In both cases I'm using defconfig; I've tried enabling lockdep
> > > but it was silent for arm64 and x86.
> > 
> > It looks like we're trying to install a task-bound event into a context
> > where task_cpu(ctx->task) is dead, and thus the cpu_function_call() in
> > perf_install_in_context() fails. We retry repeatedly.
> > 
> > On !PREEMPT (as with x86 defconfig), we manage to prevent the hotplug
> > machinery from making progress, and this turns into a livelock.
> > 
> > On PREEMPT (as with arm64 defconfig), I'm somewhat lost.
> 
> So the problem is that even with PREEMPT we can hit a blocked task
> that has a 'dead' cpu.
> 
> We'll spin until either the task wakes up or the CPU does, either can
> take a very long time.
> 
> How exactly your test-case triggers this, all it executes is 'true' and
> that really shouldn't block much, is a mystery still.

The perf tool forks a helper process, which blocks on a pipe, and once
signalled, execs the target (i.e. true). The main perf process opens
(enable-on-exec) events on that, then writes to the pipe to wake up the
helper.

... so now I see why that makes us see a dead task_cpu(); thanks for the
explanation above!

[...]

> @@ -2352,6 +2357,28 @@ perf_install_in_context(struct perf_event_context *ctx,
>  		return;
>  	}
>  	raw_spin_unlock_irq(&ctx->lock);
> +
> +	raw_spin_lock_irq(&task->pi_lock);
> +	if (!(task->state == TASK_RUNNING || task->state == TASK_WAKING)) {

For a moment I thought there was a remaining race here with the lazy
ctx-switch if the new task was RUNNING on an online CPU, but I guess
we'll retry the cpu_function_call() in that case.

I'll attack this tomorrow when I can think again...

Thanks,
Mark.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Perf hotplug lockup in v4.9-rc8
  2016-12-07 18:34   ` Peter Zijlstra
  2016-12-07 19:56     ` Mark Rutland
@ 2016-12-09 13:59     ` Peter Zijlstra
  2016-12-12 11:46       ` Will Deacon
                         ` (2 more replies)
  1 sibling, 3 replies; 17+ messages in thread
From: Peter Zijlstra @ 2016-12-09 13:59 UTC (permalink / raw)
  To: Mark Rutland
  Cc: linux-kernel, Ingo Molnar, Arnaldo Carvalho de Melo,
	Thomas Gleixner, Sebastian Andrzej Siewior, jeremy.linton,
	Will Deacon

On Wed, Dec 07, 2016 at 07:34:55PM +0100, Peter Zijlstra wrote:

> @@ -2352,6 +2357,28 @@ perf_install_in_context(struct perf_event_context *ctx,
>  		return;
>  	}
>  	raw_spin_unlock_irq(&ctx->lock);
> +
> +	raw_spin_lock_irq(&task->pi_lock);
> +	if (!(task->state == TASK_RUNNING || task->state == TASK_WAKING)) {
> +		/*
> +		 * XXX horrific hack...
> +		 */
> +		raw_spin_lock(&ctx->lock);
> +		if (task != ctx->task) {
> +			raw_spin_unlock(&ctx->lock);
> +			raw_spin_unlock_irq(&task->pi_lock);
> +			goto again;
> +		}
> +
> +		add_event_to_ctx(event, ctx);
> +		raw_spin_unlock(&ctx->lock);
> +		raw_spin_unlock_irq(&task->pi_lock);
> +		return;
> +	}
> +	raw_spin_unlock_irq(&task->pi_lock);
> +
> +	cond_resched();
> +
>  	/*
>  	 * Since !ctx->is_active doesn't mean anything, we must IPI
>  	 * unconditionally.

So while I went back and forth trying to make that less ugly, I figured
there was another problem.

Imagine the cpu_function_call() hitting the 'right' cpu, but not finding
the task current. It will then continue to install the event in the
context. However, that doesn't stop another CPU from pulling the task in
question from our rq and scheduling it elsewhere.

This all lead me to the below patch.. Now it has a rather large comment,
and while it represents my current thinking on the matter, I'm not at
all sure its entirely correct. I got my brain in a fair twist while
writing it.

Please as to carefully think about it.

---
 kernel/events/core.c | 70 +++++++++++++++++++++++++++++++++++-----------------
 1 file changed, 48 insertions(+), 22 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 6ee1febdf6ff..7d9ae461c535 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -2252,7 +2252,7 @@ static int  __perf_install_in_context(void *info)
 	struct perf_event_context *ctx = event->ctx;
 	struct perf_cpu_context *cpuctx = __get_cpu_context(ctx);
 	struct perf_event_context *task_ctx = cpuctx->task_ctx;
-	bool activate = true;
+	bool reprogram = true;
 	int ret = 0;
 
 	raw_spin_lock(&cpuctx->ctx.lock);
@@ -2260,27 +2260,26 @@ static int  __perf_install_in_context(void *info)
 		raw_spin_lock(&ctx->lock);
 		task_ctx = ctx;
 
-		/* If we're on the wrong CPU, try again */
-		if (task_cpu(ctx->task) != smp_processor_id()) {
-			ret = -ESRCH;
-			goto unlock;
-		}
+		reprogram = (ctx->task == current);
 
 		/*
-		 * If we're on the right CPU, see if the task we target is
-		 * current, if not we don't have to activate the ctx, a future
-		 * context switch will do that for us.
+		 * If the task is running, it must be running on this CPU,
+		 * otherwise we cannot reprogram things.
+		 *
+		 * If its not running, we don't care, ctx->lock will
+		 * serialize against it becoming runnable.
 		 */
-		if (ctx->task != current)
-			activate = false;
-		else
-			WARN_ON_ONCE(cpuctx->task_ctx && cpuctx->task_ctx != ctx);
+		if (task_curr(ctx->task) && !reprogram) {
+			ret = -ESRCH;
+			goto unlock;
+		}
 
+		WARN_ON_ONCE(reprogram && cpuctx->task_ctx && cpuctx->task_ctx != ctx);
 	} else if (task_ctx) {
 		raw_spin_lock(&task_ctx->lock);
 	}
 
-	if (activate) {
+	if (reprogram) {
 		ctx_sched_out(ctx, cpuctx, EVENT_TIME);
 		add_event_to_ctx(event, ctx);
 		ctx_resched(cpuctx, task_ctx);
@@ -2331,13 +2330,36 @@ perf_install_in_context(struct perf_event_context *ctx,
 	/*
 	 * Installing events is tricky because we cannot rely on ctx->is_active
 	 * to be set in case this is the nr_events 0 -> 1 transition.
+	 *
+	 * Instead we use task_curr(), which tells us if the task is running.
+	 * However, since we use task_curr() outside of rq::lock, we can race
+	 * against the actual state. This means the result can be wrong.
+	 *
+	 * If we get a false positive, we retry, this is harmless.
+	 *
+	 * If we get a false negative, things are complicated. If we are after
+	 * perf_event_context_sched_in() ctx::lock will serialize us, and the
+	 * value must be correct. If we're before, it doesn't matter since
+	 * perf_event_context_sched_in() will program the counter.
+	 *
+	 * However, this hinges on the remote context switch having observed
+	 * our task->perf_event_ctxp[] store, such that it will in fact take
+	 * ctx::lock in perf_event_context_sched_in().
+	 *
+	 * We do this by task_function_call(), if the IPI fails to hit the task
+	 * we know any future context switch of task must see the
+	 * perf_event_ctpx[] store.
 	 */
-again:
+
 	/*
-	 * Cannot use task_function_call() because we need to run on the task's
-	 * CPU regardless of whether its current or not.
+	 * This smp_mb() orders the task->perf_event_ctxp[] store with the
+	 * task_cpu() load, such that if the IPI then does not find the task
+	 * running, a future context switch of that task must observe the
+	 * store.
 	 */
-	if (!cpu_function_call(task_cpu(task), __perf_install_in_context, event))
+	smp_mb();
+again:
+	if (!task_function_call(task, __perf_install_in_context, event))
 		return;
 
 	raw_spin_lock_irq(&ctx->lock);
@@ -2351,12 +2373,16 @@ perf_install_in_context(struct perf_event_context *ctx,
 		raw_spin_unlock_irq(&ctx->lock);
 		return;
 	}
-	raw_spin_unlock_irq(&ctx->lock);
 	/*
-	 * Since !ctx->is_active doesn't mean anything, we must IPI
-	 * unconditionally.
+	 * If the task is not running, ctx->lock will avoid it becoming so,
+	 * thus we can safely install the event.
 	 */
-	goto again;
+	if (task_curr(task)) {
+		raw_spin_unlock_irq(&ctx->lock);
+		goto again;
+	}
+	add_event_to_ctx(event, ctx);
+	raw_spin_unlock_irq(&ctx->lock);
 }
 
 /*

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Perf hotplug lockup in v4.9-rc8
  2016-12-09 13:59     ` Peter Zijlstra
@ 2016-12-12 11:46       ` Will Deacon
  2016-12-12 12:42         ` Peter Zijlstra
  2017-01-11 14:59       ` Mark Rutland
  2017-01-14 12:28       ` [tip:perf/urgent] perf/core: Fix sys_perf_event_open() vs. hotplug tip-bot for Peter Zijlstra
  2 siblings, 1 reply; 17+ messages in thread
From: Will Deacon @ 2016-12-12 11:46 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mark Rutland, linux-kernel, Ingo Molnar,
	Arnaldo Carvalho de Melo, Thomas Gleixner,
	Sebastian Andrzej Siewior, jeremy.linton

On Fri, Dec 09, 2016 at 02:59:00PM +0100, Peter Zijlstra wrote:
> On Wed, Dec 07, 2016 at 07:34:55PM +0100, Peter Zijlstra wrote:
> 
> > @@ -2352,6 +2357,28 @@ perf_install_in_context(struct perf_event_context *ctx,
> >  		return;
> >  	}
> >  	raw_spin_unlock_irq(&ctx->lock);
> > +
> > +	raw_spin_lock_irq(&task->pi_lock);
> > +	if (!(task->state == TASK_RUNNING || task->state == TASK_WAKING)) {
> > +		/*
> > +		 * XXX horrific hack...
> > +		 */
> > +		raw_spin_lock(&ctx->lock);
> > +		if (task != ctx->task) {
> > +			raw_spin_unlock(&ctx->lock);
> > +			raw_spin_unlock_irq(&task->pi_lock);
> > +			goto again;
> > +		}
> > +
> > +		add_event_to_ctx(event, ctx);
> > +		raw_spin_unlock(&ctx->lock);
> > +		raw_spin_unlock_irq(&task->pi_lock);
> > +		return;
> > +	}
> > +	raw_spin_unlock_irq(&task->pi_lock);
> > +
> > +	cond_resched();
> > +
> >  	/*
> >  	 * Since !ctx->is_active doesn't mean anything, we must IPI
> >  	 * unconditionally.
> 
> So while I went back and forth trying to make that less ugly, I figured
> there was another problem.
> 
> Imagine the cpu_function_call() hitting the 'right' cpu, but not finding
> the task current. It will then continue to install the event in the
> context. However, that doesn't stop another CPU from pulling the task in
> question from our rq and scheduling it elsewhere.
> 
> This all lead me to the below patch.. Now it has a rather large comment,
> and while it represents my current thinking on the matter, I'm not at
> all sure its entirely correct. I got my brain in a fair twist while
> writing it.
> 
> Please as to carefully think about it.
> 
> ---
>  kernel/events/core.c | 70 +++++++++++++++++++++++++++++++++++-----------------
>  1 file changed, 48 insertions(+), 22 deletions(-)
> 
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 6ee1febdf6ff..7d9ae461c535 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -2252,7 +2252,7 @@ static int  __perf_install_in_context(void *info)
>  	struct perf_event_context *ctx = event->ctx;
>  	struct perf_cpu_context *cpuctx = __get_cpu_context(ctx);
>  	struct perf_event_context *task_ctx = cpuctx->task_ctx;
> -	bool activate = true;
> +	bool reprogram = true;
>  	int ret = 0;
>  
>  	raw_spin_lock(&cpuctx->ctx.lock);
> @@ -2260,27 +2260,26 @@ static int  __perf_install_in_context(void *info)
>  		raw_spin_lock(&ctx->lock);
>  		task_ctx = ctx;
>  
> -		/* If we're on the wrong CPU, try again */
> -		if (task_cpu(ctx->task) != smp_processor_id()) {
> -			ret = -ESRCH;
> -			goto unlock;
> -		}
> +		reprogram = (ctx->task == current);
>  
>  		/*
> -		 * If we're on the right CPU, see if the task we target is
> -		 * current, if not we don't have to activate the ctx, a future
> -		 * context switch will do that for us.
> +		 * If the task is running, it must be running on this CPU,
> +		 * otherwise we cannot reprogram things.
> +		 *
> +		 * If its not running, we don't care, ctx->lock will
> +		 * serialize against it becoming runnable.
>  		 */
> -		if (ctx->task != current)
> -			activate = false;
> -		else
> -			WARN_ON_ONCE(cpuctx->task_ctx && cpuctx->task_ctx != ctx);
> +		if (task_curr(ctx->task) && !reprogram) {
> +			ret = -ESRCH;
> +			goto unlock;
> +		}
>  
> +		WARN_ON_ONCE(reprogram && cpuctx->task_ctx && cpuctx->task_ctx != ctx);
>  	} else if (task_ctx) {
>  		raw_spin_lock(&task_ctx->lock);
>  	}
>  
> -	if (activate) {
> +	if (reprogram) {
>  		ctx_sched_out(ctx, cpuctx, EVENT_TIME);
>  		add_event_to_ctx(event, ctx);
>  		ctx_resched(cpuctx, task_ctx);
> @@ -2331,13 +2330,36 @@ perf_install_in_context(struct perf_event_context *ctx,
>  	/*
>  	 * Installing events is tricky because we cannot rely on ctx->is_active
>  	 * to be set in case this is the nr_events 0 -> 1 transition.
> +	 *
> +	 * Instead we use task_curr(), which tells us if the task is running.
> +	 * However, since we use task_curr() outside of rq::lock, we can race
> +	 * against the actual state. This means the result can be wrong.
> +	 *
> +	 * If we get a false positive, we retry, this is harmless.
> +	 *
> +	 * If we get a false negative, things are complicated. If we are after
> +	 * perf_event_context_sched_in() ctx::lock will serialize us, and the
> +	 * value must be correct. If we're before, it doesn't matter since
> +	 * perf_event_context_sched_in() will program the counter.
> +	 *
> +	 * However, this hinges on the remote context switch having observed
> +	 * our task->perf_event_ctxp[] store, such that it will in fact take
> +	 * ctx::lock in perf_event_context_sched_in().
> +	 *
> +	 * We do this by task_function_call(), if the IPI fails to hit the task
> +	 * we know any future context switch of task must see the
> +	 * perf_event_ctpx[] store.
>  	 */
> -again:
> +
>  	/*
> -	 * Cannot use task_function_call() because we need to run on the task's
> -	 * CPU regardless of whether its current or not.
> +	 * This smp_mb() orders the task->perf_event_ctxp[] store with the
> +	 * task_cpu() load, such that if the IPI then does not find the task
> +	 * running, a future context switch of that task must observe the
> +	 * store.
>  	 */
> -	if (!cpu_function_call(task_cpu(task), __perf_install_in_context, event))
> +	smp_mb();
> +again:
> +	if (!task_function_call(task, __perf_install_in_context, event))
>  		return;

I'm trying to figure out whether or not the barriers implied by the IPI
are sufficient here, or whether we really need the explicit smp_mb().
Certainly, arch_send_call_function_single_ipi has to order the publishing
of the remote work before the signalling of the interrupt, but the comment
above refers to "the task_cpu() load" and I can't see that after your
diff.

What are you trying to order here?

Will

>  
>  	raw_spin_lock_irq(&ctx->lock);
> @@ -2351,12 +2373,16 @@ perf_install_in_context(struct perf_event_context *ctx,
>  		raw_spin_unlock_irq(&ctx->lock);
>  		return;
>  	}
> -	raw_spin_unlock_irq(&ctx->lock);
>  	/*
> -	 * Since !ctx->is_active doesn't mean anything, we must IPI
> -	 * unconditionally.
> +	 * If the task is not running, ctx->lock will avoid it becoming so,
> +	 * thus we can safely install the event.
>  	 */
> -	goto again;
> +	if (task_curr(task)) {
> +		raw_spin_unlock_irq(&ctx->lock);
> +		goto again;
> +	}
> +	add_event_to_ctx(event, ctx);
> +	raw_spin_unlock_irq(&ctx->lock);
>  }
>  
>  /*

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Perf hotplug lockup in v4.9-rc8
  2016-12-12 11:46       ` Will Deacon
@ 2016-12-12 12:42         ` Peter Zijlstra
  2016-12-22  8:45           ` Peter Zijlstra
  0 siblings, 1 reply; 17+ messages in thread
From: Peter Zijlstra @ 2016-12-12 12:42 UTC (permalink / raw)
  To: Will Deacon
  Cc: Mark Rutland, linux-kernel, Ingo Molnar,
	Arnaldo Carvalho de Melo, Thomas Gleixner,
	Sebastian Andrzej Siewior, jeremy.linton

On Mon, Dec 12, 2016 at 11:46:40AM +0000, Will Deacon wrote:
> > @@ -2331,13 +2330,36 @@ perf_install_in_context(struct perf_event_context *ctx,
> >  	/*
> >  	 * Installing events is tricky because we cannot rely on ctx->is_active
> >  	 * to be set in case this is the nr_events 0 -> 1 transition.
> > +	 *
> > +	 * Instead we use task_curr(), which tells us if the task is running.
> > +	 * However, since we use task_curr() outside of rq::lock, we can race
> > +	 * against the actual state. This means the result can be wrong.
> > +	 *
> > +	 * If we get a false positive, we retry, this is harmless.
> > +	 *
> > +	 * If we get a false negative, things are complicated. If we are after
> > +	 * perf_event_context_sched_in() ctx::lock will serialize us, and the
> > +	 * value must be correct. If we're before, it doesn't matter since
> > +	 * perf_event_context_sched_in() will program the counter.
> > +	 *
> > +	 * However, this hinges on the remote context switch having observed
> > +	 * our task->perf_event_ctxp[] store, such that it will in fact take
> > +	 * ctx::lock in perf_event_context_sched_in().
> > +	 *
> > +	 * We do this by task_function_call(), if the IPI fails to hit the task
> > +	 * we know any future context switch of task must see the
> > +	 * perf_event_ctpx[] store.
> >  	 */
> > +
> >  	/*
> > +	 * This smp_mb() orders the task->perf_event_ctxp[] store with the
> > +	 * task_cpu() load, such that if the IPI then does not find the task
> > +	 * running, a future context switch of that task must observe the
> > +	 * store.
> >  	 */
> > +	smp_mb();
> > +again:
> > +	if (!task_function_call(task, __perf_install_in_context, event))
> >  		return;
> 
> I'm trying to figure out whether or not the barriers implied by the IPI
> are sufficient here, or whether we really need the explicit smp_mb().
> Certainly, arch_send_call_function_single_ipi has to order the publishing
> of the remote work before the signalling of the interrupt, but the comment
> above refers to "the task_cpu() load" and I can't see that after your
> diff.
> 
> What are you trying to order here?

I suppose something like this:


CPU0		CPU1		CPU2

		(current == t)

t->perf_event_ctxp[] = ctx;
smp_mb();
cpu = task_cpu(t);

		switch(t, n);
				migrate(t, 2);
				switch(p, t);

				ctx = t->perf_event_ctxp[]; // must not be NULL

smp_function_call(cpu, ..);

		generic_exec_single()
		  func();
		    spin_lock(ctx->lock);
		    if (task_curr(t)) // false

		    add_event_to_ctx();
		    spin_unlock(ctx->lock);

				perf_event_context_sched_in();
				  spin_lock(ctx->lock);
				  // sees event


Where between setting the perf_event_ctxp[] and sending the IPI the task
moves away and the IPI misses, and while the new CPU is in the middle of
scheduling in t, it hasn't yet passed through perf_event_sched_in(), but
when it does, it _must_ observe the ctx value we stored.

My thinking was that the IPI itself is not sufficient since when it
misses the task, nothing then guarantees we see the store. However, if
we order the store and the task_cpu() load, then any context
switching/migrating involved with changing that value, should ensure we
see our prior store.

Of course, even now writing this, I'm still confused.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Perf hotplug lockup in v4.9-rc8
  2016-12-12 12:42         ` Peter Zijlstra
@ 2016-12-22  8:45           ` Peter Zijlstra
  2016-12-22 14:00             ` Peter Zijlstra
  0 siblings, 1 reply; 17+ messages in thread
From: Peter Zijlstra @ 2016-12-22  8:45 UTC (permalink / raw)
  To: Will Deacon
  Cc: Mark Rutland, linux-kernel, Ingo Molnar,
	Arnaldo Carvalho de Melo, Thomas Gleixner,
	Sebastian Andrzej Siewior, jeremy.linton, Boqun Feng,
	Paul McKenney

On Mon, Dec 12, 2016 at 01:42:28PM +0100, Peter Zijlstra wrote:
> On Mon, Dec 12, 2016 at 11:46:40AM +0000, Will Deacon wrote:
> > > @@ -2331,13 +2330,36 @@ perf_install_in_context(struct perf_event_context *ctx,
> > >  	/*
> > >  	 * Installing events is tricky because we cannot rely on ctx->is_active
> > >  	 * to be set in case this is the nr_events 0 -> 1 transition.
> > > +	 *
> > > +	 * Instead we use task_curr(), which tells us if the task is running.
> > > +	 * However, since we use task_curr() outside of rq::lock, we can race
> > > +	 * against the actual state. This means the result can be wrong.
> > > +	 *
> > > +	 * If we get a false positive, we retry, this is harmless.
> > > +	 *
> > > +	 * If we get a false negative, things are complicated. If we are after
> > > +	 * perf_event_context_sched_in() ctx::lock will serialize us, and the
> > > +	 * value must be correct. If we're before, it doesn't matter since
> > > +	 * perf_event_context_sched_in() will program the counter.
> > > +	 *
> > > +	 * However, this hinges on the remote context switch having observed
> > > +	 * our task->perf_event_ctxp[] store, such that it will in fact take
> > > +	 * ctx::lock in perf_event_context_sched_in().
> > > +	 *
> > > +	 * We do this by task_function_call(), if the IPI fails to hit the task
> > > +	 * we know any future context switch of task must see the
> > > +	 * perf_event_ctpx[] store.
> > >  	 */
> > > +
> > >  	/*
> > > +	 * This smp_mb() orders the task->perf_event_ctxp[] store with the
> > > +	 * task_cpu() load, such that if the IPI then does not find the task
> > > +	 * running, a future context switch of that task must observe the
> > > +	 * store.
> > >  	 */
> > > +	smp_mb();
> > > +again:
> > > +	if (!task_function_call(task, __perf_install_in_context, event))
> > >  		return;
> > 
> > I'm trying to figure out whether or not the barriers implied by the IPI
> > are sufficient here, or whether we really need the explicit smp_mb().
> > Certainly, arch_send_call_function_single_ipi has to order the publishing
> > of the remote work before the signalling of the interrupt, but the comment
> > above refers to "the task_cpu() load" and I can't see that after your
> > diff.
> > 
> > What are you trying to order here?
> 
> I suppose something like this:
> 
> 
> CPU0		CPU1		CPU2
> 
> 		(current == t)
> 
> t->perf_event_ctxp[] = ctx;
> smp_mb();
> cpu = task_cpu(t);
> 
> 		switch(t, n);
> 				migrate(t, 2);
> 				switch(p, t);
> 
> 				ctx = t->perf_event_ctxp[]; // must not be NULL
> 

So I think I can cast the above into a test like:

  W[x] = 1                W[y] = 1                R[z] = 1
  mb                      mb                      mb
  R[y] = 0                W[z] = 1                R[x] = 0

Where x is the perf_event_ctxp[], y is our task's cpu and z is our task
being placed on the rq of cpu2.

See also commit: 8643cda549ca ("sched/core, locking: Document
Program-Order guarantees"), Independent of which cpu initiates the
migration between CPU1 and CPU2 there is ordering between the CPUs.

This would then translate into something like:

  C C-peterz

  {
  }

  P0(int *x, int *y)
  {
	  int r1;

	  WRITE_ONCE(*x, 1);
	  smp_mb();
	  r1 = READ_ONCE(*y);
  }

  P1(int *y, int *z)
  {
	  WRITE_ONCE(*y, 1);
	  smp_mb();
	  WRITE_ONCE(*z, 1);
  }

  P2(int *x, int *z)
  {
	  int r1;
	  int r2;

	  r1 = READ_ONCE(*z);
	  smp_mb();
	  r2 = READ_ONCE(*x);
  }

  exists
  (0:r1=0 /\ 2:r1=1 /\ 2:r2=0)

Which evaluates into:

  Test C-peterz Allowed
  States 7
  0:r1=0; 2:r1=0; 2:r2=0;
  0:r1=0; 2:r1=0; 2:r2=1;
  0:r1=0; 2:r1=1; 2:r2=1;
  0:r1=1; 2:r1=0; 2:r2=0;
  0:r1=1; 2:r1=0; 2:r2=1;
  0:r1=1; 2:r1=1; 2:r2=0;
  0:r1=1; 2:r1=1; 2:r2=1;
  No
  Witnesses
  Positive: 0 Negative: 7
  Condition exists (0:r1=0 /\ 2:r1=1 /\ 2:r2=0)
  Observation C-peterz Never 0 7
  Hash=661589febb9e41b222d8acae1fd64e25

And the strong and weak model agree.


> smp_function_call(cpu, ..);
> 
> 		generic_exec_single()
> 		  func();
> 		    spin_lock(ctx->lock);
> 		    if (task_curr(t)) // false
> 
> 		    add_event_to_ctx();
> 		    spin_unlock(ctx->lock);
> 
> 				perf_event_context_sched_in();
> 				  spin_lock(ctx->lock);
> 				  // sees event
> 
> 
> Where between setting the perf_event_ctxp[] and sending the IPI the task
> moves away and the IPI misses, and while the new CPU is in the middle of
> scheduling in t, it hasn't yet passed through perf_event_sched_in(), but
> when it does, it _must_ observe the ctx value we stored.
> 
> My thinking was that the IPI itself is not sufficient since when it
> misses the task, nothing then guarantees we see the store. However, if
> we order the store and the task_cpu() load, then any context
> switching/migrating involved with changing that value, should ensure we
> see our prior store.
> 
> Of course, even now writing this, I'm still confused.

On IRC you said:

: I think it's similar to the "ISA2" litmus test, only the first reads-from edge is an IPI and the second is an Unlock->Lock

In case the IPI misses, we cannot use the IPI itself for anything I'm
afraid, also per the above we don't need to.

: the case I'm more confused by is if CPU2 takes the ctx->lock before CPU1
: I'm guessing that's prevented by the way migration works?

So same scenario but CPU2 takes the ctx->lock first. In that case it
will not observe our event and do nothing. CPU1 will then acquire
ctx->lock, this then implies ordering against CPU2, which means it
_must_ observe task_curr() && task != current and it too will not do
anything but we'll loop and try the whole thing again.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Perf hotplug lockup in v4.9-rc8
  2016-12-22  8:45           ` Peter Zijlstra
@ 2016-12-22 14:00             ` Peter Zijlstra
  2016-12-22 16:33               ` Paul E. McKenney
  0 siblings, 1 reply; 17+ messages in thread
From: Peter Zijlstra @ 2016-12-22 14:00 UTC (permalink / raw)
  To: Will Deacon
  Cc: Mark Rutland, linux-kernel, Ingo Molnar,
	Arnaldo Carvalho de Melo, Thomas Gleixner,
	Sebastian Andrzej Siewior, jeremy.linton, Boqun Feng,
	Paul McKenney

On Thu, Dec 22, 2016 at 09:45:09AM +0100, Peter Zijlstra wrote:
> On Mon, Dec 12, 2016 at 01:42:28PM +0100, Peter Zijlstra wrote:

> > > What are you trying to order here?
> > 
> > I suppose something like this:
> > 
> > 
> > CPU0		CPU1		CPU2
> > 
> > 		(current == t)
> > 
> > t->perf_event_ctxp[] = ctx;
> > smp_mb();
> > cpu = task_cpu(t);
> > 
> > 		switch(t, n);
> > 				migrate(t, 2);
> > 				switch(p, t);
> > 
> > 				ctx = t->perf_event_ctxp[]; // must not be NULL
> > 
> 
> So I think I can cast the above into a test like:
> 
>   W[x] = 1                W[y] = 1                R[z] = 1
>   mb                      mb                      mb
>   R[y] = 0                W[z] = 1                R[x] = 0
> 
> Where x is the perf_event_ctxp[], y is our task's cpu and z is our task
> being placed on the rq of cpu2.
> 
> See also commit: 8643cda549ca ("sched/core, locking: Document
> Program-Order guarantees"), Independent of which cpu initiates the
> migration between CPU1 and CPU2 there is ordering between the CPUs.

I think that when we assume RCpc locks, the above CPU1 mb ends up being
something like an smp_wmb() (ie. non transitive). CPU2 needs to do a
context switch between observing the task on its runqueue and getting to
switching in perf-events for the task, which keeps that a full mb.

Now, if only this model would have locks in ;-)

> This would then translate into something like:
> 
>   C C-peterz
> 
>   {
>   }
> 
>   P0(int *x, int *y)
>   {
> 	  int r1;
> 
> 	  WRITE_ONCE(*x, 1);
> 	  smp_mb();
> 	  r1 = READ_ONCE(*y);
>   }
> 
>   P1(int *y, int *z)
>   {
> 	  WRITE_ONCE(*y, 1);
> 	  smp_mb();

And this modified to: smp_wmb()

> 	  WRITE_ONCE(*z, 1);
>   }
> 
>   P2(int *x, int *z)
>   {
> 	  int r1;
> 	  int r2;
> 
> 	  r1 = READ_ONCE(*z);
> 	  smp_mb();
> 	  r2 = READ_ONCE(*x);
>   }
> 
>   exists
>   (0:r1=0 /\ 2:r1=1 /\ 2:r2=0)

Still results in the same outcome.

If however we change P2's barrier into a smp_rmb() it does become
possible, but as said above, there's a context switch in between which
implies a full barrier so no worries.

Similar if I replace everything z with smp_store_release() and
smp_load_acquire().


Of course, its entirely possible the litmus test doesn't reflect
reality, I still find it somewhat hard to write these things.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Perf hotplug lockup in v4.9-rc8
  2016-12-22 14:00             ` Peter Zijlstra
@ 2016-12-22 16:33               ` Paul E. McKenney
  0 siblings, 0 replies; 17+ messages in thread
From: Paul E. McKenney @ 2016-12-22 16:33 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Will Deacon, Mark Rutland, linux-kernel, Ingo Molnar,
	Arnaldo Carvalho de Melo, Thomas Gleixner,
	Sebastian Andrzej Siewior, jeremy.linton, Boqun Feng

On Thu, Dec 22, 2016 at 03:00:10PM +0100, Peter Zijlstra wrote:
> On Thu, Dec 22, 2016 at 09:45:09AM +0100, Peter Zijlstra wrote:
> > On Mon, Dec 12, 2016 at 01:42:28PM +0100, Peter Zijlstra wrote:
> 
> > > > What are you trying to order here?
> > > 
> > > I suppose something like this:
> > > 
> > > 
> > > CPU0		CPU1		CPU2
> > > 
> > > 		(current == t)
> > > 
> > > t->perf_event_ctxp[] = ctx;
> > > smp_mb();
> > > cpu = task_cpu(t);
> > > 
> > > 		switch(t, n);
> > > 				migrate(t, 2);
> > > 				switch(p, t);
> > > 
> > > 				ctx = t->perf_event_ctxp[]; // must not be NULL
> > > 
> > 
> > So I think I can cast the above into a test like:
> > 
> >   W[x] = 1                W[y] = 1                R[z] = 1
> >   mb                      mb                      mb
> >   R[y] = 0                W[z] = 1                R[x] = 0
> > 
> > Where x is the perf_event_ctxp[], y is our task's cpu and z is our task
> > being placed on the rq of cpu2.
> > 
> > See also commit: 8643cda549ca ("sched/core, locking: Document
> > Program-Order guarantees"), Independent of which cpu initiates the
> > migration between CPU1 and CPU2 there is ordering between the CPUs.
> 
> I think that when we assume RCpc locks, the above CPU1 mb ends up being
> something like an smp_wmb() (ie. non transitive). CPU2 needs to do a
> context switch between observing the task on its runqueue and getting to
> switching in perf-events for the task, which keeps that a full mb.
> 
> Now, if only this model would have locks in ;-)

Yeah, we are slow.  ;-)

But you should be able to emulate them with xchg_acquire() and
smp_store_release().

							Thanx, Paul

> > This would then translate into something like:
> > 
> >   C C-peterz
> > 
> >   {
> >   }
> > 
> >   P0(int *x, int *y)
> >   {
> > 	  int r1;
> > 
> > 	  WRITE_ONCE(*x, 1);
> > 	  smp_mb();
> > 	  r1 = READ_ONCE(*y);
> >   }
> > 
> >   P1(int *y, int *z)
> >   {
> > 	  WRITE_ONCE(*y, 1);
> > 	  smp_mb();
> 
> And this modified to: smp_wmb()
> 
> > 	  WRITE_ONCE(*z, 1);
> >   }
> > 
> >   P2(int *x, int *z)
> >   {
> > 	  int r1;
> > 	  int r2;
> > 
> > 	  r1 = READ_ONCE(*z);
> > 	  smp_mb();
> > 	  r2 = READ_ONCE(*x);
> >   }
> > 
> >   exists
> >   (0:r1=0 /\ 2:r1=1 /\ 2:r2=0)
> 
> Still results in the same outcome.
> 
> If however we change P2's barrier into a smp_rmb() it does become
> possible, but as said above, there's a context switch in between which
> implies a full barrier so no worries.
> 
> Similar if I replace everything z with smp_store_release() and
> smp_load_acquire().
> 
> 
> Of course, its entirely possible the litmus test doesn't reflect
> reality, I still find it somewhat hard to write these things.
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Perf hotplug lockup in v4.9-rc8
  2016-12-09 13:59     ` Peter Zijlstra
  2016-12-12 11:46       ` Will Deacon
@ 2017-01-11 14:59       ` Mark Rutland
  2017-01-11 16:03         ` Peter Zijlstra
  2017-01-14 12:28       ` [tip:perf/urgent] perf/core: Fix sys_perf_event_open() vs. hotplug tip-bot for Peter Zijlstra
  2 siblings, 1 reply; 17+ messages in thread
From: Mark Rutland @ 2017-01-11 14:59 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Ingo Molnar, Arnaldo Carvalho de Melo,
	Thomas Gleixner, Sebastian Andrzej Siewior, jeremy.linton,
	Will Deacon

Hi Peter,

Sorry for the delay; this fell into my backlog over the holiday.

On Fri, Dec 09, 2016 at 02:59:00PM +0100, Peter Zijlstra wrote:
> So while I went back and forth trying to make that less ugly, I figured
> there was another problem.
> 
> Imagine the cpu_function_call() hitting the 'right' cpu, but not finding
> the task current. It will then continue to install the event in the
> context. However, that doesn't stop another CPU from pulling the task in
> question from our rq and scheduling it elsewhere.
> 
> This all lead me to the below patch.. Now it has a rather large comment,
> and while it represents my current thinking on the matter, I'm not at
> all sure its entirely correct. I got my brain in a fair twist while
> writing it.
> 
> Please as to carefully think about it.

FWIW, I've given the below a spin on a few systems, and with it applied
my reproducer no longer triggers the issue.

Unfortunately, most of the ordering concerns have gone over my head. :/

> @@ -2331,13 +2330,36 @@ perf_install_in_context(struct perf_event_context *ctx,
>  	/*
>  	 * Installing events is tricky because we cannot rely on ctx->is_active
>  	 * to be set in case this is the nr_events 0 -> 1 transition.
> +	 *
> +	 * Instead we use task_curr(), which tells us if the task is running.
> +	 * However, since we use task_curr() outside of rq::lock, we can race
> +	 * against the actual state. This means the result can be wrong.
> +	 *
> +	 * If we get a false positive, we retry, this is harmless.
> +	 *
> +	 * If we get a false negative, things are complicated. If we are after
> +	 * perf_event_context_sched_in() ctx::lock will serialize us, and the
> +	 * value must be correct. If we're before, it doesn't matter since
> +	 * perf_event_context_sched_in() will program the counter.
> +	 *
> +	 * However, this hinges on the remote context switch having observed
> +	 * our task->perf_event_ctxp[] store, such that it will in fact take
> +	 * ctx::lock in perf_event_context_sched_in().

Sorry if I'm being thick here, but which store are we describing above?
i.e. which function, how does that relate to perf_install_in_context()?

I haven't managed to wrap my head around why this matters. :/

Thanks,
Mark.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Perf hotplug lockup in v4.9-rc8
  2017-01-11 14:59       ` Mark Rutland
@ 2017-01-11 16:03         ` Peter Zijlstra
  2017-01-11 16:26           ` Mark Rutland
  2017-01-11 19:51           ` Peter Zijlstra
  0 siblings, 2 replies; 17+ messages in thread
From: Peter Zijlstra @ 2017-01-11 16:03 UTC (permalink / raw)
  To: Mark Rutland
  Cc: linux-kernel, Ingo Molnar, Arnaldo Carvalho de Melo,
	Thomas Gleixner, Sebastian Andrzej Siewior, jeremy.linton,
	Will Deacon

On Wed, Jan 11, 2017 at 02:59:20PM +0000, Mark Rutland wrote:
> Hi Peter,
> 
> Sorry for the delay; this fell into my backlog over the holiday.
> 
> On Fri, Dec 09, 2016 at 02:59:00PM +0100, Peter Zijlstra wrote:
> > So while I went back and forth trying to make that less ugly, I figured
> > there was another problem.
> > 
> > Imagine the cpu_function_call() hitting the 'right' cpu, but not finding
> > the task current. It will then continue to install the event in the
> > context. However, that doesn't stop another CPU from pulling the task in
> > question from our rq and scheduling it elsewhere.
> > 
> > This all lead me to the below patch.. Now it has a rather large comment,
> > and while it represents my current thinking on the matter, I'm not at
> > all sure its entirely correct. I got my brain in a fair twist while
> > writing it.
> > 
> > Please as to carefully think about it.
> 
> FWIW, I've given the below a spin on a few systems, and with it applied
> my reproducer no longer triggers the issue.
> 
> Unfortunately, most of the ordering concerns have gone over my head. :/
> 
> > @@ -2331,13 +2330,36 @@ perf_install_in_context(struct perf_event_context *ctx,
> >  	/*
> >  	 * Installing events is tricky because we cannot rely on ctx->is_active
> >  	 * to be set in case this is the nr_events 0 -> 1 transition.
> > +	 *
> > +	 * Instead we use task_curr(), which tells us if the task is running.
> > +	 * However, since we use task_curr() outside of rq::lock, we can race
> > +	 * against the actual state. This means the result can be wrong.
> > +	 *
> > +	 * If we get a false positive, we retry, this is harmless.
> > +	 *
> > +	 * If we get a false negative, things are complicated. If we are after
> > +	 * perf_event_context_sched_in() ctx::lock will serialize us, and the
> > +	 * value must be correct. If we're before, it doesn't matter since
> > +	 * perf_event_context_sched_in() will program the counter.
> > +	 *
> > +	 * However, this hinges on the remote context switch having observed
> > +	 * our task->perf_event_ctxp[] store, such that it will in fact take
> > +	 * ctx::lock in perf_event_context_sched_in().
> 
> Sorry if I'm being thick here, but which store are we describing above?
> i.e. which function, how does that relate to perf_install_in_context()?

The only store to perf_event_ctxp[] of interest is the initial one in
find_get_context().

> I haven't managed to wrap my head around why this matters. :/

See the scenario from:

 https://lkml.kernel.org/r/20161212124228.GE3124@twins.programming.kicks-ass.net

Its installing the first event on 't', which concurrently with the
install gets migrated to a third CPU.


CPU0            CPU1            CPU2

                (current == t)

t->perf_event_ctxp[] = ctx;
smp_mb();
cpu = task_cpu(t);

                switch(t, n);
                                migrate(t, 2);
                                switch(p, t);

                                ctx = t->perf_event_ctxp[]; // must not be NULL

smp_function_call(cpu, ..);

                generic_exec_single()
                  func();
                    spin_lock(ctx->lock);
                    if (task_curr(t)) // false

                    add_event_to_ctx();
                    spin_unlock(ctx->lock);

                                perf_event_context_sched_in();
                                  spin_lock(ctx->lock);
                                  // sees event



So its CPU0's store of t->perf_event_ctxp[] that must not go 'missing.
Because if CPU2's load of that variable were to observe NULL, it would
not try to schedule the ctx and we'd have a task running without its
counter, which would be 'bad'.

As long as we observe !NULL, we'll acquire ctx->lock. If we acquire it
first and not see the event yet, then CPU0 must observe task_running()
and retry. If the install happens first, then we must see the event on
sched-in and all is well.



In any case, I'll try and write a proper Changelog for this...

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Perf hotplug lockup in v4.9-rc8
  2017-01-11 16:03         ` Peter Zijlstra
@ 2017-01-11 16:26           ` Mark Rutland
  2017-01-11 19:51           ` Peter Zijlstra
  1 sibling, 0 replies; 17+ messages in thread
From: Mark Rutland @ 2017-01-11 16:26 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Ingo Molnar, Arnaldo Carvalho de Melo,
	Thomas Gleixner, Sebastian Andrzej Siewior, jeremy.linton,
	Will Deacon

On Wed, Jan 11, 2017 at 05:03:58PM +0100, Peter Zijlstra wrote:
> On Wed, Jan 11, 2017 at 02:59:20PM +0000, Mark Rutland wrote:
> > On Fri, Dec 09, 2016 at 02:59:00PM +0100, Peter Zijlstra wrote:

> > > +	 * If we get a false negative, things are complicated. If we are after
> > > +	 * perf_event_context_sched_in() ctx::lock will serialize us, and the
> > > +	 * value must be correct. If we're before, it doesn't matter since
> > > +	 * perf_event_context_sched_in() will program the counter.
> > > +	 *
> > > +	 * However, this hinges on the remote context switch having observed
> > > +	 * our task->perf_event_ctxp[] store, such that it will in fact take
> > > +	 * ctx::lock in perf_event_context_sched_in().
> > 
> > Sorry if I'm being thick here, but which store are we describing above?
> > i.e. which function, how does that relate to perf_install_in_context()?
> 
> The only store to perf_event_ctxp[] of interest is the initial one in
> find_get_context().

Ah, I see. I'd missed the rcu_assign_pointer() when looking around for
an assignment.

> > I haven't managed to wrap my head around why this matters. :/
> 
> See the scenario from:
> 
>  https://lkml.kernel.org/r/20161212124228.GE3124@twins.programming.kicks-ass.net
> 
> Its installing the first event on 't', which concurrently with the
> install gets migrated to a third CPU.

I was completely failing to consider that this was the installation of
the first event; I should have read the existing comment. Things make a
lot more sense now.

> CPU0            CPU1            CPU2
> 
>                 (current == t)
> 
> t->perf_event_ctxp[] = ctx;
> smp_mb();
> cpu = task_cpu(t);
> 
>                 switch(t, n);
>                                 migrate(t, 2);
>                                 switch(p, t);
> 
>                                 ctx = t->perf_event_ctxp[]; // must not be NULL
> 
> smp_function_call(cpu, ..);
> 
>                 generic_exec_single()
>                   func();
>                     spin_lock(ctx->lock);
>                     if (task_curr(t)) // false
> 
>                     add_event_to_ctx();
>                     spin_unlock(ctx->lock);
> 
>                                 perf_event_context_sched_in();
>                                   spin_lock(ctx->lock);
>                                   // sees event
> 
> 
> 
> So its CPU0's store of t->perf_event_ctxp[] that must not go 'missing.
> Because if CPU2's load of that variable were to observe NULL, it would
> not try to schedule the ctx and we'd have a task running without its
> counter, which would be 'bad'.
> 
> As long as we observe !NULL, we'll acquire ctx->lock. If we acquire it
> first and not see the event yet, then CPU0 must observe task_running()
> and retry. If the install happens first, then we must see the event on
> sched-in and all is well.

I think I follow now. Thanks for bearing with me!

> In any case, I'll try and write a proper Changelog for this...

If it's just the commit message and/or comments changing, feel free to
add:

Tested-by: Mark Rutland <mark.rutland@arm.com>

Thanks,
Mark.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Perf hotplug lockup in v4.9-rc8
  2017-01-11 16:03         ` Peter Zijlstra
  2017-01-11 16:26           ` Mark Rutland
@ 2017-01-11 19:51           ` Peter Zijlstra
  1 sibling, 0 replies; 17+ messages in thread
From: Peter Zijlstra @ 2017-01-11 19:51 UTC (permalink / raw)
  To: Mark Rutland
  Cc: linux-kernel, Ingo Molnar, Arnaldo Carvalho de Melo,
	Thomas Gleixner, Sebastian Andrzej Siewior, jeremy.linton,
	Will Deacon

On Wed, Jan 11, 2017 at 05:03:58PM +0100, Peter Zijlstra wrote:
> 
> In any case, I'll try and write a proper Changelog for this...

This is what I came up with, most of it should look familiar, its
copy/pasted bits from this thread.

---
Subject: perf: Fix sys_perf_event_open() vs hotplug
From: Peter Zijlstra <peterz@infradead.org>
Date: Fri, 9 Dec 2016 14:59:00 +0100

There is problem with installing an event in a task that is 'stuck' on
an offline CPU.

Blocked tasks are not dis-assosciated from offlined CPUs, after all, a
blocked task doesn't run and doesn't require a CPU etc.. Only on
wakeup do we ammend the situation and place the task on a available
CPU.

If we hit such a task with perf_install_in_context() we'll loop until
either that task wakes up or the CPU comes back online, if the task
waking depends on the event being installed, we're stuck.

While looking into this issue, I also spotted another problem, if we
hit a task with perf_install_in_context() that is in the middle of
being migrated, that is we observe the old CPU before sending the IPI,
but run the IPI (on the old CPU) while the task is already running on
the new CPU, things also go sideways.

Rework things to rely on task_curr() -- outside of rq->lock -- which
is rather tricky. Imagine the following scenario where we're trying to
install the first event into our task 't':


CPU0            CPU1            CPU2

                (current == t)

t->perf_event_ctxp[] = ctx;
smp_mb();
cpu = task_cpu(t);

                switch(t, n);
                                migrate(t, 2);
                                switch(p, t);

                                ctx = t->perf_event_ctxp[]; // must not be NULL

smp_function_call(cpu, ..);

                generic_exec_single()
                  func();
                    spin_lock(ctx->lock);
                    if (task_curr(t)) // false

                    add_event_to_ctx();
                    spin_unlock(ctx->lock);

                                perf_event_context_sched_in();
                                  spin_lock(ctx->lock);
                                  // sees event


So its CPU0's store of t->perf_event_ctxp[] that must not go 'missing'.
Because if CPU2's load of that variable were to observe NULL, it would
not try to schedule the ctx and we'd have a task running without its
counter, which would be 'bad'.

As long as we observe !NULL, we'll acquire ctx->lock. If we acquire it
first and not see the event yet, then CPU0 must observe task_curr()
and retry. If the install happens first, then we must see the event on
sched-in and all is well.

I think we can translate the first part (until the 'must not be NULL')
of the scenario to a litmus test like:

  C C-peterz

  {
  }

  P0(int *x, int *y)
  {
          int r1;

          WRITE_ONCE(*x, 1);
          smp_mb();
          r1 = READ_ONCE(*y);
  }

  P1(int *y, int *z)
  {
          WRITE_ONCE(*y, 1);
          smp_store_release(z, 1);
  }

  P2(int *x, int *z)
  {
          int r1;
          int r2;

          r1 = smp_load_acquire(z);
	  smp_mb();
          r2 = READ_ONCE(*x);
  }

  exists
  (0:r1=0 /\ 2:r1=1 /\ 2:r2=0)

Where:
  x is perf_event_ctxp[],
  y is our tasks's CPU, and
  z is our task being placed on the rq of CPU2.

The P0 smp_mb() is the one added by this patch, ordering the store to
perf_event_ctxp[] from find_get_context() and the load of task_cpu()
in task_function_call().

The smp_store_release/smp_load_acquire model the RCpc locking of the
rq->lock and the smp_mb() of P2 is the context switch switching from
whatever CPU2 was running to our task 't'.

This litmus test evaluates into:

  Test C-peterz Allowed
  States 7
  0:r1=0; 2:r1=0; 2:r2=0;
  0:r1=0; 2:r1=0; 2:r2=1;
  0:r1=0; 2:r1=1; 2:r2=1;
  0:r1=1; 2:r1=0; 2:r2=0;
  0:r1=1; 2:r1=0; 2:r2=1;
  0:r1=1; 2:r1=1; 2:r2=0;
  0:r1=1; 2:r1=1; 2:r2=1;
  No
  Witnesses
  Positive: 0 Negative: 7
  Condition exists (0:r1=0 /\ 2:r1=1 /\ 2:r2=0)
  Observation C-peterz Never 0 7
  Hash=e427f41d9146b2a5445101d3e2fcaa34

And the strong and weak model agree.

Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: jeremy.linton@arm.com
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reported-by: Mark Rutland <mark.rutland@arm.com>
Tested-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20161209135900.GU3174@twins.programming.kicks-ass.net
---
 kernel/events/core.c |   70 ++++++++++++++++++++++++++++++++++-----------------
 1 file changed, 48 insertions(+), 22 deletions(-)

--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -2249,7 +2249,7 @@ static int  __perf_install_in_context(vo
 	struct perf_event_context *ctx = event->ctx;
 	struct perf_cpu_context *cpuctx = __get_cpu_context(ctx);
 	struct perf_event_context *task_ctx = cpuctx->task_ctx;
-	bool activate = true;
+	bool reprogram = true;
 	int ret = 0;
 
 	raw_spin_lock(&cpuctx->ctx.lock);
@@ -2257,27 +2257,26 @@ static int  __perf_install_in_context(vo
 		raw_spin_lock(&ctx->lock);
 		task_ctx = ctx;
 
-		/* If we're on the wrong CPU, try again */
-		if (task_cpu(ctx->task) != smp_processor_id()) {
-			ret = -ESRCH;
-			goto unlock;
-		}
+		reprogram = (ctx->task == current);
 
 		/*
-		 * If we're on the right CPU, see if the task we target is
-		 * current, if not we don't have to activate the ctx, a future
-		 * context switch will do that for us.
+		 * If the task is running, it must be running on this CPU,
+		 * otherwise we cannot reprogram things.
+		 *
+		 * If its not running, we don't care, ctx->lock will
+		 * serialize against it becoming runnable.
 		 */
-		if (ctx->task != current)
-			activate = false;
-		else
-			WARN_ON_ONCE(cpuctx->task_ctx && cpuctx->task_ctx != ctx);
+		if (task_curr(ctx->task) && !reprogram) {
+			ret = -ESRCH;
+			goto unlock;
+		}
 
+		WARN_ON_ONCE(reprogram && cpuctx->task_ctx && cpuctx->task_ctx != ctx);
 	} else if (task_ctx) {
 		raw_spin_lock(&task_ctx->lock);
 	}
 
-	if (activate) {
+	if (reprogram) {
 		ctx_sched_out(ctx, cpuctx, EVENT_TIME);
 		add_event_to_ctx(event, ctx);
 		ctx_resched(cpuctx, task_ctx);
@@ -2328,13 +2327,36 @@ perf_install_in_context(struct perf_even
 	/*
 	 * Installing events is tricky because we cannot rely on ctx->is_active
 	 * to be set in case this is the nr_events 0 -> 1 transition.
+	 *
+	 * Instead we use task_curr(), which tells us if the task is running.
+	 * However, since we use task_curr() outside of rq::lock, we can race
+	 * against the actual state. This means the result can be wrong.
+	 *
+	 * If we get a false positive, we retry, this is harmless.
+	 *
+	 * If we get a false negative, things are complicated. If we are after
+	 * perf_event_context_sched_in() ctx::lock will serialize us, and the
+	 * value must be correct. If we're before, it doesn't matter since
+	 * perf_event_context_sched_in() will program the counter.
+	 *
+	 * However, this hinges on the remote context switch having observed
+	 * our task->perf_event_ctxp[] store, such that it will in fact take
+	 * ctx::lock in perf_event_context_sched_in().
+	 *
+	 * We do this by task_function_call(), if the IPI fails to hit the task
+	 * we know any future context switch of task must see the
+	 * perf_event_ctpx[] store.
 	 */
-again:
+
 	/*
-	 * Cannot use task_function_call() because we need to run on the task's
-	 * CPU regardless of whether its current or not.
+	 * This smp_mb() orders the task->perf_event_ctxp[] store with the
+	 * task_cpu() load, such that if the IPI then does not find the task
+	 * running, a future context switch of that task must observe the
+	 * store.
 	 */
-	if (!cpu_function_call(task_cpu(task), __perf_install_in_context, event))
+	smp_mb();
+again:
+	if (!task_function_call(task, __perf_install_in_context, event))
 		return;
 
 	raw_spin_lock_irq(&ctx->lock);
@@ -2348,12 +2370,16 @@ perf_install_in_context(struct perf_even
 		raw_spin_unlock_irq(&ctx->lock);
 		return;
 	}
-	raw_spin_unlock_irq(&ctx->lock);
 	/*
-	 * Since !ctx->is_active doesn't mean anything, we must IPI
-	 * unconditionally.
+	 * If the task is not running, ctx->lock will avoid it becoming so,
+	 * thus we can safely install the event.
 	 */
-	goto again;
+	if (task_curr(task)) {
+		raw_spin_unlock_irq(&ctx->lock);
+		goto again;
+	}
+	add_event_to_ctx(event, ctx);
+	raw_spin_unlock_irq(&ctx->lock);
 }
 
 /*

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [tip:perf/urgent] perf/core: Fix sys_perf_event_open() vs. hotplug
  2016-12-09 13:59     ` Peter Zijlstra
  2016-12-12 11:46       ` Will Deacon
  2017-01-11 14:59       ` Mark Rutland
@ 2017-01-14 12:28       ` tip-bot for Peter Zijlstra
  2 siblings, 0 replies; 17+ messages in thread
From: tip-bot for Peter Zijlstra @ 2017-01-14 12:28 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, peterz, jolsa, tglx, mingo, acme,
	mark.rutland, will.deacon, eranian, alexander.shishkin, torvalds,
	bigeasy, vincent.weaver, acme

Commit-ID:  63cae12bce9861cec309798d34701cf3da20bc71
Gitweb:     http://git.kernel.org/tip/63cae12bce9861cec309798d34701cf3da20bc71
Author:     Peter Zijlstra <peterz@infradead.org>
AuthorDate: Fri, 9 Dec 2016 14:59:00 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Sat, 14 Jan 2017 10:56:10 +0100

perf/core: Fix sys_perf_event_open() vs. hotplug

There is problem with installing an event in a task that is 'stuck' on
an offline CPU.

Blocked tasks are not dis-assosciated from offlined CPUs, after all, a
blocked task doesn't run and doesn't require a CPU etc.. Only on
wakeup do we ammend the situation and place the task on a available
CPU.

If we hit such a task with perf_install_in_context() we'll loop until
either that task wakes up or the CPU comes back online, if the task
waking depends on the event being installed, we're stuck.

While looking into this issue, I also spotted another problem, if we
hit a task with perf_install_in_context() that is in the middle of
being migrated, that is we observe the old CPU before sending the IPI,
but run the IPI (on the old CPU) while the task is already running on
the new CPU, things also go sideways.

Rework things to rely on task_curr() -- outside of rq->lock -- which
is rather tricky. Imagine the following scenario where we're trying to
install the first event into our task 't':

CPU0            CPU1            CPU2

                (current == t)

t->perf_event_ctxp[] = ctx;
smp_mb();
cpu = task_cpu(t);

                switch(t, n);
                                migrate(t, 2);
                                switch(p, t);

                                ctx = t->perf_event_ctxp[]; // must not be NULL

smp_function_call(cpu, ..);

                generic_exec_single()
                  func();
                    spin_lock(ctx->lock);
                    if (task_curr(t)) // false

                    add_event_to_ctx();
                    spin_unlock(ctx->lock);

                                perf_event_context_sched_in();
                                  spin_lock(ctx->lock);
                                  // sees event

So its CPU0's store of t->perf_event_ctxp[] that must not go 'missing'.
Because if CPU2's load of that variable were to observe NULL, it would
not try to schedule the ctx and we'd have a task running without its
counter, which would be 'bad'.

As long as we observe !NULL, we'll acquire ctx->lock. If we acquire it
first and not see the event yet, then CPU0 must observe task_curr()
and retry. If the install happens first, then we must see the event on
sched-in and all is well.

I think we can translate the first part (until the 'must not be NULL')
of the scenario to a litmus test like:

  C C-peterz

  {
  }

  P0(int *x, int *y)
  {
          int r1;

          WRITE_ONCE(*x, 1);
          smp_mb();
          r1 = READ_ONCE(*y);
  }

  P1(int *y, int *z)
  {
          WRITE_ONCE(*y, 1);
          smp_store_release(z, 1);
  }

  P2(int *x, int *z)
  {
          int r1;
          int r2;

          r1 = smp_load_acquire(z);
	  smp_mb();
          r2 = READ_ONCE(*x);
  }

  exists
  (0:r1=0 /\ 2:r1=1 /\ 2:r2=0)

Where:
  x is perf_event_ctxp[],
  y is our tasks's CPU, and
  z is our task being placed on the rq of CPU2.

The P0 smp_mb() is the one added by this patch, ordering the store to
perf_event_ctxp[] from find_get_context() and the load of task_cpu()
in task_function_call().

The smp_store_release/smp_load_acquire model the RCpc locking of the
rq->lock and the smp_mb() of P2 is the context switch switching from
whatever CPU2 was running to our task 't'.

This litmus test evaluates into:

  Test C-peterz Allowed
  States 7
  0:r1=0; 2:r1=0; 2:r2=0;
  0:r1=0; 2:r1=0; 2:r2=1;
  0:r1=0; 2:r1=1; 2:r2=1;
  0:r1=1; 2:r1=0; 2:r2=0;
  0:r1=1; 2:r1=0; 2:r2=1;
  0:r1=1; 2:r1=1; 2:r2=0;
  0:r1=1; 2:r1=1; 2:r2=1;
  No
  Witnesses
  Positive: 0 Negative: 7
  Condition exists (0:r1=0 /\ 2:r1=1 /\ 2:r2=0)
  Observation C-peterz Never 0 7
  Hash=e427f41d9146b2a5445101d3e2fcaa34

And the strong and weak model agree.

Reported-by: Mark Rutland <mark.rutland@arm.com>
Tested-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: Will Deacon <will.deacon@arm.com>
Cc: jeremy.linton@arm.com
Link: http://lkml.kernel.org/r/20161209135900.GU3174@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/events/core.c | 70 +++++++++++++++++++++++++++++++++++-----------------
 1 file changed, 48 insertions(+), 22 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index ab15509..72ce7d6 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -2249,7 +2249,7 @@ static int  __perf_install_in_context(void *info)
 	struct perf_event_context *ctx = event->ctx;
 	struct perf_cpu_context *cpuctx = __get_cpu_context(ctx);
 	struct perf_event_context *task_ctx = cpuctx->task_ctx;
-	bool activate = true;
+	bool reprogram = true;
 	int ret = 0;
 
 	raw_spin_lock(&cpuctx->ctx.lock);
@@ -2257,27 +2257,26 @@ static int  __perf_install_in_context(void *info)
 		raw_spin_lock(&ctx->lock);
 		task_ctx = ctx;
 
-		/* If we're on the wrong CPU, try again */
-		if (task_cpu(ctx->task) != smp_processor_id()) {
-			ret = -ESRCH;
-			goto unlock;
-		}
+		reprogram = (ctx->task == current);
 
 		/*
-		 * If we're on the right CPU, see if the task we target is
-		 * current, if not we don't have to activate the ctx, a future
-		 * context switch will do that for us.
+		 * If the task is running, it must be running on this CPU,
+		 * otherwise we cannot reprogram things.
+		 *
+		 * If its not running, we don't care, ctx->lock will
+		 * serialize against it becoming runnable.
 		 */
-		if (ctx->task != current)
-			activate = false;
-		else
-			WARN_ON_ONCE(cpuctx->task_ctx && cpuctx->task_ctx != ctx);
+		if (task_curr(ctx->task) && !reprogram) {
+			ret = -ESRCH;
+			goto unlock;
+		}
 
+		WARN_ON_ONCE(reprogram && cpuctx->task_ctx && cpuctx->task_ctx != ctx);
 	} else if (task_ctx) {
 		raw_spin_lock(&task_ctx->lock);
 	}
 
-	if (activate) {
+	if (reprogram) {
 		ctx_sched_out(ctx, cpuctx, EVENT_TIME);
 		add_event_to_ctx(event, ctx);
 		ctx_resched(cpuctx, task_ctx);
@@ -2328,13 +2327,36 @@ perf_install_in_context(struct perf_event_context *ctx,
 	/*
 	 * Installing events is tricky because we cannot rely on ctx->is_active
 	 * to be set in case this is the nr_events 0 -> 1 transition.
+	 *
+	 * Instead we use task_curr(), which tells us if the task is running.
+	 * However, since we use task_curr() outside of rq::lock, we can race
+	 * against the actual state. This means the result can be wrong.
+	 *
+	 * If we get a false positive, we retry, this is harmless.
+	 *
+	 * If we get a false negative, things are complicated. If we are after
+	 * perf_event_context_sched_in() ctx::lock will serialize us, and the
+	 * value must be correct. If we're before, it doesn't matter since
+	 * perf_event_context_sched_in() will program the counter.
+	 *
+	 * However, this hinges on the remote context switch having observed
+	 * our task->perf_event_ctxp[] store, such that it will in fact take
+	 * ctx::lock in perf_event_context_sched_in().
+	 *
+	 * We do this by task_function_call(), if the IPI fails to hit the task
+	 * we know any future context switch of task must see the
+	 * perf_event_ctpx[] store.
 	 */
-again:
+
 	/*
-	 * Cannot use task_function_call() because we need to run on the task's
-	 * CPU regardless of whether its current or not.
+	 * This smp_mb() orders the task->perf_event_ctxp[] store with the
+	 * task_cpu() load, such that if the IPI then does not find the task
+	 * running, a future context switch of that task must observe the
+	 * store.
 	 */
-	if (!cpu_function_call(task_cpu(task), __perf_install_in_context, event))
+	smp_mb();
+again:
+	if (!task_function_call(task, __perf_install_in_context, event))
 		return;
 
 	raw_spin_lock_irq(&ctx->lock);
@@ -2348,12 +2370,16 @@ again:
 		raw_spin_unlock_irq(&ctx->lock);
 		return;
 	}
-	raw_spin_unlock_irq(&ctx->lock);
 	/*
-	 * Since !ctx->is_active doesn't mean anything, we must IPI
-	 * unconditionally.
+	 * If the task is not running, ctx->lock will avoid it becoming so,
+	 * thus we can safely install the event.
 	 */
-	goto again;
+	if (task_curr(task)) {
+		raw_spin_unlock_irq(&ctx->lock);
+		goto again;
+	}
+	add_event_to_ctx(event, ctx);
+	raw_spin_unlock_irq(&ctx->lock);
 }
 
 /*

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2017-01-14 12:29 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-12-07 13:53 Perf hotplug lockup in v4.9-rc8 Mark Rutland
2016-12-07 14:30 ` Mark Rutland
2016-12-07 16:39   ` Mark Rutland
2016-12-07 17:53 ` Mark Rutland
2016-12-07 18:34   ` Peter Zijlstra
2016-12-07 19:56     ` Mark Rutland
2016-12-09 13:59     ` Peter Zijlstra
2016-12-12 11:46       ` Will Deacon
2016-12-12 12:42         ` Peter Zijlstra
2016-12-22  8:45           ` Peter Zijlstra
2016-12-22 14:00             ` Peter Zijlstra
2016-12-22 16:33               ` Paul E. McKenney
2017-01-11 14:59       ` Mark Rutland
2017-01-11 16:03         ` Peter Zijlstra
2017-01-11 16:26           ` Mark Rutland
2017-01-11 19:51           ` Peter Zijlstra
2017-01-14 12:28       ` [tip:perf/urgent] perf/core: Fix sys_perf_event_open() vs. hotplug tip-bot for Peter Zijlstra

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.