All of lore.kernel.org
 help / color / mirror / Atom feed
* workqueue: race in mod_delayed_work_on?
@ 2016-05-10  8:21 Konstantin Khlebnikov
  2016-05-10 16:28 ` Konstantin Khlebnikov
  0 siblings, 1 reply; 7+ messages in thread
From: Konstantin Khlebnikov @ 2016-05-10  8:21 UTC (permalink / raw)
  To: linux-kernel, Tejun Heo

I've got plenty warnings, bugs and oops around trivial use of mod_delayed_work in drivers/infiniband/core/addr.c

For example:

WARN_ON_ONCE(timer_pending(timer)) in __queue_delayed_work
and
BUG_ON(timer_pending(timer) || !timer->function) in add_timer_on

<4>[471106.614256] ------------[ cut here ]------------
<4>[471106.614271] WARNING: CPU: 2 PID: 0 at kernel/workqueue.c:1428 __queue_delayed_work+0xdd/0x160()
<4>[471106.614274] Modules linked in: tcp_diag inet_diag bridge cls_cgroup sch_htb netconsole configfs 8021q mrp garp stp llc 
intel_powerclamp coretemp kvm_intel kvm crc32_pclmul aesni_intel
ablk_helper cryptd lrw mgag200 psmouse ttm gf128mul serio_raw glue_helper aes_x86_64 drm_kms_helper drm gpio_ich microcode sysimgblt 
sysfillrect syscopyarea lpc_ich i7core_edac ioatdma edac_c
ore mlx4_ib ib_sa ib_mad ib_core ib_addr mlx4_en mlx4_core vxlan udp_tunnel ip6_udp_tunnel tcp_htcp ahci igb libahci dca i2c_algo_bit 
i2c_core ptp pps_core raid10 raid456 async_pq async_xor x
or async_memcpy async_raid6_recov raid6_pq async_tx raid1 raid0 multipath linear [last unloaded: ipmi_msghandler]
<4>[471106.614339] CPU: 2 PID: 0 Comm: swapper/2 Not tainted 3.18.25-28 #1
<4>[471106.614344] Hardware name: Supermicro X8DTU/X8DTU, BIOS 2.1c       08/03/2012
<4>[471106.614346]  0000000000000594 ffff880627c43c88 ffffffff816d08e3 0000000000000007
<4>[471106.614348]  0000000000000000 ffff880627c43cc8 ffffffff8106bb2c ffff880627c43cd8
<4>[471106.614350]  0000000000000080 ffff88061b0b2a00 0000000000000080 ffff88061b0b2a00
<4>[471106.614353] Call Trace:
<4>[471106.614354]  <IRQ>  [<ffffffff816d08e3>] dump_stack+0x4e/0x68
<4>[471106.614362]  [<ffffffff8106bb2c>] warn_slowpath_common+0x8c/0xc0
<4>[471106.614364]  [<ffffffff8106bb7a>] warn_slowpath_null+0x1a/0x20
<4>[471106.614367]  [<ffffffff8108235d>] __queue_delayed_work+0xdd/0x160
<4>[471106.614371]  [<ffffffff810829da>] mod_delayed_work_on+0x5a/0x80
<4>[471106.614375]  [<ffffffffa014028a>] set_timeout+0x3a/0x40 [ib_addr]
<4>[471106.614378]  [<ffffffffa01407af>] netevent_callback+0x2f/0x40 [ib_addr]
<4>[471106.614382]  [<ffffffff8108af7d>] notifier_call_chain+0x4d/0x70
<4>[471106.614384]  [<ffffffff8108afba>] __atomic_notifier_call_chain+0x1a/0x30
<4>[471106.614387]  [<ffffffff8108afe6>] atomic_notifier_call_chain+0x16/0x20
<4>[471106.614390]  [<ffffffff815d75fb>] call_netevent_notifiers+0x1b/0x20
<4>[471106.614393]  [<ffffffff815d9a7e>] neigh_update_notify+0x1e/0x40
<4>[471106.614395]  [<ffffffff815daebf>] neigh_timer_handler+0xdf/0x2a0
<4>[471106.614397]  [<ffffffff815dade0>] ? neigh_periodic_work+0x220/0x220
<4>[471106.614401]  [<ffffffff810bf43a>] call_timer_fn+0x3a/0x170
<4>[471106.614403]  [<ffffffff810bf67c>] ? cascade+0x7c/0xa0
<4>[471106.614405]  [<ffffffff810c100f>] run_timer_softirq+0x2df/0x340
<4>[471106.614409]  [<ffffffff810ce944>] ? clockevents_program_event+0x74/0x100
<4>[471106.614411]  [<ffffffff815dade0>] ? neigh_periodic_work+0x220/0x220
<4>[471106.614414]  [<ffffffff8106f7df>] __do_softirq+0xdf/0x2f0
<4>[471106.614417]  [<ffffffff8106fc66>] irq_exit+0x86/0xb0
<4>[471106.614420]  [<ffffffff816db61a>] smp_apic_timer_interrupt+0x4a/0x60
<4>[471106.614423]  [<ffffffff816d96fd>] apic_timer_interrupt+0x6d/0x80
<4>[471106.614424]  <EOI>  [<ffffffff815796a1>] ? cpuidle_enter_state+0x71/0x1d0
<4>[471106.614430]  [<ffffffff8157968f>] ? cpuidle_enter_state+0x5f/0x1d0
<4>[471106.614433]  [<ffffffff815798c7>] cpuidle_enter+0x17/0x20
<4>[471106.614436]  [<ffffffff810a905f>] cpu_startup_entry+0x34f/0x3d0
<4>[471106.614438]  [<ffffffff810461fc>] start_secondary+0x14c/0x160
<4>[471106.614440] ---[ end trace 23e6f46f3ce04a60 ]---
<5>[471106.614443] ---[ now 2016-04-27 06:06:02+03 ]---
<4>[471106.617880] ------------[ cut here ]------------
<2>[471106.617905] kernel BUG at kernel/time/timer.c:963!
<4>[471106.617924] invalid opcode: 0000 [#1] SMP
<4>[471106.617942] Modules linked in: tcp_diag inet_diag bridge cls_cgroup sch_htb netconsole configfs 8021q mrp garp stp llc 
intel_powerclamp coretemp kvm_intel kvm crc32_pclmul aesni_intel ablk_helper cryptd lrw mgag200 psmouse ttm gf128mul serio_raw glue_helper 
aes_x86_64 drm_kms_helper drm gpio_ich microcode sysimgblt sysfillrect syscopyarea lpc_ich i7core_edac ioatdma edac_core mlx4_ib ib_sa 
ib_mad ib_core ib_addr mlx4_en mlx4_core vxlan udp_tunnel ip6_udp_tunnel tcp_htcp ahci igb libahci dca i2c_algo_bit i2c_core ptp pps_core 
raid10 raid456 async_pq async_xor xor async_memcpy async_raid6_recov raid6_pq async_tx raid1 raid0 multipath linear [last unloaded: 
ipmi_msghandler]
<4>[471106.618265] CPU: 2 PID: 0 Comm: swapper/2 Tainted: G        W      3.18.25-28 #1
<4>[471106.618291] Hardware name: Supermicro X8DTU/X8DTU, BIOS 2.1c       08/03/2012
<4>[471106.618317] task: ffff88061e5a53c0 ti: ffff88061e5d0000 task.ti: ffff88061e5d0000
<4>[471106.618344] RIP: 0010:[<ffffffff810c09c9>]  [<ffffffff810c09c9>] add_timer_on+0xf9/0x100
<4>[471106.618377] RSP: 0018:ffff880627c43c98  EFLAGS: 00010086
<4>[471106.618398] RAX: 000000000000d480 RBX: ffffffffa0143060 RCX: 0000000107028001
<4>[471106.618423] RDX: ffff880627c40000 RSI: 0000000000000002 RDI: ffffffffa0143060
<4>[471106.618449] RBP: ffff880627c43cd8 R08: ffffffff81389c30 R09: ffff880115355600
<4>[471106.618474] R10: 0000000000000216 R11: 0000000000000003 R12: ffff88061e0d0000
<4>[471106.618500] R13: 0000000000000080 R14: ffff88061b0b2a00 R15: 0000000000000001
<4>[471106.618526] FS:  0000000000000000(0000) GS:ffff880627c40000(0000) knlGS:0000000000000000
<4>[471106.618555] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
<4>[471106.618576] CR2: 00007f418817a3d0 CR3: 0000000001c14000 CR4: 00000000000007e0
<4>[471106.618602] Stack:
<4>[471106.618611]  ffff880627c43cd8 0000000000000080 ffff88061b0b2a00 0000000000000002
<4>[471106.618643]  ffff88061b0b2a00 0000000000000080 ffff88061b0b2a00 0000000000000001
<4>[471106.618675]  ffff880627c43d08 ffffffff810822f6 0000000000000001 ffffffffa0143040
<4>[471106.618707] Call Trace:
<4>[471106.618718]  <IRQ>
<4>[471106.618727] <4>[471106.618736]  [<ffffffff810822f6>] __queue_delayed_work+0x76/0x160
<4>[471106.618758]  [<ffffffff810829da>] mod_delayed_work_on+0x5a/0x80
<4>[471106.618782]  [<ffffffffa014028a>] set_timeout+0x3a/0x40 [ib_addr]
<4>[471106.618807]  [<ffffffffa01407af>] netevent_callback+0x2f/0x40 [ib_addr]
<4>[471106.618832]  [<ffffffff8108af7d>] notifier_call_chain+0x4d/0x70
<4>[471106.618856]  [<ffffffff8108afba>] __atomic_notifier_call_chain+0x1a/0x30
<4>[471106.618881]  [<ffffffff8108afe6>] atomic_notifier_call_chain+0x16/0x20
<4>[471106.618906]  [<ffffffff815d75fb>] call_netevent_notifiers+0x1b/0x20
<4>[471106.618930]  [<ffffffff815d9a7e>] neigh_update_notify+0x1e/0x40
<4>[471106.618953]  [<ffffffff815daebf>] neigh_timer_handler+0xdf/0x2a0
<4>[471106.618976]  [<ffffffff815dade0>] ? neigh_periodic_work+0x220/0x220
<4>[471106.619000]  [<ffffffff810bf43a>] call_timer_fn+0x3a/0x170
<4>[471106.619021]  [<ffffffff810bf67c>] ? cascade+0x7c/0xa0
<4>[471106.619041]  [<ffffffff810c100f>] run_timer_softirq+0x2df/0x340
<4>[471106.619064]  [<ffffffff810ce944>] ? clockevents_program_event+0x74/0x100
<4>[471106.619090]  [<ffffffff815dade0>] ? neigh_periodic_work+0x220/0x220
<4>[471106.619115]  [<ffffffff8106f7df>] __do_softirq+0xdf/0x2f0
<4>[471106.619943]  [<ffffffff8106fc66>] irq_exit+0x86/0xb0
<4>[471106.620758]  [<ffffffff816db61a>] smp_apic_timer_interrupt+0x4a/0x60
<4>[471106.621578]  [<ffffffff816d96fd>] apic_timer_interrupt+0x6d/0x80
<4>[471106.622386]  <EOI>
<4>[471106.622396] <4>[471106.623190]  [<ffffffff815796a1>] ? cpuidle_enter_state+0x71/0x1d0
<4>[471106.623997]  [<ffffffff8157968f>] ? cpuidle_enter_state+0x5f/0x1d0
<4>[471106.624783]  [<ffffffff815798c7>] cpuidle_enter+0x17/0x20
<4>[471106.625544]  [<ffffffff810a905f>] cpu_startup_entry+0x34f/0x3d0
<4>[471106.626282]  [<ffffffff810461fc>] start_secondary+0x14c/0x160
<4>[471106.626996] Code: fa 49 83 c5 10 48 89 de ff d1 48 8b 55 c8 4c 89 e8 4c 29 f0 48 8b 4c 02 f0 48 85 c9 75 dc 65 ff 0c 25 a0 b7 00 00 
e9 79 ff ff ff <0f> 0b 0f 1f 44 00 00 66 66 66 66 90 55 48 89 e5 41 55 41 54 53
<1>[471106.628532] RIP  [<ffffffff810c09c9>] add_timer_on+0xf9/0x100
<4>[471106.629244]  RSP <ffff880627c43c98>
<5>[471106.629946] ---[ now 2016-04-27 06:06:02+03 ]---
<4>[471106.630695] ---[ end trace 23e6f46f3ce04a61 ]---

WARN_ON_ONCE(!work_pending(work)) in set_work_data

<4>[235668.529480] ------------[ cut here ]------------
<4>[235668.529491] WARNING: CPU: 3 PID: 0 at kernel/workqueue.c:597 __queue_work+0x2b6/0x3d0()
<4>[235668.529493] Modules linked in: ip6t_REJECT nf_reject_ipv6 xt_tcpudp ip6table_filter ip6_tables x_tables bridge cls_cgroup sch_htb 
netconsole configfs 8021q mrp garp stp llc kvm_amd kvm
microcode amd64_edac_mod edac_core edac_mce_amd mgag200 psmouse ttm serio_raw sp5100_tco drm_kms_helper drm k10temp i2c_piix4 sysimgblt 
sysfillrect syscopyarea mlx4_ib ib_sa ib_mad ib_core i
b_addr mlx4_en mlx4_core vxlan udp_tunnel ip6_udp_tunnel tcp_htcp raid10 raid456 async_pq async_xor xor async_memcpy async_raid6_recov igb 
dca i2c_algo_bit i2c_core ptp pps_core ahci libahci
raid6_pq async_tx raid1 raid0 multipath linear [last unloaded: ipmi_msghandler]
<4>[235668.529542] CPU: 3 PID: 0 Comm: swapper/3 Tainted: G W 3.18.25-28 #1
<4>[235668.529544] Hardware name: Supermicro H8DGU/H8DGU, BIOS 1.3 07/12/11
<4>[235668.529546] 0000000000000255 ffff88040fcc3d78 ffffffff816d08e3 0000000000000007
<4>[235668.529548] 0000000000000000 ffff88040fcc3db8 ffffffff8106bb2c ffff88040fcc3da8
<4>[235668.529551] ffff8800df94ad00 ffffffffa00d7040 ffff8800df94ad60 ffff88040f811000
<4>[235668.529553] Call Trace:
<4>[235668.529555] <IRQ> [<ffffffff816d08e3>] dump_stack+0x4e/0x68
<4>[235668.529563] [<ffffffff8106bb2c>] warn_slowpath_common+0x8c/0xc0
<4>[235668.529567] [<ffffffff8106bb7a>] warn_slowpath_null+0x1a/0x20
<4>[235668.529569] [<ffffffff81082146>] __queue_work+0x2b6/0x3d0
<4>[235668.529572] [<ffffffff81082260>] ? __queue_work+0x3d0/0x3d0
<4>[235668.529575] [<ffffffff81082278>] delayed_work_timer_fn+0x18/0x20
<4>[235668.529579] [<ffffffff810bf43a>] call_timer_fn+0x3a/0x170
<4>[235668.529582] [<ffffffff810c0f34>] run_timer_softirq+0x204/0x340
<4>[235668.529586] [<ffffffff810ce944>] ? clockevents_program_event+0x74/0x100
<4>[235668.529588] [<ffffffff81082260>] ? __queue_work+0x3d0/0x3d0
<4>[235668.529591] [<ffffffff8106f7df>] __do_softirq+0xdf/0x2f0
<4>[235668.529593] [<ffffffff8106fc66>] irq_exit+0x86/0xb0
<4>[235668.529597] [<ffffffff816db61a>] smp_apic_timer_interrupt+0x4a/0x60
<4>[235668.529600] [<ffffffff816d96fd>] apic_timer_interrupt+0x6d/0x80
<4>[235668.529601] <EOI> [<ffffffff8101e0a9>] ? sched_clock+0x9/0x10
<4>[235668.529607] [<ffffffff81055246>] ? native_safe_halt+0x6/0x10
<4>[235668.529610] [<ffffffff8101f153>] default_idle+0x23/0x100
<4>[235668.529613] [<ffffffff8101faff>] arch_cpu_idle+0xf/0x20
<4>[235668.529616] [<ffffffff810a8fea>] cpu_startup_entry+0x2da/0x3d0
<4>[235668.529619] [<ffffffff810ce352>] ? clockevents_register_device+0xe2/0x140
<4>[235668.529622] [<ffffffff810461fc>] start_secondary+0x14c/0x160
<4>[235668.529624] ---[ end trace cbf3ac778a6035e6 ]---

-- 
Konstantin

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: workqueue: race in mod_delayed_work_on?
  2016-05-10  8:21 workqueue: race in mod_delayed_work_on? Konstantin Khlebnikov
@ 2016-05-10 16:28 ` Konstantin Khlebnikov
  2016-05-10 16:36   ` Tejun Heo
  0 siblings, 1 reply; 7+ messages in thread
From: Konstantin Khlebnikov @ 2016-05-10 16:28 UTC (permalink / raw)
  To: linux-kernel, Tejun Heo, Sasha Levin

On 10.05.2016 11:21, Konstantin Khlebnikov wrote:
> I've got plenty warnings, bugs and oops around trivial use of mod_delayed_work in drivers/infiniband/core/addr.c

Looks like problem in mod_delayed_work_on was hidden because add_timer is equal to mod_timer
but Sasha accidentally backported 874bbfe600a660cba9c776b3957b1ce393151b76
(workqueue: make sure delayed work run in local cpu) into 3.18.25

I don't see reason why that commit could break delayed work,
most likely it highlighted some other problem.

>
> For example:
>
> WARN_ON_ONCE(timer_pending(timer)) in __queue_delayed_work
> and
> BUG_ON(timer_pending(timer) || !timer->function) in add_timer_on
>
> <4>[471106.614256] ------------[ cut here ]------------
> <4>[471106.614271] WARNING: CPU: 2 PID: 0 at kernel/workqueue.c:1428 __queue_delayed_work+0xdd/0x160()
> <4>[471106.614274] Modules linked in: tcp_diag inet_diag bridge cls_cgroup sch_htb netconsole configfs 8021q mrp garp stp llc
> intel_powerclamp coretemp kvm_intel kvm crc32_pclmul aesni_intel
> ablk_helper cryptd lrw mgag200 psmouse ttm gf128mul serio_raw glue_helper aes_x86_64 drm_kms_helper drm gpio_ich microcode sysimgblt
> sysfillrect syscopyarea lpc_ich i7core_edac ioatdma edac_c
> ore mlx4_ib ib_sa ib_mad ib_core ib_addr mlx4_en mlx4_core vxlan udp_tunnel ip6_udp_tunnel tcp_htcp ahci igb libahci dca i2c_algo_bit
> i2c_core ptp pps_core raid10 raid456 async_pq async_xor x
> or async_memcpy async_raid6_recov raid6_pq async_tx raid1 raid0 multipath linear [last unloaded: ipmi_msghandler]
> <4>[471106.614339] CPU: 2 PID: 0 Comm: swapper/2 Not tainted 3.18.25-28 #1
> <4>[471106.614344] Hardware name: Supermicro X8DTU/X8DTU, BIOS 2.1c       08/03/2012
> <4>[471106.614346]  0000000000000594 ffff880627c43c88 ffffffff816d08e3 0000000000000007
> <4>[471106.614348]  0000000000000000 ffff880627c43cc8 ffffffff8106bb2c ffff880627c43cd8
> <4>[471106.614350]  0000000000000080 ffff88061b0b2a00 0000000000000080 ffff88061b0b2a00
> <4>[471106.614353] Call Trace:
> <4>[471106.614354]  <IRQ>  [<ffffffff816d08e3>] dump_stack+0x4e/0x68
> <4>[471106.614362]  [<ffffffff8106bb2c>] warn_slowpath_common+0x8c/0xc0
> <4>[471106.614364]  [<ffffffff8106bb7a>] warn_slowpath_null+0x1a/0x20
> <4>[471106.614367]  [<ffffffff8108235d>] __queue_delayed_work+0xdd/0x160
> <4>[471106.614371]  [<ffffffff810829da>] mod_delayed_work_on+0x5a/0x80
> <4>[471106.614375]  [<ffffffffa014028a>] set_timeout+0x3a/0x40 [ib_addr]
> <4>[471106.614378]  [<ffffffffa01407af>] netevent_callback+0x2f/0x40 [ib_addr]
> <4>[471106.614382]  [<ffffffff8108af7d>] notifier_call_chain+0x4d/0x70
> <4>[471106.614384]  [<ffffffff8108afba>] __atomic_notifier_call_chain+0x1a/0x30
> <4>[471106.614387]  [<ffffffff8108afe6>] atomic_notifier_call_chain+0x16/0x20
> <4>[471106.614390]  [<ffffffff815d75fb>] call_netevent_notifiers+0x1b/0x20
> <4>[471106.614393]  [<ffffffff815d9a7e>] neigh_update_notify+0x1e/0x40
> <4>[471106.614395]  [<ffffffff815daebf>] neigh_timer_handler+0xdf/0x2a0
> <4>[471106.614397]  [<ffffffff815dade0>] ? neigh_periodic_work+0x220/0x220
> <4>[471106.614401]  [<ffffffff810bf43a>] call_timer_fn+0x3a/0x170
> <4>[471106.614403]  [<ffffffff810bf67c>] ? cascade+0x7c/0xa0
> <4>[471106.614405]  [<ffffffff810c100f>] run_timer_softirq+0x2df/0x340
> <4>[471106.614409]  [<ffffffff810ce944>] ? clockevents_program_event+0x74/0x100
> <4>[471106.614411]  [<ffffffff815dade0>] ? neigh_periodic_work+0x220/0x220
> <4>[471106.614414]  [<ffffffff8106f7df>] __do_softirq+0xdf/0x2f0
> <4>[471106.614417]  [<ffffffff8106fc66>] irq_exit+0x86/0xb0
> <4>[471106.614420]  [<ffffffff816db61a>] smp_apic_timer_interrupt+0x4a/0x60
> <4>[471106.614423]  [<ffffffff816d96fd>] apic_timer_interrupt+0x6d/0x80
> <4>[471106.614424]  <EOI>  [<ffffffff815796a1>] ? cpuidle_enter_state+0x71/0x1d0
> <4>[471106.614430]  [<ffffffff8157968f>] ? cpuidle_enter_state+0x5f/0x1d0
> <4>[471106.614433]  [<ffffffff815798c7>] cpuidle_enter+0x17/0x20
> <4>[471106.614436]  [<ffffffff810a905f>] cpu_startup_entry+0x34f/0x3d0
> <4>[471106.614438]  [<ffffffff810461fc>] start_secondary+0x14c/0x160
> <4>[471106.614440] ---[ end trace 23e6f46f3ce04a60 ]---
> <5>[471106.614443] ---[ now 2016-04-27 06:06:02+03 ]---
> <4>[471106.617880] ------------[ cut here ]------------
> <2>[471106.617905] kernel BUG at kernel/time/timer.c:963!
> <4>[471106.617924] invalid opcode: 0000 [#1] SMP
> <4>[471106.617942] Modules linked in: tcp_diag inet_diag bridge cls_cgroup sch_htb netconsole configfs 8021q mrp garp stp llc
> intel_powerclamp coretemp kvm_intel kvm crc32_pclmul aesni_intel ablk_helper cryptd lrw mgag200 psmouse ttm gf128mul serio_raw glue_helper
> aes_x86_64 drm_kms_helper drm gpio_ich microcode sysimgblt sysfillrect syscopyarea lpc_ich i7core_edac ioatdma edac_core mlx4_ib ib_sa
> ib_mad ib_core ib_addr mlx4_en mlx4_core vxlan udp_tunnel ip6_udp_tunnel tcp_htcp ahci igb libahci dca i2c_algo_bit i2c_core ptp pps_core
> raid10 raid456 async_pq async_xor xor async_memcpy async_raid6_recov raid6_pq async_tx raid1 raid0 multipath linear [last unloaded:
> ipmi_msghandler]
> <4>[471106.618265] CPU: 2 PID: 0 Comm: swapper/2 Tainted: G        W      3.18.25-28 #1
> <4>[471106.618291] Hardware name: Supermicro X8DTU/X8DTU, BIOS 2.1c       08/03/2012
> <4>[471106.618317] task: ffff88061e5a53c0 ti: ffff88061e5d0000 task.ti: ffff88061e5d0000
> <4>[471106.618344] RIP: 0010:[<ffffffff810c09c9>]  [<ffffffff810c09c9>] add_timer_on+0xf9/0x100
> <4>[471106.618377] RSP: 0018:ffff880627c43c98  EFLAGS: 00010086
> <4>[471106.618398] RAX: 000000000000d480 RBX: ffffffffa0143060 RCX: 0000000107028001
> <4>[471106.618423] RDX: ffff880627c40000 RSI: 0000000000000002 RDI: ffffffffa0143060
> <4>[471106.618449] RBP: ffff880627c43cd8 R08: ffffffff81389c30 R09: ffff880115355600
> <4>[471106.618474] R10: 0000000000000216 R11: 0000000000000003 R12: ffff88061e0d0000
> <4>[471106.618500] R13: 0000000000000080 R14: ffff88061b0b2a00 R15: 0000000000000001
> <4>[471106.618526] FS:  0000000000000000(0000) GS:ffff880627c40000(0000) knlGS:0000000000000000
> <4>[471106.618555] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> <4>[471106.618576] CR2: 00007f418817a3d0 CR3: 0000000001c14000 CR4: 00000000000007e0
> <4>[471106.618602] Stack:
> <4>[471106.618611]  ffff880627c43cd8 0000000000000080 ffff88061b0b2a00 0000000000000002
> <4>[471106.618643]  ffff88061b0b2a00 0000000000000080 ffff88061b0b2a00 0000000000000001
> <4>[471106.618675]  ffff880627c43d08 ffffffff810822f6 0000000000000001 ffffffffa0143040
> <4>[471106.618707] Call Trace:
> <4>[471106.618718]  <IRQ>
> <4>[471106.618727] <4>[471106.618736]  [<ffffffff810822f6>] __queue_delayed_work+0x76/0x160
> <4>[471106.618758]  [<ffffffff810829da>] mod_delayed_work_on+0x5a/0x80
> <4>[471106.618782]  [<ffffffffa014028a>] set_timeout+0x3a/0x40 [ib_addr]
> <4>[471106.618807]  [<ffffffffa01407af>] netevent_callback+0x2f/0x40 [ib_addr]
> <4>[471106.618832]  [<ffffffff8108af7d>] notifier_call_chain+0x4d/0x70
> <4>[471106.618856]  [<ffffffff8108afba>] __atomic_notifier_call_chain+0x1a/0x30
> <4>[471106.618881]  [<ffffffff8108afe6>] atomic_notifier_call_chain+0x16/0x20
> <4>[471106.618906]  [<ffffffff815d75fb>] call_netevent_notifiers+0x1b/0x20
> <4>[471106.618930]  [<ffffffff815d9a7e>] neigh_update_notify+0x1e/0x40
> <4>[471106.618953]  [<ffffffff815daebf>] neigh_timer_handler+0xdf/0x2a0
> <4>[471106.618976]  [<ffffffff815dade0>] ? neigh_periodic_work+0x220/0x220
> <4>[471106.619000]  [<ffffffff810bf43a>] call_timer_fn+0x3a/0x170
> <4>[471106.619021]  [<ffffffff810bf67c>] ? cascade+0x7c/0xa0
> <4>[471106.619041]  [<ffffffff810c100f>] run_timer_softirq+0x2df/0x340
> <4>[471106.619064]  [<ffffffff810ce944>] ? clockevents_program_event+0x74/0x100
> <4>[471106.619090]  [<ffffffff815dade0>] ? neigh_periodic_work+0x220/0x220
> <4>[471106.619115]  [<ffffffff8106f7df>] __do_softirq+0xdf/0x2f0
> <4>[471106.619943]  [<ffffffff8106fc66>] irq_exit+0x86/0xb0
> <4>[471106.620758]  [<ffffffff816db61a>] smp_apic_timer_interrupt+0x4a/0x60
> <4>[471106.621578]  [<ffffffff816d96fd>] apic_timer_interrupt+0x6d/0x80
> <4>[471106.622386]  <EOI>
> <4>[471106.622396] <4>[471106.623190]  [<ffffffff815796a1>] ? cpuidle_enter_state+0x71/0x1d0
> <4>[471106.623997]  [<ffffffff8157968f>] ? cpuidle_enter_state+0x5f/0x1d0
> <4>[471106.624783]  [<ffffffff815798c7>] cpuidle_enter+0x17/0x20
> <4>[471106.625544]  [<ffffffff810a905f>] cpu_startup_entry+0x34f/0x3d0
> <4>[471106.626282]  [<ffffffff810461fc>] start_secondary+0x14c/0x160
> <4>[471106.626996] Code: fa 49 83 c5 10 48 89 de ff d1 48 8b 55 c8 4c 89 e8 4c 29 f0 48 8b 4c 02 f0 48 85 c9 75 dc 65 ff 0c 25 a0 b7 00 00
> e9 79 ff ff ff <0f> 0b 0f 1f 44 00 00 66 66 66 66 90 55 48 89 e5 41 55 41 54 53
> <1>[471106.628532] RIP  [<ffffffff810c09c9>] add_timer_on+0xf9/0x100
> <4>[471106.629244]  RSP <ffff880627c43c98>
> <5>[471106.629946] ---[ now 2016-04-27 06:06:02+03 ]---
> <4>[471106.630695] ---[ end trace 23e6f46f3ce04a61 ]---
>
> WARN_ON_ONCE(!work_pending(work)) in set_work_data
>
> <4>[235668.529480] ------------[ cut here ]------------
> <4>[235668.529491] WARNING: CPU: 3 PID: 0 at kernel/workqueue.c:597 __queue_work+0x2b6/0x3d0()
> <4>[235668.529493] Modules linked in: ip6t_REJECT nf_reject_ipv6 xt_tcpudp ip6table_filter ip6_tables x_tables bridge cls_cgroup sch_htb
> netconsole configfs 8021q mrp garp stp llc kvm_amd kvm
> microcode amd64_edac_mod edac_core edac_mce_amd mgag200 psmouse ttm serio_raw sp5100_tco drm_kms_helper drm k10temp i2c_piix4 sysimgblt
> sysfillrect syscopyarea mlx4_ib ib_sa ib_mad ib_core i
> b_addr mlx4_en mlx4_core vxlan udp_tunnel ip6_udp_tunnel tcp_htcp raid10 raid456 async_pq async_xor xor async_memcpy async_raid6_recov igb
> dca i2c_algo_bit i2c_core ptp pps_core ahci libahci
> raid6_pq async_tx raid1 raid0 multipath linear [last unloaded: ipmi_msghandler]
> <4>[235668.529542] CPU: 3 PID: 0 Comm: swapper/3 Tainted: G W 3.18.25-28 #1
> <4>[235668.529544] Hardware name: Supermicro H8DGU/H8DGU, BIOS 1.3 07/12/11
> <4>[235668.529546] 0000000000000255 ffff88040fcc3d78 ffffffff816d08e3 0000000000000007
> <4>[235668.529548] 0000000000000000 ffff88040fcc3db8 ffffffff8106bb2c ffff88040fcc3da8
> <4>[235668.529551] ffff8800df94ad00 ffffffffa00d7040 ffff8800df94ad60 ffff88040f811000
> <4>[235668.529553] Call Trace:
> <4>[235668.529555] <IRQ> [<ffffffff816d08e3>] dump_stack+0x4e/0x68
> <4>[235668.529563] [<ffffffff8106bb2c>] warn_slowpath_common+0x8c/0xc0
> <4>[235668.529567] [<ffffffff8106bb7a>] warn_slowpath_null+0x1a/0x20
> <4>[235668.529569] [<ffffffff81082146>] __queue_work+0x2b6/0x3d0
> <4>[235668.529572] [<ffffffff81082260>] ? __queue_work+0x3d0/0x3d0
> <4>[235668.529575] [<ffffffff81082278>] delayed_work_timer_fn+0x18/0x20
> <4>[235668.529579] [<ffffffff810bf43a>] call_timer_fn+0x3a/0x170
> <4>[235668.529582] [<ffffffff810c0f34>] run_timer_softirq+0x204/0x340
> <4>[235668.529586] [<ffffffff810ce944>] ? clockevents_program_event+0x74/0x100
> <4>[235668.529588] [<ffffffff81082260>] ? __queue_work+0x3d0/0x3d0
> <4>[235668.529591] [<ffffffff8106f7df>] __do_softirq+0xdf/0x2f0
> <4>[235668.529593] [<ffffffff8106fc66>] irq_exit+0x86/0xb0
> <4>[235668.529597] [<ffffffff816db61a>] smp_apic_timer_interrupt+0x4a/0x60
> <4>[235668.529600] [<ffffffff816d96fd>] apic_timer_interrupt+0x6d/0x80
> <4>[235668.529601] <EOI> [<ffffffff8101e0a9>] ? sched_clock+0x9/0x10
> <4>[235668.529607] [<ffffffff81055246>] ? native_safe_halt+0x6/0x10
> <4>[235668.529610] [<ffffffff8101f153>] default_idle+0x23/0x100
> <4>[235668.529613] [<ffffffff8101faff>] arch_cpu_idle+0xf/0x20
> <4>[235668.529616] [<ffffffff810a8fea>] cpu_startup_entry+0x2da/0x3d0
> <4>[235668.529619] [<ffffffff810ce352>] ? clockevents_register_device+0xe2/0x140
> <4>[235668.529622] [<ffffffff810461fc>] start_secondary+0x14c/0x160
> <4>[235668.529624] ---[ end trace cbf3ac778a6035e6 ]---
>


-- 
Konstantin

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: workqueue: race in mod_delayed_work_on?
  2016-05-10 16:28 ` Konstantin Khlebnikov
@ 2016-05-10 16:36   ` Tejun Heo
  2016-05-10 17:20     ` Konstantin Khlebnikov
  0 siblings, 1 reply; 7+ messages in thread
From: Tejun Heo @ 2016-05-10 16:36 UTC (permalink / raw)
  To: Konstantin Khlebnikov; +Cc: linux-kernel, Sasha Levin

Hello,

On Tue, May 10, 2016 at 07:28:08PM +0300, Konstantin Khlebnikov wrote:
> On 10.05.2016 11:21, Konstantin Khlebnikov wrote:
> >I've got plenty warnings, bugs and oops around trivial use of mod_delayed_work in drivers/infiniband/core/addr.c
> 
> Looks like problem in mod_delayed_work_on was hidden because add_timer is equal to mod_timer

The timer usages are gated behind PENDING bit, so whether add_timer()
is equal to mod_timer() shouldn't matter.

> but Sasha accidentally backported 874bbfe600a660cba9c776b3957b1ce393151b76
> (workqueue: make sure delayed work run in local cpu) into 3.18.25
> 
> I don't see reason why that commit could break delayed work,
> most likely it highlighted some other problem.

What are you running?  Can you reproduce the issue on upstream kernel?

-- 
tejun

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: workqueue: race in mod_delayed_work_on?
  2016-05-10 16:36   ` Tejun Heo
@ 2016-05-10 17:20     ` Konstantin Khlebnikov
  2016-05-12 13:06       ` Konstantin Khlebnikov
  0 siblings, 1 reply; 7+ messages in thread
From: Konstantin Khlebnikov @ 2016-05-10 17:20 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-kernel, Sasha Levin

On 10.05.2016 19:36, Tejun Heo wrote:
> Hello,
>
> On Tue, May 10, 2016 at 07:28:08PM +0300, Konstantin Khlebnikov wrote:
>> On 10.05.2016 11:21, Konstantin Khlebnikov wrote:
>>> I've got plenty warnings, bugs and oops around trivial use of mod_delayed_work in drivers/infiniband/core/addr.c
>>
>> Looks like problem in mod_delayed_work_on was hidden because add_timer is equal to mod_timer
>
> The timer usages are gated behind PENDING bit, so whether add_timer()
> is equal to mod_timer() shouldn't matter.

Hmm... this looks little bit more complicated than one bit.

>
>> but Sasha accidentally backported 874bbfe600a660cba9c776b3957b1ce393151b76
>> (workqueue: make sure delayed work run in local cpu) into 3.18.25
>>
>> I don't see reason why that commit could break delayed work,
>> most likely it highlighted some other problem.
>
> What are you running?  Can you reproduce the issue on upstream kernel?
>

This is slight patched 3.18.y. Looks like this started when we upgraded kernel to 3.18.25 and
somebody have loaded module ib_addr (ip in infiniband or something) which actually unused
because these machines have no infiniband at all. But this code is poked from ethernet arp
sometimes. So, it crashes somewhere from time to time. I'll try to stresstest this piece.

-- 
Konstantin

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: workqueue: race in mod_delayed_work_on?
  2016-05-10 17:20     ` Konstantin Khlebnikov
@ 2016-05-12 13:06       ` Konstantin Khlebnikov
  2016-05-13 13:49         ` Konstantin Khlebnikov
  0 siblings, 1 reply; 7+ messages in thread
From: Konstantin Khlebnikov @ 2016-05-12 13:06 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-kernel, Sasha Levin

On 10.05.2016 20:20, Konstantin Khlebnikov wrote:
> On 10.05.2016 19:36, Tejun Heo wrote:
>> Hello,
>>
>> On Tue, May 10, 2016 at 07:28:08PM +0300, Konstantin Khlebnikov wrote:
>>> On 10.05.2016 11:21, Konstantin Khlebnikov wrote:
>>>> I've got plenty warnings, bugs and oops around trivial use of mod_delayed_work in drivers/infiniband/core/addr.c
>>>
>>> Looks like problem in mod_delayed_work_on was hidden because add_timer is equal to mod_timer
>>
>> The timer usages are gated behind PENDING bit, so whether add_timer()
>> is equal to mod_timer() shouldn't matter.
>
> Hmm... this looks little bit more complicated than one bit.

Yep, problem was here - both timer and work can be active at the same time.

So try_to_grab_pending can return success for two concurrent callers:
first get del_timer, second removes work from workqueue. After that
both call add timer and one of them either catch BUG_ON or corrupt timer list.

I see two possible fixes: always remove timer and work in try_to_grab_pending
but this must be carefully synchronized. This will make it slower for sure.
Or always use mod_timer in  __queue_delayed_work() - both callers will modify timer,
but here is no mod_timer_on().

>
>>
>>> but Sasha accidentally backported 874bbfe600a660cba9c776b3957b1ce393151b76
>>> (workqueue: make sure delayed work run in local cpu) into 3.18.25
>>>
>>> I don't see reason why that commit could break delayed work,
>>> most likely it highlighted some other problem.
>>
>> What are you running?  Can you reproduce the issue on upstream kernel?
>>
>
> This is slight patched 3.18.y. Looks like this started when we upgraded kernel to 3.18.25 and
> somebody have loaded module ib_addr (ip in infiniband or something) which actually unused
> because these machines have no infiniband at all. But this code is poked from ethernet arp
> sometimes. So, it crashes somewhere from time to time. I'll try to stresstest this piece.
>




-- 
Konstantin

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: workqueue: race in mod_delayed_work_on?
  2016-05-12 13:06       ` Konstantin Khlebnikov
@ 2016-05-13 13:49         ` Konstantin Khlebnikov
  2016-05-15 16:08           ` Konstantin Khlebnikov
  0 siblings, 1 reply; 7+ messages in thread
From: Konstantin Khlebnikov @ 2016-05-13 13:49 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-kernel, Sasha Levin

On 12.05.2016 16:06, Konstantin Khlebnikov wrote:
> On 10.05.2016 20:20, Konstantin Khlebnikov wrote:
>> On 10.05.2016 19:36, Tejun Heo wrote:
>>> Hello,
>>>
>>> On Tue, May 10, 2016 at 07:28:08PM +0300, Konstantin Khlebnikov wrote:
>>>> On 10.05.2016 11:21, Konstantin Khlebnikov wrote:
>>>>> I've got plenty warnings, bugs and oops around trivial use of mod_delayed_work in drivers/infiniband/core/addr.c
>>>>
>>>> Looks like problem in mod_delayed_work_on was hidden because add_timer is equal to mod_timer
>>>
>>> The timer usages are gated behind PENDING bit, so whether add_timer()
>>> is equal to mod_timer() shouldn't matter.
>>
>> Hmm... this looks little bit more complicated than one bit.
>
> Yep, problem was here - both timer and work can be active at the same time.

Nope, this is impossible. This will be a bug itself.

>
> So try_to_grab_pending can return success for two coRecentncurrent callers:
> first get del_timer, second removes work from workqueue. After that
> both call add timer and one of them either catch BUG_ON or corrupt timer list.
>
> I see two possible fixes: always remove timer and work in try_to_grab_pending
> but this must be carefully synchronized. This will make it slower for sure.
> Or always use mod_timer in  __queue_delayed_work() - both callers will modify timer,
> but here is no mod_timer_on().
>
>>
>>>
>>>> but Sasha accidentally backported 874bbfe600a660cba9c776b3957b1ce393151b76
>>>> (workqueue: make sure delayed work run in local cpu) into 3.18.25
>>>>
>>>> I don't see reason why that commit could break delayed work,
>>>> most likely it highlighted some other problem.
>>>
>>> What are you running?  Can you reproduce the issue on upstream kernel?
>>>
>>
>> This is slight patched 3.18.y. Looks like this started when we upgraded kernel to 3.18.25 and
>> somebody have loaded module ib_addr (ip in infiniband or something) which actually unused
>> because these machines have no infiniband at all. But this code is poked from ethernet arp
>> sometimes. So, it crashes somewhere from time to time. I'll try to stresstest this piece.
>>
>
>
>
>


-- 
Konstantin

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: workqueue: race in mod_delayed_work_on?
  2016-05-13 13:49         ` Konstantin Khlebnikov
@ 2016-05-15 16:08           ` Konstantin Khlebnikov
  0 siblings, 0 replies; 7+ messages in thread
From: Konstantin Khlebnikov @ 2016-05-15 16:08 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-kernel, Sasha Levin

On 13.05.2016 16:49, Konstantin Khlebnikov wrote:
> On 12.05.2016 16:06, Konstantin Khlebnikov wrote:
>> On 10.05.2016 20:20, Konstantin Khlebnikov wrote:
>>> On 10.05.2016 19:36, Tejun Heo wrote:
>>>> Hello,
>>>>
>>>> On Tue, May 10, 2016 at 07:28:08PM +0300, Konstantin Khlebnikov wrote:
>>>>> On 10.05.2016 11:21, Konstantin Khlebnikov wrote:
>>>>>> I've got plenty warnings, bugs and oops around trivial use of mod_delayed_work in drivers/infiniband/core/addr.c
>>>>>
>>>>> Looks like problem in mod_delayed_work_on was hidden because add_timer is equal to mod_timer
>>>>
>>>> The timer usages are gated behind PENDING bit, so whether add_timer()
>>>> is equal to mod_timer() shouldn't matter.
>>>
>>> Hmm... this looks little bit more complicated than one bit.
>>
>> Yep, problem was here - both timer and work can be active at the same time.
>
> Nope, this is impossible. This will be a bug itself.
>
>>
>> So try_to_grab_pending can return success for two coRecentncurrent callers:
>> first get del_timer, second removes work from workqueue. After that
>> both call add timer and one of them either catch BUG_ON or corrupt timer list.
>>
>> I see two possible fixes: always remove timer and work in try_to_grab_pending
>> but this must be carefully synchronized. This will make it slower for sure.
>> Or always use mod_timer in  __queue_delayed_work() - both callers will modify timer,
>> but here is no mod_timer_on().
>>
>>>
>>>>
>>>>> but Sasha accidentally backported 874bbfe600a660cba9c776b3957b1ce393151b76
>>>>> (workqueue: make sure delayed work run in local cpu) into 3.18.25
>>>>>
>>>>> I don't see reason why that commit could break delayed work,
>>>>> most likely it highlighted some other problem.

Indeed, my problem was caused by that backported commit because branch 3.18.y
has no your fix for add_timer_on() -- 22b886dd1018093920c4250dee2a9a3cb7cff7b8
("timers: Use proper base migration in add_timer_on()"). Somehow I've missed that.

>>>>
>>>> What are you running?  Can you reproduce the issue on upstream kernel?
>>>>
>>>
>>> This is slight patched 3.18.y. Looks like this started when we upgraded kernel to 3.18.25 and
>>> somebody have loaded module ib_addr (ip in infiniband or something) which actually unused
>>> because these machines have no infiniband at all. But this code is poked from ethernet arp
>>> sometimes. So, it crashes somewhere from time to time. I'll try to stresstest this piece.
>>>
>>
>

-- 
Konstantin

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2016-05-15 16:09 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-05-10  8:21 workqueue: race in mod_delayed_work_on? Konstantin Khlebnikov
2016-05-10 16:28 ` Konstantin Khlebnikov
2016-05-10 16:36   ` Tejun Heo
2016-05-10 17:20     ` Konstantin Khlebnikov
2016-05-12 13:06       ` Konstantin Khlebnikov
2016-05-13 13:49         ` Konstantin Khlebnikov
2016-05-15 16:08           ` Konstantin Khlebnikov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.