linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* timer code oops when calling mod_delayed_work
@ 2015-10-29 14:31 Jeff Layton
  2015-10-29 17:58 ` Jeff Layton
  0 siblings, 1 reply; 17+ messages in thread
From: Jeff Layton @ 2015-10-29 14:31 UTC (permalink / raw)
  To: linux-kernel
  Cc: Tejun Heo, bfields, Michael Skralivetsky, Chris Worley, Trond Myklebust

Our QA guys hit the following warning and oops today on a relatively
recent upstream kernel. They were running a fairly busy NFSv3 I/O test
against this server:

[   71.074180] NFSD: starting 90-second grace period (net ffffffff81cf3b40)
[  221.146802] Ebtables v2.0 unregistered
[  318.909835] ------------[ cut here ]------------
[  318.909846] WARNING: CPU: 22 PID: 3402 at kernel/workqueue.c:1444 __queue_delayed_work+0xe7/0x190()
[  318.909848] Modules linked in: bridge stp llc intel_rapl iosf_mbi x86_pkg_temp_thermal coretemp kvm_intel kvm ipmi_ssif crct10dif_pclmul crc32_pclmul iTCO_wdt iTCO_vendor_support joydev mei_me mei sb_edac shpchp ipmi_si lpc_ich i2c_i801 edac_core tpm_tis wmi ipmi_msghandler tpm acpi_power_meter acpi_pad nfsd auth_rpcgss nfs_acl lockd grace sunrpc xfs libcrc32c mlx4_en vxlan ip6_udp_tunnel udp_tunnel ast drm_kms_helper ttm crc32c_intel igb drm mlx4_core ptp nvme pps_core dca i2c_algo_bit fjes [last unloaded: iptable_security]
[  318.909889] CPU: 22 PID: 3402 Comm: nfsd Not tainted 4.3.0-0.rc7.git0.1.fc24.x86_64 #1
[  318.909891] Hardware name: Quanta D51B/D51BP (1G LOM), BIOS S2BP3A05 01/29/2015
[  318.909892]  0000000000000000 00000000319846b5 ffff880322d0f440 ffffffff813a48cf
[  318.909895]  0000000000000000 ffff880322d0f478 ffffffff8109f122 0000000000000400
[  318.909898]  ffff88085f48fc00 ffffffffa03623e0 ffff88085f48fc00 000000000001d4c0
[  318.909901] Call Trace:
[  318.909906]  [<ffffffff813a48cf>] dump_stack+0x44/0x55
[  318.909909]  [<ffffffff8109f122>] warn_slowpath_common+0x82/0xc0
[  318.909913]  [<ffffffff8109f26a>] warn_slowpath_null+0x1a/0x20
[  318.909915]  [<ffffffff810b67a7>] __queue_delayed_work+0xe7/0x190
[  318.909917]  [<ffffffff810b70de>] mod_delayed_work_on+0x5e/0xa0
[  318.909931]  [<ffffffffa0333394>] prune_cache_entries+0x94/0xb0 [nfsd]
[  318.909936]  [<ffffffffa03333be>] nfsd_reply_cache_scan+0xe/0x10 [nfsd]
[  318.909941]  [<ffffffff811b78e2>] shrink_slab.part.42+0x1e2/0x380
[  318.909944]  [<ffffffff811bbfc6>] shrink_zone+0x2b6/0x2c0
[  318.909946]  [<ffffffff811bc145>] do_try_to_free_pages+0x175/0x400
[  318.909949]  [<ffffffff811bc4da>] try_to_free_pages+0x10a/0x150
[  318.909954]  [<ffffffff811ae5b1>] __alloc_pages_nodemask+0x581/0x990
[  318.909957]  [<ffffffff811b19f5>] ? account_page_dirtied+0xd5/0x160
[  318.909960]  [<ffffffff811f7b41>] alloc_pages_current+0x91/0x100
[  318.909963]  [<ffffffff811a45bb>] __page_cache_alloc+0xab/0xe0
[  318.909965]  [<ffffffff811a4c94>] pagecache_get_page+0x84/0x1e0
[  318.909967]  [<ffffffff811a4e16>] grab_cache_page_write_begin+0x26/0x40
[  318.909990]  [<ffffffffa0221cd4>] xfs_vm_write_begin+0x34/0xe0 [xfs]
[  318.910005]  [<ffffffffa0221a39>] ? xfs_vm_write_end+0x29/0x80 [xfs]
[  318.910007]  [<ffffffff811a418e>] generic_perform_write+0xce/0x1c0
[  318.910023]  [<ffffffffa022efb7>] xfs_file_buffered_aio_write+0x127/0x250 [xfs]
[  318.910028]  [<ffffffffa032be60>] ? nfserrno+0x60/0x60 [nfsd]
[  318.910031]  [<ffffffff812035e9>] ? __kmalloc+0x1b9/0x260
[  318.910034]  [<ffffffff81223617>] ? rw_copy_check_uvector+0x57/0x110
[  318.910047]  [<ffffffffa022f1d5>] xfs_file_write_iter+0xf5/0x110 [xfs]
[  318.910050]  [<ffffffff81221e3c>] do_iter_readv_writev+0x6c/0xa0
[  318.910053]  [<ffffffff812229bf>] do_readv_writev+0x18f/0x230
[  318.910065]  [<ffffffffa022f0e0>] ? xfs_file_buffered_aio_write+0x250/0x250 [xfs]
[  318.910068]  [<ffffffff813284bb>] ? security_file_open+0x8b/0x90
[  318.910070]  [<ffffffff8121ffa4>] ? do_dentry_open+0x264/0x2f0
[  318.910084]  [<ffffffffa022ced0>] ? xfs_extent_busy_ag_cmp+0x20/0x20 [xfs]
[  318.910086]  [<ffffffff81222ae9>] vfs_writev+0x39/0x50
[  318.910092]  [<ffffffffa032df4d>] nfsd_vfs_write+0xbd/0x380 [nfsd]
[  318.910097]  [<ffffffffa0330568>] nfsd_write+0xd8/0xf0 [nfsd]
[  318.910103]  [<ffffffffa03367fc>] nfsd3_proc_write+0xbc/0x150 [nfsd]
[  318.910107]  [<ffffffffa0328e68>] nfsd_dispatch+0xb8/0x200 [nfsd]
[  318.910128]  [<ffffffffa02d39dc>] svc_process_common+0x40c/0x650 [sunrpc]
[  318.910137]  [<ffffffffa02d4b14>] svc_process+0xf4/0x180 [sunrpc]
[  318.910142]  [<ffffffffa03288bf>] nfsd+0xef/0x160 [nfsd]
[  318.910146]  [<ffffffffa03287d0>] ? nfsd_destroy+0x60/0x60 [nfsd]
[  318.910149]  [<ffffffff810bd6a8>] kthread+0xd8/0xf0
[  318.910152]  [<ffffffff810bd5d0>] ? kthread_worker_fn+0x160/0x160
[  318.910155]  [<ffffffff8177f19f>] ret_from_fork+0x3f/0x70
[  318.910158]  [<ffffffff810bd5d0>] ? kthread_worker_fn+0x160/0x160
[  318.910160] ---[ end trace db375019f3368852 ]---
[  318.910169] ------------[ cut here ]------------
[  318.910201] kernel BUG at kernel/time/timer.c:977!
[  318.910226] invalid opcode: 0000 [#1] SMP 
[  318.910249] Modules linked in: bridge stp llc intel_rapl iosf_mbi x86_pkg_temp_thermal coretemp kvm_intel kvm ipmi_ssif crct10dif_pclmul crc32_pclmul iTCO_wdt iTCO_vendor_support joydev mei_me mei sb_edac shpchp ipmi_si lpc_ich i2c_i801 edac_core tpm_tis wmi ipmi_msghandler tpm acpi_power_meter acpi_pad nfsd auth_rpcgss nfs_acl lockd grace sunrpc xfs libcrc32c mlx4_en vxlan ip6_udp_tunnel udp_tunnel ast drm_kms_helper ttm crc32c_intel igb drm mlx4_core ptp nvme pps_core dca i2c_algo_bit fjes [last unloaded: iptable_security]
[  318.910588] CPU: 22 PID: 3402 Comm: nfsd Tainted: G        W       4.3.0-0.rc7.git0.1.fc24.x86_64 #1
[  318.910629] Hardware name: Quanta D51B/D51BP (1G LOM), BIOS S2BP3A05 01/29/2015
[  318.910664] task: ffff880322c49e00 ti: ffff880322d0c000 task.ti: ffff880322d0c000
[  318.910698] RIP: 0010:[<ffffffff811076a6>]  [<ffffffff811076a6>] add_timer_on+0xc6/0x110
[  318.910736] RSP: 0018:ffff880322d0f458  EFLAGS: 00010082
[  318.910761] RAX: 0000000000000000 RBX: 0000000000000016 RCX: 0000000100021e85
[  318.910793] RDX: ffffffffa03623c0 RSI: 0000000000000016 RDI: ffffffffa03623e0
[  318.910826] RBP: ffff880322d0f488 R08: 000000000000000a R09: 00000000000005f2
[  318.910856] R10: 000000000000027c R11: 00000000000005f2 R12: ffff88085f48fc00
[  318.910888] R13: ffff88085fd8e080 R14: ffff88085f48fc00 R15: 000000000001d4c0
[  318.910922] FS:  0000000000000000(0000) GS:ffff88085fd80000(0000) knlGS:0000000000000000
[  318.910956] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  318.910981] CR2: 00007f9d6ea41000 CR3: 0000000001c0b000 CR4: 00000000001406e0
[  318.911012] Stack:
[  318.911025]  ffff88085f48fc00 0000000000000016 ffff88085f48fc00 ffffffffa03623e0
[  318.911069]  ffff88085f48fc00 000000000001d4c0 ffff880322d0f4c0 ffffffff810b6749
[  318.911111]  000000000001d4c0 ffffffffa03623c0 0000000000000001 ffffffffa03623c0
[  318.911153] Call Trace:
[  318.911170]  [<ffffffff810b6749>] __queue_delayed_work+0x89/0x190
[  318.911201]  [<ffffffff810b70de>] mod_delayed_work_on+0x5e/0xa0
[  318.911236]  [<ffffffffa0333394>] prune_cache_entries+0x94/0xb0 [nfsd]
[  318.911267]  [<ffffffffa03333be>] nfsd_reply_cache_scan+0xe/0x10 [nfsd]
[  318.911299]  [<ffffffff811b78e2>] shrink_slab.part.42+0x1e2/0x380
[  318.911330]  [<ffffffff811bbfc6>] shrink_zone+0x2b6/0x2c0
[  318.911358]  [<ffffffff811bc145>] do_try_to_free_pages+0x175/0x400
[  318.911391]  [<ffffffff811bc4da>] try_to_free_pages+0x10a/0x150
[  318.911422]  [<ffffffff811ae5b1>] __alloc_pages_nodemask+0x581/0x990
[  318.911454]  [<ffffffff811b19f5>] ? account_page_dirtied+0xd5/0x160
[  318.911483]  [<ffffffff811f7b41>] alloc_pages_current+0x91/0x100
[  318.911513]  [<ffffffff811a45bb>] __page_cache_alloc+0xab/0xe0
[  318.911543]  [<ffffffff811a4c94>] pagecache_get_page+0x84/0x1e0
[  318.911573]  [<ffffffff811a4e16>] grab_cache_page_write_begin+0x26/0x40
[  318.911620]  [<ffffffffa0221cd4>] xfs_vm_write_begin+0x34/0xe0 [xfs]
[  318.911661]  [<ffffffffa0221a39>] ? xfs_vm_write_end+0x29/0x80 [xfs]
[  318.912748]  [<ffffffff811a418e>] generic_perform_write+0xce/0x1c0
[  318.913810]  [<ffffffffa022efb7>] xfs_file_buffered_aio_write+0x127/0x250 [xfs]
[  318.914858]  [<ffffffffa032be60>] ? nfserrno+0x60/0x60 [nfsd]
[  318.915715]  [<ffffffff812035e9>] ? __kmalloc+0x1b9/0x260
[  318.916360]  [<ffffffff81223617>] ? rw_copy_check_uvector+0x57/0x110
[  318.917014]  [<ffffffffa022f1d5>] xfs_file_write_iter+0xf5/0x110 [xfs]
[  318.917650]  [<ffffffff81221e3c>] do_iter_readv_writev+0x6c/0xa0
[  318.918312]  [<ffffffff812229bf>] do_readv_writev+0x18f/0x230
[  318.918954]  [<ffffffffa022f0e0>] ? xfs_file_buffered_aio_write+0x250/0x250 [xfs]
[  318.919575]  [<ffffffff813284bb>] ? security_file_open+0x8b/0x90
[  318.920178]  [<ffffffff8121ffa4>] ? do_dentry_open+0x264/0x2f0
[  318.920771]  [<ffffffffa022ced0>] ? xfs_extent_busy_ag_cmp+0x20/0x20 [xfs]
[  318.921354]  [<ffffffff81222ae9>] vfs_writev+0x39/0x50
[  318.921961]  [<ffffffffa032df4d>] nfsd_vfs_write+0xbd/0x380 [nfsd]
[  318.922509]  [<ffffffffa0330568>] nfsd_write+0xd8/0xf0 [nfsd]
[  318.923045]  [<ffffffffa03367fc>] nfsd3_proc_write+0xbc/0x150 [nfsd]
[  318.923574]  [<ffffffffa0328e68>] nfsd_dispatch+0xb8/0x200 [nfsd]
[  318.924111]  [<ffffffffa02d39dc>] svc_process_common+0x40c/0x650 [sunrpc]
[  318.924633]  [<ffffffffa02d4b14>] svc_process+0xf4/0x180 [sunrpc]
[  318.925144]  [<ffffffffa03288bf>] nfsd+0xef/0x160 [nfsd]
[  318.925654]  [<ffffffffa03287d0>] ? nfsd_destroy+0x60/0x60 [nfsd]
[  318.926163]  [<ffffffff810bd6a8>] kthread+0xd8/0xf0
[  318.926668]  [<ffffffff810bd5d0>] ? kthread_worker_fn+0x160/0x160
[  318.927189]  [<ffffffff8177f19f>] ret_from_fork+0x3f/0x70
[  318.927685]  [<ffffffff810bd5d0>] ? kthread_worker_fn+0x160/0x160
[  318.928210] Code: 4d 85 c0 75 e4 48 89 de 4c 89 ef e8 55 f9 ff ff 48 8b 34 24 4c 89 ef e8 19 72 67 00 48 83 c4 08 5b 41 5c 41 5d 41 5e 41 5f 5d c3 <0f> 0b 48 83 7f 38 00 48 8b 45 08 0f 85 5f ff ff ff 48 89 47 38 
[  318.929315] RIP  [<ffffffff811076a6>] add_timer_on+0xc6/0x110
[  318.930019]  RSP <ffff880322d0f458>

Basically the shrinker kicked in. After we shrank a bunch of cache
entries, we tried to mod_delayed_work on the cache cleaner job so that
it wouldn't run too soon, and that's where the code oopsed.

Is our usage of mod_delayed_work in prune_cache_entries() wrong in some
way? Or is there some breakage in the workqueue or timer code?

-- 
Jeff Layton <jlayton@poochiereds.net>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: timer code oops when calling mod_delayed_work
  2015-10-29 14:31 timer code oops when calling mod_delayed_work Jeff Layton
@ 2015-10-29 17:58 ` Jeff Layton
  2015-10-31  2:00   ` Tejun Heo
  0 siblings, 1 reply; 17+ messages in thread
From: Jeff Layton @ 2015-10-29 17:58 UTC (permalink / raw)
  To: linux-kernel
  Cc: Tejun Heo, bfields, Michael Skralivetsky, Chris Worley, Trond Myklebust

On Thu, 29 Oct 2015 10:31:13 -0400
Jeff Layton <jlayton@poochiereds.net> wrote:

> Our QA guys hit the following warning and oops today on a relatively
> recent upstream kernel. They were running a fairly busy NFSv3 I/O test
> against this server:
> 
> [   71.074180] NFSD: starting 90-second grace period (net ffffffff81cf3b40)
> [  221.146802] Ebtables v2.0 unregistered
> [  318.909835] ------------[ cut here ]------------
> [  318.909846] WARNING: CPU: 22 PID: 3402 at kernel/workqueue.c:1444 __queue_delayed_work+0xe7/0x190()
> [  318.909848] Modules linked in: bridge stp llc intel_rapl iosf_mbi x86_pkg_temp_thermal coretemp kvm_intel kvm ipmi_ssif crct10dif_pclmul crc32_pclmul iTCO_wdt iTCO_vendor_support joydev mei_me mei sb_edac shpchp ipmi_si lpc_ich i2c_i801 edac_core tpm_tis wmi ipmi_msghandler tpm acpi_power_meter acpi_pad nfsd auth_rpcgss nfs_acl lockd grace sunrpc xfs libcrc32c mlx4_en vxlan ip6_udp_tunnel udp_tunnel ast drm_kms_helper ttm crc32c_intel igb drm mlx4_core ptp nvme pps_core dca i2c_algo_bit fjes [last unloaded: iptable_security]
> [  318.909889] CPU: 22 PID: 3402 Comm: nfsd Not tainted 4.3.0-0.rc7.git0.1.fc24.x86_64 #1
> [  318.909891] Hardware name: Quanta D51B/D51BP (1G LOM), BIOS S2BP3A05 01/29/2015
> [  318.909892]  0000000000000000 00000000319846b5 ffff880322d0f440 ffffffff813a48cf
> [  318.909895]  0000000000000000 ffff880322d0f478 ffffffff8109f122 0000000000000400
> [  318.909898]  ffff88085f48fc00 ffffffffa03623e0 ffff88085f48fc00 000000000001d4c0
> [  318.909901] Call Trace:
> [  318.909906]  [<ffffffff813a48cf>] dump_stack+0x44/0x55
> [  318.909909]  [<ffffffff8109f122>] warn_slowpath_common+0x82/0xc0
> [  318.909913]  [<ffffffff8109f26a>] warn_slowpath_null+0x1a/0x20
> [  318.909915]  [<ffffffff810b67a7>] __queue_delayed_work+0xe7/0x190
> [  318.909917]  [<ffffffff810b70de>] mod_delayed_work_on+0x5e/0xa0
> [  318.909931]  [<ffffffffa0333394>] prune_cache_entries+0x94/0xb0 [nfsd]
> [  318.909936]  [<ffffffffa03333be>] nfsd_reply_cache_scan+0xe/0x10 [nfsd]
> [  318.909941]  [<ffffffff811b78e2>] shrink_slab.part.42+0x1e2/0x380
> [  318.909944]  [<ffffffff811bbfc6>] shrink_zone+0x2b6/0x2c0
> [  318.909946]  [<ffffffff811bc145>] do_try_to_free_pages+0x175/0x400
> [  318.909949]  [<ffffffff811bc4da>] try_to_free_pages+0x10a/0x150
> [  318.909954]  [<ffffffff811ae5b1>] __alloc_pages_nodemask+0x581/0x990
> [  318.909957]  [<ffffffff811b19f5>] ? account_page_dirtied+0xd5/0x160
> [  318.909960]  [<ffffffff811f7b41>] alloc_pages_current+0x91/0x100
> [  318.909963]  [<ffffffff811a45bb>] __page_cache_alloc+0xab/0xe0
> [  318.909965]  [<ffffffff811a4c94>] pagecache_get_page+0x84/0x1e0
> [  318.909967]  [<ffffffff811a4e16>] grab_cache_page_write_begin+0x26/0x40
> [  318.909990]  [<ffffffffa0221cd4>] xfs_vm_write_begin+0x34/0xe0 [xfs]
> [  318.910005]  [<ffffffffa0221a39>] ? xfs_vm_write_end+0x29/0x80 [xfs]
> [  318.910007]  [<ffffffff811a418e>] generic_perform_write+0xce/0x1c0
> [  318.910023]  [<ffffffffa022efb7>] xfs_file_buffered_aio_write+0x127/0x250 [xfs]
> [  318.910028]  [<ffffffffa032be60>] ? nfserrno+0x60/0x60 [nfsd]
> [  318.910031]  [<ffffffff812035e9>] ? __kmalloc+0x1b9/0x260
> [  318.910034]  [<ffffffff81223617>] ? rw_copy_check_uvector+0x57/0x110
> [  318.910047]  [<ffffffffa022f1d5>] xfs_file_write_iter+0xf5/0x110 [xfs]
> [  318.910050]  [<ffffffff81221e3c>] do_iter_readv_writev+0x6c/0xa0
> [  318.910053]  [<ffffffff812229bf>] do_readv_writev+0x18f/0x230
> [  318.910065]  [<ffffffffa022f0e0>] ? xfs_file_buffered_aio_write+0x250/0x250 [xfs]
> [  318.910068]  [<ffffffff813284bb>] ? security_file_open+0x8b/0x90
> [  318.910070]  [<ffffffff8121ffa4>] ? do_dentry_open+0x264/0x2f0
> [  318.910084]  [<ffffffffa022ced0>] ? xfs_extent_busy_ag_cmp+0x20/0x20 [xfs]
> [  318.910086]  [<ffffffff81222ae9>] vfs_writev+0x39/0x50
> [  318.910092]  [<ffffffffa032df4d>] nfsd_vfs_write+0xbd/0x380 [nfsd]
> [  318.910097]  [<ffffffffa0330568>] nfsd_write+0xd8/0xf0 [nfsd]
> [  318.910103]  [<ffffffffa03367fc>] nfsd3_proc_write+0xbc/0x150 [nfsd]
> [  318.910107]  [<ffffffffa0328e68>] nfsd_dispatch+0xb8/0x200 [nfsd]
> [  318.910128]  [<ffffffffa02d39dc>] svc_process_common+0x40c/0x650 [sunrpc]
> [  318.910137]  [<ffffffffa02d4b14>] svc_process+0xf4/0x180 [sunrpc]
> [  318.910142]  [<ffffffffa03288bf>] nfsd+0xef/0x160 [nfsd]
> [  318.910146]  [<ffffffffa03287d0>] ? nfsd_destroy+0x60/0x60 [nfsd]
> [  318.910149]  [<ffffffff810bd6a8>] kthread+0xd8/0xf0
> [  318.910152]  [<ffffffff810bd5d0>] ? kthread_worker_fn+0x160/0x160
> [  318.910155]  [<ffffffff8177f19f>] ret_from_fork+0x3f/0x70
> [  318.910158]  [<ffffffff810bd5d0>] ? kthread_worker_fn+0x160/0x160
> [  318.910160] ---[ end trace db375019f3368852 ]---
> [  318.910169] ------------[ cut here ]------------
> [  318.910201] kernel BUG at kernel/time/timer.c:977!
> [  318.910226] invalid opcode: 0000 [#1] SMP 
> [  318.910249] Modules linked in: bridge stp llc intel_rapl iosf_mbi x86_pkg_temp_thermal coretemp kvm_intel kvm ipmi_ssif crct10dif_pclmul crc32_pclmul iTCO_wdt iTCO_vendor_support joydev mei_me mei sb_edac shpchp ipmi_si lpc_ich i2c_i801 edac_core tpm_tis wmi ipmi_msghandler tpm acpi_power_meter acpi_pad nfsd auth_rpcgss nfs_acl lockd grace sunrpc xfs libcrc32c mlx4_en vxlan ip6_udp_tunnel udp_tunnel ast drm_kms_helper ttm crc32c_intel igb drm mlx4_core ptp nvme pps_core dca i2c_algo_bit fjes [last unloaded: iptable_security]
> [  318.910588] CPU: 22 PID: 3402 Comm: nfsd Tainted: G        W       4.3.0-0.rc7.git0.1.fc24.x86_64 #1
> [  318.910629] Hardware name: Quanta D51B/D51BP (1G LOM), BIOS S2BP3A05 01/29/2015
> [  318.910664] task: ffff880322c49e00 ti: ffff880322d0c000 task.ti: ffff880322d0c000
> [  318.910698] RIP: 0010:[<ffffffff811076a6>]  [<ffffffff811076a6>] add_timer_on+0xc6/0x110
> [  318.910736] RSP: 0018:ffff880322d0f458  EFLAGS: 00010082
> [  318.910761] RAX: 0000000000000000 RBX: 0000000000000016 RCX: 0000000100021e85
> [  318.910793] RDX: ffffffffa03623c0 RSI: 0000000000000016 RDI: ffffffffa03623e0
> [  318.910826] RBP: ffff880322d0f488 R08: 000000000000000a R09: 00000000000005f2
> [  318.910856] R10: 000000000000027c R11: 00000000000005f2 R12: ffff88085f48fc00
> [  318.910888] R13: ffff88085fd8e080 R14: ffff88085f48fc00 R15: 000000000001d4c0
> [  318.910922] FS:  0000000000000000(0000) GS:ffff88085fd80000(0000) knlGS:0000000000000000
> [  318.910956] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  318.910981] CR2: 00007f9d6ea41000 CR3: 0000000001c0b000 CR4: 00000000001406e0
> [  318.911012] Stack:
> [  318.911025]  ffff88085f48fc00 0000000000000016 ffff88085f48fc00 ffffffffa03623e0
> [  318.911069]  ffff88085f48fc00 000000000001d4c0 ffff880322d0f4c0 ffffffff810b6749
> [  318.911111]  000000000001d4c0 ffffffffa03623c0 0000000000000001 ffffffffa03623c0
> [  318.911153] Call Trace:
> [  318.911170]  [<ffffffff810b6749>] __queue_delayed_work+0x89/0x190
> [  318.911201]  [<ffffffff810b70de>] mod_delayed_work_on+0x5e/0xa0
> [  318.911236]  [<ffffffffa0333394>] prune_cache_entries+0x94/0xb0 [nfsd]
> [  318.911267]  [<ffffffffa03333be>] nfsd_reply_cache_scan+0xe/0x10 [nfsd]
> [  318.911299]  [<ffffffff811b78e2>] shrink_slab.part.42+0x1e2/0x380
> [  318.911330]  [<ffffffff811bbfc6>] shrink_zone+0x2b6/0x2c0
> [  318.911358]  [<ffffffff811bc145>] do_try_to_free_pages+0x175/0x400
> [  318.911391]  [<ffffffff811bc4da>] try_to_free_pages+0x10a/0x150
> [  318.911422]  [<ffffffff811ae5b1>] __alloc_pages_nodemask+0x581/0x990
> [  318.911454]  [<ffffffff811b19f5>] ? account_page_dirtied+0xd5/0x160
> [  318.911483]  [<ffffffff811f7b41>] alloc_pages_current+0x91/0x100
> [  318.911513]  [<ffffffff811a45bb>] __page_cache_alloc+0xab/0xe0
> [  318.911543]  [<ffffffff811a4c94>] pagecache_get_page+0x84/0x1e0
> [  318.911573]  [<ffffffff811a4e16>] grab_cache_page_write_begin+0x26/0x40
> [  318.911620]  [<ffffffffa0221cd4>] xfs_vm_write_begin+0x34/0xe0 [xfs]
> [  318.911661]  [<ffffffffa0221a39>] ? xfs_vm_write_end+0x29/0x80 [xfs]
> [  318.912748]  [<ffffffff811a418e>] generic_perform_write+0xce/0x1c0
> [  318.913810]  [<ffffffffa022efb7>] xfs_file_buffered_aio_write+0x127/0x250 [xfs]
> [  318.914858]  [<ffffffffa032be60>] ? nfserrno+0x60/0x60 [nfsd]
> [  318.915715]  [<ffffffff812035e9>] ? __kmalloc+0x1b9/0x260
> [  318.916360]  [<ffffffff81223617>] ? rw_copy_check_uvector+0x57/0x110
> [  318.917014]  [<ffffffffa022f1d5>] xfs_file_write_iter+0xf5/0x110 [xfs]
> [  318.917650]  [<ffffffff81221e3c>] do_iter_readv_writev+0x6c/0xa0
> [  318.918312]  [<ffffffff812229bf>] do_readv_writev+0x18f/0x230
> [  318.918954]  [<ffffffffa022f0e0>] ? xfs_file_buffered_aio_write+0x250/0x250 [xfs]
> [  318.919575]  [<ffffffff813284bb>] ? security_file_open+0x8b/0x90
> [  318.920178]  [<ffffffff8121ffa4>] ? do_dentry_open+0x264/0x2f0
> [  318.920771]  [<ffffffffa022ced0>] ? xfs_extent_busy_ag_cmp+0x20/0x20 [xfs]
> [  318.921354]  [<ffffffff81222ae9>] vfs_writev+0x39/0x50
> [  318.921961]  [<ffffffffa032df4d>] nfsd_vfs_write+0xbd/0x380 [nfsd]
> [  318.922509]  [<ffffffffa0330568>] nfsd_write+0xd8/0xf0 [nfsd]
> [  318.923045]  [<ffffffffa03367fc>] nfsd3_proc_write+0xbc/0x150 [nfsd]
> [  318.923574]  [<ffffffffa0328e68>] nfsd_dispatch+0xb8/0x200 [nfsd]
> [  318.924111]  [<ffffffffa02d39dc>] svc_process_common+0x40c/0x650 [sunrpc]
> [  318.924633]  [<ffffffffa02d4b14>] svc_process+0xf4/0x180 [sunrpc]
> [  318.925144]  [<ffffffffa03288bf>] nfsd+0xef/0x160 [nfsd]
> [  318.925654]  [<ffffffffa03287d0>] ? nfsd_destroy+0x60/0x60 [nfsd]
> [  318.926163]  [<ffffffff810bd6a8>] kthread+0xd8/0xf0
> [  318.926668]  [<ffffffff810bd5d0>] ? kthread_worker_fn+0x160/0x160
> [  318.927189]  [<ffffffff8177f19f>] ret_from_fork+0x3f/0x70
> [  318.927685]  [<ffffffff810bd5d0>] ? kthread_worker_fn+0x160/0x160
> [  318.928210] Code: 4d 85 c0 75 e4 48 89 de 4c 89 ef e8 55 f9 ff ff 48 8b 34 24 4c 89 ef e8 19 72 67 00 48 83 c4 08 5b 41 5c 41 5d 41 5e 41 5f 5d c3 <0f> 0b 48 83 7f 38 00 48 8b 45 08 0f 85 5f ff ff ff 48 89 47 38 
> [  318.929315] RIP  [<ffffffff811076a6>] add_timer_on+0xc6/0x110
> [  318.930019]  RSP <ffff880322d0f458>
> 
> Basically the shrinker kicked in. After we shrank a bunch of cache
> entries, we tried to mod_delayed_work on the cache cleaner job so that
> it wouldn't run too soon, and that's where the code oopsed.
> 
> Is our usage of mod_delayed_work in prune_cache_entries() wrong in some
> way? Or is there some breakage in the workqueue or timer code?
> 

Ok, a bit more analysis...

We did get a vmcore from this. Here is the "cache_cleaner" delayed work:

crash> p cache_cleaner
cache_cleaner = $12 = {
  work = {
    data = {
      counter = 0xfffffffe1
    }, 
    entry = {
      next = 0xffffffffa03623c8 <cache_cleaner+8>, 
      prev = 0xffffffffa03623c8 <cache_cleaner+8>
    }, 
    func = 0xffffffffa03333c0 <cache_cleaner_func>
  }, 
  timer = {
    entry = {
      next = 0x0, 
      pprev = 0xffff88085fd0eaf8
    }, 
    expires = 0x100021e99, 
    function = 0xffffffff810b66a0 <delayed_work_timer_fn>, 
    data = 0xffffffffa03623c0, 
    flags = 0x200014, 
    slack = 0xffffffff, 
    start_pid = 0x0, 
    start_site = 0x0, 
    start_comm = "\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000"
  }, 
  wq = 0xffff88085f48fc00, 
  cpu = 0x14
}

So the PENDING bit is set (lowest bit in data.counter), and timer->entry.pprev
pprev pointer is not NULL (so timer_pending is true). I also see that
there are several nfsd threads running the shrinker at the same time.

There is one potential problem that I see, but I'd appreciate someone
sanity checking me on this. Here is mod_delayed_work_on:

------------------[snip]------------------------
bool mod_delayed_work_on(int cpu, struct workqueue_struct *wq,
                         struct delayed_work *dwork, unsigned long delay)
{
        unsigned long flags;
        int ret;

        do {
                ret = try_to_grab_pending(&dwork->work, true, &flags);
        } while (unlikely(ret == -EAGAIN));

        if (likely(ret >= 0)) {
                __queue_delayed_work(cpu, wq, dwork, delay);
                local_irq_restore(flags);
        }

        /* -ENOENT from try_to_grab_pending() becomes %true */
        return ret;
}
------------------[snip]------------------------

...and here is the beginning of try_to_grab_pending:

------------------[snip]------------------------
        /* try to steal the timer if it exists */
        if (is_dwork) {
                struct delayed_work *dwork = to_delayed_work(work);

                /*
                 * dwork->timer is irqsafe.  If del_timer() fails, it's
                 * guaranteed that the timer is not queued anywhere and not
                 * running on the local CPU.
                 */
                if (likely(del_timer(&dwork->timer)))
                        return 1;
        }

        /* try to claim PENDING the normal way */
        if (!test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work)))
                return 0;
------------------[snip]------------------------


...so if del_timer returns true, we'll return 1 from
try_to_grab_pending without actually setting the
WORK_STRUCT_PENDING_BIT, and then will end up calling
__queue_delayed_work.

That seems wrong to me -- shouldn't we be ensuring that that bit is set
when returning 1 from try_to_grab_pending to guard against concurrent
queue_delayed_work_on calls?

-- 
Jeff Layton <jlayton@poochiereds.net>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: timer code oops when calling mod_delayed_work
  2015-10-29 17:58 ` Jeff Layton
@ 2015-10-31  2:00   ` Tejun Heo
  2015-10-31 11:34     ` Jeff Layton
  0 siblings, 1 reply; 17+ messages in thread
From: Tejun Heo @ 2015-10-31  2:00 UTC (permalink / raw)
  To: Jeff Layton
  Cc: linux-kernel, bfields, Michael Skralivetsky, Chris Worley,
	Trond Myklebust, Lai Jiangshan

(cc'ing Lai)

Hello, Jeff.

On Thu, Oct 29, 2015 at 01:58:36PM -0400, Jeff Layton wrote:
> crash> p cache_cleaner
> cache_cleaner = $12 = {
>   work = {
>     data = {
>       counter = 0xfffffffe1

If I'm not mistaken, PENDING, flush color 14, OFFQ and POOL_NONE.

>     }, 
>     entry = {
>       next = 0xffffffffa03623c8 <cache_cleaner+8>, 
>       prev = 0xffffffffa03623c8 <cache_cleaner+8>

Empty entry.

>     }, 
>     func = 0xffffffffa03333c0 <cache_cleaner_func>
>   }, 
>   timer = {
>     entry = {
>       next = 0x0, 
>       pprev = 0xffff88085fd0eaf8
>     }, 
>     expires = 0x100021e99, 
>     function = 0xffffffff810b66a0 <delayed_work_timer_fn>, 
>     data = 0xffffffffa03623c0, 
>     flags = 0x200014, 
>     slack = 0xffffffff, 
>     start_pid = 0x0, 
>     start_site = 0x0, 
>     start_comm = "\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000"
>   }, 
>   wq = 0xffff88085f48fc00, 
>   cpu = 0x14
> }
> 
> So the PENDING bit is set (lowest bit in data.counter), and timer->entry.pprev
> pprev pointer is not NULL (so timer_pending is true). I also see that
> there are several nfsd threads running the shrinker at the same time.
> 
> There is one potential problem that I see, but I'd appreciate someone
> sanity checking me on this. Here is mod_delayed_work_on:
...
> ...and here is the beginning of try_to_grab_pending:
> 
> ------------------[snip]------------------------
>         /* try to steal the timer if it exists */
>         if (is_dwork) {
>                 struct delayed_work *dwork = to_delayed_work(work);
> 
>                 /*
>                  * dwork->timer is irqsafe.  If del_timer() fails, it's
>                  * guaranteed that the timer is not queued anywhere and not
>                  * running on the local CPU.
>                  */
>                 if (likely(del_timer(&dwork->timer)))
>                         return 1;
>         }
> 
>         /* try to claim PENDING the normal way */
>         if (!test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work)))
>                 return 0;
> ------------------[snip]------------------------
> 
> 
> ...so if del_timer returns true, we'll return 1 from
> try_to_grab_pending without actually setting the
> WORK_STRUCT_PENDING_BIT, and then will end up calling
> __queue_delayed_work.
> 
> That seems wrong to me -- shouldn't we be ensuring that that bit is set
> when returning 1 from try_to_grab_pending to guard against concurrent
> queue_delayed_work_on calls?

But if try_to_grab_pending() succeeded at stealing dwork->timer, it's
known that the PENDING bit must already be set.  IOW, the bit is
stolen together with the timer.

Heh, this one is tricky.  Yeah, try_to_grab_pending() missing PENDING
would explain the failure but I can't see how it'd leak at the moment.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: timer code oops when calling mod_delayed_work
  2015-10-31  2:00   ` Tejun Heo
@ 2015-10-31 11:34     ` Jeff Layton
  2015-10-31 21:31       ` Tejun Heo
  0 siblings, 1 reply; 17+ messages in thread
From: Jeff Layton @ 2015-10-31 11:34 UTC (permalink / raw)
  To: Tejun Heo
  Cc: linux-kernel, bfields, Michael Skralivetsky, Chris Worley,
	Trond Myklebust, Lai Jiangshan

On Sat, 31 Oct 2015 11:00:12 +0900
Tejun Heo <tj@kernel.org> wrote:

> (cc'ing Lai)
> 
> Hello, Jeff.
> 
> On Thu, Oct 29, 2015 at 01:58:36PM -0400, Jeff Layton wrote:
> > crash> p cache_cleaner
> > cache_cleaner = $12 = {
> >   work = {
> >     data = {
> >       counter = 0xfffffffe1
> 
> If I'm not mistaken, PENDING, flush color 14, OFFQ and POOL_NONE.
> 
> >     }, 
> >     entry = {
> >       next = 0xffffffffa03623c8 <cache_cleaner+8>, 
> >       prev = 0xffffffffa03623c8 <cache_cleaner+8>
> 
> Empty entry.
> 
> >     }, 
> >     func = 0xffffffffa03333c0 <cache_cleaner_func>
> >   }, 
> >   timer = {
> >     entry = {
> >       next = 0x0, 
> >       pprev = 0xffff88085fd0eaf8
> >     }, 
> >     expires = 0x100021e99, 
> >     function = 0xffffffff810b66a0 <delayed_work_timer_fn>, 
> >     data = 0xffffffffa03623c0, 
> >     flags = 0x200014, 
> >     slack = 0xffffffff, 
> >     start_pid = 0x0, 
> >     start_site = 0x0, 
> >     start_comm = "\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000"
> >   }, 
> >   wq = 0xffff88085f48fc00, 
> >   cpu = 0x14
> > }
> > 
> > So the PENDING bit is set (lowest bit in data.counter), and timer->entry.pprev
> > pprev pointer is not NULL (so timer_pending is true). I also see that
> > there are several nfsd threads running the shrinker at the same time.
> > 
> > There is one potential problem that I see, but I'd appreciate someone
> > sanity checking me on this. Here is mod_delayed_work_on:
> ...
> > ...and here is the beginning of try_to_grab_pending:
> > 
> > ------------------[snip]------------------------
> >         /* try to steal the timer if it exists */
> >         if (is_dwork) {
> >                 struct delayed_work *dwork = to_delayed_work(work);
> > 
> >                 /*
> >                  * dwork->timer is irqsafe.  If del_timer() fails, it's
> >                  * guaranteed that the timer is not queued anywhere and not
> >                  * running on the local CPU.
> >                  */
> >                 if (likely(del_timer(&dwork->timer)))
> >                         return 1;
> >         }
> > 
> >         /* try to claim PENDING the normal way */
> >         if (!test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work)))
> >                 return 0;
> > ------------------[snip]------------------------
> > 
> > 
> > ...so if del_timer returns true, we'll return 1 from
> > try_to_grab_pending without actually setting the
> > WORK_STRUCT_PENDING_BIT, and then will end up calling
> > __queue_delayed_work.
> > 
> > That seems wrong to me -- shouldn't we be ensuring that that bit is set
> > when returning 1 from try_to_grab_pending to guard against concurrent
> > queue_delayed_work_on calls?
> 
> But if try_to_grab_pending() succeeded at stealing dwork->timer, it's
> known that the PENDING bit must already be set.  IOW, the bit is
> stolen together with the timer.
> 
> Heh, this one is tricky.  Yeah, try_to_grab_pending() missing PENDING
> would explain the failure but I can't see how it'd leak at the moment.
> 

Thanks Tejun. Yeah, I realized that after sending the response above.

If you successfully delete the timer the timer then the PENDING bit
should already be set. Might be worth throwing in something like this,
just before the return 1:

    WARN_ON(!test_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work)))

...but I doubt it would fire. I think it's likely that the bug is
elsewhere.

The other thing is that we've had this code in place for a couple of
years now, and this is the first time I've seen an oops like this. I
suspect that this may be a recent regression, but I don't know that for
sure.

I have asked Chris and Michael to see if they can bisect it down, but
it may be a bit before they can get that done. Any insight you might
have in the meantime would helpful.
-- 
Jeff Layton <jlayton@poochiereds.net>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: timer code oops when calling mod_delayed_work
  2015-10-31 11:34     ` Jeff Layton
@ 2015-10-31 21:31       ` Tejun Heo
  2015-10-31 21:54         ` Jeff Layton
  0 siblings, 1 reply; 17+ messages in thread
From: Tejun Heo @ 2015-10-31 21:31 UTC (permalink / raw)
  To: Jeff Layton
  Cc: linux-kernel, bfields, Michael Skralivetsky, Chris Worley,
	Trond Myklebust, Lai Jiangshan

Hello, Jeff.

On Sat, Oct 31, 2015 at 07:34:00AM -0400, Jeff Layton wrote:
> > Heh, this one is tricky.  Yeah, try_to_grab_pending() missing PENDING
> > would explain the failure but I can't see how it'd leak at the moment.
> 
> Thanks Tejun. Yeah, I realized that after sending the response above.
> 
> If you successfully delete the timer the timer then the PENDING bit
> should already be set. Might be worth throwing in something like this,
> just before the return 1:
> 
>     WARN_ON(!test_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work)))
>
> ...but I doubt it would fire. I think it's likely that the bug is
> elsewhere.

I doubt that'd trigger either but it'd still be a good idea to throw
that in just in case.

> The other thing is that we've had this code in place for a couple of
> years now, and this is the first time I've seen an oops like this. I
> suspect that this may be a recent regression, but I don't know that for
> sure.

I can't think of any recent changes which could affect this.  This
part of code has remained unchanged for years at least from the
workqueue side.

> I have asked Chris and Michael to see if they can bisect it down, but
> it may be a bit before they can get that done. Any insight you might
> have in the meantime would helpful.

Yeah, I'd love to find out how reproducible the issue is.  If the
problem is rarely reproducible, it might make sense to try
instrumentation before trying bisection as it *could* be a latent bug
which has been there all along and bisecting to the commit introducing
the code wouldn't help us too much.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: timer code oops when calling mod_delayed_work
  2015-10-31 21:31       ` Tejun Heo
@ 2015-10-31 21:54         ` Jeff Layton
  2015-11-02 19:48           ` Chris Worley
  0 siblings, 1 reply; 17+ messages in thread
From: Jeff Layton @ 2015-10-31 21:54 UTC (permalink / raw)
  To: Tejun Heo
  Cc: linux-kernel, bfields, Michael Skralivetsky, Chris Worley,
	Trond Myklebust, Lai Jiangshan

On Sat, 31 Oct 2015 17:31:07 -0400
Tejun Heo <tj@kernel.org> wrote:

> Hello, Jeff.
> 
> On Sat, Oct 31, 2015 at 07:34:00AM -0400, Jeff Layton wrote:
> > > Heh, this one is tricky.  Yeah, try_to_grab_pending() missing PENDING
> > > would explain the failure but I can't see how it'd leak at the moment.
> > 
> > Thanks Tejun. Yeah, I realized that after sending the response above.
> > 
> > If you successfully delete the timer the timer then the PENDING bit
> > should already be set. Might be worth throwing in something like this,
> > just before the return 1:
> > 
> >     WARN_ON(!test_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work)))
> >
> > ...but I doubt it would fire. I think it's likely that the bug is
> > elsewhere.
> 
> I doubt that'd trigger either but it'd still be a good idea to throw
> that in just in case.
> 
> > The other thing is that we've had this code in place for a couple of
> > years now, and this is the first time I've seen an oops like this. I
> > suspect that this may be a recent regression, but I don't know that for
> > sure.
> 
> I can't think of any recent changes which could affect this.  This
> part of code has remained unchanged for years at least from the
> workqueue side.
> 
> > I have asked Chris and Michael to see if they can bisect it down, but
> > it may be a bit before they can get that done. Any insight you might
> > have in the meantime would helpful.
> 
> Yeah, I'd love to find out how reproducible the issue is.  If the
> problem is rarely reproducible, it might make sense to try
> instrumentation before trying bisection as it *could* be a latent bug
> which has been there all along and bisecting to the commit introducing
> the code wouldn't help us too much.
> 

It seems fairly reproducible, at least on v4.3-rc7 kernels:

This came about when I asked them to perf test some nfsd patches that I
have queued up. I patched a Fedora 4.3-rc7 kernel and wanted to see
what the perf delta was (with NFSv3, fwiw):

    Patched kernels here: http://koji.fedoraproject.org/koji/taskinfo?taskID=11598089

    Unpatched kernels here: http://koji.fedoraproject.org/koji/buildinfo?buildID=694377

Michael was using the SPEC SFS VDI workload to test, and was able to
get the same panic on both kernels. So it does seem to be reproducible.
It might even be possible to tune the VM to make the shrinker fire more
often, which may help tickle this more.

In any case, I've asked them to try something v4.2-ish and see if it's
reproducible there, and then try v4.1 if it is. I figure anything
earlier is probably not worth testing if it still fails on v4.1. If it
turns out not to be reproducible on those earlier kernels then we can
bisect from there to track it down.

Thanks again!
-- 
Jeff Layton <jlayton@poochiereds.net>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: timer code oops when calling mod_delayed_work
  2015-10-31 21:54         ` Jeff Layton
@ 2015-11-02 19:48           ` Chris Worley
  2015-11-02 19:56             ` Jeff Layton
  0 siblings, 1 reply; 17+ messages in thread
From: Chris Worley @ 2015-11-02 19:48 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Tejun Heo, linux-kernel, bfields, Michael Skralivetsky,
	Trond Myklebust, Lai Jiangshan

On Sat, Oct 31, 2015 at 11:54 AM, Jeff Layton <jlayton@poochiereds.net> wrote:
> On Sat, 31 Oct 2015 17:31:07 -0400
> Tejun Heo <tj@kernel.org> wrote:
...
>>
>> > I have asked Chris and Michael to see if they can bisect it down, but
>> > it may be a bit before they can get that done. Any insight you might
>> > have in the meantime would helpful.
>>
>> Yeah, I'd love to find out how reproducible the issue is.  If the
>> problem is rarely reproducible, it might make sense to try
>> instrumentation before trying bisection as it *could* be a latent bug
>> which has been there all along and bisecting to the commit introducing
>> the code wouldn't help us too much.
>>
>
> It seems fairly reproducible, at least on v4.3-rc7 kernels:
>
> This came about when I asked them to perf test some nfsd patches that I
> have queued up. I patched a Fedora 4.3-rc7 kernel and wanted to see
> what the perf delta was (with NFSv3, fwiw):
>
>     Patched kernels here: http://koji.fedoraproject.org/koji/taskinfo?taskID=11598089
>
>     Unpatched kernels here: http://koji.fedoraproject.org/koji/buildinfo?buildID=694377
>
> Michael was using the SPEC SFS VDI workload to test, and was able to
> get the same panic on both kernels. So it does seem to be reproducible.
> It might even be possible to tune the VM to make the shrinker fire more
> often, which may help tickle this more.
>
> In any case, I've asked them to try something v4.2-ish and see if it's
> reproducible there, and then try v4.1 if it is. I figure anything
> earlier is probably not worth testing if it still fails on v4.1. If it
> turns out not to be reproducible on those earlier kernels then we can
> bisect from there to track it down.

The trick seems to be the NFS thread count: I initially though this
was SFS/VDI specific, but when I ratcheted up the thread count to what
Michael was using, 256 threads oopses on fio (throughput) benchmarks
too.

In bisecting kernels, it appeared between 4.2.3-200 and 4.2.5-200 (all
the 4.2.4 kernels were bad).

Jeff has a lead on this...

Chris

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: timer code oops when calling mod_delayed_work
  2015-11-02 19:48           ` Chris Worley
@ 2015-11-02 19:56             ` Jeff Layton
  2015-11-03  1:33               ` Jeff Layton
  0 siblings, 1 reply; 17+ messages in thread
From: Jeff Layton @ 2015-11-02 19:56 UTC (permalink / raw)
  To: Chris Worley
  Cc: Tejun Heo, linux-kernel, bfields, Michael Skralivetsky,
	Trond Myklebust, Lai Jiangshan

On Mon, 2 Nov 2015 09:48:41 -1000
Chris Worley <chris.worley@primarydata.com> wrote:

> On Sat, Oct 31, 2015 at 11:54 AM, Jeff Layton <jlayton@poochiereds.net> wrote:
> > On Sat, 31 Oct 2015 17:31:07 -0400
> > Tejun Heo <tj@kernel.org> wrote:
> ...
> >>
> >> > I have asked Chris and Michael to see if they can bisect it down, but
> >> > it may be a bit before they can get that done. Any insight you might
> >> > have in the meantime would helpful.
> >>
> >> Yeah, I'd love to find out how reproducible the issue is.  If the
> >> problem is rarely reproducible, it might make sense to try
> >> instrumentation before trying bisection as it *could* be a latent bug
> >> which has been there all along and bisecting to the commit introducing
> >> the code wouldn't help us too much.
> >>
> >
> > It seems fairly reproducible, at least on v4.3-rc7 kernels:
> >
> > This came about when I asked them to perf test some nfsd patches that I
> > have queued up. I patched a Fedora 4.3-rc7 kernel and wanted to see
> > what the perf delta was (with NFSv3, fwiw):
> >
> >     Patched kernels here: http://koji.fedoraproject.org/koji/taskinfo?taskID=11598089
> >
> >     Unpatched kernels here: http://koji.fedoraproject.org/koji/buildinfo?buildID=694377
> >
> > Michael was using the SPEC SFS VDI workload to test, and was able to
> > get the same panic on both kernels. So it does seem to be reproducible.
> > It might even be possible to tune the VM to make the shrinker fire more
> > often, which may help tickle this more.
> >
> > In any case, I've asked them to try something v4.2-ish and see if it's
> > reproducible there, and then try v4.1 if it is. I figure anything
> > earlier is probably not worth testing if it still fails on v4.1. If it
> > turns out not to be reproducible on those earlier kernels then we can
> > bisect from there to track it down.
> 
> The trick seems to be the NFS thread count: I initially though this
> was SFS/VDI specific, but when I ratcheted up the thread count to what
> Michael was using, 256 threads oopses on fio (throughput) benchmarks
> too.
> 
> In bisecting kernels, it appeared between 4.2.3-200 and 4.2.5-200 (all
> the 4.2.4 kernels were bad).
> 
> Jeff has a lead on this...
> 
> Chris

Thanks Chris. This commit went in between those kernels:

commit a7c571f2e3ff9243ba41c242486f53dbca37d864
Author: Shaohua Li <shli@fb.com>
Date:   Wed Sep 30 09:05:30 2015 -0700

    workqueue: make sure delayed work run in local cpu
    
    commit 874bbfe600a660cba9c776b3957b1ce393151b76 upstream.
    

...and I suspect that it's the culprit. That commit causes this code to
always use add_timer_on, which seems to have different semantics from
add_timer.

I'm going to build a v4.2.5 kernel with that patch reverted to confirm
it, but it seems likely...

-- 
Jeff Layton <jlayton@poochiereds.net>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: timer code oops when calling mod_delayed_work
  2015-11-02 19:56             ` Jeff Layton
@ 2015-11-03  1:33               ` Jeff Layton
  2015-11-03 17:55                 ` Jeff Layton
  0 siblings, 1 reply; 17+ messages in thread
From: Jeff Layton @ 2015-11-03  1:33 UTC (permalink / raw)
  To: Chris Worley
  Cc: Tejun Heo, linux-kernel, bfields, Michael Skralivetsky,
	Trond Myklebust, Shaohua Li

On Mon, 2 Nov 2015 14:56:33 -0500
Jeff Layton <jlayton@poochiereds.net> wrote:

> On Mon, 2 Nov 2015 09:48:41 -1000
> Chris Worley <chris.worley@primarydata.com> wrote:
> 
> > On Sat, Oct 31, 2015 at 11:54 AM, Jeff Layton <jlayton@poochiereds.net> wrote:
> > > On Sat, 31 Oct 2015 17:31:07 -0400
> > > Tejun Heo <tj@kernel.org> wrote:
> > ...
> > >>
> > >> > I have asked Chris and Michael to see if they can bisect it down, but
> > >> > it may be a bit before they can get that done. Any insight you might
> > >> > have in the meantime would helpful.
> > >>
> > >> Yeah, I'd love to find out how reproducible the issue is.  If the
> > >> problem is rarely reproducible, it might make sense to try
> > >> instrumentation before trying bisection as it *could* be a latent bug
> > >> which has been there all along and bisecting to the commit introducing
> > >> the code wouldn't help us too much.
> > >>
> > >
> > > It seems fairly reproducible, at least on v4.3-rc7 kernels:
> > >
> > > This came about when I asked them to perf test some nfsd patches that I
> > > have queued up. I patched a Fedora 4.3-rc7 kernel and wanted to see
> > > what the perf delta was (with NFSv3, fwiw):
> > >
> > >     Patched kernels here: http://koji.fedoraproject.org/koji/taskinfo?taskID=11598089
> > >
> > >     Unpatched kernels here: http://koji.fedoraproject.org/koji/buildinfo?buildID=694377
> > >
> > > Michael was using the SPEC SFS VDI workload to test, and was able to
> > > get the same panic on both kernels. So it does seem to be reproducible.
> > > It might even be possible to tune the VM to make the shrinker fire more
> > > often, which may help tickle this more.
> > >
> > > In any case, I've asked them to try something v4.2-ish and see if it's
> > > reproducible there, and then try v4.1 if it is. I figure anything
> > > earlier is probably not worth testing if it still fails on v4.1. If it
> > > turns out not to be reproducible on those earlier kernels then we can
> > > bisect from there to track it down.
> > 
> > The trick seems to be the NFS thread count: I initially though this
> > was SFS/VDI specific, but when I ratcheted up the thread count to what
> > Michael was using, 256 threads oopses on fio (throughput) benchmarks
> > too.
> > 
> > In bisecting kernels, it appeared between 4.2.3-200 and 4.2.5-200 (all
> > the 4.2.4 kernels were bad).
> > 
> > Jeff has a lead on this...
> > 
> > Chris
> 
> Thanks Chris. This commit went in between those kernels:
> 
> commit a7c571f2e3ff9243ba41c242486f53dbca37d864
> Author: Shaohua Li <shli@fb.com>
> Date:   Wed Sep 30 09:05:30 2015 -0700
> 
>     workqueue: make sure delayed work run in local cpu
>     
>     commit 874bbfe600a660cba9c776b3957b1ce393151b76 upstream.
>     
> 
> ...and I suspect that it's the culprit. That commit causes this code to
> always use add_timer_on, which seems to have different semantics from
> add_timer.
> 
> I'm going to build a v4.2.5 kernel with that patch reverted to confirm
> it, but it seems likely...
> 

(dropping Lai's address, and adding Shaohua Li's)

Ok, I built a kernel with that patch reverted and that seems to fix the
problem.

Looking at the patch, I guess the main difference is that we're no
longer using add_timer for unbound workqueue tasks. That said, we
should have possession of the PENDING bit before calling add_timer_on.

So...I'm a little stumped. Any thoughts as to where the race could be?
-- 
Jeff Layton <jlayton@poochiereds.net>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: timer code oops when calling mod_delayed_work
  2015-11-03  1:33               ` Jeff Layton
@ 2015-11-03 17:55                 ` Jeff Layton
  2015-11-03 22:54                   ` Tejun Heo
  0 siblings, 1 reply; 17+ messages in thread
From: Jeff Layton @ 2015-11-03 17:55 UTC (permalink / raw)
  To: Chris Worley
  Cc: Tejun Heo, linux-kernel, bfields, Michael Skralivetsky,
	Trond Myklebust, Shaohua Li

On Mon, 2 Nov 2015 20:33:39 -0500
Jeff Layton <jlayton@poochiereds.net> wrote:

> On Mon, 2 Nov 2015 14:56:33 -0500
> Jeff Layton <jlayton@poochiereds.net> wrote:
> 
> > On Mon, 2 Nov 2015 09:48:41 -1000
> > Chris Worley <chris.worley@primarydata.com> wrote:
> > 
> > > On Sat, Oct 31, 2015 at 11:54 AM, Jeff Layton <jlayton@poochiereds.net> wrote:
> > > > On Sat, 31 Oct 2015 17:31:07 -0400
> > > > Tejun Heo <tj@kernel.org> wrote:
> > > ...
> > > >>
> > > >> > I have asked Chris and Michael to see if they can bisect it down, but
> > > >> > it may be a bit before they can get that done. Any insight you might
> > > >> > have in the meantime would helpful.
> > > >>
> > > >> Yeah, I'd love to find out how reproducible the issue is.  If the
> > > >> problem is rarely reproducible, it might make sense to try
> > > >> instrumentation before trying bisection as it *could* be a latent bug
> > > >> which has been there all along and bisecting to the commit introducing
> > > >> the code wouldn't help us too much.
> > > >>
> > > >
> > > > It seems fairly reproducible, at least on v4.3-rc7 kernels:
> > > >
> > > > This came about when I asked them to perf test some nfsd patches that I
> > > > have queued up. I patched a Fedora 4.3-rc7 kernel and wanted to see
> > > > what the perf delta was (with NFSv3, fwiw):
> > > >
> > > >     Patched kernels here: http://koji.fedoraproject.org/koji/taskinfo?taskID=11598089
> > > >
> > > >     Unpatched kernels here: http://koji.fedoraproject.org/koji/buildinfo?buildID=694377
> > > >
> > > > Michael was using the SPEC SFS VDI workload to test, and was able to
> > > > get the same panic on both kernels. So it does seem to be reproducible.
> > > > It might even be possible to tune the VM to make the shrinker fire more
> > > > often, which may help tickle this more.
> > > >
> > > > In any case, I've asked them to try something v4.2-ish and see if it's
> > > > reproducible there, and then try v4.1 if it is. I figure anything
> > > > earlier is probably not worth testing if it still fails on v4.1. If it
> > > > turns out not to be reproducible on those earlier kernels then we can
> > > > bisect from there to track it down.
> > > 
> > > The trick seems to be the NFS thread count: I initially though this
> > > was SFS/VDI specific, but when I ratcheted up the thread count to what
> > > Michael was using, 256 threads oopses on fio (throughput) benchmarks
> > > too.
> > > 
> > > In bisecting kernels, it appeared between 4.2.3-200 and 4.2.5-200 (all
> > > the 4.2.4 kernels were bad).
> > > 
> > > Jeff has a lead on this...
> > > 
> > > Chris
> > 
> > Thanks Chris. This commit went in between those kernels:
> > 
> > commit a7c571f2e3ff9243ba41c242486f53dbca37d864
> > Author: Shaohua Li <shli@fb.com>
> > Date:   Wed Sep 30 09:05:30 2015 -0700
> > 
> >     workqueue: make sure delayed work run in local cpu
> >     
> >     commit 874bbfe600a660cba9c776b3957b1ce393151b76 upstream.
> >     
> > 
> > ...and I suspect that it's the culprit. That commit causes this code to
> > always use add_timer_on, which seems to have different semantics from
> > add_timer.
> > 
> > I'm going to build a v4.2.5 kernel with that patch reverted to confirm
> > it, but it seems likely...
> > 
> 
> (dropping Lai's address, and adding Shaohua Li's)
> 
> Ok, I built a kernel with that patch reverted and that seems to fix the
> problem.
> 
> Looking at the patch, I guess the main difference is that we're no
> longer using add_timer for unbound workqueue tasks. That said, we
> should have possession of the PENDING bit before calling add_timer_on.
> 
> So...I'm a little stumped. Any thoughts as to where the race could be?

Ok, I've been able to reproduce this with a small module that I wrote
as a reproducer:

[  197.502139] BUG: unable to handle kernel NULL pointer dereference at           (null)
[  197.503073] IP: [<ffffffff8112d67c>] detach_if_pending+0x6c/0x190
[  197.503752] PGD 0 
[  197.504061] Oops: 0002 [#1] SMP 
[  197.504449] Modules linked in: wqthrash(OE-) snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hda_core xfs snd_hwdep snd_seq libcrc32c snd_seq_device ppdev snd_pcm snd_timer snd soundcore parport_pc e1000 i2c_piix4 parport joydev virtio_balloon pvpanic acpi_cpufreq nfsd nfs_acl lockd auth_rpcgss grace sunrpc virtio_blk virtio_console qxl drm_kms_helper ttm virtio_pci virtio_ring serio_raw virtio drm ata_generic pata_acpi
[  197.509358] CPU: 1 PID: 794 Comm: kworker/u8:9 Tainted: G           OE   4.3.0-0.rc7.git2.1.fc24.x86_64 #1
[  197.510767] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[  197.511850] Workqueue: wqthrash wqthrash_workfunc [wqthrash]
[  197.513141] task: ffff88020f918000 ti: ffff8802143a8000 task.ti: ffff8802143a8000
[  197.513958] RIP: 0010:[<ffffffff8112d67c>]  [<ffffffff8112d67c>] detach_if_pending+0x6c/0x190
[  197.514927] RSP: 0018:ffff8802143abc68  EFLAGS: 00010086
[  197.515539] RAX: dead000000000200 RBX: ffffffffa022e050 RCX: 000000000000000f
[  197.516552] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000046
[  197.517374] RBP: ffff8802143abc88 R08: 0000000000000001 R09: 0000000000000001
[  197.518194] R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000000
[  197.518992] R13: ffff880216fce2c0 R14: 0000000000000001 R15: 00000000000003e8
[  197.519817] FS:  0000000000000000(0000) GS:ffff880217000000(0000) knlGS:0000000000000000
[  197.520738] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  197.521410] CR2: 0000000000000000 CR3: 00000000daa9b000 CR4: 00000000000006e0
[  197.522234] Stack:
[  197.522584]  ffffffffa022e050 0000000000000000 ffff880216fce2c0 ffff88021688f800
[  197.524102]  ffff8802143abcc0 ffffffff8112db72 0000000000000086 00000000ba809f75
[  197.525594]  ffffffffa022e000 ffff8802143abd00 0000000000000001 ffff8802143abcf0
[  197.527082] Call Trace:
[  197.527581]  [<ffffffff8112db72>] del_timer+0x62/0xa0
[  197.532565]  [<ffffffff810c913e>] try_to_grab_pending+0xce/0x190
[  197.533253]  [<ffffffff810c9242>] mod_delayed_work_on+0x42/0xb0
[  197.533965]  [<ffffffffa022c081>] wqthrash_workfunc+0x61/0x90 [wqthrash]
[  197.534766]  [<ffffffff810c8470>] process_one_work+0x230/0x6a0
[  197.535683]  [<ffffffff810c83d9>] ? process_one_work+0x199/0x6a0
[  197.536378]  [<ffffffff810c892e>] worker_thread+0x4e/0x450
[  197.537004]  [<ffffffff810c88e0>] ? process_one_work+0x6a0/0x6a0
[  197.537724]  [<ffffffff810cf981>] kthread+0x101/0x120
[  197.538662]  [<ffffffff81104fa9>] ? trace_hardirqs_on_caller+0x129/0x1b0
[  197.539926]  [<ffffffff810cf880>] ? kthread_create_on_node+0x250/0x250
[  197.541160]  [<ffffffff81864fef>] ret_from_fork+0x3f/0x70
[  197.542187]  [<ffffffff810cf880>] ? kthread_create_on_node+0x250/0x250
[  197.543416] Code: 3b 89 e4 00 e8 06 64 ff ff 85 c0 74 0d 80 3d 45 61 e4 00 00 0f 84 c4 00 00 00 65 ff 0d de e7 ed 7e 48 8b 03 48 8b 53 08 48 85 c0 <48> 89 02 74 04 48 89 50 08 45 84 f6 74 08 48 c7 43 08 00 00 00 
[  197.547527] RIP  [<ffffffff8112d67c>] detach_if_pending+0x6c/0x190
[  197.548481]  RSP <ffff8802143abc68>
[  197.549220] CR2: 0000000000000000
[  197.549883] ---[ end trace 10dec6677e9e3a68 ]---

Apply this patch, build a wqthrash.ko module, plug it in and watch it
explode. It basically just spawns a bunch of (immediate) workqueue jobs
that attempt to schedule and modify a single piece of delayed_work.

I still don't have a lead on the actual bug, but this does seem to
tickle it fairly easily.

-----------------------8<------------------------------

[PATCH] wqthrash: add module that will thrash the workqueue code

...let's see if we can tickle the races in the delayed_work
infrastructure.

Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
---
 samples/Makefile            |  3 +-
 samples/wqthrash/Makefile   |  1 +
 samples/wqthrash/wqthrash.c | 94 +++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 97 insertions(+), 1 deletion(-)
 create mode 100644 samples/wqthrash/Makefile
 create mode 100644 samples/wqthrash/wqthrash.c

diff --git a/samples/Makefile b/samples/Makefile
index f00257bcc5a7..7e92dc91879d 100644
--- a/samples/Makefile
+++ b/samples/Makefile
@@ -1,4 +1,5 @@
 # Makefile for Linux samples code
 
 obj-$(CONFIG_SAMPLES)	+= kobject/ kprobes/ trace_events/ livepatch/ \
-			   hw_breakpoint/ kfifo/ kdb/ hidraw/ rpmsg/ seccomp/
+			   hw_breakpoint/ kfifo/ kdb/ hidraw/ rpmsg/ \
+			   seccomp/ wqthrash/
diff --git a/samples/wqthrash/Makefile b/samples/wqthrash/Makefile
new file mode 100644
index 000000000000..da4c36a2c7d5
--- /dev/null
+++ b/samples/wqthrash/Makefile
@@ -0,0 +1 @@
+obj-m += wqthrash.o
diff --git a/samples/wqthrash/wqthrash.c b/samples/wqthrash/wqthrash.c
new file mode 100644
index 000000000000..b1c52c20fb8f
--- /dev/null
+++ b/samples/wqthrash/wqthrash.c
@@ -0,0 +1,94 @@
+/*
+ * wqthrash.c -- thrash the delayed workqueue rescheduling code
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor,
+ * Boston, MA 02110-1301, USA.
+ */
+
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/kernel.h>
+#include <linux/slab.h>
+#include <linux/gfp.h>
+#include <linux/workqueue.h>
+#include <linux/delay.h>
+#include <linux/sched.h>
+
+#define	ITERATIONS	(102400)
+#define	DELAY		(HZ)
+#define	THRASHERS	(256)
+
+static void
+dummy_work(struct work_struct *unused)
+{
+	schedule_timeout_uninterruptible(1);
+}
+
+static DECLARE_DELAYED_WORK(wqthrash_delayed_work, dummy_work);
+
+static void
+wqthrash_workfunc(struct work_struct *unused)
+{
+	int	i;
+
+	for (i = 0; i < ITERATIONS; ++i) {
+		schedule_delayed_work(&wqthrash_delayed_work, DELAY);
+		cond_resched();
+		mod_delayed_work(system_wq, &wqthrash_delayed_work, DELAY);
+		cond_resched();
+	}
+}
+
+static struct workqueue_struct	*wq;
+static struct work_struct	*tw;
+
+static int
+wqthrash_init(void)
+{
+	int			i;
+
+	wq = alloc_workqueue("wqthrash", WQ_UNBOUND, 0);
+	if (!wq)
+		return -ENOMEM;
+
+	tw = kcalloc(THRASHERS, sizeof(*tw), GFP_KERNEL);
+	if (!tw) {
+		destroy_workqueue(wq);
+		return -ENOMEM;
+	}
+
+	for (i = 0; i < THRASHERS; ++i) {
+		INIT_WORK(&tw[i], wqthrash_workfunc);
+		queue_work(wq, &tw[i]);
+	}
+	return 0;
+}
+
+static void
+wqthrash_exit(void)
+{
+	int			i;
+
+	for (i = 0; i < THRASHERS; ++i)
+		flush_work(&tw[i]);
+
+	kfree(tw);
+	destroy_workqueue(wq);
+	cancel_delayed_work_sync(&wqthrash_delayed_work);
+}
+
+module_init(wqthrash_init);
+module_exit(wqthrash_exit);
+MODULE_LICENSE("GPL");
-- 
2.4.3


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: timer code oops when calling mod_delayed_work
  2015-11-03 17:55                 ` Jeff Layton
@ 2015-11-03 22:54                   ` Tejun Heo
  2015-11-04  0:06                     ` Tejun Heo
  0 siblings, 1 reply; 17+ messages in thread
From: Tejun Heo @ 2015-11-03 22:54 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Chris Worley, linux-kernel, bfields, Michael Skralivetsky,
	Trond Myklebust, Shaohua Li

Hello, Jeff.

On Tue, Nov 03, 2015 at 12:55:04PM -0500, Jeff Layton wrote:
> > Ok, I built a kernel with that patch reverted and that seems to fix the
> > problem.
> > 
> > Looking at the patch, I guess the main difference is that we're no
> > longer using add_timer for unbound workqueue tasks. That said, we
> > should have possession of the PENDING bit before calling add_timer_on.
> > 
> > So...I'm a little stumped. Any thoughts as to where the race could be?
> 
> Ok, I've been able to reproduce this with a small module that I wrote
> as a reproducer:

I think I know what's going on.  Looks like a bug in timer code when
timers are queued on a different cpu than before.  Will experiment
with it a bit and report back.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: timer code oops when calling mod_delayed_work
  2015-11-03 22:54                   ` Tejun Heo
@ 2015-11-04  0:06                     ` Tejun Heo
  2015-11-04 11:48                       ` Jeff Layton
  0 siblings, 1 reply; 17+ messages in thread
From: Tejun Heo @ 2015-11-04  0:06 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Chris Worley, linux-kernel, bfields, Michael Skralivetsky,
	Trond Myklebust, Shaohua Li

Hello, Jeff.

Can you please verify whether the following patch fixes the issue?

Thanks.

diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index 84190f0..566a282 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -970,12 +970,21 @@ EXPORT_SYMBOL(add_timer);
  */
 void add_timer_on(struct timer_list *timer, int cpu)
 {
-	struct tvec_base *base = per_cpu_ptr(&tvec_bases, cpu);
+	struct tvec_base *new_base = per_cpu_ptr(&tvec_bases, cpu);
+	struct tvec_base *base;
 	unsigned long flags;
 
 	timer_stats_timer_set_start_info(timer);
 	BUG_ON(timer_pending(timer) || !timer->function);
-	spin_lock_irqsave(&base->lock, flags);
+
+	base = lock_timer_base(timer, &flags);
+	if (base != new_base) {
+		timer->flags |= TIMER_MIGRATING;
+		spin_unlock(&base->lock);
+		base = new_base;
+		spin_lock(&base->lock);
+	}
+
 	timer->flags = (timer->flags & ~TIMER_BASEMASK) | cpu;
 	debug_activate(timer, timer->expires);
 	internal_add_timer(base, timer);

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: timer code oops when calling mod_delayed_work
  2015-11-04  0:06                     ` Tejun Heo
@ 2015-11-04 11:48                       ` Jeff Layton
  2015-11-04 17:15                         ` [PATCH] timer: add_timer_on() should perform proper migration Tejun Heo
  0 siblings, 1 reply; 17+ messages in thread
From: Jeff Layton @ 2015-11-04 11:48 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Chris Worley, linux-kernel, bfields, Michael Skralivetsky,
	Trond Myklebust, Shaohua Li

On Tue, 3 Nov 2015 19:06:58 -0500
Tejun Heo <tj@kernel.org> wrote:

> Hello, Jeff.
> 
> Can you please verify whether the following patch fixes the issue?
> 
> Thanks.
> 
> diff --git a/kernel/time/timer.c b/kernel/time/timer.c
> index 84190f0..566a282 100644
> --- a/kernel/time/timer.c
> +++ b/kernel/time/timer.c
> @@ -970,12 +970,21 @@ EXPORT_SYMBOL(add_timer);
>   */
>  void add_timer_on(struct timer_list *timer, int cpu)
>  {
> -	struct tvec_base *base = per_cpu_ptr(&tvec_bases, cpu);
> +	struct tvec_base *new_base = per_cpu_ptr(&tvec_bases, cpu);
> +	struct tvec_base *base;
>  	unsigned long flags;
>  
>  	timer_stats_timer_set_start_info(timer);
>  	BUG_ON(timer_pending(timer) || !timer->function);
> -	spin_lock_irqsave(&base->lock, flags);
> +
> +	base = lock_timer_base(timer, &flags);
> +	if (base != new_base) {
> +		timer->flags |= TIMER_MIGRATING;
> +		spin_unlock(&base->lock);
> +		base = new_base;
> +		spin_lock(&base->lock);
> +	}
> +
>  	timer->flags = (timer->flags & ~TIMER_BASEMASK) | cpu;
>  	debug_activate(timer, timer->expires);
>  	internal_add_timer(base, timer);

Yes, thanks! That fixes my reproducer and Chris also said he was unable
to reproduce the bug after applying that patch. You can add this if you
like:

   Tested-by: Jeff Layton <jlayton@poochiereds.net>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH] timer: add_timer_on() should perform proper migration
  2015-11-04 11:48                       ` Jeff Layton
@ 2015-11-04 17:15                         ` Tejun Heo
  2015-11-04 19:27                           ` [tip:timers/urgent] timers: Use proper base migration in add_timer_on() tip-bot for Tejun Heo
  2015-11-04 19:35                           ` [PATCH] timer: add_timer_on() should perform proper migration Thomas Gleixner
  0 siblings, 2 replies; 17+ messages in thread
From: Tejun Heo @ 2015-11-04 17:15 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Chris Worley, linux-kernel, bfields, Michael Skralivetsky,
	Trond Myklebust, Shaohua Li, Jeff Layton, kernel-team

Regardless of the previous CPU a timer was on, add_timer_on()
currently simply sets timer->flags to the new CPU.  As the caller must
be seeing the timer as idle, this is locally fine, but the timer
leaving the old base while unlocked can lead to race conditions as
follows.

Let's say timer was on cpu 0.

  cpu 0					cpu 1
  -----------------------------------------------------------------------------
  del_timer(timer) succeeds
					del_timer(timer)
					  lock_timer_base(timer) locks cpu_0_base
  add_timer_on(timer, 1)
    spin_lock(&cpu_1_base->lock)
    timer->flags set to cpu_1_base
    operates on @timer			  operates on @timer

This triggered with mod_delayed_work_on() which contains
"if (del_timer()) add_timer_on()" sequence eventually leading to the
following oops.

  BUG: unable to handle kernel NULL pointer dereference at           (null)
  IP: [<ffffffff810ca6e9>] detach_if_pending+0x69/0x1a0
  ...
  Workqueue: wqthrash wqthrash_workfunc [wqthrash]
  task: ffff8800172ca680 ti: ffff8800172d0000 task.ti: ffff8800172d0000
  RIP: 0010:[<ffffffff810ca6e9>]  [<ffffffff810ca6e9>] detach_if_pending+0x69/0x1a0
  ...
  Call Trace:
   [<ffffffff810cb0b4>] del_timer+0x44/0x60
   [<ffffffff8106e836>] try_to_grab_pending+0xb6/0x160
   [<ffffffff8106e913>] mod_delayed_work_on+0x33/0x80
   [<ffffffffa0000081>] wqthrash_workfunc+0x61/0x90 [wqthrash]
   [<ffffffff8106dba8>] process_one_work+0x1e8/0x650
   [<ffffffff8106e05e>] worker_thread+0x4e/0x450
   [<ffffffff810746af>] kthread+0xef/0x110
   [<ffffffff8185980f>] ret_from_fork+0x3f/0x70

Fix it by updating add_timer_on() to perform proper migration as
__mod_timer() does.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-and-tested-by: Jeff Layton <jlayton@poochiereds.net>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/g/20151029103113.2f893924@tlielax.poochiereds.net
---
 kernel/time/timer.c |   22 +++++++++++++++++++---
 1 file changed, 19 insertions(+), 3 deletions(-)

--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -970,13 +970,29 @@ EXPORT_SYMBOL(add_timer);
  */
 void add_timer_on(struct timer_list *timer, int cpu)
 {
-	struct tvec_base *base = per_cpu_ptr(&tvec_bases, cpu);
+	struct tvec_base *new_base = per_cpu_ptr(&tvec_bases, cpu);
+	struct tvec_base *base;
 	unsigned long flags;
 
 	timer_stats_timer_set_start_info(timer);
 	BUG_ON(timer_pending(timer) || !timer->function);
-	spin_lock_irqsave(&base->lock, flags);
-	timer->flags = (timer->flags & ~TIMER_BASEMASK) | cpu;
+
+	/*
+	 * If @timer was on a different CPU, it should be migrated with the
+	 * old base locked to prevent other operations proceeding with the
+	 * wrong base locked.  See lock_timer_base().
+	 */
+	base = lock_timer_base(timer, &flags);
+	if (base != new_base) {
+		timer->flags |= TIMER_MIGRATING;
+
+		spin_unlock(&base->lock);
+		base = new_base;
+		spin_lock(&base->lock);
+		WRITE_ONCE(timer->flags,
+			   (timer->flags & ~TIMER_BASEMASK) | cpu);
+	}
+
 	debug_activate(timer, timer->expires);
 	internal_add_timer(base, timer);
 	spin_unlock_irqrestore(&base->lock, flags);

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [tip:timers/urgent] timers: Use proper base migration in add_timer_on()
  2015-11-04 17:15                         ` [PATCH] timer: add_timer_on() should perform proper migration Tejun Heo
@ 2015-11-04 19:27                           ` tip-bot for Tejun Heo
  2015-11-04 19:35                           ` [PATCH] timer: add_timer_on() should perform proper migration Thomas Gleixner
  1 sibling, 0 replies; 17+ messages in thread
From: tip-bot for Tejun Heo @ 2015-11-04 19:27 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: tglx, linux-kernel, chris.worley, michael.skralivetsky, tj,
	jlayton, mingo, trond.myklebust, shli, hpa

Commit-ID:  22b886dd1018093920c4250dee2a9a3cb7cff7b8
Gitweb:     http://git.kernel.org/tip/22b886dd1018093920c4250dee2a9a3cb7cff7b8
Author:     Tejun Heo <tj@kernel.org>
AuthorDate: Wed, 4 Nov 2015 12:15:33 -0500
Committer:  Thomas Gleixner <tglx@linutronix.de>
CommitDate: Wed, 4 Nov 2015 20:23:19 +0100

timers: Use proper base migration in add_timer_on()

Regardless of the previous CPU a timer was on, add_timer_on()
currently simply sets timer->flags to the new CPU.  As the caller must
be seeing the timer as idle, this is locally fine, but the timer
leaving the old base while unlocked can lead to race conditions as
follows.

Let's say timer was on cpu 0.

  cpu 0					cpu 1
  -----------------------------------------------------------------------------
  del_timer(timer) succeeds
					del_timer(timer)
					  lock_timer_base(timer) locks cpu_0_base
  add_timer_on(timer, 1)
    spin_lock(&cpu_1_base->lock)
    timer->flags set to cpu_1_base
    operates on @timer			  operates on @timer

This triggered with mod_delayed_work_on() which contains
"if (del_timer()) add_timer_on()" sequence eventually leading to the
following oops.

  BUG: unable to handle kernel NULL pointer dereference at           (null)
  IP: [<ffffffff810ca6e9>] detach_if_pending+0x69/0x1a0
  ...
  Workqueue: wqthrash wqthrash_workfunc [wqthrash]
  task: ffff8800172ca680 ti: ffff8800172d0000 task.ti: ffff8800172d0000
  RIP: 0010:[<ffffffff810ca6e9>]  [<ffffffff810ca6e9>] detach_if_pending+0x69/0x1a0
  ...
  Call Trace:
   [<ffffffff810cb0b4>] del_timer+0x44/0x60
   [<ffffffff8106e836>] try_to_grab_pending+0xb6/0x160
   [<ffffffff8106e913>] mod_delayed_work_on+0x33/0x80
   [<ffffffffa0000081>] wqthrash_workfunc+0x61/0x90 [wqthrash]
   [<ffffffff8106dba8>] process_one_work+0x1e8/0x650
   [<ffffffff8106e05e>] worker_thread+0x4e/0x450
   [<ffffffff810746af>] kthread+0xef/0x110
   [<ffffffff8185980f>] ret_from_fork+0x3f/0x70

Fix it by updating add_timer_on() to perform proper migration as
__mod_timer() does.

Reported-and-tested-by: Jeff Layton <jlayton@poochiereds.net>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Chris Worley <chris.worley@primarydata.com>
Cc: bfields@fieldses.org
Cc: Michael Skralivetsky <michael.skralivetsky@primarydata.com>
Cc: Trond Myklebust <trond.myklebust@primarydata.com>
Cc: Shaohua Li <shli@fb.com>
Cc: Jeff Layton <jlayton@poochiereds.net>
Cc: kernel-team@fb.com
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/20151029103113.2f893924@tlielax.poochiereds.net
Link: http://lkml.kernel.org/r/20151104171533.GI5749@mtj.duckdns.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 kernel/time/timer.c | 22 +++++++++++++++++++---
 1 file changed, 19 insertions(+), 3 deletions(-)

diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index 74591ba..bbc5d11 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -977,13 +977,29 @@ EXPORT_SYMBOL(add_timer);
  */
 void add_timer_on(struct timer_list *timer, int cpu)
 {
-	struct tvec_base *base = per_cpu_ptr(&tvec_bases, cpu);
+	struct tvec_base *new_base = per_cpu_ptr(&tvec_bases, cpu);
+	struct tvec_base *base;
 	unsigned long flags;
 
 	timer_stats_timer_set_start_info(timer);
 	BUG_ON(timer_pending(timer) || !timer->function);
-	spin_lock_irqsave(&base->lock, flags);
-	timer->flags = (timer->flags & ~TIMER_BASEMASK) | cpu;
+
+	/*
+	 * If @timer was on a different CPU, it should be migrated with the
+	 * old base locked to prevent other operations proceeding with the
+	 * wrong base locked.  See lock_timer_base().
+	 */
+	base = lock_timer_base(timer, &flags);
+	if (base != new_base) {
+		timer->flags |= TIMER_MIGRATING;
+
+		spin_unlock(&base->lock);
+		base = new_base;
+		spin_lock(&base->lock);
+		WRITE_ONCE(timer->flags,
+			   (timer->flags & ~TIMER_BASEMASK) | cpu);
+	}
+
 	debug_activate(timer, timer->expires);
 	internal_add_timer(base, timer);
 	spin_unlock_irqrestore(&base->lock, flags);

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH] timer: add_timer_on() should perform proper migration
  2015-11-04 17:15                         ` [PATCH] timer: add_timer_on() should perform proper migration Tejun Heo
  2015-11-04 19:27                           ` [tip:timers/urgent] timers: Use proper base migration in add_timer_on() tip-bot for Tejun Heo
@ 2015-11-04 19:35                           ` Thomas Gleixner
  2015-11-04 19:43                             ` Tejun Heo
  1 sibling, 1 reply; 17+ messages in thread
From: Thomas Gleixner @ 2015-11-04 19:35 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Chris Worley, linux-kernel, bfields, Michael Skralivetsky,
	Trond Myklebust, Shaohua Li, Jeff Layton, kernel-team

Tejun,

On Wed, 4 Nov 2015, Tejun Heo wrote:

> Regardless of the previous CPU a timer was on, add_timer_on()
> currently simply sets timer->flags to the new CPU.  As the caller must
> be seeing the timer as idle, this is locally fine, but the timer
> leaving the old base while unlocked can lead to race conditions as
> follows.

nice detective work. This has been there forever. I really wonder why
nobody ever triggered this before.

@stable: 

The patch does only apply to kernels >= 4.2. Backport for older
kernels is below.

Thanks,

	tglx

----------->

--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -966,13 +966,26 @@ EXPORT_SYMBOL(add_timer);
  */
 void add_timer_on(struct timer_list *timer, int cpu)
 {
-	struct tvec_base *base = per_cpu(tvec_bases, cpu);
+	struct tvec_base *new_base = per_cpu(tvec_bases, cpu);
+	struct tvec_base *base;
 	unsigned long flags;
 
 	timer_stats_timer_set_start_info(timer);
 	BUG_ON(timer_pending(timer) || !timer->function);
-	spin_lock_irqsave(&base->lock, flags);
-	timer_set_base(timer, base);
+
+	/*
+	 * If @timer was on a different CPU, it should be migrated with the
+	 * old base locked to prevent other operations proceeding with the
+	 * wrong base locked.  See lock_timer_base().
+	 */
+	base = lock_timer_base(timer, &flags);
+	if (base != new_base) {
+		timer_set_base(timer, NULL);
+		spin_unlock(&base->lock);
+		base = new_base;
+		spin_lock(&base->lock);
+		timer_set_base(timer, base);
+	}
 	debug_activate(timer, timer->expires);
 	internal_add_timer(base, timer);
 	spin_unlock_irqrestore(&base->lock, flags);



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] timer: add_timer_on() should perform proper migration
  2015-11-04 19:35                           ` [PATCH] timer: add_timer_on() should perform proper migration Thomas Gleixner
@ 2015-11-04 19:43                             ` Tejun Heo
  0 siblings, 0 replies; 17+ messages in thread
From: Tejun Heo @ 2015-11-04 19:43 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Chris Worley, linux-kernel, bfields, Michael Skralivetsky,
	Trond Myklebust, Shaohua Li, Jeff Layton, kernel-team

Hello, Thomas.

On Wed, Nov 04, 2015 at 08:35:00PM +0100, Thomas Gleixner wrote:
> nice detective work. This has been there forever. I really wonder why
> nobody ever triggered this before.

I guess there weren't enough concurrent del/add_timer_on() combos.
Workqueue recently switched to always use add_timer_on() from its
timer function making that a lot more frequent for heavy
mod_delayed_work_on() users and that's when it started surfacing.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2015-11-04 19:43 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-10-29 14:31 timer code oops when calling mod_delayed_work Jeff Layton
2015-10-29 17:58 ` Jeff Layton
2015-10-31  2:00   ` Tejun Heo
2015-10-31 11:34     ` Jeff Layton
2015-10-31 21:31       ` Tejun Heo
2015-10-31 21:54         ` Jeff Layton
2015-11-02 19:48           ` Chris Worley
2015-11-02 19:56             ` Jeff Layton
2015-11-03  1:33               ` Jeff Layton
2015-11-03 17:55                 ` Jeff Layton
2015-11-03 22:54                   ` Tejun Heo
2015-11-04  0:06                     ` Tejun Heo
2015-11-04 11:48                       ` Jeff Layton
2015-11-04 17:15                         ` [PATCH] timer: add_timer_on() should perform proper migration Tejun Heo
2015-11-04 19:27                           ` [tip:timers/urgent] timers: Use proper base migration in add_timer_on() tip-bot for Tejun Heo
2015-11-04 19:35                           ` [PATCH] timer: add_timer_on() should perform proper migration Thomas Gleixner
2015-11-04 19:43                             ` Tejun Heo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).