From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757381AbbJ2R6s (ORCPT ); Thu, 29 Oct 2015 13:58:48 -0400 Received: from mail-qg0-f45.google.com ([209.85.192.45]:35297 "EHLO mail-qg0-f45.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753160AbbJ2R6p (ORCPT ); Thu, 29 Oct 2015 13:58:45 -0400 Date: Thu, 29 Oct 2015 13:58:36 -0400 From: Jeff Layton To: linux-kernel@vger.kernel.org Cc: Tejun Heo , bfields@fieldses.org, Michael Skralivetsky , Chris Worley , Trond Myklebust Subject: Re: timer code oops when calling mod_delayed_work Message-ID: <20151029135836.02ad9000@synchrony.poochiereds.net> In-Reply-To: <20151029103113.2f893924@tlielax.poochiereds.net> References: <20151029103113.2f893924@tlielax.poochiereds.net> X-Mailer: Claws Mail 3.12.0 (GTK+ 2.24.28; x86_64-redhat-linux-gnu) MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, 29 Oct 2015 10:31:13 -0400 Jeff Layton wrote: > Our QA guys hit the following warning and oops today on a relatively > recent upstream kernel. They were running a fairly busy NFSv3 I/O test > against this server: > > [ 71.074180] NFSD: starting 90-second grace period (net ffffffff81cf3b40) > [ 221.146802] Ebtables v2.0 unregistered > [ 318.909835] ------------[ cut here ]------------ > [ 318.909846] WARNING: CPU: 22 PID: 3402 at kernel/workqueue.c:1444 __queue_delayed_work+0xe7/0x190() > [ 318.909848] Modules linked in: bridge stp llc intel_rapl iosf_mbi x86_pkg_temp_thermal coretemp kvm_intel kvm ipmi_ssif crct10dif_pclmul crc32_pclmul iTCO_wdt iTCO_vendor_support joydev mei_me mei sb_edac shpchp ipmi_si lpc_ich i2c_i801 edac_core tpm_tis wmi ipmi_msghandler tpm acpi_power_meter acpi_pad nfsd auth_rpcgss nfs_acl lockd grace sunrpc xfs libcrc32c mlx4_en vxlan ip6_udp_tunnel udp_tunnel ast drm_kms_helper ttm crc32c_intel igb drm mlx4_core ptp nvme pps_core dca i2c_algo_bit fjes [last unloaded: iptable_security] > [ 318.909889] CPU: 22 PID: 3402 Comm: nfsd Not tainted 4.3.0-0.rc7.git0.1.fc24.x86_64 #1 > [ 318.909891] Hardware name: Quanta D51B/D51BP (1G LOM), BIOS S2BP3A05 01/29/2015 > [ 318.909892] 0000000000000000 00000000319846b5 ffff880322d0f440 ffffffff813a48cf > [ 318.909895] 0000000000000000 ffff880322d0f478 ffffffff8109f122 0000000000000400 > [ 318.909898] ffff88085f48fc00 ffffffffa03623e0 ffff88085f48fc00 000000000001d4c0 > [ 318.909901] Call Trace: > [ 318.909906] [] dump_stack+0x44/0x55 > [ 318.909909] [] warn_slowpath_common+0x82/0xc0 > [ 318.909913] [] warn_slowpath_null+0x1a/0x20 > [ 318.909915] [] __queue_delayed_work+0xe7/0x190 > [ 318.909917] [] mod_delayed_work_on+0x5e/0xa0 > [ 318.909931] [] prune_cache_entries+0x94/0xb0 [nfsd] > [ 318.909936] [] nfsd_reply_cache_scan+0xe/0x10 [nfsd] > [ 318.909941] [] shrink_slab.part.42+0x1e2/0x380 > [ 318.909944] [] shrink_zone+0x2b6/0x2c0 > [ 318.909946] [] do_try_to_free_pages+0x175/0x400 > [ 318.909949] [] try_to_free_pages+0x10a/0x150 > [ 318.909954] [] __alloc_pages_nodemask+0x581/0x990 > [ 318.909957] [] ? account_page_dirtied+0xd5/0x160 > [ 318.909960] [] alloc_pages_current+0x91/0x100 > [ 318.909963] [] __page_cache_alloc+0xab/0xe0 > [ 318.909965] [] pagecache_get_page+0x84/0x1e0 > [ 318.909967] [] grab_cache_page_write_begin+0x26/0x40 > [ 318.909990] [] xfs_vm_write_begin+0x34/0xe0 [xfs] > [ 318.910005] [] ? xfs_vm_write_end+0x29/0x80 [xfs] > [ 318.910007] [] generic_perform_write+0xce/0x1c0 > [ 318.910023] [] xfs_file_buffered_aio_write+0x127/0x250 [xfs] > [ 318.910028] [] ? nfserrno+0x60/0x60 [nfsd] > [ 318.910031] [] ? __kmalloc+0x1b9/0x260 > [ 318.910034] [] ? rw_copy_check_uvector+0x57/0x110 > [ 318.910047] [] xfs_file_write_iter+0xf5/0x110 [xfs] > [ 318.910050] [] do_iter_readv_writev+0x6c/0xa0 > [ 318.910053] [] do_readv_writev+0x18f/0x230 > [ 318.910065] [] ? xfs_file_buffered_aio_write+0x250/0x250 [xfs] > [ 318.910068] [] ? security_file_open+0x8b/0x90 > [ 318.910070] [] ? do_dentry_open+0x264/0x2f0 > [ 318.910084] [] ? xfs_extent_busy_ag_cmp+0x20/0x20 [xfs] > [ 318.910086] [] vfs_writev+0x39/0x50 > [ 318.910092] [] nfsd_vfs_write+0xbd/0x380 [nfsd] > [ 318.910097] [] nfsd_write+0xd8/0xf0 [nfsd] > [ 318.910103] [] nfsd3_proc_write+0xbc/0x150 [nfsd] > [ 318.910107] [] nfsd_dispatch+0xb8/0x200 [nfsd] > [ 318.910128] [] svc_process_common+0x40c/0x650 [sunrpc] > [ 318.910137] [] svc_process+0xf4/0x180 [sunrpc] > [ 318.910142] [] nfsd+0xef/0x160 [nfsd] > [ 318.910146] [] ? nfsd_destroy+0x60/0x60 [nfsd] > [ 318.910149] [] kthread+0xd8/0xf0 > [ 318.910152] [] ? kthread_worker_fn+0x160/0x160 > [ 318.910155] [] ret_from_fork+0x3f/0x70 > [ 318.910158] [] ? kthread_worker_fn+0x160/0x160 > [ 318.910160] ---[ end trace db375019f3368852 ]--- > [ 318.910169] ------------[ cut here ]------------ > [ 318.910201] kernel BUG at kernel/time/timer.c:977! > [ 318.910226] invalid opcode: 0000 [#1] SMP > [ 318.910249] Modules linked in: bridge stp llc intel_rapl iosf_mbi x86_pkg_temp_thermal coretemp kvm_intel kvm ipmi_ssif crct10dif_pclmul crc32_pclmul iTCO_wdt iTCO_vendor_support joydev mei_me mei sb_edac shpchp ipmi_si lpc_ich i2c_i801 edac_core tpm_tis wmi ipmi_msghandler tpm acpi_power_meter acpi_pad nfsd auth_rpcgss nfs_acl lockd grace sunrpc xfs libcrc32c mlx4_en vxlan ip6_udp_tunnel udp_tunnel ast drm_kms_helper ttm crc32c_intel igb drm mlx4_core ptp nvme pps_core dca i2c_algo_bit fjes [last unloaded: iptable_security] > [ 318.910588] CPU: 22 PID: 3402 Comm: nfsd Tainted: G W 4.3.0-0.rc7.git0.1.fc24.x86_64 #1 > [ 318.910629] Hardware name: Quanta D51B/D51BP (1G LOM), BIOS S2BP3A05 01/29/2015 > [ 318.910664] task: ffff880322c49e00 ti: ffff880322d0c000 task.ti: ffff880322d0c000 > [ 318.910698] RIP: 0010:[] [] add_timer_on+0xc6/0x110 > [ 318.910736] RSP: 0018:ffff880322d0f458 EFLAGS: 00010082 > [ 318.910761] RAX: 0000000000000000 RBX: 0000000000000016 RCX: 0000000100021e85 > [ 318.910793] RDX: ffffffffa03623c0 RSI: 0000000000000016 RDI: ffffffffa03623e0 > [ 318.910826] RBP: ffff880322d0f488 R08: 000000000000000a R09: 00000000000005f2 > [ 318.910856] R10: 000000000000027c R11: 00000000000005f2 R12: ffff88085f48fc00 > [ 318.910888] R13: ffff88085fd8e080 R14: ffff88085f48fc00 R15: 000000000001d4c0 > [ 318.910922] FS: 0000000000000000(0000) GS:ffff88085fd80000(0000) knlGS:0000000000000000 > [ 318.910956] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 318.910981] CR2: 00007f9d6ea41000 CR3: 0000000001c0b000 CR4: 00000000001406e0 > [ 318.911012] Stack: > [ 318.911025] ffff88085f48fc00 0000000000000016 ffff88085f48fc00 ffffffffa03623e0 > [ 318.911069] ffff88085f48fc00 000000000001d4c0 ffff880322d0f4c0 ffffffff810b6749 > [ 318.911111] 000000000001d4c0 ffffffffa03623c0 0000000000000001 ffffffffa03623c0 > [ 318.911153] Call Trace: > [ 318.911170] [] __queue_delayed_work+0x89/0x190 > [ 318.911201] [] mod_delayed_work_on+0x5e/0xa0 > [ 318.911236] [] prune_cache_entries+0x94/0xb0 [nfsd] > [ 318.911267] [] nfsd_reply_cache_scan+0xe/0x10 [nfsd] > [ 318.911299] [] shrink_slab.part.42+0x1e2/0x380 > [ 318.911330] [] shrink_zone+0x2b6/0x2c0 > [ 318.911358] [] do_try_to_free_pages+0x175/0x400 > [ 318.911391] [] try_to_free_pages+0x10a/0x150 > [ 318.911422] [] __alloc_pages_nodemask+0x581/0x990 > [ 318.911454] [] ? account_page_dirtied+0xd5/0x160 > [ 318.911483] [] alloc_pages_current+0x91/0x100 > [ 318.911513] [] __page_cache_alloc+0xab/0xe0 > [ 318.911543] [] pagecache_get_page+0x84/0x1e0 > [ 318.911573] [] grab_cache_page_write_begin+0x26/0x40 > [ 318.911620] [] xfs_vm_write_begin+0x34/0xe0 [xfs] > [ 318.911661] [] ? xfs_vm_write_end+0x29/0x80 [xfs] > [ 318.912748] [] generic_perform_write+0xce/0x1c0 > [ 318.913810] [] xfs_file_buffered_aio_write+0x127/0x250 [xfs] > [ 318.914858] [] ? nfserrno+0x60/0x60 [nfsd] > [ 318.915715] [] ? __kmalloc+0x1b9/0x260 > [ 318.916360] [] ? rw_copy_check_uvector+0x57/0x110 > [ 318.917014] [] xfs_file_write_iter+0xf5/0x110 [xfs] > [ 318.917650] [] do_iter_readv_writev+0x6c/0xa0 > [ 318.918312] [] do_readv_writev+0x18f/0x230 > [ 318.918954] [] ? xfs_file_buffered_aio_write+0x250/0x250 [xfs] > [ 318.919575] [] ? security_file_open+0x8b/0x90 > [ 318.920178] [] ? do_dentry_open+0x264/0x2f0 > [ 318.920771] [] ? xfs_extent_busy_ag_cmp+0x20/0x20 [xfs] > [ 318.921354] [] vfs_writev+0x39/0x50 > [ 318.921961] [] nfsd_vfs_write+0xbd/0x380 [nfsd] > [ 318.922509] [] nfsd_write+0xd8/0xf0 [nfsd] > [ 318.923045] [] nfsd3_proc_write+0xbc/0x150 [nfsd] > [ 318.923574] [] nfsd_dispatch+0xb8/0x200 [nfsd] > [ 318.924111] [] svc_process_common+0x40c/0x650 [sunrpc] > [ 318.924633] [] svc_process+0xf4/0x180 [sunrpc] > [ 318.925144] [] nfsd+0xef/0x160 [nfsd] > [ 318.925654] [] ? nfsd_destroy+0x60/0x60 [nfsd] > [ 318.926163] [] kthread+0xd8/0xf0 > [ 318.926668] [] ? kthread_worker_fn+0x160/0x160 > [ 318.927189] [] ret_from_fork+0x3f/0x70 > [ 318.927685] [] ? kthread_worker_fn+0x160/0x160 > [ 318.928210] Code: 4d 85 c0 75 e4 48 89 de 4c 89 ef e8 55 f9 ff ff 48 8b 34 24 4c 89 ef e8 19 72 67 00 48 83 c4 08 5b 41 5c 41 5d 41 5e 41 5f 5d c3 <0f> 0b 48 83 7f 38 00 48 8b 45 08 0f 85 5f ff ff ff 48 89 47 38 > [ 318.929315] RIP [] add_timer_on+0xc6/0x110 > [ 318.930019] RSP > > Basically the shrinker kicked in. After we shrank a bunch of cache > entries, we tried to mod_delayed_work on the cache cleaner job so that > it wouldn't run too soon, and that's where the code oopsed. > > Is our usage of mod_delayed_work in prune_cache_entries() wrong in some > way? Or is there some breakage in the workqueue or timer code? > Ok, a bit more analysis... We did get a vmcore from this. Here is the "cache_cleaner" delayed work: crash> p cache_cleaner cache_cleaner = $12 = { work = { data = { counter = 0xfffffffe1 }, entry = { next = 0xffffffffa03623c8 , prev = 0xffffffffa03623c8 }, func = 0xffffffffa03333c0 }, timer = { entry = { next = 0x0, pprev = 0xffff88085fd0eaf8 }, expires = 0x100021e99, function = 0xffffffff810b66a0 , data = 0xffffffffa03623c0, flags = 0x200014, slack = 0xffffffff, start_pid = 0x0, start_site = 0x0, start_comm = "\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000" }, wq = 0xffff88085f48fc00, cpu = 0x14 } So the PENDING bit is set (lowest bit in data.counter), and timer->entry.pprev pprev pointer is not NULL (so timer_pending is true). I also see that there are several nfsd threads running the shrinker at the same time. There is one potential problem that I see, but I'd appreciate someone sanity checking me on this. Here is mod_delayed_work_on: ------------------[snip]------------------------ bool mod_delayed_work_on(int cpu, struct workqueue_struct *wq, struct delayed_work *dwork, unsigned long delay) { unsigned long flags; int ret; do { ret = try_to_grab_pending(&dwork->work, true, &flags); } while (unlikely(ret == -EAGAIN)); if (likely(ret >= 0)) { __queue_delayed_work(cpu, wq, dwork, delay); local_irq_restore(flags); } /* -ENOENT from try_to_grab_pending() becomes %true */ return ret; } ------------------[snip]------------------------ ...and here is the beginning of try_to_grab_pending: ------------------[snip]------------------------ /* try to steal the timer if it exists */ if (is_dwork) { struct delayed_work *dwork = to_delayed_work(work); /* * dwork->timer is irqsafe. If del_timer() fails, it's * guaranteed that the timer is not queued anywhere and not * running on the local CPU. */ if (likely(del_timer(&dwork->timer))) return 1; } /* try to claim PENDING the normal way */ if (!test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work))) return 0; ------------------[snip]------------------------ ...so if del_timer returns true, we'll return 1 from try_to_grab_pending without actually setting the WORK_STRUCT_PENDING_BIT, and then will end up calling __queue_delayed_work. That seems wrong to me -- shouldn't we be ensuring that that bit is set when returning 1 from try_to_grab_pending to guard against concurrent queue_delayed_work_on calls? -- Jeff Layton