From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1161976AbaKNVbe (ORCPT ); Fri, 14 Nov 2014 16:31:34 -0500 Received: from mx1.redhat.com ([209.132.183.28]:38778 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754335AbaKNVbd (ORCPT ); Fri, 14 Nov 2014 16:31:33 -0500 Date: Fri, 14 Nov 2014 16:31:24 -0500 From: Dave Jones To: Linux Kernel Cc: Linus Torvalds Subject: frequent lockups in 3.18rc4 Message-ID: <20141114213124.GB3344@redhat.com> Mail-Followup-To: Dave Jones , Linux Kernel , Linus Torvalds MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org I'm not sure how long this goes back (3.17 was fine afair) but I'm seeing these several times a day lately.. NMI watchdog: BUG: soft lockup - CPU#3 stuck for 22s! [trinity-c129:25570] irq event stamp: 74224 hardirqs last enabled at (74223): [] restore_args+0x0/0x30 hardirqs last disabled at (74224): [] apic_timer_interrupt+0x6a/0x80 softirqs last enabled at (74222): [] __do_softirq+0x26a/0x6f0 softirqs last disabled at (74209): [] irq_exit+0x13d/0x170 CPU: 3 PID: 25570 Comm: trinity-c129 Not tainted 3.18.0-rc4+ #83 [loadavg: 198.04 186.66 181.58 24/442 26708] task: ffff880213442f00 ti: ffff8801ea714000 task.ti: ffff8801ea714000 RIP: 0010:[] [] generic_exec_single+0xea/0x1d0 RSP: 0018:ffff8801ea717a08 EFLAGS: 00000202 RAX: ffff880213442f00 RBX: ffffffff9c875664 RCX: 0000000000000006 RDX: 0000000000001370 RSI: ffff880213443790 RDI: ffff880213442f00 RBP: ffff8801ea717a68 R08: ffff880242b56690 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: ffff8801ea717978 R13: ffff880213442f00 R14: ffff8801ea714000 R15: ffff880213442f00 FS: 00007f240994e700(0000) GS:ffff880244600000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000004 CR3: 000000019a017000 CR4: 00000000001407e0 DR0: 00007fb3367e0000 DR1: 00007f82542ab000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600 Stack: ffffffff9ce4c620 0000000000000000 ffffffff9c048b20 ffff8801ea717b18 0000000000000003 0000000052e0da3d ffffffff9cc7ef3c 0000000000000002 ffffffff9c048b20 ffff8801ea717b18 0000000000000001 0000000000000003 Call Trace: [] ? leave_mm+0x210/0x210 [] ? leave_mm+0x210/0x210 [] smp_call_function_single+0x66/0x110 [] ? leave_mm+0x210/0x210 [] smp_call_function_many+0x2f1/0x390 [] flush_tlb_mm_range+0xe0/0x370 [] tlb_flush_mmu_tlbonly+0x42/0x50 [] tlb_finish_mmu+0x45/0x50 [] zap_page_range_single+0x119/0x170 [] unmap_mapping_range+0x140/0x1b0 [] shmem_fallocate+0x43d/0x540 [] ? preempt_count_sub+0xab/0x100 [] ? prepare_to_wait+0x27/0x80 [] ? __sb_start_write+0x103/0x1d0 [] do_fallocate+0x12a/0x1c0 [] SyS_madvise+0x3d3/0x890 [] ? context_tracking_user_exit+0x52/0x260 [] ? syscall_trace_enter_phase2+0x10d/0x3d0 [] tracesys_phase2+0xd4/0xd9 Code: 63 c7 48 89 de 48 89 df 48 c7 c2 c0 50 1d 00 48 03 14 c5 40 b9 f2 9c e8 d5 ea 2b 00 84 c0 74 0b e9 bc 00 00 00 0f 1f 40 00 f3 90 43 18 01 75 f8 31 c0 48 8b 4d c8 65 48 33 0c 25 28 00 00 00 Kernel panic - not syncing: softlockup: hung tasks I've got a local hack to dump loadavg on traces, and as you can see in that example, the machine was really busy, but we were at least making progress before the trace spewed, and the machine rebooted. (I have reboot-on-lockup sysctl set, without it, the machine just wedges indefinitely shortly after the spew). The trace doesn't really enlighten me as to what we should be doing to prevent this though. ideas? I can try to bisect it, but it takes hours before it happens, so it might take days to complete, and the next few weeks are complicated timewise.. Dave