From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757034AbaKTPZw (ORCPT ); Thu, 20 Nov 2014 10:25:52 -0500 Received: from mx1.redhat.com ([209.132.183.28]:60246 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756195AbaKTPZu (ORCPT ); Thu, 20 Nov 2014 10:25:50 -0500 Date: Thu, 20 Nov 2014 10:25:09 -0500 From: Dave Jones To: Andy Lutomirski Cc: Linus Torvalds , Don Zickus , Thomas Gleixner , Linux Kernel , the arch/x86 maintainers , Peter Zijlstra Subject: Re: frequent lockups in 3.18rc4 Message-ID: <20141120152509.GA5412@redhat.com> Mail-Followup-To: Dave Jones , Andy Lutomirski , Linus Torvalds , Don Zickus , Thomas Gleixner , Linux Kernel , the arch/x86 maintainers , Peter Zijlstra References: <20141118023930.GA2871@redhat.com> <20141118145234.GA7487@redhat.com> <20141118215540.GD35311@redhat.com> <20141119021902.GA14216@redhat.com> <20141119145902.GA13387@redhat.com> <546D0530.8040800@mit.edu> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <546D0530.8040800@mit.edu> User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Nov 19, 2014 at 01:01:36PM -0800, Andy Lutomirski wrote: > TIF_NOHZ is not the same thing as NOHZ. Can you try a kernel with > CONFIG_CONTEXT_TRACKING=n? Doing that may involve fiddling with RCU > settings a bit. The normal no HZ idle stuff has nothing to do with > TIF_NOHZ, and you either have TIF_NOHZ set or you have some kind of > thread_info corruption going on here. Disabling CONTEXT_TRACKING didn't change the problem. Unfortunatly the full trace didn't make it over usb-serial this time. Grr. Here's what came over serial.. NMI watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [trinity-c35:11634] CPU: 2 PID: 11634 Comm: trinity-c35 Not tainted 3.18.0-rc5+ #94 [loadavg: 164.79 157.30 155.90 37/409 11893] task: ffff88014e0d96f0 ti: ffff880220eb4000 task.ti: ffff880220eb4000 RIP: 0010:[] [] copy_user_enhanced_fast_string+0x5/0x10 RSP: 0018:ffff880220eb7ef0 EFLAGS: 00010283 RAX: ffff880220eb4000 RBX: ffffffff887dac64 RCX: 0000000000006a18 RDX: 000000000000e02f RSI: 00007f766f466620 RDI: ffff88016f6a7617 RBP: ffff880220eb7f78 R08: 8000000000000063 R09: 0000000000000004 R10: 0000000000000010 R11: 0000000000000000 R12: ffffffff880bf50d R13: 0000000000000001 R14: ffff880220eb4000 R15: 0000000000000001 FS: 00007f766f459740(0000) GS:ffff880244400000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f766f461000 CR3: 000000018b00e000 CR4: 00000000001407e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600 Stack: ffffffff882f4225 ffff880183db5a00 0000000001743440 00007f766f0fb000 fffffffffffffeff 0000000000000000 0000000000008d79 00007f766f45f000 ffffffff8837adae 00ff880220eb7f38 000000003203f1ac 0000000000000001 Call Trace: [] ? SyS_add_key+0xd5/0x240 [] ? trace_hardirqs_on_thunk+0x3a/0x3f [] system_call_fastpath+0x12/0x17 Code: 48 ff c6 48 ff c7 ff c9 75 f2 89 d1 c1 e9 03 83 e2 07 f3 48 a5 89 d1 f3 a4 31 c0 0f 1f 00 c3 0f 1f 80 00 00 00 00 0f 1f 00 89 d1 a4 31 c0 0f 1f 00 c3 90 90 90 0f 1f 00 83 fa 08 0f 82 95 00 sending NMI to other CPUs: Here's a crappy phonecam pic of the screen. http://codemonkey.org.uk/junk/IMG_4311.jpg There's a bit of trace missing between the above and what was on the screen, so we missed some CPUs. Dave