From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751894AbaLETbO (ORCPT ); Fri, 5 Dec 2014 14:31:14 -0500 Received: from mail-qg0-f49.google.com ([209.85.192.49]:35168 "EHLO mail-qg0-f49.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751739AbaLETbM (ORCPT ); Fri, 5 Dec 2014 14:31:12 -0500 MIME-Version: 1.0 In-Reply-To: <20141205184808.GA2753@redhat.com> References: <547ccf74.a5198c0a.25de.26d9@mx.google.com> <20141201230339.GA20487@ret.masoncoding.com> <1417529606.3924.26.camel@maggy.simpson.net> <1417540493.21136.3@mail.thefacebook.com> <20141203184111.GA32005@redhat.com> <20141205171501.GA1320@redhat.com> <20141205184808.GA2753@redhat.com> Date: Fri, 5 Dec 2014 11:31:11 -0800 X-Google-Sender-Auth: xZzv54qy3pdVIxMF1x3ulpMSvlo Message-ID: Subject: Re: frequent lockups in 3.18rc4 From: Linus Torvalds To: Dave Jones , Linus Torvalds , Chris Mason , Mike Galbraith , Ingo Molnar , Peter Zijlstra , =?UTF-8?Q?D=C3=A2niel_Fraga?= , Sasha Levin , "Paul E. McKenney" , Linux Kernel Mailing List Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Dec 5, 2014 at 10:48 AM, Dave Jones wrote: > > In the meantime, I rebooted into the same kernel, and ran trinity > solely doing the lsetxattr syscalls. Any particular reason for the lsetxattr guess? Just the last call chain? I don't recognize it from the other traces, but maybe I just didn't notice. > The load was a bit lower, so I > cranked up the number of child processes to 512, and then this > happened.. Ugh. "dump_trace()" being broken and looping forever? I don't actually believe it, because this isn't even on the exception stack (well, the NMI dumper is, but that one worked fine - this is the "nested" dumping of just the allocation call chain) Smells like more random callchains to me. Unless this one is repeatable. Limiting trinity to just lsetxattr is interesting. Did it make things fail faster? Linus > [ 1611.747053] WARNING: CPU: 0 PID: 14810 at kernel/watchdog.c:265 watchdog_overflow_callback+0xd5/0x120() > [ 1611.747083] Watchdog detected hard LOCKUP on cpu 0 > [ 1611.747389] CPU: 0 PID: 14810 Comm: trinity-c304 Not tainted 3.16.0+ #114 > [ 1611.747544] Call Trace: > [ remnoved NMI perf event stack trace ] > [ 1611.753861] [] is_module_text_address+0x17/0x50 > [ 1611.754734] [] __kernel_text_address+0x58/0x80 > [ 1611.755575] [] print_context_stack+0x8f/0x100 > [ 1611.756410] [] dump_trace+0x140/0x370 > [ 1611.758895] [] save_stack_trace+0x2b/0x50 > [ 1611.759720] [] set_track+0x70/0x140 > [ 1611.760541] [] alloc_debug_processing+0x92/0x118 > [ 1611.761366] [] __slab_alloc+0x45f/0x56f > [ 1611.765539] [] kmem_cache_alloc+0x1f6/0x270 > [ 1611.767183] [] getname_flags+0x4f/0x1a0 > [ 1611.768004] [] user_path_at_empty+0x45/0xc0 > [ 1611.772129] [] user_path_at+0x11/0x20 > [ 1611.772959] [] SyS_lsetxattr+0x4b/0xf0 > [ 1611.773783] [] system_call_fastpath+0x16/0x1b