From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932664AbaKSVs4 (ORCPT ); Wed, 19 Nov 2014 16:48:56 -0500 Received: from mx1.redhat.com ([209.132.183.28]:55213 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756174AbaKSVsW (ORCPT ); Wed, 19 Nov 2014 16:48:22 -0500 Date: Wed, 19 Nov 2014 16:47:43 -0500 From: Dave Jones To: Andy Lutomirski Cc: Linus Torvalds , Don Zickus , Thomas Gleixner , Linux Kernel , the arch/x86 maintainers , Peter Zijlstra Subject: Re: frequent lockups in 3.18rc4 Message-ID: <20141119214743.GA18883@redhat.com> Mail-Followup-To: Dave Jones , Andy Lutomirski , Linus Torvalds , Don Zickus , Thomas Gleixner , Linux Kernel , the arch/x86 maintainers , Peter Zijlstra References: <20141118023930.GA2871@redhat.com> <20141118145234.GA7487@redhat.com> <20141118215540.GD35311@redhat.com> <20141119021902.GA14216@redhat.com> <20141119145902.GA13387@redhat.com> <546D0530.8040800@mit.edu> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <546D0530.8040800@mit.edu> User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Nov 19, 2014 at 01:01:36PM -0800, Andy Lutomirski wrote: > TIF_NOHZ is not the same thing as NOHZ. Can you try a kernel with > CONFIG_CONTEXT_TRACKING=n? Doing that may involve fiddling with RCU > settings a bit. The normal no HZ idle stuff has nothing to do with > TIF_NOHZ, and you either have TIF_NOHZ set or you have some kind of > thread_info corruption going on here. I'll try that next. > > RSP: 0018:ffff880192d2fee8 EFLAGS: 00000246 > > RAX: 0000000000000000 RBX: 0000000100000046 RCX: 000000336ee35b47 > > ^^^^^^^^^ > > That is a strange coincidence. Where did 0x46 | (1<<32) come from? > That's a sensible interrupts-disabled flags value with the high part set > to 0x1. Those high bits are undefined, but they ought to all be zero. This box is usually pretty solid, but it's been in service as a 24/7 fuzzing box for over a year now, so it's not outside the realm of possibility that this could all be a hardware fault if some memory has gone bad or the like. Unless we find something obvious in the next few days, I'll try running memtest over the weekend (though I've seen situations where that doesn't stress hardware enough to manifest a problem, so it might not be entirely conclusive unless it actually finds a fault). I wish I had a second identical box to see if it would be reproducible. > > [] perf_read+0x226/0x370 > > [] ? security_file_permission+0x87/0xa0 > > [] vfs_read+0x9f/0x180 > > [] SyS_read+0x58/0xd0 > > [] tracesys_phase2+0xd4/0xd9 > > Riddle me this: what are we doing in tracesys_phase2? This is a full > slow-path syscall. TIF_NOHZ doesn't cause that, I think. I'd love to > see the value of ti->flags here. Is trinity using ptrace? That's one of the few syscalls we actually blacklist (mostly because it requires some more thinking: just passing it crap can get the fuzzer into a confused state where it thinks child processes are dead, when they aren't etc). So it shouldn't be calling ptrace ever. Dave