From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751868AbcBYUcX (ORCPT ); Thu, 25 Feb 2016 15:32:23 -0500 Received: from mail-pf0-f170.google.com ([209.85.192.170]:32949 "EHLO mail-pf0-f170.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750754AbcBYUcU (ORCPT ); Thu, 25 Feb 2016 15:32:20 -0500 Subject: Re: BUG: unable to handle kernel paging request from pty_write [was: Linux 4.4.2] To: Linus Torvalds References: <20160217203730.GA14820@kroah.com> <56CED373.9060603@suse.cz> <56CF4A83.3040408@hurleysoftware.com> Cc: Jiri Slaby , Greg KH , Linux Kernel Mailing List , Andrew Morton , stable , lwn@lwn.net, Steven Rostedt From: Peter Hurley Message-ID: <56CF64C9.8050705@hurleysoftware.com> Date: Thu, 25 Feb 2016 12:32:09 -0800 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.5.1 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 02/25/2016 11:09 AM, Linus Torvalds wrote: > On Thu, Feb 25, 2016 at 10:40 AM, Peter Hurley wrote: >> >> The crash itself is in try_to_wake_up() (again, assuming the stacktrace is >> valid). > > No, the crash seems to be off in la-la-land I meant the last-known-good address is try_to_wake_up(); in the same way that RIP @ 0 crashes, but no one says the crash is @ NULL. >, judging by the oops: > > IP: [] 0xffff88023fd40000 > > which isn't kernel code at all. It is close to, but not at, the percpu > area you point out. Assuming ffff88023fdc0000 is percpu start for cpu 7 then I'm pretty sure ffff88023fd40000 is percpu start for cpu 6. Either way, RIP is almost certainly in the percpu block. > But yes, the call trace looks accurate and makes sense, we haveL > > tty_flip_buffer_push -> > (queue_work is inline) -> > queue_work_on -> > __queue_work -> > insert_work -> > (wake_up_worker is inlined) > wake_up_process -> try_to_wake_up -> > *insane non-code address* > > but I cannot for the life of me see how we get to an insane address. > It smells like stack corruption when returning from try_to_wake_up() > or something like that. > > Hmm. Actually, try_to_wake_up() will do several indirect calls > (task_waking and select_task_rq, and it_func_ptr->fn for tracing), but > then I'd expect to see try_to_wake_up itself in the stack trace. > Of course, when you jump to la-la-land, crazy things can happen. And > that offending IP is at a page boundary, so it migth have run some > random code on the previous page. > > Quite frankly, neither ->task_waking() nor ->select_task_rq() look > very likely. Agreed, the sched_class indirections do not seem likely. > But the tracepoint stuff is actually fairly dynamic, and > does things like > > it_func_ptr = rcu_dereference_sched((tp)->funcs); > > to get the function pointer information, so if there is some race in > there, anything can happen. > > Jiri, were you messing around with tracing when this happened? Or > maybe shutting down CPU's? There was a RCU locking problem with CPU > shutdown, maybe this is one of the symptoms. The fix for that is > recent, and not in 4.4.2. > > Adding Steven Rostedt to the cc. Steven, does that look like a possible case? > > Linus >