From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752973AbcBCKmr (ORCPT ); Wed, 3 Feb 2016 05:42:47 -0500 Received: from www.linutronix.de ([62.245.132.108]:56904 "EHLO Galois.linutronix.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750722AbcBCKmo (ORCPT ); Wed, 3 Feb 2016 05:42:44 -0500 Date: Wed, 3 Feb 2016 11:41:26 +0100 (CET) From: Thomas Gleixner To: Jiri Slaby cc: Petr Mladek , Jan Kara , Ben Hutchings , Tejun Heo , Sasha Levin , Shaohua Li , LKML , stable@vger.kernel.org, Daniel Bilik Subject: Re: Crashes with 874bbfe600a6 in 3.18.25 In-Reply-To: <56B1C9E4.4020400@suse.cz> Message-ID: References: <20160120211926.GJ10810@quack.suse.cz> <20160120213901.GA755895@devbig084.prn1.facebook.com> <20160121095234.GN10810@quack.suse.cz> <56A1817C.10300@oracle.com> <20160122160903.GH32380@htj.duckdns.org> <1453515623.3734.156.camel@decadent.org.uk> <20160126093400.GV24938@quack.suse.cz> <20160126111438.GA731@pathway.suse.cz> <56B1C9E4.4020400@suse.cz> User-Agent: Alpine 2.11 (DEB 23 2013-08-11) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-Linutronix-Spam-Score: -1.0 X-Linutronix-Spam-Level: - X-Linutronix-Spam-Status: No , -1.0 points, 5.0 required, ALL_TRUSTED=-1,SHORTCIRCUIT=-0.0001 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, 3 Feb 2016, Jiri Slaby wrote: > On 01/26/2016, 02:09 PM, Thomas Gleixner wrote: > What happens in later kernels, when the cpu is offlined before the > delayed_work timer ticks? In stable 3.12, with the patch, this scenario > results in an oops: > #5 [ffff8c03fdd63d80] page_fault at ffffffff81523a88 > [exception RIP: __queue_work+121] > RIP: ffffffff81071989 RSP: ffff8c03fdd63e30 RFLAGS: 00010086 > RAX: ffff88048b96bc00 RBX: ffff8c03e9bcc800 RCX: ffff880473820478 > RDX: 0000000000000400 RSI: 0000000000000004 RDI: ffff880473820458 > RBP: 0000000000000000 R8: ffff8c03fdd71f40 R9: ffff8c03ea4c4002 > R10: 0000000000000000 R11: 0000000000000005 R12: ffff880473820458 > R13: 00000000000000a8 R14: 000000000000e328 R15: 00000000000000a8 > ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 > #6 [ffff8c03fdd63e68] call_timer_fn at ffffffff81065611 > #7 [ffff8c03fdd63e98] run_timer_softirq at ffffffff810663b7 > #8 [ffff8c03fdd63f00] __do_softirq at ffffffff8105e2c5 > #9 [ffff8c03fdd63f68] call_softirq at ffffffff8152cf9c > #10 [ffff8c03fdd63f80] do_softirq at ffffffff81004665 > #11 [ffff8c03fdd63fa0] smp_apic_timer_interrupt at ffffffff8152d835 > #12 [ffff8c03fdd63fb0] apic_timer_interrupt at ffffffff8152c2dd > > The CPU was 168, and that one was offlined in the meantime. So > __queue_work fails at: > if (!(wq->flags & WQ_UNBOUND)) > pwq = per_cpu_ptr(wq->cpu_pwqs, cpu); > else > pwq = unbound_pwq_by_node(wq, cpu_to_node(cpu)); > ^^^ ^^^^ NODE is -1 > \ pwq is NULL > > if (last_pool && last_pool != pwq->pool) { <--- BOOM I don't see how that works on later kernels. If cpu_to_node() returns -1 we access outside of the array bounds.... Thanks, tglx