From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1752973AbcBCKmr (ORCPT <rfc822;w@1wt.eu>);
	Wed, 3 Feb 2016 05:42:47 -0500
Received: from www.linutronix.de ([62.245.132.108]:56904 "EHLO
	Galois.linutronix.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1750722AbcBCKmo (ORCPT
	<rfc822;linux-kernel@vger.kernel.org>);
	Wed, 3 Feb 2016 05:42:44 -0500
Date: Wed, 3 Feb 2016 11:41:26 +0100 (CET)
From: Thomas Gleixner <tglx@linutronix.de>
To: Jiri Slaby <jslaby@suse.cz>
cc: Petr Mladek <pmladek@suse.com>, Jan Kara <jack@suse.cz>,
        Ben Hutchings <ben@decadent.org.uk>, Tejun Heo <tj@kernel.org>,
        Sasha Levin <sasha.levin@oracle.com>, Shaohua Li <shli@fb.com>,
        LKML <linux-kernel@vger.kernel.org>, stable@vger.kernel.org,
        Daniel Bilik <daniel.bilik@neosystem.cz>
Subject: Re: Crashes with 874bbfe600a6 in 3.18.25
In-Reply-To: <56B1C9E4.4020400@suse.cz>
Message-ID: <alpine.DEB.2.11.1602031133230.25254@nanos>
References: <20160120211926.GJ10810@quack.suse.cz> <20160120213901.GA755895@devbig084.prn1.facebook.com> <20160121095234.GN10810@quack.suse.cz> <56A1817C.10300@oracle.com> <20160122160903.GH32380@htj.duckdns.org> <1453515623.3734.156.camel@decadent.org.uk>
 <alpine.DEB.2.11.1601231710210.3886@nanos> <20160126093400.GV24938@quack.suse.cz> <20160126111438.GA731@pathway.suse.cz> <alpine.DEB.2.11.1601261352010.3886@nanos> <56B1C9E4.4020400@suse.cz>
User-Agent: Alpine 2.11 (DEB 23 2013-08-11)
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
X-Linutronix-Spam-Score: -1.0
X-Linutronix-Spam-Level: -
X-Linutronix-Spam-Status: No , -1.0 points, 5.0 required,  ALL_TRUSTED=-1,SHORTCIRCUIT=-0.0001
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On Wed, 3 Feb 2016, Jiri Slaby wrote:
> On 01/26/2016, 02:09 PM, Thomas Gleixner wrote:
> What happens in later kernels, when the cpu is offlined before the
> delayed_work timer ticks? In stable 3.12, with the patch, this  scenario
> results in an oops:
>  #5 [ffff8c03fdd63d80] page_fault at ffffffff81523a88
>     [exception RIP: __queue_work+121]
>     RIP: ffffffff81071989  RSP: ffff8c03fdd63e30  RFLAGS: 00010086
>     RAX: ffff88048b96bc00  RBX: ffff8c03e9bcc800  RCX: ffff880473820478
>     RDX: 0000000000000400  RSI: 0000000000000004  RDI: ffff880473820458
>     RBP: 0000000000000000   R8: ffff8c03fdd71f40   R9: ffff8c03ea4c4002
>     R10: 0000000000000000  R11: 0000000000000005  R12: ffff880473820458
>     R13: 00000000000000a8  R14: 000000000000e328  R15: 00000000000000a8
>     ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
>  #6 [ffff8c03fdd63e68] call_timer_fn at ffffffff81065611
>  #7 [ffff8c03fdd63e98] run_timer_softirq at ffffffff810663b7
>  #8 [ffff8c03fdd63f00] __do_softirq at ffffffff8105e2c5
>  #9 [ffff8c03fdd63f68] call_softirq at ffffffff8152cf9c
> #10 [ffff8c03fdd63f80] do_softirq at ffffffff81004665
> #11 [ffff8c03fdd63fa0] smp_apic_timer_interrupt at ffffffff8152d835
> #12 [ffff8c03fdd63fb0] apic_timer_interrupt at ffffffff8152c2dd
> 
> The CPU was 168, and that one was offlined in the meantime. So
> __queue_work fails at:
>   if (!(wq->flags & WQ_UNBOUND))
>     pwq = per_cpu_ptr(wq->cpu_pwqs, cpu);
>   else
>     pwq = unbound_pwq_by_node(wq, cpu_to_node(cpu));
>     ^^^                           ^^^^ NODE is -1
>       \ pwq is NULL
> 
>   if (last_pool && last_pool != pwq->pool) { <--- BOOM

I don't see how that works on later kernels. If cpu_to_node() returns -1 we
access outside of the array bounds....

Thanks,

	tglx