From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1755468AbbCFExI (ORCPT <rfc822;w@1wt.eu>);
	Thu, 5 Mar 2015 23:53:08 -0500
Received: from mout.gmx.net ([212.227.17.20]:61498 "EHLO mout.gmx.net"
	rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
	id S1752887AbbCFExD (ORCPT <rfc822;linux-kernel@vger.kernel.org>);
	Thu, 5 Mar 2015 23:53:03 -0500
Message-ID: <1425617559.16821.36.camel@gmx.de>
Subject: Re: NMI watchdog triggering during load_balance
From: Mike Galbraith <efault@gmx.de>
To: David Ahern <david.ahern@oracle.com>
Cc: Peter Zijlstra <peterz@infradead.org>, Ingo Molnar <mingo@kernel.org>,
        LKML <linux-kernel@vger.kernel.org>
Date: Fri, 06 Mar 2015 05:52:39 +0100
In-Reply-To: <54F92788.6010007@oracle.com>
References: <54F92788.6010007@oracle.com>
Content-Type: text/plain; charset="UTF-8"
X-Mailer: Evolution 3.12.9 
Mime-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Provags-ID: V03:K0:Xt+JjIEhMhJs7oqJohCRi8rXRiXVQnG5tRBPbCUCjG+Tgk3WUPO
 Bz5LjAo7RU84O8S0XyTGW8ieGgu7YDqOfjm3KGtfeBT0W5DfKu6LmbcDaCLYETn7SdnyjKq
 HAMH8XVX7zjcP+uPrMkgOsdymkT1DTRfpMMGndS42cDdrLoF2kauq3BJoCSxnD1OXz99KCY
 cb8Hc+lCOh7xvfUEa+LWg==
X-UI-Out-Filterresults: notjunk:1;
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On Thu, 2015-03-05 at 21:05 -0700, David Ahern wrote:
> Hi Peter/Mike/Ingo:
> 
> I've been banging my against this wall for a week now and hoping you or 
> someone could shed some light on the problem.
> 
> On larger systems (256 to 1024 cpus) there are several use cases (e.g., 
> http://www.cs.virginia.edu/stream/) that regularly trigger the NMI 
> watchdog with the stack trace:
> 
> Call Trace:
> @  [000000000045d3d0] double_rq_lock+0x4c/0x68
> @  [00000000004699c4] load_balance+0x278/0x740
> @  [00000000008a7b88] __schedule+0x378/0x8e4
> @  [00000000008a852c] schedule+0x68/0x78
> @  [000000000042c82c] cpu_idle+0x14c/0x18c
> @  [00000000008a3a50] after_lock_tlb+0x1b4/0x1cc
> 
> Capturing data for all CPUs I tend to see load_balance related stack 
> traces on 700-800 cpus, with a few hundred blocked on _raw_spin_trylock_bh.
> 
> I originally thought it was a deadlock in the rq locking, but if I bump 
> the watchdog timeout the system eventually recovers (after 10-30+ 
> seconds of unresponsiveness) so it does not seem likely to be a deadlock.
> 
> This particluar system has 1024 cpus:
> # lscpu
> Architecture:          sparc64
> CPU op-mode(s):        32-bit, 64-bit
> Byte Order:            Big Endian
> CPU(s):                1024
> On-line CPU(s) list:   0-1023
> Thread(s) per core:    8
> Core(s) per socket:    4
> Socket(s):             32
> NUMA node(s):          4
> NUMA node0 CPU(s):     0-255
> NUMA node1 CPU(s):     256-511
> NUMA node2 CPU(s):     512-767
> NUMA node3 CPU(s):     768-1023
> 
> and there are 4 scheduling domains. An example of the domain debug 
> output (condensed for the email):
> 
> CPU970 attaching sched-domain:
>   domain 0: span 968-975 level SIBLING
>    groups: 8 single CPU groups
>    domain 1: span 968-975 level MC
>     groups: 1 group with 8 cpus
>     domain 2: span 768-1023 level CPU
>      groups: 4 groups with 256 cpus per group

Wow, that topology is horrid.  I'm not surprised that your box is
writhing in agony.  Can you twiddle that?

	-Mike