From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755468AbbCFExI (ORCPT ); Thu, 5 Mar 2015 23:53:08 -0500 Received: from mout.gmx.net ([212.227.17.20]:61498 "EHLO mout.gmx.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752887AbbCFExD (ORCPT ); Thu, 5 Mar 2015 23:53:03 -0500 Message-ID: <1425617559.16821.36.camel@gmx.de> Subject: Re: NMI watchdog triggering during load_balance From: Mike Galbraith To: David Ahern Cc: Peter Zijlstra , Ingo Molnar , LKML Date: Fri, 06 Mar 2015 05:52:39 +0100 In-Reply-To: <54F92788.6010007@oracle.com> References: <54F92788.6010007@oracle.com> Content-Type: text/plain; charset="UTF-8" X-Mailer: Evolution 3.12.9 Mime-Version: 1.0 Content-Transfer-Encoding: 7bit X-Provags-ID: V03:K0:Xt+JjIEhMhJs7oqJohCRi8rXRiXVQnG5tRBPbCUCjG+Tgk3WUPO Bz5LjAo7RU84O8S0XyTGW8ieGgu7YDqOfjm3KGtfeBT0W5DfKu6LmbcDaCLYETn7SdnyjKq HAMH8XVX7zjcP+uPrMkgOsdymkT1DTRfpMMGndS42cDdrLoF2kauq3BJoCSxnD1OXz99KCY cb8Hc+lCOh7xvfUEa+LWg== X-UI-Out-Filterresults: notjunk:1; Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, 2015-03-05 at 21:05 -0700, David Ahern wrote: > Hi Peter/Mike/Ingo: > > I've been banging my against this wall for a week now and hoping you or > someone could shed some light on the problem. > > On larger systems (256 to 1024 cpus) there are several use cases (e.g., > http://www.cs.virginia.edu/stream/) that regularly trigger the NMI > watchdog with the stack trace: > > Call Trace: > @ [000000000045d3d0] double_rq_lock+0x4c/0x68 > @ [00000000004699c4] load_balance+0x278/0x740 > @ [00000000008a7b88] __schedule+0x378/0x8e4 > @ [00000000008a852c] schedule+0x68/0x78 > @ [000000000042c82c] cpu_idle+0x14c/0x18c > @ [00000000008a3a50] after_lock_tlb+0x1b4/0x1cc > > Capturing data for all CPUs I tend to see load_balance related stack > traces on 700-800 cpus, with a few hundred blocked on _raw_spin_trylock_bh. > > I originally thought it was a deadlock in the rq locking, but if I bump > the watchdog timeout the system eventually recovers (after 10-30+ > seconds of unresponsiveness) so it does not seem likely to be a deadlock. > > This particluar system has 1024 cpus: > # lscpu > Architecture: sparc64 > CPU op-mode(s): 32-bit, 64-bit > Byte Order: Big Endian > CPU(s): 1024 > On-line CPU(s) list: 0-1023 > Thread(s) per core: 8 > Core(s) per socket: 4 > Socket(s): 32 > NUMA node(s): 4 > NUMA node0 CPU(s): 0-255 > NUMA node1 CPU(s): 256-511 > NUMA node2 CPU(s): 512-767 > NUMA node3 CPU(s): 768-1023 > > and there are 4 scheduling domains. An example of the domain debug > output (condensed for the email): > > CPU970 attaching sched-domain: > domain 0: span 968-975 level SIBLING > groups: 8 single CPU groups > domain 1: span 968-975 level MC > groups: 1 group with 8 cpus > domain 2: span 768-1023 level CPU > groups: 4 groups with 256 cpus per group Wow, that topology is horrid. I'm not surprised that your box is writhing in agony. Can you twiddle that? -Mike