linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* NMI watchdog triggering during load_balance
@ 2015-03-06  4:05 David Ahern
  2015-03-06  4:52 ` Mike Galbraith
                   ` (3 more replies)
  0 siblings, 4 replies; 14+ messages in thread
From: David Ahern @ 2015-03-06  4:05 UTC (permalink / raw)
  To: Peter Zijlstra, Mike Galbraith, Ingo Molnar, LKML

Hi Peter/Mike/Ingo:

I've been banging my against this wall for a week now and hoping you or 
someone could shed some light on the problem.

On larger systems (256 to 1024 cpus) there are several use cases (e.g., 
http://www.cs.virginia.edu/stream/) that regularly trigger the NMI 
watchdog with the stack trace:

Call Trace:
@  [000000000045d3d0] double_rq_lock+0x4c/0x68
@  [00000000004699c4] load_balance+0x278/0x740
@  [00000000008a7b88] __schedule+0x378/0x8e4
@  [00000000008a852c] schedule+0x68/0x78
@  [000000000042c82c] cpu_idle+0x14c/0x18c
@  [00000000008a3a50] after_lock_tlb+0x1b4/0x1cc

Capturing data for all CPUs I tend to see load_balance related stack 
traces on 700-800 cpus, with a few hundred blocked on _raw_spin_trylock_bh.

I originally thought it was a deadlock in the rq locking, but if I bump 
the watchdog timeout the system eventually recovers (after 10-30+ 
seconds of unresponsiveness) so it does not seem likely to be a deadlock.

This particluar system has 1024 cpus:
# lscpu
Architecture:          sparc64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Big Endian
CPU(s):                1024
On-line CPU(s) list:   0-1023
Thread(s) per core:    8
Core(s) per socket:    4
Socket(s):             32
NUMA node(s):          4
NUMA node0 CPU(s):     0-255
NUMA node1 CPU(s):     256-511
NUMA node2 CPU(s):     512-767
NUMA node3 CPU(s):     768-1023

and there are 4 scheduling domains. An example of the domain debug 
output (condensed for the email):

CPU970 attaching sched-domain:
  domain 0: span 968-975 level SIBLING
   groups: 8 single CPU groups
   domain 1: span 968-975 level MC
    groups: 1 group with 8 cpus
    domain 2: span 768-1023 level CPU
     groups: 32 groups with 8 cpus per group
     domain 3: span 0-1023 level NODE
      groups: 4 groups with 256 cpus per group


On an idle system (20 or so non-kernel threads such as mingetty, udev, 
...) perf top shows the task scheduler is consuming significant time:


    PerfTop:  136580 irqs/sec  kernel:99.9%  exact:  0.0% [1000Hz 
cycles],  (all, 1024 CPUs)
-----------------------------------------------------------------------

     20.22%  [kernel]  [k] find_busiest_group
     16.00%  [kernel]  [k] find_next_bit
      6.37%  [kernel]  [k] ktime_get_update_offsets
      5.70%  [kernel]  [k] ktime_get
...


This is a 2.6.39 kernel (yes, a relatively old one); 3.8 shows similar 
symptoms. 3.18 is much better.

 From what I can tell load balancing is happening non-stop and there is 
heavy contention in the run queue locks. I instrumented the rq locking 
and under load (e.g, the stream test) regularly see single rq locks held 
continuously for 2-3 seconds (e.g., at the end of the stream run which 
has 1024 threads and the process is terminating).

I have been staring at and instrumenting the scheduling code for days. 
It seems like the balancing of domains is regularly lining up on all or 
almost all CPUs and it seems like the NODE domain causes the most damage 
since it scans all cpus (ie., in rebalance_domains() each domain pass 
triggers a call to load_balance on all cpus at the same time). Just in 
random snapshots during a stream test I have seen 1 pass through 
rebalance_domains take > 17 seconds (custom tracepoints to tag start and 
end).

Since each domain is a superset of the lower one each pass through 
load_balance regularly repeats the processing of the previous domain 
(e.g., NODE domain repeats the cpus in the CPU domain). Then multiplying 
that across 1024 cpus and it seems like a of duplication.

Does that make sense or am I off in the weeds?

Thanks,
David

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: NMI watchdog triggering during load_balance
  2015-03-06  4:05 NMI watchdog triggering during load_balance David Ahern
@ 2015-03-06  4:52 ` Mike Galbraith
  2015-03-06 15:01   ` David Ahern
  2015-03-06  8:51 ` Peter Zijlstra
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 14+ messages in thread
From: Mike Galbraith @ 2015-03-06  4:52 UTC (permalink / raw)
  To: David Ahern; +Cc: Peter Zijlstra, Ingo Molnar, LKML

On Thu, 2015-03-05 at 21:05 -0700, David Ahern wrote:
> Hi Peter/Mike/Ingo:
> 
> I've been banging my against this wall for a week now and hoping you or 
> someone could shed some light on the problem.
> 
> On larger systems (256 to 1024 cpus) there are several use cases (e.g., 
> http://www.cs.virginia.edu/stream/) that regularly trigger the NMI 
> watchdog with the stack trace:
> 
> Call Trace:
> @  [000000000045d3d0] double_rq_lock+0x4c/0x68
> @  [00000000004699c4] load_balance+0x278/0x740
> @  [00000000008a7b88] __schedule+0x378/0x8e4
> @  [00000000008a852c] schedule+0x68/0x78
> @  [000000000042c82c] cpu_idle+0x14c/0x18c
> @  [00000000008a3a50] after_lock_tlb+0x1b4/0x1cc
> 
> Capturing data for all CPUs I tend to see load_balance related stack 
> traces on 700-800 cpus, with a few hundred blocked on _raw_spin_trylock_bh.
> 
> I originally thought it was a deadlock in the rq locking, but if I bump 
> the watchdog timeout the system eventually recovers (after 10-30+ 
> seconds of unresponsiveness) so it does not seem likely to be a deadlock.
> 
> This particluar system has 1024 cpus:
> # lscpu
> Architecture:          sparc64
> CPU op-mode(s):        32-bit, 64-bit
> Byte Order:            Big Endian
> CPU(s):                1024
> On-line CPU(s) list:   0-1023
> Thread(s) per core:    8
> Core(s) per socket:    4
> Socket(s):             32
> NUMA node(s):          4
> NUMA node0 CPU(s):     0-255
> NUMA node1 CPU(s):     256-511
> NUMA node2 CPU(s):     512-767
> NUMA node3 CPU(s):     768-1023
> 
> and there are 4 scheduling domains. An example of the domain debug 
> output (condensed for the email):
> 
> CPU970 attaching sched-domain:
>   domain 0: span 968-975 level SIBLING
>    groups: 8 single CPU groups
>    domain 1: span 968-975 level MC
>     groups: 1 group with 8 cpus
>     domain 2: span 768-1023 level CPU
>      groups: 4 groups with 256 cpus per group

Wow, that topology is horrid.  I'm not surprised that your box is
writhing in agony.  Can you twiddle that?

	-Mike


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: NMI watchdog triggering during load_balance
  2015-03-06  4:05 NMI watchdog triggering during load_balance David Ahern
  2015-03-06  4:52 ` Mike Galbraith
@ 2015-03-06  8:51 ` Peter Zijlstra
  2015-03-06 15:03   ` David Ahern
  2015-03-06  9:07 ` Peter Zijlstra
  2015-03-06  9:12 ` Peter Zijlstra
  3 siblings, 1 reply; 14+ messages in thread
From: Peter Zijlstra @ 2015-03-06  8:51 UTC (permalink / raw)
  To: David Ahern; +Cc: Mike Galbraith, Ingo Molnar, LKML

On Thu, Mar 05, 2015 at 09:05:28PM -0700, David Ahern wrote:
> Hi Peter/Mike/Ingo:
> 
> Does that make sense or am I off in the weeds?

How much of your story pertains to 3.18? I'm not particularly interested
in anything much older than that.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: NMI watchdog triggering during load_balance
  2015-03-06  4:05 NMI watchdog triggering during load_balance David Ahern
  2015-03-06  4:52 ` Mike Galbraith
  2015-03-06  8:51 ` Peter Zijlstra
@ 2015-03-06  9:07 ` Peter Zijlstra
  2015-03-06 15:10   ` David Ahern
  2015-03-06  9:12 ` Peter Zijlstra
  3 siblings, 1 reply; 14+ messages in thread
From: Peter Zijlstra @ 2015-03-06  9:07 UTC (permalink / raw)
  To: David Ahern; +Cc: Mike Galbraith, Ingo Molnar, LKML

On Thu, Mar 05, 2015 at 09:05:28PM -0700, David Ahern wrote:
> Since each domain is a superset of the lower one each pass through
> load_balance regularly repeats the processing of the previous domain (e.g.,
> NODE domain repeats the cpus in the CPU domain). Then multiplying that
> across 1024 cpus and it seems like a of duplication.

It is, _but_ each domain has an interval, bigger domains _should_ load
balance at a bigger interval (iow lower frequency), and all this is
lockless data gathering, so reusing stuff from the previous round could
be quite stale indeed.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: NMI watchdog triggering during load_balance
  2015-03-06  4:05 NMI watchdog triggering during load_balance David Ahern
                   ` (2 preceding siblings ...)
  2015-03-06  9:07 ` Peter Zijlstra
@ 2015-03-06  9:12 ` Peter Zijlstra
  2015-03-06 15:12   ` David Ahern
  3 siblings, 1 reply; 14+ messages in thread
From: Peter Zijlstra @ 2015-03-06  9:12 UTC (permalink / raw)
  To: David Ahern; +Cc: Mike Galbraith, Ingo Molnar, LKML

On Thu, Mar 05, 2015 at 09:05:28PM -0700, David Ahern wrote:
> Socket(s):             32
> NUMA node(s):          4

Urgh, with 32 'cpus' per socket, you still do _8_ sockets per node, for
a total of 256 cpus per node.

That's painful. I don't suppose you can really change the hardware, but
that's a 'curious' choice.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: NMI watchdog triggering during load_balance
  2015-03-06  4:52 ` Mike Galbraith
@ 2015-03-06 15:01   ` David Ahern
  2015-03-06 18:11     ` Mike Galbraith
  0 siblings, 1 reply; 14+ messages in thread
From: David Ahern @ 2015-03-06 15:01 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: Peter Zijlstra, Ingo Molnar, LKML

On 3/5/15 9:52 PM, Mike Galbraith wrote:
>> CPU970 attaching sched-domain:
>>    domain 0: span 968-975 level SIBLING
>>     groups: 8 single CPU groups
>>     domain 1: span 968-975 level MC
>>      groups: 1 group with 8 cpus
>>      domain 2: span 768-1023 level CPU
>>       groups: 4 groups with 256 cpus per group
>
> Wow, that topology is horrid.  I'm not surprised that your box is
> writhing in agony.  Can you twiddle that?
>

twiddle that how?

The system has 4 physical cpus (sockets). Each cpu has 32 cores with 8 
threads per core and each cpu has 4 memory controllers.

If I disable SCHED_MC and CGROUPS_SCHED (group scheduling) there is a 
noticeable improvement -- watchdog does not trigger and I do not get the 
rq locks held for 2-3 seconds. But there is still fairly high cpu usage 
for an idle system. Perhaps I should leave SCHED_MC on and disable 
SCHED_SMT; I'll try that today.

Thanks,
David

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: NMI watchdog triggering during load_balance
  2015-03-06  8:51 ` Peter Zijlstra
@ 2015-03-06 15:03   ` David Ahern
  0 siblings, 0 replies; 14+ messages in thread
From: David Ahern @ 2015-03-06 15:03 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Mike Galbraith, Ingo Molnar, LKML

On 3/6/15 1:51 AM, Peter Zijlstra wrote:
> On Thu, Mar 05, 2015 at 09:05:28PM -0700, David Ahern wrote:
>> Hi Peter/Mike/Ingo:
>>
>> Does that make sense or am I off in the weeds?
>
> How much of your story pertains to 3.18? I'm not particularly interested
> in anything much older than that.
>

No. All of the data in the opening email are from 2.6.39. Each kernel 
(2.6.39, 3.8 and 3.18) has a different performance problem. I will look 
at 3.18 in depth soon, but from what I can see the fundamental concepts 
of the load balancing have not changed (e.g., my tracepoints from 2.6.39 
still apply to 3.18).

David

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: NMI watchdog triggering during load_balance
  2015-03-06  9:07 ` Peter Zijlstra
@ 2015-03-06 15:10   ` David Ahern
  0 siblings, 0 replies; 14+ messages in thread
From: David Ahern @ 2015-03-06 15:10 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Mike Galbraith, Ingo Molnar, LKML

On 3/6/15 2:07 AM, Peter Zijlstra wrote:
> On Thu, Mar 05, 2015 at 09:05:28PM -0700, David Ahern wrote:
>> Since each domain is a superset of the lower one each pass through
>> load_balance regularly repeats the processing of the previous domain (e.g.,
>> NODE domain repeats the cpus in the CPU domain). Then multiplying that
>> across 1024 cpus and it seems like a of duplication.
>
> It is, _but_ each domain has an interval, bigger domains _should_ load
> balance at a bigger interval (iow lower frequency), and all this is
> lockless data gathering, so reusing stuff from the previous round could
> be quite stale indeed.
>

Yes and I have twiddled the intervals. The defaults for min_interval and 
max_interval (msec):
SMT 1 2
MC  1 4
CPU 1 4
NODE 8 32

Increasing those values (e.g. moving NODE to 50 and 100) drops idle time 
cpu usage but does not solve the fundamental problem -- under load the 
balancing of domains seems to be lining up and the system comes to a 
halt in load balancing frenzy.

David

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: NMI watchdog triggering during load_balance
  2015-03-06  9:12 ` Peter Zijlstra
@ 2015-03-06 15:12   ` David Ahern
  0 siblings, 0 replies; 14+ messages in thread
From: David Ahern @ 2015-03-06 15:12 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Mike Galbraith, Ingo Molnar, LKML

On 3/6/15 2:12 AM, Peter Zijlstra wrote:
> On Thu, Mar 05, 2015 at 09:05:28PM -0700, David Ahern wrote:
>> Socket(s):             32
>> NUMA node(s):          4
>
> Urgh, with 32 'cpus' per socket, you still do _8_ sockets per node, for
> a total of 256 cpus per node.

Per the response to Mike, the system has 4 physical cpus. Each cpu has 
32 cores with 8 threads per core and 4 memory controllers (one mcu per 8 
cores). Yes there are 256 logical cpus per node.

David

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: NMI watchdog triggering during load_balance
  2015-03-06 15:01   ` David Ahern
@ 2015-03-06 18:11     ` Mike Galbraith
  2015-03-06 18:37       ` David Ahern
  0 siblings, 1 reply; 14+ messages in thread
From: Mike Galbraith @ 2015-03-06 18:11 UTC (permalink / raw)
  To: David Ahern; +Cc: Peter Zijlstra, Ingo Molnar, LKML

On Fri, 2015-03-06 at 08:01 -0700, David Ahern wrote:
> On 3/5/15 9:52 PM, Mike Galbraith wrote:
> >> CPU970 attaching sched-domain:
> >>    domain 0: span 968-975 level SIBLING
> >>     groups: 8 single CPU groups
> >>     domain 1: span 968-975 level MC
> >>      groups: 1 group with 8 cpus
> >>      domain 2: span 768-1023 level CPU
> >>       groups: 4 groups with 256 cpus per group
> >
> > Wow, that topology is horrid.  I'm not surprised that your box is
> > writhing in agony.  Can you twiddle that?
> >
> 
> twiddle that how?

That was the question, _do_ you have any control, because that topology
is toxic.  I guess your reply means 'nope'.

> The system has 4 physical cpus (sockets). Each cpu has 32 cores with 8 
> threads per core and each cpu has 4 memory controllers.

Thank god I've never met one of these, looks like the box from hell :)

> If I disable SCHED_MC and CGROUPS_SCHED (group scheduling) there is a 
> noticeable improvement -- watchdog does not trigger and I do not get the 
> rq locks held for 2-3 seconds. But there is still fairly high cpu usage 
> for an idle system. Perhaps I should leave SCHED_MC on and disable 
> SCHED_SMT; I'll try that today.

Well, if you disable SMT,your troubles _should_ shrink radically, as
your box does. You should probably look at why you have CPU domains.
You don't ever want to see that on a NUMA box.

	-Mike


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: NMI watchdog triggering during load_balance
  2015-03-06 18:11     ` Mike Galbraith
@ 2015-03-06 18:37       ` David Ahern
  2015-03-06 19:29         ` Mike Galbraith
  2015-03-07  9:36         ` Peter Zijlstra
  0 siblings, 2 replies; 14+ messages in thread
From: David Ahern @ 2015-03-06 18:37 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: Peter Zijlstra, Ingo Molnar, LKML

On 3/6/15 11:11 AM, Mike Galbraith wrote:
> That was the question, _do_ you have any control, because that topology
> is toxic.  I guess your reply means 'nope'.
>
>> The system has 4 physical cpus (sockets). Each cpu has 32 cores with 8
>> threads per core and each cpu has 4 memory controllers.
>
> Thank god I've never met one of these, looks like the box from hell :)
>
>> If I disable SCHED_MC and CGROUPS_SCHED (group scheduling) there is a
>> noticeable improvement -- watchdog does not trigger and I do not get the
>> rq locks held for 2-3 seconds. But there is still fairly high cpu usage
>> for an idle system. Perhaps I should leave SCHED_MC on and disable
>> SCHED_SMT; I'll try that today.
>
> Well, if you disable SMT,your troubles _should_ shrink radically, as
> your box does. You should probably look at why you have CPU domains.
> You don't ever want to see that on a NUMA box.

In responding earlier today I realized that the topology is all wrong as 
you were pointing out. There should be 16 NUMA domains (4 memory 
controllers per socket and 4 sockets). There should be 8 sibling cores. 
I will look into why that is not getting setup properly and what we can 
do about fixing it.

--

But, I do not understand how the wrong topology is causing the NMI 
watchdog to trigger. In the end there are still N domains, M groups per 
domain and P cpus per group. Doesn't the balancing walk over all of them 
irrespective of physical topology?

Here's another data point that jelled this morning explaining the 
problem to someone: the NMI watchdog trips on a mass exit:

TPC: <_raw_spin_trylock_bh+0x38/0x100>
g0: 7fffffffffffffff g1: 00000000000000ff g2: 0000000000070f8c g3: 
fffe403b97891c98
g4: fffe803b963eda00 g5: 000000010036c000 g6: fffe803b84108000 g7: 
0000000000000093
o0: 0000000000000fe0 o1: 0000000000000fe0 o2: ffffff0000000000 o3: 
0000000000200200
o4: 0000000000a98080 o5: 0000000000000000 sp: fffe803b8410ada1 ret_pc: 
00000000006800dc
RPC: <cpumask_next_and+0x44/0x6c>
l0: 0000000000e9b114 l1: 0000000000000001 l2: 0000000000000001 l3: 
0000000000000005
l4: 0000000000002000 l5: fffe803b8410b990 l6: 0000000000000004 l7: 
0000000000f267b0
i0: 0000000100b10700 i1: 00000000ffffffff i2: 0000000101324d80 i3: 
fffe803b8410b6c0
i4: 0000000000000038 i5: 0000000000000498 i6: fffe803b8410ae51 i7: 
000000000045dc30
I7: <double_rq_lock+0x4c/0x68>
Call Trace:
  [000000000045dc30] double_rq_lock+0x4c/0x68
  [000000000046a23c] load_balance+0x278/0x740
  [00000000008aa178] __schedule+0x378/0x8e4
  [00000000008aab1c] schedule+0x68/0x78
  [00000000004718ac] do_exit+0x798/0x7c0
  [000000000047195c] do_group_exit+0x88/0xc0
  [0000000000481148] get_signal_to_deliver+0x3ec/0x4c8
  [000000000042cbc0] do_signal+0x70/0x5e4
  [000000000042d14c] do_notify_resume+0x18/0x50
  [00000000004049c4] __handle_signal+0xc/0x2c


For example the stream program has 1024 threads (1 for each CPU). If you 
ctrl-c the program or wait for it terminate that's when it trips. Other 
workloads that routinely trip it are make -j N, N some number (e.g., on 
a 256 cpu system 'make -j 128'), 10 seconds later oops stop that build, 
ctrl-c ... boom with the above stack trace.

Code wise ... and this is still present in 3.18 and 3.20:

schedule()
- __schedule()
   + irqs disabled: raw_spin_lock_irq(&rq->lock);

      pick_next_task
      - idle_balance()

   + irqs enabled:
     different task: context_switch(rq, prev, next)
                     --> finish_lock_switch eventually
     same task: raw_spin_unlock_irq(&rq->lock) or


For 2.6.39 it's the invocation of idle_balance which is triggering load 
balancing with IRQs disabled. That's when the NMI watchdog trips.

I'll pound on 3.18 and see if I can reproduce something similar there.

David

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: NMI watchdog triggering during load_balance
  2015-03-06 18:37       ` David Ahern
@ 2015-03-06 19:29         ` Mike Galbraith
  2015-03-10  3:06           ` David Ahern
  2015-03-07  9:36         ` Peter Zijlstra
  1 sibling, 1 reply; 14+ messages in thread
From: Mike Galbraith @ 2015-03-06 19:29 UTC (permalink / raw)
  To: David Ahern; +Cc: Peter Zijlstra, Ingo Molnar, LKML

On Fri, 2015-03-06 at 11:37 -0700, David Ahern wrote:

> But, I do not understand how the wrong topology is causing the NMI 
> watchdog to trigger. In the end there are still N domains, M groups per 
> domain and P cpus per group. Doesn't the balancing walk over all of them 
> irrespective of physical topology?

You have this size extra large CPU domain that you shouldn't have,
massive collisions therein ensue.

	-Mike


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: NMI watchdog triggering during load_balance
  2015-03-06 18:37       ` David Ahern
  2015-03-06 19:29         ` Mike Galbraith
@ 2015-03-07  9:36         ` Peter Zijlstra
  1 sibling, 0 replies; 14+ messages in thread
From: Peter Zijlstra @ 2015-03-07  9:36 UTC (permalink / raw)
  To: David Ahern; +Cc: Mike Galbraith, Ingo Molnar, LKML

On Fri, Mar 06, 2015 at 11:37:11AM -0700, David Ahern wrote:
> On 3/6/15 11:11 AM, Mike Galbraith wrote:
> In responding earlier today I realized that the topology is all wrong as you
> were pointing out. There should be 16 NUMA domains (4 memory controllers per
> socket and 4 sockets). There should be 8 sibling cores. I will look into why
> that is not getting setup properly and what we can do about fixing it.

So we changed the numa topology setup a while back; see commit
cb83b629bae0 ("sched/numa: Rewrite the CONFIG_NUMA sched domain
support").

> But, I do not understand how the wrong topology is causing the NMI watchdog
> to trigger. In the end there are still N domains, M groups per domain and P
> cpus per group. Doesn't the balancing walk over all of them irrespective of
> physical topology?

Not quite; so for regular load balancing only the first CPU in the
domain will iterate up.

So if you have 4 'nodes' only 4 CPUs will iterate the entire machine,
not all 1024.



> Call Trace:
>  [000000000045dc30] double_rq_lock+0x4c/0x68
>  [000000000046a23c] load_balance+0x278/0x740
>  [00000000008aa178] __schedule+0x378/0x8e4
>  [00000000008aab1c] schedule+0x68/0x78
>  [00000000004718ac] do_exit+0x798/0x7c0
>  [000000000047195c] do_group_exit+0x88/0xc0
>  [0000000000481148] get_signal_to_deliver+0x3ec/0x4c8
>  [000000000042cbc0] do_signal+0x70/0x5e4
>  [000000000042d14c] do_notify_resume+0x18/0x50
>  [00000000004049c4] __handle_signal+0xc/0x2c
> 
> 
> For example the stream program has 1024 threads (1 for each CPU). If you
> ctrl-c the program or wait for it terminate that's when it trips. Other
> workloads that routinely trip it are make -j N, N some number (e.g., on a
> 256 cpu system 'make -j 128'), 10 seconds later oops stop that build, ctrl-c
> ... boom with the above stack trace.
> 
> Code wise ... and this is still present in 3.18 and 3.20:
> 
> schedule()
> - __schedule()
>   + irqs disabled: raw_spin_lock_irq(&rq->lock);
> 
>      pick_next_task
>      - idle_balance()

> For 2.6.39 it's the invocation of idle_balance which is triggering load
> balancing with IRQs disabled. That's when the NMI watchdog trips.

So for idle_balance() look at SD_BALANCE_NEWIDLE, only domains with that
set will get iterated.

I suppose you could try something like the below on 3.18

Which will disable SD_BALANCE_NEWDILE on all 'distant' nodes; but first
check how your fixed numa topology looks and if you trigger that case at
all.

---
 kernel/sched/core.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 17141da77c6e..7fce683928fe 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6268,6 +6268,7 @@ sd_init(struct sched_domain_topology_level *tl, int cpu)
 		if (sched_domains_numa_distance[tl->numa_level] > RECLAIM_DISTANCE) {
 			sd->flags &= ~(SD_BALANCE_EXEC |
 				       SD_BALANCE_FORK |
+				       SD_BALANCE_NEWIDLE |
 				       SD_WAKE_AFFINE);
 		}
 


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: NMI watchdog triggering during load_balance
  2015-03-06 19:29         ` Mike Galbraith
@ 2015-03-10  3:06           ` David Ahern
  0 siblings, 0 replies; 14+ messages in thread
From: David Ahern @ 2015-03-10  3:06 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: Peter Zijlstra, Ingo Molnar, LKML

On 3/6/15 12:29 PM, Mike Galbraith wrote:
> On Fri, 2015-03-06 at 11:37 -0700, David Ahern wrote:
>
>> But, I do not understand how the wrong topology is causing the NMI
>> watchdog to trigger. In the end there are still N domains, M groups per
>> domain and P cpus per group. Doesn't the balancing walk over all of them
>> irrespective of physical topology?
>
> You have this size extra large CPU domain that you shouldn't have,
> massive collisions therein ensue.
>

I was able to get the socket/cores/threads issue resolved, so the 
topology is correct. But still need to check out a few things. Thanks 
Mike and Peter for the suggestions.

David

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2015-03-10  3:07 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-03-06  4:05 NMI watchdog triggering during load_balance David Ahern
2015-03-06  4:52 ` Mike Galbraith
2015-03-06 15:01   ` David Ahern
2015-03-06 18:11     ` Mike Galbraith
2015-03-06 18:37       ` David Ahern
2015-03-06 19:29         ` Mike Galbraith
2015-03-10  3:06           ` David Ahern
2015-03-07  9:36         ` Peter Zijlstra
2015-03-06  8:51 ` Peter Zijlstra
2015-03-06 15:03   ` David Ahern
2015-03-06  9:07 ` Peter Zijlstra
2015-03-06 15:10   ` David Ahern
2015-03-06  9:12 ` Peter Zijlstra
2015-03-06 15:12   ` David Ahern

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).