linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC][PATCH 00/26] sched/numa
@ 2012-03-16 14:40 Peter Zijlstra
  2012-03-16 14:40 ` [RFC][PATCH 01/26] mm, mpol: Re-implement check_*_range() using walk_page_range() Peter Zijlstra
                   ` (28 more replies)
  0 siblings, 29 replies; 153+ messages in thread
From: Peter Zijlstra @ 2012-03-16 14:40 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Dan Smith, Bharata B Rao, Lee Schermerhorn,
	Andrea Arcangeli, Rik van Riel, Johannes Weiner
  Cc: linux-kernel, linux-mm


Hi All,

While the current scheduler has knowledge of the machine topology, including
NUMA (although there's room for improvement there as well [1]), it is
completely insensitive to which nodes a task's memory actually is on.

Current upstream task memory allocation prefers to use the node the task is
currently running on (unless explicitly told otherwise, see
mbind()/set_mempolicy()), and with the scheduler free to move the task about at
will, the task's memory can end up being spread all over the machine's nodes.

While the scheduler does a reasonable job of keeping short running tasks on a
single node (by means of simply not doing the cross-node migration very often),
it completely blows for long-running processes with a large memory footprint.

This patch-set aims at improving this situation. It does so by assigning a
preferred, or home, node to every process/thread_group. Memory allocation is
then directed by this preference instead of the node the task might actually be
running on momentarily. The load-balancer is also modified to prefer running
the task on its home-node, although not at the cost of letting CPUs go idle or
at the cost of execution fairness.

On top of this a new NUMA balancer is introduced, which can change a process'
home-node the hard way. This heavy process migration is driven by two factors:
either tasks are running away from their home-node, or memory is being
allocated away from the home-node. In either case, it tries to move processes
around to make the 'problem' go away.

The home-node migration handles both cpu and memory (anonymous only for now) in
an integrated fashion. The memory migration uses migrate-on-fault to avoid
doing a lot of work from the actual numa balancer kernl thread and only
migrates the active memory.

For processes that have more tasks than would fit on a node and which want to
split their activity in a useful fashion, the patch-set introduces two new
syscalls: sys_numa_tbind()/sys_numa_mbind(). These syscalls can be used to
create {thread}x{vma} groups which are then scheduled as a unit instead of the
entire process.

That said, its still early days and there's lots of improvements to make.

On to the actual patches...

The first two are generic cleanups:

  [01/26] mm, mpol: Re-implement check_*_range() using walk_page_range()
  [02/26] mm, mpol: Remove NUMA_INTERLEAVE_HIT

The second set is a rework of Lee Schermerhorn's Migrate-on-Fault patches [2]:

  [03/26] mm, mpol: add MPOL_MF_LAZY ...
  [04/26] mm, mpol: add MPOL_MF_NOOP
  [05/26] mm, mpol: Check for misplaced page
  [06/26] mm: Migrate misplaced page
  [07/26] mm: Handle misplaced anon pages
  [08/26] mm, mpol: Simplify do_mbind()

The third set implements the basic numa balancing:

  [09/26] sched, mm: Introduce tsk_home_node()
  [10/26] mm, mpol: Make mempolicy home-node aware
  [11/26] mm, mpol: Lazy migrate a process/vma
  [12/26] sched, mm: sched_{fork,exec} node assignment
  [13/26] sched: Implement home-node awareness
  [14/26] sched, numa: Numa balancer
  [15/26] sched, numa: Implement hotplug hooks
  [16/26] sched, numa: Abstract the numa_entity

The next three patches are a band-aid, Lai Jiangshan (and Paul McKenney) are
doing a proper implementation.. the reverts are me being lazy about fwd porting
my call_srcu() implementation.

  [17/26] srcu: revert1
  [18/26] srcu: revert2
  [19/26] srcu: Implement call_srcu()

The last bits implement the new syscalls:

  [20/26] mm, mpol: Introduce vma_dup_policy()
  [21/26] mm, mpol: Introduce vma_put_policy()
  [22/26] mm, mpol: Split and explose some mempolicy functions
  [23/26] sched, numa: Introduce sys_numa_{t,m}bind()
  [24/26] mm, mpol: Implement numa_group RSS accounting
  [25/26] sched, numa: Only migrate long-running entities
  [26/26] sched, numa: A few debug bits


And a few numbers...

On my WSM-EP (2 nodes, 6 cores/node, 2 thread/core), running 48 stream
benchmarks [3] (modified to use ~230MB and run long).

Without these patches it degrades into 50-50 local/remote memory accesses:

 Performance counter stats for 'sleep 2':

       259,668,750 r01b7@500b:u 		[100.00%]
       262,170,142 r01b7@200b:u                                                

       2.010446121 seconds time elapsed

With the patches there's a significant improvement in locality:

 Performance counter stats for 'sleep 2':

       496,860,345 r01b7@500b:u 		[100.00%]
        78,292,565 r01b7@200b:u                                                

       2.010707488 seconds time elapsed

(the perf events are a bit magical and not supported in an actual perf
 release -- but the first one is L3 misses to local dram, the second is
 L3 misses to remote dram)

If you look at those numbers you can also see that the sum is greater in the
second case, this means that we can service L3 misses at a higher rate, which
translates into a performance gain.

These numbers also show that while there's a marked improvement, there's still
some gain to be had. The current numa balancer is still somewhat fickle.

 ~ Peter


[1] - http://marc.info/?l=linux-kernel&m=130218515520540
      now that we have SD_OVERLAP it should be fairly easy to do.

[2] - http://markmail.org/message/mdwbcitql5ka4uws

[3] - https://asc.llnl.gov/computing_resources/purple/archive/benchmarks/memory/stream.tar 


^ permalink raw reply	[flat|nested] 153+ messages in thread

end of thread, other threads:[~2012-07-14 16:22 UTC | newest]

Thread overview: 153+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-03-16 14:40 [RFC][PATCH 00/26] sched/numa Peter Zijlstra
2012-03-16 14:40 ` [RFC][PATCH 01/26] mm, mpol: Re-implement check_*_range() using walk_page_range() Peter Zijlstra
2012-03-16 14:40 ` [RFC][PATCH 02/26] mm, mpol: Remove NUMA_INTERLEAVE_HIT Peter Zijlstra
2012-07-06 10:32   ` Johannes Weiner
2012-07-06 13:46     ` [tip:sched/core] mm: Fix vmstat names-values off-by-one tip-bot for Johannes Weiner
2012-07-06 14:48     ` [RFC][PATCH 02/26] mm, mpol: Remove NUMA_INTERLEAVE_HIT Minchan Kim
2012-07-06 15:02       ` Peter Zijlstra
2012-07-06 14:54   ` Kyungmin Park
2012-07-06 15:00     ` Peter Zijlstra
2012-03-16 14:40 ` [RFC][PATCH 03/26] mm, mpol: add MPOL_MF_LAZY Peter Zijlstra
2012-03-23 11:50   ` Mel Gorman
2012-07-06 16:38     ` Rik van Riel
2012-07-06 20:04       ` Lee Schermerhorn
2012-07-06 20:27         ` Rik van Riel
2012-07-09 11:48       ` Peter Zijlstra
2012-03-16 14:40 ` [RFC][PATCH 04/26] mm, mpol: add MPOL_MF_NOOP Peter Zijlstra
2012-07-06 18:40   ` Rik van Riel
2012-03-16 14:40 ` [RFC][PATCH 05/26] mm, mpol: Check for misplaced page Peter Zijlstra
2012-03-16 14:40 ` [RFC][PATCH 06/26] mm: Migrate " Peter Zijlstra
2012-04-03 17:32   ` Dan Smith
2012-03-16 14:40 ` [RFC][PATCH 07/26] mm: Handle misplaced anon pages Peter Zijlstra
2012-03-16 14:40 ` [RFC][PATCH 08/26] mm, mpol: Simplify do_mbind() Peter Zijlstra
2012-03-16 14:40 ` [RFC][PATCH 09/26] sched, mm: Introduce tsk_home_node() Peter Zijlstra
2012-03-16 14:40 ` [RFC][PATCH 10/26] mm, mpol: Make mempolicy home-node aware Peter Zijlstra
2012-03-16 18:34   ` Christoph Lameter
2012-03-16 21:12     ` Peter Zijlstra
2012-03-19 13:53       ` Christoph Lameter
2012-03-19 14:05         ` Peter Zijlstra
2012-03-19 15:16           ` Christoph Lameter
2012-03-19 15:23             ` Peter Zijlstra
2012-03-19 15:31               ` Christoph Lameter
2012-03-19 17:09                 ` Peter Zijlstra
2012-03-19 17:28                   ` Peter Zijlstra
2012-03-19 19:06                   ` Christoph Lameter
2012-03-19 20:28                   ` Lee Schermerhorn
2012-03-19 21:21                     ` Peter Zijlstra
2012-03-16 14:40 ` [RFC][PATCH 11/26] mm, mpol: Lazy migrate a process/vma Peter Zijlstra
2012-03-16 14:40 ` [RFC][PATCH 12/26] sched, mm: sched_{fork,exec} node assignment Peter Zijlstra
2012-06-15 18:16   ` Tony Luck
2012-06-20 19:12     ` [PATCH] sched: Fix build problems when CONFIG_NUMA=y and CONFIG_SMP=n Luck, Tony
2012-03-16 14:40 ` [RFC][PATCH 13/26] sched: Implement home-node awareness Peter Zijlstra
2012-03-16 14:40 ` [RFC][PATCH 14/26] sched, numa: Numa balancer Peter Zijlstra
2012-07-07 18:26   ` Rik van Riel
2012-07-09 12:05     ` Peter Zijlstra
2012-07-09 12:23     ` Peter Zijlstra
2012-07-09 12:40       ` Peter Zijlstra
2012-07-09 14:50         ` Rik van Riel
2012-07-08 18:35   ` Rik van Riel
2012-07-09 12:25     ` Peter Zijlstra
2012-07-09 14:54       ` Rik van Riel
2012-07-12 22:02   ` Rik van Riel
2012-07-13 14:45     ` Don Morris
2012-07-14 16:20       ` Rik van Riel
2012-03-16 14:40 ` [RFC][PATCH 15/26] sched, numa: Implement hotplug hooks Peter Zijlstra
2012-03-19 12:16   ` Srivatsa S. Bhat
2012-03-19 12:19     ` Peter Zijlstra
2012-03-19 12:27       ` Srivatsa S. Bhat
2012-03-16 14:40 ` [RFC][PATCH 16/26] sched, numa: Abstract the numa_entity Peter Zijlstra
2012-03-16 14:40 ` [RFC][PATCH 17/26] srcu: revert1 Peter Zijlstra
2012-03-16 14:40 ` [RFC][PATCH 18/26] srcu: revert2 Peter Zijlstra
2012-03-16 14:40 ` [RFC][PATCH 19/26] srcu: Implement call_srcu() Peter Zijlstra
2012-03-16 14:40 ` [RFC][PATCH 20/26] mm, mpol: Introduce vma_dup_policy() Peter Zijlstra
2012-03-16 14:40 ` [RFC][PATCH 21/26] mm, mpol: Introduce vma_put_policy() Peter Zijlstra
2012-03-16 14:40 ` [RFC][PATCH 22/26] mm, mpol: Split and explose some mempolicy functions Peter Zijlstra
2012-03-16 14:40 ` [RFC][PATCH 23/26] sched, numa: Introduce sys_numa_{t,m}bind() Peter Zijlstra
2012-03-16 14:40 ` [RFC][PATCH 24/26] mm, mpol: Implement numa_group RSS accounting Peter Zijlstra
2012-03-16 14:40 ` [RFC][PATCH 25/26] sched, numa: Only migrate long-running entities Peter Zijlstra
2012-07-08 18:34   ` Rik van Riel
2012-07-09 12:26     ` Peter Zijlstra
2012-07-09 14:53       ` Rik van Riel
2012-07-09 14:55         ` Peter Zijlstra
2012-03-16 14:40 ` [RFC][PATCH 26/26] sched, numa: A few debug bits Peter Zijlstra
2012-03-16 18:25 ` [RFC] AutoNUMA alpha6 Andrea Arcangeli
2012-03-19 18:47   ` Peter Zijlstra
2012-03-19 19:02     ` Andrea Arcangeli
2012-03-20 23:41   ` Dan Smith
2012-03-21  1:00     ` Andrea Arcangeli
2012-03-21  2:12     ` Andrea Arcangeli
2012-03-21  4:01       ` Dan Smith
2012-03-21 12:49         ` Andrea Arcangeli
2012-03-21 22:05           ` Dan Smith
2012-03-21 22:52             ` Andrea Arcangeli
2012-03-21 23:13               ` Dan Smith
2012-03-21 23:41                 ` Andrea Arcangeli
2012-03-22  0:17               ` Andrea Arcangeli
2012-03-22 13:58                 ` Dan Smith
2012-03-22 14:27                   ` Andrea Arcangeli
2012-03-22 18:49                     ` Andrea Arcangeli
2012-03-22 18:56                       ` Dan Smith
2012-03-22 19:11                         ` Andrea Arcangeli
2012-03-23 14:15                         ` Andrew Theurer
2012-03-23 16:01                           ` Andrea Arcangeli
2012-03-25 13:30                         ` Andrea Arcangeli
2012-03-21  7:12       ` Ingo Molnar
2012-03-21 12:08         ` Andrea Arcangeli
2012-03-21  7:53     ` Ingo Molnar
2012-03-21 12:17       ` Andrea Arcangeli
2012-03-19  9:57 ` [RFC][PATCH 00/26] sched/numa Avi Kivity
2012-03-19 11:12   ` Peter Zijlstra
2012-03-19 11:30     ` Peter Zijlstra
2012-03-19 11:39     ` Peter Zijlstra
2012-03-19 11:42     ` Avi Kivity
2012-03-19 11:59       ` Peter Zijlstra
2012-03-19 12:07         ` Avi Kivity
2012-03-19 12:09       ` Peter Zijlstra
2012-03-19 12:16         ` Avi Kivity
2012-03-19 20:03           ` Peter Zijlstra
2012-03-20 10:18             ` Avi Kivity
2012-03-20 10:48               ` Peter Zijlstra
2012-03-20 10:52                 ` Avi Kivity
2012-03-20 11:07                   ` Peter Zijlstra
2012-03-20 11:48                     ` Avi Kivity
2012-03-19 12:20       ` Peter Zijlstra
2012-03-19 12:24         ` Avi Kivity
2012-03-19 15:44           ` Avi Kivity
2012-03-19 13:40       ` Andrea Arcangeli
2012-03-19 20:06         ` Peter Zijlstra
2012-03-19 13:04     ` Andrea Arcangeli
2012-03-19 13:26       ` Peter Zijlstra
2012-03-19 13:57         ` Andrea Arcangeli
2012-03-19 14:06           ` Avi Kivity
2012-03-19 14:30             ` Andrea Arcangeli
2012-03-19 18:42               ` Peter Zijlstra
2012-03-20 22:18                 ` Rik van Riel
2012-03-21 16:50                   ` Andrea Arcangeli
2012-04-02 16:34                   ` Pekka Enberg
2012-04-02 16:55                     ` Rik van Riel
2012-04-02 16:54                       ` Pekka Enberg
2012-04-02 17:12                         ` Pekka Enberg
2012-04-02 17:23                           ` Pekka Enberg
2012-03-19 14:07           ` Peter Zijlstra
2012-03-19 14:34             ` Andrea Arcangeli
2012-03-19 18:41               ` Peter Zijlstra
2012-03-19 19:13           ` Peter Zijlstra
2012-03-19 14:07         ` Andrea Arcangeli
2012-03-19 19:05           ` Peter Zijlstra
2012-03-19 13:26       ` Peter Zijlstra
2012-03-19 14:16         ` Andrea Arcangeli
2012-03-19 13:29       ` Peter Zijlstra
2012-03-19 14:19         ` Andrea Arcangeli
2012-03-19 13:39       ` Peter Zijlstra
2012-03-19 14:20         ` Andrea Arcangeli
2012-03-19 20:17           ` Christoph Lameter
2012-03-19 20:28             ` Ingo Molnar
2012-03-19 20:43               ` Christoph Lameter
2012-03-19 21:34                 ` Ingo Molnar
2012-03-20  0:05               ` Linus Torvalds
2012-03-20  7:31                 ` Ingo Molnar
2012-03-21 22:53 ` Nish Aravamudan
2012-03-22  9:45   ` Peter Zijlstra
2012-03-22 10:34     ` Ingo Molnar
2012-03-24  1:41     ` Nish Aravamudan
2012-03-26 11:42       ` Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).