linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Ingo Molnar <mingo@kernel.org>
To: Christoph Lameter <cl@linux.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	Paul Turner <pjt@google.com>,
	Lee Schermerhorn <Lee.Schermerhorn@hp.com>,
	Rik van Riel <riel@redhat.com>, Mel Gorman <mgorman@suse.de>,
	Andrew Morton <akpm@linux-foundation.org>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Thomas Gleixner <tglx@linutronix.de>
Subject: Re: [PATCH 5/8] sched, numa, mm: Add adaptive NUMA affinity support
Date: Tue, 13 Nov 2012 09:19:35 +0100	[thread overview]
Message-ID: <20121113081935.GB21386@gmail.com> (raw)
In-Reply-To: <0000013af7130ad7-95edbaf9-d31d-4258-8fc0-013d152246a2-000000@email.amazonses.com>


* Christoph Lameter <cl@linux.com> wrote:

> > Using this, we can construct two per-task node-vectors, 
> > 'S_i' and 'P_i' reflecting the amount of shared and 
> > privately used pages of this task respectively. Pages for 
> > which two consecutive 'hits' are of the same cpu are assumed 
> > private and the others are shared.
> 
> The classification is per task? [...]

Yes, exactly - access patterns are fundamentally and physically 
per task, as a task can execute only on a single CPU at once. (I 
say 'task' instead of 'thread' or 'process' because the new code 
makes no distinction between threads and processes.)

The new code maps out inter-task relationships, statistically. 

So we are basically able to (statistically) approximate which 
task relates to which other task in the system, based on their 
memory access patterns alone: using a very compact metric and 
matching scheduler rules and a (lazy) memory placement machinery 
on the VM side.

Say consider the following 10-task workload, where the scheduler 
is able to figure out these relationships:

  { A, B, C, D } dominantly share memory X with each other
  { E, F, G, H } dominantly share memory Y with each other
  { I }          uses memory privately
  { J }          uses memory privately

and the scheduler rules then try to converge these groups of 
tasks ideally.

[ The 'role' and grouping of tasks is not static but sampled and 
  average based - so if a worker thread changes its role, the 
  scheduler will adapt placement to that. ]

[ A 'private task' is basically a special case for sharing 
  memory: if a task only shares memory with itself. Its 
  placement and spreading is easy. ]

> [...] But most tasks have memory areas that are private and 
> other areas where shared accesses occur. Can that be per 
> memory area? [...]

Do you mean per vma, and/or per mm?

How would that work? Consider the above case:

 - 12x CPU-intense threads and a large piece of shared memory

 - 2x 4 threads are using two large shared memory area to 
   calculate (one area for each group of threads)

 - the 4 remaining processes aggregate and sort the results from
   the 8 threads, in their own dominantly 'private' working set.

how does per vma or per mm describe that properly? The whole 
workload might be just within a single large vma within a JVM. 
Or it might be implemented using processes and anonymous shared 
memory.

If you look at this from a 'per task access pattern and 
inter-task working set relationship' perspective then the 
resolution and optimization is natural: the 2x 4 threads should 
be grouped together modulo capacity constraints, while the 
remaining 4 'private memory' threads should be spread out over 
the remaining capacity of the system.

What matters is how tasks relate to each other as they perform 
processing, not which APIs the workload uses to create tasks and 
memory areas.

The main constraint from a placement optimization complexity POV 
is task->CPU placement: for NUMA workloads the main challenge - 
and 80% of the code and much of the real meat of the feature - 
is to categorize and place tasks properly.

There might be much discussion about PROT_NONE and memory 
migration details, but that is because the VM code is 5 times 
larger than the scheduler code and due to that there's 5 times 
more VM hackers than scheduler hackers ;-)

In reality the main complexity of this problem [the placement 
optimization problem portion] is a dominantly CPU/task scheduler 
feature, and IMO rather fundamentally so: it's not an 
implementation choice but derives from the Von Neumann model of 
computing in essence.

And that is why IMO the task based access pattern metric 
implementaton is such a good fit in practice as well - and that 
is why other approaches struggled getting a hold of the NUMA 
problem.

Thanks,

	Ingo

  reply	other threads:[~2012-11-13  8:19 UTC|newest]

Thread overview: 37+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-11-12 16:04 [PATCH 0/8] Announcement: Enhanced NUMA scheduling with adaptive affinity Peter Zijlstra
2012-11-12 16:04 ` [PATCH 1/8] sched, numa, mm: Introduce sched_feat_numa() Peter Zijlstra
2012-11-12 16:04 ` [PATCH 2/8] sched, numa, mm: Implement THP migration Peter Zijlstra
2012-11-12 16:04 ` [PATCH 3/8] sched, numa, mm: Add credits for NUMA placement Peter Zijlstra
2012-11-12 16:04 ` [PATCH 4/8] sched, numa, mm: Add last_cpu to page flags Peter Zijlstra
2012-11-13 11:55   ` Ingo Molnar
2012-11-13 16:09   ` Rik van Riel
2012-11-12 16:04 ` [PATCH 5/8] sched, numa, mm: Add adaptive NUMA affinity support Peter Zijlstra
2012-11-13  0:02   ` Christoph Lameter
2012-11-13  8:19     ` Ingo Molnar [this message]
2012-11-13 22:57   ` Rik van Riel
2012-11-16 18:06   ` Rik van Riel
2012-11-16 18:14     ` Ingo Molnar
2012-11-16 18:23       ` Rik van Riel
2012-11-29 19:34   ` Andi Kleen
2012-11-12 16:04 ` [PATCH 6/8] sched, numa, mm: Implement constant, per task Working Set Sampling (WSS) rate Peter Zijlstra
2012-11-12 16:04 ` [PATCH 7/8] sched, numa, mm: Count WS scanning against present PTEs, not virtual memory ranges Peter Zijlstra
2012-11-12 16:04 ` [PATCH 8/8] sched, numa, mm: Implement slow start for working set sampling Peter Zijlstra
2012-11-12 18:48 ` Benchmark results: "Enhanced NUMA scheduling with adaptive affinity" Ingo Molnar
2012-11-15 10:08   ` Mel Gorman
2012-11-15 18:52     ` Rik van Riel
2012-11-15 21:27       ` Mel Gorman
2012-11-15 20:32     ` Linus Torvalds
2012-11-15 22:04       ` Rik van Riel
2012-11-16 14:14         ` Mel Gorman
2012-11-16 19:50           ` Andrea Arcangeli
2012-11-16 20:05             ` Mel Gorman
2012-11-16 16:16       ` Ingo Molnar
2012-11-16 15:56     ` Ingo Molnar
2012-11-16 16:25       ` Mel Gorman
2012-11-16 17:49         ` Ingo Molnar
2012-11-16 19:04           ` Mel Gorman
2012-11-12 23:43 ` [PATCH 0/8] Announcement: Enhanced NUMA scheduling with adaptive affinity Christoph Lameter
2012-11-13  7:24   ` Ingo Molnar
2012-11-15 14:26     ` Christoph Lameter
2012-11-16 15:59       ` Ingo Molnar
2012-11-16 20:57         ` Christoph Lameter

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20121113081935.GB21386@gmail.com \
    --to=mingo@kernel.org \
    --cc=Lee.Schermerhorn@hp.com \
    --cc=a.p.zijlstra@chello.nl \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=cl@linux.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@suse.de \
    --cc=pjt@google.com \
    --cc=riel@redhat.com \
    --cc=tglx@linutronix.de \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).