All of lore.kernel.org
 help / color / mirror / Atom feed
From: Andrea Arcangeli <aarcange@redhat.com>
To: Mike Galbraith <efault@gmx.de>
Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Peter Zijlstra <pzijlstr@redhat.com>, Ingo Molnar <mingo@elte.hu>,
	Mel Gorman <mel@csn.ul.ie>, Hugh Dickins <hughd@google.com>,
	Rik van Riel <riel@redhat.com>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Hillf Danton <dhillf@gmail.com>,
	Andrew Jones <drjones@redhat.com>, Dan Smith <danms@us.ibm.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	Paul Turner <pjt@google.com>, Christoph Lameter <cl@linux.com>,
	Suresh Siddha <suresh.b.siddha@intel.com>,
	"Paul E. McKenney" <paulmck@linux.vnet.ibm.com>,
	Lai Jiangshan <laijs@cn.fujitsu.com>,
	Bharata B Rao <bharata.rao@gmail.com>,
	Lee Schermerhorn <Lee.Schermerhorn@hp.com>,
	Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>,
	Alex Shi <alex.shi@intel.com>,
	Mauricio Faria de Oliveira <mauricfo@linux.vnet.ibm.com>,
	Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>,
	Don Morris <don.morris@hp.com>,
	Benjamin Herrenschmidt <benh@kernel.crashing.org>
Subject: Re: [PATCH 18/33] autonuma: teach CFS about autonuma affinity
Date: Sat, 6 Oct 2012 14:34:32 +0200	[thread overview]
Message-ID: <20121006123432.GS6793@redhat.com> (raw)
In-Reply-To: <1349491194.6984.175.camel@marge.simpson.net>

Hi Mike,

On Sat, Oct 06, 2012 at 04:39:54AM +0200, Mike Galbraith wrote:
> On Fri, 2012-10-05 at 13:54 +0200, Andrea Arcangeli wrote: 
> > On Fri, Oct 05, 2012 at 08:41:25AM +0200, Mike Galbraith wrote:
> > > On Thu, 2012-10-04 at 01:51 +0200, Andrea Arcangeli wrote: 
> > > > The CFS scheduler is still in charge of all scheduling decisions. At
> > > > times, however, AutoNUMA balancing will override them.
> > > > 
> > > > Generally, we'll just rely on the CFS scheduler to keep doing its
> > > > thing, while preferring the task's AutoNUMA affine node when deciding
> > > > to move a task to a different runqueue or when waking it up.
> > > 
> > > Why does AutoNuma fiddle with wakeup decisions _within_ a node?
> > > 
> > > pgbench intensely disliked me recently depriving it of migration routes
> > > in select_idle_sibling(), so AutoNuma saying NAK seems unlikely to make
> > > it or ilk any happier.
> > 
> > Preferring doesn't mean NAK. It means "search affine first" if there's
> > not, go the usual route like if autonuma was not there.
> 
> I'll rephrase.  We're searching a processor.  What does that have to do
> with NUMA?  I saw you turning want_affine off (and wonder what that's
> gonna do to fluctuating vs for more or less static loads), and get that.

I think you just found a mistake.

So disabling wake_affine if the wakeup CPU was on a remote NODE (only
in that case it was turned off), meant sd_affine couldn't be turned on
and for certain wakeups select_idle_sibling wouldn't run (rendering
pointless some of my logic in select_idle_sibling).

So I'm reversing this hunk:

@@ -2708,7 +2722,8 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
                return prev_cpu;
 
        if (sd_flag & SD_BALANCE_WAKE) {
-               if (cpumask_test_cpu(cpu, tsk_cpus_allowed(p)))
+               if (cpumask_test_cpu(cpu, tsk_cpus_allowed(p)) &&
+                   task_autonuma_cpu(p, cpu))
                        want_affine = 1;
                new_cpu = prev_cpu;
        }


Another optimization I noticed is that I should record idle_target =
true if target == cpu && idle_cpu(cpu) but task_autonuma_cpu fails, so
we'll pick the target if it was idle, and there's no CPU idle in the
affine node.

> I measured the 1 in 1:N pgbench very much preferring mobility.  The N,
> dunno, but I don't imagine a large benefit for making them sticky
> either.  Hohum, numbers will tell the tale.

Mobility on non-NUMA is an entirely different matter than mobility
across NUMA nodes. Keep in mind there are tons of CPUs intra-node too
so the mobility intra node may be enough.  But I don't know exactly
what the mobiltiy requirements of pgbench are so I can't tell for sure
and I fully agree we should collect numbers.

The availability of NUMA systems increased a lot lately so hopefully
more people will be able to test it and provide feedback.

Overall getting wrong the intra-node convergence is more concerning
than not being optimal in the 1:N load. Getting the former wrong means
we risk to delay convergence (and having to fixup later with autonuma
balancing events). The latter is just about maxim out all memory
channels and all HT idle cores, in a MADV_INTERLEAVE behavior and to
mitigage the spurious page migrates (which will still happen seldom
and we need them to keep happening slowly to avoid ending up using a
single memory channel). But the latter is a less deterministic case,
it's harder to be faster than upstream unless upstream does all
allocations in one thread first and then starts the other threads
computing on the memory later. The 1:N has no perfect solution anyway,
unless we just detect it and hammer it with MADV_INTERLEAVE. But I
tried to avoid hard classifications and radical change in behavior and
I try to do something that always works no matter the load we throw at
it. So I'm usually more concerned about optimizing for the former case
which has a perfect solution possible.

> > If there are multiple threads their affinity will vary slighly and the
> > task_selected_nid will distribute (and if it doesn't distribute the
> > idle load balancing will still work perfectly as upstream).
> > 
> > If there's just one thread, so really 1:N, it doesn't matter in which
> > CPU of the 4 nodes we put it if it's the memory split is 25/25/25/25.
> 
> It should matter when load is not static.  Just as select_idle_sibling()
> is not a great idea once you're ramped up, retained stickiness should
> hurt dynamic responsiveness.  But never mind, that's just me pondering
> the up/down sides of stickiness.

Actually I'm going to test removing the above hunk.

> > In short in those 1:N scenarios, it's usually better to just stick to
> > the last node it run on, and it does with AutoNUMA. This is why it's
> > better to have 1 task_selected_nid instead of 4. There may be level 3
> > caches for the node too and that will preserve them too.
> 
> My point was that there is no correct node to prefer, so wondered if
> AutoNuma could possibly recognize that, and not do what can only be the
> wrong thing.  It needs to only tag things it is really sure about.

You know sched/fair.c so much better than me, so you decide. AutoNUMA
is just an ideal hacking base that converges and works well, and we
can build on that. It's very easy to modify and experiment
with. All contributions are welcome ;).

I'm adding new ideas to it as I write this in some experimetnal branch
(just reached new records of convergence vs autonuma27, by accounting
in real time for the page migrations in mm_autonuma without having to
boost the numa hinting page fault rate).

Thanks!
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2012-10-06 12:35 UTC|newest]

Thread overview: 148+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-10-03 23:50 [PATCH 00/33] AutoNUMA27 Andrea Arcangeli
2012-10-03 23:50 ` [PATCH 01/33] autonuma: add Documentation/vm/autonuma.txt Andrea Arcangeli
2012-10-11 10:50   ` Mel Gorman
2012-10-11 16:07     ` Andrea Arcangeli
2012-10-11 16:07       ` Andrea Arcangeli
2012-10-11 19:37       ` Mel Gorman
2012-10-11 19:37         ` Mel Gorman
2012-10-03 23:50 ` [PATCH 02/33] autonuma: make set_pmd_at always available Andrea Arcangeli
2012-10-11 10:54   ` Mel Gorman
2012-10-03 23:50 ` [PATCH 03/33] autonuma: export is_vma_temporary_stack() even if CONFIG_TRANSPARENT_HUGEPAGE=n Andrea Arcangeli
2012-10-11 10:54   ` Mel Gorman
2012-10-03 23:50 ` [PATCH 04/33] autonuma: define _PAGE_NUMA Andrea Arcangeli
2012-10-11 11:01   ` Mel Gorman
2012-10-11 16:43     ` Andrea Arcangeli
2012-10-11 16:43       ` Andrea Arcangeli
2012-10-11 19:48       ` Mel Gorman
2012-10-11 19:48         ` Mel Gorman
2012-10-03 23:50 ` [PATCH 05/33] autonuma: pte_numa() and pmd_numa() Andrea Arcangeli
2012-10-11 11:15   ` Mel Gorman
2012-10-11 16:58     ` Andrea Arcangeli
2012-10-11 16:58       ` Andrea Arcangeli
2012-10-11 19:54       ` Mel Gorman
2012-10-11 19:54         ` Mel Gorman
2012-10-03 23:50 ` [PATCH 06/33] autonuma: teach gup_fast about pmd_numa Andrea Arcangeli
2012-10-11 12:22   ` Mel Gorman
2012-10-11 17:05     ` Andrea Arcangeli
2012-10-11 17:05       ` Andrea Arcangeli
2012-10-11 20:01       ` Mel Gorman
2012-10-11 20:01         ` Mel Gorman
2012-10-03 23:50 ` [PATCH 07/33] autonuma: mm_autonuma and task_autonuma data structures Andrea Arcangeli
2012-10-11 12:28   ` Mel Gorman
2012-10-11 15:24     ` Rik van Riel
2012-10-11 15:57       ` Mel Gorman
2012-10-12  0:23       ` Christoph Lameter
2012-10-12  0:52         ` Andrea Arcangeli
2012-10-12  0:52           ` Andrea Arcangeli
2012-10-11 17:15     ` Andrea Arcangeli
2012-10-11 17:15       ` Andrea Arcangeli
2012-10-11 20:06       ` Mel Gorman
2012-10-11 20:06         ` Mel Gorman
2012-10-03 23:50 ` [PATCH 08/33] autonuma: define the autonuma flags Andrea Arcangeli
2012-10-11 13:46   ` Mel Gorman
2012-10-11 17:34     ` Andrea Arcangeli
2012-10-11 17:34       ` Andrea Arcangeli
2012-10-11 20:17       ` Mel Gorman
2012-10-11 20:17         ` Mel Gorman
2012-10-03 23:50 ` [PATCH 09/33] autonuma: core autonuma.h header Andrea Arcangeli
2012-10-03 23:50 ` [PATCH 10/33] autonuma: CPU follows memory algorithm Andrea Arcangeli
2012-10-11 14:58   ` Mel Gorman
2012-10-12  0:25     ` Andrea Arcangeli
2012-10-12  0:25       ` Andrea Arcangeli
2012-10-12  8:29       ` Mel Gorman
2012-10-12  8:29         ` Mel Gorman
2012-10-03 23:50 ` [PATCH 11/33] autonuma: add the autonuma_last_nid in the page structure Andrea Arcangeli
2012-10-03 23:50 ` [PATCH 12/33] autonuma: Migrate On Fault per NUMA node data Andrea Arcangeli
2012-10-11 15:43   ` Mel Gorman
2012-10-03 23:50 ` [PATCH 13/33] autonuma: autonuma_enter/exit Andrea Arcangeli
2012-10-11 13:50   ` Mel Gorman
2012-10-03 23:50 ` [PATCH 14/33] autonuma: call autonuma_setup_new_exec() Andrea Arcangeli
2012-10-11 15:47   ` Mel Gorman
2012-10-03 23:50 ` [PATCH 15/33] autonuma: alloc/free/init task_autonuma Andrea Arcangeli
2012-10-11 15:53   ` Mel Gorman
2012-10-11 17:34     ` Rik van Riel
     [not found]       ` <20121011175953.GT1818@redhat.com>
2012-10-12 14:03         ` Rik van Riel
2012-10-12 14:03           ` Rik van Riel
2012-10-03 23:50 ` [PATCH 16/33] autonuma: alloc/free/init mm_autonuma Andrea Arcangeli
2012-10-03 23:50 ` [PATCH 17/33] autonuma: prevent select_task_rq_fair to return -1 Andrea Arcangeli
2012-10-03 23:51 ` [PATCH 18/33] autonuma: teach CFS about autonuma affinity Andrea Arcangeli
2012-10-05  6:41   ` Mike Galbraith
2012-10-05 11:54     ` Andrea Arcangeli
2012-10-06  2:39       ` Mike Galbraith
2012-10-06 12:34         ` Andrea Arcangeli [this message]
2012-10-07  6:07           ` Mike Galbraith
2012-10-08  7:03             ` Mike Galbraith
2012-10-03 23:51 ` [PATCH 19/33] autonuma: memory follows CPU algorithm and task/mm_autonuma stats collection Andrea Arcangeli
2012-10-10 22:01   ` Rik van Riel
2012-10-10 22:36     ` Andrea Arcangeli
2012-10-11 18:28   ` Mel Gorman
2012-10-13 18:06   ` Srikar Dronamraju
2012-10-15  8:24     ` Srikar Dronamraju
2012-10-15  8:24       ` Srikar Dronamraju
2012-10-15  9:20       ` Mel Gorman
2012-10-15  9:20         ` Mel Gorman
2012-10-15 10:00         ` Srikar Dronamraju
2012-10-15 10:00           ` Srikar Dronamraju
2012-10-03 23:51 ` [PATCH 20/33] autonuma: default mempolicy follow AutoNUMA Andrea Arcangeli
2012-10-04 20:03   ` KOSAKI Motohiro
2012-10-11 18:32   ` Mel Gorman
2012-10-03 23:51 ` [PATCH 21/33] autonuma: call autonuma_split_huge_page() Andrea Arcangeli
2012-10-11 18:33   ` Mel Gorman
2012-10-03 23:51 ` [PATCH 22/33] autonuma: make khugepaged pte_numa aware Andrea Arcangeli
2012-10-11 18:36   ` Mel Gorman
2012-10-03 23:51 ` [PATCH 23/33] autonuma: retain page last_nid information in khugepaged Andrea Arcangeli
2012-10-11 18:44   ` Mel Gorman
2012-10-12 11:37     ` Rik van Riel
2012-10-12 12:35       ` Mel Gorman
2012-10-03 23:51 ` [PATCH 24/33] autonuma: split_huge_page: transfer the NUMA type from the pmd to the pte Andrea Arcangeli
2012-10-11 18:45   ` Mel Gorman
2012-10-03 23:51 ` [PATCH 25/33] autonuma: numa hinting page faults entry points Andrea Arcangeli
2012-10-11 18:47   ` Mel Gorman
2012-10-03 23:51 ` [PATCH 26/33] autonuma: reset autonuma page data when pages are freed Andrea Arcangeli
2012-10-03 23:51 ` [PATCH 27/33] autonuma: link mm/autonuma.o and kernel/sched/numa.o Andrea Arcangeli
2012-10-03 23:51 ` [PATCH 28/33] autonuma: add CONFIG_AUTONUMA and CONFIG_AUTONUMA_DEFAULT_ENABLED Andrea Arcangeli
2012-10-11 18:50   ` Mel Gorman
2012-10-03 23:51 ` [PATCH 29/33] autonuma: page_autonuma Andrea Arcangeli
2012-10-04 14:16   ` Christoph Lameter
2012-10-04 20:09   ` KOSAKI Motohiro
2012-10-05 11:31     ` Andrea Arcangeli
2012-10-03 23:51 ` [PATCH 30/33] autonuma: bugcheck page_autonuma fields on newly allocated pages Andrea Arcangeli
2012-10-03 23:51 ` [PATCH 31/33] autonuma: boost khugepaged scanning rate Andrea Arcangeli
2012-10-03 23:51 ` [PATCH 32/33] autonuma: add migrate_allow_first_fault knob in sysfs Andrea Arcangeli
2012-10-03 23:51 ` [PATCH 33/33] autonuma: add mm_autonuma working set estimation Andrea Arcangeli
2012-10-04 18:39 ` [PATCH 00/33] AutoNUMA27 Andrew Morton
2012-10-04 20:49   ` Rik van Riel
2012-10-05 23:08   ` Rik van Riel
2012-10-05 23:14   ` Andi Kleen
2012-10-05 23:14     ` Andi Kleen
2012-10-05 23:57     ` Tim Chen
2012-10-05 23:57       ` Tim Chen
2012-10-06  0:11       ` Andi Kleen
2012-10-06  0:11         ` Andi Kleen
2012-10-08 13:44         ` Don Morris
2012-10-08 13:44           ` Don Morris
2012-10-08 20:34     ` Rik van Riel
2012-10-08 20:34       ` Rik van Riel
2012-10-11 10:19 ` Mel Gorman
2012-10-11 14:56   ` Andrea Arcangeli
2012-10-11 14:56     ` Andrea Arcangeli
2012-10-11 15:35     ` Mel Gorman
2012-10-11 15:35       ` Mel Gorman
2012-10-12  0:41       ` Andrea Arcangeli
2012-10-12  0:41         ` Andrea Arcangeli
2012-10-12 14:54       ` Mel Gorman
2012-10-12 14:54         ` Mel Gorman
2012-10-11 21:34 ` Mel Gorman
2012-10-12  1:45   ` Andrea Arcangeli
2012-10-12  1:45     ` Andrea Arcangeli
2012-10-12  8:46     ` Mel Gorman
2012-10-12  8:46       ` Mel Gorman
2012-10-13 18:40 ` Srikar Dronamraju
2012-10-13 18:40   ` Srikar Dronamraju
2012-10-14  4:57   ` Andrea Arcangeli
2012-10-14  4:57     ` Andrea Arcangeli
2012-10-15  8:16     ` Srikar Dronamraju
2012-10-15  8:16       ` Srikar Dronamraju
2012-10-23 16:32     ` Srikar Dronamraju
2012-10-23 16:32       ` Srikar Dronamraju
2012-10-16 13:48 ` Mel Gorman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20121006123432.GS6793@redhat.com \
    --to=aarcange@redhat.com \
    --cc=Lee.Schermerhorn@hp.com \
    --cc=akpm@linux-foundation.org \
    --cc=alex.shi@intel.com \
    --cc=benh@kernel.crashing.org \
    --cc=bharata.rao@gmail.com \
    --cc=cl@linux.com \
    --cc=danms@us.ibm.com \
    --cc=dhillf@gmail.com \
    --cc=don.morris@hp.com \
    --cc=drjones@redhat.com \
    --cc=efault@gmx.de \
    --cc=hannes@cmpxchg.org \
    --cc=hughd@google.com \
    --cc=konrad.wilk@oracle.com \
    --cc=laijs@cn.fujitsu.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mauricfo@linux.vnet.ibm.com \
    --cc=mel@csn.ul.ie \
    --cc=mingo@elte.hu \
    --cc=paulmck@linux.vnet.ibm.com \
    --cc=pjt@google.com \
    --cc=pzijlstr@redhat.com \
    --cc=riel@redhat.com \
    --cc=suresh.b.siddha@intel.com \
    --cc=tglx@linutronix.de \
    --cc=torvalds@linux-foundation.org \
    --cc=vatsa@linux.vnet.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.