Re: newidle balancing in NUMA domain?

From: Nick Piggin <npiggin@suse.de>
To: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <peterz@infradead.org>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>
Subject: Re: newidle balancing in NUMA domain?
Date: Mon, 23 Nov 2009 13:27:31 +0100	[thread overview]
Message-ID: <20091123122731.GE2287@wotan.suse.de> (raw)
In-Reply-To: <20091123120849.GB32009@elte.hu>

On Mon, Nov 23, 2009 at 01:08:49PM +0100, Ingo Molnar wrote:
> 
> * Nick Piggin <npiggin@suse.de> wrote:
> 
> > On Mon, Nov 23, 2009 at 12:45:50PM +0100, Ingo Molnar wrote:
> > Well to be fair, the *decision* is to use a longer-term weight for the 
> > runqueue to reduce balancing (seeing as we naturally do far more 
> > balancing on conditions means that we tend to look at our instant 
> > runqueue weight when it is 0).
> 
> Well, the problem with that is that it uses a potentially outdated piece 
> of metric - and that can become visible if balancing events are rare 
> enough.

It shouldn't, it happens on the local CPU, at each scheduler tick.

> I.e. we do need a time scale (rate of balancing) to be able to do this 
> correctly on a statistical level - which pretty much brings in 'rate 
> limit' kind of logic.
> 
> We are better off observing reality precisely and then saying "dont do 
> this action" instead of fuzzing our metrics [or using fuzzy metrics 
> conditionally - which is really the same] and hoping that in the end it 
> will be as if we didnt do certain decisions.

We do though. We take the instant value and the weighted value and
take the min or max (depending on source or destination).

And we decide to do that because of the instantaneous fluctuations
on runqueue load can be far far outside even a normal short term
operating condition.

I don't know how you would otherwise propose to damp those fluctuations
that don't actually require any balancing. Rate limiting doesn't get
rid of those at all, it just does a bit less frequent blaancing. But
on the NUMA domain, memory affinity has been destroyed whether a task is
moved once or 100 times.

> (I hope i explained my point clearly enough.)
> 
> No argument that it could be done cleaner - the duality right now of
> both having the fuzzy stats and the rate limiting should be decided 
> one way or another.

Well, I would say please keep domain balancing behaviour at least
somewhat close to how it was with O(1) scheduler at least until CFS
is more sorted out. There is no need to knee jerk because BFS is
better at something.

We can certainly look at making improvements and take queues from
BFS to use in sched domains, but it is so easy to introduce regressions
and also regressions versus previous kernels are much more important
than slowdowns versus an out of tree patch. So while CFS is still
going through troubles I think it is much better to slow down on
sched-domains changes.

Thanks,