Re: [PATCH 3/3] sched: limit cpu search and rotate search window for scalability

From: Peter Zijlstra <peterz@infradead.org>
To: Subhra Mazumdar <subhra.mazumdar@oracle.com>
Cc: linux-kernel@vger.kernel.org, mingo@redhat.com,
	daniel.lezcano@linaro.org, steven.sistare@oracle.com,
	dhaval.giani@oracle.com, rohit.k.jain@oracle.com
Subject: Re: [PATCH 3/3] sched: limit cpu search and rotate search window for scalability
Date: Wed, 25 Apr 2018 17:36:00 +0200	[thread overview]
Message-ID: <20180425153600.GA4043@hirez.programming.kicks-ass.net> (raw)
In-Reply-To: <d7efc4d3-7c50-752e-a1ae-a164d991dbcc@oracle.com>

On Tue, Apr 24, 2018 at 05:10:34PM -0700, Subhra Mazumdar wrote:
> On 04/24/2018 05:53 AM, Peter Zijlstra wrote:

> > Why do you need to put a max on? Why isn't the proportional thing
> > working as is? (is the average no good because of big variance or what)

> Firstly the choosing of 512 seems arbitrary.

It is; it is a crud attempt to deal with big variance. The comment says
as much.

> Secondly the logic here is that the enqueuing cpu should search up to
> time it can get work itself.  Why is that the optimal amount to
> search?

1/512-th of the time in fact, per the above random number, but yes.
Because searching for longer than we're expecting to be idle for is
clearly bad, at that point we're inhibiting doing useful work.

But while thinking about all this, I think I've spotted a few more
issues, aside from the variance:

Firstly, while avg_idle estimates the average duration for _when_ we go
idle, it doesn't give a good measure when we do not in fact go idle. So
when we get wakeups while fully busy, avg_idle is a poor measure.

Secondly, the number of wakeups performed is also important. If we have
a lot of wakeups, we need to look at aggregate wakeup time over a
period. Not just single wakeup time.

And thirdly, we're sharing the idle duration with newidle balance.

And I think the 512 is a result of me not having recognised these
additional issues when looking at the traces, I saw variance and left it
there.

This leaves me thinking we need a better estimator for wakeups. Because
if there really is significant idle time, not looking for idle CPUs to
run on is bad. Placing that upper limit, especially such a low one, is
just an indication of failure.

I'll see if I can come up with something.