Re: [RFC PATCH 1/9] sched,cgroup: Add interface for latency-nice

From: Parth Shah <parth@linux.ibm.com>
To: Patrick Bellasi <patrick.bellasi@arm.com>,
	Valentin Schneider <valentin.schneider@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>,
	Subhra Mazumdar <subhra.mazumdar@oracle.com>,
	linux-kernel@vger.kernel.org, mingo@redhat.com,
	tglx@linutronix.de, steven.sistare@oracle.com,
	dhaval.giani@oracle.com, daniel.lezcano@linaro.org,
	vincent.guittot@linaro.org, viresh.kumar@linaro.org,
	tim.c.chen@linux.intel.com, mgorman@techsingularity.net
Subject: Re: [RFC PATCH 1/9] sched,cgroup: Add interface for latency-nice
Date: Fri, 6 Sep 2019 18:15:28 +0530	[thread overview]
Message-ID: <3bb17e15-5492-b78c-20a8-5989519f20e2@linux.ibm.com> (raw)
In-Reply-To: <87d0ge3n85.fsf@arm.com>

On 9/5/19 6:37 PM, Patrick Bellasi wrote:
> 
> On Thu, Sep 05, 2019 at 12:46:37 +0100, Valentin Schneider wrote...
> 
>> On 05/09/2019 12:18, Patrick Bellasi wrote:
>>>> There's a few things wrong there; I really feel that if we call it nice,
>>>> it should be like nice. Otherwise we should call it latency-bias and not
>>>> have the association with nice to confuse people.
>>>>
>>>> Secondly; the default should be in the middle of the range. Naturally
>>>> this would be a signed range like nice [-(x+1),x] for some x. but if you
>>>> want [0,1024], then the default really should be 512, but personally I
>>>> like 0 better as a default, in which case we need negative numbers.
>>>>
>>>> This is important because we want to be able to bias towards less
>>>> importance to (tail) latency as well as more importantance to (tail)
>>>> latency.
>>>>
>>>> Specifically, Oracle wants to sacrifice (some) latency for throughput.
>>>> Facebook OTOH seems to want to sacrifice (some) throughput for latency.
>>>
>>> Right, we have this dualism to deal with and current mainline behaviour
>>> is somehow in the middle.
>>>
>>> BTW, the FB requirement is the same we have in Android.
>>> We want some CFS tasks to have very small latency and a low chance
>>> to be preempted by the wake-up of less-important "background" tasks.
>>>
>>> I'm not totally against the usage of a signed range, but I'm thinking
>>> that since we are introducing a new (non POSIX) concept we can get the
>>> chance to make it more human friendly.
>>>
>>> Give the two extremes above, would not be much simpler and intuitive to
>>> have 0 implementing the FB/Android (no latency) case and 1024 the
>>> (max latency) Oracle case?
>>>
>>
>> For something like latency-<whatever>, I don't see the point of having
>> such a wide range. The nice range is probably more than enough - and before
>> even bothering about the range, we should probably agree on what the range
>> should represent.
>>
>> If it's niceness, I read it as: positive latency-nice value means we're
>> nice to latency, means we reduce it. So the further up you go, the more you
>> restrict your wakeup scan. I think it's quite easy to map that into the
>> code: current behaviour at 0, with a decreasing scan mask size as we go
>> towards +19. I don't think anyone needs 512 steps to tune this.
>>
>> I don't know what logic we'd follow for negative values though. Maybe
>> latency-nice -20 means always going through the slowpath, but what of the
>> intermediate values?
> 
> Yep, I think so fare we are all converging towards the idea to use the
> a signed range. Regarding the range itself, yes: 1024 looks very
> oversized, but +-20 is still something which leave room for a bit of
> flexibility and it also better matches the idea that we don't want to
> "enumerate behaviours" but just expose a knob. To map certain "bias" we
> could benefit from a slightly larger range.
> 
>> AFAICT this RFC only looks at wakeups, but I guess latency-nice can be
> 
> For the wakeup path there is also the TurboSched proposal by Parth:
> 
>    Message-ID: <20190725070857.6639-1-parth@linux.ibm.com> 
>    https://lore.kernel.org/lkml/20190725070857.6639-1-parth@linux.ibm.com/
> 
> we should keep in mind.
> 
>> applied elsewhere (e.g. load-balance, something like task_hot() and its
>> use of sysctl_sched_migration_cost).
> 
> For LB can you come up with some better description of what usages you
> see could benefit from a "per task" or "per task-group" latency niceness?
> 

I guess there is some usecase in case of thermal throttling.
If a task is heating up the core then in ideal scenarios POWER systems throttle
down to rated frequency.
In such case, if the task is latency sensitive (min latency nice), we can move the
task around the chip to heat up the chip uniformly allowing me to gain more performance
with sustained higher frequency.
With this, we will require the help from active load balancer and latency-nice
classification on per task and/or group basis.

Hopefully, this might be useful for other arch as well, right?

> Best,
> Patrick
> 

Thanks,
Parth