Re: [RFC PATCH] sched/fair: Interleave cfs bandwidth timers for improved single thread performance at low utilization

From: shrikanth hegde <sshegde@linux.vnet.ibm.com>
To: Benjamin Segall <bsegall@google.com>
Cc: mingo@redhat.com, peterz@infradead.org,
	vincent.guittot@linaro.org, dietmar.eggemann@arm.com,
	tglx@linutronix.de, srikar@linux.vnet.ibm.com,
	arjan@linux.intel.com, svaidy@linux.ibm.com,
	linux-kernel@vger.kernel.org
Subject: Re: [RFC PATCH] sched/fair: Interleave cfs bandwidth timers for improved single thread performance at low utilization
Date: Wed, 15 Feb 2023 16:31:29 +0530	[thread overview]
Message-ID: <cd37483e-bf11-ec74-c240-74935bb44809@linux.vnet.ibm.com> (raw)
In-Reply-To: <xm268rh06i97.fsf@google.com>

>>
>>              6.2.rc5                           with patch
>>         1CG    power   2CG    power   | 1CG  power     2CG        power
>> 1Core   218     44     315      46    | 219    45    277(+12%)    47(-2%)
>>         219     43     315      45    | 219    44    244(+22%)    48(-6%)
>> 	                              |
>> 2Core   108     48     158      52    | 109    50    114(+26%)    59(-13%)
>>         109     49     157      52    | 109    49    136(+13%)    56(-7%)
>>                                       |
>> 4Core    60     59      89      65    |  62    58     72(+19%)    68(-5%)
>>          61     61      90      65    |  62    60     68(+24%)    73(-12%)
>>                                       |
>> 8Core    33     77      48      83    |  33    77     37(+23%)    91(-10%)
>>          33     77      48      84    |  33    77     38(+21%)    90(-7%)
>>
>> There is no benefit at higher utilization of 50% or more. There is no
>> degradation also.
>>
>> This is RFC PATCH V2, where the code has been shifted from hrtimer to
>> sched. This patch sets an initial value as multiple of period/10.
>> Here timers can still align if the time started the cgroup is within the
>> period/10 interval. On a real life workload, time gives sufficient
>> randomness. There can be a better interleaving by being more
>> deterministic. For example, when there are 2 cgroups, they should
>> have initial value of 0/50ms or 10/60ms so on. When there are 3 cgroups,
>> 0/3/6ms or 1/4/7ms etc. That is more complicated as it has to account
>> for cgroup addition/deletion and accuracy w.r.t to period/quota.
>> If that approach is better here, then will come up with that patch.
> 
> This does seem vaguely reasonable, though the power argument of
> consolidating wakeups and such is something that we intentionally do in
> other situations.
> 
Thank you Benjamin for taking a look and spending time in reviewing this.
> How reasonable do you think it is to just say (and what do the
> equivalent numbers look like on your particular benchmark) "put some
> variance on your period config if you want variance"?
>Run to run variance is expected with this patch as the patch depends
on time upto last period/10 as the basis for interleaving. 
What i could infer from this comment about variance. Please correct if not.

>>
>> Signed-off-by: Shrikanth Hegde<sshegde@linux.vnet.ibm.com>
>> ---
>>  kernel/sched/fair.c | 17 ++++++++++++++---
>>  1 file changed, 14 insertions(+), 3 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index ff4dbbae3b10..7b69c329e05d 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -5939,14 +5939,25 @@ static void init_cfs_rq_runtime(struct cfs_rq *cfs_rq)
>>
>>  void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
>>  {
>> -	lockdep_assert_held(&cfs_b->lock);
>> +	struct hrtimer *period_timer = &cfs_b->period_timer;
>> +	s64 incr = ktime_to_ns(cfs_b->period) / 10;
>> +	ktime_t delta;
>> +	u64 orun = 1;
>>
>> +	lockdep_assert_held(&cfs_b->lock);
>>  	if (cfs_b->period_active)
>>  		return;
>>
>>  	cfs_b->period_active = 1;
>> -	hrtimer_forward_now(&cfs_b->period_timer, cfs_b->period);
>> -	hrtimer_start_expires(&cfs_b->period_timer, HRTIMER_MODE_ABS_PINNED);
>> +	delta = ktime_sub(period_timer->base->get_time(),
>> +			hrtimer_get_expires(period_timer));
>> +	if (unlikely(delta >= cfs_b->period)) {
> 
> Probably could have a short comment here that's something like "forward
> the hrtimer by period / 10 to reduce synchronized wakeups"
> 
Sure. Will do in the next version of this patch. 

>> +		orun = ktime_divns(delta, incr);
>> +		hrtimer_add_expires_ns(period_timer, incr * orun);
>> +	}
>> +
>> +	hrtimer_forward_now(period_timer, cfs_b->period);
>> +	hrtimer_start_expires(period_timer, HRTIMER_MODE_ABS_PINNED);
>>  }
>>
>>  static void destroy_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
>> --
>> 2.31.1