Re: [RFC v2 2/2] sched/fair: introduce sched-idle balance

From: Abel Wu <wuyun.abel@bytedance.com>
To: Josh Don <joshdon@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>,
	Mel Gorman <mgorman@suse.de>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	linux-kernel <linux-kernel@vger.kernel.org>,
	Abel Wu <wuyun.abel@bytedance.com>
Subject: Re: [RFC v2 2/2] sched/fair: introduce sched-idle balance
Date: Wed, 13 Apr 2022 01:56:10 +0800	[thread overview]
Message-ID: <801da029-6bbe-2a0c-7de0-afffc3d5de02@bytedance.com> (raw)
In-Reply-To: <CABk29NvE=Fmgo4xqQLfy-K8j0hNS-+ppGdYt37yDUnRJiqjZ5w@mail.gmail.com>

Hi Josh,

On 4/12/22 9:59 AM, Josh Don Wrote:
> Hi Abel,
> 
>>
>> +static inline bool cfs_rq_busy(struct rq *rq)
>> +{
>> +       return rq->cfs.h_nr_running - rq->cfs.idle_h_nr_running == 1;
>> +}
>> +
>> +static inline bool need_pull_cfs_task(struct rq *rq)
>> +{
>> +       return rq->cfs.h_nr_running == rq->cfs.idle_h_nr_running;
>> +}
> 
> Note that this will also return true if there are 0 tasks, which I
> don't think is the semantics you intend for its use in
> rebalance_domains() below.

I intended covering the idle balance. My last v1 patchset wanted to
ignore the idle balance because of the high cpu wakeup latency, but
after some benchmarking, that seems not necessary.

> 
>>   /*
>>    * Use locality-friendly rq->overloaded to cache the status of the rq
>>    * to minimize the heavy cost on LLC shared data.
>> @@ -7837,6 +7867,22 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
>>          if (kthread_is_per_cpu(p))
>>                  return 0;
>>
>> +       if (unlikely(task_h_idle(p))) {
>> +               /*
>> +                * Disregard hierarchically idle tasks during sched-idle
>> +                * load balancing.
>> +                */
>> +               if (env->idle == CPU_SCHED_IDLE)
>> +                       return 0;
>> +       } else if (!static_branch_unlikely(&sched_asym_cpucapacity)) {
>> +               /*
>> +                * It's not gonna help if stacking non-idle tasks on one
>> +                * cpu while leaving some idle.
>> +                */
>> +               if (cfs_rq_busy(env->src_rq) && !need_pull_cfs_task(env->dst_rq))
>> +                       return 0;
> 
> These checks don't involve the task at all, so this kind of check
> should be pushed into the more general load balance function. But, I'm
> not totally clear on the motivation here. If we have cpu A with 1
> non-idle task and 100 idle tasks, and cpu B with 1 non-idle task, we
> should definitely try to load balance some of the idle tasks from A to
> B. idle tasks _do_ get time to run (although little), and this can add
> up and cause antagonism to the non-idle task if there are a lot of
> idle threads.

CPU_SCHED_IDLE means triggered by sched_idle_balance() in which pulls
a non-idle task for the unoccupied cpu from the overloaded ones, so
idle tasks are not the target and should be skipped.

The second part is: if we have cpu A with 1 non-idle task and 100 idle
tasks, and B with >=1 non-idle task, we don't migrate the last non-idle
task on A to B.

> 
>>
>>   /*
>> + * The sched-idle balancing tries to make full use of cpu capacity
>> + * for non-idle tasks by pulling them for the unoccupied cpus from
>> + * the overloaded ones.
>> + *
>> + * Return 1 if pulled successfully, 0 otherwise.
>> + */
>> +static int sched_idle_balance(struct rq *dst_rq)
>> +{
>> +       struct sched_domain *sd;
>> +       struct task_struct *p = NULL;
>> +       int dst_cpu = cpu_of(dst_rq), cpu;
>> +
>> +       sd = rcu_dereference(per_cpu(sd_llc, dst_cpu));
>> +       if (unlikely(!sd))
>> +               return 0;
>> +
>> +       if (!atomic_read(&sd->shared->nr_overloaded))
>> +               return 0;
>> +
>> +       for_each_cpu_wrap(cpu, sdo_mask(sd->shared), dst_cpu + 1) {
>> +               struct rq *rq = cpu_rq(cpu);
>> +               struct rq_flags rf;
>> +               struct lb_env env;
>> +
>> +               if (cpu == dst_cpu || !cfs_rq_overloaded(rq) ||
>> +                   READ_ONCE(rq->sched_idle_balance))
>> +                       continue;
>> +
>> +               WRITE_ONCE(rq->sched_idle_balance, 1);
>> +               rq_lock_irqsave(rq, &rf);
>> +
>> +               env = (struct lb_env) {
>> +                       .sd             = sd,
>> +                       .dst_cpu        = dst_cpu,
>> +                       .dst_rq         = dst_rq,
>> +                       .src_cpu        = cpu,
>> +                       .src_rq         = rq,
>> +                       .idle           = CPU_SCHED_IDLE, /* non-idle only */
>> +                       .flags          = LBF_DST_PINNED, /* pin dst_cpu */
>> +               };
>> +
>> +               update_rq_clock(rq);
>> +               p = detach_one_task(&env);
>> +               if (p)
>> +                       update_overload_status(rq);
>> +
>> +               rq_unlock(rq, &rf);
>> +               WRITE_ONCE(rq->sched_idle_balance, 0);
>> +
>> +               if (p) {
>> +                       attach_one_task(dst_rq, p);
>> +                       local_irq_restore(rf.flags);
>> +                       return 1;
>> +               }
>> +
>> +               local_irq_restore(rf.flags);
>> +       }
>> +
>> +       return 0;
>> +}
> 
> I think this could probably be integrated with the load balancing
> function. Your goal is ignore idle tasks for the purpose of pulling
> from a remote rq. And I think the above isn't exactly what you want
> anyway; detach_tasks/detach_one_task  are just going to iterate the
> task list in order. You want to actually look for the non-idle tasks
> explicitly.

I have tried a simple version like below (and sched_idle_balance() is
not needed anymore):

@@ -10338,6 +10343,7 @@ static void rebalance_domains(struct rq *rq, 
enum cpu_idle_type idle)
  	int continue_balancing = 1;
  	int cpu = rq->cpu;
  	int busy = idle != CPU_IDLE && !sched_idle_cpu(cpu);
+	int prev_busy = busy;
  	unsigned long interval;
  	struct sched_domain *sd;
  	/* Earliest time when we have to do rebalance again */
@@ -10394,6 +10400,9 @@ static void rebalance_domains(struct rq *rq, 
enum cpu_idle_type idle)
  			next_balance = sd->last_balance + interval;
  			update_next_balance = 1;
  		}
+
+		if (!prev_busy && !need_pull_cfs_task(rq))
+			break;
  	}
  	if (need_decay) {
  		/*

But benchmark results are not good enough compared to RFCv2 patchset.
I would dig more deep into this, thanks.

> 
>> @@ -10996,9 +11119,9 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
>>
>>                  if (sd->flags & SD_BALANCE_NEWIDLE) {
>>
>> -                       pulled_task = load_balance(this_cpu, this_rq,
>> -                                                  sd, CPU_NEWLY_IDLE,
>> -                                                  &continue_balancing);
>> +                       pulled_task |= load_balance(this_cpu, this_rq,
>> +                                                   sd, CPU_NEWLY_IDLE,
>> +                                                   &continue_balancing);
> 
> Why |= ?

This is because I changed the behavior of newidle balance a bit. Vanilla
kernel will quit newidle balance once we got task to run on this rq, no
matter the task is non-idle or not. But after this patch, if there are
overloaded cpus in this LLC, we will try harder on balance until we got
non-idle tasks, which means the balancing would be continue even if now
the cpu is sched_idle.

Thanks & BR,
Abel