All of lore.kernel.org
 help / color / mirror / Atom feed
From: Abel Wu <wuyun.abel@bytedance.com>
To: Josh Don <joshdon@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>,
	Mel Gorman <mgorman@suse.de>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	linux-kernel <linux-kernel@vger.kernel.org>,
	Abel Wu <wuyun.abel@bytedance.com>
Subject: Re: [RFC v2 2/2] sched/fair: introduce sched-idle balance
Date: Wed, 13 Apr 2022 01:56:10 +0800	[thread overview]
Message-ID: <801da029-6bbe-2a0c-7de0-afffc3d5de02@bytedance.com> (raw)
In-Reply-To: <CABk29NvE=Fmgo4xqQLfy-K8j0hNS-+ppGdYt37yDUnRJiqjZ5w@mail.gmail.com>

Hi Josh,

On 4/12/22 9:59 AM, Josh Don Wrote:
> Hi Abel,
> 
>>
>> +static inline bool cfs_rq_busy(struct rq *rq)
>> +{
>> +       return rq->cfs.h_nr_running - rq->cfs.idle_h_nr_running == 1;
>> +}
>> +
>> +static inline bool need_pull_cfs_task(struct rq *rq)
>> +{
>> +       return rq->cfs.h_nr_running == rq->cfs.idle_h_nr_running;
>> +}
> 
> Note that this will also return true if there are 0 tasks, which I
> don't think is the semantics you intend for its use in
> rebalance_domains() below.

I intended covering the idle balance. My last v1 patchset wanted to
ignore the idle balance because of the high cpu wakeup latency, but
after some benchmarking, that seems not necessary.

> 
>>   /*
>>    * Use locality-friendly rq->overloaded to cache the status of the rq
>>    * to minimize the heavy cost on LLC shared data.
>> @@ -7837,6 +7867,22 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
>>          if (kthread_is_per_cpu(p))
>>                  return 0;
>>
>> +       if (unlikely(task_h_idle(p))) {
>> +               /*
>> +                * Disregard hierarchically idle tasks during sched-idle
>> +                * load balancing.
>> +                */
>> +               if (env->idle == CPU_SCHED_IDLE)
>> +                       return 0;
>> +       } else if (!static_branch_unlikely(&sched_asym_cpucapacity)) {
>> +               /*
>> +                * It's not gonna help if stacking non-idle tasks on one
>> +                * cpu while leaving some idle.
>> +                */
>> +               if (cfs_rq_busy(env->src_rq) && !need_pull_cfs_task(env->dst_rq))
>> +                       return 0;
> 
> These checks don't involve the task at all, so this kind of check
> should be pushed into the more general load balance function. But, I'm
> not totally clear on the motivation here. If we have cpu A with 1
> non-idle task and 100 idle tasks, and cpu B with 1 non-idle task, we
> should definitely try to load balance some of the idle tasks from A to
> B. idle tasks _do_ get time to run (although little), and this can add
> up and cause antagonism to the non-idle task if there are a lot of
> idle threads.

CPU_SCHED_IDLE means triggered by sched_idle_balance() in which pulls
a non-idle task for the unoccupied cpu from the overloaded ones, so
idle tasks are not the target and should be skipped.

The second part is: if we have cpu A with 1 non-idle task and 100 idle
tasks, and B with >=1 non-idle task, we don't migrate the last non-idle
task on A to B.

> 
>>
>>   /*
>> + * The sched-idle balancing tries to make full use of cpu capacity
>> + * for non-idle tasks by pulling them for the unoccupied cpus from
>> + * the overloaded ones.
>> + *
>> + * Return 1 if pulled successfully, 0 otherwise.
>> + */
>> +static int sched_idle_balance(struct rq *dst_rq)
>> +{
>> +       struct sched_domain *sd;
>> +       struct task_struct *p = NULL;
>> +       int dst_cpu = cpu_of(dst_rq), cpu;
>> +
>> +       sd = rcu_dereference(per_cpu(sd_llc, dst_cpu));
>> +       if (unlikely(!sd))
>> +               return 0;
>> +
>> +       if (!atomic_read(&sd->shared->nr_overloaded))
>> +               return 0;
>> +
>> +       for_each_cpu_wrap(cpu, sdo_mask(sd->shared), dst_cpu + 1) {
>> +               struct rq *rq = cpu_rq(cpu);
>> +               struct rq_flags rf;
>> +               struct lb_env env;
>> +
>> +               if (cpu == dst_cpu || !cfs_rq_overloaded(rq) ||
>> +                   READ_ONCE(rq->sched_idle_balance))
>> +                       continue;
>> +
>> +               WRITE_ONCE(rq->sched_idle_balance, 1);
>> +               rq_lock_irqsave(rq, &rf);
>> +
>> +               env = (struct lb_env) {
>> +                       .sd             = sd,
>> +                       .dst_cpu        = dst_cpu,
>> +                       .dst_rq         = dst_rq,
>> +                       .src_cpu        = cpu,
>> +                       .src_rq         = rq,
>> +                       .idle           = CPU_SCHED_IDLE, /* non-idle only */
>> +                       .flags          = LBF_DST_PINNED, /* pin dst_cpu */
>> +               };
>> +
>> +               update_rq_clock(rq);
>> +               p = detach_one_task(&env);
>> +               if (p)
>> +                       update_overload_status(rq);
>> +
>> +               rq_unlock(rq, &rf);
>> +               WRITE_ONCE(rq->sched_idle_balance, 0);
>> +
>> +               if (p) {
>> +                       attach_one_task(dst_rq, p);
>> +                       local_irq_restore(rf.flags);
>> +                       return 1;
>> +               }
>> +
>> +               local_irq_restore(rf.flags);
>> +       }
>> +
>> +       return 0;
>> +}
> 
> I think this could probably be integrated with the load balancing
> function. Your goal is ignore idle tasks for the purpose of pulling
> from a remote rq. And I think the above isn't exactly what you want
> anyway; detach_tasks/detach_one_task  are just going to iterate the
> task list in order. You want to actually look for the non-idle tasks
> explicitly.

I have tried a simple version like below (and sched_idle_balance() is
not needed anymore):

@@ -10338,6 +10343,7 @@ static void rebalance_domains(struct rq *rq, 
enum cpu_idle_type idle)
  	int continue_balancing = 1;
  	int cpu = rq->cpu;
  	int busy = idle != CPU_IDLE && !sched_idle_cpu(cpu);
+	int prev_busy = busy;
  	unsigned long interval;
  	struct sched_domain *sd;
  	/* Earliest time when we have to do rebalance again */
@@ -10394,6 +10400,9 @@ static void rebalance_domains(struct rq *rq, 
enum cpu_idle_type idle)
  			next_balance = sd->last_balance + interval;
  			update_next_balance = 1;
  		}
+
+		if (!prev_busy && !need_pull_cfs_task(rq))
+			break;
  	}
  	if (need_decay) {
  		/*

But benchmark results are not good enough compared to RFCv2 patchset.
I would dig more deep into this, thanks.

> 
>> @@ -10996,9 +11119,9 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
>>
>>                  if (sd->flags & SD_BALANCE_NEWIDLE) {
>>
>> -                       pulled_task = load_balance(this_cpu, this_rq,
>> -                                                  sd, CPU_NEWLY_IDLE,
>> -                                                  &continue_balancing);
>> +                       pulled_task |= load_balance(this_cpu, this_rq,
>> +                                                   sd, CPU_NEWLY_IDLE,
>> +                                                   &continue_balancing);
> 
> Why |= ?

This is because I changed the behavior of newidle balance a bit. Vanilla
kernel will quit newidle balance once we got task to run on this rq, no
matter the task is non-idle or not. But after this patch, if there are
overloaded cpus in this LLC, we will try harder on balance until we got
non-idle tasks, which means the balancing would be continue even if now
the cpu is sched_idle.

Thanks & BR,
Abel


  reply	other threads:[~2022-04-12 17:56 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-04-09 13:51 [RFC v2 0/2] introduece sched-idle balance Abel Wu
2022-04-09 13:51 ` [RFC v2 1/2] sched/fair: filter out overloaded cpus in SIS Abel Wu
2022-04-12  1:23   ` Josh Don
2022-04-12 17:55     ` Abel Wu
2022-04-13 23:49       ` Josh Don
2022-04-14 15:36         ` Abel Wu
2022-04-15 23:21           ` Josh Don
2022-04-25  7:02   ` [sched/fair] 6b433275e3: stress-ng.sock.ops_per_sec 16.2% improvement kernel test robot
2022-04-25  7:02     ` kernel test robot
2022-04-09 13:51 ` [RFC v2 2/2] sched/fair: introduce sched-idle balance Abel Wu
2022-04-12  1:59   ` Josh Don
2022-04-12 17:56     ` Abel Wu [this message]
2022-04-14  0:08       ` Josh Don
2022-04-14 15:38         ` Abel Wu
2022-04-27 13:15   ` [sched/fair] ae44f2177f: reaim.jobs_per_min 2.3% improvement kernel test robot
2022-04-27 13:15     ` kernel test robot

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=801da029-6bbe-2a0c-7de0-afffc3d5de02@bytedance.com \
    --to=wuyun.abel@bytedance.com \
    --cc=joshdon@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mgorman@suse.de \
    --cc=peterz@infradead.org \
    --cc=vincent.guittot@linaro.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.