All of lore.kernel.org
 help / color / mirror / Atom feed
From: Abel Wu <wuyun.abel@bytedance.com>
To: Josh Don <joshdon@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>,
	Mel Gorman <mgorman@suse.de>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	linux-kernel <linux-kernel@vger.kernel.org>,
	Abel Wu <wuyun.abel@bytedance.com>, Chen Yu <yu.c.chen@intel.com>
Subject: Re: [PATCH v3] sched/fair: filter out overloaded cpus in SIS
Date: Tue, 10 May 2022 16:03:56 +0800	[thread overview]
Message-ID: <2db9ab21-e1ce-694a-f509-5600f1190d75@bytedance.com> (raw)
In-Reply-To: <CABk29Ns3KBwLXBSwiSe7Pv2YE9iMg+A1kPpPESWG=KNJu9dz0w@mail.gmail.com>

Hi Josh,

On 5/10/22 9:14 AM, Josh Don Wrote:
> Hi Abel,
> 
> Overall this looks good, just a couple of comments.
> 
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index d4bd299d67ab..79b4ff24faee 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -6323,7 +6323,9 @@ static inline int select_idle_smt(struct task_struct *p, struct sched_domain *sd
>>   static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool has_idle_core, int target)
>>   {
>>          struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
>> -       int i, cpu, idle_cpu = -1, nr = INT_MAX;
>> +       struct sched_domain_shared *sds = sd->shared;
>> +       int nr, nro, weight = sd->span_weight;
>> +       int i, cpu, idle_cpu = -1;
>>          struct rq *this_rq = this_rq();
>>          int this = smp_processor_id();
>>          struct sched_domain *this_sd;
>> @@ -6333,7 +6335,23 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
>>          if (!this_sd)
>>                  return -1;
>>
>> +       nro = atomic_read(&sds->nr_overloaded_cpus);
>> +       if (nro == weight)
>> +               goto out;
> 
> This assumes that the sd we're operating on here is the LLC domain
> (true for current use). Perhaps to catch future bugs from changing
> this assumption, we could WARN_ON_ONCE(nro > weight).

The @sds comes from sd->shared, so I don't think the condition will
break once we operate at other level domains. But a quick check on
sds != NULL may be needed then since domains can have no sds attached.

> 
>> +
>> +       nr = min_t(int, weight, p->nr_cpus_allowed);
>> +
>> +       /*
>> +        * It's unlikely to find an idle cpu if the system is under
>> +        * heavy pressure, so skip searching to save a few cycles
>> +        * and relieve cache traffic.
>> +        */
>> +       if (weight - nro < (nr >> 4) && !has_idle_core)
>> +               return -1;
> 
> nit: nr / 16 is easier to read and the compiler will do the shifting for you.

Agreed.

> 
> Was < intentional vs <= ? With <= you'll be able to skip the search in
> the case where both sides evaluate to 0 (can happen frequently if we
> have no idle cpus, and a task with a small affinity mask).

It's intentional, the idea is to unconditionally pass when there are
less than 16 cpus to search which seems scalability is not an issue.
But I made a mistake that (weight - nro) couldn't be 0 here, so it's
not appropriate to use "<".

BTW, I think Chen Yu's proposal[1] on search depth limitation is a
better idea and more reasonable. And he is doing some benchmark on
the mixture of our work.

[1] 
https://lore.kernel.org/lkml/20220428182442.659294-1-yu.c.chen@intel.com/

> 
> This will also get a bit confused in the case where the task has many
> cpus allowed, but almost all of them on a different LLC than the one
> we're considering here. Apart from caching the per-LLC
> nr_cpus_allowed, we could instead use cpumask_weight(cpus) below (and
> only do this in the !has_idle_core case to reduce calls to
> cpumask_weight()).

Yes the task might have many cpus allowed on another LLC, the idea is
to use @nr as a worst case boundary. And with Chen's work, I think we
can get rid of nr_cpus_allowed.

> 
>> +
>>          cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
>> +       if (nro > 1)
>> +               cpumask_andnot(cpus, cpus, sdo_mask(sds));
> 
> Just
> if (nro)
> ?

I think it's just not worthy to touch sdo_mask(sds) which causes heavy
cache traffic, if it only contains one cpu.

> 
>> @@ -6392,6 +6407,9 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
>>
>>                  update_avg(&this_sd->avg_scan_cost, time);
>>          }
>> +out:
>> +       if (has_idle_core)
>> +               WRITE_ONCE(sds->has_idle_cores, 0);
> 
> nit: use set_idle_cores() instead (or, if you really want to avoid the
> extra sds dereference, add a __set_idle_cores(sds, val) helper you can
> call directly.

OK, will do.

> 
>> @@ -7904,6 +7922,7 @@ static struct task_struct *detach_one_task(struct lb_env *env)
>>                          continue;
>>
>>                  detach_task(p, env);
>> +               update_overloaded_rq(env->src_rq);
>>
>>                  /*
>>                   * Right now, this is only the second place where
>> @@ -8047,6 +8066,9 @@ static int detach_tasks(struct lb_env *env)
>>                  list_move(&p->se.group_node, tasks);
>>          }
>>
>> +       if (detached)
>> +               update_overloaded_rq(env->src_rq);
>> +
> 
> Thinking about this more, I don't see an issue with moving the
> update_overloaded_rq() calls to enqueue/dequeue_task, rather than here
> in the attach/detach_task paths. Overloaded state only changes when we
> pass the boundary of 2 runnable non-idle tasks, so thashing of the
> overloaded mask is a lot less worrisome than if it were updated on the
> boundary of 1 runnable task. The attach/detach_task paths run as part
> of load balancing, which can be on a millisecond time scale.

It's really hard to say which one is better, and I think it's more like
workload-specific. It's common in our cloud servers that a long running
workload co-exists with a short running workload which could flip the
status frequently.

Thanks & BR,
Abel

  reply	other threads:[~2022-05-10  8:04 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-05-05 12:23 [PATCH v3] sched/fair: filter out overloaded cpus in SIS Abel Wu
2022-05-07 16:09 ` Chen Yu
2022-05-07 17:50   ` Abel Wu
2022-05-09 15:21     ` Chen Yu
2022-05-09 15:31       ` Chen Yu
2022-05-10  2:55       ` Abel Wu
2022-05-10  1:14 ` Josh Don
2022-05-10  8:03   ` Abel Wu [this message]
2022-05-19 22:16 ` Tim Chen
2022-05-20  2:48   ` Abel Wu
2022-05-20  6:48 ` K Prateek Nayak
2022-05-20  7:43   ` Abel Wu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=2db9ab21-e1ce-694a-f509-5600f1190d75@bytedance.com \
    --to=wuyun.abel@bytedance.com \
    --cc=joshdon@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mgorman@suse.de \
    --cc=peterz@infradead.org \
    --cc=vincent.guittot@linaro.org \
    --cc=yu.c.chen@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.