All of lore.kernel.org
 help / color / mirror / Atom feed
From: Brendan Jackman <brendan.jackman@arm.com>
To: Josef Bacik <josef@toxicpanda.com>
Cc: Mike Galbraith <umgwanakikbuti@gmail.com>,
	Joel Fernandes <joelaf@google.com>,
	Peter Zijlstra <peterz@infradead.org>,
	LKML <linux-kernel@vger.kernel.org>,
	Juri Lelli <Juri.Lelli@arm.com>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Patrick Bellasi <patrick.bellasi@arm.com>,
	Chris Redpath <Chris.Redpath@arm.com>
Subject: Re: wake_wide mechanism clarification
Date: Thu, 03 Aug 2017 16:05:11 +0100	[thread overview]
Message-ID: <87wp6kr8t4.fsf@arm.com> (raw)
In-Reply-To: <20170803131537.GB17196@destiny>


On Thu, Aug 03 2017 at 13:15, Josef Bacik wrote:
> On Thu, Aug 03, 2017 at 11:53:19AM +0100, Brendan Jackman wrote:
>>
>> Hi,
>>
>> On Fri, Jun 30 2017 at 17:55, Josef Bacik wrote:
>> > On Fri, Jun 30, 2017 at 07:02:20PM +0200, Mike Galbraith wrote:
>> >> On Fri, 2017-06-30 at 10:28 -0400, Josef Bacik wrote:
>> >> > On Thu, Jun 29, 2017 at 08:04:59PM -0700, Joel Fernandes wrote:
>> >> >
>> >> > > That makes sense that we multiply slave's flips by a factor because
>> >> > > its low, but I still didn't get why the factor is chosen to be
>> >> > > llc_size instead of something else for the multiplication with slave
>> >> > > (slave * factor).
>> >>
>> >> > Yeah I don't know why llc_size was chosen...
>> >>
>> >> static void update_top_cache_domain(int cpu)
>> >> {
>> >> struct sched_domain_shared *sds = NULL;
>> >> struct sched_domain *sd;
>> >> int id = cpu;
>> >> int size = 1;
>> >>
>> >> sd = highest_flag_domain(cpu, SD_SHARE_PKG_RESOURCES);
>> >> if (sd) {
>> >> id = cpumask_first(sched_domain_span(sd));
>> >> size = cpumask_weight(sched_domain_span(sd));
>> >> sds = sd->shared;
>> >> }
>> >>
>> >> rcu_assign_pointer(per_cpu(sd_llc, cpu), sd);
>> >> per_cpu(sd_llc_size, cpu) = size;
>> >>
>> >> The goal of wake wide was to approximate when pulling would be a futile
>> >> consolidation effort and counterproductive to scaling. 'course with
>> >> ever increasing socket size, any 1:N waker is ever more likely to run
>> >> out of CPU for its one and only self (slamming into scaling wall)
>> >> before it needing to turn its minions loose to conquer the world.
>> >>
>> >> Something else to consider: network interrupt waking multiple workers
>> >> at high frequency. If the waking CPU is idle, do you really want to
>> >> place a worker directly in front of a tattoo artist, or is it better
>> >> off nearly anywhere but there?
>> >>
>> >> If the box is virtual, with no topology exposed (or real but ancient)
>> >> to let select_idle_sibling() come to the rescue, two workers can even
>> >> get tattooed simultaneously (see sync wakeup).
>> >>
>> >
>> > Heuristics are hard, news at 11.  I think messing with wake_wide() itself is too
>> > big of a hammer, we probably need a middle ground.  I'm messing with it right
>> > now so it's too early to say for sure, but i _suspect_ the bigger latencies we
>> > see are not because we overload the cpu we're trying to pull to, but because
>> > when we fail to do the wake_affine() we only look at siblings of the affine_sd
>> > instead of doing the full "find the idlest cpu in the land!" thing.
>>
>> This is the problem I've been hitting lately. My use case is 1 task per
>> CPU on ARM big.LITTLE (asymmetrical CPU capacity). The workload is 1
>> task per CPU, they all do X amount of work then pthread_barrier_wait
>> (i.e. sleep until the last task finishes its X and hits the barrier). On
>> big.LITTLE, the tasks which get a "big" CPU finish faster, and then
>> those CPUs pull over the tasks that are still running:
>>
>>      v CPU v           ->time->
>>
>>                     -------------
>>    0  (big)         11111  /333
>>                     -------------
>>    1  (big)         22222   /444|
>>                     -------------
>>    2  (LITTLE)      333333/
>>                     -------------
>>    3  (LITTLE)      444444/
>>                     -------------
>>
>> Now when task 4 hits the barrier (at |) and wakes the others up, there
>> are 4 tasks with prev_cpu=<big> and 0 tasks with
>> prev_cpu=<little>. Assuming that those wakeups happen on CPU4,
>> regardless of wake_affine, want_affine means that we'll only look in
>> sd_llc (cpus 0 and 1), so tasks will be unnecessarily coscheduled on the
>> bigs until the next load balance, something like this:
>>
>>      v CPU v           ->time->
>>
>>                     ------------------------
>>    0  (big)         11111  /333  31313\33333
>>                     ------------------------
>>    1  (big)         22222   /444|424\4444444
>>                     ------------------------
>>    2  (LITTLE)      333333/          \222222
>>                     ------------------------
>>    3  (LITTLE)      444444/            \1111
>>                     ------------------------
>>                                  ^^^
>>                            underutilization
>>
>> > I _think_
>> > the answer is to make select_idle_sibling() try less hard to find something
>> > workable and only use obviously idle cpu's in the affine sd, and fall back to
>> > the full load balance esque search.
>>
>> So this idea of allowing select_idle_sibling to fail, and falling back
>> to the slow path, would help me too, I think.
>
> Unfortunately this statement of mine was wrong, I had it in my head that we
> would fall back to a find the idlest cpu thing provided we failed to wake
> affine, but we just do select_idle_sibling() and expect the load balancer to
> move things around as needed.

Ah yes, when wake_affine() returns false, we still do
select_idle_sibling (except in prev_cpu's sd_llc instead of
smp_processor_id()'s), and that is the problem faced by my workload. I
thought you were suggesting to change the flow so that
select_idle_sibling can say "I didn't find any idle siblings - go to the
find_idlest_group path".

>> This is also why I was playing with your
>> don't-affine-recently-balanced-tasks patch[1], which also helps my case
>> since it prevents want_affine for tasks 3 and 4 (which were recently
>> moved by an active balance).
>>
>> [1] https://marc.info/?l=linux-kernel&m=150003849602535&w=2
>>     (also linked elsewhere in this thread)
>>
>
> Would you try peter's sched/experimental branch and see how that affects your
> workload?  I'm still messing with my patches and I may drop this one as it now
> appears to be too aggressive with the new set of patches.  Thanks,

Sure, I'll take a look at those, thanks. I guess the idea of caching
values in LB and then using them in wakeup[2] is a lighter-handed way of
achieving the same thing as last_balance_ts? It won't solve my problem
directly since we'll still only look in sd_llc, but I think it could be
a basis for a way to say "go find_idlest_group path on these tasks" at
the beginning of select_task_rq_fair.

[2] https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/commit/?h=sched/experimental&id=5b4ed509027a5b6f495e6fe871cae850d5762bef

Thanks,
Brendan

  reply	other threads:[~2017-08-03 15:05 UTC|newest]

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-06-30  0:19 wake_wide mechanism clarification Joel Fernandes
2017-06-30  0:49 ` Josef Bacik
2017-06-30  3:04   ` Joel Fernandes
2017-06-30 14:28     ` Josef Bacik
2017-06-30 17:02       ` Mike Galbraith
2017-06-30 17:55         ` Josef Bacik
2017-08-03 10:53           ` Brendan Jackman
2017-08-03 13:15             ` Josef Bacik
2017-08-03 15:05               ` Brendan Jackman [this message]
2017-08-09 21:22                 ` Atish Patra
2017-08-10  9:48                   ` Brendan Jackman
2017-08-10 17:41                     ` Atish Patra
2017-07-29  8:01         ` Joel Fernandes
2017-07-29  8:13           ` Joel Fernandes
2017-08-02  8:26             ` Michael Wang
2017-08-03 23:48               ` Joel Fernandes
2017-07-29 15:07           ` Mike Galbraith
2017-07-29 20:19             ` Joel Fernandes
2017-07-29 22:28               ` Joel Fernandes
2017-07-29 22:41                 ` Joel Fernandes
2017-07-31 12:21                   ` Josef Bacik
2017-07-31 13:42                     ` Mike Galbraith
2017-07-31 14:48                       ` Josef Bacik
2017-07-31 17:23                         ` Mike Galbraith
2017-07-31 16:21                     ` Joel Fernandes
2017-07-31 16:42                       ` Josef Bacik
2017-07-31 17:55                         ` Joel Fernandes
2017-06-30  3:11   ` Mike Galbraith
2017-06-30 13:11   ` Matt Fleming

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87wp6kr8t4.fsf@arm.com \
    --to=brendan.jackman@arm.com \
    --cc=Chris.Redpath@arm.com \
    --cc=Juri.Lelli@arm.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=joelaf@google.com \
    --cc=josef@toxicpanda.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=patrick.bellasi@arm.com \
    --cc=peterz@infradead.org \
    --cc=umgwanakikbuti@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.