linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Steven Sistare <steven.sistare@oracle.com>
To: Valentin Schneider <valentin.schneider@arm.com>,
	mingo@redhat.com, peterz@infradead.org
Cc: subhra.mazumdar@oracle.com, dhaval.giani@oracle.com,
	rohit.k.jain@oracle.com, daniel.m.jordan@oracle.com,
	pavel.tatashin@microsoft.com, matt@codeblueprint.co.uk,
	umgwanakikbuti@gmail.com, riel@redhat.com, jbacik@fb.com,
	juri.lelli@redhat.com, linux-kernel@vger.kernel.org
Subject: Re: [PATCH 07/10] sched/fair: Provide can_migrate_task_llc
Date: Wed, 31 Oct 2018 11:43:36 -0400	[thread overview]
Message-ID: <7c503b58-6370-df65-51c3-2591bb1cf621@oracle.com> (raw)
In-Reply-To: <b7d91b1d-5da1-5d70-f266-93b108379b64@arm.com>

On 10/29/2018 3:34 PM, Valentin Schneider wrote:
> On 26/10/2018 19:28, Steven Sistare wrote:
>> On 10/26/2018 2:04 PM, Valentin Schneider wrote:
> [...]
>>>
>>> I was thinking that perhaps we could have scenarios where some rq's
>>> keep stealing tasks off of each other and we end up circulating tasks 
>>> between CPUs. Now, that would only happen if we had a handful of tasks
>>> with a very tiny period, and I'm not familiar with (real) such hyperactive
>>> workloads similar to those generated by hackbench where that could happen.
>>
>> That will not happen with the current code, as it only steals if nr_running > 1.
>> The src loses a task, the dst gains it and has nr_running == 1, so it will not
>> be re-stolen.
> 
> That's indeed fine, I was thinking of something like this:
> 
> Suppose you have 2 rq's sharing a workload of 3 tasks. You get one rq with
> nr_running == 1 (r_1) and one rq with nr_running == 2 (r_2).
> 
> As soon as the task on r_1 ends/blocks, we'll go through idle balancing and
> can potentially steal the non-running task from r_2. Sometime later the task
> that was running on r_1 wakes up, and we end up with r_1->nr_running == 2
> and r_2->nr_running == 1.
> 
> IOW we've swapped their role in that example, and the whole thing can
> repeat.
> 
> The shorter the period of those tasks, the more we'll migrate them
> between rq's, hence why I wonder if we shouldn't have some sort of
> throttling.

Stealing is still the right move in this scenario.  Idle cycles become useful
cycles.  The only cost is the CPU time to dequeue from a remote rq and
enqueue on the local rq.  Earlier we discussed skipping try_steal() if avg_idle
is very small, on the order of 10 usec.  I think that type of throttling would 
cover your scenario.  I will add it in my next version.

>> If we modify the code to handle misfits, we may steal when src nr_running == 1,
>> but a fast CPU will only steal the lone task from a slow one, never fast from fast, 
>> and never slow from fast, so no tug of war.
>>
>>> In short, I wonder if we should have task_hot() in there. Drawing a
>>> parallel with load_balance(), even if load-balancing is happening between
>>> rqs of the same LLC, we do go check task_hot(). Have you already experimented
>>> with adding a task_hot() check in here?
>>
>> I tried task_hot, to see if L1/L2 cache warmth matters much on L1/L2/L3 systems, 
>> and it reduced steals and overall performance.
> 
> Mmm so task_hot() mainly implements two mechanisms - the CACHE_HOT_BUDDY
> sched feature and the exec_start threshold.
> 
> The first one should be sidestepped in the stealing case since we won't
> pass (if env->dst_rq->nr_running), that leaves us with the threshold.
> 
> We might want to sidestep it when we are doing balancing within an LLC
> domain (env->sd->flags & SD_SHARE_PKG_RESOURCES) - or use a lower threshold
> in such cases.
> 
> In any case, I think it would make sense to add some LLC conditions to
> task_hot() so that
> - regular load_balance() can also benefit from them

This is probably a good idea (lower threshold for task_hot within LLC).
I would rather see it done as a separate patch, with a separate performance
evaluation, as it will affect all workloads, even those that do not steal.
A load balancing migration when !task_hot() may be performed even when
the dst CPU already has a task to run, so the migration may or may not
improve utilization.  By contrast, a newly idle CPU that does not find
work goes idle and definitely wastes cycles.  Note how
migrate_degrades_locality() chooses migration regardless of preferred node
when the dst is idle:

        /* Leaving a core idle is often worse than degrading locality. */
        if (env->idle != CPU_NOT_IDLE)
                return -1;

I apply the same principle in can_migrate_task_llc().

> - task stealing has at least some sort of throttling
> 
> 
> On a sidenote, I find it a bit odd that the exec_start threshold depends on
> sysctl_sched_migration_cost, which to me is more about idle_balance() cost
> than "how long does it take for a previously run task to go cache cold".

Agreed, but these are all magic numbers anyway :)

>>> I've run some iterations of hackbench (hackbench 2 process 100000) to
>>> investigate this task bouncing, but I didn't really see any of it. That was
>>> just a 4+4 big.LITTLE system though, I'll try to get numbers on a system
>>> with more CPUs.
>>>
>>> ----->8-----
>>>
>>> activations: # of task activations (task starts running)
>>> cpu_migrations: # of activations where cpu != prev_cpu
>>> % stats are percentiles
>>>
>>> - STEAL:
>>>
>>>   | stat  | cpu_migrations | activations |
>>>   |-------+----------------+-------------|
>>>   | count |    2005.000000 | 2005.000000 |
>>>   | mean  |      16.244888 |  290.608479 |
>>>   | std   |      38.963138 |  253.003528 |
>>>   | min   |       0.000000 |    3.000000 |
>>>   | 50%   |       3.000000 |  239.000000 |
>>>   | 75%   |       8.000000 |  436.000000 |
>>>   | 90%   |      45.000000 |  626.000000 |
>>>   | 99%   |     188.960000 | 1073.000000 |
>>>   | max   |     369.000000 | 1417.000000 |
>>>
>>> - NO_STEAL:
>>>
>>>   | stat  | cpu_migrations | activations |
>>>   |-------+----------------+-------------|
>>>   | count |    2005.000000 | 2005.000000 |
>>>   | mean  |      15.260848 |  297.860848 |
>>>   | std   |      46.331890 |  253.210813 |
>>>   | min   |       0.000000 |    3.000000 |
>>>   | 50%   |       3.000000 |  252.000000 |
>>>   | 75%   |       7.000000 |  444.000000 |
>>>   | 90%   |      32.600000 |  643.600000 |
>>>   | 99%   |     214.880000 | 1127.520000 |
>>>   | max   |     467.000000 | 1547.000000 |
>>>
>>> ----->8-----
>>>
>>> Otherwise, my only other concern at the moment is that since stealing
>>> doesn't care about load, we could steal a task that would cause a big
>>> imbalance, which wouldn't have happened with a call to load_balance().
>>>
>>> I don't think this can be triggered with a symmetrical workload like
>>> hackbench, so I'll go explore something else.
>>
>> The dst is about to go idle with zero load, so stealing can only improve the
>> instantaneous balance between src and dst.  For longer term average load, we
>> still rely on periodic load_balance to make adjustments.
> 
> Right, so my line of thinking was that by not doing a load_balance() and
> taking a shortcut (stealing a task), we may end up just postponing a
> load_balance() to after we've stolen a task. I guess in those cases
> there's no magic trick to be found and we just have to deal with it.

In the current code I call idle_balance/load_balance first and then try_steal.
If idle_balance fails because of cost, then it has effectively postponed itself,
independently of stealing.  The next successful call to load_balance will
correct any imbalance caused by stealing.

> And then there's some of the logic like we have in update_sd_pick_busiest()
> where we e.g. try to prevent misfit tasks from running on LITTLEs, but
> then if such tasks are waiting to be run and a LITTLE frees itself up,
> I *think* it's okay to steal it.

Should be OK to steal.  If a BIG subsequently goes idle, load_balance will move
the task to the BIG, or the BIG may steal it when we support misfit stealing.

Questions for you, Valentin:

  - Should misfit stealing be a separate patch, after my series?  I prefer that,
    so we get stealing in peoples hands as soon as possible.  I think separating 
    it is OK because stealing should not cause any regression for misfits, as 
    my code still calls idle_balance/load_balance, which handles misfits.

  - Who should implement misfit stealing -- you, me, someone else?  I have no 
    preference.

- Steve

  reply	other threads:[~2018-10-31 15:44 UTC|newest]

Thread overview: 45+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-10-22 14:59 [PATCH 00/10] steal tasks to improve CPU utilization Steve Sistare
2018-10-22 14:59 ` [PATCH 01/10] sched: Provide sparsemask, a reduced contention bitmap Steve Sistare
2018-10-22 14:59 ` [PATCH 02/10] sched/topology: Provide hooks to allocate data shared per LLC Steve Sistare
2018-10-22 14:59 ` [PATCH 03/10] sched/topology: Provide cfs_overload_cpus bitmap Steve Sistare
2018-10-22 14:59 ` [PATCH 04/10] sched/fair: Dynamically update cfs_overload_cpus Steve Sistare
2018-10-22 16:56   ` Peter Zijlstra
2018-10-22 18:43     ` Steven Sistare
2018-10-22 14:59 ` [PATCH 05/10] sched/fair: Hoist idle_stamp up from idle_balance Steve Sistare
2018-10-25 13:47   ` Valentin Schneider
2018-10-25 14:04     ` Steven Sistare
2018-10-22 14:59 ` [PATCH 06/10] sched/fair: Generalize the detach_task interface Steve Sistare
2018-10-22 14:59 ` [PATCH 07/10] sched/fair: Provide can_migrate_task_llc Steve Sistare
2018-10-26 18:04   ` Valentin Schneider
2018-10-26 18:28     ` Steven Sistare
2018-10-29 19:34       ` Valentin Schneider
2018-10-31 15:43         ` Steven Sistare [this message]
2018-10-31 18:48           ` Valentin Schneider
2018-10-31 19:14         ` Peter Zijlstra
2018-11-01 11:16           ` Valentin Schneider
2018-10-22 14:59 ` [PATCH 08/10] sched/fair: Steal work from an overloaded CPU when CPU goes idle Steve Sistare
2018-10-25 13:48   ` Valentin Schneider
2018-10-25 14:07     ` Steven Sistare
2018-10-22 14:59 ` [PATCH 09/10] sched/fair: disable stealing if too many NUMA nodes Steve Sistare
2018-10-22 17:06   ` Peter Zijlstra
2018-10-22 18:47     ` Steven Sistare
2018-10-22 19:21       ` Steven Sistare
2018-10-22 22:05         ` Peter Zijlstra
2018-10-23 13:18   ` Steven Sistare
2018-10-22 14:59 ` [PATCH 10/10] sched/fair: Provide idle search schedstats Steve Sistare
2018-10-22 17:04 ` [PATCH 00/10] steal tasks to improve CPU utilization Peter Zijlstra
2018-10-22 19:07   ` Steven Sistare
2018-10-22 22:09     ` Peter Zijlstra
2018-10-24 15:34     ` Valentin Schneider
2018-10-24 19:27       ` Steven Sistare
2018-10-25 11:31         ` Valentin Schneider
2018-10-25 12:21           ` Steven Sistare
2018-10-25  7:50 ` Vincent Guittot
2018-10-25 11:28   ` Steven Sistare
2018-10-25 12:43     ` Vincent Guittot
2018-10-25 14:19       ` Steven Sistare
2018-10-31 19:35 ` Steven Sistare
2018-11-01 11:56 ` Steven Sistare
2018-11-02 23:39 ` Subhra Mazumdar
2018-11-05 20:08   ` Steven Sistare
2019-01-04 13:37 ` Shijith Thotton

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=7c503b58-6370-df65-51c3-2591bb1cf621@oracle.com \
    --to=steven.sistare@oracle.com \
    --cc=daniel.m.jordan@oracle.com \
    --cc=dhaval.giani@oracle.com \
    --cc=jbacik@fb.com \
    --cc=juri.lelli@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=matt@codeblueprint.co.uk \
    --cc=mingo@redhat.com \
    --cc=pavel.tatashin@microsoft.com \
    --cc=peterz@infradead.org \
    --cc=riel@redhat.com \
    --cc=rohit.k.jain@oracle.com \
    --cc=subhra.mazumdar@oracle.com \
    --cc=umgwanakikbuti@gmail.com \
    --cc=valentin.schneider@arm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).