Re: [PATCH 07/10] sched/fair: Provide can_migrate_task_llc

From: Steven Sistare <steven.sistare@oracle.com>
To: Valentin Schneider <valentin.schneider@arm.com>,
	mingo@redhat.com, peterz@infradead.org
Cc: subhra.mazumdar@oracle.com, dhaval.giani@oracle.com,
	rohit.k.jain@oracle.com, daniel.m.jordan@oracle.com,
	pavel.tatashin@microsoft.com, matt@codeblueprint.co.uk,
	umgwanakikbuti@gmail.com, riel@redhat.com, jbacik@fb.com,
	juri.lelli@redhat.com, linux-kernel@vger.kernel.org
Subject: Re: [PATCH 07/10] sched/fair: Provide can_migrate_task_llc
Date: Wed, 31 Oct 2018 11:43:36 -0400	[thread overview]
Message-ID: <7c503b58-6370-df65-51c3-2591bb1cf621@oracle.com> (raw)
In-Reply-To: <b7d91b1d-5da1-5d70-f266-93b108379b64@arm.com>

On 10/29/2018 3:34 PM, Valentin Schneider wrote:
> On 26/10/2018 19:28, Steven Sistare wrote:
>> On 10/26/2018 2:04 PM, Valentin Schneider wrote:
> [...]
>>>
>>> I was thinking that perhaps we could have scenarios where some rq's
>>> keep stealing tasks off of each other and we end up circulating tasks 
>>> between CPUs. Now, that would only happen if we had a handful of tasks
>>> with a very tiny period, and I'm not familiar with (real) such hyperactive
>>> workloads similar to those generated by hackbench where that could happen.
>>
>> That will not happen with the current code, as it only steals if nr_running > 1.
>> The src loses a task, the dst gains it and has nr_running == 1, so it will not
>> be re-stolen.
> 
> That's indeed fine, I was thinking of something like this:
> 
> Suppose you have 2 rq's sharing a workload of 3 tasks. You get one rq with
> nr_running == 1 (r_1) and one rq with nr_running == 2 (r_2).
> 
> As soon as the task on r_1 ends/blocks, we'll go through idle balancing and
> can potentially steal the non-running task from r_2. Sometime later the task
> that was running on r_1 wakes up, and we end up with r_1->nr_running == 2
> and r_2->nr_running == 1.
> 
> IOW we've swapped their role in that example, and the whole thing can
> repeat.
> 
> The shorter the period of those tasks, the more we'll migrate them
> between rq's, hence why I wonder if we shouldn't have some sort of
> throttling.

Stealing is still the right move in this scenario.  Idle cycles become useful
cycles.  The only cost is the CPU time to dequeue from a remote rq and
enqueue on the local rq.  Earlier we discussed skipping try_steal() if avg_idle
is very small, on the order of 10 usec.  I think that type of throttling would 
cover your scenario.  I will add it in my next version.

>> If we modify the code to handle misfits, we may steal when src nr_running == 1,
>> but a fast CPU will only steal the lone task from a slow one, never fast from fast, 
>> and never slow from fast, so no tug of war.
>>
>>> In short, I wonder if we should have task_hot() in there. Drawing a
>>> parallel with load_balance(), even if load-balancing is happening between
>>> rqs of the same LLC, we do go check task_hot(). Have you already experimented
>>> with adding a task_hot() check in here?
>>
>> I tried task_hot, to see if L1/L2 cache warmth matters much on L1/L2/L3 systems, 
>> and it reduced steals and overall performance.
> 
> Mmm so task_hot() mainly implements two mechanisms - the CACHE_HOT_BUDDY
> sched feature and the exec_start threshold.
> 
> The first one should be sidestepped in the stealing case since we won't
> pass (if env->dst_rq->nr_running), that leaves us with the threshold.
> 
> We might want to sidestep it when we are doing balancing within an LLC
> domain (env->sd->flags & SD_SHARE_PKG_RESOURCES) - or use a lower threshold
> in such cases.
> 
> In any case, I think it would make sense to add some LLC conditions to
> task_hot() so that
> - regular load_balance() can also benefit from them

This is probably a good idea (lower threshold for task_hot within LLC).
I would rather see it done as a separate patch, with a separate performance
evaluation, as it will affect all workloads, even those that do not steal.
A load balancing migration when !task_hot() may be performed even when
the dst CPU already has a task to run, so the migration may or may not
improve utilization.  By contrast, a newly idle CPU that does not find
work goes idle and definitely wastes cycles.  Note how
migrate_degrades_locality() chooses migration regardless of preferred node
when the dst is idle:

        /* Leaving a core idle is often worse than degrading locality. */
        if (env->idle != CPU_NOT_IDLE)
                return -1;

I apply the same principle in can_migrate_task_llc().

> - task stealing has at least some sort of throttling
> 
> 
> On a sidenote, I find it a bit odd that the exec_start threshold depends on
> sysctl_sched_migration_cost, which to me is more about idle_balance() cost
> than "how long does it take for a previously run task to go cache cold".

Agreed, but these are all magic numbers anyway :)

>>> I've run some iterations of hackbench (hackbench 2 process 100000) to
>>> investigate this task bouncing, but I didn't really see any of it. That was
>>> just a 4+4 big.LITTLE system though, I'll try to get numbers on a system
>>> with more CPUs.
>>>
>>> ----->8-----
>>>
>>> activations: # of task activations (task starts running)
>>> cpu_migrations: # of activations where cpu != prev_cpu
>>> % stats are percentiles
>>>
>>> - STEAL:
>>>
>>>   | stat  | cpu_migrations | activations |
>>>   |-------+----------------+-------------|
>>>   | count |    2005.000000 | 2005.000000 |
>>>   | mean  |      16.244888 |  290.608479 |
>>>   | std   |      38.963138 |  253.003528 |
>>>   | min   |       0.000000 |    3.000000 |
>>>   | 50%   |       3.000000 |  239.000000 |
>>>   | 75%   |       8.000000 |  436.000000 |
>>>   | 90%   |      45.000000 |  626.000000 |
>>>   | 99%   |     188.960000 | 1073.000000 |
>>>   | max   |     369.000000 | 1417.000000 |
>>>
>>> - NO_STEAL:
>>>
>>>   | stat  | cpu_migrations | activations |
>>>   |-------+----------------+-------------|
>>>   | count |    2005.000000 | 2005.000000 |
>>>   | mean  |      15.260848 |  297.860848 |
>>>   | std   |      46.331890 |  253.210813 |
>>>   | min   |       0.000000 |    3.000000 |
>>>   | 50%   |       3.000000 |  252.000000 |
>>>   | 75%   |       7.000000 |  444.000000 |
>>>   | 90%   |      32.600000 |  643.600000 |
>>>   | 99%   |     214.880000 | 1127.520000 |
>>>   | max   |     467.000000 | 1547.000000 |
>>>
>>> ----->8-----
>>>
>>> Otherwise, my only other concern at the moment is that since stealing
>>> doesn't care about load, we could steal a task that would cause a big
>>> imbalance, which wouldn't have happened with a call to load_balance().
>>>
>>> I don't think this can be triggered with a symmetrical workload like
>>> hackbench, so I'll go explore something else.
>>
>> The dst is about to go idle with zero load, so stealing can only improve the
>> instantaneous balance between src and dst.  For longer term average load, we
>> still rely on periodic load_balance to make adjustments.
> 
> Right, so my line of thinking was that by not doing a load_balance() and
> taking a shortcut (stealing a task), we may end up just postponing a
> load_balance() to after we've stolen a task. I guess in those cases
> there's no magic trick to be found and we just have to deal with it.

In the current code I call idle_balance/load_balance first and then try_steal.
If idle_balance fails because of cost, then it has effectively postponed itself,
independently of stealing.  The next successful call to load_balance will
correct any imbalance caused by stealing.

> And then there's some of the logic like we have in update_sd_pick_busiest()
> where we e.g. try to prevent misfit tasks from running on LITTLEs, but
> then if such tasks are waiting to be run and a LITTLE frees itself up,
> I *think* it's okay to steal it.

Should be OK to steal.  If a BIG subsequently goes idle, load_balance will move
the task to the BIG, or the BIG may steal it when we support misfit stealing.

Questions for you, Valentin:

  - Should misfit stealing be a separate patch, after my series?  I prefer that,
    so we get stealing in peoples hands as soon as possible.  I think separating 
    it is OK because stealing should not cause any regression for misfits, as 
    my code still calls idle_balance/load_balance, which handles misfits.

  - Who should implement misfit stealing -- you, me, someone else?  I have no 
    preference.

- Steve