From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-0.8 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_PASS autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id BA025C0044C for ; Mon, 29 Oct 2018 19:34:56 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 7CFF320824 for ; Mon, 29 Oct 2018 19:34:56 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 7CFF320824 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=arm.com Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728821AbeJ3EY7 (ORCPT ); Tue, 30 Oct 2018 00:24:59 -0400 Received: from usa-sjc-mx-foss1.foss.arm.com ([217.140.101.70]:45424 "EHLO foss.arm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725781AbeJ3EY7 (ORCPT ); Tue, 30 Oct 2018 00:24:59 -0400 Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.72.51.249]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 0F502341; Mon, 29 Oct 2018 12:34:54 -0700 (PDT) Received: from [10.1.194.37] (e113632-lin.cambridge.arm.com [10.1.194.37]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id C9B553F6A8; Mon, 29 Oct 2018 12:34:51 -0700 (PDT) Subject: Re: [PATCH 07/10] sched/fair: Provide can_migrate_task_llc To: Steven Sistare , mingo@redhat.com, peterz@infradead.org Cc: subhra.mazumdar@oracle.com, dhaval.giani@oracle.com, rohit.k.jain@oracle.com, daniel.m.jordan@oracle.com, pavel.tatashin@microsoft.com, matt@codeblueprint.co.uk, umgwanakikbuti@gmail.com, riel@redhat.com, jbacik@fb.com, juri.lelli@redhat.com, linux-kernel@vger.kernel.org References: <1540220381-424433-1-git-send-email-steven.sistare@oracle.com> <1540220381-424433-8-git-send-email-steven.sistare@oracle.com> From: Valentin Schneider Message-ID: Date: Mon, 29 Oct 2018 19:34:50 +0000 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.2.1 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Language: en-GB Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 26/10/2018 19:28, Steven Sistare wrote: > On 10/26/2018 2:04 PM, Valentin Schneider wrote: [...] >> >> I was thinking that perhaps we could have scenarios where some rq's >> keep stealing tasks off of each other and we end up circulating tasks >> between CPUs. Now, that would only happen if we had a handful of tasks >> with a very tiny period, and I'm not familiar with (real) such hyperactive >> workloads similar to those generated by hackbench where that could happen. > > That will not happen with the current code, as it only steals if nr_running > 1. > The src loses a task, the dst gains it and has nr_running == 1, so it will not > be re-stolen. > That's indeed fine, I was thinking of something like this: Suppose you have 2 rq's sharing a workload of 3 tasks. You get one rq with nr_running == 1 (r_1) and one rq with nr_running == 2 (r_2). As soon as the task on r_1 ends/blocks, we'll go through idle balancing and can potentially steal the non-running task from r_2. Sometime later the task that was running on r_1 wakes up, and we end up with r_1->nr_running == 2 and r_2->nr_running == 1. IOW we've swapped their role in that example, and the whole thing can repeat. The shorter the period of those tasks, the more we'll migrate them between rq's, hence why I wonder if we shouldn't have some sort of throttling. > If we modify the code to handle misfits, we may steal when src nr_running == 1, > but a fast CPU will only steal the lone task from a slow one, never fast from fast, > and never slow from fast, so no tug of war. > >> In short, I wonder if we should have task_hot() in there. Drawing a >> parallel with load_balance(), even if load-balancing is happening between >> rqs of the same LLC, we do go check task_hot(). Have you already experimented >> with adding a task_hot() check in here? > > I tried task_hot, to see if L1/L2 cache warmth matters much on L1/L2/L3 systems, > and it reduced steals and overall performance. > Mmm so task_hot() mainly implements two mechanisms - the CACHE_HOT_BUDDY sched feature and the exec_start threshold. The first one should be sidestepped in the stealing case since we won't pass (if env->dst_rq->nr_running), that leaves us with the threshold. We might want to sidestep it when we are doing balancing within an LLC domain (env->sd->flags & SD_SHARE_PKG_RESOURCES) - or use a lower threshold in such cases. In any case, I think it would make sense to add some LLC conditions to task_hot() so that - regular load_balance() can also benefit from them - task stealing has at least some sort of throttling On a sidenote, I find it a bit odd that the exec_start threshold depends on sysctl_sched_migration_cost, which to me is more about idle_balance() cost than "how long does it take for a previously run task to go cache cold". >> I've run some iterations of hackbench (hackbench 2 process 100000) to >> investigate this task bouncing, but I didn't really see any of it. That was >> just a 4+4 big.LITTLE system though, I'll try to get numbers on a system >> with more CPUs. >> >> ----->8----- >> >> activations: # of task activations (task starts running) >> cpu_migrations: # of activations where cpu != prev_cpu >> % stats are percentiles >> >> - STEAL: >> >> | stat | cpu_migrations | activations | >> |-------+----------------+-------------| >> | count | 2005.000000 | 2005.000000 | >> | mean | 16.244888 | 290.608479 | >> | std | 38.963138 | 253.003528 | >> | min | 0.000000 | 3.000000 | >> | 50% | 3.000000 | 239.000000 | >> | 75% | 8.000000 | 436.000000 | >> | 90% | 45.000000 | 626.000000 | >> | 99% | 188.960000 | 1073.000000 | >> | max | 369.000000 | 1417.000000 | >> >> - NO_STEAL: >> >> | stat | cpu_migrations | activations | >> |-------+----------------+-------------| >> | count | 2005.000000 | 2005.000000 | >> | mean | 15.260848 | 297.860848 | >> | std | 46.331890 | 253.210813 | >> | min | 0.000000 | 3.000000 | >> | 50% | 3.000000 | 252.000000 | >> | 75% | 7.000000 | 444.000000 | >> | 90% | 32.600000 | 643.600000 | >> | 99% | 214.880000 | 1127.520000 | >> | max | 467.000000 | 1547.000000 | >> >> ----->8----- >> >> Otherwise, my only other concern at the moment is that since stealing >> doesn't care about load, we could steal a task that would cause a big >> imbalance, which wouldn't have happened with a call to load_balance(). >> >> I don't think this can be triggered with a symmetrical workload like >> hackbench, so I'll go explore something else. > > The dst is about to go idle with zero load, so stealing can only improve the > instantaneous balance between src and dst. For longer term average load, we > still rely on periodic load_balance to make adjustments. > Right, so my line of thinking was that by not doing a load_balance() and taking a shortcut (stealing a task), we may end up just postponing a load_balance() to after we've stolen a task. I guess in those cases there's no magic trick to be found and we just have to deal with it. And then there's some of the logic like we have in update_sd_pick_busiest() where we e.g. try to prevent misfit tasks from running on LITTLEs, but then if such tasks are waiting to be run and a LITTLE frees itself up, I *think* it's okay to steal it. > All good questions, keep them coming. > > - Steve >