From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-1.3 required=3.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, SPF_PASS autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 049E6C0044C for ; Wed, 31 Oct 2018 15:44:28 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id AC70D2080A for ; Wed, 31 Oct 2018 15:44:27 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=oracle.com header.i=@oracle.com header.b="q5grfofk" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org AC70D2080A Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=oracle.com Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729606AbeKAAm5 (ORCPT ); Wed, 31 Oct 2018 20:42:57 -0400 Received: from userp2130.oracle.com ([156.151.31.86]:33058 "EHLO userp2130.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727623AbeKAAm5 (ORCPT ); Wed, 31 Oct 2018 20:42:57 -0400 Received: from pps.filterd (userp2130.oracle.com [127.0.0.1]) by userp2130.oracle.com (8.16.0.22/8.16.0.22) with SMTP id w9VFhjlW016175; Wed, 31 Oct 2018 15:43:57 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oracle.com; h=subject : to : cc : references : from : message-id : date : mime-version : in-reply-to : content-type : content-transfer-encoding; s=corp-2018-07-02; bh=QkdlQLxI/69quCWZhNOhAuX7UeTZjVfWi4PmqiC54cY=; b=q5grfofkYgXre/FfJIsDY9q2PA0z+QInUyxsQh+9VdgFw8FybwddDmJA9AZvygYxxfYG 15s+mIG7prIoryNebVvhWaS2OvoUqnSoLoiCwcgOfqWNNCQhyAhtvHhBuQkmHlLRG7OE jxnnXXxZa4L25NKxcVeOVnyGplDyl/sW7JK+fvNXgbF0vsPFuLVlvzw1gP+sJyuvRxRp Z4r+Jf2z+vpcXcvCl11m5XEEYG4SkrORyW0aHfqozykkxMIwxn/dHz/QoJwN9M9iA6wb vKqmsdoYkjvQkvRPLx9c2LRBdy9BBe9/z6WNWODHgz1doTs1Wryav2d3nFeBCBS7zd/U gA== Received: from userv0021.oracle.com (userv0021.oracle.com [156.151.31.71]) by userp2130.oracle.com with ESMTP id 2nducm8jrj-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Wed, 31 Oct 2018 15:43:57 +0000 Received: from userv0122.oracle.com (userv0122.oracle.com [156.151.31.75]) by userv0021.oracle.com (8.14.4/8.14.4) with ESMTP id w9VFhumO010996 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Wed, 31 Oct 2018 15:43:56 GMT Received: from abhmp0008.oracle.com (abhmp0008.oracle.com [141.146.116.14]) by userv0122.oracle.com (8.14.4/8.14.4) with ESMTP id w9VFhtEP032360; Wed, 31 Oct 2018 15:43:55 GMT Received: from [10.152.35.100] (/10.152.35.100) by default (Oracle Beehive Gateway v4.0) with ESMTP ; Wed, 31 Oct 2018 08:43:55 -0700 Subject: Re: [PATCH 07/10] sched/fair: Provide can_migrate_task_llc To: Valentin Schneider , mingo@redhat.com, peterz@infradead.org Cc: subhra.mazumdar@oracle.com, dhaval.giani@oracle.com, rohit.k.jain@oracle.com, daniel.m.jordan@oracle.com, pavel.tatashin@microsoft.com, matt@codeblueprint.co.uk, umgwanakikbuti@gmail.com, riel@redhat.com, jbacik@fb.com, juri.lelli@redhat.com, linux-kernel@vger.kernel.org References: <1540220381-424433-1-git-send-email-steven.sistare@oracle.com> <1540220381-424433-8-git-send-email-steven.sistare@oracle.com> From: Steven Sistare Organization: Oracle Corporation Message-ID: <7c503b58-6370-df65-51c3-2591bb1cf621@oracle.com> Date: Wed, 31 Oct 2018 11:43:36 -0400 User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) Gecko/20100101 Thunderbird/52.9.1 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit X-Proofpoint-Virus-Version: vendor=nai engine=5900 definitions=9062 signatures=668683 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 suspectscore=0 malwarescore=0 phishscore=0 bulkscore=0 spamscore=0 mlxscore=0 mlxlogscore=999 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1807170000 definitions=main-1810310131 Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 10/29/2018 3:34 PM, Valentin Schneider wrote: > On 26/10/2018 19:28, Steven Sistare wrote: >> On 10/26/2018 2:04 PM, Valentin Schneider wrote: > [...] >>> >>> I was thinking that perhaps we could have scenarios where some rq's >>> keep stealing tasks off of each other and we end up circulating tasks >>> between CPUs. Now, that would only happen if we had a handful of tasks >>> with a very tiny period, and I'm not familiar with (real) such hyperactive >>> workloads similar to those generated by hackbench where that could happen. >> >> That will not happen with the current code, as it only steals if nr_running > 1. >> The src loses a task, the dst gains it and has nr_running == 1, so it will not >> be re-stolen. > > That's indeed fine, I was thinking of something like this: > > Suppose you have 2 rq's sharing a workload of 3 tasks. You get one rq with > nr_running == 1 (r_1) and one rq with nr_running == 2 (r_2). > > As soon as the task on r_1 ends/blocks, we'll go through idle balancing and > can potentially steal the non-running task from r_2. Sometime later the task > that was running on r_1 wakes up, and we end up with r_1->nr_running == 2 > and r_2->nr_running == 1. > > IOW we've swapped their role in that example, and the whole thing can > repeat. > > The shorter the period of those tasks, the more we'll migrate them > between rq's, hence why I wonder if we shouldn't have some sort of > throttling. Stealing is still the right move in this scenario. Idle cycles become useful cycles. The only cost is the CPU time to dequeue from a remote rq and enqueue on the local rq. Earlier we discussed skipping try_steal() if avg_idle is very small, on the order of 10 usec. I think that type of throttling would cover your scenario. I will add it in my next version. >> If we modify the code to handle misfits, we may steal when src nr_running == 1, >> but a fast CPU will only steal the lone task from a slow one, never fast from fast, >> and never slow from fast, so no tug of war. >> >>> In short, I wonder if we should have task_hot() in there. Drawing a >>> parallel with load_balance(), even if load-balancing is happening between >>> rqs of the same LLC, we do go check task_hot(). Have you already experimented >>> with adding a task_hot() check in here? >> >> I tried task_hot, to see if L1/L2 cache warmth matters much on L1/L2/L3 systems, >> and it reduced steals and overall performance. > > Mmm so task_hot() mainly implements two mechanisms - the CACHE_HOT_BUDDY > sched feature and the exec_start threshold. > > The first one should be sidestepped in the stealing case since we won't > pass (if env->dst_rq->nr_running), that leaves us with the threshold. > > We might want to sidestep it when we are doing balancing within an LLC > domain (env->sd->flags & SD_SHARE_PKG_RESOURCES) - or use a lower threshold > in such cases. > > In any case, I think it would make sense to add some LLC conditions to > task_hot() so that > - regular load_balance() can also benefit from them This is probably a good idea (lower threshold for task_hot within LLC). I would rather see it done as a separate patch, with a separate performance evaluation, as it will affect all workloads, even those that do not steal. A load balancing migration when !task_hot() may be performed even when the dst CPU already has a task to run, so the migration may or may not improve utilization. By contrast, a newly idle CPU that does not find work goes idle and definitely wastes cycles. Note how migrate_degrades_locality() chooses migration regardless of preferred node when the dst is idle: /* Leaving a core idle is often worse than degrading locality. */ if (env->idle != CPU_NOT_IDLE) return -1; I apply the same principle in can_migrate_task_llc(). > - task stealing has at least some sort of throttling > > > On a sidenote, I find it a bit odd that the exec_start threshold depends on > sysctl_sched_migration_cost, which to me is more about idle_balance() cost > than "how long does it take for a previously run task to go cache cold". Agreed, but these are all magic numbers anyway :) >>> I've run some iterations of hackbench (hackbench 2 process 100000) to >>> investigate this task bouncing, but I didn't really see any of it. That was >>> just a 4+4 big.LITTLE system though, I'll try to get numbers on a system >>> with more CPUs. >>> >>> ----->8----- >>> >>> activations: # of task activations (task starts running) >>> cpu_migrations: # of activations where cpu != prev_cpu >>> % stats are percentiles >>> >>> - STEAL: >>> >>> | stat | cpu_migrations | activations | >>> |-------+----------------+-------------| >>> | count | 2005.000000 | 2005.000000 | >>> | mean | 16.244888 | 290.608479 | >>> | std | 38.963138 | 253.003528 | >>> | min | 0.000000 | 3.000000 | >>> | 50% | 3.000000 | 239.000000 | >>> | 75% | 8.000000 | 436.000000 | >>> | 90% | 45.000000 | 626.000000 | >>> | 99% | 188.960000 | 1073.000000 | >>> | max | 369.000000 | 1417.000000 | >>> >>> - NO_STEAL: >>> >>> | stat | cpu_migrations | activations | >>> |-------+----------------+-------------| >>> | count | 2005.000000 | 2005.000000 | >>> | mean | 15.260848 | 297.860848 | >>> | std | 46.331890 | 253.210813 | >>> | min | 0.000000 | 3.000000 | >>> | 50% | 3.000000 | 252.000000 | >>> | 75% | 7.000000 | 444.000000 | >>> | 90% | 32.600000 | 643.600000 | >>> | 99% | 214.880000 | 1127.520000 | >>> | max | 467.000000 | 1547.000000 | >>> >>> ----->8----- >>> >>> Otherwise, my only other concern at the moment is that since stealing >>> doesn't care about load, we could steal a task that would cause a big >>> imbalance, which wouldn't have happened with a call to load_balance(). >>> >>> I don't think this can be triggered with a symmetrical workload like >>> hackbench, so I'll go explore something else. >> >> The dst is about to go idle with zero load, so stealing can only improve the >> instantaneous balance between src and dst. For longer term average load, we >> still rely on periodic load_balance to make adjustments. > > Right, so my line of thinking was that by not doing a load_balance() and > taking a shortcut (stealing a task), we may end up just postponing a > load_balance() to after we've stolen a task. I guess in those cases > there's no magic trick to be found and we just have to deal with it. In the current code I call idle_balance/load_balance first and then try_steal. If idle_balance fails because of cost, then it has effectively postponed itself, independently of stealing. The next successful call to load_balance will correct any imbalance caused by stealing. > And then there's some of the logic like we have in update_sd_pick_busiest() > where we e.g. try to prevent misfit tasks from running on LITTLEs, but > then if such tasks are waiting to be run and a LITTLE frees itself up, > I *think* it's okay to steal it. Should be OK to steal. If a BIG subsequently goes idle, load_balance will move the task to the BIG, or the BIG may steal it when we support misfit stealing. Questions for you, Valentin: - Should misfit stealing be a separate patch, after my series? I prefer that, so we get stealing in peoples hands as soon as possible. I think separating it is OK because stealing should not cause any regression for misfits, as my code still calls idle_balance/load_balance, which handles misfits. - Who should implement misfit stealing -- you, me, someone else? I have no preference. - Steve