From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.1 required=3.0 tests=DKIM_INVALID,DKIM_SIGNED, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_PASS, USER_AGENT_MUTT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id D617AC004D3 for ; Wed, 24 Oct 2018 10:03:43 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 7DD8C2082F for ; Wed, 24 Oct 2018 10:03:43 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=infradead.org header.i=@infradead.org header.b="I8LiXdrQ" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 7DD8C2082F Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=infradead.org Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727176AbeJXSbI (ORCPT ); Wed, 24 Oct 2018 14:31:08 -0400 Received: from merlin.infradead.org ([205.233.59.134]:35954 "EHLO merlin.infradead.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726301AbeJXSbI (ORCPT ); Wed, 24 Oct 2018 14:31:08 -0400 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=merlin.20170209; h=In-Reply-To:Content-Type:MIME-Version: References:Message-ID:Subject:Cc:To:From:Date:Sender:Reply-To: Content-Transfer-Encoding:Content-ID:Content-Description:Resent-Date: Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:List-Id: List-Help:List-Unsubscribe:List-Subscribe:List-Post:List-Owner:List-Archive; bh=zt8nk9P+V33pkJPHKaa5TegdnHjmQ/nIxiAqEemNepc=; b=I8LiXdrQ4NIkO2zmRRWOFi+EN /zf9RooEMkqsG7slRc9iNHKamjM++kvzsc5FNFNDHE6bG3e7Ufzjd5azFnNGyfFBsagBv0dBv2tFT UsbSBN7MOSo4DQQaRrMqaZuPJtoGovHhB9oF9Oly1q+HFGdHbEQY3zzrd9RH+70/2jTiJ/nIUF82t U/of+0iAsn5tv7KfxrytSyOvlmQsLyWZQZVQ6SVm8lmBw2nX9N/a5Tlg4nAQruc12aw1lQKOUQJfI ZjagwS49nrBGFqzrxQSLVJE4tQ9cApSqfO1ZH18LTxPhj0M3Y/eEwnjM/9/ImtNc7XH02ySaNdpGn Oihs79Yzg==; Received: from [185.7.230.213] (helo=worktop) by merlin.infradead.org with esmtpsa (Exim 4.90_1 #2 (Red Hat Linux)) id 1gFG0h-0007IK-L7; Wed, 24 Oct 2018 10:03:32 +0000 Received: by worktop (Postfix, from userid 1000) id EF2C76E08B3; Wed, 24 Oct 2018 12:03:23 +0200 (CEST) Date: Wed, 24 Oct 2018 12:03:23 +0200 From: Peter Zijlstra To: Srikar Dronamraju Cc: Ingo Molnar , LKML , Mel Gorman , Rik van Riel , Yi Wang , zhong.weidong@zte.com.cn, Yi Liu , Frederic Weisbecker , Thomas Gleixner Subject: Re: [PATCH v2] sched/core: Don't mix isolcpus and housekeeping CPUs Message-ID: <20181024100323.GO3109@worktop.c.hoisthospitality.com> References: <1540350169-18581-1-git-send-email-srikar@linux.vnet.ibm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1540350169-18581-1-git-send-email-srikar@linux.vnet.ibm.com> User-Agent: Mutt/1.5.22.1 (2013-10-16) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Oct 24, 2018 at 08:32:49AM +0530, Srikar Dronamraju wrote: > Load balancer and NUMA balancer are not suppose to work on isolcpus. > > Currently when setting sched affinity, there are no checks to see if the > requested cpumask has CPUs from both isolcpus and housekeeping CPUs. > > If user passes a mix of isolcpus and housekeeping CPUs, then > NUMA balancer can pick a isolcpu to schedule. > With this change, if a combination of isolcpus and housekeeping CPUs are > provided, then we restrict ourselves to housekeeping CPUs. I'm still not liking this much. This adds more special cases for isolcpus. Also, I don't believe in correcting silly users; give 'em rope and show them how to tie the knot. Where does the numa balancer pick the 'wrong' CPU? task_numa_migrate() checks to see if the task is currently part of a SD_NUMA domain, otherwise it doesn't do anything. This means your housekeeping mask spans multiple nodes to begin with, right? But after that we seem to ignore the sched domains entirely; task_numa_find_cpu() only tests cpus_allowed. It appears to me the for_each_online_node() iteration in task_numa_migrate() needs an addition test to see if the selected node has any CPUs in the relevant sched_domain _at_all_. A little something like the below -- except we also need to do something about cpus_active_mask. Not been near a compiler. --- kernel/sched/fair.c | 36 ++++++++++++++++++++++++++---------- 1 file changed, 26 insertions(+), 10 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index ee271bb661cc..287ef7f0203b 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1497,6 +1497,8 @@ struct task_numa_env { struct task_struct *best_task; long best_imp; int best_cpu; + + cpumask_var_t cpus; }; static void task_numa_assign(struct task_numa_env *env, @@ -1704,7 +1706,7 @@ static void task_numa_find_cpu(struct task_numa_env *env, */ maymove = !load_too_imbalanced(src_load, dst_load, env); - for_each_cpu(cpu, cpumask_of_node(env->dst_nid)) { + for_each_cpu_and(cpu, cpumask_of_node(env->dst_nid), env->cpus) { /* Skip this CPU if the source task cannot migrate */ if (!cpumask_test_cpu(cpu, &env->p->cpus_allowed)) continue; @@ -1734,6 +1736,9 @@ static int task_numa_migrate(struct task_struct *p) int nid, ret, dist; long taskimp, groupimp; + if (!alloc_cpumask_var(&env.cpus, GFP_KERNEL)) + return -ENOMEM; + /* * Pick the lowest SD_NUMA domain, as that would have the smallest * imbalance and would be the first to start moving tasks about. @@ -1744,20 +1749,23 @@ static int task_numa_migrate(struct task_struct *p) */ rcu_read_lock(); sd = rcu_dereference(per_cpu(sd_numa, env.src_cpu)); - if (sd) - env.imbalance_pct = 100 + (sd->imbalance_pct - 100) / 2; - rcu_read_unlock(); - /* * Cpusets can break the scheduler domain tree into smaller * balance domains, some of which do not cross NUMA boundaries. * Tasks that are "trapped" in such domains cannot be migrated * elsewhere, so there is no point in (re)trying. */ - if (unlikely(!sd)) { + if (!sd) { sched_setnuma(p, task_node(p)); - return -EINVAL; + rcu_read_unlock(); + ret = -EINVAL; + goto out; } + env.imbalance_pct = 100 + (sd->imbalance_pct - 100) / 2; + while (sd->parent) + sd = sd->parent; + cpumask_copy(env.cpus, sched_domain_span(sd)); + rcu_read_unlock(); env.dst_nid = p->numa_preferred_nid; dist = env.dist = node_distance(env.src_nid, env.dst_nid); @@ -1783,6 +1791,9 @@ static int task_numa_migrate(struct task_struct *p) if (nid == env.src_nid || nid == p->numa_preferred_nid) continue; + if (!cpumask_intersects(cpumask_of_node(nid), env.cpus)) + continue; + dist = node_distance(env.src_nid, env.dst_nid); if (sched_numa_topology_type == NUMA_BACKPLANE && dist != env.dist) { @@ -1822,8 +1833,10 @@ static int task_numa_migrate(struct task_struct *p) } /* No better CPU than the current one was found. */ - if (env.best_cpu == -1) - return -EAGAIN; + if (env.best_cpu == -1) { + ret = -EAGAIN; + goto out; + } best_rq = cpu_rq(env.best_cpu); if (env.best_task == NULL) { @@ -1831,7 +1844,7 @@ static int task_numa_migrate(struct task_struct *p) WRITE_ONCE(best_rq->numa_migrate_on, 0); if (ret != 0) trace_sched_stick_numa(p, env.src_cpu, env.best_cpu); - return ret; + goto out; } ret = migrate_swap(p, env.best_task, env.best_cpu, env.src_cpu); @@ -1840,6 +1853,9 @@ static int task_numa_migrate(struct task_struct *p) if (ret != 0) trace_sched_stick_numa(p, env.src_cpu, task_cpu(env.best_task)); put_task_struct(env.best_task); + +out: + free_cpumask_var(&env.cpus); return ret; }