From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756147AbaICJLv (ORCPT ); Wed, 3 Sep 2014 05:11:51 -0400 Received: from e35.co.us.ibm.com ([32.97.110.153]:55552 "EHLO e35.co.us.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755896AbaICJLs (ORCPT ); Wed, 3 Sep 2014 05:11:48 -0400 Message-ID: <5406DB43.1030506@linux.vnet.ibm.com> Date: Wed, 03 Sep 2014 14:41:31 +0530 From: Preeti U Murthy User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.6.0 MIME-Version: 1.0 To: Vincent Guittot CC: Peter Zijlstra , Ingo Molnar , linux-kernel , Russell King - ARM Linux , LAK , Rik van Riel , Morten Rasmussen , Mike Galbraith , Nicolas Pitre , "linaro-kernel@lists.linaro.org" , Daniel Lezcano , Dietmar Eggemann Subject: Re: [PATCH v5 08/12] sched: move cfs task on a CPU with higher capacity References: <1409051215-16788-1-git-send-email-vincent.guittot@linaro.org> <1409051215-16788-9-git-send-email-vincent.guittot@linaro.org> <54020F00.2030807@linux.vnet.ibm.com> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-TM-AS-MML: disable X-Content-Scanned: Fidelis XPS MAILER x-cbid: 14090309-6688-0000-0000-0000047968B8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 09/01/2014 02:15 PM, Vincent Guittot wrote: > On 30 August 2014 19:50, Preeti U Murthy wrote: >> Hi Vincent, >>> index 18db43e..60ae1ce 100644 >>> --- a/kernel/sched/fair.c >>> +++ b/kernel/sched/fair.c >>> @@ -6049,6 +6049,14 @@ static bool update_sd_pick_busiest(struct lb_env *env, >>> return true; >>> } >>> >>> + /* >>> + * The group capacity is reduced probably because of activity from other >>> + * sched class or interrupts which use part of the available capacity >>> + */ >>> + if ((sg->sgc->capacity_orig * 100) > (sgs->group_capacity * >>> + env->sd->imbalance_pct)) >> >> Wouldn't the check on avg_load let us know if we are packing more tasks >> in this group than its capacity ? Isn't that the metric we are more >> interested in? > > With this patch, we don't want to pack but we prefer to spread the > task on another CPU than the one which handles the interruption if > latter uses a significant part of the CPU capacity. > >> >>> + return true; >>> + >>> return false; >>> } >>> >>> @@ -6534,13 +6542,23 @@ static int need_active_balance(struct lb_env *env) >>> struct sched_domain *sd = env->sd; >>> >>> if (env->idle == CPU_NEWLY_IDLE) { >>> + int src_cpu = env->src_cpu; >>> >>> /* >>> * ASYM_PACKING needs to force migrate tasks from busy but >>> * higher numbered CPUs in order to pack all tasks in the >>> * lowest numbered CPUs. >>> */ >>> - if ((sd->flags & SD_ASYM_PACKING) && env->src_cpu > env->dst_cpu) >>> + if ((sd->flags & SD_ASYM_PACKING) && src_cpu > env->dst_cpu) >>> + return 1; >>> + >>> + /* >>> + * If the CPUs share their cache and the src_cpu's capacity is >>> + * reduced because of other sched_class or IRQs, we trig an >>> + * active balance to move the task >>> + */ >>> + if ((capacity_orig_of(src_cpu) * 100) > (capacity_of(src_cpu) * >>> + sd->imbalance_pct)) >>> return 1; >>> } >>> >>> @@ -6643,6 +6661,8 @@ static int load_balance(int this_cpu, struct rq *this_rq, >>> >>> schedstat_add(sd, lb_imbalance[idle], env.imbalance); >>> >>> + env.src_cpu = busiest->cpu; >>> + >>> ld_moved = 0; >>> if (busiest->nr_running > 1) { >>> /* >>> @@ -6652,7 +6672,6 @@ static int load_balance(int this_cpu, struct rq *this_rq, >>> * correctly treated as an imbalance. >>> */ >>> env.flags |= LBF_ALL_PINNED; >>> - env.src_cpu = busiest->cpu; >>> env.src_rq = busiest; >>> env.loop_max = min(sysctl_sched_nr_migrate, busiest->nr_running); >>> >>> @@ -7359,10 +7378,12 @@ static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle) >>> >>> /* >>> * Current heuristic for kicking the idle load balancer in the presence >>> - * of an idle cpu is the system. >>> + * of an idle cpu in the system. >>> * - This rq has more than one task. >>> - * - At any scheduler domain level, this cpu's scheduler group has multiple >>> - * busy cpu's exceeding the group's capacity. >>> + * - This rq has at least one CFS task and the capacity of the CPU is >>> + * significantly reduced because of RT tasks or IRQs. >>> + * - At parent of LLC scheduler domain level, this cpu's scheduler group has >>> + * multiple busy cpu. >>> * - For SD_ASYM_PACKING, if the lower numbered cpu's in the scheduler >>> * domain span are idle. >>> */ >>> @@ -7372,9 +7393,10 @@ static inline int nohz_kick_needed(struct rq *rq) >>> struct sched_domain *sd; >>> struct sched_group_capacity *sgc; >>> int nr_busy, cpu = rq->cpu; >>> + bool kick = false; >>> >>> if (unlikely(rq->idle_balance)) >>> - return 0; >>> + return false; >>> >>> /* >>> * We may be recently in ticked or tickless idle mode. At the first >>> @@ -7388,38 +7410,45 @@ static inline int nohz_kick_needed(struct rq *rq) >>> * balancing. >>> */ >>> if (likely(!atomic_read(&nohz.nr_cpus))) >>> - return 0; >>> + return false; >>> >>> if (time_before(now, nohz.next_balance)) >>> - return 0; >>> + return false; >>> >>> if (rq->nr_running >= 2) >> >> Will this check ^^ not catch those cases which this patch is targeting? > > This patch is not about how many tasks are running but if the capacity > of the CPU is reduced because of side activity like interruptions > which are only visible in the capacity value (with IRQ_TIME_ACCOUNTING > config) but not in nr_running. > Even if the capacity is reduced because of RT tasks, nothing ensures > that the RT task will be running when the tick fires > > Regards, > Vincent >> >> Regards >> Preeti U Murthy >> >>> - goto need_kick; >>> + return true; >>> >>> rcu_read_lock(); >>> sd = rcu_dereference(per_cpu(sd_busy, cpu)); >>> - >>> if (sd) { >>> sgc = sd->groups->sgc; >>> nr_busy = atomic_read(&sgc->nr_busy_cpus); >>> >>> - if (nr_busy > 1) >>> - goto need_kick_unlock; >>> + if (nr_busy > 1) { >>> + kick = true; >>> + goto unlock; >>> + } >>> + >>> } >>> >>> - sd = rcu_dereference(per_cpu(sd_asym, cpu)); >>> + sd = rcu_dereference(rq->sd); >>> + if (sd) { >>> + if ((rq->cfs.h_nr_running >= 1) && >>> + ((rq->cpu_capacity * sd->imbalance_pct) < >>> + (rq->cpu_capacity_orig * 100))) { Ok I understand your explanation above. But I was wondering if you would need to add this check around rq->cfs.h_nr_running >= 1 in the above two cases as well. I have actually raised this concern over whether we should be using rq->nr_running or cfs_rq->nr_running while we do load balancing in reply to your patch3. While all our load measurements are about the cfs_rq load, we use rq->nr_running, which may include tasks from other sched classes as well. We divide them to get average load per task during load balancing which is wrong, isn't it? Similarly during nohz_kick_needed(), we trigger load balancing based on rq->nr_running again. In this patch too, even if you know that the cpu is being dominated by tasks that do not belong to cfs class, you would be triggering a futile load balance if there are no fair tasks to move. Regards Preeti U Murthy From mboxrd@z Thu Jan 1 00:00:00 1970 From: preeti@linux.vnet.ibm.com (Preeti U Murthy) Date: Wed, 03 Sep 2014 14:41:31 +0530 Subject: [PATCH v5 08/12] sched: move cfs task on a CPU with higher capacity In-Reply-To: References: <1409051215-16788-1-git-send-email-vincent.guittot@linaro.org> <1409051215-16788-9-git-send-email-vincent.guittot@linaro.org> <54020F00.2030807@linux.vnet.ibm.com> Message-ID: <5406DB43.1030506@linux.vnet.ibm.com> To: linux-arm-kernel@lists.infradead.org List-Id: linux-arm-kernel.lists.infradead.org On 09/01/2014 02:15 PM, Vincent Guittot wrote: > On 30 August 2014 19:50, Preeti U Murthy wrote: >> Hi Vincent, >>> index 18db43e..60ae1ce 100644 >>> --- a/kernel/sched/fair.c >>> +++ b/kernel/sched/fair.c >>> @@ -6049,6 +6049,14 @@ static bool update_sd_pick_busiest(struct lb_env *env, >>> return true; >>> } >>> >>> + /* >>> + * The group capacity is reduced probably because of activity from other >>> + * sched class or interrupts which use part of the available capacity >>> + */ >>> + if ((sg->sgc->capacity_orig * 100) > (sgs->group_capacity * >>> + env->sd->imbalance_pct)) >> >> Wouldn't the check on avg_load let us know if we are packing more tasks >> in this group than its capacity ? Isn't that the metric we are more >> interested in? > > With this patch, we don't want to pack but we prefer to spread the > task on another CPU than the one which handles the interruption if > latter uses a significant part of the CPU capacity. > >> >>> + return true; >>> + >>> return false; >>> } >>> >>> @@ -6534,13 +6542,23 @@ static int need_active_balance(struct lb_env *env) >>> struct sched_domain *sd = env->sd; >>> >>> if (env->idle == CPU_NEWLY_IDLE) { >>> + int src_cpu = env->src_cpu; >>> >>> /* >>> * ASYM_PACKING needs to force migrate tasks from busy but >>> * higher numbered CPUs in order to pack all tasks in the >>> * lowest numbered CPUs. >>> */ >>> - if ((sd->flags & SD_ASYM_PACKING) && env->src_cpu > env->dst_cpu) >>> + if ((sd->flags & SD_ASYM_PACKING) && src_cpu > env->dst_cpu) >>> + return 1; >>> + >>> + /* >>> + * If the CPUs share their cache and the src_cpu's capacity is >>> + * reduced because of other sched_class or IRQs, we trig an >>> + * active balance to move the task >>> + */ >>> + if ((capacity_orig_of(src_cpu) * 100) > (capacity_of(src_cpu) * >>> + sd->imbalance_pct)) >>> return 1; >>> } >>> >>> @@ -6643,6 +6661,8 @@ static int load_balance(int this_cpu, struct rq *this_rq, >>> >>> schedstat_add(sd, lb_imbalance[idle], env.imbalance); >>> >>> + env.src_cpu = busiest->cpu; >>> + >>> ld_moved = 0; >>> if (busiest->nr_running > 1) { >>> /* >>> @@ -6652,7 +6672,6 @@ static int load_balance(int this_cpu, struct rq *this_rq, >>> * correctly treated as an imbalance. >>> */ >>> env.flags |= LBF_ALL_PINNED; >>> - env.src_cpu = busiest->cpu; >>> env.src_rq = busiest; >>> env.loop_max = min(sysctl_sched_nr_migrate, busiest->nr_running); >>> >>> @@ -7359,10 +7378,12 @@ static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle) >>> >>> /* >>> * Current heuristic for kicking the idle load balancer in the presence >>> - * of an idle cpu is the system. >>> + * of an idle cpu in the system. >>> * - This rq has more than one task. >>> - * - At any scheduler domain level, this cpu's scheduler group has multiple >>> - * busy cpu's exceeding the group's capacity. >>> + * - This rq has at least one CFS task and the capacity of the CPU is >>> + * significantly reduced because of RT tasks or IRQs. >>> + * - At parent of LLC scheduler domain level, this cpu's scheduler group has >>> + * multiple busy cpu. >>> * - For SD_ASYM_PACKING, if the lower numbered cpu's in the scheduler >>> * domain span are idle. >>> */ >>> @@ -7372,9 +7393,10 @@ static inline int nohz_kick_needed(struct rq *rq) >>> struct sched_domain *sd; >>> struct sched_group_capacity *sgc; >>> int nr_busy, cpu = rq->cpu; >>> + bool kick = false; >>> >>> if (unlikely(rq->idle_balance)) >>> - return 0; >>> + return false; >>> >>> /* >>> * We may be recently in ticked or tickless idle mode. At the first >>> @@ -7388,38 +7410,45 @@ static inline int nohz_kick_needed(struct rq *rq) >>> * balancing. >>> */ >>> if (likely(!atomic_read(&nohz.nr_cpus))) >>> - return 0; >>> + return false; >>> >>> if (time_before(now, nohz.next_balance)) >>> - return 0; >>> + return false; >>> >>> if (rq->nr_running >= 2) >> >> Will this check ^^ not catch those cases which this patch is targeting? > > This patch is not about how many tasks are running but if the capacity > of the CPU is reduced because of side activity like interruptions > which are only visible in the capacity value (with IRQ_TIME_ACCOUNTING > config) but not in nr_running. > Even if the capacity is reduced because of RT tasks, nothing ensures > that the RT task will be running when the tick fires > > Regards, > Vincent >> >> Regards >> Preeti U Murthy >> >>> - goto need_kick; >>> + return true; >>> >>> rcu_read_lock(); >>> sd = rcu_dereference(per_cpu(sd_busy, cpu)); >>> - >>> if (sd) { >>> sgc = sd->groups->sgc; >>> nr_busy = atomic_read(&sgc->nr_busy_cpus); >>> >>> - if (nr_busy > 1) >>> - goto need_kick_unlock; >>> + if (nr_busy > 1) { >>> + kick = true; >>> + goto unlock; >>> + } >>> + >>> } >>> >>> - sd = rcu_dereference(per_cpu(sd_asym, cpu)); >>> + sd = rcu_dereference(rq->sd); >>> + if (sd) { >>> + if ((rq->cfs.h_nr_running >= 1) && >>> + ((rq->cpu_capacity * sd->imbalance_pct) < >>> + (rq->cpu_capacity_orig * 100))) { Ok I understand your explanation above. But I was wondering if you would need to add this check around rq->cfs.h_nr_running >= 1 in the above two cases as well. I have actually raised this concern over whether we should be using rq->nr_running or cfs_rq->nr_running while we do load balancing in reply to your patch3. While all our load measurements are about the cfs_rq load, we use rq->nr_running, which may include tasks from other sched classes as well. We divide them to get average load per task during load balancing which is wrong, isn't it? Similarly during nohz_kick_needed(), we trigger load balancing based on rq->nr_running again. In this patch too, even if you know that the cpu is being dominated by tasks that do not belong to cfs class, you would be triggering a futile load balance if there are no fair tasks to move. Regards Preeti U Murthy