From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-6.8 required=3.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI, SIGNED_OFF_BY,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 37BF5CA9EBC for ; Fri, 25 Oct 2019 16:51:18 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 0392F2053B for ; Fri, 25 Oct 2019 16:51:18 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=linaro.org header.i=@linaro.org header.b="Th4ij+Lc" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2633147AbfJYQvR (ORCPT ); Fri, 25 Oct 2019 12:51:17 -0400 Received: from mail-lf1-f67.google.com ([209.85.167.67]:46077 "EHLO mail-lf1-f67.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2440795AbfJYQvQ (ORCPT ); Fri, 25 Oct 2019 12:51:16 -0400 Received: by mail-lf1-f67.google.com with SMTP id v8so2281356lfa.12 for ; Fri, 25 Oct 2019 09:51:14 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linaro.org; s=google; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=BiOTygXbDKrYQL09NBGtGxxHh+cjKKps5VsTzvQh9uU=; b=Th4ij+Lc/8Bo/uugPI1jq65sIDjnyNfmWFr/lyGbd4AaiAqVPcqfolY/zN4ev4nN1P PNv1mml7QhKBECIz/16IEM8fIeizlembJBNRTA/3I6UmkZ8NAXMXaD4fkKFign+4WxKN 2MT1WsmbRaqyZZuPW+zETYU/uH33qpeCq6/rwyoTJ7Cw7SXM4EIrFF+L87Y2+xw2Zq20 CLqIINvPd2y+mwaHUVrgRACoGA2WltM6fsJuVmUaBl6mYBoLUIDzKlI2pU4qZQLrA2/8 MVg3VbXPeMNz73vJLQN4PHBSCHqPlOI9zTcCWLA+aI2SYoBpSZffsrERXdJX33NQ4b7k D7sQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=BiOTygXbDKrYQL09NBGtGxxHh+cjKKps5VsTzvQh9uU=; b=MiFkbGZjYUu/UoPWQ/jAUeQ2YUYlFGO1ytf7dEysDE8gkz7sc39DFuVJnNhmQ7qM+B 3KAHEzw8sNH/uzTztlOgL25FgeDWbyxQmdZSvOLKE5DZiCyCVUUS05tsl2IMFS/kqq/E cXFlP7e6R/Y3Fmaozckn3Epj3dSb351qkPkxE5TfZBu/4hyZ0MngLo+CGiuoSLMlAmPn MXW7qAVS6wZ2sObNzaiPfpIDm4TNbfBQ2OOwSNElWH92qLwstpbOCCnZ3pkNh3YCurvp Krt4ByoI9lG4aZs6s6ztw7J+iboW6IfR8pd+MHJCJx5zfR3gXk4I/SWAIDZgLmHkFBK6 Bxaw== X-Gm-Message-State: APjAAAVexBTGpnS2t1uz5DrjlqATY4kH6Aa+Slw4TPM9WdYLdcHp30KL V9tqO7qw0AmVj3w+S5dgPAI3QUz0dVVjfej6UEg/uQ== X-Google-Smtp-Source: APXvYqwfjO8QJQNG5ueZVjSO+qmEER1wgu5HXr4sDZlDBle96ivSoeHaSj41A0QHAjwDobcKB98UYHcd4J/UDNdQXjM= X-Received: by 2002:a19:ca13:: with SMTP id a19mr3303908lfg.133.1572022273329; Fri, 25 Oct 2019 09:51:13 -0700 (PDT) MIME-Version: 1.0 References: <1572018904-5234-1-git-send-email-dsmythies@telus.net> In-Reply-To: <1572018904-5234-1-git-send-email-dsmythies@telus.net> From: Vincent Guittot Date: Fri, 25 Oct 2019 18:51:01 +0200 Message-ID: Subject: Re: [PATCH] Revert "sched/fair: Fix O(nr_cgroups) in the load balancing path" To: Doug Smythies Cc: linux-kernel , "open list:THERMAL" , Doug Smythies , Peter Zijlstra , Ingo Molnar , Linus Torvalds , Thomas Gleixner , Sargun Dhillon , Tejun Heo , Xie XiuQi , xiezhipeng1@huawei.com, Srinivas Pandruvada , Rik van Riel Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi Doug, On Fri, 25 Oct 2019 at 17:55, Doug Smythies wrote: > > This reverts commit 039ae8bcf7a5f4476f4487e6bf816885fb3fb617, > which, in turn, was a re-apply of > commit a9e7f6544b9c ("sched/fair: Fix O(nr_cgroups) in load balance path") > after it was reverted via > commit c40f7d74c741 ("sched/fair: Fix infinite loop in update_blocked_averages() by reverting a9e7f6544b9c") > > For an idle system, the cfs_rq_is_decayed function components can underflow to 0 and > incorrectly return TRUE, when the item should not be deleted from the list. The patch from Rik solves the problem of cfs_rq_is_decayed wrongly returns true https://lore.kernel.org/lkml/20190906191237.27006-6-riel@surriel.com/ > > Signed-off-by: Doug Smythies > Cc: Vincent Guittot > Cc: Peter Zijlstra (Intel) > Cc: Ingo Molnar > Cc: Linus Torvalds > Cc: Peter Zijlstra > Cc: Thomas Gleixner > Cc: sargun@sargun.me > Cc: tj@kernel.org > Cc: xiexiuqi@huawei.com > Cc: xiezhipeng1@huawei.com > Cc: Srinivas Pandruvada > > --- > Note 1: Both this reversion and just deleting the cfs_rq_is_decayed function > and it's call and leaving the other changes have been tested. I do not know > which solution is better. (ie for the "list_for_each_entry_rcu" part of it.) > > Note 2: Previous controversy over this patch was based on heavy workloads, > but this is based on minimal or no workload, or "idle". > Where "idle" on my test server, with no gui and many services disabled, > tends to mean more "idle" than most systems. > > Note 3: While this supporting data only involves the intel_pstate CPU > frequency scaling driver as a casualty, it is beyond my capabilities > to determine what other tasks that should be running might be omitted. > > Use case example 1: > System Idle: The intel pstate CPU frequency scaling driver: > Mode: Active, non-hwp, powersave governor. > Expected behaviour: There is never ever a duration (time between calls to > the driver / per CPU) longer than 4 seconds (the watchdog time, I think). > Actual behaviour: There are long long gaps between calls to the driver: > > Kernel: 5.4-rc2 CPU:7 > duration: 327.17 Seconds. (this is one of many hundreds of examples.) > mpref: 44023326 > apref: 20716861 > tsc: 1.11604E+12 > load: 0 > CPU frequency: 1.6053 GHz (average over this 327 second sample period). > old pstate: 16 (the lowest for my processor) > new pstate: 16 > > Kernel: 5.4-rc2 + reversion (either method) > After several hours of testing, maximum durations were never more > than 4 seconds (well plus some jitter). > reversion method: max=4.07908 seconds > CPU:7 > mperf: 492578 > apref: 231813 (56,829 per second average is consistent with other tests) > tsc: 13914264074 > load: 0 > CPU frequency: 1.6052 GHz > old pstate: 16 (the lowest for my precessor) > new pstate: 16 > > On average, the non-reverted kernel executes the driver 25% less > than the reverted kernel during idle. > > O.K. so who cares, the requested pstate doesn't change? > First, one wonders if the math could overflow. > (although 7180ddd suggests maybe it won't) > Second, the sample is largely dominated by obsolete information. > Third, this can be problematic, and potentially wastes energy, > for the busy to idle transition. > > Use case example 2: > The busy to idle transition: > > Typically, the pstate request response to a busy to idle transition > is very slow because the duration suddenly goes from, typically, > 10 milliseconds to much much longer, up to 4 seconds. Transition > times to the system being fully idle, with all requested pstates > being at minimum, takes around 8 seconds with this reversion, > and, potentially, a very very long time (over 100 seconds has been > measured) without. > > Again, so who cares, if the processor is in a deep idle state anyway, > not consuming much energy? O.K. but what if it is in an idle state > where energy consumption is a function of the requested pstate? > For example, for my processor (i7-2600K), idle state 1, then processor > package energy can be over double what it should be for many 10s of > seconds. > > Experiment method: > > enable only idle state 1 > Dountil stopped > apply a 100% load (all CPUs) > after awhile (about 50 seconds) remove the load. > allow a short transient delay (1 second). > measure the processor package joules used over the next 149 seconds. > Enduntil > > Kernel k5.4-rc2 + reversion (this method) > Average processor package power: 9.148 watts (128 samples, > 7 hours) > Minimum: 9.02 watts > Maximum: 9.29 watts > Note: outlyer data point group removed, as it was assumed the computer > had something to do and wasn't actually "idle". > > Kernel 5.4-rc2: > Average processor package power: 9.969 watts (150 samples, > 8 hours) > Or 9% more energy for the idle phases of the work load. > Minimum: 9.15 watts > Maximum: 13.79 watts (51% more power) > > --- > kernel/sched/fair.c | 43 +++++++++---------------------------------- > 1 file changed, 9 insertions(+), 34 deletions(-) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 83ab35e..51625b8 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -381,10 +381,9 @@ static inline void assert_list_leaf_cfs_rq(struct rq *rq) > SCHED_WARN_ON(rq->tmp_alone_branch != &rq->leaf_cfs_rq_list); > } > > -/* Iterate thr' all leaf cfs_rq's on a runqueue */ > -#define for_each_leaf_cfs_rq_safe(rq, cfs_rq, pos) \ > - list_for_each_entry_safe(cfs_rq, pos, &rq->leaf_cfs_rq_list, \ > - leaf_cfs_rq_list) > +/* Iterate through all cfs_rq's on a runqueue in bottom-up order */ > +#define for_each_leaf_cfs_rq(rq, cfs_rq) \ > + list_for_each_entry_rcu(cfs_rq, &rq->leaf_cfs_rq_list, leaf_cfs_rq_list) > > /* Do the two (enqueued) entities belong to the same group ? */ > static inline struct cfs_rq * > @@ -481,8 +480,8 @@ static inline void assert_list_leaf_cfs_rq(struct rq *rq) > { > } > > -#define for_each_leaf_cfs_rq_safe(rq, cfs_rq, pos) \ > - for (cfs_rq = &rq->cfs, pos = NULL; cfs_rq; cfs_rq = pos) > +#define for_each_leaf_cfs_rq(rq, cfs_rq) \ > + for (cfs_rq = &rq->cfs; cfs_rq; cfs_rq = NULL) > > static inline struct sched_entity *parent_entity(struct sched_entity *se) > { > @@ -7502,27 +7501,10 @@ static inline void update_blocked_load_status(struct rq *rq, bool has_blocked) { > > #ifdef CONFIG_FAIR_GROUP_SCHED > > -static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq) > -{ > - if (cfs_rq->load.weight) > - return false; > - > - if (cfs_rq->avg.load_sum) > - return false; > - > - if (cfs_rq->avg.util_sum) > - return false; > - > - if (cfs_rq->avg.runnable_load_sum) > - return false; > - > - return true; > -} > - > static void update_blocked_averages(int cpu) > { > struct rq *rq = cpu_rq(cpu); > - struct cfs_rq *cfs_rq, *pos; > + struct cfs_rq *cfs_rq; > const struct sched_class *curr_class; > struct rq_flags rf; > bool done = true; > @@ -7534,7 +7516,7 @@ static void update_blocked_averages(int cpu) > * Iterates the task_group tree in a bottom up fashion, see > * list_add_leaf_cfs_rq() for details. > */ > - for_each_leaf_cfs_rq_safe(rq, cfs_rq, pos) { > + for_each_leaf_cfs_rq(rq, cfs_rq) { > struct sched_entity *se; > > if (update_cfs_rq_load_avg(cfs_rq_clock_pelt(cfs_rq), cfs_rq)) > @@ -7545,13 +7527,6 @@ static void update_blocked_averages(int cpu) > if (se && !skip_blocked_update(se)) > update_load_avg(cfs_rq_of(se), se, 0); > > - /* > - * There can be a lot of idle CPU cgroups. Don't let fully > - * decayed cfs_rqs linger on the list. > - */ > - if (cfs_rq_is_decayed(cfs_rq)) > - list_del_leaf_cfs_rq(cfs_rq); > - > /* Don't need periodic decay once load/util_avg are null */ > if (cfs_rq_has_blocked(cfs_rq)) > done = false; > @@ -10444,10 +10419,10 @@ const struct sched_class fair_sched_class = { > #ifdef CONFIG_SCHED_DEBUG > void print_cfs_stats(struct seq_file *m, int cpu) > { > - struct cfs_rq *cfs_rq, *pos; > + struct cfs_rq *cfs_rq; > > rcu_read_lock(); > - for_each_leaf_cfs_rq_safe(cpu_rq(cpu), cfs_rq, pos) > + for_each_leaf_cfs_rq(cpu_rq(cpu), cfs_rq) > print_cfs_rq(m, cpu, cfs_rq); > rcu_read_unlock(); > } > -- > 2.7.4 >