linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Doug Smythies" <dsmythies@telus.net>
To: "'Vincent Guittot'" <vincent.guittot@linaro.org>
Cc: "'linux-kernel'" <linux-kernel@vger.kernel.org>,
	"'open list:THERMAL'" <linux-pm@vger.kernel.org>,
	"'Peter Zijlstra'" <peterz@infradead.org>,
	"'Ingo Molnar'" <mingo@kernel.org>,
	"'Linus Torvalds'" <torvalds@linux-foundation.org>,
	"'Thomas Gleixner'" <tglx@linutronix.de>,
	"'Sargun Dhillon'" <sargun@sargun.me>,
	"'Tejun Heo'" <tj@kernel.org>,
	"'Xie XiuQi'" <xiexiuqi@huawei.com>, <xiezhipeng1@huawei.com>,
	"'Srinivas Pandruvada'" <srinivas.pandruvada@linux.intel.com>,
	"'Rik van Riel'" <riel@surriel.com>
Subject: RE: [PATCH] Revert "sched/fair: Fix O(nr_cgroups) in the load balancing path"
Date: Fri, 25 Oct 2019 23:59:41 -0700	[thread overview]
Message-ID: <000c01d58bca$f5709b30$e051d190$@net> (raw)
In-Reply-To: <CAKfTPtDFAS3TiNaaPoEXFZbqdMt_-tfGm9ffVcQAN=Mu_KbRdQ@mail.gmail.com>

Hi Vincent,

Thank you for your quick reply.

On 2010.10.25 09:51 Vincent Guittot wrote:
> On Fri, 25 Oct 2019 at 17:55, Doug Smythies <doug.smythies@gmail.com> wrote:
>>
>> This reverts commit 039ae8bcf7a5f4476f4487e6bf816885fb3fb617,
>> which, in turn, was a re-apply of
>> commit a9e7f6544b9c ("sched/fair: Fix O(nr_cgroups) in load balance path")
>> after it was reverted via
>> commit c40f7d74c741 ("sched/fair: Fix infinite loop in update_blocked_averages() by reverting a9e7f6544b9c")
>>
>> For an idle system, the cfs_rq_is_decayed function components can underflow to 0 and
>> incorrectly return TRUE, when the item should not be deleted from the list.
>
> The patch from Rik solves the problem of cfs_rq_is_decayed wrongly returns true
> https://lore.kernel.org/lkml/20190906191237.27006-6-riel@surriel.com/

Not for my use case.

I applied Rik's patch to kernel 5.4-rc2 (since all my other reference
test data had been acquired against a base of 5.4-rc2). Tests results
are similar to the non-reverted kernel (results added in-line
below).

>>
>> Signed-off-by: Doug Smythies <dsmythies@telus.net>
>> Cc: Vincent Guittot <vincent.guittot@linaro.org>
>> Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
>> Cc: Ingo Molnar <mingo@kernel.org>
>> Cc: Linus Torvalds <torvalds@linux-foundation.org>
>> Cc: Peter Zijlstra <peterz@infradead.org>
>> Cc: Thomas Gleixner <tglx@linutronix.de>
>> Cc: sargun@sargun.me
>> Cc: tj@kernel.org
>> Cc: xiexiuqi@huawei.com
>> Cc: xiezhipeng1@huawei.com
>> Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
>>
>> ---
>> Note 1: Both this reversion and just deleting the cfs_rq_is_decayed function
>> and it's call and leaving the other changes have been tested. I do not know
>> which solution is better. (ie for the "list_for_each_entry_rcu" part of it.)
>>
>> Note 2: Previous controversy over this patch was based on heavy workloads,
>> but this is based on minimal or no workload, or "idle".
>> Where "idle" on my test server, with no gui and many services disabled,
>> tends to mean more "idle" than most systems.
>>
>> Note 3: While this supporting data only involves the intel_pstate CPU
>> frequency scaling driver as a casualty, it is beyond my capabilities
>> to determine what other tasks that should be running might be omitted.
>>
>> Use case example 1:
>> System Idle: The intel pstate CPU frequency scaling driver:
>> Mode: Active, non-hwp, powersave governor.
>> Expected behaviour: There is never ever a duration (time between calls to
>> the driver / per CPU) longer than 4 seconds (the watchdog time, I think).
>> Actual behaviour: There are long long gaps between calls to the driver:
>>
>> Kernel: 5.4-rc2 CPU:7
>> duration: 327.17 Seconds. (this is one of many hundreds of examples.)
>> mpref: 44023326
>> apref: 20716861
>> tsc: 1.11604E+12
>> load: 0
>> CPU frequency: 1.6053 GHz (average over this 327 second sample period).
>> old pstate: 16 (the lowest for my processor)
>> new pstate: 16
>>
>> Kernel: 5.4-rc2 + reversion (either method)
>> After several hours of testing, maximum durations were never more
>> than 4 seconds (well plus some jitter).
>> reversion method: max=4.07908 seconds
>> CPU:7
>> mperf: 492578
>> apref: 231813 (56,829 per second average is consistent with other tests)
>> tsc: 13914264074
>> load: 0
>> CPU frequency: 1.6052 GHz
>> old pstate: 16 (the lowest for my precessor)
>> new pstate: 16
>>
>> On average, the non-reverted kernel executes the driver 25% less
>> than the reverted kernel during idle.

On (shorter)average, the Rik patched kernel executes the driver
14% less than the reverted kernel during idle.

Longer and repeated testing would be required to determine if
this is a trend or simply non-repeatable noise.

>> O.K. so who cares, the requested pstate doesn't change?
>> First, one wonders if the math could overflow.
>> (although 7180ddd suggests maybe it won't)
>> Second, the sample is largely dominated by obsolete information.
>> Third, this can be problematic, and potentially wastes energy,
>> for the busy to idle transition.
>>
>> Use case example 2:
>> The busy to idle transition:
>>
>> Typically, the pstate request response to a busy to idle transition
>> is very slow because the duration suddenly goes from, typically,
>> 10 milliseconds to much much longer, up to 4 seconds. Transition
>> times to the system being fully idle, with all requested pstates
>> being at minimum, takes around 8 seconds with this reversion,
>> and, potentially, a very very long time (over 100 seconds has been
>> measured) without.
>>
>> Again, so who cares, if the processor is in a deep idle state anyway,
>> not consuming much energy? O.K. but what if it is in an idle state
>> where energy consumption is a function of the requested pstate?
>> For example, for my processor (i7-2600K), idle state 1, then processor
>> package energy can be over double what it should be for many 10s of
>> seconds.
>>
>> Experiment method:
>>
>> enable only idle state 1
>> Dountil stopped
>>   apply a 100% load (all CPUs)
>>   after awhile (about 50 seconds) remove the load.
>>   allow a short transient delay (1 second).
>>   measure the processor package joules used over the next 149 seconds.
>> Enduntil
>>
>> Kernel k5.4-rc2 + reversion (this method)
>> Average processor package power: 9.148 watts (128 samples, > 7 hours)
>> Minimum: 9.02 watts
>> Maximum: 9.29 watts
>> Note: outlyer data point group removed, as it was assumed the computer
>> had something to do and wasn't actually "idle".
>>
>> Kernel 5.4-rc2:
>> Average processor package power: 9.969 watts (150 samples, > 8 hours)
>> Or 9% more energy for the idle phases of the work load.
>> Minimum: 9.15 watts
>> Maximum: 13.79 watts (51% more power)

Kernel 5.4-rc2 + Rik-patch:
Average processor package power: 9.85 watts (53 samples, < 3 hours)
Or 7.7% more energy for the idle phases of the work load.
Minimum: 9.23 watts
Maximum: 12.79 watts (40% more power)




  reply	other threads:[~2019-10-26  6:59 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-10-25 15:55 [PATCH] Revert "sched/fair: Fix O(nr_cgroups) in the load balancing path" Doug Smythies
2019-10-25 16:51 ` Vincent Guittot
2019-10-26  6:59   ` Doug Smythies [this message]
2019-10-28  8:21     ` Vincent Guittot
2019-10-29 14:55       ` Doug Smythies
2019-10-29 15:36         ` Peter Zijlstra
2019-10-29 16:20           ` Vincent Guittot
2019-10-29 16:49             ` Peter Zijlstra
2019-10-29 17:00               ` Vincent Guittot
2019-10-29 17:09                 ` Vincent Guittot
2019-10-29 16:02         ` Vincent Guittot
2019-10-30 14:04           ` Doug Smythies
2019-10-30 15:27             ` Vincent Guittot
2019-11-08  9:18               ` Vincent Guittot
2019-11-09 16:47                 ` Doug Smythies
2019-11-10 15:13                   ` Vincent Guittot
2019-10-29 19:34 ` Srinivas Pandruvada
2019-10-29 19:59   ` Srinivas Pandruvada
2019-10-30 15:54     ` Doug Smythies

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='000c01d58bca$f5709b30$e051d190$@net' \
    --to=dsmythies@telus.net \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-pm@vger.kernel.org \
    --cc=mingo@kernel.org \
    --cc=peterz@infradead.org \
    --cc=riel@surriel.com \
    --cc=sargun@sargun.me \
    --cc=srinivas.pandruvada@linux.intel.com \
    --cc=tglx@linutronix.de \
    --cc=tj@kernel.org \
    --cc=torvalds@linux-foundation.org \
    --cc=vincent.guittot@linaro.org \
    --cc=xiexiuqi@huawei.com \
    --cc=xiezhipeng1@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).