From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752143AbdECHzX (ORCPT ); Wed, 3 May 2017 03:55:23 -0400 Received: from mail-oi0-f53.google.com ([209.85.218.53]:33219 "EHLO mail-oi0-f53.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751621AbdECHzQ (ORCPT ); Wed, 3 May 2017 03:55:16 -0400 MIME-Version: 1.0 In-Reply-To: References: <20170425184941.GB15593@wtj.duckdns.org> <20170425210810.GB20255@wtj.duckdns.org> <20170427003020.GD11348@wtj.duckdns.org> <20170428161450.GA4628@htj.duckdns.org> <20170502205639.GB5335@htj.duckdns.org> From: Vincent Guittot Date: Wed, 3 May 2017 09:54:54 +0200 Message-ID: Subject: Re: [PATCH 2/2] sched/fair: Always propagate runnable_load_avg To: Tejun Heo Cc: Ingo Molnar , Peter Zijlstra , linux-kernel , Linus Torvalds , Mike Galbraith , Paul Turner , Chris Mason , kernel-team@fb.com Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 3 May 2017 at 09:25, Vincent Guittot wrote: > On 2 May 2017 at 22:56, Tejun Heo wrote: >> Hello, Vincent. >> >> On Tue, May 02, 2017 at 08:56:52AM +0200, Vincent Guittot wrote: >>> On 28 April 2017 at 18:14, Tejun Heo wrote: >>> > I'll follow up in the other subthread but there really is fundamental >>> > difference in how we calculate runnable_avg w/ and w/o cgroups. >>> > Indepndent of whether we can improve the load balancer further, it is >>> > an existing bug. >>> >>> I'd like to weight that a bit. >>> The runnable_load_avg works correctly as it is because it reflects >>> correctly the load of runnable entity at root domain >>> If you start to propagate the runnable_load_avg on the load_avg of the >>> group entity, the load will become unstable. >>> runnable_load_avg has been added to fix load_balance being unable to >>> select the right busiest rq. So the goal is to use more and more >>> load_avg not the opposite >> >> I have a hard time understanding what you're trying to say here. >> >> Without cgroup, the load balancer uses the sum of load_avgs of the >> running tasks on the queue. As shown by the debug trace, the load >> balancer repeatedly ends up picking the wrong CPU when cgroup is >> involved because it ends up including the load_avgs of nested blocked >> tasks into runnable_load_avg of root - we lose the distinction between >> running and blocked load_avgs when we pass through a nested cfs_rq. >> >> We can further improve the load balancer all we want, for example, >> right now, we would end up picking a CPU with one task which has a >> really high weight over another CPU with two normal weight tasks even, >> which isn't ideal; however, there is something obviously broken in the >> existing mechanism and we want to fix that first independent of >> further improvements, and it won't be a good idea to paper over an >> existing problem with a different mechanism either. > > That's probably where I disagree. > IMO, runnable_load_avg is already a fix to compensate the fact when > PELT has been rewritten, the load balance has not been update > accordingly and was unable to make the right choice > between 2 groups: one group having only 1 task but a higher load and > another group with several tasks but a lower load (either because of > the blocked load or because a runnable task with far higher load than > others). runnable_load_avg has been put back to quickly fix the 1st > case but it doesn't fix the other. There were also some issues of > propagating load update with task migration but this has been fixed > now. > So we have 2 use case with the exact same behavior a group with a > higher load (runnable_load_avg or load_avg) but only 1 task and > another one with several tasks but a lower load. At now > runnable_load_avg can only fix one with the impact of breaking > load_avg behavior whereas one fix in the load_balance can fix both > withour breaking load_avg behavior. > We should better work on removing runnable_load_avg completely than > adding more stuff on it > >> >>> >> I always let time between 2 consecutive run and the 10 consecutive >>> >> runs take around 2min to execute >>> >> >>> >> Then I have run several time these 10 consecutive test and results stay the same >>> > >>> > Can you please try the patch at the end of this message? I'm >>> > wondering whether you're seeing the errors accumulating from the wrong >>> > min(). >>> >>> I still have the regression with the patch below. >>> The regression comes from the use runnable_load_avg to propagate. As >>> load_avg becomes null at some point, it break the computation of share >>> and the load_avg stay very low >> >> That's surprising given that what the patch does is bringing the >> cgroup behavior closer to !cgroup behavior. It'd be great to be able > > But it's also break load_avg which is used to calculate per cfs_rq > share and this change generates the regression on my board > >> to reproduce the problem and trace it. It looks like the board is >> pretty standardized. Would the following be equivalent to the one you >> have? >> >> http://a.co/f3dD1lm > > Yes it's this board i have the previous version but it should not > change the behavior >> >> If so, I can just buy it, get your test image and repro it here and >> trace why the regression is happening with the setup. We might be >> hitting something else. > > I have pushed my branch which includes v4.11-rc8 + your patches + some > debug trace here: > https://git.linaro.org/people/vincent.guittot/kernel.git/log/?h=test-tejun-patches > > and uploaded a trace_cmd's trace that shows the regression > http://people.linaro.org/~vincent.guittot/trace-tejun-patches I have also uploaded the trace withour your patches so you can compare both: http://people.linaro.org/~vincent.guittot/trace-v4.8-rc11 > > As an example, we can easily see arounf time stamp 94.826469 sec that > cpu5 is idle while CPU2 and CPU3 share their time between several > task. > This doesn't happens without your patch > >> >> Thanks. >> >> -- >> tejun