From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-3.8 required=3.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2608EC433E0 for ; Tue, 9 Jun 2020 17:10:30 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id DFDC220734 for ; Tue, 9 Jun 2020 17:10:29 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=linaro.org header.i=@linaro.org header.b="edDNy0VW" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1731547AbgFIRK3 (ORCPT ); Tue, 9 Jun 2020 13:10:29 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:35258 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728370AbgFIRKW (ORCPT ); Tue, 9 Jun 2020 13:10:22 -0400 Received: from mail-lf1-x144.google.com (mail-lf1-x144.google.com [IPv6:2a00:1450:4864:20::144]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id EAC9FC03E97C for ; Tue, 9 Jun 2020 10:10:20 -0700 (PDT) Received: by mail-lf1-x144.google.com with SMTP id u16so12955431lfl.8 for ; Tue, 09 Jun 2020 10:10:20 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linaro.org; s=google; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=MkML/s2sAsLSTlIcZwPVD5iqinEWUXMTFG2V4ORnXJY=; b=edDNy0VWLZ7oaoBzttbc5KSAeO6HvyPvAWf7iw7IQpHzGRGFgSTjm5Z01I/NfQh02g bQpcTpC5+Evnl+vvSdnxAK066vZ4eCwKPbY69qoux2wB5OQ8xQE2FYwW4Ac0DmnW/SCK LRayT5LcCmFe4GjDF2SEQU57mvjXx8Avi+2Q8oKoVdcTk88RdLRD/kM64BuaKrV5/rCy JeuHfyDLyleOcsJTrSLDtyzZc2faSwM2ABiti9WGp3goH9iyPxRmhsGJq78jW3H/oQ7b 0MsNILdxQWtv+oUzMHL/U2a3/HSk4Em6agsiLqAxu6ezUY1j/lOTn2CTEGveHNyghU/W EUGw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=MkML/s2sAsLSTlIcZwPVD5iqinEWUXMTFG2V4ORnXJY=; b=kSTECwnqRdIPGPjwiqUAWRtfhFM3gw+I4XlC2wJXblVLwPXaExhBUGV6inv4rlnNIQ MOwCLPe7NWn6Z4On3KjV8FiQZIXfNYYJVwCxe4i4yWr46jn/rJiy2kMGgrHCWRthMlNh H86Zflp4/yh+Dvm16gIadoT5PIH64eyFrJG8iBR4QOCqNerf6cHrQx2ip6u9ZICB0Wx9 DBBY82KiNz4xfA4Kp4e6EJT1rl32gaVxiVqQa33wd9F8ZBzXca18jeQ5I+8ftfVKo46x ba2UQpJBPTLVNOVhYBu23sX17XUKGPURLofWebWKE4FFyNSO3YoOA++l5wFm9nSeZxDv vnFA== X-Gm-Message-State: AOAM530Sp6XQy0Lb7+aKBW3OhMoSpnSgwUFq5s1KARnYWp+38JZuzYwy RF8LyQ0IVO7SUtHQz0S8PksyBDq8daYq4xt4mLG8TQ== X-Google-Smtp-Source: ABdhPJw3CZekOt96aFvax0kiSSCtmlI0zQ2zIKSm9Tp+bPjUIGJkPOd0+VLNLcSLsnLEqeGgyTLtZMc6N/azSA7LKkA= X-Received: by 2002:a19:c3c5:: with SMTP id t188mr16128484lff.149.1591722619072; Tue, 09 Jun 2020 10:10:19 -0700 (PDT) MIME-Version: 1.0 References: <20200528132327.GB706460@hirez.programming.kicks-ass.net> <20200528155800.yjrmx3hj72xreryh@e107158-lin.cambridge.arm.com> <20200528161112.GI2483@worktop.programming.kicks-ass.net> <20200529100806.GA3070@suse.de> <87v9k84knx.derkling@matbug.net> <20200603101022.GG3070@suse.de> <20200603165200.v2ypeagziht7kxdw@e107158-lin.cambridge.arm.com> <20200608123102.6sdhdhit7lac5cfl@e107158-lin.cambridge.arm.com> In-Reply-To: <20200608123102.6sdhdhit7lac5cfl@e107158-lin.cambridge.arm.com> From: Vincent Guittot Date: Tue, 9 Jun 2020 19:10:07 +0200 Message-ID: Subject: Re: [PATCH 1/2] sched/uclamp: Add a new sysctl to control RT default boost value To: Qais Yousef Cc: Mel Gorman , Patrick Bellasi , Dietmar Eggemann , Peter Zijlstra , Ingo Molnar , Randy Dunlap , Jonathan Corbet , Juri Lelli , Steven Rostedt , Ben Segall , Luis Chamberlain , Kees Cook , Iurii Zaikin , Quentin Perret , Valentin Schneider , Pavan Kondeti , linux-doc@vger.kernel.org, linux-kernel , linux-fs Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, 8 Jun 2020 at 14:31, Qais Yousef wrote: > > On 06/04/20 14:14, Vincent Guittot wrote: > > [...] > > > I have tried your patch and I don't see any difference compared to > > previous tests. Let me give you more details of my setup: > > I create 3 levels of cgroups and usually run the tests in the 4 levels > > (which includes root). The result above are for the root level > > > > But I see a difference at other levels: > > > > root level 1 level 2 level 3 > > > > /w patch uclamp disable 50097 46615 43806 41078 > > tip uclamp enable 48706(-2.78%) 45583(-2.21%) 42851(-2.18%) > > 40313(-1.86%) > > /w patch uclamp enable 48882(-2.43%) 45774(-1.80%) 43108(-1.59%) > > 40667(-1.00%) > > > > Whereas tip with uclamp stays around 2% behind tip without uclamp, the > > diff of uclamp with your patch tends to decrease when we increase the > > number of level > > So I did try to dig more into this, but I think it's either not a good > reproducer or what we're observing here is uArch level latencies caused by the > new code that seem to produce a bigger knock on effect than what they really > are. > > First, CONFIG_FAIR_GROUP_SCHED is 'expensive', for some definition of > expensive.. yes, enabling CONFIG_FAIR_GROUP_SCHED adds an overhead > > *** uclamp disabled/fair group enabled *** > > # Executed 50000 pipe operations between two threads > > Total time: 0.958 [sec] > > 19.177100 usecs/op > 52145 ops/sec > > *** uclamp disabled/fair group disabled *** > > # Executed 50000 pipe operations between two threads > Total time: 0.808 [sec] > > 16.176200 usecs/op > 61819 ops/sec > > So there's a 15.6% drop in ops/sec when enabling this option. I think it's good > to look at the absolutely number of usecs/op, Fair group adds around > 3 usecs/op. > > I dropped FAIR_GROUP_SCHED from my config to eliminate this overhead and focus > on solely on uclamp overhead. Have you checked that both tests run at the root level ? Your function-graph log below shows several calls to update_cfs_group() which means that your trace below has not been made at root level but most probably at the 3rd level and I wonder if you used the same setup for running the benchmark above. This could explain such huge difference because I don't have such difference on my platform but more around 2% For uclamp disable/fair group enable/ function graph enable : 47994ops/sec For uclamp disable/fair group disable/ function graph enable : 49107ops/sec > > With uclamp enabled but no fair group I get > > *** uclamp enabled/fair group disabled *** > > # Executed 50000 pipe operations between two threads > Total time: 0.856 [sec] > > 17.125740 usecs/op > 58391 ops/sec > > The drop is 5.5% in ops/sec. Or 1 usecs/op. > > I don't know what's the expectation here. 1 us could be a lot, but I don't > think we expect the new code to take more than few 100s of ns anyway. If you > add potential caching effects, reaching 1 us wouldn't be that hard. > > Note that in my runs I chose performance governor and use `taskset 0x2` to You might want to set 2 CPUs in your cpumask instead of 1 in order to have 1 CPU for each thread > force running on a big core to make sure the runs are repeatable. I also use performance governor but don't pinned tasks because I use smp. > > On Juno-r2 I managed to scrap most of the 1 us with the below patch. It seems > there was weird branching behavior that affects the I$ in my case. It'd be good > to try it out to see if it makes a difference for you. The perf are slightly worse on my setup: For uclamp enable/fair group disable/ function graph enable : 48413ops/sec with patch below : 47804os/sec > > The I$ effect is my best educated guess. Perf doesn't catch this path and > I couldn't convince it to look at cache and branch misses between 2 specific > points. > > Other subtle code shuffling did have weird effect on the result too. One worthy > one is making uclamp_rq_dec() noinline gains back ~400 ns. Making > uclamp_rq_inc() noinline *too* cancels this gain out :-/ > > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index 0464569f26a7..0835ee20a3c7 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -1071,13 +1071,11 @@ static inline void uclamp_rq_dec_id(struct rq *rq, struct task_struct *p, > > static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p) > { > - enum uclamp_id clamp_id; > - > if (unlikely(!p->sched_class->uclamp_enabled)) > return; > > - for_each_clamp_id(clamp_id) > - uclamp_rq_inc_id(rq, p, clamp_id); > + uclamp_rq_inc_id(rq, p, UCLAMP_MIN); > + uclamp_rq_inc_id(rq, p, UCLAMP_MAX); > > /* Reset clamp idle holding when there is one RUNNABLE task */ > if (rq->uclamp_flags & UCLAMP_FLAG_IDLE) > @@ -1086,13 +1084,11 @@ static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p) > > static inline void uclamp_rq_dec(struct rq *rq, struct task_struct *p) > { > - enum uclamp_id clamp_id; > - > if (unlikely(!p->sched_class->uclamp_enabled)) > return; > > - for_each_clamp_id(clamp_id) > - uclamp_rq_dec_id(rq, p, clamp_id); > + uclamp_rq_dec_id(rq, p, UCLAMP_MIN); > + uclamp_rq_dec_id(rq, p, UCLAMP_MAX); > } > > static inline void > > > FWIW I fail to see activate/deactivate_task in perf record. They don't show up > on the list which means this micro benchmark doesn't stress them as Mel's test > does. Strange because I have been able to trace them. > > Worth noting that I did try running the same test on 2 vCPU VirtualBox VM and > 64 vCPU qemu and I couldn't spot a difference when uclamp was enabled/disabled > in these 2 environments. > > > > > Beside this, that's also interesting to notice the ~6% of perf impact > > between each level for the same image > > Beside my observation above, I captured this function_graph when > FAIR_GROUP_SCHED is enabled. What I pasted below is a particularly bad > deactivation, it's not always that costly. > > This ran happened was recorded with uclamp disabled. > > I admit I don't know how much of these numbers is ftrace overhead. When trying > to capture similar runs for uclamp, the numbers didn't add up compared to > running the test without ftrace generating the graph. If juno is suffering from > bad branching costs in this path, then I suspect ftrace will amplify this as > AFAIU it'll cause extra jumps on entry and exit. > > > > sched-pipe-6532 [001] 9407.276302: funcgraph_entry: | deactivate_task() { > sched-pipe-6532 [001] 9407.276302: funcgraph_entry: | dequeue_task_fair() { > sched-pipe-6532 [001] 9407.276303: funcgraph_entry: | update_curr() { > sched-pipe-6532 [001] 9407.276304: funcgraph_entry: 0.780 us | update_min_vruntime(); > sched-pipe-6532 [001] 9407.276306: funcgraph_entry: | cpuacct_charge() { > sched-pipe-6532 [001] 9407.276306: funcgraph_entry: 0.820 us | __rcu_read_lock(); > sched-pipe-6532 [001] 9407.276308: funcgraph_entry: 0.740 us | __rcu_read_unlock(); > sched-pipe-6532 [001] 9407.276309: funcgraph_exit: 3.980 us | } > sched-pipe-6532 [001] 9407.276310: funcgraph_entry: 0.720 us | __rcu_read_lock(); > sched-pipe-6532 [001] 9407.276312: funcgraph_entry: 0.720 us | __rcu_read_unlock(); > sched-pipe-6532 [001] 9407.276313: funcgraph_exit: 9.840 us | } > sched-pipe-6532 [001] 9407.276314: funcgraph_entry: | __update_load_avg_se() { > sched-pipe-6532 [001] 9407.276315: funcgraph_entry: 0.720 us | __accumulate_pelt_segments(); > sched-pipe-6532 [001] 9407.276316: funcgraph_exit: 2.260 us | } > sched-pipe-6532 [001] 9407.276317: funcgraph_entry: | __update_load_avg_cfs_rq() { > sched-pipe-6532 [001] 9407.276318: funcgraph_entry: 0.860 us | __accumulate_pelt_segments(); > sched-pipe-6532 [001] 9407.276319: funcgraph_exit: 2.340 us | } > sched-pipe-6532 [001] 9407.276320: funcgraph_entry: 0.760 us | clear_buddies(); > sched-pipe-6532 [001] 9407.276321: funcgraph_entry: 0.800 us | account_entity_dequeue(); > sched-pipe-6532 [001] 9407.276323: funcgraph_entry: 0.720 us | update_cfs_group(); > sched-pipe-6532 [001] 9407.276324: funcgraph_entry: 0.740 us | update_min_vruntime(); > sched-pipe-6532 [001] 9407.276326: funcgraph_entry: 0.720 us | set_next_buddy(); > sched-pipe-6532 [001] 9407.276327: funcgraph_entry: | __update_load_avg_se() { > sched-pipe-6532 [001] 9407.276328: funcgraph_entry: 0.740 us | __accumulate_pelt_segments(); > sched-pipe-6532 [001] 9407.276329: funcgraph_exit: 2.220 us | } > sched-pipe-6532 [001] 9407.276330: funcgraph_entry: | __update_load_avg_cfs_rq() { > sched-pipe-6532 [001] 9407.276331: funcgraph_entry: 0.740 us | __accumulate_pelt_segments(); > sched-pipe-6532 [001] 9407.276332: funcgraph_exit: 2.180 us | } > sched-pipe-6532 [001] 9407.276333: funcgraph_entry: | update_cfs_group() { > sched-pipe-6532 [001] 9407.276334: funcgraph_entry: | reweight_entity() { > sched-pipe-6532 [001] 9407.276335: funcgraph_entry: | update_curr() { > sched-pipe-6532 [001] 9407.276335: funcgraph_entry: 0.720 us | __calc_delta(); > sched-pipe-6532 [001] 9407.276337: funcgraph_entry: 0.740 us | update_min_vruntime(); > sched-pipe-6532 [001] 9407.276338: funcgraph_exit: 3.560 us | } > sched-pipe-6532 [001] 9407.276339: funcgraph_entry: 0.720 us | account_entity_dequeue(); > sched-pipe-6532 [001] 9407.276340: funcgraph_entry: 0.720 us | account_entity_enqueue(); > sched-pipe-6532 [001] 9407.276342: funcgraph_exit: 7.860 us | } > sched-pipe-6532 [001] 9407.276342: funcgraph_exit: 9.280 us | } > sched-pipe-6532 [001] 9407.276343: funcgraph_entry: | __update_load_avg_se() { > sched-pipe-6532 [001] 9407.276344: funcgraph_entry: 0.720 us | __accumulate_pelt_segments(); > sched-pipe-6532 [001] 9407.276345: funcgraph_exit: 2.180 us | } > sched-pipe-6532 [001] 9407.276346: funcgraph_entry: | __update_load_avg_cfs_rq() { > sched-pipe-6532 [001] 9407.276347: funcgraph_entry: 0.740 us | __accumulate_pelt_segments(); > sched-pipe-6532 [001] 9407.276348: funcgraph_exit: 2.180 us | } > sched-pipe-6532 [001] 9407.276349: funcgraph_entry: | update_cfs_group() { > sched-pipe-6532 [001] 9407.276350: funcgraph_entry: | reweight_entity() { > sched-pipe-6532 [001] 9407.276350: funcgraph_entry: | update_curr() { > sched-pipe-6532 [001] 9407.276351: funcgraph_entry: 0.740 us | __calc_delta(); > sched-pipe-6532 [001] 9407.276353: funcgraph_entry: 0.720 us | update_min_vruntime(); > sched-pipe-6532 [001] 9407.276354: funcgraph_exit: 3.580 us | } > sched-pipe-6532 [001] 9407.276355: funcgraph_entry: 0.740 us | account_entity_dequeue(); > sched-pipe-6532 [001] 9407.276356: funcgraph_entry: 0.720 us | account_entity_enqueue(); > sched-pipe-6532 [001] 9407.276358: funcgraph_exit: 7.960 us | } > sched-pipe-6532 [001] 9407.276358: funcgraph_exit: 9.400 us | } > sched-pipe-6532 [001] 9407.276360: funcgraph_entry: | __update_load_avg_se() { > sched-pipe-6532 [001] 9407.276360: funcgraph_entry: 0.740 us | __accumulate_pelt_segments(); > sched-pipe-6532 [001] 9407.276362: funcgraph_exit: 2.220 us | } > sched-pipe-6532 [001] 9407.276362: funcgraph_entry: | __update_load_avg_cfs_rq() { > sched-pipe-6532 [001] 9407.276363: funcgraph_entry: 0.740 us | __accumulate_pelt_segments(); > sched-pipe-6532 [001] 9407.276365: funcgraph_exit: 2.160 us | } > sched-pipe-6532 [001] 9407.276366: funcgraph_entry: | update_cfs_group() { > sched-pipe-6532 [001] 9407.276367: funcgraph_entry: | reweight_entity() { > sched-pipe-6532 [001] 9407.276368: funcgraph_entry: | update_curr() { > sched-pipe-6532 [001] 9407.276368: funcgraph_entry: 0.720 us | __calc_delta(); > sched-pipe-6532 [001] 9407.276370: funcgraph_entry: 0.720 us | update_min_vruntime(); > sched-pipe-6532 [001] 9407.276371: funcgraph_exit: 3.540 us | } > sched-pipe-6532 [001] 9407.276372: funcgraph_entry: 0.740 us | account_entity_dequeue(); > sched-pipe-6532 [001] 9407.276373: funcgraph_entry: 0.720 us | account_entity_enqueue(); > sched-pipe-6532 [001] 9407.276375: funcgraph_exit: 7.840 us | } > sched-pipe-6532 [001] 9407.276375: funcgraph_exit: 9.300 us | } > sched-pipe-6532 [001] 9407.276376: funcgraph_entry: 0.720 us | hrtick_update(); > sched-pipe-6532 [001] 9407.276377: funcgraph_exit: + 75.000 us | } > sched-pipe-6532 [001] 9407.276378: funcgraph_exit: + 76.700 us | } > > > Cheers > > -- > Qais Yousef