From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 7C8F6EB64DD
	for <linux-kernel@archiver.kernel.org>; Tue,  4 Jul 2023 13:59:03 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S231424AbjGDN7C (ORCPT <rfc822;linux-kernel@archiver.kernel.org>);
        Tue, 4 Jul 2023 09:59:02 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:33034 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S231229AbjGDN66 (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Tue, 4 Jul 2023 09:58:58 -0400
Received: from mail-wm1-x329.google.com (mail-wm1-x329.google.com [IPv6:2a00:1450:4864:20::329])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 883C5F7
        for <linux-kernel@vger.kernel.org>; Tue,  4 Jul 2023 06:58:56 -0700 (PDT)
Received: by mail-wm1-x329.google.com with SMTP id 5b1f17b1804b1-3fbd33a573cso29019755e9.2
        for <linux-kernel@vger.kernel.org>; Tue, 04 Jul 2023 06:58:56 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=layalina-io.20221208.gappssmtp.com; s=20221208; t=1688479135; x=1691071135;
        h=in-reply-to:content-disposition:mime-version:references:message-id
         :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to;
        bh=KXukMoiUHsFnABd/wPhkZ9XTrO4JKNHS6gLRhLQfPrw=;
        b=qAplUnOrZ7sW0L3AtiaFgnB27rPog0XZkibQeE06ehyqFJ7wOpcdx//oLsp0pEoKhE
         vs6MvQOnZ7gpHFhWR5AramoApQ8e/AevTkjt4v53LmUV+axiPWkAtH2VJ879Qvv7A7QO
         RULCYSMEXe8tD96lOr6dsB3vd/dD5VQJaV+1QJAkXOcy6TdTevWCklpZR+urflsVtHYO
         QJzJC7v7SQlAMITu0YwC3WT9KHXH4I7Q1aI0959stm6VZwKvv44WAuFoaHLxlUpm+sUj
         wkg3WgVWDHc7Bmhd1V3+mqs+lcRuRmDZbfrezwQ0K6HkUo6Jov1gfWu/SY/1JjNd1ee4
         ietg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20221208; t=1688479135; x=1691071135;
        h=in-reply-to:content-disposition:mime-version:references:message-id
         :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date
         :message-id:reply-to;
        bh=KXukMoiUHsFnABd/wPhkZ9XTrO4JKNHS6gLRhLQfPrw=;
        b=LHziyYbD9ZucJgc0cvD8DRCSnMBUE4RrqdGvmyk2Rb2KvoB0u2PVIZ+HyfxYWZmJfT
         VVpztVHH2FUPoUIHc9JnnXsF8FOgMk50wb+tbsI1HfCYx4tCScoIQfa6P55fjIgYN5R9
         rw9eqFg/mueSo5+qzN3kwHFj+QNPDzGUp0OLdvmQ1+NtfRrGLSdzaBs838WYp0txUp2B
         FywxjjZdILMU+k7yrucWhpHc5avzeGXCKkDivaFHX8S99yDyzBheor//OraNjYemWEnE
         UX1piZmKxvDdIU5/L0ebDdjsjTb/9XhFEJvHR3tfDhlAMG0Q3zEd9h0TYAvp6sSyEKwN
         1aqg==
X-Gm-Message-State: AC+VfDxgKM3JYmyWb6cRYJKnOxHEVoJH2Dol4SuEvb0tFYlCT0w1TWKj
        PckrxG6X45UWkf1TEIcwXeAAGg==
X-Google-Smtp-Source: ACHHUZ5F71QlUqoXVjpa302vOdIuAEU7ajQUSk5+RbDFLT2dNXnaQtMNNG3GltJYd9ukODusNi9uSA==
X-Received: by 2002:a7b:c04a:0:b0:3fb:403d:90c5 with SMTP id u10-20020a7bc04a000000b003fb403d90c5mr10204425wmc.39.1688479134886;
        Tue, 04 Jul 2023 06:58:54 -0700 (PDT)
Received: from airbuntu ([104.132.45.103])
        by smtp.gmail.com with ESMTPSA id w10-20020a1cf60a000000b003f9bd9e3226sm30010310wmc.7.2023.07.04.06.58.53
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Tue, 04 Jul 2023 06:58:54 -0700 (PDT)
Date:   Tue, 4 Jul 2023 14:58:48 +0100
From:   Qais Yousef <qyousef@layalina.io>
To:     Lukasz Luba <lukasz.luba@arm.com>
Cc:     linux-kernel@vger.kernel.org, linux-trace-kernel@vger.kernel.org,
        rafael@kernel.org, linux-pm@vger.kernel.org, rostedt@goodmis.org,
        mhiramat@kernel.org, mingo@redhat.com, peterz@infradead.org,
        juri.lelli@redhat.com, vincent.guittot@linaro.org,
        dietmar.eggemann@arm.com, bsegall@google.com, mgorman@suse.de,
        bristot@redhat.com, vschneid@redhat.com, delyank@fb.com,
        qyousef@google.com, kernel test robot <lkp@intel.com>
Subject: Re: [RESEND][PATCH v2 3/3] schedutil: trace: Add tracing to capture
 filter out requests
Message-ID: <20230704135848.jfogmar4o6lmycog@airbuntu>
References: <20230522145702.2419654-1-lukasz.luba@arm.com>
 <20230522145702.2419654-4-lukasz.luba@arm.com>
 <20230531183105.r5tqpdx5axoogkzp@airbuntu>
 <a0101269-1d8b-d4e1-52b4-250a99b395fa@arm.com>
 <20230630120122.oqy4q42bl5gy6bfd@airbuntu>
 <20230630132507.hggzz5g2odrabii6@airbuntu>
 <be79b206-01f8-f3e1-4107-ae6c662cbf47@arm.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
In-Reply-To: <be79b206-01f8-f3e1-4107-ae6c662cbf47@arm.com>
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On 07/04/23 09:23, Lukasz Luba wrote:
> Hi Qais,
> 
> On 6/30/23 14:25, Qais Yousef wrote:
> > On 06/30/23 13:01, Qais Yousef wrote:
> > > On 06/20/23 18:52, Lukasz Luba wrote:
> > > > Hi Qais,
> > > > 
> > > > I have somehow missed your feedback on this series.
> > > > 
> > > > On 5/31/23 19:31, Qais Yousef wrote:
> > > > > On 05/22/23 15:57, Lukasz Luba wrote:
> > > > > > Some of the frequency update requests coming form the task scheduler
> > > > > > might be filter out. It can happen when the previous request was served
> > > > > > not that long ago (in a period smaller than provided by the cpufreq driver
> > > > > > as minimum for frequency update). In such case, we want to know if some of
> > > > > > the frequency updates cannot make through.
> > > > > > Export the new tracepoint as well. That would allow to handle it by a
> > > > > > toolkit for trace analyzes.
> > > > > > 
> > > > > > Reported-by: kernel test robot <lkp@intel.com> # solved tricky build
> > > > > > Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
> > > > > > ---
> > > > > >    include/trace/events/sched.h     |  4 ++++
> > > > > >    kernel/sched/cpufreq_schedutil.c | 10 ++++++++--
> > > > > >    2 files changed, 12 insertions(+), 2 deletions(-)
> > > > > > 
> > > > > > diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
> > > > > > index dbfb30809f15..e34b7cd5de73 100644
> > > > > > --- a/include/trace/events/sched.h
> > > > > > +++ b/include/trace/events/sched.h
> > > > > > @@ -739,6 +739,10 @@ DECLARE_TRACE(uclamp_update_tsk_tp,
> > > > > >    	TP_PROTO(struct task_struct *tsk, int uclamp_id,  unsigned int value),
> > > > > >    	TP_ARGS(tsk, uclamp_id, value));
> > > > > > +DECLARE_TRACE(schedutil_update_filtered_tp,
> > > > > > +	TP_PROTO(int cpu),
> > > > > > +	TP_ARGS(cpu));
> > > > > > +
> > > > > >    #endif /* _TRACE_SCHED_H */
> > > > > >    /* This part must be outside protection */
> > > > > > diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> > > > > > index f462496e5c07..4f9daf258a65 100644
> > > > > > --- a/kernel/sched/cpufreq_schedutil.c
> > > > > > +++ b/kernel/sched/cpufreq_schedutil.c
> > > > > > @@ -6,6 +6,8 @@
> > > > > >     * Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > > > > >     */
> > > > > > +EXPORT_TRACEPOINT_SYMBOL_GPL(schedutil_update_filtered_tp);
> > > > > > +
> > > > > >    #define IOWAIT_BOOST_MIN	(SCHED_CAPACITY_SCALE / 8)
> > > > > >    struct sugov_tunables {
> > > > > > @@ -318,8 +320,10 @@ static inline bool sugov_update_single_common(struct sugov_cpu *sg_cpu,
> > > > > >    	ignore_dl_rate_limit(sg_cpu);
> > > > > > -	if (!sugov_should_update_freq(sg_cpu->sg_policy, time))
> > > > > > +	if (!sugov_should_update_freq(sg_cpu->sg_policy, time)) {
> > > > > > +		trace_schedutil_update_filtered_tp(sg_cpu->cpu);
> > > > > >    		return false;
> > > > > > +	}
> > > > > 
> > > > > Can't we have something more generic here too? Are you interested to count
> > > > > these events? How do you plan to use it?
> > > > 
> > > > The plan is to record those events, count them and maybe adjust the FW
> > > > if the frequency switching capabilities are too low, e.g. 4ms...
> > > 
> > > You mean as part of tuning step for the system or at runtime? The latter seems
> > > to indicate for a proper interface instead.
> 
> Not at runtime, the FW change or maybe even the uC would be needed for
> this. Therefore, our client which experiments with the new platform
> can run the analysis and spot this. Then it can change the FW
> if there was an issue, or maybe even upgrade the HW if there are severe
> issues with delivering needed performance on time (e.g. in high display
> refresh rates and first-frame drops).
> 
> > > 
> > > IMHO I think the current filtering mechanism needs a bit of a massage.
> > > 
> > > One thing I think we must do is to ignore the filter if there's a big sudden
> > > change in requested frequency. Like for instance if a big task migrates. Then
> > > prev_cpu should go to lower freq sooner, and new_cpu should change to higher
> > > frequency sooner too. The filtering makes sense only in steady state situation
> > > where we are ramping up or down gradually.
> 
> This is kind of a heuristic, which is biased for mobiles IMO.

How come? big tasks are not only on mobile? A 500+ task can exist on any
system?

> 
> > > 
> > > If no one beats me to it, I'll propose something in that regard.
> > > 
> > > > 
> > > > We need those numbers to point out that there is a need for faster
> > > > FW micro-controller to serve those incoming requests.
> > > 
> > > I think there's a big assumption here that the filter is always set correctly
> > > ;-)
> 
> In our case, it is set correctly, the value is coming directly from FW
> [1]. It is the FW+HW limit, so not much to do with this in the kernel.
> 
> > > 
> > > > 
> > > > > 
> > > > > I think this will be a very noisy event by the way.
> > > > 
> > > > Could be, but on the other hand for those statistical analysis
> > > > 'the more the better'. It will also depend on number of
> > > > CPUs in the cluster, e.g. 4 CPUs vs 1 CPU.
> > > > 
> > > > I don't know when we will switch to this per-cpu cpufreq mode
> > > > when all CPUs behave like independent DVFS. Juno mainline kernel and FW
> > > > supports that mode. We would have to compare those two modes and
> > > > measure how much we gain/loose when using one and not the other.
> > > > 
> > > > Furthermore, we already suspect some of our integration testing for
> > > > EAS-mainline (on Juno) failing due to filtered out requests. How much
> > > > that would impact other boards - it would be nice to see in traces.
> > > 
> > > Another problem I think we have is that the DVFS headroom value should be
> > > a function of this filter. At the moment it is hardcoded to a random value
> > > which causes power issue.
> 
> It's not a random value, as you can see in [1]. This is the main goal

I'm referring to the 25% in map_util_perf().

> for this $subject - provide information with proper test that the FW
> or HW change is needed. If you like to have a decent performance in
> your Linux based solution, the faster FW/HW would be needed. I don't
> want to put more hacks or heuristics which try to workaround performance
> issues with the HW. E.g. if someone instead of a 200MHz uC running fast
> FW would put 100MHz uC than should get quality data from integration
> tests, that such a design might not work well with recent OSes and apps.
> Currently those kind of 'design' checks are very hard, because require
> sophisticated knowledge and we are trying to lower that bar for more
> engineers.

I think we're talking about different things ;-)

> 
> > > 
> > > So to summarize I think there are two improvements required (and if anyone has
> > > the time to try them out go ahead otherwise I'll get to it):
> > > 
> > >   1. The filter should only be applied if the history hasn't changed. ie: we are
> > >      gradually increasing or decreasing PELT. Otherwise we should honour sudden
> > >      changes ASAP.
> > >   2. DVFS headroom should be a function of the filter. 25% is too high for
> > >      500us. It could be too low for 10ms (I don't know).
> > 
> > To expand a bit more since it's related. Our  migration margins should depend
> > on the tick value instead of random magic numbers they are today. More
> > precisely the balance_interval. If there's a misfit task, then we should
> > upmigrate it at wake up only if we think it'll become misfit before the load
> > balancer kicks in. Otherwise the load balancer should do the correction if it
> > becomes long running/misfit. If the sys admin wants to speed up/slow down
> > migration it should be throw controlling PELT IMO and not these magic margin
> > values - which are hardcoded to random values at the moment anyway that are not
> > suitable for every system.
> > 
> > And since we decoupled overutilized from misfit lb; I think our definition
> > should improve to better detect when the system needs to disable packing and
> > starts spreading. Current check for overutilized based on misfit is no longer
> > adequate IMO. Especially when there's a single misfit task in the system.
> > 
> > Again, just sharing thoughts in case someone interested to work on this before
> > I get a chance to share any patches ;-)
> 
> Those are all heuristics and some of your ideas are going very beyond
> the $subject. As I said the main goal for the $subject is to
> deliver information that the FW/HW might need a re-design (maybe even
> a more silicon for the uC).
> 
> I cannot stop you from creating a dedicated thread with your patches,
> though ;)

Fair enough.

/me goes away

--
Qais Yousef

> 
> Regards,
> Lukasz
> 
> [1] https://elixir.bootlin.com/linux/v6.4/source/drivers/cpufreq/scmi-cpufreq.c#L224