[RESEND][PATCH v2 0/3] Add basic tracing for uclamp and schedutil

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RESEND][PATCH v2 0/3] Add basic tracing for uclamp and schedutil
@ 2023-05-22 14:56 Lukasz Luba
  2023-05-22 14:57 ` [RESEND][PATCH v2 1/3] sched/tp: Add new tracepoint to track uclamp set from user-space Lukasz Luba
                   ` (2 more replies)
  0 siblings, 3 replies; 18+ messages in thread
From: Lukasz Luba @ 2023-05-22 14:56 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel, rafael, linux-pm
  Cc: rostedt, mhiramat, mingo, peterz, juri.lelli, vincent.guittot,
	dietmar.eggemann, bsegall, mgorman, bristot, vschneid, delyank,
	lukasz.luba, qyousef, qyousef

Hi all,

The task scheduler feature: Uclamp, begins to take off. To better understand
the dynamics in the task scheduler and CPU frequency requests we need some
better tracing.
In schedutil (cpufreq governor) we allow to enter the scheduler
and make the frequency change. Although, there is some limit in regards to how
often this can happen. That min period is provided by the cpufreq driver.
Thus, some of the cpufreq requests might be filter out and the frequency won't
be changed (hopefuly will be set a bit later). We would like to know about
those situations, especially in context of the user-space hints made via
Uclamp for particular tasks.
This patch set aims to add base for our toolkits and post-processing trace
analyzes.

Changelog:
v2:
- solved the issue from CI build warning, dropped schedutil.h and re-used
  the sched.h which is available in build_utility.c where cpufreq_schedutil.c
  is included
- added tag for the last patch 3/3 for the CI robot helping hend 
- re-based on top of v6.4-rc1
v1:
- implementation can be found here [1]

Regards,
Lukasz Luba

[1] https://lore.kernel.org/lkml/20230322151843.14390-1-lukasz.luba@arm.com/

Lukasz Luba (3):
  sched/tp: Add new tracepoint to track uclamp set from user-space
  cpufreq: schedutil: Refactor sugov_update_shared() internals
  schedutil: trace: Add tracing to capture filter out requests

 include/trace/events/sched.h     |  8 ++++++++
 kernel/sched/core.c              |  5 +++++
 kernel/sched/cpufreq_schedutil.c | 28 ++++++++++++++++++----------
 3 files changed, 31 insertions(+), 10 deletions(-)

-- 
2.25.1

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [RESEND][PATCH v2 1/3] sched/tp: Add new tracepoint to track uclamp set from user-space
  2023-05-22 14:56 [RESEND][PATCH v2 0/3] Add basic tracing for uclamp and schedutil Lukasz Luba
@ 2023-05-22 14:57 ` Lukasz Luba
       [not found]   ` <20230531182629.nztie5rwhjl53v3d@airbuntu>
  2023-05-22 14:57 ` [RESEND][PATCH v2 2/3] cpufreq: schedutil: Refactor sugov_update_shared() internals Lukasz Luba
  2023-05-22 14:57 ` [RESEND][PATCH v2 3/3] schedutil: trace: Add tracing to capture filter out requests Lukasz Luba
  2 siblings, 1 reply; 18+ messages in thread
From: Lukasz Luba @ 2023-05-22 14:57 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel, rafael, linux-pm
  Cc: rostedt, mhiramat, mingo, peterz, juri.lelli, vincent.guittot,
	dietmar.eggemann, bsegall, mgorman, bristot, vschneid, delyank,
	lukasz.luba, qyousef, qyousef

The user-space can set uclamp value for a given task. It impacts task
placement decisions made by the scheduler. This is very useful information
and helps to understand the system behavior or track improvements in
middleware and applications which start using uclamp mechanisms and report
better performance in tests.

Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
---
 include/trace/events/sched.h | 4 ++++
 kernel/sched/core.c          | 5 +++++
 2 files changed, 9 insertions(+)

diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index fbb99a61f714..dbfb30809f15 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -735,6 +735,10 @@ DECLARE_TRACE(sched_update_nr_running_tp,
 	TP_PROTO(struct rq *rq, int change),
 	TP_ARGS(rq, change));
 
+DECLARE_TRACE(uclamp_update_tsk_tp,
+	TP_PROTO(struct task_struct *tsk, int uclamp_id,  unsigned int value),
+	TP_ARGS(tsk, uclamp_id, value));
+
 #endif /* _TRACE_SCHED_H */
 
 /* This part must be outside protection */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 944c3ae39861..7b9b800ebb6c 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -114,6 +114,7 @@ EXPORT_TRACEPOINT_SYMBOL_GPL(sched_overutilized_tp);
 EXPORT_TRACEPOINT_SYMBOL_GPL(sched_util_est_cfs_tp);
 EXPORT_TRACEPOINT_SYMBOL_GPL(sched_util_est_se_tp);
 EXPORT_TRACEPOINT_SYMBOL_GPL(sched_update_nr_running_tp);
+EXPORT_TRACEPOINT_SYMBOL_GPL(uclamp_update_tsk_tp);
 
 DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
 
@@ -1956,12 +1957,16 @@ static void __setscheduler_uclamp(struct task_struct *p,
 	    attr->sched_util_min != -1) {
 		uclamp_se_set(&p->uclamp_req[UCLAMP_MIN],
 			      attr->sched_util_min, true);
+		trace_uclamp_update_tsk_tp(p, UCLAMP_MIN,
+					   attr->sched_util_min);
 	}
 
 	if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MAX &&
 	    attr->sched_util_max != -1) {
 		uclamp_se_set(&p->uclamp_req[UCLAMP_MAX],
 			      attr->sched_util_max, true);
+		trace_uclamp_update_tsk_tp(p, UCLAMP_MAX,
+					   attr->sched_util_max);
 	}
 }
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [RESEND][PATCH v2 2/3] cpufreq: schedutil: Refactor sugov_update_shared() internals
  2023-05-22 14:56 [RESEND][PATCH v2 0/3] Add basic tracing for uclamp and schedutil Lukasz Luba
  2023-05-22 14:57 ` [RESEND][PATCH v2 1/3] sched/tp: Add new tracepoint to track uclamp set from user-space Lukasz Luba
@ 2023-05-22 14:57 ` Lukasz Luba
  2023-06-20 17:36   ` Rafael J. Wysocki
  2023-05-22 14:57 ` [RESEND][PATCH v2 3/3] schedutil: trace: Add tracing to capture filter out requests Lukasz Luba
  2 siblings, 1 reply; 18+ messages in thread
From: Lukasz Luba @ 2023-05-22 14:57 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel, rafael, linux-pm
  Cc: rostedt, mhiramat, mingo, peterz, juri.lelli, vincent.guittot,
	dietmar.eggemann, bsegall, mgorman, bristot, vschneid, delyank,
	lukasz.luba, qyousef, qyousef

Remove the if section block. Use the simple check to bail out
and jump to the unlock at the end. That makes the code more readable
and ready for some future tracing.

Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
---
 kernel/sched/cpufreq_schedutil.c | 20 +++++++++++---------
 1 file changed, 11 insertions(+), 9 deletions(-)

diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index e3211455b203..f462496e5c07 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -446,17 +446,19 @@ sugov_update_shared(struct update_util_data *hook, u64 time, unsigned int flags)
 
 	ignore_dl_rate_limit(sg_cpu);
 
-	if (sugov_should_update_freq(sg_policy, time)) {
-		next_f = sugov_next_freq_shared(sg_cpu, time);
+	if (!sugov_should_update_freq(sg_policy, time))
+		goto unlock;
 
-		if (!sugov_update_next_freq(sg_policy, time, next_f))
-			goto unlock;
+	next_f = sugov_next_freq_shared(sg_cpu, time);
+
+	if (!sugov_update_next_freq(sg_policy, time, next_f))
+		goto unlock;
+
+	if (sg_policy->policy->fast_switch_enabled)
+		cpufreq_driver_fast_switch(sg_policy->policy, next_f);
+	else
+		sugov_deferred_update(sg_policy);
 
-		if (sg_policy->policy->fast_switch_enabled)
-			cpufreq_driver_fast_switch(sg_policy->policy, next_f);
-		else
-			sugov_deferred_update(sg_policy);
-	}
 unlock:
 	raw_spin_unlock(&sg_policy->update_lock);
 }
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [RESEND][PATCH v2 3/3] schedutil: trace: Add tracing to capture filter out requests
  2023-05-22 14:56 [RESEND][PATCH v2 0/3] Add basic tracing for uclamp and schedutil Lukasz Luba
  2023-05-22 14:57 ` [RESEND][PATCH v2 1/3] sched/tp: Add new tracepoint to track uclamp set from user-space Lukasz Luba
  2023-05-22 14:57 ` [RESEND][PATCH v2 2/3] cpufreq: schedutil: Refactor sugov_update_shared() internals Lukasz Luba
@ 2023-05-22 14:57 ` Lukasz Luba
  2023-06-20 17:40   ` Rafael J. Wysocki
       [not found]   ` <20230531183105.r5tqpdx5axoogkzp@airbuntu>
  2 siblings, 2 replies; 18+ messages in thread
From: Lukasz Luba @ 2023-05-22 14:57 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel, rafael, linux-pm
  Cc: rostedt, mhiramat, mingo, peterz, juri.lelli, vincent.guittot,
	dietmar.eggemann, bsegall, mgorman, bristot, vschneid, delyank,
	lukasz.luba, qyousef, qyousef, kernel test robot

Some of the frequency update requests coming form the task scheduler
might be filter out. It can happen when the previous request was served
not that long ago (in a period smaller than provided by the cpufreq driver
as minimum for frequency update). In such case, we want to know if some of
the frequency updates cannot make through.
Export the new tracepoint as well. That would allow to handle it by a
toolkit for trace analyzes.

Reported-by: kernel test robot <lkp@intel.com> # solved tricky build
Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
---
 include/trace/events/sched.h     |  4 ++++
 kernel/sched/cpufreq_schedutil.c | 10 ++++++++--
 2 files changed, 12 insertions(+), 2 deletions(-)

diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index dbfb30809f15..e34b7cd5de73 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -739,6 +739,10 @@ DECLARE_TRACE(uclamp_update_tsk_tp,
 	TP_PROTO(struct task_struct *tsk, int uclamp_id,  unsigned int value),
 	TP_ARGS(tsk, uclamp_id, value));
 
+DECLARE_TRACE(schedutil_update_filtered_tp,
+	TP_PROTO(int cpu),
+	TP_ARGS(cpu));
+
 #endif /* _TRACE_SCHED_H */
 
 /* This part must be outside protection */
diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index f462496e5c07..4f9daf258a65 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -6,6 +6,8 @@
  * Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
  */
 
+EXPORT_TRACEPOINT_SYMBOL_GPL(schedutil_update_filtered_tp);
+
 #define IOWAIT_BOOST_MIN	(SCHED_CAPACITY_SCALE / 8)
 
 struct sugov_tunables {
@@ -318,8 +320,10 @@ static inline bool sugov_update_single_common(struct sugov_cpu *sg_cpu,
 
 	ignore_dl_rate_limit(sg_cpu);
 
-	if (!sugov_should_update_freq(sg_cpu->sg_policy, time))
+	if (!sugov_should_update_freq(sg_cpu->sg_policy, time)) {
+		trace_schedutil_update_filtered_tp(sg_cpu->cpu);
 		return false;
+	}
 
 	sugov_get_util(sg_cpu);
 	sugov_iowait_apply(sg_cpu, time, max_cap);
@@ -446,8 +450,10 @@ sugov_update_shared(struct update_util_data *hook, u64 time, unsigned int flags)
 
 	ignore_dl_rate_limit(sg_cpu);
 
-	if (!sugov_should_update_freq(sg_policy, time))
+	if (!sugov_should_update_freq(sg_policy, time)) {
+		trace_schedutil_update_filtered_tp(sg_cpu->cpu);
 		goto unlock;
+	}
 
 	next_f = sugov_next_freq_shared(sg_cpu, time);
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [RESEND][PATCH v2 2/3] cpufreq: schedutil: Refactor sugov_update_shared() internals
  2023-05-22 14:57 ` [RESEND][PATCH v2 2/3] cpufreq: schedutil: Refactor sugov_update_shared() internals Lukasz Luba
@ 2023-06-20 17:36   ` Rafael J. Wysocki
  0 siblings, 0 replies; 18+ messages in thread
From: Rafael J. Wysocki @ 2023-06-20 17:36 UTC (permalink / raw)
  To: Lukasz Luba
  Cc: linux-kernel, linux-trace-kernel, rafael, linux-pm, rostedt,
	mhiramat, mingo, peterz, juri.lelli, vincent.guittot,
	dietmar.eggemann, bsegall, mgorman, bristot, vschneid, delyank,
	qyousef, qyousef

On Mon, May 22, 2023 at 4:57 PM Lukasz Luba <lukasz.luba@arm.com> wrote:
>
> Remove the if section block. Use the simple check to bail out
> and jump to the unlock at the end. That makes the code more readable
> and ready for some future tracing.
>
> Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
> ---
>  kernel/sched/cpufreq_schedutil.c | 20 +++++++++++---------
>  1 file changed, 11 insertions(+), 9 deletions(-)
>
> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> index e3211455b203..f462496e5c07 100644
> --- a/kernel/sched/cpufreq_schedutil.c
> +++ b/kernel/sched/cpufreq_schedutil.c
> @@ -446,17 +446,19 @@ sugov_update_shared(struct update_util_data *hook, u64 time, unsigned int flags)
>
>         ignore_dl_rate_limit(sg_cpu);
>
> -       if (sugov_should_update_freq(sg_policy, time)) {
> -               next_f = sugov_next_freq_shared(sg_cpu, time);
> +       if (!sugov_should_update_freq(sg_policy, time))
> +               goto unlock;
>
> -               if (!sugov_update_next_freq(sg_policy, time, next_f))
> -                       goto unlock;
> +       next_f = sugov_next_freq_shared(sg_cpu, time);
> +
> +       if (!sugov_update_next_freq(sg_policy, time, next_f))
> +               goto unlock;
> +
> +       if (sg_policy->policy->fast_switch_enabled)
> +               cpufreq_driver_fast_switch(sg_policy->policy, next_f);
> +       else
> +               sugov_deferred_update(sg_policy);
>
> -               if (sg_policy->policy->fast_switch_enabled)
> -                       cpufreq_driver_fast_switch(sg_policy->policy, next_f);
> -               else
> -                       sugov_deferred_update(sg_policy);
> -       }
>  unlock:
>         raw_spin_unlock(&sg_policy->update_lock);
>  }
> --

The first patch in the series needs some feedback from the scheduler
people, but I can apply this one right away if you want me to.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RESEND][PATCH v2 3/3] schedutil: trace: Add tracing to capture filter out requests
  2023-05-22 14:57 ` [RESEND][PATCH v2 3/3] schedutil: trace: Add tracing to capture filter out requests Lukasz Luba
@ 2023-06-20 17:40   ` Rafael J. Wysocki
  2023-06-20 18:08     ` Lukasz Luba
       [not found]   ` <20230531183105.r5tqpdx5axoogkzp@airbuntu>
  1 sibling, 1 reply; 18+ messages in thread
From: Rafael J. Wysocki @ 2023-06-20 17:40 UTC (permalink / raw)
  To: Lukasz Luba
  Cc: linux-kernel, linux-trace-kernel, rafael, linux-pm, rostedt,
	mhiramat, mingo, peterz, juri.lelli, vincent.guittot,
	dietmar.eggemann, bsegall, mgorman, bristot, vschneid, delyank,
	qyousef, qyousef, kernel test robot

On Mon, May 22, 2023 at 4:57 PM Lukasz Luba <lukasz.luba@arm.com> wrote:
>
> Some of the frequency update requests coming form the task scheduler
> might be filter out. It can happen when the previous request was served
> not that long ago (in a period smaller than provided by the cpufreq driver
> as minimum for frequency update). In such case, we want to know if some of
> the frequency updates cannot make through.
> Export the new tracepoint as well. That would allow to handle it by a
> toolkit for trace analyzes.
>
> Reported-by: kernel test robot <lkp@intel.com> # solved tricky build
> Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
> ---
>  include/trace/events/sched.h     |  4 ++++
>  kernel/sched/cpufreq_schedutil.c | 10 ++++++++--
>  2 files changed, 12 insertions(+), 2 deletions(-)
>
> diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
> index dbfb30809f15..e34b7cd5de73 100644
> --- a/include/trace/events/sched.h
> +++ b/include/trace/events/sched.h
> @@ -739,6 +739,10 @@ DECLARE_TRACE(uclamp_update_tsk_tp,
>         TP_PROTO(struct task_struct *tsk, int uclamp_id,  unsigned int value),
>         TP_ARGS(tsk, uclamp_id, value));
>
> +DECLARE_TRACE(schedutil_update_filtered_tp,
> +       TP_PROTO(int cpu),
> +       TP_ARGS(cpu));
> +
>  #endif /* _TRACE_SCHED_H */
>
>  /* This part must be outside protection */
> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> index f462496e5c07..4f9daf258a65 100644
> --- a/kernel/sched/cpufreq_schedutil.c
> +++ b/kernel/sched/cpufreq_schedutil.c
> @@ -6,6 +6,8 @@
>   * Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>   */
>
> +EXPORT_TRACEPOINT_SYMBOL_GPL(schedutil_update_filtered_tp);
> +
>  #define IOWAIT_BOOST_MIN       (SCHED_CAPACITY_SCALE / 8)
>
>  struct sugov_tunables {
> @@ -318,8 +320,10 @@ static inline bool sugov_update_single_common(struct sugov_cpu *sg_cpu,
>
>         ignore_dl_rate_limit(sg_cpu);
>
> -       if (!sugov_should_update_freq(sg_cpu->sg_policy, time))
> +       if (!sugov_should_update_freq(sg_cpu->sg_policy, time)) {
> +               trace_schedutil_update_filtered_tp(sg_cpu->cpu);

It looks like the tracepoint can be added to
sugov_should_update_freq() for less code duplication.

>                 return false;
> +       }
>
>         sugov_get_util(sg_cpu);
>         sugov_iowait_apply(sg_cpu, time, max_cap);
> @@ -446,8 +450,10 @@ sugov_update_shared(struct update_util_data *hook, u64 time, unsigned int flags)
>
>         ignore_dl_rate_limit(sg_cpu);
>
> -       if (!sugov_should_update_freq(sg_policy, time))
> +       if (!sugov_should_update_freq(sg_policy, time)) {
> +               trace_schedutil_update_filtered_tp(sg_cpu->cpu);
>                 goto unlock;
> +       }
>
>         next_f = sugov_next_freq_shared(sg_cpu, time);
>
> --

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RESEND][PATCH v2 3/3] schedutil: trace: Add tracing to capture filter out requests
       [not found]   ` <20230531183105.r5tqpdx5axoogkzp@airbuntu>
@ 2023-06-20 17:52     ` Lukasz Luba
  2023-06-30 12:01       ` Qais Yousef
  0 siblings, 1 reply; 18+ messages in thread
From: Lukasz Luba @ 2023-06-20 17:52 UTC (permalink / raw)
  To: Qais Yousef
  Cc: linux-kernel, linux-trace-kernel, rafael, linux-pm, rostedt,
	mhiramat, mingo, peterz, juri.lelli, vincent.guittot,
	dietmar.eggemann, bsegall, mgorman, bristot, vschneid, delyank,
	qyousef, kernel test robot

Hi Qais,

I have somehow missed your feedback on this series.

On 5/31/23 19:31, Qais Yousef wrote:
> On 05/22/23 15:57, Lukasz Luba wrote:
>> Some of the frequency update requests coming form the task scheduler
>> might be filter out. It can happen when the previous request was served
>> not that long ago (in a period smaller than provided by the cpufreq driver
>> as minimum for frequency update). In such case, we want to know if some of
>> the frequency updates cannot make through.
>> Export the new tracepoint as well. That would allow to handle it by a
>> toolkit for trace analyzes.
>>
>> Reported-by: kernel test robot <lkp@intel.com> # solved tricky build
>> Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
>> ---
>>   include/trace/events/sched.h     |  4 ++++
>>   kernel/sched/cpufreq_schedutil.c | 10 ++++++++--
>>   2 files changed, 12 insertions(+), 2 deletions(-)
>>
>> diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
>> index dbfb30809f15..e34b7cd5de73 100644
>> --- a/include/trace/events/sched.h
>> +++ b/include/trace/events/sched.h
>> @@ -739,6 +739,10 @@ DECLARE_TRACE(uclamp_update_tsk_tp,
>>   	TP_PROTO(struct task_struct *tsk, int uclamp_id,  unsigned int value),
>>   	TP_ARGS(tsk, uclamp_id, value));
>>   
>> +DECLARE_TRACE(schedutil_update_filtered_tp,
>> +	TP_PROTO(int cpu),
>> +	TP_ARGS(cpu));
>> +
>>   #endif /* _TRACE_SCHED_H */
>>   
>>   /* This part must be outside protection */
>> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
>> index f462496e5c07..4f9daf258a65 100644
>> --- a/kernel/sched/cpufreq_schedutil.c
>> +++ b/kernel/sched/cpufreq_schedutil.c
>> @@ -6,6 +6,8 @@
>>    * Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>>    */
>>   
>> +EXPORT_TRACEPOINT_SYMBOL_GPL(schedutil_update_filtered_tp);
>> +
>>   #define IOWAIT_BOOST_MIN	(SCHED_CAPACITY_SCALE / 8)
>>   
>>   struct sugov_tunables {
>> @@ -318,8 +320,10 @@ static inline bool sugov_update_single_common(struct sugov_cpu *sg_cpu,
>>   
>>   	ignore_dl_rate_limit(sg_cpu);
>>   
>> -	if (!sugov_should_update_freq(sg_cpu->sg_policy, time))
>> +	if (!sugov_should_update_freq(sg_cpu->sg_policy, time)) {
>> +		trace_schedutil_update_filtered_tp(sg_cpu->cpu);
>>   		return false;
>> +	}
> 
> Can't we have something more generic here too? Are you interested to count
> these events? How do you plan to use it?

The plan is to record those events, count them and maybe adjust the FW
if the frequency switching capabilities are too low, e.g. 4ms...

We need those numbers to point out that there is a need for faster
FW micro-controller to serve those incoming requests.

> 
> I think this will be a very noisy event by the way.

Could be, but on the other hand for those statistical analysis
'the more the better'. It will also depend on number of
CPUs in the cluster, e.g. 4 CPUs vs 1 CPU.

I don't know when we will switch to this per-cpu cpufreq mode
when all CPUs behave like independent DVFS. Juno mainline kernel and FW
supports that mode. We would have to compare those two modes and
measure how much we gain/loose when using one and not the other.

Furthermore, we already suspect some of our integration testing for
EAS-mainline (on Juno) failing due to filtered out requests. How much
that would impact other boards - it would be nice to see in traces.

Thanks for your feedback!
Lukasz

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RESEND][PATCH v2 3/3] schedutil: trace: Add tracing to capture filter out requests
  2023-06-20 17:40   ` Rafael J. Wysocki
@ 2023-06-20 18:08     ` Lukasz Luba
  0 siblings, 0 replies; 18+ messages in thread
From: Lukasz Luba @ 2023-06-20 18:08 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: linux-kernel, linux-trace-kernel, linux-pm, rostedt, mhiramat,
	mingo, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
	bsegall, mgorman, bristot, vschneid, delyank, qyousef, qyousef,
	kernel test robot

Hi Rafael,

On 6/20/23 18:40, Rafael J. Wysocki wrote:
> On Mon, May 22, 2023 at 4:57 PM Lukasz Luba <lukasz.luba@arm.com> wrote:
>>
>> Some of the frequency update requests coming form the task scheduler
>> might be filter out. It can happen when the previous request was served
>> not that long ago (in a period smaller than provided by the cpufreq driver
>> as minimum for frequency update). In such case, we want to know if some of
>> the frequency updates cannot make through.
>> Export the new tracepoint as well. That would allow to handle it by a
>> toolkit for trace analyzes.
>>
>> Reported-by: kernel test robot <lkp@intel.com> # solved tricky build
>> Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
>> ---
>>   include/trace/events/sched.h     |  4 ++++
>>   kernel/sched/cpufreq_schedutil.c | 10 ++++++++--
>>   2 files changed, 12 insertions(+), 2 deletions(-)
>>
>> diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
>> index dbfb30809f15..e34b7cd5de73 100644
>> --- a/include/trace/events/sched.h
>> +++ b/include/trace/events/sched.h
>> @@ -739,6 +739,10 @@ DECLARE_TRACE(uclamp_update_tsk_tp,
>>          TP_PROTO(struct task_struct *tsk, int uclamp_id,  unsigned int value),
>>          TP_ARGS(tsk, uclamp_id, value));
>>
>> +DECLARE_TRACE(schedutil_update_filtered_tp,
>> +       TP_PROTO(int cpu),
>> +       TP_ARGS(cpu));
>> +
>>   #endif /* _TRACE_SCHED_H */
>>
>>   /* This part must be outside protection */
>> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
>> index f462496e5c07..4f9daf258a65 100644
>> --- a/kernel/sched/cpufreq_schedutil.c
>> +++ b/kernel/sched/cpufreq_schedutil.c
>> @@ -6,6 +6,8 @@
>>    * Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>>    */
>>
>> +EXPORT_TRACEPOINT_SYMBOL_GPL(schedutil_update_filtered_tp);
>> +
>>   #define IOWAIT_BOOST_MIN       (SCHED_CAPACITY_SCALE / 8)
>>
>>   struct sugov_tunables {
>> @@ -318,8 +320,10 @@ static inline bool sugov_update_single_common(struct sugov_cpu *sg_cpu,
>>
>>          ignore_dl_rate_limit(sg_cpu);
>>
>> -       if (!sugov_should_update_freq(sg_cpu->sg_policy, time))
>> +       if (!sugov_should_update_freq(sg_cpu->sg_policy, time)) {
>> +               trace_schedutil_update_filtered_tp(sg_cpu->cpu);
> 
> It looks like the tracepoint can be added to
> sugov_should_update_freq() for less code duplication.
> 

Make sense. I will move that trace there.

In such case, of movement that trace call...
Based on your comment for patch 2/3 I got impression
that you still want it. For me it looks more 'aligned' w/ that
patch 2/3. The two functions code flows:
sugov_update_shared() and sugov_update_single_common() - how
they call and interpret result from
sugov_should_update_freq() - is more clear IMO.

So I will keep that patch 2/3 in the next version. Although,
if you don't like it - please tell me and I will drop it.

Thanks for the review!

Lukasz

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RESEND][PATCH v2 1/3] sched/tp: Add new tracepoint to track uclamp set from user-space
       [not found]   ` <20230531182629.nztie5rwhjl53v3d@airbuntu>
@ 2023-06-21  3:25     ` Masami Hiramatsu
  2023-06-30 11:49       ` Qais Yousef
  2023-07-06 11:14     ` Peter Zijlstra
  1 sibling, 1 reply; 18+ messages in thread
From: Masami Hiramatsu @ 2023-06-21  3:25 UTC (permalink / raw)
  To: Qais Yousef
  Cc: Lukasz Luba, linux-kernel, linux-trace-kernel, rafael, linux-pm,
	rostedt, mhiramat, mingo, peterz, juri.lelli, vincent.guittot,
	dietmar.eggemann, bsegall, mgorman, bristot, vschneid, delyank,
	qyousef

On Wed, 31 May 2023 19:26:29 +0100
Qais Yousef <qyousef@layalina.io> wrote:

> On 05/22/23 15:57, Lukasz Luba wrote:
> > The user-space can set uclamp value for a given task. It impacts task
> > placement decisions made by the scheduler. This is very useful information
> > and helps to understand the system behavior or track improvements in
> > middleware and applications which start using uclamp mechanisms and report
> > better performance in tests.
> 
> Do you mind adding a generic one instead please? And explain why we can't just
> attach to the syscall via kprobes? I think you want to bypass the permission
> checks, so maybe a generic tracepoint after that might be justifiable?

Could you tell me more about this point? I would like to know what kind of
permission checks can be bypassed with tracepoints.

> Then anyone can use it to track how userspace has changed any attributes for
> a task, not just uclamp.

I guess Uclamp is not controlled by syscall but from kernel internal
sched_setattr/setscheduler() too. Anyway I agree that it can be more generic
tracepoint, something like trace_sched_set_scheduer(task, attr).

Thank you,

> 
> 
> Thanks
> 
> --
> Qais Yousef
> 
> > 
> > Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
> > ---
> >  include/trace/events/sched.h | 4 ++++
> >  kernel/sched/core.c          | 5 +++++
> >  2 files changed, 9 insertions(+)
> > 
> > diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
> > index fbb99a61f714..dbfb30809f15 100644
> > --- a/include/trace/events/sched.h
> > +++ b/include/trace/events/sched.h
> > @@ -735,6 +735,10 @@ DECLARE_TRACE(sched_update_nr_running_tp,
> >  	TP_PROTO(struct rq *rq, int change),
> >  	TP_ARGS(rq, change));
> >  
> > +DECLARE_TRACE(uclamp_update_tsk_tp,
> > +	TP_PROTO(struct task_struct *tsk, int uclamp_id,  unsigned int value),
> > +	TP_ARGS(tsk, uclamp_id, value));
> > +
> >  #endif /* _TRACE_SCHED_H */
> >  
> >  /* This part must be outside protection */
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 944c3ae39861..7b9b800ebb6c 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -114,6 +114,7 @@ EXPORT_TRACEPOINT_SYMBOL_GPL(sched_overutilized_tp);
> >  EXPORT_TRACEPOINT_SYMBOL_GPL(sched_util_est_cfs_tp);
> >  EXPORT_TRACEPOINT_SYMBOL_GPL(sched_util_est_se_tp);
> >  EXPORT_TRACEPOINT_SYMBOL_GPL(sched_update_nr_running_tp);
> > +EXPORT_TRACEPOINT_SYMBOL_GPL(uclamp_update_tsk_tp);
> >  
> >  DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
> >  
> > @@ -1956,12 +1957,16 @@ static void __setscheduler_uclamp(struct task_struct *p,
> >  	    attr->sched_util_min != -1) {
> >  		uclamp_se_set(&p->uclamp_req[UCLAMP_MIN],
> >  			      attr->sched_util_min, true);
> > +		trace_uclamp_update_tsk_tp(p, UCLAMP_MIN,
> > +					   attr->sched_util_min);
> >  	}
> >  
> >  	if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MAX &&
> >  	    attr->sched_util_max != -1) {
> >  		uclamp_se_set(&p->uclamp_req[UCLAMP_MAX],
> >  			      attr->sched_util_max, true);
> > +		trace_uclamp_update_tsk_tp(p, UCLAMP_MAX,
> > +					   attr->sched_util_max);
> >  	}
> >  }
> >  
> > -- 
> > 2.25.1
> > 


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RESEND][PATCH v2 1/3] sched/tp: Add new tracepoint to track uclamp set from user-space
  2023-06-21  3:25     ` Masami Hiramatsu
@ 2023-06-30 11:49       ` Qais Yousef
  2023-07-04  7:49         ` Lukasz Luba
  0 siblings, 1 reply; 18+ messages in thread
From: Qais Yousef @ 2023-06-30 11:49 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Lukasz Luba, linux-kernel, linux-trace-kernel, rafael, linux-pm,
	rostedt, mingo, peterz, juri.lelli, vincent.guittot,
	dietmar.eggemann, bsegall, mgorman, bristot, vschneid, delyank,
	qyousef

On 06/21/23 12:25, Masami Hiramatsu wrote:
> On Wed, 31 May 2023 19:26:29 +0100
> Qais Yousef <qyousef@layalina.io> wrote:
> 
> > On 05/22/23 15:57, Lukasz Luba wrote:
> > > The user-space can set uclamp value for a given task. It impacts task
> > > placement decisions made by the scheduler. This is very useful information
> > > and helps to understand the system behavior or track improvements in
> > > middleware and applications which start using uclamp mechanisms and report
> > > better performance in tests.
> > 
> > Do you mind adding a generic one instead please? And explain why we can't just
> > attach to the syscall via kprobes? I think you want to bypass the permission
> > checks, so maybe a generic tracepoint after that might be justifiable?
> 
> Could you tell me more about this point? I would like to know what kind of
> permission checks can be bypassed with tracepoints.

Sorry bad usage of English from my end.

The syscall can fail if the caller doesn't have permission to change the
attribute (some of them are protected with CAP_NICE) or if the boundary check
fails. The desire here is to emit a tracepoint() when the user successfully
changes an attribute of a task.

Lukasz would like to have this tracepoint to help debug and analyse workloads.
We are not really bypassing anything. So to rephrase, emit the tracepointn if
the syscall is successfully changing an attribute.

> 
> > Then anyone can use it to track how userspace has changed any attributes for
> > a task, not just uclamp.
> 
> I guess Uclamp is not controlled by syscall but from kernel internal
> sched_setattr/setscheduler() too. Anyway I agree that it can be more generic
> tracepoint, something like trace_sched_set_scheduer(task, attr).

Yes. Which is something worries me and I had a series in the past to hide it.
The uclamp range is abstracted and has no meaning in general and should be set
specifically to each system. e.g: 512 means half the system performance level,
but if the system is over powered this could be too fast, and if it's
underpowered it could be too slow. It must be set by userspace; though not sure
if kernel threads need to manage their performance level how this can be
achieved.

Thanks!

--
Qais Yousef

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RESEND][PATCH v2 3/3] schedutil: trace: Add tracing to capture filter out requests
  2023-06-20 17:52     ` Lukasz Luba
@ 2023-06-30 12:01       ` Qais Yousef
  2023-06-30 13:25         ` Qais Yousef
  0 siblings, 1 reply; 18+ messages in thread
From: Qais Yousef @ 2023-06-30 12:01 UTC (permalink / raw)
  To: Lukasz Luba
  Cc: linux-kernel, linux-trace-kernel, rafael, linux-pm, rostedt,
	mhiramat, mingo, peterz, juri.lelli, vincent.guittot,
	dietmar.eggemann, bsegall, mgorman, bristot, vschneid, delyank,
	qyousef, kernel test robot

On 06/20/23 18:52, Lukasz Luba wrote:
> Hi Qais,
> 
> I have somehow missed your feedback on this series.
> 
> On 5/31/23 19:31, Qais Yousef wrote:
> > On 05/22/23 15:57, Lukasz Luba wrote:
> > > Some of the frequency update requests coming form the task scheduler
> > > might be filter out. It can happen when the previous request was served
> > > not that long ago (in a period smaller than provided by the cpufreq driver
> > > as minimum for frequency update). In such case, we want to know if some of
> > > the frequency updates cannot make through.
> > > Export the new tracepoint as well. That would allow to handle it by a
> > > toolkit for trace analyzes.
> > > 
> > > Reported-by: kernel test robot <lkp@intel.com> # solved tricky build
> > > Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
> > > ---
> > >   include/trace/events/sched.h     |  4 ++++
> > >   kernel/sched/cpufreq_schedutil.c | 10 ++++++++--
> > >   2 files changed, 12 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
> > > index dbfb30809f15..e34b7cd5de73 100644
> > > --- a/include/trace/events/sched.h
> > > +++ b/include/trace/events/sched.h
> > > @@ -739,6 +739,10 @@ DECLARE_TRACE(uclamp_update_tsk_tp,
> > >   	TP_PROTO(struct task_struct *tsk, int uclamp_id,  unsigned int value),
> > >   	TP_ARGS(tsk, uclamp_id, value));
> > > +DECLARE_TRACE(schedutil_update_filtered_tp,
> > > +	TP_PROTO(int cpu),
> > > +	TP_ARGS(cpu));
> > > +
> > >   #endif /* _TRACE_SCHED_H */
> > >   /* This part must be outside protection */
> > > diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> > > index f462496e5c07..4f9daf258a65 100644
> > > --- a/kernel/sched/cpufreq_schedutil.c
> > > +++ b/kernel/sched/cpufreq_schedutil.c
> > > @@ -6,6 +6,8 @@
> > >    * Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > >    */
> > > +EXPORT_TRACEPOINT_SYMBOL_GPL(schedutil_update_filtered_tp);
> > > +
> > >   #define IOWAIT_BOOST_MIN	(SCHED_CAPACITY_SCALE / 8)
> > >   struct sugov_tunables {
> > > @@ -318,8 +320,10 @@ static inline bool sugov_update_single_common(struct sugov_cpu *sg_cpu,
> > >   	ignore_dl_rate_limit(sg_cpu);
> > > -	if (!sugov_should_update_freq(sg_cpu->sg_policy, time))
> > > +	if (!sugov_should_update_freq(sg_cpu->sg_policy, time)) {
> > > +		trace_schedutil_update_filtered_tp(sg_cpu->cpu);
> > >   		return false;
> > > +	}
> > 
> > Can't we have something more generic here too? Are you interested to count
> > these events? How do you plan to use it?
> 
> The plan is to record those events, count them and maybe adjust the FW
> if the frequency switching capabilities are too low, e.g. 4ms...

You mean as part of tuning step for the system or at runtime? The latter seems
to indicate for a proper interface instead.

IMHO I think the current filtering mechanism needs a bit of a massage.

One thing I think we must do is to ignore the filter if there's a big sudden
change in requested frequency. Like for instance if a big task migrates. Then
prev_cpu should go to lower freq sooner, and new_cpu should change to higher
frequency sooner too. The filtering makes sense only in steady state situation
where we are ramping up or down gradually.

If no one beats me to it, I'll propose something in that regard.

> 
> We need those numbers to point out that there is a need for faster
> FW micro-controller to serve those incoming requests.

I think there's a big assumption here that the filter is always set correctly
;-)

> 
> > 
> > I think this will be a very noisy event by the way.
> 
> Could be, but on the other hand for those statistical analysis
> 'the more the better'. It will also depend on number of
> CPUs in the cluster, e.g. 4 CPUs vs 1 CPU.
> 
> I don't know when we will switch to this per-cpu cpufreq mode
> when all CPUs behave like independent DVFS. Juno mainline kernel and FW
> supports that mode. We would have to compare those two modes and
> measure how much we gain/loose when using one and not the other.
> 
> Furthermore, we already suspect some of our integration testing for
> EAS-mainline (on Juno) failing due to filtered out requests. How much
> that would impact other boards - it would be nice to see in traces.

Another problem I think we have is that the DVFS headroom value should be
a function of this filter. At the moment it is hardcoded to a random value
which causes power issue.

So to summarize I think there are two improvements required (and if anyone has
the time to try them out go ahead otherwise I'll get to it):

 1. The filter should only be applied if the history hasn't changed. ie: we are
    gradually increasing or decreasing PELT. Otherwise we should honour sudden
    changes ASAP.
 2. DVFS headroom should be a function of the filter. 25% is too high for
    500us. It could be too low for 10ms (I don't know).


Thanks!

--
Qais Yousef

> 
> Thanks for your feedback!
> Lukasz

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RESEND][PATCH v2 3/3] schedutil: trace: Add tracing to capture filter out requests
  2023-06-30 12:01       ` Qais Yousef
@ 2023-06-30 13:25         ` Qais Yousef
  2023-07-04  8:23           ` Lukasz Luba
  0 siblings, 1 reply; 18+ messages in thread
From: Qais Yousef @ 2023-06-30 13:25 UTC (permalink / raw)
  To: Lukasz Luba
  Cc: linux-kernel, linux-trace-kernel, rafael, linux-pm, rostedt,
	mhiramat, mingo, peterz, juri.lelli, vincent.guittot,
	dietmar.eggemann, bsegall, mgorman, bristot, vschneid, delyank,
	qyousef, kernel test robot

On 06/30/23 13:01, Qais Yousef wrote:
> On 06/20/23 18:52, Lukasz Luba wrote:
> > Hi Qais,
> > 
> > I have somehow missed your feedback on this series.
> > 
> > On 5/31/23 19:31, Qais Yousef wrote:
> > > On 05/22/23 15:57, Lukasz Luba wrote:
> > > > Some of the frequency update requests coming form the task scheduler
> > > > might be filter out. It can happen when the previous request was served
> > > > not that long ago (in a period smaller than provided by the cpufreq driver
> > > > as minimum for frequency update). In such case, we want to know if some of
> > > > the frequency updates cannot make through.
> > > > Export the new tracepoint as well. That would allow to handle it by a
> > > > toolkit for trace analyzes.
> > > > 
> > > > Reported-by: kernel test robot <lkp@intel.com> # solved tricky build
> > > > Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
> > > > ---
> > > >   include/trace/events/sched.h     |  4 ++++
> > > >   kernel/sched/cpufreq_schedutil.c | 10 ++++++++--
> > > >   2 files changed, 12 insertions(+), 2 deletions(-)
> > > > 
> > > > diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
> > > > index dbfb30809f15..e34b7cd5de73 100644
> > > > --- a/include/trace/events/sched.h
> > > > +++ b/include/trace/events/sched.h
> > > > @@ -739,6 +739,10 @@ DECLARE_TRACE(uclamp_update_tsk_tp,
> > > >   	TP_PROTO(struct task_struct *tsk, int uclamp_id,  unsigned int value),
> > > >   	TP_ARGS(tsk, uclamp_id, value));
> > > > +DECLARE_TRACE(schedutil_update_filtered_tp,
> > > > +	TP_PROTO(int cpu),
> > > > +	TP_ARGS(cpu));
> > > > +
> > > >   #endif /* _TRACE_SCHED_H */
> > > >   /* This part must be outside protection */
> > > > diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> > > > index f462496e5c07..4f9daf258a65 100644
> > > > --- a/kernel/sched/cpufreq_schedutil.c
> > > > +++ b/kernel/sched/cpufreq_schedutil.c
> > > > @@ -6,6 +6,8 @@
> > > >    * Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > > >    */
> > > > +EXPORT_TRACEPOINT_SYMBOL_GPL(schedutil_update_filtered_tp);
> > > > +
> > > >   #define IOWAIT_BOOST_MIN	(SCHED_CAPACITY_SCALE / 8)
> > > >   struct sugov_tunables {
> > > > @@ -318,8 +320,10 @@ static inline bool sugov_update_single_common(struct sugov_cpu *sg_cpu,
> > > >   	ignore_dl_rate_limit(sg_cpu);
> > > > -	if (!sugov_should_update_freq(sg_cpu->sg_policy, time))
> > > > +	if (!sugov_should_update_freq(sg_cpu->sg_policy, time)) {
> > > > +		trace_schedutil_update_filtered_tp(sg_cpu->cpu);
> > > >   		return false;
> > > > +	}
> > > 
> > > Can't we have something more generic here too? Are you interested to count
> > > these events? How do you plan to use it?
> > 
> > The plan is to record those events, count them and maybe adjust the FW
> > if the frequency switching capabilities are too low, e.g. 4ms...
> 
> You mean as part of tuning step for the system or at runtime? The latter seems
> to indicate for a proper interface instead.
> 
> IMHO I think the current filtering mechanism needs a bit of a massage.
> 
> One thing I think we must do is to ignore the filter if there's a big sudden
> change in requested frequency. Like for instance if a big task migrates. Then
> prev_cpu should go to lower freq sooner, and new_cpu should change to higher
> frequency sooner too. The filtering makes sense only in steady state situation
> where we are ramping up or down gradually.
> 
> If no one beats me to it, I'll propose something in that regard.
> 
> > 
> > We need those numbers to point out that there is a need for faster
> > FW micro-controller to serve those incoming requests.
> 
> I think there's a big assumption here that the filter is always set correctly
> ;-)
> 
> > 
> > > 
> > > I think this will be a very noisy event by the way.
> > 
> > Could be, but on the other hand for those statistical analysis
> > 'the more the better'. It will also depend on number of
> > CPUs in the cluster, e.g. 4 CPUs vs 1 CPU.
> > 
> > I don't know when we will switch to this per-cpu cpufreq mode
> > when all CPUs behave like independent DVFS. Juno mainline kernel and FW
> > supports that mode. We would have to compare those two modes and
> > measure how much we gain/loose when using one and not the other.
> > 
> > Furthermore, we already suspect some of our integration testing for
> > EAS-mainline (on Juno) failing due to filtered out requests. How much
> > that would impact other boards - it would be nice to see in traces.
> 
> Another problem I think we have is that the DVFS headroom value should be
> a function of this filter. At the moment it is hardcoded to a random value
> which causes power issue.
> 
> So to summarize I think there are two improvements required (and if anyone has
> the time to try them out go ahead otherwise I'll get to it):
> 
>  1. The filter should only be applied if the history hasn't changed. ie: we are
>     gradually increasing or decreasing PELT. Otherwise we should honour sudden
>     changes ASAP.
>  2. DVFS headroom should be a function of the filter. 25% is too high for
>     500us. It could be too low for 10ms (I don't know).

To expand a bit more since it's related. Our  migration margins should depend
on the tick value instead of random magic numbers they are today. More
precisely the balance_interval. If there's a misfit task, then we should
upmigrate it at wake up only if we think it'll become misfit before the load
balancer kicks in. Otherwise the load balancer should do the correction if it
becomes long running/misfit. If the sys admin wants to speed up/slow down
migration it should be throw controlling PELT IMO and not these magic margin
values - which are hardcoded to random values at the moment anyway that are not
suitable for every system.

And since we decoupled overutilized from misfit lb; I think our definition
should improve to better detect when the system needs to disable packing and
starts spreading. Current check for overutilized based on misfit is no longer
adequate IMO. Especially when there's a single misfit task in the system.

Again, just sharing thoughts in case someone interested to work on this before
I get a chance to share any patches ;-)


Cheers

--
Qais Yousef

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RESEND][PATCH v2 1/3] sched/tp: Add new tracepoint to track uclamp set from user-space
  2023-06-30 11:49       ` Qais Yousef
@ 2023-07-04  7:49         ` Lukasz Luba
  2023-07-04 14:02           ` Qais Yousef
  0 siblings, 1 reply; 18+ messages in thread
From: Lukasz Luba @ 2023-07-04  7:49 UTC (permalink / raw)
  To: Qais Yousef, Masami Hiramatsu
  Cc: linux-kernel, linux-trace-kernel, rafael, linux-pm, rostedt,
	mingo, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
	bsegall, mgorman, bristot, vschneid, delyank, qyousef

Hi Masami, Qais,

On 6/30/23 12:49, Qais Yousef wrote:
> On 06/21/23 12:25, Masami Hiramatsu wrote:
>> On Wed, 31 May 2023 19:26:29 +0100
>> Qais Yousef <qyousef@layalina.io> wrote:
>>
>>> On 05/22/23 15:57, Lukasz Luba wrote:
>>>> The user-space can set uclamp value for a given task. It impacts task
>>>> placement decisions made by the scheduler. This is very useful information
>>>> and helps to understand the system behavior or track improvements in
>>>> middleware and applications which start using uclamp mechanisms and report
>>>> better performance in tests.
>>>
>>> Do you mind adding a generic one instead please? And explain why we can't just
>>> attach to the syscall via kprobes? I think you want to bypass the permission
>>> checks, so maybe a generic tracepoint after that might be justifiable?
>>
>> Could you tell me more about this point? I would like to know what kind of
>> permission checks can be bypassed with tracepoints.
> 
> Sorry bad usage of English from my end.
> 
> The syscall can fail if the caller doesn't have permission to change the
> attribute (some of them are protected with CAP_NICE) or if the boundary check
> fails. The desire here is to emit a tracepoint() when the user successfully
> changes an attribute of a task.
> 
> Lukasz would like to have this tracepoint to help debug and analyse workloads.
> We are not really bypassing anything. So to rephrase, emit the tracepointn if
> the syscall is successfully changing an attribute.

Sorry, but no. As I said, I don't want to add more dependencies in our
tooling and kernel configuration.

> 
>>
>>> Then anyone can use it to track how userspace has changed any attributes for
>>> a task, not just uclamp.

Is this a real-life use case?
Is there a user-space SW that changes dynamically those attributes in a
way which affects task scheduler decisions that we have hard time to
understand it?

This syscall is quite old and I haven't heard that there is a need to
know what and how often it changes for apps.

On the other hand (the real life).
We started facing issues since some smart middle-ware very often
(a few hundred times in a few minutes) changes the uclamp for apps.
That daemon works autonomously and tries to figure out best values,
To understand those decisions we need some tricky offline processing
from trace. Since the uclamp affects a lot of mechanisms, we need to
know exactly the time and value when it is set.

If you don't point me to the SW which changes the other attributes
that often that you need to record them and post process, then
I would keep the current approach.


>>
>> I guess Uclamp is not controlled by syscall but from kernel internal
>> sched_setattr/setscheduler() too. Anyway I agree that it can be more generic
>> tracepoint, something like trace_sched_set_scheduer(task, attr).
> 
> Yes. Which is something worries me and I had a series in the past to hide it.
> The uclamp range is abstracted and has no meaning in general and should be set
> specifically to each system. e.g: 512 means half the system performance level,
> but if the system is over powered this could be too fast, and if it's
> underpowered it could be too slow. It must be set by userspace; though not sure
> if kernel threads need to manage their performance level how this can be
> achieved.

In mainline kernel I don't see any place where uclamp is set for kernel
threads. It's not use the use case and I hope it won't be anytime soon.

Regards,
Lukasz

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RESEND][PATCH v2 3/3] schedutil: trace: Add tracing to capture filter out requests
  2023-06-30 13:25         ` Qais Yousef
@ 2023-07-04  8:23           ` Lukasz Luba
  2023-07-04 13:58             ` Qais Yousef
  0 siblings, 1 reply; 18+ messages in thread
From: Lukasz Luba @ 2023-07-04  8:23 UTC (permalink / raw)
  To: Qais Yousef
  Cc: linux-kernel, linux-trace-kernel, rafael, linux-pm, rostedt,
	mhiramat, mingo, peterz, juri.lelli, vincent.guittot,
	dietmar.eggemann, bsegall, mgorman, bristot, vschneid, delyank,
	qyousef, kernel test robot

Hi Qais,

On 6/30/23 14:25, Qais Yousef wrote:
> On 06/30/23 13:01, Qais Yousef wrote:
>> On 06/20/23 18:52, Lukasz Luba wrote:
>>> Hi Qais,
>>>
>>> I have somehow missed your feedback on this series.
>>>
>>> On 5/31/23 19:31, Qais Yousef wrote:
>>>> On 05/22/23 15:57, Lukasz Luba wrote:
>>>>> Some of the frequency update requests coming form the task scheduler
>>>>> might be filter out. It can happen when the previous request was served
>>>>> not that long ago (in a period smaller than provided by the cpufreq driver
>>>>> as minimum for frequency update). In such case, we want to know if some of
>>>>> the frequency updates cannot make through.
>>>>> Export the new tracepoint as well. That would allow to handle it by a
>>>>> toolkit for trace analyzes.
>>>>>
>>>>> Reported-by: kernel test robot <lkp@intel.com> # solved tricky build
>>>>> Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
>>>>> ---
>>>>>    include/trace/events/sched.h     |  4 ++++
>>>>>    kernel/sched/cpufreq_schedutil.c | 10 ++++++++--
>>>>>    2 files changed, 12 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
>>>>> index dbfb30809f15..e34b7cd5de73 100644
>>>>> --- a/include/trace/events/sched.h
>>>>> +++ b/include/trace/events/sched.h
>>>>> @@ -739,6 +739,10 @@ DECLARE_TRACE(uclamp_update_tsk_tp,
>>>>>    	TP_PROTO(struct task_struct *tsk, int uclamp_id,  unsigned int value),
>>>>>    	TP_ARGS(tsk, uclamp_id, value));
>>>>> +DECLARE_TRACE(schedutil_update_filtered_tp,
>>>>> +	TP_PROTO(int cpu),
>>>>> +	TP_ARGS(cpu));
>>>>> +
>>>>>    #endif /* _TRACE_SCHED_H */
>>>>>    /* This part must be outside protection */
>>>>> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
>>>>> index f462496e5c07..4f9daf258a65 100644
>>>>> --- a/kernel/sched/cpufreq_schedutil.c
>>>>> +++ b/kernel/sched/cpufreq_schedutil.c
>>>>> @@ -6,6 +6,8 @@
>>>>>     * Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>>>>>     */
>>>>> +EXPORT_TRACEPOINT_SYMBOL_GPL(schedutil_update_filtered_tp);
>>>>> +
>>>>>    #define IOWAIT_BOOST_MIN	(SCHED_CAPACITY_SCALE / 8)
>>>>>    struct sugov_tunables {
>>>>> @@ -318,8 +320,10 @@ static inline bool sugov_update_single_common(struct sugov_cpu *sg_cpu,
>>>>>    	ignore_dl_rate_limit(sg_cpu);
>>>>> -	if (!sugov_should_update_freq(sg_cpu->sg_policy, time))
>>>>> +	if (!sugov_should_update_freq(sg_cpu->sg_policy, time)) {
>>>>> +		trace_schedutil_update_filtered_tp(sg_cpu->cpu);
>>>>>    		return false;
>>>>> +	}
>>>>
>>>> Can't we have something more generic here too? Are you interested to count
>>>> these events? How do you plan to use it?
>>>
>>> The plan is to record those events, count them and maybe adjust the FW
>>> if the frequency switching capabilities are too low, e.g. 4ms...
>>
>> You mean as part of tuning step for the system or at runtime? The latter seems
>> to indicate for a proper interface instead.

Not at runtime, the FW change or maybe even the uC would be needed for
this. Therefore, our client which experiments with the new platform
can run the analysis and spot this. Then it can change the FW
if there was an issue, or maybe even upgrade the HW if there are severe
issues with delivering needed performance on time (e.g. in high display
refresh rates and first-frame drops).

>>
>> IMHO I think the current filtering mechanism needs a bit of a massage.
>>
>> One thing I think we must do is to ignore the filter if there's a big sudden
>> change in requested frequency. Like for instance if a big task migrates. Then
>> prev_cpu should go to lower freq sooner, and new_cpu should change to higher
>> frequency sooner too. The filtering makes sense only in steady state situation
>> where we are ramping up or down gradually.

This is kind of a heuristic, which is biased for mobiles IMO.

>>
>> If no one beats me to it, I'll propose something in that regard.
>>
>>>
>>> We need those numbers to point out that there is a need for faster
>>> FW micro-controller to serve those incoming requests.
>>
>> I think there's a big assumption here that the filter is always set correctly
>> ;-)

In our case, it is set correctly, the value is coming directly from FW
[1]. It is the FW+HW limit, so not much to do with this in the kernel.

>>
>>>
>>>>
>>>> I think this will be a very noisy event by the way.
>>>
>>> Could be, but on the other hand for those statistical analysis
>>> 'the more the better'. It will also depend on number of
>>> CPUs in the cluster, e.g. 4 CPUs vs 1 CPU.
>>>
>>> I don't know when we will switch to this per-cpu cpufreq mode
>>> when all CPUs behave like independent DVFS. Juno mainline kernel and FW
>>> supports that mode. We would have to compare those two modes and
>>> measure how much we gain/loose when using one and not the other.
>>>
>>> Furthermore, we already suspect some of our integration testing for
>>> EAS-mainline (on Juno) failing due to filtered out requests. How much
>>> that would impact other boards - it would be nice to see in traces.
>>
>> Another problem I think we have is that the DVFS headroom value should be
>> a function of this filter. At the moment it is hardcoded to a random value
>> which causes power issue.

It's not a random value, as you can see in [1]. This is the main goal
for this $subject - provide information with proper test that the FW
or HW change is needed. If you like to have a decent performance in
your Linux based solution, the faster FW/HW would be needed. I don't
want to put more hacks or heuristics which try to workaround performance
issues with the HW. E.g. if someone instead of a 200MHz uC running fast
FW would put 100MHz uC than should get quality data from integration
tests, that such a design might not work well with recent OSes and apps.
Currently those kind of 'design' checks are very hard, because require
sophisticated knowledge and we are trying to lower that bar for more
engineers.

>>
>> So to summarize I think there are two improvements required (and if anyone has
>> the time to try them out go ahead otherwise I'll get to it):
>>
>>   1. The filter should only be applied if the history hasn't changed. ie: we are
>>      gradually increasing or decreasing PELT. Otherwise we should honour sudden
>>      changes ASAP.
>>   2. DVFS headroom should be a function of the filter. 25% is too high for
>>      500us. It could be too low for 10ms (I don't know).
> 
> To expand a bit more since it's related. Our  migration margins should depend
> on the tick value instead of random magic numbers they are today. More
> precisely the balance_interval. If there's a misfit task, then we should
> upmigrate it at wake up only if we think it'll become misfit before the load
> balancer kicks in. Otherwise the load balancer should do the correction if it
> becomes long running/misfit. If the sys admin wants to speed up/slow down
> migration it should be throw controlling PELT IMO and not these magic margin
> values - which are hardcoded to random values at the moment anyway that are not
> suitable for every system.
> 
> And since we decoupled overutilized from misfit lb; I think our definition
> should improve to better detect when the system needs to disable packing and
> starts spreading. Current check for overutilized based on misfit is no longer
> adequate IMO. Especially when there's a single misfit task in the system.
> 
> Again, just sharing thoughts in case someone interested to work on this before
> I get a chance to share any patches ;-)

Those are all heuristics and some of your ideas are going very beyond
the $subject. As I said the main goal for the $subject is to
deliver information that the FW/HW might need a re-design (maybe even
a more silicon for the uC).

I cannot stop you from creating a dedicated thread with your patches,
though ;)

Regards,
Lukasz

[1] 
https://elixir.bootlin.com/linux/v6.4/source/drivers/cpufreq/scmi-cpufreq.c#L224

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RESEND][PATCH v2 3/3] schedutil: trace: Add tracing to capture filter out requests
  2023-07-04  8:23           ` Lukasz Luba
@ 2023-07-04 13:58             ` Qais Yousef
  0 siblings, 0 replies; 18+ messages in thread
From: Qais Yousef @ 2023-07-04 13:58 UTC (permalink / raw)
  To: Lukasz Luba
  Cc: linux-kernel, linux-trace-kernel, rafael, linux-pm, rostedt,
	mhiramat, mingo, peterz, juri.lelli, vincent.guittot,
	dietmar.eggemann, bsegall, mgorman, bristot, vschneid, delyank,
	qyousef, kernel test robot

On 07/04/23 09:23, Lukasz Luba wrote:
> Hi Qais,
> 
> On 6/30/23 14:25, Qais Yousef wrote:
> > On 06/30/23 13:01, Qais Yousef wrote:
> > > On 06/20/23 18:52, Lukasz Luba wrote:
> > > > Hi Qais,
> > > > 
> > > > I have somehow missed your feedback on this series.
> > > > 
> > > > On 5/31/23 19:31, Qais Yousef wrote:
> > > > > On 05/22/23 15:57, Lukasz Luba wrote:
> > > > > > Some of the frequency update requests coming form the task scheduler
> > > > > > might be filter out. It can happen when the previous request was served
> > > > > > not that long ago (in a period smaller than provided by the cpufreq driver
> > > > > > as minimum for frequency update). In such case, we want to know if some of
> > > > > > the frequency updates cannot make through.
> > > > > > Export the new tracepoint as well. That would allow to handle it by a
> > > > > > toolkit for trace analyzes.
> > > > > > 
> > > > > > Reported-by: kernel test robot <lkp@intel.com> # solved tricky build
> > > > > > Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
> > > > > > ---
> > > > > >    include/trace/events/sched.h     |  4 ++++
> > > > > >    kernel/sched/cpufreq_schedutil.c | 10 ++++++++--
> > > > > >    2 files changed, 12 insertions(+), 2 deletions(-)
> > > > > > 
> > > > > > diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
> > > > > > index dbfb30809f15..e34b7cd5de73 100644
> > > > > > --- a/include/trace/events/sched.h
> > > > > > +++ b/include/trace/events/sched.h
> > > > > > @@ -739,6 +739,10 @@ DECLARE_TRACE(uclamp_update_tsk_tp,
> > > > > >    	TP_PROTO(struct task_struct *tsk, int uclamp_id,  unsigned int value),
> > > > > >    	TP_ARGS(tsk, uclamp_id, value));
> > > > > > +DECLARE_TRACE(schedutil_update_filtered_tp,
> > > > > > +	TP_PROTO(int cpu),
> > > > > > +	TP_ARGS(cpu));
> > > > > > +
> > > > > >    #endif /* _TRACE_SCHED_H */
> > > > > >    /* This part must be outside protection */
> > > > > > diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> > > > > > index f462496e5c07..4f9daf258a65 100644
> > > > > > --- a/kernel/sched/cpufreq_schedutil.c
> > > > > > +++ b/kernel/sched/cpufreq_schedutil.c
> > > > > > @@ -6,6 +6,8 @@
> > > > > >     * Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > > > > >     */
> > > > > > +EXPORT_TRACEPOINT_SYMBOL_GPL(schedutil_update_filtered_tp);
> > > > > > +
> > > > > >    #define IOWAIT_BOOST_MIN	(SCHED_CAPACITY_SCALE / 8)
> > > > > >    struct sugov_tunables {
> > > > > > @@ -318,8 +320,10 @@ static inline bool sugov_update_single_common(struct sugov_cpu *sg_cpu,
> > > > > >    	ignore_dl_rate_limit(sg_cpu);
> > > > > > -	if (!sugov_should_update_freq(sg_cpu->sg_policy, time))
> > > > > > +	if (!sugov_should_update_freq(sg_cpu->sg_policy, time)) {
> > > > > > +		trace_schedutil_update_filtered_tp(sg_cpu->cpu);
> > > > > >    		return false;
> > > > > > +	}
> > > > > 
> > > > > Can't we have something more generic here too? Are you interested to count
> > > > > these events? How do you plan to use it?
> > > > 
> > > > The plan is to record those events, count them and maybe adjust the FW
> > > > if the frequency switching capabilities are too low, e.g. 4ms...
> > > 
> > > You mean as part of tuning step for the system or at runtime? The latter seems
> > > to indicate for a proper interface instead.
> 
> Not at runtime, the FW change or maybe even the uC would be needed for
> this. Therefore, our client which experiments with the new platform
> can run the analysis and spot this. Then it can change the FW
> if there was an issue, or maybe even upgrade the HW if there are severe
> issues with delivering needed performance on time (e.g. in high display
> refresh rates and first-frame drops).
> 
> > > 
> > > IMHO I think the current filtering mechanism needs a bit of a massage.
> > > 
> > > One thing I think we must do is to ignore the filter if there's a big sudden
> > > change in requested frequency. Like for instance if a big task migrates. Then
> > > prev_cpu should go to lower freq sooner, and new_cpu should change to higher
> > > frequency sooner too. The filtering makes sense only in steady state situation
> > > where we are ramping up or down gradually.
> 
> This is kind of a heuristic, which is biased for mobiles IMO.

How come? big tasks are not only on mobile? A 500+ task can exist on any
system?

> 
> > > 
> > > If no one beats me to it, I'll propose something in that regard.
> > > 
> > > > 
> > > > We need those numbers to point out that there is a need for faster
> > > > FW micro-controller to serve those incoming requests.
> > > 
> > > I think there's a big assumption here that the filter is always set correctly
> > > ;-)
> 
> In our case, it is set correctly, the value is coming directly from FW
> [1]. It is the FW+HW limit, so not much to do with this in the kernel.
> 
> > > 
> > > > 
> > > > > 
> > > > > I think this will be a very noisy event by the way.
> > > > 
> > > > Could be, but on the other hand for those statistical analysis
> > > > 'the more the better'. It will also depend on number of
> > > > CPUs in the cluster, e.g. 4 CPUs vs 1 CPU.
> > > > 
> > > > I don't know when we will switch to this per-cpu cpufreq mode
> > > > when all CPUs behave like independent DVFS. Juno mainline kernel and FW
> > > > supports that mode. We would have to compare those two modes and
> > > > measure how much we gain/loose when using one and not the other.
> > > > 
> > > > Furthermore, we already suspect some of our integration testing for
> > > > EAS-mainline (on Juno) failing due to filtered out requests. How much
> > > > that would impact other boards - it would be nice to see in traces.
> > > 
> > > Another problem I think we have is that the DVFS headroom value should be
> > > a function of this filter. At the moment it is hardcoded to a random value
> > > which causes power issue.
> 
> It's not a random value, as you can see in [1]. This is the main goal

I'm referring to the 25% in map_util_perf().

> for this $subject - provide information with proper test that the FW
> or HW change is needed. If you like to have a decent performance in
> your Linux based solution, the faster FW/HW would be needed. I don't
> want to put more hacks or heuristics which try to workaround performance
> issues with the HW. E.g. if someone instead of a 200MHz uC running fast
> FW would put 100MHz uC than should get quality data from integration
> tests, that such a design might not work well with recent OSes and apps.
> Currently those kind of 'design' checks are very hard, because require
> sophisticated knowledge and we are trying to lower that bar for more
> engineers.

I think we're talking about different things ;-)

> 
> > > 
> > > So to summarize I think there are two improvements required (and if anyone has
> > > the time to try them out go ahead otherwise I'll get to it):
> > > 
> > >   1. The filter should only be applied if the history hasn't changed. ie: we are
> > >      gradually increasing or decreasing PELT. Otherwise we should honour sudden
> > >      changes ASAP.
> > >   2. DVFS headroom should be a function of the filter. 25% is too high for
> > >      500us. It could be too low for 10ms (I don't know).
> > 
> > To expand a bit more since it's related. Our  migration margins should depend
> > on the tick value instead of random magic numbers they are today. More
> > precisely the balance_interval. If there's a misfit task, then we should
> > upmigrate it at wake up only if we think it'll become misfit before the load
> > balancer kicks in. Otherwise the load balancer should do the correction if it
> > becomes long running/misfit. If the sys admin wants to speed up/slow down
> > migration it should be throw controlling PELT IMO and not these magic margin
> > values - which are hardcoded to random values at the moment anyway that are not
> > suitable for every system.
> > 
> > And since we decoupled overutilized from misfit lb; I think our definition
> > should improve to better detect when the system needs to disable packing and
> > starts spreading. Current check for overutilized based on misfit is no longer
> > adequate IMO. Especially when there's a single misfit task in the system.
> > 
> > Again, just sharing thoughts in case someone interested to work on this before
> > I get a chance to share any patches ;-)
> 
> Those are all heuristics and some of your ideas are going very beyond
> the $subject. As I said the main goal for the $subject is to
> deliver information that the FW/HW might need a re-design (maybe even
> a more silicon for the uC).
> 
> I cannot stop you from creating a dedicated thread with your patches,
> though ;)

Fair enough.

/me goes away

--
Qais Yousef

> 
> Regards,
> Lukasz
> 
> [1] https://elixir.bootlin.com/linux/v6.4/source/drivers/cpufreq/scmi-cpufreq.c#L224

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RESEND][PATCH v2 1/3] sched/tp: Add new tracepoint to track uclamp set from user-space
  2023-07-04  7:49         ` Lukasz Luba
@ 2023-07-04 14:02           ` Qais Yousef
  0 siblings, 0 replies; 18+ messages in thread
From: Qais Yousef @ 2023-07-04 14:02 UTC (permalink / raw)
  To: Lukasz Luba
  Cc: Masami Hiramatsu, linux-kernel, linux-trace-kernel, rafael,
	linux-pm, rostedt, mingo, peterz, juri.lelli, vincent.guittot,
	dietmar.eggemann, bsegall, mgorman, bristot, vschneid, delyank,
	qyousef

On 07/04/23 08:49, Lukasz Luba wrote:
> Hi Masami, Qais,
> 
> On 6/30/23 12:49, Qais Yousef wrote:
> > On 06/21/23 12:25, Masami Hiramatsu wrote:
> > > On Wed, 31 May 2023 19:26:29 +0100
> > > Qais Yousef <qyousef@layalina.io> wrote:
> > > 
> > > > On 05/22/23 15:57, Lukasz Luba wrote:
> > > > > The user-space can set uclamp value for a given task. It impacts task
> > > > > placement decisions made by the scheduler. This is very useful information
> > > > > and helps to understand the system behavior or track improvements in
> > > > > middleware and applications which start using uclamp mechanisms and report
> > > > > better performance in tests.
> > > > 
> > > > Do you mind adding a generic one instead please? And explain why we can't just
> > > > attach to the syscall via kprobes? I think you want to bypass the permission
> > > > checks, so maybe a generic tracepoint after that might be justifiable?
> > > 
> > > Could you tell me more about this point? I would like to know what kind of
> > > permission checks can be bypassed with tracepoints.
> > 
> > Sorry bad usage of English from my end.
> > 
> > The syscall can fail if the caller doesn't have permission to change the
> > attribute (some of them are protected with CAP_NICE) or if the boundary check
> > fails. The desire here is to emit a tracepoint() when the user successfully
> > changes an attribute of a task.
> > 
> > Lukasz would like to have this tracepoint to help debug and analyse workloads.
> > We are not really bypassing anything. So to rephrase, emit the tracepointn if
> > the syscall is successfully changing an attribute.
> 
> Sorry, but no. As I said, I don't want to add more dependencies in our
> tooling and kernel configuration.

Fair enough. But as I said before a dedicate uclamp only tracepoint doesn't
make sense to me. If maintainers are convinced, then be it. My point of view is
that We want generic tracepoints that scale to other use cases and it makes
sense to go this way to accommodate all potential future users, not just you.


Cheers

--
Qais Yousef

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RESEND][PATCH v2 1/3] sched/tp: Add new tracepoint to track uclamp set from user-space
       [not found]   ` <20230531182629.nztie5rwhjl53v3d@airbuntu>
  2023-06-21  3:25     ` Masami Hiramatsu
@ 2023-07-06 11:14     ` Peter Zijlstra
  2023-07-19 13:18       ` Lukasz Luba
  1 sibling, 1 reply; 18+ messages in thread
From: Peter Zijlstra @ 2023-07-06 11:14 UTC (permalink / raw)
  To: Qais Yousef
  Cc: Lukasz Luba, linux-kernel, linux-trace-kernel, rafael, linux-pm,
	rostedt, mhiramat, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, bsegall, mgorman, bristot, vschneid, delyank,
	qyousef

On Wed, May 31, 2023 at 07:26:29PM +0100, Qais Yousef wrote:
> On 05/22/23 15:57, Lukasz Luba wrote:
> > The user-space can set uclamp value for a given task. It impacts task
> > placement decisions made by the scheduler. This is very useful information
> > and helps to understand the system behavior or track improvements in
> > middleware and applications which start using uclamp mechanisms and report
> > better performance in tests.
> 
> Do you mind adding a generic one instead please? And explain why we can't just
> attach to the syscall via kprobes? I think you want to bypass the permission
> checks, so maybe a generic tracepoint after that might be justifiable?
> Then anyone can use it to track how userspace has changed any attributes for
> a task, not just uclamp.

Yeah, so I'm leaning towards the same, if you want to put a tracepoint
in __sched_setscheduler(), just trace the whole attr and leave it at
that:

	trace_update_sched_attr_tp(p, attr);

or somesuch.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RESEND][PATCH v2 1/3] sched/tp: Add new tracepoint to track uclamp set from user-space
  2023-07-06 11:14     ` Peter Zijlstra
@ 2023-07-19 13:18       ` Lukasz Luba
  0 siblings, 0 replies; 18+ messages in thread
From: Lukasz Luba @ 2023-07-19 13:18 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, linux-trace-kernel, rafael, linux-pm, rostedt,
	mhiramat, mingo, juri.lelli, vincent.guittot, dietmar.eggemann,
	bsegall, mgorman, bristot, vschneid, delyank, qyousef,
	Qais Yousef



On 7/6/23 12:14, Peter Zijlstra wrote:
> On Wed, May 31, 2023 at 07:26:29PM +0100, Qais Yousef wrote:
>> On 05/22/23 15:57, Lukasz Luba wrote:
>>> The user-space can set uclamp value for a given task. It impacts task
>>> placement decisions made by the scheduler. This is very useful information
>>> and helps to understand the system behavior or track improvements in
>>> middleware and applications which start using uclamp mechanisms and report
>>> better performance in tests.
>>
>> Do you mind adding a generic one instead please? And explain why we can't just
>> attach to the syscall via kprobes? I think you want to bypass the permission
>> checks, so maybe a generic tracepoint after that might be justifiable?
>> Then anyone can use it to track how userspace has changed any attributes for
>> a task, not just uclamp.
> 
> Yeah, so I'm leaning towards the same, if you want to put a tracepoint
> in __sched_setscheduler(), just trace the whole attr and leave it at
> that:
> 
> 	trace_update_sched_attr_tp(p, attr);
> 
> or somesuch.
> 

OK, fair enough, I'll do that. Thanks Peter!
(I'm sorry for the delay, I was on vacation)

Lukasz

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2023-07-19 13:18 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-05-22 14:56 [RESEND][PATCH v2 0/3] Add basic tracing for uclamp and schedutil Lukasz Luba
2023-05-22 14:57 ` [RESEND][PATCH v2 1/3] sched/tp: Add new tracepoint to track uclamp set from user-space Lukasz Luba
     [not found]   ` <20230531182629.nztie5rwhjl53v3d@airbuntu>
2023-06-21  3:25     ` Masami Hiramatsu
2023-06-30 11:49       ` Qais Yousef
2023-07-04  7:49         ` Lukasz Luba
2023-07-04 14:02           ` Qais Yousef
2023-07-06 11:14     ` Peter Zijlstra
2023-07-19 13:18       ` Lukasz Luba
2023-05-22 14:57 ` [RESEND][PATCH v2 2/3] cpufreq: schedutil: Refactor sugov_update_shared() internals Lukasz Luba
2023-06-20 17:36   ` Rafael J. Wysocki
2023-05-22 14:57 ` [RESEND][PATCH v2 3/3] schedutil: trace: Add tracing to capture filter out requests Lukasz Luba
2023-06-20 17:40   ` Rafael J. Wysocki
2023-06-20 18:08     ` Lukasz Luba
     [not found]   ` <20230531183105.r5tqpdx5axoogkzp@airbuntu>
2023-06-20 17:52     ` Lukasz Luba
2023-06-30 12:01       ` Qais Yousef
2023-06-30 13:25         ` Qais Yousef
2023-07-04  8:23           ` Lukasz Luba
2023-07-04 13:58             ` Qais Yousef

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).