linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] psi: remove CPU full metric at system level
@ 2022-03-03  5:58 Chengming Zhou
  2022-03-03 13:26 ` Johannes Weiner
  0 siblings, 1 reply; 3+ messages in thread
From: Chengming Zhou @ 2022-03-03  5:58 UTC (permalink / raw)
  To: corbet, hannes, mingo, peterz, surenb, ebiggers
  Cc: linux-doc, linux-kernel, songmuchun, Chengming Zhou, Martin Steigerwald

Martin find it confusing when look at the /proc/pressure/cpu output,
and found no hint about that CPU "full" line in psi Documentation.

% cat /proc/pressure/cpu
some avg10=0.92 avg60=0.91 avg300=0.73 total=933490489
full avg10=0.22 avg60=0.23 avg300=0.16 total=358783277

The PSI_CPU_FULL state is introduced by commit e7fcd7622823
("psi: Add PSI_CPU_FULL state"), which mainly for cgroup level,
but also counted at the system level as a side effect.

Naturally, the FULL state doesn't exist for the CPU resource at
the system level. These "full" numbers can come from CPU idle
schedule latency. For example, t1 is the time when task wakeup
on an idle CPU, t2 is the time when CPU pick and switch to it.
The delta of (t2 - t1) will be in CPU_FULL state.

Another case all processes can be stalled is when all cgroups
have been throttled at the same time, which unlikely to happen.

Anyway, CPU_FULL metric is meaningless and confusing at the
system level. So this patch removed CPU full metric at the
system level, and removed it's monitor function too. The psi
Documentation has also been updated accordingly.

Fixes: e7fcd7622823 ("psi: Add PSI_CPU_FULL state")
Reported-by: Martin Steigerwald <Martin.Steigerwald@proact.de>
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
---
 Documentation/accounting/psi.rst | 18 +++++++++++++++---
 kernel/sched/psi.c               | 10 +++++++++-
 2 files changed, 24 insertions(+), 4 deletions(-)

diff --git a/Documentation/accounting/psi.rst b/Documentation/accounting/psi.rst
index 860fe651d645..519652c06d7d 100644
--- a/Documentation/accounting/psi.rst
+++ b/Documentation/accounting/psi.rst
@@ -178,8 +178,20 @@ Cgroup2 interface
 In a system with a CONFIG_CGROUP=y kernel and the cgroup2 filesystem
 mounted, pressure stall information is also tracked for tasks grouped
 into cgroups. Each subdirectory in the cgroupfs mountpoint contains
-cpu.pressure, memory.pressure, and io.pressure files; the format is
-the same as the /proc/pressure/ files.
+cpu.pressure, memory.pressure, and io.pressure files; the format of
+memory.pressure and io.pressure is the same as the /proc/pressure/ files.
+
+But the format of cpu.pressure is as such::
+	some avg10=0.00 avg60=0.00 avg300=0.00 total=0
+	full avg10=0.00 avg60=0.00 avg300=0.00 total=0
+
+The "some" line indicates the share of time in which at least some tasks
+in the cgroup are stalled on CPU resource.
+
+The "full" line indicates the share of time in which all non-idle tasks
+in the cgroup are stalled on CPU resource, which is being used by others
+outside of the cgroup or throttled by the cgroup cpu.max configuration.
 
 Per-cgroup psi monitors can be specified and used the same way as
-system-wide ones.
+system-wide ones, except that users can also monitor full pressure on
+CPU resource at the cgroup level.
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index e14358178849..d1baeb07d08c 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -1047,6 +1047,7 @@ void cgroup_move_task(struct task_struct *task, struct css_set *to)
 
 int psi_show(struct seq_file *m, struct psi_group *group, enum psi_res res)
 {
+	int full_max = 2;
 	int full;
 	u64 now;
 
@@ -1061,7 +1062,11 @@ int psi_show(struct seq_file *m, struct psi_group *group, enum psi_res res)
 		group->avg_next_update = update_averages(group, now);
 	mutex_unlock(&group->avgs_lock);
 
-	for (full = 0; full < 2; full++) {
+	/* CPU_FULL state doesn't exist at system level */
+	if (res == PSI_CPU && group == &psi_system)
+		full_max = 1;
+
+	for (full = 0; full < full_max; full++) {
 		unsigned long avg[3];
 		u64 total;
 		int w;
@@ -1103,6 +1108,9 @@ struct psi_trigger *psi_trigger_create(struct psi_group *group,
 	if (state >= PSI_NONIDLE)
 		return ERR_PTR(-EINVAL);
 
+	if (state == PSI_CPU_FULL && group == &psi_system)
+		return ERR_PTR(-EINVAL);
+
 	if (window_us < WINDOW_MIN_US ||
 		window_us > WINDOW_MAX_US)
 		return ERR_PTR(-EINVAL);
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [PATCH] psi: remove CPU full metric at system level
  2022-03-03  5:58 [PATCH] psi: remove CPU full metric at system level Chengming Zhou
@ 2022-03-03 13:26 ` Johannes Weiner
  2022-03-04 14:43   ` [External] " Chengming Zhou
  0 siblings, 1 reply; 3+ messages in thread
From: Johannes Weiner @ 2022-03-03 13:26 UTC (permalink / raw)
  To: Chengming Zhou
  Cc: corbet, mingo, peterz, surenb, ebiggers, linux-doc, linux-kernel,
	songmuchun, Martin Steigerwald

On Thu, Mar 03, 2022 at 01:58:14PM +0800, Chengming Zhou wrote:
> Martin find it confusing when look at the /proc/pressure/cpu output,
> and found no hint about that CPU "full" line in psi Documentation.
> 
> % cat /proc/pressure/cpu
> some avg10=0.92 avg60=0.91 avg300=0.73 total=933490489
> full avg10=0.22 avg60=0.23 avg300=0.16 total=358783277
> 
> The PSI_CPU_FULL state is introduced by commit e7fcd7622823
> ("psi: Add PSI_CPU_FULL state"), which mainly for cgroup level,
> but also counted at the system level as a side effect.
> 
> Naturally, the FULL state doesn't exist for the CPU resource at
> the system level. These "full" numbers can come from CPU idle
> schedule latency. For example, t1 is the time when task wakeup
> on an idle CPU, t2 is the time when CPU pick and switch to it.
> The delta of (t2 - t1) will be in CPU_FULL state.
> 
> Another case all processes can be stalled is when all cgroups
> have been throttled at the same time, which unlikely to happen.
> 
> Anyway, CPU_FULL metric is meaningless and confusing at the
> system level. So this patch removed CPU full metric at the
> system level, and removed it's monitor function too. The psi
> Documentation has also been updated accordingly.
> 
> Fixes: e7fcd7622823 ("psi: Add PSI_CPU_FULL state")
> Reported-by: Martin Steigerwald <Martin.Steigerwald@proact.de>
> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
> ---
>  Documentation/accounting/psi.rst | 18 +++++++++++++++---
>  kernel/sched/psi.c               | 10 +++++++++-
>  2 files changed, 24 insertions(+), 4 deletions(-)
> 
> diff --git a/Documentation/accounting/psi.rst b/Documentation/accounting/psi.rst
> index 860fe651d645..519652c06d7d 100644
> --- a/Documentation/accounting/psi.rst
> +++ b/Documentation/accounting/psi.rst
> @@ -178,8 +178,20 @@ Cgroup2 interface
>  In a system with a CONFIG_CGROUP=y kernel and the cgroup2 filesystem
>  mounted, pressure stall information is also tracked for tasks grouped
>  into cgroups. Each subdirectory in the cgroupfs mountpoint contains
> -cpu.pressure, memory.pressure, and io.pressure files; the format is
> -the same as the /proc/pressure/ files.
> +cpu.pressure, memory.pressure, and io.pressure files; the format of
> +memory.pressure and io.pressure is the same as the /proc/pressure/ files.
> +
> +But the format of cpu.pressure is as such::
> +	some avg10=0.00 avg60=0.00 avg300=0.00 total=0
> +	full avg10=0.00 avg60=0.00 avg300=0.00 total=0

It's the format of cpu.pressure, except when it's
/sys/fs/cgroup/cpu.pressure... I think this is getting maybe a tad too
difficult to write parsers for. Plus, we added the line over a year
ago so we might break somebody by removing it again.

How about reporting zeroes at the system level?

diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index e14358178849..86824de404bc 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -1062,14 +1062,17 @@ int psi_show(struct seq_file *m, struct psi_group *group, enum psi_res res)
 	mutex_unlock(&group->avgs_lock);
 
 	for (full = 0; full < 2; full++) {
-		unsigned long avg[3];
-		u64 total;
+		unsigned long avg[3] = { 0, };
+		u64 total = 0;
 		int w;
 
-		for (w = 0; w < 3; w++)
-			avg[w] = group->avg[res * 2 + full][w];
-		total = div_u64(group->total[PSI_AVGS][res * 2 + full],
-				NSEC_PER_USEC);
+		/* CPU FULL is undefined at the system level */
+		if (!(group == &psi_system && res == PSI_CPU && full)) {
+			for (w = 0; w < 3; w++)
+				avg[w] = group->avg[res * 2 + full][w];
+			total = div_u64(group->total[PSI_AVGS][res * 2 + full],
+					NSEC_PER_USEC);
+		}
 
 		seq_printf(m, "%s avg10=%lu.%02lu avg60=%lu.%02lu avg300=%lu.%02lu total=%llu\n",
 			   full ? "full" : "some",

^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [External] Re: [PATCH] psi: remove CPU full metric at system level
  2022-03-03 13:26 ` Johannes Weiner
@ 2022-03-04 14:43   ` Chengming Zhou
  0 siblings, 0 replies; 3+ messages in thread
From: Chengming Zhou @ 2022-03-04 14:43 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: corbet, mingo, peterz, surenb, ebiggers, linux-doc, linux-kernel,
	songmuchun, Martin Steigerwald

On 2022/3/3 9:26 下午, Johannes Weiner wrote:
> On Thu, Mar 03, 2022 at 01:58:14PM +0800, Chengming Zhou wrote:
>> Martin find it confusing when look at the /proc/pressure/cpu output,
>> and found no hint about that CPU "full" line in psi Documentation.
>>
>> % cat /proc/pressure/cpu
>> some avg10=0.92 avg60=0.91 avg300=0.73 total=933490489
>> full avg10=0.22 avg60=0.23 avg300=0.16 total=358783277
>>
>> The PSI_CPU_FULL state is introduced by commit e7fcd7622823
>> ("psi: Add PSI_CPU_FULL state"), which mainly for cgroup level,
>> but also counted at the system level as a side effect.
>>
>> Naturally, the FULL state doesn't exist for the CPU resource at
>> the system level. These "full" numbers can come from CPU idle
>> schedule latency. For example, t1 is the time when task wakeup
>> on an idle CPU, t2 is the time when CPU pick and switch to it.
>> The delta of (t2 - t1) will be in CPU_FULL state.
>>
>> Another case all processes can be stalled is when all cgroups
>> have been throttled at the same time, which unlikely to happen.
>>
>> Anyway, CPU_FULL metric is meaningless and confusing at the
>> system level. So this patch removed CPU full metric at the
>> system level, and removed it's monitor function too. The psi
>> Documentation has also been updated accordingly.
>>
>> Fixes: e7fcd7622823 ("psi: Add PSI_CPU_FULL state")
>> Reported-by: Martin Steigerwald <Martin.Steigerwald@proact.de>
>> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
>> ---
>>  Documentation/accounting/psi.rst | 18 +++++++++++++++---
>>  kernel/sched/psi.c               | 10 +++++++++-
>>  2 files changed, 24 insertions(+), 4 deletions(-)
>>
>> diff --git a/Documentation/accounting/psi.rst b/Documentation/accounting/psi.rst
>> index 860fe651d645..519652c06d7d 100644
>> --- a/Documentation/accounting/psi.rst
>> +++ b/Documentation/accounting/psi.rst
>> @@ -178,8 +178,20 @@ Cgroup2 interface
>>  In a system with a CONFIG_CGROUP=y kernel and the cgroup2 filesystem
>>  mounted, pressure stall information is also tracked for tasks grouped
>>  into cgroups. Each subdirectory in the cgroupfs mountpoint contains
>> -cpu.pressure, memory.pressure, and io.pressure files; the format is
>> -the same as the /proc/pressure/ files.
>> +cpu.pressure, memory.pressure, and io.pressure files; the format of
>> +memory.pressure and io.pressure is the same as the /proc/pressure/ files.
>> +
>> +But the format of cpu.pressure is as such::
>> +	some avg10=0.00 avg60=0.00 avg300=0.00 total=0
>> +	full avg10=0.00 avg60=0.00 avg300=0.00 total=0
> 
> It's the format of cpu.pressure, except when it's
> /sys/fs/cgroup/cpu.pressure... I think this is getting maybe a tad too
> difficult to write parsers for. Plus, we added the line over a year
> ago so we might break somebody by removing it again.
> 
> How about reporting zeroes at the system level?

Ok, it's really better for userspace parsers, will change to this way
and send later.

Thanks.

> 
> diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
> index e14358178849..86824de404bc 100644
> --- a/kernel/sched/psi.c
> +++ b/kernel/sched/psi.c
> @@ -1062,14 +1062,17 @@ int psi_show(struct seq_file *m, struct psi_group *group, enum psi_res res)
>  	mutex_unlock(&group->avgs_lock);
>  
>  	for (full = 0; full < 2; full++) {
> -		unsigned long avg[3];
> -		u64 total;
> +		unsigned long avg[3] = { 0, };
> +		u64 total = 0;
>  		int w;
>  
> -		for (w = 0; w < 3; w++)
> -			avg[w] = group->avg[res * 2 + full][w];
> -		total = div_u64(group->total[PSI_AVGS][res * 2 + full],
> -				NSEC_PER_USEC);
> +		/* CPU FULL is undefined at the system level */
> +		if (!(group == &psi_system && res == PSI_CPU && full)) {
> +			for (w = 0; w < 3; w++)
> +				avg[w] = group->avg[res * 2 + full][w];
> +			total = div_u64(group->total[PSI_AVGS][res * 2 + full],
> +					NSEC_PER_USEC);
> +		}
>  
>  		seq_printf(m, "%s avg10=%lu.%02lu avg60=%lu.%02lu avg300=%lu.%02lu total=%llu\n",
>  			   full ? "full" : "some",

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2022-03-04 14:43 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-03-03  5:58 [PATCH] psi: remove CPU full metric at system level Chengming Zhou
2022-03-03 13:26 ` Johannes Weiner
2022-03-04 14:43   ` [External] " Chengming Zhou

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).