linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/9] sched/psi: some optimization and extension
@ 2022-07-21  4:04 Chengming Zhou
  2022-07-21  4:04 ` [PATCH 1/9] sched/psi: fix periodic aggregation shut off Chengming Zhou
                   ` (8 more replies)
  0 siblings, 9 replies; 38+ messages in thread
From: Chengming Zhou @ 2022-07-21  4:04 UTC (permalink / raw)
  To: hannes, surenb, mingo, peterz, tj, corbet, akpm, rdunlap
  Cc: linux-doc, linux-kernel, songmuchun, cgroups, Chengming Zhou

Hi all,

This patch series are some optimization and extension for PSI.

patch 1/9 fix periodic aggregation shut off problem introduced by earlier
commit 4117cebf1a9f ("psi: Optimize task switch inside shared cgroups").

patch 2/9 optimize task switch inside shared cgroups when in_memstall status
of prev task and next task are different.

patch 3-4 optimize and simplify PSI status tracking by don't change task
psi_flags when migrate CPU/cgroup.

patch 7-8 introduce new kernel cmdline parameter "psi_inner_cgroup=" to
configure whether or not to track PSI stall information for inner cgroups.

patch 9/9 introduce new PSI resource PSI_IRQ to track IRQ/SOFTIRQ pressure
stall information when CONFIG_IRQ_TIME_ACCOUNTING.

Performance test on Intel Xeon Platinum with 3 levels of cgroup, in which
run mmtests config-scheduler-perfpipe:

                                  tip                    tip                    tip                patched                patched                patched                patched
                               default   cgroup_disable=pressure  IRQ_TIME_ACCOUNTING              default      psi_inner_cgroup=off             PSI_IRQ    PSI_IRQ + psi_inner_cgroup=off
Min       Time        9.89 (   0.00%)        8.99 (   9.12%)       10.04 (  -1.53%)        9.63 (   2.58%)        9.27 (   6.22%)       10.09 (  -2.04%)        9.45 (   4.41%)
1st-qrtle Time       10.01 (   0.00%)        9.15 (   8.66%)       10.16 (  -1.45%)        9.72 (   2.89%)        9.35 (   6.61%)       10.20 (  -1.81%)        9.54 (   4.77%)
2nd-qrtle Time       10.07 (   0.00%)        9.25 (   8.12%)       10.19 (  -1.21%)        9.79 (   2.73%)        9.38 (   6.78%)       10.24 (  -1.75%)        9.59 (   4.68%)
3rd-qrtle Time       10.14 (   0.00%)        9.30 (   8.32%)       10.23 (  -0.88%)        9.84 (   3.00%)        9.44 (   6.92%)       10.27 (  -1.21%)        9.62 (   5.18%)
Max-1     Time        9.89 (   0.00%)        8.99 (   9.12%)       10.04 (  -1.53%)        9.63 (   2.58%)        9.27 (   6.22%)       10.09 (  -2.04%)        9.45 (   4.41%)
Max-5     Time        9.89 (   0.00%)        8.99 (   9.12%)       10.04 (  -1.53%)        9.63 (   2.58%)        9.27 (   6.22%)       10.09 (  -2.04%)        9.45 (   4.41%)
Max-10    Time        9.92 (   0.00%)        9.09 (   8.33%)       10.11 (  -1.97%)        9.67 (   2.51%)        9.29 (   6.29%)       10.15 (  -2.30%)        9.48 (   4.46%)
Max-90    Time       10.20 (   0.00%)        9.33 (   8.53%)       10.33 (  -1.24%)        9.87 (   3.29%)        9.49 (   6.99%)       10.29 (  -0.85%)        9.66 (   5.32%)
Max-95    Time       10.23 (   0.00%)        9.34 (   8.70%)       10.37 (  -1.39%)        9.94 (   2.83%)        9.53 (   6.88%)       10.30 (  -0.65%)        9.67 (   5.51%)
Max-99    Time       10.23 (   0.00%)        9.37 (   8.43%)       10.40 (  -1.63%)        9.99 (   2.41%)        9.76 (   4.57%)       10.31 (  -0.74%)        9.69 (   5.25%)
Max       Time       10.34 (   0.00%)        9.46 (   8.50%)       10.43 (  -0.83%)       17.04 ( -64.80%)        9.79 (   5.36%)       10.32 (   0.20%)        9.71 (   6.07%)
Amean     Time       10.08 (   0.00%)        9.23 *   8.39%*       10.21 *  -1.33%*       10.03 (   0.47%)        9.41 *   6.59%*       10.23 *  -1.53%*        9.59 *   4.87%*

Thanks!

Chengming Zhou (9):
  sched/psi: fix periodic aggregation shut off
  sched/psi: optimize task switch inside shared cgroups again
  sched/psi: move private helpers to sched/stats.h
  sched/psi: don't change task psi_flags when migrate CPU/group
  sched/psi: don't create cgroup PSI files when psi_disabled
  sched/psi: save percpu memory when !psi_cgroups_enabled
  sched/psi: cache parent psi_group to speed up groups iterate
  sched/psi: add kernel cmdline parameter psi_inner_cgroup
  sched/psi: add PSI_IRQ to track IRQ/SOFTIRQ pressure

 .../admin-guide/kernel-parameters.txt         |  11 +
 include/linux/psi.h                           |   5 +-
 include/linux/psi_types.h                     |   9 +-
 include/linux/sched.h                         |   3 -
 kernel/cgroup/cgroup.c                        |  30 +++
 kernel/sched/core.c                           |   2 +
 kernel/sched/psi.c                            | 194 +++++++++++++-----
 kernel/sched/stats.h                          |  71 ++++---
 8 files changed, 232 insertions(+), 93 deletions(-)

-- 
2.36.1


^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH 1/9] sched/psi: fix periodic aggregation shut off
  2022-07-21  4:04 [PATCH 0/9] sched/psi: some optimization and extension Chengming Zhou
@ 2022-07-21  4:04 ` Chengming Zhou
  2022-07-25 15:34   ` Johannes Weiner
  2022-07-25 15:39   ` Johannes Weiner
  2022-07-21  4:04 ` [PATCH 2/9] sched/psi: optimize task switch inside shared cgroups again Chengming Zhou
                   ` (7 subsequent siblings)
  8 siblings, 2 replies; 38+ messages in thread
From: Chengming Zhou @ 2022-07-21  4:04 UTC (permalink / raw)
  To: hannes, surenb, mingo, peterz, tj, corbet, akpm, rdunlap
  Cc: linux-doc, linux-kernel, songmuchun, cgroups, Chengming Zhou

We don't want to wake periodic aggregation work back up if the
task change is the aggregation worker itself going to sleep, or
we'll ping-pong forever.

Previously, we would use psi_task_change() in psi_dequeue() when
task going to sleep, so this check was put in psi_task_change().

But commit 4117cebf1a9f ("psi: Optimize task switch inside shared cgroups")
defer task sleep handling to psi_task_switch(), won't go through
psi_task_change() anymore.

So this patch move this check to psi_task_switch(). Note for defer sleep
case, we should wake periodic avgs work for common ancestors groups,
since those groups have next task sched_in.

Fixes: 4117cebf1a9f ("psi: Optimize task switch inside shared cgroups")
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
---
 kernel/sched/psi.c | 26 +++++++++++++-------------
 1 file changed, 13 insertions(+), 13 deletions(-)

diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index a337f3e35997..c8a4e644cd2c 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -800,7 +800,6 @@ void psi_task_change(struct task_struct *task, int clear, int set)
 {
 	int cpu = task_cpu(task);
 	struct psi_group *group;
-	bool wake_clock = true;
 	void *iter = NULL;
 	u64 now;
 
@@ -810,19 +809,9 @@ void psi_task_change(struct task_struct *task, int clear, int set)
 	psi_flags_change(task, clear, set);
 
 	now = cpu_clock(cpu);
-	/*
-	 * Periodic aggregation shuts off if there is a period of no
-	 * task changes, so we wake it back up if necessary. However,
-	 * don't do this if the task change is the aggregation worker
-	 * itself going to sleep, or we'll ping-pong forever.
-	 */
-	if (unlikely((clear & TSK_RUNNING) &&
-		     (task->flags & PF_WQ_WORKER) &&
-		     wq_worker_last_func(task) == psi_avgs_work))
-		wake_clock = false;
 
 	while ((group = iterate_groups(task, &iter)))
-		psi_group_change(group, cpu, clear, set, now, wake_clock);
+		psi_group_change(group, cpu, clear, set, now, true);
 }
 
 void psi_task_switch(struct task_struct *prev, struct task_struct *next,
@@ -858,6 +847,7 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
 
 	if (prev->pid) {
 		int clear = TSK_ONCPU, set = 0;
+		bool wake_clock = true;
 
 		/*
 		 * When we're going to sleep, psi_dequeue() lets us
@@ -871,13 +861,23 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
 				clear |= TSK_MEMSTALL_RUNNING;
 			if (prev->in_iowait)
 				set |= TSK_IOWAIT;
+
+			/*
+			 * Periodic aggregation shuts off if there is a period of no
+			 * task changes, so we wake it back up if necessary. However,
+			 * don't do this if the task change is the aggregation worker
+			 * itself going to sleep, or we'll ping-pong forever.
+			 */
+			if (unlikely((prev->flags & PF_WQ_WORKER) &&
+				     wq_worker_last_func(prev) == psi_avgs_work))
+				wake_clock = false;
 		}
 
 		psi_flags_change(prev, clear, set);
 
 		iter = NULL;
 		while ((group = iterate_groups(prev, &iter)) && group != common)
-			psi_group_change(group, cpu, clear, set, now, true);
+			psi_group_change(group, cpu, clear, set, now, wake_clock);
 
 		/*
 		 * TSK_ONCPU is handled up to the common ancestor. If we're tasked
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 2/9] sched/psi: optimize task switch inside shared cgroups again
  2022-07-21  4:04 [PATCH 0/9] sched/psi: some optimization and extension Chengming Zhou
  2022-07-21  4:04 ` [PATCH 1/9] sched/psi: fix periodic aggregation shut off Chengming Zhou
@ 2022-07-21  4:04 ` Chengming Zhou
  2022-07-21  4:04 ` [PATCH 3/9] sched/psi: move private helpers to sched/stats.h Chengming Zhou
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 38+ messages in thread
From: Chengming Zhou @ 2022-07-21  4:04 UTC (permalink / raw)
  To: hannes, surenb, mingo, peterz, tj, corbet, akpm, rdunlap
  Cc: linux-doc, linux-kernel, songmuchun, cgroups, Chengming Zhou

commit 4117cebf1a9f ("psi: Optimize task switch inside shared cgroups")
defer prev task sleep handling to psi_task_switch(), so we don't need
to clear and set TSK_ONCPU state for common cgroups.

    A
    |
    B
   / \
  C   D
 /     \
prev   next

After that commit psi_task_switch() do:
1. psi_group_change(next, .set=TSK_ONCPU) for D
2. psi_group_change(prev, .clear=TSK_ONCPU | TSK_RUNNING) for C
3. psi_group_change(prev, .clear=TSK_RUNNING) for B, A

But there is a limitation "prev->psi_flags == next->psi_flags" that
if not satisfied, will make this cgroups optimization unusable for both
sleep switch or running switch cases. For example:

prev->in_memstall != next->in_memstall when sleep switch:
1. psi_group_change(next, .set=TSK_ONCPU) for D, B, A
2. psi_group_change(prev, .clear=TSK_ONCPU | TSK_RUNNING) for C, B, A

prev->in_memstall != next->in_memstall when running switch:
1. psi_group_change(next, .set=TSK_ONCPU) for D, B, A
2. psi_group_change(prev, .clear=TSK_ONCPU) for C, B, A

The reason why this limitation exist is that we consider a group is
PSI_MEM_FULL if the CPU is actively reclaiming and nothing productive
could run even if it were runnable. So when CPU curr changed from prev
to next and their in_memstall status is different, we have to change
PSI_MEM_FULL status for their common cgroups.

This patch remove this limitation by making psi_group_change() change
PSI_MEM_FULL status depend on CPU curr->in_memstall status.

Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
---
 kernel/sched/psi.c | 8 ++------
 1 file changed, 2 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index c8a4e644cd2c..e04041d8251b 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -823,8 +823,6 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
 	u64 now = cpu_clock(cpu);
 
 	if (next->pid) {
-		bool identical_state;
-
 		psi_flags_change(next, 0, TSK_ONCPU);
 		/*
 		 * When switching between tasks that have an identical
@@ -832,11 +830,9 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
 		 * we reach the first common ancestor. Iterate @next's
 		 * ancestors only until we encounter @prev's ONCPU.
 		 */
-		identical_state = prev->psi_flags == next->psi_flags;
 		iter = NULL;
 		while ((group = iterate_groups(next, &iter))) {
-			if (identical_state &&
-			    per_cpu_ptr(group->pcpu, cpu)->tasks[NR_ONCPU]) {
+			if (per_cpu_ptr(group->pcpu, cpu)->tasks[NR_ONCPU]) {
 				common = group;
 				break;
 			}
@@ -883,7 +879,7 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
 		 * TSK_ONCPU is handled up to the common ancestor. If we're tasked
 		 * with dequeuing too, finish that for the rest of the hierarchy.
 		 */
-		if (sleep) {
+		if (sleep || unlikely(prev->in_memstall != next->in_memstall)) {
 			clear &= ~TSK_ONCPU;
 			for (; group; group = iterate_groups(prev, &iter))
 				psi_group_change(group, cpu, clear, set, now, true);
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 3/9] sched/psi: move private helpers to sched/stats.h
  2022-07-21  4:04 [PATCH 0/9] sched/psi: some optimization and extension Chengming Zhou
  2022-07-21  4:04 ` [PATCH 1/9] sched/psi: fix periodic aggregation shut off Chengming Zhou
  2022-07-21  4:04 ` [PATCH 2/9] sched/psi: optimize task switch inside shared cgroups again Chengming Zhou
@ 2022-07-21  4:04 ` Chengming Zhou
  2022-07-25 16:39   ` Johannes Weiner
  2022-07-21  4:04 ` [PATCH 4/9] sched/psi: don't change task psi_flags when migrate CPU/group Chengming Zhou
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 38+ messages in thread
From: Chengming Zhou @ 2022-07-21  4:04 UTC (permalink / raw)
  To: hannes, surenb, mingo, peterz, tj, corbet, akpm, rdunlap
  Cc: linux-doc, linux-kernel, songmuchun, cgroups, Chengming Zhou

This patch move psi_task_change/psi_task_switch declarations out of
PSI public header, since they are only needed for implementing the
PSI stats tracking in sched/stats.h

psi_task_switch is obvious, psi_task_change can't be public helper
since it doesn't check psi_disabled static key. And there is no
any user now, so put it in sched/stats.h too.

Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
---
 include/linux/psi.h  | 4 ----
 kernel/sched/stats.h | 4 ++++
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/include/linux/psi.h b/include/linux/psi.h
index 89784763d19e..aa168a038242 100644
--- a/include/linux/psi.h
+++ b/include/linux/psi.h
@@ -18,10 +18,6 @@ extern struct psi_group psi_system;
 
 void psi_init(void);
 
-void psi_task_change(struct task_struct *task, int clear, int set);
-void psi_task_switch(struct task_struct *prev, struct task_struct *next,
-		     bool sleep);
-
 void psi_memstall_enter(unsigned long *flags);
 void psi_memstall_leave(unsigned long *flags);
 
diff --git a/kernel/sched/stats.h b/kernel/sched/stats.h
index baa839c1ba96..c39b467ece43 100644
--- a/kernel/sched/stats.h
+++ b/kernel/sched/stats.h
@@ -107,6 +107,10 @@ __schedstats_from_se(struct sched_entity *se)
 }
 
 #ifdef CONFIG_PSI
+void psi_task_change(struct task_struct *task, int clear, int set);
+void psi_task_switch(struct task_struct *prev, struct task_struct *next,
+		     bool sleep);
+
 /*
  * PSI tracks state that persists across sleeps, such as iowaits and
  * memory stalls. As a result, it has to distinguish between sleeps,
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 4/9] sched/psi: don't change task psi_flags when migrate CPU/group
  2022-07-21  4:04 [PATCH 0/9] sched/psi: some optimization and extension Chengming Zhou
                   ` (2 preceding siblings ...)
  2022-07-21  4:04 ` [PATCH 3/9] sched/psi: move private helpers to sched/stats.h Chengming Zhou
@ 2022-07-21  4:04 ` Chengming Zhou
  2022-07-21  4:04 ` [PATCH 5/9] sched/psi: don't create cgroup PSI files when psi_disabled Chengming Zhou
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 38+ messages in thread
From: Chengming Zhou @ 2022-07-21  4:04 UTC (permalink / raw)
  To: hannes, surenb, mingo, peterz, tj, corbet, akpm, rdunlap
  Cc: linux-doc, linux-kernel, songmuchun, cgroups, Chengming Zhou

The current code use psi_task_change() at every scheduling point,
in which change task psi_flags then change all its psi_groups.

So we have to heavily rely on the task scheduling states to calculate
what to set and what to clear at every scheduling point, which make
the PSI stats tracking code much complex and error prone.

In fact, the task psi_flags only change at wakeup and sleep (except
ONCPU state at switch), it doesn't change at all when migrate CPU/group.

If we keep its psi_flags unchanged when migrate CPU/group, we can
just use task->psi_flags to clear(migrate out) or set(migrate in),
which will make PSI stats tracking much simplier and more efficient.

Note: ENQUEUE_WAKEUP only means wakeup task from sleep state, don't
include wakeup new task, so add psi_enqueue() in wake_up_new_task().

Performance test on Intel Xeon Platinum with 3 levels of cgroup:

1. before the patch:

$ perf bench sched all
 # Running sched/messaging benchmark...
 # 20 sender and receiver processes per group
 # 10 groups == 400 processes run

     Total time: 0.034 [sec]

 # Running sched/pipe benchmark...
 # Executed 1000000 pipe operations between two processes

     Total time: 8.210 [sec]

       8.210600 usecs/op
         121793 ops/sec

2. after the patch:

$ perf bench sched all
 # Running sched/messaging benchmark...
 # 20 sender and receiver processes per group
 # 10 groups == 400 processes run

     Total time: 0.032 [sec]

 # Running sched/pipe benchmark...
 # Executed 1000000 pipe operations between two processes

     Total time: 8.077 [sec]

       8.077648 usecs/op
         123798 ops/sec

Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
---
 include/linux/sched.h |  3 ---
 kernel/sched/core.c   |  1 +
 kernel/sched/psi.c    | 24 ++++++++++---------
 kernel/sched/stats.h  | 54 +++++++++++++++++++++----------------------
 4 files changed, 40 insertions(+), 42 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 88b8817b827d..20a94786cad8 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -879,9 +879,6 @@ struct task_struct {
 	unsigned			sched_reset_on_fork:1;
 	unsigned			sched_contributes_to_load:1;
 	unsigned			sched_migrated:1;
-#ifdef CONFIG_PSI
-	unsigned			sched_psi_wake_requeue:1;
-#endif
 
 	/* Force alignment to the next boundary: */
 	unsigned			:0;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index a463dbc92fcd..f5f2d3542b05 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4642,6 +4642,7 @@ void wake_up_new_task(struct task_struct *p)
 	post_init_entity_util_avg(p);
 
 	activate_task(rq, p, ENQUEUE_NOCLOCK);
+	psi_enqueue(p, true);
 	trace_sched_wakeup_new(p);
 	check_preempt_curr(rq, p, WF_FORK);
 #ifdef CONFIG_SMP
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index e04041d8251b..6ba159fe2a4f 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -796,22 +796,24 @@ static void psi_flags_change(struct task_struct *task, int clear, int set)
 	task->psi_flags |= set;
 }
 
-void psi_task_change(struct task_struct *task, int clear, int set)
+void psi_change_groups(struct task_struct *task, int clear, int set)
 {
 	int cpu = task_cpu(task);
 	struct psi_group *group;
 	void *iter = NULL;
-	u64 now;
+	u64 now = cpu_clock(cpu);
+
+	while ((group = iterate_groups(task, &iter)))
+		psi_group_change(group, cpu, clear, set, now, true);
+}
 
+void psi_task_change(struct task_struct *task, int clear, int set)
+{
 	if (!task->pid)
 		return;
 
 	psi_flags_change(task, clear, set);
-
-	now = cpu_clock(cpu);
-
-	while ((group = iterate_groups(task, &iter)))
-		psi_group_change(group, cpu, clear, set, now, true);
+	psi_change_groups(task, clear, set);
 }
 
 void psi_task_switch(struct task_struct *prev, struct task_struct *next,
@@ -1015,9 +1017,9 @@ void cgroup_move_task(struct task_struct *task, struct css_set *to)
 	 *   pick_next_task()
 	 *     rq_unlock()
 	 *                                rq_lock()
-	 *                                psi_task_change() // old cgroup
+	 *                                psi_change_groups() // old cgroup
 	 *                                task->cgroups = to
-	 *                                psi_task_change() // new cgroup
+	 *                                psi_change_groups() // new cgroup
 	 *                                rq_unlock()
 	 *     rq_lock()
 	 *   psi_sched_switch() // does deferred updates in new cgroup
@@ -1027,13 +1029,13 @@ void cgroup_move_task(struct task_struct *task, struct css_set *to)
 	task_flags = task->psi_flags;
 
 	if (task_flags)
-		psi_task_change(task, task_flags, 0);
+		psi_change_groups(task, task_flags, 0);
 
 	/* See comment above */
 	rcu_assign_pointer(task->cgroups, to);
 
 	if (task_flags)
-		psi_task_change(task, 0, task_flags);
+		psi_change_groups(task, 0, task_flags);
 
 	task_rq_unlock(rq, task, &rf);
 }
diff --git a/kernel/sched/stats.h b/kernel/sched/stats.h
index c39b467ece43..e930b8fa6253 100644
--- a/kernel/sched/stats.h
+++ b/kernel/sched/stats.h
@@ -107,6 +107,7 @@ __schedstats_from_se(struct sched_entity *se)
 }
 
 #ifdef CONFIG_PSI
+void psi_change_groups(struct task_struct *task, int clear, int set);
 void psi_task_change(struct task_struct *task, int clear, int set);
 void psi_task_switch(struct task_struct *prev, struct task_struct *next,
 		     bool sleep);
@@ -124,42 +125,46 @@ static inline void psi_enqueue(struct task_struct *p, bool wakeup)
 	if (static_branch_likely(&psi_disabled))
 		return;
 
-	if (p->in_memstall)
-		set |= TSK_MEMSTALL_RUNNING;
+	if (!wakeup) {
+		if (p->psi_flags)
+			psi_change_groups(p, 0, p->psi_flags);
+		return;
+	}
 
-	if (!wakeup || p->sched_psi_wake_requeue) {
-		if (p->in_memstall)
+	/*
+	 * wakeup (including wakeup migrate) need to change task psi_flags,
+	 * specifically need to set TSK_RUNNING or TSK_MEMSTALL_RUNNING.
+	 * Since we clear task->psi_flags for wakeup migrated task, we need
+	 * to check task->psi_flags to see what should be set and clear.
+	 */
+	if (unlikely(p->in_memstall)) {
+		set |= TSK_MEMSTALL_RUNNING;
+		if (!(p->psi_flags & TSK_MEMSTALL))
 			set |= TSK_MEMSTALL;
-		if (p->sched_psi_wake_requeue)
-			p->sched_psi_wake_requeue = 0;
-	} else {
-		if (p->in_iowait)
-			clear |= TSK_IOWAIT;
 	}
+	if (p->psi_flags & TSK_IOWAIT)
+		clear |= TSK_IOWAIT;
 
 	psi_task_change(p, clear, set);
 }
 
 static inline void psi_dequeue(struct task_struct *p, bool sleep)
 {
-	int clear = TSK_RUNNING;
-
 	if (static_branch_likely(&psi_disabled))
 		return;
 
+	if (!sleep) {
+		if (p->psi_flags)
+			psi_change_groups(p, p->psi_flags, 0);
+		return;
+	}
+
 	/*
 	 * A voluntary sleep is a dequeue followed by a task switch. To
 	 * avoid walking all ancestors twice, psi_task_switch() handles
 	 * TSK_RUNNING and TSK_IOWAIT for us when it moves TSK_ONCPU.
 	 * Do nothing here.
 	 */
-	if (sleep)
-		return;
-
-	if (p->in_memstall)
-		clear |= (TSK_MEMSTALL | TSK_MEMSTALL_RUNNING);
-
-	psi_task_change(p, clear, 0);
 }
 
 static inline void psi_ttwu_dequeue(struct task_struct *p)
@@ -169,21 +174,14 @@ static inline void psi_ttwu_dequeue(struct task_struct *p)
 	/*
 	 * Is the task being migrated during a wakeup? Make sure to
 	 * deregister its sleep-persistent psi states from the old
-	 * queue, and let psi_enqueue() know it has to requeue.
+	 * queue.
 	 */
-	if (unlikely(p->in_iowait || p->in_memstall)) {
+	if (unlikely(p->psi_flags)) {
 		struct rq_flags rf;
 		struct rq *rq;
-		int clear = 0;
-
-		if (p->in_iowait)
-			clear |= TSK_IOWAIT;
-		if (p->in_memstall)
-			clear |= TSK_MEMSTALL;
 
 		rq = __task_rq_lock(p, &rf);
-		psi_task_change(p, clear, 0);
-		p->sched_psi_wake_requeue = 1;
+		psi_task_change(p, p->psi_flags, 0);
 		__task_rq_unlock(rq, &rf);
 	}
 }
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 5/9] sched/psi: don't create cgroup PSI files when psi_disabled
  2022-07-21  4:04 [PATCH 0/9] sched/psi: some optimization and extension Chengming Zhou
                   ` (3 preceding siblings ...)
  2022-07-21  4:04 ` [PATCH 4/9] sched/psi: don't change task psi_flags when migrate CPU/group Chengming Zhou
@ 2022-07-21  4:04 ` Chengming Zhou
  2022-07-25 16:41   ` Johannes Weiner
  2022-07-21  4:04 ` [PATCH 6/9] sched/psi: save percpu memory when !psi_cgroups_enabled Chengming Zhou
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 38+ messages in thread
From: Chengming Zhou @ 2022-07-21  4:04 UTC (permalink / raw)
  To: hannes, surenb, mingo, peterz, tj, corbet, akpm, rdunlap
  Cc: linux-doc, linux-kernel, songmuchun, cgroups, Chengming Zhou

commit 3958e2d0c34e ("cgroup: make per-cgroup pressure stall tracking configurable")
make PSI can be configured to skip per-cgroup stall accounting. And
doesn't expose PSI files in cgroup hierarchy.

This patch do the same thing when psi_disabled.

Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
---
 kernel/cgroup/cgroup.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 1779ccddb734..1424da7ed2c4 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -3700,6 +3700,9 @@ static void cgroup_pressure_release(struct kernfs_open_file *of)
 
 bool cgroup_psi_enabled(void)
 {
+	if (static_branch_likely(&psi_disabled))
+		return false;
+
 	return (cgroup_feature_disable_mask & (1 << OPT_FEATURE_PRESSURE)) == 0;
 }
 
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 6/9] sched/psi: save percpu memory when !psi_cgroups_enabled
  2022-07-21  4:04 [PATCH 0/9] sched/psi: some optimization and extension Chengming Zhou
                   ` (4 preceding siblings ...)
  2022-07-21  4:04 ` [PATCH 5/9] sched/psi: don't create cgroup PSI files when psi_disabled Chengming Zhou
@ 2022-07-21  4:04 ` Chengming Zhou
  2022-07-25 16:47   ` Johannes Weiner
  2022-07-21  4:04 ` [PATCH 7/9] sched/psi: cache parent psi_group to speed up groups iterate Chengming Zhou
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 38+ messages in thread
From: Chengming Zhou @ 2022-07-21  4:04 UTC (permalink / raw)
  To: hannes, surenb, mingo, peterz, tj, corbet, akpm, rdunlap
  Cc: linux-doc, linux-kernel, songmuchun, cgroups, Chengming Zhou

We won't use cgroup psi_group when !psi_cgroups_enabled, so don't
bother to alloc percpu memory and init for it.

Also don't need to migrate task PSI stats between cgroups in
cgroup_move_task().

Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
---
 kernel/sched/psi.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 6ba159fe2a4f..aa40bf888102 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -205,6 +205,7 @@ void __init psi_init(void)
 {
 	if (!psi_enable) {
 		static_branch_enable(&psi_disabled);
+		static_branch_disable(&psi_cgroups_enabled);
 		return;
 	}
 
@@ -952,7 +953,7 @@ void psi_memstall_leave(unsigned long *flags)
 #ifdef CONFIG_CGROUPS
 int psi_cgroup_alloc(struct cgroup *cgroup)
 {
-	if (static_branch_likely(&psi_disabled))
+	if (!static_branch_likely(&psi_cgroups_enabled))
 		return 0;
 
 	cgroup->psi.pcpu = alloc_percpu(struct psi_group_cpu);
@@ -964,7 +965,7 @@ int psi_cgroup_alloc(struct cgroup *cgroup)
 
 void psi_cgroup_free(struct cgroup *cgroup)
 {
-	if (static_branch_likely(&psi_disabled))
+	if (!static_branch_likely(&psi_cgroups_enabled))
 		return;
 
 	cancel_delayed_work_sync(&cgroup->psi.avgs_work);
@@ -991,7 +992,7 @@ void cgroup_move_task(struct task_struct *task, struct css_set *to)
 	struct rq_flags rf;
 	struct rq *rq;
 
-	if (static_branch_likely(&psi_disabled)) {
+	if (!static_branch_likely(&psi_cgroups_enabled)) {
 		/*
 		 * Lame to do this here, but the scheduler cannot be locked
 		 * from the outside, so we move cgroups from inside sched/.
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 7/9] sched/psi: cache parent psi_group to speed up groups iterate
  2022-07-21  4:04 [PATCH 0/9] sched/psi: some optimization and extension Chengming Zhou
                   ` (5 preceding siblings ...)
  2022-07-21  4:04 ` [PATCH 6/9] sched/psi: save percpu memory when !psi_cgroups_enabled Chengming Zhou
@ 2022-07-21  4:04 ` Chengming Zhou
  2022-07-21  4:04 ` [PATCH 8/9] sched/psi: add kernel cmdline parameter psi_inner_cgroup Chengming Zhou
  2022-07-21  4:04 ` [PATCH 9/9] sched/psi: add PSI_IRQ to track IRQ/SOFTIRQ pressure Chengming Zhou
  8 siblings, 0 replies; 38+ messages in thread
From: Chengming Zhou @ 2022-07-21  4:04 UTC (permalink / raw)
  To: hannes, surenb, mingo, peterz, tj, corbet, akpm, rdunlap
  Cc: linux-doc, linux-kernel, songmuchun, cgroups, Chengming Zhou

We use iterate_groups() to iterate each level psi_group to update
PSI stats, which is a very hot path.

In current code, iterate_groups() have to use multiple branches and
cgroup_parent() to get parent psi_group for each level, which is not
very efficient.

This patch cache parent psi_group, only need to get psi_group of task
itself first, then just use group->parent to iterate.

And this patch is preparation for the following patch, in which we can
configure PSI to only account for leaf cgroups and system-wide.

Performance test on Intel Xeon Platinum with 3 levels of cgroup:

1. before the patch:

$ perf bench sched all
 # Running sched/messaging benchmark...
 # 20 sender and receiver processes per group
 # 10 groups == 400 processes run

     Total time: 0.032 [sec]

 # Running sched/pipe benchmark...
 # Executed 1000000 pipe operations between two processes

     Total time: 8.077 [sec]

       8.077648 usecs/op
         123798 ops/sec

2. after the patch:

$ perf bench sched all
 # Running sched/messaging benchmark...
 # 20 sender and receiver processes per group
 # 10 groups == 400 processes run

     Total time: 0.032 [sec]

 # Running sched/pipe benchmark...
 # Executed 1000000 pipe operations between two processes

     Total time: 7.758 [sec]

       7.758354 usecs/op
         128893 ops/sec

Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
---
 include/linux/psi_types.h |  2 ++
 kernel/sched/psi.c        | 48 ++++++++++++++++++++-------------------
 2 files changed, 27 insertions(+), 23 deletions(-)

diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h
index c7fe7c089718..c124f7d186d0 100644
--- a/include/linux/psi_types.h
+++ b/include/linux/psi_types.h
@@ -147,6 +147,8 @@ struct psi_trigger {
 };
 
 struct psi_group {
+	struct psi_group *parent;
+
 	/* Protects data used by the aggregator */
 	struct mutex avgs_lock;
 
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index aa40bf888102..2228cbf3bdd3 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -758,30 +758,22 @@ static void psi_group_change(struct psi_group *group, int cpu,
 		schedule_delayed_work(&group->avgs_work, PSI_FREQ);
 }
 
-static struct psi_group *iterate_groups(struct task_struct *task, void **iter)
+static inline struct psi_group *task_psi_group(struct task_struct *task)
 {
-	if (*iter == &psi_system)
-		return NULL;
-
 #ifdef CONFIG_CGROUPS
 	if (static_branch_likely(&psi_cgroups_enabled)) {
-		struct cgroup *cgroup = NULL;
-
-		if (!*iter)
-			cgroup = task->cgroups->dfl_cgrp;
-		else
-			cgroup = cgroup_parent(*iter);
+		struct cgroup *cgroup = task_dfl_cgroup(task);
 
-		if (cgroup && cgroup_parent(cgroup)) {
-			*iter = cgroup;
+		if (cgroup && cgroup_parent(cgroup))
 			return cgroup_psi(cgroup);
-		}
 	}
 #endif
-	*iter = &psi_system;
 	return &psi_system;
 }
 
+#define for_each_psi_group(group) \
+	for (; group; group = group->parent)
+
 static void psi_flags_change(struct task_struct *task, int clear, int set)
 {
 	if (((task->psi_flags & set) ||
@@ -799,12 +791,11 @@ static void psi_flags_change(struct task_struct *task, int clear, int set)
 
 void psi_change_groups(struct task_struct *task, int clear, int set)
 {
+	struct psi_group *group = task_psi_group(task);
 	int cpu = task_cpu(task);
-	struct psi_group *group;
-	void *iter = NULL;
 	u64 now = cpu_clock(cpu);
 
-	while ((group = iterate_groups(task, &iter)))
+	for_each_psi_group(group)
 		psi_group_change(group, cpu, clear, set, now, true);
 }
 
@@ -822,7 +813,6 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
 {
 	struct psi_group *group, *common = NULL;
 	int cpu = task_cpu(prev);
-	void *iter;
 	u64 now = cpu_clock(cpu);
 
 	if (next->pid) {
@@ -833,8 +823,8 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
 		 * we reach the first common ancestor. Iterate @next's
 		 * ancestors only until we encounter @prev's ONCPU.
 		 */
-		iter = NULL;
-		while ((group = iterate_groups(next, &iter))) {
+		group = task_psi_group(next);
+		for_each_psi_group(group) {
 			if (per_cpu_ptr(group->pcpu, cpu)->tasks[NR_ONCPU]) {
 				common = group;
 				break;
@@ -874,9 +864,12 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
 
 		psi_flags_change(prev, clear, set);
 
-		iter = NULL;
-		while ((group = iterate_groups(prev, &iter)) && group != common)
+		group = task_psi_group(prev);
+		for_each_psi_group(group) {
+			if (group == common)
+				break;
 			psi_group_change(group, cpu, clear, set, now, wake_clock);
+		}
 
 		/*
 		 * TSK_ONCPU is handled up to the common ancestor. If we're tasked
@@ -884,7 +877,8 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
 		 */
 		if (sleep || unlikely(prev->in_memstall != next->in_memstall)) {
 			clear &= ~TSK_ONCPU;
-			for (; group; group = iterate_groups(prev, &iter))
+
+			for_each_psi_group(group)
 				psi_group_change(group, cpu, clear, set, now, true);
 		}
 	}
@@ -953,6 +947,8 @@ void psi_memstall_leave(unsigned long *flags)
 #ifdef CONFIG_CGROUPS
 int psi_cgroup_alloc(struct cgroup *cgroup)
 {
+	struct cgroup *parent;
+
 	if (!static_branch_likely(&psi_cgroups_enabled))
 		return 0;
 
@@ -960,6 +956,12 @@ int psi_cgroup_alloc(struct cgroup *cgroup)
 	if (!cgroup->psi.pcpu)
 		return -ENOMEM;
 	group_init(&cgroup->psi);
+
+	parent = cgroup_parent(cgroup);
+	if (parent && cgroup_parent(parent))
+		cgroup->psi.parent = cgroup_psi(parent);
+	else
+		cgroup->psi.parent = &psi_system;
 	return 0;
 }
 
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 8/9] sched/psi: add kernel cmdline parameter psi_inner_cgroup
  2022-07-21  4:04 [PATCH 0/9] sched/psi: some optimization and extension Chengming Zhou
                   ` (6 preceding siblings ...)
  2022-07-21  4:04 ` [PATCH 7/9] sched/psi: cache parent psi_group to speed up groups iterate Chengming Zhou
@ 2022-07-21  4:04 ` Chengming Zhou
  2022-07-25 16:52   ` Johannes Weiner
  2022-07-21  4:04 ` [PATCH 9/9] sched/psi: add PSI_IRQ to track IRQ/SOFTIRQ pressure Chengming Zhou
  8 siblings, 1 reply; 38+ messages in thread
From: Chengming Zhou @ 2022-07-21  4:04 UTC (permalink / raw)
  To: hannes, surenb, mingo, peterz, tj, corbet, akpm, rdunlap
  Cc: linux-doc, linux-kernel, songmuchun, cgroups, Chengming Zhou

PSI accounts stalls for each cgroup separately and aggregates it
at each level of the hierarchy. This may case non-negligible overhead
for some workloads when under deep level of the hierarchy.

commit 3958e2d0c34e ("cgroup: make per-cgroup pressure stall tracking configurable")
make PSI to skip per-cgroup stall accounting, only account system-wide
to avoid this each level overhead.

For our use case, we also want leaf cgroup PSI accounted for userspace
adjustment on that cgroup, apart from only system-wide management.

So this patch add kernel cmdline parameter "psi_inner_cgroup" to control
whether or not to account for inner cgroups, which is default to true
for compatibility.

Performance test on Intel Xeon Platinum with 3 levels of cgroup:

1. default (psi_inner_cgroup=true)

$ perf bench sched all
 # Running sched/messaging benchmark...
 # 20 sender and receiver processes per group
 # 10 groups == 400 processes run

     Total time: 0.032 [sec]

 # Running sched/pipe benchmark...
 # Executed 1000000 pipe operations between two processes

     Total time: 7.758 [sec]

       7.758354 usecs/op
         128893 ops/sec

2. psi_inner_cgroup=false

$ perf bench sched all
 # Running sched/messaging benchmark...
 # 20 sender and receiver processes per group
 # 10 groups == 400 processes run

     Total time: 0.032 [sec]

 # Running sched/pipe benchmark...
 # Executed 1000000 pipe operations between two processes

     Total time: 7.309 [sec]

       7.309436 usecs/op
         136809 ops/sec

Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
---
 Documentation/admin-guide/kernel-parameters.txt |  6 ++++++
 kernel/sched/psi.c                              | 11 ++++++++++-
 2 files changed, 16 insertions(+), 1 deletion(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 8090130b544b..6beef5b8bc36 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -4419,6 +4419,12 @@
 			tracking.
 			Format: <bool>
 
+	psi_inner_cgroup=
+			[KNL] Enable or disable pressure stall information
+			tracking for the inner cgroups.
+			Format: <bool>
+			default: enabled
+
 	psmouse.proto=	[HW,MOUSE] Highest PS2 mouse protocol extension to
 			probe for; one of (bare|imps|exps|lifebook|any).
 	psmouse.rate=	[HW,MOUSE] Set desired mouse report rate, in reports
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 2228cbf3bdd3..8d76920f47b3 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -147,12 +147,21 @@ static bool psi_enable;
 #else
 static bool psi_enable = true;
 #endif
+
+static bool psi_inner_cgroup __read_mostly = true;
+
 static int __init setup_psi(char *str)
 {
 	return kstrtobool(str, &psi_enable) == 0;
 }
 __setup("psi=", setup_psi);
 
+static int __init setup_psi_inner_cgroup(char *str)
+{
+	return kstrtobool(str, &psi_inner_cgroup) == 0;
+}
+__setup("psi_inner_cgroup=", setup_psi_inner_cgroup);
+
 /* Running averages - we need to be higher-res than loadavg */
 #define PSI_FREQ	(2*HZ+1)	/* 2 sec intervals */
 #define EXP_10s		1677		/* 1/exp(2s/10s) as fixed-point */
@@ -958,7 +967,7 @@ int psi_cgroup_alloc(struct cgroup *cgroup)
 	group_init(&cgroup->psi);
 
 	parent = cgroup_parent(cgroup);
-	if (parent && cgroup_parent(parent))
+	if (parent && cgroup_parent(parent) && psi_inner_cgroup)
 		cgroup->psi.parent = cgroup_psi(parent);
 	else
 		cgroup->psi.parent = &psi_system;
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 9/9] sched/psi: add PSI_IRQ to track IRQ/SOFTIRQ pressure
  2022-07-21  4:04 [PATCH 0/9] sched/psi: some optimization and extension Chengming Zhou
                   ` (7 preceding siblings ...)
  2022-07-21  4:04 ` [PATCH 8/9] sched/psi: add kernel cmdline parameter psi_inner_cgroup Chengming Zhou
@ 2022-07-21  4:04 ` Chengming Zhou
  2022-07-21 10:00   ` kernel test robot
                     ` (4 more replies)
  8 siblings, 5 replies; 38+ messages in thread
From: Chengming Zhou @ 2022-07-21  4:04 UTC (permalink / raw)
  To: hannes, surenb, mingo, peterz, tj, corbet, akpm, rdunlap
  Cc: linux-doc, linux-kernel, songmuchun, cgroups, Chengming Zhou

Now PSI already tracked workload pressure stall information for
CPU, memory and IO. Apart from these, IRQ/SOFTIRQ could have
obvious impact on some workload productivity, such as web service
workload.

When CONFIG_IRQ_TIME_ACCOUNTING, we can get IRQ/SOFTIRQ delta time
from update_rq_clock_task(), in which we can record that delta
to CPU curr task's cgroups as PSI_IRQ_FULL status.

Note we don't use PSI_IRQ_SOME since IRQ/SOFTIRQ always happen in
the current task on the CPU, make nothing productive could run
even if it were runnable, so we only use PSI_IRQ_FULL.

For performance impact consideration, this is enabled by default when
CONFIG_IRQ_TIME_ACCOUNTING, but can be disabled by kernel cmdline
parameter "psi_irq=".

Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
---
 .../admin-guide/kernel-parameters.txt         |  5 ++
 include/linux/psi.h                           |  1 +
 include/linux/psi_types.h                     |  7 +-
 kernel/cgroup/cgroup.c                        | 27 +++++++
 kernel/sched/core.c                           |  1 +
 kernel/sched/psi.c                            | 76 ++++++++++++++++++-
 kernel/sched/stats.h                          | 13 ++++
 7 files changed, 126 insertions(+), 4 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 6beef5b8bc36..1067dde299a0 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -4425,6 +4425,11 @@
 			Format: <bool>
 			default: enabled
 
+	psi_irq=	[KNL] Enable or disable IRQ/SOFTIRQ pressure stall
+			information tracking.
+			Format: <bool>
+			default: enabled when CONFIG_IRQ_TIME_ACCOUNTING.
+
 	psmouse.proto=	[HW,MOUSE] Highest PS2 mouse protocol extension to
 			probe for; one of (bare|imps|exps|lifebook|any).
 	psmouse.rate=	[HW,MOUSE] Set desired mouse report rate, in reports
diff --git a/include/linux/psi.h b/include/linux/psi.h
index aa168a038242..f5cf3e45d5a5 100644
--- a/include/linux/psi.h
+++ b/include/linux/psi.h
@@ -14,6 +14,7 @@ struct css_set;
 #ifdef CONFIG_PSI
 
 extern struct static_key_false psi_disabled;
+extern struct static_key_true psi_irq_enabled;
 extern struct psi_group psi_system;
 
 void psi_init(void);
diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h
index c124f7d186d0..195f123b1cd1 100644
--- a/include/linux/psi_types.h
+++ b/include/linux/psi_types.h
@@ -47,7 +47,8 @@ enum psi_res {
 	PSI_IO,
 	PSI_MEM,
 	PSI_CPU,
-	NR_PSI_RESOURCES = 3,
+	PSI_IRQ,
+	NR_PSI_RESOURCES = 4,
 };
 
 /*
@@ -63,9 +64,11 @@ enum psi_states {
 	PSI_MEM_FULL,
 	PSI_CPU_SOME,
 	PSI_CPU_FULL,
+	PSI_IRQ_SOME,
+	PSI_IRQ_FULL,
 	/* Only per-CPU, to weigh the CPU in the global average: */
 	PSI_NONIDLE,
-	NR_PSI_STATES = 7,
+	NR_PSI_STATES = 9,
 };
 
 enum psi_aggregators {
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 1424da7ed2c4..cf61df0ac892 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -3683,6 +3683,23 @@ static ssize_t cgroup_cpu_pressure_write(struct kernfs_open_file *of,
 	return cgroup_pressure_write(of, buf, nbytes, PSI_CPU);
 }
 
+#ifdef CONFIG_IRQ_TIME_ACCOUNTING
+static int cgroup_irq_pressure_show(struct seq_file *seq, void *v)
+{
+	struct cgroup *cgrp = seq_css(seq)->cgroup;
+	struct psi_group *psi = cgroup_ino(cgrp) == 1 ? &psi_system : &cgrp->psi;
+
+	return psi_show(seq, psi, PSI_IRQ);
+}
+
+static ssize_t cgroup_irq_pressure_write(struct kernfs_open_file *of,
+					  char *buf, size_t nbytes,
+					  loff_t off)
+{
+	return cgroup_pressure_write(of, buf, nbytes, PSI_IRQ);
+}
+#endif
+
 static __poll_t cgroup_pressure_poll(struct kernfs_open_file *of,
 					  poll_table *pt)
 {
@@ -5079,6 +5096,16 @@ static struct cftype cgroup_base_files[] = {
 		.poll = cgroup_pressure_poll,
 		.release = cgroup_pressure_release,
 	},
+#ifdef CONFIG_IRQ_TIME_ACCOUNTING
+	{
+		.name = "irq.pressure",
+		.flags = CFTYPE_PRESSURE,
+		.seq_show = cgroup_irq_pressure_show,
+		.write = cgroup_irq_pressure_write,
+		.poll = cgroup_pressure_poll,
+		.release = cgroup_pressure_release,
+	},
+#endif
 #endif /* CONFIG_PSI */
 	{ }	/* terminate */
 };
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f5f2d3542b05..08637cfb7ed9 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -708,6 +708,7 @@ static void update_rq_clock_task(struct rq *rq, s64 delta)
 
 	rq->prev_irq_time += irq_delta;
 	delta -= irq_delta;
+	psi_account_irqtime(rq->curr, irq_delta);
 #endif
 #ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING
 	if (static_key_false((&paravirt_steal_rq_enabled))) {
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 8d76920f47b3..6a0894e28780 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -141,6 +141,7 @@ static int psi_bug __read_mostly;
 
 DEFINE_STATIC_KEY_FALSE(psi_disabled);
 DEFINE_STATIC_KEY_TRUE(psi_cgroups_enabled);
+DEFINE_STATIC_KEY_TRUE(psi_irq_enabled);
 
 #ifdef CONFIG_PSI_DEFAULT_DISABLED
 static bool psi_enable;
@@ -150,6 +151,12 @@ static bool psi_enable = true;
 
 static bool psi_inner_cgroup __read_mostly = true;
 
+#ifdef CONFIG_IRQ_TIME_ACCOUNTING
+static bool psi_irq_enable = true;
+#else
+static bool psi_irq_enable;
+#endif
+
 static int __init setup_psi(char *str)
 {
 	return kstrtobool(str, &psi_enable) == 0;
@@ -162,6 +169,12 @@ static int __init setup_psi_inner_cgroup(char *str)
 }
 __setup("psi_inner_cgroup=", setup_psi_inner_cgroup);
 
+static int __init setup_psi_irq(char *str)
+{
+	return kstrtobool(str, &psi_irq_enable) == 0;
+}
+__setup("psi_irq=", setup_psi_irq);
+
 /* Running averages - we need to be higher-res than loadavg */
 #define PSI_FREQ	(2*HZ+1)	/* 2 sec intervals */
 #define EXP_10s		1677		/* 1/exp(2s/10s) as fixed-point */
@@ -215,12 +228,16 @@ void __init psi_init(void)
 	if (!psi_enable) {
 		static_branch_enable(&psi_disabled);
 		static_branch_disable(&psi_cgroups_enabled);
+		static_branch_disable(&psi_irq_enabled);
 		return;
 	}
 
 	if (!cgroup_psi_enabled())
 		static_branch_disable(&psi_cgroups_enabled);
 
+	if (!psi_irq_enable)
+		static_branch_disable(&psi_irq_enabled);
+
 	psi_period = jiffies_to_nsecs(PSI_FREQ);
 	group_init(&psi_system);
 }
@@ -893,6 +910,28 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
 	}
 }
 
+void psi_groups_account_irqtime(struct task_struct *task, u32 delta)
+{
+	struct psi_group *group = task_psi_group(task);
+	int cpu = task_cpu(task);
+	u64 now = cpu_clock(cpu);
+	struct psi_group_cpu *groupc;
+
+	for_each_psi_group(group) {
+		groupc = per_cpu_ptr(group->pcpu, cpu);
+
+		write_seqcount_begin(&groupc->seq);
+
+		record_times(groupc, now);
+		groupc->times[PSI_IRQ_FULL] += delta;
+
+		write_seqcount_end(&groupc->seq);
+
+		if (group->poll_states & (1 << PSI_IRQ_FULL))
+			psi_schedule_poll_work(group, 1);
+	}
+}
+
 /**
  * psi_memstall_enter - mark the beginning of a memory stall section
  * @flags: flags to handle nested sections
@@ -1069,7 +1108,7 @@ int psi_show(struct seq_file *m, struct psi_group *group, enum psi_res res)
 		group->avg_next_update = update_averages(group, now);
 	mutex_unlock(&group->avgs_lock);
 
-	for (full = 0; full < 2; full++) {
+	for (full = (res == PSI_IRQ); full < 2; full++) {
 		unsigned long avg[3] = { 0, };
 		u64 total = 0;
 		int w;
@@ -1111,9 +1150,12 @@ struct psi_trigger *psi_trigger_create(struct psi_group *group,
 	else
 		return ERR_PTR(-EINVAL);
 
-	if (state >= PSI_NONIDLE)
+	if (state >= PSI_NONIDLE || state == PSI_IRQ_SOME)
 		return ERR_PTR(-EINVAL);
 
+	if (!static_branch_likely(&psi_irq_enabled) && state == PSI_IRQ_FULL)
+		return ERR_PTR(-EOPNOTSUPP);
+
 	if (window_us < WINDOW_MIN_US ||
 		window_us > WINDOW_MAX_US)
 		return ERR_PTR(-EINVAL);
@@ -1395,6 +1437,33 @@ static const struct proc_ops psi_cpu_proc_ops = {
 	.proc_release	= psi_fop_release,
 };
 
+#ifdef CONFIG_IRQ_TIME_ACCOUNTING
+static int psi_irq_show(struct seq_file *m, void *v)
+{
+	return psi_show(m, &psi_system, PSI_IRQ);
+}
+
+static int psi_irq_open(struct inode *inode, struct file *file)
+{
+	return psi_open(file, psi_irq_show);
+}
+
+static ssize_t psi_irq_write(struct file *file, const char __user *user_buf,
+			     size_t nbytes, loff_t *ppos)
+{
+	return psi_write(file, user_buf, nbytes, PSI_IRQ);
+}
+
+static const struct proc_ops psi_irq_proc_ops = {
+	.proc_open	= psi_irq_open,
+	.proc_read	= seq_read,
+	.proc_lseek	= seq_lseek,
+	.proc_write	= psi_irq_write,
+	.proc_poll	= psi_fop_poll,
+	.proc_release	= psi_fop_release,
+};
+#endif
+
 static int __init psi_proc_init(void)
 {
 	if (psi_enable) {
@@ -1402,6 +1471,9 @@ static int __init psi_proc_init(void)
 		proc_create("pressure/io", 0666, NULL, &psi_io_proc_ops);
 		proc_create("pressure/memory", 0666, NULL, &psi_memory_proc_ops);
 		proc_create("pressure/cpu", 0666, NULL, &psi_cpu_proc_ops);
+#ifdef CONFIG_IRQ_TIME_ACCOUNTING
+		proc_create("pressure/irq", 0666, NULL, &psi_irq_proc_ops);
+#endif
 	}
 	return 0;
 }
diff --git a/kernel/sched/stats.h b/kernel/sched/stats.h
index e930b8fa6253..10926cdaccc8 100644
--- a/kernel/sched/stats.h
+++ b/kernel/sched/stats.h
@@ -111,6 +111,7 @@ void psi_change_groups(struct task_struct *task, int clear, int set);
 void psi_task_change(struct task_struct *task, int clear, int set);
 void psi_task_switch(struct task_struct *prev, struct task_struct *next,
 		     bool sleep);
+void psi_groups_account_irqtime(struct task_struct *task, u32 delta);
 
 /*
  * PSI tracks state that persists across sleeps, such as iowaits and
@@ -196,6 +197,17 @@ static inline void psi_sched_switch(struct task_struct *prev,
 	psi_task_switch(prev, next, sleep);
 }
 
+static inline void psi_account_irqtime(struct task_struct *task, u32 delta)
+{
+	if (!static_branch_likely(&psi_irq_enabled))
+		return;
+
+	if (!task->pid)
+		return;
+
+	psi_groups_account_irqtime(task, delta);
+}
+
 #else /* CONFIG_PSI */
 static inline void psi_enqueue(struct task_struct *p, bool wakeup) {}
 static inline void psi_dequeue(struct task_struct *p, bool sleep) {}
@@ -203,6 +215,7 @@ static inline void psi_ttwu_dequeue(struct task_struct *p) {}
 static inline void psi_sched_switch(struct task_struct *prev,
 				    struct task_struct *next,
 				    bool sleep) {}
+static inline void psi_account_irqtime(struct task_struct *curr, u32 delta) {}
 #endif /* CONFIG_PSI */
 
 #ifdef CONFIG_SCHED_INFO
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH 9/9] sched/psi: add PSI_IRQ to track IRQ/SOFTIRQ pressure
  2022-07-21  4:04 ` [PATCH 9/9] sched/psi: add PSI_IRQ to track IRQ/SOFTIRQ pressure Chengming Zhou
@ 2022-07-21 10:00   ` kernel test robot
  2022-07-21 22:10   ` kernel test robot
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 38+ messages in thread
From: kernel test robot @ 2022-07-21 10:00 UTC (permalink / raw)
  To: Chengming Zhou, hannes, surenb, mingo, peterz, tj, corbet, akpm, rdunlap
  Cc: kbuild-all, linux-doc, linux-kernel, songmuchun, cgroups, Chengming Zhou

Hi Chengming,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on tip/sched/core]
[also build test WARNING on akpm-mm/mm-everything linus/master v5.19-rc7]
[cannot apply to tj-cgroup/for-next next-20220720]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Chengming-Zhou/sched-psi-some-optimization-and-extension/20220721-120833
base:   https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git 401e4963bf45c800e3e9ea0d3a0289d738005fd4
config: riscv-randconfig-s032-20220718 (https://download.01.org/0day-ci/archive/20220721/202207211726.ilPYe7AO-lkp@intel.com/config)
compiler: riscv64-linux-gcc (GCC) 12.1.0
reproduce:
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # apt-get install sparse
        # sparse version: v0.6.4-39-gce1a6720-dirty
        # https://github.com/intel-lab-lkp/linux/commit/d14f2a9ff31fefc5b28a16addaa832dc80d84189
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Chengming-Zhou/sched-psi-some-optimization-and-extension/20220721-120833
        git checkout d14f2a9ff31fefc5b28a16addaa832dc80d84189
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross C=1 CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__' O=build_dir ARCH=riscv SHELL=/bin/bash kernel/sched/

If you fix the issue, kindly add following tag where applicable
Reported-by: kernel test robot <lkp@intel.com>


sparse warnings: (new ones prefixed by >>)
>> kernel/sched/core.c:711:31: sparse: sparse: incorrect type in argument 1 (different address spaces) @@     expected struct task_struct *task @@     got struct task_struct [noderef] __rcu *curr @@
   kernel/sched/core.c:711:31: sparse:     expected struct task_struct *task
   kernel/sched/core.c:711:31: sparse:     got struct task_struct [noderef] __rcu *curr
   kernel/sched/core.c:1028:38: sparse: sparse: incorrect type in initializer (different address spaces) @@     expected struct task_struct *curr @@     got struct task_struct [noderef] __rcu *curr @@
   kernel/sched/core.c:1028:38: sparse:     expected struct task_struct *curr
   kernel/sched/core.c:1028:38: sparse:     got struct task_struct [noderef] __rcu *curr
   kernel/sched/core.c:2192:33: sparse: sparse: incorrect type in argument 1 (different address spaces) @@     expected struct task_struct *p @@     got struct task_struct [noderef] __rcu *curr @@
   kernel/sched/core.c:2192:33: sparse:     expected struct task_struct *p
   kernel/sched/core.c:2192:33: sparse:     got struct task_struct [noderef] __rcu *curr
   kernel/sched/core.c:2192:68: sparse: sparse: incorrect type in argument 1 (different address spaces) @@     expected struct task_struct *tsk @@     got struct task_struct [noderef] __rcu *curr @@
   kernel/sched/core.c:2192:68: sparse:     expected struct task_struct *tsk
   kernel/sched/core.c:2192:68: sparse:     got struct task_struct [noderef] __rcu *curr
   kernel/sched/core.c:3592:17: sparse: sparse: incorrect type in assignment (different address spaces) @@     expected struct sched_domain *[assigned] sd @@     got struct sched_domain [noderef] __rcu *parent @@
   kernel/sched/core.c:3592:17: sparse:     expected struct sched_domain *[assigned] sd
   kernel/sched/core.c:3592:17: sparse:     got struct sched_domain [noderef] __rcu *parent
   kernel/sched/core.c:3789:28: sparse: sparse: incorrect type in argument 1 (different address spaces) @@     expected struct task_struct const *p @@     got struct task_struct [noderef] __rcu *curr @@
   kernel/sched/core.c:3789:28: sparse:     expected struct task_struct const *p
   kernel/sched/core.c:3789:28: sparse:     got struct task_struct [noderef] __rcu *curr
   kernel/sched/core.c:9089:43: sparse: sparse: incorrect type in initializer (different address spaces) @@     expected struct task_struct *push_task @@     got struct task_struct [noderef] __rcu *curr @@
   kernel/sched/core.c:9089:43: sparse:     expected struct task_struct *push_task
   kernel/sched/core.c:9089:43: sparse:     got struct task_struct [noderef] __rcu *curr
   kernel/sched/core.c:5376:38: sparse: sparse: incorrect type in initializer (different address spaces) @@     expected struct task_struct *curr @@     got struct task_struct [noderef] __rcu *curr @@
   kernel/sched/core.c:5376:38: sparse:     expected struct task_struct *curr
   kernel/sched/core.c:5376:38: sparse:     got struct task_struct [noderef] __rcu *curr
   kernel/sched/core.c:6322:14: sparse: sparse: incorrect type in assignment (different address spaces) @@     expected struct task_struct *prev @@     got struct task_struct [noderef] __rcu *curr @@
   kernel/sched/core.c:6322:14: sparse:     expected struct task_struct *prev
   kernel/sched/core.c:6322:14: sparse:     got struct task_struct [noderef] __rcu *curr
   kernel/sched/core.c:6848:17: sparse: sparse: incompatible types in comparison expression (different address spaces):
   kernel/sched/core.c:6848:17: sparse:    struct task_struct *
   kernel/sched/core.c:6848:17: sparse:    struct task_struct [noderef] __rcu *
   kernel/sched/core.c:7064:22: sparse: sparse: incompatible types in comparison expression (different address spaces):
   kernel/sched/core.c:7064:22: sparse:    struct task_struct [noderef] __rcu *
   kernel/sched/core.c:7064:22: sparse:    struct task_struct *
   kernel/sched/core.c:11121:25: sparse: sparse: incorrect type in argument 1 (different address spaces) @@     expected struct task_struct *p @@     got struct task_struct [noderef] __rcu *curr @@
   kernel/sched/core.c:11121:25: sparse:     expected struct task_struct *p
   kernel/sched/core.c:11121:25: sparse:     got struct task_struct [noderef] __rcu *curr
   kernel/sched/core.c:537:6: sparse: sparse: context imbalance in 'raw_spin_rq_lock_nested' - wrong count at exit
   kernel/sched/core.c:562:6: sparse: sparse: context imbalance in 'raw_spin_rq_trylock' - wrong count at exit
   kernel/sched/core.c:586:6: sparse: sparse: context imbalance in 'raw_spin_rq_unlock' - unexpected unlock
   kernel/sched/core.c: note: in included file:
   kernel/sched/sched.h:1580:9: sparse: sparse: context imbalance in '__task_rq_lock' - wrong count at exit
   kernel/sched/sched.h:1580:9: sparse: sparse: context imbalance in 'task_rq_lock' - wrong count at exit
   kernel/sched/core.c: note: in included file:
   kernel/sched/pelt.h:97:13: sparse: sparse: incorrect type in argument 1 (different address spaces) @@     expected struct task_struct const *p @@     got struct task_struct [noderef] __rcu *curr @@
   kernel/sched/pelt.h:97:13: sparse:     expected struct task_struct const *p
   kernel/sched/pelt.h:97:13: sparse:     got struct task_struct [noderef] __rcu *curr
   kernel/sched/core.c:2183:33: sparse: sparse: dereference of noderef expression
   kernel/sched/core.c:2184:19: sparse: sparse: dereference of noderef expression
   kernel/sched/core.c:2185:18: sparse: sparse: dereference of noderef expression
   kernel/sched/core.c: note: in included file:
   kernel/sched/sched.h:2053:25: sparse: sparse: incompatible types in comparison expression (different address spaces):
   kernel/sched/sched.h:2053:25: sparse:    struct task_struct [noderef] __rcu *
   kernel/sched/sched.h:2053:25: sparse:    struct task_struct *
   kernel/sched/sched.h:2210:9: sparse: sparse: incompatible types in comparison expression (different address spaces):
   kernel/sched/sched.h:2210:9: sparse:    struct task_struct [noderef] __rcu *
   kernel/sched/sched.h:2210:9: sparse:    struct task_struct *
   kernel/sched/core.c:2158:38: sparse: sparse: incompatible types in comparison expression (different address spaces):
   kernel/sched/core.c:2158:38: sparse:    struct task_struct [noderef] __rcu *
   kernel/sched/core.c:2158:38: sparse:    struct task_struct const *
   kernel/sched/sched.h:2053:25: sparse: sparse: incompatible types in comparison expression (different address spaces):
   kernel/sched/sched.h:2053:25: sparse:    struct task_struct [noderef] __rcu *
   kernel/sched/sched.h:2053:25: sparse:    struct task_struct *
   kernel/sched/sched.h:2210:9: sparse: sparse: incompatible types in comparison expression (different address spaces):
   kernel/sched/sched.h:2210:9: sparse:    struct task_struct [noderef] __rcu *
   kernel/sched/sched.h:2210:9: sparse:    struct task_struct *
   kernel/sched/sched.h:2210:9: sparse: sparse: incompatible types in comparison expression (different address spaces):
   kernel/sched/sched.h:2210:9: sparse:    struct task_struct [noderef] __rcu *
   kernel/sched/sched.h:2210:9: sparse:    struct task_struct *
   kernel/sched/sched.h:2053:25: sparse: sparse: incompatible types in comparison expression (different address spaces):
   kernel/sched/sched.h:2053:25: sparse:    struct task_struct [noderef] __rcu *
   kernel/sched/sched.h:2053:25: sparse:    struct task_struct *
   kernel/sched/sched.h:2210:9: sparse: sparse: incompatible types in comparison expression (different address spaces):
   kernel/sched/sched.h:2210:9: sparse:    struct task_struct [noderef] __rcu *
   kernel/sched/sched.h:2210:9: sparse:    struct task_struct *
   kernel/sched/sched.h:2053:25: sparse: sparse: incompatible types in comparison expression (different address spaces):
   kernel/sched/sched.h:2053:25: sparse:    struct task_struct [noderef] __rcu *
   kernel/sched/sched.h:2053:25: sparse:    struct task_struct *
   kernel/sched/sched.h:2210:9: sparse: sparse: incompatible types in comparison expression (different address spaces):
   kernel/sched/sched.h:2210:9: sparse:    struct task_struct [noderef] __rcu *
   kernel/sched/sched.h:2210:9: sparse:    struct task_struct *
   kernel/sched/sched.h:2053:25: sparse: sparse: incompatible types in comparison expression (different address spaces):
   kernel/sched/sched.h:2053:25: sparse:    struct task_struct [noderef] __rcu *
   kernel/sched/sched.h:2053:25: sparse:    struct task_struct *
   kernel/sched/sched.h:2210:9: sparse: sparse: incompatible types in comparison expression (different address spaces):
   kernel/sched/sched.h:2210:9: sparse:    struct task_struct [noderef] __rcu *
   kernel/sched/sched.h:2210:9: sparse:    struct task_struct *
   kernel/sched/sched.h:2053:25: sparse: sparse: incompatible types in comparison expression (different address spaces):
   kernel/sched/sched.h:2053:25: sparse:    struct task_struct [noderef] __rcu *
   kernel/sched/sched.h:2053:25: sparse:    struct task_struct *
   kernel/sched/sched.h:2210:9: sparse: sparse: incompatible types in comparison expression (different address spaces):
   kernel/sched/sched.h:2210:9: sparse:    struct task_struct [noderef] __rcu *
   kernel/sched/sched.h:2210:9: sparse:    struct task_struct *

vim +711 kernel/sched/core.c

   675	
   676	/*
   677	 * RQ-clock updating methods:
   678	 */
   679	
   680	static void update_rq_clock_task(struct rq *rq, s64 delta)
   681	{
   682	/*
   683	 * In theory, the compile should just see 0 here, and optimize out the call
   684	 * to sched_rt_avg_update. But I don't trust it...
   685	 */
   686		s64 __maybe_unused steal = 0, irq_delta = 0;
   687	
   688	#ifdef CONFIG_IRQ_TIME_ACCOUNTING
   689		irq_delta = irq_time_read(cpu_of(rq)) - rq->prev_irq_time;
   690	
   691		/*
   692		 * Since irq_time is only updated on {soft,}irq_exit, we might run into
   693		 * this case when a previous update_rq_clock() happened inside a
   694		 * {soft,}irq region.
   695		 *
   696		 * When this happens, we stop ->clock_task and only update the
   697		 * prev_irq_time stamp to account for the part that fit, so that a next
   698		 * update will consume the rest. This ensures ->clock_task is
   699		 * monotonic.
   700		 *
   701		 * It does however cause some slight miss-attribution of {soft,}irq
   702		 * time, a more accurate solution would be to update the irq_time using
   703		 * the current rq->clock timestamp, except that would require using
   704		 * atomic ops.
   705		 */
   706		if (irq_delta > delta)
   707			irq_delta = delta;
   708	
   709		rq->prev_irq_time += irq_delta;
   710		delta -= irq_delta;
 > 711		psi_account_irqtime(rq->curr, irq_delta);
   712	#endif
   713	#ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING
   714		if (static_key_false((&paravirt_steal_rq_enabled))) {
   715			steal = paravirt_steal_clock(cpu_of(rq));
   716			steal -= rq->prev_steal_time_rq;
   717	
   718			if (unlikely(steal > delta))
   719				steal = delta;
   720	
   721			rq->prev_steal_time_rq += steal;
   722			delta -= steal;
   723		}
   724	#endif
   725	
   726		rq->clock_task += delta;
   727	

-- 
0-DAY CI Kernel Test Service
https://01.org/lkp

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 9/9] sched/psi: add PSI_IRQ to track IRQ/SOFTIRQ pressure
  2022-07-21  4:04 ` [PATCH 9/9] sched/psi: add PSI_IRQ to track IRQ/SOFTIRQ pressure Chengming Zhou
  2022-07-21 10:00   ` kernel test robot
@ 2022-07-21 22:10   ` kernel test robot
  2022-07-22  3:30   ` Abel Wu
                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 38+ messages in thread
From: kernel test robot @ 2022-07-21 22:10 UTC (permalink / raw)
  To: Chengming Zhou, hannes, surenb, mingo, peterz, tj, corbet, akpm, rdunlap
  Cc: kbuild-all, linux-doc, linux-kernel, songmuchun, cgroups, Chengming Zhou

Hi Chengming,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on tip/sched/core]
[also build test WARNING on akpm-mm/mm-everything linus/master v5.19-rc7]
[cannot apply to tj-cgroup/for-next next-20220721]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Chengming-Zhou/sched-psi-some-optimization-and-extension/20220721-120833
base:   https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git 401e4963bf45c800e3e9ea0d3a0289d738005fd4
config: x86_64-randconfig-s022-20220718 (https://download.01.org/0day-ci/archive/20220722/202207220642.sbCB4Bcf-lkp@intel.com/config)
compiler: gcc-11 (Debian 11.3.0-3) 11.3.0
reproduce:
        # apt-get install sparse
        # sparse version: v0.6.4-39-gce1a6720-dirty
        # https://github.com/intel-lab-lkp/linux/commit/d14f2a9ff31fefc5b28a16addaa832dc80d84189
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Chengming-Zhou/sched-psi-some-optimization-and-extension/20220721-120833
        git checkout d14f2a9ff31fefc5b28a16addaa832dc80d84189
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        make W=1 C=1 CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__' O=build_dir ARCH=x86_64 SHELL=/bin/bash

If you fix the issue, kindly add following tag where applicable
Reported-by: kernel test robot <lkp@intel.com>

sparse warnings: (new ones prefixed by >>)
>> kernel/sched/core.c:711:31: sparse: sparse: incorrect type in argument 1 (different address spaces) @@     expected struct task_struct *curr @@     got struct task_struct [noderef] __rcu *curr @@
   kernel/sched/core.c:711:31: sparse:     expected struct task_struct *curr
   kernel/sched/core.c:711:31: sparse:     got struct task_struct [noderef] __rcu *curr
   kernel/sched/core.c:781:48: sparse: sparse: incorrect type in argument 2 (different address spaces) @@     expected struct task_struct *p @@     got struct task_struct [noderef] __rcu *curr @@
   kernel/sched/core.c:781:48: sparse:     expected struct task_struct *p
   kernel/sched/core.c:781:48: sparse:     got struct task_struct [noderef] __rcu *curr
   kernel/sched/core.c:1028:38: sparse: sparse: incorrect type in initializer (different address spaces) @@     expected struct task_struct *curr @@     got struct task_struct [noderef] __rcu *curr @@
   kernel/sched/core.c:1028:38: sparse:     expected struct task_struct *curr
   kernel/sched/core.c:1028:38: sparse:     got struct task_struct [noderef] __rcu *curr
   kernel/sched/core.c:2192:33: sparse: sparse: incorrect type in argument 1 (different address spaces) @@     expected struct task_struct *p @@     got struct task_struct [noderef] __rcu *curr @@
   kernel/sched/core.c:2192:33: sparse:     expected struct task_struct *p
   kernel/sched/core.c:2192:33: sparse:     got struct task_struct [noderef] __rcu *curr
   kernel/sched/core.c:2192:68: sparse: sparse: incorrect type in argument 1 (different address spaces) @@     expected struct task_struct *tsk @@     got struct task_struct [noderef] __rcu *curr @@
   kernel/sched/core.c:2192:68: sparse:     expected struct task_struct *tsk
   kernel/sched/core.c:2192:68: sparse:     got struct task_struct [noderef] __rcu *curr
   kernel/sched/core.c:5376:38: sparse: sparse: incorrect type in initializer (different address spaces) @@     expected struct task_struct *curr @@     got struct task_struct [noderef] __rcu *curr @@
   kernel/sched/core.c:5376:38: sparse:     expected struct task_struct *curr
   kernel/sched/core.c:5376:38: sparse:     got struct task_struct [noderef] __rcu *curr
   kernel/sched/core.c:6322:14: sparse: sparse: incorrect type in assignment (different address spaces) @@     expected struct task_struct *prev @@     got struct task_struct [noderef] __rcu *curr @@
   kernel/sched/core.c:6322:14: sparse:     expected struct task_struct *prev
   kernel/sched/core.c:6322:14: sparse:     got struct task_struct [noderef] __rcu *curr
   kernel/sched/core.c:6848:17: sparse: sparse: incompatible types in comparison expression (different address spaces):
   kernel/sched/core.c:6848:17: sparse:    struct task_struct *
   kernel/sched/core.c:6848:17: sparse:    struct task_struct [noderef] __rcu *
   kernel/sched/core.c:7064:22: sparse: sparse: incompatible types in comparison expression (different address spaces):
   kernel/sched/core.c:7064:22: sparse:    struct task_struct [noderef] __rcu *
   kernel/sched/core.c:7064:22: sparse:    struct task_struct *
   kernel/sched/core.c:11121:25: sparse: sparse: incorrect type in argument 1 (different address spaces) @@     expected struct task_struct *p @@     got struct task_struct [noderef] __rcu *curr @@
   kernel/sched/core.c:11121:25: sparse:     expected struct task_struct *p
   kernel/sched/core.c:11121:25: sparse:     got struct task_struct [noderef] __rcu *curr
   kernel/sched/core.c:537:6: sparse: sparse: context imbalance in 'raw_spin_rq_lock_nested' - wrong count at exit
   kernel/sched/core.c:570:23: sparse: sparse: context imbalance in 'raw_spin_rq_trylock' - wrong count at exit
   kernel/sched/core.c:586:6: sparse: sparse: context imbalance in 'raw_spin_rq_unlock' - unexpected unlock
   kernel/sched/core.c:624:36: sparse: sparse: context imbalance in '__task_rq_lock' - wrong count at exit
   kernel/sched/core.c:665:36: sparse: sparse: context imbalance in 'task_rq_lock' - wrong count at exit
   kernel/sched/core.c:781:11: sparse: sparse: dereference of noderef expression
   kernel/sched/core.c:2183:33: sparse: sparse: dereference of noderef expression
   kernel/sched/core.c:2184:19: sparse: sparse: dereference of noderef expression
   kernel/sched/core.c:2185:18: sparse: sparse: dereference of noderef expression
   kernel/sched/core.c: note: in included file:
   kernel/sched/sched.h:1592:9: sparse: sparse: context imbalance in 'ttwu_runnable' - wrong count at exit
   kernel/sched/core.c:4262:9: sparse: sparse: context imbalance in 'task_call_func' - wrong count at exit
   kernel/sched/sched.h:1592:9: sparse: sparse: context imbalance in 'wake_up_new_task' - wrong count at exit
   kernel/sched/core.c:5035:9: sparse: sparse: context imbalance in 'finish_task_switch' - wrong count at exit
   kernel/sched/sched.h:2053:25: sparse: sparse: incompatible types in comparison expression (different address spaces):
   kernel/sched/sched.h:2053:25: sparse:    struct task_struct [noderef] __rcu *
   kernel/sched/sched.h:2053:25: sparse:    struct task_struct *
   kernel/sched/sched.h:2210:9: sparse: sparse: incompatible types in comparison expression (different address spaces):
   kernel/sched/sched.h:2210:9: sparse:    struct task_struct [noderef] __rcu *
   kernel/sched/sched.h:2210:9: sparse:    struct task_struct *
   kernel/sched/sched.h:2210:9: sparse: sparse: incompatible types in comparison expression (different address spaces):
   kernel/sched/sched.h:2210:9: sparse:    struct task_struct [noderef] __rcu *
   kernel/sched/sched.h:2210:9: sparse:    struct task_struct *
   kernel/sched/sched.h:2053:25: sparse: sparse: incompatible types in comparison expression (different address spaces):
   kernel/sched/sched.h:2053:25: sparse:    struct task_struct [noderef] __rcu *
   kernel/sched/sched.h:2053:25: sparse:    struct task_struct *
   kernel/sched/sched.h:2210:9: sparse: sparse: incompatible types in comparison expression (different address spaces):
   kernel/sched/sched.h:2210:9: sparse:    struct task_struct [noderef] __rcu *
   kernel/sched/sched.h:2210:9: sparse:    struct task_struct *
   kernel/sched/sched.h:2053:25: sparse: sparse: incompatible types in comparison expression (different address spaces):
   kernel/sched/sched.h:2053:25: sparse:    struct task_struct [noderef] __rcu *
   kernel/sched/sched.h:2053:25: sparse:    struct task_struct *
   kernel/sched/sched.h:2210:9: sparse: sparse: incompatible types in comparison expression (different address spaces):
   kernel/sched/sched.h:2210:9: sparse:    struct task_struct [noderef] __rcu *
   kernel/sched/sched.h:2210:9: sparse:    struct task_struct *
   kernel/sched/sched.h:2053:25: sparse: sparse: incompatible types in comparison expression (different address spaces):
   kernel/sched/sched.h:2053:25: sparse:    struct task_struct [noderef] __rcu *
   kernel/sched/sched.h:2053:25: sparse:    struct task_struct *
   kernel/sched/sched.h:2210:9: sparse: sparse: incompatible types in comparison expression (different address spaces):
   kernel/sched/sched.h:2210:9: sparse:    struct task_struct [noderef] __rcu *
   kernel/sched/sched.h:2210:9: sparse:    struct task_struct *
   kernel/sched/sched.h:2053:25: sparse: sparse: incompatible types in comparison expression (different address spaces):
   kernel/sched/sched.h:2053:25: sparse:    struct task_struct [noderef] __rcu *
   kernel/sched/sched.h:2053:25: sparse:    struct task_struct *

vim +711 kernel/sched/core.c

   675	
   676	/*
   677	 * RQ-clock updating methods:
   678	 */
   679	
   680	static void update_rq_clock_task(struct rq *rq, s64 delta)
   681	{
   682	/*
   683	 * In theory, the compile should just see 0 here, and optimize out the call
   684	 * to sched_rt_avg_update. But I don't trust it...
   685	 */
   686		s64 __maybe_unused steal = 0, irq_delta = 0;
   687	
   688	#ifdef CONFIG_IRQ_TIME_ACCOUNTING
   689		irq_delta = irq_time_read(cpu_of(rq)) - rq->prev_irq_time;
   690	
   691		/*
   692		 * Since irq_time is only updated on {soft,}irq_exit, we might run into
   693		 * this case when a previous update_rq_clock() happened inside a
   694		 * {soft,}irq region.
   695		 *
   696		 * When this happens, we stop ->clock_task and only update the
   697		 * prev_irq_time stamp to account for the part that fit, so that a next
   698		 * update will consume the rest. This ensures ->clock_task is
   699		 * monotonic.
   700		 *
   701		 * It does however cause some slight miss-attribution of {soft,}irq
   702		 * time, a more accurate solution would be to update the irq_time using
   703		 * the current rq->clock timestamp, except that would require using
   704		 * atomic ops.
   705		 */
   706		if (irq_delta > delta)
   707			irq_delta = delta;
   708	
   709		rq->prev_irq_time += irq_delta;
   710		delta -= irq_delta;
 > 711		psi_account_irqtime(rq->curr, irq_delta);
   712	#endif
   713	#ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING
   714		if (static_key_false((&paravirt_steal_rq_enabled))) {
   715			steal = paravirt_steal_clock(cpu_of(rq));
   716			steal -= rq->prev_steal_time_rq;
   717	
   718			if (unlikely(steal > delta))
   719				steal = delta;
   720	
   721			rq->prev_steal_time_rq += steal;
   722			delta -= steal;
   723		}
   724	#endif
   725	
   726		rq->clock_task += delta;
   727	

-- 
0-DAY CI Kernel Test Service
https://01.org/lkp

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 9/9] sched/psi: add PSI_IRQ to track IRQ/SOFTIRQ pressure
  2022-07-21  4:04 ` [PATCH 9/9] sched/psi: add PSI_IRQ to track IRQ/SOFTIRQ pressure Chengming Zhou
  2022-07-21 10:00   ` kernel test robot
  2022-07-21 22:10   ` kernel test robot
@ 2022-07-22  3:30   ` Abel Wu
  2022-07-22  6:13     ` Chengming Zhou
  2022-07-25 18:26   ` Johannes Weiner
  2022-07-27 16:07   ` Peter Zijlstra
  4 siblings, 1 reply; 38+ messages in thread
From: Abel Wu @ 2022-07-22  3:30 UTC (permalink / raw)
  To: Chengming Zhou, hannes, surenb, mingo, peterz, tj, corbet, akpm, rdunlap
  Cc: linux-doc, linux-kernel, songmuchun, cgroups

Hi Chengming,

On 7/21/22 12:04 PM, Chengming Zhou Wrote:
> Now PSI already tracked workload pressure stall information for
> CPU, memory and IO. Apart from these, IRQ/SOFTIRQ could have
> obvious impact on some workload productivity, such as web service
> workload.
> 
> When CONFIG_IRQ_TIME_ACCOUNTING, we can get IRQ/SOFTIRQ delta time
> from update_rq_clock_task(), in which we can record that delta
> to CPU curr task's cgroups as PSI_IRQ_FULL status.

The {soft,}irq affection should be equal to all the runnable tasks
on that cpu, not only rq->curr. Further I think irqstall is per-cpu
rather than per-cgroup.

Abel

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 9/9] sched/psi: add PSI_IRQ to track IRQ/SOFTIRQ pressure
  2022-07-22  3:30   ` Abel Wu
@ 2022-07-22  6:13     ` Chengming Zhou
  2022-07-22  7:14       ` Abel Wu
  0 siblings, 1 reply; 38+ messages in thread
From: Chengming Zhou @ 2022-07-22  6:13 UTC (permalink / raw)
  To: Abel Wu, hannes, surenb, mingo, peterz, tj, corbet, akpm, rdunlap
  Cc: linux-doc, linux-kernel, songmuchun, cgroups

On 2022/7/22 11:30, Abel Wu wrote:
> Hi Chengming,
> 
> On 7/21/22 12:04 PM, Chengming Zhou Wrote:
>> Now PSI already tracked workload pressure stall information for
>> CPU, memory and IO. Apart from these, IRQ/SOFTIRQ could have
>> obvious impact on some workload productivity, such as web service
>> workload.
>>
>> When CONFIG_IRQ_TIME_ACCOUNTING, we can get IRQ/SOFTIRQ delta time
>> from update_rq_clock_task(), in which we can record that delta
>> to CPU curr task's cgroups as PSI_IRQ_FULL status.
> 
> The {soft,}irq affection should be equal to all the runnable tasks
> on that cpu, not only rq->curr. Further I think irqstall is per-cpu
> rather than per-cgroup.

Although IRQ/SOFTIRQ is per-cpu, it's the rq->curr who own the CPU at the time
and pay for it, meanwhile other groups would be thought as PSI_CPU_FULL.

So I think it's reasonable to account this IRQ/SOFTIRQ delta to rq->curr's groups
as PSI_IRQ_FULL pressure stall. And per-cpu IRQ stall can also get from psi_system.

Thanks.

> 
> Abel

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 9/9] sched/psi: add PSI_IRQ to track IRQ/SOFTIRQ pressure
  2022-07-22  6:13     ` Chengming Zhou
@ 2022-07-22  7:14       ` Abel Wu
  2022-07-22  7:33         ` Chengming Zhou
  0 siblings, 1 reply; 38+ messages in thread
From: Abel Wu @ 2022-07-22  7:14 UTC (permalink / raw)
  To: Chengming Zhou, hannes, surenb, mingo, peterz, tj, corbet, akpm, rdunlap
  Cc: linux-doc, linux-kernel, songmuchun, cgroups

On 7/22/22 2:13 PM, Chengming Zhou Wrote:
> On 2022/7/22 11:30, Abel Wu wrote:
>> Hi Chengming,
>>
>> On 7/21/22 12:04 PM, Chengming Zhou Wrote:
>>> Now PSI already tracked workload pressure stall information for
>>> CPU, memory and IO. Apart from these, IRQ/SOFTIRQ could have
>>> obvious impact on some workload productivity, such as web service
>>> workload.
>>>
>>> When CONFIG_IRQ_TIME_ACCOUNTING, we can get IRQ/SOFTIRQ delta time
>>> from update_rq_clock_task(), in which we can record that delta
>>> to CPU curr task's cgroups as PSI_IRQ_FULL status.
>>
>> The {soft,}irq affection should be equal to all the runnable tasks
>> on that cpu, not only rq->curr. Further I think irqstall is per-cpu
>> rather than per-cgroup.
> 
> Although IRQ/SOFTIRQ is per-cpu, it's the rq->curr who own the CPU at the time
> and pay for it, meanwhile other groups would be thought as PSI_CPU_FULL.

I don't think rq->curr pays for it if you mean consuming quota here.
And it doesn't seem appropriate to let other groups treat it as cpu
stall because the rq->curr is also the victim rather than the one
causes stall (so it's different from rq->curr causing memstall and
observed as cpustall by others).

> 
> So I think it's reasonable to account this IRQ/SOFTIRQ delta to rq->curr's groups
> as PSI_IRQ_FULL pressure stall. And per-cpu IRQ stall can also get from psi_system.
> 


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 9/9] sched/psi: add PSI_IRQ to track IRQ/SOFTIRQ pressure
  2022-07-22  7:14       ` Abel Wu
@ 2022-07-22  7:33         ` Chengming Zhou
  0 siblings, 0 replies; 38+ messages in thread
From: Chengming Zhou @ 2022-07-22  7:33 UTC (permalink / raw)
  To: Abel Wu, hannes, surenb, mingo, peterz, tj, corbet, akpm, rdunlap
  Cc: linux-doc, linux-kernel, songmuchun, cgroups

On 2022/7/22 15:14, Abel Wu wrote:
> On 7/22/22 2:13 PM, Chengming Zhou Wrote:
>> On 2022/7/22 11:30, Abel Wu wrote:
>>> Hi Chengming,
>>>
>>> On 7/21/22 12:04 PM, Chengming Zhou Wrote:
>>>> Now PSI already tracked workload pressure stall information for
>>>> CPU, memory and IO. Apart from these, IRQ/SOFTIRQ could have
>>>> obvious impact on some workload productivity, such as web service
>>>> workload.
>>>>
>>>> When CONFIG_IRQ_TIME_ACCOUNTING, we can get IRQ/SOFTIRQ delta time
>>>> from update_rq_clock_task(), in which we can record that delta
>>>> to CPU curr task's cgroups as PSI_IRQ_FULL status.
>>>
>>> The {soft,}irq affection should be equal to all the runnable tasks
>>> on that cpu, not only rq->curr. Further I think irqstall is per-cpu
>>> rather than per-cgroup.
>>
>> Although IRQ/SOFTIRQ is per-cpu, it's the rq->curr who own the CPU at the time
>> and pay for it, meanwhile other groups would be thought as PSI_CPU_FULL.
> 
> I don't think rq->curr pays for it if you mean consuming quota here.

Yes, it makes rq->curr's groups look more productive than it actually is,
which are clearly different from other groups.

> And it doesn't seem appropriate to let other groups treat it as cpu
> stall because the rq->curr is also the victim rather than the one
> causes stall (so it's different from rq->curr causing memstall and
> observed as cpustall by others).

IMHO, we don't care who causes stall, instead we care about what affects
workload productivity.


> 
>>
>> So I think it's reasonable to account this IRQ/SOFTIRQ delta to rq->curr's groups
>> as PSI_IRQ_FULL pressure stall. And per-cpu IRQ stall can also get from psi_system.
>>
> 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 1/9] sched/psi: fix periodic aggregation shut off
  2022-07-21  4:04 ` [PATCH 1/9] sched/psi: fix periodic aggregation shut off Chengming Zhou
@ 2022-07-25 15:34   ` Johannes Weiner
  2022-07-25 15:39   ` Johannes Weiner
  1 sibling, 0 replies; 38+ messages in thread
From: Johannes Weiner @ 2022-07-25 15:34 UTC (permalink / raw)
  To: Chengming Zhou
  Cc: surenb, mingo, peterz, tj, corbet, akpm, rdunlap, linux-doc,
	linux-kernel, songmuchun, cgroups

On Thu, Jul 21, 2022 at 12:04:31PM +0800, Chengming Zhou wrote:
> We don't want to wake periodic aggregation work back up if the
> task change is the aggregation worker itself going to sleep, or
> we'll ping-pong forever.
> 
> Previously, we would use psi_task_change() in psi_dequeue() when
> task going to sleep, so this check was put in psi_task_change().
> 
> But commit 4117cebf1a9f ("psi: Optimize task switch inside shared cgroups")
> defer task sleep handling to psi_task_switch(), won't go through
> psi_task_change() anymore.
> 
> So this patch move this check to psi_task_switch(). Note for defer sleep
> case, we should wake periodic avgs work for common ancestors groups,
> since those groups have next task sched_in.
> 
> Fixes: 4117cebf1a9f ("psi: Optimize task switch inside shared cgroups")
> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>

Good catch!

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 1/9] sched/psi: fix periodic aggregation shut off
  2022-07-21  4:04 ` [PATCH 1/9] sched/psi: fix periodic aggregation shut off Chengming Zhou
  2022-07-25 15:34   ` Johannes Weiner
@ 2022-07-25 15:39   ` Johannes Weiner
  2022-07-26 13:28     ` Chengming Zhou
  1 sibling, 1 reply; 38+ messages in thread
From: Johannes Weiner @ 2022-07-25 15:39 UTC (permalink / raw)
  To: Chengming Zhou
  Cc: surenb, mingo, peterz, tj, corbet, akpm, rdunlap, linux-doc,
	linux-kernel, songmuchun, cgroups

On Thu, Jul 21, 2022 at 12:04:31PM +0800, Chengming Zhou wrote:
> @@ -871,13 +861,23 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
>  				clear |= TSK_MEMSTALL_RUNNING;
>  			if (prev->in_iowait)
>  				set |= TSK_IOWAIT;
> +
> +			/*
> +			 * Periodic aggregation shuts off if there is a period of no
> +			 * task changes, so we wake it back up if necessary. However,
> +			 * don't do this if the task change is the aggregation worker
> +			 * itself going to sleep, or we'll ping-pong forever.
> +			 */
> +			if (unlikely((prev->flags & PF_WQ_WORKER) &&
> +				     wq_worker_last_func(prev) == psi_avgs_work))
> +				wake_clock = false;
>  		}
>  
>  		psi_flags_change(prev, clear, set);
>  
>  		iter = NULL;
>  		while ((group = iterate_groups(prev, &iter)) && group != common)
> -			psi_group_change(group, cpu, clear, set, now, true);
> +			psi_group_change(group, cpu, clear, set, now, wake_clock);
>  
>  		/*
>  		 * TSK_ONCPU is handled up to the common ancestor. If we're tasked

Wait, there is another psi_group_change() below this, which handles
the clearing of TSK_RUNNING for common ancestors. We don't want to
wake those either, so it needs s/true/wake_clock/ as well.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 3/9] sched/psi: move private helpers to sched/stats.h
  2022-07-21  4:04 ` [PATCH 3/9] sched/psi: move private helpers to sched/stats.h Chengming Zhou
@ 2022-07-25 16:39   ` Johannes Weiner
  0 siblings, 0 replies; 38+ messages in thread
From: Johannes Weiner @ 2022-07-25 16:39 UTC (permalink / raw)
  To: Chengming Zhou
  Cc: surenb, mingo, peterz, tj, corbet, akpm, rdunlap, linux-doc,
	linux-kernel, songmuchun, cgroups

On Thu, Jul 21, 2022 at 12:04:33PM +0800, Chengming Zhou wrote:
> This patch move psi_task_change/psi_task_switch declarations out of
> PSI public header, since they are only needed for implementing the
> PSI stats tracking in sched/stats.h
> 
> psi_task_switch is obvious, psi_task_change can't be public helper
> since it doesn't check psi_disabled static key. And there is no
> any user now, so put it in sched/stats.h too.
> 
> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 5/9] sched/psi: don't create cgroup PSI files when psi_disabled
  2022-07-21  4:04 ` [PATCH 5/9] sched/psi: don't create cgroup PSI files when psi_disabled Chengming Zhou
@ 2022-07-25 16:41   ` Johannes Weiner
  0 siblings, 0 replies; 38+ messages in thread
From: Johannes Weiner @ 2022-07-25 16:41 UTC (permalink / raw)
  To: Chengming Zhou
  Cc: surenb, mingo, peterz, tj, corbet, akpm, rdunlap, linux-doc,
	linux-kernel, songmuchun, cgroups

On Thu, Jul 21, 2022 at 12:04:35PM +0800, Chengming Zhou wrote:
> commit 3958e2d0c34e ("cgroup: make per-cgroup pressure stall tracking configurable")
> make PSI can be configured to skip per-cgroup stall accounting. And
> doesn't expose PSI files in cgroup hierarchy.
> 
> This patch do the same thing when psi_disabled.
> 
> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 6/9] sched/psi: save percpu memory when !psi_cgroups_enabled
  2022-07-21  4:04 ` [PATCH 6/9] sched/psi: save percpu memory when !psi_cgroups_enabled Chengming Zhou
@ 2022-07-25 16:47   ` Johannes Weiner
  0 siblings, 0 replies; 38+ messages in thread
From: Johannes Weiner @ 2022-07-25 16:47 UTC (permalink / raw)
  To: Chengming Zhou
  Cc: surenb, mingo, peterz, tj, corbet, akpm, rdunlap, linux-doc,
	linux-kernel, songmuchun, cgroups

On Thu, Jul 21, 2022 at 12:04:36PM +0800, Chengming Zhou wrote:
> We won't use cgroup psi_group when !psi_cgroups_enabled, so don't
> bother to alloc percpu memory and init for it.
> 
> Also don't need to migrate task PSI stats between cgroups in
> cgroup_move_task().
> 
> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 8/9] sched/psi: add kernel cmdline parameter psi_inner_cgroup
  2022-07-21  4:04 ` [PATCH 8/9] sched/psi: add kernel cmdline parameter psi_inner_cgroup Chengming Zhou
@ 2022-07-25 16:52   ` Johannes Weiner
  2022-07-26 13:38     ` [External] " Chengming Zhou
  2022-07-26 17:54     ` Tejun Heo
  0 siblings, 2 replies; 38+ messages in thread
From: Johannes Weiner @ 2022-07-25 16:52 UTC (permalink / raw)
  To: Chengming Zhou
  Cc: surenb, mingo, peterz, tj, corbet, akpm, rdunlap, linux-doc,
	linux-kernel, songmuchun, cgroups

On Thu, Jul 21, 2022 at 12:04:38PM +0800, Chengming Zhou wrote:
> PSI accounts stalls for each cgroup separately and aggregates it
> at each level of the hierarchy. This may case non-negligible overhead
> for some workloads when under deep level of the hierarchy.
> 
> commit 3958e2d0c34e ("cgroup: make per-cgroup pressure stall tracking configurable")
> make PSI to skip per-cgroup stall accounting, only account system-wide
> to avoid this each level overhead.
> 
> For our use case, we also want leaf cgroup PSI accounted for userspace
> adjustment on that cgroup, apart from only system-wide management.

I hear the overhead argument. But skipping accounting in intermediate
levels is a bit odd and unprecedented in the cgroup interface. Once we
do this, it's conceivable people would like to do the same thing for
other stats and accounting, like for instance memory.stat.

Tejun, what are your thoughts on this?

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 9/9] sched/psi: add PSI_IRQ to track IRQ/SOFTIRQ pressure
  2022-07-21  4:04 ` [PATCH 9/9] sched/psi: add PSI_IRQ to track IRQ/SOFTIRQ pressure Chengming Zhou
                     ` (2 preceding siblings ...)
  2022-07-22  3:30   ` Abel Wu
@ 2022-07-25 18:26   ` Johannes Weiner
  2022-07-26 13:55     ` [External] " Chengming Zhou
  2022-07-27 11:28     ` Chengming Zhou
  2022-07-27 16:07   ` Peter Zijlstra
  4 siblings, 2 replies; 38+ messages in thread
From: Johannes Weiner @ 2022-07-25 18:26 UTC (permalink / raw)
  To: Chengming Zhou
  Cc: surenb, mingo, peterz, tj, corbet, akpm, rdunlap, linux-doc,
	linux-kernel, songmuchun, cgroups

On Thu, Jul 21, 2022 at 12:04:39PM +0800, Chengming Zhou wrote:
> Now PSI already tracked workload pressure stall information for
> CPU, memory and IO. Apart from these, IRQ/SOFTIRQ could have
> obvious impact on some workload productivity, such as web service
> workload.
> 
> When CONFIG_IRQ_TIME_ACCOUNTING, we can get IRQ/SOFTIRQ delta time
> from update_rq_clock_task(), in which we can record that delta
> to CPU curr task's cgroups as PSI_IRQ_FULL status.
>
> Note we don't use PSI_IRQ_SOME since IRQ/SOFTIRQ always happen in
> the current task on the CPU, make nothing productive could run
> even if it were runnable, so we only use PSI_IRQ_FULL.

That sounds reasonable.

> For performance impact consideration, this is enabled by default when
> CONFIG_IRQ_TIME_ACCOUNTING, but can be disabled by kernel cmdline
> parameter "psi_irq=".

If there isn't a concrete usecase already, let's not add another
commandline parameter for now.

> @@ -63,9 +64,11 @@ enum psi_states {
>  	PSI_MEM_FULL,
>  	PSI_CPU_SOME,
>  	PSI_CPU_FULL,
> +	PSI_IRQ_SOME,
> +	PSI_IRQ_FULL,
>  	/* Only per-CPU, to weigh the CPU in the global average: */
>  	PSI_NONIDLE,
> -	NR_PSI_STATES = 7,
> +	NR_PSI_STATES = 9,
>  };

Unfortunately, this grows the psi state touched by the scheduler into
a second cacheline. :( Please reclaim space first.

I think we can remove the NR_CPU task count, which frees up one
u32. Something like the below diff should work (untested!)

And you should be able to remove PSI_IRQ_SOME, since it's not used
anyway. Then we'd be good to go.

diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h
index c7fe7c089718..31dc76e2d8ea 100644
--- a/include/linux/psi_types.h
+++ b/include/linux/psi_types.h
@@ -15,13 +15,6 @@ enum psi_task_count {
 	NR_IOWAIT,
 	NR_MEMSTALL,
 	NR_RUNNING,
-	/*
-	 * This can't have values other than 0 or 1 and could be
-	 * implemented as a bit flag. But for now we still have room
-	 * in the first cacheline of psi_group_cpu, and this way we
-	 * don't have to special case any state tracking for it.
-	 */
-	NR_ONCPU,
 	/*
 	 * For IO and CPU stalls the presence of running/oncpu tasks
 	 * in the domain means a partial rather than a full stall.
@@ -39,9 +32,11 @@ enum psi_task_count {
 #define TSK_IOWAIT	(1 << NR_IOWAIT)
 #define TSK_MEMSTALL	(1 << NR_MEMSTALL)
 #define TSK_RUNNING	(1 << NR_RUNNING)
-#define TSK_ONCPU	(1 << NR_ONCPU)
 #define TSK_MEMSTALL_RUNNING	(1 << NR_MEMSTALL_RUNNING)
 
+/* Only one task can be scheduled, no corresponding task count */
+#define TSK_ONCPU	(1 << NR_PSI_TASK_COUNTS)
+
 /* Resources that workloads could be stalled on */
 enum psi_res {
 	PSI_IO,
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index a4fa3aadfcba..232e1dbfad46 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -215,7 +215,7 @@ void __init psi_init(void)
 	group_init(&psi_system);
 }
 
-static bool test_state(unsigned int *tasks, enum psi_states state)
+static bool test_state(unsigned int *tasks, enum psi_states state, bool oncpu)
 {
 	switch (state) {
 	case PSI_IO_SOME:
@@ -228,9 +228,9 @@ static bool test_state(unsigned int *tasks, enum psi_states state)
 		return unlikely(tasks[NR_MEMSTALL] &&
 			tasks[NR_RUNNING] == tasks[NR_MEMSTALL_RUNNING]);
 	case PSI_CPU_SOME:
-		return unlikely(tasks[NR_RUNNING] > tasks[NR_ONCPU]);
+		return unlikely(tasks[NR_RUNNING] > oncpu);
 	case PSI_CPU_FULL:
-		return unlikely(tasks[NR_RUNNING] && !tasks[NR_ONCPU]);
+		return unlikely(tasks[NR_RUNNING] && !oncpu);
 	case PSI_NONIDLE:
 		return tasks[NR_IOWAIT] || tasks[NR_MEMSTALL] ||
 			tasks[NR_RUNNING];
@@ -692,9 +692,9 @@ static void psi_group_change(struct psi_group *group, int cpu,
 			     bool wake_clock)
 {
 	struct psi_group_cpu *groupc;
-	u32 state_mask = 0;
 	unsigned int t, m;
 	enum psi_states s;
+	u32 state_mask;
 
 	groupc = per_cpu_ptr(group->pcpu, cpu);
 
@@ -710,6 +710,26 @@ static void psi_group_change(struct psi_group *group, int cpu,
 
 	record_times(groupc, now);
 
+	/*
+	 * Start with TSK_ONCPU, which doesn't have a corresponding
+	 * task count - it's just a boolean flag directly encoded in
+	 * the state mask. Clear, set, or carry the current state if
+	 * no changes are requested.
+	 */
+	if (clear & TSK_ONCPU) {
+		state_mask = 0;
+		clear &= ~TSK_ONCPU;
+	} else if (set & TSK_ONCPU) {
+		state_mask = TSK_ONCPU;
+		set &= ~TSK_ONCPU;
+	} else {
+		state_mask = groupc->state_mask & TSK_ONCPU;
+	}
+
+	/*
+	 * The rest of the state mask is calculated based on the task
+	 * counts. Update those first, then construct the mask.
+	 */
 	for (t = 0, m = clear; m; m &= ~(1 << t), t++) {
 		if (!(m & (1 << t)))
 			continue;
@@ -729,9 +749,8 @@ static void psi_group_change(struct psi_group *group, int cpu,
 		if (set & (1 << t))
 			groupc->tasks[t]++;
 
-	/* Calculate state mask representing active states */
 	for (s = 0; s < NR_PSI_STATES; s++) {
-		if (test_state(groupc->tasks, s))
+		if (test_state(groupc->tasks, s, state_mask & TSK_ONCPU))
 			state_mask |= (1 << s);
 	}
 
@@ -743,7 +762,7 @@ static void psi_group_change(struct psi_group *group, int cpu,
 	 * task in a cgroup is in_memstall, the corresponding groupc
 	 * on that cpu is in PSI_MEM_FULL state.
 	 */
-	if (unlikely(groupc->tasks[NR_ONCPU] && cpu_curr(cpu)->in_memstall))
+	if (unlikely((state_mask & TSK_ONCPU) && cpu_curr(cpu)->in_memstall))
 		state_mask |= (1 << PSI_MEM_FULL);
 
 	groupc->state_mask = state_mask;
@@ -847,7 +866,8 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
 		iter = NULL;
 		while ((group = iterate_groups(next, &iter))) {
 			if (identical_state &&
-			    per_cpu_ptr(group->pcpu, cpu)->tasks[NR_ONCPU]) {
+			    (per_cpu_ptr(group->pcpu, cpu)->state_mask &
+			     TSK_ONCPU)) {
 				common = group;
 				break;
 			}

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH 1/9] sched/psi: fix periodic aggregation shut off
  2022-07-25 15:39   ` Johannes Weiner
@ 2022-07-26 13:28     ` Chengming Zhou
  0 siblings, 0 replies; 38+ messages in thread
From: Chengming Zhou @ 2022-07-26 13:28 UTC (permalink / raw)
  To: Johannes Weiner, Chengming Zhou
  Cc: surenb, mingo, peterz, tj, corbet, akpm, rdunlap, linux-doc,
	linux-kernel, songmuchun, cgroups

On 2022/7/25 23:39, Johannes Weiner wrote:
> On Thu, Jul 21, 2022 at 12:04:31PM +0800, Chengming Zhou wrote:
>> @@ -871,13 +861,23 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
>>  				clear |= TSK_MEMSTALL_RUNNING;
>>  			if (prev->in_iowait)
>>  				set |= TSK_IOWAIT;
>> +
>> +			/*
>> +			 * Periodic aggregation shuts off if there is a period of no
>> +			 * task changes, so we wake it back up if necessary. However,
>> +			 * don't do this if the task change is the aggregation worker
>> +			 * itself going to sleep, or we'll ping-pong forever.
>> +			 */
>> +			if (unlikely((prev->flags & PF_WQ_WORKER) &&
>> +				     wq_worker_last_func(prev) == psi_avgs_work))
>> +				wake_clock = false;
>>  		}
>>  
>>  		psi_flags_change(prev, clear, set);
>>  
>>  		iter = NULL;
>>  		while ((group = iterate_groups(prev, &iter)) && group != common)
>> -			psi_group_change(group, cpu, clear, set, now, true);
>> +			psi_group_change(group, cpu, clear, set, now, wake_clock);
>>  
>>  		/*
>>  		 * TSK_ONCPU is handled up to the common ancestor. If we're tasked
> 
> Wait, there is another psi_group_change() below this, which handles
> the clearing of TSK_RUNNING for common ancestors. We don't want to
> wake those either, so it needs s/true/wake_clock/ as well.

Yes, I was wrong, will fix.

Thanks!

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [External] Re: [PATCH 8/9] sched/psi: add kernel cmdline parameter psi_inner_cgroup
  2022-07-25 16:52   ` Johannes Weiner
@ 2022-07-26 13:38     ` Chengming Zhou
  2022-07-26 17:54     ` Tejun Heo
  1 sibling, 0 replies; 38+ messages in thread
From: Chengming Zhou @ 2022-07-26 13:38 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: surenb, mingo, peterz, tj, corbet, akpm, rdunlap, linux-doc,
	linux-kernel, songmuchun, cgroups

On 2022/7/26 00:52, Johannes Weiner wrote:
> On Thu, Jul 21, 2022 at 12:04:38PM +0800, Chengming Zhou wrote:
>> PSI accounts stalls for each cgroup separately and aggregates it
>> at each level of the hierarchy. This may case non-negligible overhead
>> for some workloads when under deep level of the hierarchy.
>>
>> commit 3958e2d0c34e ("cgroup: make per-cgroup pressure stall tracking configurable")
>> make PSI to skip per-cgroup stall accounting, only account system-wide
>> to avoid this each level overhead.
>>
>> For our use case, we also want leaf cgroup PSI accounted for userspace
>> adjustment on that cgroup, apart from only system-wide management.
> 
> I hear the overhead argument. But skipping accounting in intermediate
> levels is a bit odd and unprecedented in the cgroup interface. Once we
> do this, it's conceivable people would like to do the same thing for
> other stats and accounting, like for instance memory.stat.

Right, it's a bit odd... We don't use PSI stats in intermediate levels
in our use case, but don't know what other use scenarios are. If they are
useful for other people, this patch can be dropped.

Thanks.

> 
> Tejun, what are your thoughts on this?

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [External] Re: [PATCH 9/9] sched/psi: add PSI_IRQ to track IRQ/SOFTIRQ pressure
  2022-07-25 18:26   ` Johannes Weiner
@ 2022-07-26 13:55     ` Chengming Zhou
  2022-07-27 11:28     ` Chengming Zhou
  1 sibling, 0 replies; 38+ messages in thread
From: Chengming Zhou @ 2022-07-26 13:55 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: surenb, mingo, peterz, tj, corbet, akpm, rdunlap, linux-doc,
	linux-kernel, songmuchun, cgroups

On 2022/7/26 02:26, Johannes Weiner wrote:
> On Thu, Jul 21, 2022 at 12:04:39PM +0800, Chengming Zhou wrote:
>> Now PSI already tracked workload pressure stall information for
>> CPU, memory and IO. Apart from these, IRQ/SOFTIRQ could have
>> obvious impact on some workload productivity, such as web service
>> workload.
>>
>> When CONFIG_IRQ_TIME_ACCOUNTING, we can get IRQ/SOFTIRQ delta time
>> from update_rq_clock_task(), in which we can record that delta
>> to CPU curr task's cgroups as PSI_IRQ_FULL status.
>>
>> Note we don't use PSI_IRQ_SOME since IRQ/SOFTIRQ always happen in
>> the current task on the CPU, make nothing productive could run
>> even if it were runnable, so we only use PSI_IRQ_FULL.
> 
> That sounds reasonable.
> 
>> For performance impact consideration, this is enabled by default when
>> CONFIG_IRQ_TIME_ACCOUNTING, but can be disabled by kernel cmdline
>> parameter "psi_irq=".
> 
> If there isn't a concrete usecase already, let's not add another
> commandline parameter for now.

Ok, will remove it.

> 
>> @@ -63,9 +64,11 @@ enum psi_states {
>>  	PSI_MEM_FULL,
>>  	PSI_CPU_SOME,
>>  	PSI_CPU_FULL,
>> +	PSI_IRQ_SOME,
>> +	PSI_IRQ_FULL,
>>  	/* Only per-CPU, to weigh the CPU in the global average: */
>>  	PSI_NONIDLE,
>> -	NR_PSI_STATES = 7,
>> +	NR_PSI_STATES = 9,
>>  };
> 
> Unfortunately, this grows the psi state touched by the scheduler into
> a second cacheline. :( Please reclaim space first.

Thank you for pointing this out!

> 
> I think we can remove the NR_CPU task count, which frees up one
> u32. Something like the below diff should work (untested!)
> 
> And you should be able to remove PSI_IRQ_SOME, since it's not used
> anyway. Then we'd be good to go.

Very good solution for this, I will test it later.

Thanks!

> 
> diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h
> index c7fe7c089718..31dc76e2d8ea 100644
> --- a/include/linux/psi_types.h
> +++ b/include/linux/psi_types.h
> @@ -15,13 +15,6 @@ enum psi_task_count {
>  	NR_IOWAIT,
>  	NR_MEMSTALL,
>  	NR_RUNNING,
> -	/*
> -	 * This can't have values other than 0 or 1 and could be
> -	 * implemented as a bit flag. But for now we still have room
> -	 * in the first cacheline of psi_group_cpu, and this way we
> -	 * don't have to special case any state tracking for it.
> -	 */
> -	NR_ONCPU,
>  	/*
>  	 * For IO and CPU stalls the presence of running/oncpu tasks
>  	 * in the domain means a partial rather than a full stall.
> @@ -39,9 +32,11 @@ enum psi_task_count {
>  #define TSK_IOWAIT	(1 << NR_IOWAIT)
>  #define TSK_MEMSTALL	(1 << NR_MEMSTALL)
>  #define TSK_RUNNING	(1 << NR_RUNNING)
> -#define TSK_ONCPU	(1 << NR_ONCPU)
>  #define TSK_MEMSTALL_RUNNING	(1 << NR_MEMSTALL_RUNNING)
>  
> +/* Only one task can be scheduled, no corresponding task count */
> +#define TSK_ONCPU	(1 << NR_PSI_TASK_COUNTS)
> +
>  /* Resources that workloads could be stalled on */
>  enum psi_res {
>  	PSI_IO,
> diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
> index a4fa3aadfcba..232e1dbfad46 100644
> --- a/kernel/sched/psi.c
> +++ b/kernel/sched/psi.c
> @@ -215,7 +215,7 @@ void __init psi_init(void)
>  	group_init(&psi_system);
>  }
>  
> -static bool test_state(unsigned int *tasks, enum psi_states state)
> +static bool test_state(unsigned int *tasks, enum psi_states state, bool oncpu)
>  {
>  	switch (state) {
>  	case PSI_IO_SOME:
> @@ -228,9 +228,9 @@ static bool test_state(unsigned int *tasks, enum psi_states state)
>  		return unlikely(tasks[NR_MEMSTALL] &&
>  			tasks[NR_RUNNING] == tasks[NR_MEMSTALL_RUNNING]);
>  	case PSI_CPU_SOME:
> -		return unlikely(tasks[NR_RUNNING] > tasks[NR_ONCPU]);
> +		return unlikely(tasks[NR_RUNNING] > oncpu);
>  	case PSI_CPU_FULL:
> -		return unlikely(tasks[NR_RUNNING] && !tasks[NR_ONCPU]);
> +		return unlikely(tasks[NR_RUNNING] && !oncpu);
>  	case PSI_NONIDLE:
>  		return tasks[NR_IOWAIT] || tasks[NR_MEMSTALL] ||
>  			tasks[NR_RUNNING];
> @@ -692,9 +692,9 @@ static void psi_group_change(struct psi_group *group, int cpu,
>  			     bool wake_clock)
>  {
>  	struct psi_group_cpu *groupc;
> -	u32 state_mask = 0;
>  	unsigned int t, m;
>  	enum psi_states s;
> +	u32 state_mask;
>  
>  	groupc = per_cpu_ptr(group->pcpu, cpu);
>  
> @@ -710,6 +710,26 @@ static void psi_group_change(struct psi_group *group, int cpu,
>  
>  	record_times(groupc, now);
>  
> +	/*
> +	 * Start with TSK_ONCPU, which doesn't have a corresponding
> +	 * task count - it's just a boolean flag directly encoded in
> +	 * the state mask. Clear, set, or carry the current state if
> +	 * no changes are requested.
> +	 */
> +	if (clear & TSK_ONCPU) {
> +		state_mask = 0;
> +		clear &= ~TSK_ONCPU;
> +	} else if (set & TSK_ONCPU) {
> +		state_mask = TSK_ONCPU;
> +		set &= ~TSK_ONCPU;
> +	} else {
> +		state_mask = groupc->state_mask & TSK_ONCPU;
> +	}
> +
> +	/*
> +	 * The rest of the state mask is calculated based on the task
> +	 * counts. Update those first, then construct the mask.
> +	 */
>  	for (t = 0, m = clear; m; m &= ~(1 << t), t++) {
>  		if (!(m & (1 << t)))
>  			continue;
> @@ -729,9 +749,8 @@ static void psi_group_change(struct psi_group *group, int cpu,
>  		if (set & (1 << t))
>  			groupc->tasks[t]++;
>  
> -	/* Calculate state mask representing active states */
>  	for (s = 0; s < NR_PSI_STATES; s++) {
> -		if (test_state(groupc->tasks, s))
> +		if (test_state(groupc->tasks, s, state_mask & TSK_ONCPU))
>  			state_mask |= (1 << s);
>  	}
>  
> @@ -743,7 +762,7 @@ static void psi_group_change(struct psi_group *group, int cpu,
>  	 * task in a cgroup is in_memstall, the corresponding groupc
>  	 * on that cpu is in PSI_MEM_FULL state.
>  	 */
> -	if (unlikely(groupc->tasks[NR_ONCPU] && cpu_curr(cpu)->in_memstall))
> +	if (unlikely((state_mask & TSK_ONCPU) && cpu_curr(cpu)->in_memstall))
>  		state_mask |= (1 << PSI_MEM_FULL);
>  
>  	groupc->state_mask = state_mask;
> @@ -847,7 +866,8 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
>  		iter = NULL;
>  		while ((group = iterate_groups(next, &iter))) {
>  			if (identical_state &&
> -			    per_cpu_ptr(group->pcpu, cpu)->tasks[NR_ONCPU]) {
> +			    (per_cpu_ptr(group->pcpu, cpu)->state_mask &
> +			     TSK_ONCPU)) {
>  				common = group;
>  				break;
>  			}

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 8/9] sched/psi: add kernel cmdline parameter psi_inner_cgroup
  2022-07-25 16:52   ` Johannes Weiner
  2022-07-26 13:38     ` [External] " Chengming Zhou
@ 2022-07-26 17:54     ` Tejun Heo
  2022-08-03 12:17       ` Chengming Zhou
  1 sibling, 1 reply; 38+ messages in thread
From: Tejun Heo @ 2022-07-26 17:54 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Chengming Zhou, surenb, mingo, peterz, corbet, akpm, rdunlap,
	linux-doc, linux-kernel, songmuchun, cgroups

Hello,

On Mon, Jul 25, 2022 at 12:52:17PM -0400, Johannes Weiner wrote:
> On Thu, Jul 21, 2022 at 12:04:38PM +0800, Chengming Zhou wrote:
> > PSI accounts stalls for each cgroup separately and aggregates it
> > at each level of the hierarchy. This may case non-negligible overhead
> > for some workloads when under deep level of the hierarchy.
> > 
> > commit 3958e2d0c34e ("cgroup: make per-cgroup pressure stall tracking configurable")
> > make PSI to skip per-cgroup stall accounting, only account system-wide
> > to avoid this each level overhead.
> > 
> > For our use case, we also want leaf cgroup PSI accounted for userspace
> > adjustment on that cgroup, apart from only system-wide management.
> 
> I hear the overhead argument. But skipping accounting in intermediate
> levels is a bit odd and unprecedented in the cgroup interface. Once we
> do this, it's conceivable people would like to do the same thing for
> other stats and accounting, like for instance memory.stat.
> 
> Tejun, what are your thoughts on this?

Given that PSI requires on-the-spot recursive accumulation unlike other
stats, it can add quite a bit of overhead, so I'm sympathetic to the
argument because PSI can't be made cheaper by kernel being better (or at
least we don't know how to yet).

That said, "leaf-only" feels really hacky to me. My memory is hazy but
there's nothing preventing any cgroup from being skipped over when updating
PSI states, right? The state count propagation is recursive but it's each
task's state being propagated upwards not the child cgroup's, so we can skip
over any cgroup arbitrarily. ie. we can at least turn off PSI reporting on
any given cgroup without worrying about affecting others. Am I correct?

Assuming the above isn't wrong, if we can figure out how we can re-enable
it, which is more difficult as the counters need to be resynchronized with
the current state, that'd be ideal. Then, we can just allow each cgroup to
enable / disable PSI reporting dynamically as they see fit.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 9/9] sched/psi: add PSI_IRQ to track IRQ/SOFTIRQ pressure
  2022-07-25 18:26   ` Johannes Weiner
  2022-07-26 13:55     ` [External] " Chengming Zhou
@ 2022-07-27 11:28     ` Chengming Zhou
  2022-07-27 13:00       ` Johannes Weiner
  1 sibling, 1 reply; 38+ messages in thread
From: Chengming Zhou @ 2022-07-27 11:28 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: surenb, mingo, peterz, tj, corbet, akpm, rdunlap, linux-doc,
	linux-kernel, songmuchun, cgroups

On 2022/7/26 02:26, Johannes Weiner wrote:
> On Thu, Jul 21, 2022 at 12:04:39PM +0800, Chengming Zhou wrote:
>> Now PSI already tracked workload pressure stall information for
>> CPU, memory and IO. Apart from these, IRQ/SOFTIRQ could have
>> obvious impact on some workload productivity, such as web service
>> workload.
>>
>> When CONFIG_IRQ_TIME_ACCOUNTING, we can get IRQ/SOFTIRQ delta time
>> from update_rq_clock_task(), in which we can record that delta
>> to CPU curr task's cgroups as PSI_IRQ_FULL status.
>>
>> Note we don't use PSI_IRQ_SOME since IRQ/SOFTIRQ always happen in
>> the current task on the CPU, make nothing productive could run
>> even if it were runnable, so we only use PSI_IRQ_FULL.
> 
> That sounds reasonable.
> 
>> For performance impact consideration, this is enabled by default when
>> CONFIG_IRQ_TIME_ACCOUNTING, but can be disabled by kernel cmdline
>> parameter "psi_irq=".
> 
> If there isn't a concrete usecase already, let's not add another
> commandline parameter for now.
> 
>> @@ -63,9 +64,11 @@ enum psi_states {
>>  	PSI_MEM_FULL,
>>  	PSI_CPU_SOME,
>>  	PSI_CPU_FULL,
>> +	PSI_IRQ_SOME,
>> +	PSI_IRQ_FULL,
>>  	/* Only per-CPU, to weigh the CPU in the global average: */
>>  	PSI_NONIDLE,
>> -	NR_PSI_STATES = 7,
>> +	NR_PSI_STATES = 9,
>>  };
> 
> Unfortunately, this grows the psi state touched by the scheduler into
> a second cacheline. :( Please reclaim space first.
> 
> I think we can remove the NR_CPU task count, which frees up one
> u32. Something like the below diff should work (untested!)

Hi, I tested ok, would you mind if I put this patch in this series?


Subject: [PATCH] sched/psi: remove NR_ONCPU task accounting

We put all fields updated by the scheduler in the first cacheline of
struct psi_group_cpu for performance.

Since we want add another PSI_IRQ_FULL to track IRQ/SOFTIRQ pressure,
we need to reclaim space first. This patch remove NR_ONCPU task accounting
in struct psi_group_cpu, use TSK_ONCPU in state_mask to track instead.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Chengming Zhou <zhouchengming@bytedance.com>
Tested-by: Chengming Zhou <zhouchengming@bytedance.com>


> 
> And you should be able to remove PSI_IRQ_SOME, since it's not used
> anyway. Then we'd be good to go.
> 
> diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h
> index c7fe7c089718..31dc76e2d8ea 100644
> --- a/include/linux/psi_types.h
> +++ b/include/linux/psi_types.h
> @@ -15,13 +15,6 @@ enum psi_task_count {
>  	NR_IOWAIT,
>  	NR_MEMSTALL,
>  	NR_RUNNING,
> -	/*
> -	 * This can't have values other than 0 or 1 and could be
> -	 * implemented as a bit flag. But for now we still have room
> -	 * in the first cacheline of psi_group_cpu, and this way we
> -	 * don't have to special case any state tracking for it.
> -	 */
> -	NR_ONCPU,
>  	/*
>  	 * For IO and CPU stalls the presence of running/oncpu tasks
>  	 * in the domain means a partial rather than a full stall.
> @@ -39,9 +32,11 @@ enum psi_task_count {
>  #define TSK_IOWAIT	(1 << NR_IOWAIT)
>  #define TSK_MEMSTALL	(1 << NR_MEMSTALL)
>  #define TSK_RUNNING	(1 << NR_RUNNING)
> -#define TSK_ONCPU	(1 << NR_ONCPU)
>  #define TSK_MEMSTALL_RUNNING	(1 << NR_MEMSTALL_RUNNING)
>  
> +/* Only one task can be scheduled, no corresponding task count */
> +#define TSK_ONCPU	(1 << NR_PSI_TASK_COUNTS)
> +
>  /* Resources that workloads could be stalled on */
>  enum psi_res {
>  	PSI_IO,
> diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
> index a4fa3aadfcba..232e1dbfad46 100644
> --- a/kernel/sched/psi.c
> +++ b/kernel/sched/psi.c
> @@ -215,7 +215,7 @@ void __init psi_init(void)
>  	group_init(&psi_system);
>  }
>  
> -static bool test_state(unsigned int *tasks, enum psi_states state)
> +static bool test_state(unsigned int *tasks, enum psi_states state, bool oncpu)
>  {
>  	switch (state) {
>  	case PSI_IO_SOME:
> @@ -228,9 +228,9 @@ static bool test_state(unsigned int *tasks, enum psi_states state)
>  		return unlikely(tasks[NR_MEMSTALL] &&
>  			tasks[NR_RUNNING] == tasks[NR_MEMSTALL_RUNNING]);
>  	case PSI_CPU_SOME:
> -		return unlikely(tasks[NR_RUNNING] > tasks[NR_ONCPU]);
> +		return unlikely(tasks[NR_RUNNING] > oncpu);
>  	case PSI_CPU_FULL:
> -		return unlikely(tasks[NR_RUNNING] && !tasks[NR_ONCPU]);
> +		return unlikely(tasks[NR_RUNNING] && !oncpu);
>  	case PSI_NONIDLE:
>  		return tasks[NR_IOWAIT] || tasks[NR_MEMSTALL] ||
>  			tasks[NR_RUNNING];
> @@ -692,9 +692,9 @@ static void psi_group_change(struct psi_group *group, int cpu,
>  			     bool wake_clock)
>  {
>  	struct psi_group_cpu *groupc;
> -	u32 state_mask = 0;
>  	unsigned int t, m;
>  	enum psi_states s;
> +	u32 state_mask;
>  
>  	groupc = per_cpu_ptr(group->pcpu, cpu);
>  
> @@ -710,6 +710,26 @@ static void psi_group_change(struct psi_group *group, int cpu,
>  
>  	record_times(groupc, now);
>  
> +	/*
> +	 * Start with TSK_ONCPU, which doesn't have a corresponding
> +	 * task count - it's just a boolean flag directly encoded in
> +	 * the state mask. Clear, set, or carry the current state if
> +	 * no changes are requested.
> +	 */
> +	if (clear & TSK_ONCPU) {
> +		state_mask = 0;
> +		clear &= ~TSK_ONCPU;
> +	} else if (set & TSK_ONCPU) {
> +		state_mask = TSK_ONCPU;
> +		set &= ~TSK_ONCPU;
> +	} else {
> +		state_mask = groupc->state_mask & TSK_ONCPU;
> +	}
> +
> +	/*
> +	 * The rest of the state mask is calculated based on the task
> +	 * counts. Update those first, then construct the mask.
> +	 */
>  	for (t = 0, m = clear; m; m &= ~(1 << t), t++) {
>  		if (!(m & (1 << t)))
>  			continue;
> @@ -729,9 +749,8 @@ static void psi_group_change(struct psi_group *group, int cpu,
>  		if (set & (1 << t))
>  			groupc->tasks[t]++;
>  
> -	/* Calculate state mask representing active states */
>  	for (s = 0; s < NR_PSI_STATES; s++) {
> -		if (test_state(groupc->tasks, s))
> +		if (test_state(groupc->tasks, s, state_mask & TSK_ONCPU))
>  			state_mask |= (1 << s);
>  	}
>  
> @@ -743,7 +762,7 @@ static void psi_group_change(struct psi_group *group, int cpu,
>  	 * task in a cgroup is in_memstall, the corresponding groupc
>  	 * on that cpu is in PSI_MEM_FULL state.
>  	 */
> -	if (unlikely(groupc->tasks[NR_ONCPU] && cpu_curr(cpu)->in_memstall))
> +	if (unlikely((state_mask & TSK_ONCPU) && cpu_curr(cpu)->in_memstall))
>  		state_mask |= (1 << PSI_MEM_FULL);
>  
>  	groupc->state_mask = state_mask;
> @@ -847,7 +866,8 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
>  		iter = NULL;
>  		while ((group = iterate_groups(next, &iter))) {
>  			if (identical_state &&
> -			    per_cpu_ptr(group->pcpu, cpu)->tasks[NR_ONCPU]) {
> +			    (per_cpu_ptr(group->pcpu, cpu)->state_mask &
> +			     TSK_ONCPU)) {
>  				common = group;
>  				break;
>  			}

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 9/9] sched/psi: add PSI_IRQ to track IRQ/SOFTIRQ pressure
  2022-07-27 11:28     ` Chengming Zhou
@ 2022-07-27 13:00       ` Johannes Weiner
  2022-07-27 15:09         ` Chengming Zhou
  0 siblings, 1 reply; 38+ messages in thread
From: Johannes Weiner @ 2022-07-27 13:00 UTC (permalink / raw)
  To: Chengming Zhou
  Cc: surenb, mingo, peterz, tj, corbet, akpm, rdunlap, linux-doc,
	linux-kernel, songmuchun, cgroups

On Wed, Jul 27, 2022 at 07:28:37PM +0800, Chengming Zhou wrote:
> On 2022/7/26 02:26, Johannes Weiner wrote:
> > I think we can remove the NR_CPU task count, which frees up one
> > u32. Something like the below diff should work (untested!)
> 
> Hi, I tested ok, would you mind if I put this patch in this series?
> 
> Subject: [PATCH] sched/psi: remove NR_ONCPU task accounting
> 
> We put all fields updated by the scheduler in the first cacheline of
> struct psi_group_cpu for performance.
> 
> Since we want add another PSI_IRQ_FULL to track IRQ/SOFTIRQ pressure,
> we need to reclaim space first. This patch remove NR_ONCPU task accounting
> in struct psi_group_cpu, use TSK_ONCPU in state_mask to track instead.

Thanks for testing it, that sounds good.

> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> Reviewed-by: Chengming Zhou <zhouchengming@bytedance.com>
> Tested-by: Chengming Zhou <zhouchengming@bytedance.com>

Since you're handling the patch, you need to add your own
Signed-off-by: as well. And keep From: Johannes (git commit --author).

Thanks!

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 9/9] sched/psi: add PSI_IRQ to track IRQ/SOFTIRQ pressure
  2022-07-27 13:00       ` Johannes Weiner
@ 2022-07-27 15:09         ` Chengming Zhou
  0 siblings, 0 replies; 38+ messages in thread
From: Chengming Zhou @ 2022-07-27 15:09 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: surenb, mingo, peterz, tj, corbet, akpm, rdunlap, linux-doc,
	linux-kernel, songmuchun, cgroups

On 2022/7/27 21:00, Johannes Weiner wrote:
> On Wed, Jul 27, 2022 at 07:28:37PM +0800, Chengming Zhou wrote:
>> On 2022/7/26 02:26, Johannes Weiner wrote:
>>> I think we can remove the NR_CPU task count, which frees up one
>>> u32. Something like the below diff should work (untested!)
>>
>> Hi, I tested ok, would you mind if I put this patch in this series?
>>
>> Subject: [PATCH] sched/psi: remove NR_ONCPU task accounting
>>
>> We put all fields updated by the scheduler in the first cacheline of
>> struct psi_group_cpu for performance.
>>
>> Since we want add another PSI_IRQ_FULL to track IRQ/SOFTIRQ pressure,
>> we need to reclaim space first. This patch remove NR_ONCPU task accounting
>> in struct psi_group_cpu, use TSK_ONCPU in state_mask to track instead.
> 
> Thanks for testing it, that sounds good.
> 
>> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
>> Reviewed-by: Chengming Zhou <zhouchengming@bytedance.com>
>> Tested-by: Chengming Zhou <zhouchengming@bytedance.com>
> 
> Since you're handling the patch, you need to add your own
> Signed-off-by: as well. And keep From: Johannes (git commit --author).

Got it. Thanks!


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 9/9] sched/psi: add PSI_IRQ to track IRQ/SOFTIRQ pressure
  2022-07-21  4:04 ` [PATCH 9/9] sched/psi: add PSI_IRQ to track IRQ/SOFTIRQ pressure Chengming Zhou
                     ` (3 preceding siblings ...)
  2022-07-25 18:26   ` Johannes Weiner
@ 2022-07-27 16:07   ` Peter Zijlstra
  4 siblings, 0 replies; 38+ messages in thread
From: Peter Zijlstra @ 2022-07-27 16:07 UTC (permalink / raw)
  To: Chengming Zhou
  Cc: hannes, surenb, mingo, tj, corbet, akpm, rdunlap, linux-doc,
	linux-kernel, songmuchun, cgroups

On Thu, Jul 21, 2022 at 12:04:39PM +0800, Chengming Zhou wrote:
> diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h
> index c124f7d186d0..195f123b1cd1 100644
> --- a/include/linux/psi_types.h
> +++ b/include/linux/psi_types.h
> @@ -47,7 +47,8 @@ enum psi_res {
>  	PSI_IO,
>  	PSI_MEM,
>  	PSI_CPU,
> -	NR_PSI_RESOURCES = 3,
> +	PSI_IRQ,
> +	NR_PSI_RESOURCES = 4,
>  };
>  
>  /*
> @@ -63,9 +64,11 @@ enum psi_states {
>  	PSI_MEM_FULL,
>  	PSI_CPU_SOME,
>  	PSI_CPU_FULL,
> +	PSI_IRQ_SOME,
> +	PSI_IRQ_FULL,
>  	/* Only per-CPU, to weigh the CPU in the global average: */
>  	PSI_NONIDLE,
> -	NR_PSI_STATES = 7,
> +	NR_PSI_STATES = 9,
>  };
>  
>  enum psi_aggregators {

$ pahole -EC psi_group_cpu defconfig-build/kernel/sched/build_utility.o
struct psi_group_cpu {
        /* typedef seqcount_t */ struct seqcount {
                unsigned int       sequence;                                             /*     0     4 */
        } seq __attribute__((__aligned__(64))); /*     0     4 */
        unsigned int               tasks[5];                                             /*     4    20 */
        /* typedef u32 -> __u32 */ unsigned int               state_mask;                /*    24     4 */
        /* typedef u32 -> __u32 */ unsigned int               times[7];                  /*    28    28 */
        /* typedef u64 -> __u64 */ long long unsigned int     state_start;               /*    56     8 */
        /* --- cacheline 1 boundary (64 bytes) --- */
        /* typedef u32 -> __u32 */ unsigned int               times_prev[2][7] __attribute__((__aligned__(64))); /*    64    56 */

        /* size: 128, cachelines: 2, members: 6 */
        /* padding: 8 */
        /* forced alignments: 2 */
} __attribute__((__aligned__(64)));


$ pahole -EC psi_group_cpu defconfig-build/kernel/sched/build_utility.o
struct psi_group_cpu {
        /* typedef seqcount_t */ struct seqcount {
                unsigned int       sequence;                                             /*     0     4 */
        } seq __attribute__((__aligned__(64))); /*     0     4 */
        unsigned int               tasks[5];                                             /*     4    20 */
        /* typedef u32 -> __u32 */ unsigned int               state_mask;                /*    24     4 */
        /* typedef u32 -> __u32 */ unsigned int               times[9];                  /*    28    36 */
        /* --- cacheline 1 boundary (64 bytes) --- */
        /* typedef u64 -> __u64 */ long long unsigned int     state_start;               /*    64     8 */

        /* XXX 56 bytes hole, try to pack */

        /* --- cacheline 2 boundary (128 bytes) --- */
        /* typedef u32 -> __u32 */ unsigned int               times_prev[2][9] __attribute__((__aligned__(64))); /*   128    72 */

        /* size: 256, cachelines: 4, members: 6 */
        /* sum members: 144, holes: 1, sum holes: 56 */
        /* padding: 56 */
        /* forced alignments: 2, forced holes: 1, sum forced holes: 56 */
} __attribute__((__aligned__(64)));


So yeah, I think not.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 8/9] sched/psi: add kernel cmdline parameter psi_inner_cgroup
  2022-07-26 17:54     ` Tejun Heo
@ 2022-08-03 12:17       ` Chengming Zhou
  2022-08-03 17:58         ` Tejun Heo
  0 siblings, 1 reply; 38+ messages in thread
From: Chengming Zhou @ 2022-08-03 12:17 UTC (permalink / raw)
  To: Tejun Heo, Johannes Weiner
  Cc: surenb, mingo, peterz, corbet, akpm, rdunlap, linux-doc,
	linux-kernel, songmuchun, cgroups

On 2022/7/27 01:54, Tejun Heo wrote:
> Hello,
> 
> On Mon, Jul 25, 2022 at 12:52:17PM -0400, Johannes Weiner wrote:
>> On Thu, Jul 21, 2022 at 12:04:38PM +0800, Chengming Zhou wrote:
>>> PSI accounts stalls for each cgroup separately and aggregates it
>>> at each level of the hierarchy. This may case non-negligible overhead
>>> for some workloads when under deep level of the hierarchy.
>>>
>>> commit 3958e2d0c34e ("cgroup: make per-cgroup pressure stall tracking configurable")
>>> make PSI to skip per-cgroup stall accounting, only account system-wide
>>> to avoid this each level overhead.
>>>
>>> For our use case, we also want leaf cgroup PSI accounted for userspace
>>> adjustment on that cgroup, apart from only system-wide management.
>>
>> I hear the overhead argument. But skipping accounting in intermediate
>> levels is a bit odd and unprecedented in the cgroup interface. Once we
>> do this, it's conceivable people would like to do the same thing for
>> other stats and accounting, like for instance memory.stat.
>>
>> Tejun, what are your thoughts on this?
> 
> Given that PSI requires on-the-spot recursive accumulation unlike other
> stats, it can add quite a bit of overhead, so I'm sympathetic to the
> argument because PSI can't be made cheaper by kernel being better (or at
> least we don't know how to yet).
> 
> That said, "leaf-only" feels really hacky to me. My memory is hazy but
> there's nothing preventing any cgroup from being skipped over when updating
> PSI states, right? The state count propagation is recursive but it's each
> task's state being propagated upwards not the child cgroup's, so we can skip
> over any cgroup arbitrarily. ie. we can at least turn off PSI reporting on
> any given cgroup without worrying about affecting others. Am I correct?

Yes, I think it's correct.

> 
> Assuming the above isn't wrong, if we can figure out how we can re-enable
> it, which is more difficult as the counters need to be resynchronized with
> the current state, that'd be ideal. Then, we can just allow each cgroup to
> enable / disable PSI reporting dynamically as they see fit.

This method is more fine-grained but more difficult like you said above.
I think it may meet most needs to disable PSI stats in intermediate cgroups?

Thanks!

> 
> Thanks.
> 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 8/9] sched/psi: add kernel cmdline parameter psi_inner_cgroup
  2022-08-03 12:17       ` Chengming Zhou
@ 2022-08-03 17:58         ` Tejun Heo
  2022-08-03 19:22           ` Johannes Weiner
  2022-08-04  2:02           ` Chengming Zhou
  0 siblings, 2 replies; 38+ messages in thread
From: Tejun Heo @ 2022-08-03 17:58 UTC (permalink / raw)
  To: Chengming Zhou
  Cc: Johannes Weiner, surenb, mingo, peterz, corbet, akpm, rdunlap,
	linux-doc, linux-kernel, songmuchun, cgroups

Hello,

On Wed, Aug 03, 2022 at 08:17:22PM +0800, Chengming Zhou wrote:
> > Assuming the above isn't wrong, if we can figure out how we can re-enable
> > it, which is more difficult as the counters need to be resynchronized with
> > the current state, that'd be ideal. Then, we can just allow each cgroup to
> > enable / disable PSI reporting dynamically as they see fit.
> 
> This method is more fine-grained but more difficult like you said above.
> I think it may meet most needs to disable PSI stats in intermediate cgroups?

So, I'm not necessarily against implementing something easier but we at
least wanna get the interface right, so that if we decide to do the full
thing later we can easily expand on the existing interface. ie. let's please
not be too hacky. I don't think it'd be that difficult to implement
per-cgroup disable-only operation that we can later expand to allow
re-enabling, right?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 8/9] sched/psi: add kernel cmdline parameter psi_inner_cgroup
  2022-08-03 17:58         ` Tejun Heo
@ 2022-08-03 19:22           ` Johannes Weiner
  2022-08-03 19:48             ` Tejun Heo
  2022-08-04 13:51             ` Chengming Zhou
  2022-08-04  2:02           ` Chengming Zhou
  1 sibling, 2 replies; 38+ messages in thread
From: Johannes Weiner @ 2022-08-03 19:22 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Chengming Zhou, surenb, mingo, peterz, corbet, akpm, rdunlap,
	linux-doc, linux-kernel, songmuchun, cgroups

On Wed, Aug 03, 2022 at 07:58:27AM -1000, Tejun Heo wrote:
> Hello,
> 
> On Wed, Aug 03, 2022 at 08:17:22PM +0800, Chengming Zhou wrote:
> > > Assuming the above isn't wrong, if we can figure out how we can re-enable
> > > it, which is more difficult as the counters need to be resynchronized with
> > > the current state, that'd be ideal. Then, we can just allow each cgroup to
> > > enable / disable PSI reporting dynamically as they see fit.
> > 
> > This method is more fine-grained but more difficult like you said above.
> > I think it may meet most needs to disable PSI stats in intermediate cgroups?
> 
> So, I'm not necessarily against implementing something easier but we at
> least wanna get the interface right, so that if we decide to do the full
> thing later we can easily expand on the existing interface. ie. let's please
> not be too hacky. I don't think it'd be that difficult to implement
> per-cgroup disable-only operation that we can later expand to allow
> re-enabling, right?

It should be relatively straight-forward to disable and re-enable
state aggregation, time tracking, averaging on a per-cgroup level, if
we can live with losing history from while it was disabled. I.e. the
avgs will restart from 0, total= will have gaps - should be okay, IMO.

Where it gets trickier is also stopping the tracking of task counts in
a cgroup. For re-enabling afterwards, we'd have to freeze scheduler
and cgroup state and find all tasks of interest across all CPUs for
the given cgroup to recreate the counts. I'm not quite sure whether
that's feasible, and if so, whether it's worth the savings.

It might be good to benchmark the two disabling steps independently.
Maybe stopping aggregation while keeping task counts is good enough,
and we can commit to a disable/re-enable interface from the start.

Or maybe it's all in the cachelines and iteration, and stopping the
aggregation while still writing task counts isn't saving much. In that
case we'd have to look closer at reconstructing task counts, to see if
later re-enabling is actually a practical option or whether a one-off
kill switch is more realistic.

Chengming, can you experiment with disabling: record_times(), the
test_state() loop and state_mask construction, and the averaging
worker - while keeping the groupc->tasks updates?

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 8/9] sched/psi: add kernel cmdline parameter psi_inner_cgroup
  2022-08-03 19:22           ` Johannes Weiner
@ 2022-08-03 19:48             ` Tejun Heo
  2022-08-04 13:51             ` Chengming Zhou
  1 sibling, 0 replies; 38+ messages in thread
From: Tejun Heo @ 2022-08-03 19:48 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Chengming Zhou, surenb, mingo, peterz, corbet, akpm, rdunlap,
	linux-doc, linux-kernel, songmuchun, cgroups

Hello,

On Wed, Aug 03, 2022 at 03:22:16PM -0400, Johannes Weiner wrote:
> Where it gets trickier is also stopping the tracking of task counts in
> a cgroup. For re-enabling afterwards, we'd have to freeze scheduler
> and cgroup state and find all tasks of interest across all CPUs for
> the given cgroup to recreate the counts. I'm not quite sure whether
> that's feasible, and if so, whether it's worth the savings.

If this turns out to be necessary, I wonder whether we can just be
opportunistic. ie. don't bother with walking all the tasks but only remember
whether a task is accounted at a given level or not (this can be a bitmap
which is allocated at cgroup attach type and in most caess will be pretty
small). Then, maybe we can just start accounting them as they cycle through
state transitions - we ignore the ones leaving states that they weren't
accounted for and start remembering the new states they enter.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 8/9] sched/psi: add kernel cmdline parameter psi_inner_cgroup
  2022-08-03 17:58         ` Tejun Heo
  2022-08-03 19:22           ` Johannes Weiner
@ 2022-08-04  2:02           ` Chengming Zhou
  1 sibling, 0 replies; 38+ messages in thread
From: Chengming Zhou @ 2022-08-04  2:02 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Johannes Weiner, surenb, mingo, peterz, corbet, akpm, rdunlap,
	linux-doc, linux-kernel, songmuchun, cgroups

On 2022/8/4 01:58, Tejun Heo wrote:
> Hello,
> 
> On Wed, Aug 03, 2022 at 08:17:22PM +0800, Chengming Zhou wrote:
>>> Assuming the above isn't wrong, if we can figure out how we can re-enable
>>> it, which is more difficult as the counters need to be resynchronized with
>>> the current state, that'd be ideal. Then, we can just allow each cgroup to
>>> enable / disable PSI reporting dynamically as they see fit.
>>
>> This method is more fine-grained but more difficult like you said above.
>> I think it may meet most needs to disable PSI stats in intermediate cgroups?
> 
> So, I'm not necessarily against implementing something easier but we at
> least wanna get the interface right, so that if we decide to do the full
> thing later we can easily expand on the existing interface. ie. let's please
> not be too hacky. I don't think it'd be that difficult to implement
> per-cgroup disable-only operation that we can later expand to allow
> re-enabling, right?

Agree, the interface is important, per-cgroup disable-only operation maybe easier
to implement. I will look into this more.

Thanks!


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 8/9] sched/psi: add kernel cmdline parameter psi_inner_cgroup
  2022-08-03 19:22           ` Johannes Weiner
  2022-08-03 19:48             ` Tejun Heo
@ 2022-08-04 13:51             ` Chengming Zhou
  2022-08-04 16:56               ` Johannes Weiner
  1 sibling, 1 reply; 38+ messages in thread
From: Chengming Zhou @ 2022-08-04 13:51 UTC (permalink / raw)
  To: Johannes Weiner, Tejun Heo
  Cc: surenb, mingo, peterz, corbet, akpm, rdunlap, linux-doc,
	linux-kernel, songmuchun, cgroups

On 2022/8/4 03:22, Johannes Weiner wrote:
> On Wed, Aug 03, 2022 at 07:58:27AM -1000, Tejun Heo wrote:
>> Hello,
>>
>> On Wed, Aug 03, 2022 at 08:17:22PM +0800, Chengming Zhou wrote:
>>>> Assuming the above isn't wrong, if we can figure out how we can re-enable
>>>> it, which is more difficult as the counters need to be resynchronized with
>>>> the current state, that'd be ideal. Then, we can just allow each cgroup to
>>>> enable / disable PSI reporting dynamically as they see fit.
>>>
>>> This method is more fine-grained but more difficult like you said above.
>>> I think it may meet most needs to disable PSI stats in intermediate cgroups?
>>
>> So, I'm not necessarily against implementing something easier but we at
>> least wanna get the interface right, so that if we decide to do the full
>> thing later we can easily expand on the existing interface. ie. let's please
>> not be too hacky. I don't think it'd be that difficult to implement
>> per-cgroup disable-only operation that we can later expand to allow
>> re-enabling, right?
> 
> It should be relatively straight-forward to disable and re-enable
> state aggregation, time tracking, averaging on a per-cgroup level, if
> we can live with losing history from while it was disabled. I.e. the
> avgs will restart from 0, total= will have gaps - should be okay, IMO.
> 
> Where it gets trickier is also stopping the tracking of task counts in
> a cgroup. For re-enabling afterwards, we'd have to freeze scheduler
> and cgroup state and find all tasks of interest across all CPUs for
> the given cgroup to recreate the counts. I'm not quite sure whether
> that's feasible, and if so, whether it's worth the savings.
> 
> It might be good to benchmark the two disabling steps independently.
> Maybe stopping aggregation while keeping task counts is good enough,
> and we can commit to a disable/re-enable interface from the start.
> 
> Or maybe it's all in the cachelines and iteration, and stopping the
> aggregation while still writing task counts isn't saving much. In that
> case we'd have to look closer at reconstructing task counts, to see if
> later re-enabling is actually a practical option or whether a one-off
> kill switch is more realistic.
> 
> Chengming, can you experiment with disabling: record_times(), the
> test_state() loop and state_mask construction, and the averaging
> worker - while keeping the groupc->tasks updates?

Hello,

I did this experiment today with disabling record_times(), test_state()
loop and averaging worker, while only keeping groupc->tasks[] updates,
the results look promising.

mmtests/config-scheduler-perfpipe on Intel Xeon Platinum with 3 levels of cgroup:

perfpipe
                                  tip                    tip                patched
                              psi=off                 psi=on      only groupc->tasks[]
Min       Time        7.99 (   0.00%)        8.86 ( -10.95%)        8.31 (  -4.08%)
1st-qrtle Time        8.11 (   0.00%)        8.94 ( -10.22%)        8.39 (  -3.46%)
2nd-qrtle Time        8.17 (   0.00%)        9.02 ( -10.42%)        8.44 (  -3.37%)
3rd-qrtle Time        8.20 (   0.00%)        9.08 ( -10.72%)        8.48 (  -3.43%)
Max-1     Time        7.99 (   0.00%)        8.86 ( -10.95%)        8.31 (  -4.08%)
Max-5     Time        7.99 (   0.00%)        8.86 ( -10.95%)        8.31 (  -4.08%)
Max-10    Time        8.09 (   0.00%)        8.89 (  -9.96%)        8.35 (  -3.22%)
Max-90    Time        8.31 (   0.00%)        9.13 (  -9.90%)        8.55 (  -2.95%)
Max-95    Time        8.32 (   0.00%)        9.14 (  -9.88%)        8.55 (  -2.81%)
Max-99    Time        8.39 (   0.00%)        9.26 ( -10.30%)        8.57 (  -2.09%)
Max       Time        8.56 (   0.00%)        9.26 (  -8.23%)        8.72 (  -1.90%)
Amean     Time        8.19 (   0.00%)        9.03 * -10.26%*        8.45 *  -3.27%*


Tejun suggested using a bitmap in task to remember whether the task is accounted
at a given level or not, which I think also is a very good idea, but I haven't
clearly figure out how to do it.

The above performance test result looks good to me, so I think we can implement this
per-cgroup "cgroup.psi" interface to disable/re-enable PSI stats from the start,
and we can change to a better implementation if needed later?

Thanks!


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 8/9] sched/psi: add kernel cmdline parameter psi_inner_cgroup
  2022-08-04 13:51             ` Chengming Zhou
@ 2022-08-04 16:56               ` Johannes Weiner
  0 siblings, 0 replies; 38+ messages in thread
From: Johannes Weiner @ 2022-08-04 16:56 UTC (permalink / raw)
  To: Chengming Zhou
  Cc: Tejun Heo, surenb, mingo, peterz, corbet, akpm, rdunlap,
	linux-doc, linux-kernel, songmuchun, cgroups

On Thu, Aug 04, 2022 at 09:51:31PM +0800, Chengming Zhou wrote:
> On 2022/8/4 03:22, Johannes Weiner wrote:
> > On Wed, Aug 03, 2022 at 07:58:27AM -1000, Tejun Heo wrote:
> >> Hello,
> >>
> >> On Wed, Aug 03, 2022 at 08:17:22PM +0800, Chengming Zhou wrote:
> >>>> Assuming the above isn't wrong, if we can figure out how we can re-enable
> >>>> it, which is more difficult as the counters need to be resynchronized with
> >>>> the current state, that'd be ideal. Then, we can just allow each cgroup to
> >>>> enable / disable PSI reporting dynamically as they see fit.
> >>>
> >>> This method is more fine-grained but more difficult like you said above.
> >>> I think it may meet most needs to disable PSI stats in intermediate cgroups?
> >>
> >> So, I'm not necessarily against implementing something easier but we at
> >> least wanna get the interface right, so that if we decide to do the full
> >> thing later we can easily expand on the existing interface. ie. let's please
> >> not be too hacky. I don't think it'd be that difficult to implement
> >> per-cgroup disable-only operation that we can later expand to allow
> >> re-enabling, right?
> > 
> > It should be relatively straight-forward to disable and re-enable
> > state aggregation, time tracking, averaging on a per-cgroup level, if
> > we can live with losing history from while it was disabled. I.e. the
> > avgs will restart from 0, total= will have gaps - should be okay, IMO.
> > 
> > Where it gets trickier is also stopping the tracking of task counts in
> > a cgroup. For re-enabling afterwards, we'd have to freeze scheduler
> > and cgroup state and find all tasks of interest across all CPUs for
> > the given cgroup to recreate the counts. I'm not quite sure whether
> > that's feasible, and if so, whether it's worth the savings.
> > 
> > It might be good to benchmark the two disabling steps independently.
> > Maybe stopping aggregation while keeping task counts is good enough,
> > and we can commit to a disable/re-enable interface from the start.
> > 
> > Or maybe it's all in the cachelines and iteration, and stopping the
> > aggregation while still writing task counts isn't saving much. In that
> > case we'd have to look closer at reconstructing task counts, to see if
> > later re-enabling is actually a practical option or whether a one-off
> > kill switch is more realistic.
> > 
> > Chengming, can you experiment with disabling: record_times(), the
> > test_state() loop and state_mask construction, and the averaging
> > worker - while keeping the groupc->tasks updates?
> 
> Hello,
> 
> I did this experiment today with disabling record_times(), test_state()
> loop and averaging worker, while only keeping groupc->tasks[] updates,
> the results look promising.
> 
> mmtests/config-scheduler-perfpipe on Intel Xeon Platinum with 3 levels of cgroup:
> 
> perfpipe
>                                   tip                    tip                patched
>                               psi=off                 psi=on      only groupc->tasks[]
> Min       Time        7.99 (   0.00%)        8.86 ( -10.95%)        8.31 (  -4.08%)
> 1st-qrtle Time        8.11 (   0.00%)        8.94 ( -10.22%)        8.39 (  -3.46%)
> 2nd-qrtle Time        8.17 (   0.00%)        9.02 ( -10.42%)        8.44 (  -3.37%)
> 3rd-qrtle Time        8.20 (   0.00%)        9.08 ( -10.72%)        8.48 (  -3.43%)
> Max-1     Time        7.99 (   0.00%)        8.86 ( -10.95%)        8.31 (  -4.08%)
> Max-5     Time        7.99 (   0.00%)        8.86 ( -10.95%)        8.31 (  -4.08%)
> Max-10    Time        8.09 (   0.00%)        8.89 (  -9.96%)        8.35 (  -3.22%)
> Max-90    Time        8.31 (   0.00%)        9.13 (  -9.90%)        8.55 (  -2.95%)
> Max-95    Time        8.32 (   0.00%)        9.14 (  -9.88%)        8.55 (  -2.81%)
> Max-99    Time        8.39 (   0.00%)        9.26 ( -10.30%)        8.57 (  -2.09%)
> Max       Time        8.56 (   0.00%)        9.26 (  -8.23%)        8.72 (  -1.90%)
> Amean     Time        8.19 (   0.00%)        9.03 * -10.26%*        8.45 *  -3.27%*

Fantastic!

> Tejun suggested using a bitmap in task to remember whether the task is accounted
> at a given level or not, which I think also is a very good idea, but I haven't
> clearly figure out how to do it.
> 
> The above performance test result looks good to me, so I think we can implement this
> per-cgroup "cgroup.psi" interface to disable/re-enable PSI stats from the start,
> and we can change to a better implementation if needed later?

Yes, that sounds good to me.

^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2022-08-04 16:56 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-07-21  4:04 [PATCH 0/9] sched/psi: some optimization and extension Chengming Zhou
2022-07-21  4:04 ` [PATCH 1/9] sched/psi: fix periodic aggregation shut off Chengming Zhou
2022-07-25 15:34   ` Johannes Weiner
2022-07-25 15:39   ` Johannes Weiner
2022-07-26 13:28     ` Chengming Zhou
2022-07-21  4:04 ` [PATCH 2/9] sched/psi: optimize task switch inside shared cgroups again Chengming Zhou
2022-07-21  4:04 ` [PATCH 3/9] sched/psi: move private helpers to sched/stats.h Chengming Zhou
2022-07-25 16:39   ` Johannes Weiner
2022-07-21  4:04 ` [PATCH 4/9] sched/psi: don't change task psi_flags when migrate CPU/group Chengming Zhou
2022-07-21  4:04 ` [PATCH 5/9] sched/psi: don't create cgroup PSI files when psi_disabled Chengming Zhou
2022-07-25 16:41   ` Johannes Weiner
2022-07-21  4:04 ` [PATCH 6/9] sched/psi: save percpu memory when !psi_cgroups_enabled Chengming Zhou
2022-07-25 16:47   ` Johannes Weiner
2022-07-21  4:04 ` [PATCH 7/9] sched/psi: cache parent psi_group to speed up groups iterate Chengming Zhou
2022-07-21  4:04 ` [PATCH 8/9] sched/psi: add kernel cmdline parameter psi_inner_cgroup Chengming Zhou
2022-07-25 16:52   ` Johannes Weiner
2022-07-26 13:38     ` [External] " Chengming Zhou
2022-07-26 17:54     ` Tejun Heo
2022-08-03 12:17       ` Chengming Zhou
2022-08-03 17:58         ` Tejun Heo
2022-08-03 19:22           ` Johannes Weiner
2022-08-03 19:48             ` Tejun Heo
2022-08-04 13:51             ` Chengming Zhou
2022-08-04 16:56               ` Johannes Weiner
2022-08-04  2:02           ` Chengming Zhou
2022-07-21  4:04 ` [PATCH 9/9] sched/psi: add PSI_IRQ to track IRQ/SOFTIRQ pressure Chengming Zhou
2022-07-21 10:00   ` kernel test robot
2022-07-21 22:10   ` kernel test robot
2022-07-22  3:30   ` Abel Wu
2022-07-22  6:13     ` Chengming Zhou
2022-07-22  7:14       ` Abel Wu
2022-07-22  7:33         ` Chengming Zhou
2022-07-25 18:26   ` Johannes Weiner
2022-07-26 13:55     ` [External] " Chengming Zhou
2022-07-27 11:28     ` Chengming Zhou
2022-07-27 13:00       ` Johannes Weiner
2022-07-27 15:09         ` Chengming Zhou
2022-07-27 16:07   ` Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).