linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 00/10] sched/psi: some optimization and extension
@ 2022-08-08 11:03 Chengming Zhou
  2022-08-08 11:03 ` [PATCH v2 01/10] sched/psi: fix periodic aggregation shut off Chengming Zhou
                   ` (10 more replies)
  0 siblings, 11 replies; 32+ messages in thread
From: Chengming Zhou @ 2022-08-08 11:03 UTC (permalink / raw)
  To: hannes, tj, corbet, surenb, mingo, peterz, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall
  Cc: cgroups, linux-doc, linux-kernel, songmuchun, Chengming Zhou

Hi all,

This patch series are some optimization and extension for PSI, based on
the tip/sched/core branch.

patch 1/10 fix periodic aggregation shut off problem introduced by earlier
commit 4117cebf1a9f ("psi: Optimize task switch inside shared cgroups").

patch 2/10 optimize task switch inside shared cgroups when in_memstall status
of prev task and next task are different.

patch 3-4 optimize and simplify PSI status tracking by don't change task
psi_flags when migrate CPU/cgroup.

patch 7/10 remove NR_ONCPU task accounting to save 4 bytes in the first
cacheline to be used by the following patch 8/10, which introduce new
PSI resource PSI_IRQ to track IRQ/SOFTIRQ pressure stall information.

patch 9/10 introduce a per-cgroup interface "cgroup.psi" to disable
or re-enable PSI stats accounting in the cgroup level.

patch 10/10 cache parent psi_group in struct psi_group to speed up
the hot iteration path.

Thanks!

Changes in v2:
 - Add Acked-by tags from Johannes Weiner. Thanks for review!
 - Fix periodic aggregation wakeup for common ancestors in
   psi_task_switch().
 - Add patch 7/10 from Johannes Weiner, which remove NR_ONCPU
   task accounting to save 4 bytes in the first cacheline.
 - Remove "psi_irq=" kernel cmdline parameter in last version.
 - Add per-cgroup interface "cgroup.psi" to disable/re-enable
   PSI stats accounting in the cgroup level.

Chengming Zhou (9):
  sched/psi: fix periodic aggregation shut off
  sched/psi: optimize task switch inside shared cgroups again
  sched/psi: move private helpers to sched/stats.h
  sched/psi: don't change task psi_flags when migrate CPU/group
  sched/psi: don't create cgroup PSI files when psi_disabled
  sched/psi: save percpu memory when !psi_cgroups_enabled
  sched/psi: add PSI_IRQ to track IRQ/SOFTIRQ pressure
  sched/psi: per-cgroup PSI stats disable/re-enable interface
  sched/psi: cache parent psi_group to speed up groups iterate

Johannes Weiner (1):
  sched/psi: remove NR_ONCPU task accounting

 Documentation/admin-guide/cgroup-v2.rst |  13 ++
 include/linux/psi.h                     |   6 +-
 include/linux/psi_types.h               |  25 +--
 include/linux/sched.h                   |   3 -
 kernel/cgroup/cgroup.c                  |  73 +++++++
 kernel/sched/core.c                     |   2 +
 kernel/sched/psi.c                      | 247 +++++++++++++++++-------
 kernel/sched/stats.h                    |  60 +++---
 8 files changed, 313 insertions(+), 116 deletions(-)

-- 
2.36.1


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v2 01/10] sched/psi: fix periodic aggregation shut off
  2022-08-08 11:03 [PATCH v2 00/10] sched/psi: some optimization and extension Chengming Zhou
@ 2022-08-08 11:03 ` Chengming Zhou
  2022-08-08 11:03 ` [PATCH v2 02/10] sched/psi: optimize task switch inside shared cgroups again Chengming Zhou
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 32+ messages in thread
From: Chengming Zhou @ 2022-08-08 11:03 UTC (permalink / raw)
  To: hannes, tj, corbet, surenb, mingo, peterz, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall
  Cc: cgroups, linux-doc, linux-kernel, songmuchun, Chengming Zhou

We don't want to wake periodic aggregation work back up if the
task change is the aggregation worker itself going to sleep, or
we'll ping-pong forever.

Previously, we would use psi_task_change() in psi_dequeue() when
task going to sleep, so this check was put in psi_task_change().

But commit 4117cebf1a9f ("psi: Optimize task switch inside shared cgroups")
defer task sleep handling to psi_task_switch(), won't go through
psi_task_change() anymore.

So this patch move this check to psi_task_switch().

Fixes: 4117cebf1a9f ("psi: Optimize task switch inside shared cgroups")
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
---
 kernel/sched/psi.c | 28 ++++++++++++++--------------
 1 file changed, 14 insertions(+), 14 deletions(-)

diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index a337f3e35997..115a7e52fa23 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -800,7 +800,6 @@ void psi_task_change(struct task_struct *task, int clear, int set)
 {
 	int cpu = task_cpu(task);
 	struct psi_group *group;
-	bool wake_clock = true;
 	void *iter = NULL;
 	u64 now;
 
@@ -810,19 +809,9 @@ void psi_task_change(struct task_struct *task, int clear, int set)
 	psi_flags_change(task, clear, set);
 
 	now = cpu_clock(cpu);
-	/*
-	 * Periodic aggregation shuts off if there is a period of no
-	 * task changes, so we wake it back up if necessary. However,
-	 * don't do this if the task change is the aggregation worker
-	 * itself going to sleep, or we'll ping-pong forever.
-	 */
-	if (unlikely((clear & TSK_RUNNING) &&
-		     (task->flags & PF_WQ_WORKER) &&
-		     wq_worker_last_func(task) == psi_avgs_work))
-		wake_clock = false;
 
 	while ((group = iterate_groups(task, &iter)))
-		psi_group_change(group, cpu, clear, set, now, wake_clock);
+		psi_group_change(group, cpu, clear, set, now, true);
 }
 
 void psi_task_switch(struct task_struct *prev, struct task_struct *next,
@@ -858,6 +847,7 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
 
 	if (prev->pid) {
 		int clear = TSK_ONCPU, set = 0;
+		bool wake_clock = true;
 
 		/*
 		 * When we're going to sleep, psi_dequeue() lets us
@@ -871,13 +861,23 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
 				clear |= TSK_MEMSTALL_RUNNING;
 			if (prev->in_iowait)
 				set |= TSK_IOWAIT;
+
+			/*
+			 * Periodic aggregation shuts off if there is a period of no
+			 * task changes, so we wake it back up if necessary. However,
+			 * don't do this if the task change is the aggregation worker
+			 * itself going to sleep, or we'll ping-pong forever.
+			 */
+			if (unlikely((prev->flags & PF_WQ_WORKER) &&
+				     wq_worker_last_func(prev) == psi_avgs_work))
+				wake_clock = false;
 		}
 
 		psi_flags_change(prev, clear, set);
 
 		iter = NULL;
 		while ((group = iterate_groups(prev, &iter)) && group != common)
-			psi_group_change(group, cpu, clear, set, now, true);
+			psi_group_change(group, cpu, clear, set, now, wake_clock);
 
 		/*
 		 * TSK_ONCPU is handled up to the common ancestor. If we're tasked
@@ -886,7 +886,7 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
 		if (sleep) {
 			clear &= ~TSK_ONCPU;
 			for (; group; group = iterate_groups(prev, &iter))
-				psi_group_change(group, cpu, clear, set, now, true);
+				psi_group_change(group, cpu, clear, set, now, wake_clock);
 		}
 	}
 }
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v2 02/10] sched/psi: optimize task switch inside shared cgroups again
  2022-08-08 11:03 [PATCH v2 00/10] sched/psi: some optimization and extension Chengming Zhou
  2022-08-08 11:03 ` [PATCH v2 01/10] sched/psi: fix periodic aggregation shut off Chengming Zhou
@ 2022-08-08 11:03 ` Chengming Zhou
  2022-08-08 11:03 ` [PATCH v2 03/10] sched/psi: move private helpers to sched/stats.h Chengming Zhou
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 32+ messages in thread
From: Chengming Zhou @ 2022-08-08 11:03 UTC (permalink / raw)
  To: hannes, tj, corbet, surenb, mingo, peterz, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall
  Cc: cgroups, linux-doc, linux-kernel, songmuchun, Chengming Zhou

commit 4117cebf1a9f ("psi: Optimize task switch inside shared cgroups")
defer prev task sleep handling to psi_task_switch(), so we don't need
to clear and set TSK_ONCPU state for common cgroups.

    A
    |
    B
   / \
  C   D
 /     \
prev   next

After that commit psi_task_switch() do:
1. psi_group_change(next, .set=TSK_ONCPU) for D
2. psi_group_change(prev, .clear=TSK_ONCPU | TSK_RUNNING) for C
3. psi_group_change(prev, .clear=TSK_RUNNING) for B, A

But there is a limitation "prev->psi_flags == next->psi_flags" that
if not satisfied, will make this cgroups optimization unusable for both
sleep switch or running switch cases. For example:

prev->in_memstall != next->in_memstall when sleep switch:
1. psi_group_change(next, .set=TSK_ONCPU) for D, B, A
2. psi_group_change(prev, .clear=TSK_ONCPU | TSK_RUNNING) for C, B, A

prev->in_memstall != next->in_memstall when running switch:
1. psi_group_change(next, .set=TSK_ONCPU) for D, B, A
2. psi_group_change(prev, .clear=TSK_ONCPU) for C, B, A

The reason why this limitation exist is that we consider a group is
PSI_MEM_FULL if the CPU is actively reclaiming and nothing productive
could run even if it were runnable. So when CPU curr changed from prev
to next and their in_memstall status is different, we have to change
PSI_MEM_FULL status for their common cgroups.

This patch remove this limitation by making psi_group_change() change
PSI_MEM_FULL status depend on CPU curr->in_memstall status.

Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
---
 kernel/sched/psi.c | 8 ++------
 1 file changed, 2 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 115a7e52fa23..9e8c5d9e585c 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -823,8 +823,6 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
 	u64 now = cpu_clock(cpu);
 
 	if (next->pid) {
-		bool identical_state;
-
 		psi_flags_change(next, 0, TSK_ONCPU);
 		/*
 		 * When switching between tasks that have an identical
@@ -832,11 +830,9 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
 		 * we reach the first common ancestor. Iterate @next's
 		 * ancestors only until we encounter @prev's ONCPU.
 		 */
-		identical_state = prev->psi_flags == next->psi_flags;
 		iter = NULL;
 		while ((group = iterate_groups(next, &iter))) {
-			if (identical_state &&
-			    per_cpu_ptr(group->pcpu, cpu)->tasks[NR_ONCPU]) {
+			if (per_cpu_ptr(group->pcpu, cpu)->tasks[NR_ONCPU]) {
 				common = group;
 				break;
 			}
@@ -883,7 +879,7 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
 		 * TSK_ONCPU is handled up to the common ancestor. If we're tasked
 		 * with dequeuing too, finish that for the rest of the hierarchy.
 		 */
-		if (sleep) {
+		if (sleep || unlikely(prev->in_memstall != next->in_memstall)) {
 			clear &= ~TSK_ONCPU;
 			for (; group; group = iterate_groups(prev, &iter))
 				psi_group_change(group, cpu, clear, set, now, wake_clock);
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v2 03/10] sched/psi: move private helpers to sched/stats.h
  2022-08-08 11:03 [PATCH v2 00/10] sched/psi: some optimization and extension Chengming Zhou
  2022-08-08 11:03 ` [PATCH v2 01/10] sched/psi: fix periodic aggregation shut off Chengming Zhou
  2022-08-08 11:03 ` [PATCH v2 02/10] sched/psi: optimize task switch inside shared cgroups again Chengming Zhou
@ 2022-08-08 11:03 ` Chengming Zhou
  2022-08-08 11:03 ` [PATCH v2 04/10] sched/psi: don't change task psi_flags when migrate CPU/group Chengming Zhou
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 32+ messages in thread
From: Chengming Zhou @ 2022-08-08 11:03 UTC (permalink / raw)
  To: hannes, tj, corbet, surenb, mingo, peterz, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall
  Cc: cgroups, linux-doc, linux-kernel, songmuchun, Chengming Zhou

This patch move psi_task_change/psi_task_switch declarations out of
PSI public header, since they are only needed for implementing the
PSI stats tracking in sched/stats.h

psi_task_switch is obvious, psi_task_change can't be public helper
since it doesn't check psi_disabled static key. And there is no
any user now, so put it in sched/stats.h too.

Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/psi.h  | 4 ----
 kernel/sched/stats.h | 4 ++++
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/include/linux/psi.h b/include/linux/psi.h
index 89784763d19e..aa168a038242 100644
--- a/include/linux/psi.h
+++ b/include/linux/psi.h
@@ -18,10 +18,6 @@ extern struct psi_group psi_system;
 
 void psi_init(void);
 
-void psi_task_change(struct task_struct *task, int clear, int set);
-void psi_task_switch(struct task_struct *prev, struct task_struct *next,
-		     bool sleep);
-
 void psi_memstall_enter(unsigned long *flags);
 void psi_memstall_leave(unsigned long *flags);
 
diff --git a/kernel/sched/stats.h b/kernel/sched/stats.h
index baa839c1ba96..c39b467ece43 100644
--- a/kernel/sched/stats.h
+++ b/kernel/sched/stats.h
@@ -107,6 +107,10 @@ __schedstats_from_se(struct sched_entity *se)
 }
 
 #ifdef CONFIG_PSI
+void psi_task_change(struct task_struct *task, int clear, int set);
+void psi_task_switch(struct task_struct *prev, struct task_struct *next,
+		     bool sleep);
+
 /*
  * PSI tracks state that persists across sleeps, such as iowaits and
  * memory stalls. As a result, it has to distinguish between sleeps,
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v2 04/10] sched/psi: don't change task psi_flags when migrate CPU/group
  2022-08-08 11:03 [PATCH v2 00/10] sched/psi: some optimization and extension Chengming Zhou
                   ` (2 preceding siblings ...)
  2022-08-08 11:03 ` [PATCH v2 03/10] sched/psi: move private helpers to sched/stats.h Chengming Zhou
@ 2022-08-08 11:03 ` Chengming Zhou
  2022-08-08 11:03 ` [PATCH v2 05/10] sched/psi: don't create cgroup PSI files when psi_disabled Chengming Zhou
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 32+ messages in thread
From: Chengming Zhou @ 2022-08-08 11:03 UTC (permalink / raw)
  To: hannes, tj, corbet, surenb, mingo, peterz, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall
  Cc: cgroups, linux-doc, linux-kernel, songmuchun, Chengming Zhou

The current code use psi_task_change() at every scheduling point,
in which change task psi_flags then change all its psi_groups.

So we have to heavily rely on the task scheduling states to calculate
what to set and what to clear at every scheduling point, which make
the PSI stats tracking code much complex and error prone.

In fact, the task psi_flags only change at wakeup and sleep (except
ONCPU state at switch), it doesn't change at all when migrate CPU/group.

If we keep its psi_flags unchanged when migrate CPU/group, we can
just use task->psi_flags to clear(migrate out) or set(migrate in),
which will make PSI stats tracking much simplier and more efficient.

Note: ENQUEUE_WAKEUP only means wakeup task from sleep state, don't
include wakeup new task, so add psi_enqueue() in wake_up_new_task().

Performance test on Intel Xeon Platinum with 3 levels of cgroup:

1. before the patch:

$ perf bench sched all
 # Running sched/messaging benchmark...
 # 20 sender and receiver processes per group
 # 10 groups == 400 processes run

     Total time: 0.034 [sec]

 # Running sched/pipe benchmark...
 # Executed 1000000 pipe operations between two processes

     Total time: 8.210 [sec]

       8.210600 usecs/op
         121793 ops/sec

2. after the patch:

$ perf bench sched all
 # Running sched/messaging benchmark...
 # 20 sender and receiver processes per group
 # 10 groups == 400 processes run

     Total time: 0.032 [sec]

 # Running sched/pipe benchmark...
 # Executed 1000000 pipe operations between two processes

     Total time: 8.077 [sec]

       8.077648 usecs/op
         123798 ops/sec

Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
---
 include/linux/sched.h |  3 ---
 kernel/sched/core.c   |  1 +
 kernel/sched/psi.c    | 24 ++++++++++---------
 kernel/sched/stats.h  | 54 +++++++++++++++++++++----------------------
 4 files changed, 40 insertions(+), 42 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 88b8817b827d..20a94786cad8 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -879,9 +879,6 @@ struct task_struct {
 	unsigned			sched_reset_on_fork:1;
 	unsigned			sched_contributes_to_load:1;
 	unsigned			sched_migrated:1;
-#ifdef CONFIG_PSI
-	unsigned			sched_psi_wake_requeue:1;
-#endif
 
 	/* Force alignment to the next boundary: */
 	unsigned			:0;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 64c08993221b..3aa401689f7e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4642,6 +4642,7 @@ void wake_up_new_task(struct task_struct *p)
 	post_init_entity_util_avg(p);
 
 	activate_task(rq, p, ENQUEUE_NOCLOCK);
+	psi_enqueue(p, true);
 	trace_sched_wakeup_new(p);
 	check_preempt_curr(rq, p, WF_FORK);
 #ifdef CONFIG_SMP
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 9e8c5d9e585c..974471f212a3 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -796,22 +796,24 @@ static void psi_flags_change(struct task_struct *task, int clear, int set)
 	task->psi_flags |= set;
 }
 
-void psi_task_change(struct task_struct *task, int clear, int set)
+void psi_change_groups(struct task_struct *task, int clear, int set)
 {
 	int cpu = task_cpu(task);
 	struct psi_group *group;
 	void *iter = NULL;
-	u64 now;
+	u64 now = cpu_clock(cpu);
+
+	while ((group = iterate_groups(task, &iter)))
+		psi_group_change(group, cpu, clear, set, now, true);
+}
 
+void psi_task_change(struct task_struct *task, int clear, int set)
+{
 	if (!task->pid)
 		return;
 
 	psi_flags_change(task, clear, set);
-
-	now = cpu_clock(cpu);
-
-	while ((group = iterate_groups(task, &iter)))
-		psi_group_change(group, cpu, clear, set, now, true);
+	psi_change_groups(task, clear, set);
 }
 
 void psi_task_switch(struct task_struct *prev, struct task_struct *next,
@@ -1015,9 +1017,9 @@ void cgroup_move_task(struct task_struct *task, struct css_set *to)
 	 *   pick_next_task()
 	 *     rq_unlock()
 	 *                                rq_lock()
-	 *                                psi_task_change() // old cgroup
+	 *                                psi_change_groups() // old cgroup
 	 *                                task->cgroups = to
-	 *                                psi_task_change() // new cgroup
+	 *                                psi_change_groups() // new cgroup
 	 *                                rq_unlock()
 	 *     rq_lock()
 	 *   psi_sched_switch() // does deferred updates in new cgroup
@@ -1027,13 +1029,13 @@ void cgroup_move_task(struct task_struct *task, struct css_set *to)
 	task_flags = task->psi_flags;
 
 	if (task_flags)
-		psi_task_change(task, task_flags, 0);
+		psi_change_groups(task, task_flags, 0);
 
 	/* See comment above */
 	rcu_assign_pointer(task->cgroups, to);
 
 	if (task_flags)
-		psi_task_change(task, 0, task_flags);
+		psi_change_groups(task, 0, task_flags);
 
 	task_rq_unlock(rq, task, &rf);
 }
diff --git a/kernel/sched/stats.h b/kernel/sched/stats.h
index c39b467ece43..e930b8fa6253 100644
--- a/kernel/sched/stats.h
+++ b/kernel/sched/stats.h
@@ -107,6 +107,7 @@ __schedstats_from_se(struct sched_entity *se)
 }
 
 #ifdef CONFIG_PSI
+void psi_change_groups(struct task_struct *task, int clear, int set);
 void psi_task_change(struct task_struct *task, int clear, int set);
 void psi_task_switch(struct task_struct *prev, struct task_struct *next,
 		     bool sleep);
@@ -124,42 +125,46 @@ static inline void psi_enqueue(struct task_struct *p, bool wakeup)
 	if (static_branch_likely(&psi_disabled))
 		return;
 
-	if (p->in_memstall)
-		set |= TSK_MEMSTALL_RUNNING;
+	if (!wakeup) {
+		if (p->psi_flags)
+			psi_change_groups(p, 0, p->psi_flags);
+		return;
+	}
 
-	if (!wakeup || p->sched_psi_wake_requeue) {
-		if (p->in_memstall)
+	/*
+	 * wakeup (including wakeup migrate) need to change task psi_flags,
+	 * specifically need to set TSK_RUNNING or TSK_MEMSTALL_RUNNING.
+	 * Since we clear task->psi_flags for wakeup migrated task, we need
+	 * to check task->psi_flags to see what should be set and clear.
+	 */
+	if (unlikely(p->in_memstall)) {
+		set |= TSK_MEMSTALL_RUNNING;
+		if (!(p->psi_flags & TSK_MEMSTALL))
 			set |= TSK_MEMSTALL;
-		if (p->sched_psi_wake_requeue)
-			p->sched_psi_wake_requeue = 0;
-	} else {
-		if (p->in_iowait)
-			clear |= TSK_IOWAIT;
 	}
+	if (p->psi_flags & TSK_IOWAIT)
+		clear |= TSK_IOWAIT;
 
 	psi_task_change(p, clear, set);
 }
 
 static inline void psi_dequeue(struct task_struct *p, bool sleep)
 {
-	int clear = TSK_RUNNING;
-
 	if (static_branch_likely(&psi_disabled))
 		return;
 
+	if (!sleep) {
+		if (p->psi_flags)
+			psi_change_groups(p, p->psi_flags, 0);
+		return;
+	}
+
 	/*
 	 * A voluntary sleep is a dequeue followed by a task switch. To
 	 * avoid walking all ancestors twice, psi_task_switch() handles
 	 * TSK_RUNNING and TSK_IOWAIT for us when it moves TSK_ONCPU.
 	 * Do nothing here.
 	 */
-	if (sleep)
-		return;
-
-	if (p->in_memstall)
-		clear |= (TSK_MEMSTALL | TSK_MEMSTALL_RUNNING);
-
-	psi_task_change(p, clear, 0);
 }
 
 static inline void psi_ttwu_dequeue(struct task_struct *p)
@@ -169,21 +174,14 @@ static inline void psi_ttwu_dequeue(struct task_struct *p)
 	/*
 	 * Is the task being migrated during a wakeup? Make sure to
 	 * deregister its sleep-persistent psi states from the old
-	 * queue, and let psi_enqueue() know it has to requeue.
+	 * queue.
 	 */
-	if (unlikely(p->in_iowait || p->in_memstall)) {
+	if (unlikely(p->psi_flags)) {
 		struct rq_flags rf;
 		struct rq *rq;
-		int clear = 0;
-
-		if (p->in_iowait)
-			clear |= TSK_IOWAIT;
-		if (p->in_memstall)
-			clear |= TSK_MEMSTALL;
 
 		rq = __task_rq_lock(p, &rf);
-		psi_task_change(p, clear, 0);
-		p->sched_psi_wake_requeue = 1;
+		psi_task_change(p, p->psi_flags, 0);
 		__task_rq_unlock(rq, &rf);
 	}
 }
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v2 05/10] sched/psi: don't create cgroup PSI files when psi_disabled
  2022-08-08 11:03 [PATCH v2 00/10] sched/psi: some optimization and extension Chengming Zhou
                   ` (3 preceding siblings ...)
  2022-08-08 11:03 ` [PATCH v2 04/10] sched/psi: don't change task psi_flags when migrate CPU/group Chengming Zhou
@ 2022-08-08 11:03 ` Chengming Zhou
  2022-08-08 11:03 ` [PATCH v2 06/10] sched/psi: save percpu memory when !psi_cgroups_enabled Chengming Zhou
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 32+ messages in thread
From: Chengming Zhou @ 2022-08-08 11:03 UTC (permalink / raw)
  To: hannes, tj, corbet, surenb, mingo, peterz, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall
  Cc: cgroups, linux-doc, linux-kernel, songmuchun, Chengming Zhou

commit 3958e2d0c34e ("cgroup: make per-cgroup pressure stall tracking configurable")
make PSI can be configured to skip per-cgroup stall accounting. And
doesn't expose PSI files in cgroup hierarchy.

This patch do the same thing when psi_disabled.

Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
---
 kernel/cgroup/cgroup.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 13c8e91d7862..5f88117fc81e 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -3709,6 +3709,9 @@ static void cgroup_pressure_release(struct kernfs_open_file *of)
 
 bool cgroup_psi_enabled(void)
 {
+	if (static_branch_likely(&psi_disabled))
+		return false;
+
 	return (cgroup_feature_disable_mask & (1 << OPT_FEATURE_PRESSURE)) == 0;
 }
 
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v2 06/10] sched/psi: save percpu memory when !psi_cgroups_enabled
  2022-08-08 11:03 [PATCH v2 00/10] sched/psi: some optimization and extension Chengming Zhou
                   ` (4 preceding siblings ...)
  2022-08-08 11:03 ` [PATCH v2 05/10] sched/psi: don't create cgroup PSI files when psi_disabled Chengming Zhou
@ 2022-08-08 11:03 ` Chengming Zhou
  2022-08-08 11:03 ` [PATCH v2 07/10] sched/psi: remove NR_ONCPU task accounting Chengming Zhou
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 32+ messages in thread
From: Chengming Zhou @ 2022-08-08 11:03 UTC (permalink / raw)
  To: hannes, tj, corbet, surenb, mingo, peterz, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall
  Cc: cgroups, linux-doc, linux-kernel, songmuchun, Chengming Zhou

We won't use cgroup psi_group when !psi_cgroups_enabled, so don't
bother to alloc percpu memory and init for it.

Also don't need to migrate task PSI stats between cgroups in
cgroup_move_task().

Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
---
 kernel/sched/psi.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 974471f212a3..595a6c8230b7 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -205,6 +205,7 @@ void __init psi_init(void)
 {
 	if (!psi_enable) {
 		static_branch_enable(&psi_disabled);
+		static_branch_disable(&psi_cgroups_enabled);
 		return;
 	}
 
@@ -952,7 +953,7 @@ void psi_memstall_leave(unsigned long *flags)
 #ifdef CONFIG_CGROUPS
 int psi_cgroup_alloc(struct cgroup *cgroup)
 {
-	if (static_branch_likely(&psi_disabled))
+	if (!static_branch_likely(&psi_cgroups_enabled))
 		return 0;
 
 	cgroup->psi.pcpu = alloc_percpu(struct psi_group_cpu);
@@ -964,7 +965,7 @@ int psi_cgroup_alloc(struct cgroup *cgroup)
 
 void psi_cgroup_free(struct cgroup *cgroup)
 {
-	if (static_branch_likely(&psi_disabled))
+	if (!static_branch_likely(&psi_cgroups_enabled))
 		return;
 
 	cancel_delayed_work_sync(&cgroup->psi.avgs_work);
@@ -991,7 +992,7 @@ void cgroup_move_task(struct task_struct *task, struct css_set *to)
 	struct rq_flags rf;
 	struct rq *rq;
 
-	if (static_branch_likely(&psi_disabled)) {
+	if (!static_branch_likely(&psi_cgroups_enabled)) {
 		/*
 		 * Lame to do this here, but the scheduler cannot be locked
 		 * from the outside, so we move cgroups from inside sched/.
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v2 07/10] sched/psi: remove NR_ONCPU task accounting
  2022-08-08 11:03 [PATCH v2 00/10] sched/psi: some optimization and extension Chengming Zhou
                   ` (5 preceding siblings ...)
  2022-08-08 11:03 ` [PATCH v2 06/10] sched/psi: save percpu memory when !psi_cgroups_enabled Chengming Zhou
@ 2022-08-08 11:03 ` Chengming Zhou
  2022-08-16 10:40   ` Chengming Zhou
  2022-08-08 11:03 ` [PATCH v2 08/10] sched/psi: add PSI_IRQ to track IRQ/SOFTIRQ pressure Chengming Zhou
                   ` (3 subsequent siblings)
  10 siblings, 1 reply; 32+ messages in thread
From: Chengming Zhou @ 2022-08-08 11:03 UTC (permalink / raw)
  To: hannes, tj, corbet, surenb, mingo, peterz, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall
  Cc: cgroups, linux-doc, linux-kernel, songmuchun, Chengming Zhou

From: Johannes Weiner <hannes@cmpxchg.org>

We put all fields updated by the scheduler in the first cacheline of
struct psi_group_cpu for performance.

Since we want add another PSI_IRQ_FULL to track IRQ/SOFTIRQ pressure,
we need to reclaim space first. This patch remove NR_ONCPU task accounting
in struct psi_group_cpu, use one bit in state_mask to track instead.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Reviewed-by: Chengming Zhou <zhouchengming@bytedance.com>
Tested-by: Chengming Zhou <zhouchengming@bytedance.com>
---
 include/linux/psi_types.h | 16 +++++++---------
 kernel/sched/psi.c        | 36 ++++++++++++++++++++++++++++--------
 2 files changed, 35 insertions(+), 17 deletions(-)

diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h
index c7fe7c089718..54cb74946db4 100644
--- a/include/linux/psi_types.h
+++ b/include/linux/psi_types.h
@@ -15,13 +15,6 @@ enum psi_task_count {
 	NR_IOWAIT,
 	NR_MEMSTALL,
 	NR_RUNNING,
-	/*
-	 * This can't have values other than 0 or 1 and could be
-	 * implemented as a bit flag. But for now we still have room
-	 * in the first cacheline of psi_group_cpu, and this way we
-	 * don't have to special case any state tracking for it.
-	 */
-	NR_ONCPU,
 	/*
 	 * For IO and CPU stalls the presence of running/oncpu tasks
 	 * in the domain means a partial rather than a full stall.
@@ -32,16 +25,18 @@ enum psi_task_count {
 	 * threads and memstall ones.
 	 */
 	NR_MEMSTALL_RUNNING,
-	NR_PSI_TASK_COUNTS = 5,
+	NR_PSI_TASK_COUNTS = 4,
 };
 
 /* Task state bitmasks */
 #define TSK_IOWAIT	(1 << NR_IOWAIT)
 #define TSK_MEMSTALL	(1 << NR_MEMSTALL)
 #define TSK_RUNNING	(1 << NR_RUNNING)
-#define TSK_ONCPU	(1 << NR_ONCPU)
 #define TSK_MEMSTALL_RUNNING	(1 << NR_MEMSTALL_RUNNING)
 
+/* Only one task can be scheduled, no corresponding task count */
+#define TSK_ONCPU	(1 << NR_PSI_TASK_COUNTS)
+
 /* Resources that workloads could be stalled on */
 enum psi_res {
 	PSI_IO,
@@ -68,6 +63,9 @@ enum psi_states {
 	NR_PSI_STATES = 7,
 };
 
+/* Use one bit in the state mask to track TSK_ONCPU */
+#define PSI_ONCPU	(1 << NR_PSI_STATES)
+
 enum psi_aggregators {
 	PSI_AVGS = 0,
 	PSI_POLL,
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 595a6c8230b7..1c675715ed33 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -216,7 +216,7 @@ void __init psi_init(void)
 	group_init(&psi_system);
 }
 
-static bool test_state(unsigned int *tasks, enum psi_states state)
+static bool test_state(unsigned int *tasks, enum psi_states state, bool oncpu)
 {
 	switch (state) {
 	case PSI_IO_SOME:
@@ -229,9 +229,9 @@ static bool test_state(unsigned int *tasks, enum psi_states state)
 		return unlikely(tasks[NR_MEMSTALL] &&
 			tasks[NR_RUNNING] == tasks[NR_MEMSTALL_RUNNING]);
 	case PSI_CPU_SOME:
-		return unlikely(tasks[NR_RUNNING] > tasks[NR_ONCPU]);
+		return unlikely(tasks[NR_RUNNING] > oncpu);
 	case PSI_CPU_FULL:
-		return unlikely(tasks[NR_RUNNING] && !tasks[NR_ONCPU]);
+		return unlikely(tasks[NR_RUNNING] && !oncpu);
 	case PSI_NONIDLE:
 		return tasks[NR_IOWAIT] || tasks[NR_MEMSTALL] ||
 			tasks[NR_RUNNING];
@@ -693,9 +693,9 @@ static void psi_group_change(struct psi_group *group, int cpu,
 			     bool wake_clock)
 {
 	struct psi_group_cpu *groupc;
-	u32 state_mask = 0;
 	unsigned int t, m;
 	enum psi_states s;
+	u32 state_mask;
 
 	groupc = per_cpu_ptr(group->pcpu, cpu);
 
@@ -711,6 +711,26 @@ static void psi_group_change(struct psi_group *group, int cpu,
 
 	record_times(groupc, now);
 
+	/*
+	 * Start with TSK_ONCPU, which doesn't have a corresponding
+	 * task count - it's just a boolean flag directly encoded in
+	 * the state mask. Clear, set, or carry the current state if
+	 * no changes are requested.
+	 */
+	if (clear & TSK_ONCPU) {
+		state_mask = 0;
+		clear &= ~TSK_ONCPU;
+	} else if (set & TSK_ONCPU) {
+		state_mask = PSI_ONCPU;
+		set &= ~TSK_ONCPU;
+	} else {
+		state_mask = groupc->state_mask & PSI_ONCPU;
+	}
+
+	/*
+	 * The rest of the state mask is calculated based on the task
+	 * counts. Update those first, then construct the mask.
+	 */
 	for (t = 0, m = clear; m; m &= ~(1 << t), t++) {
 		if (!(m & (1 << t)))
 			continue;
@@ -730,9 +750,8 @@ static void psi_group_change(struct psi_group *group, int cpu,
 		if (set & (1 << t))
 			groupc->tasks[t]++;
 
-	/* Calculate state mask representing active states */
 	for (s = 0; s < NR_PSI_STATES; s++) {
-		if (test_state(groupc->tasks, s))
+		if (test_state(groupc->tasks, s, state_mask & PSI_ONCPU))
 			state_mask |= (1 << s);
 	}
 
@@ -744,7 +763,7 @@ static void psi_group_change(struct psi_group *group, int cpu,
 	 * task in a cgroup is in_memstall, the corresponding groupc
 	 * on that cpu is in PSI_MEM_FULL state.
 	 */
-	if (unlikely(groupc->tasks[NR_ONCPU] && cpu_curr(cpu)->in_memstall))
+	if (unlikely((state_mask & PSI_ONCPU) && cpu_curr(cpu)->in_memstall))
 		state_mask |= (1 << PSI_MEM_FULL);
 
 	groupc->state_mask = state_mask;
@@ -835,7 +854,8 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
 		 */
 		iter = NULL;
 		while ((group = iterate_groups(next, &iter))) {
-			if (per_cpu_ptr(group->pcpu, cpu)->tasks[NR_ONCPU]) {
+			if (per_cpu_ptr(group->pcpu, cpu)->state_mask &
+			    PSI_ONCPU) {
 				common = group;
 				break;
 			}
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v2 08/10] sched/psi: add PSI_IRQ to track IRQ/SOFTIRQ pressure
  2022-08-08 11:03 [PATCH v2 00/10] sched/psi: some optimization and extension Chengming Zhou
                   ` (6 preceding siblings ...)
  2022-08-08 11:03 ` [PATCH v2 07/10] sched/psi: remove NR_ONCPU task accounting Chengming Zhou
@ 2022-08-08 11:03 ` Chengming Zhou
  2022-08-08 11:03 ` [PATCH v2 09/10] sched/psi: per-cgroup PSI stats disable/re-enable interface Chengming Zhou
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 32+ messages in thread
From: Chengming Zhou @ 2022-08-08 11:03 UTC (permalink / raw)
  To: hannes, tj, corbet, surenb, mingo, peterz, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall
  Cc: cgroups, linux-doc, linux-kernel, songmuchun, Chengming Zhou

Now PSI already tracked workload pressure stall information for
CPU, memory and IO. Apart from these, IRQ/SOFTIRQ could have
obvious impact on some workload productivity, such as web service
workload.

When CONFIG_IRQ_TIME_ACCOUNTING, we can get IRQ/SOFTIRQ delta time
from update_rq_clock_task(), in which we can record that delta
to CPU curr task's cgroups as PSI_IRQ_FULL status.

Note we don't use PSI_IRQ_SOME since IRQ/SOFTIRQ always happen in
the current task on the CPU, make nothing productive could run
even if it were runnable, so we only use PSI_IRQ_FULL.

Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
---
 Documentation/admin-guide/cgroup-v2.rst |  6 +++
 include/linux/psi_types.h               |  6 ++-
 kernel/cgroup/cgroup.c                  | 27 ++++++++++
 kernel/sched/core.c                     |  1 +
 kernel/sched/psi.c                      | 65 ++++++++++++++++++++++++-
 kernel/sched/stats.h                    |  2 +
 6 files changed, 103 insertions(+), 4 deletions(-)

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 176298f2f4de..dd84e34bc051 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -968,6 +968,12 @@ All cgroup core files are prefixed with "cgroup."
 	killing cgroups is a process directed operation, i.e. it affects
 	the whole thread-group.
 
+  irq.pressure
+	A read-write nested-keyed file.
+
+	Shows pressure stall information for IRQ/SOFTIRQ. See
+	:ref:`Documentation/accounting/psi.rst <psi>` for details.
+
 Controllers
 ===========
 
diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h
index 54cb74946db4..4677655f6ca1 100644
--- a/include/linux/psi_types.h
+++ b/include/linux/psi_types.h
@@ -42,7 +42,8 @@ enum psi_res {
 	PSI_IO,
 	PSI_MEM,
 	PSI_CPU,
-	NR_PSI_RESOURCES = 3,
+	PSI_IRQ,
+	NR_PSI_RESOURCES = 4,
 };
 
 /*
@@ -58,9 +59,10 @@ enum psi_states {
 	PSI_MEM_FULL,
 	PSI_CPU_SOME,
 	PSI_CPU_FULL,
+	PSI_IRQ_FULL,
 	/* Only per-CPU, to weigh the CPU in the global average: */
 	PSI_NONIDLE,
-	NR_PSI_STATES = 7,
+	NR_PSI_STATES = 8,
 };
 
 /* Use one bit in the state mask to track TSK_ONCPU */
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 5f88117fc81e..91de8ff7fa50 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -3692,6 +3692,23 @@ static ssize_t cgroup_cpu_pressure_write(struct kernfs_open_file *of,
 	return cgroup_pressure_write(of, buf, nbytes, PSI_CPU);
 }
 
+#ifdef CONFIG_IRQ_TIME_ACCOUNTING
+static int cgroup_irq_pressure_show(struct seq_file *seq, void *v)
+{
+	struct cgroup *cgrp = seq_css(seq)->cgroup;
+	struct psi_group *psi = cgroup_ino(cgrp) == 1 ? &psi_system : &cgrp->psi;
+
+	return psi_show(seq, psi, PSI_IRQ);
+}
+
+static ssize_t cgroup_irq_pressure_write(struct kernfs_open_file *of,
+					  char *buf, size_t nbytes,
+					  loff_t off)
+{
+	return cgroup_pressure_write(of, buf, nbytes, PSI_IRQ);
+}
+#endif
+
 static __poll_t cgroup_pressure_poll(struct kernfs_open_file *of,
 					  poll_table *pt)
 {
@@ -5088,6 +5105,16 @@ static struct cftype cgroup_base_files[] = {
 		.poll = cgroup_pressure_poll,
 		.release = cgroup_pressure_release,
 	},
+#ifdef CONFIG_IRQ_TIME_ACCOUNTING
+	{
+		.name = "irq.pressure",
+		.flags = CFTYPE_PRESSURE,
+		.seq_show = cgroup_irq_pressure_show,
+		.write = cgroup_irq_pressure_write,
+		.poll = cgroup_pressure_poll,
+		.release = cgroup_pressure_release,
+	},
+#endif
 #endif /* CONFIG_PSI */
 	{ }	/* terminate */
 };
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 3aa401689f7e..4cfb6ab32142 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -708,6 +708,7 @@ static void update_rq_clock_task(struct rq *rq, s64 delta)
 
 	rq->prev_irq_time += irq_delta;
 	delta -= irq_delta;
+	psi_account_irqtime(rq->curr, irq_delta);
 #endif
 #ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING
 	if (static_key_false((&paravirt_steal_rq_enabled))) {
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 1c675715ed33..58f8092c938f 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -910,6 +910,34 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
 	}
 }
 
+void psi_account_irqtime(struct task_struct *task, u32 delta)
+{
+	int cpu = task_cpu(task);
+	void *iter = NULL;
+	struct psi_group *group;
+	struct psi_group_cpu *groupc;
+	u64 now;
+
+	if (!task->pid)
+		return;
+
+	now = cpu_clock(cpu);
+
+	while ((group = iterate_groups(task, &iter))) {
+		groupc = per_cpu_ptr(group->pcpu, cpu);
+
+		write_seqcount_begin(&groupc->seq);
+
+		record_times(groupc, now);
+		groupc->times[PSI_IRQ_FULL] += delta;
+
+		write_seqcount_end(&groupc->seq);
+
+		if (group->poll_states & (1 << PSI_IRQ_FULL))
+			psi_schedule_poll_work(group, 1);
+	}
+}
+
 /**
  * psi_memstall_enter - mark the beginning of a memory stall section
  * @flags: flags to handle nested sections
@@ -1078,7 +1106,7 @@ int psi_show(struct seq_file *m, struct psi_group *group, enum psi_res res)
 		group->avg_next_update = update_averages(group, now);
 	mutex_unlock(&group->avgs_lock);
 
-	for (full = 0; full < 2; full++) {
+	for (full = 0; full < 2 - (res == PSI_IRQ); full++) {
 		unsigned long avg[3] = { 0, };
 		u64 total = 0;
 		int w;
@@ -1092,7 +1120,7 @@ int psi_show(struct seq_file *m, struct psi_group *group, enum psi_res res)
 		}
 
 		seq_printf(m, "%s avg10=%lu.%02lu avg60=%lu.%02lu avg300=%lu.%02lu total=%llu\n",
-			   full ? "full" : "some",
+			   full || (res == PSI_IRQ) ? "full" : "some",
 			   LOAD_INT(avg[0]), LOAD_FRAC(avg[0]),
 			   LOAD_INT(avg[1]), LOAD_FRAC(avg[1]),
 			   LOAD_INT(avg[2]), LOAD_FRAC(avg[2]),
@@ -1120,6 +1148,9 @@ struct psi_trigger *psi_trigger_create(struct psi_group *group,
 	else
 		return ERR_PTR(-EINVAL);
 
+	if ((res == PSI_IRQ) && (--state != PSI_IRQ_FULL))
+		return ERR_PTR(-EINVAL);
+
 	if (state >= PSI_NONIDLE)
 		return ERR_PTR(-EINVAL);
 
@@ -1404,6 +1435,33 @@ static const struct proc_ops psi_cpu_proc_ops = {
 	.proc_release	= psi_fop_release,
 };
 
+#ifdef CONFIG_IRQ_TIME_ACCOUNTING
+static int psi_irq_show(struct seq_file *m, void *v)
+{
+	return psi_show(m, &psi_system, PSI_IRQ);
+}
+
+static int psi_irq_open(struct inode *inode, struct file *file)
+{
+	return psi_open(file, psi_irq_show);
+}
+
+static ssize_t psi_irq_write(struct file *file, const char __user *user_buf,
+			     size_t nbytes, loff_t *ppos)
+{
+	return psi_write(file, user_buf, nbytes, PSI_IRQ);
+}
+
+static const struct proc_ops psi_irq_proc_ops = {
+	.proc_open	= psi_irq_open,
+	.proc_read	= seq_read,
+	.proc_lseek	= seq_lseek,
+	.proc_write	= psi_irq_write,
+	.proc_poll	= psi_fop_poll,
+	.proc_release	= psi_fop_release,
+};
+#endif
+
 static int __init psi_proc_init(void)
 {
 	if (psi_enable) {
@@ -1411,6 +1469,9 @@ static int __init psi_proc_init(void)
 		proc_create("pressure/io", 0666, NULL, &psi_io_proc_ops);
 		proc_create("pressure/memory", 0666, NULL, &psi_memory_proc_ops);
 		proc_create("pressure/cpu", 0666, NULL, &psi_cpu_proc_ops);
+#ifdef CONFIG_IRQ_TIME_ACCOUNTING
+		proc_create("pressure/irq", 0666, NULL, &psi_irq_proc_ops);
+#endif
 	}
 	return 0;
 }
diff --git a/kernel/sched/stats.h b/kernel/sched/stats.h
index e930b8fa6253..8b6cfc7a56f5 100644
--- a/kernel/sched/stats.h
+++ b/kernel/sched/stats.h
@@ -111,6 +111,7 @@ void psi_change_groups(struct task_struct *task, int clear, int set);
 void psi_task_change(struct task_struct *task, int clear, int set);
 void psi_task_switch(struct task_struct *prev, struct task_struct *next,
 		     bool sleep);
+void psi_account_irqtime(struct task_struct *task, u32 delta);
 
 /*
  * PSI tracks state that persists across sleeps, such as iowaits and
@@ -203,6 +204,7 @@ static inline void psi_ttwu_dequeue(struct task_struct *p) {}
 static inline void psi_sched_switch(struct task_struct *prev,
 				    struct task_struct *next,
 				    bool sleep) {}
+static inline void psi_account_irqtime(struct task_struct *task, u32 delta) {}
 #endif /* CONFIG_PSI */
 
 #ifdef CONFIG_SCHED_INFO
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v2 09/10] sched/psi: per-cgroup PSI stats disable/re-enable interface
  2022-08-08 11:03 [PATCH v2 00/10] sched/psi: some optimization and extension Chengming Zhou
                   ` (7 preceding siblings ...)
  2022-08-08 11:03 ` [PATCH v2 08/10] sched/psi: add PSI_IRQ to track IRQ/SOFTIRQ pressure Chengming Zhou
@ 2022-08-08 11:03 ` Chengming Zhou
  2022-08-09 17:48   ` Tejun Heo
                     ` (2 more replies)
  2022-08-08 11:03 ` [PATCH v2 10/10] sched/psi: cache parent psi_group to speed up groups iterate Chengming Zhou
  2022-08-15 13:25 ` [PATCH v2 00/10] sched/psi: some optimization and extension Michal Koutný
  10 siblings, 3 replies; 32+ messages in thread
From: Chengming Zhou @ 2022-08-08 11:03 UTC (permalink / raw)
  To: hannes, tj, corbet, surenb, mingo, peterz, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall
  Cc: cgroups, linux-doc, linux-kernel, songmuchun, Chengming Zhou

PSI accounts stalls for each cgroup separately and aggregates it
at each level of the hierarchy. This may cause non-negligible overhead
for some workloads when under deep level of the hierarchy.

commit 3958e2d0c34e ("cgroup: make per-cgroup pressure stall tracking configurable")
make PSI to skip per-cgroup stall accounting, only account system-wide
to avoid this each level overhead.

But for our use case, we also want leaf cgroup PSI stats accounted for
userspace adjustment on that cgroup, apart from only system-wide adjustment.

So this patch introduce a per-cgroup PSI stats disable/re-enable
interface "cgroup.psi", which is a read-write single value file that
allowed values are "0" and "1", the defaults is "1" so per-cgroup
PSI stats is enabled by default.

Implementation details:

It should be relatively straight-forward to disable and re-enable
state aggregation, time tracking, averaging on a per-cgroup level,
if we can live with losing history from while it was disabled.
I.e. the avgs will restart from 0, total= will have gaps.

But it's hard or complex to stop/restart groupc->tasks[] updates,
which is not implemented in this patch. So we always update
groupc->tasks[] and PSI_ONCPU bit in psi_group_change() even when
the cgroup PSI stats is disabled.

Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
---
 Documentation/admin-guide/cgroup-v2.rst |  7 ++++
 include/linux/psi.h                     |  2 ++
 include/linux/psi_types.h               |  2 ++
 kernel/cgroup/cgroup.c                  | 43 +++++++++++++++++++++++++
 kernel/sched/psi.c                      | 40 +++++++++++++++++++----
 5 files changed, 87 insertions(+), 7 deletions(-)

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index dd84e34bc051..ade40506ab80 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -968,6 +968,13 @@ All cgroup core files are prefixed with "cgroup."
 	killing cgroups is a process directed operation, i.e. it affects
 	the whole thread-group.
 
+  cgroup.psi
+	A read-write single value file that allowed values are "0" and "1".
+	The default is "1".
+
+	Writing "0" to the file will disable the cgroup PSI stats accounting.
+	Writing "1" to the file will re-enable the cgroup PSI stats accounting.
+
   irq.pressure
 	A read-write nested-keyed file.
 
diff --git a/include/linux/psi.h b/include/linux/psi.h
index aa168a038242..1138ccffd76b 100644
--- a/include/linux/psi.h
+++ b/include/linux/psi.h
@@ -33,6 +33,7 @@ __poll_t psi_trigger_poll(void **trigger_ptr, struct file *file,
 int psi_cgroup_alloc(struct cgroup *cgrp);
 void psi_cgroup_free(struct cgroup *cgrp);
 void cgroup_move_task(struct task_struct *p, struct css_set *to);
+void psi_cgroup_enable(struct psi_group *group, bool enable);
 #endif
 
 #else /* CONFIG_PSI */
@@ -54,6 +55,7 @@ static inline void cgroup_move_task(struct task_struct *p, struct css_set *to)
 {
 	rcu_assign_pointer(p->cgroups, to);
 }
+static inline void psi_cgroup_enable(struct psi_group *group, bool enable) {}
 #endif
 
 #endif /* CONFIG_PSI */
diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h
index 4677655f6ca1..fced39e255aa 100644
--- a/include/linux/psi_types.h
+++ b/include/linux/psi_types.h
@@ -147,6 +147,8 @@ struct psi_trigger {
 };
 
 struct psi_group {
+	bool enabled;
+
 	/* Protects data used by the aggregator */
 	struct mutex avgs_lock;
 
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 91de8ff7fa50..6ba56983b5a5 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -3709,6 +3709,43 @@ static ssize_t cgroup_irq_pressure_write(struct kernfs_open_file *of,
 }
 #endif
 
+static int cgroup_psi_show(struct seq_file *seq, void *v)
+{
+	struct cgroup *cgrp = seq_css(seq)->cgroup;
+	struct psi_group *psi = cgroup_ino(cgrp) == 1 ? &psi_system : &cgrp->psi;
+
+	seq_printf(seq, "%d\n", psi->enabled);
+
+	return 0;
+}
+
+static ssize_t cgroup_psi_write(struct kernfs_open_file *of,
+				char *buf, size_t nbytes, loff_t off)
+{
+	ssize_t ret;
+	int enable;
+	struct cgroup *cgrp;
+	struct psi_group *psi;
+
+	ret = kstrtoint(strstrip(buf), 0, &enable);
+	if (ret)
+		return ret;
+
+	if (enable < 0 || enable > 1)
+		return -ERANGE;
+
+	cgrp = cgroup_kn_lock_live(of->kn, false);
+	if (!cgrp)
+		return -ENOENT;
+
+	psi = cgroup_ino(cgrp) == 1 ? &psi_system : &cgrp->psi;
+	psi_cgroup_enable(psi, enable);
+
+	cgroup_kn_unlock(of->kn);
+
+	return nbytes;
+}
+
 static __poll_t cgroup_pressure_poll(struct kernfs_open_file *of,
 					  poll_table *pt)
 {
@@ -5115,6 +5152,12 @@ static struct cftype cgroup_base_files[] = {
 		.release = cgroup_pressure_release,
 	},
 #endif
+	{
+		.name = "cgroup.psi",
+		.flags = CFTYPE_PRESSURE,
+		.seq_show = cgroup_psi_show,
+		.write = cgroup_psi_write,
+	},
 #endif /* CONFIG_PSI */
 	{ }	/* terminate */
 };
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 58f8092c938f..9df1686ee02d 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -181,6 +181,7 @@ static void group_init(struct psi_group *group)
 {
 	int cpu;
 
+	group->enabled = true;
 	for_each_possible_cpu(cpu)
 		seqcount_init(&per_cpu_ptr(group->pcpu, cpu)->seq);
 	group->avg_last_update = sched_clock();
@@ -700,17 +701,16 @@ static void psi_group_change(struct psi_group *group, int cpu,
 	groupc = per_cpu_ptr(group->pcpu, cpu);
 
 	/*
-	 * First we assess the aggregate resource states this CPU's
-	 * tasks have been in since the last change, and account any
-	 * SOME and FULL time these may have resulted in.
-	 *
-	 * Then we update the task counts according to the state
+	 * First we update the task counts according to the state
 	 * change requested through the @clear and @set bits.
+	 *
+	 * Then if the cgroup PSI stats accounting enabled, we
+	 * assess the aggregate resource states this CPU's tasks
+	 * have been in since the last change, and account any
+	 * SOME and FULL time these may have resulted in.
 	 */
 	write_seqcount_begin(&groupc->seq);
 
-	record_times(groupc, now);
-
 	/*
 	 * Start with TSK_ONCPU, which doesn't have a corresponding
 	 * task count - it's just a boolean flag directly encoded in
@@ -750,6 +750,14 @@ static void psi_group_change(struct psi_group *group, int cpu,
 		if (set & (1 << t))
 			groupc->tasks[t]++;
 
+	if (!group->enabled) {
+		if (groupc->state_mask & (1 << PSI_NONIDLE))
+			record_times(groupc, now);
+		groupc->state_mask = state_mask;
+		write_seqcount_end(&groupc->seq);
+		return;
+	}
+
 	for (s = 0; s < NR_PSI_STATES; s++) {
 		if (test_state(groupc->tasks, s, state_mask & PSI_ONCPU))
 			state_mask |= (1 << s);
@@ -766,6 +774,7 @@ static void psi_group_change(struct psi_group *group, int cpu,
 	if (unlikely((state_mask & PSI_ONCPU) && cpu_curr(cpu)->in_memstall))
 		state_mask |= (1 << PSI_MEM_FULL);
 
+	record_times(groupc, now);
 	groupc->state_mask = state_mask;
 
 	write_seqcount_end(&groupc->seq);
@@ -1088,6 +1097,23 @@ void cgroup_move_task(struct task_struct *task, struct css_set *to)
 
 	task_rq_unlock(rq, task, &rf);
 }
+
+void psi_cgroup_enable(struct psi_group *group, bool enable)
+{
+	struct psi_group_cpu *groupc;
+	int cpu;
+	u64 now;
+
+	if (group->enabled == enable)
+		return;
+	group->enabled = enable;
+
+	for_each_possible_cpu(cpu) {
+		groupc = per_cpu_ptr(group->pcpu, cpu);
+		now = cpu_clock(cpu);
+		psi_group_change(group, cpu, 0, 0, now, true);
+	}
+}
 #endif /* CONFIG_CGROUPS */
 
 int psi_show(struct seq_file *m, struct psi_group *group, enum psi_res res)
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v2 10/10] sched/psi: cache parent psi_group to speed up groups iterate
  2022-08-08 11:03 [PATCH v2 00/10] sched/psi: some optimization and extension Chengming Zhou
                   ` (8 preceding siblings ...)
  2022-08-08 11:03 ` [PATCH v2 09/10] sched/psi: per-cgroup PSI stats disable/re-enable interface Chengming Zhou
@ 2022-08-08 11:03 ` Chengming Zhou
  2022-08-15 13:25 ` [PATCH v2 00/10] sched/psi: some optimization and extension Michal Koutný
  10 siblings, 0 replies; 32+ messages in thread
From: Chengming Zhou @ 2022-08-08 11:03 UTC (permalink / raw)
  To: hannes, tj, corbet, surenb, mingo, peterz, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall
  Cc: cgroups, linux-doc, linux-kernel, songmuchun, Chengming Zhou

We use iterate_groups() to iterate each level psi_group to update
PSI stats, which is a very hot path.

In current code, iterate_groups() have to use multiple branches and
cgroup_parent() to get parent psi_group for each level, which is not
very efficient.

This patch cache parent psi_group in struct psi_group, only need to get
psi_group of task itself first, then just use group->parent to iterate.

Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
---
 include/linux/psi_types.h |  1 +
 kernel/sched/psi.c        | 51 ++++++++++++++++++++-------------------
 2 files changed, 27 insertions(+), 25 deletions(-)

diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h
index fced39e255aa..7459a47fcb1f 100644
--- a/include/linux/psi_types.h
+++ b/include/linux/psi_types.h
@@ -148,6 +148,7 @@ struct psi_trigger {
 
 struct psi_group {
 	bool enabled;
+	struct psi_group *parent;
 
 	/* Protects data used by the aggregator */
 	struct mutex avgs_lock;
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 9df1686ee02d..d3c1c49b9bcf 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -786,30 +786,22 @@ static void psi_group_change(struct psi_group *group, int cpu,
 		schedule_delayed_work(&group->avgs_work, PSI_FREQ);
 }
 
-static struct psi_group *iterate_groups(struct task_struct *task, void **iter)
+static inline struct psi_group *task_psi_group(struct task_struct *task)
 {
-	if (*iter == &psi_system)
-		return NULL;
-
 #ifdef CONFIG_CGROUPS
 	if (static_branch_likely(&psi_cgroups_enabled)) {
-		struct cgroup *cgroup = NULL;
-
-		if (!*iter)
-			cgroup = task->cgroups->dfl_cgrp;
-		else
-			cgroup = cgroup_parent(*iter);
+		struct cgroup *cgroup = task_dfl_cgroup(task);
 
-		if (cgroup && cgroup_parent(cgroup)) {
-			*iter = cgroup;
+		if (cgroup && cgroup_parent(cgroup))
 			return cgroup_psi(cgroup);
-		}
 	}
 #endif
-	*iter = &psi_system;
 	return &psi_system;
 }
 
+#define for_each_psi_group(group) \
+	for (; group; group = group->parent)
+
 static void psi_flags_change(struct task_struct *task, int clear, int set)
 {
 	if (((task->psi_flags & set) ||
@@ -827,12 +819,11 @@ static void psi_flags_change(struct task_struct *task, int clear, int set)
 
 void psi_change_groups(struct task_struct *task, int clear, int set)
 {
+	struct psi_group *group = task_psi_group(task);
 	int cpu = task_cpu(task);
-	struct psi_group *group;
-	void *iter = NULL;
 	u64 now = cpu_clock(cpu);
 
-	while ((group = iterate_groups(task, &iter)))
+	for_each_psi_group(group)
 		psi_group_change(group, cpu, clear, set, now, true);
 }
 
@@ -850,7 +841,6 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
 {
 	struct psi_group *group, *common = NULL;
 	int cpu = task_cpu(prev);
-	void *iter;
 	u64 now = cpu_clock(cpu);
 
 	if (next->pid) {
@@ -861,8 +851,8 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
 		 * we reach the first common ancestor. Iterate @next's
 		 * ancestors only until we encounter @prev's ONCPU.
 		 */
-		iter = NULL;
-		while ((group = iterate_groups(next, &iter))) {
+		group = task_psi_group(next);
+		for_each_psi_group(group) {
 			if (per_cpu_ptr(group->pcpu, cpu)->state_mask &
 			    PSI_ONCPU) {
 				common = group;
@@ -903,9 +893,12 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
 
 		psi_flags_change(prev, clear, set);
 
-		iter = NULL;
-		while ((group = iterate_groups(prev, &iter)) && group != common)
+		group = task_psi_group(prev);
+		for_each_psi_group(group) {
+			if (group == common)
+				break;
 			psi_group_change(group, cpu, clear, set, now, wake_clock);
+		}
 
 		/*
 		 * TSK_ONCPU is handled up to the common ancestor. If we're tasked
@@ -913,7 +906,7 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
 		 */
 		if (sleep || unlikely(prev->in_memstall != next->in_memstall)) {
 			clear &= ~TSK_ONCPU;
-			for (; group; group = iterate_groups(prev, &iter))
+			for_each_psi_group(group)
 				psi_group_change(group, cpu, clear, set, now, wake_clock);
 		}
 	}
@@ -922,7 +915,6 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
 void psi_account_irqtime(struct task_struct *task, u32 delta)
 {
 	int cpu = task_cpu(task);
-	void *iter = NULL;
 	struct psi_group *group;
 	struct psi_group_cpu *groupc;
 	u64 now;
@@ -932,7 +924,8 @@ void psi_account_irqtime(struct task_struct *task, u32 delta)
 
 	now = cpu_clock(cpu);
 
-	while ((group = iterate_groups(task, &iter))) {
+	group = task_psi_group(task);
+	for_each_psi_group(group) {
 		groupc = per_cpu_ptr(group->pcpu, cpu);
 
 		write_seqcount_begin(&groupc->seq);
@@ -1010,6 +1003,8 @@ void psi_memstall_leave(unsigned long *flags)
 #ifdef CONFIG_CGROUPS
 int psi_cgroup_alloc(struct cgroup *cgroup)
 {
+	struct cgroup *parent;
+
 	if (!static_branch_likely(&psi_cgroups_enabled))
 		return 0;
 
@@ -1017,6 +1012,12 @@ int psi_cgroup_alloc(struct cgroup *cgroup)
 	if (!cgroup->psi.pcpu)
 		return -ENOMEM;
 	group_init(&cgroup->psi);
+
+	parent = cgroup_parent(cgroup);
+	if (parent && cgroup_parent(parent))
+		cgroup->psi.parent = cgroup_psi(parent);
+	else
+		cgroup->psi.parent = &psi_system;
 	return 0;
 }
 
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 09/10] sched/psi: per-cgroup PSI stats disable/re-enable interface
  2022-08-08 11:03 ` [PATCH v2 09/10] sched/psi: per-cgroup PSI stats disable/re-enable interface Chengming Zhou
@ 2022-08-09 17:48   ` Tejun Heo
  2022-08-10  0:39     ` Chengming Zhou
  2022-08-12 10:14   ` Michal Koutný
  2022-08-15 15:49   ` Johannes Weiner
  2 siblings, 1 reply; 32+ messages in thread
From: Tejun Heo @ 2022-08-09 17:48 UTC (permalink / raw)
  To: Chengming Zhou
  Cc: hannes, corbet, surenb, mingo, peterz, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, cgroups, linux-doc,
	linux-kernel, songmuchun

Hello,

On Mon, Aug 08, 2022 at 07:03:40PM +0800, Chengming Zhou wrote:
> So this patch introduce a per-cgroup PSI stats disable/re-enable
> interface "cgroup.psi", which is a read-write single value file that
> allowed values are "0" and "1", the defaults is "1" so per-cgroup
> PSI stats is enabled by default.

Given that the knobs are named {cpu|memory|io}.pressure, I wonder whether
"cgroup.psi" is the best name. Also, it doesn't convey that it's the
enable/disable knob. I think it needs a better name.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 09/10] sched/psi: per-cgroup PSI stats disable/re-enable interface
  2022-08-09 17:48   ` Tejun Heo
@ 2022-08-10  0:39     ` Chengming Zhou
  2022-08-10  1:30       ` Chengming Zhou
  0 siblings, 1 reply; 32+ messages in thread
From: Chengming Zhou @ 2022-08-10  0:39 UTC (permalink / raw)
  To: Tejun Heo
  Cc: hannes, corbet, surenb, mingo, peterz, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, cgroups, linux-doc,
	linux-kernel, songmuchun

On 2022/8/10 01:48, Tejun Heo wrote:
> Hello,
> 
> On Mon, Aug 08, 2022 at 07:03:40PM +0800, Chengming Zhou wrote:
>> So this patch introduce a per-cgroup PSI stats disable/re-enable
>> interface "cgroup.psi", which is a read-write single value file that
>> allowed values are "0" and "1", the defaults is "1" so per-cgroup
>> PSI stats is enabled by default.
> 
> Given that the knobs are named {cpu|memory|io}.pressure, I wonder whether
> "cgroup.psi" is the best name. Also, it doesn't convey that it's the
> enable/disable knob. I think it needs a better name.

Yes, "cgroup.psi" is not good. What abort "pressure.enable" or "cgroup.psi_enable"?

Thanks.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 09/10] sched/psi: per-cgroup PSI stats disable/re-enable interface
  2022-08-10  0:39     ` Chengming Zhou
@ 2022-08-10  1:30       ` Chengming Zhou
  2022-08-10 15:25         ` Johannes Weiner
  0 siblings, 1 reply; 32+ messages in thread
From: Chengming Zhou @ 2022-08-10  1:30 UTC (permalink / raw)
  To: Tejun Heo
  Cc: hannes, corbet, surenb, mingo, peterz, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, cgroups, linux-doc,
	linux-kernel, songmuchun

On 2022/8/10 08:39, Chengming Zhou wrote:
> On 2022/8/10 01:48, Tejun Heo wrote:
>> Hello,
>>
>> On Mon, Aug 08, 2022 at 07:03:40PM +0800, Chengming Zhou wrote:
>>> So this patch introduce a per-cgroup PSI stats disable/re-enable
>>> interface "cgroup.psi", which is a read-write single value file that
>>> allowed values are "0" and "1", the defaults is "1" so per-cgroup
>>> PSI stats is enabled by default.
>>
>> Given that the knobs are named {cpu|memory|io}.pressure, I wonder whether
>> "cgroup.psi" is the best name. Also, it doesn't convey that it's the
>> enable/disable knob. I think it needs a better name.
> 
> Yes, "cgroup.psi" is not good. What abort "pressure.enable" or "cgroup.psi_enable"?

Doesn't look good either, what do you think of "cgroup.pressure.enable"?

Thanks.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 09/10] sched/psi: per-cgroup PSI stats disable/re-enable interface
  2022-08-10  1:30       ` Chengming Zhou
@ 2022-08-10 15:25         ` Johannes Weiner
  2022-08-10 17:27           ` Tejun Heo
  2022-08-15 13:23           ` Michal Koutný
  0 siblings, 2 replies; 32+ messages in thread
From: Johannes Weiner @ 2022-08-10 15:25 UTC (permalink / raw)
  To: Chengming Zhou
  Cc: Tejun Heo, corbet, surenb, mingo, peterz, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, cgroups, linux-doc,
	linux-kernel, songmuchun

On Wed, Aug 10, 2022 at 09:30:59AM +0800, Chengming Zhou wrote:
> On 2022/8/10 08:39, Chengming Zhou wrote:
> > On 2022/8/10 01:48, Tejun Heo wrote:
> >> Hello,
> >>
> >> On Mon, Aug 08, 2022 at 07:03:40PM +0800, Chengming Zhou wrote:
> >>> So this patch introduce a per-cgroup PSI stats disable/re-enable
> >>> interface "cgroup.psi", which is a read-write single value file that
> >>> allowed values are "0" and "1", the defaults is "1" so per-cgroup
> >>> PSI stats is enabled by default.
> >>
> >> Given that the knobs are named {cpu|memory|io}.pressure, I wonder whether
> >> "cgroup.psi" is the best name. Also, it doesn't convey that it's the
> >> enable/disable knob. I think it needs a better name.
> > 
> > Yes, "cgroup.psi" is not good. What abort "pressure.enable" or "cgroup.psi_enable"?
> 
> Doesn't look good either, what do you think of "cgroup.pressure.enable"?

How about just cgroup.pressure? Too ambiguous?

cgroup.pressure.enable sounds good to me too. Or, because it's
default-enabled and that likely won't change, cgroup.pressure.disable.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 09/10] sched/psi: per-cgroup PSI stats disable/re-enable interface
  2022-08-10 15:25         ` Johannes Weiner
@ 2022-08-10 17:27           ` Tejun Heo
  2022-08-11  2:09             ` Chengming Zhou
  2022-08-15 13:23           ` Michal Koutný
  1 sibling, 1 reply; 32+ messages in thread
From: Tejun Heo @ 2022-08-10 17:27 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Chengming Zhou, corbet, surenb, mingo, peterz, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, cgroups, linux-doc,
	linux-kernel, songmuchun

Hello,

On Wed, Aug 10, 2022 at 11:25:07AM -0400, Johannes Weiner wrote:
> How about just cgroup.pressure? Too ambiguous?
> 
> cgroup.pressure.enable sounds good to me too. Or, because it's
> default-enabled and that likely won't change, cgroup.pressure.disable.

.disable sounds more logical but I like .enable better for some reason. As
for just cgroup.pressure, yeah, maybe? The conundrum is that the prettiness
order is the exact reverse of the logical order. So, I'm okay with any of
the three.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 09/10] sched/psi: per-cgroup PSI stats disable/re-enable interface
  2022-08-10 17:27           ` Tejun Heo
@ 2022-08-11  2:09             ` Chengming Zhou
  0 siblings, 0 replies; 32+ messages in thread
From: Chengming Zhou @ 2022-08-11  2:09 UTC (permalink / raw)
  To: Tejun Heo, Johannes Weiner
  Cc: corbet, surenb, mingo, peterz, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, cgroups, linux-doc, linux-kernel, songmuchun

On 2022/8/11 01:27, Tejun Heo wrote:
> Hello,
> 
> On Wed, Aug 10, 2022 at 11:25:07AM -0400, Johannes Weiner wrote:
>> How about just cgroup.pressure? Too ambiguous?
>>
>> cgroup.pressure.enable sounds good to me too. Or, because it's
>> default-enabled and that likely won't change, cgroup.pressure.disable.
> 
> .disable sounds more logical but I like .enable better for some reason. As
> for just cgroup.pressure, yeah, maybe? The conundrum is that the prettiness
> order is the exact reverse of the logical order. So, I'm okay with any of
> the three.

Ok, so I would like to pick the prettiest "cgroup.pressure", it also looks more
consistent with {cpu|memory|io}.pressure.

Thanks!


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 09/10] sched/psi: per-cgroup PSI stats disable/re-enable interface
  2022-08-08 11:03 ` [PATCH v2 09/10] sched/psi: per-cgroup PSI stats disable/re-enable interface Chengming Zhou
  2022-08-09 17:48   ` Tejun Heo
@ 2022-08-12 10:14   ` Michal Koutný
  2022-08-12 12:36     ` Chengming Zhou
  2022-08-15 15:49   ` Johannes Weiner
  2 siblings, 1 reply; 32+ messages in thread
From: Michal Koutný @ 2022-08-12 10:14 UTC (permalink / raw)
  To: Chengming Zhou
  Cc: hannes, tj, corbet, surenb, mingo, peterz, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, cgroups, linux-doc,
	linux-kernel, songmuchun

Hello Chengming.

On Mon, Aug 08, 2022 at 07:03:40PM +0800, Chengming Zhou <zhouchengming@bytedance.com> wrote:
> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> index dd84e34bc051..ade40506ab80 100644
> --- a/Documentation/admin-guide/cgroup-v2.rst
> +++ b/Documentation/admin-guide/cgroup-v2.rst
> @@ -968,6 +968,13 @@ All cgroup core files are prefixed with "cgroup."
>  	killing cgroups is a process directed operation, i.e. it affects
>  	the whole thread-group.
>  
> +  cgroup.psi
> +	A read-write single value file that allowed values are "0" and "1".
> +	The default is "1".
> +
> +	Writing "0" to the file will disable the cgroup PSI stats accounting.
> +	Writing "1" to the file will re-enable the cgroup PSI stats accounting.
> +

I'd suggest explaining here explicitely, this control attribute is not
hierarchical (i.e. PSI accounting in a cgroup does not affect accounting
in descendants and doesn't need pass enablement via ancestors from
root). And the purpose that it "saves" cycles (where).

Regards,
Michal

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 09/10] sched/psi: per-cgroup PSI stats disable/re-enable interface
  2022-08-12 10:14   ` Michal Koutný
@ 2022-08-12 12:36     ` Chengming Zhou
  2022-08-15 13:23       ` Michal Koutný
  0 siblings, 1 reply; 32+ messages in thread
From: Chengming Zhou @ 2022-08-12 12:36 UTC (permalink / raw)
  To: Michal Koutný
  Cc: hannes, tj, corbet, surenb, mingo, peterz, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, cgroups, linux-doc,
	linux-kernel, songmuchun

On 2022/8/12 18:14, Michal Koutný wrote:
> Hello Chengming.
> 
> On Mon, Aug 08, 2022 at 07:03:40PM +0800, Chengming Zhou <zhouchengming@bytedance.com> wrote:
>> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
>> index dd84e34bc051..ade40506ab80 100644
>> --- a/Documentation/admin-guide/cgroup-v2.rst
>> +++ b/Documentation/admin-guide/cgroup-v2.rst
>> @@ -968,6 +968,13 @@ All cgroup core files are prefixed with "cgroup."
>>  	killing cgroups is a process directed operation, i.e. it affects
>>  	the whole thread-group.
>>  
>> +  cgroup.psi
>> +	A read-write single value file that allowed values are "0" and "1".
>> +	The default is "1".
>> +
>> +	Writing "0" to the file will disable the cgroup PSI stats accounting.
>> +	Writing "1" to the file will re-enable the cgroup PSI stats accounting.
>> +
> 
> I'd suggest explaining here explicitely, this control attribute is not
> hierarchical (i.e. PSI accounting in a cgroup does not affect accounting
> in descendants and doesn't need pass enablement via ancestors from
> root). And the purpose that it "saves" cycles (where).

Thanks for the suggestion and explanation!

Could you help take a look if there is anything to improve?


--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -968,6 +968,23 @@ All cgroup core files are prefixed with "cgroup."
        killing cgroups is a process directed operation, i.e. it affects
        the whole thread-group.

+  cgroup.pressure
+       A read-write single value file that allowed values are "0" and "1".
+       The default is "1".
+
+       Writing "0" to the file will disable the cgroup PSI accounting.
+       Writing "1" to the file will re-enable the cgroup PSI accounting.
+
+       This control attribute is not hierarchical, so disable or enable PSI
+       accounting in a cgroup does not affect PSI accounting in descendants
+       and doesn't need pass enablement via ancestors from root.
+
+       The reason this control attribute exists is that PSI accounts stalls for
+       each cgroup separately and aggregates it at each level of the hierarchy.
+       This may cause non-negligible overhead for some workloads when under
+       deep level of the hierarchy, in which case this control attribute can
+       be used to disable PSI accounting in the cgroups.
+

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 09/10] sched/psi: per-cgroup PSI stats disable/re-enable interface
  2022-08-12 12:36     ` Chengming Zhou
@ 2022-08-15 13:23       ` Michal Koutný
  0 siblings, 0 replies; 32+ messages in thread
From: Michal Koutný @ 2022-08-15 13:23 UTC (permalink / raw)
  To: Chengming Zhou
  Cc: hannes, tj, corbet, surenb, mingo, peterz, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, cgroups, linux-doc,
	linux-kernel, songmuchun

On Fri, Aug 12, 2022 at 08:36:17PM +0800, Chengming Zhou <zhouchengming@bytedance.com> wrote:
> Could you help take a look if there is anything to improve?

Thanks, just a little nit.

> +       The reason this control attribute exists is that PSI accounts stalls for
> +       each cgroup separately and aggregates it at each level of the hierarchy.
> +       This may cause non-negligible overhead for some workloads when under
> +       deep level of the hierarchy, in which case this control attribute can
> +       be used to disable PSI accounting in the cgroups.

s/in the cgroups/in the non-leaf cgroups/
or
s/in the cgroups/in the uninteresting cgroups/

(I'm concerned that it may result in lots of disabling if you want the
performance. I'll expand on it in 2nd subthread.)

Michal

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 09/10] sched/psi: per-cgroup PSI stats disable/re-enable interface
  2022-08-10 15:25         ` Johannes Weiner
  2022-08-10 17:27           ` Tejun Heo
@ 2022-08-15 13:23           ` Michal Koutný
  2022-08-23  6:18             ` Chengming Zhou
  1 sibling, 1 reply; 32+ messages in thread
From: Michal Koutný @ 2022-08-15 13:23 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Chengming Zhou, Tejun Heo, corbet, surenb, mingo, peterz,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, cgroups,
	linux-doc, linux-kernel, songmuchun

On Wed, Aug 10, 2022 at 11:25:07AM -0400, Johannes Weiner <hannes@cmpxchg.org> wrote:
> cgroup.pressure.enable sounds good to me too. Or, because it's
> default-enabled and that likely won't change, cgroup.pressure.disable.

Will it not change?

I'd say that user would be interested in particular level or even just
level in subtree for PSI, so the opt-out may result in lots of explicit
disablements (or even watch for cgroups created and disable PSI there)
to get some performance back.

I have two suggestions based on the above:
1) Make the default globally configurable (mount option?)
2) Allow implicit enablement upon trigger creation

WDYT?

Michal

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 00/10] sched/psi: some optimization and extension
  2022-08-08 11:03 [PATCH v2 00/10] sched/psi: some optimization and extension Chengming Zhou
                   ` (9 preceding siblings ...)
  2022-08-08 11:03 ` [PATCH v2 10/10] sched/psi: cache parent psi_group to speed up groups iterate Chengming Zhou
@ 2022-08-15 13:25 ` Michal Koutný
  2022-08-16 14:01   ` Chengming Zhou
  10 siblings, 1 reply; 32+ messages in thread
From: Michal Koutný @ 2022-08-15 13:25 UTC (permalink / raw)
  To: Chengming Zhou
  Cc: hannes, tj, corbet, surenb, mingo, peterz, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, cgroups, linux-doc,
	linux-kernel, songmuchun

On Mon, Aug 08, 2022 at 07:03:31PM +0800, Chengming Zhou <zhouchengming@bytedance.com> wrote:
> This patch series are some optimization and extension for PSI,

BTW do you have some numbers/example how much these modifications save
when aggregated together?

Thanks,
Michal

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 09/10] sched/psi: per-cgroup PSI stats disable/re-enable interface
  2022-08-08 11:03 ` [PATCH v2 09/10] sched/psi: per-cgroup PSI stats disable/re-enable interface Chengming Zhou
  2022-08-09 17:48   ` Tejun Heo
  2022-08-12 10:14   ` Michal Koutný
@ 2022-08-15 15:49   ` Johannes Weiner
  2022-08-15 19:50     ` Tejun Heo
  2022-08-16 13:06     ` Chengming Zhou
  2 siblings, 2 replies; 32+ messages in thread
From: Johannes Weiner @ 2022-08-15 15:49 UTC (permalink / raw)
  To: Chengming Zhou
  Cc: tj, corbet, surenb, mingo, peterz, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, cgroups, linux-doc,
	linux-kernel, songmuchun

On Mon, Aug 08, 2022 at 07:03:40PM +0800, Chengming Zhou wrote:
> +static ssize_t cgroup_psi_write(struct kernfs_open_file *of,
> +				char *buf, size_t nbytes, loff_t off)
> +{
> +	ssize_t ret;
> +	int enable;
> +	struct cgroup *cgrp;
> +	struct psi_group *psi;
> +
> +	ret = kstrtoint(strstrip(buf), 0, &enable);
> +	if (ret)
> +		return ret;
> +
> +	if (enable < 0 || enable > 1)
> +		return -ERANGE;
> +
> +	cgrp = cgroup_kn_lock_live(of->kn, false);
> +	if (!cgrp)
> +		return -ENOENT;
> +
> +	psi = cgroup_ino(cgrp) == 1 ? &psi_system : &cgrp->psi;
> +	psi_cgroup_enable(psi, enable);

I think it should also add/remove the pressure files when enabling and
disabling the aggregation, since their contents would be stale and
misleading.

Take a look at cgroup_add_dfl_cftypes() and cgroup_rm_cftypes()

> @@ -5115,6 +5152,12 @@ static struct cftype cgroup_base_files[] = {
>  		.release = cgroup_pressure_release,
>  	},
>  #endif
> +	{
> +		.name = "cgroup.psi",
> +		.flags = CFTYPE_PRESSURE,
> +		.seq_show = cgroup_psi_show,
> +		.write = cgroup_psi_write,
> +	},
>  #endif /* CONFIG_PSI */
>  	{ }	/* terminate */
>  };
> diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
> index 58f8092c938f..9df1686ee02d 100644
> --- a/kernel/sched/psi.c
> +++ b/kernel/sched/psi.c
> @@ -181,6 +181,7 @@ static void group_init(struct psi_group *group)
>  {
>  	int cpu;
>  
> +	group->enabled = true;
>  	for_each_possible_cpu(cpu)
>  		seqcount_init(&per_cpu_ptr(group->pcpu, cpu)->seq);
>  	group->avg_last_update = sched_clock();
> @@ -700,17 +701,16 @@ static void psi_group_change(struct psi_group *group, int cpu,
>  	groupc = per_cpu_ptr(group->pcpu, cpu);
>  
>  	/*
> -	 * First we assess the aggregate resource states this CPU's
> -	 * tasks have been in since the last change, and account any
> -	 * SOME and FULL time these may have resulted in.
> -	 *
> -	 * Then we update the task counts according to the state
> +	 * First we update the task counts according to the state
>  	 * change requested through the @clear and @set bits.
> +	 *
> +	 * Then if the cgroup PSI stats accounting enabled, we
> +	 * assess the aggregate resource states this CPU's tasks
> +	 * have been in since the last change, and account any
> +	 * SOME and FULL time these may have resulted in.
>  	 */
>  	write_seqcount_begin(&groupc->seq);
>  
> -	record_times(groupc, now);
> -
>  	/*
>  	 * Start with TSK_ONCPU, which doesn't have a corresponding
>  	 * task count - it's just a boolean flag directly encoded in
> @@ -750,6 +750,14 @@ static void psi_group_change(struct psi_group *group, int cpu,
>  		if (set & (1 << t))
>  			groupc->tasks[t]++;
>  
> +	if (!group->enabled) {
> +		if (groupc->state_mask & (1 << PSI_NONIDLE))
> +			record_times(groupc, now);

Why record the nonidle time? It's only used for aggregation, which is
stopped as well.

> @@ -1088,6 +1097,23 @@ void cgroup_move_task(struct task_struct *task, struct css_set *to)
>  
>  	task_rq_unlock(rq, task, &rf);
>  }
> +
> +void psi_cgroup_enable(struct psi_group *group, bool enable)
> +{
> +	struct psi_group_cpu *groupc;
> +	int cpu;
> +	u64 now;
> +
> +	if (group->enabled == enable)
> +		return;
> +	group->enabled = enable;
> +
> +	for_each_possible_cpu(cpu) {
> +		groupc = per_cpu_ptr(group->pcpu, cpu);
> +		now = cpu_clock(cpu);
> +		psi_group_change(group, cpu, 0, 0, now, true);

This loop deserves a comment, IMO.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 09/10] sched/psi: per-cgroup PSI stats disable/re-enable interface
  2022-08-15 15:49   ` Johannes Weiner
@ 2022-08-15 19:50     ` Tejun Heo
  2022-08-16 13:06     ` Chengming Zhou
  1 sibling, 0 replies; 32+ messages in thread
From: Tejun Heo @ 2022-08-15 19:50 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Chengming Zhou, corbet, surenb, mingo, peterz, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, cgroups, linux-doc,
	linux-kernel, songmuchun

Hello,

On Mon, Aug 15, 2022 at 11:49:55AM -0400, Johannes Weiner wrote:
> I think it should also add/remove the pressure files when enabling and
> disabling the aggregation, since their contents would be stale and
> misleading.
> 
> Take a look at cgroup_add_dfl_cftypes() and cgroup_rm_cftypes()

The problem with adding cftypes dynamically is that it can fail, which isn't
the end of the world here but still kinda sucks. I think what we actually
wanna do is hiding and unhiding while keeping all the data structures in
place which is needed somewhere else anyway.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 07/10] sched/psi: remove NR_ONCPU task accounting
  2022-08-08 11:03 ` [PATCH v2 07/10] sched/psi: remove NR_ONCPU task accounting Chengming Zhou
@ 2022-08-16 10:40   ` Chengming Zhou
  0 siblings, 0 replies; 32+ messages in thread
From: Chengming Zhou @ 2022-08-16 10:40 UTC (permalink / raw)
  To: hannes, tj, corbet, surenb, mingo, peterz, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall
  Cc: cgroups, linux-doc, linux-kernel, songmuchun

On 2022/8/8 19:03, Chengming Zhou wrote:
> From: Johannes Weiner <hannes@cmpxchg.org>
> 
> We put all fields updated by the scheduler in the first cacheline of
> struct psi_group_cpu for performance.
> 
> Since we want add another PSI_IRQ_FULL to track IRQ/SOFTIRQ pressure,
> we need to reclaim space first. This patch remove NR_ONCPU task accounting
> in struct psi_group_cpu, use one bit in state_mask to track instead.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
> Reviewed-by: Chengming Zhou <zhouchengming@bytedance.com>
> Tested-by: Chengming Zhou <zhouchengming@bytedance.com>
> ---
>  include/linux/psi_types.h | 16 +++++++---------
>  kernel/sched/psi.c        | 36 ++++++++++++++++++++++++++++--------
>  2 files changed, 35 insertions(+), 17 deletions(-)
> 
> diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h
> index c7fe7c089718..54cb74946db4 100644
> --- a/include/linux/psi_types.h
> +++ b/include/linux/psi_types.h
> @@ -15,13 +15,6 @@ enum psi_task_count {
>  	NR_IOWAIT,
>  	NR_MEMSTALL,
>  	NR_RUNNING,
> -	/*
> -	 * This can't have values other than 0 or 1 and could be
> -	 * implemented as a bit flag. But for now we still have room
> -	 * in the first cacheline of psi_group_cpu, and this way we
> -	 * don't have to special case any state tracking for it.
> -	 */
> -	NR_ONCPU,
>  	/*
>  	 * For IO and CPU stalls the presence of running/oncpu tasks
>  	 * in the domain means a partial rather than a full stall.
> @@ -32,16 +25,18 @@ enum psi_task_count {
>  	 * threads and memstall ones.
>  	 */
>  	NR_MEMSTALL_RUNNING,
> -	NR_PSI_TASK_COUNTS = 5,
> +	NR_PSI_TASK_COUNTS = 4,
>  };
>  
>  /* Task state bitmasks */
>  #define TSK_IOWAIT	(1 << NR_IOWAIT)
>  #define TSK_MEMSTALL	(1 << NR_MEMSTALL)
>  #define TSK_RUNNING	(1 << NR_RUNNING)
> -#define TSK_ONCPU	(1 << NR_ONCPU)
>  #define TSK_MEMSTALL_RUNNING	(1 << NR_MEMSTALL_RUNNING)
>  
> +/* Only one task can be scheduled, no corresponding task count */
> +#define TSK_ONCPU	(1 << NR_PSI_TASK_COUNTS)
> +
>  /* Resources that workloads could be stalled on */
>  enum psi_res {
>  	PSI_IO,
> @@ -68,6 +63,9 @@ enum psi_states {
>  	NR_PSI_STATES = 7,
>  };
>  
> +/* Use one bit in the state mask to track TSK_ONCPU */
> +#define PSI_ONCPU	(1 << NR_PSI_STATES)
> +
>  enum psi_aggregators {
>  	PSI_AVGS = 0,
>  	PSI_POLL,
> diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
> index 595a6c8230b7..1c675715ed33 100644
> --- a/kernel/sched/psi.c
> +++ b/kernel/sched/psi.c
> @@ -216,7 +216,7 @@ void __init psi_init(void)
>  	group_init(&psi_system);
>  }
>  
> -static bool test_state(unsigned int *tasks, enum psi_states state)
> +static bool test_state(unsigned int *tasks, enum psi_states state, bool oncpu)
>  {
>  	switch (state) {
>  	case PSI_IO_SOME:
> @@ -229,9 +229,9 @@ static bool test_state(unsigned int *tasks, enum psi_states state)
>  		return unlikely(tasks[NR_MEMSTALL] &&
>  			tasks[NR_RUNNING] == tasks[NR_MEMSTALL_RUNNING]);
>  	case PSI_CPU_SOME:
> -		return unlikely(tasks[NR_RUNNING] > tasks[NR_ONCPU]);
> +		return unlikely(tasks[NR_RUNNING] > oncpu);
>  	case PSI_CPU_FULL:
> -		return unlikely(tasks[NR_RUNNING] && !tasks[NR_ONCPU]);
> +		return unlikely(tasks[NR_RUNNING] && !oncpu);
>  	case PSI_NONIDLE:
>  		return tasks[NR_IOWAIT] || tasks[NR_MEMSTALL] ||
>  			tasks[NR_RUNNING];
> @@ -693,9 +693,9 @@ static void psi_group_change(struct psi_group *group, int cpu,
>  			     bool wake_clock)
>  {
>  	struct psi_group_cpu *groupc;
> -	u32 state_mask = 0;
>  	unsigned int t, m;
>  	enum psi_states s;
> +	u32 state_mask;
>  
>  	groupc = per_cpu_ptr(group->pcpu, cpu);
>  
> @@ -711,6 +711,26 @@ static void psi_group_change(struct psi_group *group, int cpu,
>  
>  	record_times(groupc, now);
>  
> +	/*
> +	 * Start with TSK_ONCPU, which doesn't have a corresponding
> +	 * task count - it's just a boolean flag directly encoded in
> +	 * the state mask. Clear, set, or carry the current state if
> +	 * no changes are requested.
> +	 */
> +	if (clear & TSK_ONCPU) {
> +		state_mask = 0;
> +		clear &= ~TSK_ONCPU;
> +	} else if (set & TSK_ONCPU) {
> +		state_mask = PSI_ONCPU;
> +		set &= ~TSK_ONCPU;

These two conditions should use "unlikely()", which would be better
in the perf bench sched pipe testcase.


> +	} else {
> +		state_mask = groupc->state_mask & PSI_ONCPU;
> +	}
> +
> +	/*
> +	 * The rest of the state mask is calculated based on the task
> +	 * counts. Update those first, then construct the mask.
> +	 */
>  	for (t = 0, m = clear; m; m &= ~(1 << t), t++) {
>  		if (!(m & (1 << t)))
>  			continue;
> @@ -730,9 +750,8 @@ static void psi_group_change(struct psi_group *group, int cpu,
>  		if (set & (1 << t))
>  			groupc->tasks[t]++;
>  
> -	/* Calculate state mask representing active states */
>  	for (s = 0; s < NR_PSI_STATES; s++) {
> -		if (test_state(groupc->tasks, s))
> +		if (test_state(groupc->tasks, s, state_mask & PSI_ONCPU))
>  			state_mask |= (1 << s);
>  	}
>  
> @@ -744,7 +763,7 @@ static void psi_group_change(struct psi_group *group, int cpu,
>  	 * task in a cgroup is in_memstall, the corresponding groupc
>  	 * on that cpu is in PSI_MEM_FULL state.
>  	 */
> -	if (unlikely(groupc->tasks[NR_ONCPU] && cpu_curr(cpu)->in_memstall))
> +	if (unlikely((state_mask & PSI_ONCPU) && cpu_curr(cpu)->in_memstall))
>  		state_mask |= (1 << PSI_MEM_FULL);
>  
>  	groupc->state_mask = state_mask;
> @@ -835,7 +854,8 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
>  		 */
>  		iter = NULL;
>  		while ((group = iterate_groups(next, &iter))) {
> -			if (per_cpu_ptr(group->pcpu, cpu)->tasks[NR_ONCPU]) {
> +			if (per_cpu_ptr(group->pcpu, cpu)->state_mask &
> +			    PSI_ONCPU) {
>  				common = group;
>  				break;
>  			}

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 09/10] sched/psi: per-cgroup PSI stats disable/re-enable interface
  2022-08-15 15:49   ` Johannes Weiner
  2022-08-15 19:50     ` Tejun Heo
@ 2022-08-16 13:06     ` Chengming Zhou
  1 sibling, 0 replies; 32+ messages in thread
From: Chengming Zhou @ 2022-08-16 13:06 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: tj, corbet, surenb, mingo, peterz, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, cgroups, linux-doc,
	linux-kernel, songmuchun

On 2022/8/15 23:49, Johannes Weiner wrote:
> On Mon, Aug 08, 2022 at 07:03:40PM +0800, Chengming Zhou wrote:
>> +static ssize_t cgroup_psi_write(struct kernfs_open_file *of,
>> +				char *buf, size_t nbytes, loff_t off)
>> +{
>> +	ssize_t ret;
>> +	int enable;
>> +	struct cgroup *cgrp;
>> +	struct psi_group *psi;
>> +
>> +	ret = kstrtoint(strstrip(buf), 0, &enable);
>> +	if (ret)
>> +		return ret;
>> +
>> +	if (enable < 0 || enable > 1)
>> +		return -ERANGE;
>> +
>> +	cgrp = cgroup_kn_lock_live(of->kn, false);
>> +	if (!cgrp)
>> +		return -ENOENT;
>> +
>> +	psi = cgroup_ino(cgrp) == 1 ? &psi_system : &cgrp->psi;
>> +	psi_cgroup_enable(psi, enable);
> 
> I think it should also add/remove the pressure files when enabling and
> disabling the aggregation, since their contents would be stale and
> misleading.
> 
> Take a look at cgroup_add_dfl_cftypes() and cgroup_rm_cftypes()

Ok, I will look.

> 
>> @@ -5115,6 +5152,12 @@ static struct cftype cgroup_base_files[] = {
>>  		.release = cgroup_pressure_release,
>>  	},
>>  #endif
>> +	{
>> +		.name = "cgroup.psi",
>> +		.flags = CFTYPE_PRESSURE,
>> +		.seq_show = cgroup_psi_show,
>> +		.write = cgroup_psi_write,
>> +	},
>>  #endif /* CONFIG_PSI */
>>  	{ }	/* terminate */
>>  };
>> diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
>> index 58f8092c938f..9df1686ee02d 100644
>> --- a/kernel/sched/psi.c
>> +++ b/kernel/sched/psi.c
>> @@ -181,6 +181,7 @@ static void group_init(struct psi_group *group)
>>  {
>>  	int cpu;
>>  
>> +	group->enabled = true;
>>  	for_each_possible_cpu(cpu)
>>  		seqcount_init(&per_cpu_ptr(group->pcpu, cpu)->seq);
>>  	group->avg_last_update = sched_clock();
>> @@ -700,17 +701,16 @@ static void psi_group_change(struct psi_group *group, int cpu,
>>  	groupc = per_cpu_ptr(group->pcpu, cpu);
>>  
>>  	/*
>> -	 * First we assess the aggregate resource states this CPU's
>> -	 * tasks have been in since the last change, and account any
>> -	 * SOME and FULL time these may have resulted in.
>> -	 *
>> -	 * Then we update the task counts according to the state
>> +	 * First we update the task counts according to the state
>>  	 * change requested through the @clear and @set bits.
>> +	 *
>> +	 * Then if the cgroup PSI stats accounting enabled, we
>> +	 * assess the aggregate resource states this CPU's tasks
>> +	 * have been in since the last change, and account any
>> +	 * SOME and FULL time these may have resulted in.
>>  	 */
>>  	write_seqcount_begin(&groupc->seq);
>>  
>> -	record_times(groupc, now);
>> -
>>  	/*
>>  	 * Start with TSK_ONCPU, which doesn't have a corresponding
>>  	 * task count - it's just a boolean flag directly encoded in
>> @@ -750,6 +750,14 @@ static void psi_group_change(struct psi_group *group, int cpu,
>>  		if (set & (1 << t))
>>  			groupc->tasks[t]++;
>>  
>> +	if (!group->enabled) {
>> +		if (groupc->state_mask & (1 << PSI_NONIDLE))
>> +			record_times(groupc, now);
> 
> Why record the nonidle time? It's only used for aggregation, which is
> stopped as well.

I'm considering of this situation: disable at t2 and re-enable at t3

state1(t1) --> state2(t2) --> state3(t3)

If aggregator has get_recent_times() in [t1, t2], groupc->times_prev[aggregator]
will include that delta of (t - t1).

Then re-enable at t3, the delta of (t3-t1) is discarded, may make that aggregator
see times < groupc->times_prev[aggregator] ?

Maybe I missed something, not sure whether this is a problem.


> 
>> @@ -1088,6 +1097,23 @@ void cgroup_move_task(struct task_struct *task, struct css_set *to)
>>  
>>  	task_rq_unlock(rq, task, &rf);
>>  }
>> +
>> +void psi_cgroup_enable(struct psi_group *group, bool enable)
>> +{
>> +	struct psi_group_cpu *groupc;
>> +	int cpu;
>> +	u64 now;
>> +
>> +	if (group->enabled == enable)
>> +		return;
>> +	group->enabled = enable;
>> +
>> +	for_each_possible_cpu(cpu) {
>> +		groupc = per_cpu_ptr(group->pcpu, cpu);
>> +		now = cpu_clock(cpu);
>> +		psi_group_change(group, cpu, 0, 0, now, true);
> 
> This loop deserves a comment, IMO.

I add some comments as below, could you help take a look?

+
+void psi_cgroup_enable(struct psi_group *group, bool enable)
+{
+       int cpu;
+       u64 now;
+
+       if (group->enabled == enable)
+               return;
+       group->enabled = enable;
+
+       /*
+        * We use psi_group_change() to disable or re-enable the
+        * record_times(), test_state() loop and averaging worker
+        * in each psi_group_cpu of the psi_group, use .clear = 0
+        * and .set = 0 here since no task status really changed.
+        */
+       for_each_possible_cpu(cpu) {
+               now = cpu_clock(cpu);
+               psi_group_change(group, cpu, 0, 0, now, true);
+       }
+}

Thanks!


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 00/10] sched/psi: some optimization and extension
  2022-08-15 13:25 ` [PATCH v2 00/10] sched/psi: some optimization and extension Michal Koutný
@ 2022-08-16 14:01   ` Chengming Zhou
  2022-08-17 15:19     ` Chengming Zhou
  0 siblings, 1 reply; 32+ messages in thread
From: Chengming Zhou @ 2022-08-16 14:01 UTC (permalink / raw)
  To: Michal Koutný
  Cc: hannes, tj, corbet, surenb, mingo, peterz, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, cgroups, linux-doc,
	linux-kernel, songmuchun

On 2022/8/15 21:25, Michal Koutný wrote:
> On Mon, Aug 08, 2022 at 07:03:31PM +0800, Chengming Zhou <zhouchengming@bytedance.com> wrote:
>> This patch series are some optimization and extension for PSI,
> 
> BTW do you have some numbers/example how much these modifications save
> when aggregated together?
> 

I lost my test data last time, I will use mmtests/config-scheduler-perfpipe
to get some performance numbers right now.

Thanks.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 00/10] sched/psi: some optimization and extension
  2022-08-16 14:01   ` Chengming Zhou
@ 2022-08-17 15:19     ` Chengming Zhou
  0 siblings, 0 replies; 32+ messages in thread
From: Chengming Zhou @ 2022-08-17 15:19 UTC (permalink / raw)
  To: Michal Koutný
  Cc: hannes, tj, corbet, surenb, mingo, peterz, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, cgroups, linux-doc,
	linux-kernel, songmuchun

On 2022/8/16 22:01, Chengming Zhou wrote:
> On 2022/8/15 21:25, Michal Koutný wrote:
>> On Mon, Aug 08, 2022 at 07:03:31PM +0800, Chengming Zhou <zhouchengming@bytedance.com> wrote:
>>> This patch series are some optimization and extension for PSI,
>>
>> BTW do you have some numbers/example how much these modifications save
>> when aggregated together?
>>

Sorry about delay...

Performance test using mmtests/config-scheduler-perfpipe in /user.slice/user-0.slice/session-4.scope

                                 next                patched       patched/only-leaf
Min       Time        8.82 (   0.00%)        8.49 (   3.74%)        8.00 (   9.32%)
1st-qrtle Time        8.90 (   0.00%)        8.58 (   3.63%)        8.05 (   9.58%)
2nd-qrtle Time        8.94 (   0.00%)        8.61 (   3.65%)        8.09 (   9.50%)
3rd-qrtle Time        8.99 (   0.00%)        8.65 (   3.75%)        8.15 (   9.35%)
Max-1     Time        8.82 (   0.00%)        8.49 (   3.74%)        8.00 (   9.32%)
Max-5     Time        8.82 (   0.00%)        8.49 (   3.74%)        8.00 (   9.32%)
Max-10    Time        8.84 (   0.00%)        8.55 (   3.20%)        8.04 (   9.05%)
Max-90    Time        9.04 (   0.00%)        8.67 (   4.10%)        8.18 (   9.51%)
Max-95    Time        9.04 (   0.00%)        8.68 (   4.03%)        8.20 (   9.26%)
Max-99    Time        9.07 (   0.00%)        8.73 (   3.82%)        8.25 (   9.11%)
Max       Time        9.12 (   0.00%)        8.89 (   2.54%)        8.27 (   9.29%)
Amean     Time        8.95 (   0.00%)        8.62 *   3.67%*        8.11 *   9.43%*


Thanks!

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 09/10] sched/psi: per-cgroup PSI stats disable/re-enable interface
  2022-08-15 13:23           ` Michal Koutný
@ 2022-08-23  6:18             ` Chengming Zhou
  2022-08-23 15:35               ` Johannes Weiner
  0 siblings, 1 reply; 32+ messages in thread
From: Chengming Zhou @ 2022-08-23  6:18 UTC (permalink / raw)
  To: Michal Koutný, Johannes Weiner
  Cc: Tejun Heo, corbet, surenb, mingo, peterz, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, cgroups, linux-doc,
	linux-kernel, songmuchun

On 2022/8/15 21:23, Michal Koutný wrote:
> On Wed, Aug 10, 2022 at 11:25:07AM -0400, Johannes Weiner <hannes@cmpxchg.org> wrote:
>> cgroup.pressure.enable sounds good to me too. Or, because it's
>> default-enabled and that likely won't change, cgroup.pressure.disable.
> 
> Will it not change?
> 
> I'd say that user would be interested in particular level or even just
> level in subtree for PSI, so the opt-out may result in lots of explicit
> disablements (or even watch for cgroups created and disable PSI there)
> to get some performance back.
> 
> I have two suggestions based on the above:
> 1) Make the default globally configurable (mount option?)
> 2) Allow implicit enablement upon trigger creation
> 

I think suggestion 1) make sense in some use case, like make per-cgroup
PSI disabled by default using a mount option, then enable using the
"cgroup.pressure" interface.

But suggestion 2) auto enable upon trigger creation, if we hide the
{cpu,memory,io}.pressure files when disabled, how can we create trigger?

Want to see what do Johannes and Tejun think about these suggestions?

Thanks.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 09/10] sched/psi: per-cgroup PSI stats disable/re-enable interface
  2022-08-23  6:18             ` Chengming Zhou
@ 2022-08-23 15:35               ` Johannes Weiner
  2022-08-23 15:43                 ` Chengming Zhou
  2022-08-23 16:20                 ` Tejun Heo
  0 siblings, 2 replies; 32+ messages in thread
From: Johannes Weiner @ 2022-08-23 15:35 UTC (permalink / raw)
  To: Chengming Zhou
  Cc: Michal Koutný,
	Tejun Heo, corbet, surenb, mingo, peterz, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, cgroups, linux-doc,
	linux-kernel, songmuchun

On Tue, Aug 23, 2022 at 02:18:21PM +0800, Chengming Zhou wrote:
> On 2022/8/15 21:23, Michal Koutný wrote:
> > On Wed, Aug 10, 2022 at 11:25:07AM -0400, Johannes Weiner <hannes@cmpxchg.org> wrote:
> >> cgroup.pressure.enable sounds good to me too. Or, because it's
> >> default-enabled and that likely won't change, cgroup.pressure.disable.
> > 
> > Will it not change?
> > 
> > I'd say that user would be interested in particular level or even just
> > level in subtree for PSI, so the opt-out may result in lots of explicit
> > disablements (or even watch for cgroups created and disable PSI there)
> > to get some performance back.
> > 
> > I have two suggestions based on the above:
> > 1) Make the default globally configurable (mount option?)
> > 2) Allow implicit enablement upon trigger creation
> > 
> 
> I think suggestion 1) make sense in some use case, like make per-cgroup
> PSI disabled by default using a mount option, then enable using the
> "cgroup.pressure" interface.
> 
> But suggestion 2) auto enable upon trigger creation, if we hide the
> {cpu,memory,io}.pressure files when disabled, how can we create trigger?
> 
> Want to see what do Johannes and Tejun think about these suggestions?

Re 1: I agree. If desired in the future we can make the default
configurable. Kconfig, mount option, what have you. cgroup.pressure
will work fine as a name regardless of what the default is.

Re 2: Not all consumers of the pressure metrics create trigger. I
would argue that few do. So it isn't the best signal to decide on
whether aggregation should occur. And yes, it's further complicated by
the triggers being written to the very pressure files. If we don't
hide them, we have to come up with another way to mark them as stale,
lest they confuse the heck out of users. Without breaking format...

So IMO, default-enable, "cgroup.pressure" as a name, and hiding the
pressure files should be good for now while allowing to make the
default configurable down the line.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 09/10] sched/psi: per-cgroup PSI stats disable/re-enable interface
  2022-08-23 15:35               ` Johannes Weiner
@ 2022-08-23 15:43                 ` Chengming Zhou
  2022-08-23 16:20                 ` Tejun Heo
  1 sibling, 0 replies; 32+ messages in thread
From: Chengming Zhou @ 2022-08-23 15:43 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Michal Koutný,
	Tejun Heo, corbet, surenb, mingo, peterz, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, cgroups, linux-doc,
	linux-kernel, songmuchun

On 2022/8/23 23:35, Johannes Weiner wrote:
> On Tue, Aug 23, 2022 at 02:18:21PM +0800, Chengming Zhou wrote:
>> On 2022/8/15 21:23, Michal Koutný wrote:
>>> On Wed, Aug 10, 2022 at 11:25:07AM -0400, Johannes Weiner <hannes@cmpxchg.org> wrote:
>>>> cgroup.pressure.enable sounds good to me too. Or, because it's
>>>> default-enabled and that likely won't change, cgroup.pressure.disable.
>>>
>>> Will it not change?
>>>
>>> I'd say that user would be interested in particular level or even just
>>> level in subtree for PSI, so the opt-out may result in lots of explicit
>>> disablements (or even watch for cgroups created and disable PSI there)
>>> to get some performance back.
>>>
>>> I have two suggestions based on the above:
>>> 1) Make the default globally configurable (mount option?)
>>> 2) Allow implicit enablement upon trigger creation
>>>
>>
>> I think suggestion 1) make sense in some use case, like make per-cgroup
>> PSI disabled by default using a mount option, then enable using the
>> "cgroup.pressure" interface.
>>
>> But suggestion 2) auto enable upon trigger creation, if we hide the
>> {cpu,memory,io}.pressure files when disabled, how can we create trigger?
>>
>> Want to see what do Johannes and Tejun think about these suggestions?
> 
> Re 1: I agree. If desired in the future we can make the default
> configurable. Kconfig, mount option, what have you. cgroup.pressure
> will work fine as a name regardless of what the default is.
> 
> Re 2: Not all consumers of the pressure metrics create trigger. I
> would argue that few do. So it isn't the best signal to decide on
> whether aggregation should occur. And yes, it's further complicated by
> the triggers being written to the very pressure files. If we don't
> hide them, we have to come up with another way to mark them as stale,
> lest they confuse the heck out of users. Without breaking format...
> 
> So IMO, default-enable, "cgroup.pressure" as a name, and hiding the
> pressure files should be good for now while allowing to make the
> default configurable down the line.

Agree, it's what we want for now. Thanks for your reply!



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 09/10] sched/psi: per-cgroup PSI stats disable/re-enable interface
  2022-08-23 15:35               ` Johannes Weiner
  2022-08-23 15:43                 ` Chengming Zhou
@ 2022-08-23 16:20                 ` Tejun Heo
  1 sibling, 0 replies; 32+ messages in thread
From: Tejun Heo @ 2022-08-23 16:20 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Chengming Zhou, Michal Koutný,
	corbet, surenb, mingo, peterz, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, cgroups, linux-doc, linux-kernel, songmuchun

Hello,

On Tue, Aug 23, 2022 at 11:35:59AM -0400, Johannes Weiner wrote:
> Re 1: I agree. If desired in the future we can make the default
> configurable. Kconfig, mount option, what have you. cgroup.pressure
> will work fine as a name regardless of what the default is.

Given that there's already cgroup_disable=pressure for cases which want it
fully disabled, I'm not sure we'd need to add more complex disabling
options. The only difference that'd make is for users who are configuring
cgroups manually which is pretty rare and it'd create a clear downside of
increasing confusion as the base assumption becomes dynamic. So, I think the
current default-on with opting-out is and will be just fine.

> Re 2: Not all consumers of the pressure metrics create trigger. I
> would argue that few do. So it isn't the best signal to decide on
> whether aggregation should occur. And yes, it's further complicated by
> the triggers being written to the very pressure files. If we don't
> hide them, we have to come up with another way to mark them as stale,
> lest they confuse the heck out of users. Without breaking format...
> 
> So IMO, default-enable, "cgroup.pressure" as a name, and hiding the
> pressure files should be good for now while allowing to make the
> default configurable down the line.

Sounds great.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2022-08-23 18:10 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-08-08 11:03 [PATCH v2 00/10] sched/psi: some optimization and extension Chengming Zhou
2022-08-08 11:03 ` [PATCH v2 01/10] sched/psi: fix periodic aggregation shut off Chengming Zhou
2022-08-08 11:03 ` [PATCH v2 02/10] sched/psi: optimize task switch inside shared cgroups again Chengming Zhou
2022-08-08 11:03 ` [PATCH v2 03/10] sched/psi: move private helpers to sched/stats.h Chengming Zhou
2022-08-08 11:03 ` [PATCH v2 04/10] sched/psi: don't change task psi_flags when migrate CPU/group Chengming Zhou
2022-08-08 11:03 ` [PATCH v2 05/10] sched/psi: don't create cgroup PSI files when psi_disabled Chengming Zhou
2022-08-08 11:03 ` [PATCH v2 06/10] sched/psi: save percpu memory when !psi_cgroups_enabled Chengming Zhou
2022-08-08 11:03 ` [PATCH v2 07/10] sched/psi: remove NR_ONCPU task accounting Chengming Zhou
2022-08-16 10:40   ` Chengming Zhou
2022-08-08 11:03 ` [PATCH v2 08/10] sched/psi: add PSI_IRQ to track IRQ/SOFTIRQ pressure Chengming Zhou
2022-08-08 11:03 ` [PATCH v2 09/10] sched/psi: per-cgroup PSI stats disable/re-enable interface Chengming Zhou
2022-08-09 17:48   ` Tejun Heo
2022-08-10  0:39     ` Chengming Zhou
2022-08-10  1:30       ` Chengming Zhou
2022-08-10 15:25         ` Johannes Weiner
2022-08-10 17:27           ` Tejun Heo
2022-08-11  2:09             ` Chengming Zhou
2022-08-15 13:23           ` Michal Koutný
2022-08-23  6:18             ` Chengming Zhou
2022-08-23 15:35               ` Johannes Weiner
2022-08-23 15:43                 ` Chengming Zhou
2022-08-23 16:20                 ` Tejun Heo
2022-08-12 10:14   ` Michal Koutný
2022-08-12 12:36     ` Chengming Zhou
2022-08-15 13:23       ` Michal Koutný
2022-08-15 15:49   ` Johannes Weiner
2022-08-15 19:50     ` Tejun Heo
2022-08-16 13:06     ` Chengming Zhou
2022-08-08 11:03 ` [PATCH v2 10/10] sched/psi: cache parent psi_group to speed up groups iterate Chengming Zhou
2022-08-15 13:25 ` [PATCH v2 00/10] sched/psi: some optimization and extension Michal Koutný
2022-08-16 14:01   ` Chengming Zhou
2022-08-17 15:19     ` Chengming Zhou

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).