All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/4] psi: Add PSI_CPU_FULL state and some code optimization
@ 2021-03-03  3:46 Chengming Zhou
  2021-03-03  3:46 ` [PATCH v2 1/4] psi: Add PSI_CPU_FULL state Chengming Zhou
                   ` (4 more replies)
  0 siblings, 5 replies; 23+ messages in thread
From: Chengming Zhou @ 2021-03-03  3:46 UTC (permalink / raw)
  To: hannes, mingo, peterz, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt
  Cc: linux-kernel, songmuchun, zhouchengming

This patch series is RESEND of the previous patches on psi subsystem. A few
weeks passed since the last review, so I put them together and resend for
more convenient review and merge.

Patch 1 add PSI_CPU_FULL state means all non-idle tasks in a cgroup are delayed
on the CPU resource which used by others outside of the cgroup or throttled
by the cgroup cpu.max configuration.

Patch 2 use ONCPU state and the current in_memstall flag to detect reclaim,
remove the hook in timer tick to make code more concise and maintainable.
And patch 3 adds unlikely() annotations to move the pressure state branches
out of line to eliminate undesirable jumps during wakeup and sleeps.

Patch 4 optimize the voluntary sleep switch by remove one call of
psi_group_change() for every common cgroup ancestor of the two tasks.

Chengming Zhou (3):
  psi: Add PSI_CPU_FULL state
  psi: Use ONCPU state tracking machinery to detect reclaim
  psi: Optimize task switch inside shared cgroups

Johannes Weiner (1):
  psi: pressure states are unlikely

 include/linux/psi.h       |   1 -
 include/linux/psi_types.h |   3 +-
 kernel/sched/core.c       |   1 -
 kernel/sched/psi.c        | 122 ++++++++++++++++++++++++----------------------
 kernel/sched/stats.h      |  37 +++++---------
 5 files changed, 78 insertions(+), 86 deletions(-)

-- 
2.11.0


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH v2 1/4] psi: Add PSI_CPU_FULL state
  2021-03-03  3:46 [PATCH v2 0/4] psi: Add PSI_CPU_FULL state and some code optimization Chengming Zhou
@ 2021-03-03  3:46 ` Chengming Zhou
  2021-03-04  9:09   ` [tip: sched/core] " tip-bot2 for Chengming Zhou
  2021-03-06 11:42   ` tip-bot2 for Chengming Zhou
  2021-03-03  3:46 ` [PATCH v2 2/4] psi: Use ONCPU state tracking machinery to detect reclaim Chengming Zhou
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 23+ messages in thread
From: Chengming Zhou @ 2021-03-03  3:46 UTC (permalink / raw)
  To: hannes, mingo, peterz, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt
  Cc: linux-kernel, songmuchun, zhouchengming

The FULL state doesn't exist for the CPU resource at the system level,
but exist at the cgroup level, means all non-idle tasks in a cgroup are
delayed on the CPU resource which used by others outside of the cgroup
or throttled by the cgroup cpu.max configuration.

Co-developed-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/psi_types.h |  3 ++-
 kernel/sched/psi.c        | 14 +++++++++++---
 2 files changed, 13 insertions(+), 4 deletions(-)

diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h
index b95f3211566a..0a23300d49af 100644
--- a/include/linux/psi_types.h
+++ b/include/linux/psi_types.h
@@ -50,9 +50,10 @@ enum psi_states {
 	PSI_MEM_SOME,
 	PSI_MEM_FULL,
 	PSI_CPU_SOME,
+	PSI_CPU_FULL,
 	/* Only per-CPU, to weigh the CPU in the global average: */
 	PSI_NONIDLE,
-	NR_PSI_STATES = 6,
+	NR_PSI_STATES = 7,
 };
 
 enum psi_aggregators {
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 967732c0766c..2293c45d289d 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -34,7 +34,10 @@
  * delayed on that resource such that nobody is advancing and the CPU
  * goes idle. This leaves both workload and CPU unproductive.
  *
- * (Naturally, the FULL state doesn't exist for the CPU resource.)
+ * Naturally, the FULL state doesn't exist for the CPU resource at the
+ * system level, but exist at the cgroup level, means all non-idle tasks
+ * in a cgroup are delayed on the CPU resource which used by others outside
+ * of the cgroup or throttled by the cgroup cpu.max configuration.
  *
  *	SOME = nr_delayed_tasks != 0
  *	FULL = nr_delayed_tasks != 0 && nr_running_tasks == 0
@@ -225,6 +228,8 @@ static bool test_state(unsigned int *tasks, enum psi_states state)
 		return tasks[NR_MEMSTALL] && !tasks[NR_RUNNING];
 	case PSI_CPU_SOME:
 		return tasks[NR_RUNNING] > tasks[NR_ONCPU];
+	case PSI_CPU_FULL:
+		return tasks[NR_RUNNING] && !tasks[NR_ONCPU];
 	case PSI_NONIDLE:
 		return tasks[NR_IOWAIT] || tasks[NR_MEMSTALL] ||
 			tasks[NR_RUNNING];
@@ -678,8 +683,11 @@ static void record_times(struct psi_group_cpu *groupc, int cpu,
 		}
 	}
 
-	if (groupc->state_mask & (1 << PSI_CPU_SOME))
+	if (groupc->state_mask & (1 << PSI_CPU_SOME)) {
 		groupc->times[PSI_CPU_SOME] += delta;
+		if (groupc->state_mask & (1 << PSI_CPU_FULL))
+			groupc->times[PSI_CPU_FULL] += delta;
+	}
 
 	if (groupc->state_mask & (1 << PSI_NONIDLE))
 		groupc->times[PSI_NONIDLE] += delta;
@@ -1018,7 +1026,7 @@ int psi_show(struct seq_file *m, struct psi_group *group, enum psi_res res)
 		group->avg_next_update = update_averages(group, now);
 	mutex_unlock(&group->avgs_lock);
 
-	for (full = 0; full < 2 - (res == PSI_CPU); full++) {
+	for (full = 0; full < 2; full++) {
 		unsigned long avg[3];
 		u64 total;
 		int w;
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v2 2/4] psi: Use ONCPU state tracking machinery to detect reclaim
  2021-03-03  3:46 [PATCH v2 0/4] psi: Add PSI_CPU_FULL state and some code optimization Chengming Zhou
  2021-03-03  3:46 ` [PATCH v2 1/4] psi: Add PSI_CPU_FULL state Chengming Zhou
@ 2021-03-03  3:46 ` Chengming Zhou
  2021-03-03 14:56   ` Johannes Weiner
                     ` (3 more replies)
  2021-03-03  3:46 ` [PATCH v2 3/4] psi: pressure states are unlikely Chengming Zhou
                   ` (2 subsequent siblings)
  4 siblings, 4 replies; 23+ messages in thread
From: Chengming Zhou @ 2021-03-03  3:46 UTC (permalink / raw)
  To: hannes, mingo, peterz, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt
  Cc: linux-kernel, songmuchun, zhouchengming

Move the reclaim detection from the timer tick to the task state
tracking machinery using the recently added ONCPU state. And we
also add task psi_flags changes checking in the psi_task_switch()
optimization to update the parents properly.

In terms of performance and cost, this ONCPU task state tracking
is not cheaper than previous timer tick in aggregate. But the code is
simpler and shorter this way, so it's a maintainability win. And
Johannes did some testing with perf bench, the performace and cost
changes would be acceptable for real workloads.

Thanks to Johannes Weiner for pointing out the psi_task_switch()
optimization things and the clearer changelog.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
---
Updates since v1:
- Fold changes from Johannes that compare task psi_flags instead of
  in_memstall in the psi_task_switch() optimization and move it out
  of the loop
- Include some comments about the performance and cost from Johannes
  too, and the detailed bench results can be seen from here:
  https://lore.kernel.org/patchwork/patch/1378653/#1577607

 include/linux/psi.h  |  1 -
 kernel/sched/core.c  |  1 -
 kernel/sched/psi.c   | 65 +++++++++++++++++++---------------------------------
 kernel/sched/stats.h |  9 --------
 4 files changed, 24 insertions(+), 52 deletions(-)

diff --git a/include/linux/psi.h b/include/linux/psi.h
index 7361023f3fdd..65eb1476ac70 100644
--- a/include/linux/psi.h
+++ b/include/linux/psi.h
@@ -20,7 +20,6 @@ void psi_task_change(struct task_struct *task, int clear, int set);
 void psi_task_switch(struct task_struct *prev, struct task_struct *next,
 		     bool sleep);
 
-void psi_memstall_tick(struct task_struct *task, int cpu);
 void psi_memstall_enter(unsigned long *flags);
 void psi_memstall_leave(unsigned long *flags);
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index ca2bb629595f..860b006a72bc 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4544,7 +4544,6 @@ void scheduler_tick(void)
 	update_thermal_load_avg(rq_clock_thermal(rq), rq, thermal_pressure);
 	curr->sched_class->task_tick(rq, curr, 0);
 	calc_global_load_tick(rq);
-	psi_task_tick(rq);
 
 	rq_unlock(rq, &rf);
 
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 2293c45d289d..0fe6ff6a6a15 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -644,8 +644,7 @@ static void poll_timer_fn(struct timer_list *t)
 	wake_up_interruptible(&group->poll_wait);
 }
 
-static void record_times(struct psi_group_cpu *groupc, int cpu,
-			 bool memstall_tick)
+static void record_times(struct psi_group_cpu *groupc, int cpu)
 {
 	u32 delta;
 	u64 now;
@@ -664,23 +663,6 @@ static void record_times(struct psi_group_cpu *groupc, int cpu,
 		groupc->times[PSI_MEM_SOME] += delta;
 		if (groupc->state_mask & (1 << PSI_MEM_FULL))
 			groupc->times[PSI_MEM_FULL] += delta;
-		else if (memstall_tick) {
-			u32 sample;
-			/*
-			 * Since we care about lost potential, a
-			 * memstall is FULL when there are no other
-			 * working tasks, but also when the CPU is
-			 * actively reclaiming and nothing productive
-			 * could run even if it were runnable.
-			 *
-			 * When the timer tick sees a reclaiming CPU,
-			 * regardless of runnable tasks, sample a FULL
-			 * tick (or less if it hasn't been a full tick
-			 * since the last state change).
-			 */
-			sample = min(delta, (u32)jiffies_to_nsecs(1));
-			groupc->times[PSI_MEM_FULL] += sample;
-		}
 	}
 
 	if (groupc->state_mask & (1 << PSI_CPU_SOME)) {
@@ -714,7 +696,7 @@ static void psi_group_change(struct psi_group *group, int cpu,
 	 */
 	write_seqcount_begin(&groupc->seq);
 
-	record_times(groupc, cpu, false);
+	record_times(groupc, cpu);
 
 	for (t = 0, m = clear; m; m &= ~(1 << t), t++) {
 		if (!(m & (1 << t)))
@@ -738,6 +720,18 @@ static void psi_group_change(struct psi_group *group, int cpu,
 		if (test_state(groupc->tasks, s))
 			state_mask |= (1 << s);
 	}
+
+	/*
+	 * Since we care about lost potential, a memstall is FULL
+	 * when there are no other working tasks, but also when
+	 * the CPU is actively reclaiming and nothing productive
+	 * could run even if it were runnable. So when the current
+	 * task in a cgroup is in_memstall, the corresponding groupc
+	 * on that cpu is in PSI_MEM_FULL state.
+	 */
+	if (groupc->tasks[NR_ONCPU] && cpu_curr(cpu)->in_memstall)
+		state_mask |= (1 << PSI_MEM_FULL);
+
 	groupc->state_mask = state_mask;
 
 	write_seqcount_end(&groupc->seq);
@@ -823,17 +817,21 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
 	void *iter;
 
 	if (next->pid) {
+		bool identical_state;
+
 		psi_flags_change(next, 0, TSK_ONCPU);
 		/*
-		 * When moving state between tasks, the group that
-		 * contains them both does not change: we can stop
-		 * updating the tree once we reach the first common
-		 * ancestor. Iterate @next's ancestors until we
-		 * encounter @prev's state.
+		 * When switching between tasks that have an identical
+		 * runtime state, the cgroup that contains both tasks
+		 * runtime state, the cgroup that contains both tasks
+		 * we reach the first common ancestor. Iterate @next's
+		 * ancestors only until we encounter @prev's ONCPU.
 		 */
+		identical_state = prev->psi_flags == next->psi_flags;
 		iter = NULL;
 		while ((group = iterate_groups(next, &iter))) {
-			if (per_cpu_ptr(group->pcpu, cpu)->tasks[NR_ONCPU]) {
+			if (identical_state &&
+			    per_cpu_ptr(group->pcpu, cpu)->tasks[NR_ONCPU]) {
 				common = group;
 				break;
 			}
@@ -859,21 +857,6 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
 	}
 }
 
-void psi_memstall_tick(struct task_struct *task, int cpu)
-{
-	struct psi_group *group;
-	void *iter = NULL;
-
-	while ((group = iterate_groups(task, &iter))) {
-		struct psi_group_cpu *groupc;
-
-		groupc = per_cpu_ptr(group->pcpu, cpu);
-		write_seqcount_begin(&groupc->seq);
-		record_times(groupc, cpu, true);
-		write_seqcount_end(&groupc->seq);
-	}
-}
-
 /**
  * psi_memstall_enter - mark the beginning of a memory stall section
  * @flags: flags to handle nested sections
diff --git a/kernel/sched/stats.h b/kernel/sched/stats.h
index 33d0daf83842..9e4e67a94731 100644
--- a/kernel/sched/stats.h
+++ b/kernel/sched/stats.h
@@ -144,14 +144,6 @@ static inline void psi_sched_switch(struct task_struct *prev,
 	psi_task_switch(prev, next, sleep);
 }
 
-static inline void psi_task_tick(struct rq *rq)
-{
-	if (static_branch_likely(&psi_disabled))
-		return;
-
-	if (unlikely(rq->curr->in_memstall))
-		psi_memstall_tick(rq->curr, cpu_of(rq));
-}
 #else /* CONFIG_PSI */
 static inline void psi_enqueue(struct task_struct *p, bool wakeup) {}
 static inline void psi_dequeue(struct task_struct *p, bool sleep) {}
@@ -159,7 +151,6 @@ static inline void psi_ttwu_dequeue(struct task_struct *p) {}
 static inline void psi_sched_switch(struct task_struct *prev,
 				    struct task_struct *next,
 				    bool sleep) {}
-static inline void psi_task_tick(struct rq *rq) {}
 #endif /* CONFIG_PSI */
 
 #ifdef CONFIG_SCHED_INFO
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v2 3/4] psi: pressure states are unlikely
  2021-03-03  3:46 [PATCH v2 0/4] psi: Add PSI_CPU_FULL state and some code optimization Chengming Zhou
  2021-03-03  3:46 ` [PATCH v2 1/4] psi: Add PSI_CPU_FULL state Chengming Zhou
  2021-03-03  3:46 ` [PATCH v2 2/4] psi: Use ONCPU state tracking machinery to detect reclaim Chengming Zhou
@ 2021-03-03  3:46 ` Chengming Zhou
  2021-03-04  9:09   ` [tip: sched/core] psi: Pressure " tip-bot2 for Johannes Weiner
  2021-03-06 11:42   ` tip-bot2 for Johannes Weiner
  2021-03-03  3:46 ` [PATCH v2 4/4] psi: Optimize task switch inside shared cgroups Chengming Zhou
  2021-03-03 14:59 ` [PATCH v2 0/4] psi: Add PSI_CPU_FULL state and some code optimization Johannes Weiner
  4 siblings, 2 replies; 23+ messages in thread
From: Chengming Zhou @ 2021-03-03  3:46 UTC (permalink / raw)
  To: hannes, mingo, peterz, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt
  Cc: linux-kernel, songmuchun, zhouchengming

From: Johannes Weiner <hannes@cmpxchg.org>

Move the unlikely branches out of line. This eliminates undesirable
jumps during wakeup and sleeps for workloads that aren't under any
sort of resource pressure.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 kernel/sched/psi.c | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 0fe6ff6a6a15..3907a6b847aa 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -219,17 +219,17 @@ static bool test_state(unsigned int *tasks, enum psi_states state)
 {
 	switch (state) {
 	case PSI_IO_SOME:
-		return tasks[NR_IOWAIT];
+		return unlikely(tasks[NR_IOWAIT]);
 	case PSI_IO_FULL:
-		return tasks[NR_IOWAIT] && !tasks[NR_RUNNING];
+		return unlikely(tasks[NR_IOWAIT] && !tasks[NR_RUNNING]);
 	case PSI_MEM_SOME:
-		return tasks[NR_MEMSTALL];
+		return unlikely(tasks[NR_MEMSTALL]);
 	case PSI_MEM_FULL:
-		return tasks[NR_MEMSTALL] && !tasks[NR_RUNNING];
+		return unlikely(tasks[NR_MEMSTALL] && !tasks[NR_RUNNING]);
 	case PSI_CPU_SOME:
-		return tasks[NR_RUNNING] > tasks[NR_ONCPU];
+		return unlikely(tasks[NR_RUNNING] > tasks[NR_ONCPU]);
 	case PSI_CPU_FULL:
-		return tasks[NR_RUNNING] && !tasks[NR_ONCPU];
+		return unlikely(tasks[NR_RUNNING] && !tasks[NR_ONCPU]);
 	case PSI_NONIDLE:
 		return tasks[NR_IOWAIT] || tasks[NR_MEMSTALL] ||
 			tasks[NR_RUNNING];
@@ -729,7 +729,7 @@ static void psi_group_change(struct psi_group *group, int cpu,
 	 * task in a cgroup is in_memstall, the corresponding groupc
 	 * on that cpu is in PSI_MEM_FULL state.
 	 */
-	if (groupc->tasks[NR_ONCPU] && cpu_curr(cpu)->in_memstall)
+	if (unlikely(groupc->tasks[NR_ONCPU] && cpu_curr(cpu)->in_memstall))
 		state_mask |= (1 << PSI_MEM_FULL);
 
 	groupc->state_mask = state_mask;
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v2 4/4] psi: Optimize task switch inside shared cgroups
  2021-03-03  3:46 [PATCH v2 0/4] psi: Add PSI_CPU_FULL state and some code optimization Chengming Zhou
                   ` (2 preceding siblings ...)
  2021-03-03  3:46 ` [PATCH v2 3/4] psi: pressure states are unlikely Chengming Zhou
@ 2021-03-03  3:46 ` Chengming Zhou
  2021-03-03 14:57   ` Johannes Weiner
                     ` (3 more replies)
  2021-03-03 14:59 ` [PATCH v2 0/4] psi: Add PSI_CPU_FULL state and some code optimization Johannes Weiner
  4 siblings, 4 replies; 23+ messages in thread
From: Chengming Zhou @ 2021-03-03  3:46 UTC (permalink / raw)
  To: hannes, mingo, peterz, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt
  Cc: linux-kernel, songmuchun, zhouchengming

The commit 36b238d57172 ("psi: Optimize switching tasks inside shared
cgroups") only update cgroups whose state actually changes during a
task switch only in task preempt case, not in task sleep case.

We actually don't need to clear and set TSK_ONCPU state for common cgroups
of next and prev task in sleep case, that can save many psi_group_change
especially when most activity comes from one leaf cgroup.

sleep before:
psi_dequeue()
  while ((group = iterate_groups(prev)))  # all ancestors
    psi_group_change(prev, .clear=TSK_RUNNING|TSK_ONCPU)
psi_task_switch()
  while ((group = iterate_groups(next)))  # all ancestors
    psi_group_change(next, .set=TSK_ONCPU)

sleep after:
psi_dequeue()
  nop
psi_task_switch()
  while ((group = iterate_groups(next)))  # until (prev & next)
    psi_group_change(next, .set=TSK_ONCPU)
  while ((group = iterate_groups(prev)))  # all ancestors
    psi_group_change(prev, .clear=common?TSK_RUNNING:TSK_RUNNING|TSK_ONCPU)

When a voluntary sleep switches to another task, we remove one call of
psi_group_change() for every common cgroup ancestor of the two tasks.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
---
Updates since v1:
- Many improvements in the comments and code from Johannes Weiner.

 kernel/sched/psi.c   | 35 +++++++++++++++++++++++++----------
 kernel/sched/stats.h | 28 ++++++++++++----------------
 2 files changed, 37 insertions(+), 26 deletions(-)

diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 3907a6b847aa..ee3c5b48622f 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -840,20 +840,35 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
 		}
 	}
 
-	/*
-	 * If this is a voluntary sleep, dequeue will have taken care
-	 * of the outgoing TSK_ONCPU alongside TSK_RUNNING already. We
-	 * only need to deal with it during preemption.
-	 */
-	if (sleep)
-		return;
-
 	if (prev->pid) {
-		psi_flags_change(prev, TSK_ONCPU, 0);
+		int clear = TSK_ONCPU, set = 0;
+
+		/*
+		 * When we're going to sleep, psi_dequeue() lets us handle
+		 * TSK_RUNNING and TSK_IOWAIT here, where we can combine it
+		 * with TSK_ONCPU and save walking common ancestors twice.
+		 */
+		if (sleep) {
+			clear |= TSK_RUNNING;
+			if (prev->in_iowait)
+				set |= TSK_IOWAIT;
+		}
+
+		psi_flags_change(prev, clear, set);
 
 		iter = NULL;
 		while ((group = iterate_groups(prev, &iter)) && group != common)
-			psi_group_change(group, cpu, TSK_ONCPU, 0, true);
+			psi_group_change(group, cpu, clear, set, true);
+
+		/*
+		 * TSK_ONCPU is handled up to the common ancestor. If we're tasked
+		 * with dequeuing too, finish that for the rest of the hierarchy.
+		 */
+		if (sleep) {
+			clear &= ~TSK_ONCPU;
+			for (; group; group = iterate_groups(prev, &iter))
+				psi_group_change(group, cpu, clear, set, true);
+		}
 	}
 }
 
diff --git a/kernel/sched/stats.h b/kernel/sched/stats.h
index 9e4e67a94731..dc218e9f4558 100644
--- a/kernel/sched/stats.h
+++ b/kernel/sched/stats.h
@@ -84,28 +84,24 @@ static inline void psi_enqueue(struct task_struct *p, bool wakeup)
 
 static inline void psi_dequeue(struct task_struct *p, bool sleep)
 {
-	int clear = TSK_RUNNING, set = 0;
+	int clear = TSK_RUNNING;
 
 	if (static_branch_likely(&psi_disabled))
 		return;
 
-	if (!sleep) {
-		if (p->in_memstall)
-			clear |= TSK_MEMSTALL;
-	} else {
-		/*
-		 * When a task sleeps, schedule() dequeues it before
-		 * switching to the next one. Merge the clearing of
-		 * TSK_RUNNING and TSK_ONCPU to save an unnecessary
-		 * psi_task_change() call in psi_sched_switch().
-		 */
-		clear |= TSK_ONCPU;
+	/*
+	 * A voluntary sleep is a dequeue followed by a task switch. To
+	 * avoid walking all ancestors twice, psi_task_switch() handles
+	 * TSK_RUNNING and TSK_IOWAIT for us when it moves TSK_ONCPU.
+	 * Do nothing here.
+	 */
+	if (sleep)
+		return;
 
-		if (p->in_iowait)
-			set |= TSK_IOWAIT;
-	}
+	if (p->in_memstall)
+		clear |= TSK_MEMSTALL;
 
-	psi_task_change(p, clear, set);
+	psi_task_change(p, clear, 0);
 }
 
 static inline void psi_ttwu_dequeue(struct task_struct *p)
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 2/4] psi: Use ONCPU state tracking machinery to detect reclaim
  2021-03-03  3:46 ` [PATCH v2 2/4] psi: Use ONCPU state tracking machinery to detect reclaim Chengming Zhou
@ 2021-03-03 14:56   ` Johannes Weiner
  2021-03-03 15:52   ` Peter Zijlstra
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 23+ messages in thread
From: Johannes Weiner @ 2021-03-03 14:56 UTC (permalink / raw)
  To: Chengming Zhou
  Cc: mingo, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, linux-kernel, songmuchun

On Wed, Mar 03, 2021 at 11:46:57AM +0800, Chengming Zhou wrote:
> Move the reclaim detection from the timer tick to the task state
> tracking machinery using the recently added ONCPU state. And we
> also add task psi_flags changes checking in the psi_task_switch()
> optimization to update the parents properly.
> 
> In terms of performance and cost, this ONCPU task state tracking
> is not cheaper than previous timer tick in aggregate. But the code is
> simpler and shorter this way, so it's a maintainability win. And
> Johannes did some testing with perf bench, the performace and cost
> changes would be acceptable for real workloads.
> 
> Thanks to Johannes Weiner for pointing out the psi_task_switch()
> optimization things and the clearer changelog.
> 
> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 4/4] psi: Optimize task switch inside shared cgroups
  2021-03-03  3:46 ` [PATCH v2 4/4] psi: Optimize task switch inside shared cgroups Chengming Zhou
@ 2021-03-03 14:57   ` Johannes Weiner
  2021-03-03 15:53   ` Peter Zijlstra
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 23+ messages in thread
From: Johannes Weiner @ 2021-03-03 14:57 UTC (permalink / raw)
  To: Chengming Zhou
  Cc: mingo, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, linux-kernel, songmuchun

On Wed, Mar 03, 2021 at 11:46:59AM +0800, Chengming Zhou wrote:
> The commit 36b238d57172 ("psi: Optimize switching tasks inside shared
> cgroups") only update cgroups whose state actually changes during a
> task switch only in task preempt case, not in task sleep case.
> 
> We actually don't need to clear and set TSK_ONCPU state for common cgroups
> of next and prev task in sleep case, that can save many psi_group_change
> especially when most activity comes from one leaf cgroup.
> 
> sleep before:
> psi_dequeue()
>   while ((group = iterate_groups(prev)))  # all ancestors
>     psi_group_change(prev, .clear=TSK_RUNNING|TSK_ONCPU)
> psi_task_switch()
>   while ((group = iterate_groups(next)))  # all ancestors
>     psi_group_change(next, .set=TSK_ONCPU)
> 
> sleep after:
> psi_dequeue()
>   nop
> psi_task_switch()
>   while ((group = iterate_groups(next)))  # until (prev & next)
>     psi_group_change(next, .set=TSK_ONCPU)
>   while ((group = iterate_groups(prev)))  # all ancestors
>     psi_group_change(prev, .clear=common?TSK_RUNNING:TSK_RUNNING|TSK_ONCPU)
> 
> When a voluntary sleep switches to another task, we remove one call of
> psi_group_change() for every common cgroup ancestor of the two tasks.
> 
> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 0/4] psi: Add PSI_CPU_FULL state and some code optimization
  2021-03-03  3:46 [PATCH v2 0/4] psi: Add PSI_CPU_FULL state and some code optimization Chengming Zhou
                   ` (3 preceding siblings ...)
  2021-03-03  3:46 ` [PATCH v2 4/4] psi: Optimize task switch inside shared cgroups Chengming Zhou
@ 2021-03-03 14:59 ` Johannes Weiner
  2021-03-03 15:32   ` Peter Zijlstra
  4 siblings, 1 reply; 23+ messages in thread
From: Johannes Weiner @ 2021-03-03 14:59 UTC (permalink / raw)
  To: peterz
  Cc: Chengming Zhou, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, linux-kernel, songmuchun

On Wed, Mar 03, 2021 at 11:46:55AM +0800, Chengming Zhou wrote:
> This patch series is RESEND of the previous patches on psi subsystem. A few
> weeks passed since the last review, so I put them together and resend for
> more convenient review and merge.
> 
> Patch 1 add PSI_CPU_FULL state means all non-idle tasks in a cgroup are delayed
> on the CPU resource which used by others outside of the cgroup or throttled
> by the cgroup cpu.max configuration.
> 
> Patch 2 use ONCPU state and the current in_memstall flag to detect reclaim,
> remove the hook in timer tick to make code more concise and maintainable.
> And patch 3 adds unlikely() annotations to move the pressure state branches
> out of line to eliminate undesirable jumps during wakeup and sleeps.
> 
> Patch 4 optimize the voluntary sleep switch by remove one call of
> psi_group_change() for every common cgroup ancestor of the two tasks.
> 
> Chengming Zhou (3):
>   psi: Add PSI_CPU_FULL state
>   psi: Use ONCPU state tracking machinery to detect reclaim
>   psi: Optimize task switch inside shared cgroups
> 
> Johannes Weiner (1):
>   psi: pressure states are unlikely

Peter, would you mind routing these through the sched tree for 5.13?

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 0/4] psi: Add PSI_CPU_FULL state and some code optimization
  2021-03-03 14:59 ` [PATCH v2 0/4] psi: Add PSI_CPU_FULL state and some code optimization Johannes Weiner
@ 2021-03-03 15:32   ` Peter Zijlstra
  2021-03-03 16:05     ` Peter Zijlstra
  0 siblings, 1 reply; 23+ messages in thread
From: Peter Zijlstra @ 2021-03-03 15:32 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Chengming Zhou, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, linux-kernel, songmuchun

On Wed, Mar 03, 2021 at 09:59:11AM -0500, Johannes Weiner wrote:
> On Wed, Mar 03, 2021 at 11:46:55AM +0800, Chengming Zhou wrote:
> > This patch series is RESEND of the previous patches on psi subsystem. A few
> > weeks passed since the last review, so I put them together and resend for
> > more convenient review and merge.
> > 
> > Patch 1 add PSI_CPU_FULL state means all non-idle tasks in a cgroup are delayed
> > on the CPU resource which used by others outside of the cgroup or throttled
> > by the cgroup cpu.max configuration.
> > 
> > Patch 2 use ONCPU state and the current in_memstall flag to detect reclaim,
> > remove the hook in timer tick to make code more concise and maintainable.
> > And patch 3 adds unlikely() annotations to move the pressure state branches
> > out of line to eliminate undesirable jumps during wakeup and sleeps.
> > 
> > Patch 4 optimize the voluntary sleep switch by remove one call of
> > psi_group_change() for every common cgroup ancestor of the two tasks.
> > 
> > Chengming Zhou (3):
> >   psi: Add PSI_CPU_FULL state
> >   psi: Use ONCPU state tracking machinery to detect reclaim
> >   psi: Optimize task switch inside shared cgroups
> > 
> > Johannes Weiner (1):
> >   psi: pressure states are unlikely
> 
> Peter, would you mind routing these through the sched tree for 5.13?

Yes, I can do that. Thanks!

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 2/4] psi: Use ONCPU state tracking machinery to detect reclaim
  2021-03-03  3:46 ` [PATCH v2 2/4] psi: Use ONCPU state tracking machinery to detect reclaim Chengming Zhou
  2021-03-03 14:56   ` Johannes Weiner
@ 2021-03-03 15:52   ` Peter Zijlstra
  2021-03-04  9:09   ` [tip: sched/core] " tip-bot2 for Chengming Zhou
  2021-03-06 11:42   ` tip-bot2 for Chengming Zhou
  3 siblings, 0 replies; 23+ messages in thread
From: Peter Zijlstra @ 2021-03-03 15:52 UTC (permalink / raw)
  To: Chengming Zhou
  Cc: hannes, mingo, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, linux-kernel, songmuchun

On Wed, Mar 03, 2021 at 11:46:57AM +0800, Chengming Zhou wrote:
> Move the reclaim detection from the timer tick to the task state
> tracking machinery using the recently added ONCPU state. And we
> also add task psi_flags changes checking in the psi_task_switch()
> optimization to update the parents properly.
> 
> In terms of performance and cost, this ONCPU task state tracking
> is not cheaper than previous timer tick in aggregate. But the code is
> simpler and shorter this way, so it's a maintainability win. And
> Johannes did some testing with perf bench, the performace and cost
> changes would be acceptable for real workloads.
> 
> Thanks to Johannes Weiner for pointing out the psi_task_switch()
> optimization things and the clearer changelog.
> 
Co-developed-by: Muchun ?
> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 4/4] psi: Optimize task switch inside shared cgroups
  2021-03-03  3:46 ` [PATCH v2 4/4] psi: Optimize task switch inside shared cgroups Chengming Zhou
  2021-03-03 14:57   ` Johannes Weiner
@ 2021-03-03 15:53   ` Peter Zijlstra
  2021-03-03 16:01     ` [External] " Muchun Song
  2021-03-04  9:09   ` [tip: sched/core] " tip-bot2 for Chengming Zhou
  2021-03-06 11:42   ` tip-bot2 for Chengming Zhou
  3 siblings, 1 reply; 23+ messages in thread
From: Peter Zijlstra @ 2021-03-03 15:53 UTC (permalink / raw)
  To: Chengming Zhou
  Cc: hannes, mingo, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, linux-kernel, songmuchun

On Wed, Mar 03, 2021 at 11:46:59AM +0800, Chengming Zhou wrote:
> The commit 36b238d57172 ("psi: Optimize switching tasks inside shared
> cgroups") only update cgroups whose state actually changes during a
> task switch only in task preempt case, not in task sleep case.
> 
> We actually don't need to clear and set TSK_ONCPU state for common cgroups
> of next and prev task in sleep case, that can save many psi_group_change
> especially when most activity comes from one leaf cgroup.
> 
> sleep before:
> psi_dequeue()
>   while ((group = iterate_groups(prev)))  # all ancestors
>     psi_group_change(prev, .clear=TSK_RUNNING|TSK_ONCPU)
> psi_task_switch()
>   while ((group = iterate_groups(next)))  # all ancestors
>     psi_group_change(next, .set=TSK_ONCPU)
> 
> sleep after:
> psi_dequeue()
>   nop
> psi_task_switch()
>   while ((group = iterate_groups(next)))  # until (prev & next)
>     psi_group_change(next, .set=TSK_ONCPU)
>   while ((group = iterate_groups(prev)))  # all ancestors
>     psi_group_change(prev, .clear=common?TSK_RUNNING:TSK_RUNNING|TSK_ONCPU)
> 
> When a voluntary sleep switches to another task, we remove one call of
> psi_group_change() for every common cgroup ancestor of the two tasks.
> 
Co-developed-by: Muchun ?
> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [External] Re: [PATCH v2 4/4] psi: Optimize task switch inside shared cgroups
  2021-03-03 15:53   ` Peter Zijlstra
@ 2021-03-03 16:01     ` Muchun Song
  0 siblings, 0 replies; 23+ messages in thread
From: Muchun Song @ 2021-03-03 16:01 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Chengming Zhou, Johannes Weiner, Ingo Molnar, Juri Lelli,
	Vincent Guittot, dietmar.eggemann, Steven Rostedt, LKML

On Wed, Mar 3, 2021 at 11:53 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Wed, Mar 03, 2021 at 11:46:59AM +0800, Chengming Zhou wrote:
> > The commit 36b238d57172 ("psi: Optimize switching tasks inside shared
> > cgroups") only update cgroups whose state actually changes during a
> > task switch only in task preempt case, not in task sleep case.
> >
> > We actually don't need to clear and set TSK_ONCPU state for common cgroups
> > of next and prev task in sleep case, that can save many psi_group_change
> > especially when most activity comes from one leaf cgroup.
> >
> > sleep before:
> > psi_dequeue()
> >   while ((group = iterate_groups(prev)))  # all ancestors
> >     psi_group_change(prev, .clear=TSK_RUNNING|TSK_ONCPU)
> > psi_task_switch()
> >   while ((group = iterate_groups(next)))  # all ancestors
> >     psi_group_change(next, .set=TSK_ONCPU)
> >
> > sleep after:
> > psi_dequeue()
> >   nop
> > psi_task_switch()
> >   while ((group = iterate_groups(next)))  # until (prev & next)
> >     psi_group_change(next, .set=TSK_ONCPU)
> >   while ((group = iterate_groups(prev)))  # all ancestors
> >     psi_group_change(prev, .clear=common?TSK_RUNNING:TSK_RUNNING|TSK_ONCPU)
> >
> > When a voluntary sleep switches to another task, we remove one call of
> > psi_group_change() for every common cgroup ancestor of the two tasks.
> >
> Co-developed-by: Muchun ?

Yes. Should Chengming send another version patch to add Co-developed-by tag?

> > Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> > Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 0/4] psi: Add PSI_CPU_FULL state and some code optimization
  2021-03-03 15:32   ` Peter Zijlstra
@ 2021-03-03 16:05     ` Peter Zijlstra
  2021-03-03 19:38       ` Johannes Weiner
  2021-03-04  0:14       ` [External] " Chengming Zhou
  0 siblings, 2 replies; 23+ messages in thread
From: Peter Zijlstra @ 2021-03-03 16:05 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Chengming Zhou, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, linux-kernel, songmuchun

On Wed, Mar 03, 2021 at 04:32:18PM +0100, Peter Zijlstra wrote:

> Yes, I can do that. Thanks!

Please double check the patches as found here:

  https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/log/?h=sched/core

I've manually edited the tags.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 0/4] psi: Add PSI_CPU_FULL state and some code optimization
  2021-03-03 16:05     ` Peter Zijlstra
@ 2021-03-03 19:38       ` Johannes Weiner
  2021-03-04  0:14       ` [External] " Chengming Zhou
  1 sibling, 0 replies; 23+ messages in thread
From: Johannes Weiner @ 2021-03-03 19:38 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Chengming Zhou, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, linux-kernel, songmuchun

On Wed, Mar 03, 2021 at 05:05:15PM +0100, Peter Zijlstra wrote:
> On Wed, Mar 03, 2021 at 04:32:18PM +0100, Peter Zijlstra wrote:
> 
> > Yes, I can do that. Thanks!
> 
> Please double check the patches as found here:
> 
>   https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/log/?h=sched/core
> 
> I've manually edited the tags.

Looks good to me, thanks!

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [External] Re: [PATCH v2 0/4] psi: Add PSI_CPU_FULL state and some code optimization
  2021-03-03 16:05     ` Peter Zijlstra
  2021-03-03 19:38       ` Johannes Weiner
@ 2021-03-04  0:14       ` Chengming Zhou
  1 sibling, 0 replies; 23+ messages in thread
From: Chengming Zhou @ 2021-03-04  0:14 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Johannes Weiner, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, linux-kernel, songmuchun

在 2021/3/4 上午12:05, Peter Zijlstra 写道:

> On Wed, Mar 03, 2021 at 04:32:18PM +0100, Peter Zijlstra wrote:
>
>> Yes, I can do that. Thanks!
> Please double check the patches as found here:
>
>   https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/log/?h=sched/core
>
> I've manually edited the tags.
Looks good to me, thanks!

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [tip: sched/core] psi: Pressure states are unlikely
  2021-03-03  3:46 ` [PATCH v2 3/4] psi: pressure states are unlikely Chengming Zhou
@ 2021-03-04  9:09   ` tip-bot2 for Johannes Weiner
  2021-03-06 11:42   ` tip-bot2 for Johannes Weiner
  1 sibling, 0 replies; 23+ messages in thread
From: tip-bot2 for Johannes Weiner @ 2021-03-04  9:09 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Johannes Weiner, Chengming Zhou, Peter Zijlstra (Intel),
	x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     24f3cb558f59debe0e9159459bb9627b51b47c17
Gitweb:        https://git.kernel.org/tip/24f3cb558f59debe0e9159459bb9627b51b47c17
Author:        Johannes Weiner <hannes@cmpxchg.org>
AuthorDate:    Wed, 03 Mar 2021 11:46:58 +08:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Thu, 04 Mar 2021 09:56:02 +01:00

psi: Pressure states are unlikely

Move the unlikely branches out of line. This eliminates undesirable
jumps during wakeup and sleeps for workloads that aren't under any
sort of resource pressure.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20210303034659.91735-4-zhouchengming@bytedance.com
---
 kernel/sched/psi.c | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 0fe6ff6..3907a6b 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -219,17 +219,17 @@ static bool test_state(unsigned int *tasks, enum psi_states state)
 {
 	switch (state) {
 	case PSI_IO_SOME:
-		return tasks[NR_IOWAIT];
+		return unlikely(tasks[NR_IOWAIT]);
 	case PSI_IO_FULL:
-		return tasks[NR_IOWAIT] && !tasks[NR_RUNNING];
+		return unlikely(tasks[NR_IOWAIT] && !tasks[NR_RUNNING]);
 	case PSI_MEM_SOME:
-		return tasks[NR_MEMSTALL];
+		return unlikely(tasks[NR_MEMSTALL]);
 	case PSI_MEM_FULL:
-		return tasks[NR_MEMSTALL] && !tasks[NR_RUNNING];
+		return unlikely(tasks[NR_MEMSTALL] && !tasks[NR_RUNNING]);
 	case PSI_CPU_SOME:
-		return tasks[NR_RUNNING] > tasks[NR_ONCPU];
+		return unlikely(tasks[NR_RUNNING] > tasks[NR_ONCPU]);
 	case PSI_CPU_FULL:
-		return tasks[NR_RUNNING] && !tasks[NR_ONCPU];
+		return unlikely(tasks[NR_RUNNING] && !tasks[NR_ONCPU]);
 	case PSI_NONIDLE:
 		return tasks[NR_IOWAIT] || tasks[NR_MEMSTALL] ||
 			tasks[NR_RUNNING];
@@ -729,7 +729,7 @@ static void psi_group_change(struct psi_group *group, int cpu,
 	 * task in a cgroup is in_memstall, the corresponding groupc
 	 * on that cpu is in PSI_MEM_FULL state.
 	 */
-	if (groupc->tasks[NR_ONCPU] && cpu_curr(cpu)->in_memstall)
+	if (unlikely(groupc->tasks[NR_ONCPU] && cpu_curr(cpu)->in_memstall))
 		state_mask |= (1 << PSI_MEM_FULL);
 
 	groupc->state_mask = state_mask;

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [tip: sched/core] psi: Optimize task switch inside shared cgroups
  2021-03-03  3:46 ` [PATCH v2 4/4] psi: Optimize task switch inside shared cgroups Chengming Zhou
  2021-03-03 14:57   ` Johannes Weiner
  2021-03-03 15:53   ` Peter Zijlstra
@ 2021-03-04  9:09   ` tip-bot2 for Chengming Zhou
  2021-03-06 11:42   ` tip-bot2 for Chengming Zhou
  3 siblings, 0 replies; 23+ messages in thread
From: tip-bot2 for Chengming Zhou @ 2021-03-04  9:09 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Muchun Song, Chengming Zhou, Peter Zijlstra (Intel),
	Johannes Weiner, x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     e6560d58334ca463061ade733674abc8dd0df9bd
Gitweb:        https://git.kernel.org/tip/e6560d58334ca463061ade733674abc8dd0df9bd
Author:        Chengming Zhou <zhouchengming@bytedance.com>
AuthorDate:    Wed, 03 Mar 2021 11:46:59 +08:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Thu, 04 Mar 2021 09:56:02 +01:00

psi: Optimize task switch inside shared cgroups

The commit 36b238d57172 ("psi: Optimize switching tasks inside shared
cgroups") only update cgroups whose state actually changes during a
task switch only in task preempt case, not in task sleep case.

We actually don't need to clear and set TSK_ONCPU state for common cgroups
of next and prev task in sleep case, that can save many psi_group_change
especially when most activity comes from one leaf cgroup.

sleep before:
psi_dequeue()
  while ((group = iterate_groups(prev)))  # all ancestors
    psi_group_change(prev, .clear=TSK_RUNNING|TSK_ONCPU)
psi_task_switch()
  while ((group = iterate_groups(next)))  # all ancestors
    psi_group_change(next, .set=TSK_ONCPU)

sleep after:
psi_dequeue()
  nop
psi_task_switch()
  while ((group = iterate_groups(next)))  # until (prev & next)
    psi_group_change(next, .set=TSK_ONCPU)
  while ((group = iterate_groups(prev)))  # all ancestors
    psi_group_change(prev, .clear=common?TSK_RUNNING:TSK_RUNNING|TSK_ONCPU)

When a voluntary sleep switches to another task, we remove one call of
psi_group_change() for every common cgroup ancestor of the two tasks.

Co-developed-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Link: https://lkml.kernel.org/r/20210303034659.91735-5-zhouchengming@bytedance.com
---
 kernel/sched/psi.c   | 35 +++++++++++++++++++++++++----------
 kernel/sched/stats.h | 28 ++++++++++++----------------
 2 files changed, 37 insertions(+), 26 deletions(-)

diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 3907a6b..ee3c5b4 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -840,20 +840,35 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
 		}
 	}
 
-	/*
-	 * If this is a voluntary sleep, dequeue will have taken care
-	 * of the outgoing TSK_ONCPU alongside TSK_RUNNING already. We
-	 * only need to deal with it during preemption.
-	 */
-	if (sleep)
-		return;
-
 	if (prev->pid) {
-		psi_flags_change(prev, TSK_ONCPU, 0);
+		int clear = TSK_ONCPU, set = 0;
+
+		/*
+		 * When we're going to sleep, psi_dequeue() lets us handle
+		 * TSK_RUNNING and TSK_IOWAIT here, where we can combine it
+		 * with TSK_ONCPU and save walking common ancestors twice.
+		 */
+		if (sleep) {
+			clear |= TSK_RUNNING;
+			if (prev->in_iowait)
+				set |= TSK_IOWAIT;
+		}
+
+		psi_flags_change(prev, clear, set);
 
 		iter = NULL;
 		while ((group = iterate_groups(prev, &iter)) && group != common)
-			psi_group_change(group, cpu, TSK_ONCPU, 0, true);
+			psi_group_change(group, cpu, clear, set, true);
+
+		/*
+		 * TSK_ONCPU is handled up to the common ancestor. If we're tasked
+		 * with dequeuing too, finish that for the rest of the hierarchy.
+		 */
+		if (sleep) {
+			clear &= ~TSK_ONCPU;
+			for (; group; group = iterate_groups(prev, &iter))
+				psi_group_change(group, cpu, clear, set, true);
+		}
 	}
 }
 
diff --git a/kernel/sched/stats.h b/kernel/sched/stats.h
index 9e4e67a..dc218e9 100644
--- a/kernel/sched/stats.h
+++ b/kernel/sched/stats.h
@@ -84,28 +84,24 @@ static inline void psi_enqueue(struct task_struct *p, bool wakeup)
 
 static inline void psi_dequeue(struct task_struct *p, bool sleep)
 {
-	int clear = TSK_RUNNING, set = 0;
+	int clear = TSK_RUNNING;
 
 	if (static_branch_likely(&psi_disabled))
 		return;
 
-	if (!sleep) {
-		if (p->in_memstall)
-			clear |= TSK_MEMSTALL;
-	} else {
-		/*
-		 * When a task sleeps, schedule() dequeues it before
-		 * switching to the next one. Merge the clearing of
-		 * TSK_RUNNING and TSK_ONCPU to save an unnecessary
-		 * psi_task_change() call in psi_sched_switch().
-		 */
-		clear |= TSK_ONCPU;
+	/*
+	 * A voluntary sleep is a dequeue followed by a task switch. To
+	 * avoid walking all ancestors twice, psi_task_switch() handles
+	 * TSK_RUNNING and TSK_IOWAIT for us when it moves TSK_ONCPU.
+	 * Do nothing here.
+	 */
+	if (sleep)
+		return;
 
-		if (p->in_iowait)
-			set |= TSK_IOWAIT;
-	}
+	if (p->in_memstall)
+		clear |= TSK_MEMSTALL;
 
-	psi_task_change(p, clear, set);
+	psi_task_change(p, clear, 0);
 }
 
 static inline void psi_ttwu_dequeue(struct task_struct *p)

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [tip: sched/core] psi: Add PSI_CPU_FULL state
  2021-03-03  3:46 ` [PATCH v2 1/4] psi: Add PSI_CPU_FULL state Chengming Zhou
@ 2021-03-04  9:09   ` tip-bot2 for Chengming Zhou
  2021-03-06 11:42   ` tip-bot2 for Chengming Zhou
  1 sibling, 0 replies; 23+ messages in thread
From: tip-bot2 for Chengming Zhou @ 2021-03-04  9:09 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Muchun Song, Chengming Zhou, Peter Zijlstra (Intel),
	Johannes Weiner, x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     311b293811a31929c72c790eff48cf767561589f
Gitweb:        https://git.kernel.org/tip/311b293811a31929c72c790eff48cf767561589f
Author:        Chengming Zhou <zhouchengming@bytedance.com>
AuthorDate:    Wed, 03 Mar 2021 11:46:56 +08:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Thu, 04 Mar 2021 09:56:01 +01:00

psi: Add PSI_CPU_FULL state

The FULL state doesn't exist for the CPU resource at the system level,
but exist at the cgroup level, means all non-idle tasks in a cgroup are
delayed on the CPU resource which used by others outside of the cgroup
or throttled by the cgroup cpu.max configuration.

Co-developed-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Link: https://lkml.kernel.org/r/20210303034659.91735-2-zhouchengming@bytedance.com
---
 include/linux/psi_types.h |  3 ++-
 kernel/sched/psi.c        | 14 +++++++++++---
 2 files changed, 13 insertions(+), 4 deletions(-)

diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h
index b95f321..0a23300 100644
--- a/include/linux/psi_types.h
+++ b/include/linux/psi_types.h
@@ -50,9 +50,10 @@ enum psi_states {
 	PSI_MEM_SOME,
 	PSI_MEM_FULL,
 	PSI_CPU_SOME,
+	PSI_CPU_FULL,
 	/* Only per-CPU, to weigh the CPU in the global average: */
 	PSI_NONIDLE,
-	NR_PSI_STATES = 6,
+	NR_PSI_STATES = 7,
 };
 
 enum psi_aggregators {
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 967732c..2293c45 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -34,7 +34,10 @@
  * delayed on that resource such that nobody is advancing and the CPU
  * goes idle. This leaves both workload and CPU unproductive.
  *
- * (Naturally, the FULL state doesn't exist for the CPU resource.)
+ * Naturally, the FULL state doesn't exist for the CPU resource at the
+ * system level, but exist at the cgroup level, means all non-idle tasks
+ * in a cgroup are delayed on the CPU resource which used by others outside
+ * of the cgroup or throttled by the cgroup cpu.max configuration.
  *
  *	SOME = nr_delayed_tasks != 0
  *	FULL = nr_delayed_tasks != 0 && nr_running_tasks == 0
@@ -225,6 +228,8 @@ static bool test_state(unsigned int *tasks, enum psi_states state)
 		return tasks[NR_MEMSTALL] && !tasks[NR_RUNNING];
 	case PSI_CPU_SOME:
 		return tasks[NR_RUNNING] > tasks[NR_ONCPU];
+	case PSI_CPU_FULL:
+		return tasks[NR_RUNNING] && !tasks[NR_ONCPU];
 	case PSI_NONIDLE:
 		return tasks[NR_IOWAIT] || tasks[NR_MEMSTALL] ||
 			tasks[NR_RUNNING];
@@ -678,8 +683,11 @@ static void record_times(struct psi_group_cpu *groupc, int cpu,
 		}
 	}
 
-	if (groupc->state_mask & (1 << PSI_CPU_SOME))
+	if (groupc->state_mask & (1 << PSI_CPU_SOME)) {
 		groupc->times[PSI_CPU_SOME] += delta;
+		if (groupc->state_mask & (1 << PSI_CPU_FULL))
+			groupc->times[PSI_CPU_FULL] += delta;
+	}
 
 	if (groupc->state_mask & (1 << PSI_NONIDLE))
 		groupc->times[PSI_NONIDLE] += delta;
@@ -1018,7 +1026,7 @@ int psi_show(struct seq_file *m, struct psi_group *group, enum psi_res res)
 		group->avg_next_update = update_averages(group, now);
 	mutex_unlock(&group->avgs_lock);
 
-	for (full = 0; full < 2 - (res == PSI_CPU); full++) {
+	for (full = 0; full < 2; full++) {
 		unsigned long avg[3];
 		u64 total;
 		int w;

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [tip: sched/core] psi: Use ONCPU state tracking machinery to detect reclaim
  2021-03-03  3:46 ` [PATCH v2 2/4] psi: Use ONCPU state tracking machinery to detect reclaim Chengming Zhou
  2021-03-03 14:56   ` Johannes Weiner
  2021-03-03 15:52   ` Peter Zijlstra
@ 2021-03-04  9:09   ` tip-bot2 for Chengming Zhou
  2021-03-06 11:42   ` tip-bot2 for Chengming Zhou
  3 siblings, 0 replies; 23+ messages in thread
From: tip-bot2 for Chengming Zhou @ 2021-03-04  9:09 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Muchun Song, Chengming Zhou, Peter Zijlstra (Intel),
	Johannes Weiner, x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     f3f7feec57b9141dfed9825874d0191b1ac18ad2
Gitweb:        https://git.kernel.org/tip/f3f7feec57b9141dfed9825874d0191b1ac18ad2
Author:        Chengming Zhou <zhouchengming@bytedance.com>
AuthorDate:    Wed, 03 Mar 2021 11:46:57 +08:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Thu, 04 Mar 2021 09:56:01 +01:00

psi: Use ONCPU state tracking machinery to detect reclaim

Move the reclaim detection from the timer tick to the task state
tracking machinery using the recently added ONCPU state. And we
also add task psi_flags changes checking in the psi_task_switch()
optimization to update the parents properly.

In terms of performance and cost, this ONCPU task state tracking
is not cheaper than previous timer tick in aggregate. But the code is
simpler and shorter this way, so it's a maintainability win. And
Johannes did some testing with perf bench, the performace and cost
changes would be acceptable for real workloads.

Thanks to Johannes Weiner for pointing out the psi_task_switch()
optimization things and the clearer changelog.

Co-developed-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Link: https://lkml.kernel.org/r/20210303034659.91735-3-zhouchengming@bytedance.com
---
 include/linux/psi.h  |  1 +-
 kernel/sched/core.c  |  1 +-
 kernel/sched/psi.c   | 65 +++++++++++++++----------------------------
 kernel/sched/stats.h |  9 +------
 4 files changed, 24 insertions(+), 52 deletions(-)

diff --git a/include/linux/psi.h b/include/linux/psi.h
index 7361023..65eb147 100644
--- a/include/linux/psi.h
+++ b/include/linux/psi.h
@@ -20,7 +20,6 @@ void psi_task_change(struct task_struct *task, int clear, int set);
 void psi_task_switch(struct task_struct *prev, struct task_struct *next,
 		     bool sleep);
 
-void psi_memstall_tick(struct task_struct *task, int cpu);
 void psi_memstall_enter(unsigned long *flags);
 void psi_memstall_leave(unsigned long *flags);
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 361974e..d2629fd 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4551,7 +4551,6 @@ void scheduler_tick(void)
 	update_thermal_load_avg(rq_clock_thermal(rq), rq, thermal_pressure);
 	curr->sched_class->task_tick(rq, curr, 0);
 	calc_global_load_tick(rq);
-	psi_task_tick(rq);
 
 	rq_unlock(rq, &rf);
 
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 2293c45..0fe6ff6 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -644,8 +644,7 @@ static void poll_timer_fn(struct timer_list *t)
 	wake_up_interruptible(&group->poll_wait);
 }
 
-static void record_times(struct psi_group_cpu *groupc, int cpu,
-			 bool memstall_tick)
+static void record_times(struct psi_group_cpu *groupc, int cpu)
 {
 	u32 delta;
 	u64 now;
@@ -664,23 +663,6 @@ static void record_times(struct psi_group_cpu *groupc, int cpu,
 		groupc->times[PSI_MEM_SOME] += delta;
 		if (groupc->state_mask & (1 << PSI_MEM_FULL))
 			groupc->times[PSI_MEM_FULL] += delta;
-		else if (memstall_tick) {
-			u32 sample;
-			/*
-			 * Since we care about lost potential, a
-			 * memstall is FULL when there are no other
-			 * working tasks, but also when the CPU is
-			 * actively reclaiming and nothing productive
-			 * could run even if it were runnable.
-			 *
-			 * When the timer tick sees a reclaiming CPU,
-			 * regardless of runnable tasks, sample a FULL
-			 * tick (or less if it hasn't been a full tick
-			 * since the last state change).
-			 */
-			sample = min(delta, (u32)jiffies_to_nsecs(1));
-			groupc->times[PSI_MEM_FULL] += sample;
-		}
 	}
 
 	if (groupc->state_mask & (1 << PSI_CPU_SOME)) {
@@ -714,7 +696,7 @@ static void psi_group_change(struct psi_group *group, int cpu,
 	 */
 	write_seqcount_begin(&groupc->seq);
 
-	record_times(groupc, cpu, false);
+	record_times(groupc, cpu);
 
 	for (t = 0, m = clear; m; m &= ~(1 << t), t++) {
 		if (!(m & (1 << t)))
@@ -738,6 +720,18 @@ static void psi_group_change(struct psi_group *group, int cpu,
 		if (test_state(groupc->tasks, s))
 			state_mask |= (1 << s);
 	}
+
+	/*
+	 * Since we care about lost potential, a memstall is FULL
+	 * when there are no other working tasks, but also when
+	 * the CPU is actively reclaiming and nothing productive
+	 * could run even if it were runnable. So when the current
+	 * task in a cgroup is in_memstall, the corresponding groupc
+	 * on that cpu is in PSI_MEM_FULL state.
+	 */
+	if (groupc->tasks[NR_ONCPU] && cpu_curr(cpu)->in_memstall)
+		state_mask |= (1 << PSI_MEM_FULL);
+
 	groupc->state_mask = state_mask;
 
 	write_seqcount_end(&groupc->seq);
@@ -823,17 +817,21 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
 	void *iter;
 
 	if (next->pid) {
+		bool identical_state;
+
 		psi_flags_change(next, 0, TSK_ONCPU);
 		/*
-		 * When moving state between tasks, the group that
-		 * contains them both does not change: we can stop
-		 * updating the tree once we reach the first common
-		 * ancestor. Iterate @next's ancestors until we
-		 * encounter @prev's state.
+		 * When switching between tasks that have an identical
+		 * runtime state, the cgroup that contains both tasks
+		 * runtime state, the cgroup that contains both tasks
+		 * we reach the first common ancestor. Iterate @next's
+		 * ancestors only until we encounter @prev's ONCPU.
 		 */
+		identical_state = prev->psi_flags == next->psi_flags;
 		iter = NULL;
 		while ((group = iterate_groups(next, &iter))) {
-			if (per_cpu_ptr(group->pcpu, cpu)->tasks[NR_ONCPU]) {
+			if (identical_state &&
+			    per_cpu_ptr(group->pcpu, cpu)->tasks[NR_ONCPU]) {
 				common = group;
 				break;
 			}
@@ -859,21 +857,6 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
 	}
 }
 
-void psi_memstall_tick(struct task_struct *task, int cpu)
-{
-	struct psi_group *group;
-	void *iter = NULL;
-
-	while ((group = iterate_groups(task, &iter))) {
-		struct psi_group_cpu *groupc;
-
-		groupc = per_cpu_ptr(group->pcpu, cpu);
-		write_seqcount_begin(&groupc->seq);
-		record_times(groupc, cpu, true);
-		write_seqcount_end(&groupc->seq);
-	}
-}
-
 /**
  * psi_memstall_enter - mark the beginning of a memory stall section
  * @flags: flags to handle nested sections
diff --git a/kernel/sched/stats.h b/kernel/sched/stats.h
index 33d0daf..9e4e67a 100644
--- a/kernel/sched/stats.h
+++ b/kernel/sched/stats.h
@@ -144,14 +144,6 @@ static inline void psi_sched_switch(struct task_struct *prev,
 	psi_task_switch(prev, next, sleep);
 }
 
-static inline void psi_task_tick(struct rq *rq)
-{
-	if (static_branch_likely(&psi_disabled))
-		return;
-
-	if (unlikely(rq->curr->in_memstall))
-		psi_memstall_tick(rq->curr, cpu_of(rq));
-}
 #else /* CONFIG_PSI */
 static inline void psi_enqueue(struct task_struct *p, bool wakeup) {}
 static inline void psi_dequeue(struct task_struct *p, bool sleep) {}
@@ -159,7 +151,6 @@ static inline void psi_ttwu_dequeue(struct task_struct *p) {}
 static inline void psi_sched_switch(struct task_struct *prev,
 				    struct task_struct *next,
 				    bool sleep) {}
-static inline void psi_task_tick(struct rq *rq) {}
 #endif /* CONFIG_PSI */
 
 #ifdef CONFIG_SCHED_INFO

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [tip: sched/core] psi: Pressure states are unlikely
  2021-03-03  3:46 ` [PATCH v2 3/4] psi: pressure states are unlikely Chengming Zhou
  2021-03-04  9:09   ` [tip: sched/core] psi: Pressure " tip-bot2 for Johannes Weiner
@ 2021-03-06 11:42   ` tip-bot2 for Johannes Weiner
  1 sibling, 0 replies; 23+ messages in thread
From: tip-bot2 for Johannes Weiner @ 2021-03-06 11:42 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Johannes Weiner, Chengming Zhou, Peter Zijlstra (Intel),
	Ingo Molnar, x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     fddc8bab531e217806b84906681324377d741c6c
Gitweb:        https://git.kernel.org/tip/fddc8bab531e217806b84906681324377d741c6c
Author:        Johannes Weiner <hannes@cmpxchg.org>
AuthorDate:    Wed, 03 Mar 2021 11:46:58 +08:00
Committer:     Ingo Molnar <mingo@kernel.org>
CommitterDate: Sat, 06 Mar 2021 12:40:23 +01:00

psi: Pressure states are unlikely

Move the unlikely branches out of line. This eliminates undesirable
jumps during wakeup and sleeps for workloads that aren't under any
sort of resource pressure.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20210303034659.91735-4-zhouchengming@bytedance.com
---
 kernel/sched/psi.c | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 0fe6ff6..3907a6b 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -219,17 +219,17 @@ static bool test_state(unsigned int *tasks, enum psi_states state)
 {
 	switch (state) {
 	case PSI_IO_SOME:
-		return tasks[NR_IOWAIT];
+		return unlikely(tasks[NR_IOWAIT]);
 	case PSI_IO_FULL:
-		return tasks[NR_IOWAIT] && !tasks[NR_RUNNING];
+		return unlikely(tasks[NR_IOWAIT] && !tasks[NR_RUNNING]);
 	case PSI_MEM_SOME:
-		return tasks[NR_MEMSTALL];
+		return unlikely(tasks[NR_MEMSTALL]);
 	case PSI_MEM_FULL:
-		return tasks[NR_MEMSTALL] && !tasks[NR_RUNNING];
+		return unlikely(tasks[NR_MEMSTALL] && !tasks[NR_RUNNING]);
 	case PSI_CPU_SOME:
-		return tasks[NR_RUNNING] > tasks[NR_ONCPU];
+		return unlikely(tasks[NR_RUNNING] > tasks[NR_ONCPU]);
 	case PSI_CPU_FULL:
-		return tasks[NR_RUNNING] && !tasks[NR_ONCPU];
+		return unlikely(tasks[NR_RUNNING] && !tasks[NR_ONCPU]);
 	case PSI_NONIDLE:
 		return tasks[NR_IOWAIT] || tasks[NR_MEMSTALL] ||
 			tasks[NR_RUNNING];
@@ -729,7 +729,7 @@ static void psi_group_change(struct psi_group *group, int cpu,
 	 * task in a cgroup is in_memstall, the corresponding groupc
 	 * on that cpu is in PSI_MEM_FULL state.
 	 */
-	if (groupc->tasks[NR_ONCPU] && cpu_curr(cpu)->in_memstall)
+	if (unlikely(groupc->tasks[NR_ONCPU] && cpu_curr(cpu)->in_memstall))
 		state_mask |= (1 << PSI_MEM_FULL);
 
 	groupc->state_mask = state_mask;

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [tip: sched/core] psi: Optimize task switch inside shared cgroups
  2021-03-03  3:46 ` [PATCH v2 4/4] psi: Optimize task switch inside shared cgroups Chengming Zhou
                     ` (2 preceding siblings ...)
  2021-03-04  9:09   ` [tip: sched/core] " tip-bot2 for Chengming Zhou
@ 2021-03-06 11:42   ` tip-bot2 for Chengming Zhou
  3 siblings, 0 replies; 23+ messages in thread
From: tip-bot2 for Chengming Zhou @ 2021-03-06 11:42 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Muchun Song, Chengming Zhou, Peter Zijlstra (Intel),
	Ingo Molnar, Johannes Weiner, x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     4117cebf1a9fcbf35b9aabf0e37b6c5eea296798
Gitweb:        https://git.kernel.org/tip/4117cebf1a9fcbf35b9aabf0e37b6c5eea296798
Author:        Chengming Zhou <zhouchengming@bytedance.com>
AuthorDate:    Wed, 03 Mar 2021 11:46:59 +08:00
Committer:     Ingo Molnar <mingo@kernel.org>
CommitterDate: Sat, 06 Mar 2021 12:40:23 +01:00

psi: Optimize task switch inside shared cgroups

The commit 36b238d57172 ("psi: Optimize switching tasks inside shared
cgroups") only update cgroups whose state actually changes during a
task switch only in task preempt case, not in task sleep case.

We actually don't need to clear and set TSK_ONCPU state for common cgroups
of next and prev task in sleep case, that can save many psi_group_change
especially when most activity comes from one leaf cgroup.

sleep before:
psi_dequeue()
  while ((group = iterate_groups(prev)))  # all ancestors
    psi_group_change(prev, .clear=TSK_RUNNING|TSK_ONCPU)
psi_task_switch()
  while ((group = iterate_groups(next)))  # all ancestors
    psi_group_change(next, .set=TSK_ONCPU)

sleep after:
psi_dequeue()
  nop
psi_task_switch()
  while ((group = iterate_groups(next)))  # until (prev & next)
    psi_group_change(next, .set=TSK_ONCPU)
  while ((group = iterate_groups(prev)))  # all ancestors
    psi_group_change(prev, .clear=common?TSK_RUNNING:TSK_RUNNING|TSK_ONCPU)

When a voluntary sleep switches to another task, we remove one call of
psi_group_change() for every common cgroup ancestor of the two tasks.

Co-developed-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Link: https://lkml.kernel.org/r/20210303034659.91735-5-zhouchengming@bytedance.com
---
 kernel/sched/psi.c   | 35 +++++++++++++++++++++++++----------
 kernel/sched/stats.h | 28 ++++++++++++----------------
 2 files changed, 37 insertions(+), 26 deletions(-)

diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 3907a6b..ee3c5b4 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -840,20 +840,35 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
 		}
 	}
 
-	/*
-	 * If this is a voluntary sleep, dequeue will have taken care
-	 * of the outgoing TSK_ONCPU alongside TSK_RUNNING already. We
-	 * only need to deal with it during preemption.
-	 */
-	if (sleep)
-		return;
-
 	if (prev->pid) {
-		psi_flags_change(prev, TSK_ONCPU, 0);
+		int clear = TSK_ONCPU, set = 0;
+
+		/*
+		 * When we're going to sleep, psi_dequeue() lets us handle
+		 * TSK_RUNNING and TSK_IOWAIT here, where we can combine it
+		 * with TSK_ONCPU and save walking common ancestors twice.
+		 */
+		if (sleep) {
+			clear |= TSK_RUNNING;
+			if (prev->in_iowait)
+				set |= TSK_IOWAIT;
+		}
+
+		psi_flags_change(prev, clear, set);
 
 		iter = NULL;
 		while ((group = iterate_groups(prev, &iter)) && group != common)
-			psi_group_change(group, cpu, TSK_ONCPU, 0, true);
+			psi_group_change(group, cpu, clear, set, true);
+
+		/*
+		 * TSK_ONCPU is handled up to the common ancestor. If we're tasked
+		 * with dequeuing too, finish that for the rest of the hierarchy.
+		 */
+		if (sleep) {
+			clear &= ~TSK_ONCPU;
+			for (; group; group = iterate_groups(prev, &iter))
+				psi_group_change(group, cpu, clear, set, true);
+		}
 	}
 }
 
diff --git a/kernel/sched/stats.h b/kernel/sched/stats.h
index 9e4e67a..dc218e9 100644
--- a/kernel/sched/stats.h
+++ b/kernel/sched/stats.h
@@ -84,28 +84,24 @@ static inline void psi_enqueue(struct task_struct *p, bool wakeup)
 
 static inline void psi_dequeue(struct task_struct *p, bool sleep)
 {
-	int clear = TSK_RUNNING, set = 0;
+	int clear = TSK_RUNNING;
 
 	if (static_branch_likely(&psi_disabled))
 		return;
 
-	if (!sleep) {
-		if (p->in_memstall)
-			clear |= TSK_MEMSTALL;
-	} else {
-		/*
-		 * When a task sleeps, schedule() dequeues it before
-		 * switching to the next one. Merge the clearing of
-		 * TSK_RUNNING and TSK_ONCPU to save an unnecessary
-		 * psi_task_change() call in psi_sched_switch().
-		 */
-		clear |= TSK_ONCPU;
+	/*
+	 * A voluntary sleep is a dequeue followed by a task switch. To
+	 * avoid walking all ancestors twice, psi_task_switch() handles
+	 * TSK_RUNNING and TSK_IOWAIT for us when it moves TSK_ONCPU.
+	 * Do nothing here.
+	 */
+	if (sleep)
+		return;
 
-		if (p->in_iowait)
-			set |= TSK_IOWAIT;
-	}
+	if (p->in_memstall)
+		clear |= TSK_MEMSTALL;
 
-	psi_task_change(p, clear, set);
+	psi_task_change(p, clear, 0);
 }
 
 static inline void psi_ttwu_dequeue(struct task_struct *p)

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [tip: sched/core] psi: Use ONCPU state tracking machinery to detect reclaim
  2021-03-03  3:46 ` [PATCH v2 2/4] psi: Use ONCPU state tracking machinery to detect reclaim Chengming Zhou
                     ` (2 preceding siblings ...)
  2021-03-04  9:09   ` [tip: sched/core] " tip-bot2 for Chengming Zhou
@ 2021-03-06 11:42   ` tip-bot2 for Chengming Zhou
  3 siblings, 0 replies; 23+ messages in thread
From: tip-bot2 for Chengming Zhou @ 2021-03-06 11:42 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Muchun Song, Chengming Zhou, Peter Zijlstra (Intel),
	Ingo Molnar, Johannes Weiner, x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     7fae6c8171d20ac55402930ee8ae760cf85dff7b
Gitweb:        https://git.kernel.org/tip/7fae6c8171d20ac55402930ee8ae760cf85dff7b
Author:        Chengming Zhou <zhouchengming@bytedance.com>
AuthorDate:    Wed, 03 Mar 2021 11:46:57 +08:00
Committer:     Ingo Molnar <mingo@kernel.org>
CommitterDate: Sat, 06 Mar 2021 12:40:22 +01:00

psi: Use ONCPU state tracking machinery to detect reclaim

Move the reclaim detection from the timer tick to the task state
tracking machinery using the recently added ONCPU state. And we
also add task psi_flags changes checking in the psi_task_switch()
optimization to update the parents properly.

In terms of performance and cost, this ONCPU task state tracking
is not cheaper than previous timer tick in aggregate. But the code is
simpler and shorter this way, so it's a maintainability win. And
Johannes did some testing with perf bench, the performace and cost
changes would be acceptable for real workloads.

Thanks to Johannes Weiner for pointing out the psi_task_switch()
optimization things and the clearer changelog.

Co-developed-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Link: https://lkml.kernel.org/r/20210303034659.91735-3-zhouchengming@bytedance.com
---
 include/linux/psi.h  |  1 +-
 kernel/sched/core.c  |  1 +-
 kernel/sched/psi.c   | 65 +++++++++++++++----------------------------
 kernel/sched/stats.h |  9 +------
 4 files changed, 24 insertions(+), 52 deletions(-)

diff --git a/include/linux/psi.h b/include/linux/psi.h
index 7361023..65eb147 100644
--- a/include/linux/psi.h
+++ b/include/linux/psi.h
@@ -20,7 +20,6 @@ void psi_task_change(struct task_struct *task, int clear, int set);
 void psi_task_switch(struct task_struct *prev, struct task_struct *next,
 		     bool sleep);
 
-void psi_memstall_tick(struct task_struct *task, int cpu);
 void psi_memstall_enter(unsigned long *flags);
 void psi_memstall_leave(unsigned long *flags);
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 361974e..d2629fd 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4551,7 +4551,6 @@ void scheduler_tick(void)
 	update_thermal_load_avg(rq_clock_thermal(rq), rq, thermal_pressure);
 	curr->sched_class->task_tick(rq, curr, 0);
 	calc_global_load_tick(rq);
-	psi_task_tick(rq);
 
 	rq_unlock(rq, &rf);
 
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 2293c45..0fe6ff6 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -644,8 +644,7 @@ static void poll_timer_fn(struct timer_list *t)
 	wake_up_interruptible(&group->poll_wait);
 }
 
-static void record_times(struct psi_group_cpu *groupc, int cpu,
-			 bool memstall_tick)
+static void record_times(struct psi_group_cpu *groupc, int cpu)
 {
 	u32 delta;
 	u64 now;
@@ -664,23 +663,6 @@ static void record_times(struct psi_group_cpu *groupc, int cpu,
 		groupc->times[PSI_MEM_SOME] += delta;
 		if (groupc->state_mask & (1 << PSI_MEM_FULL))
 			groupc->times[PSI_MEM_FULL] += delta;
-		else if (memstall_tick) {
-			u32 sample;
-			/*
-			 * Since we care about lost potential, a
-			 * memstall is FULL when there are no other
-			 * working tasks, but also when the CPU is
-			 * actively reclaiming and nothing productive
-			 * could run even if it were runnable.
-			 *
-			 * When the timer tick sees a reclaiming CPU,
-			 * regardless of runnable tasks, sample a FULL
-			 * tick (or less if it hasn't been a full tick
-			 * since the last state change).
-			 */
-			sample = min(delta, (u32)jiffies_to_nsecs(1));
-			groupc->times[PSI_MEM_FULL] += sample;
-		}
 	}
 
 	if (groupc->state_mask & (1 << PSI_CPU_SOME)) {
@@ -714,7 +696,7 @@ static void psi_group_change(struct psi_group *group, int cpu,
 	 */
 	write_seqcount_begin(&groupc->seq);
 
-	record_times(groupc, cpu, false);
+	record_times(groupc, cpu);
 
 	for (t = 0, m = clear; m; m &= ~(1 << t), t++) {
 		if (!(m & (1 << t)))
@@ -738,6 +720,18 @@ static void psi_group_change(struct psi_group *group, int cpu,
 		if (test_state(groupc->tasks, s))
 			state_mask |= (1 << s);
 	}
+
+	/*
+	 * Since we care about lost potential, a memstall is FULL
+	 * when there are no other working tasks, but also when
+	 * the CPU is actively reclaiming and nothing productive
+	 * could run even if it were runnable. So when the current
+	 * task in a cgroup is in_memstall, the corresponding groupc
+	 * on that cpu is in PSI_MEM_FULL state.
+	 */
+	if (groupc->tasks[NR_ONCPU] && cpu_curr(cpu)->in_memstall)
+		state_mask |= (1 << PSI_MEM_FULL);
+
 	groupc->state_mask = state_mask;
 
 	write_seqcount_end(&groupc->seq);
@@ -823,17 +817,21 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
 	void *iter;
 
 	if (next->pid) {
+		bool identical_state;
+
 		psi_flags_change(next, 0, TSK_ONCPU);
 		/*
-		 * When moving state between tasks, the group that
-		 * contains them both does not change: we can stop
-		 * updating the tree once we reach the first common
-		 * ancestor. Iterate @next's ancestors until we
-		 * encounter @prev's state.
+		 * When switching between tasks that have an identical
+		 * runtime state, the cgroup that contains both tasks
+		 * runtime state, the cgroup that contains both tasks
+		 * we reach the first common ancestor. Iterate @next's
+		 * ancestors only until we encounter @prev's ONCPU.
 		 */
+		identical_state = prev->psi_flags == next->psi_flags;
 		iter = NULL;
 		while ((group = iterate_groups(next, &iter))) {
-			if (per_cpu_ptr(group->pcpu, cpu)->tasks[NR_ONCPU]) {
+			if (identical_state &&
+			    per_cpu_ptr(group->pcpu, cpu)->tasks[NR_ONCPU]) {
 				common = group;
 				break;
 			}
@@ -859,21 +857,6 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
 	}
 }
 
-void psi_memstall_tick(struct task_struct *task, int cpu)
-{
-	struct psi_group *group;
-	void *iter = NULL;
-
-	while ((group = iterate_groups(task, &iter))) {
-		struct psi_group_cpu *groupc;
-
-		groupc = per_cpu_ptr(group->pcpu, cpu);
-		write_seqcount_begin(&groupc->seq);
-		record_times(groupc, cpu, true);
-		write_seqcount_end(&groupc->seq);
-	}
-}
-
 /**
  * psi_memstall_enter - mark the beginning of a memory stall section
  * @flags: flags to handle nested sections
diff --git a/kernel/sched/stats.h b/kernel/sched/stats.h
index 33d0daf..9e4e67a 100644
--- a/kernel/sched/stats.h
+++ b/kernel/sched/stats.h
@@ -144,14 +144,6 @@ static inline void psi_sched_switch(struct task_struct *prev,
 	psi_task_switch(prev, next, sleep);
 }
 
-static inline void psi_task_tick(struct rq *rq)
-{
-	if (static_branch_likely(&psi_disabled))
-		return;
-
-	if (unlikely(rq->curr->in_memstall))
-		psi_memstall_tick(rq->curr, cpu_of(rq));
-}
 #else /* CONFIG_PSI */
 static inline void psi_enqueue(struct task_struct *p, bool wakeup) {}
 static inline void psi_dequeue(struct task_struct *p, bool sleep) {}
@@ -159,7 +151,6 @@ static inline void psi_ttwu_dequeue(struct task_struct *p) {}
 static inline void psi_sched_switch(struct task_struct *prev,
 				    struct task_struct *next,
 				    bool sleep) {}
-static inline void psi_task_tick(struct rq *rq) {}
 #endif /* CONFIG_PSI */
 
 #ifdef CONFIG_SCHED_INFO

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [tip: sched/core] psi: Add PSI_CPU_FULL state
  2021-03-03  3:46 ` [PATCH v2 1/4] psi: Add PSI_CPU_FULL state Chengming Zhou
  2021-03-04  9:09   ` [tip: sched/core] " tip-bot2 for Chengming Zhou
@ 2021-03-06 11:42   ` tip-bot2 for Chengming Zhou
  1 sibling, 0 replies; 23+ messages in thread
From: tip-bot2 for Chengming Zhou @ 2021-03-06 11:42 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Muchun Song, Chengming Zhou, Peter Zijlstra (Intel),
	Ingo Molnar, Johannes Weiner, x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     e7fcd762282332f765af2035a9568fb126fa3c01
Gitweb:        https://git.kernel.org/tip/e7fcd762282332f765af2035a9568fb126fa3c01
Author:        Chengming Zhou <zhouchengming@bytedance.com>
AuthorDate:    Wed, 03 Mar 2021 11:46:56 +08:00
Committer:     Ingo Molnar <mingo@kernel.org>
CommitterDate: Sat, 06 Mar 2021 12:40:22 +01:00

psi: Add PSI_CPU_FULL state

The FULL state doesn't exist for the CPU resource at the system level,
but exist at the cgroup level, means all non-idle tasks in a cgroup are
delayed on the CPU resource which used by others outside of the cgroup
or throttled by the cgroup cpu.max configuration.

Co-developed-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Link: https://lkml.kernel.org/r/20210303034659.91735-2-zhouchengming@bytedance.com
---
 include/linux/psi_types.h |  3 ++-
 kernel/sched/psi.c        | 14 +++++++++++---
 2 files changed, 13 insertions(+), 4 deletions(-)

diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h
index b95f321..0a23300 100644
--- a/include/linux/psi_types.h
+++ b/include/linux/psi_types.h
@@ -50,9 +50,10 @@ enum psi_states {
 	PSI_MEM_SOME,
 	PSI_MEM_FULL,
 	PSI_CPU_SOME,
+	PSI_CPU_FULL,
 	/* Only per-CPU, to weigh the CPU in the global average: */
 	PSI_NONIDLE,
-	NR_PSI_STATES = 6,
+	NR_PSI_STATES = 7,
 };
 
 enum psi_aggregators {
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 967732c..2293c45 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -34,7 +34,10 @@
  * delayed on that resource such that nobody is advancing and the CPU
  * goes idle. This leaves both workload and CPU unproductive.
  *
- * (Naturally, the FULL state doesn't exist for the CPU resource.)
+ * Naturally, the FULL state doesn't exist for the CPU resource at the
+ * system level, but exist at the cgroup level, means all non-idle tasks
+ * in a cgroup are delayed on the CPU resource which used by others outside
+ * of the cgroup or throttled by the cgroup cpu.max configuration.
  *
  *	SOME = nr_delayed_tasks != 0
  *	FULL = nr_delayed_tasks != 0 && nr_running_tasks == 0
@@ -225,6 +228,8 @@ static bool test_state(unsigned int *tasks, enum psi_states state)
 		return tasks[NR_MEMSTALL] && !tasks[NR_RUNNING];
 	case PSI_CPU_SOME:
 		return tasks[NR_RUNNING] > tasks[NR_ONCPU];
+	case PSI_CPU_FULL:
+		return tasks[NR_RUNNING] && !tasks[NR_ONCPU];
 	case PSI_NONIDLE:
 		return tasks[NR_IOWAIT] || tasks[NR_MEMSTALL] ||
 			tasks[NR_RUNNING];
@@ -678,8 +683,11 @@ static void record_times(struct psi_group_cpu *groupc, int cpu,
 		}
 	}
 
-	if (groupc->state_mask & (1 << PSI_CPU_SOME))
+	if (groupc->state_mask & (1 << PSI_CPU_SOME)) {
 		groupc->times[PSI_CPU_SOME] += delta;
+		if (groupc->state_mask & (1 << PSI_CPU_FULL))
+			groupc->times[PSI_CPU_FULL] += delta;
+	}
 
 	if (groupc->state_mask & (1 << PSI_NONIDLE))
 		groupc->times[PSI_NONIDLE] += delta;
@@ -1018,7 +1026,7 @@ int psi_show(struct seq_file *m, struct psi_group *group, enum psi_res res)
 		group->avg_next_update = update_averages(group, now);
 	mutex_unlock(&group->avgs_lock);
 
-	for (full = 0; full < 2 - (res == PSI_CPU); full++) {
+	for (full = 0; full < 2; full++) {
 		unsigned long avg[3];
 		u64 total;
 		int w;

^ permalink raw reply related	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2021-03-06 11:43 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-03  3:46 [PATCH v2 0/4] psi: Add PSI_CPU_FULL state and some code optimization Chengming Zhou
2021-03-03  3:46 ` [PATCH v2 1/4] psi: Add PSI_CPU_FULL state Chengming Zhou
2021-03-04  9:09   ` [tip: sched/core] " tip-bot2 for Chengming Zhou
2021-03-06 11:42   ` tip-bot2 for Chengming Zhou
2021-03-03  3:46 ` [PATCH v2 2/4] psi: Use ONCPU state tracking machinery to detect reclaim Chengming Zhou
2021-03-03 14:56   ` Johannes Weiner
2021-03-03 15:52   ` Peter Zijlstra
2021-03-04  9:09   ` [tip: sched/core] " tip-bot2 for Chengming Zhou
2021-03-06 11:42   ` tip-bot2 for Chengming Zhou
2021-03-03  3:46 ` [PATCH v2 3/4] psi: pressure states are unlikely Chengming Zhou
2021-03-04  9:09   ` [tip: sched/core] psi: Pressure " tip-bot2 for Johannes Weiner
2021-03-06 11:42   ` tip-bot2 for Johannes Weiner
2021-03-03  3:46 ` [PATCH v2 4/4] psi: Optimize task switch inside shared cgroups Chengming Zhou
2021-03-03 14:57   ` Johannes Weiner
2021-03-03 15:53   ` Peter Zijlstra
2021-03-03 16:01     ` [External] " Muchun Song
2021-03-04  9:09   ` [tip: sched/core] " tip-bot2 for Chengming Zhou
2021-03-06 11:42   ` tip-bot2 for Chengming Zhou
2021-03-03 14:59 ` [PATCH v2 0/4] psi: Add PSI_CPU_FULL state and some code optimization Johannes Weiner
2021-03-03 15:32   ` Peter Zijlstra
2021-03-03 16:05     ` Peter Zijlstra
2021-03-03 19:38       ` Johannes Weiner
2021-03-04  0:14       ` [External] " Chengming Zhou

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.