linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Chengming Zhou <zhouchengming@bytedance.com>
To: hannes@cmpxchg.org, surenb@google.com, mingo@redhat.com,
	peterz@infradead.org, tj@kernel.org, corbet@lwn.net,
	akpm@linux-foundation.org, rdunlap@infradead.org
Cc: linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
	songmuchun@bytedance.com, cgroups@vger.kernel.org,
	Chengming Zhou <zhouchengming@bytedance.com>
Subject: [PATCH 4/9] sched/psi: don't change task psi_flags when migrate CPU/group
Date: Thu, 21 Jul 2022 12:04:34 +0800	[thread overview]
Message-ID: <20220721040439.2651-5-zhouchengming@bytedance.com> (raw)
In-Reply-To: <20220721040439.2651-1-zhouchengming@bytedance.com>

The current code use psi_task_change() at every scheduling point,
in which change task psi_flags then change all its psi_groups.

So we have to heavily rely on the task scheduling states to calculate
what to set and what to clear at every scheduling point, which make
the PSI stats tracking code much complex and error prone.

In fact, the task psi_flags only change at wakeup and sleep (except
ONCPU state at switch), it doesn't change at all when migrate CPU/group.

If we keep its psi_flags unchanged when migrate CPU/group, we can
just use task->psi_flags to clear(migrate out) or set(migrate in),
which will make PSI stats tracking much simplier and more efficient.

Note: ENQUEUE_WAKEUP only means wakeup task from sleep state, don't
include wakeup new task, so add psi_enqueue() in wake_up_new_task().

Performance test on Intel Xeon Platinum with 3 levels of cgroup:

1. before the patch:

$ perf bench sched all
 # Running sched/messaging benchmark...
 # 20 sender and receiver processes per group
 # 10 groups == 400 processes run

     Total time: 0.034 [sec]

 # Running sched/pipe benchmark...
 # Executed 1000000 pipe operations between two processes

     Total time: 8.210 [sec]

       8.210600 usecs/op
         121793 ops/sec

2. after the patch:

$ perf bench sched all
 # Running sched/messaging benchmark...
 # 20 sender and receiver processes per group
 # 10 groups == 400 processes run

     Total time: 0.032 [sec]

 # Running sched/pipe benchmark...
 # Executed 1000000 pipe operations between two processes

     Total time: 8.077 [sec]

       8.077648 usecs/op
         123798 ops/sec

Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
---
 include/linux/sched.h |  3 ---
 kernel/sched/core.c   |  1 +
 kernel/sched/psi.c    | 24 ++++++++++---------
 kernel/sched/stats.h  | 54 +++++++++++++++++++++----------------------
 4 files changed, 40 insertions(+), 42 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 88b8817b827d..20a94786cad8 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -879,9 +879,6 @@ struct task_struct {
 	unsigned			sched_reset_on_fork:1;
 	unsigned			sched_contributes_to_load:1;
 	unsigned			sched_migrated:1;
-#ifdef CONFIG_PSI
-	unsigned			sched_psi_wake_requeue:1;
-#endif
 
 	/* Force alignment to the next boundary: */
 	unsigned			:0;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index a463dbc92fcd..f5f2d3542b05 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4642,6 +4642,7 @@ void wake_up_new_task(struct task_struct *p)
 	post_init_entity_util_avg(p);
 
 	activate_task(rq, p, ENQUEUE_NOCLOCK);
+	psi_enqueue(p, true);
 	trace_sched_wakeup_new(p);
 	check_preempt_curr(rq, p, WF_FORK);
 #ifdef CONFIG_SMP
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index e04041d8251b..6ba159fe2a4f 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -796,22 +796,24 @@ static void psi_flags_change(struct task_struct *task, int clear, int set)
 	task->psi_flags |= set;
 }
 
-void psi_task_change(struct task_struct *task, int clear, int set)
+void psi_change_groups(struct task_struct *task, int clear, int set)
 {
 	int cpu = task_cpu(task);
 	struct psi_group *group;
 	void *iter = NULL;
-	u64 now;
+	u64 now = cpu_clock(cpu);
+
+	while ((group = iterate_groups(task, &iter)))
+		psi_group_change(group, cpu, clear, set, now, true);
+}
 
+void psi_task_change(struct task_struct *task, int clear, int set)
+{
 	if (!task->pid)
 		return;
 
 	psi_flags_change(task, clear, set);
-
-	now = cpu_clock(cpu);
-
-	while ((group = iterate_groups(task, &iter)))
-		psi_group_change(group, cpu, clear, set, now, true);
+	psi_change_groups(task, clear, set);
 }
 
 void psi_task_switch(struct task_struct *prev, struct task_struct *next,
@@ -1015,9 +1017,9 @@ void cgroup_move_task(struct task_struct *task, struct css_set *to)
 	 *   pick_next_task()
 	 *     rq_unlock()
 	 *                                rq_lock()
-	 *                                psi_task_change() // old cgroup
+	 *                                psi_change_groups() // old cgroup
 	 *                                task->cgroups = to
-	 *                                psi_task_change() // new cgroup
+	 *                                psi_change_groups() // new cgroup
 	 *                                rq_unlock()
 	 *     rq_lock()
 	 *   psi_sched_switch() // does deferred updates in new cgroup
@@ -1027,13 +1029,13 @@ void cgroup_move_task(struct task_struct *task, struct css_set *to)
 	task_flags = task->psi_flags;
 
 	if (task_flags)
-		psi_task_change(task, task_flags, 0);
+		psi_change_groups(task, task_flags, 0);
 
 	/* See comment above */
 	rcu_assign_pointer(task->cgroups, to);
 
 	if (task_flags)
-		psi_task_change(task, 0, task_flags);
+		psi_change_groups(task, 0, task_flags);
 
 	task_rq_unlock(rq, task, &rf);
 }
diff --git a/kernel/sched/stats.h b/kernel/sched/stats.h
index c39b467ece43..e930b8fa6253 100644
--- a/kernel/sched/stats.h
+++ b/kernel/sched/stats.h
@@ -107,6 +107,7 @@ __schedstats_from_se(struct sched_entity *se)
 }
 
 #ifdef CONFIG_PSI
+void psi_change_groups(struct task_struct *task, int clear, int set);
 void psi_task_change(struct task_struct *task, int clear, int set);
 void psi_task_switch(struct task_struct *prev, struct task_struct *next,
 		     bool sleep);
@@ -124,42 +125,46 @@ static inline void psi_enqueue(struct task_struct *p, bool wakeup)
 	if (static_branch_likely(&psi_disabled))
 		return;
 
-	if (p->in_memstall)
-		set |= TSK_MEMSTALL_RUNNING;
+	if (!wakeup) {
+		if (p->psi_flags)
+			psi_change_groups(p, 0, p->psi_flags);
+		return;
+	}
 
-	if (!wakeup || p->sched_psi_wake_requeue) {
-		if (p->in_memstall)
+	/*
+	 * wakeup (including wakeup migrate) need to change task psi_flags,
+	 * specifically need to set TSK_RUNNING or TSK_MEMSTALL_RUNNING.
+	 * Since we clear task->psi_flags for wakeup migrated task, we need
+	 * to check task->psi_flags to see what should be set and clear.
+	 */
+	if (unlikely(p->in_memstall)) {
+		set |= TSK_MEMSTALL_RUNNING;
+		if (!(p->psi_flags & TSK_MEMSTALL))
 			set |= TSK_MEMSTALL;
-		if (p->sched_psi_wake_requeue)
-			p->sched_psi_wake_requeue = 0;
-	} else {
-		if (p->in_iowait)
-			clear |= TSK_IOWAIT;
 	}
+	if (p->psi_flags & TSK_IOWAIT)
+		clear |= TSK_IOWAIT;
 
 	psi_task_change(p, clear, set);
 }
 
 static inline void psi_dequeue(struct task_struct *p, bool sleep)
 {
-	int clear = TSK_RUNNING;
-
 	if (static_branch_likely(&psi_disabled))
 		return;
 
+	if (!sleep) {
+		if (p->psi_flags)
+			psi_change_groups(p, p->psi_flags, 0);
+		return;
+	}
+
 	/*
 	 * A voluntary sleep is a dequeue followed by a task switch. To
 	 * avoid walking all ancestors twice, psi_task_switch() handles
 	 * TSK_RUNNING and TSK_IOWAIT for us when it moves TSK_ONCPU.
 	 * Do nothing here.
 	 */
-	if (sleep)
-		return;
-
-	if (p->in_memstall)
-		clear |= (TSK_MEMSTALL | TSK_MEMSTALL_RUNNING);
-
-	psi_task_change(p, clear, 0);
 }
 
 static inline void psi_ttwu_dequeue(struct task_struct *p)
@@ -169,21 +174,14 @@ static inline void psi_ttwu_dequeue(struct task_struct *p)
 	/*
 	 * Is the task being migrated during a wakeup? Make sure to
 	 * deregister its sleep-persistent psi states from the old
-	 * queue, and let psi_enqueue() know it has to requeue.
+	 * queue.
 	 */
-	if (unlikely(p->in_iowait || p->in_memstall)) {
+	if (unlikely(p->psi_flags)) {
 		struct rq_flags rf;
 		struct rq *rq;
-		int clear = 0;
-
-		if (p->in_iowait)
-			clear |= TSK_IOWAIT;
-		if (p->in_memstall)
-			clear |= TSK_MEMSTALL;
 
 		rq = __task_rq_lock(p, &rf);
-		psi_task_change(p, clear, 0);
-		p->sched_psi_wake_requeue = 1;
+		psi_task_change(p, p->psi_flags, 0);
 		__task_rq_unlock(rq, &rf);
 	}
 }
-- 
2.36.1


  parent reply	other threads:[~2022-07-21  4:05 UTC|newest]

Thread overview: 38+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-07-21  4:04 [PATCH 0/9] sched/psi: some optimization and extension Chengming Zhou
2022-07-21  4:04 ` [PATCH 1/9] sched/psi: fix periodic aggregation shut off Chengming Zhou
2022-07-25 15:34   ` Johannes Weiner
2022-07-25 15:39   ` Johannes Weiner
2022-07-26 13:28     ` Chengming Zhou
2022-07-21  4:04 ` [PATCH 2/9] sched/psi: optimize task switch inside shared cgroups again Chengming Zhou
2022-07-21  4:04 ` [PATCH 3/9] sched/psi: move private helpers to sched/stats.h Chengming Zhou
2022-07-25 16:39   ` Johannes Weiner
2022-07-21  4:04 ` Chengming Zhou [this message]
2022-07-21  4:04 ` [PATCH 5/9] sched/psi: don't create cgroup PSI files when psi_disabled Chengming Zhou
2022-07-25 16:41   ` Johannes Weiner
2022-07-21  4:04 ` [PATCH 6/9] sched/psi: save percpu memory when !psi_cgroups_enabled Chengming Zhou
2022-07-25 16:47   ` Johannes Weiner
2022-07-21  4:04 ` [PATCH 7/9] sched/psi: cache parent psi_group to speed up groups iterate Chengming Zhou
2022-07-21  4:04 ` [PATCH 8/9] sched/psi: add kernel cmdline parameter psi_inner_cgroup Chengming Zhou
2022-07-25 16:52   ` Johannes Weiner
2022-07-26 13:38     ` [External] " Chengming Zhou
2022-07-26 17:54     ` Tejun Heo
2022-08-03 12:17       ` Chengming Zhou
2022-08-03 17:58         ` Tejun Heo
2022-08-03 19:22           ` Johannes Weiner
2022-08-03 19:48             ` Tejun Heo
2022-08-04 13:51             ` Chengming Zhou
2022-08-04 16:56               ` Johannes Weiner
2022-08-04  2:02           ` Chengming Zhou
2022-07-21  4:04 ` [PATCH 9/9] sched/psi: add PSI_IRQ to track IRQ/SOFTIRQ pressure Chengming Zhou
2022-07-21 10:00   ` kernel test robot
2022-07-21 22:10   ` kernel test robot
2022-07-22  3:30   ` Abel Wu
2022-07-22  6:13     ` Chengming Zhou
2022-07-22  7:14       ` Abel Wu
2022-07-22  7:33         ` Chengming Zhou
2022-07-25 18:26   ` Johannes Weiner
2022-07-26 13:55     ` [External] " Chengming Zhou
2022-07-27 11:28     ` Chengming Zhou
2022-07-27 13:00       ` Johannes Weiner
2022-07-27 15:09         ` Chengming Zhou
2022-07-27 16:07   ` Peter Zijlstra

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20220721040439.2651-5-zhouchengming@bytedance.com \
    --to=zhouchengming@bytedance.com \
    --cc=akpm@linux-foundation.org \
    --cc=cgroups@vger.kernel.org \
    --cc=corbet@lwn.net \
    --cc=hannes@cmpxchg.org \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=rdunlap@infradead.org \
    --cc=songmuchun@bytedance.com \
    --cc=surenb@google.com \
    --cc=tj@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).