All of lore.kernel.org
 help / color / mirror / Atom feed
From: "tip-bot2 for Chengming Zhou" <tip-bot2@linutronix.de>
To: linux-tip-commits@vger.kernel.org
Cc: Muchun Song <songmuchun@bytedance.com>,
	Chengming Zhou <zhouchengming@bytedance.com>,
	"Peter Zijlstra (Intel)" <peterz@infradead.org>,
	Ingo Molnar <mingo@kernel.org>,
	Johannes Weiner <hannes@cmpxchg.org>,
	x86@kernel.org, linux-kernel@vger.kernel.org
Subject: [tip: sched/core] psi: Optimize task switch inside shared cgroups
Date: Sat, 06 Mar 2021 11:42:18 -0000	[thread overview]
Message-ID: <161503093807.398.7510792283650382775.tip-bot2@tip-bot2> (raw)
In-Reply-To: <20210303034659.91735-5-zhouchengming@bytedance.com>

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     4117cebf1a9fcbf35b9aabf0e37b6c5eea296798
Gitweb:        https://git.kernel.org/tip/4117cebf1a9fcbf35b9aabf0e37b6c5eea296798
Author:        Chengming Zhou <zhouchengming@bytedance.com>
AuthorDate:    Wed, 03 Mar 2021 11:46:59 +08:00
Committer:     Ingo Molnar <mingo@kernel.org>
CommitterDate: Sat, 06 Mar 2021 12:40:23 +01:00

psi: Optimize task switch inside shared cgroups

The commit 36b238d57172 ("psi: Optimize switching tasks inside shared
cgroups") only update cgroups whose state actually changes during a
task switch only in task preempt case, not in task sleep case.

We actually don't need to clear and set TSK_ONCPU state for common cgroups
of next and prev task in sleep case, that can save many psi_group_change
especially when most activity comes from one leaf cgroup.

sleep before:
psi_dequeue()
  while ((group = iterate_groups(prev)))  # all ancestors
    psi_group_change(prev, .clear=TSK_RUNNING|TSK_ONCPU)
psi_task_switch()
  while ((group = iterate_groups(next)))  # all ancestors
    psi_group_change(next, .set=TSK_ONCPU)

sleep after:
psi_dequeue()
  nop
psi_task_switch()
  while ((group = iterate_groups(next)))  # until (prev & next)
    psi_group_change(next, .set=TSK_ONCPU)
  while ((group = iterate_groups(prev)))  # all ancestors
    psi_group_change(prev, .clear=common?TSK_RUNNING:TSK_RUNNING|TSK_ONCPU)

When a voluntary sleep switches to another task, we remove one call of
psi_group_change() for every common cgroup ancestor of the two tasks.

Co-developed-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Link: https://lkml.kernel.org/r/20210303034659.91735-5-zhouchengming@bytedance.com
---
 kernel/sched/psi.c   | 35 +++++++++++++++++++++++++----------
 kernel/sched/stats.h | 28 ++++++++++++----------------
 2 files changed, 37 insertions(+), 26 deletions(-)

diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 3907a6b..ee3c5b4 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -840,20 +840,35 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
 		}
 	}
 
-	/*
-	 * If this is a voluntary sleep, dequeue will have taken care
-	 * of the outgoing TSK_ONCPU alongside TSK_RUNNING already. We
-	 * only need to deal with it during preemption.
-	 */
-	if (sleep)
-		return;
-
 	if (prev->pid) {
-		psi_flags_change(prev, TSK_ONCPU, 0);
+		int clear = TSK_ONCPU, set = 0;
+
+		/*
+		 * When we're going to sleep, psi_dequeue() lets us handle
+		 * TSK_RUNNING and TSK_IOWAIT here, where we can combine it
+		 * with TSK_ONCPU and save walking common ancestors twice.
+		 */
+		if (sleep) {
+			clear |= TSK_RUNNING;
+			if (prev->in_iowait)
+				set |= TSK_IOWAIT;
+		}
+
+		psi_flags_change(prev, clear, set);
 
 		iter = NULL;
 		while ((group = iterate_groups(prev, &iter)) && group != common)
-			psi_group_change(group, cpu, TSK_ONCPU, 0, true);
+			psi_group_change(group, cpu, clear, set, true);
+
+		/*
+		 * TSK_ONCPU is handled up to the common ancestor. If we're tasked
+		 * with dequeuing too, finish that for the rest of the hierarchy.
+		 */
+		if (sleep) {
+			clear &= ~TSK_ONCPU;
+			for (; group; group = iterate_groups(prev, &iter))
+				psi_group_change(group, cpu, clear, set, true);
+		}
 	}
 }
 
diff --git a/kernel/sched/stats.h b/kernel/sched/stats.h
index 9e4e67a..dc218e9 100644
--- a/kernel/sched/stats.h
+++ b/kernel/sched/stats.h
@@ -84,28 +84,24 @@ static inline void psi_enqueue(struct task_struct *p, bool wakeup)
 
 static inline void psi_dequeue(struct task_struct *p, bool sleep)
 {
-	int clear = TSK_RUNNING, set = 0;
+	int clear = TSK_RUNNING;
 
 	if (static_branch_likely(&psi_disabled))
 		return;
 
-	if (!sleep) {
-		if (p->in_memstall)
-			clear |= TSK_MEMSTALL;
-	} else {
-		/*
-		 * When a task sleeps, schedule() dequeues it before
-		 * switching to the next one. Merge the clearing of
-		 * TSK_RUNNING and TSK_ONCPU to save an unnecessary
-		 * psi_task_change() call in psi_sched_switch().
-		 */
-		clear |= TSK_ONCPU;
+	/*
+	 * A voluntary sleep is a dequeue followed by a task switch. To
+	 * avoid walking all ancestors twice, psi_task_switch() handles
+	 * TSK_RUNNING and TSK_IOWAIT for us when it moves TSK_ONCPU.
+	 * Do nothing here.
+	 */
+	if (sleep)
+		return;
 
-		if (p->in_iowait)
-			set |= TSK_IOWAIT;
-	}
+	if (p->in_memstall)
+		clear |= TSK_MEMSTALL;
 
-	psi_task_change(p, clear, set);
+	psi_task_change(p, clear, 0);
 }
 
 static inline void psi_ttwu_dequeue(struct task_struct *p)

  parent reply	other threads:[~2021-03-06 11:43 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-03-03  3:46 [PATCH v2 0/4] psi: Add PSI_CPU_FULL state and some code optimization Chengming Zhou
2021-03-03  3:46 ` [PATCH v2 1/4] psi: Add PSI_CPU_FULL state Chengming Zhou
2021-03-04  9:09   ` [tip: sched/core] " tip-bot2 for Chengming Zhou
2021-03-06 11:42   ` tip-bot2 for Chengming Zhou
2021-03-03  3:46 ` [PATCH v2 2/4] psi: Use ONCPU state tracking machinery to detect reclaim Chengming Zhou
2021-03-03 14:56   ` Johannes Weiner
2021-03-03 15:52   ` Peter Zijlstra
2021-03-04  9:09   ` [tip: sched/core] " tip-bot2 for Chengming Zhou
2021-03-06 11:42   ` tip-bot2 for Chengming Zhou
2021-03-03  3:46 ` [PATCH v2 3/4] psi: pressure states are unlikely Chengming Zhou
2021-03-04  9:09   ` [tip: sched/core] psi: Pressure " tip-bot2 for Johannes Weiner
2021-03-06 11:42   ` tip-bot2 for Johannes Weiner
2021-03-03  3:46 ` [PATCH v2 4/4] psi: Optimize task switch inside shared cgroups Chengming Zhou
2021-03-03 14:57   ` Johannes Weiner
2021-03-03 15:53   ` Peter Zijlstra
2021-03-03 16:01     ` [External] " Muchun Song
2021-03-04  9:09   ` [tip: sched/core] " tip-bot2 for Chengming Zhou
2021-03-06 11:42   ` tip-bot2 for Chengming Zhou [this message]
2021-03-03 14:59 ` [PATCH v2 0/4] psi: Add PSI_CPU_FULL state and some code optimization Johannes Weiner
2021-03-03 15:32   ` Peter Zijlstra
2021-03-03 16:05     ` Peter Zijlstra
2021-03-03 19:38       ` Johannes Weiner
2021-03-04  0:14       ` [External] " Chengming Zhou

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=161503093807.398.7510792283650382775.tip-bot2@tip-bot2 \
    --to=tip-bot2@linutronix.de \
    --cc=hannes@cmpxchg.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-tip-commits@vger.kernel.org \
    --cc=mingo@kernel.org \
    --cc=peterz@infradead.org \
    --cc=songmuchun@bytedance.com \
    --cc=x86@kernel.org \
    --cc=zhouchengming@bytedance.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.