All of lore.kernel.org
 help / color / mirror / Atom feed
From: Chengming Zhou <zhouchengming@bytedance.com>
To: hannes@cmpxchg.org, tj@kernel.org, mkoutny@suse.com, surenb@google.com
Cc: mingo@redhat.com, peterz@infradead.org,
	gregkh@linuxfoundation.org, corbet@lwn.net,
	cgroups@vger.kernel.org, linux-doc@vger.kernel.org,
	linux-kernel@vger.kernel.org, songmuchun@bytedance.com,
	Chengming Zhou <zhouchengming@bytedance.com>
Subject: [PATCH v4 05/10] sched/psi: optimize task switch inside shared cgroups again
Date: Fri, 26 Aug 2022 00:41:06 +0800	[thread overview]
Message-ID: <20220825164111.29534-6-zhouchengming@bytedance.com> (raw)
In-Reply-To: <20220825164111.29534-1-zhouchengming@bytedance.com>

Way back when PSI_MEM_FULL was accounted from the timer tick, task
switching could simply iterate next and prev to the common ancestor to
update TSK_ONCPU and be done.

Then memstall ticks were replaced with checking curr->in_memstall
directly in psi_group_change(). That meant that now if the task switch
was between a memstall and a !memstall task, we had to iterate through
the common ancestors at least ONCE to fix up their state_masks.

We added the identical_state filter to make sure the common ancestor
elimination was skipped in that case. It seems that was always a
little too eager, because it caused us to walk the common ancestors
*twice* instead of the required once: the iteration for next could
have stopped at the common ancestor; prev could have updated TSK_ONCPU
up to the common ancestor, then finish to the root without changing
any flags, just to get the new curr->in_memstall into the state_masks.

This patch recognizes this and makes it so that we walk to the root
exactly once if state_mask needs updating, which is simply catching up
on a missed optimization that could have been done in commit 7fae6c8171d2
("psi: Use ONCPU state tracking machinery to detect reclaim") directly.

Apart from this, it's also necessary for the next patch "sched/psi: remove
NR_ONCPU task accounting". Suppose we walk the common ancestors twice:

(1) psi_group_change(.clear = 0, .set = TSK_ONCPU)
(2) psi_group_change(.clear = TSK_ONCPU, .set = 0)

We previously used tasks[NR_ONCPU] to record TSK_ONCPU, tasks[NR_ONCPU]++
in (1) then tasks[NR_ONCPU]-- in (2), so tasks[NR_ONCPU] still be correct.

The next patch change to use one bit in state mask to record TSK_ONCPU,
PSI_ONCPU bit will be set in (1), but then be cleared in (2), which cause
the psi_group_cpu has task running on CPU but without PSI_ONCPU bit set!

With this patch, we will never walk the common ancestors twice, so won't
have above problem.

Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
---
 kernel/sched/psi.c | 21 +++++++++------------
 1 file changed, 9 insertions(+), 12 deletions(-)

diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 77d53c03a76f..d71dbc2356ff 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -820,20 +820,15 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
 	u64 now = cpu_clock(cpu);
 
 	if (next->pid) {
-		bool identical_state;
-
 		psi_flags_change(next, 0, TSK_ONCPU);
 		/*
-		 * When switching between tasks that have an identical
-		 * runtime state, the cgroup that contains both tasks
-		 * we reach the first common ancestor. Iterate @next's
-		 * ancestors only until we encounter @prev's ONCPU.
+		 * Set TSK_ONCPU on @next's cgroups. If @next shares any
+		 * ancestors with @prev, those will already have @prev's
+		 * TSK_ONCPU bit set, and we can stop the iteration there.
 		 */
-		identical_state = prev->psi_flags == next->psi_flags;
 		iter = NULL;
 		while ((group = iterate_groups(next, &iter))) {
-			if (identical_state &&
-			    per_cpu_ptr(group->pcpu, cpu)->tasks[NR_ONCPU]) {
+			if (per_cpu_ptr(group->pcpu, cpu)->tasks[NR_ONCPU]) {
 				common = group;
 				break;
 			}
@@ -877,10 +872,12 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
 			psi_group_change(group, cpu, clear, set, now, wake_clock);
 
 		/*
-		 * TSK_ONCPU is handled up to the common ancestor. If we're tasked
-		 * with dequeuing too, finish that for the rest of the hierarchy.
+		 * TSK_ONCPU is handled up to the common ancestor. If there are
+		 * any other differences between the two tasks (e.g. prev goes
+		 * to sleep, or only one task is memstall), finish propagating
+		 * those differences all the way up to the root.
 		 */
-		if (sleep) {
+		if ((prev->psi_flags ^ next->psi_flags) & ~TSK_ONCPU) {
 			clear &= ~TSK_ONCPU;
 			for (; group; group = iterate_groups(prev, &iter))
 				psi_group_change(group, cpu, clear, set, now, wake_clock);
-- 
2.37.2


  parent reply	other threads:[~2022-08-25 16:44 UTC|newest]

Thread overview: 38+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-08-25 16:41 [PATCH v4 00/10] sched/psi: some optimizations and extensions Chengming Zhou
2022-08-25 16:41 ` Chengming Zhou
2022-08-25 16:41 ` [PATCH v4 01/10] sched/psi: fix periodic aggregation shut off Chengming Zhou
2022-08-25 16:41   ` Chengming Zhou
2022-09-09 14:00   ` [tip: sched/psi] sched/psi: Fix " tip-bot2 for Chengming Zhou
2022-08-25 16:41 ` [PATCH v4 02/10] sched/psi: don't create cgroup PSI files when psi_disabled Chengming Zhou
2022-09-09 14:00   ` [tip: sched/psi] sched/psi: Don't " tip-bot2 for Chengming Zhou
2022-08-25 16:41 ` [PATCH v4 03/10] sched/psi: save percpu memory when !psi_cgroups_enabled Chengming Zhou
2022-08-25 16:41   ` Chengming Zhou
2022-09-09 14:00   ` [tip: sched/psi] sched/psi: Save " tip-bot2 for Chengming Zhou
2022-08-25 16:41 ` [PATCH v4 04/10] sched/psi: move private helpers to sched/stats.h Chengming Zhou
2022-09-09 14:00   ` [tip: sched/psi] sched/psi: Move " tip-bot2 for Chengming Zhou
2022-08-25 16:41 ` Chengming Zhou [this message]
2022-08-25 17:16   ` [PATCH v4 05/10] sched/psi: optimize task switch inside shared cgroups again Johannes Weiner
2022-08-25 17:16     ` Johannes Weiner
2022-09-09 14:00   ` [tip: sched/psi] sched/psi: Optimize " tip-bot2 for Chengming Zhou
2022-08-25 16:41 ` [PATCH v4 06/10] sched/psi: remove NR_ONCPU task accounting Chengming Zhou
2022-09-09 14:00   ` [tip: sched/psi] sched/psi: Remove " tip-bot2 for Johannes Weiner
2022-08-25 16:41 ` [PATCH v4 07/10] sched/psi: add PSI_IRQ to track IRQ/SOFTIRQ pressure Chengming Zhou
2022-08-25 16:41   ` Chengming Zhou
2022-08-25 17:17   ` Johannes Weiner
2022-08-25 17:17     ` Johannes Weiner
2022-09-09 14:00   ` [tip: sched/psi] sched/psi: Add " tip-bot2 for Chengming Zhou
2022-08-25 16:41 ` [PATCH v4 08/10] sched/psi: consolidate cgroup_psi() Chengming Zhou
2022-08-25 16:41   ` Chengming Zhou
2022-09-09 14:00   ` [tip: sched/psi] sched/psi: Consolidate cgroup_psi() tip-bot2 for Chengming Zhou
2022-08-25 16:41 ` [PATCH v4 09/10] sched/psi: cache parent psi_group to speed up groups iterate Chengming Zhou
2022-08-25 16:41   ` Chengming Zhou
2022-09-09 14:00   ` [tip: sched/psi] sched/psi: Cache parent psi_group to speed up group iteration tip-bot2 for Chengming Zhou
2022-08-25 16:41 ` [PATCH v4 10/10] sched/psi: per-cgroup PSI accounting disable/re-enable interface Chengming Zhou
2022-09-07  9:03   ` [PATCH] sched/psi: Per-cgroup " Chengming Zhou
2022-09-09 14:00     ` [tip: sched/psi] " tip-bot2 for Chengming Zhou
2022-09-06 13:13 ` [PATCH v4 00/10] sched/psi: some optimizations and extensions Chengming Zhou
2022-09-06 13:13   ` Chengming Zhou
2022-09-06 14:43   ` Peter Zijlstra
2022-09-06 14:43     ` Peter Zijlstra
2022-09-07  1:55     ` Chengming Zhou
2022-09-07  1:55       ` Chengming Zhou

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20220825164111.29534-6-zhouchengming@bytedance.com \
    --to=zhouchengming@bytedance.com \
    --cc=cgroups@vger.kernel.org \
    --cc=corbet@lwn.net \
    --cc=gregkh@linuxfoundation.org \
    --cc=hannes@cmpxchg.org \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=mkoutny@suse.com \
    --cc=peterz@infradead.org \
    --cc=songmuchun@bytedance.com \
    --cc=surenb@google.com \
    --cc=tj@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.