All of lore.kernel.org
 help / color / mirror / Atom feed
From: Chengming Zhou <zhouchengming@bytedance.com>
To: hannes@cmpxchg.org, tj@kernel.org, corbet@lwn.net,
	surenb@google.com, mingo@redhat.com, peterz@infradead.org,
	vincent.guittot@linaro.org, dietmar.eggemann@arm.com,
	rostedt@goodmis.org, bsegall@google.com
Cc: cgroups@vger.kernel.org, linux-doc@vger.kernel.org,
	linux-kernel@vger.kernel.org, songmuchun@bytedance.com,
	Chengming Zhou <zhouchengming@bytedance.com>
Subject: [PATCH v2 02/10] sched/psi: optimize task switch inside shared cgroups again
Date: Mon,  8 Aug 2022 19:03:33 +0800	[thread overview]
Message-ID: <20220808110341.15799-3-zhouchengming@bytedance.com> (raw)
In-Reply-To: <20220808110341.15799-1-zhouchengming@bytedance.com>

commit 4117cebf1a9f ("psi: Optimize task switch inside shared cgroups")
defer prev task sleep handling to psi_task_switch(), so we don't need
to clear and set TSK_ONCPU state for common cgroups.

    A
    |
    B
   / \
  C   D
 /     \
prev   next

After that commit psi_task_switch() do:
1. psi_group_change(next, .set=TSK_ONCPU) for D
2. psi_group_change(prev, .clear=TSK_ONCPU | TSK_RUNNING) for C
3. psi_group_change(prev, .clear=TSK_RUNNING) for B, A

But there is a limitation "prev->psi_flags == next->psi_flags" that
if not satisfied, will make this cgroups optimization unusable for both
sleep switch or running switch cases. For example:

prev->in_memstall != next->in_memstall when sleep switch:
1. psi_group_change(next, .set=TSK_ONCPU) for D, B, A
2. psi_group_change(prev, .clear=TSK_ONCPU | TSK_RUNNING) for C, B, A

prev->in_memstall != next->in_memstall when running switch:
1. psi_group_change(next, .set=TSK_ONCPU) for D, B, A
2. psi_group_change(prev, .clear=TSK_ONCPU) for C, B, A

The reason why this limitation exist is that we consider a group is
PSI_MEM_FULL if the CPU is actively reclaiming and nothing productive
could run even if it were runnable. So when CPU curr changed from prev
to next and their in_memstall status is different, we have to change
PSI_MEM_FULL status for their common cgroups.

This patch remove this limitation by making psi_group_change() change
PSI_MEM_FULL status depend on CPU curr->in_memstall status.

Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
---
 kernel/sched/psi.c | 8 ++------
 1 file changed, 2 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 115a7e52fa23..9e8c5d9e585c 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -823,8 +823,6 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
 	u64 now = cpu_clock(cpu);
 
 	if (next->pid) {
-		bool identical_state;
-
 		psi_flags_change(next, 0, TSK_ONCPU);
 		/*
 		 * When switching between tasks that have an identical
@@ -832,11 +830,9 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
 		 * we reach the first common ancestor. Iterate @next's
 		 * ancestors only until we encounter @prev's ONCPU.
 		 */
-		identical_state = prev->psi_flags == next->psi_flags;
 		iter = NULL;
 		while ((group = iterate_groups(next, &iter))) {
-			if (identical_state &&
-			    per_cpu_ptr(group->pcpu, cpu)->tasks[NR_ONCPU]) {
+			if (per_cpu_ptr(group->pcpu, cpu)->tasks[NR_ONCPU]) {
 				common = group;
 				break;
 			}
@@ -883,7 +879,7 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
 		 * TSK_ONCPU is handled up to the common ancestor. If we're tasked
 		 * with dequeuing too, finish that for the rest of the hierarchy.
 		 */
-		if (sleep) {
+		if (sleep || unlikely(prev->in_memstall != next->in_memstall)) {
 			clear &= ~TSK_ONCPU;
 			for (; group; group = iterate_groups(prev, &iter))
 				psi_group_change(group, cpu, clear, set, now, wake_clock);
-- 
2.36.1


WARNING: multiple messages have this Message-ID (diff)
From: Chengming Zhou <zhouchengming-EC8Uxl6Npydl57MIdRCFDg@public.gmane.org>
To: hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org,
	tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org,
	corbet-T1hC0tSOHrs@public.gmane.org,
	surenb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org,
	mingo-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org,
	peterz-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org,
	vincent.guittot-QSEj5FYQhm4dnm+yROfE0A@public.gmane.org,
	dietmar.eggemann-5wv7dgnIgG8@public.gmane.org,
	rostedt-nx8X9YLhiw1AfugRpC6u6w@public.gmane.org,
	bsegall-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org
Cc: cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-doc-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	songmuchun-EC8Uxl6Npydl57MIdRCFDg@public.gmane.org,
	Chengming Zhou
	<zhouchengming-EC8Uxl6Npydl57MIdRCFDg@public.gmane.org>
Subject: [PATCH v2 02/10] sched/psi: optimize task switch inside shared cgroups again
Date: Mon,  8 Aug 2022 19:03:33 +0800	[thread overview]
Message-ID: <20220808110341.15799-3-zhouchengming@bytedance.com> (raw)
In-Reply-To: <20220808110341.15799-1-zhouchengming-EC8Uxl6Npydl57MIdRCFDg@public.gmane.org>

commit 4117cebf1a9f ("psi: Optimize task switch inside shared cgroups")
defer prev task sleep handling to psi_task_switch(), so we don't need
to clear and set TSK_ONCPU state for common cgroups.

    A
    |
    B
   / \
  C   D
 /     \
prev   next

After that commit psi_task_switch() do:
1. psi_group_change(next, .set=TSK_ONCPU) for D
2. psi_group_change(prev, .clear=TSK_ONCPU | TSK_RUNNING) for C
3. psi_group_change(prev, .clear=TSK_RUNNING) for B, A

But there is a limitation "prev->psi_flags == next->psi_flags" that
if not satisfied, will make this cgroups optimization unusable for both
sleep switch or running switch cases. For example:

prev->in_memstall != next->in_memstall when sleep switch:
1. psi_group_change(next, .set=TSK_ONCPU) for D, B, A
2. psi_group_change(prev, .clear=TSK_ONCPU | TSK_RUNNING) for C, B, A

prev->in_memstall != next->in_memstall when running switch:
1. psi_group_change(next, .set=TSK_ONCPU) for D, B, A
2. psi_group_change(prev, .clear=TSK_ONCPU) for C, B, A

The reason why this limitation exist is that we consider a group is
PSI_MEM_FULL if the CPU is actively reclaiming and nothing productive
could run even if it were runnable. So when CPU curr changed from prev
to next and their in_memstall status is different, we have to change
PSI_MEM_FULL status for their common cgroups.

This patch remove this limitation by making psi_group_change() change
PSI_MEM_FULL status depend on CPU curr->in_memstall status.

Signed-off-by: Chengming Zhou <zhouchengming-EC8Uxl6Npydl57MIdRCFDg@public.gmane.org>
---
 kernel/sched/psi.c | 8 ++------
 1 file changed, 2 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 115a7e52fa23..9e8c5d9e585c 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -823,8 +823,6 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
 	u64 now = cpu_clock(cpu);
 
 	if (next->pid) {
-		bool identical_state;
-
 		psi_flags_change(next, 0, TSK_ONCPU);
 		/*
 		 * When switching between tasks that have an identical
@@ -832,11 +830,9 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
 		 * we reach the first common ancestor. Iterate @next's
 		 * ancestors only until we encounter @prev's ONCPU.
 		 */
-		identical_state = prev->psi_flags == next->psi_flags;
 		iter = NULL;
 		while ((group = iterate_groups(next, &iter))) {
-			if (identical_state &&
-			    per_cpu_ptr(group->pcpu, cpu)->tasks[NR_ONCPU]) {
+			if (per_cpu_ptr(group->pcpu, cpu)->tasks[NR_ONCPU]) {
 				common = group;
 				break;
 			}
@@ -883,7 +879,7 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
 		 * TSK_ONCPU is handled up to the common ancestor. If we're tasked
 		 * with dequeuing too, finish that for the rest of the hierarchy.
 		 */
-		if (sleep) {
+		if (sleep || unlikely(prev->in_memstall != next->in_memstall)) {
 			clear &= ~TSK_ONCPU;
 			for (; group; group = iterate_groups(prev, &iter))
 				psi_group_change(group, cpu, clear, set, now, wake_clock);
-- 
2.36.1


  parent reply	other threads:[~2022-08-08 11:04 UTC|newest]

Thread overview: 48+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-08-08 11:03 [PATCH v2 00/10] sched/psi: some optimization and extension Chengming Zhou
2022-08-08 11:03 ` Chengming Zhou
2022-08-08 11:03 ` [PATCH v2 01/10] sched/psi: fix periodic aggregation shut off Chengming Zhou
2022-08-08 11:03 ` Chengming Zhou [this message]
2022-08-08 11:03   ` [PATCH v2 02/10] sched/psi: optimize task switch inside shared cgroups again Chengming Zhou
2022-08-08 11:03 ` [PATCH v2 03/10] sched/psi: move private helpers to sched/stats.h Chengming Zhou
2022-08-08 11:03 ` [PATCH v2 04/10] sched/psi: don't change task psi_flags when migrate CPU/group Chengming Zhou
2022-08-08 11:03 ` [PATCH v2 05/10] sched/psi: don't create cgroup PSI files when psi_disabled Chengming Zhou
2022-08-08 11:03 ` [PATCH v2 06/10] sched/psi: save percpu memory when !psi_cgroups_enabled Chengming Zhou
2022-08-08 11:03 ` [PATCH v2 07/10] sched/psi: remove NR_ONCPU task accounting Chengming Zhou
2022-08-16 10:40   ` Chengming Zhou
2022-08-08 11:03 ` [PATCH v2 08/10] sched/psi: add PSI_IRQ to track IRQ/SOFTIRQ pressure Chengming Zhou
2022-08-08 11:03 ` [PATCH v2 09/10] sched/psi: per-cgroup PSI stats disable/re-enable interface Chengming Zhou
2022-08-09 17:48   ` Tejun Heo
2022-08-09 17:48     ` Tejun Heo
2022-08-10  0:39     ` Chengming Zhou
2022-08-10  0:39       ` Chengming Zhou
2022-08-10  1:30       ` Chengming Zhou
2022-08-10  1:30         ` Chengming Zhou
2022-08-10 15:25         ` Johannes Weiner
2022-08-10 17:27           ` Tejun Heo
2022-08-11  2:09             ` Chengming Zhou
2022-08-15 13:23           ` Michal Koutný
2022-08-15 13:23             ` Michal Koutný
2022-08-23  6:18             ` Chengming Zhou
2022-08-23  6:18               ` Chengming Zhou
2022-08-23 15:35               ` Johannes Weiner
2022-08-23 15:43                 ` Chengming Zhou
2022-08-23 15:43                   ` Chengming Zhou
2022-08-23 16:20                 ` Tejun Heo
2022-08-23 16:20                   ` Tejun Heo
2022-08-12 10:14   ` Michal Koutný
2022-08-12 10:14     ` Michal Koutný
2022-08-12 12:36     ` Chengming Zhou
2022-08-12 12:36       ` Chengming Zhou
2022-08-15 13:23       ` Michal Koutný
2022-08-15 15:49   ` Johannes Weiner
2022-08-15 19:50     ` Tejun Heo
2022-08-15 19:50       ` Tejun Heo
2022-08-16 13:06     ` Chengming Zhou
2022-08-16 13:06       ` Chengming Zhou
2022-08-08 11:03 ` [PATCH v2 10/10] sched/psi: cache parent psi_group to speed up groups iterate Chengming Zhou
2022-08-15 13:25 ` [PATCH v2 00/10] sched/psi: some optimization and extension Michal Koutný
2022-08-15 13:25   ` Michal Koutný
2022-08-16 14:01   ` Chengming Zhou
2022-08-16 14:01     ` Chengming Zhou
2022-08-17 15:19     ` Chengming Zhou
2022-08-17 15:19       ` Chengming Zhou

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20220808110341.15799-3-zhouchengming@bytedance.com \
    --to=zhouchengming@bytedance.com \
    --cc=bsegall@google.com \
    --cc=cgroups@vger.kernel.org \
    --cc=corbet@lwn.net \
    --cc=dietmar.eggemann@arm.com \
    --cc=hannes@cmpxchg.org \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=rostedt@goodmis.org \
    --cc=songmuchun@bytedance.com \
    --cc=surenb@google.com \
    --cc=tj@kernel.org \
    --cc=vincent.guittot@linaro.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.