From: Chengming Zhou <zhouchengming@bytedance.com> To: hannes@cmpxchg.org, tj@kernel.org, corbet@lwn.net, surenb@google.com, mingo@redhat.com, peterz@infradead.org, vincent.guittot@linaro.org, dietmar.eggemann@arm.com, rostedt@goodmis.org, bsegall@google.com Cc: cgroups@vger.kernel.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, songmuchun@bytedance.com, Chengming Zhou <zhouchengming@bytedance.com> Subject: [PATCH v2 02/10] sched/psi: optimize task switch inside shared cgroups again Date: Mon, 8 Aug 2022 19:03:33 +0800 [thread overview] Message-ID: <20220808110341.15799-3-zhouchengming@bytedance.com> (raw) In-Reply-To: <20220808110341.15799-1-zhouchengming@bytedance.com> commit 4117cebf1a9f ("psi: Optimize task switch inside shared cgroups") defer prev task sleep handling to psi_task_switch(), so we don't need to clear and set TSK_ONCPU state for common cgroups. A | B / \ C D / \ prev next After that commit psi_task_switch() do: 1. psi_group_change(next, .set=TSK_ONCPU) for D 2. psi_group_change(prev, .clear=TSK_ONCPU | TSK_RUNNING) for C 3. psi_group_change(prev, .clear=TSK_RUNNING) for B, A But there is a limitation "prev->psi_flags == next->psi_flags" that if not satisfied, will make this cgroups optimization unusable for both sleep switch or running switch cases. For example: prev->in_memstall != next->in_memstall when sleep switch: 1. psi_group_change(next, .set=TSK_ONCPU) for D, B, A 2. psi_group_change(prev, .clear=TSK_ONCPU | TSK_RUNNING) for C, B, A prev->in_memstall != next->in_memstall when running switch: 1. psi_group_change(next, .set=TSK_ONCPU) for D, B, A 2. psi_group_change(prev, .clear=TSK_ONCPU) for C, B, A The reason why this limitation exist is that we consider a group is PSI_MEM_FULL if the CPU is actively reclaiming and nothing productive could run even if it were runnable. So when CPU curr changed from prev to next and their in_memstall status is different, we have to change PSI_MEM_FULL status for their common cgroups. This patch remove this limitation by making psi_group_change() change PSI_MEM_FULL status depend on CPU curr->in_memstall status. Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> --- kernel/sched/psi.c | 8 ++------ 1 file changed, 2 insertions(+), 6 deletions(-) diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c index 115a7e52fa23..9e8c5d9e585c 100644 --- a/kernel/sched/psi.c +++ b/kernel/sched/psi.c @@ -823,8 +823,6 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next, u64 now = cpu_clock(cpu); if (next->pid) { - bool identical_state; - psi_flags_change(next, 0, TSK_ONCPU); /* * When switching between tasks that have an identical @@ -832,11 +830,9 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next, * we reach the first common ancestor. Iterate @next's * ancestors only until we encounter @prev's ONCPU. */ - identical_state = prev->psi_flags == next->psi_flags; iter = NULL; while ((group = iterate_groups(next, &iter))) { - if (identical_state && - per_cpu_ptr(group->pcpu, cpu)->tasks[NR_ONCPU]) { + if (per_cpu_ptr(group->pcpu, cpu)->tasks[NR_ONCPU]) { common = group; break; } @@ -883,7 +879,7 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next, * TSK_ONCPU is handled up to the common ancestor. If we're tasked * with dequeuing too, finish that for the rest of the hierarchy. */ - if (sleep) { + if (sleep || unlikely(prev->in_memstall != next->in_memstall)) { clear &= ~TSK_ONCPU; for (; group; group = iterate_groups(prev, &iter)) psi_group_change(group, cpu, clear, set, now, wake_clock); -- 2.36.1
WARNING: multiple messages have this Message-ID (diff)
From: Chengming Zhou <zhouchengming-EC8Uxl6Npydl57MIdRCFDg@public.gmane.org> To: hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org, tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org, corbet-T1hC0tSOHrs@public.gmane.org, surenb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org, mingo-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org, peterz-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org, vincent.guittot-QSEj5FYQhm4dnm+yROfE0A@public.gmane.org, dietmar.eggemann-5wv7dgnIgG8@public.gmane.org, rostedt-nx8X9YLhiw1AfugRpC6u6w@public.gmane.org, bsegall-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org Cc: cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-doc-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, songmuchun-EC8Uxl6Npydl57MIdRCFDg@public.gmane.org, Chengming Zhou <zhouchengming-EC8Uxl6Npydl57MIdRCFDg@public.gmane.org> Subject: [PATCH v2 02/10] sched/psi: optimize task switch inside shared cgroups again Date: Mon, 8 Aug 2022 19:03:33 +0800 [thread overview] Message-ID: <20220808110341.15799-3-zhouchengming@bytedance.com> (raw) In-Reply-To: <20220808110341.15799-1-zhouchengming-EC8Uxl6Npydl57MIdRCFDg@public.gmane.org> commit 4117cebf1a9f ("psi: Optimize task switch inside shared cgroups") defer prev task sleep handling to psi_task_switch(), so we don't need to clear and set TSK_ONCPU state for common cgroups. A | B / \ C D / \ prev next After that commit psi_task_switch() do: 1. psi_group_change(next, .set=TSK_ONCPU) for D 2. psi_group_change(prev, .clear=TSK_ONCPU | TSK_RUNNING) for C 3. psi_group_change(prev, .clear=TSK_RUNNING) for B, A But there is a limitation "prev->psi_flags == next->psi_flags" that if not satisfied, will make this cgroups optimization unusable for both sleep switch or running switch cases. For example: prev->in_memstall != next->in_memstall when sleep switch: 1. psi_group_change(next, .set=TSK_ONCPU) for D, B, A 2. psi_group_change(prev, .clear=TSK_ONCPU | TSK_RUNNING) for C, B, A prev->in_memstall != next->in_memstall when running switch: 1. psi_group_change(next, .set=TSK_ONCPU) for D, B, A 2. psi_group_change(prev, .clear=TSK_ONCPU) for C, B, A The reason why this limitation exist is that we consider a group is PSI_MEM_FULL if the CPU is actively reclaiming and nothing productive could run even if it were runnable. So when CPU curr changed from prev to next and their in_memstall status is different, we have to change PSI_MEM_FULL status for their common cgroups. This patch remove this limitation by making psi_group_change() change PSI_MEM_FULL status depend on CPU curr->in_memstall status. Signed-off-by: Chengming Zhou <zhouchengming-EC8Uxl6Npydl57MIdRCFDg@public.gmane.org> --- kernel/sched/psi.c | 8 ++------ 1 file changed, 2 insertions(+), 6 deletions(-) diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c index 115a7e52fa23..9e8c5d9e585c 100644 --- a/kernel/sched/psi.c +++ b/kernel/sched/psi.c @@ -823,8 +823,6 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next, u64 now = cpu_clock(cpu); if (next->pid) { - bool identical_state; - psi_flags_change(next, 0, TSK_ONCPU); /* * When switching between tasks that have an identical @@ -832,11 +830,9 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next, * we reach the first common ancestor. Iterate @next's * ancestors only until we encounter @prev's ONCPU. */ - identical_state = prev->psi_flags == next->psi_flags; iter = NULL; while ((group = iterate_groups(next, &iter))) { - if (identical_state && - per_cpu_ptr(group->pcpu, cpu)->tasks[NR_ONCPU]) { + if (per_cpu_ptr(group->pcpu, cpu)->tasks[NR_ONCPU]) { common = group; break; } @@ -883,7 +879,7 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next, * TSK_ONCPU is handled up to the common ancestor. If we're tasked * with dequeuing too, finish that for the rest of the hierarchy. */ - if (sleep) { + if (sleep || unlikely(prev->in_memstall != next->in_memstall)) { clear &= ~TSK_ONCPU; for (; group; group = iterate_groups(prev, &iter)) psi_group_change(group, cpu, clear, set, now, wake_clock); -- 2.36.1
next prev parent reply other threads:[~2022-08-08 11:04 UTC|newest] Thread overview: 48+ messages / expand[flat|nested] mbox.gz Atom feed top 2022-08-08 11:03 [PATCH v2 00/10] sched/psi: some optimization and extension Chengming Zhou 2022-08-08 11:03 ` Chengming Zhou 2022-08-08 11:03 ` [PATCH v2 01/10] sched/psi: fix periodic aggregation shut off Chengming Zhou 2022-08-08 11:03 ` Chengming Zhou [this message] 2022-08-08 11:03 ` [PATCH v2 02/10] sched/psi: optimize task switch inside shared cgroups again Chengming Zhou 2022-08-08 11:03 ` [PATCH v2 03/10] sched/psi: move private helpers to sched/stats.h Chengming Zhou 2022-08-08 11:03 ` [PATCH v2 04/10] sched/psi: don't change task psi_flags when migrate CPU/group Chengming Zhou 2022-08-08 11:03 ` [PATCH v2 05/10] sched/psi: don't create cgroup PSI files when psi_disabled Chengming Zhou 2022-08-08 11:03 ` [PATCH v2 06/10] sched/psi: save percpu memory when !psi_cgroups_enabled Chengming Zhou 2022-08-08 11:03 ` [PATCH v2 07/10] sched/psi: remove NR_ONCPU task accounting Chengming Zhou 2022-08-16 10:40 ` Chengming Zhou 2022-08-08 11:03 ` [PATCH v2 08/10] sched/psi: add PSI_IRQ to track IRQ/SOFTIRQ pressure Chengming Zhou 2022-08-08 11:03 ` [PATCH v2 09/10] sched/psi: per-cgroup PSI stats disable/re-enable interface Chengming Zhou 2022-08-09 17:48 ` Tejun Heo 2022-08-09 17:48 ` Tejun Heo 2022-08-10 0:39 ` Chengming Zhou 2022-08-10 0:39 ` Chengming Zhou 2022-08-10 1:30 ` Chengming Zhou 2022-08-10 1:30 ` Chengming Zhou 2022-08-10 15:25 ` Johannes Weiner 2022-08-10 17:27 ` Tejun Heo 2022-08-11 2:09 ` Chengming Zhou 2022-08-15 13:23 ` Michal Koutný 2022-08-15 13:23 ` Michal Koutný 2022-08-23 6:18 ` Chengming Zhou 2022-08-23 6:18 ` Chengming Zhou 2022-08-23 15:35 ` Johannes Weiner 2022-08-23 15:43 ` Chengming Zhou 2022-08-23 15:43 ` Chengming Zhou 2022-08-23 16:20 ` Tejun Heo 2022-08-23 16:20 ` Tejun Heo 2022-08-12 10:14 ` Michal Koutný 2022-08-12 10:14 ` Michal Koutný 2022-08-12 12:36 ` Chengming Zhou 2022-08-12 12:36 ` Chengming Zhou 2022-08-15 13:23 ` Michal Koutný 2022-08-15 15:49 ` Johannes Weiner 2022-08-15 19:50 ` Tejun Heo 2022-08-15 19:50 ` Tejun Heo 2022-08-16 13:06 ` Chengming Zhou 2022-08-16 13:06 ` Chengming Zhou 2022-08-08 11:03 ` [PATCH v2 10/10] sched/psi: cache parent psi_group to speed up groups iterate Chengming Zhou 2022-08-15 13:25 ` [PATCH v2 00/10] sched/psi: some optimization and extension Michal Koutný 2022-08-15 13:25 ` Michal Koutný 2022-08-16 14:01 ` Chengming Zhou 2022-08-16 14:01 ` Chengming Zhou 2022-08-17 15:19 ` Chengming Zhou 2022-08-17 15:19 ` Chengming Zhou
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=20220808110341.15799-3-zhouchengming@bytedance.com \ --to=zhouchengming@bytedance.com \ --cc=bsegall@google.com \ --cc=cgroups@vger.kernel.org \ --cc=corbet@lwn.net \ --cc=dietmar.eggemann@arm.com \ --cc=hannes@cmpxchg.org \ --cc=linux-doc@vger.kernel.org \ --cc=linux-kernel@vger.kernel.org \ --cc=mingo@redhat.com \ --cc=peterz@infradead.org \ --cc=rostedt@goodmis.org \ --cc=songmuchun@bytedance.com \ --cc=surenb@google.com \ --cc=tj@kernel.org \ --cc=vincent.guittot@linaro.org \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: linkBe sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.