linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Tejun Heo <tj@kernel.org>
To: Johannes Weiner <hannes@cmpxchg.org>
Cc: Chengming Zhou <zhouchengming@bytedance.com>,
	surenb@google.com, mingo@redhat.com, peterz@infradead.org,
	corbet@lwn.net, akpm@linux-foundation.org, rdunlap@infradead.org,
	linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
	songmuchun@bytedance.com, cgroups@vger.kernel.org
Subject: Re: [PATCH 8/9] sched/psi: add kernel cmdline parameter psi_inner_cgroup
Date: Tue, 26 Jul 2022 07:54:34 -1000	[thread overview]
Message-ID: <YuAqWprKd6NsWs7C@slm.duckdns.org> (raw)
In-Reply-To: <Yt7KQc0nnOypB2b2@cmpxchg.org>

Hello,

On Mon, Jul 25, 2022 at 12:52:17PM -0400, Johannes Weiner wrote:
> On Thu, Jul 21, 2022 at 12:04:38PM +0800, Chengming Zhou wrote:
> > PSI accounts stalls for each cgroup separately and aggregates it
> > at each level of the hierarchy. This may case non-negligible overhead
> > for some workloads when under deep level of the hierarchy.
> > 
> > commit 3958e2d0c34e ("cgroup: make per-cgroup pressure stall tracking configurable")
> > make PSI to skip per-cgroup stall accounting, only account system-wide
> > to avoid this each level overhead.
> > 
> > For our use case, we also want leaf cgroup PSI accounted for userspace
> > adjustment on that cgroup, apart from only system-wide management.
> 
> I hear the overhead argument. But skipping accounting in intermediate
> levels is a bit odd and unprecedented in the cgroup interface. Once we
> do this, it's conceivable people would like to do the same thing for
> other stats and accounting, like for instance memory.stat.
> 
> Tejun, what are your thoughts on this?

Given that PSI requires on-the-spot recursive accumulation unlike other
stats, it can add quite a bit of overhead, so I'm sympathetic to the
argument because PSI can't be made cheaper by kernel being better (or at
least we don't know how to yet).

That said, "leaf-only" feels really hacky to me. My memory is hazy but
there's nothing preventing any cgroup from being skipped over when updating
PSI states, right? The state count propagation is recursive but it's each
task's state being propagated upwards not the child cgroup's, so we can skip
over any cgroup arbitrarily. ie. we can at least turn off PSI reporting on
any given cgroup without worrying about affecting others. Am I correct?

Assuming the above isn't wrong, if we can figure out how we can re-enable
it, which is more difficult as the counters need to be resynchronized with
the current state, that'd be ideal. Then, we can just allow each cgroup to
enable / disable PSI reporting dynamically as they see fit.

Thanks.

-- 
tejun

  parent reply	other threads:[~2022-07-26 17:55 UTC|newest]

Thread overview: 38+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-07-21  4:04 [PATCH 0/9] sched/psi: some optimization and extension Chengming Zhou
2022-07-21  4:04 ` [PATCH 1/9] sched/psi: fix periodic aggregation shut off Chengming Zhou
2022-07-25 15:34   ` Johannes Weiner
2022-07-25 15:39   ` Johannes Weiner
2022-07-26 13:28     ` Chengming Zhou
2022-07-21  4:04 ` [PATCH 2/9] sched/psi: optimize task switch inside shared cgroups again Chengming Zhou
2022-07-21  4:04 ` [PATCH 3/9] sched/psi: move private helpers to sched/stats.h Chengming Zhou
2022-07-25 16:39   ` Johannes Weiner
2022-07-21  4:04 ` [PATCH 4/9] sched/psi: don't change task psi_flags when migrate CPU/group Chengming Zhou
2022-07-21  4:04 ` [PATCH 5/9] sched/psi: don't create cgroup PSI files when psi_disabled Chengming Zhou
2022-07-25 16:41   ` Johannes Weiner
2022-07-21  4:04 ` [PATCH 6/9] sched/psi: save percpu memory when !psi_cgroups_enabled Chengming Zhou
2022-07-25 16:47   ` Johannes Weiner
2022-07-21  4:04 ` [PATCH 7/9] sched/psi: cache parent psi_group to speed up groups iterate Chengming Zhou
2022-07-21  4:04 ` [PATCH 8/9] sched/psi: add kernel cmdline parameter psi_inner_cgroup Chengming Zhou
2022-07-25 16:52   ` Johannes Weiner
2022-07-26 13:38     ` [External] " Chengming Zhou
2022-07-26 17:54     ` Tejun Heo [this message]
2022-08-03 12:17       ` Chengming Zhou
2022-08-03 17:58         ` Tejun Heo
2022-08-03 19:22           ` Johannes Weiner
2022-08-03 19:48             ` Tejun Heo
2022-08-04 13:51             ` Chengming Zhou
2022-08-04 16:56               ` Johannes Weiner
2022-08-04  2:02           ` Chengming Zhou
2022-07-21  4:04 ` [PATCH 9/9] sched/psi: add PSI_IRQ to track IRQ/SOFTIRQ pressure Chengming Zhou
2022-07-21 10:00   ` kernel test robot
2022-07-21 22:10   ` kernel test robot
2022-07-22  3:30   ` Abel Wu
2022-07-22  6:13     ` Chengming Zhou
2022-07-22  7:14       ` Abel Wu
2022-07-22  7:33         ` Chengming Zhou
2022-07-25 18:26   ` Johannes Weiner
2022-07-26 13:55     ` [External] " Chengming Zhou
2022-07-27 11:28     ` Chengming Zhou
2022-07-27 13:00       ` Johannes Weiner
2022-07-27 15:09         ` Chengming Zhou
2022-07-27 16:07   ` Peter Zijlstra

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YuAqWprKd6NsWs7C@slm.duckdns.org \
    --to=tj@kernel.org \
    --cc=akpm@linux-foundation.org \
    --cc=cgroups@vger.kernel.org \
    --cc=corbet@lwn.net \
    --cc=hannes@cmpxchg.org \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=rdunlap@infradead.org \
    --cc=songmuchun@bytedance.com \
    --cc=surenb@google.com \
    --cc=zhouchengming@bytedance.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).