Re: [PATCH] Make SPLIT_RSS_COUNTING configurable

From: "Kirill A. Shutemov" <kirill@shutemov.name>
To: Daniel Colascione <dancol@google.com>
Cc: Michal Hocko <mhocko@kernel.org>, Qian Cai <cai@lca.pw>,
	Tim Murray <timmurray@google.com>,
	Suren Baghdasaryan <surenb@google.com>,
	linux-kernel <linux-kernel@vger.kernel.org>,
	linux-mm <linux-mm@kvack.org>
Subject: Re: [PATCH] Make SPLIT_RSS_COUNTING configurable
Date: Mon, 7 Oct 2019 03:11:13 +0300	[thread overview]
Message-ID: <20191007001113.ad4lbxqjxexo5svq@box.shutemov.name> (raw)
In-Reply-To: <CAKOZuesqgEBSNvpsdw1QhVvYBNBUjAL0pu1x_b-C5q22Z7BZ4g@mail.gmail.com>

On Fri, Oct 04, 2019 at 06:45:21AM -0700, Daniel Colascione wrote:
> On Fri, Oct 4, 2019 at 6:26 AM Kirill A. Shutemov <kirill@shutemov.name> wrote:
> > On Fri, Oct 04, 2019 at 02:33:49PM +0200, Michal Hocko wrote:
> > > On Wed 02-10-19 19:08:16, Daniel Colascione wrote:
> > > > On Wed, Oct 2, 2019 at 6:56 PM Qian Cai <cai@lca.pw> wrote:
> > > > > > On Oct 2, 2019, at 4:29 PM, Daniel Colascione <dancol@google.com> wrote:
> > > > > >
> > > > > > Adding the correct linux-mm address.
> > > > > >
> > > > > >
> > > > > >> +config SPLIT_RSS_COUNTING
> > > > > >> +       bool "Per-thread mm counter caching"
> > > > > >> +       depends on MMU
> > > > > >> +       default y if NR_CPUS >= SPLIT_PTLOCK_CPUS
> > > > > >> +       help
> > > > > >> +         Cache mm counter updates in thread structures and
> > > > > >> +         flush them to visible per-process statistics in batches.
> > > > > >> +         Say Y here to slightly reduce cache contention in processes
> > > > > >> +         with many threads at the expense of decreasing the accuracy
> > > > > >> +         of memory statistics in /proc.
> > > > > >> +
> > > > > >> endmenu
> > > > >
> > > > > All those vague words are going to make developers almost
> > > > > impossible to decide the right selection here. It sounds like we
> > > > > should kill SPLIT_RSS_COUNTING at all to simplify the code as
> > > > > the benefit is so small vs the side-effect?
> > > >
> > > > Killing SPLIT_RSS_COUNTING would be my first choice; IME, on mobile
> > > > and a basic desktop, it doesn't make a difference. I figured making it
> > > > a knob would help allay concerns about the performance impact in more
> > > > extreme configurations.
> > >
> > > I do agree with Qian. Either it is really helpful (is it? probably on
> > > the number of cpus) and it should be auto-enabled or it should be
> > > dropped altogether. You cannot really expect people know how to enable
> > > this without a deep understanding of the MM internals. Not to mention
> > > all those users using distro kernels/configs.
> > >
> > > A config option sounds like a bad way forward.
> >
> > And I don't see much point anyway. Reading RSS counters from proc is
> > inherently racy. It can just either way after the read due to process
> > behaviour.
> 
> Split RSS accounting doesn't make reading from mm counters racy. It
> makes these counters *wrong*. We flush task mm counters to the
> mm_struct once every 64 page faults that a task incurs or when that
> task exits. That means that if a thread takes 63 page faults and then
> sleeps for a week, that thread's process's mm counters are wrong by 63
> pages *for a week*. And some processes have a lot of threads,
> compounding the error. Split RSS accounting means that memory usage
> numbers don't add up. I don't think it's unreasonable to want a mode
> where memory counters to agree with other indicators of system
> activity.

It's documented behaviour that is upstream for 9 years. Why is your workload
special?

The documentation suggests to use smaps if you want to have precise data.
Why would it not fly for you?

> Nobody has demonstrated that split RSS accounting actually helps in
> the real world.

The original commit 34e55232e59f ("mm: avoid false sharing of mm_counter")
shows numbers on cache misses. It's not a real world workload, but you
don't have any numbers at all to back your claim.

> But I've described above, concretely, how split RSS
> accounting hurts. I've been trying for over a year to either disable
> split RSS accounting or to let people opt out of it. If you won't
> remove split RSS accounting and you won't let me add a configuration
> knob that lets people opt out of it, what will you accept?

Keeping stats precise is welcome, but often expensive. It might be
negligible for small machine, but becomes a problem on multisocket
machine with dozens or hundreds of cores. We need to keep kernel scalable.

We have other stats that update asynchronously (i.e. /proc/vmstat). Would
you like to convert them too?

-- 
 Kirill A. Shutemov