Re: [PATCH] mm: Sync percpu mm RSS counters before querying

From: Andrew Morton <akpm@linux-foundation.org>
To: "Michal Koutný" <mkoutny@suse.com>
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	Christian Brauner <brauner@kernel.org>,
	Suren Baghdasaryan <surenb@google.com>,
	"Liam R . Howlett" <Liam.Howlett@oracle.com>,
	Mike Christie <michael.christie@oracle.com>,
	Mathieu Desnoyers <mathieu.desnoyers@efficios.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Nicholas Piggin <npiggin@gmail.com>,
	Andrei Vagin <avagin@gmail.com>,
	Shakeel Butt <shakeelb@google.com>, Adam Majer <amajer@suse.com>,
	Jan Kara <jack@suse.cz>, Michal Hocko <mhocko@kernel.org>
Subject: Re: [PATCH] mm: Sync percpu mm RSS counters before querying
Date: Fri, 16 Jun 2023 12:31:44 -0700	[thread overview]
Message-ID: <20230616123144.bd2a8120dab25736c5c37297@linux-foundation.org> (raw)
In-Reply-To: <20230616180718.17725-1-mkoutny@suse.com>

On Fri, 16 Jun 2023 20:07:18 +0200 Michal Koutný <mkoutny@suse.com> wrote:

> An issue was observed with stats collected in struct rusage on ppc64le
> with 64kB pages. The percpu counters use batching with
> 	percpu_counter_batch = max(32, nr*2) # in PAGE_SIZE
> i.e. with larger pages but similar RSS consumption (bytes), there'll be
> less flushes and error more noticeable.

A fully detailed description of the issue would be helpful.  Obviously
"inaccuracy", but how bad?

> In this given case (getting consumption of exited child), we can request
> percpu counter's flush without worrying about contention with updaters.
> 
> Fortunately, the commit f1a7941243c1 ("mm: convert mm's rss stats into
> percpu_counter") didn't eradicate all traces of SPLIT_RSS_COUNTING and
> this mechanism already provided some synchronization points before
> reading stats.
> Therefore, use sync_mm_rss as carrier for percpu counters refreshes and
> forget SPLIT_RSS_COUNTING macro for good.
> 
> Impact of summing on a 8 CPU machine:
> Benchmark 1: taskset -c 1 ./shell-bench.sh
> 
> Before
>   Time (mean ± σ):      9.950 s ±  0.052 s    [User: 7.773 s, System: 2.023 s]
> 
> After
>   Time (mean ± σ):      9.990 s ±  0.070 s    [User: 7.825 s, System: 2.011 s]
> 
> cat >shell-bench.sh <<EOD
> for (( i = 0; i < 20000; i++ )); do
> 	/bin/true
> done
> EOD
> 
> The script is meant to stress fork-exit path (exit is where sync_mm_rss
> is most called, add_mm_rss_vec should be covered in fork).
> 
> ...
>
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2547,13 +2547,12 @@ static inline void setmax_mm_hiwater_rss(unsigned long *maxrss,
>  		*maxrss = hiwater_rss;
>  }
>  
> -#if defined(SPLIT_RSS_COUNTING)
> -void sync_mm_rss(struct mm_struct *mm);
> -#else
>  static inline void sync_mm_rss(struct mm_struct *mm)
>  {
> +	for (int i = 0; i < NR_MM_COUNTERS; ++i)
> +		percpu_counter_set(&mm->rss_stat[i],
> +				   percpu_counter_sum(&mm->rss_stat[i]));
>  }
> -#endif

Far too large to be inlined!  For six callsites it adds 1kb of text.

Why even modify the counter?  Can't <whatever this issue is> be
addressed by using percpu_counter_sum() in an appropriate place?

For unknown reasons percpu_counter_set() uses for_each_possible_cpu(). 
Probably just a mistake - percpu_counters are hotplug-aware and
for_each_online_cpu should suffice.

I'm really not liking percpu_counter_set().  It's only safe in
situations where the caller knows that no other CPU can be modifying
the counter.  I wonder if all the callers know that.  This situation
isn't aided by the lack of any documentation.