linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH v2 0/2] mm: convert mm's rss stats to use atomic mode
@ 2024-04-18 14:20 Peng Zhang
  2024-04-18 14:20 ` [RFC PATCH v2 1/2] percpu_counter: introduce atomic mode for percpu_counter Peng Zhang
                   ` (2 more replies)
  0 siblings, 3 replies; 13+ messages in thread
From: Peng Zhang @ 2024-04-18 14:20 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: akpm, dennisszhou, shakeelb, jack, surenb, kent.overstreet,
	mhocko, vbabka, yuzhao, yu.ma, wangkefeng.wang, sunnanyong,
	zhangpeng362

From: ZhangPeng <zhangpeng362@huawei.com>

Since commit f1a7941243c1 ("mm: convert mm's rss stats into
percpu_counter"), the rss_stats have converted into percpu_counter,
which convert the error margin from (nr_threads * 64) to approximately
(nr_cpus ^ 2). However, the new percpu allocation in mm_init() causes a
performance regression on fork/exec/shell. Even after commit 14ef95be6f55
("kernel/fork: group allocation/free of per-cpu counters for mm struct"),
the performance of fork/exec/shell is still poor compared to previous
kernel versions.

To mitigate performance regression, we delay the allocation of percpu
memory for rss_stats. Therefore, we convert mm's rss stats to use
percpu_counter atomic mode. For single-thread processes, rss_stat is in
atomic mode, which reduces the memory consumption and performance
regression caused by using percpu. For multiple-thread processes,
rss_stat is switched to the percpu mode to reduce the error margin.
We convert rss_stats from atomic mode to percpu mode only when the
second thread is created.

After lmbench test, we can get 2% ~ 4% performance improvement
for lmbench fork_proc/exec_proc/shell_proc and 6.7% performance
improvement for lmbench page_fault (before batch mode[1]).

The test results are as follows:
             base           base+revert        base+this patch

fork_proc    416.3ms        400.0ms  (3.9%)    398.6ms  (4.2%)
exec_proc    2095.9ms       2061.1ms (1.7%)    2047.7ms (2.3%)
shell_proc   3028.2ms       2954.7ms (2.4%)    2961.2ms (2.2%)
page_fault   0.3603ms       0.3358ms (6.8%)    0.3361ms (6.7%)

[1] https://lore.kernel.org/all/20240412064751.119015-1-wangkefeng.wang@huawei.com/

ChangeLog:
v2->v1:
- Convert rss_stats from atomic mode to percpu mode only when
  the second thread is created per Jan Kara.
- Compared with v1, the performance data may be different due to
  different test machines.

ZhangPeng (2):
  percpu_counter: introduce atomic mode for percpu_counter
  mm: convert mm's rss stats to use atomic mode

 include/linux/mm.h             | 50 +++++++++++++++++++++++++++++-----
 include/linux/percpu_counter.h | 43 +++++++++++++++++++++++++++--
 include/trace/events/kmem.h    |  4 +--
 kernel/fork.c                  | 18 +++++++-----
 lib/percpu_counter.c           | 31 +++++++++++++++++++--
 5 files changed, 125 insertions(+), 21 deletions(-)

-- 
2.25.1



^ permalink raw reply	[flat|nested] 13+ messages in thread

* [RFC PATCH v2 1/2] percpu_counter: introduce atomic mode for percpu_counter
  2024-04-18 14:20 [RFC PATCH v2 0/2] mm: convert mm's rss stats to use atomic mode Peng Zhang
@ 2024-04-18 14:20 ` Peng Zhang
  2024-04-18 19:40   ` Andrew Morton
  2024-04-26  8:11   ` Dennis Zhou
  2024-04-18 14:20 ` [RFC PATCH v2 2/2] mm: convert mm's rss stats to use atomic mode Peng Zhang
  2024-04-24  4:29 ` [RFC PATCH v2 0/2] " zhangpeng (AS)
  2 siblings, 2 replies; 13+ messages in thread
From: Peng Zhang @ 2024-04-18 14:20 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: akpm, dennisszhou, shakeelb, jack, surenb, kent.overstreet,
	mhocko, vbabka, yuzhao, yu.ma, wangkefeng.wang, sunnanyong,
	zhangpeng362

From: ZhangPeng <zhangpeng362@huawei.com>

Depending on whether counters is NULL, we can support two modes:
atomic mode and perpcu mode. We implement both modes by grouping
the s64 count and atomic64_t count_atomic in a union. At the same time,
we create the interface for adding and reading in atomic mode and for
switching atomic mode to percpu mode.

Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: ZhangPeng <zhangpeng362@huawei.com>
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
---
 include/linux/percpu_counter.h | 43 +++++++++++++++++++++++++++++++---
 lib/percpu_counter.c           | 31 ++++++++++++++++++++++--
 2 files changed, 69 insertions(+), 5 deletions(-)

diff --git a/include/linux/percpu_counter.h b/include/linux/percpu_counter.h
index 3a44dd1e33d2..160f9734c0bb 100644
--- a/include/linux/percpu_counter.h
+++ b/include/linux/percpu_counter.h
@@ -21,7 +21,13 @@
 
 struct percpu_counter {
 	raw_spinlock_t lock;
-	s64 count;
+	/* Depending on whether counters is NULL, we can support two modes,
+	 * atomic mode using count_atomic and perpcu mode using count.
+	 */
+	union {
+		s64 count;
+		atomic64_t count_atomic;
+	};
 #ifdef CONFIG_HOTPLUG_CPU
 	struct list_head list;	/* All percpu_counters are on a list */
 #endif
@@ -32,14 +38,14 @@ extern int percpu_counter_batch;
 
 int __percpu_counter_init_many(struct percpu_counter *fbc, s64 amount,
 			       gfp_t gfp, u32 nr_counters,
-			       struct lock_class_key *key);
+			       struct lock_class_key *key, bool switch_mode);
 
 #define percpu_counter_init_many(fbc, value, gfp, nr_counters)		\
 	({								\
 		static struct lock_class_key __key;			\
 									\
 		__percpu_counter_init_many(fbc, value, gfp, nr_counters,\
-					   &__key);			\
+					   &__key, false);		\
 	})
 
 
@@ -130,6 +136,20 @@ static inline bool percpu_counter_initialized(struct percpu_counter *fbc)
 	return (fbc->counters != NULL);
 }
 
+static inline s64 percpu_counter_atomic_read(struct percpu_counter *fbc)
+{
+	return atomic64_read(&fbc->count_atomic);
+}
+
+static inline void percpu_counter_atomic_add(struct percpu_counter *fbc,
+					     s64 amount)
+{
+	atomic64_add(amount, &fbc->count_atomic);
+}
+
+int percpu_counter_switch_to_pcpu_many(struct percpu_counter *fbc,
+				       u32 nr_counters);
+
 #else /* !CONFIG_SMP */
 
 struct percpu_counter {
@@ -260,6 +280,23 @@ static inline bool percpu_counter_initialized(struct percpu_counter *fbc)
 static inline void percpu_counter_sync(struct percpu_counter *fbc)
 {
 }
+
+static inline s64 percpu_counter_atomic_read(struct percpu_counter *fbc)
+{
+	return fbc->count;
+}
+
+static inline void percpu_counter_atomic_add(struct percpu_counter *fbc,
+					     s64 amount)
+{
+	percpu_counter_add(fbc, amount);
+}
+
+static inline int percpu_counter_switch_to_pcpu_many(struct percpu_counter *fbc,
+						     u32 nr_counters)
+{
+	return 0;
+}
 #endif	/* CONFIG_SMP */
 
 static inline void percpu_counter_inc(struct percpu_counter *fbc)
diff --git a/lib/percpu_counter.c b/lib/percpu_counter.c
index 44dd133594d4..95c4e038051a 100644
--- a/lib/percpu_counter.c
+++ b/lib/percpu_counter.c
@@ -153,7 +153,7 @@ EXPORT_SYMBOL(__percpu_counter_sum);
 
 int __percpu_counter_init_many(struct percpu_counter *fbc, s64 amount,
 			       gfp_t gfp, u32 nr_counters,
-			       struct lock_class_key *key)
+			       struct lock_class_key *key, bool switch_mode)
 {
 	unsigned long flags __maybe_unused;
 	size_t counter_size;
@@ -174,7 +174,8 @@ int __percpu_counter_init_many(struct percpu_counter *fbc, s64 amount,
 #ifdef CONFIG_HOTPLUG_CPU
 		INIT_LIST_HEAD(&fbc[i].list);
 #endif
-		fbc[i].count = amount;
+		if (likely(!switch_mode))
+			fbc[i].count = amount;
 		fbc[i].counters = (void *)counters + (i * counter_size);
 
 		debug_percpu_counter_activate(&fbc[i]);
@@ -357,6 +358,32 @@ bool __percpu_counter_limited_add(struct percpu_counter *fbc,
 	return good;
 }
 
+/*
+ * percpu_counter_switch_to_pcpu_many: Converts struct percpu_counters from
+ * atomic mode to percpu mode.
+ */
+int percpu_counter_switch_to_pcpu_many(struct percpu_counter *fbc,
+				       u32 nr_counters)
+{
+	static struct lock_class_key __key;
+	unsigned long flags;
+	bool ret = 0;
+
+	if (percpu_counter_initialized(fbc))
+		return 0;
+
+	preempt_disable();
+	local_irq_save(flags);
+	if (likely(!percpu_counter_initialized(fbc)))
+		ret = __percpu_counter_init_many(fbc, 0,
+					GFP_ATOMIC|__GFP_NOWARN|__GFP_ZERO,
+					nr_counters, &__key, true);
+	local_irq_restore(flags);
+	preempt_enable();
+
+	return ret;
+}
+
 static int __init percpu_counter_startup(void)
 {
 	int ret;
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [RFC PATCH v2 2/2] mm: convert mm's rss stats to use atomic mode
  2024-04-18 14:20 [RFC PATCH v2 0/2] mm: convert mm's rss stats to use atomic mode Peng Zhang
  2024-04-18 14:20 ` [RFC PATCH v2 1/2] percpu_counter: introduce atomic mode for percpu_counter Peng Zhang
@ 2024-04-18 14:20 ` Peng Zhang
  2024-04-19  2:30   ` Rongwei Wang
  2024-04-24  4:29 ` [RFC PATCH v2 0/2] " zhangpeng (AS)
  2 siblings, 1 reply; 13+ messages in thread
From: Peng Zhang @ 2024-04-18 14:20 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: akpm, dennisszhou, shakeelb, jack, surenb, kent.overstreet,
	mhocko, vbabka, yuzhao, yu.ma, wangkefeng.wang, sunnanyong,
	zhangpeng362

From: ZhangPeng <zhangpeng362@huawei.com>

Since commit f1a7941243c1 ("mm: convert mm's rss stats into
percpu_counter"), the rss_stats have converted into percpu_counter,
which convert the error margin from (nr_threads * 64) to approximately
(nr_cpus ^ 2). However, the new percpu allocation in mm_init() causes a
performance regression on fork/exec/shell. Even after commit 14ef95be6f55
("kernel/fork: group allocation/free of per-cpu counters for mm struct"),
the performance of fork/exec/shell is still poor compared to previous
kernel versions.

To mitigate performance regression, we delay the allocation of percpu
memory for rss_stats. Therefore, we convert mm's rss stats to use
percpu_counter atomic mode. For single-thread processes, rss_stat is in
atomic mode, which reduces the memory consumption and performance
regression caused by using percpu. For multiple-thread processes,
rss_stat is switched to the percpu mode to reduce the error margin.
We convert rss_stats from atomic mode to percpu mode only when the
second thread is created.

After lmbench test, we can get 2% ~ 4% performance improvement
for lmbench fork_proc/exec_proc/shell_proc and 6.7% performance
improvement for lmbench page_fault (before batch mode[1]).

The test results are as follows:

             base           base+revert        base+this patch

fork_proc    416.3ms        400.0ms  (3.9%)    398.6ms  (4.2%)
exec_proc    2095.9ms       2061.1ms (1.7%)    2047.7ms (2.3%)
shell_proc   3028.2ms       2954.7ms (2.4%)    2961.2ms (2.2%)
page_fault   0.3603ms       0.3358ms (6.8%)    0.3361ms (6.7%)

[1] https://lore.kernel.org/all/20240412064751.119015-1-wangkefeng.wang@huawei.com/

Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: ZhangPeng <zhangpeng362@huawei.com>
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
---
 include/linux/mm.h          | 50 +++++++++++++++++++++++++++++++------
 include/trace/events/kmem.h |  4 +--
 kernel/fork.c               | 18 +++++++------
 3 files changed, 56 insertions(+), 16 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index d261e45bb29b..8f1bfbd54697 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2631,30 +2631,66 @@ static inline bool get_user_page_fast_only(unsigned long addr,
  */
 static inline unsigned long get_mm_counter(struct mm_struct *mm, int member)
 {
-	return percpu_counter_read_positive(&mm->rss_stat[member]);
+	struct percpu_counter *fbc = &mm->rss_stat[member];
+
+	if (percpu_counter_initialized(fbc))
+		return percpu_counter_read_positive(fbc);
+
+	return percpu_counter_atomic_read(fbc);
 }
 
 void mm_trace_rss_stat(struct mm_struct *mm, int member);
 
 static inline void add_mm_counter(struct mm_struct *mm, int member, long value)
 {
-	percpu_counter_add(&mm->rss_stat[member], value);
+	struct percpu_counter *fbc = &mm->rss_stat[member];
+
+	if (percpu_counter_initialized(fbc))
+		percpu_counter_add(fbc, value);
+	else
+		percpu_counter_atomic_add(fbc, value);
 
 	mm_trace_rss_stat(mm, member);
 }
 
 static inline void inc_mm_counter(struct mm_struct *mm, int member)
 {
-	percpu_counter_inc(&mm->rss_stat[member]);
-
-	mm_trace_rss_stat(mm, member);
+	add_mm_counter(mm, member, 1);
 }
 
 static inline void dec_mm_counter(struct mm_struct *mm, int member)
 {
-	percpu_counter_dec(&mm->rss_stat[member]);
+	add_mm_counter(mm, member, -1);
+}
 
-	mm_trace_rss_stat(mm, member);
+static inline s64 mm_counter_sum(struct mm_struct *mm, int member)
+{
+	struct percpu_counter *fbc = &mm->rss_stat[member];
+
+	if (percpu_counter_initialized(fbc))
+		return percpu_counter_sum(fbc);
+
+	return percpu_counter_atomic_read(fbc);
+}
+
+static inline s64 mm_counter_sum_positive(struct mm_struct *mm, int member)
+{
+	struct percpu_counter *fbc = &mm->rss_stat[member];
+
+	if (percpu_counter_initialized(fbc))
+		return percpu_counter_sum_positive(fbc);
+
+	return percpu_counter_atomic_read(fbc);
+}
+
+static inline int mm_counter_switch_to_pcpu_many(struct mm_struct *mm)
+{
+	return percpu_counter_switch_to_pcpu_many(mm->rss_stat, NR_MM_COUNTERS);
+}
+
+static inline void mm_counter_destroy_many(struct mm_struct *mm)
+{
+	percpu_counter_destroy_many(mm->rss_stat, NR_MM_COUNTERS);
 }
 
 /* Optimized variant when folio is already known not to be anon */
diff --git a/include/trace/events/kmem.h b/include/trace/events/kmem.h
index 6e62cc64cd92..a4e40ae6a8c8 100644
--- a/include/trace/events/kmem.h
+++ b/include/trace/events/kmem.h
@@ -399,8 +399,8 @@ TRACE_EVENT(rss_stat,
 		__entry->mm_id = mm_ptr_to_hash(mm);
 		__entry->curr = !!(current->mm == mm);
 		__entry->member = member;
-		__entry->size = (percpu_counter_sum_positive(&mm->rss_stat[member])
-							    << PAGE_SHIFT);
+		__entry->size = (mm_counter_sum_positive(mm, member)
+							<< PAGE_SHIFT);
 	),
 
 	TP_printk("mm_id=%u curr=%d type=%s size=%ldB",
diff --git a/kernel/fork.c b/kernel/fork.c
index 99076dbe27d8..0214273798c5 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -823,7 +823,7 @@ static void check_mm(struct mm_struct *mm)
 			 "Please make sure 'struct resident_page_types[]' is updated as well");
 
 	for (i = 0; i < NR_MM_COUNTERS; i++) {
-		long x = percpu_counter_sum(&mm->rss_stat[i]);
+		long x = mm_counter_sum(mm, i);
 
 		if (unlikely(x))
 			pr_alert("BUG: Bad rss-counter state mm:%p type:%s val:%ld\n",
@@ -1301,16 +1301,10 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
 	if (mm_alloc_cid(mm))
 		goto fail_cid;
 
-	if (percpu_counter_init_many(mm->rss_stat, 0, GFP_KERNEL_ACCOUNT,
-				     NR_MM_COUNTERS))
-		goto fail_pcpu;
-
 	mm->user_ns = get_user_ns(user_ns);
 	lru_gen_init_mm(mm);
 	return mm;
 
-fail_pcpu:
-	mm_destroy_cid(mm);
 fail_cid:
 	destroy_context(mm);
 fail_nocontext:
@@ -1730,6 +1724,16 @@ static int copy_mm(unsigned long clone_flags, struct task_struct *tsk)
 	if (!oldmm)
 		return 0;
 
+	/*
+	 * For single-thread processes, rss_stat is in atomic mode, which
+	 * reduces the memory consumption and performance regression caused by
+	 * using percpu. For multiple-thread processes, rss_stat is switched to
+	 * the percpu mode to reduce the error margin.
+	 */
+	if (clone_flags & CLONE_THREAD)
+		if (mm_counter_switch_to_pcpu_many(oldmm))
+			return -ENOMEM;
+
 	if (clone_flags & CLONE_VM) {
 		mmget(oldmm);
 		mm = oldmm;
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH v2 1/2] percpu_counter: introduce atomic mode for percpu_counter
  2024-04-18 14:20 ` [RFC PATCH v2 1/2] percpu_counter: introduce atomic mode for percpu_counter Peng Zhang
@ 2024-04-18 19:40   ` Andrew Morton
  2024-04-19  2:55     ` zhangpeng (AS)
  2024-04-26  8:11   ` Dennis Zhou
  1 sibling, 1 reply; 13+ messages in thread
From: Andrew Morton @ 2024-04-18 19:40 UTC (permalink / raw)
  To: Peng Zhang
  Cc: linux-mm, linux-kernel, dennisszhou, shakeelb, jack, surenb,
	kent.overstreet, mhocko, vbabka, yuzhao, yu.ma, wangkefeng.wang,
	sunnanyong

On Thu, 18 Apr 2024 22:20:07 +0800 Peng Zhang <zhangpeng362@huawei.com> wrote:

> From: ZhangPeng <zhangpeng362@huawei.com>
> 
> Depending on whether counters is NULL, we can support two modes:
> atomic mode and perpcu mode. We implement both modes by grouping
> the s64 count and atomic64_t count_atomic in a union. At the same time,
> we create the interface for adding and reading in atomic mode and for
> switching atomic mode to percpu mode.
> 

I think it would be better if we had a detailed code comment in an
appropriate place which fully describes the tradeoffs here.  Tell
people when they would benefit from using one mode versus the other.


> --- a/lib/percpu_counter.c
> +++ b/lib/percpu_counter.c
> @@ -153,7 +153,7 @@ EXPORT_SYMBOL(__percpu_counter_sum);
>  
>  int __percpu_counter_init_many(struct percpu_counter *fbc, s64 amount,
>  			       gfp_t gfp, u32 nr_counters,
> -			       struct lock_class_key *key)
> +			       struct lock_class_key *key, bool switch_mode)
>  {
>  	unsigned long flags __maybe_unused;
>  	size_t counter_size;
> @@ -174,7 +174,8 @@ int __percpu_counter_init_many(struct percpu_counter *fbc, s64 amount,
>  #ifdef CONFIG_HOTPLUG_CPU
>  		INIT_LIST_HEAD(&fbc[i].list);
>  #endif
> -		fbc[i].count = amount;
> +		if (likely(!switch_mode))
> +			fbc[i].count = amount;
>  		fbc[i].counters = (void *)counters + (i * counter_size);
>  
>  		debug_percpu_counter_activate(&fbc[i]);
> @@ -357,6 +358,32 @@ bool __percpu_counter_limited_add(struct percpu_counter *fbc,
>  	return good;
>  }
>  
> +/*
> + * percpu_counter_switch_to_pcpu_many: Converts struct percpu_counters from
> + * atomic mode to percpu mode.

Describe what happens if that GFP_ATOMIC allocation fails.  We remain
in atomic mode, yes?

> + */
> +int percpu_counter_switch_to_pcpu_many(struct percpu_counter *fbc,
> +				       u32 nr_counters)
> +{
> +	static struct lock_class_key __key;
> +	unsigned long flags;
> +	bool ret = 0;
> +
> +	if (percpu_counter_initialized(fbc))
> +		return 0;
> +
> +	preempt_disable();
> +	local_irq_save(flags);

Do we need both?  Does local_irq_save() always disable preemption? 
This might not be the case for RT kernels, I always forget.

> +	if (likely(!percpu_counter_initialized(fbc)))
> +		ret = __percpu_counter_init_many(fbc, 0,
> +					GFP_ATOMIC|__GFP_NOWARN|__GFP_ZERO,
> +					nr_counters, &__key, true);
> +	local_irq_restore(flags);
> +	preempt_enable();
> +
> +	return ret;
> +}
> +

Why is there no API for switching back to atomic mode?



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH v2 2/2] mm: convert mm's rss stats to use atomic mode
  2024-04-18 14:20 ` [RFC PATCH v2 2/2] mm: convert mm's rss stats to use atomic mode Peng Zhang
@ 2024-04-19  2:30   ` Rongwei Wang
  2024-04-19  3:32     ` zhangpeng (AS)
  0 siblings, 1 reply; 13+ messages in thread
From: Rongwei Wang @ 2024-04-19  2:30 UTC (permalink / raw)
  To: Peng Zhang, linux-mm, linux-kernel
  Cc: akpm, dennisszhou, shakeelb, jack, surenb, kent.overstreet,
	mhocko, vbabka, yuzhao, yu.ma, wangkefeng.wang, sunnanyong



On 2024/4/18 22:20, Peng Zhang wrote:
> From: ZhangPeng <zhangpeng362@huawei.com>
>
> Since commit f1a7941243c1 ("mm: convert mm's rss stats into
> percpu_counter"), the rss_stats have converted into percpu_counter,
> which convert the error margin from (nr_threads * 64) to approximately
> (nr_cpus ^ 2). However, the new percpu allocation in mm_init() causes a
> performance regression on fork/exec/shell. Even after commit 14ef95be6f55
> ("kernel/fork: group allocation/free of per-cpu counters for mm struct"),
> the performance of fork/exec/shell is still poor compared to previous
> kernel versions.
>
> To mitigate performance regression, we delay the allocation of percpu
> memory for rss_stats. Therefore, we convert mm's rss stats to use
> percpu_counter atomic mode. For single-thread processes, rss_stat is in
> atomic mode, which reduces the memory consumption and performance
> regression caused by using percpu. For multiple-thread processes,
> rss_stat is switched to the percpu mode to reduce the error margin.
> We convert rss_stats from atomic mode to percpu mode only when the
> second thread is created.
Hi, Zhang Peng

This regression we also found it in lmbench these days. I have not test 
your patch, but it seems will solve a lot for it.
And I see this patch not fix the regression in multi-threads, that's 
because of the rss_stat switched to percpu mode?
(If I'm wrong, please correct me.) And It seems percpu_counter also has 
a bad effect in exit_mmap().

If so, I'm wondering if we can further improving it on the exit_mmap() 
path in multi-threads scenario, e.g. to determine which CPUs the process 
has run on (mm_cpumask()? I'm not sure).

>
> After lmbench test, we can get 2% ~ 4% performance improvement
> for lmbench fork_proc/exec_proc/shell_proc and 6.7% performance
> improvement for lmbench page_fault (before batch mode[1]).
>
> The test results are as follows:
>
>               base           base+revert        base+this patch
>
> fork_proc    416.3ms        400.0ms  (3.9%)    398.6ms  (4.2%)
> exec_proc    2095.9ms       2061.1ms (1.7%)    2047.7ms (2.3%)
> shell_proc   3028.2ms       2954.7ms (2.4%)    2961.2ms (2.2%)
> page_fault   0.3603ms       0.3358ms (6.8%)    0.3361ms (6.7%)
I think the regression will becomes more obvious if more cores. How 
about your test machine?

Thanks,
-wrw
>
> [1] https://lore.kernel.org/all/20240412064751.119015-1-wangkefeng.wang@huawei.com/
>
> Suggested-by: Jan Kara <jack@suse.cz>
> Signed-off-by: ZhangPeng <zhangpeng362@huawei.com>
> Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
> ---
>   include/linux/mm.h          | 50 +++++++++++++++++++++++++++++++------
>   include/trace/events/kmem.h |  4 +--
>   kernel/fork.c               | 18 +++++++------
>   3 files changed, 56 insertions(+), 16 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index d261e45bb29b..8f1bfbd54697 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2631,30 +2631,66 @@ static inline bool get_user_page_fast_only(unsigned long addr,
>    */
>   static inline unsigned long get_mm_counter(struct mm_struct *mm, int member)
>   {
> -	return percpu_counter_read_positive(&mm->rss_stat[member]);
> +	struct percpu_counter *fbc = &mm->rss_stat[member];
> +
> +	if (percpu_counter_initialized(fbc))
> +		return percpu_counter_read_positive(fbc);
> +
> +	return percpu_counter_atomic_read(fbc);
>   }
>   
>   void mm_trace_rss_stat(struct mm_struct *mm, int member);
>   
>   static inline void add_mm_counter(struct mm_struct *mm, int member, long value)
>   {
> -	percpu_counter_add(&mm->rss_stat[member], value);
> +	struct percpu_counter *fbc = &mm->rss_stat[member];
> +
> +	if (percpu_counter_initialized(fbc))
> +		percpu_counter_add(fbc, value);
> +	else
> +		percpu_counter_atomic_add(fbc, value);
>   
>   	mm_trace_rss_stat(mm, member);
>   }
>   
>   static inline void inc_mm_counter(struct mm_struct *mm, int member)
>   {
> -	percpu_counter_inc(&mm->rss_stat[member]);
> -
> -	mm_trace_rss_stat(mm, member);
> +	add_mm_counter(mm, member, 1);
>   }
>   
>   static inline void dec_mm_counter(struct mm_struct *mm, int member)
>   {
> -	percpu_counter_dec(&mm->rss_stat[member]);
> +	add_mm_counter(mm, member, -1);
> +}
>   
> -	mm_trace_rss_stat(mm, member);
> +static inline s64 mm_counter_sum(struct mm_struct *mm, int member)
> +{
> +	struct percpu_counter *fbc = &mm->rss_stat[member];
> +
> +	if (percpu_counter_initialized(fbc))
> +		return percpu_counter_sum(fbc);
> +
> +	return percpu_counter_atomic_read(fbc);
> +}
> +
> +static inline s64 mm_counter_sum_positive(struct mm_struct *mm, int member)
> +{
> +	struct percpu_counter *fbc = &mm->rss_stat[member];
> +
> +	if (percpu_counter_initialized(fbc))
> +		return percpu_counter_sum_positive(fbc);
> +
> +	return percpu_counter_atomic_read(fbc);
> +}
> +
> +static inline int mm_counter_switch_to_pcpu_many(struct mm_struct *mm)
> +{
> +	return percpu_counter_switch_to_pcpu_many(mm->rss_stat, NR_MM_COUNTERS);
> +}
> +
> +static inline void mm_counter_destroy_many(struct mm_struct *mm)
> +{
> +	percpu_counter_destroy_many(mm->rss_stat, NR_MM_COUNTERS);
>   }
>   
>   /* Optimized variant when folio is already known not to be anon */
> diff --git a/include/trace/events/kmem.h b/include/trace/events/kmem.h
> index 6e62cc64cd92..a4e40ae6a8c8 100644
> --- a/include/trace/events/kmem.h
> +++ b/include/trace/events/kmem.h
> @@ -399,8 +399,8 @@ TRACE_EVENT(rss_stat,
>   		__entry->mm_id = mm_ptr_to_hash(mm);
>   		__entry->curr = !!(current->mm == mm);
>   		__entry->member = member;
> -		__entry->size = (percpu_counter_sum_positive(&mm->rss_stat[member])
> -							    << PAGE_SHIFT);
> +		__entry->size = (mm_counter_sum_positive(mm, member)
> +							<< PAGE_SHIFT);
>   	),
>   
>   	TP_printk("mm_id=%u curr=%d type=%s size=%ldB",
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 99076dbe27d8..0214273798c5 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -823,7 +823,7 @@ static void check_mm(struct mm_struct *mm)
>   			 "Please make sure 'struct resident_page_types[]' is updated as well");
>   
>   	for (i = 0; i < NR_MM_COUNTERS; i++) {
> -		long x = percpu_counter_sum(&mm->rss_stat[i]);
> +		long x = mm_counter_sum(mm, i);
>   
>   		if (unlikely(x))
>   			pr_alert("BUG: Bad rss-counter state mm:%p type:%s val:%ld\n",
> @@ -1301,16 +1301,10 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
>   	if (mm_alloc_cid(mm))
>   		goto fail_cid;
>   
> -	if (percpu_counter_init_many(mm->rss_stat, 0, GFP_KERNEL_ACCOUNT,
> -				     NR_MM_COUNTERS))
> -		goto fail_pcpu;
> -
>   	mm->user_ns = get_user_ns(user_ns);
>   	lru_gen_init_mm(mm);
>   	return mm;
>   
> -fail_pcpu:
> -	mm_destroy_cid(mm);
>   fail_cid:
>   	destroy_context(mm);
>   fail_nocontext:
> @@ -1730,6 +1724,16 @@ static int copy_mm(unsigned long clone_flags, struct task_struct *tsk)
>   	if (!oldmm)
>   		return 0;
>   
> +	/*
> +	 * For single-thread processes, rss_stat is in atomic mode, which
> +	 * reduces the memory consumption and performance regression caused by
> +	 * using percpu. For multiple-thread processes, rss_stat is switched to
> +	 * the percpu mode to reduce the error margin.
> +	 */
> +	if (clone_flags & CLONE_THREAD)
> +		if (mm_counter_switch_to_pcpu_many(oldmm))
> +			return -ENOMEM;
> +
>   	if (clone_flags & CLONE_VM) {
>   		mmget(oldmm);
>   		mm = oldmm;



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH v2 1/2] percpu_counter: introduce atomic mode for percpu_counter
  2024-04-18 19:40   ` Andrew Morton
@ 2024-04-19  2:55     ` zhangpeng (AS)
  0 siblings, 0 replies; 13+ messages in thread
From: zhangpeng (AS) @ 2024-04-19  2:55 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, dennisszhou, shakeelb, jack, surenb,
	kent.overstreet, mhocko, vbabka, yuzhao, yu.ma, wangkefeng.wang,
	sunnanyong

On 2024/4/19 3:40, Andrew Morton wrote:

> On Thu, 18 Apr 2024 22:20:07 +0800 Peng Zhang <zhangpeng362@huawei.com> wrote:
>
>> From: ZhangPeng <zhangpeng362@huawei.com>
>>
>> Depending on whether counters is NULL, we can support two modes:
>> atomic mode and perpcu mode. We implement both modes by grouping
>> the s64 count and atomic64_t count_atomic in a union. At the same time,
>> we create the interface for adding and reading in atomic mode and for
>> switching atomic mode to percpu mode.
>>
> I think it would be better if we had a detailed code comment in an
> appropriate place which fully describes the tradeoffs here.  Tell
> people when they would benefit from using one mode versus the other.
>
Thanks for your reply!

Yes, of course, I would add before the union:

	/*
	 * Depending on whether counters is NULL, we can support two modes,
	 * atomic mode using count_atomic and perpcu mode using count.
	 * The single-thread processes should use atomic mode to reduce the
	 * memory consumption and performance regression.
	 * The multiple-thread processes should use percpu mode to reduce the
	 * error margin.
	 */
	union {
		s64 count;
		atomic64_t count_atomic;
	};

>> --- a/lib/percpu_counter.c
>> +++ b/lib/percpu_counter.c
>> @@ -153,7 +153,7 @@ EXPORT_SYMBOL(__percpu_counter_sum);
>>   
>>   int __percpu_counter_init_many(struct percpu_counter *fbc, s64 amount,
>>   			       gfp_t gfp, u32 nr_counters,
>> -			       struct lock_class_key *key)
>> +			       struct lock_class_key *key, bool switch_mode)
>>   {
>>   	unsigned long flags __maybe_unused;
>>   	size_t counter_size;
>> @@ -174,7 +174,8 @@ int __percpu_counter_init_many(struct percpu_counter *fbc, s64 amount,
>>   #ifdef CONFIG_HOTPLUG_CPU
>>   		INIT_LIST_HEAD(&fbc[i].list);
>>   #endif
>> -		fbc[i].count = amount;
>> +		if (likely(!switch_mode))
>> +			fbc[i].count = amount;
>>   		fbc[i].counters = (void *)counters + (i * counter_size);
>>   
>>   		debug_percpu_counter_activate(&fbc[i]);
>> @@ -357,6 +358,32 @@ bool __percpu_counter_limited_add(struct percpu_counter *fbc,
>>   	return good;
>>   }
>>   
>> +/*
>> + * percpu_counter_switch_to_pcpu_many: Converts struct percpu_counters from
>> + * atomic mode to percpu mode.
> Describe what happens if that GFP_ATOMIC allocation fails.  We remain
> in atomic mode, yes?

Yes. I'll add comments like:
  * Return: 0 if percpu_counter is already in atomic mode or successfully
  * switched to atomic mode; -ENOMEM if perpcu memory allocation fails,
  * perpcu_counter is still in atomic mode.

>> + */
>> +int percpu_counter_switch_to_pcpu_many(struct percpu_counter *fbc,
>> +				       u32 nr_counters)
>> +{
>> +	static struct lock_class_key __key;
>> +	unsigned long flags;
>> +	bool ret = 0;
>> +
>> +	if (percpu_counter_initialized(fbc))
>> +		return 0;
>> +
>> +	preempt_disable();
>> +	local_irq_save(flags);
> Do we need both?  Does local_irq_save() always disable preemption?
> This might not be the case for RT kernels, I always forget.

Yes, we need both. For RT kernels, local_irq_save() doesn't mean
that preemption is disabled.

>> +	if (likely(!percpu_counter_initialized(fbc)))
>> +		ret = __percpu_counter_init_many(fbc, 0,
>> +					GFP_ATOMIC|__GFP_NOWARN|__GFP_ZERO,
>> +					nr_counters, &__key, true);
>> +	local_irq_restore(flags);
>> +	preempt_enable();
>> +
>> +	return ret;
>> +}
>> +
> Why is there no API for switching back to atomic mode?

At least for the current test scenario, we don't need to switch back
to atomic mode. The scenario for switching back to atomic mode may be
that the multi-threaded process destroys the last second thread. I could
add this API if needed later.

-- 
Best Regards,
Peng



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH v2 2/2] mm: convert mm's rss stats to use atomic mode
  2024-04-19  2:30   ` Rongwei Wang
@ 2024-04-19  3:32     ` zhangpeng (AS)
  2024-04-20  3:13       ` Rongwei Wang
  0 siblings, 1 reply; 13+ messages in thread
From: zhangpeng (AS) @ 2024-04-19  3:32 UTC (permalink / raw)
  To: Rongwei Wang, linux-mm, linux-kernel
  Cc: akpm, dennisszhou, shakeelb, jack, surenb, kent.overstreet,
	mhocko, vbabka, yuzhao, yu.ma, wangkefeng.wang, sunnanyong

On 2024/4/19 10:30, Rongwei Wang wrote:

> On 2024/4/18 22:20, Peng Zhang wrote:
>> From: ZhangPeng <zhangpeng362@huawei.com>
>>
>> Since commit f1a7941243c1 ("mm: convert mm's rss stats into
>> percpu_counter"), the rss_stats have converted into percpu_counter,
>> which convert the error margin from (nr_threads * 64) to approximately
>> (nr_cpus ^ 2). However, the new percpu allocation in mm_init() causes a
>> performance regression on fork/exec/shell. Even after commit 
>> 14ef95be6f55
>> ("kernel/fork: group allocation/free of per-cpu counters for mm 
>> struct"),
>> the performance of fork/exec/shell is still poor compared to previous
>> kernel versions.
>>
>> To mitigate performance regression, we delay the allocation of percpu
>> memory for rss_stats. Therefore, we convert mm's rss stats to use
>> percpu_counter atomic mode. For single-thread processes, rss_stat is in
>> atomic mode, which reduces the memory consumption and performance
>> regression caused by using percpu. For multiple-thread processes,
>> rss_stat is switched to the percpu mode to reduce the error margin.
>> We convert rss_stats from atomic mode to percpu mode only when the
>> second thread is created.
> Hi, Zhang Peng
>
> This regression we also found it in lmbench these days. I have not 
> test your patch, but it seems will solve a lot for it.
> And I see this patch not fix the regression in multi-threads, that's 
> because of the rss_stat switched to percpu mode?
> (If I'm wrong, please correct me.) And It seems percpu_counter also 
> has a bad effect in exit_mmap().
>
> If so, I'm wondering if we can further improving it on the exit_mmap() 
> path in multi-threads scenario, e.g. to determine which CPUs the 
> process has run on (mm_cpumask()? I'm not sure).
>
Hi, Rongwei,

Yes, this patch only fixes the regression in single-thread processes. How
much bad effect does percpu_counter have in exit_mmap()? IMHO, the addition
of mm counter is already in batch mode, maybe I miss something?

>>
>> After lmbench test, we can get 2% ~ 4% performance improvement
>> for lmbench fork_proc/exec_proc/shell_proc and 6.7% performance
>> improvement for lmbench page_fault (before batch mode[1]).
>>
>> The test results are as follows:
>>
>>               base           base+revert        base+this patch
>>
>> fork_proc    416.3ms        400.0ms  (3.9%)    398.6ms  (4.2%)
>> exec_proc    2095.9ms       2061.1ms (1.7%)    2047.7ms (2.3%)
>> shell_proc   3028.2ms       2954.7ms (2.4%)    2961.2ms (2.2%)
>> page_fault   0.3603ms       0.3358ms (6.8%)    0.3361ms (6.7%)
> I think the regression will becomes more obvious if more cores. How 
> about your test machine?
>
Maybe multi-core is not a factor in the performance of the lmbench test here.
Both of my test machines have 96 cores.

> Thanks,
> -wrw
>>
>> [1] 
>> https://lore.kernel.org/all/20240412064751.119015-1-wangkefeng.wang@huawei.com/
>>
>> Suggested-by: Jan Kara <jack@suse.cz>
>> Signed-off-by: ZhangPeng <zhangpeng362@huawei.com>
>> Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
>> ---
>>   include/linux/mm.h          | 50 +++++++++++++++++++++++++++++++------
>>   include/trace/events/kmem.h |  4 +--
>>   kernel/fork.c               | 18 +++++++------
>>   3 files changed, 56 insertions(+), 16 deletions(-)
>>
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index d261e45bb29b..8f1bfbd54697 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -2631,30 +2631,66 @@ static inline bool 
>> get_user_page_fast_only(unsigned long addr,
>>    */
>>   static inline unsigned long get_mm_counter(struct mm_struct *mm, 
>> int member)
>>   {
>> -    return percpu_counter_read_positive(&mm->rss_stat[member]);
>> +    struct percpu_counter *fbc = &mm->rss_stat[member];
>> +
>> +    if (percpu_counter_initialized(fbc))
>> +        return percpu_counter_read_positive(fbc);
>> +
>> +    return percpu_counter_atomic_read(fbc);
>>   }
>>     void mm_trace_rss_stat(struct mm_struct *mm, int member);
>>     static inline void add_mm_counter(struct mm_struct *mm, int 
>> member, long value)
>>   {
>> -    percpu_counter_add(&mm->rss_stat[member], value);
>> +    struct percpu_counter *fbc = &mm->rss_stat[member];
>> +
>> +    if (percpu_counter_initialized(fbc))
>> +        percpu_counter_add(fbc, value);
>> +    else
>> +        percpu_counter_atomic_add(fbc, value);
>>         mm_trace_rss_stat(mm, member);
>>   }
>>     static inline void inc_mm_counter(struct mm_struct *mm, int member)
>>   {
>> -    percpu_counter_inc(&mm->rss_stat[member]);
>> -
>> -    mm_trace_rss_stat(mm, member);
>> +    add_mm_counter(mm, member, 1);
>>   }
>>     static inline void dec_mm_counter(struct mm_struct *mm, int member)
>>   {
>> -    percpu_counter_dec(&mm->rss_stat[member]);
>> +    add_mm_counter(mm, member, -1);
>> +}
>>   -    mm_trace_rss_stat(mm, member);
>> +static inline s64 mm_counter_sum(struct mm_struct *mm, int member)
>> +{
>> +    struct percpu_counter *fbc = &mm->rss_stat[member];
>> +
>> +    if (percpu_counter_initialized(fbc))
>> +        return percpu_counter_sum(fbc);
>> +
>> +    return percpu_counter_atomic_read(fbc);
>> +}
>> +
>> +static inline s64 mm_counter_sum_positive(struct mm_struct *mm, int 
>> member)
>> +{
>> +    struct percpu_counter *fbc = &mm->rss_stat[member];
>> +
>> +    if (percpu_counter_initialized(fbc))
>> +        return percpu_counter_sum_positive(fbc);
>> +
>> +    return percpu_counter_atomic_read(fbc);
>> +}
>> +
>> +static inline int mm_counter_switch_to_pcpu_many(struct mm_struct *mm)
>> +{
>> +    return percpu_counter_switch_to_pcpu_many(mm->rss_stat, 
>> NR_MM_COUNTERS);
>> +}
>> +
>> +static inline void mm_counter_destroy_many(struct mm_struct *mm)
>> +{
>> +    percpu_counter_destroy_many(mm->rss_stat, NR_MM_COUNTERS);
>>   }
>>     /* Optimized variant when folio is already known not to be anon */
>> diff --git a/include/trace/events/kmem.h b/include/trace/events/kmem.h
>> index 6e62cc64cd92..a4e40ae6a8c8 100644
>> --- a/include/trace/events/kmem.h
>> +++ b/include/trace/events/kmem.h
>> @@ -399,8 +399,8 @@ TRACE_EVENT(rss_stat,
>>           __entry->mm_id = mm_ptr_to_hash(mm);
>>           __entry->curr = !!(current->mm == mm);
>>           __entry->member = member;
>> -        __entry->size = 
>> (percpu_counter_sum_positive(&mm->rss_stat[member])
>> -                                << PAGE_SHIFT);
>> +        __entry->size = (mm_counter_sum_positive(mm, member)
>> +                            << PAGE_SHIFT);
>>       ),
>>         TP_printk("mm_id=%u curr=%d type=%s size=%ldB",
>> diff --git a/kernel/fork.c b/kernel/fork.c
>> index 99076dbe27d8..0214273798c5 100644
>> --- a/kernel/fork.c
>> +++ b/kernel/fork.c
>> @@ -823,7 +823,7 @@ static void check_mm(struct mm_struct *mm)
>>                "Please make sure 'struct resident_page_types[]' is 
>> updated as well");
>>         for (i = 0; i < NR_MM_COUNTERS; i++) {
>> -        long x = percpu_counter_sum(&mm->rss_stat[i]);
>> +        long x = mm_counter_sum(mm, i);
>>             if (unlikely(x))
>>               pr_alert("BUG: Bad rss-counter state mm:%p type:%s 
>> val:%ld\n",
>> @@ -1301,16 +1301,10 @@ static struct mm_struct *mm_init(struct 
>> mm_struct *mm, struct task_struct *p,
>>       if (mm_alloc_cid(mm))
>>           goto fail_cid;
>>   -    if (percpu_counter_init_many(mm->rss_stat, 0, GFP_KERNEL_ACCOUNT,
>> -                     NR_MM_COUNTERS))
>> -        goto fail_pcpu;
>> -
>>       mm->user_ns = get_user_ns(user_ns);
>>       lru_gen_init_mm(mm);
>>       return mm;
>>   -fail_pcpu:
>> -    mm_destroy_cid(mm);
>>   fail_cid:
>>       destroy_context(mm);
>>   fail_nocontext:
>> @@ -1730,6 +1724,16 @@ static int copy_mm(unsigned long clone_flags, 
>> struct task_struct *tsk)
>>       if (!oldmm)
>>           return 0;
>>   +    /*
>> +     * For single-thread processes, rss_stat is in atomic mode, which
>> +     * reduces the memory consumption and performance regression 
>> caused by
>> +     * using percpu. For multiple-thread processes, rss_stat is 
>> switched to
>> +     * the percpu mode to reduce the error margin.
>> +     */
>> +    if (clone_flags & CLONE_THREAD)
>> +        if (mm_counter_switch_to_pcpu_many(oldmm))
>> +            return -ENOMEM;
>> +
>>       if (clone_flags & CLONE_VM) {
>>           mmget(oldmm);
>>           mm = oldmm;
>
>
-- 
Best Regards,
Peng



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH v2 2/2] mm: convert mm's rss stats to use atomic mode
  2024-04-19  3:32     ` zhangpeng (AS)
@ 2024-04-20  3:13       ` Rongwei Wang
  2024-04-20  8:44         ` zhangpeng (AS)
  0 siblings, 1 reply; 13+ messages in thread
From: Rongwei Wang @ 2024-04-20  3:13 UTC (permalink / raw)
  To: zhangpeng (AS), linux-mm, linux-kernel
  Cc: akpm, dennisszhou, shakeelb, jack, surenb, kent.overstreet,
	mhocko, vbabka, yuzhao, yu.ma, wangkefeng.wang, sunnanyong



On 2024/4/19 11:32, zhangpeng (AS) wrote:
> On 2024/4/19 10:30, Rongwei Wang wrote:
>
>> On 2024/4/18 22:20, Peng Zhang wrote:
>>> From: ZhangPeng <zhangpeng362@huawei.com>
>>>
>>> Since commit f1a7941243c1 ("mm: convert mm's rss stats into
>>> percpu_counter"), the rss_stats have converted into percpu_counter,
>>> which convert the error margin from (nr_threads * 64) to approximately
>>> (nr_cpus ^ 2). However, the new percpu allocation in mm_init() causes a
>>> performance regression on fork/exec/shell. Even after commit 
>>> 14ef95be6f55
>>> ("kernel/fork: group allocation/free of per-cpu counters for mm 
>>> struct"),
>>> the performance of fork/exec/shell is still poor compared to previous
>>> kernel versions.
>>>
>>> To mitigate performance regression, we delay the allocation of percpu
>>> memory for rss_stats. Therefore, we convert mm's rss stats to use
>>> percpu_counter atomic mode. For single-thread processes, rss_stat is in
>>> atomic mode, which reduces the memory consumption and performance
>>> regression caused by using percpu. For multiple-thread processes,
>>> rss_stat is switched to the percpu mode to reduce the error margin.
>>> We convert rss_stats from atomic mode to percpu mode only when the
>>> second thread is created.
>> Hi, Zhang Peng
>>
>> This regression we also found it in lmbench these days. I have not 
>> test your patch, but it seems will solve a lot for it.
>> And I see this patch not fix the regression in multi-threads, that's 
>> because of the rss_stat switched to percpu mode?
>> (If I'm wrong, please correct me.) And It seems percpu_counter also 
>> has a bad effect in exit_mmap().
>>
>> If so, I'm wondering if we can further improving it on the 
>> exit_mmap() path in multi-threads scenario, e.g. to determine which 
>> CPUs the process has run on (mm_cpumask()? I'm not sure).
>>
> Hi, Rongwei,
>
> Yes, this patch only fixes the regression in single-thread processes. How
> much bad effect does percpu_counter have in exit_mmap()? IMHO, the 
> addition
Actually, I not sure, just found a little free percpu hotspot in 
exit_mmap() path when comparing 4 core vs 32 cores.

I can test more next.
> of mm counter is already in batch mode, maybe I miss something?
>
>>>
>>> After lmbench test, we can get 2% ~ 4% performance improvement
>>> for lmbench fork_proc/exec_proc/shell_proc and 6.7% performance
>>> improvement for lmbench page_fault (before batch mode[1]).
>>>
>>> The test results are as follows:
>>>
>>>               base           base+revert        base+this patch
>>>
>>> fork_proc    416.3ms        400.0ms  (3.9%)    398.6ms  (4.2%)
>>> exec_proc    2095.9ms       2061.1ms (1.7%)    2047.7ms (2.3%)
>>> shell_proc   3028.2ms       2954.7ms (2.4%)    2961.2ms (2.2%)
>>> page_fault   0.3603ms       0.3358ms (6.8%)    0.3361ms (6.7%)
>> I think the regression will becomes more obvious if more cores. How 
>> about your test machine?
>>
> Maybe multi-core is not a factor in the performance of the lmbench 
> test here.
> Both of my test machines have 96 cores.
>
>> Thanks,
>> -wrw
>>>
>>> [1] 
>>> https://lore.kernel.org/all/20240412064751.119015-1-wangkefeng.wang@huawei.com/
>>>
>>> Suggested-by: Jan Kara <jack@suse.cz>
>>> Signed-off-by: ZhangPeng <zhangpeng362@huawei.com>
>>> Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
>>> ---
>>>   include/linux/mm.h          | 50 
>>> +++++++++++++++++++++++++++++++------
>>>   include/trace/events/kmem.h |  4 +--
>>>   kernel/fork.c               | 18 +++++++------
>>>   3 files changed, 56 insertions(+), 16 deletions(-)
>>>
>>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>>> index d261e45bb29b..8f1bfbd54697 100644
>>> --- a/include/linux/mm.h
>>> +++ b/include/linux/mm.h
>>> @@ -2631,30 +2631,66 @@ static inline bool 
>>> get_user_page_fast_only(unsigned long addr,
>>>    */
>>>   static inline unsigned long get_mm_counter(struct mm_struct *mm, 
>>> int member)
>>>   {
>>> -    return percpu_counter_read_positive(&mm->rss_stat[member]);
>>> +    struct percpu_counter *fbc = &mm->rss_stat[member];
>>> +
>>> +    if (percpu_counter_initialized(fbc))
>>> +        return percpu_counter_read_positive(fbc);
>>> +
>>> +    return percpu_counter_atomic_read(fbc);
>>>   }
>>>     void mm_trace_rss_stat(struct mm_struct *mm, int member);
>>>     static inline void add_mm_counter(struct mm_struct *mm, int 
>>> member, long value)
>>>   {
>>> -    percpu_counter_add(&mm->rss_stat[member], value);
>>> +    struct percpu_counter *fbc = &mm->rss_stat[member];
>>> +
>>> +    if (percpu_counter_initialized(fbc))
>>> +        percpu_counter_add(fbc, value);
>>> +    else
>>> +        percpu_counter_atomic_add(fbc, value);
>>>         mm_trace_rss_stat(mm, member);
>>>   }
>>>     static inline void inc_mm_counter(struct mm_struct *mm, int member)
>>>   {
>>> -    percpu_counter_inc(&mm->rss_stat[member]);
>>> -
>>> -    mm_trace_rss_stat(mm, member);
>>> +    add_mm_counter(mm, member, 1);
>>>   }
>>>     static inline void dec_mm_counter(struct mm_struct *mm, int member)
>>>   {
>>> -    percpu_counter_dec(&mm->rss_stat[member]);
>>> +    add_mm_counter(mm, member, -1);
>>> +}
>>>   -    mm_trace_rss_stat(mm, member);
>>> +static inline s64 mm_counter_sum(struct mm_struct *mm, int member)
>>> +{
>>> +    struct percpu_counter *fbc = &mm->rss_stat[member];
>>> +
>>> +    if (percpu_counter_initialized(fbc))
>>> +        return percpu_counter_sum(fbc);
>>> +
>>> +    return percpu_counter_atomic_read(fbc);
>>> +}
>>> +
>>> +static inline s64 mm_counter_sum_positive(struct mm_struct *mm, int 
>>> member)
>>> +{
>>> +    struct percpu_counter *fbc = &mm->rss_stat[member];
>>> +
>>> +    if (percpu_counter_initialized(fbc))
>>> +        return percpu_counter_sum_positive(fbc);
>>> +
>>> +    return percpu_counter_atomic_read(fbc);
>>> +}
>>> +
>>> +static inline int mm_counter_switch_to_pcpu_many(struct mm_struct *mm)
>>> +{
>>> +    return percpu_counter_switch_to_pcpu_many(mm->rss_stat, 
>>> NR_MM_COUNTERS);
>>> +}
>>> +
>>> +static inline void mm_counter_destroy_many(struct mm_struct *mm)
>>> +{
>>> +    percpu_counter_destroy_many(mm->rss_stat, NR_MM_COUNTERS);
>>>   }
>>>     /* Optimized variant when folio is already known not to be anon */
>>> diff --git a/include/trace/events/kmem.h b/include/trace/events/kmem.h
>>> index 6e62cc64cd92..a4e40ae6a8c8 100644
>>> --- a/include/trace/events/kmem.h
>>> +++ b/include/trace/events/kmem.h
>>> @@ -399,8 +399,8 @@ TRACE_EVENT(rss_stat,
>>>           __entry->mm_id = mm_ptr_to_hash(mm);
>>>           __entry->curr = !!(current->mm == mm);
>>>           __entry->member = member;
>>> -        __entry->size = 
>>> (percpu_counter_sum_positive(&mm->rss_stat[member])
>>> -                                << PAGE_SHIFT);
>>> +        __entry->size = (mm_counter_sum_positive(mm, member)
>>> +                            << PAGE_SHIFT);
>>>       ),
>>>         TP_printk("mm_id=%u curr=%d type=%s size=%ldB",
>>> diff --git a/kernel/fork.c b/kernel/fork.c
>>> index 99076dbe27d8..0214273798c5 100644
>>> --- a/kernel/fork.c
>>> +++ b/kernel/fork.c
>>> @@ -823,7 +823,7 @@ static void check_mm(struct mm_struct *mm)
>>>                "Please make sure 'struct resident_page_types[]' is 
>>> updated as well");
>>>         for (i = 0; i < NR_MM_COUNTERS; i++) {
>>> -        long x = percpu_counter_sum(&mm->rss_stat[i]);
>>> +        long x = mm_counter_sum(mm, i);
>>>             if (unlikely(x))
>>>               pr_alert("BUG: Bad rss-counter state mm:%p type:%s 
>>> val:%ld\n",
>>> @@ -1301,16 +1301,10 @@ static struct mm_struct *mm_init(struct 
>>> mm_struct *mm, struct task_struct *p,
>>>       if (mm_alloc_cid(mm))
>>>           goto fail_cid;
>>>   -    if (percpu_counter_init_many(mm->rss_stat, 0, 
>>> GFP_KERNEL_ACCOUNT,
>>> -                     NR_MM_COUNTERS))
>>> -        goto fail_pcpu;
>>> -
>>>       mm->user_ns = get_user_ns(user_ns);
>>>       lru_gen_init_mm(mm);
>>>       return mm;
>>>   -fail_pcpu:
>>> -    mm_destroy_cid(mm);
>>>   fail_cid:
>>>       destroy_context(mm);
>>>   fail_nocontext:
>>> @@ -1730,6 +1724,16 @@ static int copy_mm(unsigned long clone_flags, 
>>> struct task_struct *tsk)
>>>       if (!oldmm)
>>>           return 0;
>>>   +    /*
>>> +     * For single-thread processes, rss_stat is in atomic mode, which
>>> +     * reduces the memory consumption and performance regression 
>>> caused by
>>> +     * using percpu. For multiple-thread processes, rss_stat is 
>>> switched to
>>> +     * the percpu mode to reduce the error margin.
>>> +     */
>>> +    if (clone_flags & CLONE_THREAD)
>>> +        if (mm_counter_switch_to_pcpu_many(oldmm))
>>> +            return -ENOMEM;
>>> +
>>>       if (clone_flags & CLONE_VM) {
>>>           mmget(oldmm);
>>>           mm = oldmm;
>>
>>



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH v2 2/2] mm: convert mm's rss stats to use atomic mode
  2024-04-20  3:13       ` Rongwei Wang
@ 2024-04-20  8:44         ` zhangpeng (AS)
  0 siblings, 0 replies; 13+ messages in thread
From: zhangpeng (AS) @ 2024-04-20  8:44 UTC (permalink / raw)
  To: Rongwei Wang, linux-mm, linux-kernel
  Cc: akpm, dennisszhou, shakeelb, jack, surenb, kent.overstreet,
	mhocko, vbabka, yuzhao, yu.ma, wangkefeng.wang, sunnanyong

On 2024/4/20 11:13, Rongwei Wang wrote:

> On 2024/4/19 11:32, zhangpeng (AS) wrote:
>> On 2024/4/19 10:30, Rongwei Wang wrote:
>>
>>> On 2024/4/18 22:20, Peng Zhang wrote:
>>>> From: ZhangPeng <zhangpeng362@huawei.com>
>>>>
>>>> Since commit f1a7941243c1 ("mm: convert mm's rss stats into
>>>> percpu_counter"), the rss_stats have converted into percpu_counter,
>>>> which convert the error margin from (nr_threads * 64) to approximately
>>>> (nr_cpus ^ 2). However, the new percpu allocation in mm_init() 
>>>> causes a
>>>> performance regression on fork/exec/shell. Even after commit 
>>>> 14ef95be6f55
>>>> ("kernel/fork: group allocation/free of per-cpu counters for mm 
>>>> struct"),
>>>> the performance of fork/exec/shell is still poor compared to previous
>>>> kernel versions.
>>>>
>>>> To mitigate performance regression, we delay the allocation of percpu
>>>> memory for rss_stats. Therefore, we convert mm's rss stats to use
>>>> percpu_counter atomic mode. For single-thread processes, rss_stat 
>>>> is in
>>>> atomic mode, which reduces the memory consumption and performance
>>>> regression caused by using percpu. For multiple-thread processes,
>>>> rss_stat is switched to the percpu mode to reduce the error margin.
>>>> We convert rss_stats from atomic mode to percpu mode only when the
>>>> second thread is created.
>>> Hi, Zhang Peng
>>>
>>> This regression we also found it in lmbench these days. I have not 
>>> test your patch, but it seems will solve a lot for it.
>>> And I see this patch not fix the regression in multi-threads, that's 
>>> because of the rss_stat switched to percpu mode?
>>> (If I'm wrong, please correct me.) And It seems percpu_counter also 
>>> has a bad effect in exit_mmap().
>>>
>>> If so, I'm wondering if we can further improving it on the 
>>> exit_mmap() path in multi-threads scenario, e.g. to determine which 
>>> CPUs the process has run on (mm_cpumask()? I'm not sure).
>>>
>> Hi, Rongwei,
>>
>> Yes, this patch only fixes the regression in single-thread processes. 
>> How
>> much bad effect does percpu_counter have in exit_mmap()? IMHO, the 
>> addition
> Actually, I not sure, just found a little free percpu hotspot in 
> exit_mmap() path when comparing 4 core vs 32 cores.
>
> I can test more next.

Thanks, it would be better if there is test data.

>> of mm counter is already in batch mode, maybe I miss something?
>>
>>>>
>>>> After lmbench test, we can get 2% ~ 4% performance improvement
>>>> for lmbench fork_proc/exec_proc/shell_proc and 6.7% performance
>>>> improvement for lmbench page_fault (before batch mode[1]).
>>>>
>>>> The test results are as follows:
>>>>
>>>>               base           base+revert        base+this patch
>>>>
>>>> fork_proc    416.3ms        400.0ms  (3.9%)    398.6ms (4.2%)
>>>> exec_proc    2095.9ms       2061.1ms (1.7%)    2047.7ms (2.3%)
>>>> shell_proc   3028.2ms       2954.7ms (2.4%)    2961.2ms (2.2%)
>>>> page_fault   0.3603ms       0.3358ms (6.8%)    0.3361ms (6.7%)
>>> I think the regression will becomes more obvious if more cores. How 
>>> about your test machine?
>>>
>> Maybe multi-core is not a factor in the performance of the lmbench 
>> test here.
>> Both of my test machines have 96 cores.
>>
>>> Thanks,
>>> -wrw
>>>>
>>>> [1] 
>>>> https://lore.kernel.org/all/20240412064751.119015-1-wangkefeng.wang@huawei.com/
>>>>
>>>> Suggested-by: Jan Kara <jack@suse.cz>
>>>> Signed-off-by: ZhangPeng <zhangpeng362@huawei.com>
>>>> Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
>>>> ---
>>>>   include/linux/mm.h          | 50 
>>>> +++++++++++++++++++++++++++++++------
>>>>   include/trace/events/kmem.h |  4 +--
>>>>   kernel/fork.c               | 18 +++++++------
>>>>   3 files changed, 56 insertions(+), 16 deletions(-)
>>>>
>>>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>>>> index d261e45bb29b..8f1bfbd54697 100644
>>>> --- a/include/linux/mm.h
>>>> +++ b/include/linux/mm.h
>>>> @@ -2631,30 +2631,66 @@ static inline bool 
>>>> get_user_page_fast_only(unsigned long addr,
>>>>    */
>>>>   static inline unsigned long get_mm_counter(struct mm_struct *mm, 
>>>> int member)
>>>>   {
>>>> -    return percpu_counter_read_positive(&mm->rss_stat[member]);
>>>> +    struct percpu_counter *fbc = &mm->rss_stat[member];
>>>> +
>>>> +    if (percpu_counter_initialized(fbc))
>>>> +        return percpu_counter_read_positive(fbc);
>>>> +
>>>> +    return percpu_counter_atomic_read(fbc);
>>>>   }
>>>>     void mm_trace_rss_stat(struct mm_struct *mm, int member);
>>>>     static inline void add_mm_counter(struct mm_struct *mm, int 
>>>> member, long value)
>>>>   {
>>>> -    percpu_counter_add(&mm->rss_stat[member], value);
>>>> +    struct percpu_counter *fbc = &mm->rss_stat[member];
>>>> +
>>>> +    if (percpu_counter_initialized(fbc))
>>>> +        percpu_counter_add(fbc, value);
>>>> +    else
>>>> +        percpu_counter_atomic_add(fbc, value);
>>>>         mm_trace_rss_stat(mm, member);
>>>>   }
>>>>     static inline void inc_mm_counter(struct mm_struct *mm, int 
>>>> member)
>>>>   {
>>>> -    percpu_counter_inc(&mm->rss_stat[member]);
>>>> -
>>>> -    mm_trace_rss_stat(mm, member);
>>>> +    add_mm_counter(mm, member, 1);
>>>>   }
>>>>     static inline void dec_mm_counter(struct mm_struct *mm, int 
>>>> member)
>>>>   {
>>>> -    percpu_counter_dec(&mm->rss_stat[member]);
>>>> +    add_mm_counter(mm, member, -1);
>>>> +}
>>>>   -    mm_trace_rss_stat(mm, member);
>>>> +static inline s64 mm_counter_sum(struct mm_struct *mm, int member)
>>>> +{
>>>> +    struct percpu_counter *fbc = &mm->rss_stat[member];
>>>> +
>>>> +    if (percpu_counter_initialized(fbc))
>>>> +        return percpu_counter_sum(fbc);
>>>> +
>>>> +    return percpu_counter_atomic_read(fbc);
>>>> +}
>>>> +
>>>> +static inline s64 mm_counter_sum_positive(struct mm_struct *mm, 
>>>> int member)
>>>> +{
>>>> +    struct percpu_counter *fbc = &mm->rss_stat[member];
>>>> +
>>>> +    if (percpu_counter_initialized(fbc))
>>>> +        return percpu_counter_sum_positive(fbc);
>>>> +
>>>> +    return percpu_counter_atomic_read(fbc);
>>>> +}
>>>> +
>>>> +static inline int mm_counter_switch_to_pcpu_many(struct mm_struct 
>>>> *mm)
>>>> +{
>>>> +    return percpu_counter_switch_to_pcpu_many(mm->rss_stat, 
>>>> NR_MM_COUNTERS);
>>>> +}
>>>> +
>>>> +static inline void mm_counter_destroy_many(struct mm_struct *mm)
>>>> +{
>>>> +    percpu_counter_destroy_many(mm->rss_stat, NR_MM_COUNTERS);
>>>>   }
>>>>     /* Optimized variant when folio is already known not to be anon */
>>>> diff --git a/include/trace/events/kmem.h b/include/trace/events/kmem.h
>>>> index 6e62cc64cd92..a4e40ae6a8c8 100644
>>>> --- a/include/trace/events/kmem.h
>>>> +++ b/include/trace/events/kmem.h
>>>> @@ -399,8 +399,8 @@ TRACE_EVENT(rss_stat,
>>>>           __entry->mm_id = mm_ptr_to_hash(mm);
>>>>           __entry->curr = !!(current->mm == mm);
>>>>           __entry->member = member;
>>>> -        __entry->size = 
>>>> (percpu_counter_sum_positive(&mm->rss_stat[member])
>>>> -                                << PAGE_SHIFT);
>>>> +        __entry->size = (mm_counter_sum_positive(mm, member)
>>>> +                            << PAGE_SHIFT);
>>>>       ),
>>>>         TP_printk("mm_id=%u curr=%d type=%s size=%ldB",
>>>> diff --git a/kernel/fork.c b/kernel/fork.c
>>>> index 99076dbe27d8..0214273798c5 100644
>>>> --- a/kernel/fork.c
>>>> +++ b/kernel/fork.c
>>>> @@ -823,7 +823,7 @@ static void check_mm(struct mm_struct *mm)
>>>>                "Please make sure 'struct resident_page_types[]' is 
>>>> updated as well");
>>>>         for (i = 0; i < NR_MM_COUNTERS; i++) {
>>>> -        long x = percpu_counter_sum(&mm->rss_stat[i]);
>>>> +        long x = mm_counter_sum(mm, i);
>>>>             if (unlikely(x))
>>>>               pr_alert("BUG: Bad rss-counter state mm:%p type:%s 
>>>> val:%ld\n",
>>>> @@ -1301,16 +1301,10 @@ static struct mm_struct *mm_init(struct 
>>>> mm_struct *mm, struct task_struct *p,
>>>>       if (mm_alloc_cid(mm))
>>>>           goto fail_cid;
>>>>   -    if (percpu_counter_init_many(mm->rss_stat, 0, 
>>>> GFP_KERNEL_ACCOUNT,
>>>> -                     NR_MM_COUNTERS))
>>>> -        goto fail_pcpu;
>>>> -
>>>>       mm->user_ns = get_user_ns(user_ns);
>>>>       lru_gen_init_mm(mm);
>>>>       return mm;
>>>>   -fail_pcpu:
>>>> -    mm_destroy_cid(mm);
>>>>   fail_cid:
>>>>       destroy_context(mm);
>>>>   fail_nocontext:
>>>> @@ -1730,6 +1724,16 @@ static int copy_mm(unsigned long 
>>>> clone_flags, struct task_struct *tsk)
>>>>       if (!oldmm)
>>>>           return 0;
>>>>   +    /*
>>>> +     * For single-thread processes, rss_stat is in atomic mode, which
>>>> +     * reduces the memory consumption and performance regression 
>>>> caused by
>>>> +     * using percpu. For multiple-thread processes, rss_stat is 
>>>> switched to
>>>> +     * the percpu mode to reduce the error margin.
>>>> +     */
>>>> +    if (clone_flags & CLONE_THREAD)
>>>> +        if (mm_counter_switch_to_pcpu_many(oldmm))
>>>> +            return -ENOMEM;
>>>> +
>>>>       if (clone_flags & CLONE_VM) {
>>>>           mmget(oldmm);
>>>>           mm = oldmm;
>>>
>>>
>
>
-- 
Best Regards,
Peng



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH v2 0/2] mm: convert mm's rss stats to use atomic mode
  2024-04-18 14:20 [RFC PATCH v2 0/2] mm: convert mm's rss stats to use atomic mode Peng Zhang
  2024-04-18 14:20 ` [RFC PATCH v2 1/2] percpu_counter: introduce atomic mode for percpu_counter Peng Zhang
  2024-04-18 14:20 ` [RFC PATCH v2 2/2] mm: convert mm's rss stats to use atomic mode Peng Zhang
@ 2024-04-24  4:29 ` zhangpeng (AS)
  2024-04-24  4:51   ` Dennis Zhou
  2 siblings, 1 reply; 13+ messages in thread
From: zhangpeng (AS) @ 2024-04-24  4:29 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: akpm, dennisszhou, shakeelb, jack, surenb, kent.overstreet,
	mhocko, vbabka, yuzhao, yu.ma, wangkefeng.wang, sunnanyong

On 2024/4/18 22:20, Peng Zhang wrote:

Any suggestions or opinions are welcome. Could someone please review
this patch series?
Thanks!

> From: ZhangPeng <zhangpeng362@huawei.com>
>
> Since commit f1a7941243c1 ("mm: convert mm's rss stats into
> percpu_counter"), the rss_stats have converted into percpu_counter,
> which convert the error margin from (nr_threads * 64) to approximately
> (nr_cpus ^ 2). However, the new percpu allocation in mm_init() causes a
> performance regression on fork/exec/shell. Even after commit 14ef95be6f55
> ("kernel/fork: group allocation/free of per-cpu counters for mm struct"),
> the performance of fork/exec/shell is still poor compared to previous
> kernel versions.
>
> To mitigate performance regression, we delay the allocation of percpu
> memory for rss_stats. Therefore, we convert mm's rss stats to use
> percpu_counter atomic mode. For single-thread processes, rss_stat is in
> atomic mode, which reduces the memory consumption and performance
> regression caused by using percpu. For multiple-thread processes,
> rss_stat is switched to the percpu mode to reduce the error margin.
> We convert rss_stats from atomic mode to percpu mode only when the
> second thread is created.
>
> After lmbench test, we can get 2% ~ 4% performance improvement
> for lmbench fork_proc/exec_proc/shell_proc and 6.7% performance
> improvement for lmbench page_fault (before batch mode[1]).
>
> The test results are as follows:
>               base           base+revert        base+this patch
>
> fork_proc    416.3ms        400.0ms  (3.9%)    398.6ms  (4.2%)
> exec_proc    2095.9ms       2061.1ms (1.7%)    2047.7ms (2.3%)
> shell_proc   3028.2ms       2954.7ms (2.4%)    2961.2ms (2.2%)
> page_fault   0.3603ms       0.3358ms (6.8%)    0.3361ms (6.7%)
>
> [1] https://lore.kernel.org/all/20240412064751.119015-1-wangkefeng.wang@huawei.com/
>
> ChangeLog:
> v2->v1:
> - Convert rss_stats from atomic mode to percpu mode only when
>    the second thread is created per Jan Kara.
> - Compared with v1, the performance data may be different due to
>    different test machines.
>
> ZhangPeng (2):
>    percpu_counter: introduce atomic mode for percpu_counter
>    mm: convert mm's rss stats to use atomic mode
>
>   include/linux/mm.h             | 50 +++++++++++++++++++++++++++++-----
>   include/linux/percpu_counter.h | 43 +++++++++++++++++++++++++++--
>   include/trace/events/kmem.h    |  4 +--
>   kernel/fork.c                  | 18 +++++++-----
>   lib/percpu_counter.c           | 31 +++++++++++++++++++--
>   5 files changed, 125 insertions(+), 21 deletions(-)
>
-- 
Best Regards,
Peng



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH v2 0/2] mm: convert mm's rss stats to use atomic mode
  2024-04-24  4:29 ` [RFC PATCH v2 0/2] " zhangpeng (AS)
@ 2024-04-24  4:51   ` Dennis Zhou
  0 siblings, 0 replies; 13+ messages in thread
From: Dennis Zhou @ 2024-04-24  4:51 UTC (permalink / raw)
  To: zhangpeng (AS)
  Cc: linux-mm, linux-kernel, akpm, shakeelb, jack, surenb,
	kent.overstreet, mhocko, vbabka, yuzhao, yu.ma, wangkefeng.wang,
	sunnanyong

Hi Peng,

On Wed, Apr 24, 2024 at 12:29:25PM +0800, zhangpeng (AS) wrote:
> On 2024/4/18 22:20, Peng Zhang wrote:
> 
> Any suggestions or opinions are welcome. Could someone please review
> this patch series?
> Thanks!
> 

Sorry, I haven't been very active lately. This is what I remember
discussing a while back. I'll take a close look tomorrow.

Thanks,
Dennis

> > From: ZhangPeng <zhangpeng362@huawei.com>
> > 
> > Since commit f1a7941243c1 ("mm: convert mm's rss stats into
> > percpu_counter"), the rss_stats have converted into percpu_counter,
> > which convert the error margin from (nr_threads * 64) to approximately
> > (nr_cpus ^ 2). However, the new percpu allocation in mm_init() causes a
> > performance regression on fork/exec/shell. Even after commit 14ef95be6f55
> > ("kernel/fork: group allocation/free of per-cpu counters for mm struct"),
> > the performance of fork/exec/shell is still poor compared to previous
> > kernel versions.
> > 
> > To mitigate performance regression, we delay the allocation of percpu
> > memory for rss_stats. Therefore, we convert mm's rss stats to use
> > percpu_counter atomic mode. For single-thread processes, rss_stat is in
> > atomic mode, which reduces the memory consumption and performance
> > regression caused by using percpu. For multiple-thread processes,
> > rss_stat is switched to the percpu mode to reduce the error margin.
> > We convert rss_stats from atomic mode to percpu mode only when the
> > second thread is created.
> > 
> > After lmbench test, we can get 2% ~ 4% performance improvement
> > for lmbench fork_proc/exec_proc/shell_proc and 6.7% performance
> > improvement for lmbench page_fault (before batch mode[1]).
> > 
> > The test results are as follows:
> >               base           base+revert        base+this patch
> > 
> > fork_proc    416.3ms        400.0ms  (3.9%)    398.6ms  (4.2%)
> > exec_proc    2095.9ms       2061.1ms (1.7%)    2047.7ms (2.3%)
> > shell_proc   3028.2ms       2954.7ms (2.4%)    2961.2ms (2.2%)
> > page_fault   0.3603ms       0.3358ms (6.8%)    0.3361ms (6.7%)
> > 
> > [1] https://lore.kernel.org/all/20240412064751.119015-1-wangkefeng.wang@huawei.com/
> > 
> > ChangeLog:
> > v2->v1:
> > - Convert rss_stats from atomic mode to percpu mode only when
> >    the second thread is created per Jan Kara.
> > - Compared with v1, the performance data may be different due to
> >    different test machines.
> > 
> > ZhangPeng (2):
> >    percpu_counter: introduce atomic mode for percpu_counter
> >    mm: convert mm's rss stats to use atomic mode
> > 
> >   include/linux/mm.h             | 50 +++++++++++++++++++++++++++++-----
> >   include/linux/percpu_counter.h | 43 +++++++++++++++++++++++++++--
> >   include/trace/events/kmem.h    |  4 +--
> >   kernel/fork.c                  | 18 +++++++-----
> >   lib/percpu_counter.c           | 31 +++++++++++++++++++--
> >   5 files changed, 125 insertions(+), 21 deletions(-)
> > 
> -- 
> Best Regards,
> Peng
> 


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH v2 1/2] percpu_counter: introduce atomic mode for percpu_counter
  2024-04-18 14:20 ` [RFC PATCH v2 1/2] percpu_counter: introduce atomic mode for percpu_counter Peng Zhang
  2024-04-18 19:40   ` Andrew Morton
@ 2024-04-26  8:11   ` Dennis Zhou
  2024-04-29  7:45     ` zhangpeng (AS)
  1 sibling, 1 reply; 13+ messages in thread
From: Dennis Zhou @ 2024-04-26  8:11 UTC (permalink / raw)
  To: Peng Zhang
  Cc: linux-mm, linux-kernel, akpm, dennisszhou, shakeelb, jack,
	surenb, kent.overstreet, mhocko, vbabka, yuzhao, yu.ma,
	wangkefeng.wang, sunnanyong

On Thu, Apr 18, 2024 at 10:20:07PM +0800, Peng Zhang wrote:
> From: ZhangPeng <zhangpeng362@huawei.com>
> 
> Depending on whether counters is NULL, we can support two modes:
> atomic mode and perpcu mode. We implement both modes by grouping
> the s64 count and atomic64_t count_atomic in a union. At the same time,
> we create the interface for adding and reading in atomic mode and for
> switching atomic mode to percpu mode.
> 
> Suggested-by: Jan Kara <jack@suse.cz>
> Signed-off-by: ZhangPeng <zhangpeng362@huawei.com>
> Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
> ---
>  include/linux/percpu_counter.h | 43 +++++++++++++++++++++++++++++++---
>  lib/percpu_counter.c           | 31 ++++++++++++++++++++++--
>  2 files changed, 69 insertions(+), 5 deletions(-)
> 
> diff --git a/include/linux/percpu_counter.h b/include/linux/percpu_counter.h
> index 3a44dd1e33d2..160f9734c0bb 100644
> --- a/include/linux/percpu_counter.h
> +++ b/include/linux/percpu_counter.h
> @@ -21,7 +21,13 @@
>  
>  struct percpu_counter {
>  	raw_spinlock_t lock;
> -	s64 count;
> +	/* Depending on whether counters is NULL, we can support two modes,
> +	 * atomic mode using count_atomic and perpcu mode using count.
> +	 */
> +	union {
> +		s64 count;
> +		atomic64_t count_atomic;
> +	};
>  #ifdef CONFIG_HOTPLUG_CPU
>  	struct list_head list;	/* All percpu_counters are on a list */
>  #endif
> @@ -32,14 +38,14 @@ extern int percpu_counter_batch;
>  
>  int __percpu_counter_init_many(struct percpu_counter *fbc, s64 amount,
>  			       gfp_t gfp, u32 nr_counters,
> -			       struct lock_class_key *key);
> +			       struct lock_class_key *key, bool switch_mode);

Nit: the lock_class_key at the end.
>  
>  #define percpu_counter_init_many(fbc, value, gfp, nr_counters)		\
>  	({								\
>  		static struct lock_class_key __key;			\
>  									\
>  		__percpu_counter_init_many(fbc, value, gfp, nr_counters,\
> -					   &__key);			\
> +					   &__key, false);		\
>  	})
>  
>  
> @@ -130,6 +136,20 @@ static inline bool percpu_counter_initialized(struct percpu_counter *fbc)
>  	return (fbc->counters != NULL);
>  }
>  
> +static inline s64 percpu_counter_atomic_read(struct percpu_counter *fbc)
> +{
> +	return atomic64_read(&fbc->count_atomic);
> +}
> +
> +static inline void percpu_counter_atomic_add(struct percpu_counter *fbc,
> +					     s64 amount)
> +{
> +	atomic64_add(amount, &fbc->count_atomic);
> +}
> +
> +int percpu_counter_switch_to_pcpu_many(struct percpu_counter *fbc,
> +				       u32 nr_counters);
> +
>  #else /* !CONFIG_SMP */
>  
>  struct percpu_counter {
> @@ -260,6 +280,23 @@ static inline bool percpu_counter_initialized(struct percpu_counter *fbc)
>  static inline void percpu_counter_sync(struct percpu_counter *fbc)
>  {
>  }
> +
> +static inline s64 percpu_counter_atomic_read(struct percpu_counter *fbc)
> +{
> +	return fbc->count;
> +}
> +
> +static inline void percpu_counter_atomic_add(struct percpu_counter *fbc,
> +					     s64 amount)
> +{
> +	percpu_counter_add(fbc, amount);
> +}
> +
> +static inline int percpu_counter_switch_to_pcpu_many(struct percpu_counter *fbc,
> +						     u32 nr_counters)
> +{
> +	return 0;
> +}
>  #endif	/* CONFIG_SMP */
>  
>  static inline void percpu_counter_inc(struct percpu_counter *fbc)
> diff --git a/lib/percpu_counter.c b/lib/percpu_counter.c
> index 44dd133594d4..95c4e038051a 100644
> --- a/lib/percpu_counter.c
> +++ b/lib/percpu_counter.c
> @@ -153,7 +153,7 @@ EXPORT_SYMBOL(__percpu_counter_sum);
>  
>  int __percpu_counter_init_many(struct percpu_counter *fbc, s64 amount,
>  			       gfp_t gfp, u32 nr_counters,
> -			       struct lock_class_key *key)
> +			       struct lock_class_key *key, bool switch_mode)
>  {
>  	unsigned long flags __maybe_unused;
>  	size_t counter_size;
> @@ -174,7 +174,8 @@ int __percpu_counter_init_many(struct percpu_counter *fbc, s64 amount,
>  #ifdef CONFIG_HOTPLUG_CPU
>  		INIT_LIST_HEAD(&fbc[i].list);
>  #endif
> -		fbc[i].count = amount;
> +		if (likely(!switch_mode))
> +			fbc[i].count = amount;
>  		fbc[i].counters = (void *)counters + (i * counter_size);
>  
>  		debug_percpu_counter_activate(&fbc[i]);
> @@ -357,6 +358,32 @@ bool __percpu_counter_limited_add(struct percpu_counter *fbc,
>  	return good;
>  }
>  
> +/*
> + * percpu_counter_switch_to_pcpu_many: Converts struct percpu_counters from
> + * atomic mode to percpu mode.
> + */
> +int percpu_counter_switch_to_pcpu_many(struct percpu_counter *fbc,
> +				       u32 nr_counters)
> +{
> +	static struct lock_class_key __key;

This is an improper use of lockdep. Now all of the percpu_counters
initialized on this path will key off of this lock_class_key.

> +	unsigned long flags;
> +	bool ret = 0;
> +
> +	if (percpu_counter_initialized(fbc))
> +		return 0;
> +
> +	preempt_disable();
> +	local_irq_save(flags);
> +	if (likely(!percpu_counter_initialized(fbc)))
> +		ret = __percpu_counter_init_many(fbc, 0,
> +					GFP_ATOMIC|__GFP_NOWARN|__GFP_ZERO,
> +					nr_counters, &__key, true);
> +	local_irq_restore(flags);
> +	preempt_enable();
> +
> +	return ret;
> +}

I'm staring at this API and I'm not in love with it. I think it hinges 
on that there's one user of mm_stats prior hence it's safe. Generically
though, I can't see why this is safe.

I need to give this a little more thought, but my gut reaction is I'd
rather this look like percpu_refcount where we can init dead minus the
percpu allocation. Maybe that's not quite right, but I'd feel better
about it than requiring external synchronization on a percpu_counter to
ensure that it's correct.

> +
>  static int __init percpu_counter_startup(void)
>  {
>  	int ret;
> -- 
> 2.25.1
> 

Thanks,
Dennis


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH v2 1/2] percpu_counter: introduce atomic mode for percpu_counter
  2024-04-26  8:11   ` Dennis Zhou
@ 2024-04-29  7:45     ` zhangpeng (AS)
  0 siblings, 0 replies; 13+ messages in thread
From: zhangpeng (AS) @ 2024-04-29  7:45 UTC (permalink / raw)
  To: Dennis Zhou
  Cc: linux-mm, linux-kernel, akpm, shakeelb, jack, surenb,
	kent.overstreet, mhocko, vbabka, yuzhao, yu.ma, wangkefeng.wang,
	sunnanyong

On 2024/4/26 16:11, Dennis Zhou wrote:

> On Thu, Apr 18, 2024 at 10:20:07PM +0800, Peng Zhang wrote:
>> From: ZhangPeng <zhangpeng362@huawei.com>
>>
>> Depending on whether counters is NULL, we can support two modes:
>> atomic mode and perpcu mode. We implement both modes by grouping
>> the s64 count and atomic64_t count_atomic in a union. At the same time,
>> we create the interface for adding and reading in atomic mode and for
>> switching atomic mode to percpu mode.
>>
>> Suggested-by: Jan Kara <jack@suse.cz>
>> Signed-off-by: ZhangPeng <zhangpeng362@huawei.com>
>> Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
>> ---
>>   include/linux/percpu_counter.h | 43 +++++++++++++++++++++++++++++++---
>>   lib/percpu_counter.c           | 31 ++++++++++++++++++++++--
>>   2 files changed, 69 insertions(+), 5 deletions(-)
>>
>> diff --git a/include/linux/percpu_counter.h b/include/linux/percpu_counter.h
>> index 3a44dd1e33d2..160f9734c0bb 100644
>> --- a/include/linux/percpu_counter.h
>> +++ b/include/linux/percpu_counter.h
>> @@ -21,7 +21,13 @@
>>   
>>   struct percpu_counter {
>>   	raw_spinlock_t lock;
>> -	s64 count;
>> +	/* Depending on whether counters is NULL, we can support two modes,
>> +	 * atomic mode using count_atomic and perpcu mode using count.
>> +	 */
>> +	union {
>> +		s64 count;
>> +		atomic64_t count_atomic;
>> +	};
>>   #ifdef CONFIG_HOTPLUG_CPU
>>   	struct list_head list;	/* All percpu_counters are on a list */
>>   #endif
>> @@ -32,14 +38,14 @@ extern int percpu_counter_batch;
>>   
>>   int __percpu_counter_init_many(struct percpu_counter *fbc, s64 amount,
>>   			       gfp_t gfp, u32 nr_counters,
>> -			       struct lock_class_key *key);
>> +			       struct lock_class_key *key, bool switch_mode);
> Nit: the lock_class_key at the end.

Got it.

>>   
>>   #define percpu_counter_init_many(fbc, value, gfp, nr_counters)		\
>>   	({								\
>>   		static struct lock_class_key __key;			\
>>   									\
>>   		__percpu_counter_init_many(fbc, value, gfp, nr_counters,\
>> -					   &__key);			\
>> +					   &__key, false);		\
>>   	})
>>   
>>   
>> @@ -130,6 +136,20 @@ static inline bool percpu_counter_initialized(struct percpu_counter *fbc)
>>   	return (fbc->counters != NULL);
>>   }
>>   
>> +static inline s64 percpu_counter_atomic_read(struct percpu_counter *fbc)
>> +{
>> +	return atomic64_read(&fbc->count_atomic);
>> +}
>> +
>> +static inline void percpu_counter_atomic_add(struct percpu_counter *fbc,
>> +					     s64 amount)
>> +{
>> +	atomic64_add(amount, &fbc->count_atomic);
>> +}
>> +
>> +int percpu_counter_switch_to_pcpu_many(struct percpu_counter *fbc,
>> +				       u32 nr_counters);
>> +
>>   #else /* !CONFIG_SMP */
>>   
>>   struct percpu_counter {
>> @@ -260,6 +280,23 @@ static inline bool percpu_counter_initialized(struct percpu_counter *fbc)
>>   static inline void percpu_counter_sync(struct percpu_counter *fbc)
>>   {
>>   }
>> +
>> +static inline s64 percpu_counter_atomic_read(struct percpu_counter *fbc)
>> +{
>> +	return fbc->count;
>> +}
>> +
>> +static inline void percpu_counter_atomic_add(struct percpu_counter *fbc,
>> +					     s64 amount)
>> +{
>> +	percpu_counter_add(fbc, amount);
>> +}
>> +
>> +static inline int percpu_counter_switch_to_pcpu_many(struct percpu_counter *fbc,
>> +						     u32 nr_counters)
>> +{
>> +	return 0;
>> +}
>>   #endif	/* CONFIG_SMP */
>>   
>>   static inline void percpu_counter_inc(struct percpu_counter *fbc)
>> diff --git a/lib/percpu_counter.c b/lib/percpu_counter.c
>> index 44dd133594d4..95c4e038051a 100644
>> --- a/lib/percpu_counter.c
>> +++ b/lib/percpu_counter.c
>> @@ -153,7 +153,7 @@ EXPORT_SYMBOL(__percpu_counter_sum);
>>   
>>   int __percpu_counter_init_many(struct percpu_counter *fbc, s64 amount,
>>   			       gfp_t gfp, u32 nr_counters,
>> -			       struct lock_class_key *key)
>> +			       struct lock_class_key *key, bool switch_mode)
>>   {
>>   	unsigned long flags __maybe_unused;
>>   	size_t counter_size;
>> @@ -174,7 +174,8 @@ int __percpu_counter_init_many(struct percpu_counter *fbc, s64 amount,
>>   #ifdef CONFIG_HOTPLUG_CPU
>>   		INIT_LIST_HEAD(&fbc[i].list);
>>   #endif
>> -		fbc[i].count = amount;
>> +		if (likely(!switch_mode))
>> +			fbc[i].count = amount;
>>   		fbc[i].counters = (void *)counters + (i * counter_size);
>>   
>>   		debug_percpu_counter_activate(&fbc[i]);
>> @@ -357,6 +358,32 @@ bool __percpu_counter_limited_add(struct percpu_counter *fbc,
>>   	return good;
>>   }
>>   
>> +/*
>> + * percpu_counter_switch_to_pcpu_many: Converts struct percpu_counters from
>> + * atomic mode to percpu mode.
>> + */
>> +int percpu_counter_switch_to_pcpu_many(struct percpu_counter *fbc,
>> +				       u32 nr_counters)
>> +{
>> +	static struct lock_class_key __key;
> This is an improper use of lockdep. Now all of the percpu_counters
> initialized on this path will key off of this lock_class_key.

Sorry, I don't know much about lock_class_key. I may not understand the reason
why it's not appropriate here. Could you please help me with the details?

>> +	unsigned long flags;
>> +	bool ret = 0;
>> +
>> +	if (percpu_counter_initialized(fbc))
>> +		return 0;
>> +
>> +	preempt_disable();
>> +	local_irq_save(flags);
>> +	if (likely(!percpu_counter_initialized(fbc)))
>> +		ret = __percpu_counter_init_many(fbc, 0,
>> +					GFP_ATOMIC|__GFP_NOWARN|__GFP_ZERO,
>> +					nr_counters, &__key, true);
>> +	local_irq_restore(flags);
>> +	preempt_enable();
>> +
>> +	return ret;
>> +}
> I'm staring at this API and I'm not in love with it. I think it hinges
> on that there's one user of mm_stats prior hence it's safe. Generically
> though, I can't see why this is safe.
>
> I need to give this a little more thought, but my gut reaction is I'd
> rather this look like percpu_refcount where we can init dead minus the
> percpu allocation. Maybe that's not quite right, but I'd feel better
> about it than requiring external synchronization on a percpu_counter to
> ensure that it's correct.

Maybe it would be better if I change this API to the internal function of
mm counter.

Sorry again, maybe percpu_refcount is better, but I don't understand this
sentence "we can init dead minus the percpu allocation." Please forgive my
ignorance...

Thank you very much for your review and valuable comments!

>> +
>>   static int __init percpu_counter_startup(void)
>>   {
>>   	int ret;
>> -- 
>> 2.25.1
>>
> Thanks,
> Dennis

-- 
Best Regards,
Peng



^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2024-04-29  7:45 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-04-18 14:20 [RFC PATCH v2 0/2] mm: convert mm's rss stats to use atomic mode Peng Zhang
2024-04-18 14:20 ` [RFC PATCH v2 1/2] percpu_counter: introduce atomic mode for percpu_counter Peng Zhang
2024-04-18 19:40   ` Andrew Morton
2024-04-19  2:55     ` zhangpeng (AS)
2024-04-26  8:11   ` Dennis Zhou
2024-04-29  7:45     ` zhangpeng (AS)
2024-04-18 14:20 ` [RFC PATCH v2 2/2] mm: convert mm's rss stats to use atomic mode Peng Zhang
2024-04-19  2:30   ` Rongwei Wang
2024-04-19  3:32     ` zhangpeng (AS)
2024-04-20  3:13       ` Rongwei Wang
2024-04-20  8:44         ` zhangpeng (AS)
2024-04-24  4:29 ` [RFC PATCH v2 0/2] " zhangpeng (AS)
2024-04-24  4:51   ` Dennis Zhou

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).